The Ergodic Theory of Discrete Sample Paths

Paul C. Shields

Graduate Studies in Mathematics
Volume 13

American Mathematical Society

Editorial Board James E. Humphreys David Sattinger Julius L. Shaneson Lance W. Small, chair
1991 Mathematics Subject Classification. Primary 28D20, 28D05, 94A17; Secondary 60F05, 60G17, 94A24.
ABSTRACT. This book is about finite-alphabet stationary processes, which are important in physics, engineering, and data compression. The book is designed for use in graduate courses, seminars or self study for students or faculty with some background in measure theory and probability theory.

Library of Congress Cataloging-in-Publication Data

Shields, Paul C. The ergodic theory of discrete sample paths / Paul C. Shields. p. cm. — (Graduate studies in mathematics, ISSN 1065-7339; v. 13) Includes bibliographical references and index. ISBN 0-8218-0477-4 (alk. paper) 1. Ergodic theory. 2. Measure-preserving transformations. 3. Stochastic processes. I. Title. II. Series. QA313.555 1996 96-20186 519.2'32—dc20 CIP

Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting

for them, are permitted to make fair use of the material, such as to copy a chapter for use in teaching or research. Permission is granted to quote brief passages from this publication in reviews, provided the customary acknowledgment of the source is given. Republication, systematic copying, or multiple reproduction of any material in this publication (including abstracts) is permitted only under license from the American Mathematical Society. Requests for such permission should be addressed to the Assistant to the Publisher, American Mathematical Society, P.O. Box 6248, Providence, Rhode Island 02940-6248. Requests can also be made by e-mail to reprint-permission0ams.org .

C) Copyright 1996 by the American Mathematical Society. All rights reserved. The American Mathematical Society retains all rights except those granted to the United States Government. Printed in the United States of America. ® The paper used in this book is acid-free and falls within the guidelines established to ensure permanence and durability. .. Printed on recycled paper. t.,1 10 9 8 7 6 5 4 3 2 1 01 00 99 98 97 96

Contents
Preface I ix

Basic concepts. I.1 Stationary processes. 1.2 The ergodic theory model. 1.3 The ergodic theorem. 1.4 Frequencies of finite blocks 1.5 The entropy theorem. 1.6 Entropy as expected value 1.7 Interpretations of entropy 1.8 Stationary coding 1.9 Process topologies. 1.10 Cutting and stacking.

1 1 13 33 43 51 56 66 79 87 103 121 121 131 137 147 154 165 165 174 184 194 200 211 211 221 232 239 245

II Entropy-related properties. II.! Entropy and coding 11.2 The Lempel-Ziv algorithm 11.3 Empirical entropy 11.4 Partitions of sample paths. 11.5 Entropy and recurrence times III Entropy for restricted classes. Ill.! Rates of convergence 111.2 Entropy and joint distributions. 111.3 The d-admissibility problem. 111.4 Blowing-up properties. 111.5 The waiting-time problem IV B-processes. IV1 Almost block-independence. 1V.2 The finitely determined property. 1V.3 Other B-process characterizations.
Bibliography

Index

vii

i. together with a partition of the space. and Breiman. as a measure-preserving transformation on a probability space. immediately yield the two fundamental theorems of ergodic theory. or by stationary coding. ergodic theory. for in combination with the ergodic and entropy theorems and further simple combinatorial ideas they provide powerful tools for the study of sample paths. Much of Chapter I and all of Chapter II are devoted to the development of these ideas. and data compression. engineering.i.d. a representation that not only simplifies the discussion of stationary renewal and regenerative processes but generalizes these concepts to the case where times between recurrences are not assumed to be independent.Preface This book is about finite-alphabet stationary processes. All these models and more are introduced in the first two sections of Chapter I. seminars or self study for students or faculty with some background in measure theory and probability theory. including the weak Bernoulli processes and the important class of stationary codings of i. A secondary goal is to give a careful presentation of the many models for stationary finite-alphabet processes that have been developed in probability theory. that is. processes. instantaneous functions of Markov chains. The focus is on the combinatorial properties of typical finite sample paths drawn from a stationary. which shows how "almost" packings of integer intervals can be extracted from coverings by overlapping subintervals. and renewal and regenerative processes. which are important in physics. namely. introduced in Section 1. and information theory.2. ergodic process. but only stationary. The classical process models are based on independence ideas and include the i. Further models. is to develop a theory based directly on sample path arguments. an extension of the instantaneous function concept which allows the function to depend on both past and future. McMillan. Markov chains. The book is designed for use in graduate courses. This point of view. Related models are obtained by block coding and randomizing the start. leads directly to Kakutani's simple geometric representation of a process in terms of a recurrent event. introduced by Ornstein and Weiss in 1980. A primary goal. the ergodic theorem of Birkhoff and the entropy theorem of Shannon. The packing and counting ideas yield more than these two classical results. however. which bounds the number of n-sequences that can be partitioned into long blocks subject to the condition that most of them are drawn from collections of known size. The two basic tools for a sample path theory are a packing lemma. processes. An important and simple class of such models is the class of concatenated-block processes. Of particular note in the discussion of process models is how ergodic theorists think of a stationary process. and a counting lemma. A ix . namely. only partially realized. the processes obtained by independently concatenating fixed-length blocks according to some block distribution and randomizing the start. with minimal appeals to the probability formalism. These two simple ideas.d. are discussed in Chapter III and Chapter IV.

and multi-user theory. This book is an outgrowth of the lectures I gave each fall in Budapest from 1989 through 1994. and the connection between entropy and recurrence times and entropy and the growth of prefix trees. given in Section 1. and continuous time and/or space theory. ranging from undergraduate and graduate students through post-docs and junior faculty to senior professors and researchers. then given in complete detail. is devoted to the basic tools. channel theory. These include entropy as the almost-sure bound on per-symbol compression. algebraic coding. leads to a powerful method for constructing examples known as cutting and stacking. a connection between entropy and a-neighborhoods. Likewise. and blowing-up characterizations. and IV are approximately independent of each other. K-processes. the weak topology and the even more important d-metric topology. in part because the book is already too long and in part because they are not close to the central focus of this book. The audiences included ergodic theorists. Many standard topics from ergodic theory are omitted or given only cursory treatment. combinatorial number theory. finitely determined. Several characterizations of the class of stationary codings of i. suggestive names are used for concepts. although the almost block-independence and blowing-up ideas are more recent. In addition. including the almost block-independence. information theorists. the relation between entropy and partitions of sample paths into fixed-length blocks. and the cutting and stacking method. smooth dynamics. a method for converting block codes to stationary codes. The book has four chapters. Properties related to entropy which hold only for restricted classes of processes are discussed in Chapter III. as well as combinatorialists and people from engineering and other mathematics disciplines. processes are given in Chapter IV. or partitions into repeated blocks. random fields. which is half the book. the ergodic theorem and its connection with empirical distributions. Ziv's proof of asymptotic optimality of the Lempel-Ziv algorithm via his interesting concept of individual sequence entropy. The first chapter. both as special lectures and seminars at the Mathematics Institute of the Hungarian Academy of Sciences and as courses given in the Probability Department of E6tviis Lordnd University. and a connection between entropy and waiting times. the entropy theorem rather than the ShannonMcMillan-Breiman theorem. Some of these date back to the original work of Ornstein and others on the isomorphism problem for Bernoulli shifts. III. Some specific stylistic guidelines were followed in writing this book. including rates of convergence for frequencies and entropy.i. conditioned on the material in Chapter I. Proofs are sketched first. These topics include topological dynamics. redundancy. Properties related to entropy which hold for every ergodic process are discussed in Chapter II. the sections in Chapters II. for example. lectures on parts of the penultimate draft of the book were presented at the Technical University of Delft in the fall of 1995. general ergodic theorems.x PREFACE further generalization. Theorems and lemmas are given names that include some information about content. Likewise little or nothing is said about such standard information theory topics as rate distortion theory. including the Kolmogorov and ergodic theory models for a process. divergence-rate theory. With a few exceptions.d. or partitions into distinct blocks. the estimation of joint distributions in both the variational metric and the d-metric. such as . very weak Bernoulli. the entropy theorem and its interpretations.10. and probabilists.

Much of the work for this project was supported by NSF grants DMS8742630 and DMS-9024240 and by a joint NSF-Hungarian Academy grant MTA-NSF project 37. who learned ergodic theory by carefully reading almost all of the manuscript at each stage of its development. Shields Toledo. It was Imre's suggestion that led me to include the discussions of renewal and regenerative processes.utoledo. I am much indebted to my two Toledo graduate students. Imre Csiszdr and Katalin Marton not only attended most of my lectures but critically read parts of the manuscript at all stages of its development and discussed many aspects of the book with me. and Nancy Morrison. and his criticisms led to many revisions of the cutting and stacking discussion. This book is dedicated to my son. Gusztdv Morvai. but far from least.html I am grateful to the Mathematics Institute of the Hungarian Academy for providing me with many years of space in a comfortable and stimulating environment as well as to the Institute and to the Probability Department of Etitvotis Lordnd University for the many lecture opportunities. Ohio March 22. these range in difficulty from quite easy to quite hard. Jeffrey. Gyiirgy Michaletzky. Aaron Wyner. 1996 .PREFACE xi building blocks (a name proposed by Zoli Gytirfi) and column structures (as opposed to gadgets. My initial lectures in 1989 were supported by a Fulbright lectureship. Shaogang Xu and Xuehong Li. Exercises that extend the ideas are given at the end of most sections. Don Ornstein. Much of the material in Chapter III as well as the blowing-up ideas in Chapter IV are the result of joint work with Kati. in the process discovering numerous errors and poor ways of saying things.edurpshields/ergodic. Bob Burton. No project such as this can be free from errors and incompleteness. and Jacob Ziv. Last. I am indebted to many people for assistance with this project.math. Others who contributed ideas and/or read parts of the manuscript include Gdbor TusnAdy. In addition I had numerous helpful conversations with Benjy Weiss. Paul C. Dave Neuhoff. Only those references that seem directly related to the topics discussed are included. A list of errata as well as a forum for discussion will be available on the Internet at the following web address. Jacek Serafin.) Also numbered displays are often (informally) given names similar to those used for LATEX labels. http://www.

.

.k may be omitted. stochastic) process is a sequence X1. The distribution of a process is thus a family of probability distributions. as 1 . The sequence cannot be completely arbitrary.1\ (akla 1 ) = Prob(Xk = ak1X 1 = al ) = til(alk) ktk-1(a 1) . also be defined by specifying the start distribution. the particular space on which the functions Xn are defined is not important. is denoted by am for m = 1. X2. (1) 111(4) = E P4+1 (a jic+1 ). E. however. Thus one is free to choose the underlying space (X. X. "process" means a discrete-time finite-alphabet process. E. for implicit in the definition of process is that the following consistency condition must hold for each k > 1. . am+i.. all that really matters in probability theory is the distribution of the process.k (alic) = Prob(Xif = c4).. The k-th order joint distribution of the process {X k } is the measure p. The distribution of a process can.. p) on which the Xn are defined in any convenient manner. "measure" will mean "probability measure" and "function" will mean "measurable function" with respect to some appropriate a-algebra on a probability space. unless stated otherwise. Also. that is. of random A (discrete-time. . A process is considered to be defined by its joint distributions.. and the successive conditional distributions k -1 k. In this book the focus is on finite-alphabet processes.Chapter I Basic concepts. variables defined on a probability space (X.1 Stationary processes. n . A i . The set of all such am n is denoted by Al. so. The process has alphabet A if the range of each X i is contained in A. except where each ai E A.k on A k defined by the formula p. ak-1-1 al E A k . unless it is clear from the context or explicitly stated stated otherwise. • • • . The set of joint distributions {Ak: k > 1} is called the distribution of the process. alic E Ak . The sequence a m . Section 1. p). an. The cardinality of a finite set A is denoted by IA I. one for each k. when An is used.1 k. of course. When no confusion will result the subscript k on p. .

let C = Un C„. Let E be the a-algebra generated by the cylinder sets C. BASIC CONCEPTS. 1 < i < co. on A' for which the sequence of coordinate functions fin } has the same distribution as {X J. then there is a unique Borel probability measure p. and let R. Thus the lemma is established. 1 _< i < n). is closed.} is increasing. from (a). xn . equipped with a Borel measure constructed from the consistency conditions (1).1 (a) Each set in R. Theorem 1. The members of E are commonly called the Borel sets of A. implies that p can be extended to a . Proof The proof of (a) is left as an exercise. Xi E A. if {X. extends to a finitely additive set function on the ring R. Part (b) is an application of the finite intersection property for sequences of compact sets.1. The sequence {R. m < i < n} . Two important properties are summarized in the following lemma. together with a Borel measure on A". The rigorous construction of the Kolmogorov representation is carried out as follows.2 CHAPTER I. Lemma 1. which represents it as the sequence of coordinate functions on the space of infinite sequences drawn from the alphabet A. In other words. = 'R.1.1(a).} defines a set function p. there is a unique Borel measure p..1.2 (Kolmogorov representation theorem. p([4]) = pk (alic).} is a process with finite alphabet A. on A" such that. The Kolmogorov can be thought of as the coordinate function process {i n }.) If {p k } is a sequence of measures for which the consistency conditions (1) hold. imply that p. = ai . Let Aœ denote the set of all infinite sequences X = 1Xil. An important instance of this idea is the Kolmogorov model for a process. for each k and each 4. The consistency conditions (1) together with Lemma I.(C. Proof The process {X. is the subset of A' defined by The cylinder set determined by am [4] = {x: x. and its union R = R(C) is the ring generated by all the cylinder sets. Let C. = a i . The collection E can also be defined as the the a -algebra generated by R..1. x E For each n > 1 the coordinate function in : A' representation theorem states that every process with alphabet A A. A is defined by kn (x). or by the compact sets. since the space A is compact in the product topology and each set in R. n > N. The finite intersection property. n . on the collection C of cylinder sets by the formula (2) p({41) = Prob(X. denoted by [4].) denote the ring generated by Cn . (b) If {B„} c R is a decreasing sequence of sets with empty intersection then there is an N such that B„ = 0. is a finite disjoint union of cylinder sets from C. be the collection of cylinder sets defined by sequences that belong to An. Lemma I. = R(C) generated by the cylinder sets. long as the the joint distributions are left unchanged.1(b).

finite-alphabet process.A" is defined by (Tx) n= xn+i. T -1 B = fx: T x E BI. for example.n } on the probability space (A'. x2. it) will be called the Kolmogorov representation of the process {U. The Kolmogorov measure on A extends to a complete measure on the completion t of the Borel sets E relative to tt. The statement that a process {X n } with finite alphabet A is stationary translates into the statement that the Kolmogorov measure tt is invariant under the shift transformation T. for all m. . The measure will be called the Kolmogorov measure of the process. different processes have Kolmogorov measures with different completions. the particular space on which the functions X„ are defined is not important. and uniqueness is preserved. . n.2. B . E. The shift transformation is continuous relative to the product topology on A". that is. The (left) shift T: A" 1-*. Many ideas are easier to express and many results are easier to establish in the framework of complete measures.1. or to its completion. In particular. ni <j < n) = Prob(Xi+i = ai . that is. This proves Theorem 1. thus. The phrase "stationary process" will mean a stationary. discrete-time.. one in which subsets of sets of measure 0 are measurable. or the Kolmogorov measure of the sequence Lu l l. including physics. unless stated otherwise.2. (3) Prob(Xi = ai . The Kolmogorov model is simply one particular way to define a space and a sequence of functions with the given distributions. Furthermore. whenever it is convenient to do so the complete Kolmogorov model will be used in this book. if the joint distributions do not depend on the choice of time origin. x which is expressed in symbolic form by E A.SECTION Ll. Process and measure language will often be used interchangeably. . whichever is appropriate to the context. 3 unique countably additive measure on the a-algebra E generated by R. A process is stationary. n> 1. "let be a process. x3. ni <j < n). and defines a set transformation T -1 by the formula.) == T x = (x2. for Aga in) = ictk (alic) for all k > 1 and all a. "Kolmogorov measure" is taken to refer to either the measure defined by Theorem 1. that is. and statistics. completion has no effect on joint distributions. The sequence of coordinate functions { k .. Equation (2) translates into the statement that { 5C." means "let i be the Kolmogorov measure for some process {Xn } . and affi n . though often without explicitly saying so. x = (xi. data transmission and storage.). a process is considered to be defined by its joint distributions. that is." As noted earlier. Stationary finite-alphabet processes serve as models in many settings of interest.} have the saine joint distribution.} and {X. STATIONARY PROCESSES. Another useful model is the complete measure model.1.

a homeomorphism on the compact space A z . that is. so that only nontransient effects. as appropriate to the context. the set of all sequences x = {x„}. Furthermore. on A". equivalent to the condition that p(B) = p(T -1 B) for each Borel set B. that is. onto il°'° is. Remark 1. different processes with the same alphabet A are distinguished by having . m <j < n. on A z such that Again) = pk (a1 ). In some cases where such a simplification is possible. Note that the shift T for the two-sided model is invertible. T-1 r„n 1 _ ri. which means that T is a Borel measurable mapping. It preserves a Borel probability measure A if and only if the sequence of coordinate functions {:kn } is a stationary process on the probability space (A". T is one-to-one and onto and T -1 is measure preserving. if {p4} is a set of measures for which the consistency conditions (1) and the stationarity conditions (3) hold then there is a unique T-invariant Borel measure ii. It follows that T -1 B is a Borel set for each Borel set B. It is often convenient to think of a process has having run for a long time before the start of observation. that is. The shift T: A z 1-. p(B) = p(T -1 B). for stationary processes there are two standard representations. the shift T is Borel measurable.3 Much of the success of modern probability theory is based on use of the Kolmogorov model. for asymptotic properties can often be easily formulated in terms of subsets of the infinite product space A". Such a model is well interpreted by another model. is usually summarized by saying that T preserves the measure iu or. B E E. n E Z. p. is T -invariant. The concept of cylinder set extends immediately to this new setting. for each k > 1 and each all'. only the proof for the invertible case will be given. so that the set transformation T -1 maps a cylinder set onto a cylinder set. (The two-sided shift is. "Kolmogorov measure" is taken to refer to either the one-sided measure or its completion. In addition. which is. in fact. of course. E. where each xn E A and n E Z ranges from —oo to oo. are seen. such results will then apply to the one-sided model in the stationary case. in turn.A z is defined by (Tx) n= xn+i.([4]) = p.1 .n+1 i L"m-1 — Lum+1. alternatively. for invertibility of the shift on A z makes things easier. BASIC CONCEPTS. The intuitive idea of stationarity is that the mechanism generating the random variables doesn't change in time.1. the model provides a concrete uniform setting for all processes.4 Note that CHAPTER I. called the doubly-infinite sequence model or two-sided model. Let Z denote the set of integers and let A z denote the set of doubly-infinite A-valued sequences. those that do not depend on the time origin. one as the Kolmogorov measure p. In summary. The latter model will be called the two-sided Kolmogorov model of a stationary process. or to the two-sided measure or its completion. the Kolmogorov measure on A' for the process defined by {pa In summary. The condition.([b 11 ]). bi+1 = ai. the other as the shift-invariant extension to the space Az. that ti. The condition that the process be stationary is just the condition that p.) From a physical point of view it does not matter in the stationary case whether the one-sided or two-sided model is used. The projection of p. Proofs are sometimes simpler if the two-sided model is used. In the stationary case. A(B) = p(T -1 B) for each cylinder set B.). that is.

d.6 (Markov chains. Thus a process is i. except in trivial cases. A Markov chain is said to be homogeneous or to have stationary transitions if Prob(X n = blXn _i = a) does not depend on n. The i.1. 5 different Kolmogorov measures on A.i.1. i= 1 k holds. identically distributed (i.1. as well as to the continuous-alphabet i. It is identically distributed if Prob(X„ = a) = Prob(Xi = a) holds for all n > 1 and all a E A. Example 1. processes will be useful in Chapter 4. = an IX7 -1 = ar I ) = Prob(X„ = an ).i.i.d.d. condition holds if and only if the product formula iak (al k )= n Ai (ai). it is sometimes desirable to consider countable-alphabet processes. and.d.i.1. in which case the I AI x IAI matrix M defined by Mab = Mbia = Prob(X„ = bi X„_i = a) .5 (Independent processes.a Examples. Such a measure is called a product measure defined by Ai. cases that will be considered. or even continuous-alphabet processes. is a Markov chain if (4) Prob (X n = an I V I -1 = ar l ) = Prob(Xn = an IXn _i =a_1) holds for all n > 2 and all a7 E An. A sequence of random variables {X„} is independent if Prob(X. Some standard examples of finite-alphabet processes will now be presented. An independent process is stationary if and only if it is identically distributed. {X„}. holds for all n > 1 and all al' E An.4 While this book is primarily concerned with finite-alphabet processes.) The simplest examples of finite-alphabet dependent processes are the Markov processes. even more can be gained in the stationary case by retaining the abstract concept of a stationary process as a sequence of random functions with a specified joint distribution on an arbitrary probability space. return-time processes associated with finite-alphabet processes are. As will be shown in later sections. For example. The Kolmogorov theorem extends easily to the countable case.) process.d. Example 1. if and only if its Kolmogorov measure is the product measure defined by some distribution on A. Remark 1. I. A sequence of random variables.i.SECTION 1.) The simplest example of a stationary finite-alphabet process is an independent. and functions of uniformly distributed i. countable-alphabet processes. when invertibility is useful the two-sided model on A z can be used. When completeness is needed the complete model can be used. STATIONARY PROCESSES. in the stationary case.1.

The start distribution is called a stationary distribution of a homogeneous chain if the matrix product A I M is equal to the row vector pl. In information theory.) A generalization of the Markov property allows dependence on k steps in the past. = an i Xn n :ki = an niki ) holds for all n > k and for all dii. The Markov chain {X. Example 1. = (sa . Unless stated otherwise.7 (Multistep Markov chains.1T -1 ) = Prob(sn = s. Note that probabilities for a k-th order stationary chain can be calculated by thinking of it as a first-order stationary chain with alphabet B. A (an I a n v n—i _ ) = Prob(X n = anl. any finite-alphabet process is called a finite-state process. anix ri . il n—k — n-1\ an—k ) does not depend on n as long as n > k.k (4c) > 01. that is.1. see Exercise 2. "stationary Markov process" will mean a Markov chain with stationary transitions for which A I M = tti and.} is said to be a finite-state process with state space S. indeed.6 CHAPTER I. first. Here "Markov" will be reserved for processes that satisfy the Markov property (4) or its more general form (5). the process is stationary. The two conditions for stationarity are. A process is called k-step Markov if (5) Prob(X.} is often referred to as the underlying Markov process or hidden Markov process associated with {Yn }. The condition that a Markov chain be a stationary process is that it have stationary transitions and that its start distribution is a stationary distribution. and "finite-state process" will be reserved for processes satisfying property (6). a E A. that transitions be stationary. Example 1. what is here called a finite-state process is sometimes called a Markov source. The positivity condition rules out transient states and is included so as to avoid trivialities. .1. 0 Lk—I •fu i i = a2 k otherwise. for which iu 1 (a) > 0 for all a E A. If this holds. but {Yn } is not.uk . a Markov chain. {Yn } is a finite-state process if there is a finite-alphabet process {s n } such that the pair process {X. where ii. In other words. In ergodic theory. and. 4 E A " .) Let A and S be finite sets. Yn )} is a Markov chain. following the terminology used in [15 ]. in general. Prob(X.k (4`) = Prob(X i` = 4). is called the transition matrix of the chain. BASIC CONCEPTS. for all n > 1.4-1 ) . where /2i(a) = Prob(X i = a). and what is here called a finite-state process is called a function (or lumping) of a Markov chain.8 (Finite-state processes. Yn = bisrl . then a direct calculation shows that. . that A k M (k) = . and B = fa: p. second. The start distribution is the (row) vector lit = Lai (s)}. if (6) Prob(sn = s. Y. = biSn-1). An A-valued process {Y. Note that the finite-state process {Yn } is stationary if the chain {X„} is stationary. and M (k) is the I BI x IBI matrix defined by Ai (k) _ bk — l' 1 1 r1 ak Prob(Xk ±i = bk I X ki = a k\ il .

• • • 1 Xn-f-w which determines the value y. At time n the window will contain Xn—w. the window is shifted to the right (that is. Finite coding is sometimes called sliding-block or sliding-window coding. A Borel measurable mapping F: Az 1-. the contents of the window are shifted to the left) eliminating xn _ u..).71 Z E Z. The coder F transports a Borel measure IL on Az into the measure v = it o F -1 . The stationary coder F and its time-zero coder f are connected by the formula = f (rx). or in terms of the sequence-to-symbol coder f.9 (Stationary coding. that is. The function f is called the time-zero coder associated with F.) The concept of finite-state process can be generalized to allow dependence on the past and future. x E AZ. when there is a nonnegative integer w such that = f (x) = f CO .„ with coder F. Xn—w+1. and the encoded process {Yn } defined by Yn = f (X. and bringing in .1.Bz is a stationary coder if F(TA X) = TB F(x). for one can think of the coding as being done by sliding a window of width 2w ± 1 along x. The word "coder" or "encoder" may refer to either F or f. Note that a stationary coding of a stationary process is automatically a stationary process.1. The encoded sequence y = F(x) is defined by the formula . 7 Example 1. n Yn = f (x'. Thus the stationary coding idea can be expressed in terms of the sequence-to-sequence coder F. Suppose A and B are finite sets. = f (T:1 4 x). depends only on xn and the values of its w past and future neighbors. In this case it is common practice to write f (x) = f (xw w ) and say that f (or its associated full coder F) is a finite coder with window half-width w.) is called an instantaneous function or simply a function of the {X n } process. where TA and TB denotes the shifts on the respective two-sided sequence spaces Az and B z . To determine yn+1 . defined for Borel subsets C of Bz by v(C) = The encoded process v is said to be a stationary coding of ti. In the special case when w = 0 the encoding is said to be instantaneous. A case of particular interest occurs when the time-zero coder f depends on only a finite number of coordinates of the input sequence x. n E Z. sometimes called the full coder. otherwise known as the per-symbol or time-zero coder. Vx E Az . that is. that is. Define the function f: Az i.SECTION 1.B by the formula f (x) = F(x)o. • • • 9 Xn. an idea most easily expressed in terms of the two-sided representation of a stationary process. STATIONARY PROCESSES. the value y. that is. through the function f. the image y = F(x) is the sequence defined by y. f (x) is just the 0-th coordinate of y = F (x).

then F' ([y]) = f (x. n . process with finite or infinite alphabet. N] according to the uniform distribution and defining i = 1. 1) is measurable with respect to the a-algebra E(X. (j+1)N X j1s1+1 ) 9 j 0. where T = Tg.i. 2. a function of a Markov chain. • . especially in Chapter 4. Such a code can be used to map an A-valued process {X n } into a B-valued process {Yn } by applying CN to consecutive nonoverlapping blocks of length N. it is invariant under the N-fold shift T N . stationary..nn+ -w w ) = ym n . in particular. pt and v. that is. If F: Az B z is the full coder defined by f." The relation = . the shift on B. More generally. that is. that is. such as the encoded process {Yn }..7.1. from An B. by the formula f (x. in general.„ + i. y(j+1)N jN+1 = CN ()((f+1)N) +1 iN j = 0 1. It is frequently used in practice. however. . Example 1.v ) so that. An N -block code is a function CN: A N B N . . Such a finite coder f defines a mapping.(F-1 [4]) = (7) E f A process {Yn } is a finite coding of a process {X n } if it is a stationary coding of {Xn } with finite width time-zero coder.8 CHAPTER I. The process {Yn } is called the N-block coding of the process {X n ) defined by the N -block code C N.. Note that a finite-state process is merely a finite coding of a Markov chain in which the window half-width is 0. to of the code. destroys stationarity. are connected by the formula y = o F -1 ..n41.) Another type of coding. The Kolmogorov measures..d. A stationary coding of an i. a finite alphabet process is called a B-process if it is a stationary coding of some i. the N-block coding {Yn } is not. This method is called "randomizing the start.10 (Block codings. 2.d. called block coding. . where y = F(x) is the (measurable) mapping defined by (j+1)N r jN+1 = . where A and B are finite sets. I. {Yn }. BASIC CONCEPTS. for any measure .u.} is stationary. also denoted by f. . {X. Furthermore. If the process {X. 2.. The key property of finite codes is that the inverse image of a cylinder set of length n is a finite union of cylinder sets of length n 2w. The process {Yn } is defined bj selecting an integer u E [1. where w is the window half-width . and is often easier to analyze than is stationary coding. Much more will be said about B-processes in later parts of this book. although it is N-stationary. into a stationary process.n+w generated by the random variables X„. cylinder set probabilities for the encoded measure y = o F -1 are given by the formula y(4) = ii. where several characterizations will be discussed.} and {K}..i. where yi = f m <j < n. There is a simple way to convert an N-stationary 2rocess. Xn. process is called a B-process. Xn±„.nitww)=Y s11 [x:t w w] .. F -1 ([). of the respective processes..

suppose X liv = (X1. 2. block coding introduces a periodic structure in the encoded process. the final stationary process {Yn } is not a stationary encoding of the original process {X. y and î. transporting this to an N-stationary measure p. by the process {41. on A N .I. In Section 1. NI such that. (i) li. 9 between the Kolmogorov measures. For example. with distribution t.1. . }. . } obtained (j-11N+1 be independent and have the distribution of Xr. an N-block code CN induces a block coding of a stationary process {Xn } onto an N-stationary process MI. together with its first N — 1 shifts. j = 1. .. {Y. The concatenated-block process is characterized by the following two conditions.i. with a random delay.* o 0 -1 on A" via the mapping = x = w(l)w(2) • • • where JN X (j-1)N+1 = W(i). of the respective processes. 2. a structure which is inherited. which can then be converted to a stationary process. . Let {Y. conditioned on U = u. randomizing the start clearly generalizes to convert any N-stationary process into a stationary process.} is stationary. } . B E E.8.SECTION I. In summary. 2.. by randomizing the start. then averaging to obtain the measure Z defined by the formula N-1 N il(B) = — i=o it*(0 -1 (7—i B)). The process { 171 from the N-stationary process {Yi I by randomizing the start is called the concatenatedblock process defined by the random vector Xr .1..) Processes can also be constructed by concatenating N-blocks drawn independently at random according a given distribution. The method of that is.. In general. . is expressed by the formula v ( c) = E v (T C). i=o is the average of y. . {Y. X N ) is a random vector with values in AN.} and many useful properties of {X n } may get destroyed. STATIONARY PROCESSES. There are several equivalent ways to describe such a process. the sequence {Y:± ±(iiN oN±i : j = 1.. Example 1. with stationarity produced by randomizing the start.} and { Y.} be the A-valued process defined by the requirement that the blocks . An alternative formulation of the same idea in terms of measures is obtained by starting with a probability measure p. forming the product measure If on (A N )".} is i.11 (Concatenated-block processes..d. (ii) There is random variable U uniformly distributed on {1.. a general procedure will be developed for converting a block code to a stationary code by inserting spacers between blocks so that the resulting process is a stationary coding of the original process. In random variable terms. . and as such inherits many of its properties. {Y..

i) with probability i.(br). defined by the following transition and starting rules.) Associated with a stationary process {X n } and a positive integer N are two different stationary processes. (a) If i < N . on A N and let { Y„.b. 2. however..12 (Blocked processes. Yn = X (n-1)N+1. (b) (ar . +i A related process of interest {Yn }. i ± 1). A third formulation. b' 1 1 = { MaNbN. • • • 9 Xn+N-1)• The Z process is called the nonoverlapping N-block process determined by {X n }. (ar . on AN. if Y. To describe this representation fix a measure p. .(4)/N. Note also that if {X n} is Markov with transition matrix Mab. Wn = ant X+1.10 CHAPTER I. hence this measure terminology is consistent with the random variable terminology. The measure ii. for n > 1. n = ai.} process. The process is started by selecting (ar . called the N-th term process. which is quite useful. If the process {X.} is stationary. The process defined by ii has the properties (i) and (ii). i) can only go to (ar .r. represents a concatenated-block process as a finite-state process. but overlapping clearly destroys independence. is the same as the concatenated-block process based on the block distribution A.} is i. In fact.i. {Z n } and IKI. N) goes to (b Ç'. by Zn = (X(n-1)N+19 X(n-1)N+29 • • • 1 XnN).) Concatenated-block processes will play an important role in many later discussions in this book. then the overlapping N-block process is Markov with alphabet B = far: War) > 0} and transition matrix defined by Ma i. 1) with probability p.i. .} be the (stationary) Markov chain with alphabet S = AN x {1. defined.i. N}. If {Xn } is Markov with transition matrix M. the N-th power of M.} is Markov with transition matrix Mab. (See Exercise 7..d. or Markov process is Markov.tSiv = MaNbi ri i= 1 Mb. then its nonoverlapping blockings are i. Each is stationary if {X. = (ar . that is. The overlapping N-blocking of an i.i } will be Markov with transition matrix MN. then the nonoverlapping N-block process is Markov with alphabet B and transition matrix defined by N-1 1-4a'iv .n } defined by setting Y. for each br (C) E AN. a function of a Markov chain. BASIC CONCEPTS. is called the concatenated-block process defined by the measure p. then {Y. that is..d. selects every N-th term from the {X. if {X. j). while the W process is called the overlapping N-block process determined by {X J. The process {Y. Example 1.d. n E Z. 0 if bN i —1 = ci 2 iv otherwise.1. .

For example. in which case the iterated Borel-Cantelli principle is. —> f. In many applications.. The proof of this is left to the exercises.13 (The Markov inequality. the Borel-Cantelli principle is often expressed by saying that if E wc. . Then x g Bn . E. integrable function on a probability space (X. except for a set of measure at most b. (c) Given c > 0. n Gn. < c 8 then f (x) < c. In general.) Suppose {G„} and {B} are two sequences of measurable sets such that x E G„. STATIONARY PROCESSES. Two elementary results from probability theory will be frequently used. E. and x g B. Lemma 1. such that Pn.15 (The iterated Borel-Cantelli principle.SECTION 1. n Gn .) If {C n } is a sequence of measurable sets in a probability space (X.) < oo then x g C. If {Pn } is a sequence of measurable properties then (a) P(x) holds eventually almost surely. . 11 I.(x) is true for i = 1.) Let f be a nonnegative.„ eventually almost surely. for every c > O. Lemma 1. if for almost every x there is an N = N(x) such that Pn (x) is true for n > N. the Markov inequality and the Borel-Cantelli principle. If f f dp. Almost-sure convergence is often established using the following generalization of the Borel-Cantelli principle. eventually almost surely.16 The following are equivalent for measurable functions on a probability space. summarized as follows. indeed.1. such that Ifn(x) — f(x)1 < E .l. eventually almost surely. p. a property P is said to be measurable if the set of all x for which P(x) is true is a measurable set.. p. (b) I fn (x) — f (x)I <E. is established by showing that E . which may depend on x. eventually almost surely.). just a generalized Borel-Cantelli principle. Frequent use will be made of various equivalent forms of almost sure convergence. (x) holds infinitely often.. almost surely. (b) P. n ? N.u(Bn n Gn ) < oo. n > N.14 (The Borel-Cantelli principle.i ) < 00 then for almost every x there is an N = N(x) such that x g C„. X E G.1. there is an N and a set G of measure at least 1 — c. 2. if for almost every x there is an increasing sequence {n i } of integers. the fact that x E B.1..) such that E ii(c. eventually almost surely.1.1. eventually almost surely. Lemma 1..b Probability tools. (a) f. Lemma 1. almost surely.

i. Show that {X.18. then a stationary process is said to be m-dependent. As several of the preceding examples suggest. such as the martingale theorem.) . o F-1 and let r) be the completion of v. this is not a serious problem. namely. if U„ > Un _i .17 (The Borel mapping lemma.) Let p. BASIC CONCEPTS. first see what is happening on a set of probability 1 in the old process then transfer to the new process.l. .d. and sometimes the martingale theorem is used to simplify an argument. be a Borel measure on A". then I BI 5_ 11a.) Let F be a Borel function from A" into B". If X is a Borel set such that p(X) = 1. e.1.1 is 1-dependent. a process is often specified as a function of some other process. Lemma 1. Let {tin } be i. by AZ or B z . This fact is summarized by the following lemma. if p. For ease of later reference these are stated here as the following lemma. (Hint: show that such a coding would have the property that there is a number c > 0 such that A(4) > c". and let a be a positive number (a) If a E B (b) For b E p(a) a a. Use will also be made of two almost trivial facts about the connection between cardinality and probability. except for a subset of B of measure at most a.. for such images are always measurable with respect to the completion of the image measure. 1]. and upper bounds on cardinality "almost" imply lower bounds on probability. (Include a proof that your example is not Markov of any order. [5]. process is m-dependent for some m. the law of the iterated logarithm.d. uniformly distributed on [0. otherwise X. p(b) > a/I BI. Give an example of a function of a finite-alphabet Markov chain that is not Markov of any order.d. respectively. namely. 2. Probabilities for the new process can then be calculated by using the inverse image to transfer back to the old process. Show that a finite coding of an i.(x 1') 0 O.u([4]). Lemma 1. B. for all n and k. Deeper results from probability theory. be a probability measure on the finite set A. = O. and the renewal theorem.1. that lower bounds on probability give upper bounds on cardinality. let v = p. Prove Lemma 1.u([4] n [ci_t) . = 1. (How is m related to window half-width?) 4. the central limit theorem. I. Define X.i. that Borel images of Borel sets may not be Borel sets. For nice spaces. let B c A. A similar result holds with either A" or B" replaced. Sometimes it is useful to go in the opposite direction. will not play a major role in this book.18 (Cardinality bounds. g. and is not a finite coding of a finite-alphabet i. A complication arises. such as product spaces.1.i. though they may be used in various examples. let p. with each U.12 CHAPTER I. Then show that the probability of n consecutive O's is 11(n + 1)!. process.' and an+ni±k n+m-1-1 . If . then F(X) is measurable with respect to the completion i of v and ii(F (X)) = 1.c Exercises. and all a.) 3. 1.u([4]).

15. Ergodic theory is concerned with the orbits x. is the subject of this section and is the basis for much of the remainder of this book. 7. of a transformation T: X X on some given space X. Lemma 1.Kn-Fr \ 1-4-4 Kn+1 . then give rise to stationary processes.} satisfy the Kolmogorov consistency conditions. 2. are given by the formula (1) X(x) = Xp(Tn -1 x). Show that the finite-state representation of a concatenated-block process in Example 1. The sequence of random variables and the joint distributions are expressed in terms of the shift and Kolmogorov partition. is the transformation and the partition is P = {Pa : a E A}. x E Pa .. X = A". (1/n) E7=1 kt* (1' B). that is..}. ÎZ(B) = . as follows. A measure kt on An defines a measure . Show that the process constructed in the preceding exercise is not Markov of any order.1. hence have a common extension . by the formula = (n K-1 kn -En iU(Xkr1+1) . THE ERGODIC THEORY MODEL. This suggests the possibility of using ergodic theory ideas in the study of stationary processes. Pc. Show that the concatenatedblock process Z defined by kt is an average of shifts of . Since to say that . This model. The Kolmogorov measure for a stationary process is preserved by the shift T on the sequence space... called the transformation/partition model for a stationary process. for each Borel subset B C 9. Establish the Kolmogorov representation theorem for countable alphabet processes. 8. The shift T on the sequence space.11 satisfies the two conditions (i) and (ii) of that example.u (N) on AN.1.Tx. for each n. • k=0 The measures {. First. Finite measurements on the space X. at random according to the Kolmogorov measure.a (N) . 0 < n < r. {X.T 2x. Prove the iterated Borel-Cantelli principle. relative to which information about orbit structure can be expressed in probability language. where for each a E A.. that is. let Xn = X n (x) be the label of the member of P to which 7' 1 x belongs. that is. The partition P is called the Kolmogorov partition associated with the process. and. N = 1. 6. = {x: x 1 = a}. (or on the space A z ).2 The ergodic theory model. The Kolmogorov model for a stationary process implicitly contains the transformation/partition model. The coordinate functions. an infinite sequence. . where N = Kn+r. associate with the partition P the random variable Xp defined by X2 (x) = a if a is the label of member of P to which x belongs. The process can therefore be described as follows: Pick a point x E X. In many cases of interest there is a natural probability measure preserved by T. which correspond to finite partitions of X. n > 1.SECTION L2. Section 1.tc* to A". 13 5.

is called the process defined by the transformation T and partition P. indexed by a finite set A. more simply. 2) name of x. or equivalently. BASIC CONCEPTS. in the Kolmogorov representation a random point x is a sequence in 21°' or A z . are useful.) . the coordinate functions. The concept of stationary process is formulated in terms of the abstract concepts of measure-preserving transformation and partition. cylinder sets and joint distributions may be expressed by the respective formulas. X 2 (x X3(x). 2) process.14 Tn-1 X E CHAPTER I. Pa is the same as saying that x E T -n+ 1 Pa . Continuing in this manner. (In some situations. Pick x E X at random according to the A-distribution and let X 1 (x) be the label of the set in P to which x belongs. 2)-name of x is the same as x. the cylinder sets in the Kolmogorov representation. that is. Let (X.„7—i+1 Pai .(B). tell to which set of the partition the corresponding member of the random orbit belongs. (2). The sequence { xa = X n (x): n > 1} defined for a point x E X by the formula Tn -l x E Pin . as follows. - (x).) Associated with the partition P = {Pa : a E Al is the random variable Xp defined by Xp(x) = a if x E Pa . X — U n Pa is a null set. Of course. n> 1. whose union has measure 1. partitions into countably many sets. disjoint collection of measurable sets. The (T. is called the (T. by (4). A) be a probability space. and measure preserving if it is measurable and if (T' B) = . X a (x). ) In summary. and (2) = u([4]) = i (nLI T —i+1 Pa.u. 2)-process may also be described as follows. and the joint distributions can all be expressed directly in terms of the Kolmogorov partition and the shift transformation. for all B E E. the values. E. (or the forward part of x in the two-sided case. B A partition E E. the (T. ). A mapping T: X X is said to be measurable if T -1 B E E. The k-th order distribution Ak of the process {Xa } is given by the formula (4) (4) = u (nLiT -i+1 Pai ) the direct analogue of the sequence space formula. that is. P = {Pa : a E A} of X is a finite.. Then apply T to x to obtain Tx and let X2(x) be the label of the set in P to which Tx belongs. and the (T. The process {Xa : n > 1} defined by (3). countable partitions. [4] = n7. The random variable Xp and the measure-preserving transformation T together define a process by the formula (3) X(x) = Xp(T n-l x). or.

the natural object of study becomes the restriction of the transformation to sets that cannot be split into nontrivial invariant sets. The mapping F extends to a stationary coding to Bz in the case when X = A z and T is the shift. 2)-process concept is. a stationary process can be thought of as a shift-invariant measure on a sequence space. i = 1. = (c) T -1 B D B. = f . The partition P is defined by Pb = {x: f (x) = 1) } . Lemma 1. 2)-process. hence to say that the space is indecomposable is to say that if T -1 B = B then p(B) is 0 or 1. (b) T -1 B C B. a stationary coding F: A z B z carrying it onto y determines a partition P = {Pb: b E BI of Az such that y is the Kolmogorov measure of the (TA. The ergodic theory point of view starts with the transformation/partition concept. 15 The (T.2. . The following lemma contains several equivalent formulations of the ergodicity condition.SECTION 1. since once an orbit enters an invariant set it never leaves it. In summary. in that a partition P of (X. The condition that TX C Xi. It is natural to study the orbits of a transformation by looking at its action on invariant sets. C AD = (C — D)U(D— C). (a) T is ergodic. Conversely. In particular. I. 2 translates into the statement that T -1 Xi = Xi. implies that f is constant.2. E. where the domain can be taken to be any one of the LP-spaces or the space of measurable functions. THE ERGODIC THEORY MODEL. It is standard practice to use the word "ergodic" to mean that the space is indecomposable. as a measure-preserving transformation T and partition P of an arbitrary probability space.1 (Ergodicity equivalents. equivalently. such that F(Tx) = TA F (x). 2.e. p) gives rise to a measurable function F: X 1-÷ B" which carries p onto the Kolmogorov measure y of the (T. This leads to the concept of ergodic transformation. . just an abstract form of the stationary coding concept.a Ergodic processes. The space X is Tdecomposable if it can be expressed as the disjoint union X = X1U X2 of two measurable invariant sets.e. a. See Exercise 2 and Exercise 3. denotes symmetric difference. or.) The following are equivalent for a measure-preserving transformation T on a probability space.i(B) = 0 or p(B) = 1. in essence. and UT denotes the operator on functions defined by (UT f)(x) = f (T x). while modern probability theory starts with the sequence space concept. where TB is the shift on 13'. each of positive measure. A measure-preserving transformation T is said to be ergodic if (5) 7-1 B = B = (B) = 0 or p(B) = 1. B LB) = 0 = (B) = 0 or kt(B) = 1.. (d) (e) UT (B) = 0 or kt(B) = 1. i = 1. The notation. 2)-process. A measurable set B is said to be T-invariant if TB C B. a.2.

Proof The equivalence of the first two follows from the fact that if T -1 B c B and C = n>0T -n B. but not all. concatenated-block processes.2. Thus the concept of ergodic process. then T -1 C = C and . t/2) (2s — 1. processes obtained from a block coding of an ergodic process by randomizing the start. Cut the unit square into two columns of equal width.4 The Baker's transformation.b Examples of ergodic processes.2.) A simple geometric example provides a transformation and partition for which the resulting process is the familiar coin-tossing process. I. and some. Also. BASIC CONCEPTS.) 1.2. Examples of ergodic processes include i. (See Figure 1. stationary codings of ergodic processes. 2. Place the right rectangle on top of the left to obtain a square. t) = { (2s. A stationary process is ergodic. • Tw • Tz Tw • • Tz Figure 1. As will be shown in Section 1.2 In the particular case when T is invertible.3 (The Baker's transformation. when T is one-to-one and for each measurable set C the set T C is measurable and has the same measure as C. the conditions for ergodicity can be expressed in terms of the action of T. is equivalent to an important probability concept. Remark 1. for almost every x.2.i. if the shift in the Kolmogorov representation is ergodic relative to the Kolmogorov measure. Example 1. a shift-invariant measure shift-invariant measures. note that if T is invertible then UT is a unitary operator on L2 . . rather than T -1 . These and other examples will now be discussed. Furthermore.u(B).4.u(C) = . Let X = [0. 3. which is natural from the transformation point of view. to say that a stationary process is ergodic is equivalent to saying that measures of cylinder sets can be determined by counting limiting relative frequencies along a sample path x. Squeeze each column down to height 1/2 and stretch it to width 1. 1) denote the unit square and define a transformation T by T (s. The transformation T is called the Baker's transformation since its action can be described as follows. In particular. (t ± 1)/2) if S <1/2 it s> 1/2. processes. that is.4. an invertible T is ergodic if and only if any T-invariant set has measure 0 or 1. The proofs of the other equivalences are left to the reader. irreducible Markov chains and functions thereof.2. 1) x [0.16 CHAPTER I.d.

The distribution of P is the probability distribution ftt(Pa). v P v TT into exactly the same proportions as it partitions the entire space X.6. process. described as follows. defined inductively by APO ) = (vrP (i) ) y p(k). indeed. In general. PO by setting (6) Po = {(s.2. t): s < 1/2}. defined by the first-order distribution j 1 (a). 17 The Baker's transformation T preserves Lebesgue measure in the square. Let P = {Pa : a E A} be a partition of the probability space (X. implies.P)-process is. To assist in showing that the (T . P(k) of partitions is their common refinement. the partition P v Q = {Pa n Qb: Pa E P . T -I P v P v T'P v T 2P T -1 P Figure 1. 7-1 P . symmetric. that is. for dyadic subsquares (which generate the Borel field) are mapped into rectangles of the same area. the binary.i. The sequence of partitions {P (i) : i > 1} is said to be independent if P (k+ i) and v1P (i) independent for each k? 1.Va. i. the i. b.2. together with the fact that each of the two sets P0 . This is precisely the meaning of independence. (See Figure 1. and 7— '1'. In particular. TP. t): s _?: 1/2}. that is. Qb E Q} • The join AP(i) of a finite sequence P (1) .T 2P . along with the join of these partitions. Note that T 2P partitions each set of T -1 7. . Two partitions P and Q are independent if the distribution of the restriction of P to each Qb E Q does not depend on b.d. 2)-process is just the coin-tossing process. so that {T i P} is an independent sequence. of course. it can be shown that the partition TnP is independent of vg -1 PP . TP . 1 0 P T -1 1)1 n Po n T Pi n T2 Po Ti' I I 1 0 1 0 T 2p 01101 I I .5 The join of P.2. if [t(Pa n Qb) = 12 (Pa)11 (Qb).) . some useful partition notation and terminology will be developed. P v Q. To obtain the coin-tossing process define the two-set partition P = {Po. that the (T.. This. ii). P1 = {(s. a E Al. Figure 1.5 illustrates the partitions P.. T2P. P1 has measure 1/2. are For the Baker's transformation and partition (6). . E. a E A.SECTION I. P (2) .2.i. The join. for each n > 1. of two partitions P and Q is their common refinement.d. THE ERGODIC THEORY MODEL. process ii. is given by a generalized Baker's transformation.

in = 0.(T -Nc n D) = . if any. But this is easy. implies that p.d.a(D) = Thus i.froo lim . The results are summarized here. for if T -1 C = C then T -ncnD=CnD.d.7 (Ergodicity for i. in the transition matrix M determine whether a Markov process is ergodic. 1.u(D).) To show that a process is ergodic it is sometimes easier to verify a stronger property.i. BASIC CONCEPTS. 3. labeled by the letters of A. Thus.u(C) 2 and hence . if D = [br] and N > m then T -N C and D depend on values of x i for indices i in disjoint sets of integers.u(T -nC n D) = . A transformation is mixing if (7) n— . 2.i(C) = . • Figure 1. . To show that the mixing condition (7) holds for the shift first note that it is enough to establish the condition for any two sets in a generating algebra. for if C = [a7] and N > 0 then T-N = A +1 bNA -i = a1. called mixing. Example 1. Since the measure is product measure this means that p. (C nD) = Since this holds for all sets D.2.d. Mixing clearly implies ergodicity. n — 1.u(C) is 0 or 1. A stationary process is mixing if the shift is mixing for its Kolmogorov measure.6 The generalized Baker's transformation. Example 1. The stochastic matrix M is said to be irreducible. For each a E A squeeze the column labeled a down and stretch it to obtain a rectangle of height (a) and width 1. so that the mixing property.u(C). the reader may refer to standard probability books or to the extensive discussion in [29] for details.) The location of the zero entries. Cut the unit square into I AI columns. .i. 1 < < n. if for any pair i. Stack the rectangles to obtain a square. j there is a sequence io. The partition P into columns defined in part (a) of the definition of T then produces the desired i. measures satisfy the mixing condition. C.18 CHAPTER I.8 (Ergodicity for Markov chains. . the choice D = C gives . The corresponding Baker's transformation T preserves Lebesgue measure.i. such that the column labeled a has width A i (a). in with io = i and in = j such that > 0.2. Suppose it is a product measure on A'. for all sets D and positive integers n. process. D E E. processes. hence it is enough to show that it holds for any two cylinder sets. 1.2. i 1 .a(T-N C). (7).

. mc.c.1±kbn+k+i V = (n-Flc-1 i=n JJ Md.1 1 converges in the sense of Cesaro to (Ci).d. which establishes (9). hence for any measurable sets C and D. Mc1. . An irreducible Markov chain is ergodic. Furthermore. then the limit formula (9) gives /.([4] n [btni 1 -_ . This condition. THE ERGODIC THEORY MODEL. where n-1 7r (di)11Md.t(C) = 1u(C) 2 .2. To prove this it is enough to show that N (9) LlM oo C n D) = for any cylinder sets C and D. +1 = acin). implies ergodicity. In the irreducible case. the limit of the averages of powers can be shown to exist for an arbitrary finite stochastic matrix M. 1 < i < m. j of states.+i. To establish (9) for an irreducible chain. there is only one probability vector IT such that 7rM = it and the limit matrix P has it all its rows equal to 7r. let D = [4] and C = [cri and write + T—k—n = [bnn ++ kk-Fr ]. for if T -1 C = C and D = C. each entry of it is positive and N (8) N—oc) liM N n=1 where each row of the k x k limit matrix P is equal to jr.. 19 This is just the assertion that for any pair i. Summing over all the ways to get from d. is n ii n [bn 1) = ttacril i) A I . where bn_f_k+i = ci. The sequence ML. which is the probability of transition from dn to bn+k+1 = Cl in equal to MI ci = [Mk ]d c „ the dn ci term in the k-th power of M. so that C must have measure 0 or 1. +1 = 11 n+k+m-1 i=n+k-F1 Note that V. 1=. to bn+k+1 yields the product p.:Vinj) = uvw. The condition 7rM = it shows that the Markov chain with start distribution Ai = it and transition matrix M is stationary. if the chain is at state i at some time then there is a positive probability that it will be in state j at some later time. by (8).u(C) = This proves that irreducible Markov chains are ergodic. In fact. since . and hence .d.SECTION 1.-1 (10) itt([d] k steps. Thus when k > 0 the sets T -k-n C and D depend on different coordinates. which is weaker than the mixing condition (7). If M is irreducible there is a unique probability vector it such that irM = 7r.

for any state i.u(B) = 1. assuming that all states have positive probability. if the chain is ergodic then the finite-state process will also be ergodic. This follows easily from the definitions.„ M N = P. The argument used to prove (9) then shows that the shift T must be mixing.9 A stationary Markov chain is ergodic if and only if its transition matrix is irreducible. Likewise. then the set B = T -1 Pi U T -2 PJ U has positive probability and satisfies T -1 B D B. again. . "ergodic" for Markov chains will mean merely irreducible with "mixing Markov" reserved for the additional property that some power of the transition matrix has all positive entries. The converse is also true. Thus. The underlying .11. if T is ergodic the (T. BASIC CONCEPTS.(F-1 C) it follows that v(C) is 0 or 1. "ergodic" for Markov chains is equivalent to the condition that some power of the transition matrix be positive.1.10 A stationary Markov chain is mixing if and only if some power of its transition matrix has all positive entries. the Cesaro limit theorem (8) can be strengthened to limN.2. for suppose F: A z B z is a stationary encoder. for example.u(B n Pi ) > 0.) A simple extension of the above argument shows that stationary coding also preserves the mixing property.20 CHAPTER I.u(T -nP n Pi ) > 0. (See Exercise 5. Indeed. so that. then . which means that transition from i to j in n steps occurs with positive probability. o F -1 is the Kolmogorov measure of the encoded B-valued process. The converse is also true.2. suppose it is the Kolmogorov measure of an ergodic A-valued process. a finite-state process for which the underlying Markov chain is mixing must itself be mixing. Feller's book. and hence that v is the Kolmogorov measure of an ergodic process. and suppose v = p. see Exercise 19.12 (Codings and ergodicity. at least with the additional assumption being used in this book that every state has positive probability. To be consistent with the general concept of ergodic process as used in this book. But if it(B n Pi ) > 0. 2)-process is ergodic for any finite partition P. In this case.2. Remark 1. [11]. however. If some power of M is positive (that is. 2)-process can be ergodic even though T is not. Proposition 1. and hence the chain is irreducible. Thus. for some n. has all positive entries) then M is certainly irreducible. any probability space will do. It should be noted. If C is a shift-invariant subset of B z then F-1 C is a shift-invariant subset of Az so that . In summary. Thus if the chain is ergodic then . Proposition 1. A finite-state process is a stationary coding with window width 0 of a Markov chain.u(F-1 C) is 0 or 1. since it can be represented as a stationary coding of an irreducible Markov chain. that the (T.2.11 In some probability books.) Stationary coding preserves the ergodicity property. Example 1. see Example 1. A concatenated-block process is always ergodic. Since v(C) = p. It is not important that the domain of F be the sequence space A z . below. if every state has positive probability and if Pi = {x: xi = j}.

The transformation T is one-to-one and maps intervals onto intervals of the same length. Since this periodic structure is inherited. The (T.13 (Rotation processes. defined by the 2-block code C(01) = 00.SECTION 1. let P consist of the two intervals P0 = [0. A subinterval of the circle corresponds to a subinterval or the complement of a subinterval in X. see Example 1. so that translation by a becomes rotation by 27ra. such processes are called translation processes. A condition insuring ergodicity of the process obtained by N-block coding and randomizing the start is that the original process be ergodic relative to the N-shift T N .) is said to be totally ergodic if every power TN is ergodic.. (See Example 1.. The measure -17 obtained by randomizing the start is. If T is the shift on a sequence space then T is totally ergodic if and only if for each N. the same as y.2. implies that F (Ti x) = T: F(x)..10.1.. up to a shift.) Since the condition that F(TA X) = TB F (x). p. p is the stationary Markov measure with transition matrix M and start distribution 7r given. however. depending on whether Tn .) Let a be a fixed real number and let T be defined on X = [0. and 111 . for all x and all N. A transformation T on a probability space (X. concatenated-block processes are not mixing except in special cases. hence preserves Lebesgue measure p. C(10) = 11. (The measure-preserving property is often established by proving.. as in this case. As noted in Section 1. It is also called rotation. 1) by the formula Tx = x ED a. P)-process {X n } is then described as follows. so that y is concentrated on the two sequences 000.2. For example. which correspond to the upper and lower halves of the circle in the circle representation. that is.. The following proposition is basic. (or rotation processes if the circle representation is used.1. where ED indicates addition modulo 1. • • • 9 Xn is ergodic.) As an example.12. P 1 = [1/2. The value X(x) = Xp(Tn -1 x) is then 0 or 1.lx = x ED (n — 1)a belongs to P0 or P1 . The proof of this is left to the reader. E. Pick a point x at random in the unit interval according to the uniform (Lebesgue) measure. . in this case. the nonoverlapping N-block process defined by Zn = (X (n-1)N-1-11 X (n-1)N+2. The final randomized-start process may not be ergodic. for all x.1. applying a block code to a totally ergodic process and randomizing the start produces an ergodic process. THE ERGODIC THEORY MODEL. that it holds on a family of sets that generates the a-algebra. give measure 1/2 to each of the two sequences 1010. even if the original process was ergodic. of the interval gives rise to a process. however. In particular. since X can be thought of as the unit circle by identifying x with the angle 27rx.. hence is not an ergodic process... N-block codings destroy stationarity. The mapping T is called translation by a. let p. Example 1. for it has a periodic structure due to the blocking. 21 Markov chain is not generally mixing. 1)... hence the word "interval" can be used for subsets of X that are connected or whose complements are connected. by m ro 1 L o j' = rl 1] i• Let y be the encoding of p.) Any partition 7. but a stationary process can be constructed from the encoded process by randomizing the start. it follows that a stationary coding of a totally ergodic process must be totally ergodic. and 0101 . 1/2). respectively.

10 to establish ergodicity of a large class of transformations. 1)) a. Proposition 1. applied to the end points of I. An alternate proof of ergodicity can be obtained using Fourier series.. TI cannot be the same as I. Assume that a is irrational. Two proofs of ergodicity will be given.u(A) = 1. {Tx: n > 1 } .14.2.14 continued. CHAPTER I. and hence the maximality of I implies that n I = 0. E n=1 00. suppose A is an invariant set of positive measure. One first shows that orbits are dense. e2-n(x+a) a.15 (Kronecker's theorem. The assumption that i. The preceding proof is typical of proofs of ergodicity for many transformations defined on the unit interval. Proof To establish Kronecker's theorem let F be the closure of the forward orbit. A generalization of this idea will be used in Section 1. Then there is an interval I of positive length which is maximal with respect to the property of not meeting F. The first proof is based on the following classical result of Kronecker. for each positive integer n. which completes the proof of Proposition 1. It follows that is a disjoint sequence and 00 1 = . This proves Kronecker's theorem. The Fourier series of g(x) = f (T x) is E a. but does establish ergodicity for many transformations of interest. i= 1 Thus . Given c > 0 choose an interval I such that /2(1) < E and /. then that small disjoint intervals can be placed around each point in the orbit.u ( [0. To proceed with the proof that T is ergodic if a is irrational. Proof It is left to the reader to show that T is not ergodic if a is rational.t(A n I) > (1 — 6) gives it (A) E —E) n Trig) (1 — 0(1 — 2E). produces integers n1 <n2 < < n k such that {Tni / } is a disjoint sequence and Ek i=1 tt(Tni /) > 1 — 2E. Suppose f is square integrable and has Fourier series E an e brinx .2.22 Proposition 1. Furthermore. for each x. BASIC CONCEPTS. which is a contradiction.) If a is irrational the forward orbit {Tnx: n > 1} is dense in X. Such a technique does not work in all cases. Proof of Proposition 1. since a is irrational. e2"ina e 27rinx .2. and suppose that F is not dense in X.14 T is ergodic if and only if a is irrational.t(A n I) > (1 — Kronecker's theorem.2.

from which it follows that the sequence of sets {T B} is disjoint. furthermore. Let T be a measure-preserving transformation on the probability space (X. that is. T n-1 Bn are disjoint and have the same measure. with addition mod 1. which.o ) = 0. let n(x) be the least positive integer n such that Tnx E B.17. alternatively. almost surely. The return-time picture is obtained by representing the sets (11) as intervals.2.2. for all n > 1. IL) and let B be a measurable subset of X of positive measure. while applying T to the last set Tn -1 Bn returns it to B. 1) x [0. TB. (or. These sets are clearly disjoint and have union B.2.2. a. which.) For each n. in turn.c The return-time picture. Furthermore. In general.. = 0. This proves Theorem 1.. To simplify further discussion it will be assumed in the remainder of this subsection that T is invertible. I. Points in the top level of each column are moved by T to the base B in some manner unspecified by the picture. (iii) a and 13 are rationally independent. E. define the sets Bn = E B: n(x) = n}. If a is irrational 0 unless n = 0. unless and an = an e 27' then e2lrin" n=0.SECTION 1. y El) p) (i) {Tn(x. for n > 1. by reassembling within each level it can be assumed that T just moves points directly upwards one level. Since they all have the same measure as B and p(X) = 1. e. implies that the function f is constant. the product of two intervals. it follows that p(B. if x E 1100 then Tx g Bc. that is. THE ERGODIC THEORY MODEL. translation of a compact group will be ergodic with respect to Haar measure if orbits are dense.17 (The Poincare recurrence theorem. . (See Figure 1. y).. that is. . y) 1-÷ (x a. a. the product of two circles. consider the torus T. TBn C B. one above the other. y)} is dense for any pair (x. n(x) < oc. which means that a.) Return to B is almost certain.) The proof of the following is left to the reader.2. and. along with a suggestive picture. Furthermore. 1). Proposition 1. T = [0. In other words. If x E B. the sets (11) Bn . A general return-time concept has been developed in ergodic theory. 23 If g(x) = f(x). then an = an e 2lri" holds for each integer n.18). (ii) T is ergodic. and define = {X E B: Tx B. the renewal processes and the regenerative processes. in the irrational case.. that is. n > 2. provides an alternative and often simpler view of two standard probability models. Furthermore. Proof To prove this. for B 1 = B n 7-1 B and Bn = (X — B) n B n T -n B. only the first set Bn in this sequence meets B. the rotation T has no invariant functions and hence is ergodic. For example. (Some extensions to the noninvertible case are outlined in the exercises.Vn > O. Theorem 1.2. n(x) is the time of first return to B.16 The following are equivalent for the mapping T: (x. e. are measurable.

24

CHAPTER I. BASIC CONCEPTS.

Tx

B1

B2

B3

B4

Figure 1.2.18 Return-time picture.
Note that the set Util=i Bn is just the union of the column whose base is Bn , so that the picture is a representation of the set (12) U7 =0 TB

This set is T-invariant and has positive measure. If T is ergodic then it must have measure 1, in which case, the picture represents the whole space, modulo a null set, of course. The picture suggests the following terminology. The ordered set

Bn , TB,

, T 11-1 13,

is called the column C = Cn with base Bn, width w(C) = kt(B,z ), height h(C) = n, levels L i = Ti-1 Bn , 1 < i < n, and top T'' B. Note that the measure of column C is just its width times its height, that is, p,(C) = h(C)w(C). Various quantities of interest can be easily expressed in terms of the return-time picture. For example, the return-time distribution is given by Prob(n(x) =nIxEB)— and the expected return time is given by w (Ca )

Em w(C, n

)

E(n(x)lx E B) =
(13)

h(C,i )Prob(n(x) =nix E B)

h(C)w(C)

=

m w(Cm )

In the ergodic case the latter takes the form

(14)

E(n(x)ix E B) =

1 Em w(Cm )

1

since En h(C)w(C) is the measure of the set (12), which equals 1, when T is ergodic. This formula is due to Kac, [23]. The transformation I' = TB defined on B by the formula fx = Tn(x)x is called the transformation induced by T on the subset B. The basic theorem about induced transformations is due to Kakutani.

SECTION I.2. THE ERGODIC THEORY MODEL.
-

25

Theorem 1.2.19 (The induced transformation theorem.) If p(B) > 0 the induced transformation preserves the conditional measure p(.IB) and is ergodic if T is ergodic. Proof If C c B, then x E f- 1C if and only if there is an n > 1 such that x Tnx E C. This translates into the equation
00

E

Bn and

F-1 C =U(B n n=1

n TAC),

which shows that T-1 C is measurable for any measurable C c B, and also that
/2(f—i
n

= E ,u(B,, n T'C),
n=1

00

since Bn n An = 0, if m 0 n, with m, n > 1. More is true, namely, (15)

T n Ba nT ni Bm = 0, m,n > 1, m n,

by the definition of "first return" and the assumption that T is invertible. But this implies 00 E,(T.Bo
n=1

= E p(Bn ) = AO),
n=1
,

co

which, together with (15) yields E eu(B, n T'C) = E iu(Tnii n n c) = u(C). Thus the induced transformation f preserves the conditional measure p.(.1B). The induced transformation is ergodic if the original transformation is ergodic. The picture in Figure 1.2.18 shows why this is so. If f --1 C, then each C n Bn can be pushed upwards along its column to obtain the set

c.

oo n-1
Ti(c n Bn)• D=UU n=1 i=0

But p,(T DAD) = 0 and T is invertible, so the measure of D must be 0 or 1, which, in turn, implies that kt(C1B) is 0 or 1, since p,((D n B),LC) = O. This completes the proof of the induced-transformation theorem. El

1.2.c.1 Processes associated with the return-time picture.
Several processes of interest are associated with the induced transformation and the return-time picture. It will be assumed throughout this discussion that T is an invertible, ergodic transformation on the probability space (X, E, p,), partitioned into a finite partition P = {Pa : a E A}; that B is a set of positive measure; that n(x) = minfn: Tnx E BI, x E B, is the return-time function; and that f is the transformation induced on B by T. Two simple processes are connected to returns to B. The first of these is {R,}, the (T, B)-process defined by the partition B = { B, X B}, with B labeled by 1 and X — B labeled by 0, in other words, the binary process defined by

Rn(x) = xB(T n-l x), n?: 1,

26

CHAPTER I. BASIC CONCEPTS.

where xB denotes the indicator function of B. The (T, 5)-process is called the generalized renewal process defined by T and B. This terminology comes from the classical definition of a (stationary) renewal process as a binary process in which the times between occurrences of l's are independent and identically distributed, with a finite expectation, the only difference being that now these times are not required to be i.i.d., but only stationary with finite expected return-time. The process {R,,} is a stationary coding of the (T, 2)-process with time-zero encoder XB, hence it is ergodic. Any ergodic, binary process which is not the all 0 process is, in fact, the generalized renewal process defined by some transformation T and set B; see Exercise 20. The second process connected to returns to B is {Ri } , the (7." , g)-process defined by the (countable) partition

{Bi, B2, • • •},
of B, where B, is the set of points in B whose first return to B occurs at time n. The process {k } is called the return-time process defined by T and B. It takes its values in the positive integers, and has finite expectation given by (14). Also, it is ergodic, since I' is ergodic. Later, it will be shown that any ergodic positive-integer valued process with finite expectation is the return-time process for some transformation and partition; see Theorem 1.2.24. The generalized renewal process {R n } and the return-time process {kJ } are connected, for conditioned on starting in B, the times between successive occurrences of l's in {Rn} are distributed according to the return-time process. In other words, if R 1 = 1 and the sequence {S1 } of successive returns to 1 is defined inductively by setting So = 1 and

= min{n >

Rn = 1},

j > 1,

then the sequence of random variables defined by

(16)

= Si — Sf _ 1 , j> 1

has the distribution of the return-time process. Thus, the return-time process is a function of the generalized renewal process. Except for a random shift, the generalized renewal process {R,,} is a function of the return-time process, for given a random sample path R = {k } for the return-time process, a random sample path R = {R n } for the generalized renewal process can be expressed as a concatenation (17) R = ii)(1)w(2)w(3) • • •

of blocks, where, for m > 1, w(m) = 01 _ 1 1, that is, a block of O's of length kn -1, followed by a 1. The initial block tt,(1) is a tail of the block w(1) = 0 /1-1-1 1. The only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the waiting time r until the first 1 occurs. The start position problem is solved by the return-time picture and the definition of the (T, 2)-process. The (T, 2)-process is defined by selecting x E X at random according to the it-measure, then setting X(x) = a if and only if Tx E Pa . In the ergodic case, the return-time picture is a representation of X, hence selecting x at random is the same as selecting a random point in the return-time picture. But this, in turn, is the same as selecting a column C at random according to column measure

SECTION 1.2. THE ERGODIC THEORY MODEL.

27

then selecting j at random according to the uniform distribution on 11, 2, ... , h (C)}, then selecting x at random in the j-th level of C. In summary,

Theorem 1.2.20 The generalized renewal process {R,} and the return-time process {km } are connected via the successive-returns formula, (16), and the concatenation representation, (17) with initial block th(1) = OT-1 1, where r = R — j 1 and j is uniformly distributed on [1,
The distribution of r can be found by noting that

r (x) = jx E
from which it follows that
00

Prob(r = i) =E p.,(T 4-1 n=i

=

E w (C ) .
ni

The latter sum, however, is equal to Prob(x E B and n(x) > i), which gives the alternative form 1 Prob (r = i) = (18) Prob (n(x) ijx E B) , E(n(x)lx E B) since E(n(x)lx E B) = 1/ p.(B) holds for ergodic processes, by (13). The formula (18) was derived in [11, Section XIII.5] for renewal processes by using generating functions. As the preceding argument shows, it is a very general and quite simple result about ergodic binary processes with finite expected return time. The return-time process keeps track of returns to the base, but may lose information about what is happening outside B. Another process, called the induced process, carries along such information. Let A* be the set all finite sequences drawn from A and define the countable partition 13 = {P: w E A* } of B, where, for w =

=

=

E Bk: T i-1 X E

Pao 1 <j

kl.

The (?,)-process is called the process induced by the (T, P)-process on the set B. The relation between the (T, P)-process { X, } and its induced (T, f 5)-process {km } parallels the relationship (17) between the generalized renewal process and the returntime process. Thus, given a random sample path X = {W„,} for the induced process, a random sample path x = {X,} for the (T, P)-process can be expressed as a concatenation

(19)

x = tv(1)w(2)w(3) • • • ,

into blocks, where, for m > 1, w(m) = X m , and the initial block th(1) is a tail of the block w(1) = X i . Again, the only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the length of the tail fi)(1). A generalization of the return-time picture provides the answer. The generalized version of the return-time picture is obtained by further partitioning the columns of the return-time picture, Figure 1.2.18, into subcolumns according to the conditional distribution of (T, P)-names. Thus the column with base Bk is partitioned

28

CHAPTER I. BASIC CONCEPTS.

into subcolumns CL, labeled by k-length names w, so that the subcolumn corresponding to w = a k has width (ct, T - i+ 1 Pa n BO,
,

the ith-level being labeled by a,. Furthermore, by reassembling within each level of a subcolumn, it can again be assumed that T moves points directly upwards one level. Points in the top level of each subcolumn are moved by T to the base B in some manner unspecified by the picture. For example, in Figure 1.2.21, the fact that the third subcolumn over B3, which has levels labeled 1, 1, 0, is twice the width of the second subcolumn, indicates that given that a return occurs in three steps, it is twice as likely to visit 1, 1, 0 as it is to visit 1, 0, 0 in its next three steps.
o o I
0 1 1

o
i

o
i

I

o .

1

I I
1

I

1o 1

I I

I
0

I I 1

I 1 I I

I

o
0

I I

1

1Ii

I

10

I I 1 I

o o

I

1

11

B1

B2

B3

B4

Figure 1.2.21 The generalized return-time picture. The start position problem is solved by the generalized return-time picture and the definition of the (T, 2)-process. The (T, 2)-process is defined by selecting x E X at random according to the ii-measure, then setting X, (x) = a if and only if 7' 1 x E Pa . In the ergodic case, the generalized return-time picture is a representation of X, hence selecting x at random is the same as selecting a random point in the return-time picture. But this, in turn, is the same as selecting a column Cu, at random according to column measure p.(Cu,), then selecting j at random according to the uniform distribution on {1, 2, ... , h(w), where h(w) = h(C), then selecting x'at random in the j-th level of C. This proves the following theorem.
Theorem 1.2.22 The process {X,..,} has the stationary concatenation representation (19) in terms of the induced process {54}, if and only if the initial block is th(1) = w(1)i h. (w(l)) , where j is uniformly distributed on [1, h(w(1))].

A special case of the induced process occurs when B is one of the sets of P, say B = Bb. In this case, the induced process outputs the blocks that occur between successive occurrences of the symbol b. In general case, however, knowledge of where the blocking occurs may require knowledge of the entire past and future, or may even be fully hidden.

I.2.c.2 The tower construction.
The return-time construction has an inverse construction, called the tower construction, described as follows. Let T be a measure-preserving mapping on the probability

Tx B3 X {2} B2 X 111 B3 X Bi X {0} B2 X {0} B3 X {0} Figure 1. As an application of the tower construction it will be shown that return-time processes are really just stationary positive-integer valued processes with finite expected value. 29 space (X. 0 . and let f be a measurable mapping from X into the positive integers . The formal definition is T (x.23 The tower transformation.0 .2. i) E T8} n Bn E E. In other words.u({x: . Bn x {n} (20) as a column in order of increasing i. The measure it is extended to a measure on X. the normalizing factor being n ( B n ) = E(f). as shown in Figure 1.23. The transformation T is called the tower extension of T by the (height) function f.. 2. Let k defined by = {(x . p.be the subset of the product space X x Ar Ar = { 1. E 131 ) n 13.. by thinking of each set in (20) as a copy of Bn with measure A(B). and for each n > 1.2. with points in the top of each column sent back to the base according to T. of finite expected value. A useful picture is obtained by defining Bn = {x: f (x) = n}. E. then normalizing to get a probability measure. 0) The transformation T is easily shown to preserve the measure Ti and to be invertible (and ergodic) if T is invertible (and ergodic).2. i < n. The tower extension T induces the original transformation T on the base B = X x {0}. n=1 the expected value of f. E (f ) n=1 i=1 A transformation T is obtained by mapping points upwards. (T x . B C X is defined to be measurable if and only if (x. . n > 1. 1 < i f (x)}. i) = + 1) i < f (x) i = f (x). THE ERGODIC THEORY MODEL.).SECTION 1. stacking the sets Bn x {1}. . This and other ways in which inducing and tower extensions are inverse operations are explored in the exercises. with its measure given by ‘ Tt(k) = 2-.}. B x {2}.

with the distribution of the waiting time r until the first I occurs given by (18). and note that the shift T on Aîz is invertible and ergodic and the function f (x) =X has finite expectation. In particular. onto the (two-sided) Kolmogorov measure y of the (T. (Recall that if Y.2.} such that the future joint process {(Xn.30 CHAPTER I. /2). E. A point (x. Show that the (T. (a) Show that {Zn } is the (TN. Show that a T-subinvariant function is almost-surely invariant. P)-process can be mixing even if T is only ergodic.Ar z of doubly-infinite positive-integer valued sequences.) 6.2.2. n .) 3. P-IP)-process. N}. then use Exercise 2.n ): n < t}. P)-process. Show that there is a measurable mapping F: X 1-± A z which carries p. for all n. The induced transformation and the tower construction can be used to prove the following theorem. = f (X n ). Suppose F: A z F--> Bz is a stationary coding which carries p. P)-process can be ergodic even if T is not. 7. where TA is the shift on A z . . and let Yn = X(n-1)N+1 be the N-th term process defined by the (T. A stationary process {X. Theorem 1. Ra ):n > t} is independent of the past joint process {(X.25 An ergodic process is regenerative if and only if it is an instantaneous function of an induced process whose return-time process is I. BASIC CONCEPTS. 1) at time f (x). R. (Hint: start with a suitable nonergodic Markov chain and lump states to produce an ergodic chain.} is regenerative if there is a renewal process {R. the return-time process defined by T and B has the same distribution as {k1 }. (Hint: consider g(x) = min{ f (x). 1. P)-process {Xn}. 1) returns to B = X x L1} at the point (T x . Xn+19 • • • Xn+N-1) be the overlapping N-block process. A nonnegative measurable function f is said to be T-subinvariant. Let Zn = ( X (n -1 )N+19 41-1)N+2. XnN) be the nonoverlapping N-block process. see Exercise 24. Let T be an invertible measure-preserving transformation and P = {Pa : a E Al a finite partition of the probability space (X. Show that the (T. . binary process via the concatenation-representation (17). a partition gives rise to a stationary coding. let Wn = (Xn. P)-process.24 An ergodic.2. This proves the theorem.1. given that Rt = 1.) Theorem 1.d Exercises. Determine a partition P of A z such that y is the Kolmogorov measure of the (TA.) 4. onto v. Proof Let p. fkil defines a stationary. that is. positive-integer valued process {kJ} with E(R 1 ) < oc is the return-time process for some transformation and set. such that F(Tx) = TAF(x). Let T be the tower extension of T by the function f and let B = x x {1}. Complete the proof of Lemma 1. then {Yn } is an instantaneous function of {Xn}. (Thus. if f(Tx) < f (x). a stationary coding gives rise to a partition.) 5. (Thus. Let T be a measure-preserving transformation. be the two-sided Kolmogorov measure on the space . 2.

Show that the overlapping n-blocking of an ergodic process is ergodic. for every x. X x X.u x .. j > 0. = (ar . N) goes to (br . (ar . then S(x . 15. 31 8. is a mixing finite-state process. y ± a) is not ergodic. 11. y) = (Tx. with the product measure y = p.) (b) Show that the process {} defined by setting 4. that if a is irrational then the direct product (x. 2. 1. p. 'P)-process.u(br).2. N) goes to (br .) and {Tx : x c X) is a family of measure-preserving transformations on a probability space (Y. (Hint: the transition matrix is irreducible. . y). Q)-name of (x. for each br E AN. for each br E AN. N}. 13. 16. S is called a skew product. and number p E (0.u. Let Y=Xx X. THE ERGODIC THEORY MODEL. 1). Fix a measure p. Generalize the preceding exercise by showing that if T is a measure-preserving transformation on a probability space (X. 0) with probability p ite(br ). v7P -1 P)-process.(1) = 1/2. Show that a concatenated-block process is ergodic. i 1). then S is called the direct product of T and R. that is. Let T be the shift on X = {-1. Give a formula expressing the (S. } is the (T. x y). j). 1} Z . (iii) (ar . Show that an ergodic rotation is not mixing. Txoy). T ry) is measurable. but is totally ergodic. y) = (Tx..n } be the stationary Markov chain with alphabet S = A N x {0. Show that the nonoverlapping n-blocking of an ergodic process may fail to be ergodic. y) i-± (Tx. 0) and Ym = ai . such that the mapping (x. Show that if T is mixing then it is totally ergodic. 12.SECTION 1. x and define S(x. if Y„. Let {Z. 17. (a) Show that there is indeed a unique stationary Markov chain with these properties. Try) is a measure-preserving mapping on the product space (X x Y. 1) with probability (1 — p). Show that if T is mixing then so is the direct product T x T: X x X where X x X is given the measure . 14. defined by the following transition rules. = a if I'm = (ar . Show directly. 10.(-1) = p. Insertion of random spacers between blocks can be used to change a concatenatedblock process into a mixing process. without using Proposition 6. . i) can only go to (ar .. y) 1-± (x a. (i) If i < N. = R. and let . (ii) (ar.") 9. y) in terms of the coordinates of x and y. Q)-process is called "random walk with random scenery.u be the product measure on X defined by p. (b) Let P be the time-zero partition of X and let Q = P x P. on AN. . If T. (c) Show that {Y } is the (T N . (a) Show that S preserves v. a symbol a E A N . (The (S. (b) Show that {147.

i F -1 D. Show how to extend P to a partition 2 of X. E. Prove Theorem 1. Show that the tower defined by the induced transformation function n(x) is isomorphic to T.) 21.25.) 19. Define P = {Pa : a E AI and 2 = {Qb: b Ea. . 24. Let T be an invertible measure-preserving transformation and P = {Pa : a E A} a finite partition of the probability space (X.2. 25. preserves the conditional measure. then P and 2 are 3E-independent. except for a set of Qb's of total measure at most . (Hint: use the formula F-1 (C n T ' D) = F-1 C n TA.} is a binary. 22. then it is the generalized renewal process for some transformation T and set B. P)-process is an instantaneous coding of the induced (.2. ? I' and return-time 23. except for a set of Qb's of total measure at most 6._ E. Show that if {X.?.32 CHAPTER I. BASIC CONCEPTS. (b) Show that if P and 2 are 6-independent then Ea iii(Pa I Qb) — 11 (Pa)i < Ag. 18. (Hint: obtain a picture like Figure 1. (Hint: let T be the shift in the two-sided Kolmogorov representation and take B = Ix: xo = 1 1. and is ergodic if T is ergodic.(Qb)1 . and let S = T be the tower transformation.b lit(Pa n Qb) — it(Pa)P.g.) 20. Let X be the tower over X defined by f.18 for T -1 . ergodic process which is not identically 0. Prove that a stationary coding of a mixing process is mixing. (c) Show that if Ea lia(PaiQb) — ii(Pa)i < E. such that the (T. Prove: a tower i'' over T induces on the base a transformation isomorphic to T. Prove that even if T is not invertible. then use this to guide a proof. d)-process. pt). the induced transformation is measurable. E BI to be 6-independent if (a) Show that p and 2 are independent if and only if they are 6-independent for each 6 > 0.

however. it may not be possible. A general technique for extracting "almost packings" of integer intervals from certain kinds of "coverings" of the natural numbers will be described in this subsection. of the form [n. The L'-convergence implies that f f* d 1a = f f d ia. The proof of the ergodic theorem will be based on a rather simple combinatorial result discussed in the next subsection. to pack an initial segment [1.1-valued and T to be ergodic the following form of the ergodic theorem is obtained. p. if f is taken to be 0. n E H.. Thus if T is ergodic then the limit function f*. In particular. K]. m] = (j E n < j < m). there is a subcover C' whose members are disjoint. .. but the more general version will also be needed and is quite useful in many situations not treated in this book. Birkhoff and is often called Birkhoff's ergodic theorem or the individual ergodic theorem. The ergodic theorem in the almost-sure form presented here is due to G.i.u = 1. D.This binary version of the ergodic theorem is sufficient for many results in this book. THE ERGODIC THEOREM. . E. Theorem 1. for it is an important tool in its own right and will be used frequently in later parts of this book.}.3. processes to the general class of stationary processes. where n1 = 1 and n i±i = 1 + m(n). i > 1. since it is T-invariant. 33 Section 1.u. In this discussion intervals are subsets of the natural numbers Ar = {1.a Packings from coverings. unless the function m(n) is severely restricted.3. there be a disjoint subcollection that fills most of [1.d. K] by disjoint subcollections of a given strong cover C of the natural numbers.) If T is a measure-preserving transformation on a probability space (X.) If {X n } is a binary ergodic process then the average (X 1 +X2+. namely.1 (The ergodic theorem. The finite problem has a different character. The combinatorial idea is not a merely step on the way to the ergodic theorem. just set C' = {[n.u = f f -E n n i=1 by the measure-preserving property of T.E f f(T i-l x)d..SECTION 1. and consists of all intervals of the form [n. If it is only required.3. must be almost-surely equal to the constant value f f d.. f (T i-1 x) converges almost surely and in L 1 -norm to a T -invariant function f*(x).) and if f is integrable then the average (1/n) EL. Theorem 1. however. even asymptotically. A strong cover C of Af is defined by an integer-valued function n m(n) for which m(n) > n. (The word "strong" is used to indicate that every natural number is required to be the left endpoint of a member of the cover. The ergodic theorem extends the strong law of large numbers from i. m(n)]}. I.) A strong cover C has a packing property. 2. for. since f1 n f(T i-l x) d.3.3 The ergodic theorem. .. This is a trivial observation.+Xn )In converges almost surely to the constant value E(X 1 ). then a positive and useful result is possible.2 (Ergodic theorem: binary ergodic form. m(n)].

This produces the disjoint collection C' = {[iL +1. The (L. thus guarantees that 1[1. start from the left and select successive disjoint intervals. This completes the proof of the packing lemma.) Let C be a strong cover of N and let S > 0 be given. For the interval [1. by induction. K]. selecting the first interval of length no more than L that is disjoint from the previous selections. K — L]: m(j) — j +1 < L}. K]. by construction. In particular. Proof By hypothesis K > LIS and (1) iin E K]: m(n) — n +1 > L11 < SK. suppose all the intervals of C have the same length. K] has length at most L — 1 < SK.m(n i )]: 1 < i < I} is a (1 — 26)-packing of [1.m(n i )] has length at least (1 — 25)K. stopping when within L of the end. The desired positive result is just an extension of this idea to the case when most of the intervals have length bounded by some L <8 K. If K > LIS and if [1.K]—U]I< 6K. and apply a sequential greedy algorithm. stopping when within L of the end of [1. To carry this out rigorously set m(0) = no = 0 and. Lemma 1. To motivate the positive result. then there is a subcollection C' c C which is a (1 — 26)packing of [1. since m(n i )—n i +1 < L. and hence m(n i ) < K. The interval (K — L. and are contained in [1. The interval [1. K] is (L. The intervals are disjoint. K — L]-1.(i +1)L]: 0 < i < (K — L)IL} which covers all but at most the final L — 1 members of [1. K] is said to be (L. K]. K — L] —Li then m(j) — j + 1 > L. K — L] for which m(j) — j + 1 < L.34 CHAPTER I. K — L]. the definition of the n i implies the following fact. K]. 6)-strong-cover assumption. The construction of the (1 — 26)-packing will proceed sequentially from left to right. that is. BASIC CONCEPTS. K] if the intervals in C' are disjoint and their union has cardinality at least (1 — . so that I(K —L.1 of the [ni . The construction stops after I = I (C. The claim is that C' = {[n i .3. (1). A collection of subintervals C' of the interval [1. K) steps if m(n i ) > K — L or there is no j E[1 ±m(ni). if 6 > 0 is given and K > LIS then all but at most a 6-fraction is covered.3 (The packing lemma. Thus it is only necessary to show that the union 1. Let C be a strong cover of the natural numbers N. define n i = minfj E [1+ m(n i _i). (2) If j E [I.ll< 6K. K] is called a (1 —6)-packing of [1. 8)-strongly-covered by C. K]. . say L. K]: m(n) — n +1 > L}1 < 6. 6)-strongly-covered by C if 1ln E [1.

In this case . Suppose (3) is false. Since B is T-invariant. including the description (2) of those indices that are neither within L of the end of the interval nor are contained in sets of the packing. But. as it illustrates the ideas in simplest form. Some extensions of the packing lemma will be mentioned in the exercises. Without loss of generality the first of these possibilities can be assumed.b The binary. ergodic process proof.u. m(n)]: n E JVI is a (random) strong cover of the natural numbers . Thus the collection C(x) = lln. for each integer n there will be a first integer m(n) > n such that X n Xn+1 + • • • + Xm(n) m(n) — n + 1 > p. A proof of the ergodic theorem for the binary ergodic process case will be given first.SECTION 1. i < K are contained in disjoint intervals over which the average is too big. if the average over most such disjoint blocks is too big.4 In most applications of the packing lemma all that matters is the result. each interval in C(x) has the property that the average of the xi over that interval is too big by a fixed amount. The packing lemma can be used to reduce the problem to a nonoverlapping interval problem. ( 1 ) n-+oo n i=i has positive measure. Since this occurs with high probability it implies that the expected value of the average over the entire set A K must be too big.3. it will imply that if K is large enough then with high probability.(B) = 1. and the goal is to show that (3) lirn — x i = . a. Furthermore. • + Xn+1 the set B is T-invariant and therefore p.s. 1}. Then either the limit superior of the averages is too large on a set of positive measure or the limit inferior is too small on a set of positive measure. is invariant and ergodic with respect to the shift T on {0. which is a contradiction. combined with a simple observation about almost-surely finite variables. 35 Remark 1.A'. large interval.3.3. and hence there is an E > 0 such that the set {n B = x : lirn sup — E xi > .. . but in the proof of the general ergodic theorem use will be made of the explicit construction given in the proof of Lemma 1. however. I. for.) Now suppose x E B.a(1).3.u. THE ERGODIC THEOREM. These intervals overlap. (Here is where the ergodicity assumption is used.(1) + E. K] must also be too big by a somewhat smaller fixed amount. most of the terms xi . Since lim sup n—>oo 1 +E xi + + xn = =sup n—>oo x2 + . then the average over the entire interval [1. so it is not easy to see what happens to the average over a fixed. n—*øo n i=1 i n where A(1) = A{x: x 1 = 1} = E(X1).3.

The definitions of D and GK imply that if x E GK then C(x) is an (L. then . E.23)(1 6)[. it will be used to obtain the general theorem. Note that while the collection C(x) and subcollection C'(x) both depend on x.a). Second. the set GK = {X: gic(x) < 3} has measure at least 1 .) > (1 .m(ni)]: i < I (x)} C C(X) which is a (1 . 1(x) m(n) (4) j=1 x.c The proof in the general case.3. . note that the function g K(x) = i=1 1 IC where x denotes the indicator function. let f let a be an arbitrary real number. K]. BASIC CONCEPTS. This completes the proof of (3). The Markov inequality implies that for each K. E.3. E L i (X.u). Thus given 3 > 0 there is a number L such that if D = {x: m(1) > L}. as long as x E GK. by assumption. . so that 1 K p. (1) = E (—Ex .8. since the random variable m(1) is almost-surely finite. The following lemma generalizes the essence of what was actually proved.36 CHAPTER I. n i=i Then fB f (x) d(x) > a g(B). since T is measure preserving.5 Let T be a measure-preserving transformation on (X. and since the intervals in C'(x) are disjoint. . the lower bound is independent of x. ergodic process form of the ergodic theorem. I. and define the set 1 n B = {x: lim sup . Since the xi.E f (T i-1 x) > al. > i=1 j_—n 1 (l _ 28)K [p. The preceding argument will now be extended to obtain the general ergodic theorem.a(1) + c]. A bit of preparation is needed before the packing lemma can be applied. it is bounded except for a set of small probability. Thus. 3)-strongcovering of [1.K].u(D) <62. Lemma 1.(1)+ c] .26)(1 (5)K [p.23)-packing of [1. First. are nonnegative. the sum over the intervals in C'(x) lower bounds the sum over the entire interval.26)K ku(1) + El u(GK) > (1 .(1) + c]. taking expected values yields (K E iE =1 •x > (1 .u(D) < 6 2 . J=1 which cannot be true for all S. the binary.3. and thereby the proof of Theorem 1.2. has integral . Thus the packing lemma implies that if K > LI 8 and x E GK then there is a subcollection Cf (x) = {[ni. that is.

1B).3. where f is the indicator function of a set C and a = p(C) c. x) = > R(K . it is enough to prove fB f (x) c 1 ii.m(n i )] intervals.m(ni)]} depends on x.SECTION 1. Only a bit more is needed to handle the general case. if K > LIB then the packing lemma can be applied and the argument used to prove (4) yields a lower bound of the form E f( Ti-lx ) j=1 (6) where R(K . x) is bounded from below by —2M KS and the earlier argument produces the desired conclusion. x) as well as the effect of integrating over GK in place of X. this lemma is essentially what was just proved. does give 1 v-IC s f (x) dia(x1B) = (7) a' B i=i II -I.i(B) > O.12 + 13. The set B is T-invariant and the restriction of T to it preserves the conditional measure . THE ERGODIC THEOREM. a bit more care is needed to control the effect of R(K. In the unbounded case. + (1 — 28)K a. Thus. say 1f(x)1 < M. given 3 > 0 there is an L such that if D= E B: m(1) > L } then . (Keep in mind that the collection {[ni.) In the bounded case. m(n)]: n E AO is a (random) strong cover of the natural numbers Js/.(x1B) > a.5 is true in the ergodic case for bounded functions. For x E B and n E A1 there is a least integer m(n) > n such that (5) Em i= (n n ) f (T i-1 m(n)—n+1 > a. though integrable. As before. E jE[1. the sum R(K . where f is allowed to be unbounded. so it can be assumed that . (1 — 8)a A(G KIB) . Furthermore.. the collection C(x) = fin. 37 Proof Note that in the special case where the process is ergodic.)] f (T x) is the sum of f(Ti -l x) over the indices that do not belong to the [ne. together with the lower bound (6). An integration.1q—U[n.3.u(.u(D1B) <82 and for every K the set 1 G K = {X E B: 1 -7 i=1 E xD(Ti-ix) K 6} has conditional measure at least 1 — B. The same argument can be used to show that Lemma 1. The lemma is clearly true if p(B) = 0. Since B is T-invariant. x) + E E f (T i-1 x) i=1 j=ni 1(x) m(ni) R(K .m(n.

which is small if K is large enough.K— L] —U[ni.m(n. E GK jE[1. 1 ___ K f (Ti -1 x) c 1 ii.k. which is upper bounded by 8.(x1B) > a.M(ri.(x1B).(xim.K]—U[ni.)] E 13= (T i-1 x) d. will be small if 8 is small enough. in the inequality f D1 f (x)1 clia(x1B). see (2).38 where CHAPTER I. and hence 11 itself.K] and hence the measure-preserving property of T yields 1131 B 1 fl(x) dp. The integral 13 is also easy to bound for it satisfies 1 f dp. In the current setting this translates into the statement that if j E [1. L f(x) d ii(x B) ?./ x) clitt(xIB) = fT_J(B-GK) f(x) Since f E L I and since all the sets T -i(B — GK) have the same measure.5. the final three terms can all be made arbitrarily small. K fB—Gic j=1 12 = f(Ti -l x) dp.3. K — L] — U[ni . BASIC CONCEPTS. m(n)] then Ti -l x E D. In summary.(xiB). K — L] — m(ni)] then m(j) — j 1 > L. so that passage to the appropriate limit shows that indeed fB f (x) dp. 1131 15. that if j E [1. Thus 1121 5_ -17 1 f v—N K .u(xIB).)] GK — K 1 jE(K—L. all of the terms in 11. for any fixed L. . see (7).B jE(K—L. and the measure-preserving property of T yields the bound 1121 5_ which is small if L is large enough.x)1 f (P -1 x)I dwxiB). recall from the proof of the packing lemma. f The measure-preserving property of T gives f3-Gic f (T . which completes the proof of Lemma 1.B xD(T' . (1 — kt(G + 11 + ± 13.(x1B). . To bound 12.

SECTION 1. . say I f(x)I < M. xE X. that is. UT f = if Ii where 11f II = f Ifl diu is the L'-norm.5 remains true for any T-invariant subset of B. including the short. f follows immediately from this inequality. as well as von Neumann's elegant functional analytic proof of L 2-convergence. 1=1 Remark 1.3. The reader is referred to Krengel's book for historical commentary and numerous other ergodic theorems. n. Ag converges in L'-norm.u(C)< Jc f d which can only be true if ti(C) = O. g E L I then for all m.3. Since almost-sure convergence holds. almost surely. for g can be taken to be a bounded approximation to f.-convergence of A. Moreover. Other proofs that were stimulated by [51] are given in [25. the dominated convergence theorem implies L' convergence of A n f for bounded functions.6 Birkhoff's proof was based on a proof of Lemma 1. 39 Proof of the Birkhoff ergodic theorem: general case. that is. whose statement is obtained by replacing lirn sup by sup in Lemma 1. but it is much closer in spirit to the general partitioning. then A n f (x)I < M. Thus if 1 n 1 n f (T i x) C = x: lim inf — f (T i x) <a < fi < lim sup — n-400 n i=1 n—. To complete the discussion of the ergodic theorem L' -convergence must be established. and packing ideas which have been used recently in ergodic and information theory and which will be used frequently in this book. The proof given here is less elegant. There are several different proofs of the maximal theorem. Most subsequent proofs of the ergodic theorem reduce it to a stronger result. This completes the proof of the Birkhoff ergodic theorem. [32]. almost surely. and since the average An f converges almost surely to a limit function f*. THE ERGODIC THEOREM.3. Since. Thus the averaging operator An f 1 E n n i=i T is linear and is a contraction. its application to prove the ergodic theorem as given here was developed by this author in 1980 and published much later in [68]. if f is a bounded function. But if f. IA f II If Il f E L I . elegant proof due to Garsia that appears in many books on ergodic theory. covering.3. it follows that f* E L I and that A n f converges in L'-norm to f*. UT f (x) = f (Tx). The packing lemma is a simpler version of a packing idea used to establish entropy theorems for amenable groups in [51]. Proofs of the maximal theorem and Kingman's subadditive ergodic theorem which use ideas from the packing lemma will be sketched in the exercises.5. To do this define the operator UT on L I by the formula. This proves almost sure convergence. The operator UT is linear and is an isometry because T is measure preserving. called the maximal ergodic theorem. To prove almost-sure convergence first note that Lemma 1.3. A n f — Am f 5- g)11 + ii Ag — Amg ii 2 11 f — gil + 11Ang — A m g iI — g) — A ni (f — The L . f EL I .o0 n i=1 { E E then fi.5. 27]. as was just shown.

x) such that if n > N then [1.-almost-surely finite if .ix)< 6/ 41 G .u) with values in the extended natural numbers . for the first time m that rin f (Ti -1 x) > ma is a stopping time. In particular. A stopping time r is p. hence. . By the ergodic theorem. r) = {[n. Several later results will be proved using extensions of the packing lemma.3. for almost every x.d Extensions of the packing lemma. there is an integer L such that p({x: r(x) > LI) < 6/4. A (generalized) stopping time is a measurable function r defined on the probability space (X. t(Tn -1 x) n — 1]: n E of stopping intervals is. eventually almost surely. Gn = then x E ix 1 x---11 s :— n 2x D (Ti . however. the stopping time for each shift 7' 1 x is finite. The interval [n. The given proof of the ergodic theorem was based on the purely combinatorial packing lemma.40 CHAPTER I. which is a common requirement in probability theory. starting at n. the function r(Tn -l x) is finite for every n. almost surely.u ({x: r(x) = °o }) = O. it is bounded except for a set of small measure. Let D be the set of all x for which r(x) > L. for an almost-surely finite stopping time. nor that the set {x: r(x) = nl be measurable with respect to the first n coordinates. Suppose it is a stationary measure on A'. E. lim almost surely.) If r is an almost-surely finite stopping time for an ergodic process t then for any > 0 and almost every x there is an N = N (3.7 (The ergodic stopping-time packing lemma. Now that the ergodic theorem is available. Proof Since r (x) is almost-surely finite. Thus almost-sure finiteness guarantees that. Lemma 1. . given 8 > 0. Frequent use will be made of the following almost-sure stopping-time packing result for ergodic processes. I.u(D) <6/4. BASIC CONCEPTS. see (5). the collection C = C(x.n] is (1 — 3)-packed by intervals from C(x. An almost-surely finite stopping time r has the property that for almost every x. Note that the concept of stopping time used here does not require that X be a sequence space.r(Tn.3. a (random) strong cover of the natural numbers A. r).lx)-F n — 1] is called the stopping interval for x. The stopping time idea was used implicitly in the proof of the ergodic theorem. so that if 1 n 1=1 x D (T i-1 x) d p(x) = . the packing lemma and its relatives can be reformulated in a more convenient way as lemmas about stopping times.Af U ool.

v. Suppose each [ui . I.3. a partial-packing result that holds in the case when the stopping time is assumed to be finite only on a set of positive measure. vd: i E I} be a disjoint collection of subintervals of [1. for which Tn -l x E F(r).] has length m. ti]. Two variations on these general ideas will also be useful later. and let {[si . These are stated as the next two lemmas.) Let {[ui. Formulate the partial-packing lemma as a strictly combinatorial lemma about the natural numbers. ti ] has length at least M. yi ] that meet the boundary of some [si. n] has a separated (1 — 28)-packing C' c C(x. implies that r(T -(' -1) x) < L for at least (1 — 314)n indices i in the interval [1. Show that this need not be true if some intervals are not separated by at least one integer. .) . The definition of G. where M > m13. then [1.SECTION 1.) 2.9 (The two-packings lemma. and let r be an almost-surely finite stopping time which is almost-surely bounded below by a positive integer M > 1/8. Lemma 1. Let I* be the set of indices i E I such that [ui. Show that for almost every x. But this means that [1.u(F(r)) > y > O. Show that a separated (1 — S)-packing of [1. For almost every x. 4. and each [s1 . n] and hence for at least (1 — 812)n indices i in the interval [1. let r be a stopping time such that . n] has a y -packing C' c C(x. n] is (L. K] of total length 13 K . n — L]. (Hint: define F'(x) = r(x) + 1 and apply the ergodic stopping-time lemma to F. yi] meets at least one of the [s i . r). r) such that if n > N. K] of total length aK. 5. Let p. r). r) to be the set of all intervals [n. For a stopping time r which is not almost-surely infinite. Prove the partial-packing lemma. 1 — 3/2)-stronglycovered by C(x. r) Lemma 1. ti]: j E J} be another disjoint collection of subintervals of [1. The packing lemma produces a (L.e Exercises.8 (The partial-packing lemma. This proves the lemma.3. there is an integer N= x) such that if n > N then [1. THE ERGODIC THEOREM. A (1 — 3)-packing C' of [1. be an ergodic measure on A. r(r-1 n — 1]. ti ]. and a two-packings variation. with the proofs left as exercises. 1. put F(r) = {x: r(x) < ocl. be an ergodic measure on A'. and define C(x. Then the disjoint collection {[s i . 3. ti ]: j E J } U flub i E I*1 has total length at least (a + p . r). 41 Suppose x E G„ and n > 4L/3.) Let p.3.3. n] is said to be separated if there is at least one integer between each interval in C'. 1 — 3/2)-packing of [1. (Hint: use the fact that the cardinality of J is at most K I M to estimate the total length of those [u. n] by a subset of C(x. there is an N = N (x.23)K. n] is determined by its complement. Prove the two-packings lemma.

(e) For B = {x: sup E71-01 f (T i x) > a}. 9. show that (Hint: let r(x) = n. (d) Deduce that L' -convergenceholds. Assume T is measure preserving and f is integrable. The ideas of the following proof are due to M. The mean ergodic theorem of von Neumann asserts that (1/n) EL I verges in L2-norm. BASIC CONCEPTS. let Un be the set of x's for whichEL -01 f (T i x) > 0. as n 7. The ideas of the following proof are due to R.u(B). then ii. Assume T is invertible so that UT is unitary. (Hint: show that its orthogonal complement is 0.} is a sequence of integrable functions such that gn+.F+M is dense in L 2 . (d) Show that the theorem holds for integrable functions.(Gn (r. Kingman's subadditive ergodic theorem asserts that gn (x)/ n converges almost surely to an invariant function g(x) > —oc. (a) Prove the theorem under the additional assumption that gn < 0.) (b) For bounded f and L > N. if f E LI .) . (e) Extend the preceding arguments to the case when it is only assumed that T is measure preserving. Fix N and let E = Un <N Un . r). One form of the maximal ergodic theorem asserts that JE f (x) di(x) > 0. for any L 2 function f. n ] is (1 — (5)-packed by intervals from oc.(x) < g(x) g ni (Tn x)..42 CHAPTER I. (a) Show that the theorem is true for f E F = If E L 2 : UT f = f} and for f E .Jones. (Hint: sum from 0 to r(x) — 1. and for each n. (b) Show that . then the averages (1/N) E7 f con8. [33].) (c) Deduce von Neumann's theorem from the two preceding results. show that h f da a. (Hint: first show that g(x) = liminf g n (x)In is an invariant function. 10. 8)) 1. then apply packing. x E f (T i x)xE(T i x) i=o — (N — fII.) (c) Show that the theorem holds for bounded functions. 8) is the set of all x such that [1. Show that if r is an almost-surely finite stopping time for a stationary process p. Assume T is measure preserving and { g. for some r(x) < E Un . show that r(x)-1 E i=o f (T i x)xE(T i x) ?_ 0. Show that if ti is ergodic and if f is a nonnegative measurable. but not integrable f (T i-1 x) converge almost surely to oc. then from r(x) to r(T) — 1. 6. [80]. C(x. N. function. and G n (r . r(x) = 1. (a) For bounded f. x 0 E.Steele.A4 = (I — UT)L 2 . continuing until within N of L.

4.u < oc. then convergence is also in L 1 -norm. n — k + 1]: xi +k-1 = 4}1 n—k1 n—k+1 If x and k < n are fixed the relative frequency defines a measure pk • 14) on A " .(Tn+gx) + y8 . limiting relative frequencies converge almost surely to the corresponding probabilities. that this limit exists. where x [4] denotes the indicator function of the cylinder set [an. FREQUENCIES OF FINITE BLOCKS. 43 (Hint: let (b) Prove the theorem by reducing it to the case when g„ < O. x is typical for tt if for each k the empirical distribution of each k-block converges to its theoretical probability p. that is. the subscript k on pk(ali( 14) may be omitted. When k is understood. the existence of limiting frequencies allows a stationary process to be represented as an average of ergodic processes. oo provided. . g„. The limiting (relative) frequency of di in the infinite sequence x is defined by (1) p(ainx) = lim p(ainx). Thus. if each block cif appears in x with limiting relative frequency equal to 11. of course.(4). f and a = fg d d. A sequence x is said to be (frequency) typical for the process p. The frequency can also be expressed in the alternate form f(4x) = I{i E [1.4 Frequencies of finite blocks. In the ergodic case. In the nonergodic case. n—k+1 Palic Wiz = The frequency of the block ail' in the sequence Jet' is defined for n > k by the formula E i=1 x fa i l (T i-1 x). n — k + 1]: = where I • I denotes cardinality. g 1 (x) = gni (x) — ET -1 gi (T i x ). where y8 0 as g Section 1. to obtain Pk(aflx7) = f (4x) = I{i E [1. + (d) Show that the same conclusions hold if gn < 0 and gn+8+. The relative frequency is defined by dividing the frequency by the maximum possible number of occurrences. These and related ideas will be discussed in this section. f(a14) is obtained by sliding a window of length k along x'it and counting the number of times af is seen in the window.SECTION 1.) (c) Show that if a = inf.n (x) < cc. called the empirical distribution of overlapping k-blocks. a property characterizing ergodic processes. I. n — k 1.a Frequencies for ergodic processes. An important consequence of the ergodic theorem for stationary finite alphabet processes is that relative frequencies of overlapping k-blocks in n-blocks converge almost surely as n —> oc. or the (overlapping) k-type of x.4.(4).

for almost every x. where C is an arbitrary measurable set.-almost surely. with probability 1.) Suppose A is a stationary measure such that for each k and for each block di'. A . the set 00 B=U k=1 ak i EAk B(a 1`) XE has measure 0 and convergence holds for all k and all al'. the limiting relative frequencies p(ali`Ix) exist and are constant in x.(4). Thus the entire structure of an ergodic process can be almost surely recovered merely by observing limiting relative frequencies along a single sample path.4. Proof If the limiting relative frequency p(a'nx) exists and is a constant c. In other words. BASIC CONCEPTS. Then A is an ergodic measure. n liM — n i=1 x8 = holds for almost all x for any cylinder set B.k +1 E k • (T i X •_ ' ul' ' —1 x). Thus the hypotheses imply that the formula x--.(4). a set of [=:] Theorem 1. which.) If A is ergodic then A(T (A)) = 1.4.44 CHAPTER I. p. Let T(A) denote the set of sequences that are frequency typical for A. for fixed 4. produces liM n—)00 fl E X B (T i-l x)X c (x)= i=1 n and an integration then yields (2) lim A(T -i B n i=1 E n n= .B. Theorem 1.1 (The typical-sequence theorem. A basic result for ergodic processes is that almost every sequence is typical. = p(a). then this constant c must equal p. for all measure 1. for x B(a lic). lim p(aIN) = for all k and all a. such that p(alicixtiz) -> 1). converges almost surely to f x [4] dp. since the ergodic theorem includes L I -convergence of the averages to a function with the same mean. Since there are only a countable number of possible af. by the ergodic theorem. This proves the theorem.2 (The typical-sequence converse. n—k+1 p(afixiz) - n . that is. The converse of the preceding theorem is also true. as n -> co. Multiplying each side of this equation by the indicator function xc (x). there is a set B(a) of measure 0. Proof In the ergodic case.

The usual approximation argument then shows that formula (2) holds for any measurable sets B and C. (ii) lima pi. a. Note that as part of the preceding proof the following characterizations of ergodicity were established. Sequences of length n can be partitioned into two classes. c)-typical sequences.2.4. C.(iii) limn x . 45 The latter holds for any cylinder set B and measurable set C.5 (The finite form of ergodicity.4. The precise definition of the "good" set. . This proves Theorem 1.4. a E The "bad" set is the complement. 6) = An — G n (k. (ii) If x.q) is bounded and converges almost surely to some limit.3 The following are equivalent. are often called the frequency-(k. the limit must be almost surely constant. for every k and c > O. for each E in a generating algebra.(Ti-lx) = tt(E). c) such that if n > N then there is a collection C. = 1. c).L(a)1 < c. so that those in Gn will have "good" k-block frequencies and the "bad" set Bn has low probability when n is large.4 (The "good set" form of ergodicity.e.) (a) If ji is ergodic then 4 (b) If lima p. eventually almost surely. 6) = p(afix) — . for each A and B ma generating algebra. 6)) E Gn (k. C An such that tt(Cn) > 1 — E.u(A)A(B). Some finite forms of the typical sequence theorems will be discussed next. FREQUENCIES OF FINITE BLOCKS. which depends on k and a measure of error. and hence (b) follows from Theorem 1. Theorem 1. G . then p. then 1/3(41fi') — p(aini)I < c. B n (k. Theorem 1.2. E). and B. (i) T is ergodic. G.(k.n (G n (k.SECTION 1. Theorem 1.4. for every integer k > 0 and every c > 0. A(T -i A n B) = . The members of the "good" set. Theorem 1. so that p(B) must be 0 or 1. Since p(an. or when k and c are understood. E). The condition that A n (G n (k. just the typical sequences. is Proof Part (a) is just a finite version of the typical-sequence theorem..4. ergodic.4.) A measure IL is ergodic if and only if for each block ail' and E > 0 there is an N = N(a.4. But if B = B then the formula gives it(B) = A(B) 2 . E)) converge to 1 for any E > 0 is just the condition that p(41x7) converges in probability to ii(4).1. is (3) G n (k. It is also useful to have the following finite-sequence form of the typical-sequence characterization of ergodic processes.

c) such that . the "good" set defined by (3).4. IP(ailx) — P(ailZ)I 3E.s.4. a set Bn c An with desired properties is determined in some way. eventually almost surely. M — n]: E 13n}1 — ML(13n)1 <8M. n ) > 1 — 2E. then liM Pn(BniXi ) M—).7 (The almost-covering principle. in which case the conclusion (4) is expressed. M — n that is. Proof Apply the ergodic theorem to the indicator function of Bn . A sequence xr is said to be (1 — 8)strongly-covered by B n if I {i E [1. there are approximately tt(Bn )M indices i E [1. and use convergence in probability to select n > N(a. if pn (B n iXr) < b. eventually oc. informally. .4. and if 8.46 CHAPTER I. C An. An of positive 1-1 (Bn). then define G n = ix: IP(a inxi) — P(a inx)I < EI. This simple consequence of the ergodic theorem can be stated as either a limit result or as an approximation result. Proof If it is ergodic. xr is "mostly strongly covered" by blocks from Bn . and if Bn is a subset of measure. satisfies A(13 n ) > 1 — then xr is eventually almost surely (1 — (5)-strongly-covered by r3n. If n El. and hence p(a Ix) is constant almost surely.n . Theorem 1. (4) E [1. M — n] for almost surely as M which x: +n-1 E Bn . eventually almost surely. by saying that.) If kt is an ergodic process with alphabet A.6 (The covering theorem. just set Cn = G n (k.b The ergodic theorem and covering properties. is an ergodic process with alphabet A. X. In these applications. In many situations the set 13. BASIC CONCEPTS.) If 12.4. Theorem 1. E _ _ n E. is chosen to have large measure. then the ergodic theorem is applied to show that. and use Theorem 1. The ergodic theorem is often used to derive covering properties of finite sample paths for ergodic processes. proving the theorem.a(a n n e . 1]: xn-1 B n }i < 8M. Thus. E). a.4. I.co In other words. To establish the converse fix a il' and E > 0.u(an) > 1 — Cn satisfies (i) and (ii) put En = {x: 4 E Cn} and note that . for any 8 > 0.

is stated as the following theorem. In the stationary nonergodic case. 47 The almost strong-covering idea and a related almost-packing idea are discussed in more detail in Section L7. let lim kr. F(xr) = min{i E [1. but the limit measure varies from sequence to sequence. that is. almost all of which will. It follows easily from (5). that if xr is (1 — 6)strongly-covered by 13n .4. Such an n exists by the ergodic theorem. first to select the set 13n . with the convention F(41) = L(41 ) = 0.7. nil. FREQUENCIES OF FINITE BLOCKS.c. The sequences that produce the same limit measure can be grouped into classes and the process measure can then be represented as an average of the limit measures. Since 3 is arbitrary. the desired result.8 Let p. Theorem 1. I. and then to obtain almost strong-covering by 13n . These ideas will be made precise in this section.2. in fact. Note. so that L(xr) — F(41) > (1 — 3 8 )M. The almost-covering principle.4. Let F (xr) and L(xr) denote the first and last occurrences of a E Xr. The following example illustrates one simple use of almost strong-covering.00 L(41 )— F(41 ) = 1. implies that xr is eventually almost surely (1 — 3)-strongly-covered by Bn . The set of all x for which the limit p(ainx) = lim p(44) exists for all k and all di will be denoted by E. . Bn = {a7: ai = a. be ergodic. Given 8 > 0. if no a appears in xr. This result also follows easily from the return-time picture discussed in Section I. The almost-covering idea will be used to show that (5) To prove this. The set is shift invariant and any shift-invariant measure has its support in This support result. But if this is so and M > n/3. that the expected waiting time between occurrences of a along a sample path is. be an ergodic process and fix a symbol a E A of positive probability. in the limit. Such "doubling" is common in applications. M]: x i = a} L(x) = max{i E [1.4. however. together with the ergodic theorem. limiting relative frequencies of all orders exist almost surely. (5). almost surely. then at least one member of 13n must start within 3M of the beginning of xr and at least one member of 1:3n must start within 8M of the end of xr. Thus is the set of all sequences for which all limiting relative frequencies exist for all possible blocks of all possible sizes. Note that the ergodic theorem is used twice in the example. M]: x i = a}. almost surely equal to 1/p. Example 1. see Exercise 2. follows.SECTION 1.c The ergodic decomposition. for some i E [1.4. then F(41 ) < 23M and L(xr) > (1 — 3)M. which is a simple consequence of the ergodic theorem. choose n so large that A(B n ) > 1 — 3.(a).

x (a) = p(ainx).. The ergodic measures.1. Let E denote the set of all sequences x E L for which the measure ti. k 1 i=1 the ergodic theorem implies that p(aflx) = limn p(aii`ix) exists almost surely. To make this assertion precise first note that the ergodic theorem gives the formula (6) . is shift-invariant then p. . (x).10 ( The ergodic decomposition theorem. which can be extended.C must El have measure 0 with respect to 1.4.9 If it is a shift-invariant measure then A ( C) = 1.4. the complement of . This formula indeed holds for any cylinder set. to a Borel measure i.u. B E E.) If p. n —k+ 1 The frequencies p(41. x E E.x is ergodic. Proof Since p (all' 14) = n . as an average of its ergodic components. ± i E x [ ak. Theorem 1. and extends by countable additivity to any Borel set B. each frequency-equivalence class is either entirely contained in E or disjoint from E.u(B) = f 1. Let n denote the projection onto frequency equivalence classes. BE E. — ix). Formula (6) can then be expressed in the form (7) . The measure .9 may now be expressed in the following much stronger form.ux will be called the (limiting) empirical measure determined by x.(x). Two sequences x and y are said to be frequency-equivalent if /ix = A y .(Ti.4.u o x -1 on equivalence classes. The usual measure theory argument.4.u(alic) = f p(af Ix) dp. by (6). Theorem 1.4. 4 E Ak defines a measure itx on sequences of length k. shows that to prove the ergodic decomposition theorem it is enough to prove the following lemma.. for if x E r then the formula .1.. are called the ergodic components of kt and the formula represents p. for each k and 4. which is well-defined on the frequency equivalence class of x.(E) = 1 and hence the representation (7) takes the form. Of course. Theorem 1. Theorem 1. . for each k. which establishes the theorem. Since there are only a countable number of the al` ..tx on A. Next define /27(x) = ktx.. An excellent discussion of ergodic components and their interpretation in communications theory can be found in [ 1 9 ]. The projection x is measurable and transforms the measure .(x )(B) dco(7(x)). together with the finite form of ergodicity. (B)= f p. by the Kolmogorov consistency theorem.. a shift-invariant measure always has its support in the set of sequences whose empirical measures are ergodic.48 Theorem 1. e In other words. CHAPTER I.u onto a measure w = .)c) can be thought of as measures. BASIC CONCEPTS..5.( x )(B) da)(7r(x)).9 can be strengthened to assert that a shift-invariant measure kt must actually be supported by the sequences for which the limiting frequencies not only exist but define ergodic measures..u.

such that the frequency of occurrence of 4 in any two members of C.E.4. the set Ps of shift-invariant probability measures is compact in the weak*-topology (obtained by thinking of continuous functions as linear functionals on the space of all measures via the mapping it i— f f dit. that the relative frequency of occurrence of al in any two members of Li differ by no more than E/2. z E L3. The set L1 is invariant since p(ainx) = p(4`17' x).9. the sequence {p(al 14)} converges to p(41x).c. Z E r3. . see Exercise 1 of Section 1. c A n . Let L i = Li(y) denote the set of all x E L such that (8) ip(a'nx) — p(41y)l < E/4. Theorem 1. The unit interval [0. differs by no more than E. Note.) The extreme points of Ps correspond exactly to the ergodic measures. for any x E L.4. Proof Fix E > 0. Thus the conditional measure it (. tt(CnICI) = ti Li P(xilz) dtt(z). hence For each x E Li. oo. 1 . This completes the proof of the ergodic decomposition theorem. ILi) is shift-invariant and 1 .4.4. 1] is bounded so L can be covered by a finite number of the sets Li(y) and hence the lemma is proved.12 The ergodic decomposition can also be viewed as an application of the ChoquetBishop-deLeeuw theorem of functional analysis. x' iz E An. fix a block 4. and fix a sequence y E L. in particular.SECTION 1. with /L(C) > 1 — c.q. Fix n > N and put Cn = {x: x E L2 }. [57]. which translates into the statement that p(C) > 1 — E. n > N. Indeed. The conditions (9) and (8) guarantee that the relative frequency of occurrence of 4 in any two members x7.10. itiz of Cn differ by no more E.u(xilLi) = A(CI) In particular.11 Given c > 0 and 4 there is a set X1 with ii.(Xi) > 1 — E. as n there is an integer N and a set L2 C Li of measure at least (1 — E2 )A(L1) such that (9) IP(414) — p(alx)I < E/4. such that for any x E X1 there is an N such that if n > N there is a collection C. 0 Remark 1. x E L2.„ E p(xz) ?. 1 (L1) L1 P(Crilz) d iu(z) > 1 — 6 2 so the Markov inequality yields a set L3 C L2 such that /4L3) > (1 — E)/L(L 1 ) and for which . FREQUENCIES OF FINITE BLOCKS. the set of sequences with limiting frequencies of all orders. 49 Lemma 1.

Assume p. Let T be an ergodic rotation of the circle and P the two-set partition of the unit circle into upper and lower halves. for infinitely Fix a E A and let Aa R(x) = {R 1 (X). t): where n = tk r. (It is an open question whether a mixing finite-state process is a function of a mixing Markov chain.4. I.u o R -1 on Aroo.4. (Hint: combine Exercise 2 with Theorem 1.50 CHAPTER I. Define the mapping x (x) = min{m > 1: x. R2(x). 8. m+k ) and (4.: n > 1}. The reversed process {Yn : n > 1} is the .2 for a definition of this process. Let qk • 14) be the empirical distribution of nonoverlapping k-blocks in 4.) 4. P)-process. qk (alnx.t(a). Combine (5) with the ergodic theorem to establish that for an ergodic process the expected waiting time between occurrences of a symbol a E A is just 111. This exercise illustrates a direct method for transporting a sample-path theorem from an ergodic process to a sample-path theorem for a (nonstationary) function of the process.1 ) = I{/ E [0. 0 < r < k. (b) Determine the ergodic components of the (T 3 . (Hint: show that the average time between occurrences of a in 4 is close to 1/pi (a14). See Exercise 8 in Section 1. that is. Show that an ergodic finite-state process is a function of an ergodic Markov chain. P)-process. P V TP y T 2P)-process.) 3. zt) are looking at different parts of y.P x P)-process. = (a) Show that if T is totally ergodic then qk (ali`14) .u(x lic). (a) Determine the ergodic components of the (T 3 .n = a} — 1 R1 (x) = minfm > E (x): x m = a} — E Rico i=1 .) A measure on Aa " transports to the measure y = . BASIC CONCEPTS.5. Let {X„: n > 1} be a stationary process.d Exercises 1.) 7. 6. 2. z„. Show that .1 by many n. is the conditional measure on [a] defined by an ergodic process. Show that {Xn } is ergodic if and only its reversed process is ergodic. (b) Show that the preceding result is not true if T is not totally ergodic. . where P is the Kolmogorov partition in the two-sided representation of {X. almost surely. Let P be the time-0 partition for an ergodic Markov chain with period d = 3.) 5. " be the set of all x E [a] such that Tnx E [a]. (Hint: with high probability (x k . Describe the ergodic components of the (T x T . (This is just the return-time process associated with the set B = [a]. Use the central limit theorem to show that the "random walk with random scenery" process is ergodic.

(Hint: use the Borel mapping lemma. There is a nonnegative number h = h(A) such that n oo n 1 lim — log . has limit 0. where S is the shift on (e) If T is the set of the set of frequency-typical sequences for j.5 The entropy theorem.u. CI .3 E(17.(a).) (d) (1/n) En i R i -÷ 11 p. The entropy theorem asserts that. The next three lemmas contain the facts about the entropy function and its connection to counting that will be needed to prove the entropy theorem. z(a) 111AI.(a) log 7(a). the decrease is almost surely exponential in n.(4) 1 log p.5. of the process. aEA Let 1 1 h n (x) = — log — n ti.SECTION 1. the measure p. 0 Lemma 1. In the theorem and henceforth.(4) = h. (a) The mapping x R(X) is Borel.17. n so that if A n is the measure on An defined by tt then H(t) = nE(hn (x)). v-almost surely. (Hint: use the preceding result. then v(R(T)) = 1.4) = nE(h n (x)). almost surely. The entropy of a probability distribution 7(a) on a finite set A is defined by H(z) = — E 71.2 The entropy function H(7) is concave in 7 and attains its maximum value log I AI only for the uniform distribution. For stationary processes. Theorem 1.1 (The entropy theorem. Proof Since 1/(.1. see Exercise 4c. and. the lemma follows from the preceding lemma. some counting arguments. and the concept of the entropy of a finite distribution.(4). The proof of the entropy theorem will use the packing lemma. Proof An elementary calculus exercise.5. for ergodic processes. and denoted by h = h(t). and the natural logarithm will be denoted by ln. 51 (b) R(T R ' (x ) x) = SR(x). except for some interesting cases. Lemma 1. Lemma 1.) Let it be an ergodic measure for the shift T on the space A'. where A is finite.(4) is nonincreasing in n.5.5. log means the base 2 logarithm.) Section 1. THE ENTROPY THEOREM. with a constant exponential rate called the entropy or entropy-rate.(x)) _< log I AI.

The goal is to show that h must be equal to lim sup.00 almost surely. The first task is to show that h(x) is constant. Since T is assumed to be ergodic the invariant function h(x) must indeed be constant. almost surely. almost surely. see Exercise 4. almost surely. To establish this note that if y = Tx then [Yi ] = [X3 +1 ] j so that a(y) > It(x 1 ) and hence h(Tx) < h(x). that is. BASIC CONCEPTS. since subinvariant functions are almost surely invariant. It is just the entropy of the binary distribution with ti(1) = S. h(x) is subinvariant. see [7]. that is I lim — log n-.5. The definition of limit inferior implies that for almost .Co Define where h(x) = —(1/n) log it(x'lz). that the binary entropy function is the correct (asymptotic) exponent in the number of binary sequences of length n that have no more than Sn ones. Section 1. Lemma 1. hn (x). n.52 CHAPTER I.2.a The proof of the entropy theorem. It can be shown.4 (The combinations bound. k < ms.) (n ) denotes the number of combinations of n objects taken k at a time and k 3 < 1/2 then n 2nH(3) . Thus h(Tx) = h(x). Proof The function —q log3 — (1 — q) log(1 —8) is increasing in the interval 0 < q < 3 and hence 2-nH(S) < 5k(1 _ n ) and summing gives k on-k . Define h to be the constant such that h = lim inf hn (x). Multiplying by ( 2-nH(8) E k<nS ( n k ) 'E k<n3 ( n k ) (5k 0 _ on-k n ) ic (1 _ 5 k on-k = 19 which proves the lemma. k) If E k<n6( where H(3) = —3 log3 —(1 — 3) log(1 — 3). h(x) = lim inf hn (x). n-). almost surely.= H (3) k I. let c be a positive number. 0 The function H(S) = —3 log 8—(1—S) log(1-3) is called the binary entropy function.5. Towards this end.00 n k<rz b E( n ) .

m i l E S. almost surely" can be replaced by "eventually. m i [} of disjoint subintervals of [1. If a set of K-length sample paths has cardinality only a bit more than 2K (h+E) . xr E G K (8 . .SECTION 1. Eventually almost surely. long subblocks for which (1) holds cannot have cardinality exponentially much larger than 2 /"+€) . M) if and only if there is a collection xr S = S(xr) = {[n i .. almost surely. then since a location of length n can be filled in at most 2n(h+E) ways if (1) is to hold.Es (mi — n.] E S. most of a sample path is filled by disjoint blocks of varying lengths for each of which the inequality (1) holds. there are not too many ways the parts outside the long blocks can be filled. if these subblocks are to be long. (a) m — n 1 +1 > M. An application of the packing lemma produces the following result. both to be specified later. The second idea is a counting idea.5. This is just an application of the fact that upper bounds on cardinality "almost" imply lower bounds on probability. (b) p(x) > ni+1)(h+c). This is a simple application of the ergodic stopping-time packing lemma.. then once their locations are specified.1. 53 all x the inequality h(x) < h + E holds for infinitely many values of n.. Three ideas are used to complete the proof. a fact expressed in exponential form as (1) A(4) > infinitely often. m. The set of sample paths of length K that can be mostly filled by disjoint. To fill in the details of the first idea. (c) E [ „. [ni. M) be the set of all that are (1 — 23)-packed by disjoint blocks of length at least M for which the inequality (1) holds.. The third idea is a probability idea. there will be a total of at most 2 K(h+E ) ways to fill all the locations. Most important of all is that once locations for these subblocks are specified. let S be a positive number and M > 1/B an integer. Indeed. and if they mostly fill. for large enough K. For each K > M." for a suitable multiple of E. then it is very unlikely that a sample path in the set has probability exponentially much smaller than 2 — K (h+E ) . + 1) > (1 — 28)K. THE ENTROPY THEOREM. The first idea is a packing idea. then there are not too many ways to specify their locations. In other words. let GK(S..18(b). almost surely. Lemma I. The goal is to show that "infinitely often. [n. K] with the following properties.

54

CHAPTER I. BASIC CONCEPTS.

Lemma 1.5.5

E

GK

(3, M), eventually almost surely.

Proof Define T. (x) to be the first time n > M such that p,([xrii]) >n(h+E) Since r is measurable and (1) holds infinitely often, almost surely, r is an almost surely finite stopping time. An application of the ergodic stopping-time lemma, Lemma 1.3.7, then yields the lemma. LI
The second idea, the counting idea, is expressed as the following lemma.

Lemma 1.5.6 There is a 3 > 0 and an M> 116 such that IG large K.

, M)1 <2K (h+26) ,

for all sufficiently

Proof A collection S = {[ni , m i ll of disjoint subintervals of [1, K], will be called a skeleton if it satisfies the requirement that mi — ni ± 1 > M, for each i, and if it covers all but a 26-fraction of [1, K], that is,
(2)

E (mi _ ni ± 1) >_ (1 — 23)K.
[n,,m,JES

A sequence xr is said to be compatible with such a skeleton S if 7i ) > 2-(m, --ni+i)(h+e) kt(x n for each i. The bound of the lemma will be obtained by first upper bounding the number of possible skeletons, then upper bounding the number of sequences xr that are compatible with a given skeleton. The product of these two numbers is an upper bound for the cardinality of G K (6, M) and a suitable choice of 6 will then establish the lemma. First note that the requirement that each member of a skeleton S have length at least M, means that Si < KIM, and hence there are at most KIM ways to choose the starting points of the intervals in S. Thus the number of possible skeletons is upper bounded by (3)

E

K k

H(1/111)

k<KIM\

where the upper bound is provided by Lemma 1.5.4, with H() denoting the binary entropy function. Fix a skeleton S = {[ni,mi]}. The condition that .)C(C be compatible with S means that the compatibility condition (4) it(x,7 1 ) >

must hold for each [ni , m i ] E S. For a given [ni , m i ] the number of ways xn m , can be chosen so that the compatibility condition (4) holds, is upper bounded by 2(m1—n's +1)(h+E) , by the principle that lower bounds on probability imply upper bounds on cardinality, Lemma I.1.18(a). Thus, the number of ways x j can be chosen so that j E Ui [n„ and so that the compatibility conditions hold is upper bounded by
nK h-1-€)
L

(fl

SECTION 1.5. THE ENTROPY THEOREM.

55

Outside the union of the [ni , mi] there are no conditions on xi . Since, however, there are fewer than 28K such j these positions can be filled in at most IA 1 26K ways. Thus, there are at most
IAl23K 2K (h+E)

sequences compatible with a given skeleton S = {[n i , m i ]}. Combining this with the bound, (3), on the number of possible skeletons yields

(5)

1G K(s, m)i

< 2. 1(- H(1lm) oi ncs 2 K(h+E)

Since the binary entropy function H (1 M) approaches 0 as M 00 and since IA I is finite, the numbers 8 > 0 and M > 1/8 can indeed be chosen so that IGK (8, M) < 0 2K(1 +2' ) , for all sufficiently large K. This completes the proof of Lemma 1.5.6. Fix 6 > 0 and M > 1/6 for which Lemma 1.5.6 holds, put GK = Gl{(8, M), and let BK be the set of all xr for which ,u(x(c ) < 2 -1C(h+36) . Then tt(Bic

n GK) < IGKI 2—K(h+3E)

5 2—ICE

,

holds for all sufficiently large K. Thus, xr g BK n GK, eventually almost surely, by the Borel-Cantelli principle. Since xt E GK, eventually almost surely, the iterated almost-sure principle, Lemma 1.1.15, implies that xr g BK, eventually almost surely, that is,

lim sup h K (x) < h + 3e, a.s.
K—>oo

In summary, for each e > 0, h = lim inf h K (x) < lim sup h K (x) < h 3e, a.s.,
K—*co

which completes the proof of the entropy theorem, since c is arbitrary.

0

Remark 1.5.7 The entropy theorem was first proved for Markov processes by Shannon, with convergence in probability established by McMillan and almost-sure convergence later obtained by Breiman, see [4] for references to these results. In information theory the entropy theorem is called the asymptotic equipartition property, or AEP. In ergodic theory it has been traditionally known as the Shannon-McMillan-Breiman theorem. The more descriptive name "entropy theorem" is used in this book. The proof given is due to Ornstein and Weiss, [51], and appeared as part of their extension of ergodic theory ideas to random fields and general amenable group actions. A slight variant of their proof, based on the separated packing idea discussed in Exercise 1, Section 1.3, appeared in [68].

I.5.b Exercises.
1. Prove the entropy theorem for the i.i.d. case by using the product formula on then taking the logarithm and using the strong law of large numbers. This yields the formula h = — Ea ,u(a) log it(a). 2. Use the idea suggested by the preceding exercise to prove the entropy theorem for ergodic Markov chains. What does it give for the value of h?

56

CHAPTER I. BASIC CONCEPTS.
3. Suppose for each k, Tk is a subset of Ac of cardinality at most 2k". A sequence 4 is said to be (K, 8, {T})-packed if it can be expressed as the concatenation 4 = w(1) . w(t), such that the sum of the lengths of the w(i) which belong to Ur_K 'Tk is at least (1 —8)n. Let G„ be the set of all (K, 8, {T})-packed sequences 4 and let E be positive number. Show that if K is large enough, if S is small enough, and if n is large enough relative to K and S, then IG n I < 2n(a+E) . 4. Assume p is ergodic and define c(x) = lima

,44), x

E

A°°.

(a) Show that c(x) is almost surely a constant c. (b) Show that if c > 0 then p. is concentrated on a finite set. (c) Show that if p. is mixing then c(x) = 0 for every x.

Section 1.6

Entropy as expected value.

Entropy for ergodic processes, as defined by the entropy theorem, is given by the almost-sure limit 1 1 — log h= n—>oo n Entropy can also be thought of as the limit of the expected value of the random quantity —(1/n) log A(4). The expected value formulation of entropy will be developed in this section.

I.6.a

The entropy of a random variable.

Let X be a finite-valued random variable with distribution defined by p(x) = X E A. The entropy of X is defined as the expected value of the random variable — log p(X), that is,

Prob(X = x),

H(X) =

xEA

E p(x) log -1-=— p(x)

p(x) log p(x). xEA

The logarithm base is 2 and the conventions Olog 0 = 0 and log0 = —oo are used. If p is the distribution of X, then H(p) may be used in place of H(X). For a pair (X, Y) of random variables with a joint distribution p(x, y) = Prob(X = x, Y = y), the notation H(X,Y)-= — p(x , y) log p(x, y) will be used, a notation which extends to random vectors. Most of the useful properties of entropy depend on the concavity of the logarithm function. One way to organize the concavity idea is expressed as follows.

Lemma 1.6.1 If p and q are probability k-vectors then

_E pi log p, < —
with equality if and only if p = q.

pi log qi ,

SECTION 1.6. ENTROPY AS EXPECTED VALUE.

57

Proof The natural logarithm is strictly concave so that, ln x < x — 1, with equality if and only if x = 0. Thus
qj

_

= 0,

with equality if and only if qi = pi , log x = (In x)/(1n 2).

1 < i 5_ k.

This proves the lemma, since

E,

The proof only requires that q be a sub-probability vector, that is nonnegative with qi 5_ 1. The sum

Dcplio =

pi in

PI
qi

is called the (informational) divergence, or cross-entropy, and the preceding lemma is expressed in the following form.

Lemma 1.6.2 (The divergence inequality.) If p is a probability k-vector and q is a sub-probability k-vector then D(pliq) > 0, with equality if and only if p = q.
A further generalization of the lemma, called the log-sum inequality, is included in the exercises. The basic inequalities for entropy are summarized in the following theorem.

Theorem 1.6.3 (Entropy inequalities.) (a) Positivity. H(X) > 0, with equality if and only if X is constant.

(b) Boundedness. If X has k values then H(X) < log k, with equality if and only if
each p(x) = 11k.

(c) Subadditivity. H(X, Y) < H(X) H(Y), with equality if and only if X and Y
are independent. Proof Positivity is easy to prove, while boundedness is obtained from Lemma 1.6.1 by setting Px = P(x), qx ----- 1/k. To establish subadditivity note that
H(X,Y).—

Ep (x, y) log p(x, y),
X.

y

then replace p(x , y) in the logarithm factor by p(x)p(y), and use Lemma 1.6.1 to obtain the inequality H(X,Y) < H(X)-}- H(Y), with equality if and only if p(x, y) p(x)p(y), that is, if and only if X and Y are independent. This completes the proof of the theorem. El The concept of conditional entropy provides a convenient tool for organizing further results. If p(x , y) is a given joint distribution, with corresponding conditional distribution p(xly) = p(x, y)1 p(y), then H ain

=_

p(x, y) log p(xiy) = —

p(x, y) log

p(x, y)

POO

58

CHAPTER I. BASIC CONCEPTS.

is called the conditional entropy of X, given Y. (Note that this is a slight variation on standard probability language which would call — Ex p(xIy) log p(xly) the conditional entropy. In information theory, however, the common practice is to take expected values with respect to the marginal, p(y), as is done here.) The key identity for conditional entropy is the following addition law.

(1)

H(X, Y) = H(Y)+ H(X1Y).

This is easily proved using the additive property of the logarithm, log ab = log a+log b. The previous unconditional inequalities extend to conditional entropy as follows. (The proofs are left to the reader.)

Theorem 1.6.4 (Conditional entropy inequalities.) (a) Positivity. H (X IY) > 0, with equality if and only if X is a function of Y. (b) Boundedness. H(XIY) < H (X) with equality if and only if X and Y are independent. (c) Subadditivity. H((X, Y)IZ) < H(XIZ)+ H(YIZ), and Y are conditionally independent given Z. with equality if and only if X

A useful fact is that conditional entropy H(X I Y) increases as more is known about the first variable and decreases as more is known about the second variable, that is, for any functions f and g,

Lemma 1.6.5 H(f(X)In < H(XIY) < 1-1 (X1g(n).
The proof follows from the concavity of the logarithm function. This can be done directly (left to the reader), or using the partition formulation of entropy which is developed in the following paragraphs. The entropy of a random variable X really depends only on the partition Px = {Pa: a E A} defined by Pa = fx: X(x) = a), which is called the partition defined by X. The entropy H(1)) is defined as H (X), where X is any random variable such that Px = P. Note that the join Px V Py is just the partition defined by the vector (X, Y) so that H(Px v Pr) = H(X, Y). The conditional entropy of P relative to Q is then defined by H(PIQ) = H(1) V Q) — H (Q). The partition point of view provides a useful geometric framework for interpretation of the inequalities in Lemma 1.6.5, because the partition Px is a refinement of the partition Pf(x), since each atom of Pf(x) is a union of atoms of P. The inequalities in Lemma 1.6.5 are expressed in partition form as follows.

Lemma 1.6.6 (a) If P refines Q then H(QIR) < H(PIR). (b) If R. refines S then H(PIS) > 11(P 1R).

. 59 Proof The proof of (a) is accomplished by manipulating with entropy formulas. Lemma 1. and the inequality (iii) uses the fact that H (P I VR. n n xn In the stationary case the limit superior is a limit. I. t a0b II(Ra) The latter is the same as H(PIS).) > O.SECTION I.6. 0 < r < n.6.) can be expressed as -EEtt(Pt n R a ) log t ab it(Pt n Ra) + (4)11 (Psb liZsb ) Ewpt n Ra ) log It(Pt n Ra) + it(Sb)1-1(Psb ). . so that P = Pv Q.} of A-valued random variables is defined by H(r) = E p(x)log 1 ATEAn =— p(x'iL)log p(4). ENTROPY AS EXPECTED VALUE. is obtained from S by splitting one atom of S into two pieces.6. Sb = Rb U Rb. • The quantity H(P IR.b The entropy of a process.) If {an} is a sequence of nonnegative numbers which is subadditive. where the conditional measure. < pan + b. The n-th order entropy of a sequence {X1. If m > n write m = np +. IC) is used on C. an-Fm < an + am . The equality (i) follows from the fact that P refines Q. X2. (2) 1 1 1 H ({X n }) = lim sup — H((X7) = lim sup — p(4) log n n p(4) . This is a consequence of the following basic subadditivity property of nonnegative sequences. Given c > 0 choose n so that an < n(a ± 6).r. namely. The entropy of a process is defined by a suitable passage to the limit. it (. Subadditivity gives anp < pan . a 0 b..7 (The subadditivity lemma. then limn an ln exists and equals infn an /n. 11) P(x1 xnEan E The process entropy is defined by passing to a limit. so that if b = sup . as follows. say Ra = Sa . To prove (h) it is enough to consider the case when R. the equality (ii) is just the general addition law. then another use of subadditivity yields am < anp + a. which establishes inequality (b). Proof Let a = infn an /n.n a. H (P IR) (i) (P v 21 1Z) H(QITZ) + H(PIQ v 1Z) H (OR).. that is.. Let Pc denote the partition of the set C defined by restricting the sets in P to C.

60 CHAPTER I. The subadditivity property for entropy.6. aEB for any B c A. Theorem I. can then be applied to give H({X n })= lim H (X01X1 nI ) . .u (x '0 = h.6.) _ E p(a) log p(a) _< p(B) log IBI — p(B)log p(B).5. the process entropy H is the same as the entropy-rate h of the entropy theorem. 1B) denote the conditional measure on B and use the trivial bound to obtain P(a1B)log p(aIB) . H(X.6. An alternative formula for the process entropy in the stationary case is obtained by using the general addition law. and process entropy H. Division by m. Towards this end. BASIC CONCEPTS. be an ergodic process with alphabet A. Proof Let p(. a._ log IBIaEB — E 1 The left-hand side is the same as ( p(B) from which the result follows. the following lemma will be useful in controlling entropy on small sets. to produce the formula H(X ° n )— H(X1. that is urn—log n 1 1 n .1 )= H(X0IX:n1). The next goal is to show that for ergodic processes. as in —> pc. and the fact that np/m —> 1. gives lim sup ani /m < a + E. aEB E P(a)log p(a) + log p(B)) .8 (The subset entropy-bound. and let h be the entropy-rate of A as given by the entropy theorem. n > 0. and the simple fact from analysis that if an > 0 and an+ i — an decreases to b then an /n —> b. CI This proves the lemma. from Lemma 1. Y) = H(Y)± H(X1Y). If the process is stationary then H(X n + -FT) = H(r) so the subadditivity lemma with an = H(X7) implies that the limit superior in (2) is a limit._ . The right-hand side is decreasing in n. 0 Let p.3(c).e. (3) n--4co a formula often expressed in the suggestive form (4) H({X n }) = H (X01X_1 0 ) . gives H(X7 +m) < H(X7)+ H(rnTin). Lemma 1.

6. [7]. Lemma 1. If h = 0 then define G. The recent books by Gray. Define G.6. The bound (5) still holds.(x ) log .10 Since for ergodic processes. so multiplication by . discuss many information-theoretic aspects of entropy for the general ergodic process. so the sum converges to H({X„}).u(4) <_ n(h ± c). while D only the upper bound in (6) matters and the theorem again follows. The Csiszdr-Körner book..(4). and Cover and Thomas. since the entropy theorem implies that ti(B) goes to O. [4]. this part must go to 0 as n —> oc. = A n — G.(Bn )log IA 1 — .}) is the same as the entropyrate h of the entropy theorem. both are often simply called the entropy.u(xtil) A" = — E .9 Process entropy H and entropy-rate h are the same for ergodic processes. As n —> co the measure of G. . contains an excellent discussion of combinatorial and communication theory aspects of entropy.i. On the set G. [18].1 (4) < 2—n(h—E)} and let B. Thus h— E which proves the theorem in the case when h > O.6. in the i. the process entropy H({X. After division by n. from the entropy theorem.6. Then H (X7) = — E „. [6].u(4) log p. Proof First assume that h > 0 and fix E such that 0 < E < h.i(B)log(B). = {4: 2—n(h+E) < 1.d..u(xliz) — E . G. 61 Theorem 1. gives (5) — E A (x) log ii(x) < np. approaches 1.u(4) > 2 —ne . B.8. ENTROPY AS EXPECTED VALUE. 8„ since I B I < IAIn . the following holds n(h — e) _< — log .SECTION 1. The two sums will be estimated separately. A detailed mathematical exposition and philosophical interpretation of entropy as a measure of information can be found in Billingsley's book. The subset entropy-bound.u (4 ) log . = {4: . case.u(G n ) and summing produces (6) (h — 6) 1 G.u(4)/n. } Remark 1.

} be an i. (7) H({X. .d. Proof The conditional addition law gives H((X o . Here they will be derived from the definition of process entropy. n > k. and Markov processes.5. (3).3(c).d.6. (3). a E A.6.) A stationary process {X n } is Markov of order k if and only if H(X0IX:) = H(X0IX: n1 ).!) = H(X0)• Now suppose {X n } is a Markov chain with stationary vector p and transition matrix M. The argument used in the preceding paragraph shows that if {X. and Markov processes can be derived from the entropy theorem by using the ergodic theorem to directly estimate p.6.}) = lip H (X0IX1. to obtain H({X. . This condition actually implies that the process is Markov of order k. and the fact that H(Xl Y) = H(X) if X and Y are independent. a from which it follows that E A. gives H (XI') = H(Xi) which yields the entropy formula H (X2) + H(X n ) = nH (X i ). BASIC CONCEPTS. Xink+1). An alternate proof can be given by using the conditional limit formula.d. H (X 0 1 = H(X01 X -1)• Thus the conditional entropy formula for the entropy of a process. The Markov property implies that Prob (X0 = alX) = Prob (X0 = alX_i) .} is Markov of order k then 1/({X}) = H(X0IX:1) = n > k. Let {X.c The entropy of i.i. (8) H({X n }) = H(X0IX_ i ) = — E 04 log Mii . process and let p(x) = Prob(Xi = x). Theorem 1.i. Entropy formulas for i. along with a direct calculation of H(X0IX_ 1 ) yields the following formula for the entropy of a Markov chain.}) = H(X i ) = — p(x) log p(x).11 (The Markov order theorem. I. The additivity of entropy for independent random variables.i.(4). see Exercises 1 and 2 in Section 1. Recall that a process {Xn } is Markov of order k if Prob (X0 = aiX) = Prob (X0 = alX:k1 ) . Theorem 1.62 CHAPTER I. Xi n k+1 )1X1) = H(X: n k+1 1X11) H(X 0 IXI kl .

Two sequences xi' and y'11 are said to be type-equivalent if they have the same type. 14) on A defined by the relative frequency of occurrence of each symbol in the sequence 4. n Ifi: Pi(alxi) = = an . Thus. if each symbol a appears in x'. - Proof First note that if Qn is a product measure on A n then (9) Q" (x)= ji Q(xi) i=1 = fl aEA War P1(a) .4(c).) < 2r1(131).6. the same number of times it appears in Type-equivalence classes are called type classes. k > k*. The empirical distribution or type of a sequence x E A" is the probability distribution Pi = pl (.12 (The type class bound. The following purely combinatorial result bounds the size of a type class in terms of the empirical entropy of the type.6. E . Theorem 1. 63 The second term on the right can be replaced by H(X01X: k1 ). for any type p i . rather than a particular sequence that defines the type. x E rn Tpn. The entropy of an empirical distribution gives the exponent for a bound on the number of sequences that could have produced that empirical distribution. To say that the process is Markov of some order is to say that Hk is eventually constant. x'1' E T.d Entropy and types. that is. The empirical (first-order) entropy of xi' is the entropy of p i (. I. 14). If this is true for every n > k then the process must be Markov of order k. 14)) = a Pi(aixi)log Pi(aI4). that is. and will be useful in some of the deeper interpretations of entropy to be given in later chapters. The type class of xi' will be denoted by T WO or by 7. that is.6. provided H(X01X1 k1 ) = H(X0 1X1n1 ). for n > k.. Theok+1 are conditionally indepenrem I.6. The equality condition of the subadditivity principle. where the latter stresses the type p i = pi (. 17(130= ii(pl(. a E A. a product measure is always constant on each type class. ENTROPY AS EXPECTED VALUE. can then be used to conclude that X0 and X:_ n dent given XI/. then Hks_i > fik* = Hk . If the true order is k*. This completes the proof of the theorem. In since np i (a) is the number of times the symbol a appears in a given 4 particular.14). In general the conditional entropy function Hk = H(X0IX: k1 ) is nonincreasing in k. This fact can be used to estimate the order of a Markov chain from observation of a sample path. if and only if each symbol a appears in xi' exactly n pi(a) times.SECTION 1. This fact is useful in large deviations theory.

Two sequences x. The empirical overlapping k-block distribution or k-type of a sequence x7 E A' is the probability distribution Pk = pk(' 14) on A k defined by the relative frequency of occurrence of each k-block in thesen [i quenk ce +ii'i th_a i± t ki_ S. The bound 2' 17. that is. while not tight.6.6. and k-type-equivalence classes are called k-type classes. The upper bound is all that will be used in this book. [7]. the only possible values for . Theorem 1. Pn (4) has the constant value 2' -17(P' ) on the type class of x7. 4 E T. where Pk = Pk ( Ix).' and yç' are said to be k-type-equivalent if they have the same k-type. is the correct asymptotic exponential bound for the size of a type class. x E In other words. . then the number of k-types is of lower order than IA In.14 (The number of k-types bound. and hence = ypn. The k-type class of x i" will be denoted by Tpnk . 1. npi (alfii) are 0. as shown in the Csiszdr-Körner book. if and only if each block all' appears in xÇ exactly (n — k + 1) pk (alic) times. Proof This follows from the fact that for each a E A. Theorem 1. BASIC CONCEPTS. The bound on the number of types. extends immediately to k-types.12. Theorem 1.6.1 = E I -41 n— k 1 E A k. E r pni . z E A' . in particular.64 CHAPTER 1. produces Pn(xiz) = aEA pi (a)nPl(a) . but satisfies k < a logiAi n.) The number of possible k-types is at most (n — k + Note. that if k is fixed the number of possible k-types grows polynomially in n. A type pi defines a product measure Pn on An by the formula Pn (z?) = pi (zi).) The number of possible types is at most (n + 1)1A I. as stated in the following theorem.13. Replacing Qn by Pi in the product formula. while if k is also growing. after taking the logarithm and rewriting. yields Pn(x7) = 2—n17(PI) . The concept of type extends to the empirical distribution of overlapping k-blocks. This establishes Theorem 1. with a < 1. (9).6. n. 12— n 11(” v 1 ) But Pi is a probability distribution so that Pi (Tpni ) < 1.13 (The number-of-types bound. Later use will also be made of the fact that there are only polynomially many type classes. which.

2)-process. > 0. ak ak by formula (8). 2. Theorem 1.b xr it c Tp2. A suitable bound can be obtained.e Exercises. with Q replaced b y (2-1) = rf. since P1) (4) is a probability measure on A n .15 (The k-type-class bound.) I7k (fil)1 < (n — Proof First consider the case k = 2.1>ii(1 ) IT P2 I — 1n since ii:(1) (xi) > 1/(n — 1).(1) and after suitable rewrite using the definition of W I) becomes 1 TV I) (Xlii ) = Ti-(1) (X02—(n-1)1 .-j(a k Or 1 )Ti(a ki -1 ) log 75(akiar) E Pk(akil4) log Ti(ak lari ). > 0. 2. . . I. Prove the log-sum inequality: If a. the stationary (k — 1)-st order Markov chain ri 0c -1) with the empirical transition function (ak 14 -1 ) = Pk(ainxii ) „ f k-1 „ir‘ uk IA 1 ) E bk pk and with the stationary distribution given by The Markov chain ri(k-1) has entropy (k - "Aar') = Ebk Pk(ar i 1) -E 1. by considering the (k — 1)-st order Markov measure defined by the k-type plc ( 14). The proof for general k is obtained by an obvious extension of the argument. i = 1. that is.SECTION 1. If Q is Markov a direct calculation yields the formula Q(4) = Q(xi)n Q(bi a)(n—l)p2(ab) a. The entropy FP") is called the empirical (k — 1)-st order Markov entropy of 4.6. Let P be the time-0 partition for an ergodic Markov chain with period d = 3 and entropy H > O. This yields the desired bound for the first-order case. and hence P x E 2. ENTROPY AS EXPECTED VALUE. n.(n. 65 Estimating the size of a k-type class is a bit trickier. Note that it is constant on the k-type class 'Tk(xiii) = Tp k of all sequences that have the same k-type Pk as xrii. b = E bi . then E ai log(ai I bi ) ?_ a log(a/b). however.6. and a 1. This formula. since the k-type measures the frequency of overlapping k-blocks.6. (a) Determine the entropies of the ergodic components of the (T3 . b. E ai .

(b) Determine the entropies of the ergodic components of the (T 3 . measurable with respect to xn oc . then use calculus. I. (Hint: let S be uniformly distributed on [1. €)-entropy-typical sequences. namely.(1) = 1/2.u0(-1) = p. with the product measure y = j x it. Prove that the entropy of an n-stationary process {Yn } is the same as the entropy the stationary process {X„} obtained by randomizing the start.1} z . (Hint: recurrence of simple random walk implies that all the sites of y have been almost surely visited by the past of the walk. The entropy theorem may be expressed by saying that xril is eventually almost surely entropy-typical. y) = (T x .66 CHAPTER I. or simply the set of entropy-typical sequences. BASIC CONCEPTS. . Let p. which leads to the concept of entropy-typical sequence. Two simple. Find the entropy of the (S. where jp . Toy).7 Interpretations of entropy.a Entropy-typical sequences. (Hint: use stationarity. The second interpretation is the connection between entropy and expected code length for the special class of prefix codes. and define S(x. be an ergodic process of entropy h.(Pi)1 A(Q i )). Let p be the time-zero partition of X and let Q = P x P. so H(Xr)1N H(Xr IS)1N. useful interpretations of entropy will be discussed in this section. (Hint: apply the ergodic theorem to f (x) = — log //. (4) < t(xIx) 2". and to the related building-block concept. Q)-process. and let ji be the product measure on X defined by . 3.) 6.) 5. Then show that I-1(r IS = s) is the same as the unshifted entropy of Xs/v+1 . (b) Prove Pinsker's inequality. Let p be an ergodic process and let a be a positive number. n] and note that H (X I v .. Show there is an N = N(a) such that if n > N then there is a set Fn .bt 1 a2012 . that D (P Il Q) > (1 / 2 ln 2)IP — Q1 2 . if n and c are understood.2.) 4. Let T be the shift on X = {-1. such that . S) = H H (SIXr) = H(S)+ H(X l iv IS) for any N> n. and use the equality of process entropy and entropy-rate. The divergence for k-set partitions is D(P Q) = Ei It(Pi)log(p. p V Tp T22)-process.) Section 1.) 7.QI 1. (Hint: use part (a) to reduce to the two set case. Let Y = X x X.u(Pi ) . that is. The first is an expression of the entropy theorem in exponential form.7.(xi Ix). which is the "random walk with random scenery" process discussed in Exercise 8 of Section 1. (a) Show that D (P v 7Z I Q v > Q) for any partition R.u(Fn ) > 1 — a and so that if xn E Fn then 27" 11.i(4). For E > 0 and n > 1 define En(E) 1 4: 2—n(h+E) < tt (x in) < 2—n(h-6)} The set T(E) is called the set of (n. Show that the process entropy of a stationary process and its reversed process are the same.

The context usually makes clear the notion of typicality being used. and sometimes it is just shorthand for those sequences that are likely to occur.) < 2n(h--€). this fact yields an upper bound on the cardinality of Tn . as defined in Section 1.3 (The too-small set principle. Since the total probability is at most 1.2 (The entropy-typical cardinality bound. The preceding theorem provides an upper bound on the cardinality of the set of typical sequences and depends on the fact that typical sequences have a lower bound on their probabilities. in information theory. a)-covering number by Arn (a) = minfICI: C c An. is known as the asymptotic equipartition property or AEP. then . Sometimes it means entropy-typical sequences. the measure is eventually mostly concentrated on a set of sequences of the (generally) much smaller cardinality 2n(h+6) • This fact is of key importance in information theory and plays a major role in many applications and interpretations of the entropy concept. on the number of entropy-typical sequences. eventually almost surely.(E12)) is summable in n. fl Another useful formulation of entropy. The cardinality bound on Cn and the probability upper bound 2 2) on members of Tn (E/2) combine to give the bound (Cn n Tn(E/2)) < 2n(12+02—n(h—E/2) < 2—nc/2 Thus .7q If C. The members of the entropy-typical set T(e) all have the lower bound 2 —n(h+f) on their probabilities. Typical sequences also have an upper bound on their probabilities. expresses the connection between entropy and coverings. n > 1. eventually almost surely. sometimes it means sequences that are both frequency-typical and entropy-typical. and p(C)> a}.) For each E > 0.1 (The typical-sequence form of entropy. 67 The convergence in probability form.7. which leads to the fact that too-small sets cannot be visited too often. even though there are I A In possible sequences of length n. Theorem 1. suggested by the upper bound.) The set of entropy-typical sequences satisfies 17-n (01 < 2n (h+E) Thus. x E T(e). Here the focus is on the entropy-typical idea.2. c A n and ICI _ Ci'. (Al) List the n-sequences in decreasing order of probability.7.u(Tn (E)) = 1. as defined here. . n'T.X1' E 'T. namely. The phrase "typical sequence. INTERPRETATIONS OF ENTROPY. sometimes it means frequencytypical sequences.(E12). eventually almost surely. A good way to think of .7.i (e/ 2). Proof Since . it is enough to show that x g C.SECTION 1. that is. For a > 0 define the (n. Theorem 1.7." has different meaning in different contexts.7. Ar(a) is the minimum number of sequences of length n needed to fill an afraction of the total probability.4.u(Cn n 'T. Theorem 1. Theorem 1. (A2) Count down the list until the first time a total probability of at least a is reached. limn . and the theorem is established.I\rn (a) is given by the following algorithm for its calculation. eventually almost surely.

2. for E rn (E then implies that ACTn(E) n = p. y.7.4 (The covering-exponent theorem. The connection between coverings and entropy also has as a useful approximate form. fl The covering exponent idea is quite useful as a tool for entropy estimation.(x) < 2 —n(h—E) . for each n. 1 dn (4 .5_ h. by Theorem 1.Ern (onc. Since the measure of the set of typical sequences goes to I as n oc. Proof Fix a E (0. b) be the (discrete) metric on A. and the 6-neighborhood or 6-blowup of S is defined by [S ]. which proves the theorem.n GI. The distance dn (x.e.Jce) < iTn (01 < 2 n0-1-0 .(6) n cn ) > a(1 — c). Let d(a. the proof in Section I. The connection with entropy is given by the following theorem.7. ) and hence Nn (a) = ICn > IT(e) n Cn > 2n (h-E'A(Tn(E) n cn) > 2n(h-E)a(i 6).I = (a). lim supn (l/n) log Nn (a) . The covering number Arn (a) is the count obtained in (A2). that a rotation process has entropy O. defined by if a = b otherwise. defined by 0 d (a . If n is large enough then A(T. On the other hand. suppose.u(T. the covering exponent (1/n)logArn (a) converges to h. yri`) = — n i=1 yi)• The metric cin is also known as the per-letter Hamming distance. To state this precisely let H(8) = —6 log 6 (1— 6) log(1 — 8) denote the binary entropy function.. .7. . It is important that a small blowup does not increase size by more than a small exponential factor. b) = J 1 1 and extend to the metric d on A'. S) from 4 to a set S C An is defined by dn (x' . S) = min{dn (4. and hence.i(C) > a and I C.(4) E x. below. Theorem 1. which will be discussed in the following paragraphs. = S) < 81. see. BASIC CONCEPTS. The fact that p.i (E)) eventually exceeds a.) For each a > 0.68 CHAPTER I. < 2—n(h—E) 1— -6 .' E S}. the measure . as n oc. 1) and c > O. When this happens N. for example.

M]." from which longer sequences are made by concatenating typical sequences.SECTION 1..4 then exists as n establishes that 1 lim lim — log Afn (a.7. with occasional spacers inserted between the blocks.7. < n(h+2€) 2 In particular. Lemma 1. Proof Given x. if there is an integer I = I(41 ) and a collection llni. and //([C]3 ) a}.b The building-block concept. Each such position can be changed in IA — 1 ways. 6—>0 n—> oo n I. mi]: i < 1). is required to be a member of 13. INTERPRETATIONS OF ENTROPY. This idea and several useful consequences of it will now be developed.7. The blowup form of the covering number is defined by (1) Arn (a.=1 (mi — n i + 1) > (1 — (b) E I3.7. that is.) The 6-blowup of S satisfies 69 I[S]81 ISI2 n11(6) (1A1 — 1) n3 . An application of the ergodic theorem shows that frequency-typical sequences must consist mostly of n-blocks which are entropy-typical. It is left to the reader to prove that the limit of (1/n) log Nen (a. provided only that n is not too small. If 8 > 0 then the notion of (1-0-built-up requires that xr be a concatenation of blocks from 13.5. it follows that if 8 is small enough. it is the minimum size of an n-set for which the 8-blowup covers an a-fraction of the probability. Fix a collection 5. < 2n(h-1-€) since (IAI — 1)n8 < 20 log IAI and since Slog IA I and H(8) Since IT(e)1 _ < 2n (h+26) 0. thought of as the set of building blocks. Thus. with spacers allowed between the blocks. together with Theorem 1. A sequence xr is said to be (1 — 0-built-up from the building blocks B. This proves the blowup-bound lemma..4. Also fix an integer M > n and 8 > O. for all n. c An . such that (a) Eil. subject only to the requirement that the total length of the spacers be at most SM. of disjoint n-length subintervals of [1. are allowed to depend on the sequence xr. An application of Lemma 1. i c I. the _ 2nH(3) (IAI — 1) n8 . m i ] for which x . Both the number I and the intervals [ni . Lemma 1. which combinations bound. implies that I[{4}]51 < yields the stated bound. to say that 41 is 1-built-up from Bn is the same as saying that xr is a concatenation of blocks from B„. 8) oc. The .5. given c > 0 there is a S > 0 such that Ir1n(c)131 for all n. there are at most nS positions i in which xi can be changed to create a member of [ { x7 } 13. In the special case when 8 = 0 and M is a multiple of n. then Irrn(E)ls1 _ both go to 0 as El will hold. In particular.. the entropy-typical n-sequences can be thought of as the "building blocks.7. The reader should also note the concepts of blowup and built-up are quite different. 8) = min{ICI: C c An.5 (The blowup bound. 8) = h.

Proof The number of ways to select a family {[n i .1 2mH (0 1 A im s. As usual. is similar in spirit to the proof of the key bound used to establish the entropy theorem. . For a fixed configuration of locations. Lemma 1. for members of Bn . M] is upper bounded by the number of ways to select at most SM points from a set with M members. blowup concept focuses on creating sequences by making a small density of otherwise arbitrary changes. provided M is large enough. if B. the set of entropy-typical n-sequences relative to c. The argument used to prove the packing lemma. can be used to show that almost strongly-covered sequences are also almost built-up.6 (The built-up set bound.5.(c). the number ways to fill these with members of B„ is upper bounded by IBnl i < .7. The proof of this fact. Y1 E < ) which is.5.4. in turn. Lemma 1.b.) Let DM be the set of all sequences xr that can be (1 — 6)-built-up from a given collection B„ c An. The building-block idea is closely related to the packing/covering ideas discussed in Section I. Then I Dm I IB. then the set of M-sequences that can be (1 — 6)-built-up from a given set 8„ C An of building blocks cannot be exponentially much larger in cardinality than the set of all sequences that can be formed by selecting M I n sequences from Bn and concatenating them without spacers. This establishes the lemma. A sequence xr is said to be (1 — 6)-strongly-covered by B„ C An if E [1. Lemma 1. In particular. An important fact about the building-block concept is that if S is small and n is large. which is stated as the following lemma. M — n +1]: Bn }l < 8(M — n +1). This is stated in precise form as the following lemma.3. = 7. which is the desired bound. BASIC CONCEPTS.4.6.i E / } . Thus I Dm I < ni Min IAIM. though simpler since now the blocks all have a fixed length.m i ] is upper bounded by IA 1 3m . say {[n i . H(8) = —8 log6 — (1 —8) log(1 — S) denotes the binary entropy function.m i ].3. The bound for entropy-typical building blocks follows immediately from the fact that I T(e)l 2 n(h+f) .70 CHAPTER I.3 Min and the number of ways to fill the places that are not in Ui [ni. and if is small enough then 11)m 1 < 2m(h +26) . by the combinations bound. namely. m i ]} of disjoint n-length subintervals that cover all but a 6-fraction of [1. upper bounded by 2 A411(3) . Lemma 1. while the built-up concept only allows changes in the spaces between the building blocks. but allows arbitrary selection of the blocks from a fixed collection.

implies there are at most 6M/2 indices j < M — n 1 which are not contained in one of the [ni . then xr is (1-0-built-up from Bn• Proof Put mo = 0 and for i > 0. that is. will be surprisingly useful later.(7.4. Remark 1. M] which are not contained in one of the [ni . In a typical data compression problem a given finite sequence 4.c Entropy and prefix codes.SECTION 1. A code is said to be faithful or A (binary) code on A is a mapping C: A noiseless if it is one-to-one.7 (The building-block lemma. A suitable application of the built-up set lemma. ergodic process with finite alphabet A is usually called a source.. in such a way that the source sequence x'1' is recoverable from knowledge of the encoded sequence bf. = Tn (e). The goal is to use as little storage space as possible. Lemma 1. An image C (a) is called a codeword and the range of C . INTERPRETATIONS OF ENTROPY. for which there is a close connection between code length and source entropy. hence eventually almost surely mostly built-up from members of 7.8 The preceding two lemmas are strictly combinatorial. For this reason. a subject to be discussed in the next chapter. B*.7. This result.. known as prefix codes.7. drawn from some finite alphabet A. for no upper and lower bounds on their probabilities are given. and if M > 2n/8. which can be viewed as the entropy version of the finite sequence form of the frequency-typical sequence characterization of ergodic processes. Theorem 1.u. The assumption that xr is 13.7. m i ]. and n is large enough. 71 Lemma 1.. so that a typical code must be designed to encode a large number of different sequences and accurately decode them. Of course. In the following discussion B* denotes the set of all finite-length binary sequences and t(w) denotes the length of a member w E B*. implies that eventually almost surely xr is almost stronglycovered by sequences from T(e). The lemma is therefore established. The almost-covering principle.7. and that xr is eventually in DM. all that is known is that the members of Dm are almost built-up from entropy-typical n-blocks. I. stopping when "" Ic (1 — 3/2)-strongly-covered by within n of the end of xr. In this section the focus will be on a special class of codes. In particular. if B. while the condition that M > 2n18 implies there at most 3M/2 indices j E [M — n 1.7. in information theory a stationary.7. at least in some average sense. whose length r may depend on 4 .(€)) 1. often of varying lengths. b2. A standard model is to think of a source sequence as a (finite) sample path drawn from some ergodic process. where n is a fixed large number. define ni to be least integer k > m i _ i such that ric+n--1 E 13„. (e).5. is to be mapped into a binary sequence bf = b1. mi]. then .c. implies that the set Dm of sequences that are mostly built-up from the building blocks T(e) will eventually have cardinality not exponentially much larger than 2/4(h+6) • The sequences in Dm are not necessarily entropy-typical. there are many possible source sequences.) If xr is (1-812)-strongly-covered by 13.6. Theorem 1. In combination with the ergodic and entropy theorems they provide powerful tools for analyzing the combinatorial properties of partitions of sample paths. b. to make the code length as short as possible. .4.

(i) The vertex set V(W) of T (W) is the set of prefixes of members of W.u(X). (As usual. 0100. together with the empty word A. however. In particular. where b from u to y. (See Figure 1. is called the codebook.(a). directed) binary tree T(W). Formally. where X has the distribution A.7. 101.7. A nonempty set W c B* is prefix free if no member of W is a proper prefix of any member of W. Note that the root of the tree is the empty word and the depth d(v) of the vertex is just the length of the word v. The prefix u is called a proper prefix of w if w = u v where y is nonempty. and easily proved. defined by the following two conditions.9. The expected length of a code C relative to a probability distribution a on A is E(L) = Ea .9 The binary tree for W = {00. The function that assigns to each a E A the length of the code word C (a) is called the length function of the code and denoted by £(IC) or by r if C is understood. . For codes that satisfy a simple prefix condition. the concatenation uv is the sequence of length n m defined by (uv) i = ui 1<i<n n < i < n m. base 2 logarithms are used. A nonempty word u is called a prefix of a word w if w = u v. a lower bound which is almost tight. The entropy of p. H(j) is just the expected value of the random variable — log .C(aIC) = £(C (a)). 100 } .) E B then there is a directed edge Figure 1. BASIC CONCEPTS. 0101. entropy is a lower bound to code length. it is the function defined by .C(aIC)p. To develop the prefix code idea some notation and terminology and terminology will be introduced. For prefix-free sets W of binary sequences this fact takes the form (2) . is defined by the formula H (A) = E p(a) log aEA 1 that is.. (ii) If y E V(W) has the form v = ub.w < 1. A prefix-free set W has the property that it can be represented as the leaves of the (labeled. since W is prefix free. property of binary trees is that the sum of 2 —ci(v) over all leaves y of the tree can never exceed 1. An important. a E C. For u = u7 E B* and u = vr E B*.72 CHAPTER I.) Without further assumptions about the code there is little connection between entropy and expected code length. labeled by b. the words w E W correspond to leaves L(T) of the tree T (W).

INTERPRETATIONS OF ENTROPY.u (a ) log . A geometric proof of this inequality is sketched in Exercise 10. 73 where t(w) denotes the length of w E W. for any prefix code C. The Kraft inequality then takes the form E I Gr I2-Lo-) < 1.) Let j. An inductive continuation of this assignment argument clearly establishes the lemma.7.SECTION 1. so that IG21 —‹ 2") — IG11 2L(2)—L(1). This is possible since 1 GI 12—L(1) + I G2 1 2 L(2) < 1.. The codewords of a prefix code are therefore just the leaves of a binary tree. (ii) There is a prefix code C such that EIL (C(. a result known as the Kraft inequality.. and.t be a probability distribution on A. For prefix codes. ) E p. i E [1. a lower bound that is "almost" tight. I=1 A code C is a prefix code if a = a whenever C(a) is a prefix of C(a). entropy is always a lower bound on average code length.. Proof Define i and j to be equivalent if ti = Li and let {G1.u(a). Gs } be the equivalence classes.1C)) < H(g) + 1. Theorem 1.1C)) = Er(aiC)Wa) = a a log 2paic). The Kraft inequality has the following converse. Assign G2 in some one-to-one way to binary words of length L(2) that do not have already assigned words as prefixes. the leaf corresponding to C(a) can be labeled with a or with C(a). This fundamental connection between entropy and prefix codes was discovered by Shannon. t]. where the second term on the right gives the number of words of length L(2) that have prefixes already assigned. which is possible since I Gil < 2" ) . r=1 Assign the indices in G1 to binary words of length L(1) in some one-to-one way. < 1 then there is a prefix-free set W = {w(i): 1 < i < t} such that t(w(i)) = ti . where t i E Gr . Lemma 1. For prefix codes.7. (. (i) H(p) E4 (C(.(a) log a 2-C(a1C) E. holds.10 If 1 < < f2 < < it is a nondecreasing sequence of positive integers such that E.1C)). A little algebra on the expected code length yields E A (C(. Proof Let C be a prefix code.11 (Entropy and prefix codes. the Kraft inequality (3) aEA 1. since the code is one-to-one. written in order of increasing length L(r) = L. Thus C is a prefix code if and only if C is one-to-one and the range of C is prefix-free. a . .7.

The (asymptotic) rate of such a code sequence. by the Kraft inequality.t(a)1. it follows that = a A. a Since. (4) and (5) then yield the following asymptotic results. whose length function is L. a faithful n-code Cn : An B* is a prefix n-code if and only if its range is prefix free.log tt(a))p. Thus "good" codes exist.2. almost-sure versions of these two results will be obtained for ergodic processes.log 1. a E A.(a) a 1— E p(a) log p(a) =1+ H(p). will be called a prefix-code sequence. such that. implies that the first term is nonnegative.7.7. The second term is just H(g). Theorem 1. BASIC CONCEPTS. since the measure v defined by v(a) = 2 —C(a I C) satisfies v(A) < 1. Of interest for n-codes is per-symbol code length .6.11(ii) asserts the existence of a prefix n-code Cn such that 1 1 (5) — E (reIC n)) 5. no sequence of codes can asymptotically compress more than process entropy. while Theorem I.C(4 IQ. n) — . for each n.C(a) < 1— log p(a). In this case. Let /in be a probability measure on A n with per-symbol entropy 1-1.C(4) = . while the divergence inequality. Theorem I.. and "too-good" codes do not exist. that is. where rx1 denotes the least integer > x. that is. relative to a process {X n } with Kolmogorov measure IL is defined by = liM sup n—co E (C(IC n)) The two results. Lemma 1.12 (Process entropy and prefix-codes.10 produces a prefix code C(a) = w(a).H n (P. Next consider n-codes.7.7.(A n ) = H(i)/n. . mappings Cn : A n 1—> B* from source sequences of length n to binary words.) If p. it is possible to compress as well as process entropy in the limit.74 CHAPTER I. Cn is a prefix n-code with length function . is a stationary process with process entropy H then (a) There is a prefix code sequence {C n } such that 1-4({G}) < H. n " n A sequence {Cn } . which completes the proof of the theorem. however. E(1. In the next chapter. Lemma 1. a Thus part (ii) is proved. (b) There is no prefix code sequence {C n } such that 1Z 4 ({C n }) < H.lCn ))1n. To prove part (ii) define L(a) = 1.C(a71Cn )/n and expected per-symbol code length E(C(. This proves (i). . that is. and note that E 2— C (a) < E a a 2logg(a) E a E 1.11(i) takes the form 1 (4) H(iin) for any prefix n-code Cn .

For this reason no use will be made of Huffman coding in this book. [8]. for example. 75 Remark 1. the code length function L(a) = 1— log (a)1 is closely related to the function — log p.7. The following is due to Elias.(a). the length of the binary representation of n. This means that w(n) = w(m). Next it will be shown that.7. But then v(n) = v(m). called Huffman coding.d Converting faithful codes to prefix codes. u(n) = u(m). which is asymptotically negligible. so that asymptotic results about prefix codes automatically apply to the weaker concept of faithful code. The code is a prefix code. since both consist only of O's and the first bit of both v(n) and v(m) is a 1. Thus E(12) = 0001001100. so that. for example. so that w' is empty and n = m. Shannon's theorem implies that a Shannon code is within 1 of being the best possible prefix code in the sense of minimizing expected code length. produces prefix codes that minimize expected code length. Huffman codes have considerable practical significance. as far as asymptotic results are concerned.7. since both have length equal to the length of u(n). Any prefix code with this property is called an Elias code. so the lemma is established.SECTION 1. w(12) = 1100. so that. so that. This can be done in such a way that the length of the header is (asymptotically) negligible compared to total codeword length. whose expected value is entropy. B*. The length of w(n) is flog(n +1)1.) There is a prefix code £:11. A somewhat more complicated coding procedure. a prefix) to each codeword to specify its length length.u(a)1 will be called a Shannon code. while both u(n) and v(n) have length equal to f log(1 + Flog(n 1 )1)1.e. INTERPRETATIONS OF ENTROPY.2.C(a) = F— log . Lemma 1. for Shannon codes. The second part v(n) is the binary representation of the length of w(n). but for n-codes the per-letter difference in expected code length between a Shannon code and a Huffman code is at most 1/n. for example. the length of the binary representation of Flog(n + 1)1. v(12) = 100.7. The third part w(n) is the usual binary representation of n. The key to this is the fact that a faithful n-code can always be converted to a prefix code by adding a header (i.13 A prefix code with length function . since both have length specified by v(n). C(n) = u(n)v(n)w(n). The first part u(n) is just a sequence of O's of length equal to the length of v(n). there is no loss of generality in assuming that a faithful code is a prefix code. I. The desired bound f(E(n)) = log n + o(log n) follows easily. Proof The code word assigned to n is a concatenation of three binary sequences. then. for if u(n)v(n)w(n) = u(m)v(in)w(m)w'. since. such that i(e(n)) = log n + o(log n).14 (The Elias-code lemma. CI . u(12)=000.

where e is an Elias prefix code on the integers.7. Let T be the transformation defined by T:xi--> x ea. The code CnK. 001001100111. In the example.15 If .u is a stationary process with entropy-rate H there is no faithful-code sequence such that R IL ({C n }) < H." in the definition of prefix-code sequence. where En —> 0 as .e Rotation processes have entropy O.16 Another application of the Elias prefix idea converts a prefix-code sequence {C n } into a single prefix code C: A* i where C is defined by C(4) = e(n)C(4). for ease of reading. The header tells the decoder which codebook to apply to decode the received message.5). P)-names of length n. since Cn was assumed to be faithful.(4) = 001001100111. 1). 1). and the code word C. I. a word of length 12.4 I Cn ))Cn (4). The proof for the special two set partition P = (Po.12(b) extends to the following somewhat sharper form. the header information is almost as long as Cn (4).) Theorem 1. (The definition of faithful-code sequence is obtained by replacing the word "prefix" by the word "faithful. . which is clearly a prefix code. (4).x11 . Given a faithful n-code Cn with length function Le ICn ).5. 0. and let P be a finite partition of the unit interval [0. 1). where a is irrational and ED is addition modulo 1. 2. but header length becomes negligible relative to the length of Cn (4) as codeword length grows.76 CHAPTER I. e (C(4 ICn )). then C:(4) = 0001001100.7. x E [0. For example. Proposition 1. The proof will be based on direct counting arguments rather than the entropy formalism. (4) starts and that it is 12 bits long. This enables the decoder to determine the preimage . Thus Theorem 1..7.17 The (T. to general partitions (by approximating by partitions into intervals). a prefix n-code Cn * is obtained by the formula (6) C: (4 ) = ECC(. x lii E An . with a bit more effort.. will be called an Elias extension of C.14. if C.7. and. P)-process has entropy O. n = 1. where Po = [0.7. The argument generalizes easily to arbitrary partitions into subintervals. a comma was inserted between the header information. Geometric arguments will be used to estimate the number of possible (T. X tiz E An.7. will be given in detail. and P1 = [0. PO. where. by Lemma 1. These will produce an upper bound of the form 2nEn . As an application of the covering-exponent interpretation of entropy it will be shown that ergodic rotation processes have entropy 0. BASIC CONCEPTS. The decoder reads through the header information and learns where C. Remark 1.

y]. The next lemma is just the formal statement that the name of any point x can be shifted by a bounded amount to obtain a sequence close to the name of z = 0 in the sense of the pseudometric dn.5) E I.Y11 < 6/4 and n > N. The first lemma shows that cln (x. tz]: Tlx E = for any interval I c [0.11 and the pseudometric dn(x.y1. if a = b. Thus every name is obtained by changing the name of some fixed z in a small fraction of places and shifting a bounded amount. given z and x. y) such that if n > N then Given E > 0 there is an N such that dn (x.4. some power y = Tx of x must be close to z. The case when lx . 77 n oo. Proposition 1. b) = 0.5) E <E This implies that dn (x.yli = minflx .7. The key to the proof to the entropy 0 result is that.yli < €14. y) < e. x . Let I = [x.) Inn E [1. and hence the (T-P)-names of y and z will agree most of the time. To make the preceding argument precise define the metric Ix .Y 1 1. and d(a. y) < 1{j E E I or T -1 (0. P)-names of x.SECTION 1.yl can be treated in a similar manner. if Ix .7. which occurs when and only when either T --1 0 E I or (0. The names of x and y disagree at time j if and only if Tx and T y belong to different atoms of the partition P. .yll = 11 + x . which is rotation by -a. respectively. and {x n } and {yn} denote the (T. 1).yi. to provide an N = N(x.7. where A denotes Lebesgue measure. Yi). y) is continuous relative to Ix . Proposition 1. The fact that rotations are rigid motions allows the almost-sure result to be extended to a result that holds for every x. INTERPRETATIONS OF ENTROPY. The uniform distribution property. Compactness then shows that N can be chosen to be independent of x and y. can be applied to T -1 . n where d(a. The result then follows from the covering-exponent interpretation of entropy. But translation is a rigid motion so that Tmy and Trnz will be close for all m. x. n > N. =- 1 n d(xi.7. In particular. 1) and any x E [0.7. 1). Without loss of generality it can be assumed that x < y. y) = dn(xç .18.b) = 1. 1). there cannot be exponentially very many names. y E [0. Theorem 1. if a 0 b. The key to the proof is the following uniform distribution property of irrational rotations. y E [0. Proof Suppose lx . Ergodicity asserts the truth of the proposition for almost every x.y11. Consider the case when Ix Y Ii = lx . The proof is left as an exercise. Lemma 1.19 E.18 (The uniform distribution property.

The number of binary sequences of length n — j that can be obtained by changing the (n — j)-name of z in at most c(n — j) places is upper bounded by 2 (n— i )h(E) . (a) Show that Ann ( '2) < (P.A2) given for the computation of the covering number.7. respectively. where j(x) is given by the preceding lemma. where h(6) = —doge —(1 — 6) log(1 — 6). to obtain a least positive integer j = j(x) such that IT ] x — Ok < 6/4. 6) is defined in (1). + (b) Show that H„.. I. X E X. The unit interval X can be partitioned into measurable sets Xi = Ix : j(x) = j}. Since small changes in x do not change j(x) there is a number M such that j(x) < M for all x. choose N from the preceding lemma so that if n > N and lx — yli < 6/4 then dn (x . for n > K + M.20 CHAPTER I. Let H(A) denote the entropy of p. based on the algorithm (Al .) . y) < E. BASIC CONCEPTS. Given E > 0. 3.T i X) < E. Lemma 1. P)-process is 0. yk ) < 69 k > K.7.) ± log n. Theorem 1. Given x. With K = M N the lemma follows. Proof Now for the final details of the proof that the entropy of the (T.5. There are at most 2i possible values for the first j places of any name.. where Nn (a. M] such that if Z = 0 and y = Tx then i.4. (Hint: what does the entropy theorem say about the first part of the list?) 2.2.n (g) > mH(A).Arn (a. 1) there is a j = j(x) E [1. (Hint: there is at least one way and at most n1441 2n ways to express a given 4 in the form uw(l)w(2). determine M and K > M from the preceding lemma. there is an integer M and an integer K > M such that if x E [0. Give a direct proof of the covering-exponent theorem. w(k — 1)v.78 Lemma 1. where each w(i) E An and u and v are the tail and head.15. j < M. let n > K+M and let z = O.7.. Since h(€) —> 0 as E 1=1 this completes the proof that the (T. 1. 6) exists. is the binary entropy function. Let g be the concatenated-block process defined by a measure it on A n . the number of possible rotation names of 0 length n is upper bounded by M2m2nh(€). according to the combinations bound. dk (zk Given E > 0.4. Note that dn—j(Z. Proposition 1.f Exercises. apply Kronecker's theorem. Given E > 0. of words w(0). 2)-process has entropy 0. w(k) E An. Thus an upper bound on the number of possible names of length n for members of Xi is (7) 2(n— Dh(e) 114 : x E Xill < 2 i IA lnh(6) Since there only M possible values for j. Show that limn (1/n) log.

Find an expression for the entropy of it. Section 1.7. a shift-invariant ergodic measure on A z and let B c A z be a measurable set of positive measure.1.SECTION 1. STATIONARY CODING. The definition and notation associated with stationary coding will be reviewed and the basic approximation by finite codes will be established in this subsection. Show that the resulting set of intervals is disjoint. and that this implies the Kraft inequality (2). Recall that a stationary coder is a measurable mapping F: A z 1--> B z such that F(TA X ) = TB F (x). 5. VX E AZ.8. Two basic results will be established in this section. 6.a Approximation by finite codes.8 Stationary coding.9. (b) 44) 5_ 2 ± 2n log IA I. The first is that stationary codes can be approximated by finite codes.18. that the entropy of the induced process is 1/(A)//t(B).7. Let p. that the entropy-rate of an ergodic process and its reversed process are the same. . for all xri' E An.8. Show that H({Y„}) < 8. 1] of length 2 —L(a ) whose left endpoint has dyadic expansion C (a). Show that if C„: A n H B* is a prefix code with length function L then there is a prefix code e: A" 1—* B* with length function E such that (a) Î(x) <L(x) + 1. 11.„} is ergodic. Suppose {1'.4. Let C be a prefix code on A with length function L. Use this idea to prove that if Ea 2 —C(a) < 1. Prove Proposition 1. The second basic result is a technique for creating stationary codes from block codes. using the entropy theorem. In some cases it is easy to establish a property for a block coding of a process. Show. 7. in terms of the entropy of the waiting-time distribution. Let {Ifn } be the process obtained from the stationary ergodic process {X} by applying the block code CN and randomizing the start. 10. Suppose the waiting times between occurrences of "1" for a binary process it are independent and identically distributed. 9. Show. Stationary coding was introduced in Example 1. for all x'i' E A n . 79 4. Often a property can be established by first showing that it holds for finite codes. then there is prefix code with length function L(a). using Theorem 1. then passing to a suitable limit. then extend using the block-to-stationary code construction to obtain a desired property for a stationary coding. I. Use the covering exponent concept to show that the entropy of an ergodic nstationary process is the same as the entropy of the stationary process obtained by randomizing its start. Associate with each a E A the dyadic subinterval of the unit interval [0.

80

CHAPTER I. BASIC CONCEPTS.

where TA and TB denote the shifts on the respective two-sided sequence spaces A z and B z . Given a (two-sided) process with alphabet A and Kolmogorov measure ,U A , the encoded process has alphabet B and Kolmogorov measure A B = /LA o F. The associated time-zero coder is the function f: A z 1--* B defined by the formula f (x) = F (x )o. Associated with the time-zero coder f: Az 1---> B is the partition P = 19 (f) = {Pb: h E B} of A z , defined by Pb = {X: f(x) = 1, } , that is,
(1)
X E h

if and only if f (x) = b.

Note that if y = F(x) then yn = f (Tnx) if and only if Tx E Pyn . In other words, y is the (T, 2)-name of x and the measure A B is the Kolmogorov measure of the (T, P)process. The partition 2( f) is called the partition defined by the encoder f or, simply, the encoding partition. Note also that a measurable partition P = {Pb: h E BI defines an encoder f by the formula (1), such that P = 2(f ), that is, P is the encoding partition for f. Thus there is a one-to-one correspondence between time-zero coders and measurable finite partitions of Az . In summary, a stationary coder can be thought of as a measurable function F: A z iB z such that F oTA = TB o F , or as a measurable function f: A z 1--* B, or as a measurable partition P = {Pb: h E B } . The descriptions are connected by the relationships
f (x) = F (x)0 , F (x) n = f (T n x), Pb = lx: f (x) = b).

A time-zero coder f is said to be finite (with window half-width w) if f (x) = f (i), whenever .0'. = i" vu,. In this case, the notation f (xw u.,) is used instead of f (x). A key fact for finite coders is that
=
Ç

U {x, n 71± —w wl ,
( x ,' „u t u,' ) = .
, fl

and, as shown in Example 1.1.9, for any stationary measure ,u, cylinder set probabilities for the encoded measure y = ,u o F -1 are given by the formula

(2)

y()1) = it(F-1 [yi])

=

E it(xi —+:) .
f (4 ± - :)=y' it

The principal result in this subsection is that time-zero coders are "almost finite", in the following sense.

Theorem 1.8.1 (Finite-coder approximation.) If f: Az 1--* B is a time-zero coder, if bt is a shift-invariant measure on A z , and if Az 1---> B such that E > 0, then there is a finite time-zero encoder

7:

1.i (ix: f (x) 0 7(x)}) _< c. Proof Let P = {Pb: b E B) be the encoding partition for f. Since p is a finite partition of A z into measurable sets, and since the measurable sets are generated by the finite cylinder sets, there is a positive integer w and a partition 1 5 = {Pb } with the following two properties
"../

(a) Each Pb is union of cylinder sets of the form

SECTION I8. STATIONARY CODING.

81

(h)
I../

Eb kt(PbA 7)1,) < E.
r',I 0.s•s

Let f be the time-zero encoder defined by f(x) = b, if [xi'.] c Pb. Condition (a) assures that f is a finite encoder, and p,({x: f (x) 0 f (x))) < c follows from condition I=1 (b). This proves the theorem.

Remark 1.8.2
The preceding argument only requires that the coded process be a finite-alphabet process. In particular, any stationary coding onto a finite-alphabet process of any i.i.d. process of finite or infinite alphabet can be approximated arbitrarily well by finite codes. As noted earlier, the stationary coding of an ergodic process is ergodic. A consequence of the finite-coder approximation theorem is that entropy for ergodic processes is not increased by stationary coding. The proof is based on direct estimation of the number of sequences of length n needed to fill a set of fixed probability in the encoded process.

Theorem 1.8.3 (Stationary coding and entropy.)
If y = it o F-1 is a stationary encoding of an ergodic process ii then h(v)

First consider the case when the time-zero encoder f is finite with window half-width w. By the entropy theorem, there is an N such that if n > N, there is a set Cn c An i +w w of measure at least 1/2 and cardinality at most 2 (n±2w+1)(h( A )+6) • The image f (C n ) is a subset of 13 1 of measure at least 1/2, by formula (2). Furthermore, because mappings cannot increase cardinality, the set f (C n ) has cardinality at most 2(n+21D+1)(h(4)+E) . Thus 1 - log I f (Cn)i h(11) + 6 + 8n, n where Bn —>. 0, as n -> oo, since w is fixed. It follows that h(v) < h(tt), since entropy equals the asymptotic covering rate, Theorem 1.7.4. In the general case, given c > 0 there is a finite time-zero coder f such that A (ix: f (x) 0 f (x)}) < E. Let F and F denote the sample path encoders and let y = it o F -1 and 7/ = ki o F -1 denote the Kolmogorov measures defined by f and f, respectively. It has already been established that finite codes do not increase entropy, so that h(71) < h(i). Thus, there is an n and a collection 'e" c Bn such that 71(E) > 1 - E and II Let C = [C], be the 6-blowup of C, that is,
Proof

C = { y: dn(Yi, 3711 ) 5_

E,

for some

3,T;

E '

el

The blowup bound, Lemma 1.7.5, implies that

ICI < where
3(E) = 0(E).

9

Let

15 = {x: T(x)7

E

El} and D = {x: F(x)7

E

CI

be the respective pull-backs to AZ. Since ki (ix: f (x) 0 7(x)}) < _ E2 , the Markov inequality implies that the set G = {x: dn (F(x)7,F(x)7) _< cl

82

CHAPTER I. BASIC CONCEPTS.

has measure at least 1 — c, so that p(G n 5 ) > 1 — 2E, since ,u(15) = 75(e) > 1 — E. By definition of G and D, however, Gniic D, so that

e > 1 — 2c. v(C) = g(D) > p(G n ij)
The bound ICI < 2n (h(A )+E+3(E )) , and the fact that entropy equals the asymptotic covering rate, Theorem 1.7.4, then imply that h(v) < h(u). This completes the proof of Theorem 1.8.3.

Example 1.8.4 (Stationary coding preserves mixing.)
A simple argument, see Exercise 19, can be used to show that stationary coding preserves mixing. Here a proof based on approximation by finite coders will be given. While not as simple as the earlier proof, it gives more direct insight into why stationary coding preserves mixing and is a nice application of coder approximation ideas. The a-field generated by the cylinders [4], for m and n fixed, that is, the cr-field generated by the random variables Xm , Xm+ i, , X n , will be denoted by E(Xm n ). As noted earlier, the i-fold shift of the cylinder set [cini n ] is the cylinder set T[a] = [c.:Ln ], where ci +i = ai, m < j <n. Let v = o F-1 be a stationary encoding of a mixing process kt. First consider the case when the time-zero encoder f is finite, with window width 2w + 1, say. The coordinates yri' of y = F(x) depend only on the coordinates xn i +:. Thus for g > 2w, the intersection [Yi ] n [ym mLg: i n] is the image under F of C n Tm D where C and D are both measurable with respect to E(X7 +:). If p, is mixing then, given E > 0 and n > 1, there is an M such that

jp,(c n

D) — ,u(C),u(D)I < c, C, D E

Earn,
El in

m > M,

which, in turn, implies that

n [y:Z +1]) — v([yi])v(Ey:VDI

M, g > 2w.

Thus v is mixing. In the general case, suppose v = p, o F -1 is a stationary coding of the mixing process with stationary encoder F and time-zero encoder f, and fix n > 1. Given E > 0, and a7 and Pi', let S be a positive number to be specified later and choose a finite encoder with time-zero encoder f, such that A({x: f(x) f . Thus,

col) <S

pax: F(x)7

fr(x)71)

/tax:

f (T i x)

f (Ti x)i)

f(x)

f (x)}), < n3,

by stationarity. But n is fixed, so that if (5 is small enough then v(4) will be so close to î(a) and v(b7) so close to i(b) that

(3)

Iv(a7)v(b7) — j) (a7) 13 (bi)I

E/ 3 , • P(x)21:71) < 2n8,

for all ar il and b7. Likewise, for any m > 1,

p,

F(x)7

P(x)7, or F(x)m m -Vi

SECTION 1.8. STATIONARY CODING.
so that (4)

83

Ivaan n T - m[b7]) — j)([4] n T -m[b71)1

6/3,

provided only that 8 is small enough, uniformly in m for all ail and b7. Since 1'5 is mixing, being a finite coding of ,u, it is true that

n T -m[bn) — i)([4])Dab7DI < E/3,
provided only that m is large enough, and hence, combining this with (3) and (4), and using the triangle inequality yields I vga7] n T -ni[b7]) — vga7])v([14])1 < c, for all sufficiently large m, provided only that 8 is small enough. Thus, indeed, stationary coding preserves the mixing property.

I.8.b

From block to stationary codes.

As noted in Example 1.1.10, an N-block code CN: A N B N can be used to map an A-valued process {X„} into a B-valued process {Y„} by applying CN to consecutive nonoverlapping blocks of length N, i iN+1
v (f-1-1)N
-

ii(j+1)N)
iN+1

j = 0, 1, 2, ...

If {X,} is stationary, then a stationary process {4} is obtained by randomizing the start, i.e., by selecting an integer U E [1, N] according to the uniform distribution and defining = Y+-i, j = 1, 2, .... The final process {Yn } is stationary, but it is not, except in rare cases, a stationary coding of {X„}, and nice properties of {X,}, such as mixing or even ergodicity, may get destroyed. A method for producing stationary codes from block codes will now be described. The basic idea is to use an event of small probability as a signal to start using the block code. The block code is then applied to successive N-blocks until within N of the next occurrence of the event. If the event has small enough probability, sample paths will be mostly covered by nonoverlapping blocks of length exactly N to which the block code is applied.

Lemma 1.8.5 (Block-to-stationary construction.)
Let 1,1, be an ergodic measure on A z and let C: A N B N be an N-block code. Given c > 0 there is a stationary code F = Fc: A z A z such that for almost every X E A Z there is an increasing sequence { ni :i E Z}, which depends on x, such that Z= ni+i) and (i) ni+1—ni < N, i E Z. (ii) If .I„ is the set of indices i such that [ni , +1) c [—n, n] and ni+i — n i < N, then ihn supn (1/2n)E iEjn ni+ 1 — ni < c, almost surely.
(iii) If n i+1 — ni = N then y: n n +'- ' = C(xn:±'), n -1 where y = F(x).

84

CHAPTER I. BASIC CONCEPTS.

Proof Let D be a cylinder set such that 0 < ,a(D) < 6/N, and let G be the set of all x E A z for which Tmx E D for infinitely many positive and negative values of m. The set G is measurable and has measure 1, by the ergodic theorem. For x E G, define mo = mo(x) to be the least nonnegative integer m such that Tx E D, then extend to obtain an increasing sequence mi = m i (x), j E Z such that Tinx E D if and only if m = mi, for some j. The next step is to split each interval [mi, m 1+1) into nonoverlapping blocks of length N, starting with mi, plus a final remainder block of length shorter than N, in case mi+i — mi is not exactly divisible by N. In other words, for each j let qi and ri be nonnegative integers such that mi +i—mi = qiN d-ri, 0 <ri <N, and form the disjoint collection 2-x (mi) of left-closed, right-open intervals [m i , mi N), [m 1 d-N,m i +2N), [mi±qiN, mi+i ).

All but the last of these have length exactly N, while the last one is either empty or has length ri <N. The definition of G guarantees that for x E G, the union Ui/x (mi) is a partition of Z. The random partition U i/x (m i ) can then be relabeled as {[n i , ni+1 ), i E Z}, where ni = ni(x), i E Z. If x G, define ni = i, i E Z. By construction, condition (i) certainly holds for every x E G. Furthermore, the ergodic theorem guarantees that the average distance between m i and m i±i is at least N/6, so that (ii) also holds, almost surely. The encoder F = Fc is defined as follows. Let b be a fixed element of B, called the filler symbol, and let bj denote the sequence of length 1 < j < N, each of whose terms is b. If x is sequence for which Z = U[ni , ni+i), then y = F (x) is defined by the formula bni+i— n, if n i±i — ni <N yn n :+i —1 = if n i+i — n i = N. C (4 +1-1 )

I

This definition guarantees that property (iii) holds for x E G, a set of measure 1. For x G, define F(x) i = b, i E Z. The function F is certainly measurable and satisfies F(TAX) = F(x), for all x. This completes the proof of Lemma 1.8.5. The blocks of length N are called coding blocks and the blocks of length less than N are called filler or spacer blocks. Any stationary coding F of 1,1, for which (i), (ii), and (iii) hold is called a stationary coding with 6-fraction of spacers induced by the block code CN. There are, of course, many different processes satisfying the conditions of the lemma, since there are many ways to parse sequences so that properties (i), (ii), and (iii) hold, for example, any event of small enough probability can be used and how the spacer blocks are coded is left unspecified in the lemma statement. The terminology applies to any of these processes. Remark 1.8.6 Lemma 1.8.5 was first proved in [40], but it is really only a translation into process language of a theorem about ergodic transformations first proved by Rohlin, [60]. Rohlin's theorem played a central role in Ornstein's fundamental work on the isomorphism problem for Bernoulli shifts, [46, 63].

i. the {Y} process will be mixing. As an illustration of the method a string matching example will be constructed. and extended to a larger class of processes in Theorem 11. and Markov processes it is known that L(x7) = 0(log n). It can. lim sup i-400 Â(ni) In particular. as follows. where it was shown that for any ergodic process {X.SECTION 1. For i. The construction is iterated to obtain an infinite sequence of such n's. Section 11. A negative solution to the question was provided in [71]. Let . where C„ is the zero-inserter defined by the formula (5).d.d. such that L(Y. for some O <s<t<n—k} .1. fy . The problem is to determine the asymptotic behavior of L(4) along sample paths drawn from a stationary. For x E An let L(4) be the length of the longest block appearing at least twice in x1% that is.5. The block-to-stationary code construction of Lemma 1.} defined by yin clearly has the property —on+1 = cn(X in o .} is the process obtained from {X. To make this precise. be forced to have entropy as close to the entropy of {X n } as desired. that is. of {X„}. then nesting to obtain bad behavior for infinitely many n. 1)/2+1).c A string-matching example. Let pc } .1.-on+1) . The first observation is that if stationarity is not required. case using coding ideas is suggested in Exercise 2. j ?. A question arose as to whether such a log n type bound could hold for the general ergodic case. there is a nontrivial stationary coding.} by the encoding 17(jin 1)n+1 = cn(X.8.8. j > 1. uyss+ for any starting place s. The proof for the binary case and growth rate À(n) = n 112 will be given here.i.') > 1.d. The n-block coding {I'm ) of the process {X„. as a simple illustration of the utility of block-to-stationary constructions.8. 1 " by changing the first 4n 1 i2 terms into all O's and the final 4n 1 12 terms into all O's.. if the start process {X n } is i. finite-alphabet ergodic process. . 85 I. {Yn }. the problem is easy to solve merely by periodically inserting blocks of O's to get bad behavior for one value of n..(n) —> 0.'. almost surely. furthermore. see [26.i. A proof for the i. 2]. for each n > 64.} and any positive function X(n) for which n -1 1. indicate that {Y.5. define the function Cn : A n H {0. and an increasing unbounded sequence {ni I. at least if the process is assumed to be mixing. where } (5) 0 Yi = 1 xi i <4n 112 or i > n — 4n 112 otherwise..5 provides a powerful tool for making counterexamples. STATIONARY CODING. L(x ) = max{k: +k i xss+ xtt±±k. -Fin) > n 1/2 . Cn (4) = y. The string-matching problem is of interest in DNA modeling and is defined as follows. since two disjoint blocks of O's of length at least n 112 must appear in any set of n consecutive places.

UZ { .8. for all m E Z and i > 1. and it is not easy to see how to pass to a limit process. Combining this observation with the fact that O's are not changed. Randomizing the start at each stage will restore stationarity. at each stage converts the block codes into stationary codings. with a limiting Et-fraction of spacers.5.n } by stationary coding using the codebook Cn. since O's inserted at any stage are not changed in subsequent stages. each process {Y(i) m } is a stationary coding of the original process {X.86 CHAPTER I.E. Lemma 1. and define the sequence of processes by starting with any process {Y(0)} and inductively defining {Y(i)} for i > 1 by the formula {Y --4Ecn {Y(i)m}. (6). For any start process {Y(0). so that the code at each stage never changes a 0 into a 1. and f is filler. y is the beginning of a codeword. j > 1. Property (6) guarantees that no coordinate is changed more than once. but mixing properties are lost. 11 z {0. 1 < j < for all s.n }. BASIC CONCEPTS. guarantees that a limit process 1/2 {Y(oo)„. which will. the i-th iterate {Y(i) m } has the property L(Y(i): +7') > n/2 . Given the ergodic process {X n } (now thought of as a two-sided process. that is. which. At each stage the same filler symbol. For n > 64. and let {ci } be a decreasing sequence of positive numbers. mEZ. b = 0. to be specified later. where u is the end of a codeword. where the notation indicates that {Y(j) m } is constructed from {Y(i — 1). since stationary coding will be used).} exists for which L(Y(oo) :. both to be specified later. Since stationary codings of stationary codings are themselves stationary codings. In particular. For each i > 1. mixing is not destroyed. the choice of filler means that L(Y(i)?`) > rt.. Furthermore. and hence there is a limit code F defined by the formula [F(x)]m = lim[F i (x)] m .) {YU — l)m} -14 {Y(i)ml. guarantee that the final limit process is a stationary coding of the starting process with the desired string-matching properties. since this always occurs if Y(i)?' is the concatenation uv. yields (7) L(Y(i)7 J ) > ni ll2 . will be used. + 71 ) > n. 1} z that maps {X. The rigorous construction of a stationary coding is carried out as follows. (6) Y(i — 1) m = 0 =z= Y (i) m = O. < j. . the sequence of processes {Y(i)} is defined by setting {Y(0)} = {X.}. Furthermore. {n i : i > 11 be an increasing sequence of integers. The preceding construction is conceptually quite simple but clearly destroys stationarity. let {n i } be an increasing sequence of natural numbers. provided the n i increase rapidly enough. in particular.n } and inductively defining {Y(i) m } for i > 1 by the formula (cn .1. let Fi denote the stationary encoder Fi : {0. each coordinate is changed at most once. where u is the end of a codeword and v the beginning of a codeword. and long matches are even more likely if Y(i)?' is the concatenation uf v.n } to {Y(i) m }. in turn. hence. the final process has a positive frequency of l's. xE 0. let Cn be the zero-inserter code defined by (5)./2 . in turn. An appropriate use of block-to-stationary coding.

PROCESS TOPOLOGIES. The collection P(A") is just the collection of processes with alphabet A. weak limits of ergodic processes need not be ergodic and entropy is not weakly continuous. compact. entropy is d-continuous and the class of ergodic processes is d-closed. C(4)) < E.n } defined by the stationary encoder F. In particular. The collection of (Borel) probability measures on a compact space X is denoted by Of primary interest are the two cases. declares two processes to be close if their joint distributions are close for a long enough time.8.u (n) } of measures in P(A") converges weakly to a measure it if lim p. where A is a finite set. for each i. Two useful ways to measure the closeness of stationary processes are described in this section.9. but it has two important defects.„ m E Z is the stationary coding of the initial process {X. merely by making the ni increase rapidly enough and the Ei go to zero rapidly enough. The other concept. One concept. The above is typical of stationary coding constructions. First a sequence of block codes is constructed to produce a (nonstationary) limit process with the properties desired. Show that if t is mixing and 8 > 0. processes. then the sequence {ni(x)} of Lemma 1.8. the entropy of the final process will be as close to that of the original process as desired. the limit process has the property L(Y(oo)) > the limiting density of changes can be forced to be as small as desired. I. the d-distance between processes is often difficult to calculate.d Exercises. as are other classes of interest. and it plays a central role in the theory of stationary codings of i. o (Hint: use the preceding exercise. a theory to be presented in Chapter 4.) 2. The d-metric is not so easy to define. 1. A sequence {. Since (7) hold j > 1. X = Anand X = A". Suppose C: A N B N is an N-code such that Ei.5 is applied to produce a stationary process with the same limiting properties. where it is mixing.d. for each al iv E AN . 87 The limit process {Y(oo)„z } defined by Y(oo). then Lemma 1. (Hint: replace D by TD for n large. p(x). and easy to describe. (") (a lic) = n-4.(dN(x fv .00 . and the d-topology is nonseparable.9.9 Process topologies.SECTION 1.) Section 1. however. called the dtopology. Many of the deeper recent results in ergodic theory depend on the d-metric. On the other hand. I. namely. called the weak topology.a The weak topology.. Furthermore. The weak topology is separable.i.5 can be chosen so that Prob(x 1— ' = an < 8 .n = limé Y(i).8. The subcollection of stationary processes is denoted by P s (A") and the subcollection of stationary ergodic processes is denoted by 7'e (A°°). Show that there is a stationary code F such that limn Ett (dn (4. F(x)7) < < E 1-1(un o 2E and such that h(o. declares two ergodic processes to be close if only a a small limiting density of changes are needed to convert a typical sequence for one process into a typical sequence for the other process.

The weak topology on P(A") is the topology defined by weak convergence. .°2) be the concatenated-block process defined by the single n2'1 -block x(n).2e.(4).88 CHAPTER I. as i —> oc. so the class of stationary measures is also compact in the weak topology. p. Now suppose . 0.u (n) } converges weakly to the coin-tossing process. however. iii . It is the weakest topology for which each of the mappings p . In particular. hence has a convergent subsequence. Since Ho (Xo l XII) depends continuously on the probabilities . and each . and recall that 111. let Hi. let x(n) be the concatenation of the members of {0.(0(X 0 IX11) < H(A)-1.u. For example. E > 0 there is a k such that HA (X0I < H(p) €..u(C) is continuous for each cylinder set C.) as k — > oc. The process p. V k.(X01X: ) denote the conditional entropy of X0 relative to X=1. Theorem 1.u.u. Since there are only countably many 4. 1. for any sequence {.vik = E liu(a/1 at denotes the k-th order distributional (variational) distance.u (n) (4) = 2-k . relative to the metric D(g. y) where .. say.) and (1. is a probability measure. then lim sup.(X 0 1X:1) decreases to H(. it must be true that 1-1/. The weak topology is a compact topology. given.u (n) is the Markov process with transition matrix M= n [ 1 — 1/n 1/n 1/n 1 1 — 1/n j then p weakly to the (nonergodic) process p.) — v(a 101. Furthermore. Thus. (n ) has entropy 0. where X ° k is distributed according to p. One way to see this is to note that on the class of stationary Markov processes. If k is understood then I • I may be used in place of I • lk.u(n) is stationary.. . The weak limit of ergodic measures need not be ergodic.u(a k i +1 ). for every k and every 4..°1) } of probability measures and any all'.u (n ) converges weakly to p.). the sequence {. It is easy to check that the limit function p. 1}n in some order and let p. for each n. is stationary if each A (n ) is stationary. the limit p. The weak topology is Hausdorff since a measure is determined by its values on cylinder sets.9. BASIC CONCEPTS. The weak topology is a metric topology. Indeed. the usual diagonalization procedure produces a subsequence {n i } such that { 0 (a)} converges to. if . weak convergence coincides with convergence of the entries in the transition matrix and stationary vector. Entropy is weakly upper semicontinuous. for all k and all'.°1) (a)} is bounded. so that {.1 If p (n ) converges weakly to p. yet lim . which has entropy log 2. . that puts weight 1/2 on each of the two sequences (0. Proof For any stationary measure p. A second defect in the weak topology is that entropy is not weakly continuous. H(a) < H(p)..

d. y): x E T(11. y ) = — d(xi.i. and its converse. Theorem 1. PROCESS TOPOLOGIES.T (p)) is v-almost surely constant. so that x and y must . y) = lirn sup d(4 .i.)) = inf{d(x. Tx). Theorem 1. This proves Theorem 1.2 If p. that is. the measure p. v). . x) = d(T y. Note. is ergodic if and only if g(T(g)) = 1. hence f (y) is almost surely constant. Suppose 0 < p <q < 1/2. (n) ) < H(.) Example 1. processes. that is. let g (P) denote the binary i.) + 2E. y) is the limiting per-letter Hamming distance between x and y.(n)(X01X11) is always true.4.SECTION 1. y) is the limiting (upper) density of changes needed to convert the (infinite) sequence x into the (infinite) sequence y. I. and y are ergodic processes.i.u. T (p. This extends to the It measures the fraction of changes needed to convert 4 into pseudometric ci(x.b) = { 1 n i =1 a=b a 0 b.u (P) -typical sequence.9.T (p)) is invariant since ii(y.9.1. y) and is called the d-distance between the ergodic processes it and v. a(x. Let T(g) denote the set of frequency-typical sequences for the stationary process it. as defined.d.2. The sequence x contains a limiting p-fraction of l's while the sequence q contains a limiting q-fraction of l's. and let y be a g (q) -typical sequence. (The fact that d(g. A precise formulation of this idea and some equivalent versions are given in this subsection. then d( y. For 0 < p < 1. The d-distance between two ergodic processes is the minimal limiting (upper) density of changes needed to convert a typical sequence for one process into a typical sequence for the other. The (minimal) limiting (upper) density of changes needed to convert y into a g-typical sequence is given by d(y.9. LI since Ii(g (n ) ) < 1-11. The constant of the theorem is denoted by d(g..b The d-metric. The per-letter Hamming distance is defined by 1 x---N n O d„(4 . in particular. is actually a metric on the class of ergodic processes is a consequence of an alternative description to be given later. .) To illustrate the d-distance concept the distance between two binary i. let x be a . defined by d(x. the set of all sequences for which the limiting empirical measure is equal to p. that for v-almost every v-typical sequence y. But if this is true for some n. then H(g. Proof The function f (y) = 41(y . the d-distance between p.)}.3 (The d-distance for i. 89 for all sufficiently large n. y.9. where d(a. processes will be calculated.9.4. y) on A. Theorem 1. process such that p is the probability that X 1 = 1. and y is the limiting density of changes needed to convert y into a g-typical sequence.d.1.i)• Thus ci(x. By the typical-sequence theorem. yi).

Example 1. Since Z is also almost surely p (r ) -typical.[ p 1 — p 1—p [0 1 1 N= p 1' 1 where p is small. and about half the time the alternations are out of phase. In a later subsection it will be shown that this new process distance is the constant given by Theorem 1.. BASIC CONCEPTS.u (P ) -typical x. For any set S of natural numbers. that is. Let Z be an i. one .u (q ) -typical y such that ii(x. d-far-apart Markov chains. which produces only errors. d -distance This result is also a simple consequence of the more general definition of . hence the nonmixing process v cannot be the d-limit of any sequence of mixing processes. for kt converges weakly to y as p —> 0.2. so that ci(x . On the other hand.)typical. there is therefore at least one . and let r be the solution to the equation g = p(1 — r) ± (1 — p)r .a(a) = E X(a.9.9. on A x A that has . since about half the time the alternation in x is in phase with the alternation in y. vn ) between the corresponding n . see Exercise 5.(x. and hence ii(A (P) . Ty) — 1/2. v(b) = E X(a. a(x.u (r) -typical z such that y = x Ef) z is .) As another example of the d-distance idea it will be shown that the d-distance between two binary Markov chains whose transition matrices are close can nevertheless be close to 1/2.) These results show that the d-topology is not the same as the weak topology. see Exercise 4. p (q ) ) = g — p. This new process distance will also be denoted by d(A. disagree in at least a limiting (g — p)-fraction of places.90 CHAPTER I. Fix a .1 The joining definition of d.th order distributions will be developed. The y-measure is concentrated on the alternating sequence y = 101010. b). where e is addition modulo 2. y) — 1/2. that is. First each of the two measures. and hence. y) is exactly 1/2. for .2 serves as a useful guide to intuition. A joining of two probability measures A and y on a finite set A is a measure A. Likewise. random sequence with the distribution of the Afr ) -process and let Y = x ED Z. separated by an extra 0 or 1 about every 1050 places.9.u (q ) -typical. The definition of d-distance given by Theorem 1.u. v). but it is not always easy to use in proofs and it doesn't apply to nonergodic istance as a processes. Let p. y) = g — p. (1) . is represented by a partition of the unit square into horizontal . 7" (v)) — 1/2. b a A simple geometric model is useful in thinking about the joining concept. which produces no errors.i. for almost any Z. Thus d. ( A key concept used in the definition of the 4-metric is the "join" of two measures. a(x. 1. b).9. (q ) -typical y such that cl(x.b..u-almost all x.4 ( Weakly close.u. and y be the binary Markov chains with the respective transition matrices M. Thus it is enough to find. To remedy these defects an alternative formulation of the d-d limit of a distance dn un . say p = 10-50 . to be given later in this section. y) = g — p. Later it will be shown that the class of mixing processes is a-closed.d. Thus there is indeed at least one . y) > g — p. the values {Zs : s E SI are independent of the values {Zs : s 0 S}. for each p9' ) -typical x.u and y as marginals. kt and y. and its shift Ty. a typical /L-sequence x has blocks of alternating O's and l's. (With a little more effort it can be shown that d(. Y is .

b) R(2c) R(2. 0 R(0. (A space is nonatomic if any subset of positive measure contains subsets of any smaller measure.u(a) = Eb X(a.) A representation of p.a) R(0. 6) means that.a) a R(1.a) R(0. PROCESS TOPOLOGIES. since expectation is continuous with respect to the distributional distance IX — 5. where p.9. relative to which Jn(Itt.c) Figure 1. yti')) = cin v).1 (c)). .9.c) R(2.) In other words. Of course. a) to (A. from (Z. The joining condition .(a. see Exercise 2. one can also think of cutting up the y-rectangles and reassembling to give the A-rectangles. b).c) R(1. The 4-metric is defined by cin(ti. y. (See Figure 1. that is. where ex denotes expectation with respect to X. and v are probability measures on An . A measure X on A n x A n is said to realize v) if it is a joining of A and y such that E Vd n (x7. a) is just a measure-preserving mapping ct.b) has area X(a. with the width of a strip equal to the measure of the symbol it represents. A). a measurable mapping such that a(0 -1 (a)) = p(a). 91 strips. The minimum is attained. ai ) a (q5-1 (ai ) n ( . b): a E A} is exactly the y-measure of the rectangle corresponding to b. Turning now to the definition of y).5.SECTION 1. hence is a metric on the class P(Afl) of measures on A".a) R(1b) R(1. A joining X can then be represented as a pair of measure-preserving mappings.a) R(2. and ik from (Z. means that the total mass of the rectangles {R(a. V) = min Ejdn(4 . y).b):b E A} such that R(a.b) R(2a) 1 2 R(2. for each a. The second joining condition. The function 4(A. such that X is given by the formula (2) X(a i . v(b) = Ea X.5 Joining as reassembly of subrectangles. y) satisfies the triangle inequality and is strictly positive if p. y) denote the set of joinings of p.9. A). a joining is just a rule for cutting each A-rectangle into subrectangles and reassembling them to obtain the y-rectangles.u-rectangle corresponding to a can be partitioned into horizontal rectangles {R(a. (/) from (Z.1In. for a finite distribution can always be represented as a partition of any nonatomic probability space into sets whose masses are given by the distribution. on a nonatomic probability space (Z. the use of the unit square and rectangles is for simplicity only. 6). let J(11.b) R(0. a E A. a) to (A. and v. Also. the . a) to (A. V) is a compact subset of the space 7'(A x A n ).

is a metric on and hence d defines a topology on P(A"). In the stationary case. then. a consequence of the definition of dn as an minimum.) If p. in turn. which is. since d. y). both the limit and the supremum.92 CHAPTER I. implies that p = v. which. and y as marginals. LI A consequence of the preceding result is that the d-pseudometric is actually a metric on the class P(A) of stationary measures. un).. in turn. Exercise 2. yl)). in turn. so the preceding inequality takes the form nd + mdm < n + m)dn+m. un ) must be 0. The set of all stationary joinings of p and y will be denoted by Js(p. q+m). The d-distance between stationary processes is defined as a limit of n-th order dndistance.6 If i and v are stationary then C1 (1-1 .co Note that d(p. since in the stationary case the limit superior is a supremum.9. for all n. that is.. (3) n vilz)± m (p. y) is a metric. in fact.vn)• Proof The proof depends on a superadditivity inequality.(B x v(B) = X(A" x B). In the stationary case. if d(p. BASIC CONCEPTS. It is useful to know that the d-distance can be defined directly in terms of stationary joining measures on the product space A" x A". dn (p. since iin(A.9. and v are stationary then (4) cl(p. where pi denotes the measure induced by p.n (1t7+m .6. d(it. The d-distance between two processes is defined for the class P(A) of probability measures on A" by passing to the limit. The proof of this is left to Exercise 3.:tr. ( which implies that the limsup is in fact both a limit and the supremum. Indeed.. on the set AL. n—). which is. defined as a minimization of expected per-letter Hamming distance over joinings. This establishes Theorem 1.n . the limit superior is. vn) = limiin(An.9. p: + 41 = pr.7 (Stationary joinings and d-distance. . A joining A of p and v is just a measure on A x A" with p. B E E. v) = SUP dn (An. namely. v'nV) < ( n + m) di±. Y) = lirn sup an(An. which implies that A„ = vn for all n. y) is a pseudometric on the class P(A). v) = 0 and p and y are stationary.. Theorem 1. Theorem 1. that is. v) = min ex(d(xi. p(B) = A.

implies that lim. for each n choose A n E Jn(tin.1.7 be projection of A n onto its i-th coordinate. This. together with the averaging formula. Yi)) = litridn (tt. for it depends only on the first order distribution. . B x B) = . vn) = ex(c1. y) then stationarity implies that ex(d(xi. PROCESS TOPOLOGIES.. V). (a) (6) (b) (c) n-+co lim 5:(n)(B x lim 3. for any cylinder set C. for each n. B E E E. yi ) = lima Cin (tt. y) of stationary joinings is weakly closed and hence compact. For each n.. Towards this end. To prove (c). As noted in Exercise 8 of Section 1.(4 . v). yl)) = ex n (dn (4.. va).7 (a. The details are given in the following paragraphs. (") (C x = A(C). which proves the first of the following equalities. Yi)) is weakly continuous. 93 Proof The existence of the minimum follows from the fact that the set . that is. the measure on A x A defined by A.(n) is given by the formula (5) i "(n) (B) = — n n 1=1 where A. This is done by forming. for if A E Js(. The proof of (b) is similar to the proof of (a). y). the concatenatedblock process defined by A n . the measure 7. (5). The goal is to construct a stationary measure A E y) from the collection {A}. yr)) = K-1 ( k=0 H (t v knn -l"n kk"kn-I-1 . let A.The definition of A (n) implies that if 0 < i < n — j then x A c° )) = Xn(T -` [4] x A) = ting—i [xn) = since An has A n as marginal and it is shift invariant. such that ex (d(Xi .01) denote the concatenated-block process defined by An. (n) is the measure on (A x A )° defined by A((4. It is also easy to see that d cannot exceed the minimum of Ex (d(x l . V) = Ci(p. / . let ).u(B).k (d(x i .1s(p. va). y rit )).SECTION 1. E. 0<r <n.(n) (A = bt(B).9. if N = Kn+r. and the latter dominates an (IL. such that an(An.Î. b) = E (uav. and from the fact that e . u'bvi). then taking a weakly convergent subsequence to obtain the desired limit process À.kn-Fn Ykn+1 // t( y Kn+r "n kk""Kn-1-1 )1(n+1 . yi)). It takes a bit more effort to show that d cannot be less than the right-hand side in (4). since A n belongs to -In(un. lim e-)(d(xi.

8 (The empirical-joinings lemma. b1 ))..) The measures 171. { } 0< (n — k + 2) p(414) — (n k— E p(a'nfi!) <1. u) = e). pk ((alic . I —ICX) j—>00 lim p((a .94 and note that CHAPTER L BASIC CONCEPTS. 1x). pk (af 14) and pk (biny). The pair (x. y) = lim dni ((x in i . is obtained by thinking of the pair (4 . are obtained by thinking of 4 and Ai as separate sequences. and plc ( ly7). A diagonalization argument. The limit results that will be needed are summarized in the following lemma.(. To complete the proof select a convergent subsequence of 1-(n) . y)). Lemma 1. The limit X must be stationary. that (7) dn (x. yril) defines three empirical k-block distributions. yril)). if necessary. A simple. y) = EA. Yi)X7 (xi . y) = Eî(d(ai . Passing to the limit it follows that . bk i )1(x n i1. y). E d (xi .„ (dn(4 )11 ))• Ei x7 (a. b). One difficulty is that limits may not exist for a given pair (x. b) = (1/n) 1 n x./ J such that each of the limits (8) lim p(afix). ( .2 Empirical distributions and joinings. and 3. but important fact. since cin(A.p) (d(a. Condition (6c) guarantees that ii(tt. and): are all stationary. yields an increasing sequence fn . however.b.1). y7)) = since 3:(n) (a. By dropping to a further subsequence. Fix such a n1 and denote the Kolmogorov measures exists for all k and all al` and defined by the limits in (8) as TI.9. yi) = e7. yn also exists. ) ) is a joining of pk (. and the third. Thus X E Js(h. Two of these. )). ex n (dn(x tiz .y. ir(x . and X' is a joining of g. is that plc ( 1(4 . = (n — k 1)p(aii` I x7) counts the number of times af appears in x7.j'e ly). yi)). "i. y.9. it can be assumed that the limit i i )) . al since the only difference between the two counts is that a' 2` might be the first k — 1 terms of 4. b)). respectively. I. for example. y). and is a joining of h and y. Note also. Proof Note that Further- M. hence is not counted in the second term. yri') = E p . each side of E(n — k 1)p((af. by conditions (6a) and (6b). 14)1(4 . since. y) as a single sequence.(d(xi.. 07 . and more.(.0)(d (xi . Limit versions of these finite ideas as n oc will be needed. 1(x. Yi)). b 101(x7. This proves (c).

)) .2. y) < Ex„(x .„( . yi))• Proof Let A be a stationary joining of g and y for which c/(u. ) (d(xi.u and y such that Au. Li Next it will be shown that the d-distance defined as a minimization of expected perletter Hamming distance over joinings. y i )). it follows that X E ). But. by the empirical-joinings lemma.y) (d(x i . Furthermore.9.(.u and y are ergodic then there is an ergodic joining A of .) must be a stationary joining of g and y.7(x.9. PROCESS TOPOLOGIES. y E T (v).(x. however. since d(xi.are stationary.. and (9) A= f A(x ) da)(71. by definition of X. Since. y).9. 95 is stationary. The integral expression (9) implies. means that a(i.„. y) E A" x A'. y). then X 7r(x. Likewise.(d(xi let yi)). À-almost surely. In this expression it is the projection onto frequency-equivalence classes of pairs of sequences.9. y) for which x is g-typical and y is v-typical. (10) d(u. y)). y)) be the ergodic decomposition of A. The joining result follows from the 0 corresponding finite result. (x. y) = e). Theorem 1. and hence d(u. y). y) < E). But if this holds for a given pair.y) (d(xi.7.u and v. is the empirical measure defined by (x. it is concentrated on those pairs (x. is..) If .SECTION 1. 8 . co is a probability measure on the set of such equivalence classes. while the d-equality is a consequence of (7). in the ergodic case. Theorem 1. the pair (x.. and A.10.9. 371)). a strengthening of the stationary-joining theorem.t.9. (. combined with (10). Theorem 1. since it was assumed that dCu. for À-almost every pair (x. as given by the ergodic decomposition theorem. y) E (A.9. y) = ex(d(xi .9 (Ergodic joinings and a-distance. = ex(d(xi. y) is ergodic for almost every (x. Theorem 1. The empirical-joinings lemma can be used in conjunction with the ergodic decomposition to show that the d-distance between ergodic processes is actually realized by an ergodic joining. Lemma 1. yi)).)-typical. A(x) is ergodic and is a joining of . . the same as the constant given by Theorem 1. .. A-almost surely. y) is 4( X .4. that x(d(xi yi)) = f yl)) da)(7(x. This. Thus.(x. since À is a stationary joining of p and y.y)) for A-almost every pair (x.9. y) = EA. yl) depends only on the first order distribution and expectation is linear.7. yl )). "if and 1. This establishes Theorem 1.

y) E T(A) x T (v) d(p. random variable notation will be used in stating d-distance results.9.96 CHAPTER I. (a) cln (4 . y 1 ') converges to a limit d(x. For convenience. together with the almost sure limit result.u and v are ergodic then d(. Since the latter is equal to 2( x. The lower bound result. I.n . or by x(jr) when n is understood. Let x be ti-typical and y be v-typical. as j lemma implies that Î is stationary and has p. y). blf)1(x n oc. v). .(a. In particular. and choose an increasing sequence {n 1 } such that )) converges to a limit A. for example. Here is why this is so. BASIC CONCEPTS. and h1 . the proof of (11) is finished.10. v) d(x . the following two properties hold. Next it will be shown that there is a subset X c A of v-measure 1 such that (12) d(y. y). v-almost surely.. there is a set X c A' such that v(X) = 1 and such that for y E X there is at least one x E T(A) for which both (a) and (b) hold. will be denoted by x n (jr). This establishes the desired result (12). and 2(x. As earlier. also by the empirical-joinings lemma. if p.9.9 implies that there is an ergodic joining for v).T (p)) = d(p. The ergodic theorem guarantees that for A-almost all which e(d(x i .') = — n EÀ(d(xi. is the distribution of X and v the to indicate the random vector distribution of Yf'. V y E X. .. and v as marginals.10 (Typical sequences and d-distance. and hence d(p.v). Theorem 1.Y1))=a(u. y). yields LIJ the desired result. y) E TOO Ç T(i) x T (v). y i )) = pairs (x. conditioned on the value r of some auxiliary random variable R. v). Y1)). i=1 (b) (x. T (A)). (11). The vector obtained from xri' be deleting all but the terms with indices j i < j2 < < j. The empirical-joinings so that dnj (x l i . y). Another convenient notation is X' . y'.u.b. (12).9. 17) stands for an (A.(d(Xi.Theorem 1. the distributional (or variational) distance between n-th order distributions is given by Ii — vin = Elp. (a ) — v(a7)I and a measure is said to be TN -invariant or N-stationary if it is invariant under the N-th power of the shift T. Theorem 1. y). bt) for each k and each al' and bt.3 Properties and interpretations of the d-distance. Proof First it will be shown that (11) (x. v) E-i. p((at. v) = d(y.. that is.) If . y) < d(x .9.

Ea ga. Furthermore. y o 7-1 ). y( j. b)) > 1 . with equality when n = 1.m. Y (if n )) n . r P(R = r)d(r Proof A direct calculation shows that IL .(d„(xr il . . . (b) d.u and v are T N -invariant. .2 Emin(u(a7).1A . 97 Lemma 1.Via . in _m } denote the natural ordering of the complement {1. dn(A.9. Li } and for each ( x (jr ) . n } . < md„. nE). v) = lim cin (. y) satisfies the triangle inequality and is positive if p. 14): a7 b7 } ) = 1 .y(j"))(X Y(iin—m )) be a joining of the conditional random variables. unless a = b. Moreover. (d) If .. 2. The function defined by x*(q.SECTION 1. i2. y(jr)) n . and mdm (x(r). there is always at least one joining X* of i and y for which X(4.m."1 )). The fact that cln (A. Yf' gives a joining of X(ffn).vi n . (e) d(u. since d(a. y). let {6. y) (dn (di% b7)) < . for any finite random variable R. (f) If the blocks {X i ort on+1 } and Olin 1) . Ex(d(a.) (a) dn v) _< (1/2)Ip.a7 ) .u . see Exercise 8.(XF n Yr n ) = dn(X i ort1)n+1 . X(i in-m )/x( jim) and Y(i n i -m )/y(jr).9. YU)) yr.. together with the fact that the space of measures is compact and complete relative to the metric ii .11 (Properties of the d-distance. a). v(a)).Y(j im). let yr. y) d(p.E ga7. o T'. yri') = x(x(r). X(x(jr).(ir»(xcili -m). b) = 1.(X (j r).. yuln))x(x(q). = min(A(a7). . IT) < illr.y(jr)) 5_ ndn (x 11 . )711')) < mEVd m (x(iln ).v In . is a joining of X and Yln which extends X. The left-hand inequality in (c) follows from the fact that summing out in a joining of X. y is left as an exercise. is a complete metric on the space of measures on An. v(a)). then mcl. for any joining X of bt and v. yoin -m». and hence. (g) ("in(X7. 14)) X({(a7 . then ci(u. . The right-hand inequality in (c) is a consequence of the fact that any joining X of X(r) and Y(ffn) can be extended to a joining of X and To establish this extension property.Yin = 2. PROCESS TOPOLOGIES. and that Ex (dn (a7 . (c) (X (ir). 2 1 In the case when n = 1. Completeness follows from (a)..YD. This completes the proof of property (a). ±1 } are independent of each other and of past blocks.17(iin—on+i).

12 [d:Gt. by the assumed independence of the blocks. dn A. The preceding lemma is useful in both directions.tt1)n+1)) i • j=1 The reverse inequality follows from the superadditivity property. . is obtained by replacing expected values by probabilities of large deviations.(d n (f.`n )) < 1. choose a joining Ai of {Xiiin j. such that Exj (dn (X17_ 1)n±l' Yii.11. Minimizing over A yields the right-hand inequality since dn . It is one form of a metric commonly known as the Prohorov metric.4. Yintn ) Edna (.(. with bounds given in the following lemma.(dn (4 . versions of each are stated as the following two lemmas.(xi' y) ct (a) 20/. y7)EA. y) < E. v)] 2 5_ 4 (A. y(ijn-on+1)• The product A of the A is a joining of Xr and Yr. equivalent to the 4-metric. Let A n (c) = dn (xiz .in-on+1) . yDA.. and define (13) *(/.-1)n+1 . Property (d) for the case N = 1 was proved earlier. The Markov inequality implies that X(A n ( N/Fe)) > 1 — NATt. y) 5_ 24(u. Yilj. y(i n in property (c). (2). Part (b) of each version is based on the mapping interpretation of joinings.. i .u. This completes the proof of the lemma. Another metric. and hence du. Proof Let A be a joining of 1. yiii)) < E (x7./.. v). The proof of Property (g) is left to the reader. yriz)) < 2a.n (x(irm ). for each To prove property (f). } and {Y(.1)n-1-1 . 1)n-1-1 } .2-1)n+1 )) = i (.0 The function dn * ( u. y) is a metric on the space P(An) of probability measures on A.. and hence property (f) is established. then EÀ(dn (4. if A E y) is such that X(An(a)) > 1 — a and d:(.E.t and y such that dn (A. j=1 since mnEx (c1. )1)) = a. 01.9.98 CHAPTER I. the extension to N > 1 is straightforward. This completes the discussion of Lemma 1.(xrn ./2-1)n+1' y(i. Property (e) follows from (c).9. yrn)) = E nE j (dn (x . y) Likewise. (3). y) = ct. BASIC CONCEPTS. y) = E).u. and hence cl. and hence nz Manin (X7 m . and is uniformly equivalent to the 4-metric.(A n (c)) > 1 — cl. dn(xii . Lemma 1. y) < cl. y) = min minfc > 0: A.

y) < e 2 and . (b) If 4 and are measure-preserving mappings from a nonatomic probability space (Z.b. there are measurepreserving mappings (/) and from (Z. y) < 2e. where the extra factor of k on the last term is to account for the fact that a j for which yi 0 x i can produce as many as k indices i for which y :+ k -1 _ — ak 1 and X tj. Lemma 1. let [B]s = {y7: dn(y ri' .4 Ergodicity. This follows from the fact that if v is ergodic and close to i.(4 . then a small blowup of the set of vtypical sequences must have large ti-measure. PROCESS TOPOLOGIES.) and (A". respectively. almost surely. p. and d limits. for 3 > 0. . As before.) > 1 — 2e. v). then iin (t.un (B) > 1 — c.) The a-limit of ergodic processes is ergodic.14.9. except for a set of a-measure at most c. y) < c2.9. and small blowups don't change empirical frequencies very much. . By the ergodic theorem the relative frequency p(eil`ix) converges almost surely to a limit p(anx). and hence (14) (n — k + 1)p(41). then Y n ([B]. then cin (it. which is just Lemma 1.9. y) < c.2 it is enough to show that p(aii`lx) is constant in x. v).11 ) < (n — k + 1)p(a in. By Theorem 1. 11 ) 3 I p(44) — p(any)1 < c. entropy.u) and (An.13 99 (a) If there is a joining À of it and u such that d„(4 . B) 3).*(z)) > el) 5_ E. respectively. mixing.9. denote the 3-neighborhood or 3-blow-up of B c A".SECTION 1. where [ME = dn(Yç z B) EL (b) If cin (i.-measure at most c. except for a set of A. I. Theorem 1.14 (a) If anCu. a) into (A'. )1). for every a.9. Lemma 1.15 (Ergodicity and d-limits. and (Z. To show that small blowups don't change frequencies very much. and N1 so that if n > N1 then (15) d( x. then yi 0 x i for at least one j E [i.u is the d-limit of ergodic processes. such that a(lz: dn((z). The key to these and many other limit results is that do -closeness implies that sets of large measure for one distribution must have large blowup with respect to the other distribution. fix k and note that if )7 4-k---1 = all' and 4 +k-1 0 al'. i + k — 1].t. y) < 2E.q) + knd. a) into (An. a) is a nonatomic probability space. Proof Suppose . such that dn ( ( Z) (z)) < E.4. In this section it will be shown that the d-limit of ergodic (mixing) processes is ergodic (mixing) and that entropy is d-continuous.t.9. ±1c-1 0 ak 1' Let c be a positive number and use (14) to choose a positive number S < c.

let y be a mixing process such that d(A . Al. vn ) < 82. y) < 3 2 . (Such a 3 exists by the blowup-bound lemma. Suppose A is the a-limit of mixing processes. y) < 3 2 and hence an (11.-measure. choose Tn c An such that ITn I _ 1 as n co. and y such that Evdi (xi . for n > N I .) Entropy is a-continuous on the class of ergodic processes. and let A. for all n > N. 8 then there is a sequence i E Bn such that ] 14) — P(a lic l51)1 < E. <8 and hence I 19 (alic 1-741 ) IY7)1 < 6 by (15). ] 5 ) > 1 — 2 8 . the set an . 8. for each < 2i1(h+E) and p(T) n > N. and.')) < 82. Given E > 0. fix all and VII. Lemma 1.14 to obtain A n ([13. since for stationary measures d is the supremum of the there is an N2 > Ni such that if n > N2. Lemma 1.16.7. Given 8 > 0.9.9.n . yields h(v) < h(p) + 2E. and y are d-close then a small blowup does not increase the set of p. Theorem 1.) The d-limit of mixing processes is mixing. D Theorem 1. This completes the proof of Theorem 1. each of positive p.) Choose > N such that p(T) > 1 — 3. Let pc be an ergodic process with entropy h = h(i). the finite sequence characterization of ergodicity. y. Let y be an ergodic measure such that cl(A. Yi)) = Ex(cin (4. Note. The definition of Bn and the two preceding inequalities yield — p(451)1 < 4E.u) < h (v) 2E.7.14 implies that y([7-. y) < 8 2 . for sufficiently large n. Theorem 1. via finite-coder approximation. BASIC CONCEPTS. implies that A is ergodic.5. so the covering-exponent interpretation of entropy. 57 11 E [B. .]6 ) > 1 — 23. that if y E [A]s then there is a sequence x E Bn such that dn (fi! .16 (Entropy and d-limits.9. n > Because y is ergodic N1 .100 CHAPTER I.17 (Mixing and a-limits. Proof The proof is similar to the proof. see Example 19. be a stationary joining of p.1' E [B.9.9. however. Bn = fx ril : I P(414) — v(4)1 < has vn measure at least 1-6. ] Since [B] large An -measure.4. Theorem 1. Likewise if 5.15.5. Fix n > N2 and apply Lemma 1. Mixing is also preserved in the passage to d-limits. Likewise. Let y be an ergodic process such that a(A.9. completing the proof of Theorem 1. Proof Now the idea is that if p. that stationary coding preserves mixing. v gives measure at least I — 3 to a subset of An of cardinality at most 2n (n( v)+E ) and hence El h(. Let 8 be a positive number such that un bi for all n > N. and has large v-measure.-entropy-typical sequences by very much.4. and fix E > 0. choose N.

SECTION 1.9. PROCESS TOPOLOGIES.

101

This implies that d„(x': , y7) < 8 except for a set of A measure at most 8, so that if has A-measure at most 8. Thus by making xt; < 1/n then the set {(x'il , even smaller it can be supposed that A(a7) is so close to v(a) and p.(1,11') so close to v(b?) that liu(dil )[1(b7) — v(a'12 )v(b)I Likewise, for any m > 1,

- iz X({(xr +n , y'1 ): 4 0 yrii or x: V X({(xi, yi): xi 0 4 } )

y: .- .;1 })
m+n

X({(x,m n F-7,

xm _11
,n

ym m:++:rill)

28,

so that if 8 <112n is small enough, then

1,u([4] n

T - m[bn) — vaan

n T - m[b7D1

6/3,

for all m > 1. Since u is mixing, however,

Iv(a7)v(b7) — vgan

n T - m[b7])1

6/3,

for all sufficiently large m, and hence, the triangle inequality yields

— ii([4]

n T - m[brill)1

6

,

for all sufficiently large m, provided only that 6 is small enough. Thus p, must be mixing, which establishes the theorem. This section will be closed with an example, which shows, in particular, that the

ii-topology is nonseparable.
Example 1.9.18 (Rotation processes are generally d-far-apart.) Let S and T be the transformations of X = [0, 1) defined, respectively, by Sx = x

a, T x = x

fi,

where, as before,

e denotes

addition modulo 1.

Proposition 1.9.19
Let A be an (S x T)-invariant measure on X x X which has Lebesgue measure as marginal on each factor If a and fi are rationally independent then A must be the product measure. Proof "Rationally independent" means that ka ± mfi is irrational for any two rationals k and m with (k, m) (0, 0). Let C and D be measurable subsets of X. The goal is to • show that A.(C x D) =
It is enough to prove this when C and D are intervals and p.(C) = 1/N, where N is an integer. Given E > 0, let C 1 be a subinterval of C of length (1 — E)1N and let

E = C i x D, F = X x D.

102

CHAPTER I. BASIC CONCEPTS.

Since a and fi are rationally independent, the two-dimensional version of Kronecker's theorem, Proposition 1.2.16, can be applied, yielding integers m 1 , m2, , rnN such that if V denotes the transformation S x T then Vms E and X(FAÉ) < 2E, where F' = U 1 Vrn1E .
E -±
n vrni

E = 0, if i j,

It follows that X(E) = X(P)/N is within 26/N of A.(F)/ N = 1,t(C)A(D). Let obtain X(C x D) = This completes the proof of Proposition 1.9.19.

0 to

El

Now let P be the partition of the unit interval that consists of the two intervals Pc, = [0, 1/2), Pi = [1/2, 1). It is easy to see that the mapping that carries x into its (T, 2)-name {x} is an invertible mapping of the unit interval onto the space A z the Kolmogorov measure. This fact, together withwhicaresLbgumont Proposition 1.9.19, implies that the only joining of the (T, 2)-process and the (S, 2)process is the product joining, and this, in turn, implies that the d-distance between these two processes is 1/2. This shows in particular that the class of ergodic processes is not separable, for, in fact, even the translation (rotation) subclass is not separable. It can be shown that the class of all processes that are stationary codings of i.i.d. processes is d-separable, see Exercise 3 in Section W.2.

I.9.c

Exercises.

1. A measure ,u E P(X) is extremal if it cannot be expressed in the form p = Ai 0 ia2. (a) Show that if p is ergodic then p, is extremal. (Hint: if a = ta1 -F(1 — t)A2, apply the Radon-Nikodym theorem to obtain gi (B) = fB fi diu and show that each fi is T-invariant.) (b) Show that if p is extremal, then it must be ergodic. (Hint: if T -1 B = B then p, is a convex sum of the T-invariant conditional measures p(.iB) and peiX — B).) 2. Show that l') is a complete metric by showing that

(a) The triangle inequality holds. (Hint: if X joins X and /7, and X* joins 17 and Zri', then E,T yi)A*(z7ly'll) joins X and 4.) (b) If 24(X?, ri) = 0 then Xi; and 1111 have the same distribution. (c) The metric d(X7, 17) is complete. 3. Prove the superadditivity inequality (3). 4. Let p, and y be the binary Markov chains with the respective transition matrices
m [

P —p

1—p p 1'

N [0 1 1 o]'

1

Let fi be the Markov process defined by M2.

SECTION 1.10. CUTTING AND STACKING.

103

v n = X2n, n = 1, 2, ..., then, almost (a) Show that if x is typical for it, and v surely, y is typical for rt.
(b) Use the result of part (a) to show d(,u, y) = 1/2, if 0 < p < 1. 5. Use Lemma 1.9.11(f) to show that if it. and y are i.i.d. then d(P., y) = (1/2)1A — v I i. (This is a different method for obtaining the d-distance for i.i.d. processes than the one outlined in Example 1.9.3.) 6. Suppose y is ergodic and rt, is the concatenated-block process defined by A n on A n . (p ii ) (Hint: g is concentrated on shifts of sequences Show that d(v ,g) = --n.n, ,.--n,• that are typical for the product measure on (An )° ° defined by lin .)

a.

7. Prove property (d) of Lemma 1.9.11.

* (a7 , a7) = min(p,(a7), v(a)). 8. Show there is a joining A.* of ,u and y such that Xn
9. Prove that ikt — v in = 2 — 2 E a7 min(i.t(a7), v(a)). 10. Two sets C, D C A k are a-separated if C n [D], = 0. Show that if the supports of 121 and yk are a-separated then dk(Ilk.vk) > a. 11. Suppose ,u, and y are ergodic and d(i.t, for sufficiently large n there are subsets v(Dn ) > 1 — E, and dn (x7, y7) > a 1`) ' Ak and Pk('IY1 c ) ' Vk, and k Pk('IX 1 much smaller than a, by (7)) y) > a. Show that if E > 0 is given, then Cn and Dn of A n such that /L(C) > 1 —e, — e, for x7 E Cn , yli' E D. (Hint: if is large enough, then d(4 , y ) cannot be

12. Let (Y, y) be a nonatomic probability space and suppose *: Y 1-4- A n is a measurable mapping such that dn (un , y o * -1 ) < S. Show that there is a measurable mapping q5: y H A n such that A n = V curl and such that Ev(dn(O(Y), fr(y))) <8.

Section 1.10 Cutting and stacking.
Concatenated-block processes and regenerative processes are examples of processes with block structure. Sample paths are concatenations of members of some fixed collection S of blocks, that is, finite-length sequences, with both the initial tail of a block and subsequent block probabilities governed by a product measure it* on S' . The assumption that te is a product measure is not really necessary, for any stationary measure A* on S" leads to a stationary measure p, on Aw whose sample paths are infinite concatenations of members of S, provided only that expected block length is finite. Indeed, the measure tt is just the measure given by the tower construction, with base S", measure IL*, and transformation given by the shift on S°°, see Section 1.2.c.2. It is often easier to construct counterexamples by thinking directly in terms of block structures, first constructing finite blocks that have some approximate form of the final desired property, then concatenating these blocks in some way to obtain longer blocks in which the property is approximated, continuing in this way to obtain the final process as a suitable limit of finite blocks. A powerful method for organizing such constructions will be presented in this section. The method is called "cutting and stacking," a name suggested by the geometric idea used to go from one stage to the next in the construction.

104

CHAPTER I. BASIC CONCEPTS.

Before going into the details of cutting and stacking, it will be shown how a stationary measure ,u* on 8" gives rise to a sequence of pictures, called columnar representations, and how these, in turn, lead to a description of the measure on A'.

I.10.a

The columnar representations.

Fix a set 8 c A* of finite length sequences drawn from a finite alphabet A. The members of the initial set S c A* are called words or first-order blocks. The length of a word w E S is denoted by t(w). The members w7 of the product space 8" are called nth order blocks. The length gel') of an n-th order block wrii satisfies £(w) = E, t vi ). The symbol w7 has two interpretations, one as the n-tuple w1, w2, , w,, in 8", the other as the concatenation w 1 w2 • • • to,„ in AL, where L = E, t(wi ). The context usually makes clear which interpretation is in use. The space S" consists of the infinite sequences iv?"' = w1, w2, ..., where each Wi E S. A (Borel) probability measure 1.1* on S" which is invariant under the shift on S' will be called a block-structure measure if it satisfies the finite expected-length condition,
(

E(t(w)) =

Ef (w),uT(w ) <
WES

onto 8, while, in general, Here 14 denotes the projection of of ,u* onto 8'. Note, by the way, that stationary gives E v( w7 ))

denotes the projection

. E t (w ) ,u.: (w7 ) . nE(t(w)).
to7Esn

Blocks are to be concatenated to form A-valued sequences, hence it is important to have a distribution that takes into account the length of each block. This is the probability measure Â, on S defined by the formula
A( w) —

w)te (w)

E(fi

,WE

8,

where E(ti) denotes the expected length of first-order blocks. The formula indeed defines a probability distribution, since summing t(w),u*(w) over w yields the expected block length E(t 1 ). The measure is called the linear mass distribution of words since, in the case when p,* is ergodic, X(w) is the limiting fraction of the length of a typical concatenation w1 w2 ... occupied by the word w. Indeed, using f (w1w7) to denote the number of times w appears in w7 E S", the fraction of the length of w7 occupied by w is given by

t(w)f (wIw7)
t(w7)

= t(w)

f(w1w7) n
n f(w7)

t(w)11,* (w)
E(t 1 )

= A.(w), a. s.,

since f (w1w7)1 n ---÷ ,e(w) and t(will)In E(t1), almost surely, by the ergodic theorem applied to lu*. The ratio r(w) = kt * (w)/E(ti) is called the width or thickness of w. Note that A(w) = t(w)r(w), that is, linear mass = length x width.

SECTION 1.10. CUTTING AND STACKING.

105

The unit interval can be partitioned into subintervals indexed by 21) E S such that length of the subinterval assigned to w is X(w). Thus no harm comes from thinking of X as Lebesgue measure on the unit interval. A more useful representation is obtained by subdividing the interval that corresponds to w into t(w) subintervals of width r (w), labeling the i-th subinterval with the i-th term of w, then stacking these subintervals into a column, called the column associated with w. This is called the first - order columnar representation of (5 00 , ,u*). Figure 1.10.1 shows the first-order columnar representation of S = {v, w}, where y = 01, w = 011, AIM = 1/3, and /4(w) = 2/3. Ti x

1

X V w

o

Figure 1.10.1 The first-order columnar representation of (S, AT). In the columnar representation, shifting along a word corresponds to moving intervals one level upwards. This upward movement can be accomplished by a point mapping, namely, the mapping T1 that moves each point one level upwards. This mapping is not defined on the top level of each column, but it is a Lebesgue measure-preserving map from its domain to its range, since a level is mapped linearly onto the next level, an interval of the same width. (This is also shown in Figure I.10.1.) In summary, the columnar representation not only carries full information about the distribution A.(w), or alternatively, /4(w) = p,*(w), but shifting along a block can be represented as the transformation that moves points one level upwards, a transformation which is Lebesgue measure preserving on its domain. The first-order columnar representation is determined by S and the first-order distribution AI (modulo, of course, the fact that there are many ways to partition the unit interval into subintervals and stack them into columns of the correct sizes.) Conversely, the columnar representation determines S and Al% The information about the final process is only partial since the picture gives no information about how to get from the top of a column to the base of another column, in other words, it does not tell how the blocks are to be concatenated. The first-order columnar representation is, of course, closely related to the tower representation discussed in Section I.2.c.2, the difference being that now the emphasis is on the width distribution and the partially defined transformation T that moves points upwards. Information about how first-order blocks are concatenated to form second-order blocks is given by the columnar representation of the second-order distribution ,q(w?). Let r (w?) = p(w)/E(2) be the width of w?, where E(t2) is the expected length of w?, with respect to ,4, and let X(w?) = t(w?)r(w?) be the second-order linear mass. The second-order blocks w? are represented as columns of disjoint subintervals of the unit interval of width r (14) and height aw?), with the i-th subinterval labeled by the i-th term of the concatenation tv?. A key observation, which gives the name "cutting and stacking" to the whole procedure that this discussion is leading towards, is that

2 " (w)r(w). as claimed. for the total width of the second-order columns is 1/E(t2)...2 shows how the second-order columnar representation of S = {v. the total mass contributed by the top t(w) levels of all the columns that end with w is Et(w)r(w1w)= Ea(lV t2)) WI Pf(W1W) 2E(ti) =t(w* *(w) Thus half the mass of a first-order column goes to the top parts and half to the bottom parts of second-order columns._ C i D *********** H E ****** B VW A B C V D H E F G ***************4****+********** w A VV F G ****** *********** WV WW Figure 1. BASIC CONCEPTS. 16(ww) = 4/9. .--(t2) 2_. Figure 1. Indeed.10.10. The significance of the fact that second-order columns can be built from first-order columns by appropriate cutting and stacking is that this guarantees that the transformation T2 defined for the second-order picture by mapping points directly upwards one level extends the mapping T1 that was defined for the first-order picture by mapping points directly upwards one level. . . the total mass in the second-order representation contributed by the first f(w) levels of all the columns that start with w is t(w) x--. . Likewise. which is just half the total width 1/E(t 1 ) of the first-order columns. where . w}.10. then it will continue to be so in any second-order representation that is built from the first-order representation by cutting and stacking its columns..1. E aw yr ( ww2 ) .(vw) = . if y is directly above x in some column of the firstorder representation.u. then stacking these in pairs. the 2m-th order representation can be produced by cutting the columns of the m-th order columnar representation into subcolumns and stacking them in pairs.4(vv) = 1/9„u. 2 which is exactly half the total mass of w in the first-order columnar representation.(wv) = 2/9.2 The second-order representation via cutting and stacking. Note also the important property that the set of points where T2 is undefined has only half the measure of the set where T1 is defined. so.106 CHAPTER I. le un-v2)\ ( aw* * (w) = 2E( ii) tV2 2 _ 1 toor(w). can be built by cutting and stacking the first-order representation shown in Figure 1. Second-order columns can be be obtained by cutting each first-order column into subcolumns. If this cutting and stacking method is used then the following two properties hold. it is possible to cut the first-order columns into subcolumns and stack them in pairs so as to obtain the second-order columnar representation. Indeed.1. In general. One can now proceed to define higher order columnar representations.

or height. in essence.5. 2.10. unless stated otherwise. will be used in this book. A column C = (L1. . it is a way of building something new.. The cutting and stacking idea focuses directly on the geometric concept of cutting and stacking labeled columns. The labeling of levels by symbols from A provides a partition P = {Pa : a E AI of the unit interval. P)-process is the same as the stationary process it defined by A*. gives the desired stationary A-valued process. I. . in turn. but this all happens in the background while the user focuses on the combinatorial properties needed at each stage to produce the desired final process. which has been commonly used in ergodic theory. The difference between the building of columnar representations and the general cutting and stacking method is really only a matter of approach and emphasis. Its measure X(C) is the Lebesgue measure X of the support of C. The columnar representation idea starts with the final block-structure measure . N(j) = Ar(L i ) is called the name or label of L. and the transformation T will preserve Lebesgue measure. Thus. column structure. Their common extension is a Lebesgue measure-preserving transformation on the entire interval. if cutting and stacking is used to go from stage to stage. disjoint. labeled columns. A labeling of C is a mapping Ar from {1. is undefined. Thus. Ar(E(C))) is called the name of C. which. see Corollary 1. will be proved later. it is a way of representing something already known.b The basic cutting and stacking vocabulary. The cutting and stacking language and theory will be rigorously developed in the following subsections. and the interval L i is called the j-th level of the column. f(C) is a nonempty. using the desired goal (typically to make an example of a process with some sample path property) to guide how to cut up the columns of one stage and reassemble to create the next stage. The vector Ar(C) = (H(1).10. The interval L1 is called the base or bottom. L2. and cutting and stacking extends these to joint distributions on higher level blocks. (a) The associated column transformation T2m extends the transformation Tm .3. Of course. 107 (b) The set where T2m is defined has one-half the measure of the set where Tft. The goal is still to produce a Lebesgue measure-preserving transformation on the unit interval. A column structure S is a nonempty collection of mutually disjoint. The support of a column C is the union of its levels L. Lac)) of length.10. CUTTING AND STACKING. Note that the members of a column structure are labeled columns. in essence.. Two columns are said to be disjoint if their supports are disjoint. the interval Lt(c) is called the top. Theorem 1. Some applications of its use to construct examples will be given in Chapter 3. .SECTION 1. where Pa is the union of all the levels labeled a. . The (T. The suggestive name. rather than the noninformative name.10. column structures define distributions on blocks. ordered. collection of subintervals of the unit interval of equal positive width (C). gadget. which is of course given by X(C) = i(C)r (C). hence. .u* and uses it to construct a sequence of partial transformations of the unit interval each of which extends the previous ones. and for each j. A general form of this fact. Ar(2). then the set of transformations {TA will have a common extension T defined for almost all x in the unit interval. f(C)} into the finite alphabet A. .. In particular.

An exception to this is the terminology "disjoint column structures." where the support of a column structure is the union of the supports of its columns. Note that T is one-to-one and its inverse maps points downwards. called its upward map.Et(c). This idea can be made precise in terms of subcolumns and column partitions as follows. and is not defined on the top level. .is the normalized column Lebesgue measure. r (5) < 00. . r(S). A subcolumn of L'2 . in other words. (b) The distance from the left end point of L'i to the left end point of L 1 does not depend on j. A(S) = Ex(c). Li) is a column C' = (a) For each j. If a column is pictured by drawing L i+ 1 directly above L1 then T maps points directly upwards one level. The width distribution 7 width r (C) r (C) "f(C) = EVES r (D) (45)• The measure A(S) of a column structure is the Lebesgue measure of its support. .) A (column) partition of a column C is a finite or countable collection {C(i) = (L(i. The transformation T = Ts defined by a column structure S is the union of the upward maps defined by its columns. T is not defined on the top of S and is a Lebesgue measure-preserving mapping from all but the top to all but the bottom of S. C = (L1. ) c cyf( a E r(S) r (S) . Note that implicit in the definition is that a subcolumn always has the same height as the column.. Note that the base and top have the same . L2. this means that expected column height with respect to the width distribution is finite. for 1 < j < f(C). 1). namely. Thus the union S US' of two column structures S and S' consists of the labeled columns of each. Each point below the base of S is mapped one level directly upwards by T. Columns are cut into subcolumns by slicing them vertically.(c). L(i. each labeled column in S is a labeled column in S' with the same labeling. A column C = (L 1 L2. that is." which is taken as shorthand for "column structures with disjoint support. is a subinterval of L i with the same label as L. Le (C )) defines a transformation T = T. . t(C)): i E I} .. CES CES Note that X(S) < 1 since columns always consist of disjoint subintervals of the unit interval. and column structures have disjoint columns. terminology should be interpreted as statements about sets of labeled columns. The precise definition is that T maps Li in a linear and order-preserving manner onto L i+ i. . In particular. L(i. 2) . such that the following hold.108 CHAPTER I. The base or bottom of a column structure is the union of the bases of its columns and its top is the union of the tops of its columns. It is called the mapping or upward mapping defined by the column structure S. The width r(S) of a column structure is the sum of the widths of its columns. (An alternative definition of subcolumn in terms of the upward map is given in Exercise 1. L'e ). for CES (c ) < Ei(c)-r(c) Ex . BASIC CONCEPTS. and S c S' means that S is a substructure of S'.

Li with the label of C1* C2 defined to be the concatenation vw of the label v of C 1 and the label w of C2. which is also the same as the width of C2. The basic fact about stacking is that it extends upward maps. namely.. for example. j): 1 < j < t(C2)) be disjoint labeled columns of the same width. (iii) The upward map Ts is a Lebesgue measure-preserving mapping from its domain to its range. Longer stacks are defined inductively by Ci *C2* • • • Ck Ck+i = (C1 * C2 * • • • Ck) * Ck+1. since Te. it is extended by the upward map T = Ts . where S c A*. Partitioning corresponds to cutting. In general. The cutting idea extends to column structures. thus cutting C into subcolumns according to a distribution 7r is the same as finding a column partition {C(i)} of C such that 7r(i) = t(C(i))1 . and taking the collection of the resulting subcolumns. and extends their union. j): 1 < j < t(C1)) and C2 = (L(2.SECTION 1. is built by cutting each of the first-order columns into subcolumns and stacking these in pairs. for each i.1. . a column partitioning is formed by partitioning each column of S in some way. the key properties of the upward maps defined by column structures are summarized as follows. the upward map Tco. the union of whose supports is the support of C. 109 of disjoint subcolumns of C. Thus. is now defined on the top of C1. then Ts . This is also true of the upward mapping T = Ts defined by a column structure S. (i) The domain of Ts is the union of all except the top of S.. A column partitioning of a column structure S is a column structure S' with the same support as S such that each column of S' is a subcolumn of a column of S. j — f(C1)) t(C1) < j f(Ci) -e(C2). the second-order columnar representation of (S. Thus. (ii) The range of Ts is the union of all except the bottom of S. defined by stacking C2 on top of C1 agrees with the upward map Tc„ wherever it is defined.(C).c. extends T8. the columns of S' are permitted to be stackings of variable numbers of subcolumns of S. Note that the width of C1 * C2 is the same as the width of C1. . A column structure S' is a stacking of a column structure S. and with the upward map TC 2 wherever it is defined.c. (iv) If S' is built by cutting and stacking from S. Indeed. Let C 1 = (L(1. .a*). The stacking idea is defined precisely as follows.10. The stacking of C2 on top of C 1 is denoted by C1* C2 and is the labeled column with levels L i defined by = L(1. if they have the same support and each column of S' is a stacking of columns of S. j) 1 < j < f(C1) L(2. A column structure S' is built by cutting and stacking from a column structure S if it is a stacking of a column partitioning of S. In summary. CUTTING AND STACKING. for any S' built by cutting and stacking from S.

This completes the proof of Theorem 1. I. then T -1 B = Ts -(m i ) B is an interval of the same length as B.3. The process {X. Since such intervals generate the Borel sets on the unit interval. then a common extension T of the successive upward maps is defined for almost all x in the unit interval and preserves Lebesgue measure. If the tops shrink to 0. Together with the partition defined by the names of levels. Cutting and stacking operations are then applied to 8(2) to produce a third structure 8(3). so that the transformation T defined by Tx = Ts(m)x is defined for almost all x and extends every Ts (m) .c The final transformation and process. This is equivalent to picking a random x in the support of 8(1).10. it is clearly the inverse of T. say L i . This is done by starting with a column structure 8(1) with support 1 and applying cutting and stacking operations to produce a new structure 8(2). where Pa is the union of all the levels of all the columns of 8(1) that have the name a. since the tops are shrinking to 0. S(m + 1) is built by cutting and stacking from S(m). (b) X(S(1)) = 1 and the (Lebesgue) measure of the top of S(m) goes to 0 as m —> oc.110 CHAPTER I. called the process and Kolmogorov measure defined by the complete sequence {S(m)}. respectively.) If {8(m)} is complete then the collection {Ts (n )} has a common extension T defined for almost every x E [0. (x) to be the index of the member of P to which 7 1 x belongs. Furthermore.} and its Kolmogorov measure it are. If B is a subinterval of a level in a column of some S(m) which is not the base of that column. the transformation produces a stationary process. The label structure defines the partition P = {Pa : a E A). The (T. BASIC CONCEPTS. 1] there is an M = M (x) such that x lies in an interval below the top of S(M). P)-process {X. of a column C E S(m) of height t(C) > j n — 1. the common extension of the inverses of the Ts (m) is defined for almost all x. then choosing m so that x belongs to the j-th level. El The transformation T is called the transformation defined by the complete sequence {S(m)}. since the top shrinks to 0 as m oc. and defining X. T is invertible and preserves Lebesgue measure. Further cutting and stacking preserves this relationship of being below the top and produces extensions of Ts(m). Proof For almost every x E [0. (Such an m exists eventually almost surely. A sequence {S(m)} of column structures is said to be complete if the following hold. (a) For each m > 1. 1].} is described by selecting a point x at random according to the uniform distribution on the unit interval and defining X. A precise formulation of this result will be presented in this section. Likewise. (x) to be the name of level Li+n _1 of C. Continuing in this manner a sequence {S(m)} of column structures is obtained for which each member is built by cutting and stacking from the preceding one. The goal of cutting and stacking operations is to construct a measure-preserving transformation on the unit interval.10.) The k-th order joint distribution of the process it defined by a complete sequence {S(m)} can be directly estimated by the relative frequency of occurrence of 4 in a . it follows that T is measurable and preserves Lebesgue measure. Theorem 1.3 (The complete-sequences theorem. and a condition for ergodicity.10. along with the basic formula for estimating the joint distributions of the final process from the sequence of column structures.

x. Theorem 1.10.4 (Estimation of joint distributions. The desired result (1) now follows since the sum of the widths t(C) over the columns of (m) was assumed to go oc. The relative frequency of occurrence of all' in a labeled column C is defined by pk(a inC) = Ili E [1. a negligible effect when m is large for then most of the mass must be in long columns. as n An application of the estimation formula (1) is connected to the earlier discussion on of the sequence of columnar representations defined by a block-structure measure where S c A*. CES (1) since f(C)p i (aIC) just counts the number of times a appears in the name of C. averaged over the Lebesgue measure of the column name. The quantity (t(C) — k 1)pk(a lic IC) counts the number of levels below the top k — 1 levels of C that are contained in Pal k . 1]. . in turn. Let Pa be the union of all the levels of all the columns in 8(1) that have the name a. since the top k — 1 levels of C have measure (k — 1)r(C). f(C) — k 1 : x 1' 1 = ] where xf (c) is the name of C. This completes the proof of Theorem 1. of course. pk(lC) is the empirical overlapping k-block distribution pk •ixf (c) ) defined by the name of C. For k > 1. For each m let 8(m) denote the 2''-order columnar representation.SECTION L10. The same argument applies to each S(m). CUTTING AND STACKING. the only error in using (1) to estimate Wa l') comes from the fact that pk (ainC) ignores the final k — 1 levels of the column C.} E A Z be the sequence defined by the relation Tnx E Px„ n E Z. establishing (1) for the case k = 1. {8(m)}. This estimate is exact for k = 1 for any m and for k > 1 is asymptotically correct as m oc.u(a) = X(Pa). so that {S(m)} is a complete sequence.(apx(c). which is. the context will make clear which is intended.4.) If t is the measure defined by the complete sequence. • k E Ak 1 (m) Proof In this proof C will denote either a column or its support. to 0. To make this precise. and let Pa k = { x: x = . that is.. Then. and hence IX(Paf n C) — p k (c41C)A(C)1 < 2(k — 1)r(C). and hence X(Pa fl C) = [t(c)pi (aiC)j r(c) = pi(aIC)X(C).10. Without loss of generality it can be assumed that 8(m +1) is built by cutting and stacking from S(m). let {. given by p. 111 column name. then (1) = lim E CES pk(ailC)X(C). for x E [0. Furthermore. Let itt be the . since the top has small measure. X(8(1)) = 1 and the measure of the top of S(m + 1) is half the measure of the top of S(m).

A(B I C) denotes the conditional measure X(B n c)/x(c) of the intersection of the set B with the support of the column C. . awl)]. since the start distribution for is obtained by selecting w 1 at random according to the distribution AT. that is. = l. which is sufficient for most purposes.10. The sequence {S(m)} of column structures is asymptotically independent if for each m and each e > 0. In other words. One way to make certain that the process defined by a complete sequence is ergodic is to make sure that the transformation defined by the sequence is ergodic relative to Lebesgue measure. where P = {Pa } is the partition defined by letting P a be the set of all pairs (wr) . In the following discussion C denotes either a column or its support.112 CHAPTER I. The proof for k > 1 is left as an exercise. These ideas are developed in the following paragraphs.5 The tower construction and the standard representation produce the same measures. be the Kolmogorov measure of the (T. Next let T be the tower transformation defined by the base measure it*. but only the column distributions. Related concepts are discussed in the exercises. Proof Let pk(afiw7) be the empirical overlapping k-block distribution defined by the sequence w E S n . that these independence concepts do not depend on how the columns are labeled. 7 3)-process. Corollary 1. two column structures S and S' are 6-independent if and only if the partition into columns defined by S and the partition into columns defined by S' are 6-independent. j) for which ai = a. p.(C). where L = Ei aw 1 ). Since this is clearly the same as pk(a inC) where C is the column with name w. is that later stage structures become almost independent of earlier structures. thought of as the concatenation wi w2 • • • wn E AL. measure defined by this sequence. with the shift S on S" as base transformation and height function defined by f (wi . BASIC CONCEPTS. A condition for this.) = t (w 1) • Also let /72. then selecting the start position according to the uniform distribution on [1. The column structures S and S' are 6-independent if (3) E E Ige n D) — X(C)À(D)I CES DES' 6. . For k = 1 the sum is constant in n. it is enough to show that (2) = lim E WES pk (ak l it4)A. 0 Next the question of the ergodicity of the final process will be addressed. The sequence {S(m)} will be called the standard cutting and stacking representation of the measure it. by the way. a E Ak . Note. where tovi) w = ai . there is a k > 1 such that S(m) and S(m k) are 6-independent. For example.

then there must be at least one level of D which is at least (1 — E) filled by B. of some C E 8(M) such that X(B n L) > 0 This implies that the entire column C is filled to within (1 — E 2 ) by B. 113 Theorem 1. This shows that X(B) = 1.F. This completes the proof of Theorem 1. ( since T k (B n L) = B n TkL and T k L sweeps out the entire column as k ranges from —j + 1 to aW) — j.E 2 )x(c). and hence T must be ergodic. (4) À(B n c) > 1 . The set C n D is a union of levels of D. that is.SECTION I10.F. implies E X(D) <2E. Towards this end note that since the top of S(m) shrinks to 0. Fix C and choose M so large that S(m) and S(M) are EX(C)-independent. The argument used to prove (4) then shows that the entire column D must be (1 — c) filled by B. The goal is to show that A(B) = 1.. (6) X(B n D) > (1 — c)X(D). P)-process is ergodic for any partition P.T be the set of all D E S(M) for which X(B1 C n D) > 1 — 6. In summary. which implies that the (T. It will be shown that T is ergodic. DO* Thus summing over D and using (6) yields X(B) > 1 — 3E. Thus. so if D E . the Markov inequality and the fact that E x (7.6 (Complete sequences and ergodicity. .10. Since X(B I C) = E-D A.) A complete asymptotically independent sequence defines an ergodic process. since S(M) is built by cutting and stacking from S(m). the following holds. (1 — 6 2 ). say level j. hence the following stronger form of the complete sequences and ergodicity theorem is quite useful. imply that n v)x(vi C). and hence the collection of all the levels of all the columns in all the S(m) generates the a-algebra.6. Proof Let T be the Lebesgue measure-preserving transformation of the unit interval defined by the complete. the widths of its intervals must also shrink to 0. Let B be a measurable set of positive measure such that T -1 B = B. given E > 0 there is an m and a level L.(B1 C X(B I C) ?. asymptotically independent sequence { 8(m)}. in particular. so that. (5). D e . El If is often easier to make successive stages approximately independent than it is to force asymptotic independence. 1 C) DO' El which together with the condition. CUTTING AND STACKING. (5) E DES(M) igvi c) — x(D)1 E- Let .10.

can produce a variety of counterexamples when applied to substructures.d Independent cutting and stacking.7 (Complete sequences and ergodicity: strong form. The argument used in the preceding theorem then gives À(B) > 1 — 3E. The only real modification that needs to be made in the preceding proof is to note that if {Li} is a disjoint collection of column levels such that X (BC n (UL)) = BciL i . This proves Theorem 1. provided that they have disjoint supports. where Em 0. x ( L i) < E 28. then {S(m)} defines an ergodic process. taking 3 = . which. The freedom in building ergodic processes via cutting and stacking lies in the arbitrary nature of the cutting and stacking rules. Theorem 1. then by the Markov inequality there is a subcollection with total measure at least 3 for which x(B n L i ) > (1 — E 2 )À(L i ). however. in spite of its simplicity. Independent cutting and stacking is the geometric version of the product measure idea. A few of these constructions will be described in later chapters of this book. I. Proof Assume T -1 B = B and X(B) > O. The user is free to vary which columns are to be cut and in what order they are to be stacked. A column structure S can be stacked independently on top of a labeled column C of the same width. In this discussion.10. This gives a new column structure denoted by C * S and defined as follows. (i) Partition C into subcolumns {Ci } so that r (C1) = (Di ). the same notation will be used for a column or its support. . (7) Eg where BC is the complement of B. and m so large that Em is smaller than both E 2 and (ju(B)/2) 2 so that there is a set of levels of columns in S(m) such that (7) holds. as well as how substructures are to become well-distributed in later substructures. of course. BASIC CONCEPTS.) If {S(m)} is a complete sequence such that for each m. in the complexity of the description needed to go from one stage to the next.114 CHAPTER I.10. and the sweeping-out argument used to prove (4) shows that (8) X(B n > ( 1 — E2 E SI )x(c). Let S' be the set of all columns C for which some level has this property. known as independent cutting and stacking. where Di is the ith column of S. c . it follows that and there must be at least one C E S(m) for which (8) holds and for which DES (m+1) E Ix(Dic) — x(D)i < E.7. The next subsection will focus on a simple form of cutting and stacking.u(B)/2. A bewildering array of examples have been constructed. The support of S' has measure at least 3. There are practical limitations. by using only a few simple techniques for going from one stage to the next. as m OC.10. as earlier. Thus. S(m) and S(m 1) are Em -independent. with the context making clear which meaning is intended.

a scaling of one structure is isomorphic to the other. and name. where two column structures S and S' are said to be isomorphic if there is a one-to-one correspondence between columns such that corresponding columns have the same height. A column structure S is said to be a copy of size a of a column structure S' if there is a one-to-one correspondence between columns such that corresponding columns have the same height and the same labeling. In other words. by the way. by partitioning each column of S according to the distribution 7r.10.8. A column structure S can be cut into copies {S i : i E n according to a distribution on /. and the ratio of the width of a column of S to the width of its corresponding column in S' is a. It is called the independent stacking of S onto C. Let S' and S be disjoint column structures with the same width. obtaining Ci * S.) (i) Cut S' into copies {S.(Ci ).) C* S Figure 1.10. To say precisely what it means to independently cut and stack one column structure on top of another column structure the concept of copy is useful. that a copy has the same width distribution as the original. independently onto Ci . width. An alternative description of the columns of S * S' may given as follows.10.1. Note. * V. (ii) For each i. (See Figure 1.. and letting Si be the column structure that consists of the i-th subcolumn of each column of S. . CUTTING AND STACKING. stack S. Ci *Di .9 Stacking a structure independently onto a structure. The independent cutting and stacking of S' onto S is denoted by S * S' and is defined for S = {Ci } as follows.10. The column structure S * S' is the union of the column structures Ci * C2 S' Figure 1.10. (See Figure 1. so that r(S:)= . The new column structure C * S consists of all the columns C. 115 (ii) Stack Di on top of Ci to obtain the new column.9.SECTION 1.8 Stacking a column structure independently onto a column.

The independent cutting and stacking construction contains more than just this concatenation information. the number of columns of S * S' is the product of the number of columns of S and the number of columns of S'. for it carries with it the information about the distribution of the concatenations. The sequence {S(m)} is called the sequence built (or generated) from S by repeated independent cutting and stacking.116 (i) Cut each C each Ci E S. S into subcolumns {CO.10. and S(m + 1) = S(m) * S (m). m > 1. where 5 (1) = S. The columns of Si * • • • * Sm have names that are M-fold concatenations of column names of the initial column structure S.J*Cii . starting with a column structure S produces the sequence {S(m)}. Successive applications of repeated independent cutting and stacking. -r(Ci The column structure S *S' consists of all the C. that is. and. Note that S(m) is isomorphic to the 2m-fold independent . namely. r (Cii *C) r (S * S') r (Ci ) r (S) r(C) since r (S*S') = r(S). i ) I r (S') for S' into subcolumns {C» } such that r(C) = (C)T. • • *Sm = (S1* • • • *Sm_i)*Sm. is that width distributions multiply. BASIC CONCEPTS. the number of columns of Si * • • • * Sm is the M-th power of the number of columns of S. The M-fold independent cutting and stacking of a column structure S is defined by cutting S into M copies {Sm : m = 1. Si S2 SI * S2 Figure 1.10.10. that they are independently concatenated according to the width distribution of S. however. Two-fold independent cutting and stacking is indicated in Figure 1. . such that r(Cii ).10 Two-fold independent cutting and stacking. Note that (Si * • • • * S) = t (S)/M.(Ci E (ii) Cut each column Ci ').. in the finite column case. A column structure can be cut into copies of itself and these copies stacked to form a new column structure. M} of itself of equal width and successively independently cutting and stacking them to obtain Si *52 * • • ' * SM where the latter is defined inductively by Si *. The key property of independent cutting and stacking. in the case when S has only finitely many columns. .. E CHAPTER I. 2. In particular. This formula expresses the probabilistic meaning of independent cutting and stacking: cut and stack so that width distributions multiply.

For k = 2'. The simplest of these. By the law of large numbers. a subcolumn of Ci E S. Theorem 1. for they are the precisely the processes built by repeated independent cutting and stacking from the columnar representation of their firstorder block-structure distributions. as k —> cc. and stacking these in order of increasing i. More interesting examples of cutting and stacking constructions are given in Chapter 3. just another way to define the regenerative processes. the countable case is left to Exercise 2. since r(S)E(t(S)) = X(S) = 1. 117 cutting and stacking of S. {S(m): m > mol is built from S(rno) by repeated independent cutting and stacking. and C E S. of course. so that if X(S) = 1 then the process defined by {S(m)} is ergodic. Division of X(C n D) = kp(C1Cbt(C)r(D) by the product of X(C) = i(C)r(C) and À(D) = i(D)t(D) yields x(c n D) X(C)À(D) kp(CIC I ) i(D)r (C) 1. oc.11 (Repeated independent cutting and stacking. k]:C = p(CICO I = so that. Ergodicity is guaranteed by the following theorem. then given > 0. so that its columns are just concatenations of 2' subcolumns of the columns of S. Exercise 8.. This establishes that indeed S and S(m) are eventually c-independent. by (9) and (10).SECTION I. with probability 1. kp(CICf) is the total number of occurrences of C in the sequence C. The notation D = C = C1 C2 . • Ck. the (T.10. an example of a process with an arbitrary rate of . This completes the proof of Theorem 1. define E [1. will mean the column D was formed by taking. In particular. and hence the proof shows that {S(m)} is asymptotically independent. (9) and (10) 1 1 — i(D) = — k k p(CIC) —> E £(0) i=1 E(t(S)). as k of S. P)-process defined by the sequence generated from S by repeated independent cutting and stacking is called the process built from S by repeated independent cutting and stacking. Note that for each mo. • • Ck E SOTO. for each i.10. Proof The proof for the case when S has only a finite number of columns will be given here. by assumption. This is. if X(S) = 1 then the process built from S by repeated independent cutting and stacking is ergodic.) If {S(m)} is built from S by repeated independent cutting and stacking. D = C1C2. CUTTING AND STACKING. in particular.11. there is an m such that S and S(m) are E-independent. where E(t(S)) is the expected height of the columns both with probability 1. In the case when X(S) = 1. selected independently according to the width distribution of S. The following examples indicate how some standard processes can be built by cutting and stacking. with respect to the width distribution.10.

10. It shows how repeated independent cutting and stacking on separate substructures can be used to make a process "look like another process on part of the space. is recommended as a starting point for the reader unfamiliar with cutting and stacking constructions. hence can be represented as the process built by repeated independent cutting and stacking. convergence for frequencies. that is. Example 1.! .} is a function X. and how to "mix one part slowly into another. say a.Ar(L).i.} is a function of a mixing Markov chain. 0) if L is not the top of its column. The process built from S by repeated independent cutting and stacking is just the coin-tossing process.) If {KJ is ergodic. repeated independent cutting and stacking produces the A-valued i.i. and let pt* be the product measure on S defined by p. Remark 1. the process {X.10. and to (. then {Y. Example 1.) A hidden Markov process {X.14 (Hidden Markov chains. It is easy to see that the process {KJ built from this new column structure by repeated independent cutting and stacking is Markov of order no more than the maximum length of its columns.) Let S be the column structure that consists of two columns each of height 1 and width 1/2. in particular. 1) if L is the top of its column. Since dropping the second coordinates of each new label produces the old labels.10. process in which O's and l's are equally likely.15 (Finite initial structures.12 (I." so as to guarantee ergodicity for the final process. By starting with a partition of the unit interval into disjoint intervals labeled by the finite set A. To see why this is so. (These also known as finite-state processes. process with the probability of a equal to the length of the interval assigned to a. processes. Example 1.16 The cutting and stacking ideas originated with von Neumann and Kakutani. In fact.T.10.) If the initial structure S has only a finite number of columns. and hence {X. For a discussion of this and other early work see [12]. all' E S. then {X. and ak = a.10.d.13 (Markov processes.118 CHAPTER I.} is regenerative. Define the set of all ak Prob(X i` = alic1X0 = a).1 < i < k. it follows that X.i. They have since been used to construct . then drop the first coordinates of each label.c. BASIC CONCEPTS. one can first represent the ergodic Markov chain {(Xn ." so as to approximate the desired property.d. Yn )} as a process built by repeated independent cutting and stacking. see III. the binary i. i } is mixing Markov.} built from it by repeated independent cutting and stacking is a hidden Markov chain. The process built from the first-order columnar representation of S by repeated independent cutting and stacking is the same as the Markov process {X„}. Let S be i for which ai a. Example 1. that if the original structure has two columns with heights differing by 1.) of a Markov chain. = f (Y. labeled as '0' and '1'. Note. assume that the labeling set A does not contain the symbols 0 and 1 and relabel the column levels by changing the label A (L) of level L to (A (L).d. is a function of Y.) The easiest way to construct an ergodic Markov chain {X„} via cutting and stacking is to think of it as the regenerative process defined by a recurrent state.

(m). Suppose S has measure 1 and only finitely many columns.. Prove Theorem 1. (a) Show that if the top of each column is labeled '1'. process is given in [75]. T D.SECTION 1. there is an M = M(m. Suppose 1(m) c S(m). I. (c) Show that if S and S' are 6-independent. some of which are mentioned [73]. (a) Show that {S(m)} is asymptotically independent if and only if {r(m)} is asymptotically independent. then E CES IX(CI D) .10. Let pt" be a block-structure measure on Soe. (b) Show that the asymptotic independence property is equivalent to the asymptotic well-distribution property. then S' and S are 26-independent.. most of which have long been part of the folklore of the subject. (Hint: the tower with base transformation Sn defines the same process Ti.e Exercises 1.. (Hint: for S = S(m)..i.11 for the case when S has countably many columns. 3. 2. with all other levels labeled '0'. for each m. and let {S(m)} be the standard cutting and stacking representation of the stationary A-valued process defined by (S.d. 119 numerous counterexamples.1 as m —> oc.C. 7. where D is a subset of the base B. the conditional probability X(DI C) is the fraction of C that was cut into slices to put into D. (d) Show that if EcEs IX(C1 D) — X(D)I < E.10. The latter also includes several of the results presented here.) 6. and X(1 (m)) -->. Show that if T is the upward map defined by a column C.T 2 D. (a) Suppose S' is built by cutting and stacking from S. Show that {S(m)} is asymptotically independent if and only if it* is totally ergodic. A sequence {S(m)} is asymptotically well-distributed if for each m. disjoint from 1(m). CUTTING AND STACKING. (b) Suppose that for each m there exists R(m) C S(m). Show that formula (2) holds for k > 1.10. p*). Show that for C E S and D E S'. Then apply the k = 1 argument. . each C E S(m). Suppose also that {S(m} is complete. 5. except for a set of D E SI of total measure at most E. where S C A*. re(c)-1 D). Exercise 4c implies that A* is m-ergodic. then a subcolumn has the form (D.À(D)i except for a set of D E S I of total measure at most Nfj. A column C E S is (1 —6)-well-distributed in S' if EDES' lx(vi C)—X(7))1 < E. then the process built from S by repeated independent cutting and stacking is Markov of some order.) 4. A sufficient condition for a cutting and stacking construction to produce a stationary coding of an i. and each E > 0.n such that T(m + 1) is built by An -fold independent cutting and stacking of 1(m) U R. Show that if {Mm } increases fast enough then the process defined by {S(m)} is ergodic. . E) such that C is (1 —6)-well-distributed in S(n) for n > M. and an integer 111. .

(c) Verify that the process constructed in Example 1. .120 CHAPTER I. then the process built from S by repeated independent cutting and stacking is a mixing finitestate process.10. (b) Show that if two columns of S have heights differing by 1.13 is indeed the same as the Markov process {X„}. Show that process is regenerative if and only if it is the process built from a column structure S of measure 1 by repeated independent cutting and stacking. 8. BASIC CONCEPTS.

1 Entropy and coding. that is. If each Cn is one-to-one. When Cn is understood. so that. the length i(Cn (4)) of the code word Cn (4)..u. 2. then L(4)1 n may fail to converge on a set of positive y-measure.u(4)1. The first issue is the universality problem.7q .7. 1}*. The entropy lower bound is an expected-value result. is a prefix code it is called a prefix-code sequence. almost surely. by the entropy theorem.Chapter II Entropy-related properties. The sequence of Shannon codes compresses to entropy in the limit.). and y is some other ergodic process. r (4) will be used instead of r (x rii ICn ). see Theorem 1. A code sequence {C n } is said to be universally asymptotically optimal or. for almost every sample path from any ergodic process. n n—>co for every ergodic process . almost surely. In general.7. The Shannon code construction provides a prefix-code sequence {C n } for which L(xrii) = r—log. while if each C. The code length function Le Icn ) of the code is the function that assigns to each .15. where {0. at least asymptotically in source word length n. in particular. It will be shown that there are universal codes.. see Exercise 1. 121 . code sequences which compress to the entropy in the limit. universal if lim sup L(4) < h(p). where each Cn is an n-code. Two issues left open by these results will be addressed in this section. A code sequence is a sequence {Cn : n = 1.7. however. entropy provides an asymptotic lower bound on expected per-symbol code length for prefix-code sequences and faithful-code sequences. if {C n } is a Shannon-code sequence for A. hence it does not preclude the possibility that there might be code sequences that beat entropy infinitely often on a set of positive measure. It will be shown that this cannot happen. the code sequence is called a faithful-code sequence. 1}* is the set of finite-length binary sequences. more simply. knowledge which may not be available in practice. An n-code is a mapping Cn : A n F-* 10. a lower bound which is "almost" tight.12 and Theorem 1. but its construction depends on knowledge of the process. where h(p) denotes the entropy of A. . Section 11. L(4)In h. The second issue is the almost-sure question.c. As noted in Section I.

where H = 114 denotes the process entropy of pt. which is based on a packing idea and is similar in spirit to some later proofs. thus further sharpening the connection between entropy and coding. whose existence depends only on subadditivity of entropy as expected value. The nonexistence theorem.1 (Universal codes exist. These two results also provide an alternative proof of the entropy theorem. a.) If {C n } is a faithful-code sequence and it is an ergodic measure with entropy h. Of interest. Theorem 11.) There is a prefix-code sequence {C„} such that limsup n .a Universal codes exist. almost surely. In other words. The code C. A counting argument. it is possible to (universally) compress to entropy in the limit. will follow from the entropy theorem. will then be used to show that it is not possible to beat process entropy in the limit on a set of positive measure. lim sup . utilizes the empirical distribution of k-blocks. in which case the first blocks of Cn (fi') and Cn (Yil) . also. Since process entropy H is equal to the decay-rate h given by the entropy theorem. the codeword Cn (4) is a concatenation of two binary blocks. Theorem 11. relative to some enumeration of possible k-types. is the fact that both the existence and nonexistence theorems can be established using only the existence of process entropy.u. as shown in [49]. almost surely. The proof of the existence theorem to be given actually shows the existence of codes that achieve at least the process entropy. this will show that good codes exist. that is. Distinct words. its length depends on the size of the type class. will be discussed in Section 11. but for no ergodic process is it possible to beat entropy infinitely often on a set of positive measure.1. either have different k-types. it has fixed length which depends only on the number of type classes.3. will be given in Section 11. while the existence of decay-rate entropy is given by the much deeper entropy theorem. will be used to establish the universal code existence theorem.s. based on the Lempel-Ziv coding algorithm. with code performance established by using the type counting results discussed in Section I. that is.. relative to some enumeration of the k-type class.d. the direct coding construction extends to the case of semifaithful codes. they show clearly that the basic existence and nonexistence theorems for codes are. for suitable choice of k as a function of n. together with some results about entropy for Markov chains. for any ergodic process . the k-type.1. ENTROPY-RELATED PROPERTIES. L(x 11 )/n > h. based on entropy estimation ideas of Ornstein and Weiss. A direct coding argument.122 CHAPTER II. The second block gives the index of the particular sequence x'iz in its k-type class.d.1.2. In addition. The first block gives the index of the k-type of the sequence 4. almost surely for any ergodic process.2 (Too-good codes do not exist. together with a surprisingly simple lower bound on prefix-code word length. 4 and yril. While no simpler than the earlier proof.1. together equivalent to the entropy theorem.6. in essence.1. n A prefix-code sequence {C} will be constructed such that for any ergodic it. Theorem 11.C(4)/n < H.1. then lim inf. The code is a two-part code. A third proof of the existence theorem. for which a controlled amount of distortion is allowed. A second proof of the existence theorem. II.C(fii)1 n < h(a). Theorem 11.2.

so that the bounds. i <k — 1. and on the cardinalityj7k (x)i of the set of all n-sequences that have the same circular k-type as 4. aik E A k .4+k-1 ..14 and Theorem 1. (Variable length is needed for the second part.14).15. ' +1 . < iii. let ril+k-1 = xx k i -1 . n) + log1 7 4(4)1 +2. Now suppose k = k(n) is the greatest integer in (1/2) log lAi n and let Cn be the two-part code defined by first transmitting the index of the circular k-type of x. k -. the code Cn is a prefix code. the index of the particular sequence 4 relative to some particular enumeration of T k (4). Theorem 1.6. 1f1 n The circular k-type is just the usual k-type of the sequence . say. Since the first block has fixed length. x 7 denotes the entropy of the (k — 1)-order Markov chain defined by Pk (. ak (3) ilk-Le. hence the second blocks of Cn (4) and Cn (y.:+ k -1 = ak 11 . n) of circular k-types that can be produced by sequences of length n. ENTROPY AND CODING.) Total code length satisfies log /S1 (k. that is. yet has negligible effect on code performance.. The circular k-type is the measure P k = ilk(. that is. Cn (4) = blink. or they have the same k-types but different indices in their common k-type class.`) will differ. because the size of the type class depends on the type. rd: 5.7irk-1 . 123 will be different.' . that is. The gain in using the circular k-type is compatibility in k..f. and bm t +1 is a variable-length binary sequence specifying. the concatenation of xtiz with x k i -1 . in turn. that is. A slight modification of the definition of k-type.IX II ) on il k defined by the relative frequency of occurrence of each k-block in the sequence . then the log of the number of possible k-types is negligible relative to n.c14). Pk-1(4 -1 14) =---" which."7 on the number /V (k.1. so asymptotic code performance depends only on the asymptotic behavior of the cardinality of the k-type class of x.6. . implies the entropy inequality EFkcal. ITi(4)1 < (n — 1)2 (11-1)11. called the circular k-type.SECTION 11. where the two extra bits are needed in case the logarithms are not integers. br relative to some fixed enumeration of the set of possible circular k-types. lic 14) — Pk(a • 1 ti E [1. where Hk_1. on the number of k-types and on the size of a k-type class yield the following bounds (1) (2) (k. 4 is extended periodically for k — 1 more terms. Given a sequence 4 and an integer k < n. If k does not grow too rapidly with n. will be used as it simplifies the final entropy argument. n) < (n ± olAl k . then transmitting the index of 4 in its circular k-type class.(1/2) log IAI n. An entropy argument shows that the log of this class size cannot be asymptotically larger than nH . where is a fixed length binary sequence specifying the index of the circular k-type of x.

u(4). and hence HK _ Lx? integer N = N(x.2 for the case when each G is a prefix code. where. the ergodic theorem implies that PK (ar lx7) oo. is valid for any process. The first proof uses a combination of the entropy theorem and a simple lower bound on the pointwise behavior of prefix codes. This is. the type class. The following lemma. almost surely. The proof that good codes exist is now finished. relative to a given enumeration of bits are needed to specify a given member of 7 Thus. establishing the desired bound (4). n > N. k(n) will exceed K.1. II. Since K is fixed. for es•• <H + 2E. negligible relative to n. almost surely. and does not make use of the entropy theorem. Once this happens the entropy inequality (3) combines with the preceding inequality to yield ilk(n)-1. is based on an explicit code construction which is closely connected to the packing ideas used in the proof of the entropy theorem. Two quite different proofs of Theorem 11. for as noted earlier. Fix an ergodic measure bc with process entropy H. together with the assumption that k < (1/2) log iAi n.1. A faithful-code sequence can be converted to a prefix-code sequence with no change in asymptotic performance by using the Elias header technique. (4) where H is the process entropy of p. in this discussion. choose K such that HK_i _< H ± E. The second proof.u(ar)/. process entropy El H and decay-rate h are the same for ergodic processes. For all sufficiently large n. implies that it takes at most 1 + N/Tz log(n + 1) bits to specify a given circular k(n)-type. described in Section I. [3]. of course.x7 < H. The bound (1) on the number of circular k-types. Given E > 0. Thus it is enough to prove Theorem 11.124 CHAPTER II. a bound due to Barron. In almost surely. relative to some fixed enumeration of circular k(n)-types. E) such that almost every x there is an particular. which was developed in [49]. due to Barron. this shows that lim sun n ilk(n). almost surely. as n . ENTROPY-RELATED PROPERTIES.7.1. stationary or not.u(ar-1 ).l. . HK-1 = H (X IX i K -1 ) denotes the process entropy of the Markov chain of order K — 1 defined by the conditional probabilities .2 will be given. hm sup 'C(n xilz) lim sup iik(n) .b Too-good codes do not exist. Since E is arbitrary.d of Chapter 1.x 1' < H.x? < H + 2E.x7. The bound (2) on the size of a type class implies that at most 1 + (n — log(n — 1) 4(4). so that to show the existence of good codes it is enough to show that for any ergodic measure lim sup fik(n)-1.

A second proof of the nonexistence of too-good codes.C(4) +log . yields lim inf n—000 n > lim inf .3 (The almost-sure code-length bound. (As noted earlier. The lemma follows from the Borel-Cantelli lemma.1. I.1. this can be rewritten as Bn = {x7: iu. with respect to A. Barron's inequality (5) with an = 2 loglAI n. Proof For each n define B.„ Bj ) > y.l. } is a prefix-code sequence such that lim infi £(41C. ={x: . s.)/n < H. This completes the proof of Theorem 11. since adding headers to specify length has no effect on the asymptotics. II.SECTION II. a faithful-code sequence with this property can be replaced by a prefix-code sequence with the same property. for any ergodic measure if/.c Nonexistence: second proof.(4) eventually almost surely. .) Thus there are positive numbers. Using the relation L(x7) = log 2'. such that if Bi = (x: ICJ) < i(H . 2-an < oo.0 <2 -an E 2-4(4) < 2 -"n .1. But -(1/n) log z(x) converges El almost surely to h.t(Bn) = E kt(x . a proof which uses the existence of good codes but makes no use of the entropy theorem.) Let (CO be a prefix-code sequence and let bt be a Borel probability measure on A. 125 Lemma 11. then. Theorem 11.2. Let i be a fixed ergodic process with process entropy H. then If {an } is a sequence of positive numbers such that En (5) .u(fil ) . such that those words that can be coded too well cover at least a fixed fraction of sample path length. the entropy of .log p. and assume {C.2.6)} then i(nn Up. while the complement can be compressed almost to entropy.u(4) ?_ . long enough sample paths can be partitioned into disjoint words. and after division by n. The basic idea is that if a faithful-code sequence indeed compresses more than entropy infinitely often on a set of positive measure. c and y.1. will now be given. thereby producing a code whose expected length per symbol is less than entropy by a fixed amount. x7En„ . on a set of positive measure.ei2E B„ since E „ 2—c(x7) < 1.an. with high probability. by the Kraft inequality for prefix codes.u.C(x7) + log .(4) which yields 2-c(4 ) 2 -an I. eventually a. ENTROPY AND CODING.an).

and M > 2k/S. while the words that are neither in Dk nor in BM U . .1. with high probability.. plus headers to tell which code is being used. E G(n). this produces a single code with expected length smaller than expected entropy. ' > 0. and using a single-letter code on the remaining symbols. 8 > 0. for some large enough M. (b) The total length of the w(i) that belong to A is at most 33n. . a concatentation 4 = w(1)w(2) • • • w(t) is said to be a (y.126 CHAPTER II. Let G(n) be the set of all sequences 4 that have a (y.C*(x`Oln will be uniformly bounded above by a constant on the complement of G(n).C*(4) <n(H — E'). ENTROPY-RELATED PROPERTIES.. A version of the packing lemma will be used to show that long enough sample paths can. xr.1. For n > M. provides a k > 1/3 and a prefix code ek £(.x i` I e' -k).. using the too-good code Ci on those words that belong to Bi for some i > M.. for the uniform boundedness of E*(4)In on the complement of G(n) combines with the properties (6) and (7) to yield the expected value bound E(C*(X7)) < n(H — 672). independent of n. Here it will be shown how a prefix n-code can be constructed by taking advantage of the structure of sequences in G(n). is at least yn. 0-good representation of xt. with length function E(x) = Theorem 11. using the good code Ck on those words that belong to Dk. be partitioned into disjoint words. for all sufficiently large n. such that the words that belong to B m U . (c) The total length of the w(i) that belong to BmU Bm +1U ... If 8 is small enough and such sample paths are long enough. The existence theorem. 3)-good representation. eventually almost surely. The lemma is enough to contradict Shannon's lower bound on expected code length.4 For n > M.. or belongs to Dk. there is a prefix n-code C n * whose length function E*(4) satisfies (7) . cover at least a y-fraction of total length. It will be shown later that (6) x E G(n). cover at most a 33-fraction of total length. Lemma 11. contradicting Shannon's lower bound. has measure at least 1 — S. .1.. given that (6) holds. Furthermore. Sample paths with this property can be encoded by sequentially encoding the words. To make precise the desired sample path partition concept. (a) Each w(i) belongs to A. let M > 2k IS be a positive integer. such that the set Dk = E(X lic ) < k(H 8)) . a code that will eventually beat entropy for a suitable choice of 3 and M. if the following three conditions hold. for a suitable choice of E. The coding result is stated as the following lemma. or belongs to B1 for some j > M. Let 8 < 1/2 be a positive number to be specified later.

Note that C * (4)/n is uniformly bounded on the complement of G(n). and. But L > yn.C*(X7)) < H(1. Cm+i . Here e (. then a (y. Theorem I. thereby contradicting Shannon's D lower bound on expected code length. where d > log IA I. C m . while the other headers specify which of one of the prefix-code collection {F.4. ) is an Elias code. ' (w(i)) iie(j)Ci(W(i)) if w(i) has length 1 if w(i) E Dk if w(i) E Bi for some j > M.. in turn. by the definition of Dk. . a prefix code on the natural numbers for which the length of e(j) is log j o(log j).1. for x g G(n). on the words of length 1. yields E(. then coding the w(i) in order of increasing i. (0) (w(i)) of the w(i) that belong to Up. say.11(i). for all sufficiently large n. for suitable choice of 8 and M > 2kI8.SECTION 11. so that if L is total length of those w(i) that belong to BM U BA. using the good code Ck on those words that have length k... the per-symbol code C:(4) = OF(xi)F(x2) • • • F(xn). as shown in the following paragraphs. The code C(x) is defined for sequences x E G(n) by first determining a (y. so that. Ck. the contribution . that is.). The goal is to show that on sequences in G(n) the code Cn * compresses more than entropy by a fixed amount.1. so that. using the too-good code on those words that have length more than M. For 4 g G(n). But this. the principal contribution to code length comes from the encodings Cf (u.. and using some fixed code. This is so because. ENTROPY AND CODING. a code word Elk (w(i)) has length at most k(h + 6). 127 for all sufficiently large n.. 8)good representation 4 = w(1)w(2) • • • w(t). F : A 1--* Bd.)cr i' = w(l)w(2) • • • w(t) is determined and the code is defined by C:(4) = lv(1)v(2) • • • v(t).7. . A precise definition of Cn * is given as follows.. 3)-good representation . Appropriate headers are also inserted so the decoder can tell which code is being used.. where OF ((w(i)) (8) v(i) = 10 . since the total length of such w(i) is at most N — L. is used.} is being applied to a given word. the code word Cf (w(i)) is no longer than j (H — c). Likewise.m B. since (1/n)H(A n ) -+ H. and the encodings k(W(i)) of the w(i) that belong to Dk. ignoring the headers needed to specify which code is being used as well as the length of the words that belong to Uj >m BJ . Proof of Lemma 11. by property (c) in the definition of G(n). since the initial header specifies whether E G(n) or not. The code Cn * is clearly a prefix code.1 4. By the definition of the Bi .1 . If E G(n). it takes at most L(H — E) + (N — L)(H 8) to encode all those w(i) that belong to either Dk or to Ui > m Bi .t. then it requires at most L(H — E) bits to encode all such w(i). symbols are encoded separately using the fixed code F.

ignoring the one bit needed to tell that 4 E G(n). If 8 is small enough. Lemma 1. and M > 2k13. for suitable choice of 8 . In summary. provided only that M > 2k18. relative to n. then 8 — ye -1-3d8 will be negative. provided k > 118. then n(H — E'). e'. By assumption. choose L > M so large that if B = UL BP • then . so these 2-bits headers contribute at most 28n to total code length. to total code length from encoding those w(i) that belong to either Dk or to Ui .m B i is upper bounded by — ye). eventually almost surely. provided 8 is small enough and M is large enough.M . w(i)Eup. ENTROPY-RELATED PROPERTIES. M > 2k/3 and log M < 8M.u(B) > y. the total length needed to encode all the w(i) is upper bounded by (10) n(H +8 — ye +3d3). the lemma depends on the truth of the assertion that 4 E G(n). then at most K3n bits are needed to specify the lengths of the w(i) that belong to Ui >m B i . the total contribution of all the headers as well as the length specifiers is upper bounded by 38n ± 26n + IC8n. which establishes the lemma.8. k > 113. Since an Elias code is used to do this. To establish this. 1=.128 CHAPTER II. 4 E G(n). which is certainly negligible. The 1-bit headers used to indicate that w(i) has length 1 require a total of at most 33n bits. if it assumed that log M < 8M. But (1/x) log x is a decreasing function for x > e. so that. The partial packing lemma. which exceeds e since it was assumed that < 1/2 and k > 118. it follows that if 8 is chosen small enough. since the total number of such words is at most 38n. since k < M and hence such words must have length at least k. The words w(i) need an additional header to specify their length and hence which Ci is to be used. Of course. Thus. Adding this bound to the bound (10) obtained by ignoring the headers gives the bound L*(4) < n[H 8(6 ± 3d K)— yen on total code length for members of G(n). still ignoring header lengths. n(H +3 — y(e +8)) n(H (9) Each word w(i) of length 1 requires d bits to encode. so essentially all that remains is to show that the headers require few bits. and there are at most 38n such words. for any choice of 0 < 8 < 1/2. there is a constant K such that at most log t(w(i)) K tv(i)Eui>m B1 E bits are required to specify all these lengths. by property (b) in the definition of G(n). m Bi E since each such w(i) has length M > 2k1S. There are at most nlk words w(i) the belong to Dk or to Ui >m Bi . by property (b) of the definition of G(n). so that log log t(w(i)) < K M n.3. Since d and K are constants.

that is. if E .1. eventually almost surely. eventually almost surely. vi ] meets none of the [si . by the ergodic theorem. In particular. eventually almost surely. vi that meet the boundary of at least one [s ti] is at most 2nk/M. The preceding paragraph can be summarized by saying that if x has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk then 4 must belong to G(n).SECTION 11. On the other hand. Lemma 1. . since it was assumed that M > 2k/ 6 . t] has length at least M. so the total length of those [ui .l. then the collection {[sj. (This argument is essentially just the proof of the two-packings lemma. and (c) hold. is to say that there is a disjoint collection {[u„ yi ]: i E / } of k-length subintervals of [1. n] of total length at least (1 — 20n for which each 4.d Another proof of the entropy theorem.s. ti]UU[ui. and I Dn I < 2 n(H+E) . E Dk and a disjoint collection {[si. E /*} is disjoint and covers at least a (1 — 30-fraction of [1. To say that x has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk. But this completes the proof that 4 E G(n).9. Since each [s1 . and let {C„} be a prefix-code sequence such that lim sup. ti]: j E U {[ui. Fix c > 0 and define D„ = 14: £(4) < n(H so that x Thus.3. n — k 1] that are starting places of blocks from Dk. Lemma 1. ti ]: j E J } of M-length or longer subintervals of [1. eventually almost surely. since. 4 has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk. for which each xs% E B. 4 is (1 — 0-strongly-covered by Dk. But the building-block lemma. . since Cn is invertible.C(x`11 )1 n < H. 1E1. This completes the second proof that too-good codes do not exist.7. 129 provides. = . suggested in Exercise 5. a. II. The entropy theorem can be deduced from the existence and nonexistence theorems connecting process entropy and faithful-code sequences. Dn . Section 1.u(x'1') < 2 1'1+2} . as follows. ENTROPY AND CODING.) The words v:: j E /*I {x:j E J} U {xu together with the length 1 words {x s I defined by those s U[s . provide a representation x = w(1)w(2) • • • w(t) for which the properties (a).3. n]. a y-packing of 4 by blocks drawn from B. Fix an ergodic measure bt with process entropy H. then implies that xri' is eventually almost surely (1 — 20-packed by k-blocks drawn from Dk. the cardinality of J is at most n/M. B. eventually almost surely. which is upper bounded by 8n. eventually almost surely there are at least 8n indices i E [1. n] of total length at least yn. 4 has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk. Thus if I* is the set of indices i E I such that [ui .7. ti ]. (b).

eventually almost surely.(4) > 2—n(H. is the all 0 process.130 then /t (Bn CHAPTER II.1 .) 3.1(4) log(pn (4)/vn (4)) be the divergence of An from yn . Show that there is prefix n-code Cn whose length function satisfies £(4) < Kn on the complement of S.. C3 II. for X'il E S. and let £(4) be the length function of a Shannon code with respect to vn . be the code obtained by adding the prefix 1 to the code Cn . where K is a constant independent of n. ENTROPY-RELATED PROPERTIES. Let L (xi') denote the length of the longest string that appears twice in 4. then there is a renewal process v such that lim supn D (gn II vn )/n = oc and lim infn D(pn II vn )/n = 0. Define On (4) = 4)(4). is ergodic then C(4)In cannot converge in probability to the entropy of v. (Hint: code the second occurrence of the longest string by telling how long it is. whose range is prefix free.. a.(4) = 1C:(4). where it starts. such that the first symbol in 0(4) is always a 0. n n 11. xii E Un and en (4) = C(4).. and such that C. is 1-1(pn )± D (An II vn). This exercise explores what happens when the Shannon code for one process is used on the sample paths of another process. (a) For each n let p. Code the remainder by using the Shannon code. L(4)/ log n < C. . almost surely. Add suitable headers to each part to make the entire code into a prefix code and apply Barron's lemma. (C) Show that if p. Let C. Such a code Cn is called a bounded extension of Cn * to An. then there is a constant C such that lim sup. Show that if bt is i..2 guarantees that Je ll g Un .(4 For the lower bound define U„ = Viz: p.1. Thus there is a one-to-one function 0 from Un into binary sequences of length no more than 2 -I. n n and completes this proof of the entropy theorem. Lemma 5.s. and where it occurred earlier.n(H — c).. Show that if y 0 p. (b) Let D (A n il v pi ) = E7x /1.. and note that I Uni < 2n(I I —E) ..i. Le Exercises 1. xi" g Un .. eventually almost surely. Show that the expected value of £(4) with respect to p. n D n) < IDn 12—n(H+2E) < 2—n€ Therefore 4 fl B n n Dn .s. and so xri` g Bn .d. be the projection of the Kolmogorov measure of unbiased coin tossing onto A n and let Cn be a Shannon code for /in . This proves that 1 1 lim inf — log — > H. establishing the upper bound 1 1 lim sup — log ) _< H . eventually almost surely. The resulting code On is invertible so that Theorem 11. Suppose Cn * is a one-to-one function defined on a subset S of An . a. 2.

1.2.2 The Lempel-Ziv algorithm. for j < C. [90]. 100. b(C). The parsing defines a prefix n-code Cn . . that is. that the Lempel-Ziv (LZ) algorithm compresses to entropy in the limit will be given in this section.SECTION 11. . the word-formation rule can be summarized by saying: The next word is the shortest new word. Note that the parsing is sequential. . where the final block y is either empty or is equal to some w(j). 0. and it has been extensively analyzed in various finite and limiting forms. In finite versions it is the basis for many popular data compression packages.. called the (simple) Lempel-Ziv (LZ) code by noting that each new word is really only new because of its final symbol. parses into 1. for ease of reading. b(2). . 101. In its simplest form. W(j)} and xm+ 1 g {w(1). [53]. together with a description of its final symbol. 10. 11001010001000100. If fiz is parsed as in (1).. b(C + 1). w(j) = (i) If xn 1 +1 SI {w(1). defined according to the following rules. Ziv's proof. and uses several ideas of independent interest. called here (simple) LZ parsing. 01.1 denote the least integer function and let f: {0.... A second proof. later words have no effect on earlier words. the words are separated by commas. 00. (b) Suppose w(1) . for example.. THE LEMPEL-ZIV ALGORITHM 131 Section 11. To be precise. . .. 7 + 1 . The LZ algorithm is based on a parsing procedure. and therefore an initial segment of length n can be expressed as (1) xi! = w(1)w(2).. .. . .. the LZ code maps it into the concatenation Cn (eii) = b(1)b(2). n) i-. x is parsed inductively according to the following rules.1 +1 E {WM.. w(C) y.. An important coding algorithm was invented by Lempel and Ziv in 1975. is given in Section 11.. called words. • of variable-length blocks. To describe the LZ code Cn precisely. where. that is.. b(C)b(C + 1) of the binary words b(1).. hence it is specified by giving a pointer to where the part before its final symbol occurred earlier. [91]. w(j)} then w(j ± 1) consists of the single letter xn ro.. w(j + 1) = xn such that xm n . 000. 2. WW1.... which is due to Ornstein and Weiss. (a) The first word w(1) consists of the single letter x1. where m is the least integer larger than n 1 (ii) Otherwise. . .. . Thus.Brl0gn1 and g: A 1--÷ BflogiAll be fixed one-to-one functions.4. a way to express an infinite sequence x as a concatenation x = w(1)w(2) .. let 1.

called topological entropy. then b(C the least integer such that y = w(i). (a) If j < C and w(j) has length 1. individual sequence entropy almost surely upper bounds the entropy given by the entropy theorem. for ergodic processes.16. together with a proof that this individual sequence entropy is an asymptotic upper bound on (1/n)C(4) log n. [90]. (b) If j < C and i < j is the least integer for which w(j) = w(i)a. ENTROPY-RELATED PROPERTIES.eiz) is upper bounded by (C+1)logn-FaC-1. that is. where a and /3 are constants. 1) is empty. 1) = 1 f (i). if x is a typical sequence for the binary i. together with a proof that.132 CHAPTER II.7. Ziv's proof that entropy is the correct almost-sure upper bound is based on an interesting extension of the entropy idea to individual sequences. The entropy given by the entropy theorem is h(p) = —p log p — (1 — p) log(1 — p).6. 34] for discussions of this more general concept. Of course. but gives no information about frequencies of occurrence.d. and part (c) requires at most Flog ni + 1 bits. hence its topological entropy h(x) is log 2 = 1. part (b) requires at most Calog n1 ± Flog I A + 2) bits. k = {a• i• x 1' 1 = all for some i _> The topological entropy of x is defined by 1 h(x) = lim — log 114 (x)I k a limit which exists. process with p equal to the probability of a 1 then every finite string occurs with positive probability. by subadditivity. For example. is the same as the usual topological entropy of the orbit closure of x. then b(j) = 1 f (i)0 g(a).) If p is an ergodic process with entropy h. which is the growth rate of the number of observed strings of length k. as k oc.2.2. so total code length £(. see [84. . The dominant term is C log n. Theorem 11. by Theorem 11. so to establish universality it is enough to prove the following theorem of Ziv. in which C1(4) now denotes the number of new words in the simple parsing (1). a E A. since no code can beat entropy in the limit.) Topological entropy takes into account only the number of k-strings that occur in x. then b(j) = 0 g(w(j)). almost surely. otherwise b(C (c) If y is empty. Lemma 1. for it takes into account the frequency with which strings occur. The k-block universe of x is the set tik(x) of all a l' that appear as a block of consecutive symbols in x.1 (The LZ convergence theorem. for every sequence.i. (The topological entropy of x. then (11n)C(4) log n —> h. Ziv's concept of entropy for an individual sequence begins with a simpler idea. since I/44k (x)i < Vim (x)I • Pk (x)1.1. as defined here. it is enough to prove that entropy is an upper bound.. which depends strongly on the value of p. where i is Part (a) requires at most IA l(Flog IA + 1) bits.

Theorem 11.2. The first lemma obtains a crude upper bound by a simple worst-case analysis.2. Proof The most words occur in the case when the LZ parsing contains all the words of length 1.SECTION 11. almost surely. The first gives a simple bound (which is useful in its own right).t--ns n is per-letter Hamming distance. Ziv established the LZ convergence theorem by proving the following two theorems.1... when n= In this case. 133 The concept of Ziv-entropy is based on the observation that strings with too small frequency of occurrence can be eliminated by making a small limiting density of changes.1.2.2. The Ziv-entropy of a sequence x is denoted by H(x) and is defined by H(x) = lim inf h(y). where d(4 .) There exists Bn —O.4 (The crude bound. all the words of length 2. )1). The proof of the upper bound. x E A c° . which also have independent interest.i (xii . will be carried out via three lemmas.3 (The Ziv-entropy theorem. which. n (in + 1 )1Al m+1 jiAli and C = . — 1) and C IA — 1). leads immediately to the LZ convergence theorem. y).2.) If is an ergodic process with entropy h then H(x) < h..2. y) = lim sup d. 0 such that C(x) log n n <(1 + 3n)loglAl. defined as before by c1(x . . and hence C logn ' n log I AI.. Theorem 11. The natural distance concept for this idea is cl(x. Lemma 11. up to some length. X E A°°. Theorem 11. THE LEMPEL-ZIV ALGORITHM. almost surely.1 ) = 1 .2. Theorem 11. )7. Theorem 11.2. that is. combined with the fact it is not possible to beat entropy in the limit. say m. These two theorems show that lim supn (l/n)C(x7) log n < h. the second gives topological entropy as an upper bound. and the third establishes a ii-perturbation bound.=.) lim sup C(xti') log n H(x).2 (The LZ upper bound.

i=1 The desired upper bound then follows from the bound bounds . as k oc. ()) (w(i).. v(i)) < .. y) <S. there is a S >0 such that if d( li m sup n—>oo C(. where t(v(i)) = t(w(i)).5 (The topological entropy bound. together with the fact that hi(x) — Ei < h(x) h (x) where Ei —> 0 as j oc. . ENTROPY-RELATED PROPERTIES. j=1 m+1 The bounds (2). must give an upper bound. To establish the general case in the form stated.x'iz) log n h(y) + Proof Define the word-length function by t(w) = j. tj = Er (en-El The second bounding result extends the crude bound by noting that short words don't matter in the limit. 2.6 (The perturbation bound. for the LZ parsing may be radically altered by even a few changes in x. all of which are valid for t > 1 and are simple consequences of the well-known formula 0 — t)/(t — 1). and the t < - t mtm. Fix S > 0 and suppose d(x. for w E Ai. The word w(i) will be called well-matched if dt(u.2. Lemma 11. —> 0. then Given x and E > 0. there exists 8. and hence the rate of growth of the number of words of length k.) For each x E A'.2. y) < S..) x. Let x = w(1)w(2) .134 CHAPTER II.. Lemma 11. 1 t m it . for k > 1. ei. tm 1 m(t _ i) _< E (2) m (t— 1 i Er ti < t"/(t — 1).. that is. the topological entropy. i = 1... such that C(x) log n n — Proof Let hk(x) = (1/k) log lUk(x)1. first choose m such that n< j=1 jlAli. be the LZ parsing of x and parse y as y = v(1)v(2) . The third lemma asserts that the topological entropy of any sequence in a small dneighborhood of x is (almost) an upper bound on code performance. and choose m such that E j2» j=1 „(x) < n < E i2 p„(x) . yield the desired result.. At first glance the result is quite unexpected.

where lim n fi. H(x) < h. together with the fact that hk(Y) --+ h(y).6. almost surely.2. combined with (3) and the fact that f(S) --÷ 0. x i+ 1 -T k11 • > 1 — E. the blowup-bound lemma.2.4 gives the bound (3) (1 n/ +an) log n'a log I AI where limn an = O. This bound. hence the same must be true for nonoverlapping k-blocks for at least one shift s E [0. y) < e and h(y) <h E. THE LEMPEL-ZIV ALGORITHM. For each k. and hence it can be supposed that for all sufficiently large n.2.7. follows immediately from Lemma 11.2. the total length of the poorly-matched words in the LZ parsing of 4 is at most rt.5. Thus./TS. By the Markov inequality. is an immediate consequence of the following lemma. n — 11: x in+s+1 lim inf > 1.r8fraction of x. and let Gk(y) be the set of all v(i) for which w(i) E Gk(X). for w(i) E Gk(X). 135 otherwise poorly-matched. Let C1(4) be the number of poorly-matched words and let C2(4) be the number of well-matched words in the LZ parsing of x. k — 1]. Lemma 1.2. The idea for the proof is that 4 +k-1 E Tk for all but a limiting (1 — 0-fraction of indices i. Since dk(w(i). Lemma 11. and choose k so large that there is a set 'Tk c Ak of cardinality less than 2n (h+E) and measure more than 1 — E. first note that lim i+k Ili E [0. Theorem 11.5 then gives C2(4) 5_ (1 + On)r— (h(Y) gn f (8)). implies LI the desired result. The LZ upper bound. completing the proof of Lemma 11.SECTION 11. The resulting nonoverlapping blocks that are not in 'Tk can then be replaced by a single fixed block to obtain a sequence close to x with topological entropy close to h. the poorly-matched words cover less than a limiting . k —1] such that i. there is a sequence y such that cl(x. The cardinality of Gk(Y) is at most 2khk (Y ) . Fix a frequency-typical sequence x.E. n — 1]. Proof Fix c > 0. since this is the total number of words of length k in y. v(i)) < N/75. Lemma 11.7 Let p. To make the preceding argument rigorous.n+s +k E Ed' I{i E [0. yields IGk(x)I where f(S) < 2k(h(y)+ f (6)) 0 as 8 O. n—>oo .6. be an ergodic process with entropy h. Now consider the well-matched words.2. = O. Thus there must be an integer s E [0. let Gk(x) be the set of wellmatched w(i) of length k. For any frequency-typical x and c > O.2. Lemma 11. since x was assumed to be frequency-typical.2. 0 The desired inequality.

2. which improves code performance. — n Fix a E A and let ak denote the sequence of length k. where the initial block u has length s. Theorem 11. called the slidingwindow version. then defines .2. and hence h(y) < h + E. Remark 11. 0 and. and (M — 2k)/ k < m < M/k. "new" meant "not seen as a word in the past.2. where q and r have length at most k. of words have been seen. until it reaches the no-th term. thereby.7. each w(i) has length k. One version. which has also been extensively analyzed.10 For computer implementation of LZ-type algorithms some bound on memory growth is needed.. all of whose members are a. the proof of the Ziv-entropy theorem.9 Another variation..2.8 Variations in the parsing rules lead to different forms of the LZ algorithm.. which can be expressed by saying that x is the concatenation x = uw(1)w(2).2. produces slightly more rapid convergence and allows some simplification in the design of practical coding algorithms. and (4) liminf n—. say Co. for now the location of both the start of the initial old word. Lemma 1. Since each v(i) E U {a l } . new words tend to be longer.. This modification. known as the LZW algorithm. defines a word to be new only if it has been seen nowhere in the past (recall that in simple parsing.. This completes the proof of Lemma 11.") In this alternative version. a member of the m-block universe Um (y) of y has the form qv(j)v(j + D. The sequence y is defined as the concatenation y = uv(1)v(2)..136 CHAPTER II. Nevertheless. but more bookkeeping is required..7. remains true. the rate of growth in the number of new words still determines limiting code length and the upper bounding theorem. implies that h m (y) < h + 6 ± 3m. Condition (4) and the definition of y guarantee that ii(x. If M is large relative to k. ENTROPY-RELATED PROPERTIES.3.2. Remark 11. see Exercise 1. Another version. after which the next word is defined to be the longest block that appears in the list of previously seen words.2.00 Ri < n: w(i) E TxII > 1 — E. a set of cardinality at most 1 + 2k(h +E) . Theorem 11. as well as its length must be encoded. • v(j + in — 1)r. y) <E. Tic where 3A1 ---* 0. Remark 11. starts in much the same way. also discussed in the exercises. the built-up set bound. includes the final symbol of the new word in the next word. where w(i) ak if w(i) E 'Tk otherwise. One method is to use some form of the LZ algorithm until a fixed number.6. It is used in several data compression packages.

rit — 1]: in X: . Section 11. [52]. In this section. both of which will also be discussed. and that a small > 21th exponential factor less than 2kh is not enough. EMPIRICAL ENTROPY.. the coding of a block may depend only on the block.) 2. ic.2.9 achieves entropy in the limit.3. II. Furthermore.3 Empirical entropy. 137 the next word as the longest block that appears somewhere in the past no terms. asserting that eventually almost surely only a small exponential factor more than 2kh such k-blocks is enough. In many coding procedures a sample path 4 is partitioned into blocks. the method of proof leads to useful techniques for analyzing variable-length codes.SECTION 11. about the covering exponent of the empirical distribution of fixed-length blocks will be established. here called the empirical-entropy theorem. is defined by qic(cli = Ri E [0. Show that the LZW version discussed in Remark 11. which can depend on n.2. These will be discussed in later sections. or it may depend on the past or even the entire sample path. see [87.8 achieves entropy in the limit. see Exercise 1. then E log Li < C log(n/C). through the empirical distribution of blocks in the sample path. (Hint: if the successive words lengths are . The blocks may have fixed length. The empirical-entropy theorem has a nonoverlapping-block form which will be discussed here. e. and can be shown to approach optimality as window size or tree storage approach infinity. provided only that n _ . 3. g. . a beautiful and deep result due to Ornstein and Weiss. and an overlapping-block form.2. The nonoverlapping k-block distribution. Furthermore. such as the Lempel-Ziv code. 0 < r < k. then each block is coded separately. or variable length. t2. The theorem is concerned with the number of k-blocks it takes to cover a large fraction of a sample path. • • . .cc+ -E =a } I where n = km + r. The theorem. as well as other entropy-related ideas. Such algorithms cannot achieve entropy for every ergodic process.. suggests an entropy-estimation algorithm and a specific coding technique. but do perform well for processes which have constraints on memory growth. Show that simple LZ parsing defines a binary tree in which the number of words corresponds to the number of nodes in the tree. Show that the version of LZ discussed in Remark 11.a Exercises 1. 88].

4.a Strong-packing. for unbiased coin-tossing the measure qk (. in particular. (1 — O n. the empirical measure qk(. . For each c > 0 and each k there is a set 'Tk (e) c A" for which l'Tk (e)I < 2k(h +E) . For example. The theorem is. and there is no is a set T(E) c A k such that l'Tk(E)I _ set B C A k such that IB I _< 2k(h--E) and .14) is not close to kt k in variational distance for the case n = k2". a finite sequence w is called a word. then most of the path is covered by words that come from fixed collections of exponential size almost determined by entropy.) Let ii be an ergodic measure with entropy h > O. A fixed-length version of the result is all that is needed for part (a) of the entropy theorem.a(B) > E. A parsing X n i = w(l)w(2) • • • w(t) is (1 — e)-built-up from {'Tk}. the empirical distribution qk(. such that Tk c A k . Part (a) of the empirical-entropy theorem is a consequence of a very general result about parsings of "typical" sample paths. II. an empirical distribution form of entropy as covering exponent.uk. if the short words cover a small enough fraction. Strong-packing captures the essence of the idea that if the long words of a parsing cover most of the sample path. w(2). as it will be useful in later sections. Fix a sequence of sets {E}. the empirical and true distributions will eventually almost surely have approximately the same covering properties. ENTROPY-RELATED PROPERTIES.3. Theorem 11. The surprising aspect of the empirical-entropy theorem is its assertion that even if k is allowed to grow with n. or (1 — e)-packed by {TO.. A sequence 4 is (K. if the words that belong to U'rk cover at least a (1 — E)-fraction of n. x) such that if k > K and n > 2kh . hence. then (a) qk(Z(E)14)) > 1 — E. in spite of the fact that it may not otherwise be close to iik in variational or even cik -distance. . those long words that come from the fixed collections . t(w(I))<K E - 2 is (1 — E)-built-up from MI. word length is denoted by £(w) and a parsing of 4 is an ordered collection P = {w(1). Eventually almost surely.7.1 (The empirical-entropy theorem. w(t)} of words for which 4 = w(1)w(2) • • • w(t). The ergodic theorem implies that for fixed k and mixing kt.. see Exercise 3. subject only to the condition that k < (11 h) log n. for k > 1. c)-strongly-packed by MI if any parsing . E w(i)Eurk f(w(i)) a. that is.138 CHAPTER II.I4) eventually almost surely has the same covering properties as ktk. but the more general result about parsings into variable-length blocks will be developed here. there < 2k(h+f) and pl (Tk(E)) > 1 — e.elz = w(1)w(2) • • • w(t) for which awci» < --En . (b) qk(BI4)) <E. in essence.14) is eventually almost surely close to the true distribution. for the latter says that for given c and all k large enough. . for fixed k.3. As before. for any B c A" for which IBI < 2k( h — e) . such that for almost every x there is a K = K(e. Theorem 1.

For k > m. 0-strongly-packed by {Tk for a suitable K. and let 8 be a positive number to be specified later. n — m + 1]. such that both of the following hold. w(i) is (1-0-packed by C m . Lemma 1. let 'Tk be the set of sequences of length k that are (1 — 3)-built-up from Cm . w(i) E Te(w(j)). Lemma 11.SECTION 11. eventually almost surely most indices in x7 are starting places of m-blocks from Cm . 4 is (1 — 3 2 /2)-strongly-covered by Cm .2 (The strong-packing lemma. by the built-up set lemma.) Let p. and if f(w(i)) ?_ 2/3.(ar) > 2 -m(h+ 5)} has measure at least 1 — 3 2 /4. by the packing lemma.3. by the Markov inequality. By the built-up set lemma. The key result is the existence of collections for which the 'Tk are of exponential size determined by entropy and such that eventually almost surely x is almost strongly-packed. < 2k(h+E) for k > K. if necessary. a set 'Tk = 'Tk(E) C A k . It remains to show that eventually almost surely x'IL is (K. fix c > 0. Lemma 1. the words w(i) that are not (1 — 3/2)-stronglycovered by Cm cannot have total length more than Sn < cn12. then. by the packing lemma. . EMPIRICAL ENTROPY 139 {'Tk } must cover most of the sample path. most of 4 must be covered by those words which themselves have the property that most of their indices are starting places of members of Cm . The entropy theorem provides an integer m and a set Cm c Am of measure close to 1 and cardinality at most 2m (h +6) • By the ergodic theorem.3. If a word is long enough. Proof The idea of the proof is quite simple. (ii) If x is (1 —3 2 /2)-strongly-covered by C m and parsed as 4 = w(1)w(2) • • w(t). however. k > in. for each k > K. that is. By making 8 smaller. The entropy theorem provides an m for which the set Cm = {c4n : p. then.6. The collection 'Tk of words of length k that are mostly built-up from Cm has cardinality only a small exponential factor more than 2"") . the sequence 4 has the property that X: +171-1 E Cm for at least (1 — 3 2 /2)n indices i E [1. (iii) If w(i) is (1 — 3/2)-strongly-covered by C m . it can be supposed that 3 < 12.7. (a) ITkI_ (b) x is eventually almost surely (K.3. (i) The ergodic theorem implies that for almost every x there is an N = N(x) such that for n > N. be an ergodic process of entropy h and let E be a positive number There is an integer K = K(e) and. To make the outline into a precise proof. by the Markov inequality. and most of its indices are starting places of members of Cm . 0-strongly-packed by {7 } . then the word is mostly built-up from C m . This is a consequence of the following three observations. it can be supposed that 6 is small enough to guarantee that iTk i < 2k(n+E) . But if such an x is partitioned into words then.3. that is.

since k < Sn. 8)-strongly-packed by {Tk}. except possibly the last one. and t(w(t + 1)) < k.3.. The second part encodes successive kblocks by giving their index in the listing of B.(1 — 2 8 )n. If a block belongs to Tk — B then its index in some ordering of Tk is transmitted. This completes the proof of Lemma 11. so that if B covers too much the code will beat entropy. Fix x such that 4 is (K.2k(h+') for k a K. such that eventually almost surely most of the k-blocks in an n-length sample path belong to Tk . ENTROPY-RELATED PROPERTIES. The definition of strong-packing implies that (n — k + 1)qk('Tk I4) a. and such that x. Suppose xtiz is too-well covered by a too-small collection B of k-blocks. An informal description of the proof will be given first. if the block belongs to B. while the last one has length less than Sn. and choose K (x) > K such that if k> K(x) and n> 2kh . this contributes asymptotically negligible length if I BI is exponentially smaller than 2kh and n > 2kh .140 CHAPTER II.3. If B is exponentially smaller than 2" then fewer than kh bits are needed to code each block in B. or by applying a fixed good k-block code if the block does not belong to B. 8)-strongly-packed by {T} for n a N(x). i < t. such that Irk I . This completes the proof of part (a) of the empirical-entropy 0 theorem. where t(w(i)) = k. The strongpacking lemma provides K and 'Tk c A " . this requires roughly hk-bits. for if this is so. The first part is an encoding of some listing of B. and if the parsing x 11 = w(1)w(2) • • • w(t) satisfies —2 t(w(i))<K then the set of w(i) for which f(w(i)) > K and w(i) E U Tk must have total length at El least (1 — On. From the preceding it is enough to take K > 2/8. Fix k > K(x) and n a 2kh . since the left side is just the fraction of 4 that is covered by members of Tk . Fix c > 0 and let S < 1 be a positive number to be specified later. are longer than K. II.' is (K. The good k-code used on the blocks that are not in B comes from part (a) of the empirical-entropy theorem. Thus. dividing by n — k + 1 produces qk(Tkixriz ) a (1 — 28)(1 —8) > (1 — c).. and suppose xril = w(1)w(2) • • • w(t)w(t + 1). If the block does . eventually almost surely. then n > N(x) and k < Sn. which supplies for each k a set 'Tk c A k of cardinality roughly 2. Proof of part (b). A simple two-part code can be constructed that takes advantage of the existence of such sets B. All the blocks.b Proof of the empirical-entropy theorem. if n > N(x).2. for suitable choice of S. for some k. provided only that n > 2kh . Proof of part (a).

_.8. (log n)/h] for which there is a set B c A k such that the following two properties hold. D It will be shown that the suggested code Cn beats entropy on B(n) for all sufficiently large n. say G: Tk i. 1} rk(4 ±3)1 . B(n). Several auxiliary codes are used in the formal definition of C. or the following holds. To proceed with the rigorous proof. say.. A faithful single letter code. This fact implies part (b) of the empirical-entropy theorem. (ii) I BI <2k(h6) and qk (BI4) > E. Let K be a positive integer to be specified later and for n > 2/(4 let B(n) be the set of 4 for which there is some k in the interval [K. F: A i-. since no prefix-code sequence can beat entropy infinitely often on a set of positive measure. Part (a) of the theorem provides a set Tk c ilk of cardinality at most 2k(h+3) . for almost every x.{0. eventually almost surely. if 8 is small enough and K is large enough.8. for each k.3. 1}rk(h-01. then property (1) must hold for all n > 2. so that if k is enough larger than K(x) to guarantee that nkh Z > N(x). and.B: B i . then qk(BI4) < c. (1) If B C A k and I B I < 21* -0 .E ) . 1}1 1 05 All is needed along with its extension to length in > 1. . say. an integer K(x) such that if k > K(x) and n > 2" then qk(TkIx) > 1 . for K(x) < k < loh gn. Also needed for each k is a fixed-length faithful coding of Tk. F(x. which implies part (b) of the empirical-entropy theorem. eventually almost surely. 141 not belong to Tk U B then it is transmitted term by term using some fixed 1-block code. (i) qic(Tk14) ?_ 1 . Let 1 • 1 denote the upper integer function. since such blocks cover at most a small fraction of the sample path this contributes little to overall code length.{0.3. Indeed. By using the suggested coding argument it will be shown that if 8 is small enough and K large enough then xn i . EMPIRICAL ENTROPY. eventually almost surely. so that if K(x) < k < (log n)/ h then either 4 E B(n).{0. If 4 g B(n). defined by Fm (X) = F(xi)F(x2). fix c > 0 and let 8 be a positive number to be specified later. the definition of {rk } implies that for almost every x there is an integer K (x) > K such that qk(Tk14) > 1 .. which establishes that 4 g B(n). Mk. and a fixed-length faithful code for each B c A k of cardinality at most 2k(h. n ). for n > N(x).SECTION 11. then for almost every x there is an N(x) such that x'ii g B(n).

and øk(B) determines B. since reading from the left. The two-bit header on each v(i) specifies whether w(i) belongs to B. for each k and B C A k . Once k is known. or to neither of these two sets. The code is defined as the concatenation C(4) = 10k 1E(1B1)0k (B)v( 1)v(2) • • • v(t)v(t + 1). for all sufficiently large n. the block 0k1 determines k. a fact stated as the following lemma. Finally. and qk = clk(Blx'1 2 ). a prefix code such that the length of E(n) is log n + o(log n). Fk.B(w(i)) OlG k (w(i)) 11 Fk(w(i)) 11Fr (w(t +1) 0 w(i) E B W(i) E Tk .4). since no sequence of codes can beat entropy infinitely often on a set of positive measure. Gk. so that if K is large enough and 8 small enough then n(h — c 2 12). (log n)/11] and a set B C A k of cardinality at most 2k(h-E ) are determined such that qk(B14') > c. and 4 is parsed as 4 = w(l)w(2) • • • w(t)w(t + 1). For 4 E B(n). and Fr are now known. that is. the first bit tells whether x`i' E B(n). then an integer k E [K. If 4 V B(n) then Cn (4) = OFn (4).i<t i = t +1.3. given that x til E B(n). Lemma 11. the principal contribution to total code length £(4) = £(41C) comes from the encoding of the w(i) that belong to rk U B. since each of the codes Mk.142 CHAPTER II. and as noted earlier. The lemma is sufficient to complete the proof of part (b) of the empirical-entropy theorem. implies part (b) of the empirical-entropy theorem. k). from which w(i) can then be determined by using the appropriate inverse. for 4 E B(n). for the block length k and set B c itic used to define C(. in turn. ( (I BI) determines the size of B. ENTROPY-RELATED PROPERTIES.C(x) 15_ tkqk(h — 6) + tk(1 — qk)(h + 6) + (3 + K)0(n). and.B w(i) g Tk UB. If 4 E B(n).3 I Mk. (3) . to Tic — B. where aK —> 0 as K —> oc. this. This is because tkq k (h — c) + tk(1 — q k )(h + 3) < n(h + 3 — 6(6 + S)). eventually almost surely. and w(t + 1) has length r = [0.B. where (2) v(i) = The code is a prefix code. xri' E B(n). But this implies that xriz V B(n). where w(i) has length k for i < t. The rigorous definition of Cn is as follows. since qk > E. let Ok(B) be the concatenation in some order of {Fk (4): 4 E B } . and let E be an Elias prefix code on the integers. 1=1 . 4 E B(n).

Thus with 6 ax = K(1 + log IA1)2 -KE — = o(K). xn from an unknown ergodic process it.3. .3. each of which requires k(h — E) bits. together these contribute at most 3(t + 1) bits to total length. which. then H(qk)1 k will converge almost surely to H(k)/k. of part (b) of the empirical-entropy theorem used a counting argument to show that the cardinality of the set B(n) must eventually be smaller than 2nh by a fixed exponential factor.'. hence at most 80(n) bits are needed to encode them.3. The encoding Øk(B) of B takes k Flog IAMBI < k(1+ log I A1)2 k(h. each of which requires k(h +3) bits. thereby establishing part (b) of the empirical-entropy theorem.3.SECTION 11. There are at most t + 1 blocks. as well as the extra bits that might be needed to round k(h — E) and k(h +3) up to integers.3. the goal is to estimate the entropy h of II. which is. as well as a possible extra bit to round up k(h — E) and k(h +8) to integers.q) blocks that belong to B.c Entropy estimation.4 The original proof. II. K the encoding of 4k(B) and the headers contributes at most naK to total length. and (t — tqk(B 14)) blocks in Tk — B. and then take H (qk )/ k as an estimate of h. El Remark 11. and there are at most 1 +3t such blocks. The coding proof given here is simpler and fits in nicely with the general idea that coding cannot beat entropy in the limit. EMPIRICAL ENTROPY. If k is fixed and n oc. It takes at most k(1 + log I Ai) bits to encode each w(i) Tk U B as well as to encode the final r-block if r > 0. [ 52].E ) bits. (5) provided K > (E ln 2) . -KEn. This completes completes the proof of Lemma 11. since t < nlk. A simple procedure is to determine the empirical distribution of nonoverlapping k-blocks. This gives the dominant terms in (3). each requiring a two-bit header to tell which code is being used. X2. in turn. since there are tqk(Bl. upper bounded by K (1+ loglA1)2. and n> k > K. Ignoring for the moment the contribution of the headers used to describe k and B and to tell which code is being applied to each k-block.3. Adding this to the o(n) bound of (4) then yields the 30(n) term of the lemma. The header 10k le(I BD that describes k and the size of B has length (4) 2 +k+logIBI o(logIBI) which is certainly o(n). converges to h as . in turn. 143 Proof of Lemma 11. Given a sample path Xl. A problem of interest is the entropy-estimation problem. since k > K and n > 2". since log I BI < k(h — E) and k < (log n)I n. The argument parallels the one used to prove the entropy theorem and depends on a count of the number of ways to select the bad set B. a quantity which is at most 6n/K. the number of bits used to encode the blocks in Tk U B is given by tqk (Bix)k(h — E) + (t — tq k (BI4)) k(h +3).3.

for example. > for k > K (x). the choice of k(n) would appear to be very dependent on the measure bt. n _ This bound. Proof Fix c > 0 and let {Tk(c) c A k : n > 1) satisfy part (a) of the empirical-entropy theorem. Theorem 1.5 (The entropy-estimation theorem.144 CHAPTER II. The empirical entropy theorem does imply a universal choice for the entropy-estimation problem. if k(n) —> oc. and note that 2/(/1-26) .u. . Thus. Let Uk be the set of all cif E Tk(E) for which qk (ali` 'xi' ) < 2 -k(h+2E) . In particular.(k1n — iikkai 'x i ) < 2-0h-2E) has qk (.14)) = h. then (6) n->C0 k(n) 1 lim — H(q00(. as n -. there is no universal choice of k = k(n) for which k(n) —> oc for which the empirical distribution qi.14) is close to the true distribution bak for every mixing process A. see Exercise 2.3. there is a Ki (x) such that qk(GkIx) > 1 — 36. because. k —> oc. at least for a totally ergodic .5.) If it is an ergodic measure of entropy h > 0. CI Remark 11.6. n > 2kh . the set of al' for which . while if k(n) — log log n.6 The entropy-estimation theorem is also true with the overlapping block distribution in place of the nonoverlapping block distribution.s. Next let I Vkl < Vk be the set of all alic for which qk (cif 14) > 2 -k(h-26) . if k(n) -. so that for all large enough k and all n > 2kh . Part (b) of the empirical-entropy theorem implies that for almost every x there is a K (x) such that qk(VkIX) < E.log iAl n then (6) holds for any ergodic measure p.(. ENTROPY-RELATED PROPERTIES. for example. (7) qk(Uklxi) _ < Irk(E)12ch+2E) < 2k(h+E)2 -k(h+2E) _ "-Ek —z. The same argument that was used to show that entropy-rate is the same as decay-rate entropy. for k > Ki(x). combined with the bound (7) and part (a) of the empirical-entropy theorem. at least if it is assumed that it is totally ergodic. In summary. The general result may be stated as follows. the choice k(n) — log n works for any binary ergodic process.3. implies that for almost every x.3.oc. a. then it holds for any finite-alphabet ergodic process. At first glance. and if k(n) < (1/h)log n. with alphabet A. lx) measure at least 1 — 3e.9. where G k = Tk(E) — Uk — Vk. can now be applied to complete the proof of Theorem 11. there is some choice of k = k(n) as a function of n for which the estimate will converge almost surely to h. Theorem 11.

JF. .iiri I . a header of fixed length m = k2k .3. of these k-blocks in order of decreasing frequency of occur- Step 3. since the number of bits needed to transmit an index is of the order of magnitude of the logarithm of the index. book E. subject only to the rule that af precedes 5k whenever k n -k n) qk(ailX1) > qk( all X 1 • The sequence {v(j) = j = 0. define b:+ 1+ -1 to be x4+1 • The code Cn (4) is defined as the concatenation Cn (x7) = br • b:V' • b7. that is. depends on 4. Let E be a prefix code on the natural numbers such that the length of the word E(j) is log j o(log j).42 ) 71 = h.d Universal coding. with a per-symbol code. 1. Step 2. k) and 4 = w(1)w(2)•• • w(t)v. almost surely. where h is the entropy of A. Partition xi: into blocks of length k = k(n) — (1/2) log AI n. so that the empirical-entropy theorem guarantees good performance.) The code word Cn (4) is a concatenation of two binary sequences. but the use of the Elias code insures that C. To make this sketch into a rigorous construction. The code length £(4) = ni+ L +r. The header br is a concatenation of all the members of {0. Suppose n = tk r. relative to n. fix {k(n)} for which k(n) — (1/2) log n. For 1 < i < t and 0 < j < 2k. A universal code for the class of ergodic processes is a faithful code sequence {C n } such that for any ergodic process p. see the proof of Theorem 11. (For simplicity it is assumed that the alphabet is binary. Such a code sequence was constructed in Section II. where k = k(n). followed by a sequence whose length depends on x.1.SECTION 11. is a prefix code. define the address function by the rule A(w(i)) = j. 1} k . . and it was also shown that the Lempel-Ziv algorithm gives a universal code. Transmit a list rence in 4.C is short. EMPIRICAL ENTROPY 145 II. since k -(1/2) log ik n. The number of bits needed to transmit the list . an Elias code. where each w(i) has length k. Encode successive k-blocks in 4 by giving the index of the block in the code Step 4. Step 1. if w(i) = that is. r E [0. since L depends on the distribution qk (. High frequency blocks appear near the front of the code book and therefore have small indices. where L is the sum of the lengths of the e(A(w(i))). Define b2+ +L i to be the concatenation of the E(A(w(i))) in order of increasing i.lxiii). lirn “. Finally. Encode the final block. if k does not divide n. w(i) is the j-th word in the code book. 2k — 1) is called the code book. La.3. The steps of the code are as follows.1. The empirical-entropy theorem provides another way to construct a universal code.

Another application of the empirical-entropy theorem will be given in the next chapter. n — k + 1]: 444-1 = aDi . The empirical-entropy theorem guarantees that eventually almost surely. let {7-k(E) c Ac} be the sequence given by the empirical-entropy theorem. Show that the empirical-entropy theorem is true with the overlapping block distribution. The reverse inequality follows immediately from the fact that too-good codes do not exist. For each n. A second. an integer K(x) such that qk (GkI4) > qk (7k (c)14) > 1 — E for k > K(x). S. so that lirninf' > (h — c)(1 — c). the crude bound k + o(log n) will do. For those w(i) Gk. The empirical-entropy theorem will be used to show that for any ergodic tt. let Gk denote the first 2k(h +E) members of the code book. where k = k(n). but it is instructive to note that it follows from the second part of the empirical-entropy theorem. a.e Exercises 1.. (8) n-* lim f(x n ) 1 = h. a. for k > K (x). Remark 11.C(4) lim sup — < h. This proves that . even simpler algorithm which does not require that the code book be listed in any specific order was obtained in [43].146 CHAPTER II. These are the only k-blocks that have addresses shorter than k(h — 6) + o(log n). there are at most Et indices i for which w(i) E Bk.) Pk(a in4) E [1. furthermore. let Bk be the first 2k(h-E) sequences in the code book for x. see Section 111.3. s. Thus.3.lfiz). s. and in some ways. Note that IGk I > l'Tk(E)I and. To establish this. II. for k > K(x). since Gk is a set of k-sequences of largest qke probability whose cardinality is at most 2k(h +E ) . and is described in Exercise 4. qk(GkIx t 0 qk(Tk(E)Ixi). in the context of the problem of estimating the measure II from observation of a finite sample path. and hence < (1 — On(h + c) + En + o(log n). which also includes universal coding results for coding in which some distortion is allowed.1. In fact. for almost every x. ENTROPY-RELATED PROPERTIES. n where h is the entropy of it. at least a (1 — E)-fraction of the binary addresses e(A(w(i))) will refer to members of Gk and thus have lengths bounded above by k(h + c)+ o(log n). (Hint: reduce to the nonoverlapping case for some small shift of the sequence. The empirical-entropy theorem provides.3. a. n—k+ 1 in place of the nonoverlapping block distribution qk(. Theorem 11.2.7 The universal code construction is drawn from [49].

j. Make a list in some order of the k-blocks that occur.. of k-blocks {w(i)}. most of the sample path is covered by words that are not much shorter than (log n)I h. Show that the entropy-estimation theorem is true with the overlapping block distribution in place of the nonoverlapping block distribution. The code C(4) is the concatenation of e(kiii.u is unbiased coin-tossing.. then the expected fraction of empty boxes is asymptotic to (1 — 4. If w(i) w(j).} is a universal prefix-code sequence. plus a possible final block y of length less than k. Their results were motivated by an attempt to better understand the Lempel-Ziv algorithm.. An interesting connection between entropy and partitions of sample paths into variable-length blocks was established by Ornstein and Weiss. along with a header to specify the value of k. such that all but the last symbol of each word has been seen before. Another simple universal coding procedure is suggested by the empirical-entropy theorem. w(t)} if it is the concatenation (1) 4 = w(1)w(2)...n (x). (Hint: if M balls are thrown at random into M boxes. Section 11. then (1) is called a parsing (or partition) into distinct 000110110100 = [000] [110] [1101] [00] is a partition into distinct words. for i words. PARTITIONS OF SAMPLE PATHS. except possibly for the final word. while 000110110100 = [00] [0110] [1101] [00] is not. For each k < n construct a code Ck m as follows. such a code requires at most k(1 + log IA I) bits per word. let km.'). be the first value of k E [ 1. Show that the variational distance between qk (..4 Partitions of sample paths. w(t).4. Partitions into distinct words have the asymptotic property that most of the sample path must be contained in those words that are not too short relative to entropy. most of the sample path is covered by words that are not much longer than (log n)I h. then code successive w(i) by using a fixed-length code. Call this C k m (Xil ) and let £k(X) denote the length of Ck . .(x. They show that eventually almost surely.(4)}. for any partition into distinct words. which partitions into distinct words. .SECTION 11.14) and . In other words. Transmit k and the list. a word w is a finite sequence of symbols drawn from the alphabet A and the length of w is denoted by i(w). Append some coding of the final block y. [53]. For example. and for any partition into words that have been seen in the past. 3. The final code C(xil) transmits the shortest of the codes {Ck.. 147 2.u k is asymptotically almost surely lower bounded by (1 — e -i ) for the case when n = k2k and . Show that {C. As in earlier discussions..n ) and Cknun . w(2). n] at which rk(x'11 ) achieves its minimum. A sequence Jell is said to be parsed (or partitioned) into the (ordered) set of words {w(1). Express x'12 as the concatenation w(l)w(2) • • • w(q)v..

But the part covered by shorter words contains only a few words. If this final word is not empty. = w(1)w(2). Lemma 11. w(t) is a partition into repeated words. The proofs of both theorems make use of the strong-packing lemma. ENTROPY-RELATED PROPERTIES. To discuss partitions into words that have been seen in the past.2. With this modification. w(C + 1). except.1. . and let 0 < E < h and a start set T be given. see Exercise 1.148 CHAPTER II. For almost every x E A there is an N = N(E.2 (The repeated-words theorem. a. a. x) such that if n > N. then t ( to)) <En. the lower bound (3) is a consequence of fact that entropy is an upper bound. by the crude bound. as noted earlier. the repeated-words theorem applies to the versions of the Lempel-Ziv algorithm discussed in Section 11. it is the same as a prior word. Thus the distinct-words theorem implies that. x) such that if n > N and xri' = w(1)w(2) partition into distinct words. some way to get started is needed. and xi.be a fixed finite collection of words. if each w(i) is either in .4.2.) Let be ergodic with entropy h > 0. and hence t(w(C + 1)) < n/2. meaning that there is an index j < t(w(1)) + + f(w(i — 1)) such that w(i) = Partitions into repeated words have the asymptotic property that most of the sample path must be contained in words that are not too long relative to entropy. twi>>>0-0-Ilogn It is not necessary to require exact repetition in the repeated-words theorem. s. it is sufficient if all but the last symbol appears earlier. and let E > 0 be given. s. then t ( o )) < En.7.2. A partition x = w(1)w(2) w(k) is called a partition (parsing) into repeated words. Lemma 11. so that eventually almost surely most of the sample path must be covered by words that are no longer than (h — E) -1 log n.. in turn. eventually almost surely.2.4.2 parses a finite sequence into distinct words. be ergodic with entropy h > 0.3.) Let p. which.4. let . (2). (3) Thus the distinct-words theorem and the modified repeated-words theorem together provide an alternate proof of the LZ convergence theorem. Theorem 11. called the start set. so that (2) lim sup n- C(4) log n fl < h. this time for parsings into words of variable length.T or has been seen in the past. and the fact that too-good codes do not exist. Theorem 11.2. For almost every w(t) is a x E A" there is an N = N(E. most of the sample path is covered by blocks that are at least as long as (h + E)' log(n/2). Of course.. For example.1. For this purpose.1 (The distinct-words theorem. Theorem 11. ow(0)<0+0-11ogn The (simple) Lempel-Ziv algorithm discussed in Section 11. implies that C(4) log n lim inf > h. Theorem 11. possibly for the final word.

The fact that the words are distinct then implies that . Lemma 11. 149 II.4.3. The strong-packing lemma._Exio g n)la E top(o) E Ic<(1-6)(log n)la Since IGk I <2"k the sum on the right is upper bounded by ( 2' ((1 — E)logn) —1) exp2 ((1 — E)logn)) = logn) = o(n). This proves the lemma. The proof of this simple fact is left to the reader... Given E > 0 there is an N such that if n > N and x n = w(1)w(2). PARTITIONS OF SAMPLE PATHS.(. w(t) is a partition into distinct words such that t(w(i)) 5_ En. Lemma 11. w(i)gc then t(w(i)) < 2En. Et _ The general case follows since it is enough to estimate the sum over the too-short words that belong to G.. together with some simple facts about partitions into distinct words. w(t) is a partition into distinct words. yields an integer K and { rk C A": k> 1} such that I7j < 21c(h+S) . let 8 be a positive number to be specified later.3 Given K and 8 > 0. where a is a fixed positive number. t —) men. To continue with the proof of the distinct-words theorem.4 For each k.SECTION 11. t > 1.a Proof of the distinct-words theorem. .4. because there are at most IA i k words of length k. and let G = UkGk. Proof First consider the case when all the words are required to come from G. The first observation is that the words must grow in length. then E aw (i) ) Bn. POD (j)) <K The second observation is that if the distinct words mostly come from sets that grow in size at a fixed exponential rate a.4.(. The distinct-words theorem is a consequence of the strong-packing lemma.(i)). k > K.2.. there is an N such that if n > N and xrii = w(1)w(2). then words that are too short relative to (log n)Ice cannot cover too much.4. by an application of the standard bound. Lemma 11. suppose Gk C A k satisfies IGkI _< 2ak .

the existence of good codes is guaranteed by the strong-packing lemma. strong-packing implies that given a parsing x = w(l)w(2) E w(i)EU Klj t (o)) a _3)n. for which s of the w(i) are too long. will be a fixed ergodic process with positive entropy h and E < h will be a given positive number. 3)-strongly-packed by MI. The idea is to code each too-long word of a repeated-word parsing by telling where it occurs in the past and how long it is (this is. 2 Suppose xr. Throughout this discussion p. Let u1 be the concatenation of all the words that precede V(1).4 can be applied with a = h + 3 to yield aw(i)) < 25n. The first idea is to merge the words that are between the too-long words. property (4) holds for any parsing of . and let .2. this time applied to variable-length parsing. of course. Suppose = w(l)w(2) w(t) is some given parsing. It will be shown that if the repeated-words theorem is false. A block of consecutive symbols in xiit of length more (h — E) -1 log n will be said to be too long. e. 3)-strongly-packed by {'Tk }.4.) If a good code is used on the complement of the too-long words and if the too-long words cover too much. for any parsing P of xriz for which (4) E Sn atv(i)) — .4. let ui be the concatenation of all the words that come between V(i — 1) and V(i). Thus. The distinct-words theorem follows. i.8.4..' is (K. and such that eventually almost surely xrii is (K.150 CHAPTER II. Suppose also that n is so large that. then overall code length will be shorter than entropy allows. V(s). ENTROPY-RELATED PROPERTIES. r w(t) into distinct words. V(2).b Proof of the repeated-words theorem. since 8 could have been chosen in advance to be so small that log n (1 — 2S) log n h + c 5h+3 • 1:3 I1. the basic idea of the version of the Lempel-Ziv code suggested in Remark 11. As in the proof of the empirical-entropy theorem. it into distinct words. Label these too-long words in increasing order of appearance as V(1). and therefore Lemma 11. f(w(i))(1-15)(logn)/(h+S) provided only that n is large enough. E w(i)Euciz k rk t (o )) a ( l-3)n.x. for 1 < i < s + 1.3. then a prefix-code sequence can be constructed which beats entropy by a fixed amount infinitely often on a set of positive measure. by Lemma 11.

Let 3 be a positive number to be specified later. (a) Each filler u 1 E U K Tk is coded by specifying its length and giving its index in the set Tk to which it belongs.u s V(s)u s+i. An application of the strong-packing lemma. In this way. x'iz is expressed as the concatenation x Çz = u i V(1)u2V(2) us V(s)u s±i . it can be supposed such that V(j) that when n is large enough no too-long word belongs to F. A code C. 3)-stronglypacked by {'Tk }. Such a representation is called a too-long representation of 4.. (c) Each too-long V(j) is coded by specifying its length and the start position of its earlier occurrence. PARTITIONS OF SAMPLE PATHS. Let G(n) be the set of all xri' that are (K. Lemma 11.2. (ii) Each V(j) has been seen in the past. Since .. such that eventually almost surely xril is (K. then B(n) n G(n). . a too-long representation = u1V(1)u2V(2). eventually almost surely. Sequences not in B(n) n G(n) are coded using some fixed single-letter code on each letter separately. Let B(n) be the set of all sequences x for which there is an s and a too-long representation x = uiV (1)u2 V(2) .3. (i) Ei f ( V (D) >En. The idea is to code sequences in B(n) by telling where the too-long words occurred earlier. to complete the proof of the repeated-words theorem it is enough to prove that if K is large enough. 8)-strongly-packed by {Tk } ..SECTION 11..4. with the too-long words {V(1). If x E B(n) n G(n).. is constructed as follows. a set 'Tk c A k of cardinality at most 2k(h+8) . and therefore to prove the repeated-words theorem. eventually almost surely.T is a fixed finite set. but to make such a code compress too much a good way to compress the fillers is needed.. (b) Each filler u1 g U K r k is coded by specifying its length and applying a fixed single-letter code to each letter separately. The strong-packing lemma provides the good codes to be used on the fillers. us V(s)u s+i with the following two properties. V(s)} and fillers lui. is determined for which each too-long V(j) is seen somewhere in its past and the total length of the {V(j)} is at least En. ni ) To say V(j) = xn Since the start set .q E G(n) eventually almost surely. .. The words in the too-long representation are coded sequentially using the following rules. provides an integer K and for a each k > K. 151 u s+ 1 be the concatenation of all the words that follow V(s). ni + ± imi has been seen in the past is to say that there is an index i E [0. it is enough to prove that x g B(n).

. plus the two bit headers needed to tell which code is being used and the extra bits that might be needed to round up log n and h + 6 to integers. This fact is stated in the form needed as the following lemma. as well as to the lemma itself. shows that. n s < i="' (h — e) < — (h — 6). and from specifying the index of each u i E UlciK rk. must satisfy the bound. ) (h + 6) + 60(n). xÇ2 E B(n) n G(n). "ell B(n) n G(n). Proof of Lemma 11. eventually almost surely.C(x'11 ) = s log n + ( E ti J ELJA7=Kii B(n) n G(n). With a one bit header to tell whether or not xri` belongs to B(n) n G(n). require a total of at most 60(n) bits. E au . and hence if 6 is small enough then £(4) < n(h — € 2 12).152 CHAPTER II. which. completing the proof of the repeated-words theorem. first note that there are at most s + 1 fillers so the bound Es. For 4 E B(n) n G(n). since. Lemma 11. The lemma is sufficient to show that xrii B(n) n G(n).4. while it depends on x. the principal contribution to total code length E(xriz) .5. as well as the encoding of the lengths of all the words.(x €)) . the code becomes a prefix code. which requires £(u) [h + 81 bits per word. then (5) uniformly for 4 . For the fillers that do not belong to U71. By assumption. ) <K (1 + n(h log n j). (6) s — av(i)) log n (h — e). which requires Flog ni bits for each of the s too-long words. so that (5) and (6) yield the code-length bound < n(h + — e(e + 6)) + 0(n).. since it is not possible to beat entropy by a fixed amount infinitely often on a set of positive measure. Two bit headers are appended to each block to specify which of the three types of code is being used. indeed. a prefix code S on the natural numbers such that the length of the code word assign to j is log j + o(log j). — log n — log n implies that E au . eventually almost surely.4. The key to this. ENTROPY-RELATED PROPERTIES. tw(i) > en.C(fi` IC n ) comes from telling where each V(j) occurs in the past. for XF il E B(n) n G(n). by definition. is that the number s of too-long words. that is. It is enough to show that the encoding of the fillers that do not belong to Urk . for all sufficiently large n. An Elias code is used to encode the block lengths.5 If log K < 6 K . a word is too long if its length is at least (h — E) -1 log n.

[53]. 2. 153 which is o(n) for fixed K. This completes the proof of Lemma 11. since it was assumed that log K < 5K. Remark 11. the second and third sums together contribute at most log K K n to total length. is upper bounded by exp((h — c) Et. This is equivalent to showing that the coding of the too-long repeated blocks by telling where they occurred earlier and their lengths requires at most (h — c) E ti + o(n) bits. The key bound is that the number of ways to select s disjoint subintervals of [1.4. Extend the repeated-words theorem to the case when it is only required that all but the last k symbols appeared earlier. which can be assumed to be at least e.SECTION 11. implies that the words longer than K that do not belong to UTk cover at most a 6-fraction of x . + o(n)). PARTITIONS OF SAMPLE PATHS.5. the encoding of the lengths contributes EV(t(u i)) Et(eV(V(i») bits. using a counting argument to show that the set B(n) n G(n) must eventually be smaller than 2nh by a fixed exponential factor. the dominant terms of which can be expressed as E j )<K log t(u i ) + E t(u i )K log t(u i ) + logt(V(i)). which is o(n) for K fixed. then the assumption that 4 E G(n). and since x -1 log x is decreasing for x > e. the total length contributed by the two bit headers required to tell which type of code is being used plus the possible extra bits needed to round up log n and h + 8 to integers is upper bounded by 3(2s + 1) which is o(n) since s < n(h — E)/(logn). E 7. that is. n] of length Li > (log n)/ (h — 6).4. that 4 is (K. Since it takes L(u)Flog IAI1 bits to encode a filler u g UTk. The coding proof is given here as fits in nicely with the general idea that coding cannot beat entropy in the limit. 6)-strongly-packed by {Tk }.. Since it can be assumed that the too-long words have length at least K. II. a u .4.4. The first sum is at most (log K)(s + 1). which is Sn. Finally. Extend the distinct-words and repeated-words theorems to the case where agreement in all but a fixed number of places is required. . In particular.•)) [log on SnFlog IA11. since there are at most 2s + 1 words. which is 50(n). of the repeated-words theorem parallels the proof of the entropy theorem.c Exercises 1. once n is large enough to make this sum less than (Sn/2. Finally.6 The original proof. the total number of bits required to encode all such fillers is at most (u.

the words longer than (1 + c) log n in a parsing of 4 into repeated words have total length at most n 1.d. . 1 = lim sup . equiprobable. in probability. 4.3. unbiased coin-tossing. along with an application to a prefix-tree problem will be discussed in this section.5 Entropy and recurrence times.log Rn (x). Lemma 11. [85]. The theorem to be proved is Theorem 11.5. both the upper and lower limits are subinvariant. there are fewer than n 1-612 words longer than (1 + c) log n in a parsing of 4 into repeated words.i. and r(Tx) < r(x). What can be said about distinct and repeated words in the entropy 0 case? 6. that is.) For any ergodic process p. This is a sharpening of the fact that the average recurrence time is 1/. with entropy h. An almost-sure form of the Wyner-Ziv result was established by Ornstein and Weiss. lim 1 log Rn (x) = h. These results.) (b) Show that eventually almost surely. (c) Show that eventually almost surely. Carry out the details of the proof that (2) follows from the distinct-words theorem. no parsing of 4 into repeated words contains a word longer than 4 log n. see also the earlier work of Willems.154 CHAPTER II. Define the upper and lower limits. whose logarithm is asymptotic to nh.E12 log n. be the Kolmogorov measure of the binary..1 (The recurrence-time theorem. n n--÷oo Some preliminary results are easy to establish. (Hint: use Barron's lemma. 5. Let .log R„(x). ENTROPY-RELATED PROPERTIES. Wyner and Ziv showed that the logarithm of the waiting time until the first n terms of a sequence x occurs again in x is asymptotic to nh. F(Tx) <?(x). almost surely. Carry out the details of the proof that (3) follows from the repeated-words theorem. n—woo n T(x) Since Rn _1(T x) < R n (x). almost surely. i.u.) (a) Show that eventually almost surely. i . Section 11.1. [86]. An interesting connection between entropy and recurrence times for ergodic processes was discovered by Wyner and Ziv. [53]. n 1 r(x) = lim inf . The definition of the recurrence-time function is Rn (x) = min{rn > 1: x m+ m+n 1 — Xn } — . e.u(x'11 ). process (i. 3.

is outlined in Exercise 1.SECTION 11." Indeed. however. eventually almost surely. then there is a prefix-code sequence which beats entropy in the limit. It is now a simple matter to get from (2) to the desired result. eventually almost surely. the cardinality of the projection onto An of D. To establish this. The bound F < h will be proved via a nice argument due to Wyner and Ziv. eventually almost surely. both upper and lower limits are almost surely constant. The goal is to show that x g D. is upper bounded by the cardinality of the projection of Tn . both are almost surely invariant. To establish i: < h. so that (2) yields 1-1 (Dn < n Tn) _ 2n(h-1-€12)2—n(h+E) . r < F. that if x E D(a). . 2n(h+E) — 1]. by Theorem 11. Furthermore. A code that compresses more than entropy can then be constructed by specifying where to look for the first later occurrence of such blocks. The inequality r > h will be proved by showing that if it is not true. 7-1 D. It is enough to prove this under the additional assumption that xril is "entropy typical.1. n T„. so the iterated almost sure principle yields the desired result (1).2.„ eventually almost surely.(4)) < 2-n(h+E). based on a different coding idea communicated to the author by Wyner. n > 1.5. . n 'T. fix e > 0 and define D. (1). there are constants F and r such that F(x) = F and r(x) = r. so it is enough to prove (1) x g Dn n 7. 2—nE/2 . The Borel-Cantelli principle implies that x g D. = fx: R(x) > r(h+E ) }. that is. which is impossible. Another proof. almost surely. which is upper bounded by 2n(h+E / 2) . Since these sets all have the same measure. disjointness implies that (2) It(D.(4). 155 hence.. 7' -2a(h+°+1 Dn(ai) must be disjoint. The bound r > h was obtained by Ornstein and Weiss by an explicit counting argument similar to the one used to prove the entropy theorem. then eventually almost surely. so the desired result can be established by showing that r > h and F < h. Indeed. Since the measure is assumed to be ergodic. In the proof given below. fix (4 and consider only those x E D„ for which xri' = ari'. then Tx g [a7]. . sample paths are mostly covered by long disjoint blocks which are repeated too quickly. and hence the sequence D(a). n [4].. if Tri = {X: t(x) > 2-n(h-i-E/2)} then x E T. the set D(a7) = D. for j E [1.. a more direct coding construction is used to show that if recurrence happens too soon infinitely often. and hence. Note. ENTROPY AND RECURRENCE TIMES. by the definition of D(a)..

For n > m. 2h(t -s + 1) ) . Here it will be shown how a prefix n-code Cn can be constructed which. A one bit header to tell whether or not 4 E G(n) is also used to guarantee that C. compresses the members of G(n) too well for all large enough n. 1}d. by the definition of "too-soon recurrent. for which s +k < n. a prefix code on the natural numbers such that t(E(j)) = log j + o(log j). both to be specified later. For x E G(n) a (8.5. ENTROPY-RELATED PROPERTIES. followed by e(k i ). It will be shown later that a consequence of the assumption that r < h — c is that (3) xÇ E G(n). for suitable choice of m and S. a result stated as the following lemma. is a prefix n-code. a concatenation = u V (1)u 2 V (2) • u j V (J)u j+1 .156 CHAPTER II. for n > m. eventually almost surely. The least such k is called the distance from xs to its next occurrence in i• Let m be a positive integer and 8 be a positive number. where d < 2+ log l Ai. The one bit headers on F and in the encoding of each V (j) are used to specify which code is being used. followed by E(t(V (j)). Let G(n) be the set of all xi' that have a (8. if the following hold. Lemma 11." and that the principal contribution to total code length is E i log ki < n(h—c). Sequences not in G(n) are coded by applying F to successive symbols. X E G (n ) . is said to be a (8. that is. (i) Each filler word u i is coded by applying F to its successive symbols. such that each code word starts with a O. where k i is the distance from V(j) to its next occurrence in x. m)-too-soon-recurrent representation. where E > O. m)-too-soon-recurrent representation 4 = u V(1)u2 V(2) • • • u V( .t+i is determined and successive words are coded using the following rules.1 )u. The key facts here are that log ki < i(V(j))(h — c). (a) Each V(j) has length at least m and recurs too soon in 4. (b) The sum of the lengths of the filler words ui is at most 3 8 n. m)-too-soon-recurrent representation of 4. (ii) Each V(j) is coded using a prefix 1.2 (4) f(xii) < n(h — E) n(3dS + ot ni ). To develop the covering idea suppose r < h — E.' of s lik k for consecutive symbols in a sequence 4 is said to recur too soon in 4 if xst = x` some k in the interval [ 1 . oc. The code also uses an Elias code E. A block xs. The code Cn uses a faithful single-letter code F: A H+ {0. where cm --÷ 0 as m .

establishing the recurrence-time theorem.n --* 0 as m cc. R n (x) = r there is an M such that n=M . there are at most n/ m such V(j) and hence the sum over j of the second term in (5) is upper bounded by nI3. Summing on j < n/ m. This is a covering/packing argument. Since each V(j) is assumed to have length at least m. by property (a) of the definition of G(n). Proof of Lemma 11. eventually almost surely.2. the complete encoding of the too-soon recurring words V(j) requires a total of at most n(h — E) nun.8. and hence the packing lemma implies that for almost every x. by the definition of "too-soon recurrent. Thus r > h. Finally.5.u(B) > 1 — 6. for each n > 1. so that .SECTION 11. since (1/x) log x is decreasing for x > e.2. contradicting the fact (3) that 4 E G(n).. The Elias encoding of the distance to the next occurrence of V(j) requires (5) f(V (j))(h — E) o(f(V(j))) bits. for fixed m and 6. To carry it out. x) and a collection {[n i .C(x'ii) < n(h — 6/2) on G(n). But too-good codes do not exist. eventually almost surely. there is an integer I = I (n. 0 It remains to be shown that 4 E G(n). define B n = (x: R(x) < 2 n(h—E) ). ENTROPY AND RECURRENCE TIMES. the encoding of each filler u i requires dt(u i ) bits. the second terms contribute at most Om bits while the total contribution of the first terms is at most Elogf(V(D) n logm rn . and the total length of the fillers is at most 36n. eventually almost surely. for which the following hold. bits. = 2/3. n]." Summing the first term over j yields the principal term n(h — E) in (4). 157 The lemma implies the desired result for by choosing 6 small enough and m large enough. a fact that follows from the assumption that 0 r <h — E. there is an N(x) > M/8 such that if n > (x).5. as m Do. the lemma yields . each of length at least m and at most M. This completes the proof of Lemma 11. Thus the complete encoding of all the fillers requires at most 3d6n bits.u(G(n)) must go to 0. where an. for all large enough n. .„ where /3. In summary.5. The one bit headers on the encoding of each V(j) contribute at most n/ m bits to total code length. assume r < h — E and. m]: i < of disjoint subintervals of [1. + (1 ± log m)/m 0. for B = n=m Bn' The ergodic theorem implies that (1/N) Eni Bx (Ti-i X ) > 1 . Since lim inf. The Elias encoding of the length t(V(j)) requires log t(V(D) o(log f(V(j))) bits.

n. T 2x. called the n-th order prefix tree determined by x.Tn —l x. W(3) = 0100. for example. ±1)(h—E) ]. . T 2x = 1010010010 . W(0) = 00. The prefix tree 'T5(x) is shown in Figure 1. . (b) (mi — ni + 1) > (1 — 25)n. A sequence X E A" together with its first n — 1 shifts. if it is assumed that r < h — E.3 The recurrence-time theorem suggests a Monte Carlo method for estimating entropy.a Entropy and prefix trees. as quick way to test the performance of a coding algorithm. In particular. II. In addition. x 001010010010 . T 3x = 010010010 . . ENTROPY-RELATED PROPERTIES.5. Xi :Fnil—n ` 9 for some j E [ni + 1. • • • . the total length of the V(j) is at least (1 —38)n. the simplest form of Lempel-Ziv parsing grows a subtree of the prefix tree. as indicated in Remark 11.5. in which context they are usually called suffix trees. and where each u i is a (possibly empty) word. recurs too soon. 'Tn (x). W(4) = 100. as tree searches are easy to program. algorithms for encoding are often formulated in terms of tree structures. x E G (n). Prefix trees are used to model computer storage algorithms. Tx. log Rk(T i l X) as an estimate of h. Remark 11. 1. This led Grassberger. where underlining is used to indicate the respective prefixes. where each V(j) = xn 'n.158 (a) For each i < I el' n.. T x = 01010010010 . . Closely related to the recurrence concept is the prefix tree concept. defines a tree as follows. page 72.9. For example. itl. This works reasonably well for suitably nice processes.. W(2) = 101. n]. so that if J is the largest index i for which ni + 2A1(h— E) < n. The sequence x`il can then be expressed as a concatenation (6) xn i = u i V(1)u2V(2)• • • u. and can serve. for any j W(i): i = 0. 0 < j < n. the set which is not a prefix of Px. a log n.. and thereby finishes the proof of the recurrence-time theorem. provided n is large. eventually almost surely. W(1) = 0101. ni + 2(ni i —n. independently at random in Let k [1./ V(J)u j+i. select indices fit.T4x = 10010010. Since by definition. below.5. for fixed m and 3. This completes the proof that 4 E G(n). x) be the shortest prefix of Tx i. it can be supposed that n > 2m(h— E ) IS. it defines a tree. For each 0 < i < n. to propose that use of the full prefix tree . = - CHAPTER II.. n — 11 is prefix-free. In practice. [16]. condition (b) can be replaced by (c) > < (m — n +1) > (1 — 33)n.7. let W(i) = W(i.. and take (11 kt) Et. that is.7. . where a < 1.

5. since if L7(x) = k then. . Another type of finite energy process is obtained by adding i. An ergodic process p. eventually almost surely. He suggested that if the depth of node W(i) is the length L7(x) = f(W(i)). namely. at least in probability.5. noise to a given ergodic process. L(x) is the length of the longest string that appears twice in x7. n].d. the finite energy processes. for all t > 1. see Theorem 11. then average depth should be asymptotic to (1/h) log n.d.E)-trimmed mean of a set of cardinality M is obtained by deleting the largest ME/2 and smallest ME/2 numbers in the set and taking the mean of what is left. One type of counterexample is stated as follows. ENTROPY AND RECURRENCE TIMES.5 (The finite-energy theorem. The {L } are bounded below so average depth (1/n) E (x) will be asymptotic to log n1 h in those cases where there is a constant C such that max L7 (x) < C log n.i. such as. the (1 .. This is equivalent to the string matching bound L(4) = 0(log n). Note that the prefix-tree theorem implies a trimmed mean result.4 (The prefix-tree theorem.d. and hence lim n -+cx) E n log n i=1 1 1 (x) = .) The mean length is certainly asymptotically almost surely no smaller than h log n. see Exercise 2. What is true in the general case.5. The (1 . for which a simple coding argument gives L(x) = 0(log n). mixing Markov processes and functions of mixing Markov processes all have finite energy. as earlier. (Adding noise is often called "dithering. it is not true in the general ergodic case. is an ergodic finite energy process with entropy h > 0 then there is a constant C such that L(4) < C log n.i. where.i. almost surely. and some other cases.) If p.i. the binary process defined by X n = Yn Z. by definition of L(x). 159 might lead to more rapid compression. eventually almost surely.E)-trimmed mean of the set {L(x)} is almost surely asymptotic to log n/h.5.5. there is a j E [1. processes. While this is true in the i. such that x: +k-2 = There is a class of processes. (mod 2) where {Yn } is an arbitrary binary ergodic process and {Z n } is binary i. Theorem 11. namely. j i. and independent of {Yn }. is an ergodic process of entropy h > 0. however. for example. as it reflects more of the structure of the data. h The i. is the following theorem.SECTION 11. Theorem 11. all but en of the numbers L7(x)/ log n are within c of (11h). has finite energy if there is are constants c < 1 and K such that kxt-F1 i) < Kc L .) If p. case. eventually almost surely. which kills the possibility of a general limit result for the mean. in general.d. for any E > 0. then for every c > 0 and almost every x. for all L > 1. there is no way to keep the longest W(i) from being much larger than 0(log n). there is an integer N = N(x. eventually almost surely. c) such that for n > N." and is a frequently used technique in image compression. and for all xl +L of positive measure. however.

n) = fi E [0. The prefix-tree and finite-energy theorems will be proved and the counterexample construction will be described in the following subsections.) The limitations of computer memory mean that at some stage the tree can no longer be allowed to grow. define the sets L(x. for the longest prefix is only one longer than a word that has appeared twice.1. each of which produces a different algorithm. can be viewed as an overlapping form of the distinct-words theorem.a. if it has the form w(i) = w(j)a.7 The simplest version of Lempel-Ziv parsing x = w ( 1)w(2) • • •. (8) 1L(x.160 Theorem 11. whose predecessor node was labeled w(j). For S > 0 and each k. Theorem 11. and is a consequence of the fact that. see [87. where the next word w(i). n) are distinct. for some j < i and a E A. (8). It is enough to prove the following two results.5. Such finite LZ-algorithms do not compress to entropy in the limit. grows a sequence of IA I-ary labeled trees. n — I]: (x) > (1 ± c)(log n)I hl. each of which is one symbol shorter than the shortest new word. mentioned in Remark 11.5. ENTROPY-RELATED PROPERTIES. for fixed n. and an increasing sequence {nk}. n — I]: L7(x) < (1— E)(logn) I h} U (x .1. Performance can be improved in some cases by designing a finite tree that better reflects the actual structure of the data. is assigned to the node corresponding to a. n)I < En. For E > 0. To control the location of the prefixes an "almost uniform eventual typicality" idea will be needed. such that vnrfo l Ltilk (x ) = oo. Prefix trees may perform better than the LZW tree for they tend to have more long words. eventually almost surely. n) = fi E [0. (9). The details of the proof that most of the words cannot be too short will be given first. The remainder of the sequence is parsed into words. n)I < En.5. several choices are possible. 88] for some finite version results. I U(x . Theorem 11. the tree grows to a certain size then remains fixed. the W (i. Remark 11.10. define Tk (8) = (x) > .5. There is a stationary coding F of the unbiased coin-tossing process kt. Section 11. eventually almost surely. II. is a consequence of the recurrence-time theorem.6 CHAPTER II. (9) The fact that most of the prefixes cannot be too short. From that point on.2. The fact that most of the prefixes cannot be too long. (See Exercise 3.1 Proof of the prefix-tree theorem. in which the next word is the shortest new word. Subsequent compression is obtained by giving the successive indices of these words in the fixed tree. which means more compression. almost surely. lim (7) nk log nk with respect to the measure y = o F-1 . In the simplest of these. but do achieve something reasonably close to optimal compression for most finite sequences for nice classes of sources.2.4.

If a too-long block appears twice. a. k—> cx) k and when applied to the reversed process it yields the backward result. eventually almost surely. Lemma 11. 1 lim — log Fk (x) = h. Since E (3 ). Thus if n > N(x) is sufficiently large and S is sufficiently small there will be at most cnI2 good indices i for which L'i'(x) < (1— €)(log n)I h. completes the proof that eventually almost surely most of the prefixes cannot be too short.8 larger. x) E U m rk ( . 161 _ 2k(h+5) and so that I2k(6)1 < E Tk (6 ). The ergodic theorem implies that. Hence.(8). 1 urn ..s. Both a forward and a backward recurrence-time theorem will be needed.log Bk (x) = h. for any J > M. eventually almost surely. Suppose k > M and consider the set of good indices i for which L7(x) = k. there is an M such that the set 4 4 G(M) = ix: 4 E 7i. (See Exercise 7. the shift Tx belongs to G(M) for all but at most a limiting 6-fraction of the indices i. it can be supposed that if n > N(x) then the set of non-good indices will have cardinality at most 3 6 n. k—>cc k since the reversed process is ergodic and has the same entropy as the original process.5. if necessary. n).5. (8).6 and Exercise 11. To proceed with the proof that words cannot be too short.SECTION 11. Next the upper bound result. x) are distinct members of the collection 'Tk (6) which has cardinality at most 2k(h+8) . This. will be established by using the fact that. a. each W(i. an index i < n will be called good if L(x) > M and W(i) = W(i. Since the prefixes W(i) are all distinct there are at most lAl m indices i < n such that L7(x) < M. there are at most (constant) 2 J (h+) good indices i for which M < L < J. which can be assumed to be less than en/2.x) appears at least twice. fix c > O. n.s. There are at most 2k(h+3) such indices because the corresponding W (i. For a given sequence x and integer n. The corresponding recurrence-time functions are respectively defined by Fk(X) = Bk(X) = min {in> 1 •. Thus by making the N(x) of Lemma 11. combined with the bound on the number of non-good indices of the preceding paragraph. Section 1.7. ENTROPY AND RECURRENCE TIMES. n. (9).8 For almost every x there is an integer N(x) such that if n > N(x) then for all but at most 2 6 n indices i E [0. then it has too-short return-time. the i-fold shift Tx belongs to G(M). for almost every x.. The recurrence-time theorem shows that this can only happen rarely. This result is summarized in the following "almost uniform eventual typicality" form. X m+ m+k i— min frn > 1: X1_ k+1 = X — ° k+11 The recurrence-time theorem directly yields the forward result. Section 1. except for the final letter. by the entropy theorem.5.n.) These limit results imply that . for all k > MI has measure greater than 1 — S.

the k-block starting at i recurs in the future within the next n steps. it can be supposed that both of the following hold. If i +k < n. n). A version of Barron's bound sufficient for the current result asserts that for any prefix-code sequence {C}. so that 1 logn .n) then T i E F. In summary. The key to making this method yield the desired bound is Barron's almost-sure code-length bound. for j < n.5. n .u(x'11 ) -21ogn. II. Case 2. Here "optimal code" means Shannon code.7. and hence. . this means that Bk(T i+k x) < n. that is. for at most en/3 indices i E [0. The ergodic theorem implies that for almost every x there is an integer N(x) such that if n> N(x) then Tx E F. for at most 6n/3 indices i E [0. if i E Uk (x. which asserts that no prefix code can asymptotically beat the Shannon code by more than 210g n.13.162 CHAPTER II. L7 > (1+ e) log n/ h. 7x E B. (10) £(C(4)) + log . that is. for each w.1]. there is a K. if n also satisfies k < en/3. (a) K < (1+ 6/2) log n/ h. j < i. and Tx E B. ENTROPY-RELATED PROPERTIES. such that j i and j+k -1 =x me two cases i < j and i > j will be discussed separately. then log Fk (x) > kh(1 + /4) -1 .2 Proof of the finite-energy theorem. x) as the shortest prefix of Tx that is not a prefix of any other Ti x. Thus. and log Bk (x) > kh(1+ c /4) -1 . (b) k +1 < (1+ c)log n/ h. which means that 7 k) x E B. there is an index j E [0. such that if k > K. The idea of the proof is suggested by earlier code constructions. each of measure at most en/4. Let k be the least integer that exceeds (1 + e/2) log n/ h. An upper bound on the length of the longest repeated block is obtained by comparing this code with an optimal code. This completes the proof of the prefix-tree theorem. n. Condition (b) implies that L7 -1 > k.log Fk(T i x) < — < h(1 + E/4) -1 . x F. if necessary. n).a. By making n larger. Assume i E U(x. and use an optimal code on the rest of the sequence. Lemma 5. or i + k < n.1]. Fk (T ix) < n. eventually almost surely. In this case. see Remark 1. code the second occurrence of the longest repeated block by telling its length and where it was seen earlier. n)I < en. x V B. namely. then I U (x . i < j. Case 1. and two sets F and B. A Shannon code for a measure p u is a prefix code such that E(w) = log (w)1. Fix such an x and assume n > N(x). by the definition of W (i. n. k which means that Tx E F.

II. 1} z {0. 0 < j < n k 3 I4 . o F -1 will then satisfy the conditions of Theorem 11. 163 The details of the coding idea can be filled in as follows. that is. 1} z . 5/4 so that Ei<n. It will be shown that given E > 0.p — log c n log n This completes the proof of the finite-energy theorem. To compare code length with the length of the Shannon code of x with respect to it. and L must be encoded. almost surely. with probability greater than 1 — 2-k . such that yi+i = 0. ENTROPY AND RECURRENCE TIMES.5. x ts 0+ log 2(4+L+1 1xi+L ).SECTION 11. Suppose a block of length L occurs at position s. (b) . (7) holds.6.3 A counterexample to the mean conjecture. total code length satisfies 3 log n + log 1 1 + log p(xl) kt(4A-L+11xi+L)' where the extra o(logn) terms required by the Elias codes and the extra bits needed to roundup the logarithms to integers are ignored. for property (a) guarantees that eventually almost surely L7k (x) > n:I8 for at least 4/8 indices i < nk . The final block xtn+L±I is encoded using a Shannon code with respect to the conditional measure Ix ti +L ). This is now added to the code length (11) to yield t-I-L . given by unbiased coin-tossing. Since t. there is an i in the range 0 < i < nk — n k 314 . The prefix property is guaranteed by encoding the lengths with the Elias code e on the natural numbers. . L7(x) > nk and hence. s. li m _sxu. the probability it (4) is factored to produce log (4) = log ti ) log A(xt v. since c <1.u(fx: xo (F(x)01) < E. The block xtr+t is encoded by specifying s and L.5. that is. and then again at position t 1 > s in the sequence xÇ. yields to which an application of the finite energy assumption p(x:+ L(4) 5 < . Let it denote the measure on two-sided binary sequences {0. there is a measurable shift invariant function F: {0.5. as they are asymptotically negligible. then using a Shannon code with respect to the given measure A. 3 log n + log p(x t + + IL lxi) < Kc L . a prefix code for which t(e(n)) = log n + o(log n). The block xi is encoded by specifying its length t. for each n and each a E An. the shift-invariant measure defined by requiring that p(aii) = 2.a. t 3 log n + log p(xt+1 s Thus Barron's lemma gives 1 lxf)> —2 log n. 1} z and an increasing sequence {n k } of positive integers with the following properties. The process v = p. (a) If y = F(x) then.

discussed in Section I. almost surely.5. The asymptotics of the length of the longest and shortest word in the Grassberger tree are discussed in [82] for processes satisfying memory decay conditions.0 ) = f (y" co ) and x 0 y° . which also contains the counterexample construction given here.(4 Ix ° ) > 2-n ( H—E) and Cn = {X n c„) : f (xn 00 )) < 2n (H-2€ ) 1. Replacing 2n k 1/2 by nk 3/4 will produce the coding F needed here. (d) The preceding argument establishes the recurrence result for the reversed process.u of entropy h.8. eventually almost surely. that r > h is based on the following coding idea.9 The prefix tree theorem was proved in [72]. II. In that example. in particular. where H is the entropy-rate of it and show that E p.164 CHAPTER II.xn = yn 00 whenever f (xn . Show that if L(x) = 0(log n).i. (a) Show that lim infn t(f(x n „))/ n > h. The coding F can be constructed by only a minor modification of the coding used in the string-matching example. due to Wyner.5.) (b) Deduce that the entropy of a process and its reversed process are the same.) 3.d. . then max i 14 (x) = 0 (log n). Property (b) is not really needed. as an illustration of the blockto-stationary coding method. = fx" o„: p. after its discovery it was learned that the theorem was proved using a different method in [31]. Show that this implies the result for ft. The coding proof of the finite-energy theorem is new. hence. Show that dithering an ergodic process produces a finite-energy process. ENTROPY-RELATED PROPERTIES. . Another proof.q and apply the preceding result. (Hint: L (4n ) < K log n implies max L7(x) < K log n. A function f: An B* will be said to be conditionally invertible if . for any ergodic measure . (Hint: let B.c.(B„ n cn ix°) is finite. Remark 11. 2.b Exercises. but it is added to emphasize that such a y can be produced by making an arbitrarily small limiting density of changes in the sample paths of the i. eventually almost surely. process a. blocks of O's of length 2n k i l2 were created. the entropy of y can be as close to log 2 as desired. 1. (c) Define f (x" cc ) to be the minimum m > n for which xim niV = .

The rate of convergence problem is to determine the rates at which these convergences takes place. In this section bt denotes a stationary. or k-type. These results will be developed in this section.an(tx rit : 2—n(h+E) < itin (x n i ) < 2—n(h—E)} 1. ergodic process with alphabet A. with ilk denoting the projection onto A k defined by 14(4) = . A mapping rk : R+ x Z 1—± R +. no uniform rate is possible. such as i.n). The ergodic theorem implies that if k and are fixed then ({4: as n — Pk(' 14)1 ?_ ED -± oo. and finite-state processes have exponential rates of convergence for frequencies of all orders and for entropy.1. such as renewal processes. that is. In the general ergodic case. but some standard processes.Chapter III Entropy for restricted classes. n — k +1]: x:+k-1 — k +1 . Section 111.1 Rates of convergence. aik E A k . the entropy theorem implies that .d. defined by IILk — Pk1= — Pk( 14)1 = E E 1[4(4) at pk (64'14)1. if for each fixed k and c > 0. is called a rate function for frequencies for a process 1. The distance between 14 and Pk will be measured using the distributional (variational) distance. Markov processes. as n cc. do not have exponential rates. 165 .i. Many of the processes studied in classical probability theory. is the measure pk = pk( 1x11 ) on Ak defined by Pk(a1 14) = sn E [1. there is an ergodic process that does not satisfy the rate. lin(fx riz : — Pk(' ix ril )i > ED < rk(6.u({x: x = 4}). given any convergence rate for frequencies or for entropy. Likewise. where R + denotes the positive reals and Z + the positive integers. The k-th order empirical distribution. processes.

a Exponential rates for i. n) is bounded away from O. An ergodic process is said to have exponential rates for and rk(E. is concerned with what happens in the ergodic theorem when (1/n) f (T i x) is replaced by (1/an ) f (Ti x).1 A unrelated concept with a similar name.n) has a rate function for entropy such that for each c > 0. in turn.d.1. if for each E > 0. (-1/n) log rk (E. processes. 58]. for some unbounded.d.i.i. and for any i. is a product measure. processes) If p.3 (First-order rate bound. it will not be the focus of interest here. is i. while ideas from that theory will be used. Remark 111. nondecreasing sequence {an }.i. Examples of processes without exponential rates and even more general rates will be discussed in the final subsection.d. Exponential rates of convergence for i. for any n. with alphabet A. The theorem is proved by establishing the first-order case.1. which is. then extended to Markov and other processes that have appropriate asymptotic independence properties. a combination of a large deviations bound due to Hoeffding and Sanov and an inequality of Pinsker. 2 —n(h+E) < /41 (4) < 2 —n(h—E) 1) > 1 — r (E.) A mapping r: R+ x Z+ R+ is called a rate function for entropy for a process it of entropy h. The overlapping-block problem is then reduced to the nonoverlapping-block problem. 0 as n oc. using the fact that p. oc. Lemma 111. 14). n) is bounded away from O. produces (2) 11 (4) = Dit(a))nPl aEA (alx7) = 2—n(H(pi)+D(p1 1141)) .i.2 (Exponential rates for i. called speed of convergence. [22.d processes will be established in the next subsection.. ENTROPY FOR RESTRICTED CLASSES. then it has exponential rates for frequencies and for entropy. which is treated by applying the first-order result to the larger alphabet of blocks.) There is a positive constant c such that (1) < (n It (14: IPi( 14) — ILiI > El) _ DIAl2—ncE 2 . See [32] for a discussion of this topic. n).d. (-1/n) log r (c.n) —> 0 as n frequencies if there is a rate function such that for each E > 0 and k. Proof A direct calculation.i. 62. (The determination of the best value of the exponent is known as "large deviations" theory.166 CHAPTER III. for the fact that t is product measure implies that p. The first-order theorem is a consequence of the bound given in the following lemma. for any finite set A. process p. III. In this section the following theorem will be proved.(x) is a continuous function of p1 (. Theorem 111.1. An ergodic process has exponential rates for entropy if it and r(E.1. Exponential rates for entropy follow immediately from the first-order frequency theorem.

3. w(t). — 1L El Since D(P II Q) I P — Q1 2 1 (2 ln 2) is always true. Theorem 1. E) n T(4). k) such that n = tk + k + r. El Lemma 111.)1 : pie lyD = pie jx7)}. The desired result will follow from an application of the theory of type classes discussed in I. For each s E [0. RATES OF CONVERGENCE. and class. 1. see Pinsker's inequality. produces the bound tt(B(n. (T(x)) < The bad set B(n. k) and r E [0. it follows that 1 12 1Ple lx7) D* r17- which completes the proof of Lemma 111.6. (b) There are at most (n + 1) 1A I type classes. the next t words. E)) < ( n 01Al2—nD. Fact (a) in conjunction with the product formula (2) produces the bound. since (n + 1)I 41 2 .SECTION III.1 3 and Theorem 1. all have length k. The type class of _el is the set T(x) = {. k-block parsing of x. E) = fx7: 1/1 1( . and the final word w(t 1) has length n — tk —s. The sequences x7 and y r i` are said to be (s. The two key facts.6. . This is called the s -shifted. Section 1. 1x7)) = — a (a14) log pi (a lx7). k)-equivalent if their . the sequence x can be expressed as the concatenation (4) x = w(0)w(1)• • • w(t)w(t + 1).1.12.3 gives the desired exponential-rate theorem for first-order frequencies. (3) p. are (a) 17-(4)1 < 2nH(p 1 ) .n'' = exp(—n(cE 2 ön)) where 8. Exercise 6. k — 1]. is the entropy of the empirical 1-block distribution.1. w(1). To assist in this task some notation and terminology will be developed. — I > E } can be partitioned into disjoint hence the bound (3) on the measure of the type sets of the form B(n. The extension to k-th order frequencies is carried out by reducing the overlapping-block problem to k separate nonoverlapping-block problems. together with fact (b). where the first word w(0) has length s < k.6.6. For n > 2k define integers t = t (n.d. 0 as n —> oc. and D(Pi Lai) = E a pi (aIx)log Pl(aix'11) iii(a) is the divergence of pi( • Ixii) relative to pi. where = min ID(Pi Lai): IP1(. 167 where H(pi) = 11(131(. .

of course. 04: IPke — itkI > El) < k(t + 0 1 2-46'6214 . however. k-1 (5) 14: Ipk( 14) — iLk1 el g U{4: lp.1.c e S=0 — kl Fix such a k and n. k-block parsings have the same (ordered) set {w(1). ENTROPY FOR RESTRICTED CLASSES. t]: w(j) = a}1 where x is given by (4). is a product measure implies (6) (Sk(x riz . Lemma 111. The fact that p. a lemma. the logarithm of the right-hand side is asymptotic to —tcE 2 14n —c€ 2 /4k. WI E B. (5). This completes the proof of Theorem 111. then provides the desired k-block bound (9) p.1. s) is the (s.ukl> Proof Left to the reader. . that is.. s).(fx: 1Pic (' 14) — e/2})y ({U/1: IwD — > e/21). is upper bounded by (8) (t 1) 1A1 '2'" 2 /4 .) Given e > 0 there is a y > 0 such that if kln < y and Ipk( 14) — I > E then there is an s E [0.(41x Pil ) — ps k (4 E [1.4 (Overlapping to nonoverlapping lemma. applied to the measure y and super-alphabet B = lAlk . The containment relation. k)-equivalence class k. w(t)} of k-blocks. and fix s E [0.1.2. Lemma 111. i yss+t +k . s).el') is the The s-shifted (nonoverlapping) empirical k-block distribution ps A k defined by distribution on l) = ps. If k is fixed and n is large enough the lemma yields the containment relation.168 CHAPTER III. hence the bound decays exponentially in n. k — 1]. so that if B denotes A" and y denotes the product measure on 13' defined by the formula v(wD = then /1(w i ). If k and c > 0 are fixed. k — 1] such that Ips k ( 14) — . constant on the (s. by the first-order result. . The latter. k)-equivalence class of x is denoted by Sk (fi'. 4-k +Ft k = ps k(• 1.3. It is. Sk(x . (7) p. The overlapping-block measure Pk is (almost) an average of the measures ps result summarized as the following where "almost" is needed to account for end effects. k)-equivalence class of xri'. s)) 1=1 where Sk(x . s-shifted. i The (s.

1. for computing Markov probabilities yields E w uvw ) t(v)=g mg 7r(b) The right-hand side is upper bounded by *(g). w E A*. Proof A simple modification of the i. since p.SECTION 111. Theorem 111.u(uvw) _<111(g)A(u). where i/î (g)= maxmax a Mg ab (b) • b The function *(g) —> 1. The *-mixing concept is stronger than ordinary mixing. process is *-mixing. . let a be the final symbol of u. g) such that (10) n=t(k+g)±(k+g)-Er. it will be shown that periodic. RATES OF CONVERGENCE.i.u(w).1.u(u).b The Markov and related cases. Section 1. and let b be the first symbol of w. then it has exponential rates for frequencies and for entropy. k. proof.d. which allows the length of the gap to depend on the length of the past and future.7 If kt is ilf-mixing then it has exponential rates for frequencies of all orders.d. define t = t (n. u E Ag.(4) is a continuous function of pi e 14) and p2( • 14).1. since the aperiodicity assumption implies that iti`cfb 7r(b). with Ili(g) 1. The theorem for frequencies is proved in stages. Fix a gap g. irreducible Markov chains have exponential rates. and such that . It requires only a bit more effort to obtain the mixing Markov result. Exponential rates for Markov chains will be established in this section. as g —> oc.u(w).6 An aperiodic Markov chain is ip-mixing. An i.i. Finally. u. A process is 11 f-mixing if there is a nonincreasing sequence {*(g)} such that *(g) 1 as g —± oc.0_<r<k-i-g. Proof Let M be the transition matrix and 7r(-) the stationary vector for iii. This proves the lemma.1. so as to allow gaps between blocks. Lemma 111.l.2. LI The next task is to prove Theorem 111. is an ergodic Markov source. Then it will be shown that -*-mixing processes have exponential rates for both frequencies and entropy. The entropy part follows immediately from the second-order frequency result. The formula (10). First it will be shown that aperiodic Markov chains satisfy a strong mixing condition called *-mixing. For n > 2k + g. 169 III. Fix positive integers k and g.5 (Exponential rates for Markov sources.) If p. is all that is needed to establish this lemma.

The (s. g)• The s-shifted (nonoverlapping) empirical k-block distribution. k. and if pke > E.1. provided only that k is large enough. if k > K.i. proof adapts to the Vf-mixing case. The overlappings . For each s E [0. of course. k. with gap g.7 is completed by using the 1i-mixing property to choose g so large that (g) < 2 218 . s.8 (Overlapping to nonoverlapping with gaps. there is a y > 0 and a K > 0. w(t)} of k-blocks.1.4 ( 14) — itkl 612. that k is enough larger than g and n enough larger than k to guarantee that Lemma 111. with gap g. .g .7. g)-equivalence class of 4 is denoted by Sk(xPil . such that if k n < y. then there is an s E [0. E [1.1. and the final block w(t + 1) has length n — t (k + g) — s < 2(k + g).) Given E> 0 and g. with gap g. the s-shifted. g).1. where w(0) has length s < k + g.. is the distribution on Ak defined by P. The upper bound (8) on Ips k ( 14) —lkI€121) is replaced by the upper bound (13) [i (g)]t(t 1)1111 4 C E 2 I4 and the final upper bound (9) on the probability p. Since it is enough to prove exponential rates for k-th order frequencies for large k. of course. k + g — 1] such that — 114.1. however.g(alnxii). . k + g — 1]. the sequence 4 can be expressed as the concatenation (11) x = w(0)g(1)w(1)g(2)w(2). The sequences 4 and are said to be (s. g)) kii(g)i t i=1 ktk(w(i)). with the product measure bound (6) replaced by (12) ian (Sk (4. for 1 < j < t. The logarithm of the bound in (14) is then asymptotically at most —tce 2 18n —c€ 2 /8(k + g).4 easily extends to the following. the g(j) have length g and alternate with the k-blocks w(j). The proof of Theorem 111.g(t)w(t)w(t + 1). This is called the the s-shifted. g)-equivalence class Sk (4.. If k is large relative to gap needed to account both for end effects and for the length and n is large relative to k. where the w(j) are given by (11). this is no problem. (14: Ipke placed by the bound (k + g)[*(g)]t (t + 010 2 -tc€ 2 14 (14) — /41 ED is re- assuming. k-block parsings have the same (ordered) set {w(1).1. t]: w(j) = 411 It is. g-gapped.170 CHAPTER III. s. Completion of proof of Theorem 111.7. . hence the bound decays exponentially. ENTROPY FOR RESTRICTED CLASSES. this completes the proof of Theorem 111. where "almost" is now block measure Pk is (almost) an average of the measures p k gaps. constant on the (s. The i.g (c4) = pisc. and Lemma 111.8 holds. s. g)-equivalent if their s-shifted. k. The following lemma removes the aperiodicity requirement in the Markov case. k-block parsing of x. Lemma 111.d. k-block parsing of 4. and establishes the exponential rates theorem for aperiodic Markov chains.

the nonoverlapping block measure ps condition p g (4) = 0. Cd. ?_ decays exponentially as n -. for k can always be increased or decreased by no more than d to achieve this.10 Let n i— r(n) be a positive decreasing function with limit O. such that Prob(X n+ 1 E Cs91lXn E C s = 1. Define the function c: A i— [1. Theorem 111.1. A simple cutting and stacking procedure which produces counterexamples for general rate functions will be presented. an aperiodic Markov measure with state space C(s) x C(s @ 1) x • • . Thus. unless c(ai) = c(xs ). d] by putting c(a) = s. The following theorem summarizes the goal. because it is conceptually quite simple. Theorem 111. . RATES OF CONVERGENCE. if a E C. n > N.l.1.SECTION 111. 1 < s < d. k .g ( ! XIII ) — 14c(xs)) 1 > 6 / 2 1. It can also be assumed that k is divisible by d. because it illustrates several useful cutting and stacking ideas. Let g be a gap length. IL({xi: IPk — iLk I E}) > 6/2 } separately.u k is an average of the bt k k+g-1 fx rI : IPk( ifil ) — Ad > El C U VI I: s=0 IPsk. 171 Proof The only new case is the periodic case.oc. Also let pP) denote the measure A conditioned on Xi E C5 . which can be assumed to be divisible by d and small relative to k. which has no effect on the asymptotics. . Let A be an ergodic Markov chain with period d > 1. C2. Examples of renewal processes without exponential rates for frequencies and entropy are not hard to construct. so the previous theory applies to each set {xii': ' K g — I (c(xs )) I as before.c Counterexamples. and thereby completing the CI proof of the exponential-rates theorem for Markov chains. . g = ps k .1. it follows that if k I g is large enough then Since . There is a binary ergodic process it and an integer N such that An (1-112 : 'Pi (• Ix) — Ail 1/2» ?. Lemma 111. however. and. and partition A into the (periodic) classes.u (s) is. in part.5. x C(s @ d — 1). see Exercise 4. C1. in part. (s) . k + g — 1].9 An ergodic Markov chain has exponential rates for frequencies of all orders. The measure .1. establishing the lemma.. ) where @ denotes addition mod d.. III. _ r(n). g (• WO satisfies the For s E [0.

i (1) = 1/2.n 1.u were the average of two processes. for all m. then it is certain that xi = 1." suppose kt is defined by a (complete) sequence of column structures. according to whether x`iz consists of l's or O's. to implement both of these goals. then (141 : 1/31( Ixr) — I ?_ 112» ?_ (1 _ so that if 1/2> fi > r(m) and L is large enough then (16) /2. /31(114) = 1. Mixing is done by applying repeated independent cutting and stacking to the union of the second column C2 with the complement of C. no matter what the complement of C looks like. mixing the first part into the second part so slowly that the rate of convergence is as large as desired. In particular. Suppose some column C at some stage has all of its levels labeled '1'. of positive integers which will be specified later. that is. for 1 < j < in. then for any n. for all m > 1. If x is a point in some level of C below the top ni levels.u n ({xr: pi(114) = 11) (1 — -11 L i z ) /3. The mixing of "one part slowly into another" is accomplished by cutting C into two columns.n: — I > 1/2)) > r(m). The cutting and stacking method was designed.„ ({x. The details of the above outline will now be given. Ergodic counterexamples are constructed by making . say C1 and C2. . As long as enough independent cutting and stacking is done at each stage and all the mass is moved in the limit to the second part. = (y +V)/2. and to mix one part slowly into another. The construction will be described inductively in terms of two auxiliary unbounded sequences. For example. if exactly half the measure of 8(1) is concentrated on intervals labeled '1'. so that A n (14: i pi( 14) — I 1/21) = 1. n > 1. if .172 CHAPTER III.u look like one process on part of the space and something else on the other part of the space. ENTROPY FOR RESTRICTED CLASSES. such that /3(1) = 1/2 and r(m) < f3(m) < 1/2. while . the final process will be ergodic and will satisfy (16) for all m for which r(m) < 1/2. Without loss of generality. Let m fi(m) be a nonincreasing function with limit 0. then the chance that a randomly selected x lies in the first L — ni levels of C is (1 — m/L)p. Counterexamples for a given rate would be easy to construct if ergodicity were not required. and hence (15) . where y is concentrated on the sequence of all l's and iY is concentrated on the sequence of all O's. for it can then be cut and stacked into a much longer column. it can be supposed that r(m) < 1/2. {41 } and fr. As an example of what it means to "look like one process on part of the space and something else on the other part of the space. to make a process look like another process on part of the space. long enough to guarantee that (16) will indeed hold with m replaced by ni + 1.u. The desired (complete) sequence {S(m)} will now be defined. Furthermore.u. pi(11x) is either 1 or 0. so that. if C has height L and measure fi. Each S(m) will contain a column . The bound (16) can be achieved with ni replaced by ni + 1 by making sure that the first column CI has measure slightly more than r(m 1). in part.

. and measure 13(m). For ni > 1. so that C(m — 1.(m) goes to 1. ni > 1. to guarantee that (17) holds. by Theorem 1. of height 4.(4) < 2 -n (h+E) } goes to 0 exponentially fast.1 . RATES OF CONVERGENCE. this guarantees that the final process is ergodic. 2) has measure fi(m —1) — )6(m). and rm the final process /1 is ergodic and satisfies the desired condition (17) An (Ix'i'I : Ipi(• 14)— Ail a. Since rm —> oc. (b) Show that the measure of {xii: p. where C(m — 1.10.1/20 > r (m). 2).(m). Show that the measure of the set of n-sequences that are not almost strongly-packed by 'Tk goes to 0 exponentially fast. . 2) U 7Z(m — 1). so large that (1 — m/t n1 )/3(m) > r(m).(1) consist of one interval of length 1/2.) (c) Show that a 0. together with a column structure R.n } is chosen.10. All that remains to be shown is that for suitable choice of 4. labeled '0'. First C(m — 1) is cut into two columns. the total width of 8(m) goes to 0.7.d Exercises. Suppose A has exponential rates for frequencies. has exponential rates for frequencies and entropy. 1) has measure /3(m) and C(m — 1. 1) is cut into t ni g ni -1 columns of equal width. (a) Choose k such that Tk = {x: i(x) ) 2h j has measure close to 1.11. The new remainder R(m) is obtained by applying rni -fold independent cutting and stacking to C(m — 1. 1) and C(m — 1. This completes the proof of Theorem 111. and the total measure of the intervals of S(m) that are labeled with a '0' is 1/2. which are then stacked to obtain C(m). Once this holds.(m — 1) and 1?-(m) are (1 — 2')-independent. and let R. so the sequence {8(m)} is complete and hence defines a process A.: Ipi(. all of whose entries are labeled '1'. then it can be combined with the fact that l's and O's are equally likely and the assumption that 1/2 > 8(m) > r (m). tm holds.-mixing process has exponential rates for entropy. 0 111. sequentially. 173 C(m). let C(1) consist of one interval of length 1/2. 1. for this is all that is needed to make sure that m ttn({x. 14)— Ail a 1/2 }) a (1 — — ) /3(m).10. labeled '1'. disjoint from C(m). The column C(m —1. This guarantees that at all subsequent stages the total measure of the intervals of 8(m) that are labeled with a '1' is 1/2. 8(m) is constructed as follows. (a) Show that y has exponential rates for frequencies.1.SECTION III. (Hint: as in the proof of the entropy theorem it is highly unlikely that a sequence can be mostly packed by 'Tk and have too-small measure. This is possible by Theorem 1. 1. The sequence {r. 2. Since the measure of R. C(m — 1. and since /3(m) -÷ 0. 2) U TZ. To get started. and hence the final process must satisfy t 1 (l) = A i (0) = 1/2. The condition (17) is guaranteed by choosing 4. Let y be a finite stationary coding of A and assume /2.

) 5. such as data compression. Consistent estimation also may not be possible for the choice k(n) (log n)I h. There are some situations. As before. there is no hope that the empirical k-block distribution will be close to the true distribution. as a function of sample path length n. Thus it would be desirable to have consistency results for the case when the block length function k = k(n) grows as rapidly as possible. Establish this by directly constructing a renewal process that is not lit-mixing. see Exercise 2. Here is where entropy enters the picture. the resulting empirical k-block distribution almost surely converges to the true distribution of k-blocks as n oc.2 Entropy and joint distributions. below. The definition of admissible in probability is obtained by replacing almost-sure convergence by convergence in probability.174 CHAPTER III.d.) 3. a.2. by the ergodic theorem. if k(n) > (1 + e)(logn)1 h. where I • I denotes variational. For example. The such estimates is important when using training sequences. In particular. A nondecreasing sequence {k(n)} will be said to be admissible for the ergodic measure bt if 11111 1Pk(n)(• 14) n-4co lik(n) = 0. that is. (Hint: make sure that the expected recurrence-time series converges slowly. This is the problem addressed in this section. ENTROPY FOR RESTRICTED CLASSES.1. conditional k-type too much larger than 2 —k(h(4)—h(v) and measure too much larger than 2-PP). 4. If k is fixed the procedure is almost surely consistent. the probability of a k-block will be roughly 2 -kh . The kth-order joint distribution for an ergodic finite-alphabet process can be estimated from a sample path of length n by sliding a window of length k along the sample path and counting frequencies of k-blocks. Every ergodic process has an admissible sequence such that limp k(n) = oc. Exercise 4 shows that renewal processes are not *-mixing. see Theorem 111. The problem addressed here is whether it is possible to make a universal choice of {k(n)) for "nice" classes of processes. Use cutting and stacking to construct a process with exponential rates for entropy but not for frequencies. The empirical k-block distribution for a training sequence is used as the basis for design. a fact guaranteed by the ergodic theorem. in . processes or Markov chains.i. s. Section 111. to design engineering systems. Show that there exists a renewal process that does not have exponential rates for frequencies of order 1. distance. (b) Show that y has exponential rates for entropy. k(n) > (logn)I(h — 6). or. distributional. for if k is large then. (Hint: bound the number of n-sequences that can have good k-block frequencies. that is. with high probability. after which the system is run on other. that is.. finite consistency of sample paths. independently drawn sample paths. let plc ( 14) denote the empirical distribution of overlapping k-blocks in the sequence _el'. such as the i. It is also can be shown that for any sequence k(n) -± oc there is an ergodic measure for which {k(n)} is not admissible. equivalently. where it is good to make the block length as long as possible.

2. an approximate (1 — e -1 ) fraction of the k-blocks will fail to appear in a given sample path of length n. . Theorem 111. [52]. and k(n) < (log n)/(h 6) then {k(n)} is admissible for p.d.i. Remark 111. that if the process is a stationary coding of an i. provided x and y are independently chosen sample paths. ENTROPY AND JOINT DISTRIBUTIONS. almost surely. The weak Bernoulli result follows from an even more careful look at what is really used for the ifr-mixing proof. y) is the waiting time until the first n terms of x appear in the sequence y then. The d-distance is upper bounded by half the variational distance.d.2.5. The nonadmissibility theorem is a simple consequence of the fact that no more than 2 k(n)(h—E) sequences of length k can occur in a sequence of length n and hence there is no hope of seeing even a fraction of the full distribution. in particular. or Ili-mixing.3.SECTION 111. Markov.d.2. requires that past and future blocks be almost independent if separated by a long enough gap.) If p is i.5 A third motivation for the problem discussed here was a waiting-time result obtained by Wyner and Ziv.2.2. for it is easy to see that. see Exercise 3. who used the d-distance rather than the variational distance. The Ornstein-Weiss results will be discussed in more detail in Section 111. The admissibility results of obtained here can be used to prove stronger versions of their theorem. These applications.. The weak Bernoulli property.1 (The nonadmissibility theorem. case of most interest is when k(n) The principle results are the following. for ergodic Markov chains. if 0 < E < h.) If p.(x. They showed.(x. along with various related results and counterexamples.3 (Weak Bernoulli admissibility. process then the d-distance between the empirical k(n)-block distribution and the true k(n)-block distribution goes to 0.i. independent of the length of past and future blocks. 175 the unbiased coin-tossing case when h = 1.4 A first motivation for the problem discussed in this section was the training sequence problem described in the opening paragraph. Remark 111.2. (1/n) log 14/. and hence the results described here are a sharpening of the Ornstein-Weiss theorem for the case when k < (log n)/(h 6) and the process satisfies strong enough forms of asymptotic independence.) If p. which will be defined carefully later. provided k(n) — (log n)/ h. the choice k(n) — log n is not admissible. The Vi-mixing concept was introduced in the preceding section. y) converges in probability to h.2 (The positive admissibility theorem.i. Theorem 111. with high probability. is ergodic with positive entropy h. and ifr-mixing cases is a consequence of a slight strengthening of the exponential rate bounds of the preceding section. A second motivation was the desire to obtain a more classical version of the positive results of Ornstein and Weiss. Thus the (log n)/(h 6). The positive admissibility result for the i.u. Theorem 111. They showed that if 147. and if k(n) > (log n)/(h — 6) then {k(n)} is not admissible in probability for p. is weak Bernoulli and k(n) < (log n)/(h 6) then {k(n)} is admissible for . [86]. are presented in Section 111.

so that . Proof Define yn = yn (x) — > 5E1 < 2(n ± 01/312-nCe2 =0 1 I if xn E B otherwise.k (ték oc. it can be supposed that {k(n)} is unbounded. ENTROPY FOR RESTRICTED CLASSES. This implies the nonadmissibility theorem. (2) (14 : 1Pke 14) — 141 ?.d.i. by the distance lower bound.(4) < 2-k(h-E/2)} The assumption n 2k (h -E ) implies that ILik(4)1 < 2k (h -o. the following holds. see the bound (9) of Section MA. with finite alphabet A. and hence Pk . process p.2. the error is summable in n. by the ergodic theorem. Let p. is enough to prove the positive admissibility theorem for unbiased coin-tossing. n — k +1]} .k(h—e12) • The entropy theorem guarantees that . it follows that Since pk(a fic ix) = 0.') n Tk ( 6/2)) 2 k(h-0 2-k(h-E/2) 2 -kE/2 . (1). without loss of generality. process with Prob (yn = 1) < Let Cn = E yi > 2En} . for. where 0 < E < h. where t nlk.) There is a positive constant C such that for any E > 0. Next the positive admissibility theorem will be established. so that { yn } is a binary i. for any finite set A. Define the empirical universe of k-blocks to be the set Lik (4) = x: +k-1 = a l% for some i E [1.d. First the nonadmissibility theorem will be proved.i. The key to such results for other processes is a similar bound which holds when the full alphabet is replaced by a subset of large measure. lik(x). as the reader can show. if n > IA I''. ED < k(t +1) 1A1k 2 -t cE214 . First note that. be an ergodic process with entropy h and suppose k = k(n) > (log n)/(h — E). and hence I pk (.u.176 CHAPTER III. The rate bound. Lemma 111..14) — pk1 -± 1 for every x E A'.2. III.k (Tk (e/2)) —> 1. since any bounded sequence is admissible for any ergodic process.6 (Extended first-order bound. whenever al' (1) Let 1 Pk( ' 14) — 1 111 ((lik(X)) c ) • Tk (6/2) = 14: p.a Proofs of admissibility and nonadmissibility. and for any B c A such that p(B) > 1 — É and I BI > 2. 1/31(. E.uk ((tik(4)) c) also goes to 1. since each member of Tk (E/2) has measure at most 2. for any n > 0. for any i.

Let 8 be a given positive number. Fix a set {i1.} with m < 2cn.(B(i i . tt (b) /Ld o B) + E A(b) = 2(1 — . note the set of all 4 for which Xi E B if and only if i çz' {i1. im )) for which in < 26n cannot exceed 1. Also let ALB be the conditional distribution on B defined by .d. The positive admissibility result will be proved first for the i. x i g B. in. in turn. and hence. in. see (5).i. 14) — ?_ 31) < oc. are disjoint for different { -1.d.i.2. x im from x. ENTROPY AND JOINT DISTRIBUTIONS.tti I > 56}) which is. The proof of (6) starts with the overlapping to nonoverlapping relation derived in the preceding section. The assumption that m < 26n the implies that the probability ({Xj1 E B(ii. Furthermore. }.n ) den. in. im ))(s 1)1B12 2< ini mn 1)IBI2—n(1 —20c€ 2 The sum of the p. and assume k(n) < (log n)/ (h 6). by the Borel-Cantelli lemma. i 2 . upper bounded by (5) since E Bs: — 11 1. -2. .cn4E2 177 The idea now is to partition An according to the location of the indices i for which . .BI > € 1) Ii ui . case. (14: ip(. and for xri' E B(ii. the sets {B(ii.kI ?3/21. • • • . put s = n — m. define ï = i(xil) to be the sequence of length s obtained by deleting xi„ xi2 . bEB [ A\ The exponential bound (2) can now be applied with k = 1 to upper bound (5) by tt (B(i i. no) im }:nz>2En + 1) 22-cn4E2 . . with entropy h.)) /1 B E B s : i Pi e liD — I > 3}). k-1 (7) 14: 1Pk( 14) — tuki 31 g s=o IPsk( . (Ca ) < (n 1)22 . — p. the lemma follows. then extended to the *-mixing and periodic Markov cases. It will be shown that (6) Ep.u i and let AB be the corresponding product measure on BS defined by ALB. • • • im) and have union A. combined with (4) and the assumption that I BI > 2. which immediately implies the desired result. in1 ): m > 2en} have union Cn so that (3) yields the bound (4) (ii. Assume . is upper bounded by tc(B(i i . im ): 'pi ( 14) . let B(ii.u(B)) bvB 26. page 168.) . i.AL B I --= 2_.i is i.SECTION 111. For each m < n and 1 < j 1 < i2 < • • • < j The sets . . and apply the bound (2) to obtain (3) p.

there is a K such that yi (T) = kil (Tk ) > 1 — 8/10. Since the number of typical sequences can be controlled. all k > K = K (g). see (7) on page 168. k — 1]. as n oc. ENTROPY FOR RESTRIC I ED CLASSES. The set rk -= {X I( : itt(x lic ) > 2— k(h+€12)} has cardinality at most 2k(h ±e/ 2) . by assumption. and plc ( '. The idea now is to upper bound the right-hand side in (8) by applying the extended first-order bound. with B replaced by a suitable set of entropy-typical sequences. with A replaced by A = A k and B replaced by T. The rigorous proof is given in the following paragraphs. which. though no longer a polynomial in n is still dominated by the exponential factor. Lemma 111. and. processes. indeed.6. as k grows. for s E [O.el') denotes the s-shifted nonoverlapping empirical k-block distribution.178 CHAPTER III. 3 } ) < 2(k + g)[lk(g)]` (t 1 )22127 tceil00 . to the super-alphabet A". and y denotes the product measure on » induced by A. To obtain the desired summability result it is only necessary to choose .(h±E/2)2-tc82/100 . produces the bound (9) tx: IPke — 1kI > < 2k(t 1)21. since it was assumed that k(n) oc.2. since nIk(n) and p < 1. where y is a positive constant which depends only on g.d. furthermore. combined with (8) and (7).6. where n = tk + k +r. This is. The bound (9) is actually summable in n. fi = h + 612 h ±E The right-hand side of (9) is then upper bounded by (10) 2 log n h +E (t since k(n) < (log n)/ (h E). This establishes that the sum in (6) is finite. t The preceding argument applied to the Vi-mixing case yields the bound (1 1) kt({4: 1Pke — il ?_. (8) ti({4: Ins* — Ak1 3/2}) v({ 01: IP1( .2. summable in n. by the entropy theorem. thereby completing the proof of the admissibility theorem for i. Lemma 111. yields the bound P 101) — vil > 3/20 < 2(t + 1 )2/2)2_tc82000. which is valid for k/n smaller than some fixed y > 0.i. To show this put a =C32 1100. Note that k can be replaced by k(n) in this bound. the polynomial factor in that lemma takes the form. valid for all k > K and n > kly. (1 + nl k) 2"1±(') . valid for all g. 0 < r < k. As in the preceding section. for k > K. which. The extended first-order bound. 101 ) — vil where each wi E A = A'. and all n > kly.

n > 1. hence for the aperiodic Markov case. = xi. in turn.) Also let Ex -g (f) denote conditional expectation of a function f with respect to the -g-m random vector X —g—m• —g A stationary process {X i } is weak Bernoulli (WB) or absolutely regular. An equivalent form is obtained by letting m using the martingale theorem to obtain the condition oc and (13) E sifc Gun( . -mixing result was the fact that the measure on shifted. With a = C8 2 /200 and /3 = (h + E /2)/ (h E). This establishes the positive admissibility theorem for the Vi-mixing case. the measure defined for ar iz E An by iun (41x )— g_m = Prob(r. the s-shifted. ENTROPY AND JOINT DISTRIBUTIONS. upper bounded by log n 2(t h 1)n i3 . nonoverlapping blocks could be upper bounded by the product measure on the blocks.. if given E > 0 there is a gap g > 0 such that (12) E X ig-m : n (.SECTION 111. and oc.. for all n > 1 and all m > O. I11. since since t n/ k(n) and 8 < 1. ar iz. 179 g so large that (g) < 2c6212°°. The weak Bernoulli property leads to a similar bound. which requires that jun (4 lxig g _m ) < (1 + c)/L n (ar iz). The weak Bernoulli property is much stronger than mixing. uniformly in m > 0.9. and xig g provided only that g be large enough. which is again summable in n. at least for a large fraction of shifts.b The weak Bernoulli case. assuming that k(n) the right-hand side in (11) is. k-block parsing of x.1X —g—m -g — n i) < E .2. As before. provided a small fraction of blocks are omitted and conditioning on the past is allowed. i X -go) — n> 1. —g — m < j < —g}. A process has the weak Bernoulli property if past and future become almost independent when separated by enough. which only requires that for each m > 0 and n > 1.„(. The key to the O.. let p. subject only to the requirement that k(n) < (log n)/(h E).1.2. and this is enough to obtain the admissibility theorem. however. It is much weaker than Ifr-mixing. To make this precise. = aril and Xi g Prob ( X Tg g . see Lemma 111. The extension to the periodic Markov case is obtained by a similar modification EJ of the exponential-rates result for periodic chains. that is. where the variational distance is used as the measure of approximation. conditioned on the past {xj. with gap g is the expression (14) 4 = w(0)g(1)w(1)g(2)w(2). with only a small exponential error.1x:g g _n1 ) denote the conditional measure on n-steps into the future. there is a gap g for which (12) holds.g(t)w(t)w(t ± 1). . where g and m are nonnegative integers.

_{. and the final block w(t 1) has length n — t(k+g) —s < 2(k ± g). ([w(i.& x_ s tim -1)(k+g) ) < (1 + y)tt(w(j ni )). Note that Bi is measurable with respect to the coordinates i < s + j (k + g). eventually almost surely. fl Bi ) B*) to n Bi ) = [w(j.„ tc(B*). t] is called a (y. Lemma 111. the g-blocks g(j) alternate with the k-blocks w(j). g). JEJ-th„) D and Lemma 111.n )] n B . each The first factor is an average of the measures II (w(j.180 CHAPTER III. . and later. k. and there is a sequence of measurable sets {G n (y)}..u(w(in. s. and g are understood. k. s. g) and fix a finite set J of positive integers. for 1 < j < t. which exists almost surely by the martingale theorem. Lemma 111. The set of all x E Az for which j is a (y.) .n )] n Bi.„ } ( [ w (. k. Thus (15) yields (n([w(i)] n B.)) (1+ y)IJI J EJ Put jp. Here. g)-splitting index will be denoted by Bi (y. where w(0) has length s. k.)) • tt ([.7 follows by induction. by the definition of B11„. = max{ j : j E J } and condition on B* = R E . )] n Bi ) .(w(j)). The weak Bernoulli property guarantees the almost-sure existence of a large density of splitting indices for most shifts. such that the following hold. y. or by B i if s.2. (a) X E Gn(y). there are integers k(y) and t(y). ENTROPY FOR RESTRICTED CLASSES. then there is a gap g = g(y). lie lx i 00 ) denotes the conditional measure on the infinite past.)) (1 + y) • .2. s.2.) If l is weak Bernoulli and 0 < y < 1/2.n )] of which satisfies n Bi I xs_+. defined as the limit of A( Ix i rn ) as m oc.8 (The weak Bernoulli splitting-set lemma. Then for any assignment {w(j): E J } of k-blocks tt Proof obtain (15) JEJ (naw(i)] n B. k.1)(k +g) ).( . g)-splitting index for x E A Z if (w(j) ix _ s ti -1)(k+g) ) < (1 + y)p.7 Fix (y. An index j E [1. s.2. Note that the k-block w(j) starts at index s (j — 1)(k + g) + g ± 1 and ends at index s j (k + g).

Fix k > M.s. k ± g — 1] for each of which there are at least (1 — y)t indices j in the interval [1. so there is a subset S = S(x) C [0. Fatou's lemma implies that I 1 — f (x) d p(x) < )/4 . then for x E G(y) there are at least (1 — y)(k + g) values of s E [0. then /i(CM) > 1 — (y 2 /2).. denotes the indicator function of Cm. eventually almost surely.(x) < Y 4 4 Fix such a g and for each k define fk(X) — . Vk ?_ MI. Direct calculation shows that each fk has expected value 1 and that {fk} is a martingale with respect to the increasing sequence {Ed. Thus fk converges almost surely to some f. t] that are (y.2.3C ( IX:e g.c ) • Let Ek be the a-algebra determined by the random variables.t=o — L — 2 then x E Gn(Y).SECTION 111. The ergodic theorem implies that Nul+ rri cx. see Exercise 7b. and (t + 1)(k g) < n < (t +2)(k ± g). Proof By the weak Bernoulli property there is a gap g = g(y) so large that for any k f (xx ) 1 p(x i`) p(x i`lxioo g) dp. if t > t(y). The definition of G(y) and the assumption t > 2/y 2 imply that 1 r(k+g)-1 i=0 KCht(T1 X) = g) E 1 k-i-g-1 1 t(k k + g s=o t j =i E_ t K cm ( T s-hc i —( k + g) x ) > _ y 2 . t > t(y). so that if G(y) = —EICc m (T z x) > 1 1 n-1 n.U(. g)-splitting indices for x. (Ts+0-1)(k+g)x ) > — y. {Xi : i < —g} U {X. ENTROPY AND JOINT DISTRIBUTIONS. .: 1 < i < kl. Put k(y)= M and let t(y) be any integer larger than 2/y 2 . and if (t +1)(k + g) _< n < (t + 2)(k +g). k + g —1] of cardinality at least (1 — y)(k + g) such that for x E G(y) and s E S(x) k c. 181 (b) If k > k(y).. k. 1 N-1 i=u y2 > 1 — — a. 2 where kc. s. and fix an x E Gn(Y). 1 so there is an M such that if Cm = lx :I 1 — fk (x) 15_ .

k. n > 1. provided the sets J are large fractions of [1. for gaps. J)) < (1+ y)' rip. = Gn(Y).E. g)-splitting indices for x. ( .1-1)(k +g) x E Cm. choose a positive y < 1/2.182 CHAPTER III. The overlapping-block measure Pk is (almost) an average of the measures pk s'g i. For J C [1. if kl n < y. the empirical distribution of k-blocks obtained by looking only at those k-blocks w(j) for which j E J. then Ts±(.J k n Pk. But if Tr+(i -1)(k +g) x E Cm then g(w(i)Ixst i-1)(k+g )— g) < 1 ( + y)tt(w(D).g(ai )= E W(i) an! . Assume A is weak Bernoulli of entropy h.2.(s. and measurable sets G. and Lemma 111. and k(n) < (log n)/(h + E). k. k E k A. g)-splitting index for x. t] of cardinality at least (1 — y)t.8 implies that there are (1 — y)(k + g) indices s E [0.8. ENTROPY FOR RESTRICTED CLASSES. s. Proof of weak Bernoulli admissibility. Fix t > t(y) and (t+1)(k+g) < n < (t+2)(k+g).7 and the fact that I Ji < t yield (1 6) A (n[w(j)] jEJ n D. J). In particular. and note that jEJ n B = Dn(s. t].3. 14) — ILkI 3/4 for at least 2y (k + g) indices s E [0. If k is large relative to gap length and n is large relative to k.2.9 Given 3 > 0 there is a positive y < 1/2 such that for any g there is a K = K(g. Fix > 0. which implies that j is a (y. (. n > 1. . this is no problem. that is. Theorem 111. Lemma 111. g)-splitting index. Since I S(x)I > (1 —3)(k+ g). k +g —1]. if s E S(X).1. this completes the proof of Lemma 111. where "almost" is now needed to account for end effects. t] which are (y. s. for any subset J C [1. for x E G(y) and s E S(X) there are at least (1 — y)t indices j in the interval [1.8 easily extends to the following sharper form in which the conclusion holds for a positive fraction of shifts.)) If x E G(y) then Lemma 111. and if Ipk(-14) — ?. k. k(y) and t(y). y) such that if k > K. g)-splitting indices for x. t]. so that Lemma 111. k.8 hold. where k(y) k < (log n)/(h+E). and for the j g J. s. k + g — 1] for each of which there are at least (1 — y)t indices j in the interval [1.2.2. In summary.2. t] that are (y. k + g — 1] and J C [1. For s E [0. then choose integers g = g(y). for at least (1—y)t indices j E [1. however. J) be the set of those sequences x for which every j E J is a (y. t].2. t] define s. . s.: 3 then 114' g j ( . let Dn (s. so that conditions (a) and (b) of Lemma 111.

i.i. This bound is the counterpart of (11).2.c Exercises. 2. J) and ips (17) is contained in the set k+ g —1 s=0 lpk — itkl a. for which x E Dn (s.1. Aperiodic renewal and regenerative processes. Since X E G n (y). 1. weak Bernoulli. as in the -tfr-mixing case.] 1J12:(1—y)t (ix: Ips k :jg elxi) — > 3/41 n Dn (s.. this establishes Theorem 111. Aperiodic Markov chains are weak Bernoulli because they are ifr-mixing. 183 On the other hand.g) assures that if I pk (14) — Ik1> S then I Ps kv g j ( . The weak Bernoulli concept was introduced by Friedman and Ornstein.2.2. and k > k(y) and t > t(y) are sufficiently large. J)) The proof of weak Bernoulli admissibility can now be completed very much as the proof for the 'tit-mixing case. t ] of cardinality at least (1— y)t. [13]. as part of their proof that aperiodic Markov chains are isomorphic to i.d. Show that if p is i. = {x: Ipk (n) (.9 _> 3/4 for at least 2y (k -I. see Exercises 2 and 3. indices s. t] of cardinality at least > 3/4.d.2. 3. then for any x E G 0 (y) there exists at least one s E [0. processes. In particular. unbounded sequence {k(n)} there is an ergodic process p such that {k(n)} is not admissible for A. the bound (18) is summable in n.10 The admissibility results are based on joint work with Marton. the measure of the set (17) is upper bounded by (18) 2 . Show that for any nondecreasing. in Section 1V. does not have exponential rates of convergence for frequencies. then. [38]. 2-2ty log y to bound the number of subsets J C [1.ei') — [4(n)1 > 31. If y is small enough. which need not be if-mixing are. 31 n G(y) UU ig[1. if k(n) < (logn)I(h + E).SECTION 111. I11.d. t] of cardinality small. Show that there exists an ergodic measure a with positive entropy h such that k(n) (log n)/(h + E) is admissible for p. for any subset J C [1.2. processes in the sense of ergodic theory. yet p. the set k :g j (-14) — (1 — y)t. then p(B 0 ) goes to 0 exponentially fast. . 4. with an extra factor.3. 14) Thus if y is sufficiently at least (1—y)t. y can be assumed to be so small and t so large that Lemma 111. and if B. Remark 111.3. Show that k(n) = [log n] is not admissible for unbiased coin-tossing. It is shown in Chapter 4 that weak Bernoulli processes are stationary codings of i.i. k + g — 1] and at least one J C [1. however. 2-2t y log y yy[k(o+ + 2k(n)(h+E/2)2_ t(1—Y)C32/4°° for t sufficiently large. Using the argument of that proof. eventually almost surely. ENTROPY AND JOINT DISTRIBUTIONS.

k.2. as n 6. Let R. and a deep property.1 (The d-admissibility theorem. 5.3 The d-admissibility problem. ENTROPY FOR RESTRICIED CLASSES. k(n). 151. the theorem was first proved by Ornstein and Weiss.1.. The d-admissibility theorem has a fairly simple proof. process. Stated in a somewhat different form. Show that if k(n) oc. and p. is i. then . where each w(i) has length k. Show that the preceding result holds for weak Bernoulli processes.(x7.3. almost surely.) g dp.(.v(4)1. let E(m) be the a-algebra determined by . The principal positive result is Theorem 111.) If k(n) < (log n)I h then {k(n)} is d-admissible for any process of entropy h which is a stationary coding of an i. 7. the admissibility theorem only requires that k(n) < (log n)I h.) (b) Show that the sequence fk defined in the proof of Lemma 111.. A similar result holds with the nonoverlapping-block empirical distribution in place of the overlapping-block empirical distribution.. a. Theorem 11. and let E be the smallest complete a-algebra containing Um E(m).184 CHAPTER ER III.d. . Note that in the d-m etric case. e). called the finitely determined property.. w(t)r. (a) Show that E(glE)(x) = lim 1 ni -+cx) 11 (Pm(x)) fp. The earlier concept of admissible will now be called variationally admissible.3.. y) E a1 . Since the bound (2) dn (p.i. [52}. always holds a variationally-admissible sequence is also d-admissible.i. A sequence {k(n)} is called d-admissible for the ergodic process if (1) iiIndn(Pk(n)(• Alc(n)) = 0.(x7 . though only the latter will be discussed here. (Hint: use the martingale theorem.u(R. (Hint: use the previous result with Yi = E(fk+ilEk). For each x let Pm (X) be the atom of E(m) that contains x. c) be the set of all kt-sequences that can be obtained by changing an e(log n)I(h+ fraction of the w(i)'s and permuting their order.8 is indeed x4) to evaluate a martingale.S The definition of d-admissible in probability is obtained by replacing almost-sure convergence by convergence in probability. s. c)) —> 1. based on the empirical entropy theorem. Ym. while in the variational case the condition k(n) < (log n)I(h e) was needed. Let {Yi : j > 1} be a sequence of finite-valued random variables. Assume t = Ln/k] and xin = w(1). for any integrable function g. The d-metric forms of the admissibility results of the preceding section are also of interest.d.) Section 111.

processes. 1[14(4)]s I < 2 k(" 12) by the blowup-bound. The other is a much deeper result asserting in a very strong way that no sequence can be admissible for every ergodic process. tai :x i some i E [0.3 (The strong-nonadmissibility theorem. < 2k(h-02) 2-ch-E/4) <2 -k€/4 The entropy theorem implies that it(Tk) —> 1. 14(4) = k i+k-1 = a. pk( 14)) > 82. subsequence version of strong nonadmissibility was first given in [52]. n — k +1]} . But if this holds then dk(/ik. then {k(n)} is not admissible for any ergodic process of entropy h. and hence. define the (empirical) universe of kblocks of x'11 . for any x'11 . then extended to the form given here in [50].SECTION IIL3.2 (The d-nonadmissibility theorem. THE 15-ADMISSIBILITY PROBLEM. produces Ak [11k(Xn)t5 ( [14(4)16 = n I. intersecting with the entropy-typical set. and which will be proved in Section IV. Lemma 1.i. is given in Section III.) For any nondecreasing unbounded sequence {k(n)} and 0 < a < 1/2. Theorem 111.a. If 3 is small enough then for all k.3. if k is large enough. provided 3 is small enough. The negative results are two-fold.3. If k > (log n)/(h — c) then Itik (4)1 _< 2k(h-E).2.3.7. almost surely The d-nonadmissibility theorem is a consequence of the fact that no more than sequences of length k(n) can occur in a sequence of length n < so that at most 2k(n )(h-E/2) sequences can be within 3 of one of them.d.) If k(n) > (log n)/(h — c). One is the simple fact that if n is too short relative to k.5. A proof of the d-admissibility theorem.3. . It will be proved in the next subsection. These results will be discussed in a later subsection. To establish the d-nonadmissibility theorem.16(4)) < 81. I11. so that ttaik(x)is) < 6. assuming this finitely determined property. A weaker.(X1) < k(h—E/4) 1 1 . n--*oo 111(n)) > a. such that liminf dk( fl )(pk( fl )(.a Admissibility and nonadmissibility proofs. Theorem 111. 185 which holds for stationary codings of i. and its 6-blowup dk(4. then a small neighborhood of the empirical universe of k-blocks in an n-sequence is too small to allow admissibility. Tk = 11. there is an ergodic process p.

If y is a measure on An. v)1 <8. Pk(n)(. (a) 10 (in. by the ergodic theorem. almost surely as n oc. see Section IV. Theorem 11. almost surely. 14)) — (b) (1/k(n))H(p k(n) (. provided only that k < (11h) log n. An equivalent finite form. Theorem 111.1. A stationary process t is finitely determined if for any c > 0 there is a 6 > 0 and positive integers m and K such that if k > K then any measure y on Ak which satisfies the two conditions -(rn. that is. which is more suitable for use here and in Section 111. The important fact needed for the current discussion is that a stationary coding of an i. is really just the empirical-entropy theorem. 11-1(14) — H(v)I <k6. it is sufficient to prove that if {k(n)} is unbounded and k(n) < (11h) log n.2. ---÷ h. by the definition of finitely determined. process is finitely determined. fact (b). for each m.x7 which makes it clear that 4)(m. An alternative expression is Om (ain ) E Pm (ain14)v(x) = Ev (pm (ar x7)). and noting that the sequence {k(n)} can be assumed to be unbounded. since the bounded result is true for any ergodic process. by the ergodic theorem.. of entropy h. of course. the proof that stationary codings of processes are finitely determined. must also satisfy cik(iik. 1 —> 0. then. for otherwise Lemma I. and in < n.2. vk) < E. the only difference being that the former gives less than full weight to those m-blocks that start in the initial k — 1 places or end in the final k — 1 places xn . is expressed in terms of the averaging of finite distributions. modulo.3. Thus.l n-m EE i=0 uEiv veAn—n. fact (a). The d-admissibility theorem is proved by taking y = plc ( I4). a negligible effect if n is large enough. the v-distribution of m-blocks in n-blocks is the measure 0. . The finitely determined property asserts that a process has the finitely determined property if any process close enough in joint distribution and entropy must also be dclose.i. This completes the proof of the d-admissibility theorem.2. since pk(tik(fit)14) = 1. the following hold for any ergodic process p. To establish convergence in m-block distribution.„ = 0(m. 111 . y) on Am defined by çb m (a) = 1 n — m -I. note that the averaged distribution 0(m. E v(uaT v).3.d.186 CHAPTER III. ENTROPY FOR RESTRICTED CLASSES. as k and n go to infinity. for it asserts that (1 /k)H(pke h = 11(A). v n ) = ym for stationary v.9. The desired result then follows since pni (an i 'lxri') —> A(an i z). Convergence of entropy. see Theorem IV. almost surely. Pk( WO) is almost the same as pn. This completes the proof of the d-nonadmissibility theorem.2.14(a) gives /Lk aik (x')13 ) > 1 — 28.4.( • Ixrn). the average of the y-probability of din over all except the final m — 1 starting positions in sequences of length n. .

. - [Dia = i. nondecreasing. keeping the almost the same separation at the same time as the merging is happening. and k(n) < log n.c and y are a-separated. 187 III.t and y on A k are said to be a separated if their supports are a-separated.. if C n [D]ce = 0. as before. n(k) = max{n: k(n) < k}. The support a (i) of a measure pc on Ak is the set of all x i` for which .W-typical sequences. then E(dk (x li` . of {k(n)}-d-admissible ergodic processes which are mutually a' apart in d. 1. for which (3) lim p. is - (4) If 1.u(x) > 0. look as though they are drawn from a large number of mutually far-apart ergodic processes. but important fact. The tool for managing all this is cutting and stacking. and hence lim infdk (pk (. The easiest way to make measures apart sequences and dk d-far-apart is to make supports disjoint. where.3. if X is any joining of two a-separated measures. A simple. This must happen for every value of n. Let Bk(a) = ix ) : dic(Pk• Ix). yf)) ce). where a'(1 — 1/J) > a.b Strong-nonadmissibility examples. 14 (k) ). unbounded sequence of positive integers. The result (4) extends to averages of separated families in the following form. Thus.*oc almost surely. Two measures 1. yet at the same time. This simple nonergodic example suggests the basic idea: build an ergodic process whose n-length sample paths.) is the path-length function. {p (J ) : j < J } .1 and y. (U Bk(a)) = 0 k>K The construction is easy if ergodicity is not an issue. merging of these far-apart processes must be taking place in order to obtain a final ergodic process. . in which case pk(• /2. Two sets C.(a (p) x a (v)) = ce.SECTION 111. D c A k are said to be a separated. It is enough to show that for any 0 <a < 1/2 there is an ergodic measure be. some discussion of the relation between dk-far-far-apart measures is in order. THE D-ADMISSIBILITY PROBLEM. Indeed. Associated with the window function k(. to be the average of a large number. for one can take p. even when blown-up by a. Ak) > a'(1 — 1/J). Throughout this section {k(n)} denotes a fixed. y) > a. see Exercise 1. then cik(p. a-separated means that at least ak changes must be made in a member of C to produce a member of D.. y k < dk (x k ) for some yk i E D} is the a-blowup of D. when viewed through a window of length k(n). Before beginning the construction. which is well-suited to the tasks of both merging and separation.3. k. A given x is typical x) n(k) i is mostly concentrated on for at most one component AO ) . The trick is to do the merging itself in different ways on different parts of the space. Ak) < I.

for 1 < i < n(k. then . Lemma 111. Thus if f(m)In(k m ) is sunu-nable in m. from which the lemma follows.n ). conditioned on being n(km )-levels below the top.3.. (Pkm (. Here it will be shown. A weaker liminf result will be described in the next subsection.) > am . Thus. and if the cardinality of Si (m) is constant in j. A column structure S is said to be uniform if its columns have the same height f(S) and width. yl`)) v(a (0) • [1 — AGa(v)icen a(1 — 1/. The sequence am will decrease to 0 and the total measure of the top n(km )-levels of S(m) will be summable in m. and y.4 suggests a strategy for making (6) hold.4 If it = (1/J)Ei v i is the average of a family {v . if the support of y is contained in the support of y.1). Some terminology will be helpful. and an increasing sequence {km } such that lim p(B km (a)) = O.4 guarantees that the d-property (6) will indeed hold. a complete sequence {8(m)} of column structures and an increasing sequence {km } will be constructed such that a randomly chosen point x which does not lie in the top n(km )-levels of any column of S(m) must satisfy (6) dk„. for any measure v whose support is entirely contained in the support of one of the v i . The k-block universe Li(S) of a column . 11k. after which a sketch of the modifications needed to produce the stronger result (3) will be given.: j < J} of pairwise a-separated measures. in detail. ergodicity will be guaranteed by making sure that S(m) and S(m 1) are sufficiently independent. III... namely.u. ENTROPY FOR RESTRICTED CLASSES. then. v) > a(1 — 1/J). If all the columns in S(m) have the same width and height am).3. the measure is the average of the conditional measures on the separate Si (m). take S(m) to be the union of a disjoint collection of column structures {Si (m): j < Jm } such that any km blockwhich appears in the name of any column of Si (m) must be at least am apart from any km -block which appears in the name of any column of 5. then Lemma 111.3.1 A limit inferior result. (5) The construction. To achieve (5). With these simple preliminary ideas in mind. Furthermore. (m) for i j. illustrates most of the ideas involved and is a bit easier to understand than the one used to obtain the full limit result. the basic constructions can begin.([a(v)]a ) < 1/J.x ' "5. then d(p. based on [52]. and hence EAdk(4. and hence not meet the support of yi .3. of p. guaranteeing that (5) holds. for any joining Â.b. (3).188 CHAPTER III. Lemma 111. where xi is the label of the interval containing T i-l x. how to produce an ergodic p. for i it([a(vma) = — E vi ([a(v.)]a) = v jag (v i)la) = 7 j i=1 if 1 . Proof The a-separation condition implies that the a-blowup of the support of yi does j.

1/2) by constructing a complete sequence {S(m)} of column structures and an increasing sequence {k m } with the following properties.3. namely. then by using cyclical rules with rapidly growing periods many different sets can be produced that are almost as far apart. The idea of merging rule is formalized as follows. an (M.. The real problem is how to go from stage m to stage ni+1 so that separation holds. 01 = (01) 64 b = 00001111000011110. (Al) S(m) is uniform with height am) > 2mn(k. and there is no loss in assuming that distinct columns have the same name... (A2) S(m) and S(m 1) are 2m-independent. k)-separated if their k-block universes are a-separated.. To summarize the discussion up to this point.n ). for which the cardinality of Si (m) is constant in j.. of course. 2. Each sequence has the same frequency of occurrence of 0 and 1. 01111 = (0 414)16 c = 000000000000000001 11 = (016 116)4 d = 000. yr) > 1/2. the goal (6) can be achieved with an ergodic measure for a given a E (0. yet asymptotic independence is guaranteed. yet if a block xr of 32 consecutive symbols is drawn from one of them and a block yr is drawn from another.... and concatenations of sets by independent cutting and stacking of appropriate copies of the corresponding column structures. the simpler concatenation language will be used in place of cutting and stacking language. If one starts with enough far-apart sets.. (7) = 01010. 10 -1 (DI = M/J. The construction (7) suggests a way to merge while keeping separation.. {1. The following four sequences suggest a way to do this. concatenate blocks according to a rule that specifies to which set the m-th block in the concatenation belongs. j <J.. as m Since all the columns of 5(m) will have the same width and height. Conversion back to cutting and stacking language is achieved by replacing S(m) by its columnar representation with all columns equally likely. (A3) S(m) is a union of a disjoint family {S) (m): j < J} of pairwise (am . oc. that is.SECTION III. J)merging rule is a function .. 11 = ( 064164) 1 a .0101. km )separated column structures. This is. For this reason S(m) will be taken to be a subset of A" ) . Disjoint column structures S and S' are said to be (a. then d32(4 2 .. These sequences are created by concatenating the two symbols. using a rapidly increasing period from one sequence to the next. There are many ways to force the initial stage 8 (1) to have property (A3). so first-order frequencies are good. MI whose level sets are constant. 0 and 1. THE 13-ADMISSIBILITY PROBLEM.n ) decreases to a. For M divisible by J. . the simultaneous merging and separation problem. . and for which an2 (1 — 1/J. J } .. 189 structure S is the set of all a l' that appear as a block of consecutive symbols in the name of any column in S.

190 CHAPTER III. K)-strongly-separated subsets of A t of the saine cardinalit-v. Lemma 111. for some x E S" and some i > 11. The set S(0) is called the 0-merging of {Si : j E f }. Use of this type of merging at each stage will insure asymptotic independence. K)-strongly-separated if their full k-block universes are a-separated for any k > K. stated in a form suitable for iteration. Given J and J*. Let expb 0 denote the exponential function with base b. J)-merging of the collection for some M. (exp2 (J* — 1)) and for each t E [1.5 (The merging/separation lemma. J)-merging rule 4). ENTROPY FOR RESTRICTED CLASSES. Cyclical merging rules are defined as follows.n ). In cutting and stacking language. each Si is cut into exactly MIJ copies and these are independently cut and stacked in the order specified by 0. such that . formed by concatenating the sets in the order specified by q.1 and J divides p.3. Assume p divides M1. there is a Jo = Jo(a. let (/). m E [1. In other words. p1. J)-merging rule with period pt = exp j (exp2 (t — 1)). the canonical family {0i } produces the collection {S(0)} of disjoint subsets of A mt .. then for any J* there is a K* and an V.1} is a collection of pairwise (a. S' c At are (a. . 8(0) is the set of all concatenations w(1) • • • w(M) that can be formed by selecting w(m) from the 0(m)-th member of {S1 } .P n . The following lemma. Two subsets S. . m E [1. and a collection {S7: t < J* } of subsets of 2V * of equal cardinalitv. The desired "cyclical rules with rapidly growing periods" are obtained as follows.11. that is. the direct product S() = in =1 fl 5rn = 1. it is equally likely to be any member of Si . {Sj }) of A m'.) Given 0 < at < a < 1/2. In proving this it is somewhat easier to use a stronger infinite concatenation form of the separation idea.1 times. The family {Or : t < PI is called the canonical family of cyclical merging rules defined by J and J. When applied to a collection {Si : j E J } of disjoint subsets of A'. when applied to a collection {Si: j E J } of disjoint subsets of A produces the subset S(0) = S(0. then the new collection will be almost as well separated as the old collection. a merging of the collection {Sj : j E J } is just an (M. The full k-block universe of S C A t is the set tik (S oe ) = fak i : ak i = x:+k-1 . /MI. let M = exp . is the key to producing an ergodic measure with the desired property (5). In general. The merging rule defined by the two conditions (i) (ii) 0(m)= j. The key to the construction is that if J is large enough.0(rn ± np) = 0(m). at) such that if J > Jo and {S J : j < . and. 0 *(2) • • • w (m ) : W(M) E So(. be the cyclic (M. The two important properties of this merging idea are that each factor Si appears exactly M1. An (M. given that a block comes from Sj . 0 is called the cyclic rule with period p. for each m.

then there can be at most one index k which is equal to 1 d u_one (xvv-i-(J-1)ne+1 y u ( J — 1 ) n f + 1 ) > a by the definition of (a. 4 } of pairwise (an„ kni )- (B2) Each Si (m) is a merging of {S (m — 1): j < 4_ 1 }. and hence (10) 1 )w (t + 2) • • w(J)w(1) • • • w(t) S. Let {Or : t < J* } be the canonical family of cyclical merging rules defined by J and J*.3. . Since all but a limiting (1/J)-fraction of y is covered by such y u u+(j-1)nt+1 . v(j)E ST. LI The construction of the desired sequence {S(m)} is now carried out by induction. this may be described as follows. so it only remains to show that if J is large enough. provided J is large enough. K*)-strongly-separated.3. K)-strongseparation. let S7 = S(0. (13) Each S7 is a merging of {Si: j E 191 J}. Furthermore. each block v(j) is at least as long as each concatenation (8). then K* can be chosen so that property (a) holds.SECTION IIL3. Part (b) is certainly true.Ottn-1-1) from 111. K)-strongly-separated and the assumption that K < E. (B3) 2n(k) <t(m).`)c° and y E (S)oc where s > t. and Yu x v±(J-Unf+1 = w(t where w(k) E j. > JOY . This completes the proof of Lemma 111. for each ni. and since J* is finite. for any j.--m. for each k. any merging of {S7 } is also a merging of {S}. THE D-ADMISSIBILITY PROBLEM. The condition s > t implies that there are integers n and m such that m is divisible by nJ and such that x = b(1)b(2) • • where each b(i) is a concatenation of the form (8) w(l)w(2) • • • w(J). furthermore. and. choose J„. for otherwise {Si : j < J } can be replaced by {S7: j E J } for N > Klt.1-1)nt-1-1 is a subblock of such a v(j). without losing (a. w(j) E Sr» 1 < j < J. Suppose x E (S. and.5. since (87)' = ST. Fix a decreasing sequence {a } such that a < ant < 1/2. Proof Without loss of generality it can be supposed that K < f.). for all large enough K*. K*)-strongly-separated. and for each t. Aar° (B1) S(m) is a disjoint union of a collection {S I (m): j < strongly-separated sets of the same cardinality. (a) {St*: t < J*} is pairwise (a*. 1 _< j _< J. so that if u+(. In concatenation language. the collection {St*: t < J* } is indeed pairwise (a*. while y = c(1)c(2) • •• where each c(i) has the form (9) v(1)v(2) • • • v(J). Suppose m have been determined so that the followLemma S(m) C and k ing hold.5.

and En.5 for J* = J.n + i)x(S.. 0). one on the part waiting to be merged at subsequent steps.3. III.n+ 1 = a*. the sequence {S(m)} defines a stationary A-valued measure it by the formula . and the other on the part already merged. is ergodic since the definition of merging implies that if M is large relative to m. Let {S'tk: t < Jrn+1 } be the collection given by Lemma 111. ±1 )-fraction of the space at each step. 0) is a union of pairwise km )-strongly separated substructures {Si (m. Put Si (m + 1) = S. (B2). S. for the complete proof see [50]. the structure S(m) = S(m.u(Bk m (a)) is summable in m. where } (i) )1/4. j < J. the measure p. ±1 .b. and (B3). 0): j < 41 all of whose columns have the same width and the same height t(m) = (m. To obtain the stronger property (3) it is necessary to control what happens in the interval k m < k < km+1.(St i) = (1 — 1/ Ji)x(S. Define k 1+ 1 to be the K* of Lemma 111. V1 * V2 * • • * V j As in the earlier construction. The only merging idea used is that of cyclical merging of { yi : j < J} with period J. then each member of S(m) appears in most members of S(M) with frequency almost equal to 1/1S(m)1. In the first substage.s(m) 1 k ilxiam)). if necessary. but two widths are possible. (ii) x(S7) = ( i/J. n(lcm )I f(m) < cc. Meanwhile. it can be supposed that 2m+ l n(kn1+ i) < t(S. on each of which only a small fraction of the space is merged.n } with am = a and ce.2 The general case.(m. for m replaced by m +1. 0) is cut into two copies. establishing the desired 1=1 goal (5). (m. the measure of the bad set . and Sj ".n+ 1. is needed. The union S(m + 1) = Ui Si (m + 1) then has properties (Bi). Jm+i -fold independent cutting and stacking is . which is just the independent cutting and stacking in the order given. prior separation properties are retained on the unmerged part of the space until the somewhat smaller separation for longer blocks is obtained. that is. The new construction goes from S(m) to S(m + 1) through a sequence of Jm+i intermediate steps S(m. Cyclical merging with period is applied to the collection {57 } to produce a new column structure R 1 (m. lc \ _ lim [ga l— E IS(m)1 t(m). Since (6) clearly holds. ENTROPY FOR RESTRICTED CLASSES. 1) S(m. 0)). At each intermediate step.*). 0)).4 in this case where all columns have the same height and width. Furthermore.192 CHAPTER III.3. the analogue of Theorem 1. Since t(m) --> cc. J 1+1).5 for the family {Si (m): j < J.10. rather than simple concatenation language.3. By replacing S. each Si (nt. merging only a (1/4. and hence cutting and stacking language.' by (S7) N for some suitably large N. 0) S(m. 1). Only a brief sketch of the ideas will be given here. by (B3). All columns at any substage have the same height. This is accomplished by doing the merging in separate intermediate steps.

In particular. 1)-fold independent cutting and stacking.E. but a more careful look shows that if k(m. t) <k < k(m. 1))/(m. producing in the end an ergodic process p.. > 8(m.n+ i) 2 (1-1/0. 1))-levels have measure at most n(k(m. The merging-of-a-fraction idea can now be applied to S(m. since R. and R(m. 1) > a„. 1)-fold independent cutting and stacking. 1) is upper bounded by 1/.. if M(m. for which lim . then {L i (m. 1) is replaced by its M(m. THE D-ADMISSIBILITY PROBLEM.„<k<k„. by the way. if they lie at least n(k(m. 1) affects this property. J„. km )-strongly-separated. k (m . then applying enough independent cutting and stacking to it and to the separate pieces to achieve almost the same separation for the separate structures. that the columns of R I (m. 1). 1) = {L i (m.(m. and since the probability that both lie in in the same L i (m. Note. Each of the separate substructures can be extended by applying independent cutting and stacking to it. 1).SECTION III. pk(. then it can be assumed that i(m. 1))4 + 1. 1): j < J} is pairwise (a. 2). ±1 . Lemma 111./r .. and (a) continues to hold. of course. since neither cutting into copies nor independent cutting and stacking applied to separate Ej(m.. 1) > k m such that each unmerged part j (m. which. and the whole construction can be applied anew to S(m 1) = S(m. t + 1). since strong-separation for any value of k implies strong-separation for all larger values. 1). which makes each substructure longer without changing any of the separation properties. 1 as wide. Thus if M(m. 1) has measure 1/4 +1 and the top n(k(m. then for any k in the range km < k < k(m.. there is a k(m. This is because. 1): j < J} remains pairwise (am . 1).E.u(Bk (a)) = O. 1) > n(k(m. merging a copy of {L i (m.1). large enough. An argument similar to the one used to prove the merging/separation lemma. 1) is large enough and each L i (m. 1) < 1/4 + 1.3. This is. 1))-strongly separated from the merged part R1 (m. the k-block universes of xrk) and y n(k) i are at least a-apart. 1) have the same height as those of L i (m. t): j < J } U {R 1 (m. then for a suitable choice of an. 1) < k < k(m. 1): j < J} of measure 1/4 +1 . After 4 +1 iterations.0. 193 applied separately to each to obtain a new column structure L i (m. for the stated range of k-values. the following holds. 1). while making each structure so long that (b) holds for k(m. 1))strongly-separated. and this is enough to obtain (12) ni k..3. the unmerged part has disappeared.5. . (a) The collection S(m. the probability that ak(pk(. k m )-strongly-separated. 1). 1) is chosen large enough. then k-blocks drawn from the unmerged part or previously merged parts must be far-apart from a large fraction of k-blocks drawn from the part merged at stage t 1.4_] Bk ) < DC ' .14 (k) ). in turn. 1) is (8(m.Iy in(k) )) > a is at least (1 — 2/4 + 0 2 (1 — 1/. But. (b) If x and y are picked at random in S(m. 1). 1). The collection {. k(m. t))} is (p(m. implies the following. can be used to show that if am) is long enough and J„. but are only 1/J„1 .C i (m. an event of probability at least (1 —2/J. 1). 1) is replaced by its M(m. weaker than the desired almost-sure result. 1))-levels below the top and in different L i (m.

An interesting property closely connected to entropy ideas. denotes the 6-neighborhood (or 6-blowup) of C. A slightly weaker concept. and Lemma 111. If C c An then [C]. and for other processes of interest. in turn. called the blowing-up property. (Yi)k) surely.d.d. 00 > — 1/f). processes. where each yi is ergodic.. Remark 111. which is. almost surely. processes is delayed to Chapter 4.d. called the almost blowing-up property is. Informally. = {b7: dn (a7. enough to guarantee the desired result (3). almost lim infak(Pk(. 2(k — 1)/n. processes. = (1/J)Ei yi. in fact.3. that is [C].1x n i (k) ).3.5. cik(Pk(. process. Let p. 1. ENTROPY FOR RESTRICTED CLASSES. for suitable choice of the sequence {40. relative to km .4. In particular. Processes with the blowing-up property are characterized as those processes that have exponential rates of convergence for frequencies and entropy and are stationary codings of i.c Exercises. The following theorem characterizes those processes with the blowing-up property. Section 1. An ergodic process i has the blowing-up property (B UP) if given E > 0 there is a 8 > 0 and an N such that if n > N then p. Remark 111. Show that if "if is the concatenated-block process defined by a measure y on An. A full discussion of the connections between blowing-up properties and stationary codings of i. ixin(k) ). The blowing-up property and related ideas will be introduced in this section. and d(yi . equivalent to being a stationary coding of an i.([C]. for some ariL E C} .d. for i j. y) —1511 5 . for any subset C c An for which 11 (C) . b7) < E. 2_ . a stationary process has the blowing-up property if sets of n-sequences that are not exponentially too small in probability have a large blowup. (Hint: use Exercise 11. yi ) > a. Show that 0.i.194 CHAPTER III. I11. k-4.i. This fact will be applied in I11.e to construct waiting-time counterexamples.) > 1 — E.i. including a large family called the finitary processes. has recently been shown to hold for i.4 Blowing-up properties. this shows the existence of an ergodic process for which the sequence k(n) = [(log n)/ hi is not a-admissible. then 10(k. for aperiodic Markov sources. Section 111.i.7 By starting with a well-separated collection of n-sequences of cardinality more than one can obtain a final process of positive entropy h.3.3.) 2.6 It is important to note here that enough separate independent cutting and stacking can always be done at each stage to make i(m) grow arbitrarily rapidly.9.

the time-zero encoder f satisfies the following condition. and. that the theorem asserts that an i.i. (ii) For any 6 > 0 there is a 8 > 0 and a K such that Bk has the (8.c. It is known that aperiodic Markov chains and finite-state processes. The fact that processes with the blowing-up property have exponential rates will be established later in the section. — . has the almost blowing-up property (ABUP) if for each k there is a set Bk C A k such that the following hold. Not every stationary coding of an i. (i) xlic E Bk.4. process.2 (The finitary-coding theorem.) Finitary coding preserves the blowing-up property. such that.4.i. The proof of this fact as well as most of the proof of the theorem will be delayed to Chapter 4. A stationary coding F: A z i.d.d.i. In particular. eventually almost surely.d. Later in this section the following will be proved.i -w(x) A finitary coding of an i.d. . does preserve the blowing-up property. it will be shown later in this section that a stationary coding of an i. in particular. A set B c A k has the (8.d. process has the blowing-up property. processes have the blowing-up property it follows that finitary processes have the blowing-up property. BLOWING-UP PROPERTIES.d.SECTION 111.i. A proof that a process with the almost blowing-up property is a stationary coding of an i. By borrowing one concept from Chapter 4. if there is a nonnegative.i.d. process and has exponential rates of convergence for frequencies and entropy. Theorem 111. process has the almost blowing-up property.B z is said to be finitary. and some renewal processes are finitary.) A stationary process has the blowing-up property if and only if it is a stationary coding of an i. however.i.4.([C]. process is called a finitary process. 0-blowing-up property if p. then a blowing-up property will hold for an arbitrary stationary coding of an i. [28.a(C) > 2-18 . called finitary coding.3 (Almost blowing-up characterization. Note. relative to an ergodic process ii. since i. 0-blowing-up property for k > K. process. almost surely finite integer-valued measurable function w(x).i. Once these are suitably removed. 195 Theorem 111. in-dependent processes. A particular kind of stationary coding. called the window function. If i E A Z and x -w(x) w(x) — w(x) then f (x ) = f Gip . The difficulty is that only sets of sequences that are mostly both frequency and entropy typical can possibly have a large blowup. An ergodic process p. 79].3. process has the blowing-up property.d.) A stationary process has the almost blowing-up property if and only if it is a stationary coding of an i. a result that will be used in the waiting-time discussion in the next section.1 (Blowing-up property characterization.4. without exponential rates there can be sets that are not exponentially too small yet fail to contain any typical sequences. Theorem 111.) > 1 — E. process will be given in Section IV.d. for any subset C C B for which .i. for almost every x E A z .i.

and ti. E)]. however. III. since . The blowing-up property provides 8 > 0 and N so that if n>N.u has the blowing-up property. however. k. has exponential rates for frequencies. k. o ] l < for all n. But this cannot be true for all large n since the ergodic theorem guarantees that lim n p n (B(n. 0. E/2)) = 0 for each fixed k and E. To fill in the details let pk ( WI') denote the empirical distribution of overlapping k blocks.u(B*(n. E)1 0 ) 0 it follows that . for all sufficiently large n. and hence there is an a > 0 such that 1[B* (n. ENTROPY FOR RESTRICTED CLASSES. k. contradicting the ergodic theorem. ç B(n.) > 1 — a. ) fx n i: tin(x n i) < of sequences of too-small probability is a bit trickier to obtain.) —> 1.4. k.a Blowing-up implies exponential rates. The idea of the proof is that if the set of sequences with bad frequencies does not have exponentially small measure then it can be blown up by a small amount to get a set of large measure. (1) Ipk( 14)— pk( 1)1)1 6/2. hence a small blowup of such a set cannot possibly produce enough sequences to cover a large fraction of the measure. so that. for there cannot be too many sequences whose measure is too large. but depends only on the existence of exponential rates for frequencies. and hence (1) would force the measure p n (B(n. The blowing-up property provides 6 and N so that if n>N.Cc A n .u n ([B*(n. and define B(n. k.u(4) 2 ' _E/4)} gives (n. To fill in the details of this part of the argument. since this implies that if n is sufficiently . Since .(C) > 2 -611 then ii. If the amount of blowup is small enough. An exponential bound for the measure of the set B. . k.u n (B(n.Cc An. < 2n(h—E/2)2— n(h—E14) < 2—nEl 4 . If.) would have to be at least 1 —E.u„(T. then frequencies won't change much and hence the blowup would produce a set of large measure all of whose members have bad frequencies.(n. - = trn i: iun (x n i)> so that B* (n. Next it will be shown that blowing-up implies exponential rates for entropy. 6 . E) = 1Pk( 14) — IkI > El. k. by the entropy theorem. let B .naB * 2n(h—E/2) . n In) _ SO p„([B* (n. k. Thus p n (B(n. for all n sufficiently large. First it will be shown that p.(C) > 2 -6 " then . Intersecting with the set Tn = 11.([C]y ) > 1 — E. E/2)) to be at least 1 — E. Suppose . 6)1 < 2n(h €). [B(n.u([C].196 CHAPTER III. E)) must be less than 2 -8n. E/2). and p. E)) > 2 -6" then ([B(n. —> 0. One part of this is easy. n > N and . in particular. c)) < 2 -6n . Note that if y = E/(2k + 1) then d(4.

and note that . and hence there is a k such that if Gk is the set of all x k k such that w(x) < k. most of xPli will be covered by k-blocks whose measure is about 2 -kh. Fix E > 0 and a process p with AZ.u. if n > N I . I=1 III.1)-blocks in xn -tf k that belong to Gk and.u has the blowing-up property. Let D be the projection of F-1 C onto B" HI. there is a 6 > 0 and an N 1 > N such that p. 6/4) has cardinality at most 2 k(h+e/4) . where 6 > O.6. 6 can be assumed to be so small and n so large that if y = E / (2k + 1) then . Lemma 1. 197 large then. and window function w(x). since k is fixed and . Let be a finitary coding of ._E k 1 . Thus f (x) depends only on the values x ww (x()x) . The finitary-coding theorem will now be established. then iGn < 2n(h+E/2). the set T = {x n I± k 1 : P2k+1(GkiX n jc-± k i) > 1 — 61. To fill in the details of the preceding argument apply the entropy theorem to choose k so large that Ak(B(k. there is a sequence i n tE k E D such that dn+2k(X . the built-up set bound. that is.(n. except for a set of exponentially small probability.1 • Thus fewer than (2k + 1)yn < cn of the (2k + 1)-blocks in x h± k± k I can differ from the corresponding block in i1Hk4 k . Now consider a sequence xhti.b Finitary coding and blowing-up. 1) < Y 2k -I. For such a sequence (1 — c)n of its (2k + 1)-blocks belong to Gk. the blowing-up property. n (G n ) > 1 — 2 -6n. x E B Z .4. E /4) < 2al. implies that there is an a > 0 and an N such that if n > N. For n > k let = pk(B. (Gk) > 1 — 6 2 .7. the set of n-sequences that are (1 —20-covered by the complement of B(k. and. 6/4)) < a.u([D]y ) > 1 — E. where w(x) is almost surely finite.u(T) > 1 — c by the Markov inequality. Suppose C c An satisfies v(C) > 2 -h 6 . satisfies . E)) < 2_8n which gives the desired exponential bound. 6/4).SECTION 111. moreover. and hence tin(Gn n 13. n > N1. at the same time agree entirely with the corresponding block in imHk-±k . In particular. Thus + 2—ne/2.u(D) > 2 -h 8 . there is at least a (1 — 20-fraction of (2k -I. this gives an exponential bound on the number of such x.un(B. where a will be specified in a moment. In particular. .4. k 1 E [D]y n T. As in the proof of the entropy theorem. with full encoder F: B z time-zero encoder f: B z A.(k. then p. Thus. Since the complement of 13(k.(n. which in turn means that it is exponentially very unlikely that such an can have probability much smaller than 2 -nh. 6)) < IGn12—n(h+E) < 2—ne/2 n > N • Since exponential rates have already been established for frequencies. BLOWING-UP PROPERTIES.

a property introduced to prove the d-admissibility theorem in Section 111. and which will be discussed in detail in Section IV. v) is the measure on Ak obtained by averaging must also satisfy the y-probability of k-blocks over all starting positions in sequences of length n. This proves that vn ([ che ) n T) > 1 — 2€. k This simply a translation of the definition of Section III. the minimum of dn (4 . Z?) < 26.3. Thus if (LW. The sequence Z7 belongs to C and cln (z7 .4. there is a 3 > 0 and positive integers k and N such that if n > N then any measure y on An which satisfies the two conditions (a) Luk — 0*(k. 4([C]) > 1 — E. An( IC)). E Yil )dn(.d. is finitely determined (FD) if given E > 0. Z = F(Y. Proof The relation (2) vn EC holds for any joining X of A n and A n ( IC). Lemma 111.3.198 CHAPTER III. .u.i.4 If d(1-t. This proves the lemma. that is.c Almost blowing-up and stationary coding.4.2. <e. The proof that stationary codings of i. where 4)(k.(4.a into the notation used here. C) > E has p. ENTROPY FOR RESTRICTED CLASSES. process have the almost blowing-up property makes use of the fact that such coded processes have the finitely determined property. v)I < 8. Also let c/.). I11. ylz). and hence EA(Cin(X til C)) < (LW. ) (3) Ca li) = E P (a14)v(x) = Ev(Pk(a l'IX7)).1 (' IC) denote the conditional measure defined by the set C c An. Let . C). and put z = .([C]e ) > 1 — E. so that Z7 E [C]2 E . c+ k i . then. An ergodic process p. int+ ki. measure less than E. Towards this end a simple connection between the blowup of a set and the 4-distance concept will be needed. .d. by the Markov inequality.i. x ntf k 19 and r-iChoose y and 5. processes have the almost blowing-up property (ABUP). In this section it is shown that stationary codings of i. C) denote the distance from x to C. that is. the set of 4 such that dn (xit . Ane IC» <2 then 1. ktn( . IC)) < 62 . that is.1. E D such that Yi F(y). LII which completes the proof of the finitary-coding theorem. (b) 11-1(u n ) — H(v)I < u8.4 )1) > P(4)dn(4 . y E C.

must also satisfy dn (ii. a(n)). Given E > 0.4. I11. Suppose C c Bn and . which also includes applications to some problems in multi-user information theory.4. by Lemma 111. Put Bn = Bn(k(n). The definition of Bn and formula (3) together imply that if k(n) > k.6 The results in this section are mostly drawn from joint work with Marton. Show that (2) is indeed true. where. a)-frequency-typical for all j < k. ttn(.uk and entropy close to H(i). [37. This completes the proof that finitely determined implies almost blowing-up. 1. It will be shown that for any c > 0. both of which depend on . [7]. implies that . is used. (ii) H(14) — na(n) — 178 Milne IC)) MILO ± na(n). the finitely determined property provides 8 > 0 and k such that if n is large enough then any measure y on A n which satisfies WI Ø(k.u([C].SECTION 111. for which 2 —n8 which. a) be the set of n-sequences that are both entropy-typical. relative to a.u. eventually almost surely.4. BLOWING-UP PROPERTIES. ")I < 8 and IHCan ) — HMI < n6.4. I xi ) — .4.d Exercises. ED — Remark 111. a). Let Bn (k. a)-frequency-typical if I Pk (. there is a 8 > 0 such that Bn eventually has the (8. and a nonincreasing sequence {a(n)} with limit 0. implies that C has a large blowup. A sequence 4 will be called entropy-typical. If k and a are fixed.u(C) > 2 —n8 . so there is a nondecreasing. eventually almost surely.4. Kan). by Lemma 111. as usual. Theorem 111. The basic idea is that if C c A n consists of sequences that have good k-block frequencies and probabilities roughly equal to and i(C) is not exponentially too small. Pk(• Ifiz) is the empirical distribution of overlapping k-blocks in xri'.un are d-close. a (n)). . and (j. The first step in the rigorous proof is to remove the sequences that are not suitably frequency and entropy typical. The finitely determined property then implies that gn e IC) and .4.5 (FD implies ABUP. which is all right since H(t)/n —> h. IC) satisfies the following.uk I < a. Show that a process with the almost blowing-up property must be ergodic. Thus if n is large enough then an(it. if tt(x til ) < 2. 2.5 in the Csiszdr-Körner book. then the conditional measure A n (. y) < e 2 . then lin e IC) will have (averaged) k-block probabilities close to . e)blowing-up property.) A finitely determined process has the almost blowing-up property 199 Proof Fix a finitely determined process A. — (i) 10(k. IC)) — 141 < a(n). Note that in this setting n-th order entropy. then xçi E Bn(k. unbounded sequence {k(n)}. 39]. time IC)) < e 2 . A sequence x i" will be called (k.11 (4)+na .) > 1 E. such that 4 E Bn(k(n). which. rather than the usual nh in the definition of entropy-typical. For references to earlier papers on blowing-up ideas the reader is referred to [37] and to Section 1. for any C c Bp. relative to a.4.

unlike recurrence ideas. was shown to hold for any ergodic process in Section Section 11. 1. 6) = minfm > 1: dk (x Two positive theorems will be proved.200 CHAPTER III. Assume that aperiodic renewal processes are stationary codings of i. for irreducible Markov chains. at least for certain classes of ergodic processes.2. 3. 7. y) is the waiting time until the first k terms of x appear in an independently chosen y. They showed that if Wk (x. The waiting . Show that a process with the almost blowing-up property must be mixing. y) converges in probability to h. This result was extended to somewhat larger classes of processes. The surprise here. cannot be extended to the general ergodic case. v. processes have the blowing-up property.d. [11. is the positive result. Wyner and Ziv also established a positive connection between a waiting-time concept and entropy.d. ENTROPY FOR RESTRICTED CLASSES. In addition. pp.5. 76]. [86].3.5 The waiting-time problem. Theorem 111. (a) Show that an m-dependent process has the blowing-up property.b. processes. Show that a coding is finitary relative to IL if and only if for each b the set f -1 (b) is a countable union of cylinder sets together with a null set. then (1/k) log Wk(x. y. y) is defined for x. including the weak Bernoulli class in [44.time function Wk (x. in conjunction with the almost blowing-up property discussed in the preceding section. 6. An almost sure version of the Wyner-Ziv result will be established here for the class of weak Bernoulli processes by using the joint-distribution estimation theory of III. y E Ac° by Wk (x. 6) is defined by Wk (x.u(B n ) 1. The counterexamples show that waiting time ideas. first noted by Wyner and Ziv. of course.i.i.i. 4. for the well-known waiting-time paradox. suggests that waiting times are generally longer than recurrence times. Counterexamples to extensions of these results to the general ergodic case will also be discussed.1. . 5. processes. by using the d-admissibility theorem.d. 1Off. Section 111. A connection between entropy and recurrence times. Show that condition (i) in the definition of ABUP can be replaced by the condition that . (b) Show that a i/i -mixing process has the blowing-up property. yn mi +k-1 < 61. thus further corroborating the general folklore that waiting times and recurrence times are quite different concepts. y) = min{m > 1: yn n: +k-1 i = xk The approximate-match waiting-time function W k (x. Show that some of them do not have the blowing-up property. an approximate-match version will be shown to hold for the class of stationary codings of i. Assume that i.

. In other words. V S > 0.. then for any fixed y'iz E G. In the weak Bernoulli case. hence it is very unlikely that a typical 4 can be among them or even close to one of them.log The easy part of both results is the lower bound. the probability that xf E Bk is not within 8 of some k-block in y'i' will be exponentially small in k. k almost surely with respect to the product measure p.) If p is a stationary coding of an i. for there are exponentially too few k-blocks in the first 2k(h .1.} have the additional property that if k < (log n)/(h + 6) then for any fixed xiic E Bk. k almost surely with respect to the product measure p. then k— s oo 1 lirn .3. {B k C Ak} and {G. y. x p. can be taken to depend only on the coordinates from 1 to n.8. such that a large fraction of these k-blocks are almost independent of each other.5.i. hence G„ is taken to be a subset of the space A z of doubly infinite sequences.2 (The approximate-match theorem. 8) < h.1 (The exact-match theorem. is weak Bernoulli with entropy h.f) terms of y. it is necessary to allow the set G„ to depend on the infinite past. y) = h. eventually almost surely.4. are constructed for which the following hold. The set G n consists of those y E A Z for which yll can be split into k-length blocks separated by a fixed length gap.SECTION 111. y.„0 1 k Wk(X. The d-admissibility theorem. In the approximate-match case.d. then lim sup . guarantees that property (b) holds. the G„ are the sets given by the weak Bernoulli splitting-set lemma. The set Bk C A k is chosen to have the property that any subset of it that is not exponentially too small in measure must have a large 3-blowup. the set G.}.0 k—>oo 1 lim lim inf . the Bk are the sets given by the almost blowing-up property of stationary codings of i.i. 201 Theorem 111. see Theorem 111. processes. the sequences {B k } and {G. Theorem 111.5.5.(log n)/(h + 6). In the weak Bernoulli case. S) > h.) If p. 6—). In both cases an application the Borel-Cantelli lemma establishes the desired result. eventually almost surely. so that property (a) holds by the entropy theorem. Theorem 111. In each proof two sequences of measurable sets. though the parallelism is not complete.3.d. and Wk (x. Lemma 111. hence is taken to be a subset of A. The proofs of the two upper bound results will have a parallel structure. . G„ consists of those yli whose set of k-blocks has a large 3-neighborhood. THE WAITING-TIME PROBLEM. (b) y E G n .c. Bk consists of the entropy-typical sequences. In other words. {B k } and {G n } have the property that if n > 2k(h +E) . (a) x'ic E Bk. process. x A. For the weak Bernoulli case.log Wk(x. In the approximate-match proof. for k .2. For the approximate-match result.log k. the probability that y E G„ does not contain x lic anywhere in its first n places is almost exponentially decaying in n.

7.a Lower bound proofs. x /.log Wk(x. and hence so that intersecting with the entropy-typical set 7k (e/2) gives < 2 k (6/2)) _ k(h-0 2 -ko-E/2) < (tik (yi ) n r Since .E) . there is an N = N(x./. < 2k(h -E ) . for any ergodic measure with entropy h.d. and an ergodic process will be constructed for which the approximate-match theorem is false. eventually almost surely. ENTROPY FOR RESTRICTED CLASSES. (y)16 n Tk(e/2)) [lik(yli )]3 = for some i E [0. There is a 3 > 0 such that if k(n) > (log n)/(h . the set of entropy-typical k-sequences (with respect to a > 0) is taken to be 7k(01) 2 -k(h+a) < 4 (4) < 2-k(h-a ) } Theorem 111. n > N. and hence 1/1k(yril)1 < Proof If k > (log n)/(h .5. is small enough then for all k.` E 7 k(E/2). j[lik (y)]51 _ Lemma 1.log Wk(X.202 CHAPTER III. The following notation will be useful here and later subsections. The theorem immediately implies that 1 lim lim inf . 111 IL . Theorem 111.4 Let p.6) then for every y E A" and almost every x.6) then for every y E A and almost every x. an application of the Borel-Cantelli lemma yields the theorem.3 Let p. n . 3) > h. y) such that x ik(n) lik(n)(Yi).5. Proof If k > (log n)/(h . In the final two subsections a stationary coding of an i. The (empirical) universe of k-blocks of yri' is the set iik(Yç') = Its S-blowup is dk(x 1`.5. process will be constructed for which the exact-match version of the approximate-match theorem is false.i. n > N. be an ergodic process with entropy h > 0 and let e > 0 be given. there is an N = N(x.6) then n <2k(h-e). y) such that x ik(n) (/[14k(n)(11`)]5. by the blowup-bound lemma.6) then n _ _ 2 k(h .14(y1)) 6). y. yçl E An . The lower bound and upper bound results are established in the next three subsections. Intersecting with the entropy-typical set 'Tk (E/2) gives < 2 k(h-3E/4) 2 -kch-E/2) < p.x /. k almost surely with respect to the product measure p. for any ergodic measure with entropy h. y) h.0 k—c k almost surely with respect to the product measure p. El The theorem immediately implies that 1 lim inf .5. I11. be an ergodic process with entropy h > 0 and let e > 0 be given. I f 6 < 2k(h-3(/4).k l]}.. 8—». _ _L which establishes the theorem. x A. In the following discussion. If k(n) > (logn)I(h . 2k(h-E).

(1) sa Claw (Di n Bs. and fix c > 0. The s-shifted.s. s. t] is a (y. s. Let Fix a weak Bernoulli measure a on A z with entropy h.5. 1.} are given by Lemma 111. to be specified later.( 0)-1)(k+g) ) < (1+ Y)Ww(i)). t ].SECTION 111. eventually almost surely. .a.g(t)w(t)w(t + 1).log Wk (x. k-block parsing of y. g)-splitting indices for y. and J C [1.u . Bk = Tk(a) = {xj: 2 -k(h+a) < wx lic) < where a is a positive number to be specified later.b. g)-splitting index for y E A z if . and for y E G(y) there is at least one s E [0. k.y)t indices j in the interval [1.s < 2(k + g). t] that are (y. 1 . and the final block w(t +1) has length n . where w(0) has length s < k + g. THE WAITING-TIME PROBLEM.1 k x . stated as the following.t(k g) . k + g . The sets G n C A Z are defined in terms of the splitting-index concept introduced in III. Fix such an (0 (Yrk) )1 and note that ye Ck (x). The basic terminology and results needed here will be restated in the form they will be used. The sets {B k } are just the entropy-typical sets. E For almost every x E A" there exists K(x) such that x x. with gap g is yti` = w(0)g(1)w(1)g(2)w(2)..log Wk(x. The entropy theorem implies that 4 E B. (2) 1 lim sup .u(w(DlY s 40.1] for which there are at least (1 . y) < h. n(k) = n(k. the g(j) have length g and alternate with the k-blocks w(j).2. g)-splitting index.5. The sets {G. it is enough to establish the upper bound. To complete the proof of the weak Bernoulli waiting-time theorem. k. If 8 1 is the set of all y then E A z that have j as a (y. (a) x E Gn . (b) For all large enough k and t. and a gap g. which depends on y. E) denote the least integer n such that k < (log n) I (h É).8 and depend on a positive number y. k > K(x). For each k put Ck(x) = Y Bk. eventually almost surely. for (t + 1)(k + g) < n < (t 2)(k ± g). 203 III. for 1 < j < t. y) > h -E€ so that it is enough to show that y g' Ck (x). s. Property (a) and a slightly weaker version of property (b) from that lemma are needed here.. k..2. eventually almost surely.) (1 + y) Ifl libt(w(i)). that is.b Proof of the exact-match waiting-time theorem. An index j E [1.

1. for any y E G n (k)(J. is upper bounded by that w(j) (1 — 2 -k(h+a ) ) t(1-Y) and hence )t it(ck(x)n G n(k)(J s)) < (1 + y (1 This implies the bound (3). eventually almost surely. since ii(x) > k(h+a) and since the cardinality of J is at least t(1 — y). if this holds. while k(n) denotes the largest value of k for which k < (log n)/(h c). Fix such a y. yri: E G n . the probability x lic. s.1 n Bsd) + y)' jJ fl bi (w(i)). ENTROPY FOR RESTRICTED Choose k > K (x) and t so large that property (b) holds. By the d-admissibility theorem.5. since t n/ k and n . k.2ty log y 0 y y (1 2—k(h+a))t( 1— Y) Indeed. Since y E Gn . since there are k ± g ways to choose the shift s and at most 2-2ty lo gy ways to choose sets J c [1. n > N(. and hence it is enough to show that xi. eventually almost surely. 3) > h ± c if and only if x i( E C k(y).' Ck (y). I11. and fix E > 0. let Ck (y) = fx 1`: x g [Uk(Yr k) )]81 so that.y).204 ED CLASSES. s) denote the set of all y E G n (k) for which every j E J is a (y. for all j E J.0 > 0 and K such that if k > K and C c Bk is a set of k-sequences of measure at least 2 -0 then 11([C]8/2) > 1/2. To establish the key bound (3). Theorem 111. CHAPTER III. The key bound is (3) ii(Ck(x) n Gn.1. j E J are treated as independent then.5. . then a and y can be chosen so small that the bound is summable in k. eventually almost surely.ualik(n)(31)]812) > 1/2) . and choose . Theorem 111. (1/k) log Wk(x. fix a set J c [1.c Proof of the approximate-match theorem.(0) < (k + g)2.4.3. = . it then follows that y 0 Ck(x). and such that (t+ 1)(k+ g) < n(k) < (t 2)(k ± g). Let be a stationary coding of an i. Since Gnuo (J. Theorem 111. t] of cardinality at least t(1 — y) and an integer s E [0.3. Fix a positive number 3 and define G. This completes Li the proof of the exact-match waiting-time theorem. eventually almost surely. k + g — 1] and let Gn(k) (J. (1). For each k.i. process such that h(A) = h. But if the blocks w(j).d. The least n such that k < (log n)/ (h c) is denoted by n(k). which proves the theorem. s) c the splitting-index product bound. Let {B k } be the sequence of sets given by the almost blowing-up characterization theorem. s). y. yields it(n[w(i)] JE. t] of cardinality at least t(1 — y). For almost every y E Aoe there is an N (y) such that yti' E G n . g)-splitting index for y.

By this mimicking of the bursty property of the classical example. THE WAITING-TIME PROBLEM. and hence E p(Ck(y)n Bk ) < oc. x E GL. . 1} such that g i (0) = . cycling in this way through the possible markers to make sure that the time between blocks with the same marker is very large. that is. n-400 k(n) for every N. I11. however. . The first problem is the creation of a large number of block codes which don't change sequences by much. But this implies that the sets Ck(y) and [ilk (Yin (k) )]8 intersect. for then the construction can be iterated to produce infinitely many bad k. 1. The marker construction idea is easier to carry out if the binary process p. y) will be large for a set of y's of probability close to 1/2.u i (1) = 1/2. with a code that makes only a small density of changes in sample paths. be the Kolmogorov measure of the i. completing the proof of the approximate-match theorem. eventually almost Li surely. fix GL c A L .5. A family {Ci } of functions from G L to A L is said to be pairwise k-separated. 4. that is. with high probability. for with additional (but messy) effort one can create binary markers. which contradicts the definition of Ck(y). aL (C(4)..u. eventually almost surely. A function C: GL A L is said to be 6-close to the identity if points are moved no more than 8. k=1 Since 4 E Bk. The key to the construction is to force Wk(y. y) will be large for a set of probability close to 1. for then blocks of 2's and 3's can be used to form markers. yet are pairwise separated at the k-block level. Si) to be large. 3 } . ) < 6. j j. it follows that 4 g Ck(y). so that yrk) E G(k). The same marker is used for a long succession of nonoverlapping k-blocks.i.u o F -1 will be constructed along with an increasing sequence {k(n)} such that (4) 1 lim P.SECTION 111. where P. If x i = 1. If the process has a large enough alphabet. (1/k) log Wk(y. then switches to another marker for a long time. Thus. where k is large enough to guarantee that many different markers are possible. then. ) 7 ) N) = 0.u(Ck(Y) n Bk) < 2-10 . where A = {0.E Cj(Xt). that is Uk(yP) fl /ta ) = 0. To make this precise. The classic example where waiting times are much longer than recurrence times is a "bursty" process.5.) is forced to be large with high probability. and sample paths cycle through long enough blocks of each symbol then W1 (x. If it were true that [t(Ck(Y) n Bk) > 2-0 then both [Ck(Y) n Bk]512 and [16 (Yr k) )]6/2 each have measure larger than 1/2. A stationary coding y = . This artificial expansion of the alphabet is not really essential. is thought of as a measure on A Z whose support is contained in B z .„ denotes probability with respect to the product measure y x v.„ (— log Wk(n)(y. if the k-block universes of the ranges of different functions are disjoint.d. y1 E Ci(Xt). say. The initial step at each stage is to create markers by changing only a few places in a k-block. A family {Ci} of functions from GL to A L is 6-close to the identity if each member of the family is 6-close to the identity. process with binary alphabet B = {0. 1471 (x. -51.d An exact-match counterexample. 2. Y . 205 Suppose k = k(n) > K and n(k) > N(y). Let . one in which long blocks of l's and O's alternate. hence intersect.

c. o F-1 the following hold.5 Suppose k > (2g+3)/8 and L = mk. GL. 3) such that i({2. hence no (g -I. Given c > 0. Put . (iii) v({2. where g > 1.(G Proof Order the set {2. Any block of length k in C(x) = yt must contain the (g -I. with the additional property that no (g +1)-block of 2's and 3's occurs in any concatenation of members of U. i (x lic) replaces the first 2g ± 3 terms of 4 by the concatenation 0wf0wf0. An iteration of the method. as done in the string matching construction of Section I.206 CHAPTER III. 5") < 2kN ) 5_ 2 -g ± E.5. is used to produce the final process. El The code family constructed by the preceding lemma is concatenated to produce a single code C of length 2g L. Define the i-th k-block encoder a (4) = yilc by setting y = Yg+2 = Y2g+3 = 0 . The block-to-stationary construction. Note also that the blocks of 2's and 3's introduced by the construction are separated from all else by O's. 2. there is a k > 2g +3 and astationary encoder F: A z 1--+ A z such that for y = p. i > 2g + 3. The following lemma is stated in a form that allows the necessary iteration.. is then used to construct a stationary coding for which the waiting time is long for one value of k.8. (i) pvxv (wk'.(GL). g+1 2g+2 yj = x .2)-block Owf 0. that is.5. F (x)) < 6. the family is 3-close to the identity.8. 3 > 0.5. 1. separately to each k-block. and since different wf are used with different Ci . Y2 = Yg+3 = W1 . 3 } g+ 1 ) = 0. and (C. pairwise k-separation must hold. ENTROPY FOR RESTRICTED CLASSES. its proof shows how a large number of pairwise separated codes close to the identity can be constructed using markers.C. Thus the lemma is established. Ci (xt) = yt. 1.: 1 < i < 2g1 be as in the preceding lemma for the given 3.1)-block of 2's and 3's can occur in any concatenation of members of Ui C. Lemma 111. and leaves the remaining terms unchanged. 3} L consisting of those sequences in which no block of 2's and 3's of length g occur Then there is a pairwise k-separated family {C i : 1 < i < 2g} of functions from GL to A L . The following lemma combines the codes of preceding lemma with the block-to-stationary construction in a form suitable for later iteration. Lemma 111. Lemma 1. Let GL be the subset of {0. Let L. then choose L > 2kN+2 lc so that L is divisible by k. 2. and N. Proof Choose k > (2g +3)18. Since each changes at most 2g+3 < 3k coordinates in a k-block. where jk+k jk+1 = Ci jk+k x )jk+1 o < <M. 2. 31g in some way and let wf denote the i-th member. that is. for all x. which is 8-close to the identity.6 Let it be an ergodic measure with alphabet A = {0. The encoder Ci is defined by blocking xt into k-blocks and applying Cs. (ii) d(x. 3}) = 0.

. and such that ynn+L-1 = Ci ( X. in. then yn (b) If i E Uj E jii . and such that y = F(x) satisfies the following two conditions. n +Al]. Furthermore. Case 1. had no blocks of 2's and 3's of length g. since p. for every x. for 41 (G L) 2g . Either x1 or i is in a gap of the code F. and iL iL g. c has the desired properties. note that if y and 53 are chosen independently. such that if J consists of those j for which I is shorter than M. If Case 2c occurs then Case 2b occurs with probability at most 6/2 since L . the encoded process can have no such blocks of length g + 1. while those of length M will be called coding blocks. • n+ +1M = C(Xn ni ± 1M ). Case 1 occurs with probability at most €72 since the limiting density of time in the gaps is upper bounded by 6/4.5 to obtain a stationary code F = F c with the property that for almost every x E A z there is a partition. L — 1. 2kN+2 /E. i. m < 1 < m + L — 1. then UJEJ /J covers no more than a limiting (6/4)-fraction of Z.8. (a) If = [n + 1. i0 j. first note that the only changes To show that F = F made in x to produce y = F(x) are made by the block codes Ci.k <n+L — 1. and hence property (ii) holds. The intervals I. since y and j3 were independently chosen and there are 2g different kinds of L-blocks each of which is equally likely to be chosen. then only two cases can occur. Case 2. and j such that n < 1 < n + L — 1. that are shorter than M will be called gaps.11 +L-1\ yrn = C1 (4+L-1) . since k > (2g +3)/8. Case 2b. and 2k N < m L — 1. Z = U/1 . then y. 207 M = 2g L and define a code C: Am 1--* A m by setting C(xr) = xr. since the M-block code C is a concatenation of the L-block codes. i = j. To show that property (i) holds. which establishes property (iii). the changes produce blocks of 2's and 3's of length g which are enclosed between two O's. there will be integers n. = x i . = F(). and if y = F(x) and 5".SECTION 111. of the integers into subintervals of length no more than M. 1 < <2 for xr E (GL) 29 Next apply the block-to-stationary construction of Lemma 1. Thus d(x. Case 2a occurs with probability at most 2 -g. Either k > n + L — 1 or 2kN > Case 2c. each of which changes only the first 2g + 3 places in a k-block. hence. THE WAITING-TIME PROBLEM.5. Both xi and i1 belong to coding blocks of F. {C i }. Three cases are now possible. F(x)) < 8. Y (i--1)L+1 = Ci(XoL +1). Case 2a. In Case 2.

the properties (i)-(iii) of Lemma 111. F„ have been constructed so that (a). 2a. This completes the construction of the counterexample. while condition (a) holds for the case i = n + 1. If 6 is small enough. > 2kN . it is enough to show the existence of such a sequence of encoders {Fn } and integers {k(n)}. 2. Az. with GI = F1. . = F o Fn _i o o F1 and measure 01) = o G n x. ( wk(i)(y. . F2 .g + 6/2. with n replaced by n 1. Let 6 < 2 —(n+1) be a positive number to be specified later.. the entropy of the coin-tossing process .5. 3} n ) = 0. ENTROPY FOR RESTRICI ED CLASSES. step n In summary.6 hold. for n + 1. .6 with g = 0 to select k(1) > 3 and F1 such that (a). however. where Go(x) (a) pooxon .208 CHAPTER III.5.5. (b). and 2b therefore Puxv (Wk(Y < 2kN) _< 6/2 2 . G < 2ik(i)) < 27i +1 ± 2-i + 2—n+1 ± 2—n 9 1 < < n.5.6 with 1) larger than both 2n + 2 and k(n) and g = n and p. Wk (y. and (c) hold for n. Remark 111. since y (n+ 1) = o G n since it was assumed that 6 < 2 -(n+ 1) . (b) d (G n (x). Apply Lemma 111. and hence one can make P(xo F(x)o) as small as desired. . Cases 1. 33 ) k(n) co. (x)) almost surely. (b). These properties imply that conditions (b) and (c) hold -± 1 1 . the entropy of the encoded process y can be forced to be arbitrarily close to log 2. will hold for i < n 1.7 The number E can be made arbitrarily small in the preceding argument. the following composition G.. Condition (a) guarantees that the limit measure y will have the 1 log Wk(n)(Y. finitely many times. properties hold for each n > 1. and (c) hold for n = 1. and an increasing sequence {k(n)} such that for the -1 . by the pairwise k-separation property. in y x y probability.: Az 1-4. .. Assume F1.5. n = 1. then y (n+ 1) will be so close to On ) that condition (a). that is.6. so there will be a limit encoder F and measure y = desired property. This will be done by induction. replaced by On ) to select k(n to select an encoder Fn+i : A z A z such that for v (n+ 1) = On ) 0 . In particular. Fn. The first step is to apply Lemma 111. The finish the proof it will be shown that there is a sequence of stationary encoders.u. (c) On ) ({2. Condition (c) is a technical condition which makes it possible to proceed from step n to 1. Condition (b) guarantees that eventually almost surely each coordinate is changed only o F -1 . imply that which completes the proof of Lemma 111.

and a decreasing sequence {a m } c (a. while property (c) implies that (6) lim Prob (Wk„. Lemma 111. inductive application of the merging/separation lemma. = 0. 209 III. M] j E [1.separated if their full k-block universes are a-separated for any k > K. S(m) can be replaced by (S(m))n for any positive integer n without disturbing properties (a) and (b). A merging of a collection {Si : j E J} is a product of the form S = ni =1 S Cm) [1. J] is such that 10 -1 ( = M/J. (b) Each Si (m) is a merging of {Si (m — 1): j (c) 2kni Nin < t(m)/m. D c A k are a-separated if C n [D]. A weaker subsequence result is discussed first as it illustrates the ideas in simpler form. their starting positions x1 and yi will belong to £(m)-blocks that belong to different j (m) and lie at least 2 km N. Two subsets S. J]. The full kblock universe of S c A f is the set k-blocks that appear anywhere in any concatentation of members of S. VN. Given an increasing unbounded sequence {Nm }.-indices below the end of each £(m)-block. Let a be a positive number smaller than 1/2. 1/2). km ) strongly-separated sets of the same cardinality. then with probability at least (1 — 1/40 2 (1 — 1/m) 2 . y. THE WAITING-TIME PROBLEM. First some definitions and results from 111. (a) S(m) is a disjoint union of a collection {Si (m): j < .3. where [D]„ denotes the a-blowup of D. for each where J divides M and 0: [1.5. a) > 2 which proves (6). Thus.3.5. however. (x . Indeed. if sample paths sample paths x and y are picked independently at random.. a) < 2km N ) = 0.3. Property (b) guarantees that the measure p. (x. for any integer N. produces a sequence {S(m) C A €("z ) : in > 1} and an increasing sequence {k m }. y.3. The only new fact here is property (c).strongly . a) N) = 0. see Remark 111./m } of pairwise (am . . ) > (1 — 1/40 2 (1 — 1/rn)2 .e An approximate-match counterexample. and hence property (c) can also be assumed to hold.6. y.b will be recalled. K) . Two sets C. since Nm is unbounded. such that the following properties hold for each tn. Prob (W k.b. This is just the subsequence form of the desired result (5).. S' C A f are (a.5. defined by {S(m)} is ergodic.SECTION 111. Such a t can be produced merely by suitably choosing the parameters in the strongnonadmissibility example constructed in 111. The goal is to show that there is an ergodic process it for which (5 ) 1 lim Prob (— log k co k W k (X . Once km is determined.

and hence.3. . ± 1. as in the preceding discussion. The reader is referred to [76] for a complete discussion of this final argument.2.b. Further independent cutting and stacking can always be applied at any stage without losing separation properties already gained. the column structures at any stage can be made so long that bad waiting-time behavior is guaranteed for the entire range km < k < km± i .210 CHAPTER III. (5). controls what happens in the interval km < k < kn. ENTROPY FOR RESTRICTED CLASSES. discussed in III. The stronger form. This leads to an example satisfying the stronger result.

Invertibility is a basic concept in the abstract study of transformations. if t is the distribution of XT and y the distribution of Yr. like other characterizations of B-processes..d.d. Xin n E A mn . stationary. though it is not. J+1 1 x- for all i < j < mn and all x/. A natural and useful characterization of stationary codings of i.Chapter IV B-processes. finitely determined processes. (j_on+ and the requirement that x. v). processes. including almost block-independent processes. This is fortunate. for example. the almost block-independence property. ji is the measure on A' defined by the formula nz Ti(X inn )= i j=1 (x(i—l)n+ni) . Various other names have been used. Section IV. This terminology and many of the ideas to be discussed here are rooted in O rn stein's fundamental work on the much harder problem of characterizing the invertible stationary codings of i. A block-independent process is formed by extending a measure A n on An to a product measure on (An)". will be discussed in this first section. processes. the so-called isomorphism problem in ergodic theory. processes is still a complex theory it becomes considerably simpler when the invertibility requirement is dropped. The focus in this chapter is on B-processes. randomizing the start produces a stationary process. but is of little interest in stationary process theory where the focus is on the joint distributions rather than on the particular space on which the random variables are defined. is expressed in terms of the a-metric (or some equivalent metric. 211 .i. As in earlier chapters.1 Almost block-independence. that is. The almost block-independence property.d. Yr) will often be used in place of dn (p. in > 1.i. [46].. dn (X7. finite alphabet processes that are stationary codings of i.d. processes.) Either measure or random variable notation will be used for the d-distance.i. each arising from a different characterization. Note that 71 is Ta -invariant. as will be done here.i. for while the theory of stationary codings of i. In other words. in general. and very weak Bernoulli processes. then transporting this to a Ta-invariant measure Ft on A'.

i. since stationary coding cannot increase entropy. for each j > 1. in particular. This result and the fact that mixing Markov chains are almost blockindependent are the principal results of this section. (a) Y U —1)n+n and X7 have the same distribution. then by showing that the ABI property is preserved under the passage to d-limits./ } defined by the following two conditions. the independent n-blocking of {Xi } is the Tn-invariant process {Y. process.d. if it is assumed that the i. Both of these are quite easy to prove. however. The theory to be developed in this called the concatenated-block process defined by chapter could be stated in terms of approximation by concatenated-block processes. that the two theorems together imply that a mixing Markov chain is a stationary coding of an i.d. .i. In fact it is possible to show that an almost block-independent process p is a stationary coding of any i. The fact that stationary codings of i. there is an N such that if n > N and is the independent n-blocking of . B-PROCESSES. An ergodic process p. since the ABI condition holds for every n > 1 and every E > O. which is not at all obvious. but it is generally easier to use the simpler block-independence ideas.d.1.1 (The almost block-independence theorem. processes. processes are almost block-independent is established by first proving it for finite codings.d. all this requires that h(v) > h(p. and is sufficient for the purposes of this book.i.u then d(A. Of course.i. process. for each j > 1. process onto a process d-close to then how to how to make a small density of changes in the code to produce a process even closer in d. The almost block-independence property is preserved under stationary coding and. (k ) v (j-1)n-l-n (j-1)n+1 is independent of { r. process is clearly almost block-independent.i. process has infinite alphabet with continuous distribution for then any n-block code can be represented as a function of the first n-coordinates of the process and d-joinings can be used to modify such codes. The fact that only a small density of changes are needed insures that an iteration of the method produces a limit coding equal to 12.1. process y for which h(v) > h(A).i.u n of p to An.d. It is not as easy to show that an almost block-independent process p.d. Note. process y onto the given almost block-independent process p. The independent n-blocking of an ergodic process . or by understood. Theorem IV. This will be carried out by showing how to code a suitable i. In random variable language.4) < E.) An ergodic process is almost block-independent if and only if it is a stationary coding of an i.212 CHAPTER IV.d. is a stationary coding of an i. The construction is much simpler.u is the block-independent process if n is defined by the restriction . is almost block-independent (ABI) if given E > 0.i.d.i.i. process.d. Theorem IV. It will be denoted by Ft(n).2 A mixing Markov chain is almost block-independent. .). i < (j — 1)n}. in fact. An i.d.i. for this requires the construction of a stationary coding from some i. characterizes the class of stationary codings of i.

the preceding argument applies to stationary codings of infinite alphabet i. it follows that finite codings of i.} be a finite coding the i. 213 IV.d. In summary. by the definition of independent nare independent with the distribution of Yw blocking. processes are almost blockindependent. ALMOST BLOCK-INDEPENDENCE. < c. processes and d-limits of ABI processes are almost block-independent.SECTION IV 1. since a stationary coding is a d-limit of finite codings. If the first and last w terms of each n-block are omitted. for all n. Yr) is upper bounded by m(n _ 2040 _ 2 w) w+1 vn—w-1 Tnn—w-1 '(n1-1)n+w+1 . and hence.i. Since y is almost block-independent there is an N such that if n > N and 7/ is the independent n-blocking of y then d(y. processes are almost blockindependent.d.. ) Since this is less than c. finite codings of i.d.a B-processes are ABI processes.(Zr. < E/ 3.i.i.L) < 6 / .414/ by property (c) of the d-property list. The fact that both and are i. Given E choose an almost block-independent process y such that d(p. Lemma 1.i.11. The triangle inequality then gives This proves that the class of ABI processes is d-closed. for any n> wl€.d. Furthermore. is the d-limit of almost block-independent processes.} be the independent n-blocking of { Yi } .1.i. yields yn—w—1 v mn —w-1 n n— +T -1 7 mn —w-1 am(n_210 (z w w-1-1 (m-1)n+w-1-1 ) Thus dnm (Z. and let {Z. hence such coded processes are also almost block-independent. w+1 ymn—w--1 (m-1)n±w±1 ) 4. Next assume p. Lemma 1. Fix such an n and let FL be the independent n-blocking of A.i. then mnci„. 1. so that property (f) of the d-property list. by (1).d. while the successive n — 2w blocks vn—w-1 4 w+1 y mn —w-1 (m-1)n±w±1 +r -I by the definition of window halfare independent with the distribution of rr width and the assumption that {X i } is i. > 0 (1) d .d.i. < c/3. = anal FL) = cin(y. process {X i } with window half-width w.9. when thought of as An-valued processes implies that d( 17. The successive n — 2w blocks 7n—w-1 'w+1 7 mn —w-1 " • '(ni-1)n+w+1 n+ 1 -1 . processes onto finite alphabet processes.. This is because blocks separated by twice the window half-width are independent. hence stationary codings of i.i.9. y) < c/3.. processes are almost block-independent.d. First it will be shown that finite codings of i.11. Lii .d. This completes the proof that B-processes are almost block-independent. Let {Y.

process (2) {(V(0). is a stationary coding of an infinite alphabet i. process with continuous distribution is a stationary coding.d.i. process {Z.d. process is convenient at a given stage of the construction. The key to the construction is the following lemma. The lemma can then be applied sequentially with Z (t) i = (V(0) i .o G -1 ) <E. = 2 -t. Lemma W. A block code version of the lemma is then used to change to an arbitrarily better block code. for all alphabet symbols b. to obtain a sequence of stationary codes {Ft } that satisfy the desired properties (a) and (b). process {Z.3 (Stationary code improvement. An i. with window width 1.) If a(p. Therefore. as t — * oc. process (2) onto the almost block-independent process A.b ABI processes are B-processes.d..d. process independent of {Z. will be used in the sequel. The lemma immediately yields the desired result for one can start with the vector valued.1. that is. .d.3 is to first truncate G to a block code. The goal is to show that p. (b)Er Prob((Ft+ i (z))o (Fr(z))o) — < oc. where V(0)0 is uniform on [0.. and Er = 2-t+1 and 3. V(t — 1) i ). (ii) (A x v) ({(z.i. . 1. V(t) i . and hence any i. is almost block-independent and G is a finite stationary coding of a continuously distributed i. v)o}) < 4E. and if is a binary equiprobable i..214 CHAPTER IV. process with continuous distribution.} with Kolmogorov measure A.d..i.i. V(1) i . V(t)0 is binary and equiprobable. Let {Xi } be an almost block-independent process with finite alphabet A and Kolmogorov measure A. since any random variable with continuous distribution is a function of any other random variable with continuous distribution.d. A. It will be shown that there is a continuously distributed i. V(2)1 . almost surely.1.)} with independent components. then given . The second condition is important.d. for t > 1. V (1) i . where p. countable component. each coordinate Fe(z)i changes only finitely often as t oc. Vi )} such that the following two properties hold. The exact representation of Vi l is unimportant. A o F -1 ) = 0. which allows one to choose whatever continuously distributed i. B-PROCESSES. of any other i. } is continuously distributed if Prob(Zo = b) = 0. The idea of the proof of Lemma IV.i. IV.d. there will almost surely exist a limit code F(z)1 = lime Ft (z) i which is stationary and for which cl(A. producing thereby a limit coding F from the vector-valued i.d. which allows the creation of arbitrarily better codes by making small changes in a good code. for it means that. process Vil with Kolmogorov measure A for which there is a sequence of stationary codes {Ft } such that the following hold. process with continuous distribution. 1) and. together with an arbitrary initial coding F0 of { V(0)i }.i.}. (A x o F -1 ) < 8.5 > 0 there is a finite stationary coding F of {(Z i .. V): G(z)o F (z.i.i.i. i. A = X o F -1 . This fact.i. (i) d(p. o Fr-1 ) > 0. (a)a(p. .

A binary. given that is a coding block. n)-independent process if {Ri } is a (8. Indeed.5 (Block code improvement. an extension which will be shown to hold for ABI processes.? +n--. with associated (8.1.1 = diliq is a coding block) = . Ri)} will be called a (S. The block code version of the stationary code improvement lemma is stated as the following lemma. n)-truncation of F. Lemma IV. ergodic process {Ri } will be called a (S.) Let F be a finite stationary coding which carries the stationary process v onto the stationary process 71.(F(u)7. such that d. using the auxiliary process {Vi } to tell when to apply the block code. where u is any sample path for which û7 = /47.1. 1.G(u7)) < S. Then there is a measurable mapping 4): Y 1-+ A n such that ji = y o 0-1 and such that Ev(dn((k (Y).8. An ergodic process {Wi } is called a (8. The following terminology will be used. This is well-defined since F(ii)7+ 1 depends only on the values of û.1. G(u7)) < 8. The following lemma supplies the desired stronger form of the ABI property. Ri ): i < 0 } . called the (S. on A'. see Exercise 12 in Section 1. For n > N define G(u7) = bF(ii)7± 1c. since the only disagreements can occur in the first or last w places. Lemma IV. b7) = v(0-1 (4) n (b7)) for 4): Y H A n such that A n = vo V I . (4 . n)-blocking process and if 147Ç' is independent of {(Wi . 215 Finally. n)-independent extension of a measure pt. n)-independent process and Prob(14f. The i. b7)) = E v(dn (0 (Y). so that Vd. 4 E A'.(a7). (Y))).9. if {(147„ Ri )} is a (S. for i< j<i+n-1and Ri+n _1 = 1. A block R:+n-1 of length n will be called a coding block if Ri = 0. y) be a nonatomic probability space and suppose *: Y 1-± A n is a measurable mapping such that V 0 —1 ) < 8. n)-blocking process {Ri } .d. n)blocking process.i. Clearly d.SECTION IV. a joining has the form X(a'i' . Proof This is really just the mapping form of the idea that d-joinings are created by cutting up nonatomic spaces. nature of the auxiliary process (Vi) and its independence of the {Z1 } process guarantee that the final coded process satisfies an extension of almost block-independence in which occasional spacing between blocks is allowed. while retaining the property that the blocks occur independently. An ergodic pair process {(Wi . A finite stationary code is truncated to a block code by using it except when within a window half-width of the end of a block. if the waiting time between l's is never more than n and is exactly equal to n with probability at least 1 — 3. Proof Let w be the window half-width of F and let N = 2w/8. that is. . for v-almost every sample path u.b. The almost block-independence property can be strengthened to allow gaps between the blocks.(F (u)7. Given S > 0 there is an N such that if n > N then there is an n-block code G. for v-almost every sample path u.) Let (Y.u.4 (Code truncation. and b and c are fixed words of length w.1 = On -1 1. ALMOST BLOCK-INDEPENDENCE. the better block code is converted to a stationary code by a modification of the block-to-stationary method of I. (y))) < 8.

The goal is to show that if M1 is large enough and 3 small enough then ii(y. < 2E/3.3. the ci-distance between the processes {K} and {W.i. hence all that is required is that 2N1 be negligible relative to n.1. let {W} denote the W-process conditioned on a realization { ri } of the associated (6. sk NI L and an arbitrary joining on the other parts.216 CHAPTER IV. Ft) < c. in > Mi.6. equiprobable i.} be a continuously distributed i.9. and that 8 be so small that very little is outside coding blocks. say i. with respective Kolmogorov measures À and y. Let {Z. n)-independent extension {W} of An . } . and let À x y denote the Kolmogorov measure of the direct product {(Z„ V.) < c. Thus after omitting at most N1 places at the beginning and end of the n-block it becomes synchronized with the independent Ni -blocking. Lemma 1. then. skNi < Note that m = skNI — tkNi is at least as large as MI. the starting position.i1.} will be less than 2E13. Let 6 be a positive number to be specified later and let be the Kolmogorov measure of a (8.AN Ni1+1 has the same distribution as X' N1 +i . n)independent extension 11.i. with associated (6. can be well-approximated by a (6. for almost every realization {r k }. process and {V. < 2E/3. If M i is large enough so that tk NI — nk nk+1 tIcNI is negligible compared to M1 and if 8 is small enough. and hence cl(p.}. Proof In outline. This is solved by noting that conditioned on starting at a coding n-block.} be an almost block-independent process with Kolmogorov measure t.1. of A n . provided only that 6 is small enough and n is large enough. yields the desired result.d.d. for any (8. The obstacle to a good d-fit that must be overcome is the possible nonsynchronization between n-blocks and multiples of NI -blocks.} a binary.. B-PROCESSES.) Let {X. To make this outline precise. Towards this end. process. here is the idea of the proof.. (3) Put N2 = M1+ 2N1 and fix n > N2. n)-blocking process {R. An application of property (g) in the list of dproperties.1. first choose N1 so large that a(p. n)-blocking process (R 1 ).11. completing the proof of Lemma IV. It is enough to show how an independent NI -blocking that is close to p. ii(v. be . of an n-block may not be an exact multiple of NI . vm ) < 13. but at least there will be an M1 such that dni (/u ni . for all ni. Given E > 0 there is an N and a 6 > 0 such that if n > N then a(p. n)-independent extension.6 (Independent-extension.} and {W. and that i7 V -7.} are now joined by using the joining that yields (4) on the blocks ftk + 1. Since y is not stationary it need not be true that clni (it„„ um ) < E/3. where y is the Kolmogorov measure of the independent N I -blocking IN of pt. { Proof of Lemma IV. For each k for which nk +i — n k = n choose the least integer tk and greatest integer sk such that tk Ni > nk .)}. so that (3) yields t NI (4) ytS:N IV I I± Wits:N A: 1+ ) < € /3. The processes Y. Let p. but i j will be for some j < NI.. independent of {Z. y) < 6/3. Let {n k } be the successive indices of the locations of the l's in { ri}. Lemma IV.

v): G(z)0 F (z. however. let g be a finite stationary coding of {V. . and let G be a finite stationary < E. p e-. and so that there is an (c. let {yk } denote the (random) sequence of indices that mark the starting positions of the successive coding blocks in a (random) realization r of {R.} onto a (S. Yk n — 1]. } . the dn(G(Z)kk+n-1 limiting density of indices i for which i V Uk[Yk. g(v)). E(d. if g(v) = r.d. process. (5).) < 8. y) 0 }) < 4e. a(4 +n-i„ )) < E. r)i = a.i. V. F(Z. The see why property (ii) holds. x y) o <8.)•••d(wk))< 2c. and coding to some fixed letter whenever this does not occur. by the ergodic theorem applied to {R. and hence the proof that an ABI process is a stationary coding of an i. Clearly F maps onto a n)-independent extension of . 0(Z))) 26. F(u. first choose n and S < e so that if is a (S. Xn and hence the block code improvement lemma provides an n-block code 0 such that = o 0 -1 and E(dn (à . (5) Next.d.. In other words. vgk +n-1 = 0 (Wk). By definition. a(w. thereby establishing property (ii). implies that the concatenations satisfy (6) k—>oo lirn dnk(O(Wi) • • • 0(Wk). almost surely. n)-blocking process {Ri } and lift this to the stationary coding g* of {(Z. this means that < 2E.u. The goal is to construct a finite stationary coding F of {(Z. (i) d(p. ALMOST BLOCK-INDEPENDENCE. Furthermore. Let F(z.)} onto {(Z„ Ri )} defined by g*(z. In particular. i fzi Uk[Yk. hence property (i) holds. and . Towards this end. coding such that d(p.)} such that the following hold. yk + n — 1] is almost surely at most & it follows from (6) that (X x y) ({(z. Since. Ø(Z7))) = E(dn (d (Z 11 ). (ii) (X x y) (((z. r) be the (stationary) code defined by applying 0 whenever the waiting time between 1 's is exactly n. This completes the proof of the lemma..}. n)-independent extension of i then d(p. V. fix a E A. Y)o}) <2e E< 4e. r Ykk+" = dp(zYA Yk ) for all k. define +n-1 ).. y) = (z. (X. and define F(z. v): G(z)0 F(z. n)-truncation G of G.(Z). along with the fitting bound. X o Fix 3 > O. with Wo distributed as Z.SECTION IV 1. and hence the process defined by Wk = Z Yk+n-1' i i .z (d(Z11 ).i. 217 an almost block-independent process with alphabet A. first note that {Zi } and {Ri } are independent. 0(Z7)) I R is a coding block) so the ergodic theorem applied to {Wk }.

if fi E [1 .1. and is concentrated on sequences that agree ever after. the coupling function is defined by the formula wn(4 = f minfi E [ 1. once they agree.7 The existence of block codes with a given distribution as well as the independence of the blocking processes used to convert block codes to stationary codes onto independent extensions are easy to obtain for i. The key to the fact that mixing Markov implies ABI. the expected value of the coupling function is bounded. The coupling measure is the measure A n on An x An defined by v (an . Proof The inequality (8) follows from the fact that the A n defined by (7) is a joining of . Remark W. A joining {(Xn . respectively.) The coupling function satisfies the bound (8) dn(tt. The Markov property guarantees that this is indeed a joining of the two processes.i. } . independent of n and a. v(a)p(b'i') 0 if w( a. and y be the Kolmogorov measures of {X.i. With considerably more effort. b7) = n otherwise. n --± oc. Furthermore.Y n ) cannot exceed the expected time until the two processes agree. Yn )} is obtained by running the two processes independently until they agree. this guarantees that ncln (X. Let p. n — 1]: ai = bi}. } be a (stationary) mixing Markov process and let {1. as well as several later results about mixing Markov chains is a simple form of coupling.. The precise formulation of the coupling idea needed here follows. n — 1 ] : ai = bi } 0 0 otherwise. processes with continuous distribution. a special type of joining frequently used in probability theory. v) indeed goes to O. These results. In particular. Lemma IV. with the very strong additional property that a sample path is always paired with a sample path which agrees with it ever after as soon as they agree once. see [35]. b?) = (7) A direct calculation.} and { Y n } . then running them together. [46]. below.1.d. so that iin (p.„ as marginals.c Mixing Markov and almost block-independence.d. see also [63. suitable approximate forms of these ideas can be obtained for any finite alphabet i. see Exercise 2. In particular dn(it. and an w+1 = 1)/ 1_ 1 if wn (a . IV.218 CHAPTER IV B-PROCESSES. which are the basis for Ornstein's isomorphism theory. = w < n. 42].1.u(b7). n. . process whose entropy is at least that of it. v) —> 0. v) < E)" n (wn) .. making use of the Markov properties for both measures. shows that An is a probability measure with vn and p. This establishes the lemma. For each n > 1. are discussed in detail in his book. but which is conditioned to start with Yo = a. . Let {X.8 (The Markov coupling lemma.z } be the nonstationary Markov chain with the same transition matrix as { X.u n and v n .

.n (dmn(xr. a joining Ain „ of A m . exactly the almost mixing Markov processes. } plus the fact that Zk k n +7 is independent of Z j . conditioned on the past depends only on Xk n . proves that d(p. that distribution of Xk realizes the du -distance between them./x. Zr)) < E. A suitable random insertion of spacers between the blocks produces a (8. z7)) < which.. Proof By the independent-extension lemma. by the coupling lemma.9 An almost block-independent process is the d-limit of mixing Markov processes. conditioned on the past. < c. This is easy to accomplish by marking long . 219 To prove that a mixing Markov chain is d-close to its independent n-blocking. For each positive integer m.n± ±n 1 /xkn) < for all sufficiently large n. Section 1. of course.„. 1:1 The almost block-independent processes are. a joining A. n)-independent extension which is a function of a mixing Markov chain. .„ with iimu . provided only that n is large enough. in fact. given Xk n = Xkn • kn "± ditioned distribution of XI and the conditional distribution of Xk The coupling lemma implies that (9) d. this completes the proof that mixing Markov chains are almost block-independent.SECTION IV 1. summing over the successive n-blocks and using the bound (9) yields the bound Ex„. Likewise. for all large enough n. Since the proof extends to Markov chains of arbitrary order. X k n +7/xk n ) will be used to denote the du -distance between the unconI. of the unconditioned kn "Vil and the conditional distribution of Xk kn + 7.6. n)-independent extension ri which is mixing Markov of some order.1. Fix n for which (9) holds and let {Zi } be the independent n-blocking of {Xi } ..pnn(XT n einn ) = ii(4)11.) The only problem. } with Kolmogorov measure and fix E > 0.T. 1. Lemma IV. The notation cin (X k n +7. it is enough to show that for any 8 > 0. with Kolmogorov measure i. and the distribution of X k n +7 conditioned on Xk n is d-close to the unconditioned distribution of Xn i .2. The concatenated-block process defined by pt is a function of an irreducible Markov chain. For each positive integer m. Fix a mixing Markov chain {X. ALMOST BLOCK-INDEPENDENCE. (dnin (finn .(x k n ±n . see Example 1. because the distribution of Xk k . by the Markov property. with ii mn will be constructed such that Ex. (See Exercise 13. for each xbi .11. not just a function of a mixing Markov chain. shows that A.„. x k . u is a joining of A„. j < kn. To construct Xmn. using the Markov property of {X. and hence that {X i } is almost blockindependent. is to produce a (8. therefore. n)-independent extension which is actually a mixing Markov chain.(Z?) k=1 Jr A1 zkknn++7)• A direct calculation. the idea to construct a joining by matching successive n-blocks.1. in the sense that Theorem IV. a measure it on A" has a (8. let À nin be the measure on A"' x A mn defined by m-1 (10) A. given Xk n = xk n . first choose. This will produce the desired dcloseness.

[66]. (b) (sxr . The coupling proof that mixing Markov chains are almost block-independent is modeled after the proof of a conditional form given in [41]. let sCm denote the set of all concatenations of the form s w(1) • • • w(m). n (dnzn Winn zr)) E. Furthermore. concatenations of blocks with a symbol not in the alphabet. } and {W} are (6. n with Ti Ex. .. An).220 CHAPTER IV. let C c A' consist of those ar for which it(4) > 0 and let s be a symbol not in the alphabet A.1. for each syr E SC m . aperiodic Markov chain with state space se" x {0.9. = {1T 7i } . 1) with probability (1 — p).a(syr"). Show that if { 1/17. and let ft be the measure on sCrn defined by the formula ft(sw( 1 ) . 1. il To describe the preceding in a rigorous manner. Show that (7) defines a joining of A n and yn . Show the expected value of the coupling function is bounded. 0) with probability pii(syr). i) to be x. n)-independent extensions of An and fi n . respectively. it is mixing. 3. if 1 < i < mn + 1. where y and D are the Kolmogorov measures of {W} spectively. The process {Zi } defined by Zi = f(Y 1 ) is easily seen to be (mn + 2)-order Markov. n)independent extension of A. i 1). define f(sxrn. • w(m)) = i=1 Fix 0 < p < 1.1.11 The proof that the B-processes are exactly the almost block-independent processes is based on the original proof. .„. 1. then d(v. The proof that ABI processes are d-limits of mixing Markov processes is new. Remark IV.10 The use of a marker symbol not in the original alphabet in the above construction was only to simplify the proof that the final process {Z1 } is Markov. then randomly inserting an occasional repeat of this symbol.d Exercises. n)-blocking process {R.u. nm +1) goes to (sy'llin. 1. Remark W. below. Next. } . Show that the defined by (10) is indeed a joining of . As a function of the mixing Markov chain {ri }. independent of n. irreducible. 2. provided m is large enough and p is small enough (so that only a 6-fraction of indices are spacers s. i) can only go to (sxr". and to (syr". (Z1 ) is clearly a (6. to be specified later. B-PROCESSES.) This completes the proof of Theorem IV. since once the marker s is located in the past all future probabilities are known. For each positive integer m. IV.1. (sxr . with the same associated (6. such that 4. nm 1} and transition matrix defined by the rules (a) If i < nm 1. and s otherwise. A proof that does not require alphabet expansion is outlined in Exercise 6c. and let { ri } be the stationary. where w(i) E C. rean(An.

given E > 0. If k fixed and n is large relative to k.i. the only error in this . process is the finitely determined property.d. THE FINITELY DETERMINED PROPERTY. (a) Show that p. 221 5. the process {Z.SECTION IV. A stationary process p. The most important property of a stationary coding of an i. Section 1. afinitely determined process is a stationary coding of an i. otherwise.1 A finitely determined process is ergodic. and almost block-independent. Let p.) (c) Prove Theorem IV. (See Exercise 4c. 1.1. and define Y„ to be 1 if X„ X„_1. and using Exercise 6b and Exercise 4. be the Kolmogorov measure of an ABI process for which 0 < kc i (a) < 1.} be binary i. where a" denotes the concatenation of a (b) Show that Wan) —3.i. and not i. processes are Markov.i.2. The sequence ba2nb can be used in place of s to mark the beginning of a long concatenation of n-blocks. then the distribution of k-blocks in n-blocks is almost the same as the p. and 0. Let {X.d.d.d. such that any stationary process y which satisfies the two conditions.9 is a (8.u) — H(v)I < must also satisfy (iii) c/(kt.5. there is a 8 > 0 and a positive integer k.-distribution.2 The finitely determined property. Show that {17 } is Markov. (i) 114 — vkl < 6 (ii) IH(. (Hint: if n is large enough. for some a E A. Proof Let p (n ) be the concatenated-block process defined by the restriction pn of a stationary measure p.) 6. is finitely determined (FD) if any sequence of stationary processes which converges to it weakly and in entropy also converges to it in d. 0 as n with itself n times. a stationary p is finitely determined if.i. This principle is used in the following theorem to establish ergodicity and almost block-independence for finitely determined processes.2.9 without extending the alphabet. unless X n = 0 with probability 1/2. In other words. In particular./ } defined in the proof of Theorem IV. Properties of such approximating processes that are preserved under d-limits are automatically inherited by any finitely determined limit process. Section IV. Theorem IV. Show that even if the marker s belongs to the original alphabet A. for it allows d-approximation of the process merely by approximating its entropy and a suitable finite joint distribution. (Hint: use Exercise 4.d. oc. Some special codings of i. process. n)-independent extension of IL. it can be assumed that a(a) = 0 for some a. then by altering probabilities slightly.i. must be mixing. to An. Some standard constructions produce sequences of processes converging weakly and in entropy to a given process. mixing. y) < E.) 7.

and since a-limits of ABI processes are ABI. then p. hence in if is assumed to be finitely determined.a (n ) and p. (n ) have almost the same entropy. v)I < 6 and IH(pn ) — H(v)I < n6 must also satisfy cin (p.i.2.1. Since each 11.1. then 0(k. if p. Concatenated-block processes can be modified by randomly inserting spacers between the blocks. For each n. Proof One direction is easy for if I)„ is the projection onto An of a stationary measure I). v„) = vk .2 (The FD property: finite form. it follows that a finitely determined process is ergodic.222 CHAPTER IV. let ii (n ) be a (8n . since functions of mixing processes are mixing and d-limits of mixing processes are mixing.2. and hence almost block-independent. that is. B-PROCESSES. If y is a measure on An.d. Section IV. A direct calculation shows that 0(aiic ) = E Pk(a14)Y(4) = Ep(Pk(4IX). the average of the y-probability of a i` over the first n — k in sequences of length n. Also useful in applications is a form of the finitely determined property which refers only to finite distributions. a finitely determined process must be mixing. {P ) } will converge weakly and in entropy to /I .1. see Exercise 5. Lemma IV. the v-distribution of k-blocks in n-blocks is the measure = 0(k. process is finitely determined is much deeper and will be established later. as n weakly and in entropy so that {g (n ) } converges to 1-1(u n )/n which converges to H(/)). hence almost the same k-block distribution for any k < n. Thus. a The converse result. Thus 147 ) gk .) If 8. and (11 n)H (v n ) H (v).6. that a stationary coding of an i. y) —1- since the only difference between the two measures is that 14 includes in its average the ways k-blocks can overlap two n-blocks. satisfies the finite form of the FD property stated in the theorem.v) < E.n. . y) on Ak defined by )= 1 k+it k i=ouE.t (n ) is equal n-block have to be ignored. must be finitely determined. Note also that if i3 is the concatenated-block process defined by y. is small enough then /2 00 and ii (n) have almost the same n-block distribution. . approximation being the end effect caused by the fact that the final k symbols of the oo. EE V (14 a k i V). a finitely determined process must be almost block-independent.tvE. Since ii(n ) can be chosen to be a function of a mixing Markov chain. and furthermore. (n) is ergodic and since ergodicity is preserved by passage to d-limits. This completes the proof of Theorem IV. for any k < n. Xn i 1 starting positions the expected value of the empirical k-block distribution with respect to v.) A stationary process p. The entropy of i. Theorem IV. Also. is finitely determined if and only if given c > 0. then (1) 41 _< 2(k — 1)/n. and k < n. to A. 10(k. n)-independent extension of A n (such processes were defined as part of the independent-extension lemma. there is a > 0 and positive integers k and N such that if n > N then any measure v on An for which Ip k — 0(k. Thus if 3n goes to 0 suitably fast.

i. y) = Prob(X = x.2. The proof that almost block-independence implies finitely determined will be given in three stages. the definitions of k and 6 imply that a (/urY-) < c. The dn -fi of the distributions of X'il and yr by fitting the condiidea is to extend a good tting tional distribution of X n+ 1. Y) of finite-valued random variables. The conditional distribution of Xn+ i. These results immediately show that ABI implies FD.y(x..9 of the preceding section. CI IV. given X. A finitely determined process is mixing. y) and p(y) = py (y) = Ex p(x. v) — ii. so Exercise 2 implies that 4. v) I +10(k. By making N Choose N so large that if n > N then 1(1/01-/(an ) — enough larger relative to k to take care of end effects. < 6/2.2. Finally.i. provided closeness in distribution and entropy implies that the conditional distribution of 17. given Y. Furthermore.). Fix n > N and let y be a measure on An for which 10(k. an approximate independence concept will be used. An approximate form of this is stated as the following lemma.4)1 <n6/2 and let Tf be the concatenated-block process defined by v. Thus. in turn. and let p(x) = px (x) = E.d. y) denote the marginal distributions. process is finitely determined.1. p(x. y) = px. then 1. Y = y) denote the joint distribution of the pair (X. 1H (i. which is. y) — p(x)p(y)I < e. This proof will then be extended to establish that mixing Markov processes are finitely determined. for an ABI process is the d-limit of mixing Markov processes by Theorem IV. if independence is assumed.d. and hence the method works. To make this sketch into a real proof. . by Theorem IV. IV.) < E. (A n .a. Let p(x. hence is totally ergodic. it will be shown that the class of finitely determined processes is closed in the d-metric. such that any stationary process y that satisfies Ivk — 12k 1 < 6 and 1H (v) — H COI < 6 must also satisfy d(v.a ABI implies FD. THE FINITELY DETERMINED PROPERTY. which is within 8/2 of (1/n)H(An).0(k.i. X. processes are finitely determined. This completes the proof of Theorem IV. The definition of N guarantees that 117k — till 5_ 17k — 0(k. that is. it can also be supposed that if n > N and if 17 is the concatenated-block process defined by a measure y on An. It is easy to get started for closeness in first-order distribution means that first-order distributions can be dl-well-fitted. as well as possible to the conditional distribution of yr. 223 To prove the converse assume ti. y) —1 <8/2. y If H(X) = H(X1Y).k1 < 3/2 and 1H (v) — H(1. is finitely determined and let c be a given positive number. y) — oak' < 6. given Xi'. given Yin.d.)) — H COI < 6.2. using (1). The i.17) < E. is almost independent of Yçi. The random variables are said to be c-independent if E Ip(x. y) < ii(p.. within 6/2 of H(p.2.1 I. since H(t)) = (1/ n)H (v).1. The definition of finitely determined yields a 6 > 0 and a k. proof will be carried out by induction. First it will be shown that an i. then X and Y are independent.2.2. p. does not depend on ril.SECTION IV.

there is a 8 > 0 such that if 1.d. since H(YilY°m) is decreasing in m. which is just the equality part of property (a) in the d-properties lemma.224 CHAPTER IV. and {Y i } be stationary with respective Kolmogorov measures /2 and v. then H(Yi) — 1-1 (Y11 172 1 ) will be small. Proof The entropy difference may be expressed as the divergence of the joint distribution pxy from the product distribution px p y . so that if 11(v) = limm H(Y i lY i ) is close enough to H(A). y) denotes the joint distribution of a pair (X.y1PxP Y )• P(x. H(X) — H(X1Y) = _ p(x) log p(x) log E p(x. y IP(X x P(X)P(Y)i) where c is a positive constant. with first marginal p(x) = yy p(x. property. however. Exercise 6 of Section 1. Y) p(x)p(y) Pinsker's inequality. proof as well as in later results is stated here.)' g(x)p(x) E p(x) h(y)p(ydx).i. The i. y) = X . The lemma now follows from the preceding lemma.i.2.) There is a positive constant c such that if 11(X) — H(X1Y) < c6 2 . An i. y) x.9. Lemma IV. y) x. Lemma IV. Proof If first-order distributions are close enough then first-order entropies will be close.4 Let {X i } be i. Use will also be made of the fact that first-order d-distance is just one-half of variational distance.i.i. y) = g(x) h(y). 2 D(pxylpxp y) C (E .6. process.d. then X and Y are 6-independent. p(x. Yi-1) and yi are c-independent for each j > 1. that is. implies that H(A) = H(X i ). that is. Lemma 1. without proof. y)p(x. In its statement. .5 If f (x. Given e > 0. process is. then 11 (X 1 ) will be close to 11 (Y1 ).i.2. yields.2.d.3 (The E-independence lemma.d. 0 A stationary process {Yi } is said to be an 6-independent process if (Y1.y P(Y) E p(x. e-independent for every c > O. y) log p(x. Lemma IV.11. for all m > 0. of course. di (A. y)I p(x). y) and conditional distribution p(y1x) = p(x. then E f (x. Y) of finite-valued random variables. independent of X and Y.d. The next lemma asserts that a process is e-independent if it is close enough in entropy and first-order distribution to an i.u1 — vil < 8 and 11I (A) — H(v)1 < then {Yi} is an 6-independent process. This establishes the lemma.y D (p x. as the following lemma. 0 A conditioning formula useful in the i. Thus if Ai is close enough to v1. B-PROCESSES. y) = vii/ 2 .

± 1 defined 1 . by stationarity. since (n 1)dn+1(4 +1 .u i — vil < 3/2 < c/2. Lemma IV. YD.' are c-independent. The third term is just (1/2) the variational distance between the distributions of the unconditioned random variable Yn+1 and the conditioned random variable Y +1 /y. y) (n + A. conditioned on X = x is denoted by X n+ i /4. Thus the second sum in (2) is at most 6/2 ± 6/2. The strategy is to extend by fitting the conditional distributions Xn+ 1 /4 and Yn+i /yri' as well as possible. 4+1) = \ x. close on the average to that of Yn+i . and di (Xn +i Yn-i-i/Y) denotes the a-distance between the corresponding conditional distributions. by independence. The measure X. and the distribution of 1" 1 + 1 /Y1 is.x7.4 provides a positive number 3 so that if j — vil < 8 and I H(/). Lemma IV. Yin). . the conditional formula. let ?. Y < 1 /y) y)di(Xn+04. which is in turn close to the distribution of X n+1 .2.6.2. < E..)T (Xn+1 Yni-1)• n+1 The first term is upper bounded by nc. whose expected value (with respect to yri') is at most /2.. y. n+i andy n+ i. which along with the bound nExn (dn ) < n6 and the fact that A n +1 is a joining of . such a Getting started is easy. Assume it has been shown that an (r i'. This strategy is successful because the distribution of Xn+ 1 /x'il is equal to the distribution of Xn+ i.+1/4).. y) +E < (n + 1)c.6 An i.f» Y.(dn) XI E VI Xri (X n i Yn i )d1(Xn+1. 225 Proof The following notation and terminology will be used. process {Xi} and c > O. since it was assumed that A n realizes while the second sum is equal to cin (X.n realizes d i (X n +i The triangle inequality yields (Xn+1/X r i l Xn+1) dl(Xn+19 Yn+1) + (Yn+1 Y1 /y) The first term is 0. (3) di (X E 1 /x. produces the inequality (n 1)dn+1(. y +1 ) = ndn (x`i' . For each x.i. is certainly a joining of /2. Yin) < c. Xn+1 (4+1. while.SECTION IV. Yri+1).. 1/I' ( An+1. . 1 (dn+i) < ndn(A. THE FINITELY DE IERMINED PROPERTY.i. Y1 ). thereby establishing the induction step.d. Fix It will be shown by induction that cin (X. The random variable X n+1 .2. Theorem IV. then the process {Yi } defined by y is an c-independent process. yn+.5.d. Furthermore. by independence.2. by c-independence. Without loss of generality it can be supposed that 3 < E. Fix an i. yields (n (2) 1 )Ex n+ I(dn+1) = nEA. which is at most c/2. y`i') + (xn+i./4). the second term is equal to di (X I . Y1 ) = (1/2)I. since it was assumed thatX x i . This completes the proof of Theorem IV. 44 and yn+1 .2.) — 1/(y)1 < 3.u. Yn+1)À. process is finitely determined.realize di (Xn+i by {r}. since Yn+1 and Y. and let An realize an (Xtil. since d i (X I .

are conditionally E-independent. ±1 is independent of X7 for the i. The Markov property n +Pin depends on the immediate past X. there is a y = y(E) > 0 such that if H(XIZ)— H(XIY. The Markov result is based on a generalization of the i. dies off in the d-sense. given Yo. . B-PROCESSES.226 IV. given Z. The conditional form of the E-independence lemma.i. Fix m. By decreasing 3 if necessary it can be assumed that if I H(g) — H(v)I <8 then Inanu) — na-/(01 < y/2.y E. as m grows. given Z. Lemma IV. Lemma IV. for suitable choice of m. there is a 3 > 0 such that if lid. which is possible since conditional entropy is continuous in the variational distance. A conditional form of c-independence will be needed. process.2 CHAPTER IV. The next lemma provides the desired approximate form of the Markov property. for every n > 1. The choice of 3 and the fact that H(YZIY ° n ) decreases to m H (v) as n oc. Given E > 0 and a positive integer m. extends to the following result. The key is to show that approximate versions of both properties hold for any process close enough to the Markov process in entropy and in joint distribution for a long enough time.8 Let {X. then choose 3 > 0 so small that if 1 12 m+1 — v m± i I < 8 then IH(Yriy0)-HaTixol < y12. if E E ip(x.a. where p(x.3.i. provided the Y-process is close enough to the X-process in distribution and entropy. then. guarantees that for any n > 1. Proof Choose y = y(E) from preceding lemma.1 +1 is almost independent of 1?.) Given E > 0. Lemma IV. a future block Xn but on no previous values. Z) < y. the fitting was done one step at a time. the Markov property and the Markov coupling lemma. and 17. n > 1. proof to the mixing Markov case.} be an ergodic Markov process with Kolmogorov measure and let {K} be a stationary process with Kolmogorov measure v. Fix a stationary process y for which ym+ il < 3 and 11/(u)— H (v)I <8. The Markov coupling lemma guarantees that for mixing Markov chains even this dependence on X. In that proof.2. The random variables X and Y are conditionally E-independent..1. ylz) denotes the conditional joint distribution and p(x lz) and p(y lz) denote the respective marginals. then X and Y are conditionally E-independent. for then a good match after n steps can be carried forward by fitting future m-blocks.i. To extend the i. whose proof is left to the reader.d.d. then guarantee that HOTIN— 1 -1(YPIY2 n ) < y. 17In and Yi-. Mixing Markov processes are finitely determined.m+1 — v m+ii < 8 and IH(a) — H(v)i < 8. yiz) — p(xlz)p(ydz)ip(z) < z x..2.7 (The conditional 6-independence lemma. A good match up to stage n was continued to stage n + 1 by using the fact that X.2.8.2. proof. Lemma IV.d. two properties will be used.

2. As in the proof for the i. Lemma IV. 17.SECTION IV. Since this inequality depends only on .. Fix a mixing Markov process {X n } with Kolmogorov measure p. Ynn+ = x'17 . yr ) < e/4.9. for all n > 1. it can also be assumed from Lemma IV. case. conditioned on X. + drn ( 17:+7 .4T /Yri) dm (Y: YrinZ) Y:2V/yri?). for any stationary process {Yi } with Kolmogorov measure y for which — vm+1I < and I H(u) — H(p)i <8.8 that Yini and Y i n are conditionally (E/2)-independent. Only the first-order proof will be given.'. This completes the proof of Theorem IV. conditioned on +71 /4 will denote the random vector Xn In the proof. I/2. + cim(r.8.2.'1 /y. (6). in turn. The fourth term is upper bounded by half of the variational distance between the distribution of Y. it can also be assumed Furthermore. The distribution of Xn n+ 47/4 is the same as the distribution of Xn n + r. . y)ani (X . YTP:r/y) E 9 where Xn realizes an (A. The d-fitting of and y is carried out m steps at a time. by the choice of 3 1 . This proves Lemma IV./x 1 . The first three terms contribute less than 3E/4. To complete the proof that mixing Markov implies finitely determined it is enough to show that d(p. since dm (u.i. y) is upper bounded by Ittm — that 6 is small enough to guarantee that (6) (7) dm (xT. Fix a stationary y for which (5). given Yo . it is enough to show that (8) dni(x n+1 . y). and (7) all hold.1. xo E A. Theorem IV.n+i there is a 3 > 0 so that if y is a stationary — vm+1 1 < 6 then measure for which (5) am(Yln. the extension to arbitrary finite order is left to the reader. and hence the triangle inequality yields.:vin/4.n /y. using (6) to get started. by (4).u. XT/x0) < 6/4. By making 3 smaller. by (7). Thus the expected value of the fourth term is at most (1/2)6/2 = 6/4. y) < E. and ii(Xn of X'nl Vin / and 11.2.7:14:r/y. provides an m such that (4) dm (XT. = x. Yln /Y0) < yo E A. The Markov coupling lemma.r+Fr.9 A mixing finite-order Markov process is finitely determined.d.2. Proof n + 7i . and the distribution of Ynn+ +. But the expected value of this variational distance is less than 6/2. X4 +im ) will denote the d-distance between the distributions n + T/xii. E X n Vn 4(4. implies that 171 and Yfni are conditionally E-independent. (6) and (5). and hence the sum in (8) is less than E. and fix E > O. given Yo.. THE FINITELY DETERMINED PROPERTY 227 which. if necessary.2. given Yn .8. since Ynn+ +Ini and Yin—I are conditionally (E/2)-independent.

a such that (a) 3 is close to A in distribution and entropy.. The finitely determined processes are d-closed.. outputs b7 with probability X. (None of the deeper ideas from channel theory will be used..2.a.2. yi)). A simple language for making the preceding sketch into a rigorous proof is the language of channels.. for any stationary process Trt for which 1.) Fix n.228 CHAPTER IV.11 If d(p. then implies that a(u. il b7) A. B-PROCESSES. by property (a).u in distribution and entropy there will be a stationary A with first marginal . The basic idea for the proof is as follows. (. given the input ar il.v) <E and v is finitely determined. v).u. If the second marginal.) The d-limit of finitely determined processes is finitely determined. v).ak — < 3. îl) will be small. O ) } of conditional measures can thought of as the (noisy) channel (or black box). (b7la7). —> which. The triangle inequality d(v + d a].3 The final step in the proof that almost block-independence implies finitely determined is stated as the following theorem. Theorem IV. which guarantees that any process a-close enough to a finitely determined process must have an approximate form of the finitely determined property. then there is a (5 > 0 and a k such that d(u. which is however. and for each ar il. < (5 and IH(A) — < S. is close enough to .:(d(xi. Such a finite-sequence channel extends to an infinite-sequence channel (10) X y . (9) it (a ril ) The family {A.(b714) = E A. close to E). IV. y l )) = d(. Lemma IV.(d(x l . The theorem is an immediate consequence of the following lemma.2. The fact that 3: is a stationary joining of and guarantees that E3. It will be shown that if 72.10 (The a-closure theorem..14) be the conditional measure on An defined by the formula (af. Let A be a stationary joining of it and y such that EAd(x i . (b) The second marginal of 3 is close to u in distribution and entropy. borrowed from information theory. is also small. vi)) = d(p. let 4(. '17 is close enough to y in distribution and entropy then the finitely determined property of y guarantees that a(v. only its suggestive language. a result also useful in other contexts.

that ±n ). a) of a with an output measure /3* = /6 * (4. a) need be stationary. 17) is a random vector with distribution An then H(X7.+ +ki). a) on A'. n+n ) j=0 The projection of A* onto its first factor is clearly the input measure a. = H(X7) H(}7 W ) . /2).2. and hence stationary measures A = A(An . with probability X n (b7lx i n ±i and {y i : i < j n}. then hr(X inn. measure A(4` . let n > k. as n oc.f3(4. if (XI". Given an input measure a on A" the infinite-sequence channel (10) defined by the conditional probabilities (9) determines a joining A* = A*(X n . bki ) converges to X(a /ic.) If is a stationary joining of the two stationary processes pt and v. _ i < n. and put A* = A*(Xn. the averaged output p. and # = ti(Xn. is the measure on Aœ x A" defined for m > 1 by the formula - m_i A * (a . A direct calculation shows that A is a joining of a and /3. in n and in a. THE FINITELY DE I ERMINED PROPERTY 229 which outputs an infinite sequence y. that is. = b„ for 1 < s < k. b i ) converges to a oc. then A(X n . with respect to weak convergence and convergence in entropy. "" lai. yrn) = H(X) + myrnIxrn) . a). a) are obtained by randomizing the start. blf) as n oc. ypi+ in+n i has the value b7. Likewise. and hence ' 1 lim — H(1 11 1X7) = H n — On the other hand. called the stationary joint input output measure and stationary output measure defined by An and the input measure a. though both are certainly n-stationary. first note that if (X7. The output measure /3* is the projection of A* onto its second factor. But. by breaking x into blocks of length n and applying the n-block channel to each block separately. +. given an infinite input sequence x. for i <n — k A * (a(i) i + k i.12 (Continuity in n. respectively. b ) . independent of {xi: i < jn} is.u) converges weakly and in entropy to v. = as and b(i). The i. .nn . ±„. it) and A = A(Xn . Y) is a random vector with distribution A*(X n . b(i) . Proof Fix (4. ii). The joining A*. a) and /3(X n .)(14) = E A A(4. converges weakly and in entropy to X and . a). Km )* mn If a is stationary then neither A*(X n . - Lemma IV. The next two lemmas contain the facts that will be needed about the continuity of A(X n . A(a. called the joint input output measure defined by the channel and the input measure a. The measures A and fi are. bn i ln ) = a(a)4(b i. b(i)+ so that. x(a k i b/ ). 1 b k ) = v(b) as n To establish convergence in entropy. b lic) is an average of the n measures A * (a(i) + x where a(i). a) nor /3*(X n . the measure on A' defined for m > 1 by the formula fi * ( 1/1" ) E A * (aT" .SECTION IV2.

Y) is a random vector with distribution A n. Randomizing the start then produces the desired result. dividing by m and letting in go to oc yields H(A) = H(a)+ H (Y I X 1) . yl)) + c.u). so that H(A*(X n . + E H(rr_1)n+1 I riz_on+i) Do since the channel treats input n-block independently. 1 IX7) . Lemma IV.) < E. provides an n such that the stationary input-output measure A = A (An . p. so that.)) H (v). and H(/3) = H(A) — H(Xi Yi) is continuous in H(A) and the distribution of (X 1 . H(A(4. if (XT.11. yl)) < E. so (ii) holds. a(m) )1 converges weakly and in entropy to t3(X n . Thus (i) holds. see Exercise 5. If a sequence of stationary processes la(m)} converges weakly and in entropy to a stationary process a as m oc. Y1 ). i=1 H(X71n ) MH(YnX ril ). assume 1) is finitely determined. p)) = H(. B-PROCESSES.u)+ — H (Y.13 is established. randomizing the start also yields H(/3(A.2. a). Likewise. a). Ex(d(xi. yi)) < Ex(d(xi. as n oc.230 H (X ) CHAPTER IV.12. Thus. Yr) = H(XT)+mH(Y i 'X i ). Proof For simplicity assume n = 1. a) is weakly continuous in a. so that both A* and fi* are stationary for any stationary input.12. and a stationary measure A on A" x A. as m --÷ oc.13 (Continuity in input distribution. if = /3*(X n .) Fix n > 1. then H(XT. br ) = a(aT) i-1 X(b i lai ). (i) {A(Xn . must also satisfy cl(v. The continuity-in-n lemma. . The finitely determined property of y provides y and E such that any stationary process V' for which I vt —Ve I < y and I H(v) — H(71)1 < y. in particular. °eon)} converges weakly and in entropy to A(4. The extension to n > 1 is straightforward. since the channel treats input symbols independently. This completes the proof of Lemma IV. a).2. Assume cl(p. and hence Lemma IV. g) satisfies (11) EA(d(xl. (ii) 0(4.2. and let A be a stationary measure realizing ci(p. By definition.2. Lemma IV. and IH(fi) — < y/2. . Section 1. Dividing by mn and letting in gives 1 H(A*(4. An„(a r . which depends continuously on 11(a) and the distribution of (X1. p)) H(A). which is weakly continuous in a. since Isin = vn . p)) H (X). y). and the stationary output measure fi = fi(X n . Y1). Proof of Lemma IV. Furthermore. fi = /3(X i . as in --> Do. Furthermore.2.6. y) < E. since the entropy of an n-stationary process is the same as the entropy of the average of its first n-shifts. A) satisfies Ifie — vti < y/2. Fix a stationary input measure a and put A = A(Xi. then.

71) is a joining of the input Ex(d(xi. while the proof that mixing Markov chains are finitely determined is based on the Friedman-Ornstein proof. (Hint: y is close in distribution and entropy to p. and IH(i') — 1-1 (/3)1 < y/2.z -th order Markovizations. 71) _-_. )' 1 )) < En(d(xi. processes are finitely determined is based on Ornstein's original proof. The observation that the ideas of that proof could be expressed in the simple channel language used here is from [83]. since X = A(Xn . The fact that almost block-independence is equivalent to finitely determined first appeared in [67]. Show that a finitely determined process is the d-limit of its .2.SECTION IV. IV. so the definitions of f and y imply that dal.2.13. [46].d. This completes the proof of Lemma IV. The fact that d-limits of finitely determined processes are finitely determined is due to Ornstein.) . yi)) < < yi)). provides 3 and k such that if ii. thereby completing the proof that dlimits of finitely determined processes are finitely determined. [46]. yi) + 6 Ex(d (x i. y) which was assumed to be less than E . and hence the proof that almost block-independent processes are finitely determined. [13]. i)) + 2E < 3E..(4 +1 )/p. yi)) +e and the output distribution il = 0(X. 1 The n-th order Markovization of a stationary process p. then the joint input-output distribution X = A(X„. Ex(d(xi. d(ii. )' by the inequalities (11) and (12).. y) <E. is any stationary process for which Fix — lid < 8 and IH(ii) — H(A)1 < ( 5 .2. ii„) satisfies 1 17 )t — fit' < y/2. il with the output 7/ and ) En(d(xi.11. and — II (v)I III (1 7) — II (0)I + i H(8) — I-I (v)I < 1 / . 231 The continuity-in-input lemma.14 The proof that i. Given such an input 'A it follows that the output -17 satisfies rile — vel Pe — Oel + Ifie — vel < Y. is the n-th order Markov process with transition probabilities y(x-n+i WO = p.i.(4). El Remark IV. Lemma IV.2. Ft) satisfies (12) Ex(d (x i . the proof given here that ABI processes are finitely determined is new. THE FINITELY DETERMINED PROPERTY. and the fact that X realizes d(p.b Exercises.2. Furthermore.

process. see Exercise la in Section 1..3 Other B-process characterizations.4. denotes expectation with respect to the random past X° k Informally stated.a The very weak Bernoulli and weak Bernoulli properties.(X7 I X7)) < c.232 CHAPTER IV. and finitely determined are. Show that if . called the very weak Bernoulli property. indeed. where 7 denotes the concatenated-block process defined by an (A n .17) = d(x .} is very weak Bernoulli (VWB) if given c > 0 there is an n such that for any k > 0.d. This form. Theorem IV. Prove the conditional c-independence lemma. (d. (Hint: a(p. y) < a(i2. it follows that almost block-independence. and v. Hence any of the limiting nonoverlapping n-block distributions determined by (T i x. A more careful look at the proof that mixing Markov chains are finitely determined shows that all that is really needed for a process to be finitely determined is a weaker form of the Markov coupling property. 4. 5.4. equivalent to being a stationary coding of an i.7. see Section 111. A stationary process {X. . for some i < n. hence serves as another characterization of B-processes. and for some x all of whose shifts have limiting nonoverlapping n-block distribution equal to /4.i. As in earlier discussions.3. for some y for which the limiting nonoverlapping kblock distribution of T'y is y. Since it has already been shown that finitely determined implies almost blowing-up. then 0). can be shown to be very weak Bernoulli by exploiting their natural expanding and contracting foliation structures. B-PROCESSES. where Ex o .2. very weak Bernoulli.1 (The very weak Bernoulli characterization. Show that the class of B-processes is a-separable. v. Equivalence will be established by showing that very weak Bernoulli implies finitely determined and then that almost blowing-up implies very weak Bernoulli. a process is VWB if the past has little effect in the a-sense on the future. . 2. IV.) 3. Lemma IV. Show that a process built by repeated independent cutting and stacking is a Bprocess if the initial structure has two columns with heights differing by 1. The significance of very weak Bernoulli is that it is equivalent to finitely determined and that many physically interesting processes.T i )' ) must be a joining of ia.3.) A process is very weak Bernoulli if and only if it is finitely determined. almost blowing-up.u is totally ergodic and y is any probability measure on A'. Section IV. such as geodesic flow on a manifold of constant negative curvature and various processes associated with billiards. and X`i'/x ° k will denote the random vector X7. is actually equivalent to the finitely determined property.. The reader is referred to [54] for a discussion of such applications. conditioned on the past values x ° k . either measure or random variable notation will be used for the d-distance. y). (1) Ex o .

given c > 0 there is a gap g such that for any k > 0 and m > 0. Vr) < cin _g (Ugn+i . X'1 ) ) < E. while very weak Bernoulli is used to guarantee that even such conditional dependence dies off in the ii-sense as block length grows. is obtained by using variational distance with a gap in place of the ci-distance. p-r .1. to obtain Ex o (cin (XTIV X ° k . ID. Their importance here is that weak Bernoulli processes are very weak Bernoulli. for any n > g and any random vector (Uli.) . weak Bernoulli requires that with high probability.T Y::ET /)1) is small for some fixed m and every n. that is.5.b Very weak Bernoulli implies finitely determined. A somewhat stronger property. with high probability. V :+i) g In.'. Since approximate versions of both properties hold for any process close enough to {X. OTHER B-PROCESS CHARACTERIZATIONS.. x:-±.77.2.SECTION IV. Theorems 111. a good fitting can be carried forward by fitting future ni-blocks. since 4-distance is upper bounded by one-half the variational distance.3. conditioned on intermediate values. the key is to show that E A. y rli )dni (x.} in entropy and in joint distribution for a long enough time. The proof models the earlier argument that mixing Markov implies finitely determined. as noted in earlier theorems. the conditional measures on different infinite pasts can be joined in the future so that. while the proof of the converse will be given later after some discussion of the almost blowing-up property. ( 4.3 and 111.-. for some fixed k > ni. a result established by another method in [78]. A stationary process {X i } is weak Bernoulli (WB) or absolutely regular. }. The class of weak Bernoulli processes includes the mixing Markov processes and the large class of mixing regenerative processes. IV. for suitable choice of ni. called weak Bernoulli. + .4) + cini (yn n _. In a sense made precise in Exercises 4 and 5. given only that the processes are close enough in entropy and k-th order distribution. 233 The proof that very weak Bernoulli implies finitely determined will be given in the next subsection. then one can take n = g + ni with m so large that g 1(g + ni) < € 12. the random vectors Xr and X° k are c-independent.3. Furthermore. If this is true for all m and k. first note that if Xr and X ° k are c-independent then Ex o (dm (X g +T/X ° k . while very weak Bernoulli only requires that the density of disagreements be small. vin . The triangle inequality gives <(x. -k Thus weak Bernoulli indeed implies very weak Bernoulli. weak Bernoulli processes have nice empirical distribution and waiting-time properties.nnz /y. see Exercise 2 and Exercise 3. iin (14 . To see why. As in the earlier proofs. and use the fact that . Entropy is used to guarantee approximate independence from the distant past. names agree after some point. X g ±n i z )) < E/2. if past and future become c-independent if separated by a gap g. This property was the key to the example constructed in [65] of a very weak Bernoulli process that is not weak Bernoulli.

234 CHAPTER IV. all that remains to be ±. Y. The very weak Bernoulli property provides an in so that (2) Ex o (d„. the first term is equal a.m. If y is small enough.n\ 'n+1 I "Ai x 4 . Yin /Y° K-j)) < E/4 . by the very weak Bernoulli property. yields the result that will be needed.)) < 6/2. since H(YrIY ° r ) decreases to m H (v) this means that II (Yi n IY ° K) — 11 ( 11171 1 17 K_i) < 2y. j>1 . t > 1. the conditional 6-independence lemma. once such an in is determined.8 is small enough and {Yi } is a stationary process with Kolmogorov measure V such that I tuk — Vkl <S. This result. The quantity Exo K (a. along with the triangle inequality and the earlier bound (3). given — °K• Since a-distance is upper bounded by one-half the variational distance. The details of the preceding sketch are carried out as follows. Y/yr. which completes the proof that very weak Bernoulli implies finitely determined. The goal is to show that a process sufficiently close to {Xi} in distribution and entropy satisfies almost the same bound. and hence there is a K such that H(XTIX ° _ K ) < mH(u) + y. Lemma IV.2. then (3) EyoK (iim (Yr i /Y K . Fix a very weak Bernoulli process {X n } with Kolmogorov measure p.n) < 6/4.1" -1 for all n. for all n. r in )) < c /8.)ani(Xnn±n1 +I / y n vn+ni 1. since it can also be supposed that 8 < E/2. provided only shown is that the expected value of dm (Ynn± that is close enough to {X i } in k-order distribution for some large enough k > as well as close enough in entropy. let y be a positive number to be specified later. yr). But. Thus. Furthermore. uniformly in n. so that if it also assumed that I H(v) — H(g)I <S.. Fix k = m K 1. B-PROCESSES. and hence that am (XI''. this means that Eyo(dm(Yr/ 17° K. For in fixed. In summary.Yr i )) < c/4. and Sis small enough. H(XT IX° K ) also depends continuously on 1Uk . and fix 6 > O. as t —> Do. while the expected value of the second term is small for large enough ni. then H(YrIY ° K ) < m H (v) + 2y holds. By stationarity. Towards this end.Yr /Y_°. H(XTIX ° t ) converges to mH(A). j O. namely.n (XT. implies that Yr and YIK K4 are conditionally (E/2)-independent. which is small since in < k.(XT / X ° t . } with Kolmogorov measure u for which Ik — VkI < and I H Cu) — H (v)I < (3. )( tin )) depends continuously on 'Lk.') is small. (4) Eyot (c1(Y im . LI . for any stationary process {Y. so if . . it follows from the triangle inequality that there is a S > 0 such that E y.7.n (rin /X ° K . t > O.

C A" such that x'll E Bn .E. namely. A trivial argument shows that this is indeed possible for some positive a for those )1' that are within c of some fil of positive A-mass. y) < 2e. process.c that a finitely determined process has the almost blowing-up property. It was shown in Section III.3.2 Let A and v be probability measures on A.' . A nonnegative function .d. The e-blowup [C]. A function 0: [S].) > 1 . [81].4. Proof The ideas of the proof will be described first. of a set C c A" is its c-neighborhood relative to the dn -metric. in a more general setting and stronger form. )1) < E. )1) of A.u-mass is not yet completely filled has it-measure at least E. has v-measure at least 1 . this completes the proof that almost block-independence.4. i. is called an c-matching over S if dk ( yr. I : dn (d.SECTION IV.3. by hypothesis. [C]. then dn (A . A set B C An has the (6. b) <E. Lemma IV. which. is due to Strassen. Since very weak Bernoulli implies finitely determined.i. finitely determined. yl') < e. and for any E > 0. The almost blowing-up property (ABUP). 0 (yriz)) < E. Some notation and terminology will assist in making the preceding sketch into a proof. namely. and almost blowing-up are all equivalent ways of saying that a process is a stationary coding of an i. The key to continuing is the observation that if the set of . for any subset C c B for which i(C) > 2 -1'3 • A stationary process has the almost blowing-up property if for each n there is a B. The goal is to construct a joining A. for all y [Sk.-measure less than E. = fb. This simple strategy can repeated as long as the set of x'i' whose .) > 1.a is called a mass function if it has nonempty support and E Ti(4) < 1. that the total mass received by each 4 is equal to A(4).c. A joining can be thought of as a partitioning of the v-mass of each y'i' into parts and an assignment of these parts to various xlii. after which the details will be given.u([C]. If the domain S of such a 0 is contained in the support of a mass .3. OTHER B-PROCESS CHARACTERIZATIONS. If the fraction is chosen to be largest possible at each stage then only a finite number of stages are needed to reach the point when the set of 4 whose pt-mass is not yet completely filled has 1a-measure less than E.4 whose /i -mass is not completely filled in the first stage has /i -measure at least E then its blowup has large vmeasure and the simple strategy can be used again. for most y'i' a small fraction of the remaining v-mass can be cut off and and assigned to the unfilled mass of a nearby x'11 . except on a set of (4. In this section it will be shown that almost blowing-up implies very weak Bernoulli. for some all E C) . The proof is based on the following lemma. If v([C]. . was introduced in Section 111. A simple strategy for beginning the construction of a good joining is to cut off an a-fraction of the v-mass of yri' and assign it to some xr. very weak Bernoulli. c)-blowing-up property if .c. 235 IV. The set of such yii' is just the c-blowup of the support of A. that is. c)-blowing-up property for n > N. there is a 8 > 0 and an N such that Bn has the (3.' for which dn (x1' .S. whenever it(C) > c. of A and v for which dk (4. which.c The almost blowing-up characterization. eventually almost surely. subject to the joining requirement.

236 function at. while. there is an 4 E Si for which td ) (4) = ai y(0. I=1 Only the following consequence of the upper bound will be needed. x E S. for which an a-fraction of the y-mass of each y'i? E [S]. Having defined Si .. = 1.u(F) > 1 — a and so that if xn E Fn then o so . (1) .o ) denote the a-algebra generated by the collection of cylinder sets {[xn k > 0 } . If ai < 1 then. To get started let p. by the entropy theorem. Oi+i )-stuffing fraction. can be assigned to the at-mass of 4 = The desired joining A. otherwise .(Si +i) < c. p. let .u(4lx ° . Lemma IV. only one simple entropy fact about conditional measures will be needed to prove that almost blowing-up implies very weak Bernoulli.0) has approximately the same exponential size as the unconditioned probability . for it is the largest number a < 1 for which av(V 1 (4)) < (x). Also. The construction can be stopped after i*. as long as SI 0 0) and let al be the maximal (y. and the proof of. inf 71.3.3 (The conditional entropy lemma. provided only that n is large enough.u(4) H. y in) of I=1 X-measure less than e. the conditional probability . p.u (i + 1) be the mass function defined by tt o)(x i ) — • The set Si+1 is then taken to be the support of p.(4). (1) = A. for Fn E E(X" co ). except on a set of (4.Lemma IV. Let E(Xn.u (i) . These two facts together imply the lemma. then the number a = 0) defined by i"u. +1 ) < E. dk (x. let Si be the support of itt (1) . let 01 be any E-matching over S1 (such a (P i exists by the definition of c-blowup. . It is called the maximal (y. With Lemma IV. the function Oi+1 is taken to be any c-matching over Si+i .2 is finished. Ft. say i*. then V([Si d E ) > I — E.(Si+i) < .u(Si ). (i + 1) .) Let be an ergodic process of entropy H and let a be a given positive number There is an N = N(a) > 0 such that if n > N then there is a set Fn E E(X" co ) such that . This is the fact that. the notation xn E Fn will be used as shorthand for the statement that Clk[zn k ] c F.(x)>0 ) I is positive. y) < E.2 in hand.u. with high probability.u(Si. In either case. almost surely. Hence there is first i. is constructed by induction. (5) a = min { 1. . 0)-stuffi ng fraction. . —(1/n) log .3. B-PROCESSES. Op-stuffing fraction. for which ai = 1 or for which p.7 1 (4)) and therefore .s: ( x I' CHAPTER IV. that is.u. and ai±i is taken to be the maximal (y. (pi and ai . cutting up and assigning the remaining unassigned v-mass in any way consistent with the joining requirement that the total mass received by each xÇ is equal to If ai. by the definition of maximal stuffing fraction.. Proof By the ergodic thZoc rigiiii(-x-Rlit)(14ift4 AID almost surely. (i + 1) .3.

which establishes the lemma.u(B.n > 1.u(c" n Fn ix) < 2" .u(Clx ( 1 00 ) > c then .4 it follows that if n is large enough and a < c 2 /4 then the set * E E(X).3. if A(Clx.0 ) such that A(B n ) > 1— c 2 /4.a(cix) < .3.u(x). and for E V and C c A'.NA ii(Clx ° Proof For n > N.NATe. so that if n is so large * E E(X ° . then. and such that for any e > 0. 1 — c. there is a 3 > 0 and an N such that Bn has the (3. OTHER B-PROCESS CHARACTERIZATIONS.0 E D.4 Let N = N (a) be as in the preceding lemma. (x tilx ° If E . X° E V. Towards this end.2 E Fn . Theorem IV. V = Dn n Dn .(Dn this with Lemma IV.„ and C c An then (a) and (b) give 11-(Clx) . I x_ . let Bn c An . that is.u(C n 12. there is a set Dn *) > 1 — c/2 and such that A(B n ix%) > 1 — e12.) such that (6) If C c An and . .SECTION IV. 237 Lemma IV.3. 0-blowing-up property for n > N. there is a set D. there is an n and a set V E E(X° . The goal is to show that . for each x 0 ° 00 ) > 1 — (a) iu(F. o ) > c can be rewritten as .2.)> 2-"(c — N/Fx — el2)..) _ oe ) 2an. A (• lx° . which. be such that x7 E B„. for x°. eventually almost surely. by the Markov inequality.5 Almost blowing-up implies very weak Bernoulli. Combining that p. E By Lemma IV. be a stationary process with the almost blowing-up property.u(V) > 1 — c and such that dn(A. Proof Let p. has measure at least 1 — c.3. the Markov inequality yields a set D.3... Now the main theorem of this section will be proved. If n > N. it is enough to show how to find n and V p.. then E E (X ° cx.u(C n Fn ) Ara < 2"A(C)-1..u(C n B.u(C) . E E(X° 0° ) of measure at least Dn .) 5_ 2" .s/CTe < 2" ... a set Fn c A" such that 1 — Afrx and.X.0 )) <e. and C C An.u(c n /3. The quantity .u([C]f) a.) such that .u is very weak Bernoulli. D.. (b) 11.(V) > 1 — c and such that the following holds for x ° 00 E V.u(Bn ) is an average of the quantities . E(X°.such that if x ° E D.1x%) 6/2 . to show that for each E > 0. of measure at least 1 — Nfii.Ix° 00).

3. c.u. and a measurable set B c [x] such that .0 ) of measure at least 1 — c. except for a set of x° k of measure at most E. pp. such that if i° 00 E G then there is a measurable mapping 0: [x] 1-÷ [.*(.. A more sophisticated result.u(C) < v([C].„o ] which maps the conditional measure ii(-Ix°) onto the conditional measure tie li° . is very weak Bernoulli if given E > 0 there is a positive integer n and a measurable set G E E(X° 0° ) of measure at least 1 — c. The proof of Strassen's lemma given here uses a construction suggested by Ornstein and Weiss which appeared in [37]. Show using coupling that a mixing Markov chain is weak Bernoulli.) > 1 — E. [59. 2. 1. Remark IV. Show by using the martingale theorem that a stationary process p. provided only that n is large enough.u(Bix ° . Show that a stationary process is very weak Bernoulli if and only if for each c > 0 there is an m such that for each k > 1. (Hint: coupling. Show that a mixing regenerative process is weak Bernoulli.. . so that the blowing-up property of B. Show by using the martingale theorem that a stationary process .5 appeared in [39].(•1. yields the stronger result that the an -metric is equivalent to the metric cl.„) < c. n]. < E. n]. This and related results are discussed in Pollard's book.. for i E [K. Cin (X7 IX ° k . then 2-"(e — . using a marriage lemma to pick the c-matchings 0i .0 ] which maps the conditional measure .c ). 79-80]. Example 26. This establishes the desired result (6) and completes the proof that almost blowing-up implies very weak Bernoulli. for all sufficiently large n. .) 4.6 The proof of Theorem IV.14Fe — 6/2) > 2 -8n. 1)). = x.0 0o ).d Exercises. yields ittaCli)> ou([C n B4O1. such that if x 0 E G i then there is a measurable mapping 0: [x. 0(x) 1 = xi for all except at most en indices i E [1. with the property that for all x B. 3.3. 5.a is weak Bernoulli if given c > 0 there is a positive integer K such that for every n > K there is a measurable set G E E(X ° . for all C c A.o ] such that ti(B Ix%) < E.0 0. 4)(x).3. B-PROCESSES. IV.u(• ix%) onto the conditional measure .238 CHAPTER IV. and a measurable set B c [x° . If a is chosen to be less than both 8 and c2 /8. .u. defined as the minimum c > 0 for which . with the property that for all x B.

1965. Dept.". An introduction to probability theory and its applications.Bibliography [1] Arnold. Recurrence in ergodic theory and combinatorial number theory Princeton Univ. Csiszdr and J. New York. Information theory and reliable communication. W. New York. New York. New York. of Math. Thomas. Stanford Univ. and Avez. Info. "The Erd6s-Rényi strong law for pattern matching with a given proportion of mismatches. 5(1970). Inc. D. New York. IT-21(1975). Akadémiai Kiack"). Information Theory Coding theorems for discrete memoryless systems. of Elec. 36(1980). Eng. Cover and J.. Ornstein. NY. NY.. [15] R. M.. [3] A. 194-203. [7] I. [4] P. Friedman and D. Kohn. "r-Entropy." IEEE Trans. Princeton. Wiley.I. 1981." J. Feldman. Furstenberg. Gallager. Budapest. [12] N. Introduction to ergodic theory Van Nostrand Reinhold. 1991. Elements of information theory. A. 1968." Ann. Friedman. 321-343. [5] D. Körner. [8] P. [9] P. d'analyse. [14] H. 1968. "On a new law of large numbers. "Logically smooth density estimation. Birkhauser.. NJ. 1971. 1152-1169. 1980." Advances in Math. Ergodic problems of classical mechanics. [13] N. John Wiley and Sons. Ergodic theory and information. 103-111. Barron. 365-394. C. 1985. and Ornstein's isomorphism theorem. [11] W. Th.. Benjamin. Thesis. New York. Rényi. V. [10] J. Probab. [2] R. 1970. Wiley.Elias. 17(1989).... Billingsley. Measure Theory. "Universal codeword sets and representations of the integers." Ph. Feller.. equipartition. A. Erdôs and A. John Wiley and Sons. 239 . "On isomorphism of weak Bernoulli transformations. Volume II (Second Edition). 1981. Press. Waterman.A. Arratia and M. New York. [6] T. Israel J. 23(1970).

Lind and B.S. Th. Info. 1950. "A simple proof of some ergodic theorems.." Israel J. Davisson. [31] I." Israel J. of Math. of Math. [23] M. Lectures on the coupling method. van Nostrand Co. 263-268. Gray and L. Chelsea Publishing Co. AMS.. "Comparative statistics for DNA and protein sequences . Smorodinsky. [20] P. Probability. 1990. 291-296. [30] J. New York. "Induced measure-preserving transformations. New York. Math. New Jersey. [32] U. Kontoyiannis and Y. Springer-Verlag. NJ. 1992. 5800-5804. Kac. NY." IEEE Trans. de Gruyter. [19] R. "New proofs of the maximal ergodic theorem and the Hardy-Littlewood maximal inequality. M. 1993. "Source coding without the ergodic assumption. "Prefixes and the entropy rate for long-range sources. Halmos. W. 669-675. [22] W. [34] D. New York. G. [24] S. [28] M.. random processes." Ann." IEEE Trans.) Wiley. Cambridge. [25] T. Halmos." Ann. IT-35(1989)." Israel J. D. Statist." Proc. 1002-1010. 42(1982).240 BIBLIOGRAPHY [16] P. Kemeny and J. [29] J. and ergodic properties. Gray." Proc. IT-37(1991). 36(1965). - [18] R. Princeton. statistics. Acad." IEEE Trans. [21] P. 1956. . Measure theory. New York. Hoeffding. and optimization.. IT-20(1975)... New York. Kakutani. Marcus. "A simple proof of the ergodic theorem using nonstandard analysis. Th. Finite Markov chains. [27] I. Lectures on ergodic theory. P. 681-4. 1988. Princeton. "Estimating the information content of symbol sequences and efficient codes. Weiss. Krengel. Ergodic theorems. Kelly. Kieffer. 19(1943). "Asymptotically optimal tests for multinomial distributions. Lindvall. NY. An Introduction to Symbolic Dynamics and Coding. Suhov. Berlin. 34(1979) 281-286. [33] R. 502-516. "On the notion of recurrence in discrete stochastic processes. Van Nostrand Reinhold. Inform. [17] R. Sci. Math. 53(1947)..A. Press. [35] T.. 635-641. [26] S. Cambridge Univ. 369-400. ed. Japan Acad. John Wiley and Sons. Info.. Statist.. 42(1982). "Sample converses in source coding theory. 1995. Gray. Jones. Kamae.. 1985. U. 284-290." Proc. (F.single sequence analysis. Entropy and information theory. Grassberger. " Finitary isomorphism of irreducible Markov shifts. Math.. Karlin and G. D." Probability. 1960. Theory. 82(1985). L. Snell. 87(1983). Ghandour. Springer Verlag. Nat. Katznelson and B. Keane and M.

43-58." IEEE Trans. "An application of ergodic theory to probability theory"." Israel J. Probab. [47] D. Ornstein and B." Ann. Ergodic theory. Weiss. "Entropy and the consistent estimation of joint distributions. Wyner..." Israel J. Math. IT-38(1992). lossless code. 905-930. Ornstein and P. Advances in Math. [41] D. 262(1982). 951-960. Ornstein. 18(1990). "Almost sure waiting time results for weak and very weak Bernoulli processes. 445-447." Ann. 1974." Ergodic Th. Inform. 15(1995). [44] A.. [43] D. [53] D. 211-215.. "A very simplistic.. Neuhoff and P.. Th. Th." IEEE Trans. 188-198... Inform. "A recurrence theorem for dependent processes with applications to data compression. [39] K. Neuhoff and P. Rydzyna." IEEE Workshop on Information Theory. "Indecomposable finite state channels and primitive approximation. IT-23(1977). Shields. [38] K. Nobel and A.BIBLIOGRAPHY 241 [36] K. 44(1983). "Entropy and data compression. Weiss." Ann. "The d-recognition of processes. Th. "Equivalence of measure preserving transformations. 78-83. Ornstein and B.. 441-452. 331-348. Shields. [40] D. Ornstein. 18(1990). Ornstein and P. [52] D. Marton and P.. Ornstein and P. Shields. "A simple proof of the blowing-up lemma. 63-88. Ornstein. Correction: Ann. 960-977." Ann. "The positive-divergence and blowing-up properties. Probab." Advances in Math. Probab. "How sampling reveals a process. Th. Shields. to appear. "The Shannon-McMillan-Breiman theorem for amenable groups. Info. Th. Probab. Weiss. 53-60. Shields. 86(1994). . 1(1973). Marton. [48] D. Yale Mathematical Monographs 5. CT. Sys... "Channel entropy and primitive approximation. [49] D. Ann. Rudolph." IEEE Trans. Inform. 11-18. Marton and P. 104(1994)." IEEE Trans. Info. Shields. Neuhoff and P. Math. Neuhoff and P.. IT-39(1993). [46] D. [42] D. Shields. 22(1994). randomness. [51] D. 10(1973). and Dynam. Probab. Ornstein and B. and dynamical systems.. "Universal almost sure data compression. Poland. [37] K. Shields. June. Shields. 182-224. Probab.. 10(1982). Shields. 1561-1563." IEEE Trans. Press. and B. Weiss. "Block and sliding-block source coding. IT-28(1982). IT-42(1986). Yale Univ. 1995. "An uncountable family of K-Automorphisms"." Memoirs of the AMS.. D. [45] D. Marton and P. universal. New Haven. [50] D.

283-291." Israel J. English translation: Select. "The ergodic and entropy theorems revisited. Th. Van Nostrand. 1(1961)." Mathematical Systems Theory.." IEEE Trans. [55] W. 84 (1977). 213-244. Shields. Lectures on Choquet's theorem. Ornstein and B. [62] I." Dokl. 1199-1203." Monatshefte fiir Mathematik. 1605-1617. Shields. 263-6." (in Russian). 159-180. Weiss. Petersen." Ann. Inform. of Math. "On the probability of large deviations of random variables. . AN SSSR. Sanov. The theory of Bernoulli shifts." Ann. Information and information stability of random variables and processes. Th. Rudolph. and Probability. Parry. Shields. Inform. "On the Bernoulli nature of systems with some hyperbolic structure. R. "Cutting and independent stacking of intervals. [57] Phelps. Springer-Verlag. 7 (1973). [67] P. Shields. Probab. "Stationary coding of processes. Probab. 11-44. Nauk SSSR. fur Wahr. Sys. 133-142. [64] P. "Entropy and prefixes.. Press. [70] P. Shields. "String matching . Shields. [72] P. (In Russian) Vol.. Sbornik. Statist. "A 'general' measure-preserving transformation is not mixing.J. Rohlin. [68] P. IT-37(1991). 20(1992). 1984... 349-351. Math. [66] P. Inform. Univ.. 1983. [61] D.." IEEE Trans. 60(1948). 119-123. Ergodic theory. [71] P. Convergence of stochastic processes. 19(1990). 7 of the series Problemy Peredai Informacii. 403-409. 49(1979). Shields. "Weak and very weak Bernoulli partitions.. Akad. Cambridge. [63] P. Th. Shields. N." Problems of Control and Information Theory. Pinsker.. Chicago. [59] D. of Chicago Press.. "Cutting and stacking. to appear. 20(1992). "Almost block independence. Mat. 1964. Inform. Shields. Pollard. 42(1957)." Ergodic Th. 269-277. [69] P. Transi. "If a two-point extension of a Bernoulli shift has an ergodic square. IT-25(1979). San Francisco. Press." Z. [56] K. 1981.. A method for constructing stationary processes." IEEE Trans. Topics in ergodic theory Cambridge Univ. [65] P. Moscow. [73] P. 30(1978).242 BIBLIOGRAPHY [54] D. English translation: Holden-Day. 1-4. 1960. Th. and Dynam. IT-37(1991). then it is Bernoulli. [60] V. Shields. "The entropy theorem via coding bounds. 1966 [58] M. Princeton. Shields. IT-33(1987). Cambridge Univ.the general ergodic case. 1973. 1645-1647. Cambridge." IEEE Trans. New York. "Universal almost sure data compression using Markov types.

"The sliding-window Lempel-Ziv algorithm is asymptotically optimal. Inform. Phillips. M. Th. 36(1965)." Math.. [88] A. Th. Prob. "Coding of sources with unknown statistics-Part I: Probability of encoding error. Th." IEEE Trans. Th. 5(1971). J. "Kingman's subadditive ergodic theorem. Inst. 878-880. of the IEEE. 6(1993). "An ergodic process of zero divergence-distance from the class of all stationary processes. Th. Part II: Distortion relative to a fidelity criterion". [80] M. Szpankowski. 3(1975).. IT- 39(1993). submitted. 337-343. Shields. Statist. Springer-Verlag. of Theor. Wyner and J. [84] P. "Entropy zero x Bernoulli processes are closed in the a-metric.. IT-23(1977)." Ann. IT-35(1989). [83] J." IEEE Trans. [81] V." Ann. "Two divergence-rate counterexamples.. "Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. Strassen." Proc. Prob. "The existence of probability measures with given marginals. Xu.. 93-98. of Math.. Lernpel.. of Math." J. Sciences. 54-58. [87] A. Ziv. 6(1993). of Theor... Syst. Probab. Inform. Prob. IT-39(1993). Inform. 1647-1659.. [75] P. 384-394. [78] M." J.. Ergodic theory. Ziv. Ziv. Th. Smorodinsky. 1975. 499-519. Th. [90] J. Th. Shields. "Coding theorems for individual sequences. Info. 520-524. Ziv. 210-203. 521-545. Moser." IEEE Trans. 125-1258. "Fixed data base version of the Lempel-ziv data compression algorithm. Ziv and A. IEEE Trans. [86] A.. Shields and J. 405-412. vol. "Universal redundancy rates don't exist. Info. "Universal data compression and repetition times. "Waiting times: positive and negative results on the Wyner-Ziv problem. [79] M. [89] S. Asymptotic properties of data compression and suffix trees." IEEE Trans.. [91] J. [82] W. Shields. "A partition on a Bernoulli shift which is not 'weak Bernoulli'. [85] F. [76] P... 732-6. Wyner and J. Inform.Varadhan. [77] P. IT-35(1989). Thouvenot. New York. Inform." J. "A universal algorithm for sequential data compression. IT-18(1972). NYU." Contemporary Math." IEEE Trans. "Finitary isomorphism of m-dependent processes." IEEE Trans. Smorodinsky.-P. [92] J. IT-37(1991)." IEEE Trans. 25(1989). of Theor. 872-877. An introduction to ergodic theory. Th." Ann. Henri Poincaré. Willems. Ziv. . New York. 82(1994). Wyner and J. Walters. E. 423-439. 1982. and S. a seminar Courant Inst. IT- 24(1978). 135(1992) 373-376.BIBLIOGRAPHY 243 [74] P. Steele. Inform.

.

125 base (bottom) of a column. 189 (a. 108 width (thickness). 184 almost block-independence (ABI). 9. 108 width distribution. 8 block-independent process. 211 Barron's code-length bound. 107 height (length). 115 disjoint column structures. 107 base (bottom). 103. 74 codebook. 195. 71 coding block. 24. 211 block-structure measure. K)-strongly-separated. 71 length function. 107 column partitioning. 108 uniform. 52 block code. 107 complete sequences. 109 built-up. 110 copy. 68 blowup bound. 68 8-blowup. 27 conditional E -independence. 215 column. 70 (1 — E)-built-up. 24. 84. 108 width. E)-blowing-up. 69 built by cutting and stacking. 194 almost blowing-up. 108 estimation of distributions. 10 concatenation representation. 104 block-to-stationary construction. 58 admissible. 174 in probability. 107 level. 58 . 235 alphabet. k)-separated structures. 138 circular k-type. 1 asymptotic equipartition (AEP). 104. 110 concatenated-block process. 111 column structure: top. 107 upward map. 188 upward map. 109 disjoint columns. 226 entropy. 107 name. 121 n-code. 235 (8. 187 absolutely regular. 108 support. 195. 8. 7 truncation. 107 top. 24. 110 transformation defined by.Index a-separated. 179. 72. 174 in d. 185. 26. 107 of a column structure. 8 block coding of a process. 24. 72 codeword. 194 Borel-Cantelli principle. 190 columnar representation. 108 binary entropy function. 71 code sequence. 107 cutting a column. 215 245 faithful (noiseless). 83 blowing-up property (BUP). 11 building blocks. 107 labeling. 233 addition law. 55 B-process. 74. 115 cutting into copies. 24. 123 code. 69 built-up set bound. 121 rate of. 108 (a. 109 column structure. 107 subcolumn. 212 almost blowing-up (ABUP). 24. 195. 121 per-symbol. 105 complete sequences of structures. 235 blowup.

58 a random variable. 90 rotation processes. 80 entropy. 11 expected return time. 43 ergodicity and d-limits. 1 continuous distribution. 133 entropy interpretations: covering exponent. 214 coupling. 88 entropy of i. 62 Markov processes.i. 62 concatenated-block processes. 1 of blocks in blocks. 46 number. 45 "good set" form. 56 topological. 63 . 144 n-th order. 91 definition (joining). 57 doubly-infinite sequence model.246 conditional invertibility. 92 for ergodic processes. 222 of a partition. 31 distinct words. 184 d-distance . 100 typical sequences. 46 almost-covering principle. 89 realizing tin 91 d-topology. 51 a process. 98 direct product. 176. 166 for frequencies. 16 totally. 58 entropy rate. 75. 186. 43. 67 cardinality bound. 65 empirical joinings. 147 distribution of a process. 168 divergence. 99 finite form. 185 encoding partition. 100 subset bound. 129 entropy-typical sequences. 64 empirical measure (limiting). 60 upper semicontinuity. 112 cyclic merging rule. 89 for stationary processes. 96 d-far-apart Markov chains. 73 entropy properties: d-limits. 68 prefix codes. 51 empirical. 94 empirical universe. 48 Markov. 89. 190 cylinder set. 99 mixing. processes. 20 process. 59 of a distribution. 78 entropy theorem. 51 conditional. 67 E-independence. 42 maximal.i. 4 . 164 consistency conditions. 76 empirical distribution. 6 shifted empirical block. 109 standard representation. 45 eventually almost surely. 138 estimation. 224 ergodic. 97 completeness. processes. 67 cutting and stacking. 100 ergodicity. 102 relationship to entropy. 143. 218 coupling measure. 63 of overlapping k-blocks. 166 Elias code. 49 decomposition. 92 for sequences. 87 d-distance properties. 2 d-admissible. 24 exponential rates for entropy. 42 of Kingman. 218 covering. 89 for i. 90 d'-distance. 48 empirical entropy.d. 33 of von Neumann. 132 Ziv entropy. 17 start. 91 definition for processes. 59 a partition. 138 first-order entropy.d. 223 for processes. INDEX empirical Markov entropy. 51. 57 inequality. 15 components. 103. 21 ergodic theorem (of Birkhoff).

190 mixing. 10 overlapping to nonoverlapping. 138 partial. 34 stopping-time. 15 irreducible Markov. 14 merging and separation. 166 frequency. 41 two-packings. 4 complete measure model. 91 join of partitions. 198. 139 separated. 6 generalized renewal process. 115 induced process. 17 cutting and stacking. 45 function of a Markov chain. 40 strong-packing. 195 finitary process. 131 convergence theorem. 25 infinitely often. 159 finite-state process. 11 instantaneous coder. 62 matching. E)-typical sequences. 221 finite form. 110 Kolmogorov partition. 185 nonexistence of too-good codes. 107 (T. 117 extension. 22 Lempel-Ziv (LZ) algorithm. 107 of a column level. 175. 3. processes. 102 faithful code. 71 faithful-code sequence. 65 Markov inequality. 136 linear mass distribution. 132 upper bound. 26 generalized return-time picture. 116 repeated. 121 filler or spacer blocks. 14 nonadmissibility. 34. 6 concatenated-block process. 80 finite energy. 131 LZW algorithm. 34 (1 — 0-packing. 114 M-fold. 168 overlapping-block process. 4. 10 finitely determined. 5 source. 41 . 44 (k. 27 transformation. 212 stacking of a structure: onto a column. 28 independent partitions.d. 7 finite coding. 73 Kronecker's theorem. 6 Markov order theorem. 195 finite coder. 84 fillers. 8 approximation theorem. 104 log-sum inequality. 186. 2. 122 nonoverlapping-block process. 6 k-th order. 90 mapping interpretation. 13 Kolmogorov representation. 4 set. 229 247 Kac's return-time formula. 10 packing. 18 and d-limits. 2)-name. 100 and Markov. 133 simple LZ parsing. 20 name of a column. 7 invariant measure. 43 typical. 41 packing lemma. 104 first-order rate bound.i. 25 Kolmogorov measure. 91 definition of d. 7 function. 151 finitary coding. 115 onto a structure.INDEX extremal measure. 76. 222 i. 17 joint input-output measure. 3 two-sided model. 223 first-order blocks. 4 and complete sequences. almost surely. 18 join of measures. 11 Markov chain. 215 n-blocking. 235 measure preserving. 3 Kraft inequality.

59 subinvariant function. 14 product measure. 166 for frequencies. 13 and complete sequences. 167 equivalence. 82 time-zero coder. 211 entropy. 7 and entropy. 120 stationary. 70 (1 — 0-strongly-covered. 14 per-letter Hamming distance. 5 Markov. 84 stationary process. 0-strongly-packed. 205 partition. 0-strongly-covered. 85 strong cover. 167 equivalence. 67 too-soon-recurrent representation. 33 strongly-covered. 46 (L. 12 N-th term process. 30. 167. 179 skew product. 3 (T . 8 rate function for entropy. 55 shift transformation. 63 k-type. 44. 154 repeated words. 229 Markov process. 24 picture. 64 class bound. 63 class. 40 string matching. 28 stationary coding. 110 B-process. 74. 43. 4 shift. 63. 29 type.248 pairwise k-separated. 89 Pinsker's inequality. 3 transformation/partition model. 159 (S. 31 sliding-block (-window) coding. 215 (S. 150 representation. 13 too-long word. n)-independent. 6 input-output measure. 30 transformation. 158 process. n)-blocking. 156 totally ergodic. 148 return-time distribution. 109 universal code. block parsings. 21 tower construction. 180 splitting-set lemma. 165 recurrence-time function. 65 typical sequence. 169. 166 splitting index. 132 of a column structure. 215 i. 7 with induced spacers. 32. 63. 3. 151 too-small set principle. 64 class. 6 in-dependent.. 3 shifted. 150 (K. 73 code sequence. 8 output measure. 67 and d-distance. 24 process. 190 . 5 Vf-mixing. 180 stacking columns. 170. 170 with gap. 72 code. 175 - INDEX and upward maps. 13 distribution. 10 regenerative. 229 stopping time. 68. 26 rotation (translation). 121 universe (k-block). 45. 145 existence theorem. 7 speed of convergence. 27. 66 Poincare recurrence theorem. 109 start position. 21 process. 122 sequence. 121 trees. 6 N-stationary. 188 full k-block universe.d. 21 d-far-apart rotations.P) process. 138 subadditivity. 40 interval. 23 prefix. 96 randomizing the start. 34 strongly-packed. 1. 8.i. 59 finite-energy. 81 and mixing. 101 Shannon-McMillan-Breiman theorem.

INDEX very weak Bernoulli (VWB). 200 weak Bernoulli (WB). 7 words. 87. 148 repeated. 233 weak topology. 195 window half-width. 104 distinct. 232 waiting-time function. 119 window function. 88 weak convergence. 87 well-distributed. 133 249 . 200 approximate-match. 179. 148 Ziv entropy.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.