This action might not be possible to undo. Are you sure you want to continue?
The Ergodic Theory of Discrete Sample Paths
Paul C. Shields
Graduate Studies in Mathematics
Volume 13
American Mathematical Society
Editorial Board James E. Humphreys David Sattinger Julius L. Shaneson Lance W. Small, chair
1991 Mathematics Subject Classification. Primary 28D20, 28D05, 94A17; Secondary 60F05, 60G17, 94A24.
ABSTRACT. This book is about finitealphabet stationary processes, which are important in physics, engineering, and data compression. The book is designed for use in graduate courses, seminars or self study for students or faculty with some background in measure theory and probability theory.
Library of Congress CataloginginPublication Data
Shields, Paul C. The ergodic theory of discrete sample paths / Paul C. Shields. p. cm. — (Graduate studies in mathematics, ISSN 10657339; v. 13) Includes bibliographical references and index. ISBN 0821804774 (alk. paper) 1. Ergodic theory. 2. Measurepreserving transformations. 3. Stochastic processes. I. Title. II. Series. QA313.555 1996 9620186 519.2'32—dc20 CIP
Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting
for them, are permitted to make fair use of the material, such as to copy a chapter for use in teaching or research. Permission is granted to quote brief passages from this publication in reviews, provided the customary acknowledgment of the source is given. Republication, systematic copying, or multiple reproduction of any material in this publication (including abstracts) is permitted only under license from the American Mathematical Society. Requests for such permission should be addressed to the Assistant to the Publisher, American Mathematical Society, P.O. Box 6248, Providence, Rhode Island 029406248. Requests can also be made by email to reprintpermission0ams.org .
C) Copyright 1996 by the American Mathematical Society. All rights reserved. The American Mathematical Society retains all rights except those granted to the United States Government. Printed in the United States of America. ® The paper used in this book is acidfree and falls within the guidelines established to ensure permanence and durability. .. Printed on recycled paper. t.,1 10 9 8 7 6 5 4 3 2 1 01 00 99 98 97 96
Contents
Preface I ix
Basic concepts. I.1 Stationary processes. 1.2 The ergodic theory model. 1.3 The ergodic theorem. 1.4 Frequencies of finite blocks 1.5 The entropy theorem. 1.6 Entropy as expected value 1.7 Interpretations of entropy 1.8 Stationary coding 1.9 Process topologies. 1.10 Cutting and stacking.
1 1 13 33 43 51 56 66 79 87 103 121 121 131 137 147 154 165 165 174 184 194 200 211 211 221 232 239 245
II Entropyrelated properties. II.! Entropy and coding 11.2 The LempelZiv algorithm 11.3 Empirical entropy 11.4 Partitions of sample paths. 11.5 Entropy and recurrence times III Entropy for restricted classes. Ill.! Rates of convergence 111.2 Entropy and joint distributions. 111.3 The dadmissibility problem. 111.4 Blowingup properties. 111.5 The waitingtime problem IV Bprocesses. IV1 Almost blockindependence. 1V.2 The finitely determined property. 1V.3 Other Bprocess characterizations.
Bibliography
Index
vii
ergodic process. immediately yield the two fundamental theorems of ergodic theory. The book is designed for use in graduate courses. an extension of the instantaneous function concept which allows the function to depend on both past and future. ergodic theory. Related models are obtained by block coding and randomizing the start.d. introduced in Section 1. a representation that not only simplifies the discussion of stationary renewal and regenerative processes but generalizes these concepts to the case where times between recurrences are not assumed to be independent. leads directly to Kakutani's simple geometric representation of a process in terms of a recurrent event. This point of view. An important and simple class of such models is the class of concatenatedblock processes.d. processes. The packing and counting ideas yield more than these two classical results. McMillan. but only stationary. however. is to develop a theory based directly on sample path arguments. A secondary goal is to give a careful presentation of the many models for stationary finitealphabet processes that have been developed in probability theory. namely.i. which are important in physics. seminars or self study for students or faculty with some background in measure theory and probability theory. which shows how "almost" packings of integer intervals can be extracted from coverings by overlapping subintervals. together with a partition of the space. and renewal and regenerative processes. The two basic tools for a sample path theory are a packing lemma. and Breiman. These two simple ideas. The focus is on the combinatorial properties of typical finite sample paths drawn from a stationary. only partially realized. processes. introduced by Ornstein and Weiss in 1980. for in combination with the ergodic and entropy theorems and further simple combinatorial ideas they provide powerful tools for the study of sample paths. A ix . and a counting lemma.2. All these models and more are introduced in the first two sections of Chapter I. with minimal appeals to the probability formalism. and data compression. The classical process models are based on independence ideas and include the i. and information theory.Preface This book is about finitealphabet stationary processes. Much of Chapter I and all of Chapter II are devoted to the development of these ideas. as a measurepreserving transformation on a probability space. or by stationary coding. which bounds the number of nsequences that can be partitioned into long blocks subject to the condition that most of them are drawn from collections of known size. instantaneous functions of Markov chains.i. that is. including the weak Bernoulli processes and the important class of stationary codings of i. engineering. namely. A primary goal. Markov chains. the ergodic theorem of Birkhoff and the entropy theorem of Shannon. Of particular note in the discussion of process models is how ergodic theorists think of a stationary process. are discussed in Chapter III and Chapter IV. the processes obtained by independently concatenating fixedlength blocks according to some block distribution and randomizing the start. Further models.
the relation between entropy and partitions of sample paths into fixedlength blocks. which is half the book. ranging from undergraduate and graduate students through postdocs and junior faculty to senior professors and researchers. and the cutting and stacking method. With a few exceptions. These include entropy as the almostsure bound on persymbol compression. such as . leads to a powerful method for constructing examples known as cutting and stacking. Several characterizations of the class of stationary codings of i. including rates of convergence for frequencies and entropy. processes are given in Chapter IV. or partitions into repeated blocks. The first chapter. then given in complete detail. Likewise. Some specific stylistic guidelines were followed in writing this book. and IV are approximately independent of each other. Properties related to entropy which hold for every ergodic process are discussed in Chapter II.x PREFACE further generalization. Properties related to entropy which hold only for restricted classes of processes are discussed in Chapter III.10. algebraic coding. In addition.i. given in Section 1. and the connection between entropy and recurrence times and entropy and the growth of prefix trees. and continuous time and/or space theory. divergencerate theory. Some of these date back to the original work of Ornstein and others on the isomorphism problem for Bernoulli shifts. for example. The audiences included ergodic theorists. including the Kolmogorov and ergodic theory models for a process. a method for converting block codes to stationary codes. suggestive names are used for concepts. The book has four chapters. smooth dynamics. the weak topology and the even more important dmetric topology. Many standard topics from ergodic theory are omitted or given only cursory treatment. the estimation of joint distributions in both the variational metric and the dmetric. and multiuser theory. the ergodic theorem and its connection with empirical distributions. and probabilists. channel theory. a connection between entropy and aneighborhoods. information theorists. These topics include topological dynamics. lectures on parts of the penultimate draft of the book were presented at the Technical University of Delft in the fall of 1995. in part because the book is already too long and in part because they are not close to the central focus of this book. very weak Bernoulli. conditioned on the material in Chapter I. although the almost blockindependence and blowingup ideas are more recent. Likewise little or nothing is said about such standard information theory topics as rate distortion theory. as well as combinatorialists and people from engineering and other mathematics disciplines. including the almost blockindependence. combinatorial number theory. and a connection between entropy and waiting times. or partitions into distinct blocks. is devoted to the basic tools. III. random fields. finitely determined. and blowingup characterizations. Proofs are sketched first. redundancy. the entropy theorem and its interpretations. Kprocesses.d. Ziv's proof of asymptotic optimality of the LempelZiv algorithm via his interesting concept of individual sequence entropy. both as special lectures and seminars at the Mathematics Institute of the Hungarian Academy of Sciences and as courses given in the Probability Department of E6tviis Lordnd University. the sections in Chapters II. general ergodic theorems. the entropy theorem rather than the ShannonMcMillanBreiman theorem. This book is an outgrowth of the lectures I gave each fall in Budapest from 1989 through 1994. Theorems and lemmas are given names that include some information about content.
Only those references that seem directly related to the topics discussed are included. and his criticisms led to many revisions of the cutting and stacking discussion. but far from least. My initial lectures in 1989 were supported by a Fulbright lectureship. and Jacob Ziv. Imre Csiszdr and Katalin Marton not only attended most of my lectures but critically read parts of the manuscript at all stages of its development and discussed many aspects of the book with me. Don Ornstein.utoledo.) Also numbered displays are often (informally) given names similar to those used for LATEX labels. Gusztdv Morvai.PREFACE xi building blocks (a name proposed by Zoli Gytirfi) and column structures (as opposed to gadgets. Exercises that extend the ideas are given at the end of most sections. Much of the work for this project was supported by NSF grants DMS8742630 and DMS9024240 and by a joint NSFHungarian Academy grant MTANSF project 37. I am indebted to many people for assistance with this project.html I am grateful to the Mathematics Institute of the Hungarian Academy for providing me with many years of space in a comfortable and stimulating environment as well as to the Institute and to the Probability Department of Etitvotis Lordnd University for the many lecture opportunities. In addition I had numerous helpful conversations with Benjy Weiss. Others who contributed ideas and/or read parts of the manuscript include Gdbor TusnAdy.edurpshields/ergodic. Much of the material in Chapter III as well as the blowingup ideas in Chapter IV are the result of joint work with Kati. Aaron Wyner. Gyiirgy Michaletzky. Shields Toledo. I am much indebted to my two Toledo graduate students. Last. 1996 . Ohio March 22. and Nancy Morrison. No project such as this can be free from errors and incompleteness. This book is dedicated to my son. who learned ergodic theory by carefully reading almost all of the manuscript at each stage of its development. Jeffrey. A list of errata as well as a forum for discussion will be available on the Internet at the following web address. Paul C. It was Imre's suggestion that led me to include the discussions of renewal and regenerative processes.math. http://www. in the process discovering numerous errors and poor ways of saying things. Shaogang Xu and Xuehong Li. Dave Neuhoff. these range in difficulty from quite easy to quite hard. Bob Burton. Jacek Serafin.
.
The set of joint distributions {Ak: k > 1} is called the distribution of the process. X. The sequence cannot be completely arbitrary.. p) on which the Xn are defined in any convenient manner. When no confusion will result the subscript k on p.k (alic) = Prob(Xif = c4). . when An is used. . is denoted by am for m = 1. however... one for each k.k may be omitted. so. "measure" will mean "probability measure" and "function" will mean "measurable function" with respect to some appropriate aalgebra on a probability space. n . also be defined by specifying the start distribution. that is.k on A k defined by the formula p. Thus one is free to choose the underlying space (X.. of random A (discretetime. In this book the focus is on finitealphabet processes. The cardinality of a finite set A is denoted by IA I. variables defined on a probability space (X. alic E Ak . The sequence a m . . stochastic) process is a sequence X1.1 k. unless stated otherwise.Chapter I Basic concepts. • • • . A process is considered to be defined by its joint distributions. unless it is clear from the context or explicitly stated stated otherwise. X2.. except where each ai E A.1 Stationary processes. "process" means a discretetime finitealphabet process. all that really matters in probability theory is the distribution of the process. E. E. The process has alphabet A if the range of each X i is contained in A. The distribution of a process is thus a family of probability distributions. am+i. Also. an. The kth order joint distribution of the process {X k } is the measure p. The distribution of a process can. The set of all such am n is denoted by Al. (1) 111(4) = E P4+1 (a jic+1 ). p). as 1 . Section 1. for implicit in the definition of process is that the following consistency condition must hold for each k > 1. the particular space on which the functions Xn are defined is not important. ak11 al E A k . and the successive conditional distributions k 1 k. of course. A i .1\ (akla 1 ) = Prob(Xk = ak1X 1 = al ) = til(alk) ktk1(a 1) .
1 (a) Each set in R.(C. The Kolmogorov can be thought of as the coordinate function process {i n }.} is a process with finite alphabet A. and let R. on the collection C of cylinder sets by the formula (2) p({41) = Prob(X. since the space A is compact in the product topology and each set in R.1(a). for each k and each 4.2 CHAPTER I. imply that p. which represents it as the sequence of coordinate functions on the space of infinite sequences drawn from the alphabet A. The consistency conditions (1) together with Lemma I. let C = Un C„. m < i < n} .1. Let E be the aalgebra generated by the cylinder sets C. extends to a finitely additive set function on the ring R. denoted by [4]. (b) If {B„} c R is a decreasing sequence of sets with empty intersection then there is an N such that B„ = 0. there is a unique Borel measure p. Two important properties are summarized in the following lemma..) denote the ring generated by Cn . is closed. = a i . Proof The proof of (a) is left as an exercise.1. = R(C) generated by the cylinder sets.1. n > N. long as the the joint distributions are left unchanged. on A' for which the sequence of coordinate functions fin } has the same distribution as {X J. A is defined by kn (x). Let C. The rigorous construction of the Kolmogorov representation is carried out as follows. x E For each n > 1 the coordinate function in : A' representation theorem states that every process with alphabet A A. from (a). Theorem 1.. 1 _< i < n).} defines a set function p. The finite intersection property. n . = ai . xn . Xi E A. The sequence {R. is the subset of A' defined by The cylinder set determined by am [4] = {x: x. then there is a unique Borel probability measure p. The members of E are commonly called the Borel sets of A. or by the compact sets. Part (b) is an application of the finite intersection property for sequences of compact sets. Lemma 1. 1 < i < co. An important instance of this idea is the Kolmogorov model for a process. if {X.} is increasing. p([4]) = pk (alic). = 'R. and its union R = R(C) is the ring generated by all the cylinder sets.) If {p k } is a sequence of measures for which the consistency conditions (1) hold. Thus the lemma is established. is a finite disjoint union of cylinder sets from C. In other words. on A" such that. implies that p can be extended to a . equipped with a Borel measure constructed from the consistency conditions (1). Lemma I. together with a Borel measure on A". be the collection of cylinder sets defined by sequences that belong to An. Let Aœ denote the set of all infinite sequences X = 1Xil.2 (Kolmogorov representation theorem. BASIC CONCEPTS. The collection E can also be defined as the the a algebra generated by R. Proof The process {X.1(b).1.
different processes have Kolmogorov measures with different completions. T 1 B = fx: T x E BI. Furthermore. x3. x which is expressed in symbolic form by E A. A process is stationary. that is. . and statistics. for example.} and {X.} have the saine joint distribution. This proves Theorem 1. The Kolmogorov measure on A extends to a complete measure on the completion t of the Borel sets E relative to tt. that is. including physics. whichever is appropriate to the context. x = (xi. or to its completion. ." As noted earlier. though often without explicitly saying so. n> 1. STATIONARY PROCESSES. The Kolmogorov model is simply one particular way to define a space and a sequence of functions with the given distributions. unless stated otherwise. discretetime. that is. (3) Prob(Xi = ai .). In particular. and uniqueness is preserved. Equation (2) translates into the statement that { 5C. the particular space on which the functions X„ are defined is not important. The phrase "stationary process" will mean a stationary. Another useful model is the complete measure model. it) will be called the Kolmogorov representation of the process {U.) == T x = (x2.2.n } on the probability space (A'.A" is defined by (Tx) n= xn+i. . The shift transformation is continuous relative to the product topology on A". Process and measure language will often be used interchangeably.2. E. ni <j < n) = Prob(Xi+i = ai .SECTION Ll. Stationary finitealphabet processes serve as models in many settings of interest. n. or the Kolmogorov measure of the sequence Lu l l. ni <j < n). B . 3 unique countably additive measure on the aalgebra E generated by R. and affi n . The statement that a process {X n } with finite alphabet A is stationary translates into the statement that the Kolmogorov measure tt is invariant under the shift transformation T.1. completion has no effect on joint distributions. The sequence of coordinate functions { k . if the joint distributions do not depend on the choice of time origin. thus." means "let i be the Kolmogorov measure for some process {Xn } . and defines a set transformation T 1 by the formula. for all m. data transmission and storage. finitealphabet process. whenever it is convenient to do so the complete Kolmogorov model will be used in this book. Many ideas are easier to express and many results are easier to establish in the framework of complete measures. The measure will be called the Kolmogorov measure of the process.. for Aga in) = ictk (alic) for all k > 1 and all a.. "let be a process.1. The (left) shift T: A" 1*. x2. one in which subsets of sets of measure 0 are measurable. "Kolmogorov measure" is taken to refer to either the measure defined by Theorem 1. that is. a process is considered to be defined by its joint distributions.
n E Z. so that the set transformation T 1 maps a cylinder set onto a cylinder set. T1 r„n 1 _ ri. that is. for invertibility of the shift on A z makes things easier. one as the Kolmogorov measure p. bi+1 = ai. It follows that T 1 B is a Borel set for each Borel set B. Note that the shift T for the twosided model is invertible.([4]) = p.([b 11 ]). so that only nontransient effects. p. alternatively. the set of all sequences x = {x„}. that is. The condition. Furthermore. "Kolmogorov measure" is taken to refer to either the onesided measure or its completion. The condition that the process be stationary is just the condition that p. different processes with the same alphabet A are distinguished by having . of course.) From a physical point of view it does not matter in the stationary case whether the onesided or twosided model is used. in turn.1 . called the doublyinfinite sequence model or twosided model. if {p4} is a set of measures for which the consistency conditions (1) and the stationarity conditions (3) hold then there is a unique Tinvariant Borel measure ii. or to the twosided measure or its completion.3 Much of the success of modern probability theory is based on use of the Kolmogorov model. Proofs are sometimes simpler if the twosided model is used.n+1 i L"m1 — Lum+1. Let Z denote the set of integers and let A z denote the set of doublyinfinite Avalued sequences. only the proof for the invertible case will be given. It is often convenient to think of a process has having run for a long time before the start of observation. onto il°'° is. The shift T: A z 1. is T invariant. the other as the shiftinvariant extension to the space Az.1. on A". in fact. that is.). the model provides a concrete uniform setting for all processes. on A z such that Again) = pk (a1 ). In summary. In some cases where such a simplification is possible. Remark 1. B E E. that is. The projection of p. Such a model is well interpreted by another model. where each xn E A and n E Z ranges from —oo to oo. In the stationary case. is usually summarized by saying that T preserves the measure iu or. A(B) = p(T 1 B) for each cylinder set B.4 Note that CHAPTER I. T is onetoone and onto and T 1 is measure preserving. a homeomorphism on the compact space A z . are seen. that ti. such results will then apply to the onesided model in the stationary case. as appropriate to the context. The concept of cylinder set extends immediately to this new setting. for stationary processes there are two standard representations. the Kolmogorov measure on A' for the process defined by {pa In summary. In addition. The intuitive idea of stationarity is that the mechanism generating the random variables doesn't change in time. The latter model will be called the twosided Kolmogorov model of a stationary process. equivalent to the condition that p(B) = p(T 1 B) for each Borel set B. the shift T is Borel measurable. m <j < n. E. those that do not depend on the time origin. for each k > 1 and each all'. BASIC CONCEPTS. It preserves a Borel probability measure A if and only if the sequence of coordinate functions {:kn } is a stationary process on the probability space (A". p(B) = p(T 1 B). for asymptotic properties can often be easily formulated in terms of subsets of the infinite product space A". which is. which means that T is a Borel measurable mapping.A z is defined by (Tx) n= xn+i. (The twosided shift is.
Example 1. STATIONARY PROCESSES.) The simplest example of a stationary finitealphabet process is an independent. A sequence of random variables. Thus a process is i. As will be shown in later sections.i. countablealphabet processes. even more can be gained in the stationary case by retaining the abstract concept of a stationary process as a sequence of random functions with a specified joint distribution on an arbitrary probability space.SECTION 1.) The simplest examples of finitealphabet dependent processes are the Markov processes.d.6 (Markov chains. An independent process is stationary if and only if it is identically distributed.d. in which case the I AI x IAI matrix M defined by Mab = Mbia = Prob(X„ = bi X„_i = a) . or even continuousalphabet processes. The Kolmogorov theorem extends easily to the countable case. 5 different Kolmogorov measures on A.d. if and only if its Kolmogorov measure is the product measure defined by some distribution on A. The i.i. For example. returntime processes associated with finitealphabet processes are. it is sometimes desirable to consider countablealphabet processes. It is identically distributed if Prob(X„ = a) = Prob(Xi = a) holds for all n > 1 and all a E A.d. processes will be useful in Chapter 4. and.a Examples. condition holds if and only if the product formula iak (al k )= n Ai (ai).1.i.1. when invertibility is useful the twosided model on A z can be used. is a Markov chain if (4) Prob (X n = an I V I 1 = ar l ) = Prob(Xn = an IXn _i =a_1) holds for all n > 2 and all a7 E An. in the stationary case. A sequence of random variables {X„} is independent if Prob(X. When completeness is needed the complete model can be used. as well as to the continuousalphabet i. Example 1. except in trivial cases.i.5 (Independent processes.d. and functions of uniformly distributed i.1. i= 1 k holds.) process. = an IX7 1 = ar I ) = Prob(X„ = an ). {X„}.1. Such a measure is called a product measure defined by Ai.4 While this book is primarily concerned with finitealphabet processes. identically distributed (i.i. Some standard examples of finitealphabet processes will now be presented. Remark 1. I.1. cases that will be considered. A Markov chain is said to be homogeneous or to have stationary transitions if Prob(X n = blXn _i = a) does not depend on n. holds for all n > 1 and all al' E An.
following the terminology used in [15 ]. any finitealphabet process is called a finitestate process. in general. The Markov chain {X. see Exercise 2. = an i Xn n :ki = an niki ) holds for all n > k and for all dii.8 (Finitestate processes. that transitions be stationary. Unless stated otherwise. and M (k) is the I BI x IBI matrix defined by Ai (k) _ bk — l' 1 1 r1 ak Prob(Xk ±i = bk I X ki = a k\ il .uk .6 CHAPTER I. what is here called a finitestate process is sometimes called a Markov source. where /2i(a) = Prob(X i = a). . In ergodic theory. Here "Markov" will be reserved for processes that satisfy the Markov property (4) or its more general form (5). for which iu 1 (a) > 0 for all a E A. Y.) A generalization of the Markov property allows dependence on k steps in the past. a Markov chain. that is. BASIC CONCEPTS. for all n > 1. Note that the finitestate process {Yn } is stationary if the chain {X„} is stationary. anix ri . The start distribution is called a stationary distribution of a homogeneous chain if the matrix product A I M is equal to the row vector pl.1. = biSn1).1T 1 ) = Prob(sn = s. A (an I a n v n—i _ ) = Prob(X n = anl.) Let A and S be finite sets. and B = fa: p. Note that probabilities for a kth order stationary chain can be calculated by thinking of it as a firstorder stationary chain with alphabet B.7 (Multistep Markov chains. 4 E A " . A process is called kstep Markov if (5) Prob(X. An Avalued process {Y. a E A. the process is stationary. Yn )} is a Markov chain.k (4`) = Prob(X i` = 4). and "finitestate process" will be reserved for processes satisfying property (6). is called the transition matrix of the chain.} is often referred to as the underlying Markov process or hidden Markov process associated with {Yn }. The start distribution is the (row) vector lit = Lai (s)}. = (sa . The positivity condition rules out transient states and is included so as to avoid trivialities. "stationary Markov process" will mean a Markov chain with stationary transitions for which A I M = tti and. .1. il n—k — n1\ an—k ) does not depend on n as long as n > k. and. If this holds. second. where ii. Yn = bisrl . then a direct calculation shows that.k (4c) > 01.41 ) . Prob(X. 0 Lk—I •fu i i = a2 k otherwise. that A k M (k) = . In information theory. {Yn } is a finitestate process if there is a finitealphabet process {s n } such that the pair process {X. The condition that a Markov chain be a stationary process is that it have stationary transitions and that its start distribution is a stationary distribution. indeed. In other words. Example 1. The two conditions for stationarity are. if (6) Prob(sn = s. but {Yn } is not. Example 1. and what is here called a finitestate process is called a function (or lumping) of a Markov chain.} is said to be a finitestate process with state space S. first.
) is called an instantaneous function or simply a function of the {X n } process. Vx E Az . x E AZ. Finite coding is sometimes called slidingblock or slidingwindow coding. the value y. The function f is called the timezero coder associated with F. the image y = F(x) is the sequence defined by y. the contents of the window are shifted to the left) eliminating xn _ u. an idea most easily expressed in terms of the twosided representation of a stationary process.71 Z E Z.) The concept of finitestate process can be generalized to allow dependence on the past and future. Define the function f: Az i. A case of particular interest occurs when the timezero coder f depends on only a finite number of coordinates of the input sequence x. for one can think of the coding as being done by sliding a window of width 2w ± 1 along x. depends only on xn and the values of its w past and future neighbors.).SECTION 1. otherwise known as the persymbol or timezero coder. At time n the window will contain Xn—w. the window is shifted to the right (that is. In the special case when w = 0 the encoding is said to be instantaneous.9 (Stationary coding. that is. Xn—w+1. that is. • • • 9 Xn. STATIONARY PROCESSES.1. = f (T:1 4 x). The coder F transports a Borel measure IL on Az into the measure v = it o F 1 . To determine yn+1 . n E Z. Note that a stationary coding of a stationary process is automatically a stationary process. that is.B by the formula f (x) = F(x)o. n Yn = f (x'. sometimes called the full coder. or in terms of the sequencetosymbol coder f. Suppose A and B are finite sets. that is.Bz is a stationary coder if F(TA X) = TB F(x). In this case it is common practice to write f (x) = f (xw w ) and say that f (or its associated full coder F) is a finite coder with window halfwidth w. f (x) is just the 0th coordinate of y = F (x). through the function f. • • • 1 Xnfw which determines the value y.„ with coder F. where TA and TB denotes the shifts on the respective twosided sequence spaces Az and B z . 7 Example 1. and bringing in .. A Borel measurable mapping F: Az 1. The stationary coder F and its timezero coder f are connected by the formula = f (rx). when there is a nonnegative integer w such that = f (x) = f CO . defined for Borel subsets C of Bz by v(C) = The encoded process v is said to be a stationary coding of ti. Thus the stationary coding idea can be expressed in terms of the sequencetosequence coder F. The encoded sequence y = F(x) is defined by the formula .1. and the encoded process {Yn } defined by Yn = f (X. The word "coder" or "encoder" may refer to either F or f.
} is stationary. where y = F(x) is the (measurable) mapping defined by (j+1)N r jN+1 = .. although it is Nstationary. where several characterizations will be discussed. in general.v ) so that. .i. for any measure .. A stationary coding of an i. . It is frequently used in practice. a function of a Markov chain.} and {K}. into a stationary process. where T = Tg.. Furthermore. The process {Yn } is defined bj selecting an integer u E [1. I. called block coding. F 1 ([). and is often easier to analyze than is stationary coding.nitww)=Y s11 [x:t w w] .. Much more will be said about Bprocesses in later parts of this book. process with finite or infinite alphabet. The key property of finite codes is that the inverse image of a cylinder set of length n is a finite union of cylinder sets of length n 2w. Such a code can be used to map an Avalued process {X n } into a Bvalued process {Yn } by applying CN to consecutive nonoverlapping blocks of length N. of the respective processes. {X. also denoted by f. (j+1)N X j1s1+1 ) 9 j 0.1. • .. by the formula f (x. that is. then F' ([y]) = f (x. 2. If F: Az B z is the full coder defined by f.7. cylinder set probabilities for the encoded measure y = o F 1 are given by the formula y(4) = ii. An N block code is a function CN: A N B N . stationary. N] according to the uniform distribution and defining i = 1. 2. destroys stationarity. Xn.u. where A and B are finite sets. process is called a Bprocess. n . . The Kolmogorov measures. The process {Yn } is called the Nblock coding of the process {X n ) defined by the N block code C N. . {Yn }. where w is the window halfwidth . y(j+1)N jN+1 = CN ()((f+1)N) +1 iN j = 0 1..n41. There is a simple way to convert an Nstationary 2rocess. Xn±„.(F1 [4]) = (7) E f A process {Yn } is a finite coding of a process {X n } if it is a stationary coding of {Xn } with finite width timezero coder.d. however. in particular. from An B. If the process {X. it is invariant under the Nfold shift T N . pt and v.n+w generated by the random variables X„. Note that a finitestate process is merely a finite coding of a Markov chain in which the window halfwidth is 0.nn+ w w ) = ym n . BASIC CONCEPTS. that is.10 (Block codings. This method is called "randomizing the start. the Nblock coding {Yn } is not. where yi = f m <j < n..8 CHAPTER I.d.. 2. especially in Chapter 4. Example 1.i. the shift on B. a finite alphabet process is called a Bprocess if it is a stationary coding of some i. are connected by the formula y = o F 1 . that is." The relation = . such as the encoded process {Yn }.. More generally.) Another type of coding. 1) is measurable with respect to the aalgebra E(X. Such a finite coder f defines a mapping. to of the code.„ + i.
SECTION I.} is i. (ii) There is random variable U uniformly distributed on {1.. In random variable terms. which can then be converted to a stationary process. B E E. Example 1. . a structure which is inherited. with stationarity produced by randomizing the start. } obtained (j11N+1 be independent and have the distribution of Xr. Let {Y. {Y. 2. In general. STATIONARY PROCESSES. .} and { Y. . suppose X liv = (X1. NI such that.* o 0 1 on A" via the mapping = x = w(l)w(2) • • • where JN X (j1)N+1 = W(i).i. forming the product measure If on (A N )".1. then averaging to obtain the measure Z defined by the formula N1 N il(B) = — i=o it*(0 1 (7—i B)). on A N . For example.} and many useful properties of {X n } may get destroyed.I.1. y and î. In summary. The method of that is. and as such inherits many of its properties. ..8. the final stationary process {Yn } is not a stationary encoding of the original process {X.} is stationary. 9 between the Kolmogorov measures. i=o is the average of y.) Processes can also be constructed by concatenating Nblocks drawn independently at random according a given distribution. of the respective processes. The concatenatedblock process is characterized by the following two conditions.. {Y. the sequence {Y:± ±(iiN oN±i : j = 1. 2. .d. together with its first N — 1 shifts. In Section 1. }. The process { 171 from the Nstationary process {Yi I by randomizing the start is called the concatenatedblock process defined by the random vector Xr . an Nblock code CN induces a block coding of a stationary process {Xn } onto an Nstationary process MI. X N ) is a random vector with values in AN.. with a random delay. block coding introduces a periodic structure in the encoded process. 2. } . a general procedure will be developed for converting a block code to a stationary code by inserting spacers between blocks so that the resulting process is a stationary coding of the original process.11 (Concatenatedblock processes. (i) li. randomizing the start clearly generalizes to convert any Nstationary process into a stationary process.. with distribution t. .. conditioned on U = u. An alternative formulation of the same idea in terms of measures is obtained by starting with a probability measure p. {Y. by the process {41. transporting this to an Nstationary measure p.} be the Avalued process defined by the requirement that the blocks . by randomizing the start. ... j = 1. There are several equivalent ways to describe such a process. is expressed by the formula v ( c) = E v (T C).
which is quite useful. n = ai.i) with probability i. defined by the following transition and starting rules. +i A related process of interest {Yn }. then the nonoverlapping Nblock process is Markov with alphabet B and transition matrix defined by N1 14a'iv .i. for n > 1. but overlapping clearly destroys independence. while the W process is called the overlapping Nblock process determined by {X J. for each br (C) E AN.(4)/N. that is. i) can only go to (ar . Each is stationary if {X..} is Markov with transition matrix Mab. called the Nth term process. however. is the same as the concatenatedblock process based on the block distribution A. BASIC CONCEPTS..d. N}. if {X.b.12 (Blocked processes.d. i ± 1). {Z n } and IKI. then {Y. represents a concatenatedblock process as a finitestate process. Note also that if {X n} is Markov with transition matrix Mab. 2.i. (ar . the Nth power of M. if Y. (a) If i < N . (b) (ar . . The process is started by selecting (ar . If {Xn } is Markov with transition matrix M. 1) with probability p. n E Z. . 0 if bN i —1 = ci 2 iv otherwise. or Markov process is Markov.} is stationary. a function of a Markov chain.) Associated with a stationary process {X n } and a positive integer N are two different stationary processes. is called the concatenatedblock process defined by the measure p. The process {Y. Yn = X (n1)N+1. If the process {X. that is.i } will be Markov with transition matrix MN. j). = (ar . b' 1 1 = { MaNbN. defined. • • • 9 Xn+N1)• The Z process is called the nonoverlapping Nblock process determined by {X n }.1.r. The overlapping Nblocking of an i. In fact. Wn = ant X+1. then its nonoverlapping blockings are i.(br).} process. then the overlapping Nblock process is Markov with alphabet B = far: War) > 0} and transition matrix defined by Ma i.tSiv = MaNbi ri i= 1 Mb. To describe this representation fix a measure p.i. Example 1. The process defined by ii has the properties (i) and (ii).10 CHAPTER I. selects every Nth term from the {X. on AN.d.. . The measure ii.} is i. (See Exercise 7. A third formulation. N) goes to (b Ç'.} be the (stationary) Markov chain with alphabet S = AN x {1. on A N and let { Y„. by Zn = (X(n1)N+19 X(n1)N+29 • • • 1 XnN). hence this measure terminology is consistent with the random variable terminology.) Concatenatedblock processes will play an important role in many later discussions in this book.n } defined by setting Y.
the Markov inequality and the BorelCantelli principle. n Gn .(x) is true for i = 1. Almostsure convergence is often established using the following generalization of the BorelCantelli principle. (b) P. If {Pn } is a sequence of measurable properties then (a) P(x) holds eventually almost surely. p. summarized as follows. except for a set of measure at most b. eventually almost surely. Lemma 1.b Probability tools. in which case the iterated BorelCantelli principle is. (a) f.16 The following are equivalent for measurable functions on a probability space. eventually almost surely. STATIONARY PROCESSES. E. Lemma 1.u(Bn n Gn ) < oo. 2. —> f.15 (The iterated BorelCantelli principle. n Gn.„ eventually almost surely. just a generalized BorelCantelli principle.)..i ) < 00 then for almost every x there is an N = N(x) such that x g C„. < c 8 then f (x) < c.SECTION 1.14 (The BorelCantelli principle. n > N. for every c > O.) Suppose {G„} and {B} are two sequences of measurable sets such that x E G„. p. In many applications. In general. such that Ifn(x) — f(x)1 < E . if for almost every x there is an N = N(x) such that Pn (x) is true for n > N.1. eventually almost surely.) If {C n } is a sequence of measurable sets in a probability space (X.) < oo then x g C.1. E.. almost surely. Lemma 1. . such that Pn.. is established by showing that E . (b) I fn (x) — f (x)I <E. Two elementary results from probability theory will be frequently used.1. indeed. Lemma 1. Frequent use will be made of various equivalent forms of almost sure convergence. 11 I. almost surely. which may depend on x. integrable function on a probability space (X.1.) such that E ii(c.13 (The Markov inequality. a property P is said to be measurable if the set of all x for which P(x) is true is a measurable set. X E G.l. (x) holds infinitely often.. (c) Given c > 0. eventually almost surely. if for almost every x there is an increasing sequence {n i } of integers. eventually almost surely. and x g B.) Let f be a nonnegative.1. n ? N. . there is an N and a set G of measure at least 1 — c. the fact that x E B. Then x g Bn .. For example. the BorelCantelli principle is often expressed by saying that if E wc. If f f dp. The proof of this is left to the exercises.
) Let F be a Borel function from A" into B". (How is m related to window halfwidth?) 4. = 1. g. the law of the iterated logarithm. 1. Then show that the probability of n consecutive O's is 11(n + 1)!.(x 1') 0 O. let B c A. o F1 and let r) be the completion of v. Give an example of a function of a finitealphabet Markov chain that is not Markov of any order. namely.i. then a stationary process is said to be mdependent. As several of the preceding examples suggest. If X is a Borel set such that p(X) = 1. This fact is summarized by the following lemma.u([4]).l. p(b) > a/I BI.17 (The Borel mapping lemma. If . if U„ > Un _i . that lower bounds on probability give upper bounds on cardinality. Define X. Lemma 1. [5]. (Include a proof that your example is not Markov of any order. such as the martingale theorem. and let a be a positive number (a) If a E B (b) For b E p(a) a a.1 is 1dependent. with each U.18. A similar result holds with either A" or B" replaced.) Let p.) . and the renewal theorem. namely. and sometimes the martingale theorem is used to simplify an argument. that Borel images of Borel sets may not be Borel sets. and is not a finite coding of a finitealphabet i. Sometimes it is useful to go in the opposite direction. be a probability measure on the finite set A. then I BI 5_ 11a.1. process. Probabilities for the new process can then be calculated by using the inverse image to transfer back to the old process. . will not play a major role in this book. let p. the central limit theorem. if p.18 (Cardinality bounds. except for a subset of B of measure at most a.1.c Exercises. For ease of later reference these are stated here as the following lemma. Show that {X. this is not a serious problem. a process is often specified as a function of some other process. for all n and k. 1]. Use will also be made of two almost trivial facts about the connection between cardinality and probability. let v = p. A complication arises. for such images are always measurable with respect to the completion of the image measure.d.. Prove Lemma 1.u([4] n [ci_t) . and upper bounds on cardinality "almost" imply lower bounds on probability.12 CHAPTER I. respectively. B. uniformly distributed on [0. I. then F(X) is measurable with respect to the completion i of v and ii(F (X)) = 1. otherwise X.' and an+ni±k n+m11 . though they may be used in various examples. Lemma 1. = O.d. 2. (Hint: show that such a coding would have the property that there is a number c > 0 such that A(4) > c".d. Show that a finite coding of an i. such as product spaces. process is mdependent for some m. be a Borel measure on A".) 3.u([4]). For nice spaces. first see what is happening on a set of probability 1 in the old process then transfer to the new process. Deeper results from probability theory. Let {tin } be i. by AZ or B z .i. e. BASIC CONCEPTS.1. and all a.i.
11 satisfies the two conditions (i) and (ii) of that example. • k=0 The measures {.15. where for each a E A. Ergodic theory is concerned with the orbits x. 7. x E Pa . THE ERGODIC THEORY MODEL. First. ÎZ(B) = . as follows.} satisfy the Kolmogorov consistency conditions. 0 < n < r. that is. N = 1. 2. The partition P is called the Kolmogorov partition associated with the process. Since to say that . relative to which information about orbit structure can be expressed in probability language. hence have a common extension . that is. 13 5. 8... (or on the space A z ). Section 1.1. (1/n) E7=1 kt* (1' B). The sequence of random variables and the joint distributions are expressed in terms of the shift and Kolmogorov partition.T 2x. In many cases of interest there is a natural probability measure preserved by T. The Kolmogorov model for a stationary process implicitly contains the transformation/partition model. of a transformation T: X X on some given space X.. let Xn = X n (x) be the label of the member of P to which 7' 1 x belongs. X = A". . This model. n > 1. Finite measurements on the space X. that is..2 The ergodic theory model. Show that the process constructed in the preceding exercise is not Markov of any order. = {x: x 1 = a}. Pc. for each Borel subset B C 9. This suggests the possibility of using ergodic theory ideas in the study of stationary processes. The coordinate functions. at random according to the Kolmogorov measure. Establish the Kolmogorov representation theorem for countable alphabet processes. is the subject of this section and is the basis for much of the remainder of this book. is the transformation and the partition is P = {Pa : a E A}..u (N) on AN. an infinite sequence. Lemma 1. by the formula = (n K1 kn En iU(Xkr1+1) . where N = Kn+r.}.Tx.1. The shift T on the sequence space.a (N) . then give rise to stationary processes. {X. and. Show that the finitestate representation of a concatenatedblock process in Example 1. A measure kt on An defines a measure . are given by the formula (1) X(x) = Xp(Tn 1 x).KnFr \ 144 Kn+1 . Prove the iterated BorelCantelli principle. 6. Show that the concatenatedblock process Z defined by kt is an average of shifts of .tc* to A". associate with the partition P the random variable Xp defined by X2 (x) = a if a is the label of member of P to which x belongs. called the transformation/partition model for a stationary process. which correspond to finite partitions of X. The Kolmogorov measure for a stationary process is preserved by the shift T on the sequence space. for each n. The process can therefore be described as follows: Pick a point x E X.SECTION L2.
by (4). 2) process. Continuing in this manner. is called the process defined by the transformation T and partition P. 2)process may also be described as follows. (2). or.14 Tn1 X E CHAPTER I. X a (x). ). is called the (T. for all B E E. X 2 (x X3(x). The kth order distribution Ak of the process {Xa } is given by the formula (4) (4) = u (nLiT i+1 Pai ) the direct analogue of the sequence space formula. BASIC CONCEPTS. The process {Xa : n > 1} defined by (3). and the joint distributions can all be expressed directly in terms of the Kolmogorov partition and the shift transformation.  (x). X — U n Pa is a null set. the coordinate functions. the (T. or equivalently. and the (T. countable partitions. whose union has measure 1. The concept of stationary process is formulated in terms of the abstract concepts of measurepreserving transformation and partition. E. Pa is the same as saying that x E T n+ 1 Pa . B A partition E E. the values. and measure preserving if it is measurable and if (T' B) = .) Associated with the partition P = {Pa : a E Al is the random variable Xp defined by Xp(x) = a if x E Pa . disjoint collection of measurable sets. the cylinder sets in the Kolmogorov representation.(B). more simply. 2) name of x. (In some situations. [4] = n7. P = {Pa : a E A} of X is a finite. in the Kolmogorov representation a random point x is a sequence in 21°' or A z . indexed by a finite set A. The (T. The sequence { xa = X n (x): n > 1} defined for a point x E X by the formula Tn l x E Pin . that is. The random variable Xp and the measurepreserving transformation T together define a process by the formula (3) X(x) = Xp(T nl x). partitions into countably many sets. are useful. A mapping T: X X is said to be measurable if T 1 B E E. that is. cylinder sets and joint distributions may be expressed by the respective formulas. Then apply T to x to obtain Tx and let X2(x) be the label of the set in P to which Tx belongs. and (2) = u([4]) = i (nLI T —i+1 Pa. 2)name of x is the same as x. Of course. as follows.) . Let (X. n> 1. Pick x E X at random according to the Adistribution and let X 1 (x) be the label of the set in P to which x belongs. A) be a probability space. (or the forward part of x in the twosided case.„7—i+1 Pai . tell to which set of the partition the corresponding member of the random orbit belongs.u. ) In summary..
2)process. . The ergodic theory point of view starts with the transformation/partition concept. i = 1..e. denotes symmetric difference. E. In summary. It is natural to study the orbits of a transformation by looking at its action on invariant sets.1 (Ergodicity equivalents. a stationary coding F: A z B z carrying it onto y determines a partition P = {Pb: b E BI of Az such that y is the Kolmogorov measure of the (TA. This leads to the concept of ergodic transformation. such that F(Tx) = TA F (x). The mapping F extends to a stationary coding to Bz in the case when X = A z and T is the shift. = f . equivalently. 2. The space X is Tdecomposable if it can be expressed as the disjoint union X = X1U X2 of two measurable invariant sets. 2)process concept is. C AD = (C — D)U(D— C). (d) (e) UT (B) = 0 or kt(B) = 1.2. where the domain can be taken to be any one of the LPspaces or the space of measurable functions. hence to say that the space is indecomposable is to say that if T 1 B = B then p(B) is 0 or 1. = (c) T 1 B D B. a. A measurable set B is said to be Tinvariant if TB C B. p) gives rise to a measurable function F: X 1÷ B" which carries p onto the Kolmogorov measure y of the (T. Lemma 1. (a) T is ergodic. in essence. The condition that TX C Xi. a stationary process can be thought of as a shiftinvariant measure on a sequence space. (b) T 1 B C B. 2 translates into the statement that T 1 Xi = Xi. i = 1. A measurepreserving transformation T is said to be ergodic if (5) 71 B = B = (B) = 0 or p(B) = 1. THE ERGODIC THEORY MODEL.2. See Exercise 2 and Exercise 3. the natural object of study becomes the restriction of the transformation to sets that cannot be split into nontrivial invariant sets. The following lemma contains several equivalent formulations of the ergodicity condition. in that a partition P of (X. Conversely.a Ergodic processes.2. 15 The (T. I. 2)process. a. since once an orbit enters an invariant set it never leaves it. It is standard practice to use the word "ergodic" to mean that the space is indecomposable. The partition P is defined by Pb = {x: f (x) = 1) } .SECTION 1. The notation. . as a measurepreserving transformation T and partition P of an arbitrary probability space. where TB is the shift on 13'.e.i(B) = 0 or p(B) = 1. just an abstract form of the stationary coding concept. implies that f is constant. In particular. each of positive measure.) The following are equivalent for a measurepreserving transformation T on a probability space. B LB) = 0 = (B) = 0 or kt(B) = 1. or. and UT denotes the operator on functions defined by (UT f)(x) = f (T x). while modern probability theory starts with the sequence space concept.
irreducible Markov chains and functions thereof.u(B). 1) denote the unit square and define a transformation T by T (s. t/2) (2s — 1.4. I. but not all. Thus the concept of ergodic process. and some. Furthermore.2.u(C) = . processes. BASIC CONCEPTS. which is natural from the transformation point of view. 2. concatenatedblock processes. rather than T 1 . Remark 1. .16 CHAPTER I. stationary codings of ergodic processes. Example 1. 3. Also. • Tw • Tz Tw • • Tz Figure 1.b Examples of ergodic processes.2. 1) x [0. note that if T is invertible then UT is a unitary operator on L2 .d. Examples of ergodic processes include i. As will be shown in Section 1. Cut the unit square into two columns of equal width. that is. the conditions for ergodicity can be expressed in terms of the action of T.) 1. when T is onetoone and for each measurable set C the set T C is measurable and has the same measure as C. for almost every x. (t ± 1)/2) if S <1/2 it s> 1/2.) A simple geometric example provides a transformation and partition for which the resulting process is the familiar cointossing process. These and other examples will now be discussed. t) = { (2s.2 In the particular case when T is invertible. then T 1 C = C and . if the shift in the Kolmogorov representation is ergodic relative to the Kolmogorov measure. to say that a stationary process is ergodic is equivalent to saying that measures of cylinder sets can be determined by counting limiting relative frequencies along a sample path x.i. Squeeze each column down to height 1/2 and stretch it to width 1. processes obtained from a block coding of an ergodic process by randomizing the start. A stationary process is ergodic. an invertible T is ergodic if and only if any Tinvariant set has measure 0 or 1. Proof The equivalence of the first two follows from the fact that if T 1 B c B and C = n>0T n B.3 (The Baker's transformation. Place the right rectangle on top of the left to obtain a square. a shiftinvariant measure shiftinvariant measures.4.2. In particular. The transformation T is called the Baker's transformation since its action can be described as follows. is equivalent to an important probability concept.4 The Baker's transformation. (See Figure 1. The proofs of the other equivalences are left to the reader.2.2. Let X = [0.
Qb E Q} • The join AP(i) of a finite sequence P (1) . E. P1 = {(s. it can be shown that the partition TnP is independent of vg 1 PP . i. that the (T.2. P v Q.i. To obtain the cointossing process define the twoset partition P = {Po. a E Al. process ii. P(k) of partitions is their common refinement. that is. t): s < 1/2}. defined by the firstorder distribution j 1 (a). T I P v P v T'P v T 2P T 1 P Figure 1. along with the join of these partitions.SECTION I. . for each n > 1. TP. In particular.2. PO by setting (6) Po = {(s.d. The join. P1 has measure 1/2. (See Figure 1. This.5 The join of P.2..T 2P . 1 0 P T 1 1)1 n Po n T Pi n T2 Po Ti' I I 1 0 1 0 T 2p 01101 I I . implies.d. symmetric. THE ERGODIC THEORY MODEL. together with the fact that each of the two sets P0 . of two partitions P and Q is their common refinement. . the binary. the i. so that {T i P} is an independent sequence. 2)process is just the cointossing process. In general. The sequence of partitions {P (i) : i > 1} is said to be independent if P (k+ i) and v1P (i) independent for each k? 1..P)process is. P (2) . described as follows. 71 P . defined inductively by APO ) = (vrP (i) ) y p(k). are For the Baker's transformation and partition (6). Two partitions P and Q are independent if the distribution of the restriction of P to each Qb E Q does not depend on b. indeed. Let P = {Pa : a E A} be a partition of the probability space (X. This is precisely the meaning of independence.i. 17 The Baker's transformation T preserves Lebesgue measure in the square.2. b. that is. TP . for dyadic subsquares (which generate the Borel field) are mapped into rectangles of the same area. To assist in showing that the (T . Note that T 2P partitions each set of T 1 7. the partition P v Q = {Pa n Qb: Pa E P . is given by a generalized Baker's transformation.5 illustrates the partitions P. ii). The distribution of P is the probability distribution ftt(Pa).6. if [t(Pa n Qb) = 12 (Pa)11 (Qb). Figure 1.Va. v P v TT into exactly the same proportions as it partitions the entire space X. of course.) . T2P. and 7— '1'. process. a E A. t): s _?: 1/2}. some useful partition notation and terminology will be developed.
The results are summarized here.2. Stack the rectangles to obtain a square. • Figure 1. in the transition matrix M determine whether a Markov process is ergodic. Mixing clearly implies ergodicity. j there is a sequence io. Example 1. Example 1.i. in = 0. The corresponding Baker's transformation T preserves Lebesgue measure.u(C). 2.) To show that a process is ergodic it is sometimes easier to verify a stronger property. Since the measure is product measure this means that p.u(D).u(T nC n D) = .18 CHAPTER I.u(C) 2 and hence . .i(C) = .i. if for any pair i. i 1 . For each a E A squeeze the column labeled a down and stretch it to obtain a rectangle of height (a) and width 1. for if T 1 C = C then T ncnD=CnD. process. To show that the mixing condition (7) holds for the shift first note that it is enough to establish the condition for any two sets in a generating algebra. 1 < < n. the reader may refer to standard probability books or to the extensive discussion in [29] for details. 1. so that the mixing property. such that the column labeled a has width A i (a). But this is easy.(T Nc n D) = . in with io = i and in = j such that > 0. measures satisfy the mixing condition. A transformation is mixing if (7) n— . labeled by the letters of A.a(TN C). n — 1. D E E. for all sets D and positive integers n.) The location of the zero entries. if any. 3.a(D) = Thus i. hence it is enough to show that it holds for any two cylinder sets. BASIC CONCEPTS. for if C = [a7] and N > 0 then TN = A +1 bNA i = a1. the choice D = C gives .6 The generalized Baker's transformation. Cut the unit square into I AI columns. if D = [br] and N > m then T N C and D depend on values of x i for indices i in disjoint sets of integers.froo lim . (C nD) = Since this holds for all sets D. C.8 (Ergodicity for Markov chains.7 (Ergodicity for i.d. Suppose it is a product measure on A'.2.i. . processes.d.u(C) is 0 or 1. A stationary process is mixing if the shift is mixing for its Kolmogorov measure. The partition P into columns defined in part (a) of the definition of T then produces the desired i. called mixing. Thus. . implies that p.2. 1. (7).d. The stochastic matrix M is said to be irreducible.
1 (10) itt([d] k steps. 19 This is just the assertion that for any pair i. each entry of it is positive and N (8) N—oc) liM N n=1 where each row of the k x k limit matrix P is equal to jr. To prove this it is enough to show that N (9) LlM oo C n D) = for any cylinder sets C and D. hence for any measurable sets C and D. An irreducible Markov chain is ergodic. where bn_f_k+i = ci. THE ERGODIC THEORY MODEL.2.d.([4] n [btni 1 _ . +1 = acin). the limit of the averages of powers can be shown to exist for an arbitrary finite stochastic matrix M. In the irreducible case. for if T 1 C = C and D = C... .t(C) = 1u(C) 2 . is n ii n [bn 1) = ttacril i) A I . Mc1. and hence .1 1 converges in the sense of Cesaro to (Ci). The sequence ML. If M is irreducible there is a unique probability vector it such that irM = 7r. which is weaker than the mixing condition (7). to bn+k+1 yields the product p.1±kbn+k+i V = (nFlc1 i=n JJ Md.+i.u(C) = This proves that irreducible Markov chains are ergodic. which is the probability of transition from dn to bn+k+1 = Cl in equal to MI ci = [Mk ]d c „ the dn ci term in the kth power of M. implies ergodicity. mc. since . let D = [4] and C = [cri and write + T—k—n = [bnn ++ kkFr ]. 1=. This condition.SECTION 1. In fact.c. so that C must have measure 0 or 1. Summing over all the ways to get from d. +1 = 11 n+k+m1 i=n+kF1 Note that V. there is only one probability vector IT such that 7rM = it and the limit matrix P has it all its rows equal to 7r. where n1 7r (di)11Md.:Vinj) = uvw. Furthermore.d. which establishes (9). 1 < i < m. To establish (9) for an irreducible chain. Thus when k > 0 the sets T kn C and D depend on different coordinates. if the chain is at state i at some time then there is a positive probability that it will be in state j at some later time. j of states. then the limit formula (9) gives /. by (8). The condition 7rM = it shows that the Markov chain with start distribution Ai = it and transition matrix M is stationary.
The converse is also true. if T is ergodic the (T. however. 2)process is ergodic for any finite partition P. o F 1 is the Kolmogorov measure of the encoded Bvalued process. But if it(B n Pi ) > 0. then the set B = T 1 Pi U T 2 PJ U has positive probability and satisfies T 1 B D B. Indeed. . 2)process can be ergodic even though T is not. It should be noted. In this case. Remark 1. Example 1. for suppose F: A z B z is a stationary encoder.9 A stationary Markov chain is ergodic if and only if its transition matrix is irreducible.11 In some probability books. Thus.2. and suppose v = p. Thus if the chain is ergodic then . [11]. a finitestate process for which the underlying Markov chain is mixing must itself be mixing. BASIC CONCEPTS.2. if the chain is ergodic then the finitestate process will also be ergodic. if every state has positive probability and if Pi = {x: xi = j}. the Cesaro limit theorem (8) can be strengthened to limN.1.12 (Codings and ergodicity. see Example 1. for some n. Proposition 1. Proposition 1. so that. The underlying . If C is a shiftinvariant subset of B z then F1 C is a shiftinvariant subset of Az so that . If some power of M is positive (that is. Likewise. that the (T.u(B) = 1. for any state i.„ M N = P. "ergodic" for Markov chains is equivalent to the condition that some power of the transition matrix be positive. for example. "ergodic" for Markov chains will mean merely irreducible with "mixing Markov" reserved for the additional property that some power of the transition matrix has all positive entries. Thus. This follows easily from the definitions. assuming that all states have positive probability.u(B n Pi ) > 0. The argument used to prove (9) then shows that the shift T must be mixing.2. again. (See Exercise 5. It is not important that the domain of F be the sequence space A z . see Exercise 19. A finitestate process is a stationary coding with window width 0 of a Markov chain. Feller's book.u(T nP n Pi ) > 0. since it can be represented as a stationary coding of an irreducible Markov chain. below. and hence the chain is irreducible. Since v(C) = p. which means that transition from i to j in n steps occurs with positive probability. In summary. then . The converse is also true. and hence that v is the Kolmogorov measure of an ergodic process. any probability space will do.u(F1 C) is 0 or 1. at least with the additional assumption being used in this book that every state has positive probability. has all positive entries) then M is certainly irreducible. To be consistent with the general concept of ergodic process as used in this book.(F1 C) it follows that v(C) is 0 or 1.11.) Stationary coding preserves the ergodicity property. suppose it is the Kolmogorov measure of an ergodic Avalued process. A concatenatedblock process is always ergodic.2.) A simple extension of the above argument shows that stationary coding also preserves the mixing property.20 CHAPTER I.10 A stationary Markov chain is mixing if and only if some power of its transition matrix has all positive entries.
see Example 1. for all x. It is also called rotation.2. hence the word "interval" can be used for subsets of X that are connected or whose complements are connected. and 111 . respectively.13 (Rotation processes. let P consist of the two intervals P0 = [0. for all x and all N. up to a shift. the nonoverlapping Nblock process defined by Zn = (X (n1)N111 X (n1)N+2. as in this case. which correspond to the upper and lower halves of the circle in the circle representation.. but a stationary process can be constructed from the encoded process by randomizing the start. The following proposition is basic.12. E.) Let a be a fixed real number and let T be defined on X = [0.. concatenatedblock processes are not mixing except in special cases. p is the stationary Markov measure with transition matrix M and start distribution 7r given.. defined by the 2block code C(01) = 00. depending on whether Tn .1. it follows that a stationary coding of a totally ergodic process must be totally ergodic. The (T..1.. (The measurepreserving property is often established by proving. hence preserves Lebesgue measure p.) Any partition 7. 1). For example. P 1 = [1/2. so that translation by a becomes rotation by 27ra. Pick a point x at random in the unit interval according to the uniform (Lebesgue) measure. A condition insuring ergodicity of the process obtained by Nblock coding and randomizing the start is that the original process be ergodic relative to the Nshift T N .) is said to be totally ergodic if every power TN is ergodic.SECTION 1. p. A transformation T on a probability space (X.. since X can be thought of as the unit circle by identifying x with the angle 27rx. in this case. however. and 0101 .1. implies that F (Ti x) = T: F(x).. even if the original process was ergodic. A subinterval of the circle corresponds to a subinterval or the complement of a subinterval in X. Example 1.. . The measure 17 obtained by randomizing the start is. The final randomizedstart process may not be ergodic. Nblock codings destroy stationarity. As noted in Section 1. where ED indicates addition modulo 1. C(10) = 11.2. so that y is concentrated on the two sequences 000. give measure 1/2 to each of the two sequences 1010. The transformation T is onetoone and maps intervals onto intervals of the same length.) As an example. The mapping T is called translation by a. 1/2). (See Example 1. of the interval gives rise to a process. however. Since this periodic structure is inherited.) Since the condition that F(TA X) = TB F (x). 21 Markov chain is not generally mixing.. by m ro 1 L o j' = rl 1] i• Let y be the encoding of p. In particular. If T is the shift on a sequence space then T is totally ergodic if and only if for each N. for it has a periodic structure due to the blocking. 1) by the formula Tx = x ED a. that it holds on a family of sets that generates the aalgebra. P)process {X n } is then described as follows. (or rotation processes if the circle representation is used.. The value X(x) = Xp(Tn 1 x) is then 0 or 1. The proof of this is left to the reader. • • • 9 Xn is ergodic. THE ERGODIC THEORY MODEL.lx = x ED (n — 1)a belongs to P0 or P1 . that is. let p. the same as y.10.. applying a block code to a totally ergodic process and randomizing the start produces an ergodic process. such processes are called translation processes. hence is not an ergodic process.
and hence the maximality of I implies that n I = 0. The assumption that i. which is a contradiction. which completes the proof of Proposition 1. applied to the end points of I.14. To proceed with the proof that T is ergodic if a is irrational.22 Proposition 1. It follows that is a disjoint sequence and 00 1 = . One first shows that orbits are dense. and suppose that F is not dense in X. The preceding proof is typical of proofs of ergodicity for many transformations defined on the unit interval. {Tx: n > 1 } . produces integers n1 <n2 < < n k such that {Tni / } is a disjoint sequence and Ek i=1 tt(Tni /) > 1 — 2E. A generalization of this idea will be used in Section 1. e2"ina e 27rinx . for each x.2. Assume that a is irrational. The Fourier series of g(x) = f (T x) is E a.14 T is ergodic if and only if a is irrational. Such a technique does not work in all cases.) If a is irrational the forward orbit {Tnx: n > 1} is dense in X.10 to establish ergodicity of a large class of transformations. since a is irrational.. Furthermore.14 continued. CHAPTER I. Two proofs of ergodicity will be given. Proposition 1. An alternate proof of ergodicity can be obtained using Fourier series. This proves Kronecker's theorem.2.2.15 (Kronecker's theorem. TI cannot be the same as I. for each positive integer n. Then there is an interval I of positive length which is maximal with respect to the property of not meeting F. Given c > 0 choose an interval I such that /2(1) < E and /. The first proof is based on the following classical result of Kronecker. 1)) a. Proof To establish Kronecker's theorem let F be the closure of the forward orbit.u(A) = 1. suppose A is an invariant set of positive measure.t(A n I) > (1 — Kronecker's theorem. BASIC CONCEPTS. but does establish ergodicity for many transformations of interest.2. Proof It is left to the reader to show that T is not ergodic if a is rational. e2n(x+a) a.t(A n I) > (1 — 6) gives it (A) E —E) n Trig) (1 — 0(1 — 2E). E n=1 00. i= 1 Thus .u ( [0. Suppose f is square integrable and has Fourier series E an e brinx . then that small disjoint intervals can be placed around each point in the orbit. Proof of Proposition 1.
. the rotation T has no invariant functions and hence is ergodic. . T n1 Bn are disjoint and have the same measure. by reassembling within each level it can be assumed that T just moves points directly upwards one level.c The returntime picture.2.16 The following are equivalent for the mapping T: (x.2. e. for n > 1. if x E 1100 then Tx g Bc. e. Furthermore. a. only the first set Bn in this sequence meets B. For example. the product of two circles. n > 2. I. This proves Theorem 1. consider the torus T.. (or. that is.2. one above the other. provides an alternative and often simpler view of two standard probability models.) The proof of the following is left to the reader. Proposition 1. are measurable. the product of two intervals. y)} is dense for any pair (x. along with a suggestive picture. Proof To prove this. Theorem 1. In other words. furthermore. which. Since they all have the same measure as B and p(X) = 1. y El) p) (i) {Tn(x. alternatively. in the irrational case. 23 If g(x) = f(x).) For each n. T = [0. in turn.2. THE ERGODIC THEORY MODEL. TB. TBn C B. then an = an e 2lri" holds for each integer n.2. a.. Let T be a measurepreserving transformation on the probability space (X. The returntime picture is obtained by representing the sets (11) as intervals. the sets (11) Bn .SECTION 1. n(x) is the time of first return to B.) Return to B is almost certain. that is. unless and an = an e 27' then e2lrin" n=0. translation of a compact group will be ergodic with respect to Haar measure if orbits are dense. y) 1÷ (x a. with addition mod 1.. for all n > 1. A general returntime concept has been developed in ergodic theory.Vn > O. (iii) a and 13 are rationally independent.17 (The Poincare recurrence theorem. (ii) T is ergodic. = 0. define the sets Bn = E B: n(x) = n}.17.18). that is.o ) = 0. which means that a. Points in the top level of each column are moved by T to the base B in some manner unspecified by the picture. (Some extensions to the noninvertible case are outlined in the exercises. implies that the function f is constant. In general.2. and define = {X E B: Tx B. that is. and. n(x) < oc. for B 1 = B n 71 B and Bn = (X — B) n B n T n B. it follows that p(B. IL) and let B be a measurable subset of X of positive measure. 1). . let n(x) be the least positive integer n such that Tnx E B. y). Furthermore. E. These sets are clearly disjoint and have union B. 1) x [0. while applying T to the last set Tn 1 Bn returns it to B. If x E B. To simplify further discussion it will be assumed in the remainder of this subsection that T is invertible. Furthermore. almost surely. If a is irrational 0 unless n = 0. the renewal processes and the regenerative processes. which. (See Figure 1. from which it follows that the sequence of sets {T B} is disjoint.
24
CHAPTER I. BASIC CONCEPTS.
Tx
B1
B2
B3
B4
Figure 1.2.18 Returntime picture.
Note that the set Util=i Bn is just the union of the column whose base is Bn , so that the picture is a representation of the set (12) U7 =0 TB
This set is Tinvariant and has positive measure. If T is ergodic then it must have measure 1, in which case, the picture represents the whole space, modulo a null set, of course. The picture suggests the following terminology. The ordered set
Bn , TB,
, T 111 13,
is called the column C = Cn with base Bn, width w(C) = kt(B,z ), height h(C) = n, levels L i = Ti1 Bn , 1 < i < n, and top T'' B. Note that the measure of column C is just its width times its height, that is, p,(C) = h(C)w(C). Various quantities of interest can be easily expressed in terms of the returntime picture. For example, the returntime distribution is given by Prob(n(x) =nIxEB)— and the expected return time is given by w (Ca )
Em w(C, n
)
E(n(x)lx E B) =
(13)
h(C,i )Prob(n(x) =nix E B)
h(C)w(C)
=
m w(Cm )
In the ergodic case the latter takes the form
(14)
E(n(x)ix E B) =
1 Em w(Cm )
1
since En h(C)w(C) is the measure of the set (12), which equals 1, when T is ergodic. This formula is due to Kac, [23]. The transformation I' = TB defined on B by the formula fx = Tn(x)x is called the transformation induced by T on the subset B. The basic theorem about induced transformations is due to Kakutani.
SECTION I.2. THE ERGODIC THEORY MODEL.

25
Theorem 1.2.19 (The induced transformation theorem.) If p(B) > 0 the induced transformation preserves the conditional measure p(.IB) and is ergodic if T is ergodic. Proof If C c B, then x E f 1C if and only if there is an n > 1 such that x Tnx E C. This translates into the equation
00
E
Bn and
F1 C =U(B n n=1
n TAC),
which shows that T1 C is measurable for any measurable C c B, and also that
/2(f—i
n
= E ,u(B,, n T'C),
n=1
00
since Bn n An = 0, if m 0 n, with m, n > 1. More is true, namely, (15)
T n Ba nT ni Bm = 0, m,n > 1, m n,
by the definition of "first return" and the assumption that T is invertible. But this implies 00 E,(T.Bo
n=1
= E p(Bn ) = AO),
n=1
,
co
which, together with (15) yields E eu(B, n T'C) = E iu(Tnii n n c) = u(C). Thus the induced transformation f preserves the conditional measure p.(.1B). The induced transformation is ergodic if the original transformation is ergodic. The picture in Figure 1.2.18 shows why this is so. If f 1 C, then each C n Bn can be pushed upwards along its column to obtain the set
c.
oo n1
Ti(c n Bn)• D=UU n=1 i=0
But p,(T DAD) = 0 and T is invertible, so the measure of D must be 0 or 1, which, in turn, implies that kt(C1B) is 0 or 1, since p,((D n B),LC) = O. This completes the proof of the inducedtransformation theorem. El
1.2.c.1 Processes associated with the returntime picture.
Several processes of interest are associated with the induced transformation and the returntime picture. It will be assumed throughout this discussion that T is an invertible, ergodic transformation on the probability space (X, E, p,), partitioned into a finite partition P = {Pa : a E A}; that B is a set of positive measure; that n(x) = minfn: Tnx E BI, x E B, is the returntime function; and that f is the transformation induced on B by T. Two simple processes are connected to returns to B. The first of these is {R,}, the (T, B)process defined by the partition B = { B, X B}, with B labeled by 1 and X — B labeled by 0, in other words, the binary process defined by
—
Rn(x) = xB(T nl x), n?: 1,
26
CHAPTER I. BASIC CONCEPTS.
where xB denotes the indicator function of B. The (T, 5)process is called the generalized renewal process defined by T and B. This terminology comes from the classical definition of a (stationary) renewal process as a binary process in which the times between occurrences of l's are independent and identically distributed, with a finite expectation, the only difference being that now these times are not required to be i.i.d., but only stationary with finite expected returntime. The process {R,,} is a stationary coding of the (T, 2)process with timezero encoder XB, hence it is ergodic. Any ergodic, binary process which is not the all 0 process is, in fact, the generalized renewal process defined by some transformation T and set B; see Exercise 20. The second process connected to returns to B is {Ri } , the (7." , g)process defined by the (countable) partition
{Bi, B2, • • •},
of B, where B, is the set of points in B whose first return to B occurs at time n. The process {k } is called the returntime process defined by T and B. It takes its values in the positive integers, and has finite expectation given by (14). Also, it is ergodic, since I' is ergodic. Later, it will be shown that any ergodic positiveinteger valued process with finite expectation is the returntime process for some transformation and partition; see Theorem 1.2.24. The generalized renewal process {R n } and the returntime process {kJ } are connected, for conditioned on starting in B, the times between successive occurrences of l's in {Rn} are distributed according to the returntime process. In other words, if R 1 = 1 and the sequence {S1 } of successive returns to 1 is defined inductively by setting So = 1 and
= min{n >
Rn = 1},
j > 1,
then the sequence of random variables defined by
(16)
= Si — Sf _ 1 , j> 1
has the distribution of the returntime process. Thus, the returntime process is a function of the generalized renewal process. Except for a random shift, the generalized renewal process {R,,} is a function of the returntime process, for given a random sample path R = {k } for the returntime process, a random sample path R = {R n } for the generalized renewal process can be expressed as a concatenation (17) R = ii)(1)w(2)w(3) • • •
of blocks, where, for m > 1, w(m) = 01 _ 1 1, that is, a block of O's of length kn 1, followed by a 1. The initial block tt,(1) is a tail of the block w(1) = 0 /111 1. The only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the waiting time r until the first 1 occurs. The start position problem is solved by the returntime picture and the definition of the (T, 2)process. The (T, 2)process is defined by selecting x E X at random according to the itmeasure, then setting X(x) = a if and only if Tx E Pa . In the ergodic case, the returntime picture is a representation of X, hence selecting x at random is the same as selecting a random point in the returntime picture. But this, in turn, is the same as selecting a column C at random according to column measure
SECTION 1.2. THE ERGODIC THEORY MODEL.
27
then selecting j at random according to the uniform distribution on 11, 2, ... , h (C)}, then selecting x at random in the jth level of C. In summary,
Theorem 1.2.20 The generalized renewal process {R,} and the returntime process {km } are connected via the successivereturns formula, (16), and the concatenation representation, (17) with initial block th(1) = OT1 1, where r = R — j 1 and j is uniformly distributed on [1,
The distribution of r can be found by noting that
r (x) = jx E
from which it follows that
00
Prob(r = i) =E p.,(T 41 n=i
=
E w (C ) .
ni
The latter sum, however, is equal to Prob(x E B and n(x) > i), which gives the alternative form 1 Prob (r = i) = (18) Prob (n(x) ijx E B) , E(n(x)lx E B) since E(n(x)lx E B) = 1/ p.(B) holds for ergodic processes, by (13). The formula (18) was derived in [11, Section XIII.5] for renewal processes by using generating functions. As the preceding argument shows, it is a very general and quite simple result about ergodic binary processes with finite expected return time. The returntime process keeps track of returns to the base, but may lose information about what is happening outside B. Another process, called the induced process, carries along such information. Let A* be the set all finite sequences drawn from A and define the countable partition 13 = {P: w E A* } of B, where, for w =
=
=
E Bk: T i1 X E
Pao 1 <j
kl.
The (?,)process is called the process induced by the (T, P)process on the set B. The relation between the (T, P)process { X, } and its induced (T, f 5)process {km } parallels the relationship (17) between the generalized renewal process and the returntime process. Thus, given a random sample path X = {W„,} for the induced process, a random sample path x = {X,} for the (T, P)process can be expressed as a concatenation
(19)
x = tv(1)w(2)w(3) • • • ,
into blocks, where, for m > 1, w(m) = X m , and the initial block th(1) is a tail of the block w(1) = X i . Again, the only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the length of the tail fi)(1). A generalization of the returntime picture provides the answer. The generalized version of the returntime picture is obtained by further partitioning the columns of the returntime picture, Figure 1.2.18, into subcolumns according to the conditional distribution of (T, P)names. Thus the column with base Bk is partitioned
28
CHAPTER I. BASIC CONCEPTS.
into subcolumns CL, labeled by klength names w, so that the subcolumn corresponding to w = a k has width (ct, T  i+ 1 Pa n BO,
,
the ithlevel being labeled by a,. Furthermore, by reassembling within each level of a subcolumn, it can again be assumed that T moves points directly upwards one level. Points in the top level of each subcolumn are moved by T to the base B in some manner unspecified by the picture. For example, in Figure 1.2.21, the fact that the third subcolumn over B3, which has levels labeled 1, 1, 0, is twice the width of the second subcolumn, indicates that given that a return occurs in three steps, it is twice as likely to visit 1, 1, 0 as it is to visit 1, 0, 0 in its next three steps.
o o I
0 1 1
o
i
o
i
I
o .
1
I I
1
I
1o 1
I I
I
0
I I 1
I 1 I I
I
o
0
I I
1
1Ii
I
10
I I 1 I
o o
I
1
11
B1
B2
B3
B4
Figure 1.2.21 The generalized returntime picture. The start position problem is solved by the generalized returntime picture and the definition of the (T, 2)process. The (T, 2)process is defined by selecting x E X at random according to the iimeasure, then setting X, (x) = a if and only if 7' 1 x E Pa . In the ergodic case, the generalized returntime picture is a representation of X, hence selecting x at random is the same as selecting a random point in the returntime picture. But this, in turn, is the same as selecting a column Cu, at random according to column measure p.(Cu,), then selecting j at random according to the uniform distribution on {1, 2, ... , h(w), where h(w) = h(C), then selecting x'at random in the jth level of C. This proves the following theorem.
Theorem 1.2.22 The process {X,..,} has the stationary concatenation representation (19) in terms of the induced process {54}, if and only if the initial block is th(1) = w(1)i h. (w(l)) , where j is uniformly distributed on [1, h(w(1))].
A special case of the induced process occurs when B is one of the sets of P, say B = Bb. In this case, the induced process outputs the blocks that occur between successive occurrences of the symbol b. In general case, however, knowledge of where the blocking occurs may require knowledge of the entire past and future, or may even be fully hidden.
I.2.c.2 The tower construction.
The returntime construction has an inverse construction, called the tower construction, described as follows. Let T be a measurepreserving mapping on the probability
2. The transformation T is called the tower extension of T by the (height) function f. stacking the sets Bn x {1}. E. n=1 the expected value of f. (T x . i) = + 1) i < f (x) i = f (x). THE ERGODIC THEORY MODEL.23. The tower extension T induces the original transformation T on the base B = X x {0}. and let f be a measurable mapping from X into the positive integers .2. 1 < i f (x)}. Let k defined by = {(x . 0) The transformation T is easily shown to preserve the measure Ti and to be invertible (and ergodic) if T is invertible (and ergodic).}... The formal definition is T (x. n > 1. This and other ways in which inducing and tower extensions are inverse operations are explored in the exercises. Tx B3 X {2} B2 X 111 B3 X Bi X {0} B2 X {0} B3 X {0} Figure 1. i) E T8} n Bn E E. .2. E (f ) n=1 i=1 A transformation T is obtained by mapping points upwards. by thinking of each set in (20) as a copy of Bn with measure A(B). A useful picture is obtained by defining Bn = {x: f (x) = n}. B C X is defined to be measurable if and only if (x.). 29 space (X.23 The tower transformation.0 . 0 . then normalizing to get a probability measure. with points in the top of each column sent back to the base according to T. Bn x {n} (20) as a column in order of increasing i.u({x: . with its measure given by ‘ Tt(k) = 2. p. of finite expected value. B x {2}. and for each n > 1. . the normalizing factor being n ( B n ) = E(f). As an application of the tower construction it will be shown that returntime processes are really just stationary positiveinteger valued processes with finite expected value. In other words.2. The measure it is extended to a measure on X.SECTION 1. E 131 ) n 13. i < n.be the subset of the product space X x Ar Ar = { 1. as shown in Figure 1.
if f(Tx) < f (x). P)process {Xn}. positiveinteger valued process {kJ} with E(R 1 ) < oc is the returntime process for some transformation and set.2.) 6. P)process can be ergodic even if T is not. A point (x.} such that the future joint process {(Xn. Let Zn = ( X (n 1 )N+19 411)N+2. . Show that a Tsubinvariant function is almostsurely invariant. 7. binary process via the concatenationrepresentation (17). .) 5. N}.} is regenerative if there is a renewal process {R. R.25 An ergodic process is regenerative if and only if it is an instantaneous function of an induced process whose returntime process is I. a stationary coding gives rise to a partition. 1) returns to B = X x L1} at the point (T x . that is.) Theorem 1. the returntime process defined by T and B has the same distribution as {k1 }. (a) Show that {Zn } is the (TN. let Wn = (Xn. (Recall that if Y. Let T be an invertible measurepreserving transformation and P = {Pa : a E Al a finite partition of the probability space (X. A nonnegative measurable function f is said to be Tsubinvariant. = f (X n ). Let T be a measurepreserving transformation. Show that there is a measurable mapping F: X 1± A z which carries p. BASIC CONCEPTS. Show that the (T. see Exercise 24. This proves the theorem. onto v. Suppose F: A z F> Bz is a stationary coding which carries p. Ra ):n > t} is independent of the past joint process {(X. fkil defines a stationary. A stationary process {X.) 3.1. PIP)process.Ar z of doublyinfinite positiveinteger valued sequences. then {Yn } is an instantaneous function of {Xn}.) 4.2.d Exercises. 1. be the twosided Kolmogorov measure on the space . Let T be the tower extension of T by the function f and let B = x x {1}. In particular. with the distribution of the waiting time r until the first I occurs given by (18). onto the (twosided) Kolmogorov measure y of the (T. and note that the shift T on Aîz is invertible and ergodic and the function f (x) =X has finite expectation. The induced transformation and the tower construction can be used to prove the following theorem. Theorem 1. then use Exercise 2. E. Show that the (T.24 An ergodic.30 CHAPTER I. /2). a partition gives rise to a stationary coding. 2. given that Rt = 1. (Thus. (Thus. P)process can be mixing even if T is only ergodic. 1) at time f (x). Proof Let p. Xn+19 • • • Xn+N1) be the overlapping Nblock process.2. P)process. P)process. Determine a partition P of A z such that y is the Kolmogorov measure of the (TA. (Hint: start with a suitable nonergodic Markov chain and lump states to produce an ergodic chain.2. (Hint: consider g(x) = min{ f (x). Complete the proof of Lemma 1. such that F(Tx) = TAF(x). n . where TA is the shift on A z .n ): n < t}. and let Yn = X(n1)N+1 be the Nth term process defined by the (T. for all n. XnN) be the nonoverlapping Nblock process.
p. Show that if T is mixing then so is the direct product T x T: X x X where X x X is given the measure . y) = (Tx.. (b) Let P be the timezero partition of X and let Q = P x P.. Show that a concatenatedblock process is ergodic.u be the product measure on X defined by p. 31 8. 13.) (b) Show that the process {} defined by setting 4. . N) goes to (br . Q)name of (x. (i) If i < N. 1. Let T be the shift on X = {1. Show that the overlapping nblocking of an ergodic process is ergodic. N}. 11. = (ar . and number p E (0. if Y„. Generalize the preceding exercise by showing that if T is a measurepreserving transformation on a probability space (X.") 9. i 1). (c) Show that {Y } is the (T N .(1) = 1/2.) and {Tx : x c X) is a family of measurepreserving transformations on a probability space (Y. y). y) = (Tx. a symbol a E A N . i) can only go to (ar . on AN. Fix a measure p. 16. Q)process is called "random walk with random scenery. such that the mapping (x. THE ERGODIC THEORY MODEL. 2. 15.u x . (a) Show that S preserves v. 17. 1} Z . (Hint: the transition matrix is irreducible.n } be the stationary Markov chain with alphabet S = A N x {0. then S is called the direct product of T and R. } is the (T. 12. with the product measure y = p. y ± a) is not ergodic. 14. but is totally ergodic. If T.. defined by the following transition rules. Show that an ergodic rotation is not mixing. 1). 0) with probability p ite(br ). Txoy). (a) Show that there is indeed a unique stationary Markov chain with these properties. Let Y=Xx X. Insertion of random spacers between blocks can be used to change a concatenatedblock process into a mixing process. 1) with probability (1 — p). y) in terms of the coordinates of x and y. (ii) (ar. then S(x . . Try) is a measurepreserving mapping on the product space (X x Y. and let . that is. Show directly. for each br E AN.2. 0) and Ym = ai . (ar . = a if I'm = (ar . Show that the nonoverlapping nblocking of an ergodic process may fail to be ergodic. j). for every x. Let {Z. (b) Show that {147. (The (S. T ry) is measurable. x and define S(x. .SECTION 1. X x X. y) i± (Tx. that if a is irrational then the direct product (x. N) goes to (br . x y).u. for each br E AN. = R. is a mixing finitestate process.(1) = p. S is called a skew product. Give a formula expressing the (S. 10. y) 1± (x a. v7P 1 P)process.u(br). without using Proposition 6. j > 0. (iii) (ar . 'P)process. Show that if T is mixing then it is totally ergodic.
Prove: a tower i'' over T induces on the base a transformation isomorphic to T.) 19. the induced transformation is measurable. Prove that even if T is not invertible. Show how to extend P to a partition 2 of X. d)process. Define P = {Pa : a E AI and 2 = {Qb: b Ea. E BI to be 6independent if (a) Show that p and 2 are independent if and only if they are 6independent for each 6 > 0. Prove that a stationary coding of a mixing process is mixing.2. 22.) 20. Show that the tower defined by the induced transformation function n(x) is isomorphic to T. 18. and is ergodic if T is ergodic. (Hint: obtain a picture like Figure 1. preserves the conditional measure. Show that if {X.g. P)process is an instantaneous coding of the induced (. then P and 2 are 3Eindependent. such that the (T. and let S = T be the tower transformation. Let X be the tower over X defined by f. Prove Theorem 1.) 21. Let T be an invertible measurepreserving transformation and P = {Pa : a E A} a finite partition of the probability space (X.i F 1 D. ? I' and returntime 23. except for a set of Qb's of total measure at most . then it is the generalized renewal process for some transformation T and set B. then use this to guide a proof. 25._ E.(Qb)1 .?. (Hint: use the formula F1 (C n T ' D) = F1 C n TA. E. pt). (b) Show that if P and 2 are 6independent then Ea iii(Pa I Qb) — 11 (Pa)i < Ag.25. except for a set of Qb's of total measure at most 6. . (Hint: let T be the shift in the twosided Kolmogorov representation and take B = Ix: xo = 1 1.} is a binary. BASIC CONCEPTS. 24. (c) Show that if Ea lia(PaiQb) — ii(Pa)i < E.2.b lit(Pa n Qb) — it(Pa)P. ergodic process which is not identically 0.18 for T 1 .32 CHAPTER I.
of the form [n. 2. 33 Section 1. and consists of all intervals of the form [n. The finite problem has a different character.1valued and T to be ergodic the following form of the ergodic theorem is obtained.3 The ergodic theorem. Theorem 1. If it is only required. there be a disjoint subcollection that fills most of [1. where n1 = 1 and n i±i = 1 + m(n). . A strong cover C of Af is defined by an integervalued function n m(n) for which m(n) > n. but the more general version will also be needed and is quite useful in many situations not treated in this book. Theorem 1. since f1 n f(T il x) d. The ergodic theorem extends the strong law of large numbers from i.i.. m] = (j E n < j < m). however. . In particular.) A strong cover C has a packing property. Thus if T is ergodic then the limit function f*... The proof of the ergodic theorem will be based on a rather simple combinatorial result discussed in the next subsection. since it is Tinvariant. f (T i1 x) converges almost surely and in L 1 norm to a T invariant function f*(x).d.This binary version of the ergodic theorem is sufficient for many results in this book. I.2 (Ergodic theorem: binary ergodic form. D.}.a Packings from coverings. The ergodic theorem in the almostsure form presented here is due to G. to pack an initial segment [1. namely. then a positive and useful result is possible. processes to the general class of stationary processes. In this discussion intervals are subsets of the natural numbers Ar = {1.3. m(n)]. This is a trivial observation. (The word "strong" is used to indicate that every natural number is required to be the left endpoint of a member of the cover. it may not be possible. just set C' = {[n.) and if f is integrable then the average (1/n) EL. for. m(n)]}. The L'convergence implies that f f* d 1a = f f d ia.) If {X n } is a binary ergodic process then the average (X 1 +X2+.) If T is a measurepreserving transformation on a probability space (X. unless the function m(n) is severely restricted.E f f(T il x)d. The combinatorial idea is not a merely step on the way to the ergodic theorem.. i > 1.u.+Xn )In converges almost surely to the constant value E(X 1 ). even asymptotically. n E H. if f is taken to be 0.3. Birkhoff and is often called Birkhoff's ergodic theorem or the individual ergodic theorem. A general technique for extracting "almost packings" of integer intervals from certain kinds of "coverings" of the natural numbers will be described in this subsection. however. K].u = f f E n n i=1 by the measurepreserving property of T.SECTION 1.1 (The ergodic theorem. E.3. p.u = 1.3. must be almostsurely equal to the constant value f f d. K] by disjoint subcollections of a given strong cover C of the natural numbers. . THE ERGODIC THEOREM.. for it is an important tool in its own right and will be used frequently in later parts of this book. there is a subcover C' whose members are disjoint.
3. K] is said to be (L. To carry this out rigorously set m(0) = no = 0 and. K] is (L. so that I(K —L. and hence m(n i ) < K. and apply a sequential greedy algorithm. (2) If j E [I. K]. Thus it is only necessary to show that the union 1.m(n i )]: 1 < i < I} is a (1 — 26)packing of [1. Proof By hypothesis K > LIS and (1) iin E K]: m(n) — n +1 > L11 < SK.(i +1)L]: 0 < i < (K — L)IL} which covers all but at most the final L — 1 members of [1. selecting the first interval of length no more than L that is disjoint from the previous selections. K] is called a (1 —6)packing of [1. stopping when within L of the end. the definition of the n i implies the following fact. stopping when within L of the end of [1. The construction stops after I = I (C. BASIC CONCEPTS. The interval (K — L. K — L]. since m(n i )—n i +1 < L. then there is a subcollection C' c C which is a (1 — 26)packing of [1. and are contained in [1. K) steps if m(n i ) > K — L or there is no j E[1 ±m(ni). 6)stronglycovered by C if 1ln E [1.K]—U]I< 6K. K]. 6)strongcover assumption. This produces the disjoint collection C' = {[iL +1. . K]: m(n) — n +1 > L}1 < 6.ll< 6K. This completes the proof of the packing lemma. 8)stronglycovered by C. The claim is that C' = {[n i . by construction.1 of the [ni . Lemma 1. The (L. The interval [1. In particular. The construction of the (1 — 26)packing will proceed sequentially from left to right. Let C be a strong cover of the natural numbers N. K — L] —Li then m(j) — j + 1 > L. K — L] for which m(j) — j + 1 < L. K].) Let C be a strong cover of N and let S > 0 be given. thus guarantees that 1[1. For the interval [1. start from the left and select successive disjoint intervals. (1). K — L]: m(j) — j +1 < L}. The desired positive result is just an extension of this idea to the case when most of the intervals have length bounded by some L <8 K. If K > LIS and if [1. by induction. K] if the intervals in C' are disjoint and their union has cardinality at least (1 — . A collection of subintervals C' of the interval [1. suppose all the intervals of C have the same length.m(n i )] has length at least (1 — 25)K. that is. K]. K] has length at most L — 1 < SK. K — L]1. K]. To motivate the positive result.3 (The packing lemma.34 CHAPTER I. The intervals are disjoint. if 6 > 0 is given and K > LIS then all but at most a 6fraction is covered. define n i = minfj E [1+ m(n i _i). say L.
large interval. ergodic process proof.4 In most applications of the packing lemma all that matters is the result.3.3.3.s. K] must also be too big by a somewhat smaller fixed amount. so it is not easy to see what happens to the average over a fixed. Furthermore.u. and hence there is an E > 0 such that the set {n B = x : lirn sup — E xi > . each interval in C(x) has the property that the average of the xi over that interval is too big by a fixed amount. and the goal is to show that (3) lirn — x i = . Since lim sup n—>oo 1 +E xi + + xn = =sup n—>oo x2 + . combined with a simple observation about almostsurely finite variables. as it illustrates the ideas in simplest form. Since B is Tinvariant. In this case . Since this occurs with high probability it implies that the expected value of the average over the entire set A K must be too big. n—*øo n i=1 i n where A(1) = A{x: x 1 = 1} = E(X1). . 1}. it will imply that if K is large enough then with high probability. i < K are contained in disjoint intervals over which the average is too big. Without loss of generality the first of these possibilities can be assumed.3. for each integer n there will be a first integer m(n) > n such that X n Xn+1 + • • • + Xm(n) m(n) — n + 1 > p. m(n)]: n E JVI is a (random) strong cover of the natural numbers . The packing lemma can be used to reduce the problem to a nonoverlapping interval problem.(1) + E. including the description (2) of those indices that are neither within L of the end of the interval nor are contained in sets of the packing.3. most of the terms xi . a. which is a contradiction. but in the proof of the general ergodic theorem use will be made of the explicit construction given in the proof of Lemma 1. Some extensions of the packing lemma will be mentioned in the exercises.. A proof of the ergodic theorem for the binary ergodic process case will be given first. then the average over the entire interval [1. however. (Here is where the ergodicity assumption is used. Then either the limit superior of the averages is too large on a set of positive measure or the limit inferior is too small on a set of positive measure. if the average over most such disjoint blocks is too big. These intervals overlap. for.a(1).A'. • + Xn+1 the set B is Tinvariant and therefore p.u. Suppose (3) is false.b The binary.) Now suppose x E B.(B) = 1. 35 Remark 1. I. is invariant and ergodic with respect to the shift T on {0. But. THE ERGODIC THEOREM. Thus the collection C(x) = lln. ( 1 ) n+oo n i=i has positive measure.SECTION 1.
c The proof in the general case. the sum over the intervals in C'(x) lower bounds the sum over the entire interval.(1) + c]. E. A bit of preparation is needed before the packing lemma can be applied. Thus. . The preceding argument will now be extended to obtain the general ergodic theorem. (1) = E (—Ex . are nonnegative. Thus the packing lemma implies that if K > LI 8 and x E GK then there is a subcollection Cf (x) = {[ni. Lemma 1. 1(x) m(n) (4) j=1 x. and since the intervals in C'(x) are disjoint.23)packing of [1. and thereby the proof of Theorem 1.u).3. Since the xi. ergodic process form of the ergodic theorem. the lower bound is independent of x. E L i (X.K]. Note that while the collection C(x) and subcollection C'(x) both depend on x.3. that is.m(ni)]: i < I (x)} C C(X) which is a (1 . BASIC CONCEPTS. since T is measure preserving. since the random variable m(1) is almostsurely finite. it is bounded except for a set of small probability.a(1) + c]. . the set GK = {X: gic(x) < 3} has measure at least 1 . The definitions of D and GK imply that if x E GK then C(x) is an (L. has integral . then . . Thus given 3 > 0 there is a number L such that if D = {x: m(1) > L}. J=1 which cannot be true for all S. let f let a be an arbitrary real number. note that the function g K(x) = i=1 1 IC where x denotes the indicator function.23)(1 6)[.26)(1 (5)K [p. as long as x E GK.5 Let T be a measurepreserving transformation on (X. > i=1 j_—n 1 (l _ 28)K [p.a).26)K ku(1) + El u(GK) > (1 .u(D) <62.8. K]. n i=i Then fB f (x) d(x) > a g(B). 3)strongcovering of [1. Second. First. and define the set 1 n B = {x: lim sup .(1)+ c] . The following lemma generalizes the essence of what was actually proved. taking expected values yields (K E iE =1 •x > (1 .E f (T i1 x) > al. so that 1 K p. the binary.2.u(D) < 6 2 . it will be used to obtain the general theorem.36 CHAPTER I.3. by assumption. I. The Markov inequality implies that for each K. E.) > (1 . This completes the proof of (3).
) In the bounded case. the sum R(K . x) is bounded from below by —2M KS and the earlier argument produces the desired conclusion.(x1B) > a. Only a bit more is needed to handle the general case. if K > LIB then the packing lemma can be applied and the argument used to prove (4) yields a lower bound of the form E f( Tilx ) j=1 (6) where R(K . together with the lower bound (6).3. E jE[1. does give 1 vIC s f (x) dia(x1B) = (7) a' B i=i II I. Furthermore. given 3 > 0 there is an L such that if D= E B: m(1) > L } then .u(D1B) <82 and for every K the set 1 G K = {X E B: 1 7 i=1 E xD(Tiix) K 6} has conditional measure at least 1 — B. it is enough to prove fB f (x) c 1 ii. The set B is Tinvariant and the restriction of T to it preserves the conditional measure .m(ni)]} depends on x. x) as well as the effect of integrating over GK in place of X. where f is allowed to be unbounded.i(B) > O. The lemma is clearly true if p(B) = 0..)] f (T x) is the sum of f(Ti l x) over the indices that do not belong to the [ne. + (1 — 28)K a. For x E B and n E A1 there is a least integer m(n) > n such that (5) Em i= (n n ) f (T i1 m(n)—n+1 > a. though integrable. In the unbounded case. a bit more care is needed to control the effect of R(K. (1 — 8)a A(G KIB) . 1B). An integration. As before.u(. x) + E E f (T i1 x) i=1 j=ni 1(x) m(ni) R(K . 37 Proof Note that in the special case where the process is ergodic.SECTION 1. Since B is Tinvariant. this lemma is essentially what was just proved. the collection C(x) = fin. say 1f(x)1 < M. x) = > R(K .12 + 13. THE ERGODIC THEOREM.1q—U[n. m(n)]: n E AO is a (random) strong cover of the natural numbers Js/.3. so it can be assumed that . where f is the indicator function of a set C and a = p(C) c. The same argument can be used to show that Lemma 1.m(n. (Keep in mind that the collection {[ni.m(n i )] intervals. Thus.5 is true in the ergodic case for bounded functions.
and the measurepreserving property of T yields the bound 1121 5_ which is small if L is large enough. see (7). and hence 11 itself.B xD(T' . will be small if 8 is small enough. see (2).3. In the current setting this translates into the statement that if j E [1. 1 ___ K f (Ti 1 x) c 1 ii./ x) clitt(xIB) = fT_J(BGK) f(x) Since f E L I and since all the sets T i(B — GK) have the same measure.K]—U[ni.m(n. Thus 1121 5_ 17 1 f v—N K .x)1 f (P 1 x)I dwxiB). . which completes the proof of Lemma 1.K— L] —U[ni. 1131 15. (1 — kt(G + 11 + ± 13. The integral 13 is also easy to bound for it satisfies 1 f dp. BASIC CONCEPTS. the final three terms can all be made arbitrarily small. K — L] — U[ni . K fB—Gic j=1 12 = f(Ti l x) dp. . recall from the proof of the packing lemma. L f(x) d ii(x B) ?. To bound 12.u(xIB).(x1B).(x1B). E GK jE[1.M(ri.B jE(K—L.(xiB).5. which is upper bounded by 8.K] and hence the measurepreserving property of T yields 1131 B 1 fl(x) dp. which is small if K is large enough.(x1B) > a.)] E 13= (T i1 x) d.(xim. for any fixed L. K — L] — m(ni)] then m(j) — j 1 > L.)] GK — K 1 jE(K—L. all of the terms in 11. that if j E [1. In summary. m(n)] then Ti l x E D. f The measurepreserving property of T gives f3Gic f (T . in the inequality f D1 f (x)1 clia(x1B).38 where CHAPTER I.k. so that passage to the appropriate limit shows that indeed fB f (x) dp.
UT f = if Ii where 11f II = f Ifl diu is the L'norm. f follows immediately from this inequality. Ag converges in L'norm. it follows that f* E L I and that A n f converges in L'norm to f*. Since.5.o0 n i=1 { E E then fi. f EL I .3.convergence of A. as was just shown. that is. IA f II If Il f E L I . the dominated convergence theorem implies L' convergence of A n f for bounded functions. Proofs of the maximal theorem and Kingman's subadditive ergodic theorem which use ideas from the packing lemma will be sketched in the exercises. and since the average An f converges almost surely to a limit function f*. Most subsequent proofs of the ergodic theorem reduce it to a stronger result. The proof given here is less elegant. that is.5 remains true for any Tinvariant subset of B. But if f. Moreover. Thus the averaging operator An f 1 E n n i=i T is linear and is a contraction. To prove almostsure convergence first note that Lemma 1. almost surely. To complete the discussion of the ergodic theorem L' convergence must be established. To do this define the operator UT on L I by the formula. xE X. if f is a bounded function.3. elegant proof due to Garsia that appears in many books on ergodic theory. [32].3. almost surely. 39 Proof of the Birkhoff ergodic theorem: general case. This proves almost sure convergence. Other proofs that were stimulated by [51] are given in [25. for g can be taken to be a bounded approximation to f.5. This completes the proof of the Birkhoff ergodic theorem. including the short. whose statement is obtained by replacing lirn sup by sup in Lemma 1. The operator UT is linear and is an isometry because T is measure preserving.u(C)< Jc f d which can only be true if ti(C) = O. The packing lemma is a simpler version of a packing idea used to establish entropy theorems for amenable groups in [51]. THE ERGODIC THEOREM. 1=1 Remark 1. its application to prove the ergodic theorem as given here was developed by this author in 1980 and published much later in [68]. A n f — Am f 5 g)11 + ii Ag — Amg ii 2 11 f — gil + 11Ang — A m g iI — g) — A ni (f — The L . . The reader is referred to Krengel's book for historical commentary and numerous other ergodic theorems. covering. then A n f (x)I < M. n. as well as von Neumann's elegant functional analytic proof of L 2convergence. UT f (x) = f (Tx). but it is much closer in spirit to the general partitioning. Thus if 1 n 1 n f (T i x) C = x: lim inf — f (T i x) <a < fi < lim sup — n400 n i=1 n—.6 Birkhoff's proof was based on a proof of Lemma 1. called the maximal ergodic theorem. Since almostsure convergence holds. say I f(x)I < M.SECTION 1. There are several different proofs of the maximal theorem.3.3. and packing ideas which have been used recently in ergodic and information theory and which will be used frequently in this book. 27]. g E L I then for all m.
lx)F n — 1] is called the stopping interval for x. lim almost surely. Now that the ergodic theorem is available. the stopping time for each shift 7' 1 x is finite. A (generalized) stopping time is a measurable function r defined on the probability space (X. The stopping time idea was used implicitly in the proof of the ergodic theorem.u ({x: r(x) = °o }) = O.n] is (1 — 3)packed by intervals from C(x. so that if 1 n 1=1 x D (T i1 x) d p(x) = .) If r is an almostsurely finite stopping time for an ergodic process t then for any > 0 and almost every x there is an N = N (3. starting at n.u) with values in the extended natural numbers . the collection C = C(x. the packing lemma and its relatives can be reformulated in a more convenient way as lemmas about stopping times. see (5). nor that the set {x: r(x) = nl be measurable with respect to the first n coordinates. for an almostsurely finite stopping time. By the ergodic theorem. which is a common requirement in probability theory. . x) such that if n > N then [1. r) = {[n. a (random) strong cover of the natural numbers A. given 8 > 0. Suppose it is a stationary measure on A'. The given proof of the ergodic theorem was based on the purely combinatorial packing lemma. there is an integer L such that p({x: r(x) > LI) < 6/4. the function r(Tn l x) is finite for every n. for the first time m that rin f (Ti 1 x) > ma is a stopping time. . Thus almostsure finiteness guarantees that. Note that the concept of stopping time used here does not require that X be a sequence space. Proof Since r (x) is almostsurely finite. almost surely. Several later results will be proved using extensions of the packing lemma.7 (The ergodic stoppingtime packing lemma. however.Af U ool. r). t(Tn 1 x) n — 1]: n E of stopping intervals is. E. it is bounded except for a set of small measure. The interval [n. Gn = then x E ix 1 x11 s :— n 2x D (Ti . I. An almostsurely finite stopping time r has the property that for almost every x. eventually almost surely. Lemma 1.3.40 CHAPTER I.u(D) <6/4. for almost every x.ix)< 6/ 41 G .d Extensions of the packing lemma. Frequent use will be made of the following almostsure stoppingtime packing result for ergodic processes.r(Tn. hence.almostsurely finite if . A stopping time r is p. Let D be the set of all x for which r(x) > L. In particular.3. BASIC CONCEPTS.
yi ] that meet the boundary of some [si. a partialpacking result that holds in the case when the stopping time is assumed to be finite only on a set of positive measure. r) such that if n > N. and each [s1 . But this means that [1. ti]. there is an integer N= x) such that if n > N then [1.8 (The partialpacking lemma. THE ERGODIC THEOREM.9 (The twopackings lemma. Lemma 1. Two variations on these general ideas will also be useful later. r) to be the set of all intervals [n. Let p. n] is (L. and define C(x. 1 — 3/2)stronglycovered by C(x. Show that for almost every x. and let {[si . n] by a subset of C(x. for which Tn l x E F(r).SECTION 1.) . Then the disjoint collection {[s i .] has length m. The definition of G. Formulate the partialpacking lemma as a strictly combinatorial lemma about the natural numbers. put F(r) = {x: r(x) < ocl. then [1. r) Lemma 1.3. For almost every x. The packing lemma produces a (L. ti]: j E J} be another disjoint collection of subintervals of [1. (Hint: define F'(x) = r(x) + 1 and apply the ergodic stoppingtime lemma to F. K] of total length aK. I.3. Let I* be the set of indices i E I such that [ui. where M > m13. with the proofs left as exercises. Show that this need not be true if some intervals are not separated by at least one integer.e Exercises. there is an N = N (x.3. (Hint: use the fact that the cardinality of J is at most K I M to estimate the total length of those [u. n] and hence for at least (1 — 812)n indices i in the interval [1. Prove the twopackings lemma. be an ergodic measure on A. 4. Show that a separated (1 — S)packing of [1. let r be a stopping time such that . ti ]. For a stopping time r which is not almostsurely infinite. 3. . A (1 — 3)packing C' of [1. This proves the lemma. and a twopackings variation. yi] meets at least one of the [s i . These are stated as the next two lemmas. K] of total length 13 K . r).) 2. r). 41 Suppose x E G„ and n > 4L/3. n — L]. ti ] has length at least M.) Let {[ui. 5. n] is determined by its complement. n] has a separated (1 — 28)packing C' c C(x. Prove the partialpacking lemma. vd: i E I} be a disjoint collection of subintervals of [1. n] has a y packing C' c C(x. r(r1 n — 1]. and let r be an almostsurely finite stopping time which is almostsurely bounded below by a positive integer M > 1/8.3. v. r). Suppose each [ui .) Let p. implies that r(T (' 1) x) < L for at least (1 — 314)n indices i in the interval [1. ti ]: j E J } U flub i E I*1 has total length at least (a + p . n] is said to be separated if there is at least one integer between each interval in C'.23)K.u(F(r)) > y > O. be an ergodic measure on A'. 1. 1 — 3/2)packing of [1.
8) is the set of all x such that [1. then from r(x) to r(T) — 1. (a) For bounded f. and G n (r . 9.} is a sequence of integrable functions such that gn+. then ii. (e) Extend the preceding arguments to the case when it is only assumed that T is measure preserving. function. Kingman's subadditive ergodic theorem asserts that gn (x)/ n converges almost surely to an invariant function g(x) > —oc.) (c) Show that the theorem holds for bounded functions. Assume T is measure preserving and f is integrable. 6. n ] is (1 — (5)packed by intervals from oc. The ideas of the following proof are due to M. BASIC CONCEPTS.) (c) Deduce von Neumann's theorem from the two preceding results.u(B). r(x) = 1. (b) Show that . show that h f da a.42 CHAPTER I. 8)) 1. (a) Prove the theorem under the additional assumption that gn < 0. then apply packing. Show that if r is an almostsurely finite stopping time for a stationary process p. (Hint: show that its orthogonal complement is 0. One form of the maximal ergodic theorem asserts that JE f (x) di(x) > 0. as n 7. if f E LI . [80]. show that r(x)1 E i=o f (T i x)xE(T i x) ?_ 0. Assume T is invertible so that UT is unitary. show that (Hint: let r(x) = n. (d) Deduce that L' convergenceholds.A4 = (I — UT)L 2 . r).Jones. C(x. for some r(x) < E Un . The mean ergodic theorem of von Neumann asserts that (1/n) EL I verges in L2norm. (a) Show that the theorem is true for f E F = If E L 2 : UT f = f} and for f E . x 0 E. (d) Show that the theorem holds for integrable functions. N. and for each n.(Gn (r. x E f (T i x)xE(T i x) i=o — (N — fII. let Un be the set of x's for whichEL 01 f (T i x) > 0. Assume T is measure preserving and { g. continuing until within N of L.Steele. but not integrable f (T i1 x) converge almost surely to oc. 10. then the averages (1/N) E7 f con8.) . Fix N and let E = Un <N Un .(x) < g(x) g ni (Tn x). The ideas of the following proof are due to R.F+M is dense in L 2 . (Hint: first show that g(x) = liminf g n (x)In is an invariant function. (Hint: sum from 0 to r(x) — 1. [33].) (b) For bounded f and L > N. (e) For B = {x: sup E7101 f (T i x) > a}. Show that if ti is ergodic and if f is a nonnegative measurable. for any L 2 function f..
x is typical for tt if for each k the empirical distribution of each kblock converges to its theoretical probability p. 43 (Hint: let (b) Prove the theorem by reducing it to the case when g„ < O.SECTION 1. A sequence x is said to be (frequency) typical for the process p. a property characterizing ergodic processes. . n — k + 1]: = where I • I denotes cardinality. then convergence is also in L 1 norm.(Tn+gx) + y8 . the existence of limiting frequencies allows a stationary process to be represented as an average of ergodic processes.) (c) Show that if a = inf.4.4 Frequencies of finite blocks. the subscript k on pk(ali( 14) may be omitted. f(a14) is obtained by sliding a window of length k along x'it and counting the number of times af is seen in the window. FREQUENCIES OF FINITE BLOCKS. or the (overlapping) ktype of x. n — k 1. I.n (x) < cc. The limiting (relative) frequency of di in the infinite sequence x is defined by (1) p(ainx) = lim p(ainx). that this limit exists. The relative frequency is defined by dividing the frequency by the maximum possible number of occurrences. if each block cif appears in x with limiting relative frequency equal to 11. oo provided. g 1 (x) = gni (x) — ET 1 gi (T i x ). In the nonergodic case. of course. called the empirical distribution of overlapping kblocks. to obtain Pk(aflx7) = f (4x) = I{i E [1. An important consequence of the ergodic theorem for stationary finite alphabet processes is that relative frequencies of overlapping kblocks in nblocks converge almost surely as n —> oc. When k is understood. g„. In the ergodic case.u < oc. The frequency can also be expressed in the alternate form f(4x) = I{i E [1. These and related ideas will be discussed in this section. Thus.a Frequencies for ergodic processes.(4). n — k + 1]: xi +k1 = 4}1 n—k1 n—k+1 If x and k < n are fixed the relative frequency defines a measure pk • 14) on A " . limiting relative frequencies converge almost surely to the corresponding probabilities. f and a = fg d d. where x [4] denotes the indicator function of the cylinder set [an. that is. n—k+1 Palic Wiz = The frequency of the block ail' in the sequence Jet' is defined for n > k by the formula E i=1 x fa i l (T i1 x).4. + (d) Show that the same conclusions hold if gn < 0 and gn+8+. where y8 0 as g Section 1.(4).
A basic result for ergodic processes is that almost every sequence is typical. which.k +1 E k • (T i X •_ ' ul' ' —1 x). then this constant c must equal p.(4). there is a set B(a) of measure 0. converges almost surely to f x [4] dp. a set of [=:] Theorem 1. n—k+1 p(afixiz)  n . Thus the entire structure of an ergodic process can be almost surely recovered merely by observing limiting relative frequencies along a single sample path.4.44 CHAPTER I. for almost every x. that is. by the ergodic theorem. Since there are only a countable number of possible af. for fixed 4. produces liM n—)00 fl E X B (T il x)X c (x)= i=1 n and an integration then yields (2) lim A(T i B n i=1 E n n= . Proof In the ergodic case. BASIC CONCEPTS. the set 00 B=U k=1 ak i EAk B(a 1`) XE has measure 0 and convergence holds for all k and all al'. Theorem 1. p. The converse of the preceding theorem is also true. for x B(a lic). where C is an arbitrary measurable set. as n > co.4. n liM — n i=1 x8 = holds for almost all x for any cylinder set B.almost surely. Thus the hypotheses imply that the formula x. since the ergodic theorem includes L I convergence of the averages to a function with the same mean. such that p(alicixtiz) > 1). Proof If the limiting relative frequency p(a'nx) exists and is a constant c.) Suppose A is a stationary measure such that for each k and for each block di'. Then A is an ergodic measure. A . Let T(A) denote the set of sequences that are frequency typical for A.1 (The typicalsequence theorem. Multiplying each side of this equation by the indicator function xc (x). the limiting relative frequencies p(ali`Ix) exist and are constant in x.2 (The typicalsequence converse.) If A is ergodic then A(T (A)) = 1. This proves the theorem. lim p(aIN) = for all k and all a. for all measure 1.B. with probability 1. In other words.(4). = p(a).
and hence (b) follows from Theorem 1. The usual approximation argument then shows that formula (2) holds for any measurable sets B and C. so that those in Gn will have "good" kblock frequencies and the "bad" set Bn has low probability when n is large.4. Note that as part of the preceding proof the following characterizations of ergodicity were established. a.(Tilx) = tt(E).4.4. E). ergodic. Theorem 1. Some finite forms of the typical sequence theorems will be discussed next. A(T i A n B) = . B n (k.u(A)A(B). c) such that if n > N then there is a collection C. then 1/3(41fi') — p(aini)I < c.4. is Proof Part (a) is just a finite version of the typicalsequence theorem.e.4 (The "good set" form of ergodicity.q) is bounded and converges almost surely to some limit. for every integer k > 0 and every c > 0. C.5 (The finite form of ergodicity. the limit must be almost surely constant. (ii) If x. The precise definition of the "good" set. (i) T is ergodic. for every k and c > O. This proves Theorem 1.2.(iii) limn x .4. eventually almost surely.2. so that p(B) must be 0 or 1.. . Theorem 1. It is also useful to have the following finitesequence form of the typicalsequence characterization of ergodic processes.) (a) If ji is ergodic then 4 (b) If lima p. is (3) G n (k. Since p(an. The members of the "good" set. But if B = B then the formula gives it(B) = A(B) 2 . or when k and c are understood. E). are often called the frequency(k. (ii) lima pi. G. then p.n (G n (k. for each A and B ma generating algebra.1. which depends on k and a measure of error.3 The following are equivalent. Theorem 1. E)) converge to 1 for any E > 0 is just the condition that p(41x7) converges in probability to ii(4). FREQUENCIES OF FINITE BLOCKS.) A measure IL is ergodic if and only if for each block ail' and E > 0 there is an N = N(a.4. c)typical sequences. a E The "bad" set is the complement.4. for each E in a generating algebra. = 1. c). just the typical sequences. and B. Sequences of length n can be partitioned into two classes.(k. 45 The latter holds for any cylinder set B and measurable set C. The condition that A n (G n (k.SECTION 1.L(a)1 < c. 6) = An — G n (k. 6)) E Gn (k. G . 6) = p(afix) — . Theorem 1. C An such that tt(Cn) > 1 — E.
and hence p(a Ix) is constant almost surely.7 (The almostcovering principle. An of positive 11 (Bn). This simple consequence of the ergodic theorem can be stated as either a limit result or as an approximation result.) If kt is an ergodic process with alphabet A. just set Cn = G n (k. 1]: xn1 B n }i < 8M. IP(ailx) — P(ailZ)I 3E. there are approximately tt(Bn )M indices i E [1.a(a n n e .b The ergodic theorem and covering properties. In many situations the set 13.) If 12. n ) > 1 — 2E. eventually almost surely.4. .46 CHAPTER I. X.n . (4) E [1.6 (The covering theorem. M — n that is.u(an) > 1 — Cn satisfies (i) and (ii) put En = {x: 4 E Cn} and note that . and if Bn is a subset of measure. a set Bn c An with desired properties is determined in some way.4. in which case the conclusion (4) is expressed.4.4. then the ergodic theorem is applied to show that. the "good" set defined by (3). eventually almost surely. Proof Apply the ergodic theorem to the indicator function of Bn . The ergodic theorem is often used to derive covering properties of finite sample paths for ergodic processes. satisfies A(13 n ) > 1 — then xr is eventually almost surely (1 — (5)stronglycovered by r3n. a. C An. If n El. Thus. then liM Pn(BniXi ) M—). E). for any 8 > 0. In these applications. Theorem 1. Theorem 1. and use convergence in probability to select n > N(a. To establish the converse fix a il' and E > 0. BASIC CONCEPTS. proving the theorem. and use Theorem 1. I. M — n]: E 13n}1 — ML(13n)1 <8M. and if 8. Proof If it is ergodic. by saying that. eventually oc. A sequence xr is said to be (1 — 8)stronglycovered by B n if I {i E [1. E _ _ n E. xr is "mostly strongly covered" by blocks from Bn . c) such that .s. if pn (B n iXr) < b.4. is chosen to have large measure. informally. then define G n = ix: IP(a inxi) — P(a inx)I < EI. is an ergodic process with alphabet A. M — n] for almost surely as M which x: +n1 E Bn .co In other words.
is stated as the following theorem. then F(41 ) < 23M and L(xr) > (1 — 3)M. I. let lim kr. the desired result. Example 1.4. that the expected waiting time between occurrences of a along a sample path is. almost surely. with the convention F(41) = L(41 ) = 0.c The ergodic decomposition. but the limit measure varies from sequence to sequence. however. and then to obtain almost strongcovering by 13n . Note that the ergodic theorem is used twice in the example. almost all of which will. choose n so large that A(B n ) > 1 — 3. in fact. The almostcovering idea will be used to show that (5) To prove this.2. Thus is the set of all sequences for which all limiting relative frequencies exist for all possible blocks of all possible sizes. . in the limit. The almostcovering principle. for some i E [1. limiting relative frequencies of all orders exist almost surely. M]: x i = a} L(x) = max{i E [1. be an ergodic process and fix a symbol a E A of positive probability. Such an n exists by the ergodic theorem. then at least one member of 13n must start within 3M of the beginning of xr and at least one member of 1:3n must start within 8M of the end of xr. so that L(xr) — F(41) > (1 — 3 8 )M.4. But if this is so and M > n/3. 47 The almost strongcovering idea and a related almostpacking idea are discussed in more detail in Section L7. M]: x i = a}. follows.00 L(41 )— F(41 ) = 1. These ideas will be made precise in this section. F(xr) = min{i E [1.SECTION 1. that if xr is (1 — 6)stronglycovered by 13n .4. FREQUENCIES OF FINITE BLOCKS. In the stationary nonergodic case. The following example illustrates one simple use of almost strongcovering. Such "doubling" is common in applications. Bn = {a7: ai = a. Let F (xr) and L(xr) denote the first and last occurrences of a E Xr.c. It follows easily from (5). that is.7. Note.4. This result also follows easily from the returntime picture discussed in Section I. see Exercise 2. first to select the set 13n . implies that xr is eventually almost surely (1 — 3)stronglycovered by Bn . almost surely equal to 1/p. Theorem 1. Since 3 is arbitrary. together with the ergodic theorem. (5). The set of all x for which the limit p(ainx) = lim p(44) exists for all k and all di will be denoted by E. The set is shift invariant and any shiftinvariant measure has its support in This support result.8 Let p. be ergodic. Given 8 > 0.(a). The sequences that produce the same limit measure can be grouped into classes and the process measure can then be represented as an average of the limit measures. if no a appears in xr. nil. which is a simple consequence of the ergodic theorem.
The projection x is measurable and transforms the measure . n —k+ 1 The frequencies p(41. k 1 i=1 the ergodic theorem implies that p(aflx) = limn p(aii`ix) exists almost surely.1. for each k.. which establishes the theorem.u(B) = f 1. e In other words.C must El have measure 0 with respect to 1. (B)= f p.4. 4 E Ak defines a measure itx on sequences of length k..)c) can be thought of as measures. The measure . Of course.ux will be called the (limiting) empirical measure determined by x. Proof Since p (all' 14) = n . — ix).1. to a Borel measure i. Two sequences x and y are said to be frequencyequivalent if /ix = A y . An excellent discussion of ergodic components and their interpretation in communications theory can be found in [ 1 9 ]. each frequencyequivalence class is either entirely contained in E or disjoint from E. x E E.9 can be strengthened to assert that a shiftinvariant measure kt must actually be supported by the sequences for which the limiting frequencies not only exist but define ergodic measures.4. . by the Kolmogorov consistency theorem.u. and extends by countable additivity to any Borel set B. Theorem 1. Next define /27(x) = ktx.. by (6).(E) = 1 and hence the representation (7) takes the form.9 may now be expressed in the following much stronger form. BE E. Theorem 1.48 Theorem 1.u(alic) = f p(af Ix) dp.(x )(B) dco(7(x)). ± i E x [ ak.. The ergodic measures. Let E denote the set of all sequences x E L for which the measure ti. shows that to prove the ergodic decomposition theorem it is enough to prove the following lemma. The usual measure theory argument. BASIC CONCEPTS. Theorem 1. are called the ergodic components of kt and the formula represents p. as an average of its ergodic components. for each k and 4. B E E.(Ti.tx on A. To make this assertion precise first note that the ergodic theorem gives the formula (6) . a shiftinvariant measure always has its support in the set of sequences whose empirical measures are ergodic.....5. together with the finite form of ergodicity.x (a) = p(ainx). Since there are only a countable number of the al` .x is ergodic. for if x E r then the formula .4. (x). is shiftinvariant then p. Theorem 1. .10 ( The ergodic decomposition theorem. which can be extended.(x). CHAPTER I.4.( x )(B) da)(7r(x)). which is welldefined on the frequency equivalence class of x. Formula (6) can then be expressed in the form (7) . Let n denote the projection onto frequency equivalence classes.u. This formula indeed holds for any cylinder set. the complement of .) If p.u onto a measure w = .u o x 1 on equivalence classes.4.9 If it is a shiftinvariant measure then A ( C) = 1.
tt(CnICI) = ti Li P(xilz) dtt(z). the sequence {p(al 14)} converges to p(41x). in particular.SECTION 1. hence For each x E Li. Indeed. as n there is an integer N and a set L2 C Li of measure at least (1 — E2 )A(L1) such that (9) IP(414) — p(alx)I < E/4. z E L3.4. The conditions (9) and (8) guarantee that the relative frequency of occurrence of 4 in any two members x7. the set of sequences with limiting frequencies of all orders. [57]. 49 Lemma 1. Z E r3.q. itiz of Cn differ by no more E. differs by no more than E.4. and fix a sequence y E L. The set L1 is invariant since p(ainx) = p(4`17' x). 1] is bounded so L can be covered by a finite number of the sets Li(y) and hence the lemma is proved. x' iz E An. x E L2. fix a block 4. c A n . Theorem 1. The unit interval [0. such that for any x E X1 there is an N such that if n > N there is a collection C. which translates into the statement that p(C) > 1 — E.4. oo. This completes the proof of the ergodic decomposition theorem.11 Given c > 0 and 4 there is a set X1 with ii. Note. 1 . 1 (L1) L1 P(Crilz) d iu(z) > 1 — 6 2 so the Markov inequality yields a set L3 C L2 such that /4L3) > (1 — E)/L(L 1 ) and for which .(Xi) > 1 — E. such that the frequency of occurrence of 4 in any two members of C.12 The ergodic decomposition can also be viewed as an application of the ChoquetBishopdeLeeuw theorem of functional analysis. the set Ps of shiftinvariant probability measures is compact in the weak*topology (obtained by thinking of continuous functions as linear functionals on the space of all measures via the mapping it i— f f dit.10. see Exercise 1 of Section 1. for any x E L.9. .4. Let L i = Li(y) denote the set of all x E L such that (8) ip(a'nx) — p(41y)l < E/4. 0 Remark 1. Fix n > N and put Cn = {x: x E L2 }.E. that the relative frequency of occurrence of al in any two members of Li differ by no more than E/2.„ E p(xz) ?.) The extreme points of Ps correspond exactly to the ergodic measures. ILi) is shiftinvariant and 1 . Thus the conditional measure it (. n > N. FREQUENCIES OF FINITE BLOCKS.u(xilLi) = A(CI) In particular.c. with /L(C) > 1 — c. Proof Fix E > 0.
4. Let P be the time0 partition for an ergodic Markov chain with period d = 3. Assume p. R2(x).) A measure on Aa " transports to the measure y = . The reversed process {Yn : n > 1} is the . = (a) Show that if T is totally ergodic then qk (ali`14) .50 CHAPTER I. Let {X„: n > 1} be a stationary process. P)process. Show that an ergodic finitestate process is a function of an ergodic Markov chain. where P is the Kolmogorov partition in the twosided representation of {X.) 3.1 by many n. This exercise illustrates a direct method for transporting a samplepath theorem from an ergodic process to a samplepath theorem for a (nonstationary) function of the process.1 ) = I{/ E [0. Show that . almost surely. is the conditional measure on [a] defined by an ergodic process. Show that {Xn } is ergodic if and only its reversed process is ergodic.2 for a definition of this process. Let T be an ergodic rotation of the circle and P the twoset partition of the unit circle into upper and lower halves. that is. I. Use the central limit theorem to show that the "random walk with random scenery" process is ergodic. Let qk • 14) be the empirical distribution of nonoverlapping kblocks in 4. for infinitely Fix a E A and let Aa R(x) = {R 1 (X). (It is an open question whether a mixing finitestate process is a function of a mixing Markov chain.5.u o R 1 on Aroo. . Describe the ergodic components of the (T x T . Combine (5) with the ergodic theorem to establish that for an ergodic process the expected waiting time between occurrences of a symbol a E A is just 111. (b) Determine the ergodic components of the (T 3 . Define the mapping x (x) = min{m > 1: x. 0 < r < k.4.t(a). (Hint: with high probability (x k . BASIC CONCEPTS. m+k ) and (4. (Hint: show that the average time between occurrences of a in 4 is close to 1/pi (a14). qk (alnx.n = a} — 1 R1 (x) = minfm > E (x): x m = a} — E Rico i=1 .: n > 1}. P V TP y T 2P)process.u(x lic). (This is just the returntime process associated with the set B = [a]. z„. 6. zt) are looking at different parts of y.) 5. " be the set of all x E [a] such that Tnx E [a]. 8. (b) Show that the preceding result is not true if T is not totally ergodic. P)process. See Exercise 8 in Section 1. (Hint: combine Exercise 2 with Theorem 1.d Exercises 1. 2.) 7. (a) Determine the ergodic components of the (T 3 .P x P)process. t): where n = tk r.) 4.
4) = nE(h n (x)). Theorem 1. (Hint: use the preceding result.) Let it be an ergodic measure for the shift T on the space A'.) Section 1.(a) log 7(a). n so that if A n is the measure on An defined by tt then H(t) = nE(hn (x)). the lemma follows from the preceding lemma. (a) The mapping x R(X) is Borel.17.5. 51 (b) R(T R ' (x ) x) = SR(x). some counting arguments. then v(R(T)) = 1. The entropy theorem asserts that. almost surely. There is a nonnegative number h = h(A) such that n oo n 1 lim — log . see Exercise 4c.5. log means the base 2 logarithm. where A is finite. except for some interesting cases.(4).(4) = h.(a). aEA Let 1 1 h n (x) = — log — n ti. and the concept of the entropy of a finite distribution. valmost surely.SECTION 1. the measure p. For stationary processes.u. for ergodic processes.5 The entropy theorem. 0 Lemma 1.1 (The entropy theorem. Lemma 1.5. of the process. The next three lemmas contain the facts about the entropy function and its connection to counting that will be needed to prove the entropy theorem.) (d) (1/n) En i R i ÷ 11 p. the decrease is almost surely exponential in n. The proof of the entropy theorem will use the packing lemma.(x)) _< log I AI. and. (Hint: use the Borel mapping lemma.5. with a constant exponential rate called the entropy or entropyrate. and the natural logarithm will be denoted by ln. Lemma 1. z(a) 111AI. where S is the shift on (e) If T is the set of the set of frequencytypical sequences for j.(4) 1 log p. In the theorem and henceforth.(4) is nonincreasing in n. THE ENTROPY THEOREM. The entropy of a probability distribution 7(a) on a finite set A is defined by H(z) = — E 71. Proof An elementary calculus exercise. CI .1.2 The entropy function H(7) is concave in 7 and attains its maximum value log I AI only for the uniform distribution. Proof Since 1/(. has limit 0.3 E(17. and denoted by h = h(t).
The goal is to show that h must be equal to lim sup.00 n k<rz b E( n ) . The first task is to show that h(x) is constant.= H (3) k I. h(x) is subinvariant. k) If E k<n6( where H(3) = —3 log3 —(1 — 3) log(1 — 3). almost surely. Since T is assumed to be ergodic the invariant function h(x) must indeed be constant. The definition of limit inferior implies that for almost . n). BASIC CONCEPTS. that is. see [7]. Section 1. Multiplying by ( 2nH(8) E k<nS ( n k ) 'E k<n3 ( n k ) (5k 0 _ onk n ) ic (1 _ 5 k onk = 19 which proves the lemma. k < ms. almost surely.2. It is just the entropy of the binary distribution with ti(1) = S. Thus h(Tx) = h(x). Define h to be the constant such that h = lim inf hn (x). Towards this end. that the binary entropy function is the correct (asymptotic) exponent in the number of binary sequences of length n that have no more than Sn ones.5.a The proof of the entropy theorem.4 (The combinations bound. see Exercise 4. let c be a positive number. Lemma 1. n. It can be shown.Co Define where h(x) = —(1/n) log it(x'lz).) (n ) denotes the number of combinations of n objects taken k at a time and k 3 < 1/2 then n 2nH(3) . hn (x).52 CHAPTER I. since subinvariant functions are almost surely invariant. Proof The function —q log3 — (1 — q) log(1 —8) is increasing in the interval 0 < q < 3 and hence 2nH(S) < 5k(1 _ n ) and summing gives k onk . h(x) = lim inf hn (x).5. that is I lim — log n. almost surely. almost surely. To establish this note that if y = Tx then [Yi ] = [X3 +1 ] j so that a(y) > It(x 1 ) and hence h(Tx) < h(x).00 almost surely. 0 The function H(S) = —3 log 8—(1—S) log(13) is called the binary entropy function.
Lemma I.1. and if they mostly fill. then there are not too many ways to specify their locations. This is a simple application of the ergodic stoppingtime packing lemma.. Three ideas are used to complete the proof.. Eventually almost surely.5. (b) p(x) > ni+1)(h+c). a fact expressed in exponential form as (1) A(4) > infinitely often. xr E G K (8 . M) if and only if there is a collection xr S = S(xr) = {[n i . m i [} of disjoint subintervals of [1. let S be a positive number and M > 1/B an integer. The third idea is a probability idea. Indeed. let GK(S. both to be specified later. In other words.. m i l E S.18(b). almost surely. then once their locations are specified. if these subblocks are to be long. (c) E [ „. there are not too many ways the parts outside the long blocks can be filled. The goal is to show that "infinitely often. almost surely" can be replaced by "eventually. there will be a total of at most 2 K(h+E ) ways to fill all the locations. then it is very unlikely that a sample path in the set has probability exponentially much smaller than 2 — K (h+E ) . An application of the packing lemma produces the following result.. The set of sample paths of length K that can be mostly filled by disjoint. This is just an application of the fact that upper bounds on cardinality "almost" imply lower bounds on probability.. most of a sample path is filled by disjoint blocks of varying lengths for each of which the inequality (1) holds. THE ENTROPY THEOREM. Most important of all is that once locations for these subblocks are specified. almost surely. 53 all x the inequality h(x) < h + E holds for infinitely many values of n. m.SECTION 1. To fill in the details of the first idea. for large enough K. The first idea is a packing idea. If a set of Klength sample paths has cardinality only a bit more than 2K (h+E) . M) be the set of all that are (1 — 23)packed by disjoint blocks of length at least M for which the inequality (1) holds. (a) m — n 1 +1 > M. [ni. then since a location of length n can be filled in at most 2n(h+E) ways if (1) is to hold. K] with the following properties. For each K > M. [n.] E S. The second idea is a counting idea. + 1) > (1 — 28)K." for a suitable multiple of E. . long subblocks for which (1) holds cannot have cardinality exponentially much larger than 2 /"+€) .Es (mi — n.
54
CHAPTER I. BASIC CONCEPTS.
Lemma 1.5.5
E
GK
(3, M), eventually almost surely.
Proof Define T. (x) to be the first time n > M such that p,([xrii]) >n(h+E) Since r is measurable and (1) holds infinitely often, almost surely, r is an almost surely finite stopping time. An application of the ergodic stoppingtime lemma, Lemma 1.3.7, then yields the lemma. LI
The second idea, the counting idea, is expressed as the following lemma.
Lemma 1.5.6 There is a 3 > 0 and an M> 116 such that IG large K.
, M)1 <2K (h+26) ,
for all sufficiently
Proof A collection S = {[ni , m i ll of disjoint subintervals of [1, K], will be called a skeleton if it satisfies the requirement that mi — ni ± 1 > M, for each i, and if it covers all but a 26fraction of [1, K], that is,
(2)
E (mi _ ni ± 1) >_ (1 — 23)K.
[n,,m,JES
A sequence xr is said to be compatible with such a skeleton S if 7i ) > 2(m, ni+i)(h+e) kt(x n for each i. The bound of the lemma will be obtained by first upper bounding the number of possible skeletons, then upper bounding the number of sequences xr that are compatible with a given skeleton. The product of these two numbers is an upper bound for the cardinality of G K (6, M) and a suitable choice of 6 will then establish the lemma. First note that the requirement that each member of a skeleton S have length at least M, means that Si < KIM, and hence there are at most KIM ways to choose the starting points of the intervals in S. Thus the number of possible skeletons is upper bounded by (3)
E
K k
H(1/111)
k<KIM\
where the upper bound is provided by Lemma 1.5.4, with H() denoting the binary entropy function. Fix a skeleton S = {[ni,mi]}. The condition that .)C(C be compatible with S means that the compatibility condition (4) it(x,7 1 ) >
must hold for each [ni , m i ] E S. For a given [ni , m i ] the number of ways xn m , can be chosen so that the compatibility condition (4) holds, is upper bounded by 2(m1—n's +1)(h+E) , by the principle that lower bounds on probability imply upper bounds on cardinality, Lemma I.1.18(a). Thus, the number of ways x j can be chosen so that j E Ui [n„ and so that the compatibility conditions hold is upper bounded by
nK h1€)
L
(fl
SECTION 1.5. THE ENTROPY THEOREM.
55
Outside the union of the [ni , mi] there are no conditions on xi . Since, however, there are fewer than 28K such j these positions can be filled in at most IA 1 26K ways. Thus, there are at most
IAl23K 2K (h+E)
sequences compatible with a given skeleton S = {[n i , m i ]}. Combining this with the bound, (3), on the number of possible skeletons yields
(5)
1G K(s, m)i
< 2. 1( H(1lm) oi ncs 2 K(h+E)
Since the binary entropy function H (1 M) approaches 0 as M 00 and since IA I is finite, the numbers 8 > 0 and M > 1/8 can indeed be chosen so that IGK (8, M) < 0 2K(1 +2' ) , for all sufficiently large K. This completes the proof of Lemma 1.5.6. Fix 6 > 0 and M > 1/6 for which Lemma 1.5.6 holds, put GK = Gl{(8, M), and let BK be the set of all xr for which ,u(x(c ) < 2 1C(h+36) . Then tt(Bic
n GK) < IGKI 2—K(h+3E)
5 2—ICE
,
holds for all sufficiently large K. Thus, xr g BK n GK, eventually almost surely, by the BorelCantelli principle. Since xt E GK, eventually almost surely, the iterated almostsure principle, Lemma 1.1.15, implies that xr g BK, eventually almost surely, that is,
lim sup h K (x) < h + 3e, a.s.
K—>oo
In summary, for each e > 0, h = lim inf h K (x) < lim sup h K (x) < h 3e, a.s.,
K—*co
which completes the proof of the entropy theorem, since c is arbitrary.
0
Remark 1.5.7 The entropy theorem was first proved for Markov processes by Shannon, with convergence in probability established by McMillan and almostsure convergence later obtained by Breiman, see [4] for references to these results. In information theory the entropy theorem is called the asymptotic equipartition property, or AEP. In ergodic theory it has been traditionally known as the ShannonMcMillanBreiman theorem. The more descriptive name "entropy theorem" is used in this book. The proof given is due to Ornstein and Weiss, [51], and appeared as part of their extension of ergodic theory ideas to random fields and general amenable group actions. A slight variant of their proof, based on the separated packing idea discussed in Exercise 1, Section 1.3, appeared in [68].
I.5.b Exercises.
1. Prove the entropy theorem for the i.i.d. case by using the product formula on then taking the logarithm and using the strong law of large numbers. This yields the formula h = — Ea ,u(a) log it(a). 2. Use the idea suggested by the preceding exercise to prove the entropy theorem for ergodic Markov chains. What does it give for the value of h?
56
CHAPTER I. BASIC CONCEPTS.
3. Suppose for each k, Tk is a subset of Ac of cardinality at most 2k". A sequence 4 is said to be (K, 8, {T})packed if it can be expressed as the concatenation 4 = w(1) . w(t), such that the sum of the lengths of the w(i) which belong to Ur_K 'Tk is at least (1 —8)n. Let G„ be the set of all (K, 8, {T})packed sequences 4 and let E be positive number. Show that if K is large enough, if S is small enough, and if n is large enough relative to K and S, then IG n I < 2n(a+E) . 4. Assume p is ergodic and define c(x) = lima
,44), x
E
A°°.
(a) Show that c(x) is almost surely a constant c. (b) Show that if c > 0 then p. is concentrated on a finite set. (c) Show that if p. is mixing then c(x) = 0 for every x.
Section 1.6
Entropy as expected value.
Entropy for ergodic processes, as defined by the entropy theorem, is given by the almostsure limit 1 1 — log h= n—>oo n Entropy can also be thought of as the limit of the expected value of the random quantity —(1/n) log A(4). The expected value formulation of entropy will be developed in this section.
I.6.a
The entropy of a random variable.
Let X be a finitevalued random variable with distribution defined by p(x) = X E A. The entropy of X is defined as the expected value of the random variable — log p(X), that is,
Prob(X = x),
H(X) =
xEA
E p(x) log 1=— p(x)
p(x) log p(x). xEA
The logarithm base is 2 and the conventions Olog 0 = 0 and log0 = —oo are used. If p is the distribution of X, then H(p) may be used in place of H(X). For a pair (X, Y) of random variables with a joint distribution p(x, y) = Prob(X = x, Y = y), the notation H(X,Y)= — p(x , y) log p(x, y) will be used, a notation which extends to random vectors. Most of the useful properties of entropy depend on the concavity of the logarithm function. One way to organize the concavity idea is expressed as follows.
Lemma 1.6.1 If p and q are probability kvectors then
_E pi log p, < —
with equality if and only if p = q.
pi log qi ,
SECTION 1.6. ENTROPY AS EXPECTED VALUE.
57
Proof The natural logarithm is strictly concave so that, ln x < x — 1, with equality if and only if x = 0. Thus
qj
_
= 0,
with equality if and only if qi = pi , log x = (In x)/(1n 2).
1 < i 5_ k.
This proves the lemma, since
E,
The proof only requires that q be a subprobability vector, that is nonnegative with qi 5_ 1. The sum
Dcplio =
pi in
PI
qi
is called the (informational) divergence, or crossentropy, and the preceding lemma is expressed in the following form.
Lemma 1.6.2 (The divergence inequality.) If p is a probability kvector and q is a subprobability kvector then D(pliq) > 0, with equality if and only if p = q.
A further generalization of the lemma, called the logsum inequality, is included in the exercises. The basic inequalities for entropy are summarized in the following theorem.
Theorem 1.6.3 (Entropy inequalities.) (a) Positivity. H(X) > 0, with equality if and only if X is constant.
(b) Boundedness. If X has k values then H(X) < log k, with equality if and only if
each p(x) = 11k.
(c) Subadditivity. H(X, Y) < H(X) H(Y), with equality if and only if X and Y
are independent. Proof Positivity is easy to prove, while boundedness is obtained from Lemma 1.6.1 by setting Px = P(x), qx  1/k. To establish subadditivity note that
H(X,Y).—
Ep (x, y) log p(x, y),
X.
y
then replace p(x , y) in the logarithm factor by p(x)p(y), and use Lemma 1.6.1 to obtain the inequality H(X,Y) < H(X)} H(Y), with equality if and only if p(x, y) p(x)p(y), that is, if and only if X and Y are independent. This completes the proof of the theorem. El The concept of conditional entropy provides a convenient tool for organizing further results. If p(x , y) is a given joint distribution, with corresponding conditional distribution p(xly) = p(x, y)1 p(y), then H ain
=_
p(x, y) log p(xiy) = —
p(x, y) log
p(x, y)
POO
58
CHAPTER I. BASIC CONCEPTS.
is called the conditional entropy of X, given Y. (Note that this is a slight variation on standard probability language which would call — Ex p(xIy) log p(xly) the conditional entropy. In information theory, however, the common practice is to take expected values with respect to the marginal, p(y), as is done here.) The key identity for conditional entropy is the following addition law.
(1)
H(X, Y) = H(Y)+ H(X1Y).
This is easily proved using the additive property of the logarithm, log ab = log a+log b. The previous unconditional inequalities extend to conditional entropy as follows. (The proofs are left to the reader.)
Theorem 1.6.4 (Conditional entropy inequalities.) (a) Positivity. H (X IY) > 0, with equality if and only if X is a function of Y. (b) Boundedness. H(XIY) < H (X) with equality if and only if X and Y are independent. (c) Subadditivity. H((X, Y)IZ) < H(XIZ)+ H(YIZ), and Y are conditionally independent given Z. with equality if and only if X
A useful fact is that conditional entropy H(X I Y) increases as more is known about the first variable and decreases as more is known about the second variable, that is, for any functions f and g,
Lemma 1.6.5 H(f(X)In < H(XIY) < 11 (X1g(n).
The proof follows from the concavity of the logarithm function. This can be done directly (left to the reader), or using the partition formulation of entropy which is developed in the following paragraphs. The entropy of a random variable X really depends only on the partition Px = {Pa: a E A} defined by Pa = fx: X(x) = a), which is called the partition defined by X. The entropy H(1)) is defined as H (X), where X is any random variable such that Px = P. Note that the join Px V Py is just the partition defined by the vector (X, Y) so that H(Px v Pr) = H(X, Y). The conditional entropy of P relative to Q is then defined by H(PIQ) = H(1) V Q) — H (Q). The partition point of view provides a useful geometric framework for interpretation of the inequalities in Lemma 1.6.5, because the partition Px is a refinement of the partition Pf(x), since each atom of Pf(x) is a union of atoms of P. The inequalities in Lemma 1.6.5 are expressed in partition form as follows.
Lemma 1.6.6 (a) If P refines Q then H(QIR) < H(PIR). (b) If R. refines S then H(PIS) > 11(P 1R).
7 (The subadditivity lemma. so that if b = sup .) > O. ENTROPY AS EXPECTED VALUE. anFm < an + am . The entropy of a process is defined by a suitable passage to the limit. . 11) P(x1 xnEan E The process entropy is defined by passing to a limit. Let Pc denote the partition of the set C defined by restricting the sets in P to C. then limn an ln exists and equals infn an /n. If m > n write m = np +. . as follows. it (. say Ra = Sa .b The entropy of a process. n n xn In the stationary case the limit superior is a limit. Proof Let a = infn an /n. To prove (h) it is enough to consider the case when R. a 0 b. Subadditivity gives anp < pan . which establishes inequality (b). t a0b II(Ra) The latter is the same as H(PIS).SECTION I. The equality (i) follows from the fact that P refines Q. Sb = Rb U Rb. Given c > 0 choose n so that an < n(a ± 6). IC) is used on C.n a.) can be expressed as EEtt(Pt n R a ) log t ab it(Pt n Ra) + (4)11 (Psb liZsb ) Ewpt n Ra ) log It(Pt n Ra) + it(Sb)11(Psb ).} of Avalued random variables is defined by H(r) = E p(x)log 1 ATEAn =— p(x'iL)log p(4). This is a consequence of the following basic subadditivity property of nonnegative sequences. then another use of subadditivity yields am < anp + a.6. the equality (ii) is just the general addition law. X2. and the inequality (iii) uses the fact that H (P I VR. where the conditional measure.6.. • The quantity H(P IR.. H (P IR) (i) (P v 21 1Z) H(QITZ) + H(PIQ v 1Z) H (OR). is obtained from S by splitting one atom of S into two pieces. < pan + b. The nth order entropy of a sequence {X1. (2) 1 1 1 H ({X n }) = lim sup — H((X7) = lim sup — p(4) log n n p(4) . 0 < r < n. so that P = Pv Q. I. 59 Proof The proof of (a) is accomplished by manipulating with entropy formulas. Lemma 1.) If {an} is a sequence of nonnegative numbers which is subadditive.r.. namely. that is.6.
CI This proves the lemma. An alternative formula for the process entropy in the stationary case is obtained by using the general addition law. and the simple fact from analysis that if an > 0 and an+ i — an decreases to b then an /n —> b._ log IBIaEB — E 1 The lefthand side is the same as ( p(B) from which the result follows.3(c).) _ E p(a) log p(a) _< p(B) log IBI — p(B)log p(B).8 (The subset entropybound. as in —> pc.e.1 )= H(X0IX:n1). The righthand side is decreasing in n. Lemma 1. be an ergodic process with alphabet A. Proof Let p(.u (x '0 = h. 1B) denote the conditional measure on B and use the trivial bound to obtain P(a1B)log p(aIB) . 0 Let p. Theorem I. that is urn—log n 1 1 n . (3) n4co a formula often expressed in the suggestive form (4) H({X n }) = H (X01X_1 0 ) . gives H(X7 +m) < H(X7)+ H(rnTin). Y) = H(Y)± H(X1Y). aEB for any B c A. H(X.6. and process entropy H. gives lim sup ani /m < a + E. from Lemma 1.6. Division by m. n > 0. and the fact that np/m —> 1. The next goal is to show that for ergodic processes. a. BASIC CONCEPTS. the process entropy H is the same as the entropyrate h of the entropy theorem._ . and let h be the entropyrate of A as given by the entropy theorem. . Towards this end. The subadditivity property for entropy.6. the following lemma will be useful in controlling entropy on small sets. If the process is stationary then H(X n + FT) = H(r) so the subadditivity lemma with an = H(X7) implies that the limit superior in (2) is a limit. aEB E P(a)log p(a) + log p(B)) .60 CHAPTER I. can then be applied to give H({X n })= lim H (X01X1 nI ) .5. to produce the formula H(X ° n )— H(X1.
10 Since for ergodic processes. 8„ since I B I < IAIn . The CsiszdrKörner book. After division by n.. case.6. The recent books by Gray. } Remark 1.u(4) <_ n(h ± c).i. gives (5) — E A (x) log ii(x) < np. G.(x ) log . If h = 0 then define G. A detailed mathematical exposition and philosophical interpretation of entropy as a measure of information can be found in Billingsley's book. the process entropy H({X. [4]. = {4: . Define G. Thus h— E which proves the theorem in the case when h > O. approaches 1. = {4: 2—n(h+E) < 1. the following holds n(h — e) _< — log .d. Proof First assume that h > 0 and fix E such that 0 < E < h.8.u(4) log p. and Cover and Thomas.1 (4) < 2—n(h—E)} and let B. since the entropy theorem implies that ti(B) goes to O. [18].. while D only the upper bound in (6) matters and the theorem again follows.i(B)log(B). from the entropy theorem.u(4) > 2 —ne .u(4)/n.6. this part must go to 0 as n —> oc. in the i.u (4 ) log . 61 Theorem 1. both are often simply called the entropy. so the sum converges to H({X„}). Then H (X7) = — E „. On the set G.6.u(G n ) and summing produces (6) (h — 6) 1 G. Lemma 1. As n —> co the measure of G.SECTION 1. [6]. The bound (5) still holds. = A n — G. B.u(xtil) A" = — E .(4).u(xliz) — E .(Bn )log IA 1 — . The two sums will be estimated separately.}) is the same as the entropyrate h of the entropy theorem. [7]. . contains an excellent discussion of combinatorial and communication theory aspects of entropy. The subset entropybound. so multiplication by . discuss many informationtheoretic aspects of entropy for the general ergodic process.9 Process entropy H and entropyrate h are the same for ergodic processes. ENTROPY AS EXPECTED VALUE.6.
(7) H({X.i. gives H (XI') = H(Xi) which yields the entropy formula H (X2) + H(X n ) = nH (X i ). I. process and let p(x) = Prob(Xi = x).i. The Markov property implies that Prob (X0 = alX) = Prob (X0 = alX_i) . .6. Let {X.}) = lip H (X0IX1.62 CHAPTER I. Entropy formulas for i. a E A. (8) H({X n }) = H(X0IX_ i ) = — E 04 log Mii .c The entropy of i.6. Here they will be derived from the definition of process entropy.11 (The Markov order theorem.i.} is Markov of order k then 1/({X}) = H(X0IX:1) = n > k.6. Theorem 1.d. The argument used in the preceding paragraph shows that if {X. H (X 0 1 = H(X01 X 1)• Thus the conditional entropy formula for the entropy of a process. Xink+1). The additivity of entropy for independent random variables. (3). (3).5. along with a direct calculation of H(X0IX_ 1 ) yields the following formula for the entropy of a Markov chain. and Markov processes can be derived from the entropy theorem by using the ergodic theorem to directly estimate p.}) = H(X i ) = — p(x) log p(x). n > k. and the fact that H(Xl Y) = H(X) if X and Y are independent. see Exercises 1 and 2 in Section 1. Theorem 1. to obtain H({X. } be an i. .d.(4). Recall that a process {Xn } is Markov of order k if Prob (X0 = aiX) = Prob (X0 = alX:k1 ) . Proof The conditional addition law gives H((X o . Xi n k+1 )1X1) = H(X: n k+1 1X11) H(X 0 IXI kl .d.3(c). An alternate proof can be given by using the conditional limit formula. BASIC CONCEPTS. This condition actually implies that the process is Markov of order k. and Markov processes. a from which it follows that E A.!) = H(X0)• Now suppose {X n } is a Markov chain with stationary vector p and transition matrix M.) A stationary process {X n } is Markov of order k if and only if H(X0IX:) = H(X0IX: n1 ).
the same number of times it appears in Typeequivalence classes are called type classes.) < 2r1(131). Theorem 1. This fact can be used to estimate the order of a Markov chain from observation of a sample path.. The entropy of an empirical distribution gives the exponent for a bound on the number of sequences that could have produced that empirical distribution. Two sequences xi' and y'11 are said to be typeequivalent if they have the same type. 14). In general the conditional entropy function Hk = H(X0IX: k1 ) is nonincreasing in k. The equality condition of the subadditivity principle. rather than a particular sequence that defines the type. Theok+1 are conditionally indepenrem I. The empirical (firstorder) entropy of xi' is the entropy of p i (.6. The following purely combinatorial result bounds the size of a type class in terms of the empirical entropy of the type. provided H(X01X1 k1 ) = H(X0 1X1n1 ). This fact is useful in large deviations theory. 63 The second term on the right can be replaced by H(X01X: k1 ). if each symbol a appears in x'. 14)) = a Pi(aixi)log Pi(aI4). for n > k. x'1' E T. I. if and only if each symbol a appears in xi' exactly n pi(a) times.12 (The type class bound. x E rn Tpn. n Ifi: Pi(alxi) = = an . can then be used to conclude that X0 and X:_ n dent given XI/. To say that the process is Markov of some order is to say that Hk is eventually constant. 14) on A defined by the relative frequency of occurrence of each symbol in the sequence 4.6. Thus. a E A. If the true order is k*. that is. The empirical distribution or type of a sequence x E A" is the probability distribution Pi = pl (. In since np i (a) is the number of times the symbol a appears in a given 4 particular. a product measure is always constant on each type class. The type class of xi' will be denoted by T WO or by 7.6. k > k*. that is.6. 17(130= ii(pl(.SECTION 1. for any type p i . If this is true for every n > k then the process must be Markov of order k. then Hks_i > fik* = Hk .14). where the latter stresses the type p i = pi (. ENTROPY AS EXPECTED VALUE. This completes the proof of the theorem.4(c).d Entropy and types. that is.  Proof First note that if Qn is a product measure on A n then (9) Q" (x)= ji Q(xi) i=1 = fl aEA War P1(a) . and will be useful in some of the deeper interpretations of entropy to be given in later chapters. E .
(9). BASIC CONCEPTS. produces Pn(xiz) = aEA pi (a)nPl(a) . which. Two sequences x.6. the only possible values for . The concept of type extends to the empirical distribution of overlapping kblocks.6.' and yç' are said to be ktypeequivalent if they have the same ktype. where Pk = Pk ( Ix).14 (The number of ktypes bound. Pn (4) has the constant value 2' 17(P' ) on the type class of x7. n. and hence = ypn.) The number of possible types is at most (n + 1)1A I. Theorem 1. [7]. A type pi defines a product measure Pn on An by the formula Pn (z?) = pi (zi). The bound on the number of types. yields Pn(x7) = 2—n17(PI) . after taking the logarithm and rewriting. The empirical overlapping kblock distribution or ktype of a sequence x7 E A' is the probability distribution Pk = pk(' 14) on A k defined by the relative frequency of occurrence of each kblock in thesen [i quenk ce +ii'i th_a i± t ki_ S. npi (alfii) are 0. if and only if each block all' appears in xÇ exactly (n — k + 1) pk (alic) times. is the correct asymptotic exponential bound for the size of a type class. This establishes Theorem 1.64 CHAPTER 1.) The number of possible ktypes is at most (n — k + Note. The ktype class of x i" will be denoted by Tpnk . Later use will also be made of the fact that there are only polynomially many type classes.13 (The numberoftypes bound. 1. . x E In other words.6.1 = E I 41 n— k 1 E A k. The bound 2' 17. that is. but satisfies k < a logiAi n. then the number of ktypes is of lower order than IA In. E r pni .6. that if k is fixed the number of possible ktypes grows polynomially in n. Proof This follows from the fact that for each a E A. as stated in the following theorem. in particular.12.13. Replacing Qn by Pi in the product formula. Theorem 1. extends immediately to ktypes. and ktypeequivalence classes are called ktype classes. while if k is also growing. Theorem 1. z E A' . while not tight. 4 E T. The upper bound is all that will be used in this book. with a < 1. 12— n 11(” v 1 ) But Pi is a probability distribution so that Pi (Tpni ) < 1. as shown in the CsiszdrKörner book.
and hence P x E 2. A suitable bound can be obtained. b = E bi . Prove the logsum inequality: If a. the stationary (k — 1)st order Markov chain ri 0c 1) with the empirical transition function (ak 14 1 ) = Pk(ainxii ) „ f k1 „ir‘ uk IA 1 ) E bk pk and with the stationary distribution given by The Markov chain ri(k1) has entropy (k  "Aar') = Ebk Pk(ar i 1) E 1. E ai . > 0. ak ak by formula (8). then E ai log(ai I bi ) ?_ a log(a/b). and a 1. This yields the desired bound for the firstorder case.15 (The ktypeclass bound.1>ii(1 ) IT P2 I — 1n since ii:(1) (xi) > 1/(n — 1). This formula. . 65 Estimating the size of a ktype class is a bit trickier. The proof for general k is obtained by an obvious extension of the argument. however. If Q is Markov a direct calculation yields the formula Q(4) = Q(xi)n Q(bi a)(n—l)p2(ab) a. n. The entropy FP") is called the empirical (k — 1)st order Markov entropy of 4. I. . Theorem 1. with Q replaced b y (21) = rf. that is. 2.j(a k Or 1 )Ti(a ki 1 ) log 75(akiar) E Pk(akil4) log Ti(ak lari ). by considering the (k — 1)st order Markov measure defined by the ktype plc ( 14).(1) and after suitable rewrite using the definition of W I) becomes 1 TV I) (Xlii ) = Ti(1) (X02—(n1)1 . > 0. ENTROPY AS EXPECTED VALUE. i = 1. 2)process.e Exercises.6.6.6. (a) Determine the entropies of the ergodic components of the (T3 .) I7k (fil)1 < (n — Proof First consider the case k = 2. b. Note that it is constant on the ktype class 'Tk(xiii) = Tp k of all sequences that have the same ktype Pk as xrii.(n.b xr it c Tp2. 2.SECTION 1. since P1) (4) is a probability measure on A n . since the ktype measures the frequency of overlapping kblocks. Let P be the time0 partition for an ergodic Markov chain with period d = 3 and entropy H > O.
if n and c are understood.(xi Ix). I.) 6.1} z . (b) Prove Pinsker's inequality.) Section 1. Toy). where jp .7. which leads to the concept of entropytypical sequence.7 Interpretations of entropy. . and let ji be the product measure on X defined by . The second interpretation is the connection between entropy and expected code length for the special class of prefix codes. Let p be an ergodic process and let a be a positive number. The entropy theorem may be expressed by saying that xril is eventually almost surely entropytypical.QI 1. or simply the set of entropytypical sequences. that D (P Il Q) > (1 / 2 ln 2)IP — Q1 2 . Two simple. (Hint: let S be uniformly distributed on [1. namely. Show that the process entropy of a stationary process and its reversed process are the same. and to the related buildingblock concept. Let p. (b) Determine the entropies of the ergodic components of the (T 3 . (4) < t(xIx) 2". n] and note that H (X I v . measurable with respect to xn oc .) 7. Show there is an N = N(a) such that if n > N then there is a set Fn . (a) Show that D (P v 7Z I Q v > Q) for any partition R. p V Tp T22)process. such that . S) = H H (SIXr) = H(S)+ H(X l iv IS) for any N> n. Q)process. Prove that the entropy of an nstationary process {Yn } is the same as the entropy the stationary process {X„} obtained by randomizing the start. so H(Xr)1N H(Xr IS)1N. The divergence for kset partitions is D(P Q) = Ei It(Pi)log(p. (Hint: apply the ergodic theorem to f (x) = — log //. that is. be an ergodic process of entropy h.i(4). 3. y) = (T x . (Hint: use part (a) to reduce to the two set case. and define S(x. Let p be the timezero partition of X and let Q = P x P. then use calculus. Find the entropy of the (S. €)entropytypical sequences.. (Hint: recurrence of simple random walk implies that all the sites of y have been almost surely visited by the past of the walk. BASIC CONCEPTS.66 CHAPTER I. The first is an expression of the entropy theorem in exponential form. with the product measure y = j x it.2. Let Y = X x X.a Entropytypical sequences. (Hint: use stationarity.bt 1 a2012 .) 4.(Pi)1 A(Q i )). Then show that I1(r IS = s) is the same as the unshifted entropy of Xs/v+1 .) 5. and use the equality of process entropy and entropyrate. Let T be the shift on X = {1. For E > 0 and n > 1 define En(E) 1 4: 2—n(h+E) < tt (x in) < 2—n(h6)} The set T(E) is called the set of (n.u0(1) = p.(1) = 1/2.u(Pi ) . which is the "random walk with random scenery" process discussed in Exercise 8 of Section 1.u(Fn ) > 1 — a and so that if xn E Fn then 27" 11. useful interpretations of entropy will be discussed in this section.
(Al) List the nsequences in decreasing order of probability.1 (The typicalsequence form of entropy. a)covering number by Arn (a) = minfICI: C c An. suggested by the upper bound. Theorem 1.u(Tn (E)) = 1.u(Cn n 'T.) < 2n(h€). and p(C)> a}. it is enough to show that x g C. Ar(a) is the minimum number of sequences of length n needed to fill an afraction of the total probability. eventually almost surely. A good way to think of .4.7q If C.i (e/ 2). n > 1. is known as the asymptotic equipartition property or AEP.X1' E 'T. n'T. Sometimes it means entropytypical sequences.(E12)) is summable in n. then . sometimes it means sequences that are both frequencytypical and entropytypical.3 (The toosmall set principle. c A n and ICI _ Ci'. which leads to the fact that toosmall sets cannot be visited too often. even though there are I A In possible sequences of length n. For a > 0 define the (n. sometimes it means frequencytypical sequences." has different meaning in different contexts. Theorem 1.) For each E > 0. Theorem 1. The members of the entropytypical set T(e) all have the lower bound 2 —n(h+f) on their probabilities. that is. the measure is eventually mostly concentrated on a set of sequences of the (generally) much smaller cardinality 2n(h+6) • This fact is of key importance in information theory and plays a major role in many applications and interpretations of the entropy concept. Proof Since .2. The context usually makes clear the notion of typicality being used.(E12). in information theory. limn . Theorem 1. eventually almost surely. The phrase "typical sequence. INTERPRETATIONS OF ENTROPY.I\rn (a) is given by the following algorithm for its calculation. eventually almost surely.7. expresses the connection between entropy and coverings. on the number of entropytypical sequences. and the theorem is established. x E T(e). fl Another useful formulation of entropy. namely. Since the total probability is at most 1. The preceding theorem provides an upper bound on the cardinality of the set of typical sequences and depends on the fact that typical sequences have a lower bound on their probabilities.7. as defined in Section 1. The cardinality bound on Cn and the probability upper bound 2 2) on members of Tn (E/2) combine to give the bound (Cn n Tn(E/2)) < 2n(12+02—n(h—E/2) < 2—nc/2 Thus . and sometimes it is just shorthand for those sequences that are likely to occur. (A2) Count down the list until the first time a total probability of at least a is reached.7. . Typical sequences also have an upper bound on their probabilities. Here the focus is on the entropytypical idea. eventually almost surely.SECTION 1. as defined here.7. 67 The convergence in probability form.) The set of entropytypical sequences satisfies 17n (01 < 2n (h+E) Thus. this fact yields an upper bound on the cardinality of Tn .7.2 (The entropytypical cardinality bound.
To state this precisely let H(8) = —6 log 6 (1— 6) log(1 — 8) denote the binary entropy function. lim supn (l/n) log Nn (a) . < 2—n(h—E) 1— 6 . Let d(a.5_ h. Theorem 1. S) from 4 to a set S C An is defined by dn (x' . fl The covering exponent idea is quite useful as a tool for entropy estimation. the proof in Section I.I = (a).i (E)) eventually exceeds a. b) be the (discrete) metric on A. The connection with entropy is given by the following theorem. that a rotation process has entropy O.(6) n cn ) > a(1 — c). yri`) = — n i=1 yi)• The metric cin is also known as the perletter Hamming distance. defined by if a = b otherwise. The distance dn (x. the measure .7. for E rn (E then implies that ACTn(E) n = p. for example. 1 dn (4 .Jce) < iTn (01 < 2 n010 . and the 6neighborhood or 6blowup of S is defined by [S ].) For each a > 0. The fact that p. On the other hand. . The connection between coverings and entropy also has as a useful approximate form. see. b) = J 1 1 and extend to the metric d on A'.68 CHAPTER I. for each n.7. Since the measure of the set of typical sequences goes to I as n oc. . It is important that a small blowup does not increase size by more than a small exponential factor. which proves the theorem.2.4 (The coveringexponent theorem. defined by 0 d (a .i(C) > a and I C. below. If n is large enough then A(T.' E S}. y.n GI. Proof Fix a E (0.7.(x) < 2 —n(h—E) . the covering exponent (1/n)logArn (a) converges to h. suppose. When this happens N.u(T. and hence. 1) and c > O. which will be discussed in the following paragraphs. BASIC CONCEPTS. ) and hence Nn (a) = ICn > IT(e) n Cn > 2n (hE'A(Tn(E) n cn) > 2n(hE)a(i 6). as n oc.Ern (onc. The covering number Arn (a) is the count obtained in (A2).e.(4) E x. by Theorem 1.. S) = min{dn (4. = S) < 81.
8) = min{ICI: C c An.4.. provided only that n is not too small.7.. INTERPRETATIONS OF ENTROPY. Also fix an integer M > n and 8 > O. the _ 2nH(3) (IAI — 1) n8 . it is the minimum size of an nset for which the 8blowup covers an afraction of the probability. such that (a) Eil.5. The . of disjoint nlength subintervals of [1. are allowed to depend on the sequence xr. Thus. A sequence xr is said to be (1 — 0builtup from the building blocks B. 8) = h. 6—>0 n—> oo n I. together with Theorem 1. The blowup form of the covering number is defined by (1) Arn (a. it follows that if 8 is small enough. Both the number I and the intervals [ni .SECTION 1. An application of Lemma 1. It is left to the reader to prove that the limit of (1/n) log Nen (a. This proves the blowupbound lemma. with occasional spacers inserted between the blocks. is required to be a member of 13.7. In particular. Fix a collection 5.5 (The blowup bound. In the special case when 8 = 0 and M is a multiple of n. m i ] for which x . if there is an integer I = I(41 ) and a collection llni. subject only to the requirement that the total length of the spacers be at most SM. 8) oc. If 8 > 0 then the notion of (10builtup requires that xr be a concatenation of blocks from 13. Lemma 1. given c > 0 there is a S > 0 such that Ir1n(c)131 for all n.." from which longer sequences are made by concatenating typical sequences. This idea and several useful consequences of it will now be developed. which combinations bound. there are at most nS positions i in which xi can be changed to create a member of [ { x7 } 13. Lemma 1.7. Proof Given x. that is. with spacers allowed between the blocks. An application of the ergodic theorem shows that frequencytypical sequences must consist mostly of nblocks which are entropytypical.7. < 2n(h1€) since (IAI — 1)n8 < 20 log IAI and since Slog IA I and H(8) Since IT(e)1 _ < 2n (h+26) 0. to say that 41 is 1builtup from Bn is the same as saying that xr is a concatenation of blocks from B„. and //([C]3 ) a}. for all n. i c I.) The 6blowup of S satisfies 69 I[S]81 ISI2 n11(6) (1A1 — 1) n3 .=1 (mi — n i + 1) > (1 — (b) E I3.5. Each such position can be changed in IA — 1 ways. then Irrn(E)ls1 _ both go to 0 as El will hold.b The buildingblock concept. mi]: i < 1).4 then exists as n establishes that 1 lim lim — log Afn (a.7. implies that I[{4}]51 < yields the stated bound. < n(h+2€) 2 In particular. c An . the entropytypical nsequences can be thought of as the "building blocks. thought of as the set of building blocks. M]. The reader should also note the concepts of blowup and builtup are quite different.
3. if B. = 7. but allows arbitrary selection of the blocks from a fixed collection.(c).5.3.4. Proof The number of ways to select a family {[n i . say {[n i . in turn. though simpler since now the blocks all have a fixed length. The argument used to prove the packing lemma. then the set of Msequences that can be (1 — 6)builtup from a given set 8„ C An of building blocks cannot be exponentially much larger in cardinality than the set of all sequences that can be formed by selecting M I n sequences from Bn and concatenating them without spacers.) Let DM be the set of all sequences xr that can be (1 — 6)builtup from a given collection B„ c An.6 (The builtup set bound.70 CHAPTER I. namely.b.m i ] is upper bounded by IA 1 3m . Lemma 1. BASIC CONCEPTS. the set of entropytypical nsequences relative to c. while the builtup concept only allows changes in the spaces between the building blocks. In particular. For a fixed configuration of locations. M — n +1]: Bn }l < 8(M — n +1).4.i E / } . The proof of this fact. A sequence xr is said to be (1 — 6)stronglycovered by B„ C An if E [1. can be used to show that almost stronglycovered sequences are also almost builtup.m i ].5. m i ]} of disjoint nlength subintervals that cover all but a 6fraction of [1. by the combinations bound. Lemma 1. . The bound for entropytypical building blocks follows immediately from the fact that I T(e)l 2 n(h+f) .3 Min and the number of ways to fill the places that are not in Ui [ni.6. provided M is large enough. M] is upper bounded by the number of ways to select at most SM points from a set with M members. which is stated as the following lemma. An important fact about the buildingblock concept is that if S is small and n is large. Y1 E < ) which is. Then I Dm I IB.1 2mH (0 1 A im s. Lemma 1. blowup concept focuses on creating sequences by making a small density of otherwise arbitrary changes. This establishes the lemma. is similar in spirit to the proof of the key bound used to establish the entropy theorem. Thus I Dm I < ni Min IAIM. for members of Bn . H(8) = —8 log6 — (1 —8) log(1 — S) denotes the binary entropy function. The buildingblock idea is closely related to the packing/covering ideas discussed in Section I. which is the desired bound. Lemma 1. upper bounded by 2 A411(3) . and if is small enough then 11)m 1 < 2m(h +26) .7. the number ways to fill these with members of B„ is upper bounded by IBnl i < . As usual. This is stated in precise form as the following lemma.
c Entropy and prefix codes.7. where n is a fixed large number. m i ].. there are many possible source sequences.SECTION 1. drawn from some finite alphabet A. whose length r may depend on 4 . while the condition that M > 2n18 implies there at most 3M/2 indices j E [M — n 1.6. at least in some average sense. An image C (a) is called a codeword and the range of C . mi].7. implies there are at most 6M/2 indices j < M — n 1 which are not contained in one of the [ni . A standard model is to think of a source sequence as a (finite) sample path drawn from some ergodic process. often of varying lengths. for which there is a close connection between code length and source entropy. which can be viewed as the entropy version of the finite sequence form of the frequencytypical sequence characterization of ergodic processes. in such a way that the source sequence x'1' is recoverable from knowledge of the encoded sequence bf. In combination with the ergodic and entropy theorems they provide powerful tools for analyzing the combinatorial properties of partitions of sample paths. The assumption that xr is 13. is to be mapped into a binary sequence bf = b1. to make the code length as short as possible..4.5. (e).c. The goal is to use as little storage space as possible.8 The preceding two lemmas are strictly combinatorial. In this section the focus will be on a special class of codes. implies that eventually almost surely xr is almost stronglycovered by sequences from T(e).) If xr is (1812)stronglycovered by 13. so that a typical code must be designed to encode a large number of different sequences and accurately decode them. Theorem 1. hence eventually almost surely mostly builtup from members of 7.(7.u. Remark 1. will be surprisingly useful later.7 (The buildingblock lemma. Of course. Theorem 1. and that xr is eventually in DM. for no upper and lower bounds on their probabilities are given.7. I. b. then xr is (10builtup from Bn• Proof Put mo = 0 and for i > 0. in information theory a stationary. The lemma is therefore established. 71 Lemma 1.4.7. a subject to be discussed in the next chapter. . and n is large enough. A suitable application of the builtup set lemma. define ni to be least integer k > m i _ i such that ric+n1 E 13„. The almostcovering principle. For this reason.7.(€)) 1.. if B. Lemma 1. A code is said to be faithful or A (binary) code on A is a mapping C: A noiseless if it is onetoone. stopping when "" Ic (1 — 3/2)stronglycovered by within n of the end of xr. INTERPRETATIONS OF ENTROPY. b2. all that is known is that the members of Dm are almost builtup from entropytypical nblocks. B*. then . In the following discussion B* denotes the set of all finitelength binary sequences and t(w) denotes the length of a member w E B*. This result. and if M > 2n/8. M] which are not contained in one of the [ni . In particular. ergodic process with finite alphabet A is usually called a source. = Tn (e).7. that is. implies that the set Dm of sequences that are mostly builtup from the building blocks T(e) will eventually have cardinality not exponentially much larger than 2/4(h+6) • The sequences in Dm are not necessarily entropytypical. known as prefix codes. In a typical data compression problem a given finite sequence 4.
The function that assigns to each a E A the length of the code word C (a) is called the length function of the code and denoted by £(IC) or by r if C is understood.w < 1. The prefix u is called a proper prefix of w if w = u v where y is nonempty.C(aIC)p. together with the empty word A.9. labeled by b. and easily proved. For prefixfree sets W of binary sequences this fact takes the form (2) . where X has the distribution A. H(j) is just the expected value of the random variable — log .72 CHAPTER I. For codes that satisfy a simple prefix condition. base 2 logarithms are used. The expected length of a code C relative to a probability distribution a on A is E(L) = Ea . Note that the root of the tree is the empty word and the depth d(v) of the vertex is just the length of the word v.(a).9 The binary tree for W = {00. where b from u to y. it is the function defined by . 100 } .C(aIC) = £(C (a)). 0100. For u = u7 E B* and u = vr E B*. defined by the following two conditions. is called the codebook. property of binary trees is that the sum of 2 —ci(v) over all leaves y of the tree can never exceed 1. directed) binary tree T(W).) Without further assumptions about the code there is little connection between entropy and expected code length. is defined by the formula H (A) = E p(a) log aEA 1 that is. A nonempty word u is called a prefix of a word w if w = u v. 0101. A prefixfree set W has the property that it can be represented as the leaves of the (labeled. however. the words w E W correspond to leaves L(T) of the tree T (W). 101. (See Figure 1.u(X). the concatenation uv is the sequence of length n m defined by (uv) i = ui 1<i<n n < i < n m. The entropy of p. a lower bound which is almost tight.7. a E C. In particular. (i) The vertex set V(W) of T (W) is the set of prefixes of members of W. To develop the prefix code idea some notation and terminology and terminology will be introduced. since W is prefix free. .) E B then there is a directed edge Figure 1. entropy is a lower bound to code length. Formally.7. A nonempty set W c B* is prefix free if no member of W is a proper prefix of any member of W. BASIC CONCEPTS. (As usual. (ii) If y E V(W) has the form v = ub.. An important.
7. 73 where t(w) denotes the length of w E W.7.. which is possible since I Gil < 2" ) . so that IG21 —‹ 2") — IG11 2L(2)—L(1).11 (Entropy and prefix codes. ) E p.. Theorem 1. and. holds. The codewords of a prefix code are therefore just the leaves of a binary tree. t]. a . i E [1.1C)) = Er(aiC)Wa) = a a log 2paic).SECTION 1. This is possible since 1 GI 12—L(1) + I G2 1 2 L(2) < 1. written in order of increasing length L(r) = L. the leaf corresponding to C(a) can be labeled with a or with C(a). (.u (a ) log . For prefix codes. The Kraft inequality has the following converse.10 If 1 < < f2 < < it is a nondecreasing sequence of positive integers such that E. entropy is always a lower bound on average code length. (i) H(p) E4 (C(. For prefix codes. where the second term on the right gives the number of words of length L(2) that have prefixes already assigned. < 1 then there is a prefixfree set W = {w(i): 1 < i < t} such that t(w(i)) = ti . A geometric proof of this inequality is sketched in Exercise 10.1C)) < H(g) + 1. . where t i E Gr . INTERPRETATIONS OF ENTROPY. Proof Define i and j to be equivalent if ti = Li and let {G1. a lower bound that is "almost" tight. the Kraft inequality (3) aEA 1. for any prefix code C. Assign G2 in some onetoone way to binary words of length L(2) that do not have already assigned words as prefixes. I=1 A code C is a prefix code if a = a whenever C(a) is a prefix of C(a).1C)).) Let j. since the code is onetoone. Thus C is a prefix code if and only if C is onetoone and the range of C is prefixfree.(a) log a 2C(a1C) E.t be a probability distribution on A..7. a result known as the Kraft inequality. This fundamental connection between entropy and prefix codes was discovered by Shannon. r=1 Assign the indices in G1 to binary words of length L(1) in some onetoone way. The Kraft inequality then takes the form E I Gr I2Lo) < 1. Gs } be the equivalence classes. Proof Let C be a prefix code. An inductive continuation of this assignment argument clearly establishes the lemma.u(a). A little algebra on the expected code length yields E A (C(. Lemma 1. (ii) There is a prefix code C such that EIL (C(.
where rx1 denotes the least integer > x.10 produces a prefix code C(a) = w(a).(a) a 1— E p(a) log p(a) =1+ H(p).11(ii) asserts the existence of a prefix ncode Cn such that 1 1 (5) — E (reIC n)) 5.7.C(4 IQ.) If p.(A n ) = H(i)/n.. for each n.6.t(a)1. mappings Cn : A n 1—> B* from source sequences of length n to binary words. implies that the first term is nonnegative.log tt(a))p. In the next chapter. Lemma 1. The second term is just H(g). such that.H n (P. (b) There is no prefix code sequence {C n } such that 1Z 4 ({C n }) < H. Next consider ncodes. (4) and (5) then yield the following asymptotic results. To prove part (ii) define L(a) = 1.74 CHAPTER I. that is. Cn is a prefix ncode with length function . by the Kraft inequality. however. a Thus part (ii) is proved. The (asymptotic) rate of such a code sequence. while the divergence inequality. This proves (i).7.11(i) takes the form 1 (4) H(iin) for any prefix ncode Cn .lCn ))1n.log 1. Theorem I. almostsure versions of these two results will be obtained for ergodic processes. it is possible to compress as well as process entropy in the limit. Lemma 1. In this case.7. Thus "good" codes exist. that is. Theorem 1. and note that E 2— C (a) < E a a 2logg(a) E a E 1. a Since.12 (Process entropy and prefixcodes. . Let /in be a probability measure on A n with persymbol entropy 11. E(1. n) — .C(a) < 1— log p(a). n " n A sequence {Cn } .7. no sequence of codes can asymptotically compress more than process entropy. a E A. it follows that = a A. is a stationary process with process entropy H then (a) There is a prefix code sequence {C n } such that 14({G}) < H. since the measure v defined by v(a) = 2 —C(a I C) satisfies v(A) < 1. a faithful ncode Cn : An B* is a prefix ncode if and only if its range is prefix free. whose length function is L. .C(4) = . which completes the proof of the theorem. relative to a process {X n } with Kolmogorov measure IL is defined by = liM sup n—co E (C(IC n)) The two results.C(a71Cn )/n and expected persymbol code length E(C(. Of interest for ncodes is persymbol code length . while Theorem I. BASIC CONCEPTS. that is. will be called a prefixcode sequence.2. and "toogood" codes do not exist.
This can be done in such a way that the length of the header is (asymptotically) negligible compared to total codeword length. The first part u(n) is just a sequence of O's of length equal to the length of v(n). The code is a prefix code.SECTION 1. Lemma 1. Next it will be shown that. This means that w(n) = w(m). u(n) = u(m). the code length function L(a) = 1— log (a)1 is closely related to the function — log p. CI . whose expected value is entropy. The desired bound f(E(n)) = log n + o(log n) follows easily. Shannon's theorem implies that a Shannon code is within 1 of being the best possible prefix code in the sense of minimizing expected code length. For this reason no use will be made of Huffman coding in this book. The third part w(n) is the usual binary representation of n.7.e. since both consist only of O's and the first bit of both v(n) and v(m) is a 1. so that w' is empty and n = m. INTERPRETATIONS OF ENTROPY. since. then. so that asymptotic results about prefix codes automatically apply to the weaker concept of faithful code. for example. so that. so that. since both have length equal to the length of u(n). so that. such that i(e(n)) = log n + o(log n). so the lemma is established. but for ncodes the perletter difference in expected code length between a Shannon code and a Huffman code is at most 1/n.(a).14 (The Eliascode lemma. v(12) = 100. for Shannon codes. Huffman codes have considerable practical significance. the length of the binary representation of Flog(n + 1)1. Proof The code word assigned to n is a concatenation of three binary sequences.d Converting faithful codes to prefix codes.2. which is asymptotically negligible.7.7.) There is a prefix code £:11. Any prefix code with this property is called an Elias code. called Huffman coding. as far as asymptotic results are concerned. The length of w(n) is flog(n +1)1. I. for example. The following is due to Elias. while both u(n) and v(n) have length equal to f log(1 + Flog(n 1 )1)1. for if u(n)v(n)w(n) = u(m)v(in)w(m)w'. w(12) = 1100. for example. A somewhat more complicated coding procedure. u(12)=000. Thus E(12) = 0001001100.u(a)1 will be called a Shannon code.C(a) = F— log . But then v(n) = v(m).7. produces prefix codes that minimize expected code length. [8]. the length of the binary representation of n. The key to this is the fact that a faithful ncode can always be converted to a prefix code by adding a header (i. B*.13 A prefix code with length function . 75 Remark 1. The second part v(n) is the binary representation of the length of w(n). a prefix) to each codeword to specify its length length. C(n) = u(n)v(n)w(n). there is no loss of generality in assuming that a faithful code is a prefix code. since both have length specified by v(n).
5.(4) = 001001100111. X tiz E An. by Lemma 1. a comma was inserted between the header information. These will produce an upper bound of the form 2nEn . since Cn was assumed to be faithful. with a bit more effort. As an application of the coveringexponent interpretation of entropy it will be shown that ergodic rotation processes have entropy 0. where a is irrational and ED is addition modulo 1. Geometric arguments will be used to estimate the number of possible (T. Given a faithful ncode Cn with length function Le ICn ). n = 1. Remark 1. to general partitions (by approximating by partitions into intervals).17 The (T. if C. .7. and let P be a finite partition of the unit interval [0.4 I Cn ))Cn (4). for ease of reading. but header length becomes negligible relative to the length of Cn (4) as codeword length grows. The proof for the special two set partition P = (Po. PO. (4). where En —> 0 as .7.5). The proof will be based on direct counting arguments rather than the entropy formalism.12(b) extends to the following somewhat sharper form.16 Another application of the Elias prefix idea converts a prefixcode sequence {C n } into a single prefix code C: A* i where C is defined by C(4) = e(n)C(4). which is clearly a prefix code. In the example. 001001100111. 1). 1). Let T be the transformation defined by T:xi> x ea.u is a stationary process with entropyrate H there is no faithfulcode sequence such that R IL ({C n }) < H. and P1 = [0.15 If . and. This enables the decoder to determine the preimage . where.76 CHAPTER I. 1). will be given in detail. e (C(4 ICn )). where Po = [0. x E [0. (The definition of faithfulcode sequence is obtained by replacing the word "prefix" by the word "faithful." in the definition of prefixcode sequence. the header information is almost as long as Cn (4).x11 . x lii E An . I. The code CnK. a prefix ncode Cn * is obtained by the formula (6) C: (4 ) = ECC(. and the code word C. The decoder reads through the header information and learns where C. The header tells the decoder which codebook to apply to decode the received message. 0..) Theorem 1. (4) starts and that it is 12 bits long. P)process has entropy O.7. then C:(4) = 0001001100.7. will be called an Elias extension of C. Proposition 1. a word of length 12. P)names of length n.7. where e is an Elias prefix code on the integers. For example. 2. The argument generalizes easily to arbitrary partitions into subintervals..14.7.e Rotation processes have entropy O. BASIC CONCEPTS. Thus Theorem 1.
y E [0. 1). and {x n } and {yn} denote the (T. 1) and any x E [0. y) is continuous relative to Ix .yli < €14. Proposition 1. if a 0 b.18.7. The key to the proof is the following uniform distribution property of irrational rotations.7.) Inn E [1.yl can be treated in a similar manner. b) = 0. Proposition 1. given z and x. y) such that if n > N then Given E > 0 there is an N such that dn (x. To make the preceding argument precise define the metric Ix . if a = b. Lemma 1. n where d(a. Yi).7. some power y = Tx of x must be close to z.b) = 1. which is rotation by a. The fact that rotations are rigid motions allows the almostsure result to be extended to a result that holds for every x. Ergodicity asserts the truth of the proposition for almost every x.yli = minflx . x.Y 1 1. y) < e. The result then follows from the coveringexponent interpretation of entropy. respectively. and d(a. 77 n oo. Proof Suppose lx . In particular. n > N.19 E. INTERPRETATIONS OF ENTROPY. x . The key to the proof to the entropy 0 result is that. there cannot be exponentially very many names. y) = dn(xç . 1).y1. The uniform distribution property. and hence the (TP)names of y and z will agree most of the time. P)names of x. Theorem 1.Y11 < 6/4 and n > N. Consider the case when Ix Y Ii = lx . if Ix . which occurs when and only when either T 1 0 E I or (0.11 and the pseudometric dn(x. The first lemma shows that cln (x.5) E <E This implies that dn (x.yi. The proof is left as an exercise.SECTION 1. Thus every name is obtained by changing the name of some fixed z in a small fraction of places and shifting a bounded amount. can be applied to T 1 .y11. But translation is a rigid motion so that Tmy and Trnz will be close for all m. tz]: Tlx E = for any interval I c [0. .5) E I. 1). Compactness then shows that N can be chosen to be independent of x and y. y E [0.7. to provide an N = N(x. = 1 n d(xi. Let I = [x.yll = 11 + x . y) < 1{j E E I or T 1 (0.18 (The uniform distribution property. Without loss of generality it can be assumed that x < y. The names of x and y disagree at time j if and only if Tx and T y belong to different atoms of the partition P.7. The case when lx . where A denotes Lebesgue measure. The next lemma is just the formal statement that the name of any point x can be shifted by a bounded amount to obtain a sequence close to the name of z = 0 in the sense of the pseudometric dn. y].4.
Given E > 0.. respectively. With K = M N the lemma follows. w(k — 1)v. The unit interval X can be partitioned into measurable sets Xi = Ix : j(x) = j}. dk (zk Given E > 0. is the binary entropy function.4. (Hint: there is at least one way and at most n1441 2n ways to express a given 4 in the form uw(l)w(2). Since h(€) —> 0 as E 1=1 this completes the proof that the (T.78 Lemma 1. Given E > 0.n (g) > mH(A).) . yk ) < 69 k > K. Note that dn—j(Z. based on the algorithm (Al . BASIC CONCEPTS.. 3. Give a direct proof of the coveringexponent theorem. w(k) E An. 1) there is a j = j(x) E [1.5.7. Proposition 1. M] such that if Z = 0 and y = Tx then i. Given x. Proof Now for the final details of the proof that the entropy of the (T. j < M. there is an integer M and an integer K > M such that if x E [0. 2)process has entropy 0. Show that limn (1/n) log. I. Let g be the concatenatedblock process defined by a measure it on A n . where each w(i) E An and u and v are the tail and head.. y) < E. where Nn (a.20 CHAPTER I. to obtain a least positive integer j = j(x) such that IT ] x — Ok < 6/4.T i X) < E. let n > K+M and let z = O. Since small changes in x do not change j(x) there is a number M such that j(x) < M for all x. the number of possible rotation names of 0 length n is upper bounded by M2m2nh(€). Thus an upper bound on the number of possible names of length n for members of Xi is (7) 2(n— Dh(e) 114 : x E Xill < 2 i IA lnh(6) Since there only M possible values for j.15. of words w(0).Arn (a. according to the combinations bound.A2) given for the computation of the covering number. X E X. where j(x) is given by the preceding lemma. P)process is 0. where h(6) = —doge —(1 — 6) log(1 — 6). apply Kronecker's theorem. 1. Theorem 1.7. (a) Show that Ann ( '2) < (P. Let H(A) denote the entropy of p. 6) is defined in (1).) ± log n. choose N from the preceding lemma so that if n > N and lx — yli < 6/4 then dn (x .2. Lemma 1. + (b) Show that H„. The number of binary sequences of length n — j that can be obtained by changing the (n — j)name of z in at most c(n — j) places is upper bounded by 2 (n— i )h(E) . There are at most 2i possible values for the first j places of any name. (Hint: what does the entropy theorem say about the first part of the list?) 2.7.4.f Exercises. 6) exists. determine M and K > M from the preceding lemma. for n > K + M.
that the entropyrate of an ergodic process and its reversed process are the same. 5. STATIONARY CODING.18. In some cases it is easy to establish a property for a block coding of a process. VX E AZ. Use the covering exponent concept to show that the entropy of an ergodic nstationary process is the same as the entropy of the stationary process obtained by randomizing its start. Suppose the waiting times between occurrences of "1" for a binary process it are independent and identically distributed. for all x'i' E A n .8. then passing to a suitable limit. in terms of the entropy of the waitingtime distribution. using Theorem 1. Let {Ifn } be the process obtained from the stationary ergodic process {X} by applying the block code CN and randomizing the start.7. 1] of length 2 —L(a ) whose left endpoint has dyadic expansion C (a). Show. Show. I. Show that the resulting set of intervals is disjoint. 11. Let p. Stationary coding was introduced in Example 1. and that this implies the Kraft inequality (2). Prove Proposition 1. Suppose {1'. Let C be a prefix code on A with length function L. The definition and notation associated with stationary coding will be reviewed and the basic approximation by finite codes will be established in this subsection.9. a shiftinvariant ergodic measure on A z and let B c A z be a measurable set of positive measure. 79 4. Recall that a stationary coder is a measurable mapping F: A z 1> B z such that F(TA X ) = TB F (x).„} is ergodic. then there is prefix code with length function L(a).7.4.8 Stationary coding. Use this idea to prove that if Ea 2 —C(a) < 1. 9.SECTION 1. . The second basic result is a technique for creating stationary codes from block codes. Section 1. 10. Show that H({Y„}) < 8. 7. 6. that the entropy of the induced process is 1/(A)//t(B). The first is that stationary codes can be approximated by finite codes. Show that if C„: A n H B* is a prefix code with length function L then there is a prefix code e: A" 1—* B* with length function E such that (a) Î(x) <L(x) + 1.8. for all xri' E An.a Approximation by finite codes. Often a property can be established by first showing that it holds for finite codes. then extend using the blocktostationary code construction to obtain a desired property for a stationary coding. Find an expression for the entropy of it.1. (b) 44) 5_ 2 ± 2n log IA I. using the entropy theorem. Two basic results will be established in this section. Associate with each a E A the dyadic subinterval of the unit interval [0.
80
CHAPTER I. BASIC CONCEPTS.
where TA and TB denote the shifts on the respective twosided sequence spaces A z and B z . Given a (twosided) process with alphabet A and Kolmogorov measure ,U A , the encoded process has alphabet B and Kolmogorov measure A B = /LA o F. The associated timezero coder is the function f: A z 1* B defined by the formula f (x) = F (x )o. Associated with the timezero coder f: Az 1> B is the partition P = 19 (f) = {Pb: h E B} of A z , defined by Pb = {X: f(x) = 1, } , that is,
(1)
X E h
if and only if f (x) = b.
Note that if y = F(x) then yn = f (Tnx) if and only if Tx E Pyn . In other words, y is the (T, 2)name of x and the measure A B is the Kolmogorov measure of the (T, P)process. The partition 2( f) is called the partition defined by the encoder f or, simply, the encoding partition. Note also that a measurable partition P = {Pb: h E BI defines an encoder f by the formula (1), such that P = 2(f ), that is, P is the encoding partition for f. Thus there is a onetoone correspondence between timezero coders and measurable finite partitions of Az . In summary, a stationary coder can be thought of as a measurable function F: A z iB z such that F oTA = TB o F , or as a measurable function f: A z 1* B, or as a measurable partition P = {Pb: h E B } . The descriptions are connected by the relationships
f (x) = F (x)0 , F (x) n = f (T n x), Pb = lx: f (x) = b).
A timezero coder f is said to be finite (with window halfwidth w) if f (x) = f (i), whenever .0'. = i" vu,. In this case, the notation f (xw u.,) is used instead of f (x). A key fact for finite coders is that
=
Ç
U {x, n 71± —w wl ,
( x ,' „u t u,' ) = .
, fl
and, as shown in Example 1.1.9, for any stationary measure ,u, cylinder set probabilities for the encoded measure y = ,u o F 1 are given by the formula
(2)
y()1) = it(F1 [yi])
=
E it(xi —+:) .
f (4 ±  :)=y' it
The principal result in this subsection is that timezero coders are "almost finite", in the following sense.
Theorem 1.8.1 (Finitecoder approximation.) If f: Az 1* B is a timezero coder, if bt is a shiftinvariant measure on A z , and if Az 1> B such that E > 0, then there is a finite timezero encoder
7:
1.i (ix: f (x) 0 7(x)}) _< c. Proof Let P = {Pb: b E B) be the encoding partition for f. Since p is a finite partition of A z into measurable sets, and since the measurable sets are generated by the finite cylinder sets, there is a positive integer w and a partition 1 5 = {Pb } with the following two properties
"../
(a) Each Pb is union of cylinder sets of the form
SECTION I8. STATIONARY CODING.
81
(h)
I../
Eb kt(PbA 7)1,) < E.
r',I 0.s•s
Let f be the timezero encoder defined by f(x) = b, if [xi'.] c Pb. Condition (a) assures that f is a finite encoder, and p,({x: f (x) 0 f (x))) < c follows from condition I=1 (b). This proves the theorem.
Remark 1.8.2
The preceding argument only requires that the coded process be a finitealphabet process. In particular, any stationary coding onto a finitealphabet process of any i.i.d. process of finite or infinite alphabet can be approximated arbitrarily well by finite codes. As noted earlier, the stationary coding of an ergodic process is ergodic. A consequence of the finitecoder approximation theorem is that entropy for ergodic processes is not increased by stationary coding. The proof is based on direct estimation of the number of sequences of length n needed to fill a set of fixed probability in the encoded process.
Theorem 1.8.3 (Stationary coding and entropy.)
If y = it o F1 is a stationary encoding of an ergodic process ii then h(v)
First consider the case when the timezero encoder f is finite with window halfwidth w. By the entropy theorem, there is an N such that if n > N, there is a set Cn c An i +w w of measure at least 1/2 and cardinality at most 2 (n±2w+1)(h( A )+6) • The image f (C n ) is a subset of 13 1 of measure at least 1/2, by formula (2). Furthermore, because mappings cannot increase cardinality, the set f (C n ) has cardinality at most 2(n+21D+1)(h(4)+E) . Thus 1  log I f (Cn)i h(11) + 6 + 8n, n where Bn —>. 0, as n > oo, since w is fixed. It follows that h(v) < h(tt), since entropy equals the asymptotic covering rate, Theorem 1.7.4. In the general case, given c > 0 there is a finite timezero coder f such that A (ix: f (x) 0 f (x)}) < E. Let F and F denote the sample path encoders and let y = it o F 1 and 7/ = ki o F 1 denote the Kolmogorov measures defined by f and f, respectively. It has already been established that finite codes do not increase entropy, so that h(71) < h(i). Thus, there is an n and a collection 'e" c Bn such that 71(E) > 1  E and II Let C = [C], be the 6blowup of C, that is,
Proof
C = { y: dn(Yi, 3711 ) 5_
E,
for some
3,T;
E '
el
The blowup bound, Lemma 1.7.5, implies that
ICI < where
3(E) = 0(E).
9
Let
15 = {x: T(x)7
E
El} and D = {x: F(x)7
E
CI
be the respective pullbacks to AZ. Since ki (ix: f (x) 0 7(x)}) < _ E2 , the Markov inequality implies that the set G = {x: dn (F(x)7,F(x)7) _< cl
82
CHAPTER I. BASIC CONCEPTS.
has measure at least 1 — c, so that p(G n 5 ) > 1 — 2E, since ,u(15) = 75(e) > 1 — E. By definition of G and D, however, Gniic D, so that
e > 1 — 2c. v(C) = g(D) > p(G n ij)
The bound ICI < 2n (h(A )+E+3(E )) , and the fact that entropy equals the asymptotic covering rate, Theorem 1.7.4, then imply that h(v) < h(u). This completes the proof of Theorem 1.8.3.
Example 1.8.4 (Stationary coding preserves mixing.)
A simple argument, see Exercise 19, can be used to show that stationary coding preserves mixing. Here a proof based on approximation by finite coders will be given. While not as simple as the earlier proof, it gives more direct insight into why stationary coding preserves mixing and is a nice application of coder approximation ideas. The afield generated by the cylinders [4], for m and n fixed, that is, the crfield generated by the random variables Xm , Xm+ i, , X n , will be denoted by E(Xm n ). As noted earlier, the ifold shift of the cylinder set [cini n ] is the cylinder set T[a] = [c.:Ln ], where ci +i = ai, m < j <n. Let v = o F1 be a stationary encoding of a mixing process kt. First consider the case when the timezero encoder f is finite, with window width 2w + 1, say. The coordinates yri' of y = F(x) depend only on the coordinates xn i +:. Thus for g > 2w, the intersection [Yi ] n [ym mLg: i n] is the image under F of C n Tm D where C and D are both measurable with respect to E(X7 +:). If p, is mixing then, given E > 0 and n > 1, there is an M such that
jp,(c n
D) — ,u(C),u(D)I < c, C, D E
Earn,
El in
m > M,
which, in turn, implies that
n [y:Z +1]) — v([yi])v(Ey:VDI
M, g > 2w.
Thus v is mixing. In the general case, suppose v = p, o F 1 is a stationary coding of the mixing process with stationary encoder F and timezero encoder f, and fix n > 1. Given E > 0, and a7 and Pi', let S be a positive number to be specified later and choose a finite encoder with timezero encoder f, such that A({x: f(x) f . Thus,
col) <S
pax: F(x)7
fr(x)71)
/tax:
f (T i x)
f (Ti x)i)
f(x)
f (x)}), < n3,
by stationarity. But n is fixed, so that if (5 is small enough then v(4) will be so close to î(a) and v(b7) so close to i(b) that
(3)
Iv(a7)v(b7) — j) (a7) 13 (bi)I
E/ 3 , • P(x)21:71) < 2n8,
for all ar il and b7. Likewise, for any m > 1,
p,
F(x)7
P(x)7, or F(x)m m Vi
SECTION 1.8. STATIONARY CODING.
so that (4)
83
Ivaan n T  m[b7]) — j)([4] n T m[b71)1
6/3,
provided only that 8 is small enough, uniformly in m for all ail and b7. Since 1'5 is mixing, being a finite coding of ,u, it is true that
n T m[bn) — i)([4])Dab7DI < E/3,
provided only that m is large enough, and hence, combining this with (3) and (4), and using the triangle inequality yields I vga7] n T ni[b7]) — vga7])v([14])1 < c, for all sufficiently large m, provided only that 8 is small enough. Thus, indeed, stationary coding preserves the mixing property.
I.8.b
From block to stationary codes.
As noted in Example 1.1.10, an Nblock code CN: A N B N can be used to map an Avalued process {X„} into a Bvalued process {Y„} by applying CN to consecutive nonoverlapping blocks of length N, i iN+1
v (f11)N

ii(j+1)N)
iN+1
j = 0, 1, 2, ...
If {X,} is stationary, then a stationary process {4} is obtained by randomizing the start, i.e., by selecting an integer U E [1, N] according to the uniform distribution and defining = Y+i, j = 1, 2, .... The final process {Yn } is stationary, but it is not, except in rare cases, a stationary coding of {X„}, and nice properties of {X,}, such as mixing or even ergodicity, may get destroyed. A method for producing stationary codes from block codes will now be described. The basic idea is to use an event of small probability as a signal to start using the block code. The block code is then applied to successive Nblocks until within N of the next occurrence of the event. If the event has small enough probability, sample paths will be mostly covered by nonoverlapping blocks of length exactly N to which the block code is applied.
Lemma 1.8.5 (Blocktostationary construction.)
Let 1,1, be an ergodic measure on A z and let C: A N B N be an Nblock code. Given c > 0 there is a stationary code F = Fc: A z A z such that for almost every X E A Z there is an increasing sequence { ni :i E Z}, which depends on x, such that Z= ni+i) and (i) ni+1—ni < N, i E Z. (ii) If .I„ is the set of indices i such that [ni , +1) c [—n, n] and ni+i — n i < N, then ihn supn (1/2n)E iEjn ni+ 1 — ni < c, almost surely.
(iii) If n i+1 — ni = N then y: n n +' ' = C(xn:±'), n 1 where y = F(x).
84
CHAPTER I. BASIC CONCEPTS.
Proof Let D be a cylinder set such that 0 < ,a(D) < 6/N, and let G be the set of all x E A z for which Tmx E D for infinitely many positive and negative values of m. The set G is measurable and has measure 1, by the ergodic theorem. For x E G, define mo = mo(x) to be the least nonnegative integer m such that Tx E D, then extend to obtain an increasing sequence mi = m i (x), j E Z such that Tinx E D if and only if m = mi, for some j. The next step is to split each interval [mi, m 1+1) into nonoverlapping blocks of length N, starting with mi, plus a final remainder block of length shorter than N, in case mi+i — mi is not exactly divisible by N. In other words, for each j let qi and ri be nonnegative integers such that mi +i—mi = qiN dri, 0 <ri <N, and form the disjoint collection 2x (mi) of leftclosed, rightopen intervals [m i , mi N), [m 1 dN,m i +2N), [mi±qiN, mi+i ).
All but the last of these have length exactly N, while the last one is either empty or has length ri <N. The definition of G guarantees that for x E G, the union Ui/x (mi) is a partition of Z. The random partition U i/x (m i ) can then be relabeled as {[n i , ni+1 ), i E Z}, where ni = ni(x), i E Z. If x G, define ni = i, i E Z. By construction, condition (i) certainly holds for every x E G. Furthermore, the ergodic theorem guarantees that the average distance between m i and m i±i is at least N/6, so that (ii) also holds, almost surely. The encoder F = Fc is defined as follows. Let b be a fixed element of B, called the filler symbol, and let bj denote the sequence of length 1 < j < N, each of whose terms is b. If x is sequence for which Z = U[ni , ni+i), then y = F (x) is defined by the formula bni+i— n, if n i±i — ni <N yn n :+i —1 = if n i+i — n i = N. C (4 +11 )
I
This definition guarantees that property (iii) holds for x E G, a set of measure 1. For x G, define F(x) i = b, i E Z. The function F is certainly measurable and satisfies F(TAX) = F(x), for all x. This completes the proof of Lemma 1.8.5. The blocks of length N are called coding blocks and the blocks of length less than N are called filler or spacer blocks. Any stationary coding F of 1,1, for which (i), (ii), and (iii) hold is called a stationary coding with 6fraction of spacers induced by the block code CN. There are, of course, many different processes satisfying the conditions of the lemma, since there are many ways to parse sequences so that properties (i), (ii), and (iii) hold, for example, any event of small enough probability can be used and how the spacer blocks are coded is left unspecified in the lemma statement. The terminology applies to any of these processes. Remark 1.8.6 Lemma 1.8.5 was first proved in [40], but it is really only a translation into process language of a theorem about ergodic transformations first proved by Rohlin, [60]. Rohlin's theorem played a central role in Ornstein's fundamental work on the isomorphism problem for Bernoulli shifts, [46, 63].
} by the encoding 17(jin 1)n+1 = cn(X.5 provides a powerful tool for making counterexamples.8. The construction is iterated to obtain an infinite sequence of such n's. furthermore.on+1) . where it was shown that for any ergodic process {X. there is a nontrivial stationary coding.d.'.i. The nblock coding {I'm ) of the process {X„. and extended to a larger class of processes in Theorem 11. j > 1. It can. .(n) —> 0. see [26. and Markov processes it is known that L(x7) = 0(log n). case using coding ideas is suggested in Exercise 2. A question arose as to whether such a log n type bound could hold for the general ergodic case. Let pc } . For x E An let L(4) be the length of the longest block appearing at least twice in x1% that is. The problem is to determine the asymptotic behavior of L(4) along sample paths drawn from a stationary. fy . The blocktostationary code construction of Lemma 1. such that L(Y. at least if the process is assumed to be mixing. The stringmatching problem is of interest in DNA modeling and is defined as follows.d. that is. then nesting to obtain bad behavior for infinitely many n. Let . as a simple illustration of the utility of blocktostationary constructions. Fin) > n 1/2 .5. For i. of {X„}.1. The proof for the binary case and growth rate À(n) = n 112 will be given here. 85 I. The first observation is that if stationarity is not required. be forced to have entropy as close to the entropy of {X n } as desired. A negative solution to the question was provided in [71]. indicate that {Y. 1)/2+1). and an increasing unbounded sequence {ni I. {Yn }.8.SECTION 1. To make this precise.. where C„ is the zeroinserter defined by the formula (5). Section 11. almost surely.8.5. j ?.d.') > 1.i.1. if the start process {X n } is i. 1 " by changing the first 4n 1 i2 terms into all O's and the final 4n 1 12 terms into all O's.i.. Cn (4) = y. for some O <s<t<n—k} . for each n > 64. L(x ) = max{k: +k i xss+ xtt±±k. As an illustration of the method a string matching example will be constructed. A proof for the i. lim sup i400 Â(ni) In particular.c A stringmatching example. uyss+ for any starting place s. where } (5) 0 Yi = 1 xi i <4n 112 or i > n — 4n 112 otherwise.. define the function Cn : A n H {0. 2].} defined by yin clearly has the property —on+1 = cn(X in o . STATIONARY CODING. finitealphabet ergodic process. since two disjoint blocks of O's of length at least n 112 must appear in any set of n consecutive places. the problem is easy to solve merely by periodically inserting blocks of O's to get bad behavior for one value of n. the {Y} process will be mixing.} is the process obtained from {X.} and any positive function X(n) for which n 1 1. as follows.
Randomizing the start at each stage will restore stationarity.E. An appropriate use of blocktostationary coding. In particular. Since stationary codings of stationary codings are themselves stationary codings. (6). since O's inserted at any stage are not changed in subsequent stages. the sequence of processes {Y(i)} is defined by setting {Y(0)} = {X. the final process has a positive frequency of l's. Furthermore. let Cn be the zeroinserter code defined by (5).n } and inductively defining {Y(i) m } for i > 1 by the formula (cn . and hence there is a limit code F defined by the formula [F(x)]m = lim[F i (x)] m . where the notation indicates that {Y(j) m } is constructed from {Y(i — 1).} exists for which L(Y(oo) :. to be specified later. each coordinate is changed at most once. Property (6) guarantees that no coordinate is changed more than once. where u is the end of a codeword.5. which will. for all m E Z and i > 1.n } to {Y(i) m }. which. The preceding construction is conceptually quite simple but clearly destroys stationarity. + 71 ) > n.n }. Given the ergodic process {X n } (now thought of as a twosided process. let Fi denote the stationary encoder Fi : {0. Combining this observation with the fact that O's are not changed. 1} z that maps {X. xE 0. 11 z {0. will be used. < j. at each stage converts the block codes into stationary codings. b = 0. and define the sequence of processes by starting with any process {Y(0)} and inductively defining {Y(i)} for i > 1 by the formula {Y 4Ecn {Y(i)m}. y is the beginning of a codeword.86 CHAPTER I. . and f is filler.1. mEZ. guarantee that the final limit process is a stationary coding of the starting process with the desired stringmatching properties. since stationary coding will be used). since this always occurs if Y(i)?' is the concatenation uv. and it is not easy to see how to pass to a limit process. guarantees that a limit process 1/2 {Y(oo)„. in turn. so that the code at each stage never changes a 0 into a 1. in turn. UZ { . For n > 64. BASIC CONCEPTS. each process {Y(i) m } is a stationary coding of the original process {X.n } by stationary coding using the codebook Cn. Furthermore. At each stage the same filler symbol. where u is the end of a codeword and v the beginning of a codeword. and let {ci } be a decreasing sequence of positive numbers. For any start process {Y(0). Lemma 1. yields (7) L(Y(i)7 J ) > ni ll2 . that is. hence. the choice of filler means that L(Y(i)?`) > rt. 1 < j < for all s. mixing is not destroyed. let {n i } be an increasing sequence of natural numbers. the ith iterate {Y(i) m } has the property L(Y(i): +7') > n/2 .. in particular.}. j > 1. {n i : i > 11 be an increasing sequence of integers. For each i > 1.8. (6) Y(i — 1) m = 0 =z= Y (i) m = O. The rigorous construction of a stationary coding is carried out as follows. and long matches are even more likely if Y(i)?' is the concatenation uf v. with a limiting Etfraction of spacers./2 .) {YU — l)m} 14 {Y(i)ml. but mixing properties are lost. provided the n i increase rapidly enough. both to be specified later.
8.SECTION 1. processes. then the sequence {ni(x)} of Lemma 1. entropy is dcontinuous and the class of ergodic processes is dclosed. called the weak topology.9 Process topologies. (") (a lic) = n4.„ m E Z is the stationary coding of the initial process {X. The collection P(A") is just the collection of processes with alphabet A. called the dtopology. In particular..00 . Furthermore. Show that there is a stationary code F such that limn Ett (dn (4. for each al iv E AN . Many of the deeper recent results in ergodic theory depend on the dmetric. then Lemma 1. merely by making the ni increase rapidly enough and the Ei go to zero rapidly enough. o (Hint: use the preceding exercise.n } defined by the stationary encoder F. The weak topology is separable.8. and easy to describe. The above is typical of stationary coding constructions. One concept.u (n) } of measures in P(A") converges weakly to a measure it if lim p. The dmetric is not so easy to define. p(x). declares two processes to be close if their joint distributions are close for a long enough time. Two useful ways to measure the closeness of stationary processes are described in this section. The collection of (Borel) probability measures on a compact space X is denoted by Of primary interest are the two cases. however.i. namely.a The weak topology. A sequence {. X = Anand X = A". weak limits of ergodic processes need not be ergodic and entropy is not weakly continuous.8. Show that if t is mixing and 8 > 0. (Hint: replace D by TD for n large. the entropy of the final process will be as close to that of the original process as desired.9. The subcollection of stationary processes is denoted by P s (A") and the subcollection of stationary ergodic processes is denoted by 7'e (A°°). declares two ergodic processes to be close if only a a small limiting density of changes are needed to convert a typical sequence for one process into a typical sequence for the other process.9. and it plays a central role in the theory of stationary codings of i. PROCESS TOPOLOGIES. where A is a finite set. the limit process has the property L(Y(oo)) > the limiting density of changes can be forced to be as small as desired. On the other hand. First a sequence of block codes is constructed to produce a (nonstationary) limit process with the properties desired.d Exercises. compact. I.5 is applied to produce a stationary process with the same limiting properties. 87 The limit process {Y(oo)„z } defined by Y(oo). a theory to be presented in Chapter 4. C(4)) < E. 1.) 2. I.5 can be chosen so that Prob(x 1— ' = an < 8 . where it is mixing. Since (7) hold j > 1. the ddistance between processes is often difficult to calculate.(dN(x fv . Suppose C: A N B N is an Ncode such that Ei.) Section 1.d. and the dtopology is nonseparable. for each i. as are other classes of interest. The other concept. F(x)7) < < E 11(un o 2E and such that h(o.n = limé Y(i). but it has two important defects.
so the class of stationary measures is also compact in the weak topology. and recall that 111. so that {.u. for any sequence {. The weak topology on P(A") is the topology defined by weak convergence. 1. The weak topology is a metric topology.1 If p (n ) converges weakly to p. say.). . Proof For any stationary measure p. Theorem 1. Since there are only countably many 4.°1) } of probability measures and any all'.. The weak limit of ergodic measures need not be ergodic. it must be true that 11/. One way to see this is to note that on the class of stationary Markov processes. It is easy to check that the limit function p.) — v(a 101. iii .u(C) is continuous for each cylinder set C. if . The process p. It is the weakest topology for which each of the mappings p . (n ) has entropy 0.u (n ) converges weakly to p.) as k — > oc.(4). If k is understood then I • I may be used in place of I • lk. BASIC CONCEPTS. then lim sup. E > 0 there is a k such that HA (X0I < H(p) €. weak convergence coincides with convergence of the entries in the transition matrix and stationary vector. Indeed. for all k and all'. 0. Thus. as i —> oc. let Hi.(X 0 1X:1) decreases to H(.u. where X ° k is distributed according to p. 1}n in some order and let p.°1) (a)} is bounded. Now suppose . The weak topology is Hausdorff since a measure is determined by its values on cylinder sets.(0(X 0 IX11) < H(A)1. the usual diagonalization procedure produces a subsequence {n i } such that { 0 (a)} converges to.88 CHAPTER I. H(a) < H(p).. that puts weight 1/2 on each of the two sequences (0. .9. however.2e. For example. . let x(n) be the concatenation of the members of {0. and each .u. given. y) where .. yet lim .u (n) (4) = 2k . is a probability measure. relative to the metric D(g. Furthermore.u (n) is the Markov process with transition matrix M= n [ 1 — 1/n 1/n 1/n 1 1 — 1/n j then p weakly to the (nonergodic) process p. In particular. A second defect in the weak topology is that entropy is not weakly continuous.°2) be the concatenatedblock process defined by the single n2'1 block x(n). for every k and every 4.vik = E liu(a/1 at denotes the kth order distributional (variational) distance.u(n) is stationary.. for each n. The weak topology is a compact topology. hence has a convergent subsequence. Entropy is weakly upper semicontinuous.u(a k i +1 ).(X01X: ) denote the conditional entropy of X0 relative to X=1.. the limit p.u (n) } converges weakly to the cointossing process. which has entropy log 2. the sequence {. is stationary if each A (n ) is stationary.) and (1. p. V k. Since Ho (Xo l XII) depends continuously on the probabilities .
(n)(X01X11) is always true.) + 2E. the measure p.b) = { 1 n i =1 a=b a 0 b.d. let g (P) denote the binary i. The ddistance between two ergodic processes is the minimal limiting (upper) density of changes needed to convert a typical sequence for one process into a typical sequence for the other. y) = lirn sup d(4 . yi). Suppose 0 < p <q < 1/2. Theorem 1.b The dmetric. y. Tx). is ergodic if and only if g(T(g)) = 1. then d( y.9. and let y be a g (q) typical sequence. a(x. x) = d(T y. in particular.) To illustrate the ddistance concept the distance between two binary i. This proves Theorem 1.T (p)) is invariant since ii(y. is actually a metric on the class of ergodic processes is a consequence of an alternative description to be given later. .9.)}. hence f (y) is almost surely constant. For 0 < p < 1. process such that p is the probability that X 1 = 1. PROCESS TOPOLOGIES.4.T (p)) is valmost surely constant.) Example 1. that is. that for valmost every vtypical sequence y. y) is the limiting (upper) density of changes needed to convert the (infinite) sequence x into the (infinite) sequence y. v). so that x and y must .9.1. the set of all sequences for which the limiting empirical measure is equal to p. and y is the limiting density of changes needed to convert y into a gtypical sequence. y) on A. then H(g. processes will be calculated. as defined. I. The sequence x contains a limiting pfraction of l's while the sequence q contains a limiting qfraction of l's.i.. Proof The function f (y) = 41(y . Note.i. (The fact that d(g. y) is the limiting perletter Hamming distance between x and y. and y are ergodic processes. 89 for all sufficiently large n.4.9. y) and is called the ddistance between the ergodic processes it and v.u (P) typical sequence. This extends to the It measures the fraction of changes needed to convert 4 into pseudometric ci(x. (n) ) < H(. let x be a . The perletter Hamming distance is defined by 1 xN n O d„(4 .1. T (p.3 (The ddistance for i. . processes. But if this is true for some n. y ) = — d(xi. and its converse. y): x E T(11.)) = inf{d(x. LI since Ii(g (n ) ) < 111. The constant of the theorem is denoted by d(g. that is.2 If p. The (minimal) limiting (upper) density of changes needed to convert y into a gtypical sequence is given by d(y. Theorem 1. defined by d(x.SECTION 1. A precise formulation of this idea and some equivalent versions are given in this subsection.u.d.d. Theorem 1. where d(a. By the typicalsequence theorem.2.i. Let T(g) denote the set of frequencytypical sequences for the stationary process it. the ddistance between p.9.i)• Thus ci(x.
that is. y) is exactly 1/2. For any set S of natural numbers. and hence ii(A (P) . for . p (q ) ) = g — p.i. which produces no errors. Let p.a(a) = E X(a. on A x A that has .u (q ) typical.u. This new process distance will also be denoted by d(A. v(b) = E X(a.u (q ) typical y such that ii(x. there is therefore at least one . On the other hand. Thus there is indeed at least one . but it is not always easy to use in proofs and it doesn't apply to nonergodic istance as a processes. Ty) — 1/2. is represented by a partition of the unit square into horizontal . disagree in at least a limiting (g — p)fraction of places. b).90 CHAPTER I. for kt converges weakly to y as p —> 0. y) = g — p. The definition of ddistance given by Theorem 1.u (r) typical z such that y = x Ef) z is . kt and y. see Exercise 5.[ p 1 — p 1—p [0 1 1 N= p 1' 1 where p is small. 7" (v)) — 1/2. Example 1. that is. y) > g — p. and its shift Ty. Thus it is enough to find. and hence. vn ) between the corresponding n . In a later subsection it will be shown that this new process distance is the constant given by Theorem 1.9. 1. a typical /Lsequence x has blocks of alternating O's and l's. First each of the two measures. ( A key concept used in the definition of the 4metric is the "join" of two measures. b a A simple geometric model is useful in thinking about the joining concept.9. dfarapart Markov chains.2 serves as a useful guide to intuition.2.) As another example of the ddistance idea it will be shown that the ddistance between two binary Markov chains whose transition matrices are close can nevertheless be close to 1/2.1 The joining definition of d.4 ( Weakly close.9..u. and let r be the solution to the equation g = p(1 — r) ± (1 — p)r . so that ci(x . y) — 1/2. y) = g — p. random sequence with the distribution of the Afr ) process and let Y = x ED Z. (1) . say p = 1050 .9.b.. Thus d. BASIC CONCEPTS. see Exercise 4. Likewise. A joining of two probability measures A and y on a finite set A is a measure A.) These results show that the dtopology is not the same as the weak topology. (With a little more effort it can be shown that d(. (q ) typical y such that cl(x. for each p9' ) typical x. hence the nonmixing process v cannot be the dlimit of any sequence of mixing processes. Since Z is also almost surely p (r ) typical. v).u and y as marginals. which produces only errors. Fix a . where e is addition modulo 2. Y is . one . the values {Zs : s E SI are independent of the values {Zs : s 0 S}. separated by an extra 0 or 1 about every 1050 places. b). to be given later in this section.ualmost all x. and about half the time the alternations are out of phase.th order distributions will be developed. since about half the time the alternation in x is in phase with the alternation in y. To remedy these defects an alternative formulation of the dd limit of a distance dn un . a(x. d distance This result is also a simple consequence of the more general definition of .d.u (P ) typical x. Let Z be an i.(x. Later it will be shown that the class of mixing processes is aclosed. The ymeasure is concentrated on the alternating sequence y = 101010. and y be the binary Markov chains with the respective transition matrices M.)typical. for almost any Z. a(x.
means that the total mass of the rectangles {R(a. A joining X can then be represented as a pair of measurepreserving mappings.9. and ik from (Z. where ex denotes expectation with respect to X. Of course.a) R(1b) R(1.c) R(1. The minimum is attained. one can also think of cutting up the yrectangles and reassembling to give the Arectangles. a) is just a measurepreserving mapping ct.1 (c)). yti')) = cin v). a) to (A. 6) means that.b) has area X(a. V) is a compact subset of the space 7'(A x A n ).) A representation of p.9.) In other words. see Exercise 2. and v. such that X is given by the formula (2) X(a i . y. with the width of a strip equal to the measure of the symbol it represents. for a finite distribution can always be represented as a partition of any nonatomic probability space into sets whose masses are given by the distribution. (/) from (Z. . The 4metric is defined by cin(ti. hence is a metric on the class P(Afl) of measures on A". 91 strips. A). (See Figure 1.a) R(0.b) R(0. b). PROCESS TOPOLOGIES.9. a measurable mapping such that a(0 1 (a)) = p(a). from (Z.c) R(2. the use of the unit square and rectangles is for simplicity only.a) R(0. b): a E A} is exactly the ymeasure of the rectangle corresponding to b. The joining condition . The function 4(A. a E A. v(b) = Ea X.5 Joining as reassembly of subrectangles. (A space is nonatomic if any subset of positive measure contains subsets of any smaller measure. ai ) a (q51 (ai ) n ( . a) to (A. relative to which Jn(Itt.b) R(2c) R(2.b):b E A} such that R(a. 6). a joining is just a rule for cutting each Arectangle into subrectangles and reassembling them to obtain the yrectangles. V) = min Ejdn(4 .c) Figure 1. a) to (A. the . since expectation is continuous with respect to the distributional distance IX — 5.a) a R(1. and v are probability measures on An . for each a. let J(11. The second joining condition. Turning now to the definition of y).urectangle corresponding to a can be partitioned into horizontal rectangles {R(a. Also. 0 R(0. on a nonatomic probability space (Z.5. A measure X on A n x A n is said to realize v) if it is a joining of A and y such that E Vd n (x7.b) R(2a) 1 2 R(2.1In. y) satisfies the triangle inequality and is strictly positive if p. A). where p.a) R(2. that is.u(a) = Eb X(a. y).SECTION 1. y) denote the set of joinings of p.(a.
Y) = lirn sup an(An. defined as a minimization of expected perletter Hamming distance over joinings. both the limit and the supremum.9. in turn. which is. and y as marginals. (3) n vilz)± m (p. un ) must be 0. This establishes Theorem 1. Theorem 1. if d(p. since in the stationary case the limit superior is a supremum. where pi denotes the measure induced by p. n—).92 CHAPTER I. The proof of this is left to Exercise 3.7 (Stationary joinings and ddistance. dn (p. in turn. Exercise 2. It is useful to know that the ddistance can be defined directly in terms of stationary joining measures on the product space A" x A". The ddistance between two processes is defined for the class P(A) of probability measures on A" by passing to the limit. q+m). The set of all stationary joinings of p and y will be denoted by Js(p. that is. BASIC CONCEPTS. since d. on the set AL.co Note that d(p. for all n.:tr. namely. implies that p = v..6. In the stationary case. v) = min ex(d(xi.9.(B x v(B) = X(A" x B). then.) If p. Theorem 1. and v are stationary then (4) cl(p. v'nV) < ( n + m) di±.6 If i and v are stationary then C1 (11 . p(B) = A. yl)). A joining A of p and v is just a measure on A x A" with p. the limit superior is. ( which implies that the limsup is in fact both a limit and the supremum. . which implies that A„ = vn for all n. y) is a metric. a consequence of the definition of dn as an minimum. v) = SUP dn (An. un)..n (1t7+m . LI A consequence of the preceding result is that the dpseudometric is actually a metric on the class P(A) of stationary measures. The ddistance between stationary processes is defined as a limit of nth order dndistance. Indeed. y) is a pseudometric on the class P(A). y). in turn.vn)• Proof The proof depends on a superadditivity inequality.. in fact. since iin(A.n . v) = 0 and p and y are stationary. is a metric on and hence d defines a topology on P(A"). p: + 41 = pr. d(it. which is.9. which. that is. so the preceding inequality takes the form nd + mdm < n + m)dn+m. In the stationary case. B E E. vn) = limiin(An..
It is also easy to see that d cannot exceed the minimum of Ex (d(x l . yi)). The proof of (b) is similar to the proof of (a). As noted in Exercise 8 of Section 1.. This. and the latter dominates an (IL. B x B) = .SECTION 1. such that ex (d(Xi . y rit )). let ). yr)) = K1 ( k=0 H (t v knn l"n kk"knI1 .01) denote the concatenatedblock process defined by An. y). .(n) is given by the formula (5) i "(n) (B) = — n n 1=1 where A.. (5). 93 Proof The existence of the minimum follows from the fact that the set .u(B). V).1s(p. that is. for if A E Js(. if N = Kn+r. b) = E (uav. together with the averaging formula. It takes a bit more effort to show that d cannot be less than the righthand side in (4). let A. the measure 7.Î. u'bvi). yi ) = lima Cin (tt. For each n. for each n. since A n belongs to In(un. 0<r <n. (") (C x = A(C). V) = Ci(p. implies that lim. y) then stationarity implies that ex(d(xi.7 (a.(n) (A = bt(B).. This is done by forming. and from the fact that e . (n) is the measure on (A x A )° defined by A((4.9. v). The details are given in the following paragraphs. the measure on A x A defined by A.knFn Ykn+1 // t( y Kn+r "n kk""Kn11 )1(n+1 . which proves the first of the following equalities. for any cylinder set C. The goal is to construct a stationary measure A E y) from the collection {A}. PROCESS TOPOLOGIES. yl)) = ex n (dn (4. B E E E. Yi)) is weakly continuous. To prove (c). lim e)(d(xi. such that an(An.(4 . y) of stationary joinings is weakly closed and hence compact. the concatenatedblock process defined by A n . va). (a) (6) (b) (c) n+co lim 5:(n)(B x lim 3.k (d(x i . E.The definition of A (n) implies that if 0 < i < n — j then x A c° )) = Xn(T ` [4] x A) = ting—i [xn) = since An has A n as marginal and it is shift invariant. for each n choose A n E Jn(tin. for it depends only on the first order distribution. vn) = ex(c1. va). / .1. then taking a weakly convergent subsequence to obtain the desired limit process À. Yi)) = litridn (tt. Towards this end.7 be projection of A n onto its ith coordinate.
b1 )).2 Empirical distributions and joinings. and 3. respectively.94 and note that CHAPTER L BASIC CONCEPTS. 1(x.1). Lemma 1.9. if necessary. are obtained by thinking of 4 and Ai as separate sequences. I. Note also. I —ICX) j—>00 lim p((a . ir(x . y) = EA. Thus X E Js(h. This proves (c). ex n (dn(x tiz .b./ J such that each of the limits (8) lim p(afix). b 101(x7. y) = Eî(d(ai . One difficulty is that limits may not exist for a given pair (x.9. "i. Two of these. b). y. is that plc ( 1(4 .) The measures 171. Yi)). y)). { } 0< (n — k + 2) p(414) — (n k— E p(a'nfi!) <1.j'e ly). The limit X must be stationary. al since the only difference between the two counts is that a' 2` might be the first k — 1 terms of 4. To complete the proof select a convergent subsequence of 1(n) . y7)) = since 3:(n) (a. ( . for example.p) (d(a. 1x).. since. = (n — k 1)p(aii` I x7) counts the number of times af appears in x7. The limit results that will be needed are summarized in the following lemma. Passing to the limit it follows that . yril)). A simple. and plc ( ly7). yi) = e7. y) as a single sequence. y). yn also exists. however. A diagonalization argument.(d(xi. 14)1(4 . and): are all stationary. y).„ (dn(4 )11 ))• Ei x7 (a. yi)). )). y) = lim dni ((x in i . is obtained by thinking of the pair (4 . by conditions (6a) and (6b). By dropping to a further subsequence.. Condition (6c) guarantees that ii(tt. since cin(A. Proof Note that Further M. pk (af 14) and pk (biny).(.y. Fix such a n1 and denote the Kolmogorov measures exists for all k and all al` and defined by the limits in (8) as TI. Yi)X7 (xi .0)(d (xi . Limit versions of these finite ideas as n oc will be needed. and is a joining of h and y. and more.8 (The empiricaljoinings lemma. and the third. and X' is a joining of g. 07 . b) = (1/n) 1 n x. yril) defines three empirical kblock distributions. The pair (x. yri') = E p .(. hence is not counted in the second term. b)). pk ((alic . that (7) dn (x. but important fact. it can be assumed that the limit i i )) . yields an increasing sequence fn . ) ) is a joining of pk (. u) = e). each side of E(n — k 1)p((af. bk i )1(x n i1. E d (xi .
7(x. the same as the constant given by Theorem 1.. Aalmost surely.„( .9. y i )). PROCESS TOPOLOGIES. that x(d(xi yi)) = f yl)) da)(7(x.(d(xi let yi)). y) < Ex„(x .(x.u and y such that Au. Lemma 1. however. y) = e).„. The integral expression (9) implies. Since. (. and hence d(u. and (9) A= f A(x ) da)(71. (x. by the empiricaljoinings lemma. co is a probability measure on the set of such equivalence classes. y) E A" x A'. y) for which x is gtypical and y is vtypical. for Àalmost every pair (x. . This establishes Theorem 1.u and y are ergodic then there is an ergodic joining A of .9. yi))• Proof Let A be a stationary joining of g and y for which c/(u. Furthermore. Theorem 1. while the dequality is a consequence of (7).y)) for Aalmost every pair (x. Likewise. 371)). yl )). The joining result follows from the 0 corresponding finite result.9.y) (d(x i .(x. y)) be the ergodic decomposition of A. y). combined with (10). But. y) < E). But if this holds for a given pair. y).y) (d(xi.) If . it follows that X E ).t. A(x) is ergodic and is a joining of . Theorem 1. )) . y E T (v).SECTION 1.7. y). Li Next it will be shown that the ddistance defined as a minimization of expected perletter Hamming distance over joinings. ) (d(xi.9. In this expression it is the projection onto frequencyequivalence classes of pairs of sequences. yl) depends only on the first order distribution and expectation is linear.. 95 is stationary. = ex(d(xi. Àalmost surely.) must be a stationary joining of g and y.9 (Ergodic joinings and adistance. in the ergodic case. y) E (A. The empiricaljoinings lemma can be used in conjunction with the ergodic decomposition to show that the ddistance between ergodic processes is actually realized by an ergodic joining. y)). .u and v. by definition of X. is.. since it was assumed that dCu. y) = ex(d(xi . Theorem 1. y) is ergodic for almost every (x.. Theorem 1. means that a(i.7.2. This. then X 7r(x. since À is a stationary joining of p and y.(.9.)typical. "if and 1. as given by the ergodic decomposition theorem.9. yi)). y) = EA.9. (10) d(u.4.9.10. the pair (x. 8 . y) is 4( X . and A.are stationary. Thus. a strengthening of the stationaryjoining theorem. is the empirical measure defined by (x. since d(xi. it is concentrated on those pairs (x.
u. i=1 (b) (x. and 2(x. (a ) — v(a7)I and a measure is said to be TN invariant or Nstationary if it is invariant under the Nth power of the shift T.9 implies that there is an ergodic joining for v).Y1))=a(u. there is a set X c A' such that v(X) = 1 and such that for y E X there is at least one x E T(A) for which both (a) and (b) hold. the proof of (11) is finished. This establishes the desired result (12).) If . together with the almost sure limit result. or by x(jr) when n is understood.9. Next it will be shown that there is a subset X c A of vmeasure 1 such that (12) d(y. and h1 . blf)1(x n oc. y i )) = pairs (x. y) E T(A) x T (v) d(p. The ergodic theorem guarantees that for Aalmost all which e(d(x i .(d(Xi. for example.u and v are ergodic then d(. (a) cln (4 .96 CHAPTER I. v) d(x . . as j lemma implies that Î is stationary and has p. y). valmost surely. For convenience.. and hence d(p. y).b.9.n . . Y1)). v) = d(y.9. Since the latter is equal to 2( x. conditioned on the value r of some auxiliary random variable R.10.') = — n EÀ(d(xi. if p... the distributional (or variational) distance between nth order distributions is given by Ii — vin = Elp. v). In particular. Here is why this is so. also by the empiricaljoinings lemma. yields LIJ the desired result.3 Properties and interpretations of the ddistance. Another convenient notation is X' . The lower bound result. y). I. y'.Theorem 1. is the distribution of X and v the to indicate the random vector distribution of Yf'. y). random variable notation will be used in stating ddistance results. that is. The empiricaljoinings so that dnj (x l i . and v as marginals. will be denoted by x n (jr). As earlier. y) E TOO Ç T(i) x T (v). bt) for each k and each al' and bt.9. v) Ei. T (A)).T (p)) = d(p. y). Let x be titypical and y be vtypical. and choose an increasing sequence {n 1 } such that )) converges to a limit A.v). v).(a. y 1 ') converges to a limit d(x. BASIC CONCEPTS. p((at. The vector obtained from xri' be deleting all but the terms with indices j i < j2 < < j. the following two properties hold.10 (Typical sequences and ddistance. (12). Theorem 1. 17) stands for an (A. (11). Proof First it will be shown that (11) (x. y) < d(x . V y E X. Theorem 1.
Li } and for each ( x (jr ) . y). Furthermore. is a complete metric on the space of measures on An. there is always at least one joining X* of i and y for which X(4. unless a = b. y) d(p..(X (j r). The fact that cln (A. then ci(u.Ea ga. for any finite random variable R. b)) > 1 . The function defined by x*(q.y(jr)) 5_ ndn (x 11 . v(a)).Via . y) satisfies the triangle inequality and is positive if p. and that Ex (dn (a7 ..2 Emin(u(a7). This completes the proof of property (a). and hence. for any joining X of bt and v.m. and mdm (x(r). (c) (X (ir).v In . . dn(A. (g) ("in(X7.E ga7.11 (Properties of the ddistance. = min(A(a7).m. 2 1 In the case when n = 1."1 )). Y (if n )) n .u and v are T N invariant. a). (d) If . (e) d(u. o T'.u .17(iin—on+i). y(jr)) n . X(x(jr). YU)) yr. 14): a7 b7 } ) = 1 . Completeness follows from (a). let {6.Yin = 2. The righthand inequality in (c) is a consequence of the fact that any joining X of X(r) and Y(ffn) can be extended to a joining of X and To establish this extension property. . nE).(XF n Yr n ) = dn(X i ort1)n+1 . b) = 1. v) = lim cin (. in _m } denote the natural ordering of the complement {1. yuln))x(x(q).vi n . ±1 } are independent of each other and of past blocks. Moreover. < md„. yoin m».(d„(xr il . yri') = x(x(r). y is left as an exercise. is a joining of X and Yln which extends X. y( j.(ir»(xcili m). X(i inm )/x( jim) and Y(i n i m )/y(jr).Y(j im).YD. 2. . r P(R = r)d(r Proof A direct calculation shows that IL . with equality when n = 1.9. n } . The lefthand inequality in (c) follows from the fact that summing out in a joining of X. y) (dn (di% b7)) < . .y(j"))(X Y(iin—m )) be a joining of the conditional random variables. PROCESS TOPOLOGIES. y o 71 ). )711')) < mEVd m (x(iln ). together with the fact that the space of measures is compact and complete relative to the metric ii .SECTION 1. then mcl.. let yr. 97 Lemma 1. v(a)). see Exercise 8. i2.. since d(a.1A . . IT) < illr. Yf' gives a joining of X(ffn).9. Ex(d(a.a7 ) . (f) If the blocks {X i ort on+1 } and Olin 1) . (b) d. 14)) X({(a7 .) (a) dn v) _< (1/2)Ip.
..E. Proof Let A be a joining of 1. yrn)) = E nE j (dn (x .0 The function dn * ( u. for each To prove property (f). )1)) = a. is obtained by replacing expected values by probabilities of large deviations. Another metric. Yintn ) Edna (.98 CHAPTER I. y) = E). i . (3). 01.12 [d:Gt. dn(xii .(A n (c)) > 1 — cl.(. The preceding lemma is useful in both directions. This completes the proof of the lemma. with bounds given in the following lemma. dn A.(xi' y) ct (a) 20/.. Yilj. Minimizing over A yields the righthand inequality since dn . y) 5_ 24(u./. choose a joining Ai of {Xiiin j.inon+1) .t and y such that dn (A. 1)n11 } . It is one form of a metric commonly known as the Prohorov metric. The proof of Property (g) is left to the reader.u.. Lemma 1. y(ijnon+1)• The product A of the A is a joining of Xr and Yr. y7)EA. j=1 since mnEx (c1. yDA.(xrn . y) Likewise. } and {Y(. the extension to N > 1 is straightforward. if A E y) is such that X(An(a)) > 1 — a and d:(./21)n+1' y(i. and hence cl. and hence nz Manin (X7 m . v). y(i n in property (c). Let A n (c) = dn (xiz . This completes the discussion of Lemma 1. and hence du.tt1)n+1)) i • j=1 The reverse inequality follows from the superadditivity property.21)n+1 )) = i (. and define (13) *(/.(dn (4 . Property (d) for the case N = 1 was proved earlier. yriz)) < 2a.`n )) < 1. yiii)) < E (x7. versions of each are stated as the following two lemmas. and hence property (f) is established.4.(d n (f.u. BASIC CONCEPTS. The Markov inequality implies that X(A n ( N/Fe)) > 1 — NATt. Property (e) follows from (c). and is uniformly equivalent to the 4metric.11..9.n (x(irm ). by the assumed independence of the blocks.9. then EÀ(dn (4. v)] 2 5_ 4 (A. y) < E. y) = ct. (2).1)n11 . such that Exj (dn (X17_ 1)n±l' Yii.1)n+1 . y) = min minfc > 0: A. equivalent to the 4metric. y) is a metric on the space P(An) of probability measures on A. y) < cl. Part (b) of each version is based on the mapping interpretation of joinings.
B) 3). y) < 2e.9.measure at most c. and (Z. .9. entropy. PROCESS TOPOLOGIES.13 99 (a) If there is a joining À of it and u such that d„(4 . )1).(4 .u is the dlimit of ergodic processes. respectively. then yi 0 x i for at least one j E [i.4 Ergodicity. I. such that dn ( ( Z) (z)) < E. there are measurepreserving mappings (/) and from (Z. fix k and note that if )7 4k1 = all' and 4 +k1 0 al'. and d limits.2 it is enough to show that p(aii`lx) is constant in x.b.9.un (B) > 1 — c. By Theorem 1. To show that small blowups don't change frequencies very much.t.*(z)) > el) 5_ E. The key to these and many other limit results is that do closeness implies that sets of large measure for one distribution must have large blowup with respect to the other distribution. let [B]s = {y7: dn(y ri' . (b) If 4 and are measurepreserving mappings from a nonatomic probability space (Z. Lemma 1. except for a set of ameasure at most c. As before.14. then Y n ([B]. and small blowups don't change empirical frequencies very much. By the ergodic theorem the relative frequency p(eil`ix) converges almost surely to a limit p(anx). then iin (t.4.t. Theorem 1.) The alimit of ergodic processes is ergodic. Lemma 1. This follows from the fact that if v is ergodic and close to i.15 (Ergodicity and dlimits. which is just Lemma 1.SECTION 1. Proof Suppose . a) into (A'. a) is a nonatomic probability space. for 3 > 0.) and (A".) > 1 — 2e. respectively. for every a. v). 11 ) 3 I p(44) — p(any)1 < c. ±1c1 0 ak 1' Let c be a positive number and use (14) to choose a positive number S < c. v). y) < e 2 and .u) and (An.q) + knd. y) < c2. In this section it will be shown that the dlimit of ergodic (mixing) processes is ergodic (mixing) and that entropy is dcontinuous.9. denote the 3neighborhood or 3blowup of B c A". and N1 so that if n > N1 then (15) d( x. y) < 2E. where the extra factor of k on the last term is to account for the fact that a j for which yi 0 x i can produce as many as k indices i for which y :+ k 1 _ — ak 1 and X tj. mixing. then cin (it.9. except for a set of A. i + k — 1]. p. . and hence (14) (n — k + 1)p(41).14 (a) If anCu. then a small blowup of the set of vtypical sequences must have large timeasure.11 ) < (n — k + 1)p(a in. where [ME = dn(Yç z B) EL (b) If cin (i. a) into (An.9. such that a(lz: dn((z). almost surely. y) < c.
the finite sequence characterization of ergodicity.17 (Mixing and alimits. that stationary coding preserves mixing. and y such that Evdi (xi . Given 8 > 0. n > Because y is ergodic N1 . D Theorem 1. and let A. and fix E > 0.7. v gives measure at least I — 3 to a subset of An of cardinality at most 2n (n( v)+E ) and hence El h(. Yi)) = Ex(cin (4.9. Theorem 1. via finitecoder approximation.5. 8 then there is a sequence i E Bn such that ] 14) — P(a lic l51)1 < E.u) < h (v) 2E.100 CHAPTER I. y) < 8 2 .9. let y be a mixing process such that d(A . Lemma 1. that if y E [A]s then there is a sequence x E Bn such that dn (fi! . Note. each of positive p. Proof The proof is similar to the proof. Let y be an ergodic process such that a(A. y) < 3 2 and hence an (11.')) < 82. y) < 3 2 .9. The definition of Bn and the two preceding inequalities yield — p(451)1 < 4E. Likewise.) Choose > N such that p(T) > 1 — 3.16 (Entropy and dlimits. however.) The dlimit of mixing processes is mixing.15. Suppose A is the alimit of mixing processes. <8 and hence I 19 (alic 1741 ) IY7)1 < 6 by (15). y. .4. so the coveringexponent interpretation of entropy. Given E > 0. Theorem 1. choose N.5. and y are dclose then a small blowup does not increase the set of p. for all n > N.4. the set an . Bn = fx ril : I P(414) — v(4)1 < has vn measure at least 16. and. fix all and VII.9. 57 11 E [B.) Entropy is acontinuous on the class of ergodic processes. for n > N I . 8. for sufficiently large n. Let y be an ergodic measure such that cl(A. see Example 19. Mixing is also preserved in the passage to dlimits. Let pc be an ergodic process with entropy h = h(i). Fix n > N2 and apply Lemma 1.]6 ) > 1 — 23. Theorem 1. vn ) < 82. Al.measure. ] 5 ) > 1 — 2 8 .14 to obtain A n ([13.entropytypical sequences by very much. Lemma 1. yields h(v) < h(p) + 2E. ] Since [B] large An measure.9. since for stationary measures d is the supremum of the there is an N2 > Ni such that if n > N2. (Such a 3 exists by the blowupbound lemma. and has large vmeasure. Proof Now the idea is that if p. BASIC CONCEPTS. completing the proof of Theorem 1. be a stationary joining of p.n .14 implies that y([7. choose Tn c An such that ITn I _ 1 as n co.16. Let 8 be a positive number such that un bi for all n > N. for each < 2i1(h+E) and p(T) n > N. This completes the proof of Theorem 1. implies that A is ergodic.1' E [B.7. Likewise if 5.9.
SECTION 1.9. PROCESS TOPOLOGIES.
101
This implies that d„(x': , y7) < 8 except for a set of A measure at most 8, so that if has Ameasure at most 8. Thus by making xt; < 1/n then the set {(x'il , even smaller it can be supposed that A(a7) is so close to v(a) and p.(1,11') so close to v(b?) that liu(dil )[1(b7) — v(a'12 )v(b)I Likewise, for any m > 1,
 iz X({(xr +n , y'1 ): 4 0 yrii or x: V X({(xi, yi): xi 0 4 } )
y: . .;1 })
m+n
X({(x,m n F7,
xm _11
,n
ym m:++:rill)
28,
so that if 8 <112n is small enough, then
1,u([4] n
T  m[bn) — vaan
n T  m[b7D1
6/3,
for all m > 1. Since u is mixing, however,
Iv(a7)v(b7) — vgan
n T  m[b7])1
6/3,
for all sufficiently large m, and hence, the triangle inequality yields
— ii([4]
n T  m[brill)1
6
,
for all sufficiently large m, provided only that 6 is small enough. Thus p, must be mixing, which establishes the theorem. This section will be closed with an example, which shows, in particular, that the
iitopology is nonseparable.
Example 1.9.18 (Rotation processes are generally dfarapart.) Let S and T be the transformations of X = [0, 1) defined, respectively, by Sx = x
a, T x = x
fi,
where, as before,
e denotes
addition modulo 1.
Proposition 1.9.19
Let A be an (S x T)invariant measure on X x X which has Lebesgue measure as marginal on each factor If a and fi are rationally independent then A must be the product measure. Proof "Rationally independent" means that ka ± mfi is irrational for any two rationals k and m with (k, m) (0, 0). Let C and D be measurable subsets of X. The goal is to • show that A.(C x D) =
It is enough to prove this when C and D are intervals and p.(C) = 1/N, where N is an integer. Given E > 0, let C 1 be a subinterval of C of length (1 — E)1N and let
E = C i x D, F = X x D.
102
CHAPTER I. BASIC CONCEPTS.
Since a and fi are rationally independent, the twodimensional version of Kronecker's theorem, Proposition 1.2.16, can be applied, yielding integers m 1 , m2, , rnN such that if V denotes the transformation S x T then Vms E and X(FAÉ) < 2E, where F' = U 1 Vrn1E .
E ±
n vrni
E = 0, if i j,
It follows that X(E) = X(P)/N is within 26/N of A.(F)/ N = 1,t(C)A(D). Let obtain X(C x D) = This completes the proof of Proposition 1.9.19.
0 to
El
Now let P be the partition of the unit interval that consists of the two intervals Pc, = [0, 1/2), Pi = [1/2, 1). It is easy to see that the mapping that carries x into its (T, 2)name {x} is an invertible mapping of the unit interval onto the space A z the Kolmogorov measure. This fact, together withwhicaresLbgumont Proposition 1.9.19, implies that the only joining of the (T, 2)process and the (S, 2)process is the product joining, and this, in turn, implies that the ddistance between these two processes is 1/2. This shows in particular that the class of ergodic processes is not separable, for, in fact, even the translation (rotation) subclass is not separable. It can be shown that the class of all processes that are stationary codings of i.i.d. processes is dseparable, see Exercise 3 in Section W.2.
I.9.c
Exercises.
1. A measure ,u E P(X) is extremal if it cannot be expressed in the form p = Ai 0 ia2. (a) Show that if p is ergodic then p, is extremal. (Hint: if a = ta1 F(1 — t)A2, apply the RadonNikodym theorem to obtain gi (B) = fB fi diu and show that each fi is Tinvariant.) (b) Show that if p is extremal, then it must be ergodic. (Hint: if T 1 B = B then p, is a convex sum of the Tinvariant conditional measures p(.iB) and peiX — B).) 2. Show that l') is a complete metric by showing that
(a) The triangle inequality holds. (Hint: if X joins X and /7, and X* joins 17 and Zri', then E,T yi)A*(z7ly'll) joins X and 4.) (b) If 24(X?, ri) = 0 then Xi; and 1111 have the same distribution. (c) The metric d(X7, 17) is complete. 3. Prove the superadditivity inequality (3). 4. Let p, and y be the binary Markov chains with the respective transition matrices
m [
P —p
1—p p 1'
N [0 1 1 o]'
1
Let fi be the Markov process defined by M2.
SECTION 1.10. CUTTING AND STACKING.
103
v n = X2n, n = 1, 2, ..., then, almost (a) Show that if x is typical for it, and v surely, y is typical for rt.
(b) Use the result of part (a) to show d(,u, y) = 1/2, if 0 < p < 1. 5. Use Lemma 1.9.11(f) to show that if it. and y are i.i.d. then d(P., y) = (1/2)1A — v I i. (This is a different method for obtaining the ddistance for i.i.d. processes than the one outlined in Example 1.9.3.) 6. Suppose y is ergodic and rt, is the concatenatedblock process defined by A n on A n . (p ii ) (Hint: g is concentrated on shifts of sequences Show that d(v ,g) = n.n, ,.n,• that are typical for the product measure on (An )° ° defined by lin .)
a.
7. Prove property (d) of Lemma 1.9.11.
* (a7 , a7) = min(p,(a7), v(a)). 8. Show there is a joining A.* of ,u and y such that Xn
9. Prove that ikt — v in = 2 — 2 E a7 min(i.t(a7), v(a)). 10. Two sets C, D C A k are aseparated if C n [D], = 0. Show that if the supports of 121 and yk are aseparated then dk(Ilk.vk) > a. 11. Suppose ,u, and y are ergodic and d(i.t, for sufficiently large n there are subsets v(Dn ) > 1 — E, and dn (x7, y7) > a 1`) ' Ak and Pk('IY1 c ) ' Vk, and k Pk('IX 1 much smaller than a, by (7)) y) > a. Show that if E > 0 is given, then Cn and Dn of A n such that /L(C) > 1 —e, — e, for x7 E Cn , yli' E D. (Hint: if is large enough, then d(4 , y ) cannot be
12. Let (Y, y) be a nonatomic probability space and suppose *: Y 14 A n is a measurable mapping such that dn (un , y o * 1 ) < S. Show that there is a measurable mapping q5: y H A n such that A n = V curl and such that Ev(dn(O(Y), fr(y))) <8.
Section 1.10 Cutting and stacking.
Concatenatedblock processes and regenerative processes are examples of processes with block structure. Sample paths are concatenations of members of some fixed collection S of blocks, that is, finitelength sequences, with both the initial tail of a block and subsequent block probabilities governed by a product measure it* on S' . The assumption that te is a product measure is not really necessary, for any stationary measure A* on S" leads to a stationary measure p, on Aw whose sample paths are infinite concatenations of members of S, provided only that expected block length is finite. Indeed, the measure tt is just the measure given by the tower construction, with base S", measure IL*, and transformation given by the shift on S°°, see Section 1.2.c.2. It is often easier to construct counterexamples by thinking directly in terms of block structures, first constructing finite blocks that have some approximate form of the final desired property, then concatenating these blocks in some way to obtain longer blocks in which the property is approximated, continuing in this way to obtain the final process as a suitable limit of finite blocks. A powerful method for organizing such constructions will be presented in this section. The method is called "cutting and stacking," a name suggested by the geometric idea used to go from one stage to the next in the construction.
104
CHAPTER I. BASIC CONCEPTS.
Before going into the details of cutting and stacking, it will be shown how a stationary measure ,u* on 8" gives rise to a sequence of pictures, called columnar representations, and how these, in turn, lead to a description of the measure on A'.
I.10.a
The columnar representations.
Fix a set 8 c A* of finite length sequences drawn from a finite alphabet A. The members of the initial set S c A* are called words or firstorder blocks. The length of a word w E S is denoted by t(w). The members w7 of the product space 8" are called nth order blocks. The length gel') of an nth order block wrii satisfies £(w) = E, t vi ). The symbol w7 has two interpretations, one as the ntuple w1, w2, , w,, in 8", the other as the concatenation w 1 w2 • • • to,„ in AL, where L = E, t(wi ). The context usually makes clear which interpretation is in use. The space S" consists of the infinite sequences iv?"' = w1, w2, ..., where each Wi E S. A (Borel) probability measure 1.1* on S" which is invariant under the shift on S' will be called a blockstructure measure if it satisfies the finite expectedlength condition,
(
E(t(w)) =
Ef (w),uT(w ) <
WES
onto 8, while, in general, Here 14 denotes the projection of of ,u* onto 8'. Note, by the way, that stationary gives E v( w7 ))
denotes the projection
. E t (w ) ,u.: (w7 ) . nE(t(w)).
to7Esn
Blocks are to be concatenated to form Avalued sequences, hence it is important to have a distribution that takes into account the length of each block. This is the probability measure Â, on S defined by the formula
A( w) —
w)te (w)
E(fi
,WE
8,
where E(ti) denotes the expected length of firstorder blocks. The formula indeed defines a probability distribution, since summing t(w),u*(w) over w yields the expected block length E(t 1 ). The measure is called the linear mass distribution of words since, in the case when p,* is ergodic, X(w) is the limiting fraction of the length of a typical concatenation w1 w2 ... occupied by the word w. Indeed, using f (w1w7) to denote the number of times w appears in w7 E S", the fraction of the length of w7 occupied by w is given by
t(w)f (wIw7)
t(w7)
= t(w)
f(w1w7) n
n f(w7)
t(w)11,* (w)
E(t 1 )
= A.(w), a. s.,
since f (w1w7)1 n ÷ ,e(w) and t(will)In E(t1), almost surely, by the ergodic theorem applied to lu*. The ratio r(w) = kt * (w)/E(ti) is called the width or thickness of w. Note that A(w) = t(w)r(w), that is, linear mass = length x width.
SECTION 1.10. CUTTING AND STACKING.
105
The unit interval can be partitioned into subintervals indexed by 21) E S such that length of the subinterval assigned to w is X(w). Thus no harm comes from thinking of X as Lebesgue measure on the unit interval. A more useful representation is obtained by subdividing the interval that corresponds to w into t(w) subintervals of width r (w), labeling the ith subinterval with the ith term of w, then stacking these subintervals into a column, called the column associated with w. This is called the first  order columnar representation of (5 00 , ,u*). Figure 1.10.1 shows the firstorder columnar representation of S = {v, w}, where y = 01, w = 011, AIM = 1/3, and /4(w) = 2/3. Ti x
1
X V w
o
Figure 1.10.1 The firstorder columnar representation of (S, AT). In the columnar representation, shifting along a word corresponds to moving intervals one level upwards. This upward movement can be accomplished by a point mapping, namely, the mapping T1 that moves each point one level upwards. This mapping is not defined on the top level of each column, but it is a Lebesgue measurepreserving map from its domain to its range, since a level is mapped linearly onto the next level, an interval of the same width. (This is also shown in Figure I.10.1.) In summary, the columnar representation not only carries full information about the distribution A.(w), or alternatively, /4(w) = p,*(w), but shifting along a block can be represented as the transformation that moves points one level upwards, a transformation which is Lebesgue measure preserving on its domain. The firstorder columnar representation is determined by S and the firstorder distribution AI (modulo, of course, the fact that there are many ways to partition the unit interval into subintervals and stack them into columns of the correct sizes.) Conversely, the columnar representation determines S and Al% The information about the final process is only partial since the picture gives no information about how to get from the top of a column to the base of another column, in other words, it does not tell how the blocks are to be concatenated. The firstorder columnar representation is, of course, closely related to the tower representation discussed in Section I.2.c.2, the difference being that now the emphasis is on the width distribution and the partially defined transformation T that moves points upwards. Information about how firstorder blocks are concatenated to form secondorder blocks is given by the columnar representation of the secondorder distribution ,q(w?). Let r (w?) = p(w)/E(2) be the width of w?, where E(t2) is the expected length of w?, with respect to ,4, and let X(w?) = t(w?)r(w?) be the secondorder linear mass. The secondorder blocks w? are represented as columns of disjoint subintervals of the unit interval of width r (14) and height aw?), with the ith subinterval labeled by the ith term of the concatenation tv?. A key observation, which gives the name "cutting and stacking" to the whole procedure that this discussion is leading towards, is that
. which is just half the total width 1/E(t 1 ) of the firstorder columns.1.106 CHAPTER I.10. can be built by cutting and stacking the firstorder representation shown in Figure 1. it is possible to cut the firstorder columns into subcolumns and stack them in pairs so as to obtain the secondorder columnar representation. Likewise. One can now proceed to define higher order columnar representations. .(t2) 2_. the total mass in the secondorder representation contributed by the first f(w) levels of all the columns that start with w is t(w) x.. E aw yr ( ww2 ) ._ C i D *********** H E ****** B VW A B C V D H E F G ***************4****+********** w A VV F G ****** *********** WV WW Figure 1.10. . 16(ww) = 4/9.u. then stacking these in pairs. . The significance of the fact that secondorder columns can be built from firstorder columns by appropriate cutting and stacking is that this guarantees that the transformation T2 defined for the secondorder picture by mapping points directly upwards one level extends the mapping T1 that was defined for the firstorder picture by mapping points directly upwards one level. for the total width of the secondorder columns is 1/E(t2).2 The secondorder representation via cutting and stacking.10. 2 " (w)r(w). the total mass contributed by the top t(w) levels of all the columns that end with w is Et(w)r(w1w)= Ea(lV t2)) WI Pf(W1W) 2E(ti) =t(w* *(w) Thus half the mass of a firstorder column goes to the top parts and half to the bottom parts of secondorder columns. Secondorder columns can be be obtained by cutting each firstorder column into subcolumns. w}.(wv) = 2/9. Note also the important property that the set of points where T2 is undefined has only half the measure of the set where T1 is defined. Indeed.(vw) = . then it will continue to be so in any secondorder representation that is built from the firstorder representation by cutting and stacking its columns. as claimed.4(vv) = 1/9„u. In general. le unv2)\ ( aw* * (w) = 2E( ii) tV2 2 _ 1 toor(w). Indeed. BASIC CONCEPTS. . Figure 1. where .2 shows how the secondorder columnar representation of S = {v.1. 2 which is exactly half the total mass of w in the firstorder columnar representation. the 2mth order representation can be produced by cutting the columns of the mth order columnar representation into subcolumns and stacking them in pairs. If this cutting and stacking method is used then the following two properties hold. so.. if y is directly above x in some column of the firstorder representation.
The difference between the building of columnar representations and the general cutting and stacking method is really only a matter of approach and emphasis. Thus. ordered. hence.b The basic cutting and stacking vocabulary. and the transformation T will preserve Lebesgue measure. gadget. The interval L1 is called the base or bottom.3. will be used in this book. column structures define distributions on blocks. . . and for each j. Some applications of its use to construct examples will be given in Chapter 3. which. see Corollary 1. The support of a column C is the union of its levels L. N(j) = Ar(L i ) is called the name or label of L. . is undefined. or height.. The cutting and stacking language and theory will be rigorously developed in the following subsections. Thus. P)process is the same as the stationary process it defined by A*. 2. labeled columns.10. Their common extension is a Lebesgue measurepreserving transformation on the entire interval. 107 (b) The set where T2m is defined has onehalf the measure of the set where Tft. unless stated otherwise. Two columns are said to be disjoint if their supports are disjoint. A column C = (L1. Of course. CUTTING AND STACKING. in turn.u* and uses it to construct a sequence of partial transformations of the unit interval each of which extends the previous ones. which is of course given by X(C) = i(C)r (C). The cutting and stacking idea focuses directly on the geometric concept of cutting and stacking labeled columns. Ar(E(C))) is called the name of C.10. . A general form of this fact. The columnar representation idea starts with the final blockstructure measure . Ar(2). and cutting and stacking extends these to joint distributions on higher level blocks. The suggestive name. column structure. the interval Lt(c) is called the top. Lac)) of length. collection of subintervals of the unit interval of equal positive width (C). in essence. gives the desired stationary Avalued process. . Its measure X(C) is the Lebesgue measure X of the support of C. A column structure S is a nonempty collection of mutually disjoint. but this all happens in the background while the user focuses on the combinatorial properties needed at each stage to produce the desired final process. it is a way of representing something already known. where Pa is the union of all the levels labeled a. f(C)} into the finite alphabet A. will be proved later. if cutting and stacking is used to go from stage to stage. The vector Ar(C) = (H(1). f(C) is a nonempty. in essence.. it is a way of building something new. and the interval L i is called the jth level of the column.10. . The labeling of levels by symbols from A provides a partition P = {Pa : a E AI of the unit interval. Note that the members of a column structure are labeled columns. The (T. Theorem 1.. (a) The associated column transformation T2m extends the transformation Tm . In particular.5. L2. then the set of transformations {TA will have a common extension T defined for almost all x in the unit interval. rather than the noninformative name. disjoint. which has been commonly used in ergodic theory. using the desired goal (typically to make an example of a process with some sample path property) to guide how to cut up the columns of one stage and reassemble to create the next stage.SECTION 1.10. A labeling of C is a mapping Ar from {1. The goal is still to produce a Lebesgue measurepreserving transformation on the unit interval. I.
(An alternative definition of subcolumn in terms of the upward map is given in Exercise 1.. . such that the following hold. The width r(S) of a column structure is the sum of the widths of its columns. Note that T is onetoone and its inverse maps points downwards. Note that implicit in the definition is that a subcolumn always has the same height as the column. ) c cyf( a E r(S) r (S) ." where the support of a column structure is the union of the supports of its columns. A(S) = Ex(c). r (5) < 00. t(C)): i E I} .. If a column is pictured by drawing L i+ 1 directly above L1 then T maps points directly upwards one level. A subcolumn of L'2 . Li) is a column C' = (a) For each j. is a subinterval of L i with the same label as L. Note that the base and top have the same . and column structures have disjoint columns. Le (C )) defines a transformation T = T.is the normalized column Lebesgue measure. (b) The distance from the left end point of L'i to the left end point of L 1 does not depend on j. . A column C = (L 1 L2.Et(c). The transformation T = Ts defined by a column structure S is the union of the upward maps defined by its columns. 2) . The base or bottom of a column structure is the union of the bases of its columns and its top is the union of the tops of its columns. L2. L(i. for 1 < j < f(C). namely.) A (column) partition of a column C is a finite or countable collection {C(i) = (L(i. for CES (c ) < Ei(c)r(c) Ex . in other words. It is called the mapping or upward mapping defined by the column structure S. L(i. An exception to this is the terminology "disjoint column structures. r(S)." which is taken as shorthand for "column structures with disjoint support. . . The precise definition is that T maps Li in a linear and orderpreserving manner onto L i+ i. each labeled column in S is a labeled column in S' with the same labeling. The width distribution 7 width r (C) r (C) "f(C) = EVES r (D) (45)• The measure A(S) of a column structure is the Lebesgue measure of its support. and S c S' means that S is a substructure of S'. This idea can be made precise in terms of subcolumns and column partitions as follows.(c).108 CHAPTER I. C = (L1. In particular. Each point below the base of S is mapped one level directly upwards by T. terminology should be interpreted as statements about sets of labeled columns. that is. this means that expected column height with respect to the width distribution is finite. T is not defined on the top of S and is a Lebesgue measurepreserving mapping from all but the top to all but the bottom of S. . L'e ). Thus the union S US' of two column structures S and S' consists of the labeled columns of each. CES CES Note that X(S) < 1 since columns always consist of disjoint subintervals of the unit interval. called its upward map. Columns are cut into subcolumns by slicing them vertically. 1). and is not defined on the top level. BASIC CONCEPTS.
is built by cutting each of the firstorder columns into subcolumns and stacking these in pairs. Li with the label of C1* C2 defined to be the concatenation vw of the label v of C 1 and the label w of C2. Let C 1 = (L(1. for example. This is also true of the upward mapping T = Ts defined by a column structure S. Partitioning corresponds to cutting. Thus. j): 1 < j < t(C1)) and C2 = (L(2. CUTTING AND STACKING. A column partitioning of a column structure S is a column structure S' with the same support as S such that each column of S' is a subcolumn of a column of S. for any S' built by cutting and stacking from S. 109 of disjoint subcolumns of C. j): 1 < j < t(C2)) be disjoint labeled columns of the same width. The basic fact about stacking is that it extends upward maps.c. j) 1 < j < f(C1) L(2. the columns of S' are permitted to be stackings of variable numbers of subcolumns of S. j — f(C1)) t(C1) < j f(Ci) e(C2). (ii) The range of Ts is the union of all except the bottom of S. (iv) If S' is built by cutting and stacking from S. . a column partitioning is formed by partitioning each column of S in some way.(C). Indeed. and extends their union. where S c A*.1. In general. A column structure S' is built by cutting and stacking from a column structure S if it is a stacking of a column partitioning of S. which is also the same as the width of C2. the key properties of the upward maps defined by column structures are summarized as follows. and taking the collection of the resulting subcolumns. and with the upward map TC 2 wherever it is defined. Longer stacks are defined inductively by Ci *C2* • • • Ck Ck+i = (C1 * C2 * • • • Ck) * Ck+1. defined by stacking C2 on top of C1 agrees with the upward map Tc„ wherever it is defined. if they have the same support and each column of S' is a stacking of columns of S. In summary. (iii) The upward map Ts is a Lebesgue measurepreserving mapping from its domain to its range. extends T8.. for each i. namely. it is extended by the upward map T = Ts .10. .c. Note that the width of C1 * C2 is the same as the width of C1. the union of whose supports is the support of C. The cutting idea extends to column structures. The stacking of C2 on top of C 1 is denoted by C1* C2 and is the labeled column with levels L i defined by = L(1. the upward map Tco. A column structure S' is a stacking of a column structure S. . the secondorder columnar representation of (S. since Te.SECTION 1.a*). is now defined on the top of C1. thus cutting C into subcolumns according to a distribution 7r is the same as finding a column partition {C(i)} of C such that 7r(i) = t(C(i))1 . then Ts .. The stacking idea is defined precisely as follows. (i) The domain of Ts is the union of all except the top of S. Thus.
110 CHAPTER I. (x) to be the index of the member of P to which 7 1 x belongs.3 (The completesequences theorem. respectively.) If {8(m)} is complete then the collection {Ts (n )} has a common extension T defined for almost every x E [0. and a condition for ergodicity. 1] there is an M = M (x) such that x lies in an interval below the top of S(M). Theorem 1. This is equivalent to picking a random x in the support of 8(1). (x) to be the name of level Li+n _1 of C. The label structure defines the partition P = {Pa : a E A). This completes the proof of Theorem 1. the transformation produces a stationary process. it follows that T is measurable and preserves Lebesgue measure. BASIC CONCEPTS. 1].} is described by selecting a point x at random according to the uniform distribution on the unit interval and defining X. then choosing m so that x belongs to the jth level. (b) X(S(1)) = 1 and the (Lebesgue) measure of the top of S(m) goes to 0 as m —> oc.} and its Kolmogorov measure it are. Continuing in this manner a sequence {S(m)} of column structures is obtained for which each member is built by cutting and stacking from the preceding one. If the tops shrink to 0. Cutting and stacking operations are then applied to 8(2) to produce a third structure 8(3). T is invertible and preserves Lebesgue measure. Further cutting and stacking preserves this relationship of being below the top and produces extensions of Ts(m). since the tops are shrinking to 0.3. (a) For each m > 1.10. since the top shrinks to 0 as m oc. Proof For almost every x E [0. The goal of cutting and stacking operations is to construct a measurepreserving transformation on the unit interval. P)process {X. it is clearly the inverse of T. S(m + 1) is built by cutting and stacking from S(m). where Pa is the union of all the levels of all the columns of 8(1) that have the name a. This is done by starting with a column structure 8(1) with support 1 and applying cutting and stacking operations to produce a new structure 8(2). Furthermore. A precise formulation of this result will be presented in this section. of a column C E S(m) of height t(C) > j n — 1.10.10. then T 1 B = Ts (m i ) B is an interval of the same length as B. Since such intervals generate the Borel sets on the unit interval. Likewise. The (T.c The final transformation and process. If B is a subinterval of a level in a column of some S(m) which is not the base of that column. I. called the process and Kolmogorov measure defined by the complete sequence {S(m)}. (Such an m exists eventually almost surely. and defining X. Together with the partition defined by the names of levels. the common extension of the inverses of the Ts (m) is defined for almost all x. along with the basic formula for estimating the joint distributions of the final process from the sequence of column structures. so that the transformation T defined by Tx = Ts(m)x is defined for almost all x and extends every Ts (m) . A sequence {S(m)} of column structures is said to be complete if the following hold. then a common extension T of the successive upward maps is defined for almost all x in the unit interval and preserves Lebesgue measure. The process {X. El The transformation T is called the transformation defined by the complete sequence {S(m)}.) The kth order joint distribution of the process it defined by a complete sequence {S(m)} can be directly estimated by the relative frequency of occurrence of 4 in a . say L i .
{8(m)}.10. of course. f(C) — k 1 : x 1' 1 = ] where xf (c) is the name of C. Furthermore. Then.. let {. and hence IX(Paf n C) — p k (c41C)A(C)1 < 2(k — 1)r(C). Theorem 1. given by p. CUTTING AND STACKING. The relative frequency of occurrence of all' in a labeled column C is defined by pk(a inC) = Ili E [1. then (1) = lim E CES pk(ailC)X(C). since the top k — 1 levels of C have measure (k — 1)r(C). the context will make clear which is intended. pk(lC) is the empirical overlapping kblock distribution pk •ixf (c) ) defined by the name of C.4 (Estimation of joint distributions. X(8(1)) = 1 and the measure of the top of S(m + 1) is half the measure of the top of S(m). a negligible effect when m is large for then most of the mass must be in long columns. . as n An application of the estimation formula (1) is connected to the earlier discussion on of the sequence of columnar representations defined by a blockstructure measure where S c A*. • k E Ak 1 (m) Proof In this proof C will denote either a column or its support. which is.) If t is the measure defined by the complete sequence. For each m let 8(m) denote the 2''order columnar representation.SECTION L10.u(a) = X(Pa). for x E [0. since the top has small measure. Let Pa be the union of all the levels of all the columns in 8(1) that have the name a. For k > 1.x. Let itt be the . and let Pa k = { x: x = . in turn. averaged over the Lebesgue measure of the column name.(apx(c). 111 column name. the only error in using (1) to estimate Wa l') comes from the fact that pk (ainC) ignores the final k — 1 levels of the column C. To make this precise. Without loss of generality it can be assumed that 8(m +1) is built by cutting and stacking from S(m). that is. and hence X(Pa fl C) = [t(c)pi (aiC)j r(c) = pi(aIC)X(C). The same argument applies to each S(m). 1]. The desired result (1) now follows since the sum of the widths t(C) over the columns of (m) was assumed to go oc. to 0.} E A Z be the sequence defined by the relation Tnx E Px„ n E Z. This estimate is exact for k = 1 for any m and for k > 1 is asymptotically correct as m oc.4. The quantity (t(C) — k 1)pk(a lic IC) counts the number of levels below the top k — 1 levels of C that are contained in Pal k . so that {S(m)} is a complete sequence. This completes the proof of Theorem 1. CES (1) since f(C)p i (aIC) just counts the number of times a appears in the name of C. establishing (1) for the case k = 1.10.
These ideas are developed in the following paragraphs. Note. two column structures S and S' are 6independent if and only if the partition into columns defined by S and the partition into columns defined by S' are 6independent. awl)]. The column structures S and S' are 6independent if (3) E E Ige n D) — X(C)À(D)I CES DES' 6. is that later stage structures become almost independent of earlier structures. . by the way. which is sufficient for most purposes. Next let T be the tower transformation defined by the base measure it*. there is a k > 1 such that S(m) and S(m k) are 6independent. 0 Next the question of the ergodicity of the final process will be addressed. where P = {Pa } is the partition defined by letting P a be the set of all pairs (wr) . A condition for this. In other words. In the following discussion C denotes either a column or its support. Proof Let pk(afiw7) be the empirical overlapping kblock distribution defined by the sequence w E S n . Corollary 1. with the shift S on S" as base transformation and height function defined by f (wi . where L = Ei aw 1 ). it is enough to show that (2) = lim E WES pk (ak l it4)A. that these independence concepts do not depend on how the columns are labeled. The proof for k > 1 is left as an exercise.) = t (w 1) • Also let /72. but only the column distributions. 7 3)process. since the start distribution for is obtained by selecting w 1 at random according to the distribution AT. where tovi) w = ai . Since this is clearly the same as pk(a inC) where C is the column with name w. For example.5 The tower construction and the standard representation produce the same measures. Related concepts are discussed in the exercises.112 CHAPTER I.10. A(B I C) denotes the conditional measure X(B n c)/x(c) of the intersection of the set B with the support of the column C.(C). = l. The sequence {S(m)} of column structures is asymptotically independent if for each m and each e > 0. For k = 1 the sum is constant in n. . a E Ak . measure defined by this sequence. p. The sequence {S(m)} will be called the standard cutting and stacking representation of the measure it. j) for which ai = a. thought of as the concatenation wi w2 • • • wn E AL. be the Kolmogorov measure of the (T. One way to make certain that the process defined by a complete sequence is ergodic is to make sure that the transformation defined by the sequence is ergodic relative to Lebesgue measure. that is. BASIC CONCEPTS. then selecting the start position according to the uniform distribution on [1.
of some C E 8(M) such that X(B n L) > 0 This implies that the entire column C is filled to within (1 — E 2 ) by B.SECTION I10. the following holds. CUTTING AND STACKING. DO* Thus summing over D and using (6) yields X(B) > 1 — 3E. given E > 0 there is an m and a level L. This completes the proof of Theorem 1.6 (Complete sequences and ergodicity. Let B be a measurable set of positive measure such that T 1 B = B. so that. Proof Let T be the Lebesgue measurepreserving transformation of the unit interval defined by the complete. hence the following stronger form of the complete sequences and ergodicity theorem is quite useful. This shows that X(B) = 1.(B1 C X(B I C) ?. El If is often easier to make successive stages approximately independent than it is to force asymptotic independence. Since X(B I C) = ED A. since S(M) is built by cutting and stacking from S(m). Fix C and choose M so large that S(m) and S(M) are EX(C)independent. the Markov inequality and the fact that E x (7. asymptotically independent sequence { 8(m)}. imply that n v)x(vi C).10.6. then there must be at least one level of D which is at least (1 — E) filled by B. 1 C) DO' El which together with the condition. (6) X(B n D) > (1 — c)X(D).F. (4) À(B n c) > 1 .E 2 )x(c). P)process is ergodic for any partition P. say level j.T be the set of all D E S(M) for which X(B1 C n D) > 1 — 6. (5) E DES(M) igvi c) — x(D)1 E Let .. Towards this end note that since the top of S(m) shrinks to 0. It will be shown that T is ergodic. so if D E . D e . that is. 113 Theorem 1. and hence T must be ergodic. Thus. which implies that the (T. implies E X(D) <2E. in particular. and hence the collection of all the levels of all the columns in all the S(m) generates the aalgebra. (5). The goal is to show that A(B) = 1.F.) A complete asymptotically independent sequence defines an ergodic process.10. . the widths of its intervals must also shrink to 0. The set C n D is a union of levels of D. ( since T k (B n L) = B n TkL and T k L sweeps out the entire column as k ranges from —j + 1 to aW) — j. (1 — 6 2 ). The argument used to prove (4) then shows that the entire column D must be (1 — c) filled by B. In summary.
then by the Markov inequality there is a subcollection with total measure at least 3 for which x(B n L i ) > (1 — E 2 )À(L i ). There are practical limitations. then {S(m)} defines an ergodic process. A few of these constructions will be described in later chapters of this book. The support of S' has measure at least 3.10. where Em 0. however. where Di is the ith column of S. Thus. A bewildering array of examples have been constructed. Let S' be the set of all columns C for which some level has this property. (7) Eg where BC is the complement of B. as earlier. The user is free to vary which columns are to be cut and in what order they are to be stacked. provided that they have disjoint supports.10. of course. the same notation will be used for a column or its support. S(m) and S(m 1) are Em independent. (i) Partition C into subcolumns {Ci } so that r (C1) = (Di ). which.7 (Complete sequences and ergodicity: strong form. in the complexity of the description needed to go from one stage to the next. A column structure S can be stacked independently on top of a labeled column C of the same width. Theorem 1. and the sweepingout argument used to prove (4) shows that (8) X(B n > ( 1 — E2 E SI )x(c). BASIC CONCEPTS. Proof Assume T 1 B = B and X(B) > O. Independent cutting and stacking is the geometric version of the product measure idea. The only real modification that needs to be made in the preceding proof is to note that if {Li} is a disjoint collection of column levels such that X (BC n (UL)) = BciL i . The freedom in building ergodic processes via cutting and stacking lies in the arbitrary nature of the cutting and stacking rules.7. I. c . by using only a few simple techniques for going from one stage to the next.) If {S(m)} is a complete sequence such that for each m. This proves Theorem 1. . known as independent cutting and stacking.114 CHAPTER I.10. x ( L i) < E 28. and m so large that Em is smaller than both E 2 and (ju(B)/2) 2 so that there is a set of levels of columns in S(m) such that (7) holds.u(B)/2. with the context making clear which meaning is intended. The next subsection will focus on a simple form of cutting and stacking. The argument used in the preceding theorem then gives À(B) > 1 — 3E.d Independent cutting and stacking. can produce a variety of counterexamples when applied to substructures. in spite of its simplicity. as m OC. This gives a new column structure denoted by C * S and defined as follows. taking 3 = . it follows that and there must be at least one C E S(m) for which (8) holds and for which DES (m+1) E Ix(Dic) — x(D)i < E. as well as how substructures are to become welldistributed in later substructures. In this discussion.
by partitioning each column of S according to the distribution 7r. The new column structure C * S consists of all the columns C. Note.10.) C* S Figure 1. Let S' and S be disjoint column structures with the same width. a scaling of one structure is isomorphic to the other.9 Stacking a structure independently onto a structure. .9. A column structure S is said to be a copy of size a of a column structure S' if there is a onetoone correspondence between columns such that corresponding columns have the same height and the same labeling. independently onto Ci . An alternative description of the columns of S * S' may given as follows. (ii) For each i.10. where two column structures S and S' are said to be isomorphic if there is a onetoone correspondence between columns such that corresponding columns have the same height. The independent cutting and stacking of S' onto S is denoted by S * S' and is defined for S = {Ci } as follows. A column structure S can be cut into copies {S i : i E n according to a distribution on /.10. It is called the independent stacking of S onto C. 115 (ii) Stack Di on top of Ci to obtain the new column. width.10. and the ratio of the width of a column of S to the width of its corresponding column in S' is a. The column structure S * S' is the union of the column structures Ci * C2 S' Figure 1. CUTTING AND STACKING.(Ci ).1.8. and letting Si be the column structure that consists of the ith subcolumn of each column of S. * V. by the way. In other words. stack S.8 Stacking a column structure independently onto a column. To say precisely what it means to independently cut and stack one column structure on top of another column structure the concept of copy is useful. (See Figure 1. that a copy has the same width distribution as the original. Ci *Di .10.. obtaining Ci * S.SECTION 1. and name.) (i) Cut S' into copies {S. (See Figure 1. so that r(S:)= .
the number of columns of Si * • • • * Sm is the Mth power of the number of columns of S. that is. and.(Ci E (ii) Cut each column Ci ').10. where 5 (1) = S. namely. and S(m + 1) = S(m) * S (m). Note that (Si * • • • * S) = t (S)/M.. for it carries with it the information about the distribution of the concatenations. This formula expresses the probabilistic meaning of independent cutting and stacking: cut and stack so that width distributions multiply. M} of itself of equal width and successively independently cutting and stacking them to obtain Si *52 * • • ' * SM where the latter is defined inductively by Si *. The Mfold independent cutting and stacking of a column structure S is defined by cutting S into M copies {Sm : m = 1.116 (i) Cut each C each Ci E S. Note that S(m) is isomorphic to the 2mfold independent . E CHAPTER I. Successive applications of repeated independent cutting and stacking. in the finite column case. m > 1. the number of columns of S * S' is the product of the number of columns of S and the number of columns of S'. The sequence {S(m)} is called the sequence built (or generated) from S by repeated independent cutting and stacking. r(Ci The column structure S *S' consists of all the C. such that r(Cii ). BASIC CONCEPTS. The columns of Si * • • • * Sm have names that are Mfold concatenations of column names of the initial column structure S.10 Twofold independent cutting and stacking. 2.J*Cii . starting with a column structure S produces the sequence {S(m)}.10. that they are independently concatenated according to the width distribution of S. . The independent cutting and stacking construction contains more than just this concatenation information..10. A column structure can be cut into copies of itself and these copies stacked to form a new column structure. i ) I r (S') for S' into subcolumns {C» } such that r(C) = (C)T. S into subcolumns {CO. Twofold independent cutting and stacking is indicated in Figure 1. In particular. • • *Sm = (S1* • • • *Sm_i)*Sm. . The key property of independent cutting and stacking. however. r (Cii *C) r (S * S') r (Ci ) r (S) r(C) since r (S*S') = r(S). Si S2 SI * S2 Figure 1. is that width distributions multiply. in the case when S has only finitely many columns.
and hence the proof shows that {S(m)} is asymptotically independent. The simplest of these. as k —> cc. by assumption. Theorem 1. The notation D = C = C1 C2 .SECTION I.. The following examples indicate how some standard processes can be built by cutting and stacking.11 (Repeated independent cutting and stacking.10. By the law of large numbers. selected independently according to the width distribution of S. so that if X(S) = 1 then the process defined by {S(m)} is ergodic. k]:C = p(CICO I = so that. This completes the proof of Theorem 1. for they are the precisely the processes built by repeated independent cutting and stacking from the columnar representation of their firstorder blockstructure distributions. since r(S)E(t(S)) = X(S) = 1. will mean the column D was formed by taking. where E(t(S)) is the expected height of the columns both with probability 1. This establishes that indeed S and S(m) are eventually cindependent. for each i. • • Ck E SOTO. P)process defined by the sequence generated from S by repeated independent cutting and stacking is called the process built from S by repeated independent cutting and stacking. in particular. {S(m): m > mol is built from S(rno) by repeated independent cutting and stacking. by (9) and (10).) If {S(m)} is built from S by repeated independent cutting and stacking. D = C1C2. a subcolumn of Ci E S. with respect to the width distribution. as k of S. In the case when X(S) = 1. (9) and (10) 1 1 — i(D) = — k k p(CIC) —> E £(0) i=1 E(t(S)). and stacking these in order of increasing i. with probability 1. oc. In particular.11. define E [1.10. kp(CICf) is the total number of occurrences of C in the sequence C. For k = 2'. This is. Division of X(C n D) = kp(C1Cbt(C)r(D) by the product of X(C) = i(C)r(C) and À(D) = i(D)t(D) yields x(c n D) X(C)À(D) kp(CIC I ) i(D)r (C) 1. Exercise 8. Note that for each mo. Proof The proof for the case when S has only a finite number of columns will be given here. of course. just another way to define the regenerative processes. and C E S. CUTTING AND STACKING. the countable case is left to Exercise 2. then given > 0. • Ck. if X(S) = 1 then the process built from S by repeated independent cutting and stacking is ergodic. there is an m such that S and S(m) are Eindependent. Ergodicity is guaranteed by the following theorem.10. the (T. More interesting examples of cutting and stacking constructions are given in Chapter 3. 117 cutting and stacking of S. an example of a process with an arbitrary rate of . so that its columns are just concatenations of 2' subcolumns of the columns of S.
) The easiest way to construct an ergodic Markov chain {X„} via cutting and stacking is to think of it as the regenerative process defined by a recurrent state. i } is mixing Markov. the process {X.10.i.) A hidden Markov process {X.Ar(L). say a. Let S be i for which ai a.} is a function of a mixing Markov chain. process with the probability of a equal to the length of the interval assigned to a.) If the initial structure S has only a finite number of columns. process in which O's and l's are equally likely. Example 1.c. Example 1.d.d. Example 1.13 (Markov processes. and let pt* be the product measure on S defined by p.) of a Markov chain.1 < i < k. Example 1. the binary i. repeated independent cutting and stacking produces the Avalued i. BASIC CONCEPTS. that is.14 (Hidden Markov chains. The process built from S by repeated independent cutting and stacking is just the cointossing process. all' E S. By starting with a partition of the unit interval into disjoint intervals labeled by the finite set A. in particular.16 The cutting and stacking ideas originated with von Neumann and Kakutani. then drop the first coordinates of each label." so as to approximate the desired property.118 CHAPTER I. and to (.15 (Finite initial structures. is a function of Y. Remark 1. Define the set of all ak Prob(X i` = alic1X0 = a). convergence for frequencies. then {X. is recommended as a starting point for the reader unfamiliar with cutting and stacking constructions.) If {KJ is ergodic.10. (These also known as finitestate processes." so as to guarantee ergodicity for the final process.) Let S be the column structure that consists of two columns each of height 1 and width 1/2. hence can be represented as the process built by repeated independent cutting and stacking.! .d.} built from it by repeated independent cutting and stacking is a hidden Markov chain. For a discussion of this and other early work see [12].i. and ak = a. see III.10. processes. then {Y. Yn )} as a process built by repeated independent cutting and stacking. it follows that X.} is regenerative. It is easy to see that the process {KJ built from this new column structure by repeated independent cutting and stacking is Markov of order no more than the maximum length of its columns. one can first represent the ergodic Markov chain {(Xn .10. labeled as '0' and '1'. Note. It shows how repeated independent cutting and stacking on separate substructures can be used to make a process "look like another process on part of the space.T. Since dropping the second coordinates of each new label produces the old labels. and hence {X. 0) if L is not the top of its column.i.12 (I. assume that the labeling set A does not contain the symbols 0 and 1 and relabel the column levels by changing the label A (L) of level L to (A (L). They have since been used to construct . To see why this is so. that if the original structure has two columns with heights differing by 1.} is a function X. The process built from the firstorder columnar representation of S by repeated independent cutting and stacking is the same as the Markov process {X„}. In fact. 1) if L is the top of its column. = f (Y.10. and how to "mix one part slowly into another.
C. except for a set of D E SI of total measure at most E. 5. . p*). where D is a subset of the base B.. and let {S(m)} be the standard cutting and stacking representation of the stationary Avalued process defined by (S. with all other levels labeled '0'. Show that if {Mm } increases fast enough then the process defined by {S(m)} is ergodic. Exercise 4c implies that A* is mergodic. where S C A*.i. E) such that C is (1 —6)welldistributed in S(n) for n > M. (b) Show that the asymptotic independence property is equivalent to the asymptotic welldistribution property.. A column C E S is (1 —6)welldistributed in S' if EDES' lx(vi C)—X(7))1 < E. 7. Then apply the k = 1 argument. . 2. The latter also includes several of the results presented here.10.n such that T(m + 1) is built by An fold independent cutting and stacking of 1(m) U R.) 4. (a) Show that if the top of each column is labeled '1'. A sequence {S(m)} is asymptotically welldistributed if for each m.(m). . Show that formula (2) holds for k > 1. I. (a) Suppose S' is built by cutting and stacking from S. most of which have long been part of the folklore of the subject. the conditional probability X(DI C) is the fraction of C that was cut into slices to put into D. (Hint: the tower with base transformation Sn defines the same process Ti. Suppose S has measure 1 and only finitely many columns.e Exercises 1. Suppose 1(m) c S(m). then a subcolumn has the form (D. CUTTING AND STACKING. some of which are mentioned [73]. each C E S(m). re(c)1 D).) 6. (d) Show that if EcEs IX(C1 D) — X(D)I < E.10. for each m. Let pt" be a blockstructure measure on Soe.. T D.10.À(D)i except for a set of D E S I of total measure at most Nfj. there is an M = M(m. and each E > 0. and an integer 111. then E CES IX(CI D) . A sufficient condition for a cutting and stacking construction to produce a stationary coding of an i.T 2 D. Show that if T is the upward map defined by a column C.SECTION 1. (Hint: for S = S(m).11 for the case when S has countably many columns. and X(1 (m)) >. Prove Theorem 1.. 3.d. Suppose also that {S(m} is complete.1 as m —> oc. (c) Show that if S and S' are 6independent. disjoint from 1(m). 119 numerous counterexamples. Show that {S(m)} is asymptotically independent if and only if it* is totally ergodic. Show that for C E S and D E S'. process is given in [75]. then S' and S are 26independent. then the process built from S by repeated independent cutting and stacking is Markov of some order. (b) Suppose that for each m there exists R(m) C S(m). (a) Show that {S(m)} is asymptotically independent if and only if {r(m)} is asymptotically independent.
(c) Verify that the process constructed in Example 1.13 is indeed the same as the Markov process {X„}. (b) Show that if two columns of S have heights differing by 1. . then the process built from S by repeated independent cutting and stacking is a mixing finitestate process. Show that process is regenerative if and only if it is the process built from a column structure S of measure 1 by repeated independent cutting and stacking. BASIC CONCEPTS.120 CHAPTER I.10. 8.
Chapter II Entropyrelated properties. The first issue is the universality problem. that is. more simply. while if each C. It will be shown that there are universal codes. Two issues left open by these results will be addressed in this section. .c. The sequence of Shannon codes compresses to entropy in the limit. code sequences which compress to the entropy in the limit. 121 . so that. A code sequence is a sequence {Cn : n = 1. the code sequence is called a faithfulcode sequence. The Shannon code construction provides a prefixcode sequence {C n } for which L(xrii) = r—log.12 and Theorem 1. 1}*. but its construction depends on knowledge of the process.u.. r (4) will be used instead of r (x rii ICn ). As noted in Section I. for almost every sample path from any ergodic process. The entropy lower bound is an expectedvalue result. L(4)In h.7. by the entropy theorem. 1}* is the set of finitelength binary sequences. knowledge which may not be available in practice. and y is some other ergodic process. see Exercise 1.7. hence it does not preclude the possibility that there might be code sequences that beat entropy infinitely often on a set of positive measure. if {C n } is a Shannoncode sequence for A.7. however.15. When Cn is understood. see Theorem 1. where h(p) denotes the entropy of A. In general. The code length function Le Icn ) of the code is the function that assigns to each . almost surely.1 Entropy and coding. It will be shown that this cannot happen. where {0. then L(4)1 n may fail to converge on a set of positive ymeasure. entropy provides an asymptotic lower bound on expected persymbol code length for prefixcode sequences and faithfulcode sequences. is a prefix code it is called a prefixcode sequence.7q . where each Cn is an ncode. A code sequence {C n } is said to be universally asymptotically optimal or.u(4)1. the length i(Cn (4)) of the code word Cn (4). in particular. a lower bound which is "almost" tight. The second issue is the almostsure question. almost surely.). 2. at least asymptotically in source word length n. Section 11. n n—>co for every ergodic process . universal if lim sup L(4) < h(p). An ncode is a mapping Cn : A n F* 10.. If each Cn is onetoone.
6.s. thus further sharpening the connection between entropy and coding. A counting argument.2. which is based on a packing idea and is similar in spirit to some later proofs. will then be used to show that it is not possible to beat process entropy in the limit on a set of positive measure. they show clearly that the basic existence and nonexistence theorems for codes are. relative to some enumeration of possible ktypes. together with a surprisingly simple lower bound on prefixcode word length.d. also.1. Theorem 11. will be used to establish the universal code existence theorem.2 (Toogood codes do not exist. either have different ktypes. will be discussed in Section 11. n A prefixcode sequence {C} will be constructed such that for any ergodic it. 4 and yril. Distinct words.1.) If {C n } is a faithfulcode sequence and it is an ergodic measure with entropy h. In addition. it has fixed length which depends only on the number of type classes.1. based on the LempelZiv coding algorithm. then lim inf. based on entropy estimation ideas of Ornstein and Weiss. together equivalent to the entropy theorem. it is possible to (universally) compress to entropy in the limit.122 CHAPTER II. The second block gives the index of the particular sequence x'iz in its ktype class. A third proof of the existence theorem. for suitable choice of k as a function of n.1. A second proof of the existence theorem. utilizes the empirical distribution of kblocks. will follow from the entropy theorem. whose existence depends only on subadditivity of entropy as expected value. The first block gives the index of the ktype of the sequence 4. L(x 11 )/n > h.d.a Universal codes exist. The proof of the existence theorem to be given actually shows the existence of codes that achieve at least the process entropy. as shown in [49]. together with some results about entropy for Markov chains. this will show that good codes exist. the codeword Cn (4) is a concatenation of two binary blocks. Of interest. Theorem 11. The code is a twopart code.1 (Universal codes exist. is the fact that both the existence and nonexistence theorems can be established using only the existence of process entropy. for any ergodic process . where H = 114 denotes the process entropy of pt. in which case the first blocks of Cn (fi') and Cn (Yil) . that is.. Since process entropy H is equal to the decayrate h given by the entropy theorem. almost surely for any ergodic process. will be given in Section 11. A direct coding argument.1. with code performance established by using the type counting results discussed in Section I. in essence. In other words. While no simpler than the earlier proof. These two results also provide an alternative proof of the entropy theorem. II. the direct coding construction extends to the case of semifaithful codes.3. The code C. The nonexistence theorem. while the existence of decayrate entropy is given by the much deeper entropy theorem.C(4)/n < H. a. ENTROPYRELATED PROPERTIES.C(fii)1 n < h(a). its length depends on the size of the type class. almost surely.u. that is. but for no ergodic process is it possible to beat entropy infinitely often on a set of positive measure. relative to some enumeration of the ktype class. the ktype. almost surely.1. for which a controlled amount of distortion is allowed.) There is a prefixcode sequence {C„} such that limsup n . Theorem 11. lim sup .2. Theorem 11.
rd: 5. n) + log1 7 4(4)1 +2. br relative to some fixed enumeration of the set of possible circular ktypes. x 7 denotes the entropy of the (k — 1)order Markov chain defined by Pk (. in turn. An entropy argument shows that the log of this class size cannot be asymptotically larger than nH .. so that the bounds.) Total code length satisfies log /S1 (k. 4 is extended periodically for k — 1 more terms.. aik E A k . where the two extra bits are needed in case the logarithms are not integers. or they have the same ktypes but different indices in their common ktype class.6. Given a sequence 4 and an integer k < n.`) will differ. 123 will be different. The circular ktype is the measure P k = ilk(. that is.14 and Theorem 1.IX II ) on il k defined by the relative frequency of occurrence of each kblock in the sequence .SECTION 11. called the circular ktype. say. Cn (4) = blink. Theorem 1. ' +1 . and on the cardinalityj7k (x)i of the set of all nsequences that have the same circular ktype as 4. implies the entropy inequality EFkcal. that is. ak (3) ilkLe. and bm t +1 is a variablelength binary sequence specifying. lic 14) — Pk(a • 1 ti E [1. hence the second blocks of Cn (4) and Cn (y.:+ k 1 = ak 11 . the index of the particular sequence 4 relative to some particular enumeration of T k (4). where Hk_1. k . ENTROPY AND CODING. The gain in using the circular ktype is compatibility in k. the code Cn is a prefix code. on the number of ktypes and on the size of a ktype class yield the following bounds (1) (2) (k. n) < (n ± olAl k . Now suppose k = k(n) is the greatest integer in (1/2) log lAi n and let Cn be the twopart code defined by first transmitting the index of the circular ktype of x."7 on the number /V (k. n) of circular ktypes that can be produced by sequences of length n. Since the first block has fixed length.c14). . so asymptotic code performance depends only on the asymptotic behavior of the cardinality of the ktype class of x. 1f1 n The circular ktype is just the usual ktype of the sequence . ITi(4)1 < (n — 1)2 (111)11. that is. because the size of the type class depends on the type.14). let ril+k1 = xx k i 1 . where is a fixed length binary sequence specifying the index of the circular ktype of x. that is. then the log of the number of possible ktypes is negligible relative to n.' . will be used as it simplifies the final entropy argument.f. (Variable length is needed for the second part. Pk1(4 1 14) =" which. If k does not grow too rapidly with n. yet has negligible effect on code performance. A slight modification of the definition of ktype.15. the concatenation of xtiz with x k i 1 .4+k1 .. i <k — 1. then transmitting the index of 4 in its circular ktype class.1.7irk1 .(1/2) log IAI n. < iii.6.
l. process entropy El H and decayrate h are the same for ergodic processes. of course. k(n) will exceed K. where.x7. is valid for any process. ENTROPYRELATED PROPERTIES. as n .b Toogood codes do not exist.2 for the case when each G is a prefix code. In almost surely.x 1' < H. Since K is fixed.x7 < H. The bound (2) on the size of a type class implies that at most 1 + (n — log(n — 1) 4(4). E) such that almost every x there is an particular. II. due to Barron. n > N. .2 will be given. a bound due to Barron. The second proof. relative to some fixed enumeration of circular k(n)types. HK1 = H (X IX i K 1 ) denotes the process entropy of the Markov chain of order K — 1 defined by the conditional probabilities . and does not make use of the entropy theorem.7. so that to show the existence of good codes it is enough to show that for any ergodic measure lim sup fik(n)1.u(ar1 ). in this discussion. Given E > 0. The following lemma. A faithfulcode sequence can be converted to a prefixcode sequence with no change in asymptotic performance by using the Elias header technique.x? < H + 2E. for as noted earlier. [3]. implies that it takes at most 1 + N/Tz log(n + 1) bits to specify a given circular k(n)type. Once this happens the entropy inequality (3) combines with the preceding inequality to yield ilk(n)1. almost surely. Thus it is enough to prove Theorem 11. almost surely. The first proof uses a combination of the entropy theorem and a simple lower bound on the pointwise behavior of prefix codes. for es•• <H + 2E. this shows that lim sun n ilk(n).1.u(ar)/.u(4). establishing the desired bound (4). almost surely. Since E is arbitrary. Fix an ergodic measure bc with process entropy H. The bound (1) on the number of circular ktypes.d of Chapter 1. The proof that good codes exist is now finished. Two quite different proofs of Theorem 11. relative to a given enumeration of bits are needed to specify a given member of 7 Thus. the type class. (4) where H is the process entropy of p. stationary or not. This is. hm sup 'C(n xilz) lim sup iik(n) .1. described in Section I. For all sufficiently large n.1. the ergodic theorem implies that PK (ar lx7) oo. together with the assumption that k < (1/2) log iAi n. negligible relative to n. which was developed in [49]. choose K such that HK_i _< H ± E. is based on an explicit code construction which is closely connected to the packing ideas used in the proof of the entropy theorem. and hence HK _ Lx? integer N = N(x.124 CHAPTER II.
such that those words that can be coded too well cover at least a fixed fraction of sample path length. since adding headers to specify length has no effect on the asymptotics. eventually a.SECTION II. II.an). on a set of positive measure.(4) which yields 2c(4 ) 2 an I. yields lim inf n—000 n > lim inf .3 (The almostsure codelength bound. Using the relation L(x7) = log 2'. Proof For each n define B. ENTROPY AND CODING. while the complement can be compressed almost to entropy. will now be given.) Let (CO be a prefixcode sequence and let bt be a Borel probability measure on A. the entropy of . x7En„ . a faithfulcode sequence with this property can be replaced by a prefixcode sequence with the same property. with respect to A. such that if Bi = (x: ICJ) < i(H .2.an.„ Bj ) > y.u. and after division by n.log p.(4) eventually almost surely. 2an < oo. long enough sample paths can be partitioned into disjoint words.u(fil ) .2.1. The basic idea is that if a faithfulcode sequence indeed compresses more than entropy infinitely often on a set of positive measure.l. c and y.1.) Thus there are positive numbers. and assume {C.t(Bn) = E kt(x .1.1. Barron's inequality (5) with an = 2 loglAI n. then. The lemma follows from the BorelCantelli lemma. But (1/n) log z(x) converges El almost surely to h.6)} then i(nn Up. This completes the proof of Theorem 11. with high probability. for any ergodic measure if/. this can be rewritten as Bn = {x7: iu. Theorem 11.c Nonexistence: second proof. Let i be a fixed ergodic process with process entropy H. by the Kraft inequality for prefix codes. thereby producing a code whose expected length per symbol is less than entropy by a fixed amount.C(4) +log . ={x: . } is a prefixcode sequence such that lim infi £(41C.)/n < H.0 <2 an E 24(4) < 2 "n . s. . I. a proof which uses the existence of good codes but makes no use of the entropy theorem.ei2E B„ since E „ 2—c(x7) < 1. 125 Lemma 11.C(x7) + log .u(4) ?_ . then If {an } is a sequence of positive numbers such that En (5) . A second proof of the nonexistence of toogood codes. (As noted earlier.
such that the set Dk = E(X lic ) < k(H 8)) . eventually almost surely.1. E G(n). let M > 2k IS be a positive integer. has measure at least 1 — S. To make precise the desired sample path partition concept.. Sample paths with this property can be encoded by sequentially encoding the words. contradicting Shannon's lower bound.. given that (6) holds. xr. and M > 2k/S. is at least yn. such that the words that belong to B m U . for a suitable choice of E. for all sufficiently large n. (a) Each w(i) belongs to A. a code that will eventually beat entropy for a suitable choice of 3 and M.1. if the following three conditions hold... .4 For n > M. 3)good representation. with high probability. cover at least a yfraction of total length.1. (b) The total length of the w(i) that belong to A is at most 33n. or belongs to Dk. (c) The total length of the w(i) that belong to BmU Bm +1U .x i` I e' k).C*(4) <n(H — E'). with length function E(x) = Theorem 11.. and using a singleletter code on the remaining symbols. cover at most a 33fraction of total length. If 8 is small enough and such sample paths are long enough. using the good code Ck on those words that belong to Dk.. Furthermore.. independent of n. this produces a single code with expected length smaller than expected entropy. or belongs to B1 for some j > M. The lemma is enough to contradict Shannon's lower bound on expected code length. ' > 0. plus headers to tell which code is being used. provides a k > 1/3 and a prefix code ek £(. 8 > 0. 0good representation of xt. . A version of the packing lemma will be used to show that long enough sample paths can. using the toogood code Ci on those words that belong to Bi for some i > M. there is a prefix ncode C n * whose length function E*(4) satisfies (7) . The existence theorem. Lemma 11. for the uniform boundedness of E*(4)In on the complement of G(n) combines with the properties (6) and (7) to yield the expected value bound E(C*(X7)) < n(H — 672). . It will be shown later that (6) x E G(n). for some large enough M. For n > M.126 CHAPTER II. a concatentation 4 = w(1)w(2) • • • w(t) is said to be a (y. Let 8 < 1/2 be a positive number to be specified later. Here it will be shown how a prefix ncode can be constructed by taking advantage of the structure of sequences in G(n). be partitioned into disjoint words. while the words that are neither in Dk nor in BM U . The coding result is stated as the following lemma. Let G(n) be the set of all sequences 4 that have a (y. ENTROPYRELATED PROPERTIES..C*(x`Oln will be uniformly bounded above by a constant on the complement of G(n).
t. for suitable choice of 8 and M > 2kI8.. say.7. The code C(x) is defined for sequences x E G(n) by first determining a (y. Here e (.1 4.} is being applied to a given word. Appropriate headers are also inserted so the decoder can tell which code is being used. 127 for all sufficiently large n. the contribution . on the words of length 1. so that. The goal is to show that on sequences in G(n) the code Cn * compresses more than entropy by a fixed amount. Ck. then it requires at most L(H — E) bits to encode all such w(i). The code Cn * is clearly a prefix code. then coding the w(i) in order of increasing i.. that is. But L > yn. symbols are encoded separately using the fixed code F. for x g G(n).. Proof of Lemma 11. so that if L is total length of those w(i) that belong to BM U BA. By the definition of the Bi .SECTION 11. since the total length of such w(i) is at most N — L. a prefix code on the natural numbers for which the length of e(j) is log j o(log j).1 . ' (w(i)) iie(j)Ci(W(i)) if w(i) has length 1 if w(i) E Dk if w(i) E Bi for some j > M.11(i). the code word Cf (w(i)) is no longer than j (H — c). using the toogood code on those words that have length more than M. For 4 g G(n). using the good code Ck on those words that have length k.)cr i' = w(l)w(2) • • • w(t) is determined and the code is defined by C:(4) = lv(1)v(2) • • • v(t). by property (c) in the definition of G(n). C m . it takes at most L(H — E) + (N — L)(H 8) to encode all those w(i) that belong to either Dk or to Ui > m Bi . a code word Elk (w(i)) has length at most k(h + 6). since (1/n)H(A n ) + H. ignoring the headers needed to specify which code is being used as well as the length of the words that belong to Uj >m BJ ... ) is an Elias code.). thereby contradicting Shannon's D lower bound on expected code length. then a (y. by the definition of Dk. is used. . 3)good representation . Theorem I. (0) (w(i)) of the w(i) that belong to Up.1. But this. for all sufficiently large n. A precise definition of Cn * is given as follows. and.C*(X7)) < H(1.. so that. If E G(n). while the other headers specify which of one of the prefixcode collection {F. in turn. where d > log IA I. the principal contribution to code length comes from the encodings Cf (u. and using some fixed code. since the initial header specifies whether E G(n) or not. Cm+i . 8)good representation 4 = w(1)w(2) • • • w(t).4. where OF ((w(i)) (8) v(i) = 10 . Likewise.m B.. the persymbol code C:(4) = OF(xi)F(x2) • • • F(xn). as shown in the following paragraphs. . F : A 1* Bd. This is so because.1. and the encodings k(W(i)) of the w(i) that belong to Dk. yields E(. ENTROPY AND CODING. Note that C * (4)/n is uniformly bounded on the complement of G(n).
then 8 — ye 13d8 will be negative. there is a constant K such that at most log t(w(i)) K tv(i)Eui>m B1 E bits are required to specify all these lengths. If 8 is small enough. The 1bit headers used to indicate that w(i) has length 1 require a total of at most 33n bits. the lemma depends on the truth of the assertion that 4 E G(n). by property (b) of the definition of G(n). n(H +3 — y(e +8)) n(H (9) Each word w(i) of length 1 requires d bits to encode. eventually almost surely. relative to n.M . The words w(i) need an additional header to specify their length and hence which Ci is to be used. for any choice of 0 < 8 < 1/2. which is certainly negligible. In summary. k > 113.8. and M > 2k13. Adding this bound to the bound (10) obtained by ignoring the headers gives the bound L*(4) < n[H 8(6 ± 3d K)— yen on total code length for members of G(n). w(i)Eup. since the total number of such words is at most 38n. the total length needed to encode all the w(i) is upper bounded by (10) n(H +8 — ye +3d3). then at most K3n bits are needed to specify the lengths of the w(i) that belong to Ui >m B i . which establishes the lemma. choose L > M so large that if B = UL BP • then . The partial packing lemma. the total contribution of all the headers as well as the length specifiers is upper bounded by 38n ± 26n + IC8n. Thus.128 CHAPTER II. Since an Elias code is used to do this.3. Of course. provided 8 is small enough and M is large enough. Lemma 1. which exceeds e since it was assumed that < 1/2 and k > 118. for suitable choice of 8 . But (1/x) log x is a decreasing function for x > e. still ignoring header lengths. ignoring the one bit needed to tell that 4 E G(n). m Bi E since each such w(i) has length M > 2k1S. by property (b) in the definition of G(n). e'. it follows that if 8 is chosen small enough. To establish this. to total code length from encoding those w(i) that belong to either Dk or to Ui . By assumption. so that log log t(w(i)) < K M n. then n(H — E'). M > 2k/3 and log M < 8M. 1=. Since d and K are constants. since k < M and hence such words must have length at least k. provided k > 118. ENTROPYRELATED PROPERTIES. so that.m B i is upper bounded by — ye).u(B) > y. 4 E G(n). There are at most nlk words w(i) the belong to Dk or to Ui >m Bi . and there are at most 38n such words. provided only that M > 2k18. if it assumed that log M < 8M. so essentially all that remains is to show that the headers require few bits. so these 2bits headers contribute at most 28n to total code length.
provide a representation x = w(1)w(2) • • • w(t) for which the properties (a).SECTION 11. (This argument is essentially just the proof of the twopackings lemma.7. t] has length at least M. eventually almost surely. But the buildingblock lemma. E /*} is disjoint and covers at least a (1 — 30fraction of [1.d Another proof of the entropy theorem. eventually almost surely there are at least 8n indices i E [1. suggested in Exercise 5. E Dk and a disjoint collection {[si. since it was assumed that M > 2k/ 6 . To say that x has both a ypacking by blocks drawn from B and a (1 — 20packing by blocks drawn from Dk. since Cn is invertible.l.1. ti ]. so the total length of those [ui . . II. since. if E . then implies that xri' is eventually almost surely (1 — 20packed by kblocks drawn from Dk. B. and I Dn I < 2 n(H+E) . which is upper bounded by 8n. n] of total length at least yn.s. Fix c > 0 and define D„ = 14: £(4) < n(H so that x Thus. Dn . and let {C„} be a prefixcode sequence such that lim sup. Since each [s1 .9. This completes the second proof that toogood codes do not exist. Section 1.C(x`11 )1 n < H. 4 is (1 — 0stronglycovered by Dk. by the ergodic theorem. ti]UU[ui. n — k 1] that are starting places of blocks from Dk. Thus if I* is the set of indices i E I such that [ui . . (b). and (c) hold.3. 1E1. n]. is to say that there is a disjoint collection {[u„ yi ]: i E / } of klength subintervals of [1. vi ] meets none of the [si . ti ]: j E J } of Mlength or longer subintervals of [1. as follows.3. Fix an ergodic measure bt with process entropy H. vi that meet the boundary of at least one [s ti] is at most 2nk/M. = . ti]: j E U {[ui. eventually almost surely.) The words v:: j E /*I {x:j E J} U {xu together with the length 1 words {x s I defined by those s U[s . a. eventually almost surely. eventually almost surely. that is. In particular. Lemma 1. 4 has both a ypacking by blocks drawn from B and a (1 — 20packing by blocks drawn from Dk. eventually almost surely. 129 provides. for which each xs% E B. a ypacking of 4 by blocks drawn from B. The preceding paragraph can be summarized by saying that if x has both a ypacking by blocks drawn from B and a (1 — 20packing by blocks drawn from Dk then 4 must belong to G(n). the cardinality of J is at most n/M. ENTROPY AND CODING. n] of total length at least (1 — 20n for which each 4. 4 has both a ypacking by blocks drawn from B and a (1 — 20packing by blocks drawn from Dk.7. On the other hand. Lemma 1. then the collection {[sj. The entropy theorem can be deduced from the existence and nonexistence theorems connecting process entropy and faithfulcode sequences. But this completes the proof that 4 E G(n).u(x'1') < 2 1'1+2} .
s. then there is a renewal process v such that lim supn D (gn II vn )/n = oc and lim infn D(pn II vn )/n = 0. a. such that the first symbol in 0(4) is always a 0. be the projection of the Kolmogorov measure of unbiased coin tossing onto A n and let Cn be a Shannon code for /in . and note that I Uni < 2n(I I —E) .. eventually almost surely.. n D n) < IDn 12—n(H+2E) < 2—n€ Therefore 4 fl B n n Dn . whose range is prefix free. then there is a constant C such that lim sup. (C) Show that if p... and so xri` g Bn . Such a code Cn is called a bounded extension of Cn * to An.(4) = 1C:(4). is ergodic then C(4)In cannot converge in probability to the entropy of v.n(H — c).i.1 . ENTROPYRELATED PROPERTIES. xii E Un and en (4) = C(4). where it starts. (Hint: code the second occurrence of the longest string by telling how long it is. is the all 0 process. Let L (xi') denote the length of the longest string that appears twice in 4. Show that if bt is i. (b) Let D (A n il v pi ) = E7x /1.) 3. xi" g Un .. L(4)/ log n < C.1(4) log(pn (4)/vn (4)) be the divergence of An from yn . Add suitable headers to each part to make the entire code into a prefix code and apply Barron's lemma. Let C. The resulting code On is invertible so that Theorem 11. C3 II. (a) For each n let p. eventually almost surely.1. . This proves that 1 1 lim inf — log — > H. n n and completes this proof of the entropy theorem. where K is a constant independent of n. n n 11.. and where it occurred earlier. This exercise explores what happens when the Shannon code for one process is used on the sample paths of another process.. Define On (4) = 4)(4).d. Le Exercises 1. eventually almost surely. a. and such that C. Thus there is a onetoone function 0 from Un into binary sequences of length no more than 2 I.2 guarantees that Je ll g Un . Show that if y 0 p.(4 For the lower bound define U„ = Viz: p. 2.s. establishing the upper bound 1 1 lim sup — log ) _< H . for X'il E S.(4) > 2—n(H. be the code obtained by adding the prefix 1 to the code Cn . Show that there is prefix ncode Cn whose length function satisfies £(4) < Kn on the complement of S.. and let £(4) be the length function of a Shannon code with respect to vn . Lemma 5. is 11(pn )± D (An II vn). Code the remainder by using the Shannon code. Suppose Cn * is a onetoone function defined on a subset S of An . Show that the expected value of £(4) with respect to p.130 then /t (Bn CHAPTER II. almost surely.
the words are separated by commas. b(C). . which is due to Ornstein and Weiss.. In finite versions it is the basis for many popular data compression packages. defined according to the following rules. . 2. where. that the LempelZiv (LZ) algorithm compresses to entropy in the limit will be given in this section. that is.. WW1.. [91].1 +1 E {WM. w(j)} then w(j ± 1) consists of the single letter xn ro.. called the (simple) LempelZiv (LZ) code by noting that each new word is really only new because of its final symbol. b(2). 1. Note that the parsing is sequential. The LZ algorithm is based on a parsing procedure. later words have no effect on earlier words. called words. let 1.SECTION 11. To be precise. 101.. [53]. the wordformation rule can be summarized by saying: The next word is the shortest new word. The parsing defines a prefix ncode Cn . where the final block y is either empty or is equal to some w(j)... (b) Suppose w(1) .. (a) The first word w(1) consists of the single letter x1. W(j)} and xm+ 1 g {w(1).. x is parsed inductively according to the following rules. b(C)b(C + 1) of the binary words b(1). where m is the least integer larger than n 1 (ii) Otherwise. . Thus.4. To describe the LZ code Cn precisely. for example. and therefore an initial segment of length n can be expressed as (1) xi! = w(1)w(2). [90]. 000.1 denote the least integer function and let f: {0.2. Ziv's proof. If fiz is parsed as in (1).. parses into 1. for ease of reading. A second proof. 11001010001000100. 01. . In its simplest form...2 The LempelZiv algorithm.. together with a description of its final symbol. An important coding algorithm was invented by Lempel and Ziv in 1975. .. a way to express an infinite sequence x as a concatenation x = w(1)w(2) .Brl0gn1 and g: A 1÷ BflogiAll be fixed onetoone functions. . w(j) = (i) If xn 1 +1 SI {w(1). w(j + 1) = xn such that xm n . w(C) y.. the LZ code maps it into the concatenation Cn (eii) = b(1)b(2). 100. 7 + 1 . .. . . that is.. is given in Section 11. n) i. THE LEMPELZIV ALGORITHM 131 Section 11... and it has been extensively analyzed in various finite and limiting forms. and uses several ideas of independent interest.. hence it is specified by giving a pointer to where the part before its final symbol occurred earlier. b(C + 1). 0. 00.. for j < C. called here (simple) LZ parsing. • of variablelength blocks. . 10. . .
1) = 1 f (i). called topological entropy.) If p is an ergodic process with entropy h.1. in which C1(4) now denotes the number of new words in the simple parsing (1). The entropy given by the entropy theorem is h(p) = —p log p — (1 — p) log(1 — p). ENTROPYRELATED PROPERTIES. individual sequence entropy almost surely upper bounds the entropy given by the entropy theorem.2. it is enough to prove that entropy is an upper bound. but gives no information about frequencies of occurrence. part (b) requires at most Calog n1 ± Flog I A + 2) bits. [90]. since no code can beat entropy in the limit.132 CHAPTER II. where a and /3 are constants. that is.6. .eiz) is upper bounded by (C+1)lognFaC1. then b(j) = 0 g(w(j)). 34] for discussions of this more general concept. hence its topological entropy h(x) is log 2 = 1. if x is a typical sequence for the binary i. as defined here. by Theorem 11. Of course. a E A. The kblock universe of x is the set tik(x) of all a l' that appear as a block of consecutive symbols in x.i. for ergodic processes. which is the growth rate of the number of observed strings of length k.2. For example. see [84. k = {a• i• x 1' 1 = all for some i _> The topological entropy of x is defined by 1 h(x) = lim — log 114 (x)I k a limit which exists. Theorem 11. by subadditivity. since I/44k (x)i < Vim (x)I • Pk (x)1. process with p equal to the probability of a 1 then every finite string occurs with positive probability. then b(j) = 1 f (i)0 g(a).1 (The LZ convergence theorem. (b) If j < C and i < j is the least integer for which w(j) = w(i)a. then b(C the least integer such that y = w(i). otherwise b(C (c) If y is empty.7. Lemma 1. so total code length £(. almost surely. then (11n)C(4) log n —> h.) Topological entropy takes into account only the number of kstrings that occur in x. is the same as the usual topological entropy of the orbit closure of x. which depends strongly on the value of p.16. as k oc. The dominant term is C log n. for every sequence. (The topological entropy of x. so to establish universality it is enough to prove the following theorem of Ziv. 1) is empty. for it takes into account the frequency with which strings occur. Ziv's concept of entropy for an individual sequence begins with a simpler idea. (a) If j < C and w(j) has length 1.d. together with a proof that this individual sequence entropy is an asymptotic upper bound on (1/n)C(4) log n. and part (c) requires at most Flog ni + 1 bits.. where i is Part (a) requires at most IA l(Flog IA + 1) bits. Ziv's proof that entropy is the correct almostsure upper bound is based on an interesting extension of the entropy idea to individual sequences. together with a proof that.
Theorem 11.SECTION 11.2. — 1) and C IA — 1). The first lemma obtains a crude upper bound by a simple worstcase analysis.3 (The Ziventropy theorem. leads immediately to the LZ convergence theorem. Theorem 11. 0 such that C(x) log n n <(1 + 3n)loglAl.2.2.1 ) = 1 .2.1. say m.. )7.4 (The crude bound.) lim sup C(xti') log n H(x). The first gives a simple bound (which is useful in its own right).) If is an ergodic process with entropy h then H(x) < h. all the words of length 2.. y). )1). x E A c° . almost surely. that is.2 (The LZ upper bound. Theorem 11. Theorem 11..=.2. which. almost surely. the second gives topological entropy as an upper bound. defined as before by c1(x . combined with the fact it is not possible to beat entropy in the limit. where d(4 . up to some length. THE LEMPELZIV ALGORITHM. when n= In this case.1. and hence C logn ' n log I AI. Proof The most words occur in the case when the LZ parsing contains all the words of length 1. These two theorems show that lim supn (l/n)C(x7) log n < h. n (in + 1 )1Al m+1 jiAli and C = .2.tns n is perletter Hamming distance. will be carried out via three lemmas. and the third establishes a iiperturbation bound. which also have independent interest.2. The natural distance concept for this idea is cl(x.) There exists Bn —O. y) = lim sup d. .i (xii . The Ziventropy of a sequence x is denoted by H(x) and is defined by H(x) = lim inf h(y). X E A°°.. Lemma 11. Ziv established the LZ convergence theorem by proving the following two theorems. Theorem 11. 133 The concept of Ziventropy is based on the observation that strings with too small frequency of occurrence can be eliminated by making a small limiting density of changes. The proof of the upper bound.2.
5 (The topological entropy bound. the topological entropy. must give an upper bound. together with the fact that hi(x) — Ei < h(x) h (x) where Ei —> 0 as j oc. Lemma 11. tm 1 m(t _ i) _< E (2) m (t— 1 i Er ti < t"/(t — 1). for w E Ai.. ei. and the t <  t mtm. all of which are valid for t > 1 and are simple consequences of the wellknown formula 0 — t)/(t — 1).x'iz) log n h(y) + Proof Define the wordlength function by t(w) = j. The word w(i) will be called wellmatched if dt(u. for the LZ parsing may be radically altered by even a few changes in x. then Given x and E > 0. ()) (w(i).. Let x = w(1)w(2) . y) <S.. i=1 The desired upper bound then follows from the bound bounds . as k oc. there exists 8. that is.) For each x E A'. 1 t m it . .2.) x. tj = Er (enEl The second bounding result extends the crude bound by noting that short words don't matter in the limit. 2. there is a S >0 such that if d( li m sup n—>oo C(. yield the desired result. for k > 1. Lemma 11. The third lemma asserts that the topological entropy of any sequence in a small dneighborhood of x is (almost) an upper bound on code performance.6 (The perturbation bound.. —> 0.2. j=1 m+1 The bounds (2).. i = 1.. be the LZ parsing of x and parse y as y = v(1)v(2) . and hence the rate of growth of the number of words of length k.134 CHAPTER II.. first choose m such that n< j=1 jlAli. ENTROPYRELATED PROPERTIES. At first glance the result is quite unexpected.. Fix S > 0 and suppose d(x. To establish the general case in the form stated. where t(v(i)) = t(w(i)). such that C(x) log n n — Proof Let hk(x) = (1/k) log lUk(x)1. and choose m such that E j2» j=1 „(x) < n < E i2 p„(x) . v(i)) < . y) < S.
n+s +k E Ed' I{i E [0. To make the preceding argument rigorous.2. n — 1]. hence the same must be true for nonoverlapping kblocks for at least one shift s E [0. Lemma 11. v(i)) < N/75. k —1] such that i. together with the fact that hk(Y) + h(y). Lemma 1.2. be an ergodic process with entropy h.5 then gives C2(4) 5_ (1 + On)r— (h(Y) gn f (8)). since this is the total number of words of length k in y. The cardinality of Gk(Y) is at most 2khk (Y ) .4 gives the bound (3) (1 n/ +an) log n'a log I AI where limn an = O.6. for w(i) E Gk(X).2.2.2. This bound. Lemma 11. implies LI the desired result. the total length of the poorlymatched words in the LZ parsing of 4 is at most rt.r8fraction of x. 0 The desired inequality. Proof Fix c > 0. Fix a frequencytypical sequence x.2. Let C1(4) be the number of poorlymatched words and let C2(4) be the number of wellmatched words in the LZ parsing of x. For any frequencytypical x and c > O.6. almost surely. is an immediate consequence of the following lemma. n — 11: x in+s+1 lim inf > 1.7.SECTION 11. and choose k so large that there is a set 'Tk c Ak of cardinality less than 2n (h+E) and measure more than 1 — E. Lemma 11.5. first note that lim i+k Ili E [0. Thus there must be an integer s E [0./TS. Theorem 11. and let Gk(y) be the set of all v(i) for which w(i) E Gk(X). By the Markov inequality. there is a sequence y such that cl(x. follows immediately from Lemma 11. 135 otherwise poorlymatched. H(x) < h. The LZ upper bound. THE LEMPELZIV ALGORITHM. y) < e and h(y) <h E. For each k. where lim n fi. since x was assumed to be frequencytypical. combined with (3) and the fact that f(S) ÷ 0. and hence it can be supposed that for all sufficiently large n. k — 1].2. = O. n—>oo . The idea for the proof is that 4 +k1 E Tk for all but a limiting (1 — 0fraction of indices i. Now consider the wellmatched words.2.E. Since dk(w(i).7 Let p. yields IGk(x)I where f(S) < 2k(h(y)+ f (6)) 0 as 8 O. The resulting nonoverlapping blocks that are not in 'Tk can then be replaced by a single fixed block to obtain a sequence close to x with topological entropy close to h. the poorlymatched words cover less than a limiting . let Gk(x) be the set of wellmatched w(i) of length k. the blowupbound lemma. Thus. completing the proof of Lemma 11. x i+ 1 T k11 • > 1 — E.
9 Another variation.2. then defines .2.8 Variations in the parsing rules lead to different forms of the LZ algorithm. Remark 11.2.7.6.2. Another version. for now the location of both the start of the initial old word.136 CHAPTER II. Remark 11. y) <E. where w(i) ak if w(i) E 'Tk otherwise. Theorem 11. This modification. "new" meant "not seen as a word in the past.2. — n Fix a E A and let ak denote the sequence of length k. a set of cardinality at most 1 + 2k(h +E) . called the slidingwindow version. which can be expressed by saying that x is the concatenation x = uw(1)w(2). remains true. If M is large relative to k.7.. as well as its length must be encoded. of words have been seen. ENTROPYRELATED PROPERTIES. Tic where 3A1 * 0. Theorem 11. Condition (4) and the definition of y guarantee that ii(x. also discussed in the exercises.2. until it reaches the noth term. see Exercise 1.3. which has also been extensively analyzed.10 For computer implementation of LZtype algorithms some bound on memory growth is needed. produces slightly more rapid convergence and allows some simplification in the design of practical coding algorithms. which improves code performance.. • v(j + in — 1)r. thereby. Nevertheless. One version..") In this alternative version. where the initial block u has length s. The sequence y is defined as the concatenation y = uv(1)v(2). Lemma 1. starts in much the same way. the proof of the Ziventropy theorem. the rate of growth in the number of new words still determines limiting code length and the upper bounding theorem. all of whose members are a.. Since each v(i) E U {a l } . implies that h m (y) < h + 6 ± 3m.. a member of the mblock universe Um (y) of y has the form qv(j)v(j + D. Remark 11. after which the next word is defined to be the longest block that appears in the list of previously seen words. and hence h(y) < h + E. 0 and.00 Ri < n: w(i) E TxII > 1 — E. where q and r have length at most k. the builtup set bound. includes the final symbol of the new word in the next word. new words tend to be longer. One method is to use some form of the LZ algorithm until a fixed number. and (4) liminf n—. say Co. This completes the proof of Lemma 11.2. known as the LZW algorithm. but more bookkeeping is required. and (M — 2k)/ k < m < M/k. each w(i) has length k... defines a word to be new only if it has been seen nowhere in the past (recall that in simple parsing. It is used in several data compression packages.
then each block is coded separately. provided only that n _ . Furthermore. t2. such as the LempelZiv code. through the empirical distribution of blocks in the sample path. (Hint: if the successive words lengths are . see Exercise 1. EMPIRICAL ENTROPY. see [87. the method of proof leads to useful techniques for analyzing variablelength codes.8 achieves entropy in the limit. The empiricalentropy theorem has a nonoverlappingblock form which will be discussed here. or it may depend on the past or even the entire sample path. asserting that eventually almost surely only a small exponential factor more than 2kh such kblocks is enough. 88].cc+ E =a } I where n = km + r. is defined by qic(cli = Ri E [0. .3. and can be shown to approach optimality as window size or tree storage approach infinity. or variable length. II. here called the empiricalentropy theorem. . and that a small > 21th exponential factor less than 2kh is not enough. Such algorithms cannot achieve entropy for every ergodic process. These will be discussed in later sections.. Show that the LZW version discussed in Remark 11. In many coding procedures a sample path 4 is partitioned into blocks. then E log Li < C log(n/C).9 achieves entropy in the limit. e. about the covering exponent of the empirical distribution of fixedlength blocks will be established. The blocks may have fixed length. The theorem is concerned with the number of kblocks it takes to cover a large fraction of a sample path.3 Empirical entropy. g. Show that the version of LZ discussed in Remark 11. as well as other entropyrelated ideas. The nonoverlapping kblock distribution. 137 the next word as the longest block that appears somewhere in the past no terms. Show that simple LZ parsing defines a binary tree in which the number of words corresponds to the number of nodes in the tree. a beautiful and deep result due to Ornstein and Weiss. suggests an entropyestimation algorithm and a specific coding technique. and an overlappingblock form.) 2. which can depend on n.2..SECTION 11. In this section. the coding of a block may depend only on the block. 0 < r < k. [52].2.a Exercises 1. both of which will also be discussed. Furthermore. rit — 1]: in X: . but do perform well for processes which have constraints on memory growth. 3. The theorem.2. Section 11. ic. • • .
The theorem is. w(2). such that Tk c A k . see Exercise 3. E w(i)Eurk f(w(i)) a. x) such that if k > K and n > 2kh . that is. A sequence 4 is (K. hence. As before. if the words that belong to U'rk cover at least a (1 — E)fraction of n. Part (a) of the empiricalentropy theorem is a consequence of a very general result about parsings of "typical" sample paths. II.3.4. for the latter says that for given c and all k large enough. for unbiased cointossing the measure qk (.. and there is no is a set T(E) c A k such that l'Tk(E)I _ set B C A k such that IB I _< 2k(hE) and . . subject only to the condition that k < (11 h) log n.14) is not close to kt k in variational distance for the case n = k2". . but the more general result about parsings into variablelength blocks will be developed here. (b) qk(BI4)) <E. A parsing X n i = w(l)w(2) • • • w(t) is (1 — e)builtup from {'Tk}. The surprising aspect of the empiricalentropy theorem is its assertion that even if k is allowed to grow with n.. there < 2k(h+f) and pl (Tk(E)) > 1 — e. t(w(I))<K E  2 is (1 — E)builtup from MI.14) is eventually almost surely close to the true distribution.138 CHAPTER II. those long words that come from the fixed collections . then most of the path is covered by words that come from fixed collections of exponential size almost determined by entropy. For example. for k > 1. Eventually almost surely. a finite sequence w is called a word. an empirical distribution form of entropy as covering exponent. c)stronglypacked by MI if any parsing . in particular. ENTROPYRELATED PROPERTIES. in essence. Theorem 11. the empirical distribution qk(. The ergodic theorem implies that for fixed k and mixing kt.a Strongpacking. For each c > 0 and each k there is a set 'Tk (e) c A" for which l'Tk (e)I < 2k(h +E) . word length is denoted by £(w) and a parsing of 4 is an ordered collection P = {w(1). such that for almost every x there is a K = K(e. the empirical measure qk(. or (1 — e)packed by {TO. .3.I4) eventually almost surely has the same covering properties as ktk.) Let ii be an ergodic measure with entropy h > O. in spite of the fact that it may not otherwise be close to iik in variational or even cik distance. for fixed k.uk. if the short words cover a small enough fraction. Fix a sequence of sets {E}. as it will be useful in later sections.a(B) > E. the empirical and true distributions will eventually almost surely have approximately the same covering properties.7.1 (The empiricalentropy theorem. Theorem 1.elz = w(1)w(2) • • • w(t) for which awci» < En . A fixedlength version of the result is all that is needed for part (a) of the entropy theorem. Strongpacking captures the essence of the idea that if the long words of a parsing cover most of the sample path. then (a) qk(Z(E)14)) > 1 — E. (1 — O n. for any B c A" for which IBI < 2k( h — e) . w(t)} of words for which 4 = w(1)w(2) • • • w(t).
6. If a word is long enough. then. (ii) If x is (1 —3 2 /2)stronglycovered by C m and parsed as 4 = w(1)w(2) • • w(t). however. To make the outline into a precise proof. < 2k(h+E) for k > K. . by the packing lemma. the sequence 4 has the property that X: +1711 E Cm for at least (1 — 3 2 /2)n indices i E [1. But if such an x is partitioned into words then. and let 8 be a positive number to be specified later. then the word is mostly builtup from C m . w(i) E Te(w(j)). This is a consequence of the following three observations. 0stronglypacked by {Tk for a suitable K. The key result is the existence of collections for which the 'Tk are of exponential size determined by entropy and such that eventually almost surely x is almost stronglypacked. Lemma 1.7. by the builtup set lemma.3. be an ergodic process of entropy h and let E be a positive number There is an integer K = K(e) and. By the builtup set lemma. that is.) Let p. The collection 'Tk of words of length k that are mostly builtup from Cm has cardinality only a small exponential factor more than 2"") . 4 is (1 — 3 2 /2)stronglycovered by Cm . fix c > 0. for each k > K. and if f(w(i)) ?_ 2/3. EMPIRICAL ENTROPY 139 {'Tk } must cover most of the sample path. (i) The ergodic theorem implies that for almost every x there is an N = N(x) such that for n > N. k > in. The entropy theorem provides an integer m and a set Cm c Am of measure close to 1 and cardinality at most 2m (h +6) • By the ergodic theorem. n — m + 1]. It remains to show that eventually almost surely x'IL is (K.(ar) > 2 m(h+ 5)} has measure at least 1 — 3 2 /4. if necessary. by the packing lemma. that is.3. By making 8 smaller. w(i) is (10packed by C m .SECTION 11. and most of its indices are starting places of members of Cm . Proof The idea of the proof is quite simple. Lemma 1. the words w(i) that are not (1 — 3/2)stronglycovered by Cm cannot have total length more than Sn < cn12.3. eventually almost surely most indices in x7 are starting places of mblocks from Cm . then. For k > m. such that both of the following hold. Lemma 11. (a) ITkI_ (b) x is eventually almost surely (K. most of 4 must be covered by those words which themselves have the property that most of their indices are starting places of members of Cm .3. by the Markov inequality. by the Markov inequality. let 'Tk be the set of sequences of length k that are (1 — 3)builtup from Cm .2 (The strongpacking lemma. a set 'Tk = 'Tk(E) C A k . (iii) If w(i) is (1 — 3/2)stronglycovered by C m . it can be supposed that 3 < 12. 0stronglypacked by {7 } . it can be supposed that 6 is small enough to guarantee that iTk i < 2k(n+E) . The entropy theorem provides an m for which the set Cm = {c4n : p.
such that Irk I . where t(w(i)) = k. The good kcode used on the blocks that are not in B comes from part (a) of the empiricalentropy theorem. are longer than K.2.140 CHAPTER II. Fix c > 0 and let S < 1 be a positive number to be specified later. This completes the proof of Lemma 11. if the block belongs to B. for if this is so. Fix x such that 4 is (K.. Fix k > K(x) and n a 2kh . The second part encodes successive kblocks by giving their index in the listing of B. and choose K (x) > K such that if k> K(x) and n> 2kh .3. The first part is an encoding of some listing of B. provided only that n > 2kh .2k(h+') for k a K. Thus. if n > N(x).b Proof of the empiricalentropy theorem. 8)stronglypacked by {Tk}.3. The strongpacking lemma provides K and 'Tk c A " . then n > N(x) and k < Sn. while the last one has length less than Sn. A simple twopart code can be constructed that takes advantage of the existence of such sets B. Proof of part (b). dividing by n — k + 1 produces qk(Tkixriz ) a (1 — 28)(1 —8) > (1 — c).. An informal description of the proof will be given first.' is (K. If B is exponentially smaller than 2" then fewer than kh bits are needed to code each block in B. except possibly the last one. If a block belongs to Tk — B then its index in some ordering of Tk is transmitted. All the blocks. for some k. This completes the proof of part (a) of the empiricalentropy 0 theorem. eventually almost surely. which supplies for each k a set 'Tk c A k of cardinality roughly 2. this contributes asymptotically negligible length if I BI is exponentially smaller than 2kh and n > 2kh . and t(w(t + 1)) < k. since k < Sn. such that eventually almost surely most of the kblocks in an nlength sample path belong to Tk . since the left side is just the fraction of 4 that is covered by members of Tk . The definition of strongpacking implies that (n — k + 1)qk('Tk I4) a. and suppose xril = w(1)w(2) • • • w(t)w(t + 1). ENTROPYRELATED PROPERTIES. i < t. and such that x. for suitable choice of S. so that if B covers too much the code will beat entropy. Proof of part (a). this requires roughly hkbits. 8)stronglypacked by {T} for n a N(x). or by applying a fixed good kblock code if the block does not belong to B.(1 — 2 8 )n. If the block does . From the preceding it is enough to take K > 2/8. II. Suppose xtiz is toowell covered by a toosmall collection B of kblocks. and if the parsing x 11 = w(1)w(2) • • • w(t) satisfies —2 t(w(i))<K then the set of w(i) for which f(w(i)) > K and w(i) E U Tk must have total length at El least (1 — On.
Let K be a positive integer to be specified later and for n > 2/(4 let B(n) be the set of 4 for which there is some k in the interval [K. Part (a) of the theorem provides a set Tk c ilk of cardinality at most 2k(h+3) . 1} rk(4 ±3)1 .. 1}rk(h01. say. defined by Fm (X) = F(xi)F(x2). and.. eventually almost surely.{0. an integer K(x) such that if k > K(x) and n > 2" then qk(TkIx) > 1 .3. which establishes that 4 g B(n). eventually almost surely. (ii) I BI <2k(h6) and qk (BI4) > E. say. since no prefixcode sequence can beat entropy infinitely often on a set of positive measure. so that if K(x) < k < (log n)/ h then either 4 E B(n). Let 1 • 1 denote the upper integer function. say G: Tk i. Also needed for each k is a fixedlength faithful coding of Tk. eventually almost surely. 141 not belong to Tk U B then it is transmitted term by term using some fixed 1block code. if 8 is small enough and K is large enough. Indeed. for almost every x. (1) If B C A k and I B I < 21* 0 .{0. and a fixedlength faithful code for each B c A k of cardinality at most 2k(h. . B(n). Mk. or the following holds. F: A i.E ) ._. for K(x) < k < loh gn. F(x. 1}1 1 05 All is needed along with its extension to length in > 1.{0. since such blocks cover at most a small fraction of the sample path this contributes little to overall code length. then property (1) must hold for all n > 2. fix c > 0 and let 8 be a positive number to be specified later.3. EMPIRICAL ENTROPY. Several auxiliary codes are used in the formal definition of C. then for almost every x there is an N(x) such that x'ii g B(n).B: B i . This fact implies part (b) of the empiricalentropy theorem. (i) qic(Tk14) ?_ 1 . for n > N(x). for each k. so that if k is enough larger than K(x) to guarantee that nkh Z > N(x). D It will be shown that the suggested code Cn beats entropy on B(n) for all sufficiently large n. which implies part (b) of the empiricalentropy theorem. the definition of {rk } implies that for almost every x there is an integer K (x) > K such that qk(Tk14) > 1 . If 4 g B(n).8.8. By using the suggested coding argument it will be shown that if 8 is small enough and K large enough then xn i .SECTION 11. n ). then qk(BI4) < c. (log n)/h] for which there is a set B c A k such that the following two properties hold. To proceed with the rigorous proof. A faithful single letter code.
and w(t + 1) has length r = [0. eventually almost surely. where (2) v(i) = The code is a prefix code.B w(i) g Tk UB.3. and. let Ok(B) be the concatenation in some order of {Fk (4): 4 E B } . The rigorous definition of Cn is as follows. a fact stated as the following lemma. where w(i) has length k for i < t. If 4 V B(n) then Cn (4) = OFn (4). for the block length k and set B c itic used to define C(. and Fr are now known. where aK —> 0 as K —> oc. The twobit header on each v(i) specifies whether w(i) belongs to B. (log n)/11] and a set B C A k of cardinality at most 2k(hE ) are determined such that qk(B14') > c.142 CHAPTER II. ( (I BI) determines the size of B. ENTROPYRELATED PROPERTIES.B. this. and as noted earlier. that is. for 4 E B(n). The lemma is sufficient to complete the proof of part (b) of the empiricalentropy theorem. Finally.B(w(i)) OlG k (w(i)) 11 Fk(w(i)) 11Fr (w(t +1) 0 w(i) E B W(i) E Tk . the block 0k1 determines k.4). For 4 E B(n). in turn. the first bit tells whether x`i' E B(n). Once k is known. xri' E B(n). Gk. (3) . 4 E B(n). since each of the codes Mk. since reading from the left. the principal contribution to total code length £(4) = £(41C) comes from the encoding of the w(i) that belong to rk U B. for each k and B C A k .3 I Mk. This is because tkq k (h — c) + tk(1 — q k )(h + 3) < n(h + 3 — 6(6 + S)). from which w(i) can then be determined by using the appropriate inverse. The code is defined as the concatenation C(4) = 10k 1E(1B1)0k (B)v( 1)v(2) • • • v(t)v(t + 1). given that x til E B(n). and øk(B) determines B. to Tic — B.i<t i = t +1. for all sufficiently large n.C(x) 15_ tkqk(h — 6) + tk(1 — qk)(h + 6) + (3 + K)0(n). a prefix code such that the length of E(n) is log n + o(log n). since no sequence of codes can beat entropy infinitely often on a set of positive measure. k). 1=1 . If 4 E B(n). But this implies that xriz V B(n). Lemma 11. implies part (b) of the empiricalentropy theorem. then an integer k E [K. so that if K is large enough and 8 small enough then n(h — c 2 12). or to neither of these two sets. and 4 is parsed as 4 = w(l)w(2) • • • w(t)w(t + 1). since qk > E. and qk = clk(Blx'1 2 ). and let E be an Elias prefix code on the integers. Fk.
The encoding Øk(B) of B takes k Flog IAMBI < k(1+ log I A1)2 k(h. The header 10k le(I BD that describes k and the size of B has length (4) 2 +k+logIBI o(logIBI) which is certainly o(n). and (t — tqk(B 14)) blocks in Tk — B. together these contribute at most 3(t + 1) bits to total length.3. as well as the extra bits that might be needed to round k(h — E) and k(h +3) up to integers.'. upper bounded by K (1+ loglA1)2. converges to h as . II. K the encoding of 4k(B) and the headers contributes at most naK to total length. each requiring a twobit header to tell which code is being used. A simple procedure is to determine the empirical distribution of nonoverlapping kblocks. Thus with 6 ax = K(1 + log IA1)2 KE — = o(K). EMPIRICAL ENTROPY. since k > K and n > 2".SECTION 11. (5) provided K > (E ln 2) .4 The original proof. xn from an unknown ergodic process it. If k is fixed and n oc. each of which requires k(h — E) bits. hence at most 80(n) bits are needed to encode them.3. El Remark 11. Adding this to the o(n) bound of (4) then yields the 30(n) term of the lemma. This completes completes the proof of Lemma 11. [ 52]. 143 Proof of Lemma 11. then H(qk)1 k will converge almost surely to H(k)/k. It takes at most k(1 + log I Ai) bits to encode each w(i) Tk U B as well as to encode the final rblock if r > 0. The argument parallels the one used to prove the entropy theorem and depends on a count of the number of ways to select the bad set B. Ignoring for the moment the contribution of the headers used to describe k and B and to tell which code is being applied to each kblock. since there are tqk(Bl. a quantity which is at most 6n/K.3. thereby establishing part (b) of the empiricalentropy theorem.3. which. of part (b) of the empiricalentropy theorem used a counting argument to show that the cardinality of the set B(n) must eventually be smaller than 2nh by a fixed exponential factor. KEn. the goal is to estimate the entropy h of II. and n> k > K.q) blocks that belong to B. and then take H (qk )/ k as an estimate of h. in turn.E ) bits. X2. since log I BI < k(h — E) and k < (log n)I n.3. The coding proof given here is simpler and fits in nicely with the general idea that coding cannot beat entropy in the limit. A problem of interest is the entropyestimation problem. which is.3.c Entropy estimation. each of which requires k(h +3) bits. and there are at most 1 +3t such blocks. . Given a sample path Xl. the number of bits used to encode the blocks in Tk U B is given by tqk (Bix)k(h — E) + (t — tq k (BI4)) k(h +3). This gives the dominant terms in (3). There are at most t + 1 blocks. as well as a possible extra bit to round up k(h — E) and k(h +8) to integers.3. since t < nlk. in turn.
at least if it is assumed that it is totally ergodic. the set of al' for which . In particular. (7) qk(Uklxi) _ < Irk(E)12ch+2E) < 2k(h+E)2 k(h+2E) _ "Ek —z. n _ This bound. with alphabet A. In summary.) If it is an ergodic measure of entropy h > 0.6. At first glance. while if k(n) — log log n. where G k = Tk(E) — Uk — Vk.6 The entropyestimation theorem is also true with the overlapping block distribution in place of the nonoverlapping block distribution. Theorem 11. the choice k(n) — log n works for any binary ergodic process.144 CHAPTER II. for example. Let Uk be the set of all cif E Tk(E) for which qk (ali` 'xi' ) < 2 k(h+2E) .3. .u.(.3.log iAl n then (6) holds for any ergodic measure p. for example. so that for all large enough k and all n > 2kh . Part (b) of the empiricalentropy theorem implies that for almost every x there is a K (x) such that qk(VkIX) < E. if k(n) . because. combined with the bound (7) and part (a) of the empiricalentropy theorem. at least for a totally ergodic . as n . k —> oc. lx) measure at least 1 — 3e. can now be applied to complete the proof of Theorem 11. there is a Ki (x) such that qk(GkIx) > 1 — 36.5 (The entropyestimation theorem. The empirical entropy theorem does imply a universal choice for the entropyestimation problem. > for k > K (x).(k1n — iikkai 'x i ) < 20h2E) has qk (. and if k(n) < (1/h)log n. then it holds for any finitealphabet ergodic process. then (6) n>C0 k(n) 1 lim — H(q00(. a.3. ENTROPYRELATED PROPERTIES. if k(n) —> oc. The general result may be stated as follows.14) is close to the true distribution bak for every mixing process A.5.oc. there is some choice of k = k(n) as a function of n for which the estimate will converge almost surely to h. implies that for almost every x. Theorem 1.9. the choice of k(n) would appear to be very dependent on the measure bt. The same argument that was used to show that entropyrate is the same as decayrate entropy. CI Remark 11. see Exercise 2. Next let I Vkl < Vk be the set of all alic for which qk (cif 14) > 2 k(h26) . Proof Fix c > 0 and let {Tk(c) c A k : n > 1) satisfy part (a) of the empiricalentropy theorem.s. and note that 2/(/126) . n > 2kh . there is no universal choice of k = k(n) for which k(n) —> oc for which the empirical distribution qi. Thus. for k > Ki(x).14)) = h.
of these kblocks in order of decreasing frequency of occur Step 3. EMPIRICAL ENTROPY 145 II. Step 2. Transmit a list rence in 4. The header br is a concatenation of all the members of {0. 1. where h is the entropy of A. Encode the final block. see the proof of Theorem 11.3. if k does not divide n. Partition xi: into blocks of length k = k(n) — (1/2) log AI n. The code length £(4) = ni+ L +r. that is. where each w(i) has length k. book E.1. A universal code for the class of ergodic processes is a faithful code sequence {C n } such that for any ergodic process p. To make this sketch into a rigorous construction. almost surely. so that the empiricalentropy theorem guarantees good performance. with a persymbol code. but the use of the Elias code insures that C. Finally. lirn “. Encode successive kblocks in 4 by giving the index of the block in the code Step 4. Such a code sequence was constructed in Section II. depends on 4. The empiricalentropy theorem provides another way to construct a universal code. The number of bits needed to transmit the list .42 ) 71 = h. define b:+ 1+ 1 to be x4+1 • The code Cn (4) is defined as the concatenation Cn (x7) = br • b:V' • b7.3.lxiii). where L is the sum of the lengths of the e(A(w(i))). Suppose n = tk r. since the number of bits needed to transmit an index is of the order of magnitude of the logarithm of the index. since k (1/2) log ik n.C is short. La. r E [0. w(i) is the jth word in the code book.1. subject only to the rule that af precedes 5k whenever k n k n) qk(ailX1) > qk( all X 1 • The sequence {v(j) = j = 0. Step 1. and it was also shown that the LempelZiv algorithm gives a universal code.JF. Let E be a prefix code on the natural numbers such that the length of the word E(j) is log j o(log j). . where k = k(n). (For simplicity it is assumed that the alphabet is binary. . is a prefix code. For 1 < i < t and 0 < j < 2k. Define b2+ +L i to be the concatenation of the E(A(w(i))) in order of increasing i. k) and 4 = w(1)w(2)•• • w(t)v. 1} k .SECTION 11. since L depends on the distribution qk (. define the address function by the rule A(w(i)) = j.d Universal coding. 2k — 1) is called the code book. if w(i) = that is. relative to n. an Elias code.) The code word Cn (4) is a concatenation of two binary sequences. The steps of the code are as follows.iiri I . High frequency blocks appear near the front of the code book and therefore have small indices. fix {k(n)} for which k(n) — (1/2) log n. followed by a sequence whose length depends on x. a header of fixed length m = k2k .
3. a. but it is instructive to note that it follows from the second part of the empiricalentropy theorem. and is described in Exercise 4.7 The universal code construction is drawn from [49]. The empiricalentropy theorem will be used to show that for any ergodic tt.1. where k = k(n). This proves that .C(4) lim sup — < h. qk(GkIx t 0 qk(Tk(E)Ixi). an integer K(x) such that qk (GkI4) > qk (7k (c)14) > 1 — E for k > K(x). s. at least a (1 — E)fraction of the binary addresses e(A(w(i))) will refer to members of Gk and thus have lengths bounded above by k(h + c)+ o(log n). the crude bound k + o(log n) will do. Show that the empiricalentropy theorem is true with the overlapping block distribution. For each n. for k > K (x). In fact. see Section 111. let Bk be the first 2k(hE) sequences in the code book for x.2. and hence < (1 — On(h + c) + En + o(log n).3. ENTROPYRELATED PROPERTIES. n where h is the entropy of it.e Exercises 1. let Gk denote the first 2k(h +E) members of the code book. even simpler algorithm which does not require that the code book be listed in any specific order was obtained in [43]. These are the only kblocks that have addresses shorter than k(h — 6) + o(log n). and in some ways.) Pk(a in4) E [1. n — k + 1]: 4441 = aDi . (Hint: reduce to the nonoverlapping case for some small shift of the sequence.146 CHAPTER II.3. The reverse inequality follows immediately from the fact that toogood codes do not exist. The empiricalentropy theorem guarantees that eventually almost surely. Theorem 11. in the context of the problem of estimating the measure II from observation of a finite sample path. Remark 11. for almost every x. The empiricalentropy theorem provides. For those w(i) Gk. S. (8) n* lim f(x n ) 1 = h. there are at most Et indices i for which w(i) E Bk. so that lirninf' > (h — c)(1 — c). A second. Another application of the empiricalentropy theorem will be given in the next chapter. n—k+ 1 in place of the nonoverlapping block distribution qk(. for k > K(x). To establish this. let {7k(E) c Ac} be the sequence given by the empiricalentropy theorem. furthermore. Note that IGk I > l'Tk(E)I and. which also includes universal coding results for coding in which some distortion is allowed. II. a. Thus. s. since Gk is a set of ksequences of largest qke probability whose cardinality is at most 2k(h +E ) .. a.lfiz).
.} is a universal prefixcode sequence. Show that the entropyestimation theorem is true with the overlapping block distribution in place of the nonoverlapping block distribution. such a code requires at most k(1 + log IA I) bits per word. Append some coding of the final block y. n] at which rk(x'11 ) achieves its minimum...14) and . Call this C k m (Xil ) and let £k(X) denote the length of Ck . An interesting connection between entropy and partitions of sample paths into variablelength blocks was established by Ornstein and Weiss. w(t). [53]. then code successive w(i) by using a fixedlength code..u is unbiased cointossing. be the first value of k E [ 1.. Another simple universal coding procedure is suggested by the empiricalentropy theorem. such that all but the last symbol of each word has been seen before. Their results were motivated by an attempt to better understand the LempelZiv algorithm. most of the sample path is covered by words that are not much longer than (log n)I h. w(t)} if it is the concatenation (1) 4 = w(1)w(2). most of the sample path is covered by words that are not much shorter than (log n)I h. They show that eventually almost surely. The final code C(xil) transmits the shortest of the codes {Ck. and for any partition into words that have been seen in the past.u k is asymptotically almost surely lower bounded by (1 — e i ) for the case when n = k2k and . In other words. w(2). 147 2.. Partitions into distinct words have the asymptotic property that most of the sample path must be contained in those words that are not too short relative to entropy.n (x). If w(i) w(j). along with a header to specify the value of k. except possibly for the final word. Express x'12 as the concatenation w(l)w(2) • • • w(q)v. then the expected fraction of empty boxes is asymptotic to (1 — 4.4 Partitions of sample paths. The code C(4) is the concatenation of e(kiii.n ) and Cknun . let km..4. (Hint: if M balls are thrown at random into M boxes.'). while 000110110100 = [00] [0110] [1101] [00] is not. Show that the variational distance between qk (. PARTITIONS OF SAMPLE PATHS. For each k < n construct a code Ck m as follows.(4)}. As in earlier discussions. which partitions into distinct words. . plus a possible final block y of length less than k. a word w is a finite sequence of symbols drawn from the alphabet A and the length of w is denoted by i(w). Make a list in some order of the kblocks that occur.SECTION 11. .. of kblocks {w(i)}. Section 11. then (1) is called a parsing (or partition) into distinct 000110110100 = [000] [110] [1101] [00] is a partition into distinct words.. j. Transmit k and the list. A sequence Jell is said to be parsed (or partitioned) into the (ordered) set of words {w(1). for any partition into distinct words. Show that {C. 3. For example.(x. for i words.
a.1.4. Theorem 11.2 (The repeatedwords theorem. (2).) Let be ergodic with entropy h > 0.2.2 parses a finite sequence into distinct words. x) such that if n > N. and the fact that toogood codes do not exist. The proofs of both theorems make use of the strongpacking lemma.7. and let 0 < E < h and a start set T be given. If this final word is not empty. ENTROPYRELATED PROPERTIES. Of course. possibly for the final word.2.1 (The distinctwords theorem. a.4. For almost every w(t) is a x E A" there is an N = N(E. x) such that if n > N and xri' = w(1)w(2) partition into distinct words. Lemma 11. in turn. then t ( to)) <En.. .2. meaning that there is an index j < t(w(1)) + + f(w(i — 1)) such that w(i) = Partitions into repeated words have the asymptotic property that most of the sample path must be contained in words that are not too long relative to entropy.1. s. be ergodic with entropy h > 0. For almost every x E A there is an N = N(E. most of the sample path is covered by blocks that are at least as long as (h + E)' log(n/2). except. (3) Thus the distinctwords theorem and the modified repeatedwords theorem together provide an alternate proof of the LZ convergence theorem. it is the same as a prior word. this time for parsings into words of variable length.be a fixed finite collection of words. by the crude bound. some way to get started is needed. and let E > 0 be given. which. and xi. see Exercise 1. ow(0)<0+011ogn The (simple) LempelZiv algorithm discussed in Section 11. so that (2) lim sup n C(4) log n fl < h. if each w(i) is either in .4. w(t) is a partition into repeated words. A partition x = w(1)w(2) w(k) is called a partition (parsing) into repeated words. and hence t(w(C + 1)) < n/2. Theorem 11.) Let p. With this modification. For example.3.2. = w(1)w(2). w(C + 1). But the part covered by shorter words contains only a few words. To discuss partitions into words that have been seen in the past. then t ( o )) < En. it is sufficient if all but the last symbol appears earlier. s.148 CHAPTER II. called the start set. implies that C(4) log n lim inf > h. the repeatedwords theorem applies to the versions of the LempelZiv algorithm discussed in Section 11. so that eventually almost surely most of the sample path must be covered by words that are no longer than (h — E) 1 log n. the lower bound (3) is a consequence of fact that entropy is an upper bound. twi>>>00Ilogn It is not necessary to require exact repetition in the repeatedwords theorem.T or has been seen in the past.2. Thus the distinctwords theorem implies that.. as noted earlier. eventually almost surely. Lemma 11. For this purpose. Theorem 11. let . Theorem 11.
3. k > K. Proof First consider the case when all the words are required to come from G..4 For each k. Lemma 11._Exio g n)la E top(o) E Ic<(16)(log n)la Since IGk I <2"k the sum on the right is upper bounded by ( 2' ((1 — E)logn) —1) exp2 ((1 — E)logn)) = logn) = o(n)..4. Et _ The general case follows since it is enough to estimate the sum over the tooshort words that belong to G.2. then words that are too short relative to (log n)Ice cannot cover too much. then E aw (i) ) Bn. let 8 be a positive number to be specified later.4.(i)).(. by an application of the standard bound. there is an N such that if n > N and xrii = w(1)w(2).4. w(i)gc then t(w(i)) < 2En. POD (j)) <K The second observation is that if the distinct words mostly come from sets that grow in size at a fixed exponential rate a. w(t) is a partition into distinct words. Given E > 0 there is an N such that if n > N and x n = w(1)w(2). PARTITIONS OF SAMPLE PATHS.(. and let G = UkGk. yields an integer K and { rk C A": k> 1} such that I7j < 21c(h+S) . This proves the lemma. together with some simple facts about partitions into distinct words...4. t > 1.a Proof of the distinctwords theorem.SECTION 11. To continue with the proof of the distinctwords theorem. The strongpacking lemma. where a is a fixed positive number. The first observation is that the words must grow in length. because there are at most IA i k words of length k. t —) men. The fact that the words are distinct then implies that . suppose Gk C A k satisfies IGkI _< 2ak .3 Given K and 8 > 0. Lemma 11. The distinctwords theorem is a consequence of the strongpacking lemma. The proof of this simple fact is left to the reader. Lemma 11. 149 II. . w(t) is a partition into distinct words such that t(w(i)) 5_ En.
and such that eventually almost surely xrii is (K.3. then a prefixcode sequence can be constructed which beats entropy by a fixed amount infinitely often on a set of positive measure. e.x. by Lemma 11.4. Label these toolong words in increasing order of appearance as V(1). the basic idea of the version of the LempelZiv code suggested in Remark 11. for 1 < i < s + 1. i. As in the proof of the empiricalentropy theorem. 3)stronglypacked by MI. for which s of the w(i) are too long. since 8 could have been chosen in advance to be so small that log n (1 — 2S) log n h + c 5h+3 • 1:3 I1. f(w(i))(115)(logn)/(h+S) provided only that n is large enough.. Suppose also that n is so large that.8. 3)stronglypacked by {'Tk }. this time applied to variablelength parsing. will be a fixed ergodic process with positive entropy h and E < h will be a given positive number. the existence of good codes is guaranteed by the strongpacking lemma. The idea is to code each toolong word of a repeatedword parsing by telling where it occurs in the past and how long it is (this is. Thus. 2 Suppose xr. V(2).4. It will be shown that if the repeatedwords theorem is false. E w(i)Euciz k rk t (o )) a ( l3)n.) If a good code is used on the complement of the toolong words and if the toolong words cover too much. The distinctwords theorem follows. Throughout this discussion p.b Proof of the repeatedwords theorem.4 can be applied with a = h + 3 to yield aw(i)) < 25n.4.2.150 CHAPTER II. A block of consecutive symbols in xiit of length more (h — E) 1 log n will be said to be too long. and let . let ui be the concatenation of all the words that come between V(i — 1) and V(i). Suppose = w(l)w(2) w(t) is some given parsing. it into distinct words. Let u1 be the concatenation of all the words that precede V(1). V(s). ENTROPYRELATED PROPERTIES. for any parsing P of xriz for which (4) E Sn atv(i)) — . r w(t) into distinct words. of course.' is (K. property (4) holds for any parsing of . then overall code length will be shorter than entropy allows. strongpacking implies that given a parsing x = w(l)w(2) E w(i)EU Klj t (o)) a _3)n. and therefore Lemma 11. The first idea is to merge the words that are between the toolong words.
. (a) Each filler u 1 E U K Tk is coded by specifying its length and giving its index in the set Tk to which it belongs. with the toolong words {V(1). a set 'Tk c A k of cardinality at most 2k(h+8) .. A code C.. to complete the proof of the repeatedwords theorem it is enough to prove that if K is large enough.. Let G(n) be the set of all xri' that are (K. ni ) To say V(j) = xn Since the start set . is constructed as follows.. Lemma 11. V(s)} and fillers lui.. 3)stronglypacked by {'Tk }. (ii) Each V(j) has been seen in the past. (c) Each toolong V(j) is coded by specifying its length and the start position of its earlier occurrence. a toolong representation = u1V(1)u2V(2). PARTITIONS OF SAMPLE PATHS. An application of the strongpacking lemma. and therefore to prove the repeatedwords theorem. In this way. Such a representation is called a toolong representation of 4. 151 u s+ 1 be the concatenation of all the words that follow V(s). If x E B(n) n G(n). provides an integer K and for a each k > K. Let B(n) be the set of all sequences x for which there is an s and a toolong representation x = uiV (1)u2 V(2) .3. (i) Ei f ( V (D) >En.T is a fixed finite set. is determined for which each toolong V(j) is seen somewhere in its past and the total length of the {V(j)} is at least En. such that eventually almost surely xril is (K. Since . ni + ± imi has been seen in the past is to say that there is an index i E [0. The strongpacking lemma provides the good codes to be used on the fillers.u s V(s)u s+i. it is enough to prove that x g B(n). Sequences not in B(n) n G(n) are coded using some fixed singleletter code on each letter separately. (b) Each filler u1 g U K r k is coded by specifying its length and applying a fixed singleletter code to each letter separately. eventually almost surely. . but to make such a code compress too much a good way to compress the fillers is needed. The words in the toolong representation are coded sequentially using the following rules. 8)stronglypacked by {Tk } . it can be supposed such that V(j) that when n is large enough no toolong word belongs to F. The idea is to code sequences in B(n) by telling where the toolong words occurred earlier.4.. then B(n) n G(n).2.SECTION 11. eventually almost surely. us V(s)u s+i with the following two properties. x'iz is expressed as the concatenation x Çz = u i V(1)u2V(2) us V(s)u s±i . Let 3 be a positive number to be specified later..q E G(n) eventually almost surely.
which requires Flog ni bits for each of the s toolong words. With a one bit header to tell whether or not xri` belongs to B(n) n G(n). (6) s — av(i)) log n (h — e). then (5) uniformly for 4 . for XF il E B(n) n G(n). eventually almost surely. that is. ENTROPYRELATED PROPERTIES. It is enough to show that the encoding of the fillers that do not belong to Urk . Proof of Lemma 11. so that (5) and (6) yield the codelength bound < n(h + — e(e + 6)) + 0(n). since it is not possible to beat entropy by a fixed amount infinitely often on a set of positive measure. which. require a total of at most 60(n) bits. An Elias code is used to encode the block lengths. shows that. and hence if 6 is small enough then £(4) < n(h — € 2 12). the principal contribution to total code length E(xriz) . eventually almost surely. Two bit headers are appended to each block to specify which of the three types of code is being used. a prefix code S on the natural numbers such that the length of the code word assign to j is log j + o(log j). as well as to the lemma itself. indeed. The lemma is sufficient to show that xrii B(n) n G(n). completing the proof of the repeatedwords theorem. first note that there are at most s + 1 fillers so the bound Es. xÇ2 E B(n) n G(n). plus the two bit headers needed to tell which code is being used and the extra bits that might be needed to round up log n and h + 6 to integers. as well as the encoding of the lengths of all the words.C(fi` IC n ) comes from telling where each V(j) occurs in the past. a word is too long if its length is at least (h — E) 1 log n.4. and from specifying the index of each u i E UlciK rk. tw(i) > en. This fact is stated in the form needed as the following lemma.5 If log K < 6 K . The key to this. must satisfy the bound. by definition.4. — log n — log n implies that E au . the code becomes a prefix code. is that the number s of toolong words. For 4 E B(n) n G(n).. n s < i="' (h — e) < — (h — 6). since.. ) <K (1 + n(h log n j). while it depends on x. ) (h + 6) + 60(n). for all sufficiently large n.152 CHAPTER II. E au . Lemma 11.C(x'11 ) = s log n + ( E ti J ELJA7=Kii B(n) n G(n). For the fillers that do not belong to U71.5.(x €)) . By assumption. which requires £(u) [h + 81 bits per word. "ell B(n) n G(n).
Extend the distinctwords and repeatedwords theorems to the case where agreement in all but a fixed number of places is required. 153 which is o(n) for fixed K. . implies that the words longer than K that do not belong to UTk cover at most a 6fraction of x . [53].5.4. since there are at most 2s + 1 words. using a counting argument to show that the set B(n) n G(n) must eventually be smaller than 2nh by a fixed exponential factor.. n] of length Li > (log n)/ (h — 6). which is 50(n). This completes the proof of Lemma 11.4. Since it takes L(u)Flog IAI1 bits to encode a filler u g UTk. which can be assumed to be at least e. + o(n)). Remark 11. E 7. the dominant terms of which can be expressed as E j )<K log t(u i ) + E t(u i )K log t(u i ) + logt(V(i)). PARTITIONS OF SAMPLE PATHS. is upper bounded by exp((h — c) Et.•)) [log on SnFlog IA11. 2. This is equivalent to showing that the coding of the toolong repeated blocks by telling where they occurred earlier and their lengths requires at most (h — c) E ti + o(n) bits. Finally.c Exercises 1.4. which is o(n) for K fixed. which is Sn. the encoding of the lengths contributes EV(t(u i)) Et(eV(V(i») bits. then the assumption that 4 E G(n). of the repeatedwords theorem parallels the proof of the entropy theorem. II. Finally. once n is large enough to make this sum less than (Sn/2. that 4 is (K. The first sum is at most (log K)(s + 1). The key bound is that the number of ways to select s disjoint subintervals of [1. and since x 1 log x is decreasing for x > e. the total number of bits required to encode all such fillers is at most (u. Since it can be assumed that the toolong words have length at least K. The coding proof is given here as fits in nicely with the general idea that coding cannot beat entropy in the limit.6 The original proof. 6)stronglypacked by {Tk }. since it was assumed that log K < 5K.SECTION 11. that is.4. In particular. the total length contributed by the two bit headers required to tell which type of code is being used plus the possible extra bits needed to round up log n and h + 8 to integers is upper bounded by 3(2s + 1) which is o(n) since s < n(h — E)/(logn). Extend the repeatedwords theorem to the case when it is only required that all but the last k symbols appeared earlier. a u . the second and third sums together contribute at most log K K n to total length.
see also the earlier work of Willems. along with an application to a prefixtree problem will be discussed in this section.1 (The recurrencetime theorem. This is a sharpening of the fact that the average recurrence time is 1/. 1 = lim sup . Lemma 11. Section 11.) (b) Show that eventually almost surely.E12 log n. [86]. Define the upper and lower limits. equiprobable. F(Tx) <?(x). These results.5 Entropy and recurrence times. Let .i. process (i. (c) Show that eventually almost surely.) For any ergodic process p.5. that is. (Hint: use Barron's lemma. and r(Tx) < r(x). i. there are fewer than n 1612 words longer than (1 + c) log n in a parsing of 4 into repeated words. The definition of the recurrencetime function is Rn (x) = min{rn > 1: x m+ m+n 1 — Xn } — . 5.d.u.1.) (a) Show that eventually almost surely. i . ENTROPYRELATED PROPERTIES.3. Wyner and Ziv showed that the logarithm of the waiting time until the first n terms of a sequence x occurs again in x is asymptotic to nh. 3. 4. the words longer than (1 + c) log n in a parsing of 4 into repeated words have total length at most n 1. n n÷oo Some preliminary results are easy to establish. . no parsing of 4 into repeated words contains a word longer than 4 log n. An almostsure form of the WynerZiv result was established by Ornstein and Weiss.log R„(x). Carry out the details of the proof that (3) follows from the repeatedwords theorem.154 CHAPTER II. both the upper and lower limits are subinvariant. Carry out the details of the proof that (2) follows from the distinctwords theorem. almost surely. [53].. unbiased cointossing. An interesting connection between entropy and recurrence times for ergodic processes was discovered by Wyner and Ziv. whose logarithm is asymptotic to nh. be the Kolmogorov measure of the binary. e. What can be said about distinct and repeated words in the entropy 0 case? 6.u(x'11 ). lim 1 log Rn (x) = h. The theorem to be proved is Theorem 11.log Rn (x). n—woo n T(x) Since Rn _1(T x) < R n (x). with entropy h. almost surely. in probability. [85]. n 1 r(x) = lim inf .
so the iterated almost sure principle yields the desired result (1). and hence. 2n(h+E) — 1]. fix (4 and consider only those x E D„ for which xri' = ari'. . and hence the sequence D(a). . which is impossible. The BorelCantelli principle implies that x g D. n 'T. Furthermore. fix e > 0 and define D. . Since the measure is assumed to be ergodic. (1).SECTION 11. 2—nE/2 . n > 1. A code that compresses more than entropy can then be constructed by specifying where to look for the first later occurrence of such blocks. It is now a simple matter to get from (2) to the desired result. based on a different coding idea communicated to the author by Wyner. if Tri = {X: t(x) > 2n(hiE/2)} then x E T.. both are almost surely invariant.. that is. so the desired result can be established by showing that r > h and F < h. however. The goal is to show that x g D.(4)) < 2n(h+E).5. a more direct coding construction is used to show that if recurrence happens too soon infinitely often. The bound r > h was obtained by Ornstein and Weiss by an explicit counting argument similar to the one used to prove the entropy theorem. To establish i: < h. is outlined in Exercise 1. the set D(a7) = D. 155 hence. sample paths are mostly covered by long disjoint blocks which are repeated too quickly. then eventually almost surely. that if x E D(a). n T„. = fx: R(x) > r(h+E ) }. so that (2) yields 11 (Dn < n Tn) _ 2n(h1€12)2—n(h+E) . by Theorem 11. The bound F < h will be proved via a nice argument due to Wyner and Ziv. In the proof given below. n [4]. ENTROPY AND RECURRENCE TIMES.(4). r < F. It is enough to prove this under the additional assumption that xril is "entropy typical.2. 71 D.„ eventually almost surely. The inequality r > h will be proved by showing that if it is not true. eventually almost surely. both upper and lower limits are almost surely constant. is upper bounded by the cardinality of the projection of Tn .1. To establish this. eventually almost surely. 7' 2a(h+°+1 Dn(ai) must be disjoint.. by the definition of D(a). almost surely. so it is enough to prove (1) x g Dn n 7. then Tx g [a7].. there are constants F and r such that F(x) = F and r(x) = r. then there is a prefixcode sequence which beats entropy in the limit. which is upper bounded by 2n(h+E / 2) . the cardinality of the projection onto An of D. for j E [1. eventually almost surely." Indeed. Indeed. Note. Since these sets all have the same measure. disjointness implies that (2) It(D. Another proof.
" and that the principal contribution to total code length is E i log ki < n(h—c). A one bit header to tell whether or not 4 E G(n) is also used to guarantee that C.1 )u. Here it will be shown how a prefix ncode Cn can be constructed which. ENTROPYRELATED PROPERTIES. where cm ÷ 0 as m . It will be shown later that a consequence of the assumption that r < h — c is that (3) xÇ E G(n). Let G(n) be the set of all xi' that have a (8. m)toosoonrecurrent representation 4 = u V(1)u2 V(2) • • • u V( . 2h(t s + 1) ) . that is. For n > m. Sequences not in G(n) are coded by applying F to successive symbols.156 CHAPTER II. 1}d. both to be specified later. Lemma 11. The least such k is called the distance from xs to its next occurrence in i• Let m be a positive integer and 8 be a positive number. For x E G(n) a (8. compresses the members of G(n) too well for all large enough n. m)toosoonrecurrent representation of 4. m)toosoonrecurrent representation. a result stated as the following lemma. The code Cn uses a faithful singleletter code F: A H+ {0. (a) Each V(j) has length at least m and recurs too soon in 4. by the definition of "toosoon recurrent. where E > O. The code also uses an Elias code E. is a prefix ncode. is said to be a (8. where k i is the distance from V(j) to its next occurrence in x. such that each code word starts with a O.' of s lik k for consecutive symbols in a sequence 4 is said to recur too soon in 4 if xst = x` some k in the interval [ 1 . where d < 2+ log l Ai.t+i is determined and successive words are coded using the following rules. followed by e(k i ). (ii) Each V(j) is coded using a prefix 1. The one bit headers on F and in the encoding of each V (j) are used to specify which code is being used. A block xs. a prefix code on the natural numbers such that t(E(j)) = log j + o(log j). oc. eventually almost surely.5. To develop the covering idea suppose r < h — E. The key facts here are that log ki < i(V(j))(h — c). (b) The sum of the lengths of the filler words ui is at most 3 8 n. for n > m. followed by E(t(V (j)). a concatenation = u V (1)u 2 V (2) • u j V (J)u j+1 . if the following hold. for which s +k < n.2 (4) f(xii) < n(h — E) n(3dS + ot ni ). (i) Each filler word u i is coded by applying F to its successive symbols. for suitable choice of m and S. X E G (n ) .
Proof of Lemma 11. + (1 ± log m)/m 0.n * 0 as m cc.u(B) > 1 — 6. since (1/x) log x is decreasing for x > e. for each n > 1.„ where /3. . This completes the proof of Lemma 11.. Thus the complete encoding of all the fillers requires at most 3d6n bits.C(x'ii) < n(h — 6/2) on G(n). The one bit headers on the encoding of each V(j) contribute at most n/ m bits to total code length. bits. In summary. contradicting the fact (3) that 4 E G(n). ENTROPY AND RECURRENCE TIMES.2." Summing the first term over j yields the principal term n(h — E) in (4). define B n = (x: R(x) < 2 n(h—E) ). = 2/3. the lemma yields . by the definition of "toosoon recurrent.u(G(n)) must go to 0. eventually almost surely. But toogood codes do not exist. eventually almost surely. establishing the recurrencetime theorem.SECTION 11. where an. the complete encoding of the toosoon recurring words V(j) requires a total of at most n(h — E) nun. 0 It remains to be shown that 4 E G(n). the second terms contribute at most Om bits while the total contribution of the first terms is at most Elogf(V(D) n logm rn .8. assume r < h — E and.5. each of length at least m and at most M. Thus r > h. so that . This is a covering/packing argument. by property (a) of the definition of G(n).5. m]: i < of disjoint subintervals of [1. Summing on j < n/ m. The Elias encoding of the length t(V(j)) requires log t(V(D) o(log f(V(j))) bits. for all large enough n. the encoding of each filler u i requires dt(u i ) bits. To carry it out. Since lim inf. eventually almost surely. for which the following hold. Since each V(j) is assumed to have length at least m. 157 The lemma implies the desired result for by choosing 6 small enough and m large enough. for B = n=m Bn' The ergodic theorem implies that (1/N) Eni Bx (Tii X ) > 1 . R n (x) = r there is an M such that n=M .2. n]. Finally. and the total length of the fillers is at most 36n. as m Do. x) and a collection {[n i . for fixed m and 6. there is an integer I = I (n. there is an N(x) > M/8 such that if n > (x). The Elias encoding of the distance to the next occurrence of V(j) requires (5) f(V (j))(h — E) o(f(V(j))) bits. and hence the packing lemma implies that for almost every x. a fact that follows from the assumption that 0 r <h — E. there are at most n/ m such V(j) and hence the sum over j of the second term in (5) is upper bounded by nI3.5.
.n. for example. if it is assumed that r < h — E.. This completes the proof that 4 E G(n).T4x = 10010010. called the nth order prefix tree determined by x. as quick way to test the performance of a coding algorithm. below. • • • . for fixed m and 3.3 The recurrencetime theorem suggests a Monte Carlo method for estimating entropy. T 3x = 010010010 . it can be supposed that n > 2m(h— E ) IS. where a < 1. W(0) = 00. Remark 11.Tn —l x. where each V(j) = xn 'n. . ±1)(h—E) ]. and take (11 kt) Et. x E G (n). Closely related to the recurrence concept is the prefix tree concept. W(2) = 101. A sequence X E A" together with its first n — 1 shifts. Xi :Fnil—n ` 9 for some j E [ni + 1. x) be the shortest prefix of Tx i. n]. page 72. where underlining is used to indicate the respective prefixes. let W(i) = W(i. to propose that use of the full prefix tree . in which context they are usually called suffix trees. This works reasonably well for suitably nice processes. select indices fit. and thereby finishes the proof of the recurrencetime theorem. In practice. n — 11 is prefixfree. The prefix tree 'T5(x) is shown in Figure 1. (b) (mi — ni + 1) > (1 — 25)n. for any j W(i): i = 0.5. log Rk(T i l X) as an estimate of h. defines a tree as follows. 'Tn (x). it defines a tree. the simplest form of LempelZiv parsing grows a subtree of the prefix tree. itl. ENTROPYRELATED PROPERTIES. This led Grassberger. independently at random in Let k [1.5.. algorithms for encoding are often formulated in terms of tree structures. that is. the total length of the V(j) is at least (1 —38)n. In particular.. In addition. [16]. W(4) = 100. For example. Prefix trees are used to model computer storage algorithms. =  CHAPTER II. T 2x = 1010010010 . The sequence x`il can then be expressed as a concatenation (6) xn i = u i V(1)u2V(2)• • • u.7. a log n.158 (a) For each i < I el' n. W(1) = 0101. eventually almost surely./ V(J)u j+i..a Entropy and prefix trees. . so that if J is the largest index i for which ni + 2A1(h— E) < n. II. Tx. 0 < j < n.. ni + 2(ni i —n. and can serve. T x = 01010010010 .9.5. 1. recurs too soon. provided n is large. T 2x. as tree searches are easy to program. . condition (b) can be replaced by (c) > < (m — n +1) > (1 — 33)n. as indicated in Remark 11. Since by definition. .7. W(3) = 0100. x 001010010010 . the set which is not a prefix of Px. For each 0 < i < n. and where each u i is a (possibly empty) word.
for any E > 0. there is an integer N = N(x. j i. Another type of finite energy process is obtained by adding i. has finite energy if there is are constants c < 1 and K such that kxtF1 i) < Kc L . eventually almost surely. ENTROPY AND RECURRENCE TIMES. then for every c > 0 and almost every x. for all t > 1. Theorem 11. mixing Markov processes and functions of mixing Markov processes all have finite energy. One type of counterexample is stated as follows. see Exercise 2. the (1 .i.i. is the following theorem. is an ergodic process of entropy h > 0. eventually almost surely. for example. and for all xl +L of positive measure.d. What is true in the general case..5. processes. namely. h The i. where. it is not true in the general ergodic case. all but en of the numbers L7(x)/ log n are within c of (11h). An ergodic process p. see Theorem 11.) The mean length is certainly asymptotically almost surely no smaller than h log n.) If p.d. there is a j E [1.d.5. such as. While this is true in the i.5. The (1 . is an ergodic finite energy process with entropy h > 0 then there is a constant C such that L(4) < C log n. almost surely.5. (Adding noise is often called "dithering.E)trimmed mean of a set of cardinality M is obtained by deleting the largest ME/2 and smallest ME/2 numbers in the set and taking the mean of what is left. He suggested that if the depth of node W(i) is the length L7(x) = f(W(i)). Theorem 11. at least in probability. then average depth should be asymptotic to (1/h) log n. for all L > 1." and is a frequently used technique in image compression. which kills the possibility of a general limit result for the mean. by definition of L(x). in general.d. such that x: +k2 = There is a class of processes. for which a simple coding argument gives L(x) = 0(log n). . noise to a given ergodic process. and hence lim n +cx) E n log n i=1 1 1 (x) = . eventually almost surely. and independent of {Yn }.5 (The finiteenergy theorem. n]. the binary process defined by X n = Yn Z. 159 might lead to more rapid compression. This is equivalent to the string matching bound L(4) = 0(log n). Note that the prefixtree theorem implies a trimmed mean result.i. there is no way to keep the longest W(i) from being much larger than 0(log n). L(x) is the length of the longest string that appears twice in x7.SECTION 11. however. as earlier. and some other cases.5.4 (The prefixtree theorem. the finite energy processes. The {L } are bounded below so average depth (1/n) E (x) will be asymptotic to log n1 h in those cases where there is a constant C such that max L7 (x) < C log n. however. as it reflects more of the structure of the data. c) such that for n > N.E)trimmed mean of the set {L(x)} is almost surely asymptotic to log n/h. eventually almost surely. case. (mod 2) where {Yn } is an arbitrary binary ergodic process and {Z n } is binary i.i.) If p. namely. since if L7(x) = k then.
Performance can be improved in some cases by designing a finite tree that better reflects the actual structure of the data. n) = fi E [0.7 The simplest version of LempelZiv parsing x = w ( 1)w(2) • • •. define the sets L(x. lim (7) nk log nk with respect to the measure y = o F1 .) The limitations of computer memory mean that at some stage the tree can no longer be allowed to grow. in which the next word is the shortest new word. each of which is one symbol shorter than the shortest new word.5.5. Prefix trees may perform better than the LZW tree for they tend to have more long words. (8) 1L(x. mentioned in Remark 11. I U(x . such that vnrfo l Ltilk (x ) = oo. II. n) are distinct. if it has the form w(i) = w(j)a.5. Theorem 11. There is a stationary coding F of the unbiased cointossing process kt.1. but do achieve something reasonably close to optimal compression for most finite sequences for nice classes of sources. Section 11.1. eventually almost surely. It is enough to prove the following two results.10. Subsequent compression is obtained by giving the successive indices of these words in the fixed tree.160 Theorem 11. n)I < En. Theorem 11. (See Exercise 3. n)I < En. where the next word w(i). the tree grows to a certain size then remains fixed. n — I]: L7(x) < (1— E)(logn) I h} U (x . For E > 0. define Tk (8) = (x) > .2.2. and an increasing sequence {nk}. Such finite LZalgorithms do not compress to entropy in the limit. can be viewed as an overlapping form of the distinctwords theorem. is a consequence of the recurrencetime theorem. Remark 11. The details of the proof that most of the words cannot be too short will be given first. The remainder of the sequence is parsed into words. From that point on. for some j < i and a E A. almost surely. n) = fi E [0. 88] for some finite version results. is assigned to the node corresponding to a. each of which produces a different algorithm. The prefixtree and finiteenergy theorems will be proved and the counterexample construction will be described in the following subsections. see [87. ENTROPYRELATED PROPERTIES.a.5.1 Proof of the prefixtree theorem. (9) The fact that most of the prefixes cannot be too short. the W (i. for the longest prefix is only one longer than a word that has appeared twice.4. n — I]: (x) > (1 ± c)(log n)I hl. whose predecessor node was labeled w(j). which means more compression. For S > 0 and each k. for fixed n. To control the location of the prefixes an "almost uniform eventual typicality" idea will be needed. (8). eventually almost surely.6 CHAPTER II. (9). The fact that most of the prefixes cannot be too long. several choices are possible. and is a consequence of the fact that. grows a sequence of IA Iary labeled trees. In the simplest of these.
an index i < n will be called good if L(x) > M and W(i) = W(i. a. Thus by making the N(x) of Lemma 11. will be established by using the fact that. except for the final letter.SECTION 11.8 larger.x) appears at least twice. 1 lim — log Fk (x) = h. Thus if n > N(x) is sufficiently large and S is sufficiently small there will be at most cnI2 good indices i for which L'i'(x) < (1— €)(log n)I h. Lemma 11. for any J > M. fix c > O. Since E (3 ). n. The recurrencetime theorem shows that this can only happen rarely. completes the proof that eventually almost surely most of the prefixes cannot be too short.5.7. ENTROPY AND RECURRENCE TIMES. Section 1. n). each W(i. for all k > MI has measure greater than 1 — S. To proceed with the proof that words cannot be too short. by the entropy theorem. x) E U m rk ( . then it has tooshort returntime. For a given sequence x and integer n.6 and Exercise 11. X m+ m+k i— min frn > 1: X1_ k+1 = X — ° k+11 The recurrencetime theorem directly yields the forward result. Since the prefixes W(i) are all distinct there are at most lAl m indices i < n such that L7(x) < M. (8). The ergodic theorem implies that. If a toolong block appears twice. (9). 161 _ 2k(h+5) and so that I2k(6)1 < E Tk (6 ).5. if necessary.) These limit results imply that .(8).5.s.log Bk (x) = h.. a. the shift Tx belongs to G(M) for all but at most a limiting 6fraction of the indices i. Hence. it can be supposed that if n > N(x) then the set of nongood indices will have cardinality at most 3 6 n.8 For almost every x there is an integer N(x) such that if n > N(x) then for all but at most 2 6 n indices i E [0. Section 1. The corresponding recurrencetime functions are respectively defined by Fk(X) = Bk(X) = min {in> 1 •. x) are distinct members of the collection 'Tk (6) which has cardinality at most 2k(h+8) . combined with the bound on the number of nongood indices of the preceding paragraph.n.s. This result is summarized in the following "almost uniform eventual typicality" form. 1 urn . which can be assumed to be less than en/2. k—>cc k since the reversed process is ergodic and has the same entropy as the original process. there are at most (constant) 2 J (h+) good indices i for which M < L < J.. eventually almost surely. there is an M such that the set 4 4 G(M) = ix: 4 E 7i. Both a forward and a backward recurrencetime theorem will be needed. k—> cx) k and when applied to the reversed process it yields the backward result. There are at most 2k(h+3) such indices because the corresponding W (i. This. Suppose k > M and consider the set of good indices i for which L7(x) = k. the ifold shift Tx belongs to G(M). eventually almost surely. n. (See Exercise 7. for almost every x. Next the upper bound result.
a. Let k be the least integer that exceeds (1 + e/2) log n/ h. then log Fk (x) > kh(1 + /4) 1 . In this case. Case 2. II.7. which means that 7 k) x E B.1]. each of measure at most en/4. if i E Uk (x. and use an optimal code on the rest of the sequence. The idea of the proof is suggested by earlier code constructions. that is. Condition (b) implies that L7 1 > k. A version of Barron's bound sufficient for the current result asserts that for any prefixcode sequence {C}. or i + k < n. by the definition of W (i. (b) k +1 < (1+ c)log n/ h. such that if k > K. n).u(x'11 ) 21ogn. x) as the shortest prefix of Tx that is not a prefix of any other Ti x. Thus. k which means that Tx E F. By making n larger. Here "optimal code" means Shannon code. see Remark 1. This completes the proof of the prefixtree theorem. n . j < i. x V B.n) then T i E F. Fk (T ix) < n. In summary. namely. Case 1.2 Proof of the finiteenergy theorem. and log Bk (x) > kh(1+ c /4) 1 . and two sets F and B. it can be supposed that both of the following hold. . code the second occurrence of the longest repeated block by telling its length and where it was seen earlier. so that 1 logn . and Tx E B. The key to making this method yield the desired bound is Barron's almostsure codelength bound. that is. there is an index j E [0. If i +k < n. n)I < en.13. (a) K < (1+ 6/2) log n/ h. for j < n. such that j i and j+k 1 =x me two cases i < j and i > j will be discussed separately. i < j. for at most 6n/3 indices i E [0. Lemma 5. The ergodic theorem implies that for almost every x there is an integer N(x) such that if n> N(x) then Tx E F. and hence.log Fk(T i x) < — < h(1 + E/4) 1 . the kblock starting at i recurs in the future within the next n steps. (10) £(C(4)) + log .5. L7 > (1+ e) log n/ h. ENTROPYRELATED PROPERTIES. there is a K. which asserts that no prefix code can asymptotically beat the Shannon code by more than 210g n. if necessary. x F. this means that Bk(T i+k x) < n. n.162 CHAPTER II. An upper bound on the length of the longest repeated block is obtained by comparing this code with an optimal code. Assume i E U(x.1]. n. for at most en/3 indices i E [0. for each w. eventually almost surely. Fix such an x and assume n > N(x). n). A Shannon code for a measure p u is a prefix code such that E(w) = log (w)1. 7x E B. if n also satisfies k < en/3. then I U (x .
3 log n + log p(x t + + IL lxi) < Kc L . that is. since c <1. and then again at position t 1 > s in the sequence xÇ. 1} z and an increasing sequence {n k } of positive integers with the following properties. 1} z . (a) If y = F(x) then. as they are asymptotically negligible. The block xi is encoded by specifying its length t. there is an i in the range 0 < i < nk — n k 314 . such that yi+i = 0. the shiftinvariant measure defined by requiring that p(aii) = 2. (7) holds. x ts 0+ log 2(4+L+1 1xi+L ). The final block xtn+L±I is encoded using a Shannon code with respect to the conditional measure Ix ti +L ). the probability it (4) is factored to produce log (4) = log ti ) log A(xt v. ENTROPY AND RECURRENCE TIMES. a prefix code for which t(e(n)) = log n + o(log n). almost surely.p — log c n log n This completes the proof of the finiteenergy theorem.5. 0 < j < n k 3 I4 . 5/4 so that Ei<n. o F 1 will then satisfy the conditions of Theorem 11. Since t. yields to which an application of the finite energy assumption p(x:+ L(4) 5 < .6. Suppose a block of length L occurs at position s. there is a measurable shift invariant function F: {0. li m _sxu. for each n and each a E An. This is now added to the code length (11) to yield tIL .a. 163 The details of the coding idea can be filled in as follows.5. The process v = p. then using a Shannon code with respect to the given measure A. that is. L7(x) > nk and hence. The block xtr+t is encoded by specifying s and L. . II. Let it denote the measure on twosided binary sequences {0. with probability greater than 1 — 2k . 1} z {0. It will be shown that given E > 0. given by unbiased cointossing.SECTION 11. To compare code length with the length of the Shannon code of x with respect to it. (b) . for property (a) guarantees that eventually almost surely L7k (x) > n:I8 for at least 4/8 indices i < nk . t 3 log n + log p(xt+1 s Thus Barron's lemma gives 1 lxf)> —2 log n.5. s.u(fx: xo (F(x)01) < E. total code length satisfies 3 log n + log 1 1 + log p(xl) kt(4AL+11xi+L)' where the extra o(logn) terms required by the Elias codes and the extra bits needed to roundup the logarithms to integers are ignored. The prefix property is guaranteed by encoding the lengths with the Elias code e on the natural numbers.3 A counterexample to the mean conjecture. and L must be encoded.
The coding proof of the finiteenergy theorem is new. Replacing 2n k 1/2 by nk 3/4 will produce the coding F needed here.xn = yn 00 whenever f (xn . (a) Show that lim infn t(f(x n „))/ n > h.164 CHAPTER II. Show that if L(x) = 0(log n). 1.9 The prefix tree theorem was proved in [72]. where H is the entropyrate of it and show that E p. A function f: An B* will be said to be conditionally invertible if .5. Remark 11. Property (b) is not really needed. II. ENTROPYRELATED PROPERTIES. which also contains the counterexample construction given here.d.(4 Ix ° ) > 2n ( H—E) and Cn = {X n c„) : f (xn 00 )) < 2n (H2€ ) 1. .b Exercises. Show that dithering an ergodic process produces a finiteenergy process. that r > h is based on the following coding idea. (Hint: L (4n ) < K log n implies max L7(x) < K log n. due to Wyner. the entropy of y can be as close to log 2 as desired. (d) The preceding argument establishes the recurrence result for the reversed process. The asymptotics of the length of the longest and shortest word in the Grassberger tree are discussed in [82] for processes satisfying memory decay conditions.5. discussed in Section I. (c) Define f (x" cc ) to be the minimum m > n for which xim niV = .c. for any ergodic measure . In that example. blocks of O's of length 2n k i l2 were created. (Hint: let B. The coding F can be constructed by only a minor modification of the coding used in the stringmatching example. 2. = fx" o„: p. .) 3.q and apply the preceding result.(B„ n cn ix°) is finite. after its discovery it was learned that the theorem was proved using a different method in [31]. eventually almost surely. almost surely. hence.0 ) = f (y" co ) and x 0 y° .) (b) Deduce that the entropy of a process and its reversed process are the same. in particular.8. as an illustration of the blocktostationary coding method.i. eventually almost surely. but it is added to emphasize that such a y can be produced by making an arbitrarily small limiting density of changes in the sample paths of the i. then max i 14 (x) = 0 (log n). Show that this implies the result for ft. Another proof. process a.u of entropy h.
lin(fx riz : — Pk(' ix ril )i > ED < rk(6. 165 . Likewise. The kth order empirical distribution. as n cc. The rate of convergence problem is to determine the rates at which these convergences takes place. where R + denotes the positive reals and Z + the positive integers. These results will be developed in this section. processes. A mapping rk : R+ x Z 1—± R +. defined by IILk — Pk1= — Pk( 14)1 = E E 1[4(4) at pk (64'14)1. ergodic process with alphabet A.1. given any convergence rate for frequencies or for entropy. n — k +1]: x:+k1 — k +1 .an(tx rit : 2—n(h+E) < itin (x n i ) < 2—n(h—E)} 1.1 Rates of convergence. is the measure pk = pk( 1x11 ) on Ak defined by Pk(a1 14) = sn E [1. with ilk denoting the projection onto A k defined by 14(4) = . In this section bt denotes a stationary. The distance between 14 and Pk will be measured using the distributional (variational) distance. the entropy theorem implies that . that is. aik E A k . there is an ergodic process that does not satisfy the rate.i.n).Chapter III Entropy for restricted classes.d. and finitestate processes have exponential rates of convergence for frequencies of all orders and for entropy. if for each fixed k and c > 0. The ergodic theorem implies that if k and are fixed then ({4: as n — Pk(' 14)1 ?_ ED ± oo. such as renewal processes. Many of the processes studied in classical probability theory. Section 111. is called a rate function for frequencies for a process 1. or ktype. no uniform rate is possible. but some standard processes. do not have exponential rates. Markov processes. such as i. In the general ergodic case.u({x: x = 4}).
i. An ergodic process is said to have exponential rates for and rk(E.166 CHAPTER III. Exponential rates for entropy follow immediately from the firstorder frequency theorem. 0 as n oc. called speed of convergence.d. The theorem is proved by establishing the firstorder case. for some unbounded.1. it will not be the focus of interest here. using the fact that p. (The determination of the best value of the exponent is known as "large deviations" theory.i. process p. The overlappingblock problem is then reduced to the nonoverlappingblock problem.n) has a rate function for entropy such that for each c > 0. then it has exponential rates for frequencies and for entropy.) There is a positive constant c such that (1) < (n It (14: IPi( 14) — ILiI > El) _ DIAl2—ncE 2 . with alphabet A. a combination of a large deviations bound due to Hoeffding and Sanov and an inequality of Pinsker. [22. in turn.i.d.3 (Firstorder rate bound. III.1. which is. Lemma 111. is i.a Exponential rates for i.i. 2 —n(h+E) < /41 (4) < 2 —n(h—E) 1) > 1 — r (E. (1/n) log r (c.d processes will be established in the next subsection. oc. for the fact that t is product measure implies that p. Examples of processes without exponential rates and even more general rates will be discussed in the final subsection. nondecreasing sequence {an }. which is treated by applying the firstorder result to the larger alphabet of blocks. for any finite set A. n) is bounded away from O.(x) is a continuous function of p1 (. for any n.) A mapping r: R+ x Z+ R+ is called a rate function for entropy for a process it of entropy h. 14). An ergodic process has exponential rates for entropy if it and r(E.1 A unrelated concept with a similar name. n) is bounded away from O. Theorem 111. processes) If p. Remark 111. (1/n) log rk (E.1. ENTROPY FOR RESTRICTED CLASSES. In this section the following theorem will be proved.d.2 (Exponential rates for i. produces (2) 11 (4) = Dit(a))nPl aEA (alx7) = 2—n(H(pi)+D(p1 1141)) .i. processes. while ideas from that theory will be used. See [32] for a discussion of this topic. if for each E > 0..d. and for any i. n). 62. 58]. is a product measure.n) —> 0 as n frequencies if there is a rate function such that for each E > 0 and k.1. The firstorder theorem is a consequence of the bound given in the following lemma. Exponential rates of convergence for i. then extended to Markov and other processes that have appropriate asymptotic independence properties. Proof A direct calculation. is concerned with what happens in the ergodic theorem when (1/n) f (T i x) is replaced by (1/an ) f (Ti x).
where = min ID(Pi Lai): IP1(. To assist in this task some notation and terminology will be developed. (b) There are at most (n + 1) 1A I type classes. The desired result will follow from an application of the theory of type classes discussed in I. and class.3 gives the desired exponentialrate theorem for firstorder frequencies. and D(Pi Lai) = E a pi (aIx)log Pl(aix'11) iii(a) is the divergence of pi( • Ixii) relative to pi.1 3 and Theorem 1. 0 as n —> oc. is the entropy of the empirical 1block distribution.1. For each s E [0.1. the sequence x can be expressed as the concatenation (4) x = w(0)w(1)• • • w(t)w(t + 1).n'' = exp(—n(cE 2 ön)) where 8. are (a) 17(4)1 < 2nH(p 1 ) . . E) = fx7: 1/1 1( . Exercise 6. kblock parsing of x. 1x7)) = — a (a14) log pi (a lx7). — I > E } can be partitioned into disjoint hence the bound (3) on the measure of the type sets of the form B(n. (T(x)) < The bad set B(n. 167 where H(pi) = 11(131(. E) n T(4). w(t). the next t words. Fact (a) in conjunction with the product formula (2) produces the bound. k)equivalent if their . RATES OF CONVERGENCE.)1 : pie lyD = pie jx7)}. see Pinsker's inequality. k) such that n = tk + k + r. The type class of _el is the set T(x) = {. all have length k. where the first word w(0) has length s < k.SECTION III. 1. The extension to kth order frequencies is carried out by reducing the overlappingblock problem to k separate nonoverlappingblock problems. k) and r E [0. El Lemma 111.6. w(1). .d. This is called the s shifted. For n > 2k define integers t = t (n. (3) p. The two key facts. — 1L El Since D(P II Q) I P — Q1 2 1 (2 ln 2) is always true.6. and the final word w(t 1) has length n — tk —s. together with fact (b).6. produces the bound tt(B(n. k — 1]. Section 1. E)) < ( n 01Al2—nD.12. since (n + 1)I 41 2 . The sequences x7 and y r i` are said to be (s. Theorem 1.3.6. it follows that 1 12 1Ple lx7) D* r17 which completes the proof of Lemma 111.
constant on the (s. Sk(x . ENTROPY FOR RESTRICTED CLASSES. 04: IPke — itkI > El) < k(t + 0 1 246'6214 . 4k +Ft k = ps k(• 1. It is. so that if B denotes A" and y denotes the product measure on 13' defined by the formula v(wD = then /1(w i ). The containment relation.168 CHAPTER III. of course.1. (5). The fact that p. If k and c > 0 are fixed. t]: w(j) = a}1 where x is given by (4). (7) p.3. w(t)} of kblocks.(fx: 1Pic (' 14) — e/2})y ({U/1: IwD — > e/21). . hence the bound decays exponentially in n. that is. This completes the proof of Theorem 111. s) is the (s.4 (Overlapping to nonoverlapping lemma. however. s). kblock parsings have the same (ordered) set {w(1).) Given e > 0 there is a y > 0 such that if kln < y and Ipk( 14) — I > E then there is an s E [0. i The (s. k — 1] such that Ips k ( 14) — . WI E B. k)equivalence class of x is denoted by Sk (fi'. k1 (5) 14: Ipk( 14) — iLk1 el g U{4: lp. is a product measure implies (6) (Sk(x riz .(41x Pil ) — ps k (4 E [1.1.ukl> Proof Left to the reader.el') is the The sshifted (nonoverlapping) empirical kblock distribution ps A k defined by distribution on l) = ps. applied to the measure y and superalphabet B = lAlk . i yss+t +k . If k is fixed and n is large enough the lemma yields the containment relation.2. a lemma. and fix s E [0.c e S=0 — kl Fix such a k and n.. k — 1]. by the firstorder result.1. is upper bounded by (8) (t 1) 1A1 '2'" 2 /4 . s)) 1=1 where Sk(x . then provides the desired kblock bound (9) p. . The latter. s). The overlappingblock measure Pk is (almost) an average of the measures ps result summarized as the following where "almost" is needed to account for end effects. Lemma 111. sshifted. k)equivalence class k. the logarithm of the righthand side is asymptotic to —tcE 2 14n —c€ 2 /4k. Lemma 111. k)equivalence class of xri'.
Exponential rates for Markov chains will be established in this section. A process is 11 fmixing if there is a nonincreasing sequence {*(g)} such that *(g) 1 as g —± oc. since p. it will be shown that periodic. define t = t (n. 169 III. Lemma 111.d.1. as g —> oc. g) such that (10) n=t(k+g)±(k+g)Er. where i/î (g)= maxmax a Mg ab (b) • b The function *(g) —> 1. Finally. and such that . so as to allow gaps between blocks. with Ili(g) 1. The theorem for frequencies is proved in stages. u E Ag. Proof A simple modification of the i.1. u. It requires only a bit more effort to obtain the mixing Markov result.i. let a be the final symbol of u.1.6 An aperiodic Markov chain is ipmixing.u(w).i. Fix positive integers k and g. for computing Markov probabilities yields E w uvw ) t(v)=g mg 7r(b) The righthand side is upper bounded by *(g).(4) is a continuous function of pi e 14) and p2( • 14). Section 1. Then it will be shown that *mixing processes have exponential rates for both frequencies and entropy. since the aperiodicity assumption implies that iti`cfb 7r(b). An i.SECTION 111. then it has exponential rates for frequencies and for entropy. This proves the lemma.d. Fix a gap g. is all that is needed to establish this lemma. For n > 2k + g. proof. . process is *mixing.u(u).7 If kt is ilfmixing then it has exponential rates for frequencies of all orders. Theorem 111.b The Markov and related cases. irreducible Markov chains have exponential rates. which allows the length of the gap to depend on the length of the past and future. The entropy part follows immediately from the secondorder frequency result. First it will be shown that aperiodic Markov chains satisfy a strong mixing condition called *mixing.5 (Exponential rates for Markov sources.l. The formula (10). and let b be the first symbol of w. LI The next task is to prove Theorem 111.1. w E A*.u(uvw) _<111(g)A(u). Proof Let M be the transition matrix and 7r() the stationary vector for iii.0_<r<kig. RATES OF CONVERGENCE. is an ergodic Markov source. k. The *mixing concept is stronger than ordinary mixing.2.) If p.u(w).
t]: w(j) = 411 It is. the g(j) have length g and alternate with the kblocks w(j). this completes the proof of Theorem 111. where "almost" is now block measure Pk is (almost) an average of the measures p k gaps. the sequence 4 can be expressed as the concatenation (11) x = w(0)g(1)w(1)g(2)w(2). kblock parsing of 4. with gap g. g)equivalent if their sshifted. The logarithm of the bound in (14) is then asymptotically at most —tce 2 18n —c€ 2 /8(k + g).. The following lemma removes the aperiodicity requirement in the Markov case. then there is an s E [0. g)equivalence class Sk (4. of course.8 (Overlapping to nonoverlapping with gaps. This is called the the sshifted. . for 1 < j < t. that k is enough larger than g and n enough larger than k to guarantee that Lemma 111. The i. is the distribution on Ak defined by P. The upper bound (8) on Ips k ( 14) —lkI€121) is replaced by the upper bound (13) [i (g)]t(t 1)1111 4 C E 2 I4 and the final upper bound (9) on the probability p.7.7. ggapped. with gap g. .1. For each s E [0. k. kblock parsing of x. if k > K. ENTROPY FOR RESTRICTED CLASSES.. where the w(j) are given by (11).1. with the product measure bound (6) replaced by (12) ian (Sk (4. Lemma 111.g(alnxii).g(t)w(t)w(t + 1). s. Completion of proof of Theorem 111. w(t)} of kblocks. k + g — 1] such that — 114. g)• The sshifted (nonoverlapping) empirical kblock distribution. where w(0) has length s < k + g. (14: Ipke placed by the bound (k + g)[*(g)]t (t + 010 2 tc€ 2 14 (14) — /41 ED is re assuming.g .g (c4) = pisc. however. kblock parsings have the same (ordered) set {w(1). The sequences 4 and are said to be (s.i. The overlappings .7 is completed by using the 1imixing property to choose g so large that (g) < 2 218 .1. g)) kii(g)i t i=1 ktk(w(i)).170 CHAPTER III. there is a y > 0 and a K > 0. . such that if k n < y.1. Since it is enough to prove exponential rates for kth order frequencies for large k.d. of course.) Given E> 0 and g. this is no problem. the sshifted.4 easily extends to the following. and the final block w(t + 1) has length n — t (k + g) — s < 2(k + g). The (s. k + g — 1]. s. and if pke > E. g). g)equivalence class of 4 is denoted by Sk(xPil .8 holds. constant on the (s. and establishes the exponential rates theorem for aperiodic Markov chains. k. with gap g. If k is large relative to gap needed to account both for end effects and for the length and n is large relative to k. and Lemma 111. The proof of Theorem 111.1.4 ( 14) — itkl 612. provided only that k is large enough. E [1. proof adapts to the Vfmixing case.1. hence the bound decays exponentially. s. k.
because it illustrates several useful cutting and stacking ideas.1. and partition A into the (periodic) classes.oc. which has no effect on the asymptotics. if a E C. ) where @ denotes addition mod d. IL({xi: IPk — iLk I E}) > 6/2 } separately.9 An ergodic Markov chain has exponential rates for frequencies of all orders. The following theorem summarizes the goal. 1 < s < d. . and. ?_ decays exponentially as n . Thus. Also let pP) denote the measure A conditioned on Xi E C5 . see Exercise 4. and thereby completing the CI proof of the exponentialrates theorem for Markov chains...1. C2. k + g — 1]. g (• WO satisfies the For s E [0. it follows that if k I g is large enough then Since . . C1.c Counterexamples. g = ps k . Theorem 111. A simple cutting and stacking procedure which produces counterexamples for general rate functions will be presented. such that Prob(X n+ 1 E Cs91lXn E C s = 1. the nonoverlapping block measure ps condition p g (4) = 0. Examples of renewal processes without exponential rates for frequencies and entropy are not hard to construct.l.1. Let g be a gap length.10 Let n i— r(n) be a positive decreasing function with limit O. 171 Proof The only new case is the periodic case. Lemma 111. Define the function c: A i— [1. It can also be assumed that k is divisible by d. RATES OF CONVERGENCE.1. however. There is a binary ergodic process it and an integer N such that An (1112 : 'Pi (• Ix) — Ail 1/2» ?.SECTION 111. . unless c(ai) = c(xs ).u k is an average of the bt k k+g1 fx rI : IPk( ifil ) — Ad > El C U VI I: s=0 IPsk. for k can always be increased or decreased by no more than d to achieve this. because it is conceptually quite simple.5. in part. establishing the lemma. x C(s @ d — 1).g ( ! XIII ) — 14c(xs)) 1 > 6 / 2 1. so the previous theory applies to each set {xii': ' K g — I (c(xs )) I as before. which can be assumed to be divisible by d and small relative to k. an aperiodic Markov measure with state space C(s) x C(s @ 1) x • • . d] by putting c(a) = s. Theorem 111. _ r(n). The measure . k . in part. III. Cd. n > N. Let A be an ergodic Markov chain with period d > 1.u (s) is. (s) .
ENTROPY FOR RESTRICTED CLASSES. for all m > 1. then for any n. it can be supposed that r(m) < 1/2. if C has height L and measure fi. if exactly half the measure of 8(1) is concentrated on intervals labeled '1'. in part. = (y +V)/2.u. such that /3(1) = 1/2 and r(m) < f3(m) < 1/2. according to whether x`iz consists of l's or O's. Suppose some column C at some stage has all of its levels labeled '1'. i (1) = 1/2. to make a process look like another process on part of the space.u look like one process on part of the space and something else on the other part of the space. . The bound (16) can be achieved with ni replaced by ni + 1 by making sure that the first column CI has measure slightly more than r(m 1).n: — I > 1/2)) > r(m). where y is concentrated on the sequence of all l's and iY is concentrated on the sequence of all O's. so that. pi(11x) is either 1 or 0. In particular. Let m fi(m) be a nonincreasing function with limit 0. for 1 < j < in. For example. Mixing is done by applying repeated independent cutting and stacking to the union of the second column C2 with the complement of C. {41 } and fr. so that A n (14: i pi( 14) — I 1/21) = 1. if . Ergodic counterexamples are constructed by making . mixing the first part into the second part so slowly that the rate of convergence is as large as desired. As long as enough independent cutting and stacking is done at each stage and all the mass is moved in the limit to the second part. and to mix one part slowly into another. then it is certain that xi = 1. no matter what the complement of C looks like. Counterexamples for a given rate would be easy to construct if ergodicity were not required. The mixing of "one part slowly into another" is accomplished by cutting C into two columns. As an example of what it means to "look like one process on part of the space and something else on the other part of the space.u n ({xr: pi(114) = 11) (1 — 11 L i z ) /3.u. for it can then be cut and stacked into a much longer column. The details of the above outline will now be given. long enough to guarantee that (16) will indeed hold with m replaced by ni + 1.u were the average of two processes." suppose kt is defined by a (complete) sequence of column structures. the final process will be ergodic and will satisfy (16) for all m for which r(m) < 1/2. The construction will be described inductively in terms of two auxiliary unbounded sequences. Each S(m) will contain a column . for all m. of positive integers which will be specified later. Furthermore. If x is a point in some level of C below the top ni levels. then the chance that a randomly selected x lies in the first L — ni levels of C is (1 — m/L)p. and hence (15) . while .n 1. The desired (complete) sequence {S(m)} will now be defined. /31(114) = 1. to implement both of these goals. that is. Without loss of generality.„ ({x. The cutting and stacking method was designed.172 CHAPTER III. say C1 and C2. then (141 : 1/31( Ixr) — I ?_ 112» ?_ (1 _ so that if 1/2> fi > r(m) and L is large enough then (16) /2. n > 1.
All that remains to be shown is that for suitable choice of 4. Since rm —> oc.10. has exponential rates for frequencies and entropy.7. Show that the measure of the set of nsequences that are not almost stronglypacked by 'Tk goes to 0 exponentially fast.(m). and since /3(m) ÷ 0. Since the measure of R. 2) U TZ.SECTION III. so large that (1 — m/t n1 )/3(m) > r(m). The column C(m —1.(m — 1) and 1?(m) are (1 — 2')independent.1. where C(m — 1. of height 4. tm holds. The sequence {r. (Hint: as in the proof of the entropy theorem it is highly unlikely that a sequence can be mostly packed by 'Tk and have toosmall measure.(4) < 2 n (h+E) } goes to 0 exponentially fast. by Theorem 1. 1) and C(m — 1. Once this holds. sequentially. This completes the proof of Theorem 111. all of whose entries are labeled '1'. then it can be combined with the fact that l's and O's are equally likely and the assumption that 1/2 > 8(m) > r (m). For ni > 1. labeled '1'. 2).1 . 14)— Ail a 1/2 }) a (1 — — ) /3(m). the total width of 8(m) goes to 0. 2. (b) Show that the measure of {xii: p. and measure 13(m). RATES OF CONVERGENCE. and let R. (a) Choose k such that Tk = {x: i(x) ) 2h j has measure close to 1. 173 C(m). labeled '0'. which are then stacked to obtain C(m). ni > 1. This is possible by Theorem 1.1/20 > r (m).(1) consist of one interval of length 1/2.(m) goes to 1.mixing process has exponential rates for entropy. and rm the final process /1 is ergodic and satisfies the desired condition (17) An (Ix'i'I : Ipi(• 14)— Ail a. 8(m) is constructed as follows. (a) Show that y has exponential rates for frequencies. .) (c) Show that a 0.n } is chosen. 0 111. so that C(m — 1. and the total measure of the intervals of S(m) that are labeled with a '0' is 1/2. 1) is cut into t ni g ni 1 columns of equal width. Let y be a finite stationary coding of A and assume /2. C(m — 1. and hence the final process must satisfy t 1 (l) = A i (0) = 1/2.: Ipi(. for this is all that is needed to make sure that m ttn({x. this guarantees that the final process is ergodic.10. Suppose A has exponential rates for frequencies. so the sequence {8(m)} is complete and hence defines a process A. to guarantee that (17) holds.11. The condition (17) is guaranteed by choosing 4. let C(1) consist of one interval of length 1/2. 1. disjoint from C(m).d Exercises. First C(m — 1) is cut into two columns. . To get started.10. 1) has measure /3(m) and C(m — 1. together with a column structure R. This guarantees that at all subsequent stages the total measure of the intervals of 8(m) that are labeled with a '1' is 1/2. 1. The new remainder R(m) is obtained by applying rni fold independent cutting and stacking to C(m — 1. 2) U 7Z(m — 1). 2) has measure fi(m —1) — )6(m).
) 5. conditional ktype too much larger than 2 —k(h(4)—h(v) and measure too much larger than 2PP).1. (b) Show that y has exponential rates for entropy. distributional. Every ergodic process has an admissible sequence such that limp k(n) = oc. with high probability. below. the probability of a kblock will be roughly 2 kh . there is no hope that the empirical kblock distribution will be close to the true distribution. where it is good to make the block length as long as possible. 4. equivalently. a fact guaranteed by the ergodic theorem. that is. if k(n) > (1 + e)(logn)1 h. ENTROPY FOR RESTRICTED CLASSES. such as the i. see Theorem 111. distance. Thus it would be desirable to have consistency results for the case when the block length function k = k(n) grows as rapidly as possible. see Exercise 2. A nondecreasing sequence {k(n)} will be said to be admissible for the ergodic measure bt if 11111 1Pk(n)(• 14) n4co lik(n) = 0. by the ergodic theorem. The such estimates is important when using training sequences. The empirical kblock distribution for a training sequence is used as the basis for design. Show that there exists a renewal process that does not have exponential rates for frequencies of order 1. that is. finite consistency of sample paths. independently drawn sample paths. The kthorder joint distribution for an ergodic finitealphabet process can be estimated from a sample path of length n by sliding a window of length k along the sample path and counting frequencies of kblocks. or. for if k is large then. to design engineering systems.2 Entropy and joint distributions.2. In particular. a. the resulting empirical kblock distribution almost surely converges to the true distribution of kblocks as n oc. in . s. k(n) > (logn)I(h — 6).i. Here is where entropy enters the picture.. This is the problem addressed in this section. (Hint: make sure that the expected recurrencetime series converges slowly. as a function of sample path length n. Consistent estimation also may not be possible for the choice k(n) (log n)I h. Section 111. that is. Use cutting and stacking to construct a process with exponential rates for entropy but not for frequencies. such as data compression. If k is fixed the procedure is almost surely consistent. The problem addressed here is whether it is possible to make a universal choice of {k(n)) for "nice" classes of processes. It is also can be shown that for any sequence k(n) ± oc there is an ergodic measure for which {k(n)} is not admissible. after which the system is run on other. let plc ( 14) denote the empirical distribution of overlapping kblocks in the sequence _el'.) 3.174 CHAPTER III. Exercise 4 shows that renewal processes are not *mixing. For example. There are some situations. Establish this by directly constructing a renewal process that is not litmixing.d. (Hint: bound the number of nsequences that can have good kblock frequencies. where I • I denotes variational. As before. The definition of admissible in probability is obtained by replacing almostsure convergence by convergence in probability. processes or Markov chains.
y) converges in probability to h. The OrnsteinWeiss results will be discussed in more detail in Section 111.2.2. and k(n) < (log n)/(h 6) then {k(n)} is admissible for p.u. These applications. 175 the unbiased cointossing case when h = 1.) If p. an approximate (1 — e 1 ) fraction of the kblocks will fail to appear in a given sample path of length n.i. that if the process is a stationary coding of an i. Remark 111. the choice k(n) — log n is not admissible. The weak Bernoulli result follows from an even more careful look at what is really used for the ifrmixing proof. is ergodic with positive entropy h.2. requires that past and future blocks be almost independent if separated by a long enough gap. y) is the waiting time until the first n terms of x appear in the sequence y then. Markov.i. [86].d.2.4 A first motivation for the problem discussed in this section was the training sequence problem described in the opening paragraph.1 (The nonadmissibility theorem. and ifrmixing cases is a consequence of a slight strengthening of the exponential rate bounds of the preceding section. case of most interest is when k(n) The principle results are the following.2.3 (Weak Bernoulli admissibility.d.d.i. and hence the results described here are a sharpening of the OrnsteinWeiss theorem for the case when k < (log n)/(h 6) and the process satisfies strong enough forms of asymptotic independence. provided x and y are independently chosen sample paths. which will be defined carefully later. [52]. are presented in Section 111. Theorem 111. almost surely. The positive admissibility result for the i. (1/n) log 14/. for ergodic Markov chains..(x.2 (The positive admissibility theorem.) If p. Remark 111. The ddistance is upper bounded by half the variational distance. with high probability. and if k(n) > (log n)/(h — 6) then {k(n)} is not admissible in probability for p. Thus the (log n)/(h 6). The nonadmissibility theorem is a simple consequence of the fact that no more than 2 k(n)(h—E) sequences of length k can occur in a sequence of length n and hence there is no hope of seeing even a fraction of the full distribution. They showed.2. is weak Bernoulli and k(n) < (log n)/(h 6) then {k(n)} is admissible for . A second motivation was the desire to obtain a more classical version of the positive results of Ornstein and Weiss.5.5 A third motivation for the problem discussed here was a waitingtime result obtained by Wyner and Ziv. . Theorem 111. They showed that if 147. ENTROPY AND JOINT DISTRIBUTIONS. along with various related results and counterexamples.3. who used the ddistance rather than the variational distance. see Exercise 3. The admissibility results of obtained here can be used to prove stronger versions of their theorem. or Ilimixing. process then the ddistance between the empirical k(n)block distribution and the true k(n)block distribution goes to 0. for it is easy to see that. The weak Bernoulli property. Theorem 111.(x. if 0 < E < h.) If p is i. in particular. independent of the length of past and future blocks.SECTION 111. provided k(n) — (log n)/ h. The Vimixing concept was introduced in the preceding section.
so that { yn } is a binary i. it follows that Since pk(a fic ix) = 0.2. and hence Pk . since each member of Tk (E/2) has measure at most 2. by the ergodic theorem. the following holds. be an ergodic process with entropy h and suppose k = k(n) > (log n)/(h — E). n — k +1]} .2. The rate bound. Define the empirical universe of kblocks to be the set Lik (4) = x: +k1 = a l% for some i E [1.a Proofs of admissibility and nonadmissibility. if n > IA I''. for.) There is a positive constant C such that for any E > 0. where 0 < E < h.. and hence I pk (.i. see the bound (9) of Section MA.d.i. it can be supposed that {k(n)} is unbounded. Let p.k(h—e12) • The entropy theorem guarantees that . and for any B c A such that p(B) > 1 — É and I BI > 2. Lemma 111.u.k (ték oc. for any finite set A. with finite alphabet A. (1). for any n > 0. without loss of generality. (2) (14 : 1Pke 14) — 141 ?. so that .uk ((tik(4)) c) also goes to 1.14) — pk1 ± 1 for every x E A'. III. by the distance lower bound. since any bounded sequence is admissible for any ergodic process. whenever al' (1) Let 1 Pk( ' 14) — 1 111 ((lik(X)) c ) • Tk (6/2) = 14: p. Proof Define yn = yn (x) — > 5E1 < 2(n ± 01/312nCe2 =0 1 I if xn E B otherwise. ENTROPY FOR RESTRICTED CLASSES. process p.k (Tk (e/2)) —> 1.6 (Extended firstorder bound. the error is summable in n. ED < k(t +1) 1A1k 2 t cE214 . process with Prob (yn = 1) < Let Cn = E yi > 2En} .(4) < 2k(hE/2)} The assumption n 2k (h E ) implies that ILik(4)1 < 2k (h o. where t nlk. First note that. First the nonadmissibility theorem will be proved. lik(x). for any i. as the reader can show. E.') n Tk ( 6/2)) 2 k(h0 2k(hE/2) 2 kE/2 . This implies the nonadmissibility theorem. is enough to prove the positive admissibility theorem for unbiased cointossing. Next the positive admissibility theorem will be established. 1/31(. The key to such results for other processes is a similar bound which holds when the full alphabet is replaced by a subset of large measure.176 CHAPTER III.d.
i. Furthermore. define ï = i(xil) to be the sequence of length s obtained by deleting xi„ xi2 . ENTROPY AND JOINT DISTRIBUTIONS. It will be shown that (6) Ep.u i and let AB be the corresponding product measure on BS defined by ALB. x i g B. im ): 'pi ( 14) . case. page 168. k1 (7) 14: 1Pk( 14) — tuki 31 g s=o IPsk( . im ))(s 1)1B12 2< ini mn 1)IBI2—n(1 —20c€ 2 The sum of the p.BI > € 1) Ii ui .)) /1 B E B s : i Pi e liD — I > 3}).n ) den. note the set of all 4 for which Xi E B if and only if i çz' {i1. 2. then extended to the *mixing and periodic Markov cases. .SECTION 111.d. Let 8 be a given positive number. and apply the bound (2) to obtain (3) p.kI ?3/21.) . The assumption that m < 26n the implies that the probability ({Xj1 E B(ii. • • • im) and have union A. Fix a set {i1. the lemma follows. which immediately implies the desired result. im )) for which in < 26n cannot exceed 1. Assume . in. with entropy h. in. combined with (4) and the assumption that I BI > 2. bEB [ A\ The exponential bound (2) can now be applied with k = 1 to upper bound (5) by tt (B(i i. The positive admissibility result will be proved first for the i. The proof of (6) starts with the overlapping to nonoverlapping relation derived in the preceding section. — p. upper bounded by (5) since E Bs: — 11 1.(B(i i . Also let ALB be the conditional distribution on B defined by . in turn. i 2 . For each m < n and 1 < j 1 < i2 < • • • < j The sets .AL B I = 2_. 14) — ?_ 31) < oc.tti I > 56}) which is. and assume k(n) < (log n)/ (h 6). by the BorelCantelli lemma.d. (14: ip(. i. are disjoint for different { 1. put s = n — m.} with m < 2cn. the sets {B(ii. in1 ): m > 2en} have union Cn so that (3) yields the bound (4) (ii.i. tt (b) /Ld o B) + E A(b) = 2(1 — . • • • . let B(ii. see (5). . no) im }:nz>2En + 1) 22cn4E2 . x im from x. and for xri' E B(ii.cn4E2 177 The idea now is to partition An according to the location of the indices i for which . .u(B)) bvB 26. }.i is i. (Ca ) < (n 1)22 . in. and hence.2. is upper bounded by tc(B(i i . .
Note that k can be replaced by k(n) in this bound. yields the bound P 101) — vil > 3/20 < 2(t + 1 )2/2)2_tc82000. combined with (8) and (7). ENTROPY FOR RESTRIC I ED CLASSES. The set rk = {X I( : itt(x lic ) > 2— k(h+€12)} has cardinality at most 2k(h ±e/ 2) .178 CHAPTER III.(h±E/2)2tc82/100 . since nIk(n) and p < 1. fi = h + 612 h ±E The righthand side of (9) is then upper bounded by (10) 2 log n h +E (t since k(n) < (log n)/ (h E). valid for all k > K and n > kly. and all n > kly. though no longer a polynomial in n is still dominated by the exponential factor. to the superalphabet A". which. k — 1]. (1 + nl k) 2"1±(') . by assumption. with A replaced by A = A k and B replaced by T. as k grows. by the entropy theorem. and plc ( '.el') denotes the sshifted nonoverlapping empirical kblock distribution. furthermore. where y is a positive constant which depends only on g. as n oc. all k > K = K (g).i. 3 } ) < 2(k + g)[lk(g)]` (t 1 )22127 tceil00 . which is valid for k/n smaller than some fixed y > 0.2. processes. see (7) on page 168. where n = tk + k +r. for k > K. t The preceding argument applied to the Vimixing case yields the bound (1 1) kt({4: 1Pke — il ?_. To obtain the desired summability result it is only necessary to choose . summable in n. thereby completing the proof of the admissibility theorem for i. and. indeed. with B replaced by a suitable set of entropytypical sequences. The extended firstorder bound. (8) ti({4: Ins* — Ak1 3/2}) v({ 01: IP1( . which. To show this put a =C32 1100. 101 ) — vil where each wi E A = A'.2. The rigorous proof is given in the following paragraphs. valid for all g. This establishes that the sum in (6) is finite. and y denotes the product measure on » induced by A. As in the preceding section. Lemma 111. for s E [O.6. Since the number of typical sequences can be controlled. the polynomial factor in that lemma takes the form. there is a K such that yi (T) = kil (Tk ) > 1 — 8/10. since it was assumed that k(n) oc. The idea now is to upper bound the righthand side in (8) by applying the extended firstorder bound.6.d. 0 < r < k. Lemma 111. This is. The bound (9) is actually summable in n. produces the bound (9) tx: IPke — 1kI > < 2k(t 1)21.
. ar iz. in turn. —g — m < j < —g}. This establishes the positive admissibility theorem for the Vimixing case.g(t)w(t)w(t ± 1). the sshifted. uniformly in m > 0. mixing result was the fact that the measure on shifted. let p.2.„(. As before. . i X go) — n> 1.1. 179 g so large that (g) < 2c6212°°. To make this precise. where the variational distance is used as the measure of approximation. and this is enough to obtain the admissibility theorem. subject only to the requirement that k(n) < (log n)/(h E). that is. assuming that k(n) the righthand side in (11) is. The weak Bernoulli property is much stronger than mixing. With a = C8 2 /200 and /3 = (h + E /2)/ (h E). for all n > 1 and all m > O. there is a gap g for which (12) holds. kblock parsing of x. however. hence for the aperiodic Markov case. at least for a large fraction of shifts.. The extension to the periodic Markov case is obtained by a similar modification EJ of the exponentialrates result for periodic chains. upper bounded by log n 2(t h 1)n i3 .1x:g g _n1 ) denote the conditional measure on nsteps into the future. which requires that jun (4 lxig g _m ) < (1 + c)/L n (ar iz). n > 1. since since t n/ k(n) and 8 < 1. see Lemma 111. which is again summable in n. and oc. = aril and Xi g Prob ( X Tg g .2. with gap g is the expression (14) 4 = w(0)g(1)w(1)g(2)w(2).) Also let Ex g (f) denote conditional expectation of a function f with respect to the gm random vector X —g—m• —g A stationary process {X i } is weak Bernoulli (WB) or absolutely regular. An equivalent form is obtained by letting m using the martingale theorem to obtain the condition oc and (13) E sifc Gun( . conditioned on the past {xj.9. and xig g provided only that g be large enough.SECTION 111.1X —g—m g — n i) < E . which only requires that for each m > 0 and n > 1. A process has the weak Bernoulli property if past and future become almost independent when separated by enough. It is much weaker than Ifrmixing. = xi.. if given E > 0 there is a gap g > 0 such that (12) E X igm : n (. The key to the O. I11. nonoverlapping blocks could be upper bounded by the product measure on the blocks.b The weak Bernoulli case. with only a small exponential error. ENTROPY AND JOINT DISTRIBUTIONS. the measure defined for ar iz E An by iun (41x )— g_m = Prob(r. provided a small fraction of blocks are omitted and conditioning on the past is allowed. The weak Bernoulli property leads to a similar bound. where g and m are nonnegative integers.
n )] n B . which exists almost surely by the martingale theorem. by the definition of B11„.) .2. k. Lemma 111.2.) If l is weak Bernoulli and 0 < y < 1/2. ([w(i. g)splitting index for x E A Z if (w(j) ix _ s ti 1)(k+g) ) < (1 + y)p.. y. lie lx i 00 ) denotes the conditional measure on the infinite past.180 CHAPTER III. g).7 follows by induction. eventually almost surely. there are integers k(y) and t(y).u(w(in.„ tc(B*). and g are understood. and there is a sequence of measurable sets {G n (y)}.(w(j)). s. JEJth„) D and Lemma 111. then there is a gap g = g(y). such that the following hold. s.n )] of which satisfies n Bi I xs_+. . )] n Bi ) . fl Bi ) B*) to n Bi ) = [w(j. k. or by B i if s._{. and later.1)(k +g) ).„ } ( [ w (. the gblocks g(j) alternate with the kblocks w(j). k. The set of all x E Az for which j is a (y. k. (a) X E Gn(y).)) • tt ([.)) (1 + y) • . Lemma 111. Note that Bi is measurable with respect to the coordinates i < s + j (k + g).( .)) (1+ y)IJI J EJ Put jp. g)splitting index will be denoted by Bi (y. Then for any assignment {w(j): E J } of kblocks tt Proof obtain (15) JEJ (naw(i)] n B. s. k. An index j E [1. The weak Bernoulli property guarantees the almostsure existence of a large density of splitting indices for most shifts. each The first factor is an average of the measures II (w(j. ENTROPY FOR RESTRICTED CLASSES. for 1 < j < t.8 (The weak Bernoulli splittingset lemma. s. t] is called a (y. g) and fix a finite set J of positive integers. Thus (15) yields (n([w(i)] n B. Note that the kblock w(j) starts at index s (j — 1)(k + g) + g ± 1 and ends at index s j (k + g).2.n )] n Bi. and the final block w(t 1) has length n — t(k+g) —s < 2(k ± g).7 Fix (y. Here. = max{ j : j E J } and condition on B* = R E . defined as the limit of A( Ix i rn ) as m oc. where w(0) has length s.& x_ s tim 1)(k+g) ) < (1 + y)tt(w(j ni )).2.
k ± g — 1] for each of which there are at least (1 — y)t indices j in the interval [1. Proof By the weak Bernoulli property there is a gap g = g(y) so large that for any k f (xx ) 1 p(x i`) p(x i`lxioo g) dp. t > t(y). denotes the indicator function of Cm. so that if G(y) = —EICc m (T z x) > 1 1 n1 n.c ) • Let Ek be the aalgebra determined by the random variables.: 1 < i < kl.3C ( IX:e g. k + g —1] of cardinality at least (1 — y)(k + g) such that for x E G(y) and s E S(x) k c. {Xi : i < —g} U {X. eventually almost surely. Thus fk converges almost surely to some f.2. t] that are (y. k.SECTION 111. s. ENTROPY AND JOINT DISTRIBUTIONS. .t=o — L — 2 then x E Gn(Y). 2 where kc. 1 N1 i=u y2 > 1 — — a. (Ts+01)(k+g)x ) > — y. then /i(CM) > 1 — (y 2 /2).(x) < Y 4 4 Fix such a g and for each k define fk(X) — . Vk ?_ MI. Fix k > M. g)splitting indices for x.. Fatou's lemma implies that I 1 — f (x) d p(x) < )/4 .U(.s. Put k(y)= M and let t(y) be any integer larger than 2/y 2 . and (t + 1)(k g) < n < (t +2)(k ± g).. see Exercise 7b. and if (t +1)(k + g) _< n < (t + 2)(k +g). 181 (b) If k > k(y). The definition of G(y) and the assumption t > 2/y 2 imply that 1 r(k+g)1 i=0 KCht(T1 X) = g) E 1 kig1 1 t(k k + g s=o t j =i E_ t K cm ( T shc i —( k + g) x ) > _ y 2 . Direct calculation shows that each fk has expected value 1 and that {fk} is a martingale with respect to the increasing sequence {Ed. and fix an x E Gn(Y). 1 so there is an M such that if Cm = lx :I 1 — fk (x) 15_ . if t > t(y). The ergodic theorem implies that Nul+ rri cx. then for x E G(y) there are at least (1 — y)(k + g) values of s E [0. so there is a subset S = S(x) C [0.
2. which implies that j is a (y. k(y) and t(y). The overlappingblock measure Pk is (almost) an average of the measures pk s'g i. where "almost" is now needed to account for end effects. then choose integers g = g(y). if kl n < y. t]. J). t] define s. s. g)splitting index.8 implies that there are (1 — y)(k + g) indices s E [0. k E k A. . For s E [0. t] which are (y.2.(s. so that Lemma 111. k. so that conditions (a) and (b) of Lemma 111. Fix t > t(y) and (t+1)(k+g) < n < (t+2)(k+g). that is.7 and the fact that I Ji < t yield (1 6) A (n[w(j)] jEJ n D.8 hold. In summary.8. But if Tr+(i 1)(k +g) x E Cm then g(w(i)Ixst i1)(k+g )— g) < 1 ( + y)tt(w(D). for at least (1—y)t indices j E [1.2.g(ai )= E W(i) an! . g)splitting indices for x.1. t] of cardinality at least (1 — y)t. . and for the j g J. s. this completes the proof of Lemma 111.E. k +g —1]. n > 1.3. t] that are (y.: 3 then 114' g j ( . (. Theorem 111. and if Ipk(14) — ?. then Ts±(. k. = Gn(Y).2.2. Lemma 111. If k is large relative to gap length and n is large relative to k. Assume A is weak Bernoulli of entropy h. for x E G(y) and s E S(X) there are at least (1 — y)t indices j in the interval [1. In particular. let Dn (s.2. J) be the set of those sequences x for which every j E J is a (y. for any subset J C [1. and note that jEJ n B = Dn(s. k + g — 1] for each of which there are at least (1 — y)t indices j in the interval [1.J k n Pk. for gaps.)) If x E G(y) then Lemma 111. Since I S(x)I > (1 —3)(k+ g). s. and k(n) < (log n)/(h + E). t]. 14) — ILkI 3/4 for at least 2y (k + g) indices s E [0. ENTROPY FOR RESTRICTED CLASSES. t]. ( . however. the empirical distribution of kblocks obtained by looking only at those kblocks w(j) for which j E J. this is no problem. J)) < (1+ y)' rip. Fix > 0.182 CHAPTER III. n > 1.8 easily extends to the following sharper form in which the conclusion holds for a positive fraction of shifts. g)splitting index for x. y) such that if k > K. choose a positive y < 1/2. k. where k(y) k < (log n)/(h+E). provided the sets J are large fractions of [1. if s E S(X). For J C [1. g)splitting indices for x.9 Given 3 > 0 there is a positive y < 1/2 such that for any g there is a K = K(g. and measurable sets G. k. s.11)(k +g) x E Cm. k + g — 1] and J C [1. Proof of weak Bernoulli admissibility. and Lemma 111.
1. Aperiodic Markov chains are weak Bernoulli because they are ifrmixing. then p(B 0 ) goes to 0 exponentially fast.3. Show that k(n) = [log n] is not admissible for unbiased cointossing. and if B. processes in the sense of ergodic theory. 22t y log y yy[k(o+ + 2k(n)(h+E/2)2_ t(1—Y)C32/4°° for t sufficiently large. 3.i. t] of cardinality small. 4. [38]. It is shown in Chapter 4 that weak Bernoulli processes are stationary codings of i. however.] 1J12:(1—y)t (ix: Ips k :jg elxi) — > 3/41 n Dn (s. J)) The proof of weak Bernoulli admissibility can now be completed very much as the proof for the 'titmixing case. then for any x E G 0 (y) there exists at least one s E [0. in Section 1V. does not have exponential rates of convergence for frequencies. and k > k(y) and t > t(y) are sufficiently large. unbounded sequence {k(n)} there is an ergodic process p such that {k(n)} is not admissible for A. This bound is the counterpart of (11). as in the tfrmixing case. eventually almost surely.2. If y is small enough. 22ty log y to bound the number of subsets J C [1.9 _> 3/4 for at least 2y (k I. for any subset J C [1. Aperiodic renewal and regenerative processes. with an extra factor. y can be assumed to be so small and t so large that Lemma 111. Since X E G n (y). see Exercises 2 and 3. 2.10 The admissibility results are based on joint work with Marton.i.2. the measure of the set (17) is upper bounded by (18) 2 . t] of cardinality at least > 3/4.c Exercises.d. In particular. the set k :g j (14) — (1 — y)t. k + g — 1] and at least one J C [1.3.SECTION 111. I11. indices s.g) assures that if I pk (14) — Ik1> S then I Ps kv g j ( .2. for which x E Dn (s. . = {x: Ipk (n) (.. t ] of cardinality at least (1— y)t. Using the argument of that proof. as part of their proof that aperiodic Markov chains are isomorphic to i. which need not be ifmixing are. weak Bernoulli. J) and ips (17) is contained in the set k+ g —1 s=0 lpk — itkl a.d. Show that if p is i. 31 n G(y) UU ig[1. processes.2. [13].2.1. The weak Bernoulli concept was introduced by Friedman and Ornstein.ei') — [4(n)1 > 31. yet p. 183 On the other hand. then. Show that for any nondecreasing.i.d. the bound (18) is summable in n. Show that there exists an ergodic measure a with positive entropy h such that k(n) (log n)/(h + E) is admissible for p. 14) Thus if y is sufficiently at least (1—y)t. Remark 111. this establishes Theorem 111. if k(n) < (logn)I(h + E). ENTROPY AND JOINT DISTRIBUTIONS.
where each w(i) has length k. c) be the set of all ktsequences that can be obtained by changing an e(log n)I(h+ fraction of the w(i)'s and permuting their order. process.d. then . For each x let Pm (X) be the atom of E(m) that contains x.) If k(n) < (log n)I h then {k(n)} is dadmissible for any process of entropy h which is a stationary coding of an i. (Hint: use the previous result with Yi = E(fk+ilEk). . s. is i. k(n)..(. let E(m) be the aalgebra determined by .3. Let {Yi : j > 1} be a sequence of finitevalued random variables. 151. Show that the preceding result holds for weak Bernoulli processes. and let E be the smallest complete aalgebra containing Um E(m). e).) Section 111.) (b) Show that the sequence fk defined in the proof of Lemma 111. almost surely. the theorem was first proved by Ornstein and Weiss.v(4)1.i.3 The dadmissibility problem. Since the bound (2) dn (p.3. ENTROPY FOR RESTRICIED CLASSES.S The definition of dadmissible in probability is obtained by replacing almostsure convergence by convergence in probability.184 CHAPTER ER III.. Assume t = Ln/k] and xin = w(1). A sequence {k(n)} is called dadmissible for the ergodic process if (1) iiIndn(Pk(n)(• Alc(n)) = 0. called the finitely determined property.. based on the empirical entropy theorem.8 is indeed x4) to evaluate a martingale. Theorem 11. w(t)r.u(R. and a deep property. though only the latter will be discussed here. y) E a1 . Note that in the dm etric case. Stated in a somewhat different form.d.. as n 6. 7. k. The earlier concept of admissible will now be called variationally admissible. [52}.1.(x7 . (Hint: use the martingale theorem. c)) —> 1. Show that if k(n) oc.i.2. the admissibility theorem only requires that k(n) < (log n)I h. 5. while in the variational case the condition k(n) < (log n)I(h e) was needed. The dmetric forms of the admissibility results of the preceding section are also of interest. for any integrable function g. The principal positive result is Theorem 111. Let R.(x7. A similar result holds with the nonoverlappingblock empirical distribution in place of the overlappingblock empirical distribution. and p.. (a) Show that E(glE)(x) = lim 1 ni +cx) 11 (Pm(x)) fp. The dadmissibility theorem has a fairly simple proof. a. Ym. always holds a variationallyadmissible sequence is also dadmissible.1 (The dadmissibility theorem.) g dp.
The negative results are twofold. These results will be discussed in a later subsection. intersecting with the entropytypical set.(X1) < k(h—E/4) 1 1 . then extended to the form given here in [50].) For any nondecreasing unbounded sequence {k(n)} and 0 < a < 1/2. almost surely The dnonadmissibility theorem is a consequence of the fact that no more than sequences of length k(n) can occur in a sequence of length n < so that at most 2k(n )(hE/2) sequences can be within 3 of one of them. Lemma 1. n — k +1]} . subsequence version of strong nonadmissibility was first given in [52].3. 14(4) = k i+k1 = a. tai :x i some i E [0.SECTION IIL3.16(4)) < 81. produces Ak [11k(Xn)t5 ( [14(4)16 = n I.a. The other is a much deeper result asserting in a very strong way that no sequence can be admissible for every ergodic process. if k is large enough. < 2k(h02) 2chE/4) <2 k€/4 The entropy theorem implies that it(Tk) —> 1. is given in Section III.2 (The dnonadmissibility theorem. processes. provided 3 is small enough.i.5. n*oo 111(n)) > a. and which will be proved in Section IV. One is the simple fact that if n is too short relative to k.3. for any x'11 . such that liminf dk( fl )(pk( fl )(. 185 which holds for stationary codings of i. Theorem 111. If 3 is small enough then for all k. It will be proved in the next subsection. pk( 14)) > 82. then a small neighborhood of the empirical universe of kblocks in an nsequence is too small to allow admissibility. . assuming this finitely determined property.) If k(n) > (log n)/(h — c).a Admissibility and nonadmissibility proofs. If k > (log n)/(h — c) then Itik (4)1 _< 2k(hE). Tk = 11. define the (empirical) universe of kblocks of x'11 .7. THE 15ADMISSIBILITY PROBLEM.3 (The strongnonadmissibility theorem. To establish the dnonadmissibility theorem.d. I11. so that ttaik(x)is) < 6. A weaker.3.3. Theorem 111. and hence. A proof of the dadmissibility theorem.2. then {k(n)} is not admissible for any ergodic process of entropy h. there is an ergodic process p. 1[14(4)]s I < 2 k(" 12) by the blowupbound. But if this holds then dk(/ik. and its 6blowup dk(4.
it is sufficient to prove that if {k(n)} is unbounded and k(n) < (11h) log n. a negligible effect if n is large enough.i. provided only that k < (11h) log n. . 14)) — (b) (1/k(n))H(p k(n) (. see Theorem IV.3. almost surely.x7 which makes it clear that 4)(m. note that the averaged distribution 0(m. The desired result then follows since pni (an i 'lxri') —> A(an i z). then. for it asserts that (1 /k)H(pke h = 11(A). fact (b).1. Pk(n)(. y) on Am defined by çb m (a) = 1 n — m I.. the vdistribution of mblocks in nblocks is the measure 0. must also satisfy cik(iik. The finitely determined property asserts that a process has the finitely determined property if any process close enough in joint distribution and entropy must also be dclose. ENTROPY FOR RESTRICTED CLASSES. of entropy h. the following hold for any ergodic process p. An alternative expression is Om (ain ) E Pm (ain14)v(x) = Ev (pm (ar x7)). the average of the yprobability of din over all except the final m — 1 starting positions in sequences of length n. by the ergodic theorem. The dadmissibility theorem is proved by taking y = plc ( I4). the only difference being that the former gives less than full weight to those mblocks that start in the initial k — 1 places or end in the final k — 1 places xn .9. since pk(tik(fit)14) = 1. 1 —> 0. which is more suitable for use here and in Section 111. see Section IV. is really just the empiricalentropy theorem.2. E v(uaT v). ÷ h.„ = 0(m. by the definition of finitely determined.2.2. This completes the proof of the dadmissibility theorem.3.d. since the bounded result is true for any ergodic process.4. If y is a measure on An. process is finitely determined. A stationary process t is finitely determined if for any c > 0 there is a 6 > 0 and positive integers m and K such that if k > K then any measure y on Ak which satisfies the two conditions (rn. of course. almost surely. The important fact needed for the current discussion is that a stationary coding of an i.14(a) gives /Lk aik (x')13 ) > 1 — 28. Convergence of entropy. for otherwise Lemma I. as k and n go to infinity. for each m. is expressed in terms of the averaging of finite distributions. almost surely as n oc. vk) < E. 111(14) — H(v)I <k6. This completes the proof of the dnonadmissibility theorem. fact (a). by the ergodic theorem. An equivalent finite form.( • Ixrn).186 CHAPTER III. Theorem 11. modulo. 111 . and noting that the sequence {k(n)} can be assumed to be unbounded. the proof that stationary codings of processes are finitely determined.l nm EE i=0 uEiv veAn—n. Pk( WO) is almost the same as pn. To establish convergence in mblock distribution. Theorem 111. v)1 <8. (a) 10 (in. and in < n.2. that is. v n ) = ym for stationary v. . Thus.
3. Indeed. A given x is typical x) n(k) i is mostly concentrated on for at most one component AO ) . 14 (k) ). D c A k are said to be a separated. is  (4) If 1. The easiest way to make measures apart sequences and dk dfarapart is to make supports disjoint. The trick is to do the merging itself in different ways on different parts of the space. . (U Bk(a)) = 0 k>K The construction is easy if ergodicity is not an issue. Let Bk(a) = ix ) : dic(Pk• Ix). of {k(n)}dadmissible ergodic processes which are mutually a' apart in d.  [Dia = i. Ak) < I.(a (p) x a (v)) = ce. for one can take p. see Exercise 1.c and y are aseparated. 1.t and y on A k are said to be a separated if their supports are aseparated. This must happen for every value of n. and k(n) < log n. Two sets C. Two measures 1. The result (4) extends to averages of separated families in the following form. nondecreasing. then E(dk (x li` . y) > a. unbounded sequence of positive integers. where.Wtypical sequences. some discussion of the relation between dkfarfarapart measures is in order. The tool for managing all this is cutting and stacking. if C n [D]ce = 0. Associated with the window function k(. Thus. This simple nonergodic example suggests the basic idea: build an ergodic process whose nlength sample paths. when viewed through a window of length k(n). THE DADMISSIBILITY PROBLEM. merging of these farapart processes must be taking place in order to obtain a final ergodic process. The support a (i) of a measure pc on Ak is the set of all x i` for which . look as though they are drawn from a large number of mutually farapart ergodic processes. keeping the almost the same separation at the same time as the merging is happening.b Strongnonadmissibility examples. where a'(1 — 1/J) > a. yet at the same time.*oc almost surely. 187 III. for which (3) lim p. {p (J ) : j < J } . n(k) = max{n: k(n) < k}.) is the pathlength function.. if X is any joining of two aseparated measures. A simple. It is enough to show that for any 0 <a < 1/2 there is an ergodic measure be. Throughout this section {k(n)} denotes a fixed. and hence lim infdk (pk (.. in which case pk(• /2..SECTION 111. k.u(x) > 0. Before beginning the construction.3. to be the average of a large number. even when blownup by a. y k < dk (x k ) for some yk i E D} is the ablowup of D. as before. aseparated means that at least ak changes must be made in a member of C to produce a member of D. yf)) ce). Ak) > a'(1 — 1/J). which is wellsuited to the tasks of both merging and separation. but important fact. then cik(p.1 and y.
Lemma 111. Lemma 111. A column structure S is said to be uniform if its columns have the same height f(S) and width. the measure is the average of the conditional measures on the separate Si (m). 11k. (m) for i j. from which the lemma follows.4 If it = (1/J)Ei v i is the average of a family {v .. and hence EAdk(4.3. and hence not meet the support of yi . how to produce an ergodic p. yl`)) v(a (0) • [1 — AGa(v)icen a(1 — 1/.x ' "5.n ). then . for any joining Â.1). guaranteeing that (5) holds. then d(p. (3). of p. for i it([a(vma) = — E vi ([a(v. If all the columns in S(m) have the same width and height am). conditioned on being n(km )levels below the top. for 1 < i < n(k. Some terminology will be helpful. a complete sequence {8(m)} of column structures and an increasing sequence {km } will be constructed such that a randomly chosen point x which does not lie in the top n(km )levels of any column of S(m) must satisfy (6) dk„.) > am . then Lemma 111. v) > a(1 — 1/J). Thus if f(m)In(k m ) is sununable in m. ENTROPY FOR RESTRICTED CLASSES.4 suggests a strategy for making (6) hold. A weaker liminf result will be described in the next subsection. based on [52]. for any measure v whose support is entirely contained in the support of one of the v i . in detail.4 guarantees that the dproperty (6) will indeed hold.3. Proof The aseparation condition implies that the ablowup of the support of yi does j. and an increasing sequence {km } such that lim p(B km (a)) = O. (5) The construction.([a(v)]a ) < 1/J. after which a sketch of the modifications needed to produce the stronger result (3) will be given.b. and if the cardinality of Si (m) is constant in j. if the support of y is contained in the support of y. illustrates most of the ideas involved and is a bit easier to understand than the one used to obtain the full limit result. then.)]a) = v jag (v i)la) = 7 j i=1 if 1 . Here it will be shown. To achieve (5). The kblock universe Li(S) of a column .. the basic constructions can begin. Thus. Furthermore. III. The sequence am will decrease to 0 and the total measure of the top n(km )levels of S(m) will be summable in m. ergodicity will be guaranteed by making sure that S(m) and S(m 1) are sufficiently independent.u.3.3. namely. With these simple preliminary ideas in mind.188 CHAPTER III. (Pkm (. where xi is the label of the interval containing T il x.1 A limit inferior result.. and y.: j < J} of pairwise aseparated measures. take S(m) to be the union of a disjoint collection of column structures {Si (m): j < Jm } such that any km blockwhich appears in the name of any column of Si (m) must be at least am apart from any km block which appears in the name of any column of 5.
The following four sequences suggest a way to do this. .3..n ). 1/2) by constructing a complete sequence {S(m)} of column structures and an increasing sequence {k m } with the following properties. This is. {1.. km )separated column structures. the simultaneous merging and separation problem. For M divisible by J. namely. (Al) S(m) is uniform with height am) > 2mn(k. an (M. then by using cyclical rules with rapidly growing periods many different sets can be produced that are almost as far apart. of course. The real problem is how to go from stage m to stage ni+1 so that separation holds.. To summarize the discussion up to this point. for which the cardinality of Si (m) is constant in j. MI whose level sets are constant. The construction (7) suggests a way to merge while keeping separation. yet if a block xr of 32 consecutive symbols is drawn from one of them and a block yr is drawn from another.... j <J.. . the goal (6) can be achieved with an ergodic measure for a given a E (0... J)merging rule is a function . concatenate blocks according to a rule that specifies to which set the mth block in the concatenation belongs. that is. yet asymptotic independence is guaranteed. 10 1 (DI = M/J. using a rapidly increasing period from one sequence to the next. J } . THE 13ADMISSIBILITY PROBLEM. 2.. These sequences are created by concatenating the two symbols..n ) decreases to a. 0 and 1. then d32(4 2 . as m Since all the columns of 5(m) will have the same width and height. 01111 = (0 414)16 c = 000000000000000001 11 = (016 116)4 d = 000. and there is no loss in assuming that distinct columns have the same name. (7) = 01010.. 01 = (01) 64 b = 00001111000011110. k)separated if their kblock universes are aseparated. There are many ways to force the initial stage 8 (1) to have property (A3). so firstorder frequencies are good. (A2) S(m) and S(m 1) are 2mindependent. Disjoint column structures S and S' are said to be (a. If one starts with enough farapart sets.0101.. Conversion back to cutting and stacking language is achieved by replacing S(m) by its columnar representation with all columns equally likely. 189 structure S is the set of all a l' that appear as a block of consecutive symbols in the name of any column in S. For this reason S(m) will be taken to be a subset of A" ) . oc. 11 = ( 064164) 1 a . and concatenations of sets by independent cutting and stacking of appropriate copies of the corresponding column structures. the simpler concatenation language will be used in place of cutting and stacking language. The idea of merging rule is formalized as follows. yr) > 1/2. Each sequence has the same frequency of occurrence of 0 and 1.SECTION III.. and for which an2 (1 — 1/J. (A3) S(m) is a union of a disjoint family {S) (m): j < J} of pairwise (am .
m E [1. m E [1. The full kblock universe of S C A t is the set tik (S oe ) = fak i : ak i = x:+k1 .1} is a collection of pairwise (a.n ). 8(0) is the set of all concatenations w(1) • • • w(M) that can be formed by selecting w(m) from the 0(m)th member of {S1 } . S' c At are (a.11. for each m.1 times. In cutting and stacking language. stated in a form suitable for iteration. is the key to producing an ergodic measure with the desired property (5). Let expb 0 denote the exponential function with base b. In proving this it is somewhat easier to use a stronger infinite concatenation form of the separation idea. Assume p divides M1. let M = exp . such that . let (/). it is equally likely to be any member of Si . be the cyclic (M. a merging of the collection {Sj : j E J } is just an (M. In general. Given J and J*. the canonical family {0i } produces the collection {S(0)} of disjoint subsets of A mt . .P n . for some x E S" and some i > 11. 0 *(2) • • • w (m ) : W(M) E So(. The merging rule defined by the two conditions (i) (ii) 0(m)= j. An (M. {Sj }) of A m'. The key to the construction is that if J is large enough.0(rn ± np) = 0(m).) Given 0 < at < a < 1/2.190 CHAPTER III. The family {Or : t < PI is called the canonical family of cyclical merging rules defined by J and J. J)merging of the collection for some M. /MI. The set S(0) is called the 0merging of {Si : j E f }. when applied to a collection {Si: j E J } of disjoint subsets of A produces the subset S(0) = S(0. the direct product S() = in =1 fl 5rn = 1.5 (The merging/separation lemma. The two important properties of this merging idea are that each factor Si appears exactly M1. (exp2 (J* — 1)) and for each t E [1. . 0 is called the cyclic rule with period p. Lemma 111. formed by concatenating the sets in the order specified by q. then the new collection will be almost as well separated as the old collection.3. K)stronglyseparated if their full kblock universes are aseparated for any k > K. In other words. J)merging rule with period pt = exp j (exp2 (t — 1)). The following lemma. each Si is cut into exactly MIJ copies and these are independently cut and stacked in the order specified by 0. given that a block comes from Sj . J)merging rule 4). Two subsets S.. p1.1 and J divides p. The desired "cyclical rules with rapidly growing periods" are obtained as follows. and a collection {S7: t < J* } of subsets of 2V * of equal cardinalitv. that is. K)stronglyseparated subsets of A t of the saine cardinalitv. ENTROPY FOR RESTRICTED CLASSES. Cyclical merging rules are defined as follows. and. then for any J* there is a K* and an V. When applied to a collection {Si : j E J } of disjoint subsets of A'. at) such that if J > Jo and {S J : j < . there is a Jo = Jo(a. Use of this type of merging at each stage will insure asymptotic independence.
and Yu x v±(JUnf+1 = w(t where w(k) E j. provided J is large enough. Let {Or : t < J* } be the canonical family of cyclical merging rules defined by J and J*. This completes the proof of Lemma 111. choose J„. In concatenation language. and. and hence (10) 1 )w (t + 2) • • w(J)w(1) • • • w(t) S. Since all but a limiting (1/J)fraction of y is covered by such y u u+(j1)nt+1 .3. (13) Each S7 is a merging of {Si: j E 191 J}. . for each ni. Suppose m have been determined so that the followLemma S(m) C and k ing hold. The condition s > t implies that there are integers n and m such that m is divisible by nJ and such that x = b(1)b(2) • • where each b(i) is a concatenation of the form (8) w(l)w(2) • • • w(J). for any j. LI The construction of the desired sequence {S(m)} is now carried out by induction. THE DADMISSIBILITY PROBLEM. any merging of {S7 } is also a merging of {S}. for all large enough K*. so that if u+(. this may be described as follows. (a) {St*: t < J*} is pairwise (a*. let S7 = S(0. 1 _< j _< J.). while y = c(1)c(2) • •• where each c(i) has the form (9) v(1)v(2) • • • v(J). (B3) 2n(k) <t(m).11)nt11 is a subblock of such a v(j).SECTION IIL3.Ottn11) from 111. v(j)E ST. since (87)' = ST. 4 } of pairwise (an„ kni ) (B2) Each Si (m) is a merging of {S (m — 1): j < 4_ 1 }. K*)stronglyseparated. so it only remains to show that if J is large enough. > JOY . each block v(j) is at least as long as each concatenation (8). Fix a decreasing sequence {a } such that a < ant < 1/2. K)strongseparation. for each k. the collection {St*: t < J* } is indeed pairwise (a*.5. Furthermore.m. then there can be at most one index k which is equal to 1 d u_one (xvvi(J1)ne+1 y u ( J — 1 ) n f + 1 ) > a by the definition of (a. and for each t.5. Proof Without loss of generality it can be supposed that K < f. and since J* is finite. K*)stronglyseparated. K)stronglyseparated and the assumption that K < E.3.`)c° and y E (S)oc where s > t. w(j) E Sr» 1 < j < J. for otherwise {Si : j < J } can be replaced by {S7: j E J } for N > Klt. without losing (a. Part (b) is certainly true. then K* can be chosen so that property (a) holds. Aar° (B1) S(m) is a disjoint union of a collection {S I (m): j < stronglyseparated sets of the same cardinality. Suppose x E (S. furthermore. and.
where } (i) )1/4. which is just the independent cutting and stacking in the order given. ±1 . the measure of the bad set .3. (B2). the sequence {S(m)} defines a stationary Avalued measure it by the formula . and (B3).192 CHAPTER III. 1). V1 * V2 * • • * V j As in the earlier construction.4 in this case where all columns have the same height and width.10. 1) S(m.. the measure p. 0). it can be supposed that 2m+ l n(kn1+ i) < t(S. then each member of S(m) appears in most members of S(M) with frequency almost equal to 1/1S(m)1. 0) is a union of pairwise km )strongly separated substructures {Si (m.u(Bk m (a)) is summable in m. is needed. ±1 )fraction of the space at each step. Only a brief sketch of the ideas will be given here. Define k 1+ 1 to be the K* of Lemma 111. Let {S'tk: t < Jrn+1 } be the collection given by Lemma 111.3. 0)).n } with am = a and ce. Since (6) clearly holds. Since t(m) > cc.n+ 1 = a*.5 for the family {Si (m): j < J. By replacing S. At each intermediate step. but two widths are possible. 0) is cut into two copies. one on the part waiting to be merged at subsequent steps.(m. In the first substage. Furthermore. III. 0): j < 41 all of whose columns have the same width and the same height t(m) = (m. S. To obtain the stronger property (3) it is necessary to control what happens in the interval k m < k < km+1. (m.3. and hence cutting and stacking language. The only merging idea used is that of cyclical merging of { yi : j < J} with period J. (ii) x(S7) = ( i/J. is ergodic since the definition of merging implies that if M is large relative to m. The union S(m + 1) = Ui Si (m + 1) then has properties (Bi). j < J.n + i)x(S. for the complete proof see [50]. and Sj ". if necessary. and the other on the part already merged. the structure S(m) = S(m. that is. establishing the desired 1=1 goal (5). prior separation properties are retained on the unmerged part of the space until the somewhat smaller separation for longer blocks is obtained.n+ 1. Put Si (m + 1) = S.2 The general case. and En.*). the analogue of Theorem 1.(St i) = (1 — 1/ Ji)x(S. merging only a (1/4.s(m) 1 k ilxiam)). on each of which only a small fraction of the space is merged. The new construction goes from S(m) to S(m + 1) through a sequence of Jm+i intermediate steps S(m. n(lcm )I f(m) < cc. 0) S(m. This is accomplished by doing the merging in separate intermediate steps. each Si (nt.' by (S7) N for some suitably large N. 0)). for m replaced by m +1. by (B3). Jm+i fold independent cutting and stacking is . ENTROPY FOR RESTRICTED CLASSES. J 1+1). All columns at any substage have the same height.b. Cyclical merging with period is applied to the collection {57 } to produce a new column structure R 1 (m. lc \ _ lim [ga l— E IS(m)1 t(m). rather than simple concatenation language. Meanwhile.5 for J* = J.
J„. pk(. and (a) continues to hold. t) <k < k(m. t + 1). while making each structure so long that (b) holds for k(m. 1))levels have measure at most n(k(m. and the whole construction can be applied anew to S(m 1) = S(m. since neither cutting into copies nor independent cutting and stacking applied to separate Ej(m. km )stronglyseparated. 1). THE DADMISSIBILITY PROBLEM. and R(m. that the columns of R I (m. which makes each substructure longer without changing any of the separation properties. 1): j < J} remains pairwise (am .3. Note. 1)fold independent cutting and stacking. 1) < 1/4 + 1./r . in turn. an event of probability at least (1 —2/J.3. for the stated range of kvalues. of course. then for any k in the range km < k < k(m. 1). since R. by the way. (b) If x and y are picked at random in S(m.(m. for which lim . > 8(m. then {L i (m..14 (k) ). then for a suitable choice of an. This is because. 1). (a) The collection S(m. Thus if M(m. the probability that ak(pk(. The mergingofafraction idea can now be applied to S(m. 1). t))} is (p(m. implies the following. 1))strongly separated from the merged part R1 (m. 1) is replaced by its M(m. weaker than the desired almostsure result. 1 as wide.SECTION III. k (m . then it can be assumed that i(m. Each of the separate substructures can be extended by applying independent cutting and stacking to it. k(m. 1) > k m such that each unmerged part j (m. 1) is upper bounded by 1/.n+ i) 2 (11/0.5. since strongseparation for any value of k implies strongseparation for all larger values. but are only 1/J„1 . 1) have the same height as those of L i (m. if they lie at least n(k(m. 1)...4_] Bk ) < DC ' . After 4 +1 iterations.. 1): j < J} is pairwise (a. then kblocks drawn from the unmerged part or previously merged parts must be farapart from a large fraction of kblocks drawn from the part merged at stage t 1. 1) affects this property. 1) > a„. but a more careful look shows that if k(m. the unmerged part has disappeared. 1). 193 applied separately to each to obtain a new column structure L i (m.C i (m. 1))levels below the top and in different L i (m. k m )stronglyseparated. 1) is replaced by its M(m. An argument similar to the one used to prove the merging/separation lemma. producing in the end an ergodic process p. merging a copy of {L i (m. 1) is (8(m. The collection {. and since the probability that both lie in in the same L i (m.u(Bk (a)) = O. large enough. In particular. This is. 1)fold independent cutting and stacking. 2). 1) is large enough and each L i (m. . the kblock universes of xrk) and y n(k) i are at least aapart. can be used to show that if am) is long enough and J„. Lemma 111. the following holds. ±1 . and this is enough to obtain (12) ni k. there is a k(m. 1))stronglyseparated.. 1) > n(k(m. 1) has measure 1/4 +1 and the top n(k(m.E.0. which.Iy in(k) )) > a is at least (1 — 2/4 + 0 2 (1 — 1/.E. 1) = {L i (m.„<k<k„.. 1) < k < k(m. if M(m. 1). 1). t): j < J } U {R 1 (m. 1))/(m. then applying enough independent cutting and stacking to it and to the separate pieces to achieve almost the same separation for the separate structures. 1))4 + 1.1). 1). But. 1) is chosen large enough. 1): j < J} of measure 1/4 +1 .
2_ . for any subset C c An for which 11 (C) . Informally. and d(yi .5.e to construct waitingtime counterexamples. The following theorem characterizes those processes with the blowingup property.3.3. ENTROPY FOR RESTRICTED CLASSES.d.7 By starting with a wellseparated collection of nsequences of cardinality more than one can obtain a final process of positive entropy h. almost surely. I11.d. Show that 0.d.194 CHAPTER III. b7) < E. 2(k — 1)/n. An interesting property closely connected to entropy ideas.4 Blowingup properties. equivalent to being a stationary coding of an i. and for other processes of interest.i.. 00 > — 1/f). 1. including a large family called the finitary processes. This fact will be applied in I11.3.d. y) —1511 5 . yi ) > a. k4.3. almost lim infak(Pk(. In particular.([C]. The blowingup property and related ideas will be introduced in this section. Let p. Remark 111.i.i. Processes with the blowingup property are characterized as those processes that have exponential rates of convergence for frequencies and entropy and are stationary codings of i. = {b7: dn (a7. relative to km . If C c An then [C]. process. where each yi is ergodic. for i j. for aperiodic Markov sources. and Lemma 111. processes. processes. then 10(k. this shows the existence of an ergodic process for which the sequence k(n) = [(log n)/ hi is not aadmissible. called the blowingup property. ixin(k) ). An ergodic process i has the blowingup property (B UP) if given E > 0 there is a 8 > 0 and an N such that if n > N then p.4. A slightly weaker concept. processes is delayed to Chapter 4. called the almost blowingup property is. in fact. denotes the 6neighborhood (or 6blowup) of C.i. for suitable choice of the sequence {40. (Yi)k) surely.c Exercises. enough to guarantee the desired result (3). Remark 111.1x n i (k) ). Section 1. Section 111. = (1/J)Ei yi. for some ariL E C} . has recently been shown to hold for i. in turn. cik(Pk(. A full discussion of the connections between blowingup properties and stationary codings of i. (Hint: use Exercise 11. Show that if "if is the concatenatedblock process defined by a measure y on An.6 It is important to note here that enough separate independent cutting and stacking can always be done at each stage to make i(m) grow arbitrarily rapidly. which is.) 2.9. that is [C].) > 1 — E. a stationary process has the blowingup property if sets of nsequences that are not exponentially too small in probability have a large blowup.
eventually almost surely. The difficulty is that only sets of sequences that are mostly both frequency and entropy typical can possibly have a large blowup.i. A particular kind of stationary coding. process is called a finitary process.2 (The finitarycoding theorem.i. independent processes. If i E A Z and x w(x) w(x) — w(x) then f (x ) = f Gip . .3.i. The proof of this fact as well as most of the proof of the theorem will be delayed to Chapter 4. Theorem 111.c.d.B z is said to be finitary.d. The fact that processes with the blowingup property have exponential rates will be established later in the section. Later in this section the following will be proved. in particular. since i.d. process and has exponential rates of convergence for frequencies and entropy. In particular.1 (Blowingup property characterization.d. does preserve the blowingup property.) Finitary coding preserves the blowingup property.d. for any subset C C B for which . Theorem 111. 0blowingup property if p. then a blowingup property will hold for an arbitrary stationary coding of an i. called the window function.i. A proof that a process with the almost blowingup property is a stationary coding of an i. a result that will be used in the waitingtime discussion in the next section.i. it will be shown later in this section that a stationary coding of an i. for almost every x E A z .i. if there is a nonnegative. almost surely finite integervalued measurable function w(x).a(C) > 218 .4. (i) xlic E Bk. and. It is known that aperiodic Markov chains and finitestate processes. A stationary coding F: A z i.d. called finitary coding. has the almost blowingup property (ABUP) if for each k there is a set Bk C A k such that the following hold.4. the timezero encoder f satisfies the following condition.([C].) > 1 — E.d. (ii) For any 6 > 0 there is a 8 > 0 and a K such that Bk has the (8.3 (Almost blowingup characterization. 195 Theorem 111. Once these are suitably removed. without exponential rates there can be sets that are not exponentially too small yet fail to contain any typical sequences.) A stationary process has the blowingup property if and only if it is a stationary coding of an i. process will be given in Section IV.) A stationary process has the almost blowingup property if and only if it is a stationary coding of an i. 79].4.d. BLOWINGUP PROPERTIES. process has the blowingup property. process. that the theorem asserts that an i.i.i. however.i w(x) A finitary coding of an i. process. such that. Not every stationary coding of an i. processes have the blowingup property it follows that finitary processes have the blowingup property. process has the blowingup property.i. process has the almost blowingup property. By borrowing one concept from Chapter 4. relative to an ergodic process ii. Note. A set B c A k has the (8.SECTION 111. 0blowingup property for k > K.4. [28. An ergodic process p. — .d. and some renewal processes are finitary.
however. ) fx n i: tin(x n i) < of sequences of toosmall probability is a bit trickier to obtain. hence a small blowup of such a set cannot possibly produce enough sequences to cover a large fraction of the measure. and p. and ti. k. and define B(n. —> 0. First it will be shown that p. To fill in the details let pk ( WI') denote the empirical distribution of overlapping k blocks. Since . for all sufficiently large n.(C) > 2 6 " then . Suppose .(n. E) = 1Pk( 14) — IkI > El.u n (B(n. by the entropy theorem. for all n sufficiently large.u(4) 2 ' _E/4)} gives (n. in particular. has exponential rates for frequencies. 0. E)) must be less than 2 8n. n > N and . But this cannot be true for all large n since the ergodic theorem guarantees that lim n p n (B(n. To fill in the details of this part of the argument.) —> 1.Cc A n . If the amount of blowup is small enough. Intersecting with the set Tn = 11. but depends only on the existence of exponential rates for frequencies. One part of this is easy.([C]y ) > 1 — E. for there cannot be too many sequences whose measure is too large. c)) < 2 6n . however. [B(n. and hence there is an a > 0 such that 1[B* (n. Thus p n (B(n. ç B(n. o ] l < for all n.u n ([B*(n. since . . E/2). The idea of the proof is that if the set of sequences with bad frequencies does not have exponentially small measure then it can be blown up by a small amount to get a set of large measure. (1) Ipk( 14)— pk( 1)1)1 6/2.4. then frequencies won't change much and hence the blowup would produce a set of large measure all of whose members have bad frequencies. < 2n(h—E/2)2— n(h—E14) < 2—nEl 4 .Cc An. k. E)]. 6 .u([C]. k. E)) > 2 6" then ([B(n.) would have to be at least 1 —E. contradicting the ergodic theorem. so that.  = trn i: iun (x n i)> so that B* (n. An exponential bound for the measure of the set B. E/2)) = 0 for each fixed k and E. n In) _ SO p„([B* (n.a Blowingup implies exponential rates.naB * 2n(h—E/2) . k. k. k. The blowingup property provides 6 and N so that if n>N. E)1 0 ) 0 it follows that . III.u„(T.) > 1 — a. k.196 CHAPTER III. let B . since this implies that if n is sufficiently .u(B*(n. Next it will be shown that blowingup implies exponential rates for entropy. E/2)) to be at least 1 — E. Note that if y = E/(2k + 1) then d(4. The blowingup property provides 8 > 0 and N so that if n>N. ENTROPY FOR RESTRICTED CLASSES. 6)1 < 2n(h €). If. and hence (1) would force the measure p n (B(n. k.u has the blowingup property.(C) > 2 611 then ii.
the set T = {x n I± k 1 : P2k+1(GkiX n jc± k i) > 1 — 61. BLOWINGUP PROPERTIES.u(D) > 2 h 8 . In particular. Since the complement of 13(k.(n. and hence there is a k such that if Gk is the set of all x k k such that w(x) < k.1)blocks in xn tf k that belong to Gk and. In particular. moreover. except for a set of exponentially small probability.(k.SECTION 111. E)) < 2_8n which gives the desired exponential bound. Now consider a sequence xhti. and note that . 6/4) has cardinality at most 2 k(h+e/4) .un(B.(n. which in turn means that it is exponentially very unlikely that such an can have probability much smaller than 2 nh.u([D]y ) > 1 — E.1 • Thus fewer than (2k + 1)yn < cn of the (2k + 1)blocks in x h± k± k I can differ from the corresponding block in i1Hk4 k . Thus. and hence tin(Gn n 13. x E B Z . Let be a finitary coding of . Thus f (x) depends only on the values x ww (x()x) . there is at least a (1 — 20fraction of (2k I. this gives an exponential bound on the number of such x. 6)) < IGn12—n(h+E) < 2—ne/2 n > N • Since exponential rates have already been established for frequencies. 6/4)) < a.6. there is a sequence i n tE k E D such that dn+2k(X . there is a 6 > 0 and an N 1 > N such that p. the blowingup property. if n > N I .4.b Finitary coding and blowingup. The finitarycoding theorem will now be established.4. 197 large then._E k 1 . 6 can be assumed to be so small and n so large that if y = E / (2k + 1) then . most of xPli will be covered by kblocks whose measure is about 2 kh.u.u has the blowingup property. 6/4). then iGn < 2n(h+E/2). Fix E > 0 and a process p with AZ. where 6 > O. n (G n ) > 1 — 2 6n. E /4) < 2al.7. .u(T) > 1 — c by the Markov inequality. at the same time agree entirely with the corresponding block in imHk±k . and window function w(x). Lemma 1. To fill in the details of the preceding argument apply the entropy theorem to choose k so large that Ak(B(k. Thus + 2—ne/2. that is. For n > k let = pk(B. Suppose C c An satisfies v(C) > 2 h 6 . and. Let D be the projection of F1 C onto B" HI. (Gk) > 1 — 6 2 . since k is fixed and . satisfies . with full encoder F: B z timezero encoder f: B z A. k 1 E [D]y n T. As in the proof of the entropy theorem. where w(x) is almost surely finite. the set of nsequences that are (1 —20covered by the complement of B(k. implies that there is an a > 0 and an N such that if n > N. 1) < Y 2k I. I=1 III. then p. the builtup set bound. where a will be specified in a moment. n > N1. For such a sequence (1 — c)n of its (2k + 1)blocks belong to Gk.
ENTROPY FOR RESTRICTED CLASSES. E D such that Yi F(y).1. where 4)(k. LII which completes the proof of the finitarycoding theorem. Also let c/.a into the notation used here. C) > E has p. that is. there is a 3 > 0 and positive integers k and N such that if n > N then any measure y on An which satisfies the two conditions (a) Luk — 0*(k.3.).4 )1) > P(4)dn(4 . . Z = F(Y. then. The sequence Z7 belongs to C and cln (z7 . and put z = . the minimum of dn (4 . so that Z7 E [C]2 E .u. ) (3) Ca li) = E P (a14)v(x) = Ev(Pk(a l'IX7)). <e. Thus if (LW. that is.d. and hence EA(Cin(X til C)) < (LW. C). that is. Towards this end a simple connection between the blowup of a set and the 4distance concept will be needed.c Almost blowingup and stationary coding. . Lemma 111. by the Markov inequality. This proves that vn ([ che ) n T) > 1 — 2€. v)I < 8.d. process have the almost blowingup property makes use of the fact that such coded processes have the finitely determined property. the set of 4 such that dn (xit . ktn( .4.2. Proof The relation (2) vn EC holds for any joining X of A n and A n ( IC).4. This proves the lemma. processes have the almost blowingup property (ABUP). measure less than E. ylz). and which will be discussed in detail in Section IV. c+ k i . C) denote the distance from x to C. v) is the measure on Ak obtained by averaging must also satisfy the yprobability of kblocks over all starting positions in sequences of length n.([C]e ) > 1 — E. a property introduced to prove the dadmissibility theorem in Section 111. (b) 111(u n ) — H(v)I < u8. k This simply a translation of the definition of Section III. is finitely determined (FD) if given E > 0. Ane IC» <2 then 1. 4([C]) > 1 — E.198 CHAPTER III.1 (' IC) denote the conditional measure defined by the set C c An. Let .(4. y E C. x ntf k 19 and riChoose y and 5. I11. Z?) < 26.i.i.4 If d(1t. E Yil )dn(. An( IC)). An ergodic process p.3. IC)) < 62 . In this section it is shown that stationary codings of i. The proof that stationary codings of i. int+ ki.
relative to a.un are dclose.uk and entropy close to H(i).4.6 The results in this section are mostly drawn from joint work with Marton. ")I < 8 and IHCan ) — HMI < n6. Kan). . (ii) H(14) — na(n) — 178 Milne IC)) MILO ± na(n).d Exercises. must also satisfy dn (ii. such that 4 E Bn(k(n).SECTION 111. where. so there is a nondecreasing. IC)) — 141 < a(n). ED — Remark 111. rather than the usual nh in the definition of entropytypical.4. [37. then xçi E Bn(k. I11.5 in the CsiszdrKörner book.11 (4)+na . both of which depend on . then the conditional measure A n (. Thus if n is large enough then an(it. If k and a are fixed. as usual. which. The basic idea is that if C c A n consists of sequences that have good kblock frequencies and probabilities roughly equal to and i(C) is not exponentially too small. IC) satisfies the following. 39]. BLOWINGUP PROPERTIES. Given E > 0. a (n)).4. unbounded sequence {k(n)}. I xi ) — . This completes the proof that finitely determined implies almost blowingup. A sequence 4 will be called entropytypical. 2. A sequence x i" will be called (k.) A finitely determined process has the almost blowingup property 199 Proof Fix a finitely determined process A. then lin e IC) will have (averaged) kblock probabilities close to .) > 1 E. a). there is a 8 > 0 such that Bn eventually has the (8. by Lemma 111. Show that a process with the almost blowingup property must be ergodic. 1. time IC)) < e 2 . The definition of Bn and formula (3) together imply that if k(n) > k. a)frequencytypical if I Pk (. Let Bn (k.u. a(n)). Theorem 111.4. implies that C has a large blowup. for any C c Bp. ttn(. and a nonincreasing sequence {a(n)} with limit 0. y) < e 2 . eventually almost surely. implies that .5 (FD implies ABUP. The first step in the rigorous proof is to remove the sequences that are not suitably frequency and entropy typical. e)blowingup property. and (j. Put Bn = Bn(k(n). a)frequencytypical for all j < k. For references to earlier papers on blowingup ideas the reader is referred to [37] and to Section 1. — (i) 10(k.4.4. Note that in this setting nth order entropy.4. by Lemma 111. if tt(x til ) < 2. is used. Show that (2) is indeed true. Suppose C c Bn and . Pk(• Ifiz) is the empirical distribution of overlapping kblocks in xri'. [7]. for which 2 —n8 which. eventually almost surely. The finitely determined property then implies that gn e IC) and .u(C) > 2 —n8 . which is all right since H(t)/n —> h. It will be shown that for any c > 0. a) be the set of nsequences that are both entropytypical.4. which also includes applications to some problems in multiuser information theory.uk I < a. relative to a. the finitely determined property provides 8 > 0 and k such that if n is large enough then any measure y on A n which satisfies WI Ø(k.u([C].
y. [11. including the weak Bernoulli class in [44. Assume that aperiodic renewal processes are stationary codings of i. In addition. 7. y) is defined for x. for irreducible Markov chains. first noted by Wyner and Ziv. Section 111. [86]. cannot be extended to the general ergodic case. (a) Show that an mdependent process has the blowingup property. 6) = minfm > 1: dk (x Two positive theorems will be proved. in conjunction with the almost blowingup property discussed in the preceding section. This result was extended to somewhat larger classes of processes. yn mi +k1 < 61. An almost sure version of the WynerZiv result will be established here for the class of weak Bernoulli processes by using the jointdistribution estimation theory of III. Theorem 111. 5. (b) Show that a i/i mixing process has the blowingup property. 6) is defined by Wk (x. for the wellknown waitingtime paradox. Show that a coding is finitary relative to IL if and only if for each b the set f 1 (b) is a countable union of cylinder sets together with a null set. Counterexamples to extensions of these results to the general ergodic case will also be discussed. 76]. ENTROPY FOR RESTRICTED CLASSES. unlike recurrence ideas.i. The surprise here. thus further corroborating the general folklore that waiting times and recurrence times are quite different concepts.time function Wk (x. 6. Show that a process with the almost blowingup property must be mixing. processes have the blowingup property. is the positive result.d.i. of course.5.5 The waitingtime problem. an approximatematch version will be shown to hold for the class of stationary codings of i.d.i. 1Off. 3. y) converges in probability to h.1.b. suggests that waiting times are generally longer than recurrence times. then (1/k) log Wk(x. . y) is the waiting time until the first k terms of x appear in an independently chosen y.u(B n ) 1. pp.200 CHAPTER III. at least for certain classes of ergodic processes. v. Show that condition (i) in the definition of ABUP can be replaced by the condition that .2. 1. The counterexamples show that waiting time ideas.d. y) = min{m > 1: yn n: +k1 i = xk The approximatematch waitingtime function W k (x.3. y E Ac° by Wk (x. processes. Assume that i. 4. was shown to hold for any ergodic process in Section Section 11. by using the dadmissibility theorem. Wyner and Ziv also established a positive connection between a waitingtime concept and entropy. Show that some of them do not have the blowingup property. processes. They showed that if Wk (x. The waiting . A connection between entropy and recurrence times.
5.3. y.i..0 k—>oo 1 lim lim inf . The dadmissibility theorem. for k . x p. and Wk (x.d. is weak Bernoulli with entropy h. for there are exponentially too few kblocks in the first 2k(h .1. THE WAITINGTIME PROBLEM. In the weak Bernoulli case. In each proof two sequences of measurable sets. process. The set Bk C A k is chosen to have the property that any subset of it that is not exponentially too small in measure must have a large 3blowup. processes. the Bk are the sets given by the almost blowingup property of stationary codings of i. The set G n consists of those y E A Z for which yll can be split into klength blocks separated by a fixed length gap. Lemma 111. {B k } and {G n } have the property that if n > 2k(h +E) .5. k almost surely with respect to the product measure p. such that a large fraction of these kblocks are almost independent of each other.„0 1 k Wk(X. For the weak Bernoulli case.3. the G„ are the sets given by the weak Bernoulli splittingset lemma. hence is taken to be a subset of A. are constructed for which the following hold.log Wk(x. The proofs of the two upper bound results will have a parallel structure.f) terms of y. (b) y E G n . In other words. see Theorem 111. Theorem 111. In both cases an application the BorelCantelli lemma establishes the desired result.c. V S > 0.8.4.i. so that property (a) holds by the entropy theorem.) If p is a stationary coding of an i.5.2. In the approximatematch proof. the set G. eventually almost surely. S) > h. For the approximatematch result.. the probability that y E G„ does not contain x lic anywhere in its first n places is almost exponentially decaying in n.1 (The exactmatch theorem.log The easy part of both results is the lower bound. then for any fixed y'iz E G. k almost surely with respect to the product measure p.) If p.d.(log n)/(h + 6).SECTION 111. then k— s oo 1 lirn . In the approximatematch case. hence G„ is taken to be a subset of the space A z of doubly infinite sequences. G„ consists of those yli whose set of kblocks has a large 3neighborhood. Bk consists of the entropytypical sequences. 201 Theorem 111. guarantees that property (b) holds. can be taken to depend only on the coordinates from 1 to n. In the weak Bernoulli case. it is necessary to allow the set G„ to depend on the infinite past. x A. 8) < h. y. Theorem 111. though the parallelism is not complete.2 (The approximatematch theorem. eventually almost surely. . the sequences {B k } and {G. y) = h. hence it is very unlikely that a typical 4 can be among them or even close to one of them. 6—).} have the additional property that if k < (log n)/(h + 6) then for any fixed xiic E Bk. the probability that xf E Bk is not within 8 of some kblock in y'i' will be exponentially small in k. (a) x'ic E Bk. then lim sup .}. {B k C Ak} and {G. In other words.log k.
There is a 3 > 0 such that if k(n) > (log n)/(h . El The theorem immediately implies that 1 lim inf . for any ergodic measure with entropy h. yçl E An .0 k—c k almost surely with respect to the product measure p. the set of entropytypical ksequences (with respect to a > 0) is taken to be 7k(01) 2 k(h+a) < 4 (4) < 2k(ha ) } Theorem 111. 8—». x /. In the final two subsections a stationary coding of an i.log Wk(x. x A. 2k(hE).7. In the following discussion. ENTROPY FOR RESTRICTED CLASSES.6) then n _ _ 2 k(h . j[lik (y)]51 _ Lemma 1.a Lower bound proofs.4 Let p.5.d. for any ergodic measure with entropy h..202 CHAPTER III. The lower bound and upper bound results are established in the next three subsections. eventually almost surely. n > N. (y)16 n Tk(e/2)) [lik(yli )]3 = for some i E [0. be an ergodic process with entropy h > 0 and let e > 0 be given.x /. 111 IL . n .3 Let p. be an ergodic process with entropy h > 0 and let e > 0 be given. and hence so that intersecting with the entropytypical set 7k (e/2) gives < 2 k (6/2)) _ k(h0 2 koE/2) < (tik (yi ) n r Since . process will be constructed for which the exactmatch version of the approximatematch theorem is false. Theorem 111.6) then for every y E A" and almost every x. y. I11. If k(n) > (logn)I(h .i. there is an N = N(x. and an ergodic process will be constructed for which the approximatematch theorem is false. 3) > h. The (empirical) universe of kblocks of yri' is the set iik(Yç') = Its Sblowup is dk(x 1`. Proof If k > (log n)/(h .k l]}.` E 7 k(E/2).log Wk(X. I f 6 < 2k(h3(/4)./. there is an N = N(x.6) then for every y E A and almost every x.5. The theorem immediately implies that 1 lim lim inf . y) h. by the blowupbound lemma. and hence 1/1k(yril)1 < Proof If k > (log n)/(h . _ _L which establishes the theorem.5.6) then n <2k(he). The following notation will be useful here and later subsections. is small enough then for all k. an application of the BorelCantelli lemma yields the theorem. < 2k(h E ) . n > N. y) such that x ik(n) lik(n)(Yi).5. y) such that x ik(n) (/[14k(n)(11`)]5.E) .14(y1)) 6). Intersecting with the entropytypical set 'Tk (E/2) gives < 2 k(h3E/4) 2 kchE/2) < p. k almost surely with respect to the product measure p.
and fix c > 0.log Wk (x. The sshifted.} are given by Lemma 111. eventually almost surely. (2) 1 lim sup .s.y)t indices j in the interval [1.s < 2(k + g).u .b.SECTION 111. it is enough to establish the upper bound. (a) x E Gn . the g(j) have length g and alternate with the kblocks w(j). The sets {B k } are just the entropytypical sets. Bk = Tk(a) = {xj: 2 k(h+a) < wx lic) < where a is a positive number to be specified later.a. If 8 1 is the set of all y then E A z that have j as a (y. t] that are (y. 1.1] for which there are at least (1 . Let Fix a weak Bernoulli measure a on A z with entropy h. An index j E [1.1 k x . E For almost every x E A" there exists K(x) such that x x.2. kblock parsing of y. stated as the following. k > K(x). which depends on y. with gap g is yti` = w(0)g(1)w(1)g(2)w(2).g(t)w(t)w(t + 1). E) denote the least integer n such that k < (log n) I (h É). that is. 203 III. (b) For all large enough k and t.b Proof of the exactmatch waitingtime theorem. k. For each k put Ck(x) = Y Bk. s. The sets {G. The basic terminology and results needed here will be restated in the form they will be used. To complete the proof of the weak Bernoulli waitingtime theorem. s. and a gap g.u(w(DlY s 40. (1) sa Claw (Di n Bs.5. n(k) = n(k.( 0)1)(k+g) ) < (1+ Y)Ww(i)). 1 . k. for 1 < j < t. and J C [1. and the final block w(t +1) has length n . eventually almost surely. .. eventually almost surely. The entropy theorem implies that 4 E B. g)splitting indices for y. where w(0) has length s < k + g.log Wk(x. and for y E G(y) there is at least one s E [0.. t] is a (y. t ]. y) < h. k. k + g . y) > h E€ so that it is enough to show that y g' Ck (x). The sets G n C A Z are defined in terms of the splittingindex concept introduced in III. Fix such an (0 (Yrk) )1 and note that ye Ck (x).8 and depend on a positive number y. Property (a) and a slightly weaker version of property (b) from that lemma are needed here. for (t + 1)(k + g) < n < (t 2)(k ± g).2. g)splitting index.) (1 + y) Ifl libt(w(i)).. s. THE WAITINGTIME PROBLEM. g)splitting index for y E A z if . to be specified later.t(k g) .5.
for any y E G n (k)(J. it then follows that y 0 Ck(x). Theorem 111. since t n/ k and n .1. To establish the key bound (3). and fix E > 0. eventually almost surely.3.y). s. if this holds. But if the blocks w(j). j E J are treated as independent then.2ty log y 0 y y (1 2—k(h+a))t( 1— Y) Indeed. For almost every y E Aoe there is an N (y) such that yti' E G n .i. which proves the theorem. Fix such a y. By the dadmissibility theorem. Let {B k } be the sequence of sets given by the almost blowingup characterization theorem. and such that (t+ 1)(k+ g) < n(k) < (t 2)(k ± g). s) denote the set of all y E G n (k) for which every j E J is a (y. eventually almost surely.' Ck (y). ENTROPY FOR RESTRICTED Choose k > K (x) and t so large that property (b) holds.0 > 0 and K such that if k > K and C c Bk is a set of ksequences of measure at least 2 0 then 11([C]8/2) > 1/2.c Proof of the approximatematch theorem. s) c the splittingindex product bound. The key bound is (3) ii(Ck(x) n Gn. I11. = .ualik(n)(31)]812) > 1/2) . y. For each k. Theorem 111. since ii(x) > k(h+a) and since the cardinality of J is at least t(1 — y). CHAPTER III.5. and hence it is enough to show that xi. then a and y can be chosen so small that the bound is summable in k. for all j E J. n > N(. fix a set J c [1. 3) > h ± c if and only if x i( E C k(y).(0) < (k + g)2. eventually almost surely. yri: E G n . k + g — 1] and let Gn(k) (J. (1). This completes Li the proof of the exactmatch waitingtime theorem. t] of cardinality at least t(1 — y) and an integer s E [0. the probability x lic. Let be a stationary coding of an i.1. let Ck (y) = fx 1`: x g [Uk(Yr k) )]81 so that. eventually almost surely.d. Theorem 111. g)splitting index for y. is upper bounded by that w(j) (1 — 2 k(h+a ) ) t(1Y) and hence )t it(ck(x)n G n(k)(J s)) < (1 + y (1 This implies the bound (3). while k(n) denotes the largest value of k for which k < (log n)/(h c). Since Gnuo (J. Since y E Gn . .3. (1/k) log Wk(x. s). process such that h(A) = h. t] of cardinality at least t(1 — y). and choose . k. The least n such that k < (log n)/ (h c) is denoted by n(k). Fix a positive number 3 and define G.204 ED CLASSES. yields it(n[w(i)] JE.5.1 n Bsd) + y)' jJ fl bi (w(i)). since there are k ± g ways to choose the shift s and at most 22ty lo gy ways to choose sets J c [1.4.
A stationary coding y = . A function C: GL A L is said to be 6close to the identity if points are moved no more than 8. 51. which contradicts the definition of Ck(y).u.5. A family {Ci} of functions from GL to A L is 6close to the identity if each member of the family is 6close to the identity. where A = {0.„ (— log Wk(n)(y. process with binary alphabet B = {0. fix GL c A L . ) 7 ) N) = 0. Let . A family {Ci } of functions from G L to A L is said to be pairwise kseparated. is thought of as a measure on A Z whose support is contained in B z . The key to the construction is to force Wk(y. however.u(Ck(Y) n Bk) < 210 .SECTION 111. THE WAITINGTIME PROBLEM. 205 Suppose k = k(n) > K and n(k) > N(y). 1471 (x. Si) to be large. it follows that 4 g Ck(y). then.u o F 1 will be constructed along with an increasing sequence {k(n)} such that (4) 1 lim P. one in which long blocks of l's and O's alternate. be the Kolmogorov measure of the i. 4. If the process has a large enough alphabet. By this mimicking of the bursty property of the classical example.) is forced to be large with high probability. and sample paths cycle through long enough blocks of each symbol then W1 (x. that is Uk(yP) fl /ta ) = 0. yet are pairwise separated at the kblock level. The marker construction idea is easier to carry out if the binary process p. But this implies that the sets Ck(y) and [ilk (Yin (k) )]8 intersect. The classic example where waiting times are much longer than recurrence times is a "bursty" process. j j. y) will be large for a set of y's of probability close to 1/2. Thus. The same marker is used for a long succession of nonoverlapping kblocks. hence intersect. where k is large enough to guarantee that many different markers are possible. that is. then switches to another marker for a long time.5. where P. say. for then the construction can be iterated to produce infinitely many bad k. eventually almost Li surely. y) will be large for a set of probability close to 1. k=1 Since 4 E Bk. eventually almost surely. To make this precise.i. The first problem is the creation of a large number of block codes which don't change sequences by much. 1} such that g i (0) = .u i (1) = 1/2. cycling in this way through the possible markers to make sure that the time between blocks with the same marker is very large. y1 E Ci(Xt). with high probability. This artificial expansion of the alphabet is not really essential. 2. . 1. The initial step at each stage is to create markers by changing only a few places in a kblock.. 3 } . x E GL. for then blocks of 2's and 3's can be used to form markers. If it were true that [t(Ck(Y) n Bk) > 20 then both [Ck(Y) n Bk]512 and [16 (Yr k) )]6/2 each have measure larger than 1/2. I11. n400 k(n) for every N. (1/k) log Wk(y. with a code that makes only a small density of changes in sample paths. so that yrk) E G(k).d. if the kblock universes of the ranges of different functions are disjoint. for with additional (but messy) effort one can create binary markers. .„ denotes probability with respect to the product measure y x v. and hence E p(Ck(y)n Bk ) < oc. aL (C(4). Y . If x i = 1.d An exactmatch counterexample. that is. completing the proof of the approximatematch theorem.E Cj(Xt). ) < 6.
Thus the lemma is established. and since different wf are used with different Ci . then choose L > 2kN+2 lc so that L is divisible by k.(G Proof Order the set {2. Lemma 1. the family is 3close to the identity. Lemma 111. Given c > 0. and leaves the remaining terms unchanged. Since each changes at most 2g+3 < 3k coordinates in a kblock.5. Y2 = Yg+3 = W1 . hence no (g I. F (x)) < 6.C. as done in the string matching construction of Section I. An iteration of the method. Define the ith kblock encoder a (4) = yilc by setting y = Yg+2 = Y2g+3 = 0 . 1. Put . 1. 3 > 0. for all x. and (C.5.206 CHAPTER III.(GL). is then used to construct a stationary coding for which the waiting time is long for one value of k.c.: 1 < i < 2g1 be as in the preceding lemma for the given 3. and N. (ii) d(x.1)block of 2's and 3's can occur in any concatenation of members of Ui C. The blocktostationary construction. 3) such that i({2. separately to each kblock. The encoder Ci is defined by blocking xt into kblocks and applying Cs.6 Let it be an ergodic measure with alphabet A = {0. that is.5. 2. Let L. is used to produce the final process. The following lemma combines the codes of preceding lemma with the blocktostationary construction in a form suitable for later iteration. Let GL be the subset of {0. i (x lic) replaces the first 2g ± 3 terms of 4 by the concatenation 0wf0wf0. The following lemma is stated in a form that allows the necessary iteration. with the additional property that no (g +1)block of 2's and 3's occurs in any concatenation of members of U.5 Suppose k > (2g+3)/8 and L = mk. 3}) = 0. pairwise kseparation must hold. 2.2)block Owf 0. g+1 2g+2 yj = x . 5") < 2kN ) 5_ 2 g ± E..8. Ci (xt) = yt. which is 8close to the identity. Any block of length k in C(x) = yt must contain the (g I. 2. GL. that is. its proof shows how a large number of pairwise separated codes close to the identity can be constructed using markers. where g > 1. ENTROPY FOR RESTRICTED CLASSES. El The code family constructed by the preceding lemma is concatenated to produce a single code C of length 2g L. o F1 the following hold. 3 } g+ 1 ) = 0. Note also that the blocks of 2's and 3's introduced by the construction are separated from all else by O's. Proof Choose k > (2g +3)18. 31g in some way and let wf denote the ith member. (iii) v({2. 3} L consisting of those sequences in which no block of 2's and 3's of length g occur Then there is a pairwise kseparated family {C i : 1 < i < 2g} of functions from GL to A L . i > 2g + 3.8. Lemma 111. where jk+k jk+1 = Ci jk+k x )jk+1 o < <M. (i) pvxv (wk'. there is a k > 2g +3 and astationary encoder F: A z 1+ A z such that for y = p.
= F(). Case 2. while those of length M will be called coding blocks.k <n+L — 1. {C i }. The intervals I. THE WAITINGTIME PROBLEM. F(x)) < 8. Three cases are now possible. since k > (2g +3)/8. which establishes property (iii). the encoded process can have no such blocks of length g + 1. L — 1. and such that ynn+L1 = Ci ( X. • n+ +1M = C(Xn ni ± 1M ). and j such that n < 1 < n + L — 1. then UJEJ /J covers no more than a limiting (6/4)fraction of Z. Furthermore. of the integers into subintervals of length no more than M.5 to obtain a stationary code F = F c with the property that for almost every x E A z there is a partition. then yn (b) If i E Uj E jii . had no blocks of 2's and 3's of length g. the changes produce blocks of 2's and 3's of length g which are enclosed between two O's. Either k > n + L — 1 or 2kN > Case 2c. i. n +Al]. m < 1 < m + L — 1. If Case 2c occurs then Case 2b occurs with probability at most 6/2 since L . i0 j. and hence property (ii) holds.5. hence. Z = U/1 . first note that the only changes To show that F = F made in x to produce y = F(x) are made by the block codes Ci. since p. Both xi and i1 belong to coding blocks of F. in. since the Mblock code C is a concatenation of the Lblock codes.. then y. since y and j3 were independently chosen and there are 2g different kinds of Lblocks each of which is equally likely to be chosen.11 +L1\ yrn = C1 (4+L1) . and such that y = F(x) satisfies the following two conditions. Case 2b. for 41 (G L) 2g . Case 1. Thus d(x. and 2k N < m L — 1. = x i . Y (i1)L+1 = Ci(XoL +1). there will be integers n. (a) If = [n + 1. note that if y and 53 are chosen independently.8. such that if J consists of those j for which I is shorter than M. 2kN+2 /E. Case 2a occurs with probability at most 2 g. 1 < <2 for xr E (GL) 29 Next apply the blocktostationary construction of Lemma 1. that are shorter than M will be called gaps. then only two cases can occur. c has the desired properties. and if y = F(x) and 5". Either x1 or i is in a gap of the code F.SECTION 111. In Case 2. Case 1 occurs with probability at most €72 since the limiting density of time in the gaps is upper bounded by 6/4. i = j. 207 M = 2g L and define a code C: Am 1* A m by setting C(xr) = xr. each of which changes only the first 2g + 3 places in a kblock. for every x. To show that property (i) holds. and iL iL g. Case 2a.
F2 . the entropy of the cointossing process . the entropy of the encoded process y can be forced to be arbitrarily close to log 2.5. by the pairwise kseparation property.208 CHAPTER III.. so there will be a limit encoder F and measure y = desired property. These properties imply that conditions (b) and (c) hold ± 1 1 . however. and an increasing sequence {k(n)} such that for the 1 . Wk (y. then y (n+ 1) will be so close to On ) that condition (a).5. finitely many times. Az.6 with g = 0 to select k(1) > 3 and F1 such that (a). 33 ) k(n) co. and 2b therefore Puxv (Wk(Y < 2kN) _< 6/2 2 . where Go(x) (a) pooxon . and (c) hold for n = 1. with GI = F1. 2. The first step is to apply Lemma 111. Assume F1. . Condition (b) guarantees that eventually almost surely each coordinate is changed only o F 1 . G < 2ik(i)) < 27i +1 ± 2i + 2—n+1 ± 2—n 9 1 < < n.6 hold. This completes the construction of the counterexample. ENTROPY FOR RESTRICI ED CLASSES. for n + 1.: Az 14. ( wk(i)(y.7 The number E can be made arbitrarily small in the preceding argument.5.u. (x)) almost surely.6. that is. (b).. the properties (i)(iii) of Lemma 111. > 2kN . step n In summary. n = 1. and (c) hold for n. Let 6 < 2 —(n+1) be a positive number to be specified later.5. (b).5. (b) d (G n (x). In particular. Fn. imply that which completes the proof of Lemma 111. 3} n ) = 0. since y (n+ 1) = o G n since it was assumed that 6 < 2 (n+ 1) . 2a. it is enough to show the existence of such a sequence of encoders {Fn } and integers {k(n)}. This will be done by induction. replaced by On ) to select k(n to select an encoder Fn+i : A z A z such that for v (n+ 1) = On ) 0 . . . properties hold for each n > 1. with n replaced by n 1. Condition (c) is a technical condition which makes it possible to proceed from step n to 1. while condition (a) holds for the case i = n + 1. Cases 1. (c) On ) ({2. Condition (a) guarantees that the limit measure y will have the 1 log Wk(n)(Y. . F„ have been constructed so that (a). and hence one can make P(xo F(x)o) as small as desired. = F o Fn _i o o F1 and measure 01) = o G n x. in y x y probability. will hold for i < n 1. Remark 111. The finish the proof it will be shown that there is a sequence of stationary encoders..6 with 1) larger than both 2n + 2 and k(n) and g = n and p.g + 6/2. . Apply Lemma 111. the following composition G. If 6 is small enough.
. Such a t can be produced merely by suitably choosing the parameters in the strongnonadmissibility example constructed in 111. The full kblock universe of S c A f is the set kblocks that appear anywhere in any concatentation of members of S.6.e An approximatematch counterexample. Two sets C. Prob (W k. A weaker subsequence result is discussed first as it illustrates the ideas in simpler form. a) N) = 0. S' C A f are (a. km ) stronglyseparated sets of the same cardinality.3.indices below the end of each £(m)block.5. THE WAITINGTIME PROBLEM. This is just the subsequence form of the desired result (5). produces a sequence {S(m) C A €("z ) : in > 1} and an increasing sequence {k m }. (b) Each Si (m) is a merging of {Si (m — 1): j (c) 2kni Nin < t(m)/m. The only new fact here is property (c). Once km is determined. Indeed. Property (b) guarantees that the measure p. for any integer N. Thus. inductive application of the merging/separation lemma. for each where J divides M and 0: [1.. The goal is to show that there is an ergodic process it for which (5 ) 1 lim Prob (— log k co k W k (X . Let a be a positive number smaller than 1/2. such that the following properties hold for each tn. 1/2). defined by {S(m)} is ergodic. Lemma 111. where [D]„ denotes the ablowup of D. M] j E [1. Given an increasing unbounded sequence {Nm }.3. . and a decreasing sequence {a m } c (a.3.5. then with probability at least (1 — 1/40 2 (1 — 1/m) 2 . A merging of a collection {Si : j E J} is a product of the form S = ni =1 S Cm) [1. see Remark 111. J]. D c A k are aseparated if C n [D]./m } of pairwise (am . if sample paths sample paths x and y are picked independently at random.SECTION 111.strongly .separated if their full kblock universes are aseparated for any k > K. and hence property (c) can also be assumed to hold.b.b will be recalled. VN. 209 III. ) > (1 — 1/40 2 (1 — 1/rn)2 . J] is such that 10 1 ( = M/J.5. K) . a) > 2 which proves (6). while property (c) implies that (6) lim Prob (Wk„. Two subsets S. their starting positions x1 and yi will belong to £(m)blocks that belong to different j (m) and lie at least 2 km N. First some definitions and results from 111. since Nm is unbounded. (x . y. y. (a) S(m) is a disjoint union of a collection {Si (m): j < . (x.3. however. S(m) can be replaced by (S(m))n for any positive integer n without disturbing properties (a) and (b). y. = 0. a) < 2km N ) = 0.
Further independent cutting and stacking can always be applied at any stage without losing separation properties already gained.2. the column structures at any stage can be made so long that bad waitingtime behavior is guaranteed for the entire range km < k < km± i . as in the preceding discussion. . and hence. (5). The stronger form. The reader is referred to [76] for a complete discussion of this final argument. discussed in III. ENTROPY FOR RESTRICTED CLASSES.3. controls what happens in the interval km < k < kn.b.210 CHAPTER III. This leads to an example satisfying the stronger result. ± 1.
Xin n E A mn . the almost blockindependence property. for example. Invertibility is a basic concept in the abstract study of transformations. A blockindependent process is formed by extending a measure A n on An to a product measure on (An)".1 Almost blockindependence. if t is the distribution of XT and y the distribution of Yr.i.i. each arising from a different characterization.) Either measure or random variable notation will be used for the ddistance. This terminology and many of the ideas to be discussed here are rooted in O rn stein's fundamental work on the much harder problem of characterizing the invertible stationary codings of i. in > 1. ji is the measure on A' defined by the formula nz Ti(X inn )= i j=1 (x(i—l)n+ni) .Chapter IV Bprocesses. then transporting this to a Tainvariant measure Ft on A'. is expressed in terms of the ametric (or some equivalent metric.d. J+1 1 x for all i < j < mn and all x/. Various other names have been used. This is fortunate. processes. processes. Note that 71 is Ta invariant. finitely determined processes. dn (X7.d. [46]. (j_on+ and the requirement that x. the socalled isomorphism problem in ergodic theory. as will be done here. The almost blockindependence property. but is of little interest in stationary process theory where the focus is on the joint distributions rather than on the particular space on which the random variables are defined. randomizing the start produces a stationary process. including almost blockindependent processes.d. Yr) will often be used in place of dn (p. that is..d.i. will be discussed in this first section. for while the theory of stationary codings of i. though it is not. stationary. and very weak Bernoulli processes. v). In other words.. Section IV. in general. The focus in this chapter is on Bprocesses. processes is still a complex theory it becomes considerably simpler when the invertibility requirement is dropped. finite alphabet processes that are stationary codings of i. processes. 211 . like other characterizations of Bprocesses. As in earlier chapters.i. A natural and useful characterization of stationary codings of i.
) An ergodic process is almost blockindependent if and only if it is a stationary coding of an i.u n of p to An. process y for which h(v) > h(A). (k ) v (j1)nln (j1)n+1 is independent of { r. In random variable language. The fact that stationary codings of i. Both of these are quite easy to prove.4) < E.i. then by showing that the ABI property is preserved under the passage to dlimits.d. In fact it is possible to show that an almost blockindependent process p is a stationary coding of any i. characterizes the class of stationary codings of i. The fact that only a small density of changes are needed insures that an iteration of the method produces a limit coding equal to 12. Of course.i. for each j > 1.d. however. Note. process onto a process dclose to then how to how to make a small density of changes in the code to produce a process even closer in d. is a stationary coding of an i. An i. The theory to be developed in this called the concatenatedblock process defined by chapter could be stated in terms of approximation by concatenatedblock processes. that the two theorems together imply that a mixing Markov chain is a stationary coding of an i. process. and is sufficient for the purposes of this book. in particular. there is an N such that if n > N and is the independent nblocking of .u is the blockindependent process if n is defined by the restriction .1 (The almost blockindependence theorem.d.d. Theorem IV.d. all this requires that h(v) > h(p. process.d.d.2 A mixing Markov chain is almost blockindependent. BPROCESSES. .i. the independent nblocking of {Xi } is the Tninvariant process {Y. An ergodic process p. for each j > 1. The independent nblocking of an ergodic process ./ } defined by the following two conditions.i.d. since the ABI condition holds for every n > 1 and every E > O. Theorem IV. processes are almost blockindependent is established by first proving it for finite codings.1. (a) Y U —1)n+n and X7 have the same distribution. It is not as easy to show that an almost blockindependent process p. The construction is much simpler. process. It will be denoted by Ft(n).i. which is not at all obvious.i.d.i. The almost blockindependence property is preserved under stationary coding and. is almost blockindependent (ABI) if given E > 0. for this requires the construction of a stationary coding from some i.u then d(A.i. processes.i.1.i. process has infinite alphabet with continuous distribution for then any nblock code can be represented as a function of the first ncoordinates of the process and djoinings can be used to modify such codes. since stationary coding cannot increase entropy. if it is assumed that the i. This result and the fact that mixing Markov chains are almost blockindependent are the principal results of this section. but it is generally easier to use the simpler blockindependence ideas. i < (j — 1)n}. process is clearly almost blockindependent. or by understood. in fact. . process y onto the given almost blockindependent process p. This will be carried out by showing how to code a suitable i.d.212 CHAPTER IV.).
< E/ 3.i. In summary. Lemma 1. w+1 ymn—w1 (m1)n±w±1 ) 4. = anal FL) = cin(y.414/ by property (c) of the dproperty list. > 0 (1) d ..1. processes are almost blockindependent.} be a finite coding the i.(Zr.} be the independent nblocking of { Yi } . processes onto finite alphabet processes. Lemma 1.i. finite codings of i.11.SECTION IV 1. < c.11. processes and dlimits of ABI processes are almost blockindependent.d. hence such coded processes are also almost blockindependent. Lii . Next assume p.d. processes are almost blockindependent. for any n> wl€. The successive n — 2w blocks 7n—w1 'w+1 7 mn —w1 " • '(ni1)n+w+1 n+ 1 1 . and hence. This is because blocks separated by twice the window halfwidth are independent. processes are almost blockindependent. is the dlimit of almost blockindependent processes. < c/3. it follows that finite codings of i.d.9.. by the definition of independent nare independent with the distribution of Yw blocking. The triangle inequality then gives This proves that the class of ABI processes is dclosed. This completes the proof that Bprocesses are almost blockindependent. Yr) is upper bounded by m(n _ 2040 _ 2 w) w+1 vn—w1 Tnn—w1 '(n11)n+w+1 . when thought of as Anvalued processes implies that d( 17..d. 213 IV. then mnci„. ) Since this is less than c.d. by (1). The fact that both and are i. If the first and last w terms of each nblock are omitted. Let {Y.i. the preceding argument applies to stationary codings of infinite alphabet i.i. 1.d. since a stationary coding is a dlimit of finite codings.i. yields yn—w—1 v mn —w1 n n— +T 1 7 mn —w1 am(n_210 (z w w11 (m1)n+w11 ) Thus dnm (Z.i.d. ALMOST BLOCKINDEPENDENCE.d. hence stationary codings of i.i. Furthermore. for all n. First it will be shown that finite codings of i. process {X i } with window halfwidth w. Since y is almost blockindependent there is an N such that if n > N and 7/ is the independent nblocking of y then d(y.9. and let {Z.a Bprocesses are ABI processes. Fix such an n and let FL be the independent nblocking of A. so that property (f) of the dproperty list. while the successive n — 2w blocks vn—w1 4 w+1 y mn —w1 (m1)n±w±1 +r I by the definition of window halfare independent with the distribution of rr width and the assumption that {X i } is i. Given E choose an almost blockindependent process y such that d(p.i. y) < c/3.L) < 6 / .
Therefore. which allows one to choose whatever continuously distributed i. This fact. for t > 1.i.i.i.214 CHAPTER IV.) If a(p. 1) and.d. for all alphabet symbols b. BPROCESSES. The lemma immediately yields the desired result for one can start with the vector valued. The goal is to show that p.d. V(t)0 is binary and equiprobable.i.d. The key to the construction is the following lemma. V(1) i .i. . where p.d. as t — * oc. countable component. The second condition is important. is almost blockindependent and G is a finite stationary coding of a continuously distributed i. together with an arbitrary initial coding F0 of { V(0)i }. Vi )} such that the following two properties hold. The lemma can then be applied sequentially with Z (t) i = (V(0) i . V(2)1 . (ii) (A x v) ({(z. V(t) i . process with continuous distribution. i.d. and if is a binary equiprobable i. V(t — 1) i ).i.i. An i. (A x o F 1 ) < 8.. Lemma W. and hence any i. A.o G 1 ) <E. (b)Er Prob((Ft+ i (z))o (Fr(z))o) — < oc.}. } is continuously distributed if Prob(Zo = b) = 0.. producing thereby a limit coding F from the vectorvalued i. The exact representation of Vi l is unimportant.. process (2) onto the almost blockindependent process A. process with continuous distribution is a stationary coding.d. 1.)} with independent components. A o F 1 ) = 0. process is convenient at a given stage of the construction.d. The idea of the proof of Lemma IV. . where V(0)0 is uniform on [0. almost surely.. (a)a(p. is a stationary coding of an infinite alphabet i. process independent of {Z. o Fr1 ) > 0. since any random variable with continuous distribution is a function of any other random variable with continuous distribution. there will almost surely exist a limit code F(z)1 = lime Ft (z) i which is stationary and for which cl(A. . A block code version of the lemma is then used to change to an arbitrarily better block code. process with continuous distribution. process Vil with Kolmogorov measure A for which there is a sequence of stationary codes {Ft } such that the following hold.3 is to first truncate G to a block code.d. V (1) i . IV. and Er = 2t+1 and 3..} with Kolmogorov measure A.5 > 0 there is a finite stationary coding F of {(Z i .b ABI processes are Bprocesses. Let {Xi } be an almost blockindependent process with finite alphabet A and Kolmogorov measure A. It will be shown that there is a continuously distributed i. V): G(z)o F (z. process {Z.1.i. with window width 1. of any other i.3 (Stationary code improvement. for it means that. process {Z. then given . that is.i. (i) d(p. each coordinate Fe(z)i changes only finitely often as t oc. which allows the creation of arbitrarily better codes by making small changes in a good code.1. v)o}) < 4E. = 2 t.d.i. A = X o F 1 .d. process (2) {(V(0). will be used in the sequel. to obtain a sequence of stationary codes {Ft } that satisfy the desired properties (a) and (b).
for valmost every sample path u. see Exercise 12 in Section 1. Lemma IV.5 (Block code improvement.) Let F be a finite stationary coding which carries the stationary process v onto the stationary process 71.(F (u)7. with associated (8. Indeed.SECTION IV. so that Vd. Proof This is really just the mapping form of the idea that djoinings are created by cutting up nonatomic spaces. The block code version of the stationary code improvement lemma is stated as the following lemma. A block R:+n1 of length n will be called a coding block if Ri = 0. n)independent process if {Ri } is a (8. The following lemma supplies the desired stronger form of the ABI property. b7)) = E v(dn (0 (Y).) Let (Y.? +n. 1. An ergodic process {Wi } is called a (8. Clearly d. An ergodic pair process {(Wi .G(u7)) < S.9.i. Proof Let w be the window halfwidth of F and let N = 2w/8. given that is a coding block. ergodic process {Ri } will be called a (S. ALMOST BLOCKINDEPENDENCE.1 = diliq is a coding block) = . if {(147„ Ri )} is a (S. b7) = v(01 (4) n (b7)) for 4): Y H A n such that A n = vo V I . The following terminology will be used.1.(a7). n)independent extension of a measure pt. . G(u7)) < 8. a joining has the form X(a'i' . Ri ): i < 0 } . n)blocking process {Ri } .1 = On 1 1. (Y))). the better block code is converted to a stationary code by a modification of the blocktostationary method of I. This is welldefined since F(ii)7+ 1 depends only on the values of û. an extension which will be shown to hold for ABI processes. called the (S.b. Ri)} will be called a (S. A finite stationary code is truncated to a block code by using it except when within a window halfwidth of the end of a block. on A'. Lemma IV. since the only disagreements can occur in the first or last w places. n)blocking process and if 147Ç' is independent of {(Wi . A binary. nature of the auxiliary process (Vi) and its independence of the {Z1 } process guarantee that the final coded process satisfies an extension of almost blockindependence in which occasional spacing between blocks is allowed.1.(F(u)7. Given S > 0 there is an N such that if n > N then there is an nblock code G. 215 Finally. if the waiting time between l's is never more than n and is exactly equal to n with probability at least 1 — 3. y) be a nonatomic probability space and suppose *: Y 1± A n is a measurable mapping such that V 0 —1 ) < 8. such that d.d. for valmost every sample path u.u. 4 E A'. while retaining the property that the blocks occur independently. n)independent process and Prob(14f. For n > N define G(u7) = bF(ii)7± 1c. Then there is a measurable mapping 4): Y 1+ A n such that ji = y o 01 and such that Ev(dn((k (Y). n)truncation of F. using the auxiliary process {Vi } to tell when to apply the block code. for i< j<i+n1and Ri+n _1 = 1. n)blocking process.1.8. and b and c are fixed words of length w. The i. The almost blockindependence property can be strengthened to allow gaps between the blocks.4 (Code truncation. where u is any sample path for which û7 = /47. (y))) < 8. that is. (4 .
sk NI L and an arbitrary joining on the other parts.}. Proof In outline.i.1. where y is the Kolmogorov measure of the independent N I blocking IN of pt. n)blocking process (R 1 ). To make this outline precise.} will be less than 2E13. provided only that 6 is small enough and n is large enough. can be wellapproximated by a (6. Ft) < c.1. process. Given E > 0 there is an N and a 6 > 0 such that if n > N then a(p.} and {W. This is solved by noting that conditioned on starting at a coding nblock. { Proof of Lemma IV. and that i7 V 7.) Let {X. the cidistance between the processes {K} and {W.. Towards this end.6. n)blocking process {R. and that 8 be so small that very little is outside coding blocks. It is enough to show how an independent NI blocking that is close to p. first choose N1 so large that a(p. (3) Put N2 = M1+ 2N1 and fix n > N2. n)independent extension {W} of An . let {W} denote the Wprocess conditioned on a realization { ri } of the associated (6.AN Ni1+1 has the same distribution as X' N1 +i . n)independent extension.3. for almost every realization {r k }. vm ) < 13. hence all that is required is that 2N1 be negligible relative to n.} be a continuously distributed i. so that (3) yields tÂ NI (4) ytS:N IV I I± Wits:N A: 1+ ) < € /3. Let p. The goal is to show that if M1 is large enough and 3 small enough then ii(y.9. The processes Y. of an nblock may not be an exact multiple of NI . < 2E/3. but i j will be for some j < NI.} a binary. with associated (6. say i. here is the idea of the proof. and let À x y denote the Kolmogorov measure of the direct product {(Z„ V. and hence cl(p.)}. skNi < Note that m = skNI — tkNi is at least as large as MI.d. Lemma IV. If M i is large enough so that tk NI — nk nk+1 tIcNI is negligible compared to M1 and if 8 is small enough.d. Let {n k } be the successive indices of the locations of the l's in { ri}. Let {Z. independent of {Z.1. then. for any (8.) < c. but at least there will be an M1 such that dni (/u ni . with respective Kolmogorov measures À and y. of A n .i1. BPROCESSES. for all ni. process and {V. y) < 6/3. completing the proof of Lemma IV. ii(v.11. Let 6 be a positive number to be specified later and let be the Kolmogorov measure of a (8.. be . Thus after omitting at most N1 places at the beginning and end of the nblock it becomes synchronized with the independent Ni blocking. } . the starting position.} are now joined by using the joining that yields (4) on the blocks ftk + 1. in > Mi. < 2E/3..} be an almost blockindependent process with Kolmogorov measure t. Lemma 1. n)independent extension 11.i. The obstacle to a good dfit that must be overcome is the possible nonsynchronization between nblocks and multiples of NI blocks. equiprobable i.6 (Independentextension. An application of property (g) in the list of dproperties. For each k for which nk +i — n k = n choose the least integer tk and greatest integer sk such that tk Ni > nk . Since y is not stationary it need not be true that clni (it„„ um ) < E/3.216 CHAPTER IV. yields the desired result.
y) = (z. this means that < 2E. let g be a finite stationary coding of {V. (5).z (d(Z11 ). fix a E A. implies that the concatenations satisfy (6) k—>oo lirn dnk(O(Wi) • • • 0(Wk). Xn and hence the block code improvement lemma provides an nblock code 0 such that = o 0 1 and E(dn (à .u. 217 an almost blockindependent process with alphabet A. and define F(z. In particular. The see why property (ii) holds. hence property (i) holds. 0(Z7)) I R is a coding block) so the ergodic theorem applied to {Wk }.. Yk n — 1]. and hence the proof that an ABI process is a stationary coding of an i. r Ykk+" = dp(zYA Yk ) for all k. The goal is to construct a finite stationary coding F of {(Z. v): G(z)0 F(z. yk + n — 1] is almost surely at most & it follows from (6) that (X x y) ({(z. a(w. and so that there is an (c. and .}. Towards this end. first note that {Zi } and {Ri } are independent.) < 8. along with the fitting bound. g(v)).)} such that the following hold. first choose n and S < e so that if is a (S. .. ALMOST BLOCKINDEPENDENCE.SECTION IV 1. (ii) (X x y) (((z. (5) Next. Let F(z. coding such that d(p. } . almost surely. Ø(Z7))) = E(dn (d (Z 11 ).)} onto {(Z„ Ri )} defined by g*(z. and hence the process defined by Wk = Z Yk+n1' i i . vgk +n1 = 0 (Wk). (i) d(p. if g(v) = r. Furthermore. x y) o <8.d. p e.i. with Wo distributed as Z. In other words.i. n)truncation G of G. however. v): G(z)0 F (z.} onto a (S. Clearly F maps onto a n)independent extension of . By definition. r)i = a. let {yk } denote the (random) sequence of indices that mark the starting positions of the successive coding blocks in a (random) realization r of {R.(Z). V. n)independent extension of i then d(p. Since. X o Fix 3 > O. and coding to some fixed letter whenever this does not occur.. by the ergodic theorem applied to {R. This completes the proof of the lemma. (X. the dn(G(Z)kk+n1 limiting density of indices i for which i V Uk[Yk. thereby establishing property (ii). y) 0 }) < 4e. n)blocking process {Ri } and lift this to the stationary coding g* of {(Z. F(u. E(d. define +n1 ). F(Z. and let G be a finite stationary < E. process.)•••d(wk))< 2c.d. Y)o}) <2e E< 4e. a(4 +ni„ )) < E. i fzi Uk[Yk. V. 0(Z))) 26. r) be the (stationary) code defined by applying 0 whenever the waiting time between 1 's is exactly n.
} . Lemma IV. [46]. and y be the Kolmogorov measures of {X. . The coupling measure is the measure A n on An x An defined by v (an .) The coupling function satisfies the bound (8) dn(tt.} and { Y n } .7 The existence of block codes with a given distribution as well as the independence of the blocking processes used to convert block codes to stationary codes onto independent extensions are easy to obtain for i.u n and v n . Remark W. Let p. process whose entropy is at least that of it. processes with continuous distribution. 42]. With considerably more effort. n — 1]: ai = bi}. once they agree.z } be the nonstationary Markov chain with the same transition matrix as { X. see [35]. are discussed in detail in his book.d. and is concentrated on sequences that agree ever after. The key to the fact that mixing Markov implies ABI. but which is conditioned to start with Yo = a. v) —> 0. as well as several later results about mixing Markov chains is a simple form of coupling. b?) = (7) A direct calculation.„ as marginals. this guarantees that ncln (X. The precise formulation of the coupling idea needed here follows. b7) = n otherwise. independent of n and a. v) < E)" n (wn) . } be a (stationary) mixing Markov process and let {1..u(b7). In particular. Proof The inequality (8) follows from the fact that the A n defined by (7) is a joining of .i. n ± oc. Furthermore. so that iin (p. In particular dn(it. respectively. with the very strong additional property that a sample path is always paired with a sample path which agrees with it ever after as soon as they agree once. the coupling function is defined by the formula wn(4 = f minfi E [ 1.218 CHAPTER IV BPROCESSES. see Exercise 2.1. if fi E [1 . the expected value of the coupling function is bounded.Y n ) cannot exceed the expected time until the two processes agree. shows that An is a probability measure with vn and p. For each n > 1. This establishes the lemma. which are the basis for Ornstein's isomorphism theory. Yn )} is obtained by running the two processes independently until they agree. n — 1 ] : ai = bi } 0 0 otherwise.1. A joining {(Xn . . below.d. Let {X.i..1. = w < n. see also [63. These results. suitable approximate forms of these ideas can be obtained for any finite alphabet i. The Markov property guarantees that this is indeed a joining of the two processes. then running them together. n. IV. v(a)p(b'i') 0 if w( a. a special type of joining frequently used in probability theory.c Mixing Markov and almost blockindependence. v) indeed goes to O.8 (The Markov coupling lemma. making use of the Markov properties for both measures. and an w+1 = 1)/ 1_ 1 if wn (a .
a joining A.„ with iimu . it is enough to show that for any 8 > 0.. summing over the successive nblocks and using the bound (9) yields the bound Ex„. given Xk n = Xkn • kn "± ditioned distribution of XI and the conditional distribution of Xk The coupling lemma implies that (9) d. given Xk n = xk n . exactly the almost mixing Markov processes. first choose. n)independent extension which is actually a mixing Markov chain. and hence that {X i } is almost blockindependent. Zr)) < E. a measure it on A" has a (8. Proof By the independentextension lemma. This will produce the desired dcloseness. shows that A. 1:1 The almost blockindependent processes are. 219 To prove that a mixing Markov chain is dclose to its independent nblocking. Fix a mixing Markov chain {X. n)independent extension which is a function of a mixing Markov chain. let À nin be the measure on A"' x A mn defined by m1 (10) A. for each xbi . 1.pnn(XT n einn ) = ii(4)11.1. u is a joining of A„. provided only that n is large enough. Likewise. by the Markov property. in fact. ALMOST BLOCKINDEPENDENCE. conditioned on the past depends only on Xk n .6. with Kolmogorov measure i. see Example 1. proves that d(p. for all large enough n. (See Exercise 13. Lemma IV.T. and the distribution of X k n +7 conditioned on Xk n is dclose to the unconditioned distribution of Xn i . that distribution of Xk realizes the du distance between them.9 An almost blockindependent process is the dlimit of mixing Markov processes.) The only problem. Since the proof extends to Markov chains of arbitrary order. . The notation cin (X k n +7. This is easy to accomplish by marking long . using the Markov property of {X. of course. j < kn. by the coupling lemma. < c. For each positive integer m. } with Kolmogorov measure and fix E > 0. To construct Xmn. Section 1. (dnin (finn .11. with ii mn will be constructed such that Ex. The concatenatedblock process defined by pt is a function of an irreducible Markov chain. in the sense that Theorem IV.SECTION IV 1. of the unconditioned kn "Vil and the conditional distribution of Xk kn + 7. n)independent extension ri which is mixing Markov of some order. conditioned on the past.n (dmn(xr. For each positive integer m..n± ±n 1 /xkn) < for all sufficiently large n. } plus the fact that Zk k n +7 is independent of Z j . this completes the proof that mixing Markov chains are almost blockindependent. X k n +7/xk n ) will be used to denote the du distance between the unconI. Fix n for which (9) holds and let {Zi } be the independent nblocking of {Xi } . the idea to construct a joining by matching successive nblocks.. x k . not just a function of a mixing Markov chain. z7)) < which.1. is to produce a (8.2./x.„.(Z?) k=1 Jr A1 zkknn++7)• A direct calculation. therefore. a joining Ain „ of A m . A suitable random insertion of spacers between the blocks produces a (8.(x k n ±n . because the distribution of Xk k .„.
.1.a(syr"). n with Ti Ex. [66]. The process {Zi } defined by Zi = f(Y 1 ) is easily seen to be (mn + 2)order Markov. it is mixing.9. and let { ri } be the stationary. (sxr . 0) with probability pii(syr). concatenations of blocks with a symbol not in the alphabet. Next. for each syr E SC m . • w(m)) = i=1 Fix 0 < p < 1. i) can only go to (sxr". 1. BPROCESSES. 1) with probability (1 — p). nm +1) goes to (sy'llin. The proof that ABI processes are dlimits of mixing Markov processes is new. Show that (7) defines a joining of A n and yn . then randomly inserting an occasional repeat of this symbol. n)independent extension of A. 1. IV. to be specified later. where y and D are the Kolmogorov measures of {W} spectively. provided m is large enough and p is small enough (so that only a 6fraction of indices are spacers s. below. il To describe the preceding in a rigorous manner. n (dnzn Winn zr)) E. with the same associated (6. define f(sxrn.220 CHAPTER IV. 2. } . Show that if { 1/17. then d(v.„. } and {W} are (6. For each positive integer m. rean(An.11 The proof that the Bprocesses are exactly the almost blockindependent processes is based on the original proof..) This completes the proof of Theorem IV. if 1 < i < mn + 1. i) to be x.d Exercises. Remark IV. where w(i) E C. and to (syr". Show that the defined by (10) is indeed a joining of . i 1). such that 4. .10 The use of a marker symbol not in the original alphabet in the above construction was only to simplify the proof that the final process {Z1 } is Markov. As a function of the mixing Markov chain {ri }. let sCm denote the set of all concatenations of the form s w(1) • • • w(m). aperiodic Markov chain with state space se" x {0.u. let C c A' consist of those ar for which it(4) > 0 and let s be a symbol not in the alphabet A. nm 1} and transition matrix defined by the rules (a) If i < nm 1. since once the marker s is located in the past all future probabilities are known. 1. (Z1 ) is clearly a (6. Remark W. irreducible. An). and s otherwise. independent of n.1. Show the expected value of the coupling function is bounded. and let ft be the measure on sCrn defined by the formula ft(sw( 1 ) . Furthermore. respectively. 3. = {1T 7i } . n)blocking process {R. The coupling proof that mixing Markov chains are almost blockindependent is modeled after the proof of a conditional form given in [41]. n)independent extensions of An and fi n .1. (b) (sxr . A proof that does not require alphabet expansion is outlined in Exercise 6c.
} be binary i.9 without extending the alphabet.SECTION IV.d. Properties of such approximating processes that are preserved under dlimits are automatically inherited by any finitely determined limit process. then the distribution of kblocks in nblocks is almost the same as the p. y) < E.1.distribution.d.) (c) Prove Theorem IV.i.d. Proof Let p (n ) be the concatenatedblock process defined by the restriction pn of a stationary measure p. processes are Markov. the only error in this . to An. THE FINITELY DETERMINED PROPERTY. process. Some special codings of i. n)independent extension of IL. (Hint: if n is large enough.i. afinitely determined process is a stationary coding of an i. 0 as n with itself n times. be the Kolmogorov measure of an ABI process for which 0 < kc i (a) < 1. must be mixing.u) — H(v)I < must also satisfy (iii) c/(kt. such that any stationary process y which satisfies the two conditions. In particular. 221 5. and almost blockindependent. (See Exercise 4c. where a" denotes the concatenation of a (b) Show that Wan) —3. If k fixed and n is large relative to k. (i) 114 — vkl < 6 (ii) IH(./ } defined in the proof of Theorem IV. In other words. Let {X. then by altering probabilities slightly. The most important property of a stationary coding of an i.9 is a (8.2 The finitely determined property.d.i. The sequence ba2nb can be used in place of s to mark the beginning of a long concatenation of nblocks. Section IV. Some standard constructions produce sequences of processes converging weakly and in entropy to a given process. unless X n = 0 with probability 1/2. Show that even if the marker s belongs to the original alphabet A. 1.i. there is a 8 > 0 and a positive integer k. (a) Show that p. Show that {17 } is Markov.) 6. Theorem IV. otherwise. a stationary p is finitely determined if. and using Exercise 6b and Exercise 4. This principle is used in the following theorem to establish ergodicity and almost blockindependence for finitely determined processes. A stationary process p.5. Section 1. for it allows dapproximation of the process merely by approximating its entropy and a suitable finite joint distribution. is finitely determined (FD) if any sequence of stationary processes which converges to it weakly and in entropy also converges to it in d. given E > 0. and not i. for some a E A. and define Y„ to be 1 if X„ X„_1. and 0. oc.) 7.i. it can be assumed that a(a) = 0 for some a.2. process is the finitely determined property.1 A finitely determined process is ergodic. (Hint: use Exercise 4.d. the process {Z.2. Let p. mixing.
and hence almost blockindependent. hence almost the same kblock distribution for any k < n. Since each 11. that is.2 (The FD property: finite form. Note also that if i3 is the concatenatedblock process defined by y.1.a (n ) and p. y) on Ak defined by )= 1 k+it k i=ouE. for any k < n. must be finitely determined. Also useful in applications is a form of the finitely determined property which refers only to finite distributions. is finitely determined if and only if given c > 0.) A stationary process p. to A. it follows that a finitely determined process is ergodic. and k < n. v„) = vk .n. a finitely determined process must be mixing. The entropy of i. (n) is ergodic and since ergodicity is preserved by passage to dlimits.d. and (11 n)H (v n ) H (v). Thus 147 ) gk .1. is small enough then /2 00 and ii (n) have almost the same nblock distribution. y) —1 since the only difference between the two measures is that 14 includes in its average the ways kblocks can overlap two nblocks. EE V (14 a k i V). hence in if is assumed to be finitely determined. Lemma IV. then p. 10(k. Xn i 1 starting positions the expected value of the empirical kblock distribution with respect to v. Theorem IV.v) < E. . since functions of mixing processes are mixing and dlimits of mixing processes are mixing. A direct calculation shows that 0(aiic ) = E Pk(a14)Y(4) = Ep(Pk(4IX). This completes the proof of Theorem IV. the average of the yprobability of a i` over the first n — k in sequences of length n.2. Thus. and since alimits of ABI processes are ABI. Thus if 3n goes to 0 suitably fast. Section IV. the vdistribution of kblocks in nblocks is the measure = 0(k. a finitely determined process must be almost blockindependent. a The converse result. then (1) 41 _< 2(k — 1)/n. Also.t (n ) is equal nblock have to be ignored. BPROCESSES. that a stationary coding of an i. if p. approximation being the end effect caused by the fact that the final k symbols of the oo. then 0(k.1.i. Proof One direction is easy for if I)„ is the projection onto An of a stationary measure I). If y is a measure on An. satisfies the finite form of the FD property stated in the theorem. see Exercise 5. Since ii(n ) can be chosen to be a function of a mixing Markov chain. let ii (n ) be a (8n .6. For each n. {P ) } will converge weakly and in entropy to /I . as n weakly and in entropy so that {g (n ) } converges to 11(u n )/n which converges to H(/)).2. process is finitely determined is much deeper and will be established later. . n)independent extension of A n (such processes were defined as part of the independentextension lemma. Concatenatedblock processes can be modified by randomly inserting spacers between the blocks. (n ) have almost the same entropy. v)I < 6 and IH(pn ) — H(v)I < n6 must also satisfy cin (p.tvE.222 CHAPTER IV. and furthermore.) If 8. there is a > 0 and positive integers k and N such that if n > N then any measure v on An for which Ip k — 0(k.
given Yin. The definition of finitely determined yields a 6 > 0 and a k. < 6/2. which is within 8/2 of (1/n)H(An). is finitely determined and let c be a given positive number. Thus. An approximate form of this is stated as the following lemma. This proof will then be extended to establish that mixing Markov processes are finitely determined. To make this sketch into a real proof.1 I. since H(t)) = (1/ n)H (v). First it will be shown that an i.). This completes the proof of Theorem IV. if independence is assumed. 1H (i. and hence the method works. y) = Prob(X = x. y) and p(y) = py (y) = Ex p(x. The i. X. that is. A finitely determined process is mixing. IV.9 of the preceding section. 223 To prove the converse assume ti. as well as possible to the conditional distribution of yr..17) < E. Y = y) denote the joint distribution of the pair (X. y) —1 <8/2. such that any stationary process y that satisfies Ivk — 12k 1 < 6 and 1H (v) — H COI < 6 must also satisfy d(v. The proof that almost blockindependence implies finitely determined will be given in three stages. hence is totally ergodic. y) = px. provided closeness in distribution and entropy implies that the conditional distribution of 17. THE FINITELY DETERMINED PROPERTY. p(x. Furthermore.2. The conditional distribution of Xn+ i. y) < ii(p. an approximate independence concept will be used. is almost independent of Yçi.d.2. y) — p(x)p(y)I < e. so Exercise 2 implies that 4. processes are finitely determined.a ABI implies FD. v) I +10(k. The random variables are said to be cindependent if E Ip(x.a.y(x. given Xi'. does not depend on ril. then X and Y are independent.d. Fix n > N and let y be a measure on An for which 10(k. y) — oak' < 6.SECTION IV.2. v) — ii.2.)) — H COI < 6.4)1 <n6/2 and let Tf be the concatenatedblock process defined by v. . The dn fi of the distributions of X'il and yr by fitting the condiidea is to extend a good tting tional distribution of X n+ 1. using (1).i.2. and let p(x) = px (x) = E. proof will be carried out by induction. p. Finally. Let p(x. By making N Choose N so large that if n > N then 1(1/01/(an ) — enough larger relative to k to take care of end effects.1. it will be shown that the class of finitely determined processes is closed in the dmetric. it can also be supposed that if n > N and if 17 is the concatenatedblock process defined by a measure y on An. given Y.i.2. process is finitely determined.1. (A n . which is.i.. These results immediately show that ABI implies FD. The definition of N guarantees that 117k — till 5_ 17k — 0(k. It is easy to get started for closeness in firstorder distribution means that firstorder distributions can be dlwellfitted. within 6/2 of H(p. then 1. for an ABI process is the dlimit of mixing Markov processes by Theorem IV. by Theorem IV.k1 < 3/2 and 1H (v) — H(1. Y) of finitevalued random variables.d. the definitions of k and 6 imply that a (/urY) < c. in turn. y) denote the marginal distributions. y If H(X) = H(X1Y). CI IV.) < E. given X.0(k.
Proof The entropy difference may be expressed as the divergence of the joint distribution pxy from the product distribution px p y .2. then E f (x. y) x. An i.y D (p x. Y) of finitevalued random variables. Lemma IV. Lemma IV. Lemma IV.) There is a positive constant c such that if 11(X) — H(X1Y) < c6 2 . y) log p(x. proof as well as in later results is stated here. 0 A stationary process {Yi } is said to be an 6independent process if (Y1. and {Y i } be stationary with respective Kolmogorov measures /2 and v. there is a 8 > 0 such that if 1. Thus if Ai is close enough to v1. so that if 11(v) = limm H(Y i lY i ) is close enough to H(A).i. y) and conditional distribution p(y1x) = p(x. The next lemma asserts that a process is eindependent if it is close enough in entropy and firstorder distribution to an i.i. Lemma 1. then 11 (X 1 ) will be close to 11 (Y1 ).6. however. that is. then X and Y are 6independent. In its statement. Given e > 0.11. y IP(X x P(X)P(Y)i) where c is a positive constant. Exercise 6 of Section 1. y)p(x.224 CHAPTER IV. y)I p(x). 0 A conditioning formula useful in the i. di (A.d. 2 D(pxylpxp y) C (E .9. implies that H(A) = H(X i ).3 (The Eindependence lemma. . Use will also be made of the fact that firstorder ddistance is just onehalf of variational distance.)' g(x)p(x) E p(x) h(y)p(ydx). process is.5 If f (x. The i. y) = g(x) h(y). that is. BPROCESSES. This establishes the lemma. y) denotes the joint distribution of a pair (X. then H(Yi) — 11 (Y11 172 1 ) will be small. with first marginal p(x) = yy p(x. yields. process.u1 — vil < 8 and 11I (A) — H(v)1 < then {Yi} is an 6independent process. p(x.d.d.4 Let {X i } be i. without proof. property.y P(Y) E p(x. since H(YilY°m) is decreasing in m. The lemma now follows from the preceding lemma.y1PxP Y )• P(x. Yi1) and yi are cindependent for each j > 1. for all m > 0.i. H(X) — H(X1Y) = _ p(x) log p(x) log E p(x. y) x. of course. Proof If firstorder distributions are close enough then firstorder entropies will be close.2. as the following lemma. which is just the equality part of property (a) in the dproperties lemma.i. independent of X and Y. y) = X .2.d. y) = vii/ 2 . eindependent for every c > O.d.i. Y) p(x)p(y) Pinsker's inequality.
such a Getting started is easy. Yn+1)À. Yin). 1 (dn+i) < ndn(A. the conditional formula. yn+. since (n 1)dn+1(4 +1 .4 provides a positive number 3 so that if j — vil < 8 and I H(/). . .u. yields (n (2) 1 )Ex n+ I(dn+1) = nEA.realize di (Xn+i by {r}. y +1 ) = ndn (x`i' . which along with the bound nExn (dn ) < n6 and the fact that A n +1 is a joining of . process {Xi} and c > O.. (3) di (X E 1 /x. and let An realize an (Xtil. ± 1 defined 1 . y) (n + A.2. by independence. This strategy is successful because the distribution of Xn+ 1 /x'il is equal to the distribution of Xn+ i.d. < E. Y < 1 /y) y)di(Xn+04.. This completes the proof of Theorem IV. Yri+1). since Yn+1 and Y. Theorem IV. Thus the second sum in (2) is at most 6/2 ± 6/2. Y1 ) = (1/2)I../4). since d i (X I . process is finitely determined. y. whose expected value (with respect to yri') is at most /2. THE FINITELY DE IERMINED PROPERTY. the second term is equal to di (X I .' are cindependent. let ?. thereby establishing the induction step.d. 225 Proof The following notation and terminology will be used. and di (Xn +i Ynii/Y) denotes the adistance between the corresponding conditional distributions. since it was assumed thatX x i . The third term is just (1/2) the variational distance between the distributions of the unconditioned random variable Yn+1 and the conditioned random variable Y +1 /y. Yin) < c. y) +E < (n + 1)c. Lemma IV. while. by independence. which is in turn close to the distribution of X n+1 .) — 1/(y)1 < 3.)T (Xn+1 Yni1)• n+1 The first term is upper bounded by nc.f» Y. For each x. Y1 )..(dn) XI E VI Xri (X n i Yn i )d1(Xn+1.x7. 1/I' ( An+1. Lemma IV. The random variable X n+1 .2. Fix It will be shown by induction that cin (X. Without loss of generality it can be supposed that 3 < E. Fix an i. 4+1) = \ x. 44 and yn+1 .5. by cindependence. by stationarity. Furthermore. and the distribution of 1" 1 + 1 /Y1 is. The strategy is to extend by fitting the conditional distributions Xn+ 1 /4 and Yn+i /yri' as well as possible.+1/4).2.2.i. since it was assumed that A n realizes while the second sum is equal to cin (X. YD. conditioned on X = x is denoted by X n+ i /4. close on the average to that of Yn+i . is certainly a joining of /2. which is at most c/2. The measure X. n+i andy n+ i. then the process {Yi } defined by y is an cindependent process. y`i') + (xn+i. Xn+1 (4+1.u i — vil < 3/2 < c/2.i. produces the inequality (n 1)dn+1(.2.n realizes d i (X n +i The triangle inequality yields (Xn+1/X r i l Xn+1) dl(Xn+19 Yn+1) + (Yn+1 Y1 /y) The first term is 0.6 An i.SECTION IV.6. Assume it has been shown that an (r i'.
The Markov coupling lemma guarantees that for mixing Markov chains even this dependence on X. Fix a stationary process y for which ym+ il < 3 and 11/(u)— H (v)I <8.7 (The conditional 6independence lemma. Lemma IV.i. A good match up to stage n was continued to stage n + 1 by using the fact that X. In that proof. where p(x. then guarantee that HOTIN— 1 1(YPIY2 n ) < y.3.1 +1 is almost independent of 1?. ±1 is independent of X7 for the i. By decreasing 3 if necessary it can be assumed that if I H(g) — H(v)I <8 then Inanu) — na/(01 < y/2.d. The random variables X and Y are conditionally Eindependent. for every n > 1. there is a y = y(E) > 0 such that if H(XIZ)— H(XIY. the fitting was done one step at a time. ylz) denotes the conditional joint distribution and p(x lz) and p(y lz) denote the respective marginals.y E. guarantees that for any n > 1.a. if E E ip(x. for suitable choice of m.i. The Markov property n +Pin depends on the immediate past X.2. for then a good match after n steps can be carried forward by fitting future mblocks. A conditional form of cindependence will be needed. dies off in the dsense. are conditionally Eindependent.2. given Yo. proof. then.d.i. BPROCESSES. To extend the i.m+1 — v m+ii < 8 and IH(a) — H(v)i < 8. . the Markov property and the Markov coupling lemma.2. provided the Yprocess is close enough to the Xprocess in distribution and entropy. The choice of 3 and the fact that H(YZIY ° n ) decreases to m H (v) as n oc. The Markov result is based on a generalization of the i. Lemma IV. Lemma IV. The next lemma provides the desired approximate form of the Markov property. 17In and Yi. two properties will be used. then X and Y are conditionally Eindependent.} be an ergodic Markov process with Kolmogorov measure and let {K} be a stationary process with Kolmogorov measure v. Mixing Markov processes are finitely determined.2.1. and 17. Z) < y. process. The conditional form of the Eindependence lemma.. yiz) — p(xlz)p(ydz)ip(z) < z x. whose proof is left to the reader.226 IV. there is a 3 > 0 such that if lid.2 CHAPTER IV.. a future block Xn but on no previous values. Proof Choose y = y(E) from preceding lemma.8. given Z. then choose 3 > 0 so small that if 1 12 m+1 — v m± i I < 8 then IH(Yriy0)HaTixol < y12.) Given E > 0. Given E > 0 and a positive integer m. proof to the mixing Markov case. Lemma IV. n > 1. extends to the following result.d. The key is to show that approximate versions of both properties hold for any process close enough to the Markov process in entropy and in joint distribution for a long enough time. given Z. which is possible since conditional entropy is continuous in the variational distance. Fix m. as m grows.8 Let {X.
The Markov coupling lemma. YTP:r/y) E 9 where Xn realizes an (A.r+Fr. yr ) < e/4. Proof n + 7i . . X4 +im ) will denote the ddistance between the distributions n + T/xii. Theorem IV. in turn.SECTION IV. The first three terms contribute less than 3E/4. As in the proof for the i.9 A mixing finiteorder Markov process is finitely determined. y)ani (X . given Yo . it can also be assumed Furthermore. for all n > 1. implies that 171 and Yfni are conditionally Eindependent. Ynn+ = x'17 .8. by the choice of 3 1 . Thus the expected value of the fourth term is at most (1/2)6/2 = 6/4.. given Yo..4T /Yri) dm (Y: YrinZ) Y:2V/yri?).n+i there is a 3 > 0 so that if y is a stationary — vm+1 1 < 6 then measure for which (5) am(Yln. and (7) all hold. conditioned on X. it is enough to show that (8) dni(x n+1 . y) is upper bounded by Ittm — that 6 is small enough to guarantee that (6) (7) dm (xT. y) < E. since Ynn+ +Ini and Yin—I are conditionally (E/2)independent.2. for any stationary process {Yi } with Kolmogorov measure y for which — vm+1I < and I H(u) — H(p)i <8. if necessary.8. But the expected value of this variational distance is less than 6/2.2. To complete the proof that mixing Markov implies finitely determined it is enough to show that d(p.'1 /y.9. and hence the sum in (8) is less than E.'. THE FINITELY DETERMINED PROPERTY 227 which.i.:vin/4.u. using (6) to get started. given Yn . + drn ( 17:+7 . XT/x0) < 6/4. Lemma IV.d. y). Fix a mixing Markov process {X n } with Kolmogorov measure p. I/2.2. provides an m such that (4) dm (XT. (6) and (5). it can also be assumed from Lemma IV. This proves Lemma IV. since dm (u. = x. Yln /Y0) < yo E A. and fix E > O. Only the firstorder proof will be given. The distribution of Xn n+ 47/4 is the same as the distribution of Xn n + r.7:14:r/y. The dfitting of and y is carried out m steps at a time. case. xo E A. and ii(Xn of X'nl Vin / and 11. The fourth term is upper bounded by half of the variational distance between the distribution of Y. Fix a stationary y for which (5). + cim(r. This completes the proof of Theorem IV.1.2. E X n Vn 4(4.2.8 that Yini and Y i n are conditionally (E/2)independent. (6). by (7). conditioned on +71 /4 will denote the random vector Xn In the proof. 17. by (4)./x 1 . Since this inequality depends only on . and the distribution of Ynn+ +.n /y. and hence the triangle inequality yields. the extension to arbitrary finite order is left to the reader. By making 3 smaller.
:(d(xi. The basic idea for the proof is as follows.. (9) it (a ril ) The family {A.(d(x l . îl) will be small.) Fix n.. then implies that a(u. outputs b7 with probability X. Such a finitesequence channel extends to an infinitesequence channel (10) X y .a. and for each ar il. Theorem IV.228 CHAPTER IV.a such that (a) 3 is close to A in distribution and entropy. The finitely determined processes are dclosed. v). then there is a (5 > 0 and a k such that d(u. —> which.. il b7) A.) The dlimit of finitely determined processes is finitely determined. is also small. Let A be a stationary joining of it and y such that EAd(x i . let 4(. y l )) = d(. O ) } of conditional measures can thought of as the (noisy) channel (or black box). It will be shown that if 72. (None of the deeper ideas from channel theory will be used. (b7la7).10 (The aclosure theorem. which guarantees that any process aclose enough to a finitely determined process must have an approximate form of the finitely determined property. BPROCESSES.v) <E and v is finitely determined. < (5 and IH(A) — < S.. If the second marginal.ak — < 3. The triangle inequality d(v + d a]. Lemma IV. by property (a). IV. '17 is close enough to y in distribution and entropy then the finitely determined property of y guarantees that a(v.14) be the conditional measure on An defined by the formula (af. The theorem is an immediate consequence of the following lemma. yi)).u in distribution and entropy there will be a stationary A with first marginal .(b714) = E A.2. is close enough to .2. which is however. v). close to E).11 If d(p. borrowed from information theory. (b) The second marginal of 3 is close to u in distribution and entropy. given the input ar il. The fact that 3: is a stationary joining of and guarantees that E3. (. a result also useful in other contexts.. vi)) = d(p.3 The final step in the proof that almost blockindependence implies finitely determined is stated as the following theorem. for any stationary process Trt for which 1.2.u. A simple language for making the preceding sketch into a rigorous proof is the language of channels. only its suggestive language.
/2). that is. if (XI". The i. and put A* = A*(Xn. a) need be stationary.2. b lic) is an average of the n measures A * (a(i) + x where a(i). the measure on A' defined for m > 1 by the formula fi * ( 1/1" ) E A * (aT" . a). THE FINITELY DE I ERMINED PROPERTY 229 which outputs an infinite sequence y. a) are obtained by randomizing the start. ±„.12 (Continuity in n. = H(X7) H(}7 W ) . Y) is a random vector with distribution A*(X n . measure A(4` .u) converges weakly and in entropy to v. in n and in a. and hence stationary measures A = A(An . called the joint input output measure defined by the channel and the input measure a. it) and A = A(Xn . b ) . The next two lemmas contain the facts that will be needed about the continuity of A(X n .f3(4. 1 b k ) = v(b) as n To establish convergence in entropy. Km )* mn If a is stationary then neither A*(X n . +. as n oc.)(14) = E A A(4. b(i) . for i <n — k A * (a(i) i + k i. Given an input measure a on A" the infinitesequence channel (10) defined by the conditional probabilities (9) determines a joining A* = A*(X n .+ +ki). "" lai.) If is a stationary joining of the two stationary processes pt and v. given an infinite input sequence x. . by breaking x into blocks of length n and applying the nblock channel to each block separately. bki ) converges to X(a /ic. The measures A and fi are. 17) is a random vector with distribution An then H(X7. blf) as n oc.nn . converges weakly and in entropy to X and . that ±n ). respectively. though both are certainly nstationary. a) on A'. The joining A*. with probability X n (b7lx i n ±i and {y i : i < j n}. and hence ' 1 lim — H(1 11 1X7) = H n — On the other hand. A(a. x(a k i b/ ). bn i ln ) = a(a)4(b i. n+n ) j=0 The projection of A* onto its first factor is clearly the input measure a. first note that if (X7. a) of a with an output measure /3* = /6 * (4. ypi+ in+n i has the value b7. Likewise. and # = ti(Xn. independent of {xi: i < jn} is. the averaged output p. But.  Lemma IV. _ i < n. then hr(X inn. is the measure on Aœ x A" defined for m > 1 by the formula  m_i A * (a . Proof Fix (4. A direct calculation shows that A is a joining of a and /3. a) and /3(X n . The output measure /3* is the projection of A* onto its second factor. let n > k. a) nor /3*(X n . b(i)+ so that.SECTION IV2. with respect to weak convergence and convergence in entropy. b i ) converges to a oc. yrn) = H(X) + myrnIxrn) . = b„ for 1 < s < k. a). = as and b(i). ii). called the stationary joint input output measure and stationary output measure defined by An and the input measure a. then A(X n .
A) satisfies Ifie — vti < y/2.u). (ii) 0(4. Thus (i) holds. if = /3*(X n .12.) Fix n > 1. p)) H (X). must also satisfy cl(v. as n oc. 1 IX7) . and a stationary measure A on A" x A. yl)) < E. assume 1) is finitely determined.11.2. randomizing the start also yields H(/3(A. Lemma IV. and IH(fi) — < y/2. Yr) = H(XT)+mH(Y i 'X i ). provides an n such that the stationary inputoutput measure A = A (An .2. Ex(d(xi. g) satisfies (11) EA(d(xl. and let A be a stationary measure realizing ci(p. . then.13 (Continuity in input distribution. so that. which is weakly continuous in a. An„(a r . y). °eon)} converges weakly and in entropy to A(4. + E H(rr_1)n+1 I riz_on+i) Do since the channel treats input nblock independently. If a sequence of stationary processes la(m)} converges weakly and in entropy to a stationary process a as m oc. since the entropy of an nstationary process is the same as the entropy of the average of its first nshifts.6. p. Furthermore. which depends continuously on 11(a) and the distribution of (X1. Fix a stationary input measure a and put A = A(Xi.2. Furthermore. The extension to n > 1 is straightforward. in particular. p)) = H(. so that H(A*(X n .2. Dividing by mn and letting in gives 1 H(A*(4. Proof of Lemma IV. see Exercise 5. yl)) + c. then H(XT. . so (ii) holds. and the stationary output measure fi = fi(X n . Y1). Assume cl(p.)) H (v). br ) = a(aT) i1 X(b i lai ). since the channel treats input symbols independently.230 H (X ) CHAPTER IV. fi = /3(X i . as m ÷ oc. a). a) is weakly continuous in a. H(A(4. By definition. Y1 ). Section 1. so that both A* and fi* are stationary for any stationary input.2. if (XT. i=1 H(X71n ) MH(YnX ril ). This completes the proof of Lemma IV. and hence Lemma IV. y) < E. Randomizing the start then produces the desired result. The finitely determined property of y provides y and E such that any stationary process V' for which I vt —Ve I < y and I H(v) — H(71)1 < y. p)) H(A). (i) {A(Xn . and H(/3) = H(A) — H(Xi Yi) is continuous in H(A) and the distribution of (X 1 . dividing by m and letting in go to oc yields H(A) = H(a)+ H (Y I X 1) . a). The continuityinn lemma. a(m) )1 converges weakly and in entropy to t3(X n . a). Thus.13 is established.u)+ — H (Y. Lemma IV.12. as in > Do. yi)) < Ex(d(xi.) < E. Likewise. Y) is a random vector with distribution A n. BPROCESSES. Proof For simplicity assume n = 1. since Isin = vn .
71) is a joining of the input Ex(d(xi. The observation that the ideas of that proof could be expressed in the simple channel language used here is from [83]. THE FINITELY DETERMINED PROPERTY. Ex(d(xi. This completes the proof of Lemma IV. Lemma IV. and — II (v)I III (1 7) — II (0)I + i H(8) — II (v)I < 1 / .) . and IH(i') — 11 (/3)1 < y/2.2. yi)) +e and the output distribution il = 0(X.14 The proof that i.2. while the proof that mixing Markov chains are finitely determined is based on the FriedmanOrnstein proof. and the fact that X realizes d(p. then the joint inputoutput distribution X = A(X„.2. y) which was assumed to be less than E .11.2. [46]. [46]. yi) + 6 Ex(d (x i.z th order Markovizations. (Hint: y is close in distribution and entropy to p. is the nth order Markov process with transition probabilities y(xn+i WO = p. il with the output 7/ and ) En(d(xi.(4).b Exercises.i.SECTION IV. the proof given here that ABI processes are finitely determined is new. 231 The continuityininput lemma. )' 1 )) < En(d(xi.2. and hence the proof that almost blockindependent processes are finitely determined.. i)) + 2E < 3E. 1 The nth order Markovization of a stationary process p. ii„) satisfies 1 17 )t — fit' < y/2. The fact that almost blockindependence is equivalent to finitely determined first appeared in [67]. 71) __. is any stationary process for which Fix — lid < 8 and IH(ii) — H(A)1 < ( 5 . [13]. processes are finitely determined is based on Ornstein's original proof.(4 +1 )/p. Furthermore. d(ii.d. El Remark IV. thereby completing the proof that dlimits of finitely determined processes are finitely determined. since X = A(Xn . )' by the inequalities (11) and (12). Ft) satisfies (12) Ex(d (x i . Show that a finitely determined process is the dlimit of its . IV. The fact that dlimits of finitely determined processes are finitely determined is due to Ornstein. so the definitions of f and y imply that dal.. Given such an input 'A it follows that the output 17 satisfies rile — vel Pe — Oel + Ifie — vel < Y.13. provides 3 and k such that if ii. yi)) < < yi)). y) <E.
conditioned on the past values x ° k . is actually equivalent to the finitely determined property. Hence any of the limiting nonoverlapping nblock distributions determined by (T i x. and finitely determined are. for some y for which the limiting nonoverlapping kblock distribution of T'y is y. almost blowingup.2. A stationary process {X. and v.4.T i )' ) must be a joining of ia. Since it has already been shown that finitely determined implies almost blowingup. Lemma IV. The significance of very weak Bernoulli is that it is equivalent to finitely determined and that many physically interesting processes. and X`i'/x ° k will denote the random vector X7. such as geodesic flow on a manifold of constant negative curvature and various processes associated with billiards.i.} is very weak Bernoulli (VWB) if given c > 0 there is an n such that for any k > 0.3.) A process is very weak Bernoulli if and only if it is finitely determined. hence serves as another characterization of Bprocesses. BPROCESSES. where 7 denotes the concatenatedblock process defined by an (A n . process. 5.4. see Exercise la in Section 1.1 (The very weak Bernoulli characterization. IV. .. very weak Bernoulli.7.232 CHAPTER IV. y). The reader is referred to [54] for a discussion of such applications. denotes expectation with respect to the random past X° k Informally stated.) 3.d. Prove the conditional cindependence lemma. see Section 111. indeed. a process is VWB if the past has little effect in the asense on the future. Show that if . it follows that almost blockindependence. Show that a process built by repeated independent cutting and stacking is a Bprocess if the initial structure has two columns with heights differing by 1.u is totally ergodic and y is any probability measure on A'. (Hint: a(p. Show that the class of Bprocesses is aseparable. 2. where Ex o . Equivalence will be established by showing that very weak Bernoulli implies finitely determined and then that almost blowingup implies very weak Bernoulli. 4. . then 0). Section IV. Theorem IV. (d. called the very weak Bernoulli property. and for some x all of whose shifts have limiting nonoverlapping nblock distribution equal to /4. As in earlier discussions. y) < a(i2. v. This form.3 Other Bprocess characterizations. A more careful look at the proof that mixing Markov chains are finitely determined shows that all that is really needed for a process to be finitely determined is a weaker form of the Markov coupling property. equivalent to being a stationary coding of an i. for some i < n. can be shown to be very weak Bernoulli by exploiting their natural expanding and contracting foliation structures. (1) Ex o .a The very weak Bernoulli and weak Bernoulli properties..3. either measure or random variable notation will be used for the ddistance.17) = d(x .(X7 I X7)) < c.
The triangle inequality gives <(x.. A stationary process {X i } is weak Bernoulli (WB) or absolutely regular. The class of weak Bernoulli processes includes the mixing Markov processes and the large class of mixing regenerative processes. called weak Bernoulli. A somewhat stronger property.3. In a sense made precise in Exercises 4 and 5. iin (14 . The proof models the earlier argument that mixing Markov implies finitely determined.) . then one can take n = g + ni with m so large that g 1(g + ni) < € 12.} in entropy and in joint distribution for a long enough time. V :+i) g In. This property was the key to the example constructed in [65] of a very weak Bernoulli process that is not weak Bernoulli. X'1 ) ) < E. for suitable choice of ni. }. conditioned on intermediate values. for any n > g and any random vector (Uli.77. while very weak Bernoulli is used to guarantee that even such conditional dependence dies off in the iisense as block length grows.nnz /y. while the proof of the converse will be given later after some discussion of the almost blowingup property. Their importance here is that weak Bernoulli processes are very weak Bernoulli. Entropy is used to guarantee approximate independence from the distant past. k Thus weak Bernoulli indeed implies very weak Bernoulli. a result established by another method in [78]. Since approximate versions of both properties hold for any process close enough to {X.b Very weak Bernoulli implies finitely determined. Theorems 111. Vr) < cin _g (Ugn+i . pr . to obtain Ex o (cin (XTIV X ° k . the key is to show that E A. x:±. given c > 0 there is a gap g such that for any k > 0 and m > 0. if past and future become cindependent if separated by a gap g. that is.3. weak Bernoulli requires that with high probability..T Y::ET /)1) is small for some fixed m and every n. the conditional measures on different infinite pasts can be joined in the future so that. ( 4.SECTION IV.'. since 4distance is upper bounded by onehalf the variational distance.5. is obtained by using variational distance with a gap in place of the cidistance. ID. Furthermore. OTHER BPROCESS CHARACTERIZATIONS. with high probability.1. weak Bernoulli processes have nice empirical distribution and waitingtime properties. If this is true for all m and k.4) + cini (yn n _. To see why.2. as noted in earlier theorems. for some fixed k > ni. IV. + . names agree after some point. y rli )dni (x. and use the fact that . the random vectors Xr and X° k are cindependent.3 and 111. a good fitting can be carried forward by fitting future niblocks. 233 The proof that very weak Bernoulli implies finitely determined will be given in the next subsection. As in the earlier proofs. X g ±n i z )) < E/2. while very weak Bernoulli only requires that the density of disagreements be small. first note that if Xr and X ° k are cindependent then Ex o (dm (X g +T/X ° k . see Exercise 2 and Exercise 3. vin . given only that the processes are close enough in entropy and kth order distribution.
since H(YrIY ° r ) decreases to m H (v) this means that II (Yi n IY ° K) — 11 ( 11171 1 17 K_i) < 2y. j O. yields the result that will be needed.)) < 6/2. Furthermore. The goal is to show that a process sufficiently close to {Xi} in distribution and entropy satisfies almost the same bound. LI . In summary. } with Kolmogorov measure u for which Ik — VkI < and I H Cu) — H (v)I < (3. and hence that am (XI''. Towards this end. The details of the preceding sketch are carried out as follows.2. For in fixed. the conditional 6independence lemma. The quantity Exo K (a. H(XTIX ° t ) converges to mH(A). The very weak Bernoulli property provides an in so that (2) Ex o (d„. provided only shown is that the expected value of dm (Ynn± that is close enough to {X i } in korder distribution for some large enough k > as well as close enough in entropy. for any stationary process {Y. BPROCESSES. . H(XT IX° K ) also depends continuously on 1Uk . t > O.8 is small enough and {Yi } is a stationary process with Kolmogorov measure V such that I tuk — Vkl <S. then H(YrIY ° K ) < m H (v) + 2y holds.n (rin /X ° K .7. )( tin )) depends continuously on 'Lk. Thus. the first term is equal a. r in )) < c /8.') is small. let y be a positive number to be specified later. Yin /Y° Kj)) < E/4 . This result.n\ 'n+1 I "Ai x 4 . j>1 . If y is small enough. this means that Eyo(dm(Yr/ 17° K.Yr /Y_°. so if .n (XT. But. yr). Lemma IV. it follows from the triangle inequality that there is a S > 0 such that E y. and hence there is a K such that H(XTIX ° _ K ) < mH(u) + y. and fix 6 > O.Yr i )) < c/4. by the very weak Bernoulli property. uniformly in n.m. once such an in is determined.)ani(Xnn±n1 +I / y n vn+ni 1.. all that remains to be ±. so that if it also assumed that I H(v) — H(g)I <S. while the expected value of the second term is small for large enough ni. given — °K• Since adistance is upper bounded by onehalf the variational distance. Y/yr. Y. and Sis small enough. since it can also be supposed that 8 < E/2. namely.n) < 6/4. then (3) EyoK (iim (Yr i /Y K .234 CHAPTER IV. as t —> Do.1" 1 for all n. along with the triangle inequality and the earlier bound (3). Fix k = m K 1. Fix a very weak Bernoulli process {X n } with Kolmogorov measure p. implies that Yr and YIK K4 are conditionally (E/2)independent. for all n. t > 1.(XT / X ° t . (4) Eyot (c1(Y im . which completes the proof that very weak Bernoulli implies finitely determined. By stationarity. which is small since in < k.
A joining can be thought of as a partitioning of the vmass of each y'i' into parts and an assignment of these parts to various xlii. The proof is based on the following lemma. this completes the proof that almost blockindependence. The key to continuing is the observation that if the set of . which. for some all E C) . that is.' for which dn (x1' . )1) < E.' . that the total mass received by each 4 is equal to A(4). This simple strategy can repeated as long as the set of x'i' whose . namely.3. = fb. there is a 8 > 0 and an N such that Bn has the (3. in a more general setting and stronger form. [81]. by hypothesis.c that a finitely determined process has the almost blowingup property. )1) of A. of A and v for which dk (4. Proof The ideas of the proof will be described first. was introduced in Section 111.c The almost blowingup characterization. [C]. 0 (yriz)) < E.2 Let A and v be probability measures on A. The eblowup [C]. namely. very weak Bernoulli. A simple strategy for beginning the construction of a good joining is to cut off an afraction of the vmass of yri' and assign it to some xr. If v([C]. and almost blowingup are all equivalent ways of saying that a process is a stationary coding of an i. The almost blowingup property (ABUP).d. A set B C An has the (6. which.4.) > 1. eventually almost surely. Lemma IV. y) < 2e. for most y'i' a small fraction of the remaining vmass can be cut off and and assigned to the unfilled mass of a nearby x'11 . of a set C c A" is its cneighborhood relative to the dn metric. is due to Strassen.a is called a mass function if it has nonempty support and E Ti(4) < 1. OTHER BPROCESS CHARACTERIZATIONS. after which the details will be given. process. i. for all y [Sk. The set of such yii' is just the cblowup of the support of A.4. yl') < e. 235 IV. The goal is to construct a joining A. c)blowingup property if .S. for any subset C c B for which i(C) > 2 1'3 • A stationary process has the almost blowingup property if for each n there is a B.i. and for any E > 0. is called an cmatching over S if dk ( yr. Some notation and terminology will assist in making the preceding sketch into a proof.u([C].c. except on a set of (4. c)blowingup property for n > N.3. A trivial argument shows that this is indeed possible for some positive a for those )1' that are within c of some fil of positive Amass. Since very weak Bernoulli implies finitely determined. has vmeasure at least 1 . It was shown in Section III.measure less than E.4 whose /i mass is not completely filled in the first stage has /i measure at least E then its blowup has large vmeasure and the simple strategy can be used again. subject to the joining requirement. If the domain S of such a 0 is contained in the support of a mass . C A" such that x'll E Bn . A function 0: [S]. finitely determined.E.) > 1 . If the fraction is chosen to be largest possible at each stage then only a finite number of stages are needed to reach the point when the set of 4 whose ptmass is not yet completely filled has 1ameasure less than E. whenever it(C) > c.SECTION IV. then dn (A .3.umass is not yet completely filled has itmeasure at least E. . b) <E. I : dn (d. A nonnegative function .c. In this section it will be shown that almost blowingup implies very weak Bernoulli.
(1) = A. With Lemma IV. for it is the largest number a < 1 for which av(V 1 (4)) < (x). Let E(Xn. as long as SI 0 0) and let al be the maximal (y. for which an afraction of the ymass of each y'i? E [S]. To get started let p. p. by the entropy theorem. then V([Si d E ) > I — E. that is. only one simple entropy fact about conditional measures will be needed to prove that almost blowingup implies very weak Bernoulli.u. inf 71. (i + 1) . for Fn E E(X" co ). (pi and ai . then the number a = 0) defined by i"u. the conditional probability .(4).u(4) H. x E S. Oi+i )stuffing fraction.2 in hand. Opstuffing fraction.u. almost surely. while. p. If ai < 1 then. can be assigned to the atmass of 4 = The desired joining A. say i*.u(Si ). let Si be the support of itt (1) .3. by the definition of maximal stuffing fraction.(x)>0 ) I is positive. It is called the maximal (y. I=1 Only the following consequence of the upper bound will be needed..3. with high probability. .u(4lx ° . Having defined Si . = 1.s: ( x I' CHAPTER IV. dk (x. the notation xn E Fn will be used as shorthand for the statement that Clk[zn k ] c F. Hence there is first i. let 01 be any Ematching over S1 (such a (P i exists by the definition of cblowup. cutting up and assigning the remaining unassigned vmass in any way consistent with the joining requirement that the total mass received by each xÇ is equal to If ai.236 function at.0) has approximately the same exponential size as the unconditioned probability . Also. the function Oi+1 is taken to be any cmatching over Si+i . . This is the fact that. let . In either case. These two facts together imply the lemma. Ft. BPROCESSES. (i + 1) .o ) denote the aalgebra generated by the collection of cylinder sets {[xn k > 0 } .Lemma IV.7 1 (4)) and therefore . Proof By the ergodic thZoc rigiiii(xRlit)(14ift4 AID almost surely. and the proof of. except on a set of (4. .u(F) > 1 — a and so that if xn E Fn then o so . y in) of I=1 Xmeasure less than e. there is an 4 E Si for which td ) (4) = ai y(0.3. provided only that n is large enough.) Let be an ergodic process of entropy H and let a be a given positive number There is an N = N(a) > 0 such that if n > N then there is a set Fn E E(X" co ) such that . (1) . for which ai = 1 or for which p.u(Si.2 is finished. otherwise . +1 ) < E.(Si+i) < . y) < E.3 (The conditional entropy lemma. Lemma IV. —(1/n) log .(Si +i) < c. is constructed by induction. The construction can be stopped after i*. and ai±i is taken to be the maximal (y. 0)stuffi ng fraction.u (i + 1) be the mass function defined by tt o)(x i ) — • The set Si+1 is then taken to be the support of p. (5) a = min { 1.u (i) ..
D. there is a set D.NATe. If n > N. (b) 11.(Dn this with Lemma IV.s/CTe < 2" . OTHER BPROCESS CHARACTERIZATIONS.0 E D.NA ii(Clx ° Proof For n > N.) _ oe ) 2an. Towards this end. 237 Lemma IV. and for E V and C c A'.u(x).) such that (6) If C c An and .4 it follows that if n is large enough and a < c 2 /4 then the set * E E(X).3.u(C n Fn ) Ara < 2"A(C)1.X. o ) > c can be rewritten as .0 ) such that A(B n ) > 1— c 2 /4. there is a set Dn *) > 1 — c/2 and such that A(B n ix%) > 1 — e12.) such that . E By Lemma IV. I x_ . to show that for each E > 0. there is an n and a set V E E(X° .5 Almost blowingup implies very weak Bernoulli.u(Clx ( 1 00 ) > c then ..u(B. the Markov inequality yields a set D.u(V) > 1 — c and such that dn(A. let Bn c An . which. eventually almost surely.u([C]f) a.3. The goal is to show that . V = Dn n Dn .. by the Markov inequality. for x°.u(c" n Fn ix) < 2" .Ix° 00). 1 — c.a(cix) < . 0blowingup property for n > N.u is very weak Bernoulli. so that if n is so large * E E(X ° .. be such that x7 E B„.3. E(X°. of measure at least 1 — Nfii. A (• lx° .1x%) 6/2 .u(C) .(V) > 1 — c and such that the following holds for x ° 00 E V. and C C An. if A(Clx. The quantity . then E E (X ° cx.. has measure at least 1 — c.2 E Fn . a set Fn c A" such that 1 — Afrx and. E E(X° 0° ) of measure at least Dn .0 )) <e.4 Let N = N (a) be as in the preceding lemma.)> 2"(c — N/Fx — el2).u(C n 12.u(c n /3. then.such that if x ° E D. which establishes the lemma.u(C n B.n > 1. X° E V. (x tilx ° If E .) 5_ 2" .3. Now the main theorem of this section will be proved.. for each x 0 ° 00 ) > 1 — (a) iu(F.. .3.u(Bn ) is an average of the quantities .. that is. Theorem IV.2. Proof Let p.. there is a 3 > 0 and an N such that Bn has the (3.SECTION IV.„ and C c An then (a) and (b) give 11(Clx) . and such that for any e > 0. be a stationary process with the almost blowingup property. Combining that p. it is enough to show how to find n and V p.
then 2"(e — . so that the blowingup property of B.0 ) of measure at least 1 — c. .3. using a marriage lemma to pick the cmatchings 0i . pp.. Show using coupling that a mixing Markov chain is weak Bernoulli. Show that a stationary process is very weak Bernoulli if and only if for each c > 0 there is an m such that for each k > 1. Show that a mixing regenerative process is weak Bernoulli. The proof of Strassen's lemma given here uses a construction suggested by Ornstein and Weiss which appeared in [37]. 4)(x).0 ] which maps the conditional measure . n]. = x. This establishes the desired result (6) and completes the proof that almost blowingup implies very weak Bernoulli. with the property that for all x B. This and related results are discussed in Pollard's book.u(C) < v([C]. If a is chosen to be less than both 8 and c2 /8. yields ittaCli)> ou([C n B4O1. is very weak Bernoulli if given E > 0 there is a positive integer n and a measurable set G E E(X° 0° ) of measure at least 1 — c.) > 1 — E.6 The proof of Theorem IV.) 4.14Fe — 6/2) > 2 8n. .„) < c.*(. 2. IV. . 1.u. < E.238 CHAPTER IV. n]. and a measurable set B c [x° ..u(• ix%) onto the conditional measure . defined as the minimum c > 0 for which . and a measurable set B c [x] such that . yields the stronger result that the an metric is equivalent to the metric cl.„o ] which maps the conditional measure ii(Ix°) onto the conditional measure tie li° . 3. Remark IV. for all C c A.u. (Hint: coupling. except for a set of x° k of measure at most E. such that if i° 00 E G then there is a measurable mapping 0: [x] 1÷ [. A more sophisticated result. such that if x 0 E G i then there is a measurable mapping 0: [x. 0(x) 1 = xi for all except at most en indices i E [1. Example 26. 5. BPROCESSES. [59. for all sufficiently large n. c. Show by using the martingale theorem that a stationary process p. 1)).0 0o ).u(Bix ° .d Exercises. 7980].a is weak Bernoulli if given c > 0 there is a positive integer K such that for every n > K there is a measurable set G E E(X ° . Cin (X7 IX ° k .0 0. provided only that n is large enough. with the property that for all x B.c ).3. Show by using the martingale theorem that a stationary process .o ] such that ti(B Ix%) < E. for i E [K.3..(•1.5 appeared in [39].
1965.. W. An introduction to probability theory and its applications. equipartition. Probab. of Elec. Waterman. Information theory and reliable communication. Kohn." Ph. 17(1989). Thomas. New York. 11521169.."." Ann. Ornstein. Birkhauser.A. d'analyse. NY. [11] W." Advances in Math. Info. Introduction to ergodic theory Van Nostrand Reinhold.. Wiley. Measure Theory. [5] D. "Universal codeword sets and representations of the integers. [8] P. Israel J. 1968. Wiley. Feldman. Csiszdr and J. 1981. Ergodic problems of classical mechanics. [6] T. [4] P.. Erdôs and A. 23(1970). 1980. Thesis." IEEE Trans. Ergodic theory and information. [10] J. and Ornstein's isomorphism theorem. A. "The Erd6sRényi strong law for pattern matching with a given proportion of mismatches. M. 194203. Elements of information theory. Akadémiai Kiack"). "On isomorphism of weak Bernoulli transformations. New York. 365394. and Avez. Benjamin. Barron. [12] N. New York. Feller. Dept. [3] A. Eng.. Information Theory Coding theorems for discrete memoryless systems. Stanford Univ. New York. New York. Friedman and D. V. [7] I. [2] R. Recurrence in ergodic theory and combinatorial number theory Princeton Univ. Press. 239 . 1991. Cover and J. Gallager." J. 1981. of Math. Billingsley. Volume II (Second Edition). "rEntropy. 103111. Furstenberg. Th. Princeton. [9] P. [14] H. 1985. Budapest. 321343. NY. Körner. 36(1980). Friedman.. New York. C. A..Elias. 1971. Arratia and M. Rényi. John Wiley and Sons. [15] R.Bibliography [1] Arnold. [13] N. NJ. 5(1970).. "Logically smooth density estimation. 1968.I.. Inc. 1970. John Wiley and Sons. D. IT21(1975). New York. "On a new law of large numbers.
1956. Karlin and G. "On the notion of recurrence in discrete stochastic processes. G.. SpringerVerlag. Princeton. Kontoyiannis and Y." Israel J.. "Prefixes and the entropy rate for longrange sources. P. Info.240 BIBLIOGRAPHY [16] P. Chelsea Publishing Co. Gray. and optimization. Kakutani.A. Van Nostrand Reinhold. [32] U. [35] T. Statist. Kemeny and J. Japan Acad. Krengel. Grassberger. Finite Markov chains. 87(1983)." IEEE Trans. Katznelson and B. ed. [29] J.S. [30] J. Theory. random processes.. "New proofs of the maximal ergodic theorem and the HardyLittlewood maximal inequality. [20] P. L. 502516. NY. New York. de Gruyter. M. IT37(1991). 1950. statistics." IEEE Trans.. W. Lindvall. Smorodinsky. Kamae. 1985. "Comparative statistics for DNA and protein sequences . Halmos. Math. 1995. Kieffer..." Israel J. 635641. Entropy and information theory. Kac. [26] S. John Wiley and Sons.) Wiley.single sequence analysis. D. . 1992. NY.. NJ. Ghandour. of Math. Kelly.. Acad. 10021010. Davisson." Israel J. [21] P. 19(1943). 42(1982). [24] S. "A simple proof of the ergodic theorem using nonstandard analysis. "Sample converses in source coding theory. Marcus. AMS. Gray. [33] R. Th. Cambridge Univ. Princeton. Weiss. New York. Ergodic theorems. Jones. Sci. [19] R. Hoeffding." Probability. [23] M. 1993. Gray and L. Th. Springer Verlag. New York. 669675. and ergodic properties. 58005804. [17] R. van Nostrand Co." Proc.. Math." Proc. 42(1982).. " Finitary isomorphism of irreducible Markov shifts. Measure theory.. 1990." IEEE Trans. 82(1985). "A simple proof of some ergodic theorems. [27] I. 1988. of Math." Ann. Nat. IT20(1975). 53(1947). 36(1965). Berlin. Halmos.. Press. Inform. 291296. [22] W. D. Cambridge. New Jersey. Lectures on ergodic theory. [25] T. Math. "Source coding without the ergodic assumption. New York. 263268. Snell. 284290. Suhov. [31] I. 34(1979) 281286. (F. Lind and B. 6814. Statist. New York. [28] M. U. Lectures on the coupling method. Keane and M. 369400. IT35(1989). 1960." Proc. "Asymptotically optimal tests for multinomial distributions. Info. "Estimating the information content of symbol sequences and efficient codes. An Introduction to Symbolic Dynamics and Coding." Ann. "Induced measurepreserving transformations. [34] D.  [18] R. Probability.
[45] D. June. [53] D. [44] A. 1974. Marton. 905930. Shields.. "How sampling reveals a process. Neuhoff and P.." IEEE Trans. Advances in Math. [52] D. 7883. "A very simplistic. Th. Weiss. Sys.. Neuhoff and P." Ergodic Th. 1995. Marton and P. Inform." IEEE Trans." IEEE Workshop on Information Theory. IT38(1992). to appear." Memoirs of the AMS." IEEE Trans. [43] D. Shields. Th.. 331348. [38] K.. 5360.BIBLIOGRAPHY 241 [36] K. Neuhoff and P. Correction: Ann. Ornstein and P." IEEE Trans. Math. 211215. and Dynam. [46] D. Ornstein and B. Yale Univ. "Channel entropy and primitive approximation. Probab. Rydzyna... 10(1973). Ornstein and P. Shields. Wyner... D." Israel J. 188198. CT. [49] D. Press. [51] D. [41] D. Shields.. and B. Ornstein and B. [39] K. Weiss. 441452." Ann. [40] D. Probab. Th.. [47] D. 4358. 960977. [37] K. 44(1983).." Israel J." Advances in Math. 1118. Weiss. Ergodic theory. lossless code. Probab. . Info. and dynamical systems. Shields. "An application of ergodic theory to probability theory". Math." Ann. "Equivalence of measure preserving transformations. "An uncountable family of KAutomorphisms". 10(1982). Ornstein and B. Info.. Poland. "Block and slidingblock source coding. Marton and P. Shields. New Haven. Inform. "The drecognition of processes. 15(1995). Probab. Shields. [42] D. "Indecomposable finite state channels and primitive approximation. [50] D. "Almost sure waiting time results for weak and very weak Bernoulli processes." Ann. "The positivedivergence and blowingup properties. Marton and P. Th." Ann. universal. 951960. Inform. Weiss. IT39(1993). Shields. "Entropy and data compression. 104(1994). Shields.. Th. 22(1994). 86(1994). 18(1990). [48] D. randomness. Yale Mathematical Monographs 5. Ornstein. IT28(1982). 262(1982). 445447. Probab. 15611563. Shields. Ornstein. "Universal almost sure data compression. Ann. Nobel and A. Neuhoff and P. IT23(1977). Ornstein. 1(1973). "A recurrence theorem for dependent processes with applications to data compression. Ornstein and P. Probab. 18(1990). "Entropy and the consistent estimation of joint distributions. IT42(1986).. 182224. Rudolph." IEEE Trans. 6388. "The ShannonMcMillanBreiman theorem for amenable groups.. "A simple proof of the blowingup lemma.
242 BIBLIOGRAPHY [54] D. Shields. Probab. Shields. Sys. R. Topics in ergodic theory Cambridge Univ. 1960. [62] I. and Dynam. IT37(1991). [61] D. 133142... 7 of the series Problemy Peredai Informacii. Th." Z. The theory of Bernoulli shifts. "The entropy theorem via coding bounds. 49(1979). "Cutting and stacking. English translation: Select. 349351. Princeton. Inform. Van Nostrand. Lectures on Choquet's theorem. 16451647. 213244. of Math. 1983. "On the probability of large deviations of random variables. Parry. [65] P." Ann. 84 (1977)." Israel J. 20(1992).J. Convergence of stochastic processes." Ann. IT25(1979). [70] P.the general ergodic case. New York. 1973. "Cutting and independent stacking of intervals. Shields. 20(1992). 30(1978). San Francisco. 1984. Rohlin. to appear. Math. [57] Phelps. Press. "If a twopoint extension of a Bernoulli shift has an ergodic square. Th. Ergodic theory. [68] P.. "The ergodic and entropy theorems revisited.. "Weak and very weak Bernoulli partitions. 1981. fur Wahr. Shields. Akad. Pinsker.. 1144." (in Russian)." IEEE Trans. 159180. Weiss. Mat. Shields. Nauk SSSR.. "A 'general' measurepreserving transformation is not mixing. 2636. Shields." IEEE Trans. 1(1961). 7 (1973). IT33(1987). Shields. Sbornik. 269277. [60] V. Th. Moscow. [55] W." Problems of Control and Information Theory. Shields.. "Entropy and prefixes. [72] P. "Stationary coding of processes. 1966 [58] M. SpringerVerlag. [73] P. IT37(1991). Shields. Transi. then it is Bernoulli. [63] P. Probab. . [56] K.. [59] D. "On the Bernoulli nature of systems with some hyperbolic structure. [67] P. Press. [64] P. Shields. "Universal almost sure data compression using Markov types. Cambridge. Univ." Dokl. Rudolph. and Probability. 1964. "Almost block independence. "String matching ." Mathematical Systems Theory. 19(1990). Sanov. Chicago. Pollard. English translation: HoldenDay. Inform. 60(1948)." Monatshefte fiir Mathematik. 11991203. [71] P. Statist." IEEE Trans. Petersen." Ergodic Th. Information and information stability of random variables and processes. (In Russian) Vol. 14. Cambridge Univ.. 16051617. of Chicago Press. Shields. [69] P. Ornstein and B. [66] P.. A method for constructing stationary processes. N. AN SSSR. 403409. Inform. 119123. 283291. 42(1957). Th. Cambridge. Inform.." IEEE Trans.
" Ann.. IT37(1991). "An ergodic process of zero divergencedistance from the class of all stationary processes. and S. 520524. of Theor." Ann. Ziv. of Math. IEEE Trans. "Coding theorems for individual sequences. Prob.. vol. [80] M. Inform. M. "Kingman's subadditive ergodic theorem.. IT 39(1993). 1975. 872877.." IEEE Trans. NYU." IEEE Trans.. Sciences." J. J." J.. Ergodic theory. Prob. Ziv and A.. Inform. Th. "Finitary isomorphism of mdependent processes." Ann. Inst. Ziv. New York. IT18(1972). [78] M. Th. Wyner and J.Varadhan. Wyner and J. Ziv. . [75] P. IT23(1977). of Math... Shields.. 82(1994)." IEEE Trans. "Two divergencerate counterexamples. "Entropy zero x Bernoulli processes are closed in the ametric. 6(1993). An introduction to ergodic theory. 16471659. of Theor. 405412.. Strassen. Thouvenot. 3(1975). 9398." IEEE Trans. Shields.. SpringerVerlag. Wyner and J. IT35(1989).. Inform. "Coding of sources with unknown statisticsPart I: Probability of encoding error. Smorodinsky. IT35(1989). "A universal algorithm for sequential data compression. 878880. [79] M. Probab. Ziv. [83] J. of the IEEE. [89] S. Prob. E." IEEE Trans. 1251258. Walters. 384394." Contemporary Math. [90] J.BIBLIOGRAPHY 243 [74] P. Part II: Distortion relative to a fidelity criterion". 25(1989). 521545.. [85] F. 5(1971). Phillips. Th. 7326. [88] A. [86] A. 135(1992) 373376. Inform. Lernpel. 423439. [92] J. Statist.P. [91] J. "The existence of probability measures with given marginals. Th. "A partition on a Bernoulli shift which is not 'weak Bernoulli'. Shields. [77] P. Th. "Waiting times: positive and negative results on the WynerZiv problem." Proc." Math. 210203. [82] W. "Universal data compression and repetition times. Th." J. "Universal redundancy rates don't exist. Smorodinsky. [76] P. submitted. Info. Syst. a seminar Courant Inst. Henri Poincaré. 6(1993). "Fixed data base version of the Lempelziv data compression algorithm. Asymptotic properties of data compression and suffix trees. [81] V.. Inform. Willems. Th. 5458. Steele. Xu. Inform. [84] P." IEEE Trans. New York. 1982. Th. Ziv. 36(1965)." IEEE Trans. "The slidingwindow LempelZiv algorithm is asymptotically optimal. of Theor. Szpankowski. "Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. 499519. Moser. [87] A. 337343. Shields and J. IT 24(1978). Th. IT39(1993). Info.
.
235 (8. 107 top. 27 conditional E independence. 68 blowup bound. 195. 195. 107 upward map. 74. 121 persymbol. 24. 72. 110 transformation defined by. 71 length function. 109 column structure. 104. 174 in d. 179. 84. 83 blowingup property (BUP). 11 building blocks. 194 almost blowingup. 107 height (length). 71 code sequence. 108 width (thickness). 123 code. 110 concatenatedblock process. k)separated structures. 107 of a column structure. 104 blocktostationary construction. 138 circular ktype. 107 name. 52 block code. 226 entropy. 108 support. 121 ncode. 195. 108 binary entropy function. 58 admissible. 115 cutting into copies. 7 truncation. 107 column partitioning.Index aseparated. 74 codebook. 10 concatenation representation. 68 8blowup. 8 blockindependent process. 24. 184 almost blockindependence (ABI). 26. 24. 107 complete sequences. 1 asymptotic equipartition (AEP). 190 columnar representation. 108 width. 72 codeword. 9. 235 alphabet. 70 (1 — E)builtup. E)blowingup. 235 blowup. 105 complete sequences of structures. 211 Barron's codelength bound. 108 width distribution. 107 base (bottom). 58 . 69 builtup set bound. 107 cutting a column. 109 disjoint columns. 185. 24. 107 subcolumn. 233 addition law. 125 base (bottom) of a column. 174 in probability. K)stronglyseparated. 189 (a. 107 labeling. 24. 107 level. 103. 215 245 faithful (noiseless). 110 copy. 8. 55 Bprocess. 8 block coding of a process. 188 upward map. 115 disjoint column structures. 121 rate of. 71 coding block. 215 column. 108 (a. 211 blockstructure measure. 24. 109 builtup. 108 estimation of distributions. 69 built by cutting and stacking. 194 BorelCantelli principle. 187 absolutely regular. 108 uniform. 212 almost blowingup (ABUP). 111 column structure: top.
76 empirical distribution. 73 entropy properties: dlimits. 60 upper semicontinuity. 78 entropy theorem. 144 nth order. 46 number. 59 a partition. 133 entropy interpretations: covering exponent. 1 continuous distribution. 138 firstorder entropy. 6 shifted empirical block. 1 of blocks in blocks. 90 d'distance. 91 definition for processes. 58 a random variable. 166 Elias code.246 conditional invertibility. 48 empirical entropy. 100 typical sequences. 43 ergodicity and dlimits.i. 109 standard representation. 63 . 89 realizing tin 91 dtopology. 4 . 57 inequality. 176. 102 relationship to entropy. 214 coupling. 190 cylinder set. 43. 129 entropytypical sequences. 59 of a distribution. 64 empirical measure (limiting). 33 of von Neumann. 94 empirical universe.d. 90 rotation processes. 223 for processes. 88 entropy of i. 68 prefix codes. 46 almostcovering principle. processes. 15 components. 100 ergodicity. 31 distinct words. 51 empirical. 103. 67 Eindependence. 51 a process. 42 of Kingman. 49 decomposition. 45 "good set" form. 11 expected return time. 92 for sequences. 87 ddistance properties. 218 covering. 51. 67 cardinality bound. 185 encoding partition. 100 subset bound. 48 Markov. 99 finite form. 168 divergence. 51 conditional. 21 ergodic theorem (of Birkhoff). 218 coupling measure. 132 Ziv entropy. 184 ddistance . 65 empirical joinings. 80 entropy. 186. 62 concatenatedblock processes. 24 exponential rates for entropy. 67 cutting and stacking. 89 for i. 91 definition (joining). 224 ergodic. 222 of a partition. 164 consistency conditions.d. 98 direct product. 62 Markov processes. 166 for frequencies. 89 for stationary processes. 143. 112 cyclic merging rule. 138 estimation. 63 of overlapping kblocks. 97 completeness.i. 17 start. 92 for ergodic processes. 16 totally. 20 process. 89. processes. 45 eventually almost surely. INDEX empirical Markov entropy. 75. 99 mixing. 42 maximal. 2 dadmissible. 96 dfarapart Markov chains. 147 distribution of a process. 58 entropy rate. 56 topological. 57 doublyinfinite sequence model.
215 nblocking. 104 logsum inequality. processes. 166 frequency. 62 matching. 44 (k. 175. 13 Kolmogorov representation. 102 faithful code. 223 firstorder blocks. 28 independent partitions. 122 nonoverlappingblock process. 65 Markov inequality. 139 separated. 27 transformation. 15 irreducible Markov. 17 joint inputoutput measure. 6 generalized renewal process. 138 partial. 100 and Markov. 18 and dlimits. 11 Markov chain. 168 overlappingblock process. 7 function. 14 nonadmissibility. 7 invariant measure. 6 Markov order theorem. 71 faithfulcode sequence. 110 Kolmogorov partition. 115 onto a structure. 3 Kraft inequality. 133 simple LZ parsing. 4. 25 infinitely often. 6 kth order. 10 packing. 195 finite coder. 115 induced process. 25 Kolmogorov measure. 7 finite coding. 114 Mfold. 116 repeated. 131 LZW algorithm. 10 overlapping to nonoverlapping. 41 packing lemma. 45 function of a Markov chain. 34 (1 — 0packing.i. 229 247 Kac's returntime formula. 2)name. 4 and complete sequences. 185 nonexistence of toogood codes. 18 join of measures. 76. 151 finitary coding. 4 set. 3. 41 . 5 source. 40 strongpacking. 195 finitary process. 84 fillers. 222 i. 17 cutting and stacking. 91 join of partitions. 91 definition of d. 221 finite form. almost surely. 198. 26 generalized returntime picture. 2. 186. 107 of a column level. 73 Kronecker's theorem. 235 measure preserving. 14 merging and separation.INDEX extremal measure. 34 stoppingtime. 212 stacking of a structure: onto a column. 8 approximation theorem. 190 mixing. 132 upper bound. 41 twopackings.d. 159 finitestate process. 43 typical. 22 LempelZiv (LZ) algorithm. 20 name of a column. 104 firstorder rate bound. 3 twosided model. 11 instantaneous coder. 34. 6 concatenatedblock process. 136 linear mass distribution. 107 (T. 131 convergence theorem. 80 finite energy. 10 finitely determined. 117 extension. 4 complete measure model. 90 mapping interpretation. 121 filler or spacer blocks. E)typical sequences.
169. 63 ktype. 33 stronglycovered. 23 prefix. 34 stronglypacked. 89 Pinsker's inequality. 101 ShannonMcMillanBreiman theorem. 158 process. 46 (L. 6 Nstationary. 30. 180 stacking columns. 65 typical sequence. 215 (S. 29 type. 64 class bound. 156 totally ergodic. 5 Vfmixing. 14 perletter Hamming distance. 175  INDEX and upward maps. 10 regenerative. 6 independent. 167 equivalence. 4 shift. 121 trees. 67 toosoonrecurrent representation. 67 and ddistance. 32. 74. 166 for frequencies. 148 returntime distribution. 110 Bprocess. 7 and entropy. 45. 159 (S. 170 with gap. 26 rotation (translation). 21 process. 120 stationary. 13 and complete sequences. 229 stopping time. 211 entropy. 0stronglycovered. 28 stationary coding. 8. 24 picture. 66 Poincare recurrence theorem. 180 splittingset lemma. 44. 82 timezero coder. 3 (T . n)blocking. 122 sequence. 167 equivalence. 8 output measure. 21 tower construction. 13 toolong word. 70 (1 — 0stronglycovered. 188 full kblock universe. 109 universal code. 63. 55 shift transformation. 229 Markov process.P) process. 150 (K. 40 interval. 7 speed of convergence. 205 partition. 7 with induced spacers. 43. 154 repeated words. 150 representation. 85 strong cover. 109 start position. 13 distribution. 84 stationary process. 68. 40 string matching. 59 subinvariant function. 215 i. 3 transformation/partition model. block parsings. 138 subadditivity. 64 class. 121 universe (kblock). 59 finiteenergy. 72 code. 96 randomizing the start. 31 slidingblock (window) coding. 1. 30 transformation. 3. 21 dfarapart rotations. 81 and mixing. 73 code sequence. 145 existence theorem. 63. 166 splitting index. 63 class. 0stronglypacked. 151 toosmall set principle. 8 rate function for entropy. 5 Markov. 6 inputoutput measure. 14 product measure. 179 skew product. 190 . 27. 24 process. 165 recurrencetime function.248 pairwise kseparated. n)independent. 170. 132 of a column structure. 12 Nth term process. 167.. 3 shifted.i.d.
7 words. 232 waitingtime function. 87 welldistributed. 133 249 .INDEX very weak Bernoulli (VWB). 104 distinct. 195 window halfwidth. 200 approximatematch. 200 weak Bernoulli (WB). 119 window function. 148 Ziv entropy. 233 weak topology. 87. 179. 88 weak convergence. 148 repeated.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.