The Ergodic Theory of Discrete Sample Paths

Paul C. Shields

Graduate Studies in Mathematics
Volume 13

American Mathematical Society

Editorial Board James E. Humphreys David Sattinger Julius L. Shaneson Lance W. Small, chair
1991 Mathematics Subject Classification. Primary 28D20, 28D05, 94A17; Secondary 60F05, 60G17, 94A24.
ABSTRACT. This book is about finite-alphabet stationary processes, which are important in physics, engineering, and data compression. The book is designed for use in graduate courses, seminars or self study for students or faculty with some background in measure theory and probability theory.

Library of Congress Cataloging-in-Publication Data

Shields, Paul C. The ergodic theory of discrete sample paths / Paul C. Shields. p. cm. — (Graduate studies in mathematics, ISSN 1065-7339; v. 13) Includes bibliographical references and index. ISBN 0-8218-0477-4 (alk. paper) 1. Ergodic theory. 2. Measure-preserving transformations. 3. Stochastic processes. I. Title. II. Series. QA313.555 1996 96-20186 519.2'32—dc20 CIP

Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting

for them, are permitted to make fair use of the material, such as to copy a chapter for use in teaching or research. Permission is granted to quote brief passages from this publication in reviews, provided the customary acknowledgment of the source is given. Republication, systematic copying, or multiple reproduction of any material in this publication (including abstracts) is permitted only under license from the American Mathematical Society. Requests for such permission should be addressed to the Assistant to the Publisher, American Mathematical Society, P.O. Box 6248, Providence, Rhode Island 02940-6248. Requests can also be made by e-mail to reprint-permission0ams.org .

C) Copyright 1996 by the American Mathematical Society. All rights reserved. The American Mathematical Society retains all rights except those granted to the United States Government. Printed in the United States of America. ® The paper used in this book is acid-free and falls within the guidelines established to ensure permanence and durability. .. Printed on recycled paper. t.,1 10 9 8 7 6 5 4 3 2 1 01 00 99 98 97 96

Contents
Preface I ix

Basic concepts. I.1 Stationary processes. 1.2 The ergodic theory model. 1.3 The ergodic theorem. 1.4 Frequencies of finite blocks 1.5 The entropy theorem. 1.6 Entropy as expected value 1.7 Interpretations of entropy 1.8 Stationary coding 1.9 Process topologies. 1.10 Cutting and stacking.

1 1 13 33 43 51 56 66 79 87 103 121 121 131 137 147 154 165 165 174 184 194 200 211 211 221 232 239 245

II Entropy-related properties. II.! Entropy and coding 11.2 The Lempel-Ziv algorithm 11.3 Empirical entropy 11.4 Partitions of sample paths. 11.5 Entropy and recurrence times III Entropy for restricted classes. Ill.! Rates of convergence 111.2 Entropy and joint distributions. 111.3 The d-admissibility problem. 111.4 Blowing-up properties. 111.5 The waiting-time problem IV B-processes. IV1 Almost block-independence. 1V.2 The finitely determined property. 1V.3 Other B-process characterizations.
Bibliography

Index

vii

McMillan. is to develop a theory based directly on sample path arguments. a representation that not only simplifies the discussion of stationary renewal and regenerative processes but generalizes these concepts to the case where times between recurrences are not assumed to be independent.d. The classical process models are based on independence ideas and include the i. with minimal appeals to the probability formalism. immediately yield the two fundamental theorems of ergodic theory. or by stationary coding. introduced in Section 1. Much of Chapter I and all of Chapter II are devoted to the development of these ideas. engineering. The two basic tools for a sample path theory are a packing lemma. which bounds the number of n-sequences that can be partitioned into long blocks subject to the condition that most of them are drawn from collections of known size. This point of view. All these models and more are introduced in the first two sections of Chapter I. the processes obtained by independently concatenating fixed-length blocks according to some block distribution and randomizing the start.d. A ix . namely. together with a partition of the space. The focus is on the combinatorial properties of typical finite sample paths drawn from a stationary. leads directly to Kakutani's simple geometric representation of a process in terms of a recurrent event. ergodic process. and renewal and regenerative processes. introduced by Ornstein and Weiss in 1980. ergodic theory. but only stationary.i. The packing and counting ideas yield more than these two classical results. seminars or self study for students or faculty with some background in measure theory and probability theory. namely. and Breiman. which shows how "almost" packings of integer intervals can be extracted from coverings by overlapping subintervals. The book is designed for use in graduate courses. These two simple ideas. however. which are important in physics.2. A secondary goal is to give a careful presentation of the many models for stationary finite-alphabet processes that have been developed in probability theory. and a counting lemma. Markov chains. A primary goal. processes.Preface This book is about finite-alphabet stationary processes. only partially realized. Further models. including the weak Bernoulli processes and the important class of stationary codings of i. that is. Related models are obtained by block coding and randomizing the start. an extension of the instantaneous function concept which allows the function to depend on both past and future. Of particular note in the discussion of process models is how ergodic theorists think of a stationary process. and data compression. processes. the ergodic theorem of Birkhoff and the entropy theorem of Shannon. are discussed in Chapter III and Chapter IV. and information theory. instantaneous functions of Markov chains. An important and simple class of such models is the class of concatenated-block processes.i. for in combination with the ergodic and entropy theorems and further simple combinatorial ideas they provide powerful tools for the study of sample paths. as a measure-preserving transformation on a probability space.

and the cutting and stacking method. Likewise little or nothing is said about such standard information theory topics as rate distortion theory. including rates of convergence for frequencies and entropy.i. These include entropy as the almost-sure bound on per-symbol compression. the relation between entropy and partitions of sample paths into fixed-length blocks. conditioned on the material in Chapter I. Ziv's proof of asymptotic optimality of the Lempel-Ziv algorithm via his interesting concept of individual sequence entropy. redundancy. With a few exceptions. and IV are approximately independent of each other. the ergodic theorem and its connection with empirical distributions. then given in complete detail. random fields. and multi-user theory. although the almost block-independence and blowing-up ideas are more recent. In addition.10.d.x PREFACE further generalization. leads to a powerful method for constructing examples known as cutting and stacking. as well as combinatorialists and people from engineering and other mathematics disciplines. the sections in Chapters II. Theorems and lemmas are given names that include some information about content. or partitions into distinct blocks. for example. divergence-rate theory. the estimation of joint distributions in both the variational metric and the d-metric. a method for converting block codes to stationary codes. and the connection between entropy and recurrence times and entropy and the growth of prefix trees. a connection between entropy and a-neighborhoods. given in Section 1. and probabilists. The first chapter. Many standard topics from ergodic theory are omitted or given only cursory treatment. the weak topology and the even more important d-metric topology. is devoted to the basic tools. Some of these date back to the original work of Ornstein and others on the isomorphism problem for Bernoulli shifts. and a connection between entropy and waiting times. This book is an outgrowth of the lectures I gave each fall in Budapest from 1989 through 1994. These topics include topological dynamics. suggestive names are used for concepts. which is half the book. or partitions into repeated blocks. Proofs are sketched first. III. ranging from undergraduate and graduate students through post-docs and junior faculty to senior professors and researchers. processes are given in Chapter IV. such as . in part because the book is already too long and in part because they are not close to the central focus of this book. very weak Bernoulli. Properties related to entropy which hold for every ergodic process are discussed in Chapter II. the entropy theorem rather than the ShannonMcMillan-Breiman theorem. both as special lectures and seminars at the Mathematics Institute of the Hungarian Academy of Sciences and as courses given in the Probability Department of E6tviis Lordnd University. The book has four chapters. channel theory. Likewise. lectures on parts of the penultimate draft of the book were presented at the Technical University of Delft in the fall of 1995. finitely determined. algebraic coding. smooth dynamics. including the Kolmogorov and ergodic theory models for a process. combinatorial number theory. Some specific stylistic guidelines were followed in writing this book. Several characterizations of the class of stationary codings of i. information theorists. and continuous time and/or space theory. general ergodic theorems. Properties related to entropy which hold only for restricted classes of processes are discussed in Chapter III. and blowing-up characterizations. The audiences included ergodic theorists. K-processes. the entropy theorem and its interpretations. including the almost block-independence.

but far from least.html I am grateful to the Mathematics Institute of the Hungarian Academy for providing me with many years of space in a comfortable and stimulating environment as well as to the Institute and to the Probability Department of Etitvotis Lordnd University for the many lecture opportunities. Don Ornstein. I am indebted to many people for assistance with this project. Exercises that extend the ideas are given at the end of most sections. and Nancy Morrison. 1996 . Jeffrey.PREFACE xi building blocks (a name proposed by Zoli Gytirfi) and column structures (as opposed to gadgets. and Jacob Ziv.math. A list of errata as well as a forum for discussion will be available on the Internet at the following web address. Shields Toledo. Dave Neuhoff.) Also numbered displays are often (informally) given names similar to those used for LATEX labels. Only those references that seem directly related to the topics discussed are included. Ohio March 22. Gyiirgy Michaletzky. I am much indebted to my two Toledo graduate students. Much of the material in Chapter III as well as the blowing-up ideas in Chapter IV are the result of joint work with Kati. Others who contributed ideas and/or read parts of the manuscript include Gdbor TusnAdy. Bob Burton. these range in difficulty from quite easy to quite hard. This book is dedicated to my son. It was Imre's suggestion that led me to include the discussions of renewal and regenerative processes. In addition I had numerous helpful conversations with Benjy Weiss. http://www. Jacek Serafin. who learned ergodic theory by carefully reading almost all of the manuscript at each stage of its development.utoledo. No project such as this can be free from errors and incompleteness. Imre Csiszdr and Katalin Marton not only attended most of my lectures but critically read parts of the manuscript at all stages of its development and discussed many aspects of the book with me.edurpshields/ergodic. Much of the work for this project was supported by NSF grants DMS8742630 and DMS-9024240 and by a joint NSF-Hungarian Academy grant MTA-NSF project 37. in the process discovering numerous errors and poor ways of saying things. Shaogang Xu and Xuehong Li. My initial lectures in 1989 were supported by a Fulbright lectureship. Last. Paul C. Gusztdv Morvai. Aaron Wyner. and his criticisms led to many revisions of the cutting and stacking discussion.

.

E. for implicit in the definition of process is that the following consistency condition must hold for each k > 1. • • • . In this book the focus is on finite-alphabet processes.1 Stationary processes. except where each ai E A. When no confusion will result the subscript k on p. . am+i. Section 1.1 k. unless stated otherwise. "measure" will mean "probability measure" and "function" will mean "measurable function" with respect to some appropriate a-algebra on a probability space. Thus one is free to choose the underlying space (X. The process has alphabet A if the range of each X i is contained in A. n . all that really matters in probability theory is the distribution of the process.k may be omitted. ak-1-1 al E A k .. A i . of course.. The distribution of a process can. "process" means a discrete-time finite-alphabet process.. The cardinality of a finite set A is denoted by IA I.k on A k defined by the formula p.. . so. p). of random A (discrete-time.Chapter I Basic concepts. A process is considered to be defined by its joint distributions. one for each k. the particular space on which the functions Xn are defined is not important. The set of joint distributions {Ak: k > 1} is called the distribution of the process. X2.k (alic) = Prob(Xif = c4). . The sequence a m . alic E Ak . when An is used. Also. however. an. variables defined on a probability space (X. also be defined by specifying the start distribution. The distribution of a process is thus a family of probability distributions. X.. The set of all such am n is denoted by Al. The sequence cannot be completely arbitrary. unless it is clear from the context or explicitly stated stated otherwise. and the successive conditional distributions k -1 k. stochastic) process is a sequence X1. p) on which the Xn are defined in any convenient manner. that is. E.1\ (akla 1 ) = Prob(Xk = ak1X 1 = al ) = til(alk) ktk-1(a 1) . is denoted by am for m = 1. The k-th order joint distribution of the process {X k } is the measure p. as 1 . (1) 111(4) = E P4+1 (a jic+1 ).

on A" such that. = 'R. (b) If {B„} c R is a decreasing sequence of sets with empty intersection then there is an N such that B„ = 0. then there is a unique Borel probability measure p.2 (Kolmogorov representation theorem. Theorem 1.2 CHAPTER I. Xi E A. 1 < i < co. if {X. A is defined by kn (x). The consistency conditions (1) together with Lemma I. The Kolmogorov can be thought of as the coordinate function process {i n }. and its union R = R(C) is the ring generated by all the cylinder sets. Let C. The sequence {R. or by the compact sets. implies that p can be extended to a . is a finite disjoint union of cylinder sets from C. equipped with a Borel measure constructed from the consistency conditions (1). for each k and each 4. Lemma 1. = a i . together with a Borel measure on A". is closed. long as the the joint distributions are left unchanged.1.} is a process with finite alphabet A. The finite intersection property. BASIC CONCEPTS. Thus the lemma is established. let C = Un C„. Lemma I. An important instance of this idea is the Kolmogorov model for a process..(C. Two important properties are summarized in the following lemma. The rigorous construction of the Kolmogorov representation is carried out as follows.1(b). since the space A is compact in the product topology and each set in R.1. Part (b) is an application of the finite intersection property for sequences of compact sets.1(a). denoted by [4]. n . xn . = R(C) generated by the cylinder sets. on the collection C of cylinder sets by the formula (2) p({41) = Prob(X. x E For each n > 1 the coordinate function in : A' representation theorem states that every process with alphabet A A. p([4]) = pk (alic).) denote the ring generated by Cn . = ai . there is a unique Borel measure p. The members of E are commonly called the Borel sets of A. 1 _< i < n). Proof The process {X.1. extends to a finitely additive set function on the ring R. on A' for which the sequence of coordinate functions fin } has the same distribution as {X J.. The collection E can also be defined as the the a -algebra generated by R. from (a). m < i < n} . imply that p.} defines a set function p. In other words.1 (a) Each set in R. is the subset of A' defined by The cylinder set determined by am [4] = {x: x. Let Aœ denote the set of all infinite sequences X = 1Xil.) If {p k } is a sequence of measures for which the consistency conditions (1) hold. Proof The proof of (a) is left as an exercise. and let R. which represents it as the sequence of coordinate functions on the space of infinite sequences drawn from the alphabet A.} is increasing. be the collection of cylinder sets defined by sequences that belong to An.1. Let E be the a-algebra generated by the cylinder sets C. n > N.

"let be a process. A process is stationary. or to its completion. The measure will be called the Kolmogorov measure of the process. The phrase "stationary process" will mean a stationary. though often without explicitly saying so. discrete-time. . The Kolmogorov model is simply one particular way to define a space and a sequence of functions with the given distributions. The (left) shift T: A" 1-*.n } on the probability space (A'. x2. if the joint distributions do not depend on the choice of time origin.. . and statistics. whenever it is convenient to do so the complete Kolmogorov model will be used in this book. Equation (2) translates into the statement that { 5C. thus. "Kolmogorov measure" is taken to refer to either the measure defined by Theorem 1. T -1 B = fx: T x E BI. x = (xi. Furthermore. Another useful model is the complete measure model. Process and measure language will often be used interchangeably. . (3) Prob(Xi = ai .1.2. that is. n. x3.) == T x = (x2. E." means "let i be the Kolmogorov measure for some process {Xn } .SECTION Ll.). Many ideas are easier to express and many results are easier to establish in the framework of complete measures. data transmission and storage. STATIONARY PROCESSES." As noted earlier. ni <j < n) = Prob(Xi+i = ai . The sequence of coordinate functions { k .} have the saine joint distribution. B . and defines a set transformation T -1 by the formula. completion has no effect on joint distributions. different processes have Kolmogorov measures with different completions. This proves Theorem 1. the particular space on which the functions X„ are defined is not important. x which is expressed in symbolic form by E A.} and {X.A" is defined by (Tx) n= xn+i. a process is considered to be defined by its joint distributions.1. it) will be called the Kolmogorov representation of the process {U. that is. for example. whichever is appropriate to the context. that is. and uniqueness is preserved. including physics. 3 unique countably additive measure on the a-algebra E generated by R. finite-alphabet process. one in which subsets of sets of measure 0 are measurable.. The statement that a process {X n } with finite alphabet A is stationary translates into the statement that the Kolmogorov measure tt is invariant under the shift transformation T. The shift transformation is continuous relative to the product topology on A". In particular. and affi n . Stationary finite-alphabet processes serve as models in many settings of interest. n> 1. The Kolmogorov measure on A extends to a complete measure on the completion t of the Borel sets E relative to tt. or the Kolmogorov measure of the sequence Lu l l. for Aga in) = ictk (alic) for all k > 1 and all a. for all m. ni <j < n).2. that is. unless stated otherwise.

the shift T is Borel measurable. m <j < n. E. that is. bi+1 = ai. for stationary processes there are two standard representations.([4]) = p. Furthermore. for each k > 1 and each all'.A z is defined by (Tx) n= xn+i. The condition. Such a model is well interpreted by another model. The condition that the process be stationary is just the condition that p. a homeomorphism on the compact space A z . are seen. alternatively. Note that the shift T for the two-sided model is invertible. which means that T is a Borel measurable mapping. is T -invariant.([b 11 ]). one as the Kolmogorov measure p.3 Much of the success of modern probability theory is based on use of the Kolmogorov model. that is. for asymptotic properties can often be easily formulated in terms of subsets of the infinite product space A". n E Z. The intuitive idea of stationarity is that the mechanism generating the random variables doesn't change in time.n+1 i L"m-1 — Lum+1. in fact.4 Note that CHAPTER I. that is. A(B) = p(T -1 B) for each cylinder set B. the Kolmogorov measure on A' for the process defined by {pa In summary. BASIC CONCEPTS. so that the set transformation T -1 maps a cylinder set onto a cylinder set. T-1 r„n 1 _ ri. p(B) = p(T -1 B). is usually summarized by saying that T preserves the measure iu or. Remark 1. which is. In addition. In some cases where such a simplification is possible. if {p4} is a set of measures for which the consistency conditions (1) and the stationarity conditions (3) hold then there is a unique T-invariant Borel measure ii. The shift T: A z 1-. In the stationary case. In summary. that is. It is often convenient to think of a process has having run for a long time before the start of observation. called the doubly-infinite sequence model or two-sided model. the model provides a concrete uniform setting for all processes. The concept of cylinder set extends immediately to this new setting.1. Proofs are sometimes simpler if the two-sided model is used. It follows that T -1 B is a Borel set for each Borel set B. only the proof for the invertible case will be given. that ti. The latter model will be called the two-sided Kolmogorov model of a stationary process.) From a physical point of view it does not matter in the stationary case whether the one-sided or two-sided model is used. as appropriate to the context. different processes with the same alphabet A are distinguished by having . T is one-to-one and onto and T -1 is measure preserving.). so that only nontransient effects. The projection of p. those that do not depend on the time origin. or to the two-sided measure or its completion. B E E. such results will then apply to the one-sided model in the stationary case. onto il°'° is. It preserves a Borel probability measure A if and only if the sequence of coordinate functions {:kn } is a stationary process on the probability space (A". on A z such that Again) = pk (a1 ). (The two-sided shift is. in turn. of course. equivalent to the condition that p(B) = p(T -1 B) for each Borel set B. "Kolmogorov measure" is taken to refer to either the one-sided measure or its completion. p.1 . where each xn E A and n E Z ranges from —oo to oo. for invertibility of the shift on A z makes things easier. on A". the other as the shift-invariant extension to the space Az. Let Z denote the set of integers and let A z denote the set of doubly-infinite A-valued sequences. the set of all sequences x = {x„}.

countable-alphabet processes. and. It is identically distributed if Prob(X„ = a) = Prob(Xi = a) holds for all n > 1 and all a E A. As will be shown in later sections. A sequence of random variables {X„} is independent if Prob(X.) The simplest examples of finite-alphabet dependent processes are the Markov processes. it is sometimes desirable to consider countable-alphabet processes. A Markov chain is said to be homogeneous or to have stationary transitions if Prob(X n = blXn _i = a) does not depend on n.d. when invertibility is useful the two-sided model on A z can be used.4 While this book is primarily concerned with finite-alphabet processes. The i. A sequence of random variables.6 (Markov chains. I. 5 different Kolmogorov measures on A. identically distributed (i. Example 1.1.) process.i. except in trivial cases.1. or even continuous-alphabet processes.i. Thus a process is i. Some standard examples of finite-alphabet processes will now be presented. and functions of uniformly distributed i.5 (Independent processes. in the stationary case. holds for all n > 1 and all al' E An.i. even more can be gained in the stationary case by retaining the abstract concept of a stationary process as a sequence of random functions with a specified joint distribution on an arbitrary probability space. condition holds if and only if the product formula iak (al k )= n Ai (ai). Remark 1. cases that will be considered. if and only if its Kolmogorov measure is the product measure defined by some distribution on A.1. is a Markov chain if (4) Prob (X n = an I V I -1 = ar l ) = Prob(Xn = an IXn _i =a_1) holds for all n > 2 and all a7 E An. STATIONARY PROCESSES.) The simplest example of a stationary finite-alphabet process is an independent. processes will be useful in Chapter 4. in which case the I AI x IAI matrix M defined by Mab = Mbia = Prob(X„ = bi X„_i = a) .i.1. For example. return-time processes associated with finite-alphabet processes are.d. as well as to the continuous-alphabet i.d.i.SECTION 1. Example 1.d.d. = an IX7 -1 = ar I ) = Prob(X„ = an ). An independent process is stationary if and only if it is identically distributed. The Kolmogorov theorem extends easily to the countable case. When completeness is needed the complete model can be used.a Examples. Such a measure is called a product measure defined by Ai. {X„}. i= 1 k holds.1.

first. second. If this holds. . . that A k M (k) = . a Markov chain. see Exercise 2. in general. for which iu 1 (a) > 0 for all a E A. indeed. Example 1.4-1 ) .8 (Finite-state processes. In ergodic theory. = biSn-1). Y. the process is stationary. The two conditions for stationarity are. A process is called k-step Markov if (5) Prob(X. and M (k) is the I BI x IBI matrix defined by Ai (k) _ bk — l' 1 1 r1 ak Prob(Xk ±i = bk I X ki = a k\ il .} is often referred to as the underlying Markov process or hidden Markov process associated with {Yn }. The start distribution is called a stationary distribution of a homogeneous chain if the matrix product A I M is equal to the row vector pl. and.) Let A and S be finite sets. Note that probabilities for a k-th order stationary chain can be calculated by thinking of it as a first-order stationary chain with alphabet B. that is. In other words.1. The Markov chain {X. il n—k — n-1\ an—k ) does not depend on n as long as n > k. BASIC CONCEPTS. a E A. what is here called a finite-state process is sometimes called a Markov source. and what is here called a finite-state process is called a function (or lumping) of a Markov chain. The positivity condition rules out transient states and is included so as to avoid trivialities. and "finite-state process" will be reserved for processes satisfying property (6). anix ri . but {Yn } is not. following the terminology used in [15 ].k (4c) > 01. and B = fa: p.7 (Multistep Markov chains.k (4`) = Prob(X i` = 4). Here "Markov" will be reserved for processes that satisfy the Markov property (4) or its more general form (5). then a direct calculation shows that. "stationary Markov process" will mean a Markov chain with stationary transitions for which A I M = tti and. where ii. 4 E A " .6 CHAPTER I.} is said to be a finite-state process with state space S. The start distribution is the (row) vector lit = Lai (s)}. Yn = bisrl . = (sa .uk . Yn )} is a Markov chain. 0 Lk—I •fu i i = a2 k otherwise. {Yn } is a finite-state process if there is a finite-alphabet process {s n } such that the pair process {X. where /2i(a) = Prob(X i = a). An A-valued process {Y. A (an I a n v n—i _ ) = Prob(X n = anl. Example 1. any finite-alphabet process is called a finite-state process. The condition that a Markov chain be a stationary process is that it have stationary transitions and that its start distribution is a stationary distribution. Prob(X. Note that the finite-state process {Yn } is stationary if the chain {X„} is stationary.) A generalization of the Markov property allows dependence on k steps in the past. In information theory.1T -1 ) = Prob(sn = s. that transitions be stationary. if (6) Prob(sn = s. = an i Xn n :ki = an niki ) holds for all n > k and for all dii. Unless stated otherwise. for all n > 1.1. is called the transition matrix of the chain.

n Yn = f (x'.71 Z E Z. Define the function f: Az i.1. n E Z. depends only on xn and the values of its w past and future neighbors. the contents of the window are shifted to the left) eliminating xn _ u.„ with coder F. Finite coding is sometimes called sliding-block or sliding-window coding. STATIONARY PROCESSES. the value y. • • • 1 Xn-f-w which determines the value y. that is. otherwise known as the per-symbol or time-zero coder..9 (Stationary coding. • • • 9 Xn.1.SECTION 1. the image y = F(x) is the sequence defined by y. and bringing in . that is. x E AZ.). At time n the window will contain Xn—w. Thus the stationary coding idea can be expressed in terms of the sequence-to-sequence coder F. In the special case when w = 0 the encoding is said to be instantaneous. Suppose A and B are finite sets. and the encoded process {Yn } defined by Yn = f (X. The word "coder" or "encoder" may refer to either F or f. or in terms of the sequence-to-symbol coder f. In this case it is common practice to write f (x) = f (xw w ) and say that f (or its associated full coder F) is a finite coder with window half-width w. A case of particular interest occurs when the time-zero coder f depends on only a finite number of coordinates of the input sequence x. through the function f. The function f is called the time-zero coder associated with F. f (x) is just the 0-th coordinate of y = F (x). Vx E Az . an idea most easily expressed in terms of the two-sided representation of a stationary process. sometimes called the full coder.) is called an instantaneous function or simply a function of the {X n } process. The encoded sequence y = F(x) is defined by the formula . = f (T:1 4 x). Note that a stationary coding of a stationary process is automatically a stationary process. defined for Borel subsets C of Bz by v(C) = The encoded process v is said to be a stationary coding of ti.Bz is a stationary coder if F(TA X) = TB F(x). that is.B by the formula f (x) = F(x)o.) The concept of finite-state process can be generalized to allow dependence on the past and future. To determine yn+1 . the window is shifted to the right (that is. where TA and TB denotes the shifts on the respective two-sided sequence spaces Az and B z . A Borel measurable mapping F: Az 1-. The stationary coder F and its time-zero coder f are connected by the formula = f (rx). 7 Example 1. when there is a nonnegative integer w such that = f (x) = f CO . The coder F transports a Borel measure IL on Az into the measure v = it o F -1 . Xn—w+1. for one can think of the coding as being done by sliding a window of width 2w ± 1 along x. that is.

especially in Chapter 4. from An B." The relation = . Example 1. where y = F(x) is the (measurable) mapping defined by (j+1)N r jN+1 = . where several characterizations will be discussed. also denoted by f. 2. {Yn }.„ + i.d. This method is called "randomizing the start. . where T = Tg. 2. Such a code can be used to map an A-valued process {X n } into a B-valued process {Yn } by applying CN to consecutive nonoverlapping blocks of length N. More generally. Note that a finite-state process is merely a finite coding of a Markov chain in which the window half-width is 0.n+w generated by the random variables X„. {X. 2. then F' ([y]) = f (x. the N-block coding {Yn } is not. and is often easier to analyze than is stationary coding. the shift on B.. . n .. The process {Yn } is called the N-block coding of the process {X n ) defined by the N -block code C N. of the respective processes. If F: Az B z is the full coder defined by f. Furthermore.} and {K}.} is stationary. Much more will be said about B-processes in later parts of this book. There is a simple way to convert an N-stationary 2rocess. a finite alphabet process is called a B-process if it is a stationary coding of some i.nn+ -w w ) = ym n .. y(j+1)N jN+1 = CN ()((f+1)N) +1 iN j = 0 1. in general. It is frequently used in practice.u. The Kolmogorov measures. pt and v. Such a finite coder f defines a mapping.. Xn.10 (Block codings.. it is invariant under the N-fold shift T N . that is.1. .v ) so that.i. BASIC CONCEPTS. by the formula f (x.7. where yi = f m <j < n.nitww)=Y s11 [x:t w w] . cylinder set probabilities for the encoded measure y = o F -1 are given by the formula y(4) = ii.d. • ..i. 1) is measurable with respect to the a-algebra E(X. in particular. into a stationary process. A stationary coding of an i.. process is called a B-process. destroys stationarity. The process {Yn } is defined bj selecting an integer u E [1. F -1 ([). although it is N-stationary. to of the code. a function of a Markov chain. N] according to the uniform distribution and defining i = 1. If the process {X. such as the encoded process {Yn }. . that is. The key property of finite codes is that the inverse image of a cylinder set of length n is a finite union of cylinder sets of length n 2w.8 CHAPTER I.(F-1 [4]) = (7) E f A process {Yn } is a finite coding of a process {X n } if it is a stationary coding of {Xn } with finite width time-zero coder. An N -block code is a function CN: A N B N . (j+1)N X j1s1+1 ) 9 j 0. I. are connected by the formula y = o F -1 .n41.. called block coding. Xn±„. where w is the window half-width . however. process with finite or infinite alphabet. for any measure .. stationary. that is.) Another type of coding. where A and B are finite sets.

suppose X liv = (X1. In random variable terms. An alternative formulation of the same idea in terms of measures is obtained by starting with a probability measure p.} is stationary. STATIONARY PROCESSES. 9 between the Kolmogorov measures.. with distribution t..SECTION I. 2. which can then be converted to a stationary process. {Y. In Section 1.) Processes can also be constructed by concatenating N-blocks drawn independently at random according a given distribution.} and { Y.} is i.1. block coding introduces a periodic structure in the encoded process. The method of that is. . X N ) is a random vector with values in AN. with stationarity produced by randomizing the start. (ii) There is random variable U uniformly distributed on {1. NI such that. forming the product measure If on (A N )". a general procedure will be developed for converting a block code to a stationary code by inserting spacers between blocks so that the resulting process is a stationary coding of the original process. transporting this to an N-stationary measure p.. by randomizing the start. then averaging to obtain the measure Z defined by the formula N-1 N il(B) = — i=o it*(0 -1 (7—i B)). 2. The process { 171 from the N-stationary process {Yi I by randomizing the start is called the concatenatedblock process defined by the random vector Xr . 2. an N-block code CN induces a block coding of a stationary process {Xn } onto an N-stationary process MI. Let {Y. . . of the respective processes. } . (i) li. In general. y and î.1. on A N . . There are several equivalent ways to describe such a process. a structure which is inherited. In summary.* o 0 -1 on A" via the mapping = x = w(l)w(2) • • • where JN X (j-1)N+1 = W(i). and as such inherits many of its properties. The concatenated-block process is characterized by the following two conditions..11 (Concatenated-block processes. ..8.d. . j = 1. randomizing the start clearly generalizes to convert any N-stationary process into a stationary process. together with its first N — 1 shifts. is expressed by the formula v ( c) = E v (T C). the sequence {Y:± ±(iiN oN±i : j = 1.} and many useful properties of {X n } may get destroyed. } obtained (j-11N+1 be independent and have the distribution of Xr. . B E E. i=o is the average of y.i. with a random delay. For example. }..I. {Y... the final stationary process {Yn } is not a stationary encoding of the original process {X. conditioned on U = u. Example 1.} be the A-valued process defined by the requirement that the blocks . by the process {41. {Y.

or Markov process is Markov.n } defined by setting Y. hence this measure terminology is consistent with the random variable terminology. N) goes to (b Ç'.1. the N-th power of M. (See Exercise 7. n = ai. called the N-th term process. i ± 1). if {X.(br). defined. If {Xn } is Markov with transition matrix M. i) can only go to (ar . (b) (ar . a function of a Markov chain.. then its nonoverlapping blockings are i.d. Example 1. n E Z. j). selects every N-th term from the {X. (ar .tSiv = MaNbi ri i= 1 Mb. +i A related process of interest {Yn }. but overlapping clearly destroys independence. that is. The measure ii. In fact.} is i. defined by the following transition and starting rules. If the process {X. The overlapping N-blocking of an i. Each is stationary if {X. for each br (C) E AN. Note also that if {X n} is Markov with transition matrix Mab. Yn = X (n-1)N+1. however.i } will be Markov with transition matrix MN. • • • 9 Xn+N-1)• The Z process is called the nonoverlapping N-block process determined by {X n }. . b' 1 1 = { MaNbN.} process.i. . while the W process is called the overlapping N-block process determined by {X J.i..} is Markov with transition matrix Mab. N}. is called the concatenated-block process defined by the measure p.i. on A N and let { Y„. A third formulation. .b.(4)/N. for n > 1. BASIC CONCEPTS. then the nonoverlapping N-block process is Markov with alphabet B and transition matrix defined by N-1 1-4a'iv .d.. = (ar . Wn = ant X+1. represents a concatenated-block process as a finite-state process.} be the (stationary) Markov chain with alphabet S = AN x {1. The process is started by selecting (ar . if Y. (a) If i < N . 1) with probability p.r. To describe this representation fix a measure p. on AN. then {Y.12 (Blocked processes. is the same as the concatenated-block process based on the block distribution A.d. 2. then the overlapping N-block process is Markov with alphabet B = far: War) > 0} and transition matrix defined by Ma i.10 CHAPTER I. by Zn = (X(n-1)N+19 X(n-1)N+29 • • • 1 XnN). {Z n } and IKI. that is.) Concatenated-block processes will play an important role in many later discussions in this book.) Associated with a stationary process {X n } and a positive integer N are two different stationary processes. 0 if bN i —1 = ci 2 iv otherwise. which is quite useful. The process {Y. The process defined by ii has the properties (i) and (ii).i) with probability i.} is stationary.

eventually almost surely.(x) is true for i = 1. Lemma 1.13 (The Markov inequality. indeed. and x g B. In general.. (b) I fn (x) — f (x)I <E. the fact that x E B. If f f dp. in which case the iterated Borel-Cantelli principle is. STATIONARY PROCESSES.) Suppose {G„} and {B} are two sequences of measurable sets such that x E G„. except for a set of measure at most b. p. For example. E. If {Pn } is a sequence of measurable properties then (a) P(x) holds eventually almost surely. n > N.1. . almost surely. Two elementary results from probability theory will be frequently used. (a) f.SECTION 1. eventually almost surely.) If {C n } is a sequence of measurable sets in a probability space (X. X E G. < c 8 then f (x) < c. integrable function on a probability space (X. which may depend on x.. Almost-sure convergence is often established using the following generalization of the Borel-Cantelli principle. such that Ifn(x) — f(x)1 < E .16 The following are equivalent for measurable functions on a probability space. E. summarized as follows. 2. a property P is said to be measurable if the set of all x for which P(x) is true is a measurable set.) such that E ii(c.1.14 (The Borel-Cantelli principle.u(Bn n Gn ) < oo. there is an N and a set G of measure at least 1 — c.15 (The iterated Borel-Cantelli principle. Lemma 1. n ? N. The proof of this is left to the exercises. (x) holds infinitely often. p. such that Pn. just a generalized Borel-Cantelli principle. the Markov inequality and the Borel-Cantelli principle.1. if for almost every x there is an increasing sequence {n i } of integers. eventually almost surely. the Borel-Cantelli principle is often expressed by saying that if E wc. n Gn .. n Gn.1..1. for every c > O. 11 I. In many applications. Frequent use will be made of various equivalent forms of almost sure convergence.). Then x g Bn .) < oo then x g C. —> f.b Probability tools. almost surely. (c) Given c > 0. .„ eventually almost surely. if for almost every x there is an N = N(x) such that Pn (x) is true for n > N.l. Lemma 1.i ) < 00 then for almost every x there is an N = N(x) such that x g C„. eventually almost surely. eventually almost surely. is established by showing that E . (b) P.. Lemma 1.) Let f be a nonnegative.

that lower bounds on probability give upper bounds on cardinality. be a probability measure on the finite set A. let v = p.u([4]).l. For ease of later reference these are stated here as the following lemma. Prove Lemma 1. g.) . let p. (Include a proof that your example is not Markov of any order. o F-1 and let r) be the completion of v.u([4] n [ci_t) .i. (Hint: show that such a coding would have the property that there is a number c > 0 such that A(4) > c". the law of the iterated logarithm.d. and sometimes the martingale theorem is used to simplify an argument. For nice spaces. Show that a finite coding of an i. for all n and k. that Borel images of Borel sets may not be Borel sets. Let {tin } be i. 2.1. Deeper results from probability theory. then a stationary process is said to be m-dependent.d. B. Use will also be made of two almost trivial facts about the connection between cardinality and probability. Lemma 1. e. If . If X is a Borel set such that p(X) = 1. (How is m related to window half-width?) 4. BASIC CONCEPTS. the central limit theorem. A complication arises. process. if U„ > Un _i . = 1. Lemma 1. for such images are always measurable with respect to the completion of the image measure. and all a.18. with each U. and the renewal theorem.' and an+ni±k n+m-1-1 .(x 1') 0 O. such as product spaces. such as the martingale theorem.d. first see what is happening on a set of probability 1 in the old process then transfer to the new process. namely.1 is 1-dependent.i. namely. 1.. [5]. except for a subset of B of measure at most a. As several of the preceding examples suggest. then I BI 5_ 11a. uniformly distributed on [0. Define X. though they may be used in various examples. A similar result holds with either A" or B" replaced. and upper bounds on cardinality "almost" imply lower bounds on probability.1. p(b) > a/I BI.12 CHAPTER I.18 (Cardinality bounds. then F(X) is measurable with respect to the completion i of v and ii(F (X)) = 1. . Probabilities for the new process can then be calculated by using the inverse image to transfer back to the old process. This fact is summarized by the following lemma. otherwise X.1.c Exercises. this is not a serious problem. let B c A.) Let F be a Borel function from A" into B". = O. a process is often specified as a function of some other process. 1]. and is not a finite coding of a finite-alphabet i.) 3.) Let p. and let a be a positive number (a) If a E B (b) For b E p(a) a a. Show that {X. if p. Then show that the probability of n consecutive O's is 11(n + 1)!. process is m-dependent for some m.17 (The Borel mapping lemma.i. Give an example of a function of a finite-alphabet Markov chain that is not Markov of any order. I. be a Borel measure on A". by AZ or B z .u([4]). Sometimes it is useful to go in the opposite direction. will not play a major role in this book. respectively.

SECTION L2. for each n. Since to say that . First. associate with the partition P the random variable Xp defined by X2 (x) = a if a is the label of member of P to which x belongs.11 satisfies the two conditions (i) and (ii) of that example. that is. n > 1. then give rise to stationary processes. of a transformation T: X X on some given space X.. are given by the formula (1) X(x) = Xp(Tn -1 x). 8.2 The ergodic theory model. X = A". 2. A measure kt on An defines a measure . . In many cases of interest there is a natural probability measure preserved by T. Ergodic theory is concerned with the orbits x. is the transformation and the partition is P = {Pa : a E A}. 6. at random according to the Kolmogorov measure. an infinite sequence. Lemma 1. Show that the concatenatedblock process Z defined by kt is an average of shifts of . called the transformation/partition model for a stationary process. The coordinate functions. The partition P is called the Kolmogorov partition associated with the process.T 2x. relative to which information about orbit structure can be expressed in probability language. for each Borel subset B C 9. 0 < n < r.} satisfy the Kolmogorov consistency conditions.. where N = Kn+r. • k=0 The measures {. let Xn = X n (x) be the label of the member of P to which 7' 1 x belongs. Pc. The sequence of random variables and the joint distributions are expressed in terms of the shift and Kolmogorov partition. where for each a E A. as follows. = {x: x 1 = a}.u (N) on AN. N = 1. Establish the Kolmogorov representation theorem for countable alphabet processes. Section 1. is the subject of this section and is the basis for much of the remainder of this book.}. which correspond to finite partitions of X. The shift T on the sequence space. (1/n) E7=1 kt* (1' B)..15.. 13 5. Show that the process constructed in the preceding exercise is not Markov of any order. (or on the space A z ). Finite measurements on the space X. 7.Kn-Fr \ 1-4-4 Kn+1 .Tx. The Kolmogorov measure for a stationary process is preserved by the shift T on the sequence space. ÎZ(B) = ..1. and. This suggests the possibility of using ergodic theory ideas in the study of stationary processes. Show that the finite-state representation of a concatenated-block process in Example 1. {X.tc* to A". hence have a common extension .a (N) .1. THE ERGODIC THEORY MODEL. This model. that is. Prove the iterated Borel-Cantelli principle. that is. The Kolmogorov model for a stationary process implicitly contains the transformation/partition model. x E Pa . The process can therefore be described as follows: Pick a point x E X. by the formula = (n K-1 kn -En iU(Xkr1+1) .

(B). Pick x E X at random according to the A-distribution and let X 1 (x) be the label of the set in P to which x belongs. for all B E E. whose union has measure 1. A) be a probability space. (In some situations. the (T.14 Tn-1 X E CHAPTER I. The k-th order distribution Ak of the process {Xa } is given by the formula (4) (4) = u (nLiT -i+1 Pai ) the direct analogue of the sequence space formula. as follows. Let (X. The sequence { xa = X n (x): n > 1} defined for a point x E X by the formula Tn -l x E Pin . 2) process. E. Pa is the same as saying that x E T -n+ 1 Pa . 2) name of x. 2)-name of x is the same as x. tell to which set of the partition the corresponding member of the random orbit belongs. B A partition E E. and (2) = u([4]) = i (nLI T —i+1 Pa. by (4). ) In summary. P = {Pa : a E A} of X is a finite. is called the process defined by the transformation T and partition P. the coordinate functions. Continuing in this manner. countable partitions. The random variable Xp and the measure-preserving transformation T together define a process by the formula (3) X(x) = Xp(T n-l x). the values. and measure preserving if it is measurable and if (T' B) = . cylinder sets and joint distributions may be expressed by the respective formulas. BASIC CONCEPTS. X 2 (x X3(x). partitions into countably many sets. the cylinder sets in the Kolmogorov representation. indexed by a finite set A. or equivalently. disjoint collection of measurable sets. The concept of stationary process is formulated in terms of the abstract concepts of measure-preserving transformation and partition. n> 1. more simply.u. is called the (T. - (x). and the joint distributions can all be expressed directly in terms of the Kolmogorov partition and the shift transformation. ). 2)-process may also be described as follows. X a (x).„7—i+1 Pai . The process {Xa : n > 1} defined by (3). [4] = n7.) Associated with the partition P = {Pa : a E Al is the random variable Xp defined by Xp(x) = a if x E Pa . Then apply T to x to obtain Tx and let X2(x) be the label of the set in P to which Tx belongs. The (T. (or the forward part of x in the two-sided case..) . (2). in the Kolmogorov representation a random point x is a sequence in 21°' or A z . and the (T. that is. that is. or. Of course. A mapping T: X X is said to be measurable if T -1 B E E. are useful. X — U n Pa is a null set.

The condition that TX C Xi. The partition P is defined by Pb = {x: f (x) = 1) } . a stationary process can be thought of as a shift-invariant measure on a sequence space. The space X is Tdecomposable if it can be expressed as the disjoint union X = X1U X2 of two measurable invariant sets. just an abstract form of the stationary coding concept. such that F(Tx) = TA F (x).2. . A measurable set B is said to be T-invariant if TB C B.) The following are equivalent for a measure-preserving transformation T on a probability space..1 (Ergodicity equivalents. 2)-process.e. C AD = (C — D)U(D— C).SECTION 1. Lemma 1. The notation. in that a partition P of (X.e. a stationary coding F: A z B z carrying it onto y determines a partition P = {Pb: b E BI of Az such that y is the Kolmogorov measure of the (TA. E. I. It is natural to study the orbits of a transformation by looking at its action on invariant sets. 15 The (T. denotes symmetric difference. in essence. where the domain can be taken to be any one of the LP-spaces or the space of measurable functions. B LB) = 0 = (B) = 0 or kt(B) = 1. It is standard practice to use the word "ergodic" to mean that the space is indecomposable. = (c) T -1 B D B. . (d) (e) UT (B) = 0 or kt(B) = 1. each of positive measure. 2. Conversely. implies that f is constant. In particular. THE ERGODIC THEORY MODEL.2. In summary. (b) T -1 B C B. p) gives rise to a measurable function F: X 1-÷ B" which carries p onto the Kolmogorov measure y of the (T. and UT denotes the operator on functions defined by (UT f)(x) = f (T x). 2)-process concept is.a Ergodic processes. The following lemma contains several equivalent formulations of the ergodicity condition. hence to say that the space is indecomposable is to say that if T -1 B = B then p(B) is 0 or 1. See Exercise 2 and Exercise 3. = f . The ergodic theory point of view starts with the transformation/partition concept. The mapping F extends to a stationary coding to Bz in the case when X = A z and T is the shift. This leads to the concept of ergodic transformation. since once an orbit enters an invariant set it never leaves it. the natural object of study becomes the restriction of the transformation to sets that cannot be split into nontrivial invariant sets. i = 1. a. or. as a measure-preserving transformation T and partition P of an arbitrary probability space. (a) T is ergodic. 2)-process. i = 1. 2 translates into the statement that T -1 Xi = Xi. a. where TB is the shift on 13'.i(B) = 0 or p(B) = 1.2. while modern probability theory starts with the sequence space concept. equivalently. A measure-preserving transformation T is said to be ergodic if (5) 7-1 B = B = (B) = 0 or p(B) = 1.

16 CHAPTER I. Also.2. 2.4 The Baker's transformation. Examples of ergodic processes include i. In particular.2 In the particular case when T is invertible. 1) denote the unit square and define a transformation T by T (s. processes obtained from a block coding of an ergodic process by randomizing the start.2. to say that a stationary process is ergodic is equivalent to saying that measures of cylinder sets can be determined by counting limiting relative frequencies along a sample path x. stationary codings of ergodic processes. (t ± 1)/2) if S <1/2 it s> 1/2. As will be shown in Section 1.3 (The Baker's transformation. a shift-invariant measure shift-invariant measures.u(C) = . The transformation T is called the Baker's transformation since its action can be described as follows. if the shift in the Kolmogorov representation is ergodic relative to the Kolmogorov measure. Cut the unit square into two columns of equal width. I. Example 1. Proof The equivalence of the first two follows from the fact that if T -1 B c B and C = n>0T -n B.i. BASIC CONCEPTS. but not all. processes.) A simple geometric example provides a transformation and partition for which the resulting process is the familiar coin-tossing process. . (See Figure 1. is equivalent to an important probability concept.4. Squeeze each column down to height 1/2 and stretch it to width 1.2. Thus the concept of ergodic process. The proofs of the other equivalences are left to the reader. then T -1 C = C and .d. that is.b Examples of ergodic processes.u(B).2. and some. These and other examples will now be discussed.) 1. when T is one-to-one and for each measurable set C the set T C is measurable and has the same measure as C. an invertible T is ergodic if and only if any T-invariant set has measure 0 or 1. Place the right rectangle on top of the left to obtain a square.2. for almost every x. t) = { (2s.4. which is natural from the transformation point of view. the conditions for ergodicity can be expressed in terms of the action of T. concatenated-block processes. A stationary process is ergodic. 1) x [0. rather than T -1 . irreducible Markov chains and functions thereof. Remark 1. Let X = [0. note that if T is invertible then UT is a unitary operator on L2 . t/2) (2s — 1. 3. • Tw • Tz Tw • • Tz Figure 1. Furthermore.

E. This is precisely the meaning of independence. is given by a generalized Baker's transformation. some useful partition notation and terminology will be developed. The join. In particular. v P v TT into exactly the same proportions as it partitions the entire space X. TP . symmetric. that is. t): s < 1/2}.2.5 The join of P. T -I P v P v T'P v T 2P T -1 P Figure 1.T 2P . defined by the first-order distribution j 1 (a).. P(k) of partitions is their common refinement. if [t(Pa n Qb) = 12 (Pa)11 (Qb). P1 = {(s. P v Q. described as follows. P1 has measure 1/2. Note that T 2P partitions each set of T -1 7. ii). i.) . together with the fact that each of the two sets P0 . The distribution of P is the probability distribution ftt(Pa). To assist in showing that the (T . Qb E Q} • The join AP(i) of a finite sequence P (1) . process ii. a E Al. that is.. TP.Va. for dyadic subsquares (which generate the Borel field) are mapped into rectangles of the same area. This.P)-process is.d. of course. T2P. t): s _?: 1/2}. for each n > 1. defined inductively by APO ) = (vrP (i) ) y p(k). process. the partition P v Q = {Pa n Qb: Pa E P .2. 2)-process is just the coin-tossing process. and 7— '1'.2. are For the Baker's transformation and partition (6).2. The sequence of partitions {P (i) : i > 1} is said to be independent if P (k+ i) and v1P (i) independent for each k? 1.i. Two partitions P and Q are independent if the distribution of the restriction of P to each Qb E Q does not depend on b. THE ERGODIC THEORY MODEL. a E A. that the (T. 17 The Baker's transformation T preserves Lebesgue measure in the square. PO by setting (6) Po = {(s. implies. of two partitions P and Q is their common refinement. In general. along with the join of these partitions. so that {T i P} is an independent sequence.d.SECTION I. To obtain the coin-tossing process define the two-set partition P = {Po. the i. .5 illustrates the partitions P. 1 0 P T -1 1)1 n Po n T Pi n T2 Po Ti' I I 1 0 1 0 T 2p 01101 I I . Figure 1.6. .i. indeed. (See Figure 1. the binary. Let P = {Pa : a E A} be a partition of the probability space (X. b. 7-1 P . P (2) . it can be shown that the partition TnP is independent of vg -1 PP .

18 CHAPTER I. To show that the mixing condition (7) holds for the shift first note that it is enough to establish the condition for any two sets in a generating algebra. (C nD) = Since this holds for all sets D. if D = [br] and N > m then T -N C and D depend on values of x i for indices i in disjoint sets of integers.a(T-N C). 3.froo lim . in = 0.i(C) = .) To show that a process is ergodic it is sometimes easier to verify a stronger property.2. process. C. Cut the unit square into I AI columns.u(C). 1. Stack the rectangles to obtain a square.(T -Nc n D) = . such that the column labeled a has width A i (a). labeled by the letters of A.i.u(C) 2 and hence .d. in with io = i and in = j such that > 0.d.u(T -nC n D) = .) The location of the zero entries. But this is easy. .d. 1 < < n. processes.8 (Ergodicity for Markov chains. i 1 . A stationary process is mixing if the shift is mixing for its Kolmogorov measure.2. BASIC CONCEPTS. for all sets D and positive integers n. A transformation is mixing if (7) n— . . (7). • Figure 1.u(D). For each a E A squeeze the column labeled a down and stretch it to obtain a rectangle of height (a) and width 1. The partition P into columns defined in part (a) of the definition of T then produces the desired i.2.a(D) = Thus i. the reader may refer to standard probability books or to the extensive discussion in [29] for details. j there is a sequence io. for if C = [a7] and N > 0 then T-N = A +1 bNA -i = a1. The results are summarized here. the choice D = C gives . 2. Thus.i. Mixing clearly implies ergodicity. Example 1. Example 1. in the transition matrix M determine whether a Markov process is ergodic. hence it is enough to show that it holds for any two cylinder sets. for if T -1 C = C then T -ncnD=CnD. measures satisfy the mixing condition.i. Suppose it is a product measure on A'. Since the measure is product measure this means that p. 1. so that the mixing property. D E E. The corresponding Baker's transformation T preserves Lebesgue measure.7 (Ergodicity for i. . if for any pair i. The stochastic matrix M is said to be irreducible. implies that p.u(C) is 0 or 1. called mixing.6 The generalized Baker's transformation. if any. n — 1.

+1 = 11 n+k+m-1 i=n+k-F1 Note that V.([4] n [btni 1 -_ . and hence . To establish (9) for an irreducible chain. which is weaker than the mixing condition (7). 19 This is just the assertion that for any pair i. 1 < i < m. In fact. since .d. Furthermore. then the limit formula (9) gives /. Summing over all the ways to get from d.2. if the chain is at state i at some time then there is a positive probability that it will be in state j at some later time.u(C) = This proves that irreducible Markov chains are ergodic.1±kbn+k+i V = (n-Flc-1 i=n JJ Md.d. by (8). This condition.1 1 converges in the sense of Cesaro to (Ci). +1 = acin). let D = [4] and C = [cri and write + T—k—n = [bnn ++ kk-Fr ]. The condition 7rM = it shows that the Markov chain with start distribution Ai = it and transition matrix M is stationary. there is only one probability vector IT such that 7rM = it and the limit matrix P has it all its rows equal to 7r.+i. An irreducible Markov chain is ergodic.:Vinj) = uvw.SECTION 1. the limit of the averages of powers can be shown to exist for an arbitrary finite stochastic matrix M.. for if T -1 C = C and D = C.c. If M is irreducible there is a unique probability vector it such that irM = 7r. 1=. which establishes (9).. to bn+k+1 yields the product p. each entry of it is positive and N (8) N—oc) liM N n=1 where each row of the k x k limit matrix P is equal to jr. THE ERGODIC THEORY MODEL. where bn_f_k+i = ci. In the irreducible case. j of states. where n-1 7r (di)11Md. is n ii n [bn 1) = ttacril i) A I . which is the probability of transition from dn to bn+k+1 = Cl in equal to MI ci = [Mk ]d c „ the dn ci term in the k-th power of M. Mc1. To prove this it is enough to show that N (9) LlM oo C n D) = for any cylinder sets C and D. implies ergodicity. The sequence ML. . Thus when k > 0 the sets T -k-n C and D depend on different coordinates. hence for any measurable sets C and D. so that C must have measure 0 or 1.t(C) = 1u(C) 2 .-1 (10) itt([d] k steps. mc.

Example 1. for suppose F: A z B z is a stationary encoder. "ergodic" for Markov chains is equivalent to the condition that some power of the transition matrix be positive. 2)-process can be ergodic even though T is not. The underlying . and hence that v is the Kolmogorov measure of an ergodic process. Feller's book. It is not important that the domain of F be the sequence space A z . o F -1 is the Kolmogorov measure of the encoded B-valued process. Remark 1.) A simple extension of the above argument shows that stationary coding also preserves the mixing property. so that. which means that transition from i to j in n steps occurs with positive probability. at least with the additional assumption being used in this book that every state has positive probability. if T is ergodic the (T.11 In some probability books. and hence the chain is irreducible. In summary.20 CHAPTER I. see Exercise 19. then . for example. the Cesaro limit theorem (8) can be strengthened to limN. assuming that all states have positive probability. Since v(C) = p.„ M N = P. however. A finite-state process is a stationary coding with window width 0 of a Markov chain. Proposition 1.11.u(B) = 1.u(T -nP n Pi ) > 0. The converse is also true.) Stationary coding preserves the ergodicity property.2. Likewise.u(B n Pi ) > 0. If some power of M is positive (that is. The argument used to prove (9) then shows that the shift T must be mixing.(F-1 C) it follows that v(C) is 0 or 1. if every state has positive probability and if Pi = {x: xi = j}. "ergodic" for Markov chains will mean merely irreducible with "mixing Markov" reserved for the additional property that some power of the transition matrix has all positive entries. a finite-state process for which the underlying Markov chain is mixing must itself be mixing. Thus if the chain is ergodic then . BASIC CONCEPTS. The converse is also true. (See Exercise 5. that the (T. A concatenated-block process is always ergodic. then the set B = T -1 Pi U T -2 PJ U has positive probability and satisfies T -1 B D B. any probability space will do.1.2. suppose it is the Kolmogorov measure of an ergodic A-valued process. Thus. for some n. . if the chain is ergodic then the finite-state process will also be ergodic.9 A stationary Markov chain is ergodic if and only if its transition matrix is irreducible.2. Thus. and suppose v = p. Proposition 1.u(F-1 C) is 0 or 1. If C is a shift-invariant subset of B z then F-1 C is a shift-invariant subset of Az so that . below. But if it(B n Pi ) > 0. Indeed. since it can be represented as a stationary coding of an irreducible Markov chain. In this case.12 (Codings and ergodicity. This follows easily from the definitions.10 A stationary Markov chain is mixing if and only if some power of its transition matrix has all positive entries. see Example 1. [11]. It should be noted. has all positive entries) then M is certainly irreducible. 2)-process is ergodic for any finite partition P. To be consistent with the general concept of ergodic process as used in this book. again.2. for any state i.

1). even if the original process was ergodic. for it has a periodic structure due to the blocking.SECTION 1. concatenated-block processes are not mixing except in special cases.12. p. depending on whether Tn .) Let a be a fixed real number and let T be defined on X = [0... The mapping T is called translation by a. see Example 1. for all x. it follows that a stationary coding of a totally ergodic process must be totally ergodic.10. As noted in Section 1.) Since the condition that F(TA X) = TB F (x). Since this periodic structure is inherited. such processes are called translation processes. hence the word "interval" can be used for subsets of X that are connected or whose complements are connected. implies that F (Ti x) = T: F(x). THE ERGODIC THEORY MODEL. 21 Markov chain is not generally mixing.. A subinterval of the circle corresponds to a subinterval or the complement of a subinterval in X. since X can be thought of as the unit circle by identifying x with the angle 27rx. of the interval gives rise to a process.2. however. . by m ro 1 L o j' = rl 1] i• Let y be the encoding of p.1. The proof of this is left to the reader. (or rotation processes if the circle representation is used. (See Example 1. and 0101 . E. If T is the shift on a sequence space then T is totally ergodic if and only if for each N. 1/2). where ED indicates addition modulo 1. however.13 (Rotation processes. In particular. (The measure-preserving property is often established by proving.. The (T.. which correspond to the upper and lower halves of the circle in the circle representation. The transformation T is one-to-one and maps intervals onto intervals of the same length. hence preserves Lebesgue measure p.2.) As an example. defined by the 2-block code C(01) = 00. Example 1. P 1 = [1/2. • • • 9 Xn is ergodic.1. The following proposition is basic. the nonoverlapping N-block process defined by Zn = (X (n-1)N-1-11 X (n-1)N+2. The value X(x) = Xp(Tn -1 x) is then 0 or 1... A transformation T on a probability space (X... The measure -17 obtained by randomizing the start is. so that translation by a becomes rotation by 27ra. so that y is concentrated on the two sequences 000.. and 111 . let P consist of the two intervals P0 = [0..) Any partition 7. For example. respectively. 1) by the formula Tx = x ED a.lx = x ED (n — 1)a belongs to P0 or P1 .1. that is. Pick a point x at random in the unit interval according to the uniform (Lebesgue) measure. N-block codings destroy stationarity. p is the stationary Markov measure with transition matrix M and start distribution 7r given. in this case. P)-process {X n } is then described as follows. A condition insuring ergodicity of the process obtained by N-block coding and randomizing the start is that the original process be ergodic relative to the N-shift T N . that it holds on a family of sets that generates the a-algebra. the same as y. for all x and all N. as in this case. C(10) = 11. The final randomized-start process may not be ergodic. applying a block code to a totally ergodic process and randomizing the start produces an ergodic process. hence is not an ergodic process. It is also called rotation. up to a shift. but a stationary process can be constructed from the encoded process by randomizing the start. let p. give measure 1/2 to each of the two sequences 1010.) is said to be totally ergodic if every power TN is ergodic.

Suppose f is square integrable and has Fourier series E an e brinx . It follows that is a disjoint sequence and 00 1 = . Given c > 0 choose an interval I such that /2(1) < E and /. which completes the proof of Proposition 1.2.t(A n I) > (1 — Kronecker's theorem.u(A) = 1. Proof It is left to the reader to show that T is not ergodic if a is rational.14 T is ergodic if and only if a is irrational. BASIC CONCEPTS. Then there is an interval I of positive length which is maximal with respect to the property of not meeting F. An alternate proof of ergodicity can be obtained using Fourier series. Proof To establish Kronecker's theorem let F be the closure of the forward orbit. applied to the end points of I. e2"ina e 27rinx . Two proofs of ergodicity will be given. Assume that a is irrational.14. The Fourier series of g(x) = f (T x) is E a.14 continued. A generalization of this idea will be used in Section 1. This proves Kronecker's theorem. e2-n(x+a) a. which is a contradiction. but does establish ergodicity for many transformations of interest. 1)) a. One first shows that orbits are dense.u ( [0. for each positive integer n. for each x. E n=1 00. then that small disjoint intervals can be placed around each point in the orbit..2. {Tx: n > 1 } . i= 1 Thus . and hence the maximality of I implies that n I = 0.22 Proposition 1. CHAPTER I. suppose A is an invariant set of positive measure.10 to establish ergodicity of a large class of transformations.t(A n I) > (1 — 6) gives it (A) E —E) n Trig) (1 — 0(1 — 2E). The assumption that i. To proceed with the proof that T is ergodic if a is irrational. The preceding proof is typical of proofs of ergodicity for many transformations defined on the unit interval.2. and suppose that F is not dense in X.) If a is irrational the forward orbit {Tnx: n > 1} is dense in X. produces integers n1 <n2 < < n k such that {Tni / } is a disjoint sequence and Ek i=1 tt(Tni /) > 1 — 2E. Proof of Proposition 1. TI cannot be the same as I. Furthermore. The first proof is based on the following classical result of Kronecker.15 (Kronecker's theorem. Proposition 1.2. Such a technique does not work in all cases. since a is irrational.

define the sets Bn = E B: n(x) = n}.2.. are measurable. which. 1) x [0. y) 1-÷ (x a. . Let T be a measure-preserving transformation on the probability space (X. only the first set Bn in this sequence meets B. n(x) is the time of first return to B. y El) p) (i) {Tn(x. TB.) For each n. the renewal processes and the regenerative processes. Points in the top level of each column are moved by T to the base B in some manner unspecified by the picture.c The return-time picture. in turn. I. 1). TBn C B. and. for all n > 1. E. Proposition 1.) Return to B is almost certain. while applying T to the last set Tn -1 Bn returns it to B. This proves Theorem 1.2.) The proof of the following is left to the reader. These sets are clearly disjoint and have union B. in the irrational case. (iii) a and 13 are rationally independent. Furthermore. that is. If x E B. for B 1 = B n 7-1 B and Bn = (X — B) n B n T -n B.. one above the other. provides an alternative and often simpler view of two standard probability models. the rotation T has no invariant functions and hence is ergodic. a. n(x) < oc. Proof To prove this. In other words. which means that a. the product of two intervals. with addition mod 1. T = [0. by reassembling within each level it can be assumed that T just moves points directly upwards one level.Vn > O. = 0. y)} is dense for any pair (x. 23 If g(x) = f(x).18). furthermore. IL) and let B be a measurable subset of X of positive measure. consider the torus T. If a is irrational 0 unless n = 0..2. along with a suggestive picture. and define = {X E B: Tx B. (ii) T is ergodic.17 (The Poincare recurrence theorem. then an = an e 2lri" holds for each integer n.2. (Some extensions to the noninvertible case are outlined in the exercises. To simplify further discussion it will be assumed in the remainder of this subsection that T is invertible.2. translation of a compact group will be ergodic with respect to Haar measure if orbits are dense. that is. unless and an = an e 27' then e2lrin" n=0. Theorem 1. e. Since they all have the same measure as B and p(X) = 1. which. that is. A general return-time concept has been developed in ergodic theory. alternatively. THE ERGODIC THEORY MODEL.o ) = 0. For example.16 The following are equivalent for the mapping T: (x. implies that the function f is constant. .17. it follows that p(B. that is. almost surely.SECTION 1. if x E 1100 then Tx g Bc. T n-1 Bn are disjoint and have the same measure. (or. The return-time picture is obtained by representing the sets (11) as intervals. Furthermore.. let n(x) be the least positive integer n such that Tnx E B. In general. the product of two circles.2. a. the sets (11) Bn . Furthermore. e. (See Figure 1. n > 2. for n > 1. from which it follows that the sequence of sets {T B} is disjoint. y).

24

CHAPTER I. BASIC CONCEPTS.

Tx

B1

B2

B3

B4

Figure 1.2.18 Return-time picture.
Note that the set Util=i Bn is just the union of the column whose base is Bn , so that the picture is a representation of the set (12) U7 =0 TB

This set is T-invariant and has positive measure. If T is ergodic then it must have measure 1, in which case, the picture represents the whole space, modulo a null set, of course. The picture suggests the following terminology. The ordered set

Bn , TB,

, T 11-1 13,

is called the column C = Cn with base Bn, width w(C) = kt(B,z ), height h(C) = n, levels L i = Ti-1 Bn , 1 < i < n, and top T'' B. Note that the measure of column C is just its width times its height, that is, p,(C) = h(C)w(C). Various quantities of interest can be easily expressed in terms of the return-time picture. For example, the return-time distribution is given by Prob(n(x) =nIxEB)— and the expected return time is given by w (Ca )

Em w(C, n

)

E(n(x)lx E B) =
(13)

h(C,i )Prob(n(x) =nix E B)

h(C)w(C)

=

m w(Cm )

In the ergodic case the latter takes the form

(14)

E(n(x)ix E B) =

1 Em w(Cm )

1

since En h(C)w(C) is the measure of the set (12), which equals 1, when T is ergodic. This formula is due to Kac, [23]. The transformation I' = TB defined on B by the formula fx = Tn(x)x is called the transformation induced by T on the subset B. The basic theorem about induced transformations is due to Kakutani.

SECTION I.2. THE ERGODIC THEORY MODEL.
-

25

Theorem 1.2.19 (The induced transformation theorem.) If p(B) > 0 the induced transformation preserves the conditional measure p(.IB) and is ergodic if T is ergodic. Proof If C c B, then x E f- 1C if and only if there is an n > 1 such that x Tnx E C. This translates into the equation
00

E

Bn and

F-1 C =U(B n n=1

n TAC),

which shows that T-1 C is measurable for any measurable C c B, and also that
/2(f—i
n

= E ,u(B,, n T'C),
n=1

00

since Bn n An = 0, if m 0 n, with m, n > 1. More is true, namely, (15)

T n Ba nT ni Bm = 0, m,n > 1, m n,

by the definition of "first return" and the assumption that T is invertible. But this implies 00 E,(T.Bo
n=1

= E p(Bn ) = AO),
n=1
,

co

which, together with (15) yields E eu(B, n T'C) = E iu(Tnii n n c) = u(C). Thus the induced transformation f preserves the conditional measure p.(.1B). The induced transformation is ergodic if the original transformation is ergodic. The picture in Figure 1.2.18 shows why this is so. If f --1 C, then each C n Bn can be pushed upwards along its column to obtain the set

c.

oo n-1
Ti(c n Bn)• D=UU n=1 i=0

But p,(T DAD) = 0 and T is invertible, so the measure of D must be 0 or 1, which, in turn, implies that kt(C1B) is 0 or 1, since p,((D n B),LC) = O. This completes the proof of the induced-transformation theorem. El

1.2.c.1 Processes associated with the return-time picture.
Several processes of interest are associated with the induced transformation and the return-time picture. It will be assumed throughout this discussion that T is an invertible, ergodic transformation on the probability space (X, E, p,), partitioned into a finite partition P = {Pa : a E A}; that B is a set of positive measure; that n(x) = minfn: Tnx E BI, x E B, is the return-time function; and that f is the transformation induced on B by T. Two simple processes are connected to returns to B. The first of these is {R,}, the (T, B)-process defined by the partition B = { B, X B}, with B labeled by 1 and X — B labeled by 0, in other words, the binary process defined by

Rn(x) = xB(T n-l x), n?: 1,

26

CHAPTER I. BASIC CONCEPTS.

where xB denotes the indicator function of B. The (T, 5)-process is called the generalized renewal process defined by T and B. This terminology comes from the classical definition of a (stationary) renewal process as a binary process in which the times between occurrences of l's are independent and identically distributed, with a finite expectation, the only difference being that now these times are not required to be i.i.d., but only stationary with finite expected return-time. The process {R,,} is a stationary coding of the (T, 2)-process with time-zero encoder XB, hence it is ergodic. Any ergodic, binary process which is not the all 0 process is, in fact, the generalized renewal process defined by some transformation T and set B; see Exercise 20. The second process connected to returns to B is {Ri } , the (7." , g)-process defined by the (countable) partition

{Bi, B2, • • •},
of B, where B, is the set of points in B whose first return to B occurs at time n. The process {k } is called the return-time process defined by T and B. It takes its values in the positive integers, and has finite expectation given by (14). Also, it is ergodic, since I' is ergodic. Later, it will be shown that any ergodic positive-integer valued process with finite expectation is the return-time process for some transformation and partition; see Theorem 1.2.24. The generalized renewal process {R n } and the return-time process {kJ } are connected, for conditioned on starting in B, the times between successive occurrences of l's in {Rn} are distributed according to the return-time process. In other words, if R 1 = 1 and the sequence {S1 } of successive returns to 1 is defined inductively by setting So = 1 and

= min{n >

Rn = 1},

j > 1,

then the sequence of random variables defined by

(16)

= Si — Sf _ 1 , j> 1

has the distribution of the return-time process. Thus, the return-time process is a function of the generalized renewal process. Except for a random shift, the generalized renewal process {R,,} is a function of the return-time process, for given a random sample path R = {k } for the return-time process, a random sample path R = {R n } for the generalized renewal process can be expressed as a concatenation (17) R = ii)(1)w(2)w(3) • • •

of blocks, where, for m > 1, w(m) = 01 _ 1 1, that is, a block of O's of length kn -1, followed by a 1. The initial block tt,(1) is a tail of the block w(1) = 0 /1-1-1 1. The only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the waiting time r until the first 1 occurs. The start position problem is solved by the return-time picture and the definition of the (T, 2)-process. The (T, 2)-process is defined by selecting x E X at random according to the it-measure, then setting X(x) = a if and only if Tx E Pa . In the ergodic case, the return-time picture is a representation of X, hence selecting x at random is the same as selecting a random point in the return-time picture. But this, in turn, is the same as selecting a column C at random according to column measure

SECTION 1.2. THE ERGODIC THEORY MODEL.

27

then selecting j at random according to the uniform distribution on 11, 2, ... , h (C)}, then selecting x at random in the j-th level of C. In summary,

Theorem 1.2.20 The generalized renewal process {R,} and the return-time process {km } are connected via the successive-returns formula, (16), and the concatenation representation, (17) with initial block th(1) = OT-1 1, where r = R — j 1 and j is uniformly distributed on [1,
The distribution of r can be found by noting that

r (x) = jx E
from which it follows that
00

Prob(r = i) =E p.,(T 4-1 n=i

=

E w (C ) .
ni

The latter sum, however, is equal to Prob(x E B and n(x) > i), which gives the alternative form 1 Prob (r = i) = (18) Prob (n(x) ijx E B) , E(n(x)lx E B) since E(n(x)lx E B) = 1/ p.(B) holds for ergodic processes, by (13). The formula (18) was derived in [11, Section XIII.5] for renewal processes by using generating functions. As the preceding argument shows, it is a very general and quite simple result about ergodic binary processes with finite expected return time. The return-time process keeps track of returns to the base, but may lose information about what is happening outside B. Another process, called the induced process, carries along such information. Let A* be the set all finite sequences drawn from A and define the countable partition 13 = {P: w E A* } of B, where, for w =

=

=

E Bk: T i-1 X E

Pao 1 <j

kl.

The (?,)-process is called the process induced by the (T, P)-process on the set B. The relation between the (T, P)-process { X, } and its induced (T, f 5)-process {km } parallels the relationship (17) between the generalized renewal process and the returntime process. Thus, given a random sample path X = {W„,} for the induced process, a random sample path x = {X,} for the (T, P)-process can be expressed as a concatenation

(19)

x = tv(1)w(2)w(3) • • • ,

into blocks, where, for m > 1, w(m) = X m , and the initial block th(1) is a tail of the block w(1) = X i . Again, the only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the length of the tail fi)(1). A generalization of the return-time picture provides the answer. The generalized version of the return-time picture is obtained by further partitioning the columns of the return-time picture, Figure 1.2.18, into subcolumns according to the conditional distribution of (T, P)-names. Thus the column with base Bk is partitioned

28

CHAPTER I. BASIC CONCEPTS.

into subcolumns CL, labeled by k-length names w, so that the subcolumn corresponding to w = a k has width (ct, T - i+ 1 Pa n BO,
,

the ith-level being labeled by a,. Furthermore, by reassembling within each level of a subcolumn, it can again be assumed that T moves points directly upwards one level. Points in the top level of each subcolumn are moved by T to the base B in some manner unspecified by the picture. For example, in Figure 1.2.21, the fact that the third subcolumn over B3, which has levels labeled 1, 1, 0, is twice the width of the second subcolumn, indicates that given that a return occurs in three steps, it is twice as likely to visit 1, 1, 0 as it is to visit 1, 0, 0 in its next three steps.
o o I
0 1 1

o
i

o
i

I

o .

1

I I
1

I

1o 1

I I

I
0

I I 1

I 1 I I

I

o
0

I I

1

1Ii

I

10

I I 1 I

o o

I

1

11

B1

B2

B3

B4

Figure 1.2.21 The generalized return-time picture. The start position problem is solved by the generalized return-time picture and the definition of the (T, 2)-process. The (T, 2)-process is defined by selecting x E X at random according to the ii-measure, then setting X, (x) = a if and only if 7' 1 x E Pa . In the ergodic case, the generalized return-time picture is a representation of X, hence selecting x at random is the same as selecting a random point in the return-time picture. But this, in turn, is the same as selecting a column Cu, at random according to column measure p.(Cu,), then selecting j at random according to the uniform distribution on {1, 2, ... , h(w), where h(w) = h(C), then selecting x'at random in the j-th level of C. This proves the following theorem.
Theorem 1.2.22 The process {X,..,} has the stationary concatenation representation (19) in terms of the induced process {54}, if and only if the initial block is th(1) = w(1)i h. (w(l)) , where j is uniformly distributed on [1, h(w(1))].

A special case of the induced process occurs when B is one of the sets of P, say B = Bb. In this case, the induced process outputs the blocks that occur between successive occurrences of the symbol b. In general case, however, knowledge of where the blocking occurs may require knowledge of the entire past and future, or may even be fully hidden.

I.2.c.2 The tower construction.
The return-time construction has an inverse construction, called the tower construction, described as follows. Let T be a measure-preserving mapping on the probability

A useful picture is obtained by defining Bn = {x: f (x) = n}. n > 1. 0 . E (f ) n=1 i=1 A transformation T is obtained by mapping points upwards. B x {2}.23 The tower transformation. by thinking of each set in (20) as a copy of Bn with measure A(B). The tower extension T induces the original transformation T on the base B = X x {0}.2. i) = + 1) i < f (x) i = f (x). i < n. stacking the sets Bn x {1}.. and for each n > 1.).be the subset of the product space X x Ar Ar = { 1. Let k defined by = {(x . p.2. E.0 . 0) The transformation T is easily shown to preserve the measure Ti and to be invertible (and ergodic) if T is invertible (and ergodic). (T x .23.. i) E T8} n Bn E E. of finite expected value.2. as shown in Figure 1. The measure it is extended to a measure on X. 1 < i f (x)}. with its measure given by ‘ Tt(k) = 2-.u({x: . and let f be a measurable mapping from X into the positive integers . .}. n=1 the expected value of f. then normalizing to get a probability measure. In other words. . The formal definition is T (x. Bn x {n} (20) as a column in order of increasing i. 29 space (X. E 131 ) n 13. As an application of the tower construction it will be shown that return-time processes are really just stationary positive-integer valued processes with finite expected value. Tx B3 X {2} B2 X 111 B3 X Bi X {0} B2 X {0} B3 X {0} Figure 1. with points in the top of each column sent back to the base according to T.SECTION 1. 2. the normalizing factor being n ( B n ) = E(f). B C X is defined to be measurable if and only if (x. THE ERGODIC THEORY MODEL. The transformation T is called the tower extension of T by the (height) function f. This and other ways in which inducing and tower extensions are inverse operations are explored in the exercises.

then {Yn } is an instantaneous function of {Xn}.} is regenerative if there is a renewal process {R.2. P)-process can be mixing even if T is only ergodic. BASIC CONCEPTS.n ): n < t}. P-IP)-process. onto the (two-sided) Kolmogorov measure y of the (T. Let T be a measure-preserving transformation.2. P)-process. Let T be an invertible measure-preserving transformation and P = {Pa : a E Al a finite partition of the probability space (X. given that Rt = 1.2. such that F(Tx) = TAF(x). Proof Let p.) Theorem 1.) 4.2. 7.) 3. 1) returns to B = X x L1} at the point (T x . = f (X n ).} such that the future joint process {(Xn. . N}.25 An ergodic process is regenerative if and only if it is an instantaneous function of an induced process whose return-time process is I. where TA is the shift on A z . In particular. and let Yn = X(n-1)N+1 be the N-th term process defined by the (T. a partition gives rise to a stationary coding. /2). a stationary coding gives rise to a partition. P)-process can be ergodic even if T is not. Determine a partition P of A z such that y is the Kolmogorov measure of the (TA. P)-process {Xn}. for all n. be the two-sided Kolmogorov measure on the space . (Thus. This proves the theorem. A point (x. Let T be the tower extension of T by the function f and let B = x x {1}. Xn+19 • • • Xn+N-1) be the overlapping N-block process. The induced transformation and the tower construction can be used to prove the following theorem. the return-time process defined by T and B has the same distribution as {k1 }. (Hint: consider g(x) = min{ f (x). Show that there is a measurable mapping F: X 1-± A z which carries p. 1) at time f (x). Theorem 1. that is. Show that the (T. P)-process. Show that the (T. (Thus. Complete the proof of Lemma 1. see Exercise 24. let Wn = (Xn. if f(Tx) < f (x). (Recall that if Y. Show that a T-subinvariant function is almost-surely invariant. A stationary process {X. 1. Suppose F: A z F--> Bz is a stationary coding which carries p. .) 5.24 An ergodic.) 6. A nonnegative measurable function f is said to be T-subinvariant. R. positive-integer valued process {kJ} with E(R 1 ) < oc is the return-time process for some transformation and set.30 CHAPTER I. and note that the shift T on Aîz is invertible and ergodic and the function f (x) =X has finite expectation.1. (a) Show that {Zn } is the (TN. Let Zn = ( X (n -1 )N+19 41-1)N+2. then use Exercise 2. Ra ):n > t} is independent of the past joint process {(X. onto v.Ar z of doubly-infinite positive-integer valued sequences. XnN) be the nonoverlapping N-block process. binary process via the concatenation-representation (17). n . with the distribution of the waiting time r until the first I occurs given by (18). (Hint: start with a suitable nonergodic Markov chain and lump states to produce an ergodic chain.d Exercises. fkil defines a stationary. 2. E.

y) in terms of the coordinates of x and y. Show that an ergodic rotation is not mixing.2. y) = (Tx. 1). N}. Fix a measure p. i 1). j). then S is called the direct product of T and R. 0) with probability p ite(br ). (i) If i < N. 13. (c) Show that {Y } is the (T N . (Hint: the transition matrix is irreducible. (a) Show that there is indeed a unique stationary Markov chain with these properties. If T.u. 12.u(br). 2. = a if I'm = (ar .(1) = 1/2. i) can only go to (ar . x and define S(x. that if a is irrational then the direct product (x.. Q)-name of (x. Show that the overlapping n-blocking of an ergodic process is ergodic. 14. j > 0. if Y„. T ry) is measurable. 'P)-process.) and {Tx : x c X) is a family of measure-preserving transformations on a probability space (Y. defined by the following transition rules. . 31 8. for each br E AN. (ii) (ar. 11. for every x. Show directly. 1} Z . . y ± a) is not ergodic. N) goes to (br .u x . y) 1-± (x a. . y).n } be the stationary Markov chain with alphabet S = A N x {0. (The (S. Generalize the preceding exercise by showing that if T is a measure-preserving transformation on a probability space (X. 17. such that the mapping (x. 16. N) goes to (br . (iii) (ar . Show that if T is mixing then so is the direct product T x T: X x X where X x X is given the measure . 0) and Ym = ai .(-1) = p. but is totally ergodic. 1) with probability (1 — p). S is called a skew product. (ar . Let T be the shift on X = {-1. 15. x y). } is the (T. y) i-± (Tx. Let {Z. = (ar . then S(x .) (b) Show that the process {} defined by setting 4. without using Proposition 6. (b) Let P be the time-zero partition of X and let Q = P x P. THE ERGODIC THEORY MODEL.. y) = (Tx. (b) Show that {147.") 9. v7P -1 P)-process. Show that a concatenated-block process is ergodic. Show that the nonoverlapping n-blocking of an ergodic process may fail to be ergodic. Txoy). is a mixing finite-state process.u be the product measure on X defined by p. Give a formula expressing the (S. Show that if T is mixing then it is totally ergodic. X x X. Let Y=Xx X. on AN. p. = R. Try) is a measure-preserving mapping on the product space (X x Y. and number p E (0. Insertion of random spacers between blocks can be used to change a concatenatedblock process into a mixing process.. with the product measure y = p. 10. that is. and let . for each br E AN. (a) Show that S preserves v.SECTION 1. a symbol a E A N . Q)-process is called "random walk with random scenery. 1.

) 20. and let S = T be the tower transformation. (Hint: obtain a picture like Figure 1. Prove: a tower i'' over T induces on the base a transformation isomorphic to T. (Hint: let T be the shift in the two-sided Kolmogorov representation and take B = Ix: xo = 1 1. Let X be the tower over X defined by f. such that the (T. (b) Show that if P and 2 are 6-independent then Ea iii(Pa I Qb) — 11 (Pa)i < Ag. Show that if {X. then P and 2 are 3E-independent. Show how to extend P to a partition 2 of X. ergodic process which is not identically 0. except for a set of Qb's of total measure at most 6.25.g.?. 22. BASIC CONCEPTS. (c) Show that if Ea lia(PaiQb) — ii(Pa)i < E. P)-process is an instantaneous coding of the induced (.18 for T -1 . Prove that even if T is not invertible. 25. 18. Let T be an invertible measure-preserving transformation and P = {Pa : a E A} a finite partition of the probability space (X.2. . and is ergodic if T is ergodic.2.) 19._ E. except for a set of Qb's of total measure at most . then it is the generalized renewal process for some transformation T and set B. preserves the conditional measure.(Qb)1 . 24. E BI to be 6-independent if (a) Show that p and 2 are independent if and only if they are 6-independent for each 6 > 0.} is a binary.i F -1 D. the induced transformation is measurable. then use this to guide a proof.32 CHAPTER I.b lit(Pa n Qb) — it(Pa)P. Prove Theorem 1. Prove that a stationary coding of a mixing process is mixing. E. (Hint: use the formula F-1 (C n T ' D) = F-1 C n TA. Define P = {Pa : a E AI and 2 = {Qb: b Ea. d)-process. ? I' and return-time 23. pt).) 21. Show that the tower defined by the induced transformation function n(x) is isomorphic to T.

) A strong cover C has a packing property. however..3 The ergodic theorem. The proof of the ergodic theorem will be based on a rather simple combinatorial result discussed in the next subsection. m(n)].) If {X n } is a binary ergodic process then the average (X 1 +X2+.d. A general technique for extracting "almost packings" of integer intervals from certain kinds of "coverings" of the natural numbers will be described in this subsection. then a positive and useful result is possible. n E H. A strong cover C of Af is defined by an integer-valued function n m(n) for which m(n) > n. The ergodic theorem extends the strong law of large numbers from i.i. K]. In particular.3.) If T is a measure-preserving transformation on a probability space (X. for. to pack an initial segment [1. if f is taken to be 0. since f1 n f(T i-l x) d. p.2 (Ergodic theorem: binary ergodic form.3. The L'-convergence implies that f f* d 1a = f f d ia. . i > 1.E f f(T i-l x)d.u = f f -E n n i=1 by the measure-preserving property of T.3. .1 (The ergodic theorem. D.u. must be almost-surely equal to the constant value f f d. namely. The combinatorial idea is not a merely step on the way to the ergodic theorem.1-valued and T to be ergodic the following form of the ergodic theorem is obtained.. K] by disjoint subcollections of a given strong cover C of the natural numbers. and consists of all intervals of the form [n. there be a disjoint subcollection that fills most of [1. The finite problem has a different character. The ergodic theorem in the almost-sure form presented here is due to G. E. however. Birkhoff and is often called Birkhoff's ergodic theorem or the individual ergodic theorem. just set C' = {[n. THE ERGODIC THEOREM. m] = (j E n < j < m).This binary version of the ergodic theorem is sufficient for many results in this book. f (T i-1 x) converges almost surely and in L 1 -norm to a T -invariant function f*(x).}. unless the function m(n) is severely restricted.a Packings from coverings. (The word "strong" is used to indicate that every natural number is required to be the left endpoint of a member of the cover. where n1 = 1 and n i±i = 1 + m(n). since it is T-invariant.. Theorem 1. but the more general version will also be needed and is quite useful in many situations not treated in this book. it may not be possible. This is a trivial observation. m(n)]}. Theorem 1. of the form [n. 2.) and if f is integrable then the average (1/n) EL..3. Thus if T is ergodic then the limit function f*. there is a subcover C' whose members are disjoint. In this discussion intervals are subsets of the natural numbers Ar = {1. . processes to the general class of stationary processes. even asymptotically. I..u = 1. If it is only required. for it is an important tool in its own right and will be used frequently in later parts of this book.+Xn )In converges almost surely to the constant value E(X 1 ).SECTION 1. 33 Section 1.

The claim is that C' = {[n i . The (L. K] is called a (1 —6)-packing of [1.m(n i )]: 1 < i < I} is a (1 — 26)-packing of [1. (2) If j E [I.m(n i )] has length at least (1 — 25)K. and apply a sequential greedy algorithm. and are contained in [1. The interval [1. K] if the intervals in C' are disjoint and their union has cardinality at least (1 — . The construction stops after I = I (C. so that I(K —L. if 6 > 0 is given and K > LIS then all but at most a 6-fraction is covered. K — L] for which m(j) — j + 1 < L. say L. For the interval [1.1 of the [ni . by induction. The intervals are disjoint. If K > LIS and if [1. BASIC CONCEPTS.(i +1)L]: 0 < i < (K — L)IL} which covers all but at most the final L — 1 members of [1. The construction of the (1 — 26)-packing will proceed sequentially from left to right.3. K].K]—U]I< 6K. start from the left and select successive disjoint intervals. 8)-strongly-covered by C. K) steps if m(n i ) > K — L or there is no j E[1 ±m(ni). K]. To motivate the positive result. A collection of subintervals C' of the interval [1. To carry this out rigorously set m(0) = no = 0 and. . K]. since m(n i )—n i +1 < L. K — L]. and hence m(n i ) < K. In particular. the definition of the n i implies the following fact. define n i = minfj E [1+ m(n i _i).3 (The packing lemma. The interval (K — L. K]. 6)-strong-cover assumption. that is. The desired positive result is just an extension of this idea to the case when most of the intervals have length bounded by some L <8 K.) Let C be a strong cover of N and let S > 0 be given. then there is a subcollection C' c C which is a (1 — 26)packing of [1. K] has length at most L — 1 < SK. by construction. Thus it is only necessary to show that the union 1. selecting the first interval of length no more than L that is disjoint from the previous selections. K — L] —Li then m(j) — j + 1 > L. This produces the disjoint collection C' = {[iL +1. K].34 CHAPTER I. 6)-strongly-covered by C if 1ln E [1. Let C be a strong cover of the natural numbers N.ll< 6K. stopping when within L of the end of [1. K] is (L. Lemma 1. thus guarantees that 1[1. stopping when within L of the end. suppose all the intervals of C have the same length. K — L]: m(j) — j +1 < L}. This completes the proof of the packing lemma. K — L]-1. K]: m(n) — n +1 > L}1 < 6. K] is said to be (L. (1). Proof By hypothesis K > LIS and (1) iin E K]: m(n) — n +1 > L11 < SK.

35 Remark 1. Thus the collection C(x) = lln. Furthermore. but in the proof of the general ergodic theorem use will be made of the explicit construction given in the proof of Lemma 1. for. which is a contradiction.(B) = 1.3.4 In most applications of the packing lemma all that matters is the result.u. • + Xn+1 the set B is T-invariant and therefore p. including the description (2) of those indices that are neither within L of the end of the interval nor are contained in sets of the packing. Some extensions of the packing lemma will be mentioned in the exercises. and hence there is an E > 0 such that the set {n B = x : lirn sup — E xi > . A proof of the ergodic theorem for the binary ergodic process case will be given first. if the average over most such disjoint blocks is too big.3. In this case . The packing lemma can be used to reduce the problem to a nonoverlapping interval problem. I.) Now suppose x E B. 1}. ( 1 ) n-+oo n i=i has positive measure.A'.SECTION 1.b The binary. however.3. and the goal is to show that (3) lirn — x i = . Without loss of generality the first of these possibilities can be assumed. large interval. a. . then the average over the entire interval [1. Suppose (3) is false.a(1). Since lim sup n—>oo 1 +E xi + + xn = =sup n—>oo x2 + . as it illustrates the ideas in simplest form. THE ERGODIC THEOREM. But. is invariant and ergodic with respect to the shift T on {0. most of the terms xi . so it is not easy to see what happens to the average over a fixed.s.(1) + E. These intervals overlap. i < K are contained in disjoint intervals over which the average is too big. K] must also be too big by a somewhat smaller fixed amount.. (Here is where the ergodicity assumption is used. it will imply that if K is large enough then with high probability.u. Since this occurs with high probability it implies that the expected value of the average over the entire set A K must be too big. Since B is T-invariant.3. n—*øo n i=1 i n where A(1) = A{x: x 1 = 1} = E(X1). m(n)]: n E JVI is a (random) strong cover of the natural numbers . ergodic process proof.3. Then either the limit superior of the averages is too large on a set of positive measure or the limit inferior is too small on a set of positive measure. each interval in C(x) has the property that the average of the xi over that interval is too big by a fixed amount. combined with a simple observation about almost-surely finite variables. for each integer n there will be a first integer m(n) > n such that X n Xn+1 + • • • + Xm(n) m(n) — n + 1 > p.

26)K ku(1) + El u(GK) > (1 . it will be used to obtain the general theorem. The following lemma generalizes the essence of what was actually proved.3. Thus. . ergodic process form of the ergodic theorem. the set GK = {X: gic(x) < 3} has measure at least 1 .u(D) <62.26)(1 (5)K [p. Thus the packing lemma implies that if K > LI 8 and x E GK then there is a subcollection Cf (x) = {[ni. . are nonnegative. Lemma 1.5 Let T be a measure-preserving transformation on (X.m(ni)]: i < I (x)} C C(X) which is a (1 . since the random variable m(1) is almost-surely finite.3. K].u(D) < 6 2 .2. and since the intervals in C'(x) are disjoint. E. The Markov inequality implies that for each K.a(1) + c]. E. The definitions of D and GK imply that if x E GK then C(x) is an (L. Second. (1) = E (—Ex . The preceding argument will now be extended to obtain the general ergodic theorem.u). taking expected values yields (K E iE =1 •x > (1 . I. and define the set 1 n B = {x: lim sup .K]. the lower bound is independent of x. A bit of preparation is needed before the packing lemma can be applied. and thereby the proof of Theorem 1. First. the sum over the intervals in C'(x) lower bounds the sum over the entire interval. so that 1 K p. let f let a be an arbitrary real number. Thus given 3 > 0 there is a number L such that if D = {x: m(1) > L}. 3)-strongcovering of [1.(1)+ c] . then .8.3. by assumption.E f (T i-1 x) > al. . note that the function g K(x) = i=1 1 IC where x denotes the indicator function. Note that while the collection C(x) and subcollection C'(x) both depend on x. has integral . E L i (X.c The proof in the general case. This completes the proof of (3). the binary.) > (1 . J=1 which cannot be true for all S. > i=1 j_—n 1 (l _ 28)K [p. since T is measure preserving. BASIC CONCEPTS. 1(x) m(n) (4) j=1 x. Since the xi.(1) + c].36 CHAPTER I. it is bounded except for a set of small probability.23)(1 6)[.a). that is.23)-packing of [1. as long as x E GK. n i=i Then fB f (x) d(x) > a g(B).

3. The set B is T-invariant and the restriction of T to it preserves the conditional measure . m(n)]: n E AO is a (random) strong cover of the natural numbers Js/. where f is allowed to be unbounded.1q—U[n. x) + E E f (T i-1 x) i=1 j=ni 1(x) m(ni) R(K . Since B is T-invariant.u(D1B) <82 and for every K the set 1 G K = {X E B: 1 -7 i=1 E xD(Ti-ix) K 6} has conditional measure at least 1 — B. if K > LIB then the packing lemma can be applied and the argument used to prove (4) yields a lower bound of the form E f( Ti-lx ) j=1 (6) where R(K . Furthermore.m(n i )] intervals. say 1f(x)1 < M. given 3 > 0 there is an L such that if D= E B: m(1) > L } then . x) as well as the effect of integrating over GK in place of X. + (1 — 28)K a. together with the lower bound (6). Only a bit more is needed to handle the general case. where f is the indicator function of a set C and a = p(C) c. though integrable.5 is true in the ergodic case for bounded functions. (1 — 8)a A(G KIB) . the collection C(x) = fin. this lemma is essentially what was just proved.i(B) > O. 37 Proof Note that in the special case where the process is ergodic.u(. In the unbounded case. does give 1 v-IC s f (x) dia(x1B) = (7) a' B i=i II -I. x) = > R(K . E jE[1.3. the sum R(K . (Keep in mind that the collection {[ni. THE ERGODIC THEOREM. An integration.12 + 13.)] f (T x) is the sum of f(Ti -l x) over the indices that do not belong to the [ne. so it can be assumed that . The same argument can be used to show that Lemma 1. x) is bounded from below by —2M KS and the earlier argument produces the desired conclusion. The lemma is clearly true if p(B) = 0..) In the bounded case.(x1B) > a.SECTION 1. As before.m(n. it is enough to prove fB f (x) c 1 ii. a bit more care is needed to control the effect of R(K. Thus. For x E B and n E A1 there is a least integer m(n) > n such that (5) Em i= (n n ) f (T i-1 m(n)—n+1 > a.m(ni)]} depends on x. 1B).

m(n. .x)1 f (P -1 x)I dwxiB). In summary.u(xIB). will be small if 8 is small enough.)] E 13= (T i-1 x) d. K — L] — m(ni)] then m(j) — j 1 > L. the final three terms can all be made arbitrarily small. see (7). which is small if K is large enough. so that passage to the appropriate limit shows that indeed fB f (x) dp. which is upper bounded by 8. 1131 15. (1 — kt(G + 11 + ± 13.K]—U[ni.5. K — L] — U[ni ./ x) clitt(xIB) = fT_J(B-GK) f(x) Since f E L I and since all the sets T -i(B — GK) have the same measure.B jE(K—L.3.(x1B). 1 ___ K f (Ti -1 x) c 1 ii. In the current setting this translates into the statement that if j E [1.K— L] —U[ni. and hence 11 itself. see (2). BASIC CONCEPTS.k.(xiB).M(ri. To bound 12. . for any fixed L.(x1B).(xim. Thus 1121 5_ -17 1 f v—N K . L f(x) d ii(x B) ?.B xD(T' . m(n)] then Ti -l x E D. K fB—Gic j=1 12 = f(Ti -l x) dp. that if j E [1.(x1B) > a. in the inequality f D1 f (x)1 clia(x1B). f The measure-preserving property of T gives f3-Gic f (T .38 where CHAPTER I. recall from the proof of the packing lemma. E GK jE[1.K] and hence the measure-preserving property of T yields 1131 B 1 fl(x) dp.)] GK — K 1 jE(K—L. all of the terms in 11. which completes the proof of Lemma 1. and the measure-preserving property of T yields the bound 1121 5_ which is small if L is large enough. The integral 13 is also easy to bound for it satisfies 1 f dp.

Thus the averaging operator An f 1 E n n i=i T is linear and is a contraction.3. There are several different proofs of the maximal theorem. [32]. f follows immediately from this inequality. The packing lemma is a simpler version of a packing idea used to establish entropy theorems for amenable groups in [51]. Since. but it is much closer in spirit to the general partitioning. g E L I then for all m.u(C)< Jc f d which can only be true if ti(C) = O. Other proofs that were stimulated by [51] are given in [25. 1=1 Remark 1. then A n f (x)I < M. The reader is referred to Krengel's book for historical commentary and numerous other ergodic theorems.5.5 remains true for any T-invariant subset of B. The proof given here is less elegant. 27]. To complete the discussion of the ergodic theorem L' -convergence must be established. This proves almost sure convergence. Ag converges in L'-norm. 39 Proof of the Birkhoff ergodic theorem: general case. But if f. if f is a bounded function. including the short.3. xE X. as well as von Neumann's elegant functional analytic proof of L 2-convergence. almost surely. Moreover. whose statement is obtained by replacing lirn sup by sup in Lemma 1. THE ERGODIC THEOREM. This completes the proof of the Birkhoff ergodic theorem. for g can be taken to be a bounded approximation to f.5.3. Since almost-sure convergence holds. f EL I .3. as was just shown. . it follows that f* E L I and that A n f converges in L'-norm to f*.3. Thus if 1 n 1 n f (T i x) C = x: lim inf — f (T i x) <a < fi < lim sup — n-400 n i=1 n—. Proofs of the maximal theorem and Kingman's subadditive ergodic theorem which use ideas from the packing lemma will be sketched in the exercises. the dominated convergence theorem implies L' convergence of A n f for bounded functions. To prove almost-sure convergence first note that Lemma 1. that is. elegant proof due to Garsia that appears in many books on ergodic theory. A n f — Am f 5- g)11 + ii Ag — Amg ii 2 11 f — gil + 11Ang — A m g iI — g) — A ni (f — The L .6 Birkhoff's proof was based on a proof of Lemma 1. UT f (x) = f (Tx). and since the average An f converges almost surely to a limit function f*. its application to prove the ergodic theorem as given here was developed by this author in 1980 and published much later in [68].-convergence of A.o0 n i=1 { E E then fi. covering. UT f = if Ii where 11f II = f Ifl diu is the L'-norm. called the maximal ergodic theorem. IA f II If Il f E L I . and packing ideas which have been used recently in ergodic and information theory and which will be used frequently in this book. almost surely. The operator UT is linear and is an isometry because T is measure preserving. say I f(x)I < M. n. To do this define the operator UT on L I by the formula. Most subsequent proofs of the ergodic theorem reduce it to a stronger result.SECTION 1. that is.

n] is (1 — 3)-packed by intervals from C(x.7 (The ergodic stopping-time packing lemma. BASIC CONCEPTS. Now that the ergodic theorem is available. given 8 > 0. hence. Lemma 1. the packing lemma and its relatives can be reformulated in a more convenient way as lemmas about stopping times.3.Af U ool. it is bounded except for a set of small measure. Proof Since r (x) is almost-surely finite.) If r is an almost-surely finite stopping time for an ergodic process t then for any > 0 and almost every x there is an N = N (3. the stopping time for each shift 7' 1 x is finite. In particular. Frequent use will be made of the following almost-sure stopping-time packing result for ergodic processes. The given proof of the ergodic theorem was based on the purely combinatorial packing lemma. Thus almost-sure finiteness guarantees that. The interval [n. however. Note that the concept of stopping time used here does not require that X be a sequence space. almost surely. for almost every x. the collection C = C(x.d Extensions of the packing lemma. A (generalized) stopping time is a measurable function r defined on the probability space (X. The stopping time idea was used implicitly in the proof of the ergodic theorem. starting at n. x) such that if n > N then [1.u) with values in the extended natural numbers .u ({x: r(x) = °o }) = O. r) = {[n. r). so that if 1 n 1=1 x D (T i-1 x) d p(x) = . the function r(Tn -l x) is finite for every n. t(Tn -1 x) n — 1]: n E of stopping intervals is. Gn = then x E ix 1 x---11 s :— n 2x D (Ti . Suppose it is a stationary measure on A'.ix)< 6/ 41 G . E. By the ergodic theorem. An almost-surely finite stopping time r has the property that for almost every x.40 CHAPTER I. . A stopping time r is p. eventually almost surely. a (random) strong cover of the natural numbers A.-almost-surely finite if . Several later results will be proved using extensions of the packing lemma.u(D) <6/4. lim almost surely.lx)-F n — 1] is called the stopping interval for x. I.3. . Let D be the set of all x for which r(x) > L. for an almost-surely finite stopping time. for the first time m that rin f (Ti -1 x) > ma is a stopping time. see (5). which is a common requirement in probability theory. nor that the set {x: r(x) = nl be measurable with respect to the first n coordinates. there is an integer L such that p({x: r(x) > LI) < 6/4.r(Tn.

Suppose each [ui .3. For almost every x. and each [s1 . 1 — 3/2)-packing of [1. n] is said to be separated if there is at least one integer between each interval in C'. Then the disjoint collection {[s i . THE ERGODIC THEOREM. n] has a y -packing C' c C(x. 3. Let p. put F(r) = {x: r(x) < ocl. Formulate the partial-packing lemma as a strictly combinatorial lemma about the natural numbers. then [1. ti ]: j E J } U flub i E I*1 has total length at least (a + p .) 2. 41 Suppose x E G„ and n > 4L/3. vd: i E I} be a disjoint collection of subintervals of [1. r). 1. This proves the lemma. Two variations on these general ideas will also be useful later.3.3. (Hint: use the fact that the cardinality of J is at most K I M to estimate the total length of those [u. The definition of G. Prove the partial-packing lemma. Prove the two-packings lemma. with the proofs left as exercises. r) to be the set of all intervals [n. there is an integer N= x) such that if n > N then [1. Lemma 1. yi] meets at least one of the [s i . ti]: j E J} be another disjoint collection of subintervals of [1.SECTION 1. r). r). and a two-packings variation. n] is (L. where M > m13. n — L]. let r be a stopping time such that . K] of total length 13 K .) . Show that a separated (1 — S)-packing of [1. These are stated as the next two lemmas. ti ].u(F(r)) > y > O. and define C(x. and let {[si . be an ergodic measure on A. implies that r(T -(' -1) x) < L for at least (1 — 314)n indices i in the interval [1. (Hint: define F'(x) = r(x) + 1 and apply the ergodic stopping-time lemma to F. Let I* be the set of indices i E I such that [ui. for which Tn -l x E F(r). there is an N = N (x. n] has a separated (1 — 28)-packing C' c C(x. ti]. r) such that if n > N. be an ergodic measure on A'. yi ] that meet the boundary of some [si. Show that this need not be true if some intervals are not separated by at least one integer. Show that for almost every x. K] of total length aK.9 (The two-packings lemma. n] by a subset of C(x.8 (The partial-packing lemma.e Exercises. and let r be an almost-surely finite stopping time which is almost-surely bounded below by a positive integer M > 1/8. n] and hence for at least (1 — 812)n indices i in the interval [1.3.) Let p. The packing lemma produces a (L. For a stopping time r which is not almost-surely infinite. . 5.] has length m. a partial-packing result that holds in the case when the stopping time is assumed to be finite only on a set of positive measure. n] is determined by its complement. 4. A (1 — 3)-packing C' of [1. ti ] has length at least M. I. r(r-1 n — 1]. 1 — 3/2)-stronglycovered by C(x. r) Lemma 1. v.) Let {[ui.23)K. But this means that [1.

C(x. and for each n. (a) Prove the theorem under the additional assumption that gn < 0. and G n (r . for any L 2 function f. Show that if ti is ergodic and if f is a nonnegative measurable. x E f (T i x)xE(T i x) i=o — (N — fII. r).) (c) Show that the theorem holds for bounded functions.(Gn (r. Assume T is invertible so that UT is unitary.(x) < g(x) g ni (Tn x).A4 = (I — UT)L 2 .} is a sequence of integrable functions such that gn+. as n 7. (a) For bounded f. [80].Jones. (Hint: first show that g(x) = liminf g n (x)In is an invariant function. (a) Show that the theorem is true for f E F = If E L 2 : UT f = f} and for f E . n ] is (1 — (5)-packed by intervals from oc.F+M is dense in L 2 . Assume T is measure preserving and f is integrable. then the averages (1/N) E7 f con8. for some r(x) < E Un . (d) Deduce that L' -convergenceholds. The ideas of the following proof are due to R.. x 0 E. (b) Show that . but not integrable f (T i-1 x) converge almost surely to oc. 6.) (c) Deduce von Neumann's theorem from the two preceding results. if f E LI . 8)) 1.Steele. continuing until within N of L. N. show that r(x)-1 E i=o f (T i x)xE(T i x) ?_ 0. The mean ergodic theorem of von Neumann asserts that (1/n) EL I verges in L2-norm. (e) Extend the preceding arguments to the case when it is only assumed that T is measure preserving. 9.) . [33]. show that (Hint: let r(x) = n. then from r(x) to r(T) — 1. (Hint: sum from 0 to r(x) — 1.u(B). function. Fix N and let E = Un <N Un . Kingman's subadditive ergodic theorem asserts that gn (x)/ n converges almost surely to an invariant function g(x) > —oc. (Hint: show that its orthogonal complement is 0. let Un be the set of x's for whichEL -01 f (T i x) > 0. then ii. show that h f da a. BASIC CONCEPTS. then apply packing. (e) For B = {x: sup E71-01 f (T i x) > a}. One form of the maximal ergodic theorem asserts that JE f (x) di(x) > 0. 8) is the set of all x such that [1. r(x) = 1. (d) Show that the theorem holds for integrable functions. The ideas of the following proof are due to M.42 CHAPTER I. Assume T is measure preserving and { g. Show that if r is an almost-surely finite stopping time for a stationary process p. 10.) (b) For bounded f and L > N.

g 1 (x) = gni (x) — ET -1 gi (T i x ). 43 (Hint: let (b) Prove the theorem by reducing it to the case when g„ < O.u < oc. + (d) Show that the same conclusions hold if gn < 0 and gn+8+.) (c) Show that if a = inf. In the nonergodic case. g„. n—k+1 Palic Wiz = The frequency of the block ail' in the sequence Jet' is defined for n > k by the formula E i=1 x fa i l (T i-1 x). or the (overlapping) k-type of x. In the ergodic case. that this limit exists. n — k + 1]: xi +k-1 = 4}1 n—k1 n—k+1 If x and k < n are fixed the relative frequency defines a measure pk • 14) on A " . where y8 0 as g Section 1.(4). f and a = fg d d. a property characterizing ergodic processes. FREQUENCIES OF FINITE BLOCKS. n — k + 1]: = where I • I denotes cardinality. The frequency can also be expressed in the alternate form f(4x) = I{i E [1. n — k 1. The limiting (relative) frequency of di in the infinite sequence x is defined by (1) p(ainx) = lim p(ainx). . These and related ideas will be discussed in this section. Thus. to obtain Pk(aflx7) = f (4x) = I{i E [1. When k is understood. f(a14) is obtained by sliding a window of length k along x'it and counting the number of times af is seen in the window. then convergence is also in L 1 -norm. the subscript k on pk(ali( 14) may be omitted.4. oo provided. The relative frequency is defined by dividing the frequency by the maximum possible number of occurrences. A sequence x is said to be (frequency) typical for the process p. called the empirical distribution of overlapping k-blocks. An important consequence of the ergodic theorem for stationary finite alphabet processes is that relative frequencies of overlapping k-blocks in n-blocks converge almost surely as n —> oc.(Tn+gx) + y8 .4.n (x) < cc.SECTION 1. of course.4 Frequencies of finite blocks. the existence of limiting frequencies allows a stationary process to be represented as an average of ergodic processes. where x [4] denotes the indicator function of the cylinder set [an. if each block cif appears in x with limiting relative frequency equal to 11. that is.(4).a Frequencies for ergodic processes. x is typical for tt if for each k the empirical distribution of each k-block converges to its theoretical probability p. limiting relative frequencies converge almost surely to the corresponding probabilities. I.

the set 00 B=U k=1 ak i EAk B(a 1`) XE has measure 0 and convergence holds for all k and all al'. Proof In the ergodic case. for x B(a lic). n—k+1 p(afixiz) - n . lim p(aIN) = for all k and all a. p. with probability 1. Proof If the limiting relative frequency p(a'nx) exists and is a constant c. This proves the theorem. n liM — n i=1 x8 = holds for almost all x for any cylinder set B.(4). = p(a). for fixed 4. Thus the entire structure of an ergodic process can be almost surely recovered merely by observing limiting relative frequencies along a single sample path. which. converges almost surely to f x [4] dp. the limiting relative frequencies p(ali`Ix) exist and are constant in x.44 CHAPTER I. A basic result for ergodic processes is that almost every sequence is typical. where C is an arbitrary measurable set. there is a set B(a) of measure 0. In other words. produces liM n—)00 fl E X B (T i-l x)X c (x)= i=1 n and an integration then yields (2) lim A(T -i B n i=1 E n n= . Since there are only a countable number of possible af. Theorem 1. Multiplying each side of this equation by the indicator function xc (x). for all measure 1. The converse of the preceding theorem is also true. that is. Then A is an ergodic measure. as n -> co. a set of [=:] Theorem 1.-almost surely.k +1 E k • (T i X •_ ' ul' ' —1 x). since the ergodic theorem includes L I -convergence of the averages to a function with the same mean. Thus the hypotheses imply that the formula x--.4. by the ergodic theorem.2 (The typical-sequence converse. A .4.1 (The typical-sequence theorem.) If A is ergodic then A(T (A)) = 1.) Suppose A is a stationary measure such that for each k and for each block di'. then this constant c must equal p. BASIC CONCEPTS.(4).B. for almost every x. Let T(A) denote the set of sequences that are frequency typical for A. such that p(alicixtiz) -> 1).

a E The "bad" set is the complement. so that those in Gn will have "good" k-block frequencies and the "bad" set Bn has low probability when n is large. It is also useful to have the following finite-sequence form of the typical-sequence characterization of ergodic processes. are often called the frequency-(k. Note that as part of the preceding proof the following characterizations of ergodicity were established. the limit must be almost surely constant.) (a) If ji is ergodic then 4 (b) If lima p. just the typical sequences. This proves Theorem 1.3 The following are equivalent. is Proof Part (a) is just a finite version of the typical-sequence theorem. 45 The latter holds for any cylinder set B and measurable set C. E)) converge to 1 for any E > 0 is just the condition that p(41x7) converges in probability to ii(4). = 1.u(A)A(B). then p. Since p(an.4. c). or when k and c are understood.(Ti-lx) = tt(E).1. for every integer k > 0 and every c > 0.4.2. C. and B. Theorem 1. c) such that if n > N then there is a collection C.4.4 (The "good set" form of ergodicity. 6) = An — G n (k.4. FREQUENCIES OF FINITE BLOCKS.4. . for each A and B ma generating algebra. But if B = B then the formula gives it(B) = A(B) 2 . then 1/3(41fi') — p(aini)I < c.4. 6) = p(afix) — . C An such that tt(Cn) > 1 — E. E).q) is bounded and converges almost surely to some limit. for each E in a generating algebra. is (3) G n (k. (i) T is ergodic. (ii) lima pi.e.5 (The finite form of ergodicity. which depends on k and a measure of error. E). The members of the "good" set. eventually almost surely. so that p(B) must be 0 or 1.. 6)) E Gn (k.(iii) limn x .SECTION 1. The condition that A n (G n (k. c)-typical sequences. A(T -i A n B) = . B n (k. G.n (G n (k. The precise definition of the "good" set.(k. Theorem 1. G . Theorem 1. for every k and c > O. Sequences of length n can be partitioned into two classes.) A measure IL is ergodic if and only if for each block ail' and E > 0 there is an N = N(a.4. ergodic. Theorem 1. The usual approximation argument then shows that formula (2) holds for any measurable sets B and C. (ii) If x. and hence (b) follows from Theorem 1.2. a.L(a)1 < c. Some finite forms of the typical sequence theorems will be discussed next.

4. then define G n = ix: IP(a inxi) — P(a inx)I < EI. and use Theorem 1. by saying that. Theorem 1. (4) E [1. Proof Apply the ergodic theorem to the indicator function of Bn . if pn (B n iXr) < b. then liM Pn(BniXi ) M—).) If 12.46 CHAPTER I. . Thus.a(a n n e . eventually almost surely. a. c) such that . I. Theorem 1. M — n that is. M — n] for almost surely as M which x: +n-1 E Bn . and use convergence in probability to select n > N(a. X.4. M — n]: E 13n}1 — ML(13n)1 <8M.n . then the ergodic theorem is applied to show that. a set Bn c An with desired properties is determined in some way. To establish the converse fix a il' and E > 0. 1]: xn-1 B n }i < 8M.s. and if Bn is a subset of measure. E).4.7 (The almost-covering principle. A sequence xr is said to be (1 — 8)strongly-covered by B n if I {i E [1. C An.) If kt is an ergodic process with alphabet A. IP(ailx) — P(ailZ)I 3E. just set Cn = G n (k. is chosen to have large measure. is an ergodic process with alphabet A. Proof If it is ergodic. If n El. E _ _ n E. In these applications.4. eventually oc. informally. n ) > 1 — 2E. and hence p(a Ix) is constant almost surely. The ergodic theorem is often used to derive covering properties of finite sample paths for ergodic processes.4. eventually almost surely. the "good" set defined by (3).u(an) > 1 — Cn satisfies (i) and (ii) put En = {x: 4 E Cn} and note that .co In other words. In many situations the set 13. xr is "mostly strongly covered" by blocks from Bn . satisfies A(13 n ) > 1 — then xr is eventually almost surely (1 — (5)-strongly-covered by r3n. BASIC CONCEPTS. This simple consequence of the ergodic theorem can be stated as either a limit result or as an approximation result.b The ergodic theorem and covering properties. An of positive 1-1 (Bn). in which case the conclusion (4) is expressed. there are approximately tt(Bn )M indices i E [1.6 (The covering theorem. and if 8. for any 8 > 0. proving the theorem.

if no a appears in xr. Given 8 > 0. but the limit measure varies from sequence to sequence. nil. which is a simple consequence of the ergodic theorem. (5). The almost-covering principle. be ergodic.2. then F(41 ) < 23M and L(xr) > (1 — 3)M. the desired result. however. Let F (xr) and L(xr) denote the first and last occurrences of a E Xr. first to select the set 13n . for some i E [1.c. with the convention F(41) = L(41 ) = 0.4. The set is shift invariant and any shift-invariant measure has its support in This support result. Thus is the set of all sequences for which all limiting relative frequencies exist for all possible blocks of all possible sizes. in the limit.4.00 L(41 )— F(41 ) = 1. limiting relative frequencies of all orders exist almost surely. But if this is so and M > n/3. The set of all x for which the limit p(ainx) = lim p(44) exists for all k and all di will be denoted by E.7. so that L(xr) — F(41) > (1 — 3 8 )M. In the stationary nonergodic case. M]: x i = a} L(x) = max{i E [1. Since 3 is arbitrary. implies that xr is eventually almost surely (1 — 3)-strongly-covered by Bn . M]: x i = a}. that the expected waiting time between occurrences of a along a sample path is. 47 The almost strong-covering idea and a related almost-packing idea are discussed in more detail in Section L7. almost surely. and then to obtain almost strong-covering by 13n . Note that the ergodic theorem is used twice in the example. Note. Bn = {a7: ai = a. Theorem 1. choose n so large that A(B n ) > 1 — 3. I. These ideas will be made precise in this section.4.(a). The sequences that produce the same limit measure can be grouped into classes and the process measure can then be represented as an average of the limit measures.8 Let p. that if xr is (1 — 6)strongly-covered by 13n . in fact. see Exercise 2. Example 1.SECTION 1.c The ergodic decomposition. This result also follows easily from the return-time picture discussed in Section I. The following example illustrates one simple use of almost strong-covering. together with the ergodic theorem. almost all of which will. . Such "doubling" is common in applications.4. The almost-covering idea will be used to show that (5) To prove this. F(xr) = min{i E [1. almost surely equal to 1/p. then at least one member of 13n must start within 3M of the beginning of xr and at least one member of 1:3n must start within 8M of the end of xr. Such an n exists by the ergodic theorem. that is. follows. It follows easily from (5). let lim kr. FREQUENCIES OF FINITE BLOCKS. be an ergodic process and fix a symbol a E A of positive probability. is stated as the following theorem.

BE E. Proof Since p (all' 14) = n . Let n denote the projection onto frequency equivalence classes. Theorem 1.. as an average of its ergodic components. 4 E Ak defines a measure itx on sequences of length k. which establishes the theorem.9 If it is a shift-invariant measure then A ( C) = 1.5.1. and extends by countable additivity to any Borel set B.u. BASIC CONCEPTS. together with the finite form of ergodicity.C must El have measure 0 with respect to 1. Next define /27(x) = ktx. x E E. Two sequences x and y are said to be frequency-equivalent if /ix = A y .)c) can be thought of as measures. is shift-invariant then p.4. The projection x is measurable and transforms the measure .( x )(B) da)(7r(x)).x (a) = p(ainx).. for each k.4.10 ( The ergodic decomposition theorem.tx on A. Theorem 1.(x). The usual measure theory argument. k 1 i=1 the ergodic theorem implies that p(aflx) = limn p(aii`ix) exists almost surely.. a shift-invariant measure always has its support in the set of sequences whose empirical measures are ergodic.4. . by (6).4.. Let E denote the set of all sequences x E L for which the measure ti.(E) = 1 and hence the representation (7) takes the form.48 Theorem 1.. An excellent discussion of ergodic components and their interpretation in communications theory can be found in [ 1 9 ].) If p. to a Borel measure i. Formula (6) can then be expressed in the form (7) .(x )(B) dco(7(x)). ± i E x [ ak. Theorem 1. (x). This formula indeed holds for any cylinder set.u(B) = f 1.u onto a measure w = . are called the ergodic components of kt and the formula represents p. the complement of . The measure . each frequency-equivalence class is either entirely contained in E or disjoint from E. B E E.(Ti.4. Of course.ux will be called the (limiting) empirical measure determined by x.. which can be extended. .. n —k+ 1 The frequencies p(41. CHAPTER I.u o x -1 on equivalence classes.u(alic) = f p(af Ix) dp.x is ergodic.9 can be strengthened to assert that a shift-invariant measure kt must actually be supported by the sequences for which the limiting frequencies not only exist but define ergodic measures.9 may now be expressed in the following much stronger form.u. — ix). To make this assertion precise first note that the ergodic theorem gives the formula (6) .1. Theorem 1. by the Kolmogorov consistency theorem. Since there are only a countable number of the al` .. (B)= f p. which is well-defined on the frequency equivalence class of x. The ergodic measures. shows that to prove the ergodic decomposition theorem it is enough to prove the following lemma. for each k and 4. for if x E r then the formula . e In other words.

E. with /L(C) > 1 — c.(Xi) > 1 — E. The conditions (9) and (8) guarantee that the relative frequency of occurrence of 4 in any two members x7.11 Given c > 0 and 4 there is a set X1 with ii. x E L2. which translates into the statement that p(C) > 1 — E. hence For each x E Li. 49 Lemma 1.4. tt(CnICI) = ti Li P(xilz) dtt(z).4.10.12 The ergodic decomposition can also be viewed as an application of the ChoquetBishop-deLeeuw theorem of functional analysis. FREQUENCIES OF FINITE BLOCKS. see Exercise 1 of Section 1. Proof Fix E > 0. itiz of Cn differ by no more E.) The extreme points of Ps correspond exactly to the ergodic measures. ILi) is shift-invariant and 1 . such that the frequency of occurrence of 4 in any two members of C. Thus the conditional measure it (. Theorem 1. . 1] is bounded so L can be covered by a finite number of the sets Li(y) and hence the lemma is proved.9. fix a block 4. The set L1 is invariant since p(ainx) = p(4`17' x).„ E p(xz) ?. Indeed. and fix a sequence y E L. for any x E L. c A n . differs by no more than E.4. in particular. the sequence {p(al 14)} converges to p(41x). as n there is an integer N and a set L2 C Li of measure at least (1 — E2 )A(L1) such that (9) IP(414) — p(alx)I < E/4. Fix n > N and put Cn = {x: x E L2 }. The unit interval [0. Z E r3.u(xilLi) = A(CI) In particular. This completes the proof of the ergodic decomposition theorem. z E L3.SECTION 1. x' iz E An. Let L i = Li(y) denote the set of all x E L such that (8) ip(a'nx) — p(41y)l < E/4. the set Ps of shift-invariant probability measures is compact in the weak*-topology (obtained by thinking of continuous functions as linear functionals on the space of all measures via the mapping it i— f f dit. 1 . [57]. that the relative frequency of occurrence of al in any two members of Li differ by no more than E/2.4. 0 Remark 1. 1 (L1) L1 P(Crilz) d iu(z) > 1 — 6 2 so the Markov inequality yields a set L3 C L2 such that /4L3) > (1 — E)/L(L 1 ) and for which .q. oo. Note. the set of sequences with limiting frequencies of all orders. n > N. such that for any x E X1 there is an N such that if n > N there is a collection C.c.

Let {X„: n > 1} be a stationary process.u(x lic). Use the central limit theorem to show that the "random walk with random scenery" process is ergodic. P)-process. (It is an open question whether a mixing finite-state process is a function of a mixing Markov chain. z„. 2. " be the set of all x E [a] such that Tnx E [a]. zt) are looking at different parts of y. (b) Show that the preceding result is not true if T is not totally ergodic.) A measure on Aa " transports to the measure y = . Show that .5. Let qk • 14) be the empirical distribution of nonoverlapping k-blocks in 4.4. I.) 5. (Hint: combine Exercise 2 with Theorem 1. P V TP y T 2P)-process. This exercise illustrates a direct method for transporting a sample-path theorem from an ergodic process to a sample-path theorem for a (nonstationary) function of the process. Show that {Xn } is ergodic if and only its reversed process is ergodic.: n > 1}. Let T be an ergodic rotation of the circle and P the two-set partition of the unit circle into upper and lower halves.u o R -1 on Aroo. (a) Determine the ergodic components of the (T 3 . R2(x).) 7. Describe the ergodic components of the (T x T . 6. = (a) Show that if T is totally ergodic then qk (ali`14) . is the conditional measure on [a] defined by an ergodic process.1 by many n.P x P)-process. (b) Determine the ergodic components of the (T 3 . almost surely. that is. Assume p. The reversed process {Yn : n > 1} is the . See Exercise 8 in Section 1. (This is just the return-time process associated with the set B = [a]. Define the mapping x (x) = min{m > 1: x. (Hint: with high probability (x k .n = a} — 1 R1 (x) = minfm > E (x): x m = a} — E Rico i=1 . (Hint: show that the average time between occurrences of a in 4 is close to 1/pi (a14). P)-process.1 ) = I{/ E [0. Show that an ergodic finite-state process is a function of an ergodic Markov chain. t): where n = tk r.4. .t(a).d Exercises 1. where P is the Kolmogorov partition in the two-sided representation of {X. Combine (5) with the ergodic theorem to establish that for an ergodic process the expected waiting time between occurrences of a symbol a E A is just 111. 0 < r < k. for infinitely Fix a E A and let Aa R(x) = {R 1 (X). m+k ) and (4.50 CHAPTER I.2 for a definition of this process.) 3.) 4. 8. qk (alnx. BASIC CONCEPTS. Let P be the time-0 partition for an ergodic Markov chain with period d = 3.

The entropy of a probability distribution 7(a) on a finite set A is defined by H(z) = — E 71. some counting arguments. For stationary processes. 0 Lemma 1. the measure p.) Section 1. The entropy theorem asserts that. (a) The mapping x R(X) is Borel.(x)) _< log I AI. and denoted by h = h(t).) (d) (1/n) En i R i -÷ 11 p.5. The next three lemmas contain the facts about the entropy function and its connection to counting that will be needed to prove the entropy theorem.SECTION 1.5. and the concept of the entropy of a finite distribution. There is a nonnegative number h = h(A) such that n oo n 1 lim — log .(a) log 7(a). (Hint: use the preceding result.4) = nE(h n (x)). and. 51 (b) R(T R ' (x ) x) = SR(x).(4) = h.(4) 1 log p. has limit 0. z(a) 111AI. then v(R(T)) = 1.(4). with a constant exponential rate called the entropy or entropy-rate.(a).5 The entropy theorem. the decrease is almost surely exponential in n.1 (The entropy theorem. aEA Let 1 1 h n (x) = — log — n ti. THE ENTROPY THEOREM. see Exercise 4c.2 The entropy function H(7) is concave in 7 and attains its maximum value log I AI only for the uniform distribution.5. Proof Since 1/(. except for some interesting cases. Lemma 1.17. The proof of the entropy theorem will use the packing lemma. v-almost surely. Proof An elementary calculus exercise. n so that if A n is the measure on An defined by tt then H(t) = nE(hn (x)). where S is the shift on (e) If T is the set of the set of frequency-typical sequences for j. CI . (Hint: use the Borel mapping lemma. of the process.5. Lemma 1.u.) Let it be an ergodic measure for the shift T on the space A'. In the theorem and henceforth. where A is finite. and the natural logarithm will be denoted by ln. the lemma follows from the preceding lemma. for ergodic processes.1. Theorem 1. log means the base 2 logarithm.3 E(17. almost surely.(4) is nonincreasing in n.

h(x) = lim inf hn (x). To establish this note that if y = Tx then [Yi ] = [X3 +1 ] j so that a(y) > It(x 1 ) and hence h(Tx) < h(x). almost surely. Proof The function —q log3 — (1 — q) log(1 —8) is increasing in the interval 0 < q < 3 and hence 2-nH(S) < 5k(1 _ n ) and summing gives k on-k .4 (The combinations bound. see [7]. that the binary entropy function is the correct (asymptotic) exponent in the number of binary sequences of length n that have no more than Sn ones. Thus h(Tx) = h(x).= H (3) k I. almost surely.00 n k<rz b E( n ) . n. The goal is to show that h must be equal to lim sup. 0 The function H(S) = —3 log 8—(1—S) log(1-3) is called the binary entropy function. BASIC CONCEPTS.00 almost surely. It can be shown. k < ms.) (n ) denotes the number of combinations of n objects taken k at a time and k 3 < 1/2 then n 2nH(3) .a The proof of the entropy theorem. hn (x). let c be a positive number. Define h to be the constant such that h = lim inf hn (x).5. Towards this end.52 CHAPTER I. that is. almost surely. Section 1. since subinvariant functions are almost surely invariant. that is I lim — log n-. Since T is assumed to be ergodic the invariant function h(x) must indeed be constant.Co Define where h(x) = —(1/n) log it(x'lz). k) If E k<n6( where H(3) = —3 log3 —(1 — 3) log(1 — 3). almost surely.2. The definition of limit inferior implies that for almost . h(x) is subinvariant. Lemma 1. see Exercise 4. It is just the entropy of the binary distribution with ti(1) = S.5. n-). The first task is to show that h(x) is constant. Multiplying by ( 2-nH(8) E k<nS ( n k ) 'E k<n3 ( n k ) (5k 0 _ on-k n ) ic (1 _ 5 k on-k = 19 which proves the lemma.

there are not too many ways the parts outside the long blocks can be filled.. This is a simple application of the ergodic stopping-time packing lemma. The goal is to show that "infinitely often.. . for large enough K. a fact expressed in exponential form as (1) A(4) > infinitely often. long subblocks for which (1) holds cannot have cardinality exponentially much larger than 2 /"+€) . [ni. Indeed. + 1) > (1 — 28)K. there will be a total of at most 2 K(h+E ) ways to fill all the locations. The first idea is a packing idea. let S be a positive number and M > 1/B an integer. (a) m — n 1 +1 > M. M) be the set of all that are (1 — 23)-packed by disjoint blocks of length at least M for which the inequality (1) holds. Lemma I.1. then once their locations are specified. For each K > M. M) if and only if there is a collection xr S = S(xr) = {[n i . almost surely. Eventually almost surely. most of a sample path is filled by disjoint blocks of varying lengths for each of which the inequality (1) holds. and if they mostly fill. m i [} of disjoint subintervals of [1.] E S. then since a location of length n can be filled in at most 2n(h+E) ways if (1) is to hold. The second idea is a counting idea. Most important of all is that once locations for these subblocks are specified. then it is very unlikely that a sample path in the set has probability exponentially much smaller than 2 — K (h+E ) . The third idea is a probability idea. then there are not too many ways to specify their locations.Es (mi — n. To fill in the details of the first idea. [n. m i l E S. K] with the following properties.18(b). xr E G K (8 . THE ENTROPY THEOREM." for a suitable multiple of E. (b) p(x) > ni+1)(h+c). (c) E [ „.. almost surely" can be replaced by "eventually. Three ideas are used to complete the proof. let GK(S. if these subblocks are to be long. This is just an application of the fact that upper bounds on cardinality "almost" imply lower bounds on probability.. almost surely. In other words. both to be specified later. If a set of K-length sample paths has cardinality only a bit more than 2K (h+E) . 53 all x the inequality h(x) < h + E holds for infinitely many values of n. The set of sample paths of length K that can be mostly filled by disjoint. An application of the packing lemma produces the following result.5..SECTION 1. m.

54

CHAPTER I. BASIC CONCEPTS.

Lemma 1.5.5

E

GK

(3, M), eventually almost surely.

Proof Define T. (x) to be the first time n > M such that p,([xrii]) >n(h+E) Since r is measurable and (1) holds infinitely often, almost surely, r is an almost surely finite stopping time. An application of the ergodic stopping-time lemma, Lemma 1.3.7, then yields the lemma. LI
The second idea, the counting idea, is expressed as the following lemma.

Lemma 1.5.6 There is a 3 > 0 and an M> 116 such that IG large K.

, M)1 <2K (h+26) ,

for all sufficiently

Proof A collection S = {[ni , m i ll of disjoint subintervals of [1, K], will be called a skeleton if it satisfies the requirement that mi — ni ± 1 > M, for each i, and if it covers all but a 26-fraction of [1, K], that is,
(2)

E (mi _ ni ± 1) >_ (1 — 23)K.
[n,,m,JES

A sequence xr is said to be compatible with such a skeleton S if 7i ) > 2-(m, --ni+i)(h+e) kt(x n for each i. The bound of the lemma will be obtained by first upper bounding the number of possible skeletons, then upper bounding the number of sequences xr that are compatible with a given skeleton. The product of these two numbers is an upper bound for the cardinality of G K (6, M) and a suitable choice of 6 will then establish the lemma. First note that the requirement that each member of a skeleton S have length at least M, means that Si < KIM, and hence there are at most KIM ways to choose the starting points of the intervals in S. Thus the number of possible skeletons is upper bounded by (3)

E

K k

H(1/111)

k<KIM\

where the upper bound is provided by Lemma 1.5.4, with H() denoting the binary entropy function. Fix a skeleton S = {[ni,mi]}. The condition that .)C(C be compatible with S means that the compatibility condition (4) it(x,7 1 ) >

must hold for each [ni , m i ] E S. For a given [ni , m i ] the number of ways xn m , can be chosen so that the compatibility condition (4) holds, is upper bounded by 2(m1—n's +1)(h+E) , by the principle that lower bounds on probability imply upper bounds on cardinality, Lemma I.1.18(a). Thus, the number of ways x j can be chosen so that j E Ui [n„ and so that the compatibility conditions hold is upper bounded by
nK h-1-€)
L

(fl

SECTION 1.5. THE ENTROPY THEOREM.

55

Outside the union of the [ni , mi] there are no conditions on xi . Since, however, there are fewer than 28K such j these positions can be filled in at most IA 1 26K ways. Thus, there are at most
IAl23K 2K (h+E)

sequences compatible with a given skeleton S = {[n i , m i ]}. Combining this with the bound, (3), on the number of possible skeletons yields

(5)

1G K(s, m)i

< 2. 1(- H(1lm) oi ncs 2 K(h+E)

Since the binary entropy function H (1 M) approaches 0 as M 00 and since IA I is finite, the numbers 8 > 0 and M > 1/8 can indeed be chosen so that IGK (8, M) < 0 2K(1 +2' ) , for all sufficiently large K. This completes the proof of Lemma 1.5.6. Fix 6 > 0 and M > 1/6 for which Lemma 1.5.6 holds, put GK = Gl{(8, M), and let BK be the set of all xr for which ,u(x(c ) < 2 -1C(h+36) . Then tt(Bic

n GK) < IGKI 2—K(h+3E)

5 2—ICE

,

holds for all sufficiently large K. Thus, xr g BK n GK, eventually almost surely, by the Borel-Cantelli principle. Since xt E GK, eventually almost surely, the iterated almost-sure principle, Lemma 1.1.15, implies that xr g BK, eventually almost surely, that is,

lim sup h K (x) < h + 3e, a.s.
K—>oo

In summary, for each e > 0, h = lim inf h K (x) < lim sup h K (x) < h 3e, a.s.,
K—*co

which completes the proof of the entropy theorem, since c is arbitrary.

0

Remark 1.5.7 The entropy theorem was first proved for Markov processes by Shannon, with convergence in probability established by McMillan and almost-sure convergence later obtained by Breiman, see [4] for references to these results. In information theory the entropy theorem is called the asymptotic equipartition property, or AEP. In ergodic theory it has been traditionally known as the Shannon-McMillan-Breiman theorem. The more descriptive name "entropy theorem" is used in this book. The proof given is due to Ornstein and Weiss, [51], and appeared as part of their extension of ergodic theory ideas to random fields and general amenable group actions. A slight variant of their proof, based on the separated packing idea discussed in Exercise 1, Section 1.3, appeared in [68].

I.5.b Exercises.
1. Prove the entropy theorem for the i.i.d. case by using the product formula on then taking the logarithm and using the strong law of large numbers. This yields the formula h = — Ea ,u(a) log it(a). 2. Use the idea suggested by the preceding exercise to prove the entropy theorem for ergodic Markov chains. What does it give for the value of h?

56

CHAPTER I. BASIC CONCEPTS.
3. Suppose for each k, Tk is a subset of Ac of cardinality at most 2k". A sequence 4 is said to be (K, 8, {T})-packed if it can be expressed as the concatenation 4 = w(1) . w(t), such that the sum of the lengths of the w(i) which belong to Ur_K 'Tk is at least (1 —8)n. Let G„ be the set of all (K, 8, {T})-packed sequences 4 and let E be positive number. Show that if K is large enough, if S is small enough, and if n is large enough relative to K and S, then IG n I < 2n(a+E) . 4. Assume p is ergodic and define c(x) = lima

,44), x

E

A°°.

(a) Show that c(x) is almost surely a constant c. (b) Show that if c > 0 then p. is concentrated on a finite set. (c) Show that if p. is mixing then c(x) = 0 for every x.

Section 1.6

Entropy as expected value.

Entropy for ergodic processes, as defined by the entropy theorem, is given by the almost-sure limit 1 1 — log h= n—>oo n Entropy can also be thought of as the limit of the expected value of the random quantity —(1/n) log A(4). The expected value formulation of entropy will be developed in this section.

I.6.a

The entropy of a random variable.

Let X be a finite-valued random variable with distribution defined by p(x) = X E A. The entropy of X is defined as the expected value of the random variable — log p(X), that is,

Prob(X = x),

H(X) =

xEA

E p(x) log -1-=— p(x)

p(x) log p(x). xEA

The logarithm base is 2 and the conventions Olog 0 = 0 and log0 = —oo are used. If p is the distribution of X, then H(p) may be used in place of H(X). For a pair (X, Y) of random variables with a joint distribution p(x, y) = Prob(X = x, Y = y), the notation H(X,Y)-= — p(x , y) log p(x, y) will be used, a notation which extends to random vectors. Most of the useful properties of entropy depend on the concavity of the logarithm function. One way to organize the concavity idea is expressed as follows.

Lemma 1.6.1 If p and q are probability k-vectors then

_E pi log p, < —
with equality if and only if p = q.

pi log qi ,

SECTION 1.6. ENTROPY AS EXPECTED VALUE.

57

Proof The natural logarithm is strictly concave so that, ln x < x — 1, with equality if and only if x = 0. Thus
qj

_

= 0,

with equality if and only if qi = pi , log x = (In x)/(1n 2).

1 < i 5_ k.

This proves the lemma, since

E,

The proof only requires that q be a sub-probability vector, that is nonnegative with qi 5_ 1. The sum

Dcplio =

pi in

PI
qi

is called the (informational) divergence, or cross-entropy, and the preceding lemma is expressed in the following form.

Lemma 1.6.2 (The divergence inequality.) If p is a probability k-vector and q is a sub-probability k-vector then D(pliq) > 0, with equality if and only if p = q.
A further generalization of the lemma, called the log-sum inequality, is included in the exercises. The basic inequalities for entropy are summarized in the following theorem.

Theorem 1.6.3 (Entropy inequalities.) (a) Positivity. H(X) > 0, with equality if and only if X is constant.

(b) Boundedness. If X has k values then H(X) < log k, with equality if and only if
each p(x) = 11k.

(c) Subadditivity. H(X, Y) < H(X) H(Y), with equality if and only if X and Y
are independent. Proof Positivity is easy to prove, while boundedness is obtained from Lemma 1.6.1 by setting Px = P(x), qx ----- 1/k. To establish subadditivity note that
H(X,Y).—

Ep (x, y) log p(x, y),
X.

y

then replace p(x , y) in the logarithm factor by p(x)p(y), and use Lemma 1.6.1 to obtain the inequality H(X,Y) < H(X)-}- H(Y), with equality if and only if p(x, y) p(x)p(y), that is, if and only if X and Y are independent. This completes the proof of the theorem. El The concept of conditional entropy provides a convenient tool for organizing further results. If p(x , y) is a given joint distribution, with corresponding conditional distribution p(xly) = p(x, y)1 p(y), then H ain

=_

p(x, y) log p(xiy) = —

p(x, y) log

p(x, y)

POO

58

CHAPTER I. BASIC CONCEPTS.

is called the conditional entropy of X, given Y. (Note that this is a slight variation on standard probability language which would call — Ex p(xIy) log p(xly) the conditional entropy. In information theory, however, the common practice is to take expected values with respect to the marginal, p(y), as is done here.) The key identity for conditional entropy is the following addition law.

(1)

H(X, Y) = H(Y)+ H(X1Y).

This is easily proved using the additive property of the logarithm, log ab = log a+log b. The previous unconditional inequalities extend to conditional entropy as follows. (The proofs are left to the reader.)

Theorem 1.6.4 (Conditional entropy inequalities.) (a) Positivity. H (X IY) > 0, with equality if and only if X is a function of Y. (b) Boundedness. H(XIY) < H (X) with equality if and only if X and Y are independent. (c) Subadditivity. H((X, Y)IZ) < H(XIZ)+ H(YIZ), and Y are conditionally independent given Z. with equality if and only if X

A useful fact is that conditional entropy H(X I Y) increases as more is known about the first variable and decreases as more is known about the second variable, that is, for any functions f and g,

Lemma 1.6.5 H(f(X)In < H(XIY) < 1-1 (X1g(n).
The proof follows from the concavity of the logarithm function. This can be done directly (left to the reader), or using the partition formulation of entropy which is developed in the following paragraphs. The entropy of a random variable X really depends only on the partition Px = {Pa: a E A} defined by Pa = fx: X(x) = a), which is called the partition defined by X. The entropy H(1)) is defined as H (X), where X is any random variable such that Px = P. Note that the join Px V Py is just the partition defined by the vector (X, Y) so that H(Px v Pr) = H(X, Y). The conditional entropy of P relative to Q is then defined by H(PIQ) = H(1) V Q) — H (Q). The partition point of view provides a useful geometric framework for interpretation of the inequalities in Lemma 1.6.5, because the partition Px is a refinement of the partition Pf(x), since each atom of Pf(x) is a union of atoms of P. The inequalities in Lemma 1.6.5 are expressed in partition form as follows.

Lemma 1.6.6 (a) If P refines Q then H(QIR) < H(PIR). (b) If R. refines S then H(PIS) > 11(P 1R).

so that P = Pv Q. • The quantity H(P IR.r. The equality (i) follows from the fact that P refines Q. say Ra = Sa . then another use of subadditivity yields am < anp + a. IC) is used on C. I.SECTION I. namely. If m > n write m = np +. Subadditivity gives anp < pan . This is a consequence of the following basic subadditivity property of nonnegative sequences. The n-th order entropy of a sequence {X1. . H (P IR) (i) (P v 21 1Z) H(QITZ) + H(PIQ v 1Z) H (OR). so that if b = sup . n n xn In the stationary case the limit superior is a limit. 59 Proof The proof of (a) is accomplished by manipulating with entropy formulas. and the inequality (iii) uses the fact that H (P I VR. is obtained from S by splitting one atom of S into two pieces.n a.7 (The subadditivity lemma. t a0b II(Ra) The latter is the same as H(PIS). where the conditional measure. which establishes inequality (b).) If {an} is a sequence of nonnegative numbers which is subadditive.b The entropy of a process.6.6. (2) 1 1 1 H ({X n }) = lim sup — H((X7) = lim sup — p(4) log n n p(4) . .. Lemma 1. that is. 11) P(x1 xnEan E The process entropy is defined by passing to a limit.} of A-valued random variables is defined by H(r) = E p(x)log 1 ATEAn =— p(x'iL)log p(4). it (. To prove (h) it is enough to consider the case when R.6. Proof Let a = infn an /n. a 0 b.) can be expressed as -EEtt(Pt n R a ) log t ab it(Pt n Ra) + (4)11 (Psb liZsb ) Ewpt n Ra ) log It(Pt n Ra) + it(Sb)1-1(Psb ). Given c > 0 choose n so that an < n(a ± 6). Let Pc denote the partition of the set C defined by restricting the sets in P to C.) > O. The entropy of a process is defined by a suitable passage to the limit.. < pan + b. Sb = Rb U Rb. as follows. ENTROPY AS EXPECTED VALUE. an-Fm < an + am . X2. 0 < r < n.. then limn an ln exists and equals infn an /n. the equality (ii) is just the general addition law.

8 (The subset entropy-bound. Towards this end. and let h be the entropy-rate of A as given by the entropy theorem. BASIC CONCEPTS. Theorem I. 0 Let p.6. An alternative formula for the process entropy in the stationary case is obtained by using the general addition law. can then be applied to give H({X n })= lim H (X01X1 nI ) . as in —> pc. The right-hand side is decreasing in n. and process entropy H. The subadditivity property for entropy. from Lemma 1. the following lemma will be useful in controlling entropy on small sets.u (x '0 = h.3(c). and the simple fact from analysis that if an > 0 and an+ i — an decreases to b then an /n —> b._ . Lemma 1. gives lim sup ani /m < a + E. be an ergodic process with alphabet A.) _ E p(a) log p(a) _< p(B) log IBI — p(B)log p(B). Y) = H(Y)± H(X1Y). gives H(X7 +m) < H(X7)+ H(rnTin).6. The next goal is to show that for ergodic processes. Division by m._ log IBIaEB — E 1 The left-hand side is the same as ( p(B) from which the result follows. aEB E P(a)log p(a) + log p(B)) . the process entropy H is the same as the entropy-rate h of the entropy theorem. that is urn—log n 1 1 n .e. . and the fact that np/m —> 1.6. a. n > 0. aEB for any B c A.60 CHAPTER I. If the process is stationary then H(X n + -FT) = H(r) so the subadditivity lemma with an = H(X7) implies that the limit superior in (2) is a limit.5. to produce the formula H(X ° n )— H(X1. (3) n--4co a formula often expressed in the suggestive form (4) H({X n }) = H (X01X_1 0 ) .1 )= H(X0IX:n1). H(X. CI This proves the lemma. 1B) denote the conditional measure on B and use the trivial bound to obtain P(a1B)log p(aIB) . Proof Let p(.

6..(x ) log . [4]. If h = 0 then define G. . ENTROPY AS EXPECTED VALUE. Thus h— E which proves the theorem in the case when h > O. After division by n. The recent books by Gray. Proof First assume that h > 0 and fix E such that 0 < E < h.i. so the sum converges to H({X„}). approaches 1. discuss many information-theoretic aspects of entropy for the general ergodic process. } Remark 1. [7].1 (4) < 2—n(h—E)} and let B. so multiplication by .i(B)log(B).u(xtil) A" = — E .(Bn )log IA 1 — .u(4) <_ n(h ± c).10 Since for ergodic processes. The two sums will be estimated separately.6.d. this part must go to 0 as n —> oc.u(4) log p. = {4: 2—n(h+E) < 1. = A n — G. The bound (5) still holds. Then H (X7) = — E „.u(4) > 2 —ne . while D only the upper bound in (6) matters and the theorem again follows.}) is the same as the entropyrate h of the entropy theorem. 61 Theorem 1. The subset entropy-bound. the process entropy H({X. both are often simply called the entropy. the following holds n(h — e) _< — log . As n —> co the measure of G. [6]. case. since the entropy theorem implies that ti(B) goes to O.8..(4). and Cover and Thomas.u(G n ) and summing produces (6) (h — 6) 1 G.6.u (4 ) log .u(xliz) — E . On the set G. in the i. A detailed mathematical exposition and philosophical interpretation of entropy as a measure of information can be found in Billingsley's book.6. = {4: . B.9 Process entropy H and entropy-rate h are the same for ergodic processes. Lemma 1.u(4)/n. G.SECTION 1. Define G. contains an excellent discussion of combinatorial and communication theory aspects of entropy. [18]. The Csiszdr-Körner book. gives (5) — E A (x) log ii(x) < np. 8„ since I B I < IAIn . from the entropy theorem.

The additivity of entropy for independent random variables. (8) H({X n }) = H(X0IX_ i ) = — E 04 log Mii . The argument used in the preceding paragraph shows that if {X.i.} is Markov of order k then 1/({X}) = H(X0IX:1) = n > k. n > k. Xink+1). BASIC CONCEPTS. Theorem 1.!) = H(X0)• Now suppose {X n } is a Markov chain with stationary vector p and transition matrix M.(4). (7) H({X. . along with a direct calculation of H(X0IX_ 1 ) yields the following formula for the entropy of a Markov chain. Entropy formulas for i. a E A. I. Let {X.}) = H(X i ) = — p(x) log p(x). and Markov processes. (3). see Exercises 1 and 2 in Section 1.c The entropy of i.11 (The Markov order theorem. to obtain H({X.d. Here they will be derived from the definition of process entropy. An alternate proof can be given by using the conditional limit formula. process and let p(x) = Prob(Xi = x).) A stationary process {X n } is Markov of order k if and only if H(X0IX:) = H(X0IX: n1 ).i.6.d. } be an i.3(c).6. Proof The conditional addition law gives H((X o . Xi n k+1 )1X1) = H(X: n k+1 1X11) H(X 0 IXI kl . (3).5.}) = lip H (X0IX1. and the fact that H(Xl Y) = H(X) if X and Y are independent. a from which it follows that E A. H (X 0 1 = H(X01 X -1)• Thus the conditional entropy formula for the entropy of a process. . and Markov processes can be derived from the entropy theorem by using the ergodic theorem to directly estimate p. Recall that a process {Xn } is Markov of order k if Prob (X0 = aiX) = Prob (X0 = alX:k1 ) .d. This condition actually implies that the process is Markov of order k.i.6. The Markov property implies that Prob (X0 = alX) = Prob (X0 = alX_i) . gives H (XI') = H(Xi) which yields the entropy formula H (X2) + H(X n ) = nH (X i ).62 CHAPTER I. Theorem 1.

that is. Thus.6. The following purely combinatorial result bounds the size of a type class in terms of the empirical entropy of the type.d Entropy and types. for n > k. the same number of times it appears in Type-equivalence classes are called type classes. The type class of xi' will be denoted by T WO or by 7. a product measure is always constant on each type class. n Ifi: Pi(alxi) = = an . Two sequences xi' and y'11 are said to be type-equivalent if they have the same type. This fact can be used to estimate the order of a Markov chain from observation of a sample path. 17(130= ii(pl(.12 (The type class bound. The empirical (first-order) entropy of xi' is the entropy of p i (. This fact is useful in large deviations theory. 63 The second term on the right can be replaced by H(X01X: k1 ). The empirical distribution or type of a sequence x E A" is the probability distribution Pi = pl (.14). 14)) = a Pi(aixi)log Pi(aI4). where the latter stresses the type p i = pi (. can then be used to conclude that X0 and X:_ n dent given XI/. Theok+1 are conditionally indepenrem I. for any type p i . provided H(X01X1 k1 ) = H(X0 1X1n1 ). and will be useful in some of the deeper interpretations of entropy to be given in later chapters. If this is true for every n > k then the process must be Markov of order k. a E A. The equality condition of the subadditivity principle. 14) on A defined by the relative frequency of occurrence of each symbol in the sequence 4.SECTION 1. x'1' E T.) < 2r1(131). ENTROPY AS EXPECTED VALUE. k > k*.6.. If the true order is k*. In general the conditional entropy function Hk = H(X0IX: k1 ) is nonincreasing in k.6. I. x E rn Tpn.4(c). then Hks_i > fik* = Hk . 14).6. if and only if each symbol a appears in xi' exactly n pi(a) times. In since np i (a) is the number of times the symbol a appears in a given 4 particular. Theorem 1. that is. The entropy of an empirical distribution gives the exponent for a bound on the number of sequences that could have produced that empirical distribution. To say that the process is Markov of some order is to say that Hk is eventually constant. This completes the proof of the theorem. if each symbol a appears in x'. rather than a particular sequence that defines the type. - Proof First note that if Qn is a product measure on A n then (9) Q" (x)= ji Q(xi) i=1 = fl aEA War P1(a) . E . that is.

Theorem 1. as shown in the Csiszdr-Körner book. Pn (4) has the constant value 2' -17(P' ) on the type class of x7. and k-type-equivalence classes are called k-type classes. the only possible values for . [7]. (9). E r pni . A type pi defines a product measure Pn on An by the formula Pn (z?) = pi (zi).6.6.64 CHAPTER 1.) The number of possible types is at most (n + 1)1A I. while not tight. Replacing Qn by Pi in the product formula. The empirical overlapping k-block distribution or k-type of a sequence x7 E A' is the probability distribution Pk = pk(' 14) on A k defined by the relative frequency of occurrence of each k-block in thesen [i quenk ce +ii'i th_a i± t ki_ S. yields Pn(x7) = 2—n17(PI) . produces Pn(xiz) = aEA pi (a)nPl(a) . where Pk = Pk ( Ix). Proof This follows from the fact that for each a E A.12. as stated in the following theorem. The bound 2' 17. with a < 1. The k-type class of x i" will be denoted by Tpnk . Theorem 1.6. is the correct asymptotic exponential bound for the size of a type class.1 = E I -41 n— k 1 E A k. This establishes Theorem 1. The concept of type extends to the empirical distribution of overlapping k-blocks.14 (The number of k-types bound.13. while if k is also growing. extends immediately to k-types. 12— n 11(” v 1 ) But Pi is a probability distribution so that Pi (Tpni ) < 1. after taking the logarithm and rewriting. . if and only if each block all' appears in xÇ exactly (n — k + 1) pk (alic) times. that if k is fixed the number of possible k-types grows polynomially in n. npi (alfii) are 0. which.13 (The number-of-types bound. x E In other words.) The number of possible k-types is at most (n — k + Note. Two sequences x. BASIC CONCEPTS. and hence = ypn. in particular. 4 E T. Theorem 1. Later use will also be made of the fact that there are only polynomially many type classes. that is.6.' and yç' are said to be k-type-equivalent if they have the same k-type. The bound on the number of types. 1. then the number of k-types is of lower order than IA In. n. The upper bound is all that will be used in this book. but satisfies k < a logiAi n. z E A' .

(a) Determine the entropies of the ergodic components of the (T3 . The proof for general k is obtained by an obvious extension of the argument.b xr it c Tp2.) I7k (fil)1 < (n — Proof First consider the case k = 2. 2)-process. ak ak by formula (8). A suitable bound can be obtained. and hence P x E 2. and a 1. . ENTROPY AS EXPECTED VALUE. Theorem 1. since the k-type measures the frequency of overlapping k-blocks. Note that it is constant on the k-type class 'Tk(xiii) = Tp k of all sequences that have the same k-type Pk as xrii. E ai . with Q replaced b y (2-1) = rf. Let P be the time-0 partition for an ergodic Markov chain with period d = 3 and entropy H > O. n.(1) and after suitable rewrite using the definition of W I) becomes 1 TV I) (Xlii ) = Ti-(1) (X02—(n-1)1 . however.6. since P1) (4) is a probability measure on A n . 2.-j(a k Or 1 )Ti(a ki -1 ) log 75(akiar) E Pk(akil4) log Ti(ak lari ). that is.6.15 (The k-type-class bound. > 0. The entropy FP") is called the empirical (k — 1)-st order Markov entropy of 4. the stationary (k — 1)-st order Markov chain ri 0c -1) with the empirical transition function (ak 14 -1 ) = Pk(ainxii ) „ f k-1 „ir‘ uk IA 1 ) E bk pk and with the stationary distribution given by The Markov chain ri(k-1) has entropy (k - "Aar') = Ebk Pk(ar i 1) -E 1. 2. then E ai log(ai I bi ) ?_ a log(a/b). b.6. .1>ii(1 ) IT P2 I — 1n since ii:(1) (xi) > 1/(n — 1). b = E bi . i = 1.SECTION 1. Prove the log-sum inequality: If a. If Q is Markov a direct calculation yields the formula Q(4) = Q(xi)n Q(bi a)(n—l)p2(ab) a. 65 Estimating the size of a k-type class is a bit trickier.e Exercises. > 0. This formula. This yields the desired bound for the first-order case.(n. by considering the (k — 1)-st order Markov measure defined by the k-type plc ( 14). I.

66 CHAPTER I.(1) = 1/2. so H(Xr)1N H(Xr IS)1N. (Hint: recurrence of simple random walk implies that all the sites of y have been almost surely visited by the past of the walk.) 5. which is the "random walk with random scenery" process discussed in Exercise 8 of Section 1. BASIC CONCEPTS. (Hint: apply the ergodic theorem to f (x) = — log //. . The second interpretation is the connection between entropy and expected code length for the special class of prefix codes. €)-entropy-typical sequences. and use the equality of process entropy and entropy-rate. (Hint: use stationarity. 3. and let ji be the product measure on X defined by . p V Tp T22)-process. S) = H H (SIXr) = H(S)+ H(X l iv IS) for any N> n.1} z . Let p be the time-zero partition of X and let Q = P x P.(xi Ix). and define S(x.) 7.i(4). which leads to the concept of entropy-typical sequence. Let Y = X x X. be an ergodic process of entropy h.) Section 1. Show that the process entropy of a stationary process and its reversed process are the same. Let p be an ergodic process and let a be a positive number. where jp . or simply the set of entropy-typical sequences. (4) < t(xIx) 2". Two simple. Let T be the shift on X = {-1. Show there is an N = N(a) such that if n > N then there is a set Fn . that is. that D (P Il Q) > (1 / 2 ln 2)IP — Q1 2 . n] and note that H (X I v .. (a) Show that D (P v 7Z I Q v > Q) for any partition R. The entropy theorem may be expressed by saying that xril is eventually almost surely entropy-typical. Let p. namely. y) = (T x . measurable with respect to xn oc . useful interpretations of entropy will be discussed in this section.QI 1. For E > 0 and n > 1 define En(E) 1 4: 2—n(h+E) < tt (x in) < 2—n(h-6)} The set T(E) is called the set of (n. with the product measure y = j x it. (b) Prove Pinsker's inequality. such that .u(Fn ) > 1 — a and so that if xn E Fn then 27" 11.bt 1 a2012 . The divergence for k-set partitions is D(P Q) = Ei It(Pi)log(p.a Entropy-typical sequences.(Pi)1 A(Q i )). Prove that the entropy of an n-stationary process {Yn } is the same as the entropy the stationary process {X„} obtained by randomizing the start. and to the related building-block concept. Q)-process.u0(-1) = p. Then show that I-1(r IS = s) is the same as the unshifted entropy of Xs/v+1 . I. Find the entropy of the (S.2. Toy). (b) Determine the entropies of the ergodic components of the (T 3 .7 Interpretations of entropy. if n and c are understood. The first is an expression of the entropy theorem in exponential form. (Hint: use part (a) to reduce to the two set case. then use calculus. (Hint: let S be uniformly distributed on [1.u(Pi ) .) 6.) 4.7.

.7. (Al) List the n-sequences in decreasing order of probability. The cardinality bound on Cn and the probability upper bound 2 2) on members of Tn (E/2) combine to give the bound (Cn n Tn(E/2)) < 2n(12+02—n(h—E/2) < 2—nc/2 Thus . that is. A good way to think of . namely.7q If C. suggested by the upper bound. the measure is eventually mostly concentrated on a set of sequences of the (generally) much smaller cardinality 2n(h+6) • This fact is of key importance in information theory and plays a major role in many applications and interpretations of the entropy concept.u(Tn (E)) = 1.1 (The typical-sequence form of entropy.2. Sometimes it means entropy-typical sequences.7. For a > 0 define the (n.2 (The entropy-typical cardinality bound. eventually almost surely. a)-covering number by Arn (a) = minfICI: C c An. eventually almost surely. eventually almost surely. (A2) Count down the list until the first time a total probability of at least a is reached. 67 The convergence in probability form. even though there are I A In possible sequences of length n. as defined here. it is enough to show that x g C. in information theory. Proof Since .4. Theorem 1. eventually almost surely. fl Another useful formulation of entropy. and the theorem is established. sometimes it means sequences that are both frequency-typical and entropy-typical. which leads to the fact that too-small sets cannot be visited too often. n'T.SECTION 1.i (e/ 2).7. The members of the entropy-typical set T(e) all have the lower bound 2 —n(h+f) on their probabilities. Theorem 1. The phrase "typical sequence. as defined in Section 1. Here the focus is on the entropy-typical idea. and p(C)> a}. Theorem 1. Typical sequences also have an upper bound on their probabilities.(E12).u(Cn n 'T. expresses the connection between entropy and coverings.7. limn .X1' E 'T. is known as the asymptotic equipartition property or AEP.) For each E > 0.I\rn (a) is given by the following algorithm for its calculation. and sometimes it is just shorthand for those sequences that are likely to occur. sometimes it means frequencytypical sequences. on the number of entropy-typical sequences. then . INTERPRETATIONS OF ENTROPY. The context usually makes clear the notion of typicality being used. x E T(e). this fact yields an upper bound on the cardinality of Tn . c A n and ICI _ Ci'.3 (The too-small set principle.(E12)) is summable in n. Since the total probability is at most 1. The preceding theorem provides an upper bound on the cardinality of the set of typical sequences and depends on the fact that typical sequences have a lower bound on their probabilities. n > 1. Ar(a) is the minimum number of sequences of length n needed to fill an afraction of the total probability. Theorem 1.) The set of entropy-typical sequences satisfies 17-n (01 < 2n (h+E) Thus.7.) < 2n(h--€)." has different meaning in different contexts.

Theorem 1. for each n. .) For each a > 0. y.2.4 (The covering-exponent theorem. It is important that a small blowup does not increase size by more than a small exponential factor. The connection between coverings and entropy also has as a useful approximate form.7. lim supn (l/n) log Nn (a) .(4) E x. which will be discussed in the following paragraphs.(6) n cn ) > a(1 — c). The distance dn (x. To state this precisely let H(8) = —6 log 6 (1— 6) log(1 — 8) denote the binary entropy function. If n is large enough then A(T. the proof in Section I. BASIC CONCEPTS. When this happens N. b) = J 1 1 and extend to the metric d on A'. The fact that p. b) be the (discrete) metric on A. and hence. yri`) = — n i=1 yi)• The metric cin is also known as the per-letter Hamming distance. S) = min{dn (4. ) and hence Nn (a) = ICn > IT(e) n Cn > 2n (h-E'A(Tn(E) n cn) > 2n(h-E)a(i 6). fl The covering exponent idea is quite useful as a tool for entropy estimation. 1) and c > O. as n oc.Jce) < iTn (01 < 2 n0-1-0 . for example. = S) < 81. Since the measure of the set of typical sequences goes to I as n oc.i (E)) eventually exceeds a. S) from 4 to a set S C An is defined by dn (x' . by Theorem 1. the covering exponent (1/n)logArn (a) converges to h.68 CHAPTER I.n GI.e..' E S}. and the 6-neighborhood or 6-blowup of S is defined by [S ]. Proof Fix a E (0. 1 dn (4 . < 2—n(h—E) 1— -6 . . that a rotation process has entropy O. suppose. The covering number Arn (a) is the count obtained in (A2). defined by if a = b otherwise. The connection with entropy is given by the following theorem. below. On the other hand. for E rn (E then implies that ACTn(E) n = p. the measure .7. which proves the theorem.(x) < 2 —n(h—E) .5_ h.I = (a).Ern (onc.7.u(T.i(C) > a and I C. Let d(a. defined by 0 d (a . see.

to say that 41 is 1-built-up from Bn is the same as saying that xr is a concatenation of blocks from B„.5. INTERPRETATIONS OF ENTROPY. provided only that n is not too small. the _ 2nH(3) (IAI — 1) n8 . Fix a collection 5. with spacers allowed between the blocks. and //([C]3 ) a}.7. there are at most nS positions i in which xi can be changed to create a member of [ { x7 } 13.4 then exists as n establishes that 1 lim lim — log Afn (a.) The 6-blowup of S satisfies 69 I[S]81 ISI2 n11(6) (1A1 — 1) n3 . If 8 > 0 then the notion of (1-0-built-up requires that xr be a concatenation of blocks from 13. if there is an integer I = I(41 ) and a collection llni..7. are allowed to depend on the sequence xr. for all n. Proof Given x. Lemma 1. the entropy-typical n-sequences can be thought of as the "building blocks. 6—>0 n—> oo n I. subject only to the requirement that the total length of the spacers be at most SM. of disjoint n-length subintervals of [1.SECTION 1. then Irrn(E)ls1 _ both go to 0 as El will hold. Also fix an integer M > n and 8 > O. is required to be a member of 13. mi]: i < 1). This idea and several useful consequences of it will now be developed. given c > 0 there is a S > 0 such that Ir1n(c)131 for all n.=1 (mi — n i + 1) > (1 — (b) E I3. A sequence xr is said to be (1 — 0-built-up from the building blocks B. thought of as the set of building blocks. i c I. M]. 8) = min{ICI: C c An. together with Theorem 1. The reader should also note the concepts of blowup and built-up are quite different. < n(h+2€) 2 In particular. with occasional spacers inserted between the blocks. The .7.5.4.5 (The blowup bound. An application of the ergodic theorem shows that frequency-typical sequences must consist mostly of n-blocks which are entropy-typical. Both the number I and the intervals [ni ." from which longer sequences are made by concatenating typical sequences. implies that I[{4}]51 < yields the stated bound.7.b The building-block concept. c An . Thus. Lemma 1. 8) oc. < 2n(h-1-€) since (IAI — 1)n8 < 20 log IAI and since Slog IA I and H(8) Since IT(e)1 _ < 2n (h+26) 0.. that is. 8) = h. m i ] for which x .7. This proves the blowup-bound lemma. It is left to the reader to prove that the limit of (1/n) log Nen (a. which combinations bound. The blowup form of the covering number is defined by (1) Arn (a. In particular. it follows that if 8 is small enough. Each such position can be changed in IA — 1 ways.. such that (a) Eil. An application of Lemma 1. In the special case when 8 = 0 and M is a multiple of n. it is the minimum size of an n-set for which the 8-blowup covers an a-fraction of the probability.

3. m i ]} of disjoint n-length subintervals that cover all but a 6-fraction of [1. In particular. . while the built-up concept only allows changes in the spaces between the building blocks. This establishes the lemma.m i ] is upper bounded by IA 1 3m . For a fixed configuration of locations. An important fact about the building-block concept is that if S is small and n is large. Proof The number of ways to select a family {[n i .5. BASIC CONCEPTS. by the combinations bound. Then I Dm I IB.) Let DM be the set of all sequences xr that can be (1 — 6)-built-up from a given collection B„ c An. H(8) = —8 log6 — (1 —8) log(1 — S) denotes the binary entropy function. but allows arbitrary selection of the blocks from a fixed collection. Lemma 1. the set of entropy-typical n-sequences relative to c. Y1 E < ) which is.(c). Lemma 1.i E / } . in turn. The argument used to prove the packing lemma.3.7. if B. Thus I Dm I < ni Min IAIM. The proof of this fact. As usual. which is stated as the following lemma. the number ways to fill these with members of B„ is upper bounded by IBnl i < . can be used to show that almost strongly-covered sequences are also almost built-up.5.m i ]. is similar in spirit to the proof of the key bound used to establish the entropy theorem. A sequence xr is said to be (1 — 6)-strongly-covered by B„ C An if E [1. upper bounded by 2 A411(3) . blowup concept focuses on creating sequences by making a small density of otherwise arbitrary changes. Lemma 1.4.6. M — n +1]: Bn }l < 8(M — n +1). which is the desired bound. then the set of M-sequences that can be (1 — 6)-built-up from a given set 8„ C An of building blocks cannot be exponentially much larger in cardinality than the set of all sequences that can be formed by selecting M I n sequences from Bn and concatenating them without spacers.6 (The built-up set bound.4. provided M is large enough. Lemma 1. and if is small enough then 11)m 1 < 2m(h +26) . say {[n i . though simpler since now the blocks all have a fixed length.b. for members of Bn . namely. The building-block idea is closely related to the packing/covering ideas discussed in Section I.1 2mH (0 1 A im s. This is stated in precise form as the following lemma. M] is upper bounded by the number of ways to select at most SM points from a set with M members.70 CHAPTER I. = 7.3 Min and the number of ways to fill the places that are not in Ui [ni. The bound for entropy-typical building blocks follows immediately from the fact that I T(e)l 2 n(h+f) .

then . often of varying lengths.(€)) 1. A suitable application of the built-up set lemma. whose length r may depend on 4 . drawn from some finite alphabet A. For this reason. so that a typical code must be designed to encode a large number of different sequences and accurately decode them.6. Of course. b2. and that xr is eventually in DM. will be surprisingly useful later. in information theory a stationary.c Entropy and prefix codes. In this section the focus will be on a special class of codes. ergodic process with finite alphabet A is usually called a source. = Tn (e). In the following discussion B* denotes the set of all finite-length binary sequences and t(w) denotes the length of a member w E B*. a subject to be discussed in the next chapter.7. for no upper and lower bounds on their probabilities are given.. all that is known is that the members of Dm are almost built-up from entropy-typical n-blocks.7. then xr is (1-0-built-up from Bn• Proof Put mo = 0 and for i > 0.7.8 The preceding two lemmas are strictly combinatorial.7. The goal is to use as little storage space as possible. A standard model is to think of a source sequence as a (finite) sample path drawn from some ergodic process.7. The almost-covering principle.u. (e). implies there are at most 6M/2 indices j < M — n 1 which are not contained in one of the [ni . The assumption that xr is 13.4. Remark 1. where n is a fixed large number. implies that eventually almost surely xr is almost stronglycovered by sequences from T(e). mi]. for which there is a close connection between code length and source entropy.4.c.) If xr is (1-812)-strongly-covered by 13. 71 Lemma 1.5. define ni to be least integer k > m i _ i such that ric+n--1 E 13„. known as prefix codes. Theorem 1. In particular. INTERPRETATIONS OF ENTROPY. in such a way that the source sequence x'1' is recoverable from knowledge of the encoded sequence bf. at least in some average sense. to make the code length as short as possible. M] which are not contained in one of the [ni .(7. An image C (a) is called a codeword and the range of C . there are many possible source sequences.. while the condition that M > 2n18 implies there at most 3M/2 indices j E [M — n 1. if B. and if M > 2n/8. A code is said to be faithful or A (binary) code on A is a mapping C: A noiseless if it is one-to-one. stopping when "" Ic (1 — 3/2)-strongly-covered by within n of the end of xr. is to be mapped into a binary sequence bf = b1. that is. In a typical data compression problem a given finite sequence 4. This result. Lemma 1. . and n is large enough. In combination with the ergodic and entropy theorems they provide powerful tools for analyzing the combinatorial properties of partitions of sample paths.7 (The building-block lemma. The lemma is therefore established. Theorem 1.. m i ]. implies that the set Dm of sequences that are mostly built-up from the building blocks T(e) will eventually have cardinality not exponentially much larger than 2/4(h+6) • The sequences in Dm are not necessarily entropy-typical.7. which can be viewed as the entropy version of the finite sequence form of the frequency-typical sequence characterization of ergodic processes.SECTION 1. b. hence eventually almost surely mostly built-up from members of 7. B*. I.

C(aIC)p. The function that assigns to each a E A the length of the code word C (a) is called the length function of the code and denoted by £(IC) or by r if C is understood. entropy is a lower bound to code length. For codes that satisfy a simple prefix condition. labeled by b. BASIC CONCEPTS. 100 } . For u = u7 E B* and u = vr E B*. 0100. a lower bound which is almost tight.7.7.) E B then there is a directed edge Figure 1. since W is prefix free. the concatenation uv is the sequence of length n m defined by (uv) i = ui 1<i<n n < i < n m. base 2 logarithms are used. In particular. a E C. To develop the prefix code idea some notation and terminology and terminology will be introduced.9 The binary tree for W = {00. and easily proved. The entropy of p.w < 1. An important. is defined by the formula H (A) = E p(a) log aEA 1 that is. where b from u to y.9. however. (See Figure 1. 101. Note that the root of the tree is the empty word and the depth d(v) of the vertex is just the length of the word v. A prefix-free set W has the property that it can be represented as the leaves of the (labeled.. A nonempty set W c B* is prefix free if no member of W is a proper prefix of any member of W. A nonempty word u is called a prefix of a word w if w = u v. where X has the distribution A. property of binary trees is that the sum of 2 —ci(v) over all leaves y of the tree can never exceed 1. The prefix u is called a proper prefix of w if w = u v where y is nonempty.C(aIC) = £(C (a)). is called the codebook. defined by the following two conditions. directed) binary tree T(W). (i) The vertex set V(W) of T (W) is the set of prefixes of members of W.72 CHAPTER I. 0101. Formally. the words w E W correspond to leaves L(T) of the tree T (W). (ii) If y E V(W) has the form v = ub. . The expected length of a code C relative to a probability distribution a on A is E(L) = Ea . together with the empty word A. (As usual. H(j) is just the expected value of the random variable — log . For prefix-free sets W of binary sequences this fact takes the form (2) . it is the function defined by .) Without further assumptions about the code there is little connection between entropy and expected code length.(a).u(X).

Gs } be the equivalence classes.(a) log a 2-C(a1C) E. a lower bound that is "almost" tight.t be a probability distribution on A. holds.SECTION 1. i E [1. INTERPRETATIONS OF ENTROPY. t]. where the second term on the right gives the number of words of length L(2) that have prefixes already assigned. The codewords of a prefix code are therefore just the leaves of a binary tree. a result known as the Kraft inequality.1C)) < H(g) + 1. for any prefix code C.1C)) = Er(aiC)Wa) = a a log 2paic). For prefix codes. Theorem 1.u(a).10 If 1 < < f2 < < it is a nondecreasing sequence of positive integers such that E.1C)).. < 1 then there is a prefix-free set W = {w(i): 1 < i < t} such that t(w(i)) = ti . r=1 Assign the indices in G1 to binary words of length L(1) in some one-to-one way. . The Kraft inequality then takes the form E I Gr I2-Lo-) < 1. A geometric proof of this inequality is sketched in Exercise 10. written in order of increasing length L(r) = L. ) E p. For prefix codes. 73 where t(w) denotes the length of w E W.7. and. Proof Let C be a prefix code. I=1 A code C is a prefix code if a = a whenever C(a) is a prefix of C(a). which is possible since I Gil < 2" ) .. An inductive continuation of this assignment argument clearly establishes the lemma. Lemma 1. the leaf corresponding to C(a) can be labeled with a or with C(a). This fundamental connection between entropy and prefix codes was discovered by Shannon.7. The Kraft inequality has the following converse. Proof Define i and j to be equivalent if ti = Li and let {G1.) Let j. where t i E Gr . since the code is one-to-one. so that IG21 —‹ 2") — IG11 2L(2)—L(1). (i) H(p) E4 (C(. a . (ii) There is a prefix code C such that EIL (C(. the Kraft inequality (3) aEA 1..u (a ) log .7. Thus C is a prefix code if and only if C is one-to-one and the range of C is prefix-free. Assign G2 in some one-to-one way to binary words of length L(2) that do not have already assigned words as prefixes. entropy is always a lower bound on average code length. A little algebra on the expected code length yields E A (C(. This is possible since 1 GI 12—L(1) + I G2 1 2 L(2) < 1.11 (Entropy and prefix codes. (.

6.(A n ) = H(i)/n. such that. Cn is a prefix n-code with length function .C(a) < 1— log p(a). that is. This proves (i). it follows that = a A.7. In this case. will be called a prefix-code sequence.log 1. a Since.10 produces a prefix code C(a) = w(a). is a stationary process with process entropy H then (a) There is a prefix code sequence {C n } such that 1-4({G}) < H. Theorem I.74 CHAPTER I. n) — . Let /in be a probability measure on A n with per-symbol entropy 1-1. Of interest for n-codes is per-symbol code length . The (asymptotic) rate of such a code sequence.. a faithful n-code Cn : An B* is a prefix n-code if and only if its range is prefix free. . a Thus part (ii) is proved.2. whose length function is L.) If p. relative to a process {X n } with Kolmogorov measure IL is defined by = liM sup n—co E (C(IC n)) The two results. that is. no sequence of codes can asymptotically compress more than process entropy.C(4 IQ.t(a)1. and "too-good" codes do not exist.H n (P. Next consider n-codes. however.12 (Process entropy and prefix-codes. mappings Cn : A n 1—> B* from source sequences of length n to binary words. Theorem 1. BASIC CONCEPTS. n " n A sequence {Cn } .C(a71Cn )/n and expected per-symbol code length E(C(. a E A. since the measure v defined by v(a) = 2 —C(a I C) satisfies v(A) < 1. which completes the proof of the theorem. To prove part (ii) define L(a) = 1. Lemma 1. it is possible to compress as well as process entropy in the limit.(a) a 1— E p(a) log p(a) =1+ H(p). E(1.lCn ))1n. .log tt(a))p. while Theorem I. implies that the first term is nonnegative.7.11(ii) asserts the existence of a prefix n-code Cn such that 1 1 (5) — E (reIC n)) 5.7. for each n. almost-sure versions of these two results will be obtained for ergodic processes. The second term is just H(g).11(i) takes the form 1 (4) H(iin) for any prefix n-code Cn . while the divergence inequality. (b) There is no prefix code sequence {C n } such that 1Z 4 ({C n }) < H.C(4) = . and note that E 2— C (a) < E a a 2logg(a) E a E 1. by the Kraft inequality. Lemma 1. where rx1 denotes the least integer > x. that is. In the next chapter. Thus "good" codes exist. (4) and (5) then yield the following asymptotic results.7.

so that w' is empty and n = m. for example.SECTION 1. so that asymptotic results about prefix codes automatically apply to the weaker concept of faithful code. C(n) = u(n)v(n)w(n).7. The length of w(n) is flog(n +1)1. But then v(n) = v(m). whose expected value is entropy. while both u(n) and v(n) have length equal to f log(1 + Flog(n 1 )1)1. produces prefix codes that minimize expected code length. 75 Remark 1.C(a) = F— log . The first part u(n) is just a sequence of O's of length equal to the length of v(n). INTERPRETATIONS OF ENTROPY. a prefix) to each codeword to specify its length length. Any prefix code with this property is called an Elias code. The key to this is the fact that a faithful n-code can always be converted to a prefix code by adding a header (i. but for n-codes the per-letter difference in expected code length between a Shannon code and a Huffman code is at most 1/n. since both consist only of O's and the first bit of both v(n) and v(m) is a 1. for if u(n)v(n)w(n) = u(m)v(in)w(m)w'. B*. the length of the binary representation of Flog(n + 1)1. This means that w(n) = w(m). The following is due to Elias. [8].) There is a prefix code £:11. Proof The code word assigned to n is a concatenation of three binary sequences. there is no loss of generality in assuming that a faithful code is a prefix code. since both have length equal to the length of u(n). for example. so that. for Shannon codes.13 A prefix code with length function . CI . as far as asymptotic results are concerned. so that. for example. then. such that i(e(n)) = log n + o(log n). The desired bound f(E(n)) = log n + o(log n) follows easily. w(12) = 1100.14 (The Elias-code lemma. u(12)=000. Shannon's theorem implies that a Shannon code is within 1 of being the best possible prefix code in the sense of minimizing expected code length. v(12) = 100.u(a)1 will be called a Shannon code.2. The third part w(n) is the usual binary representation of n. Huffman codes have considerable practical significance.7.d Converting faithful codes to prefix codes. the code length function L(a) = 1— log (a)1 is closely related to the function — log p.7. Lemma 1. The second part v(n) is the binary representation of the length of w(n). u(n) = u(m). so that. A somewhat more complicated coding procedure. which is asymptotically negligible.7.(a). The code is a prefix code. the length of the binary representation of n. Thus E(12) = 0001001100. so the lemma is established. called Huffman coding. This can be done in such a way that the length of the header is (asymptotically) negligible compared to total codeword length. since both have length specified by v(n). since. I. Next it will be shown that.e. For this reason no use will be made of Huffman coding in this book.

will be called an Elias extension of C. PO.16 Another application of the Elias prefix idea converts a prefix-code sequence {C n } into a single prefix code C: A* i where C is defined by C(4) = e(n)C(4).7. Proposition 1.) Theorem 1. P)-process has entropy O. where Po = [0.7.5). 2. which is clearly a prefix code.7..17 The (T. x E [0. X tiz E An. These will produce an upper bound of the form 2nEn . Geometric arguments will be used to estimate the number of possible (T. For example. Thus Theorem 1. where a is irrational and ED is addition modulo 1. 1). The proof for the special two set partition P = (Po.. a prefix n-code Cn * is obtained by the formula (6) C: (4 ) = ECC(. (4). I. This enables the decoder to determine the preimage . Remark 1.4 I Cn ))Cn (4). where.7. and let P be a finite partition of the unit interval [0. The decoder reads through the header information and learns where C. and. As an application of the covering-exponent interpretation of entropy it will be shown that ergodic rotation processes have entropy 0. where e is an Elias prefix code on the integers. (4) starts and that it is 12 bits long. 1). The header tells the decoder which codebook to apply to decode the received message.76 CHAPTER I. 0. with a bit more effort. for ease of reading. the header information is almost as long as Cn (4). x lii E An . to general partitions (by approximating by partitions into intervals).5. if C. Let T be the transformation defined by T:xi--> x ea.7.15 If .x11 . and P1 = [0. The code CnK.12(b) extends to the following somewhat sharper form. will be given in detail.(4) = 001001100111. (The definition of faithful-code sequence is obtained by replacing the word "prefix" by the word "faithful. e (C(4 ICn )). 1). The argument generalizes easily to arbitrary partitions into subintervals. a word of length 12. In the example.7.e Rotation processes have entropy O. P)-names of length n. The proof will be based on direct counting arguments rather than the entropy formalism. by Lemma 1.14. where En —> 0 as . .u is a stationary process with entropy-rate H there is no faithful-code sequence such that R IL ({C n }) < H. n = 1. a comma was inserted between the header information. since Cn was assumed to be faithful. BASIC CONCEPTS. and the code word C. but header length becomes negligible relative to the length of Cn (4) as codeword length grows. 001001100111. Given a faithful n-code Cn with length function Le ICn )." in the definition of prefix-code sequence. then C:(4) = 0001001100.

Compactness then shows that N can be chosen to be independent of x and y. and hence the (T-P)-names of y and z will agree most of the time.Y11 < 6/4 and n > N. But translation is a rigid motion so that Tmy and Trnz will be close for all m.7. The next lemma is just the formal statement that the name of any point x can be shifted by a bounded amount to obtain a sequence close to the name of z = 0 in the sense of the pseudometric dn. 1). =- 1 n d(xi. n > N. The result then follows from the covering-exponent interpretation of entropy. there cannot be exponentially very many names.y11.yi.18. The key to the proof to the entropy 0 result is that.4. To make the preceding argument precise define the metric Ix . . Thus every name is obtained by changing the name of some fixed z in a small fraction of places and shifting a bounded amount. y) = dn(xç .b) = 1. y E [0. INTERPRETATIONS OF ENTROPY. The first lemma shows that cln (x. The case when lx . y]. if a 0 b. n where d(a.7. which occurs when and only when either T --1 0 E I or (0. 1). tz]: Tlx E = for any interval I c [0. if Ix . Lemma 1.SECTION 1. Consider the case when Ix Y Ii = lx . y) < e. Theorem 1.Y 1 1. Proposition 1. Proposition 1.7. if a = b. which is rotation by -a.yli < €14. Let I = [x. and d(a.7.yl can be treated in a similar manner. y) is continuous relative to Ix . In particular. The proof is left as an exercise. 77 n oo. y E [0.5) E I. The uniform distribution property. some power y = Tx of x must be close to z. can be applied to T -1 .yll = 11 + x . b) = 0. 1) and any x E [0. x.5) E <E This implies that dn (x. Without loss of generality it can be assumed that x < y.19 E.y1. P)-names of x.7. The key to the proof is the following uniform distribution property of irrational rotations.18 (The uniform distribution property. given z and x. to provide an N = N(x. The fact that rotations are rigid motions allows the almost-sure result to be extended to a result that holds for every x.yli = minflx .11 and the pseudometric dn(x. x . respectively.) Inn E [1. Yi). 1). y) such that if n > N then Given E > 0 there is an N such that dn (x. where A denotes Lebesgue measure. y) < 1{j E E I or T -1 (0. The names of x and y disagree at time j if and only if Tx and T y belong to different atoms of the partition P. and {x n } and {yn} denote the (T. Proof Suppose lx . Ergodicity asserts the truth of the proposition for almost every x.

7. Lemma 1. dk (zk Given E > 0.7. (Hint: there is at least one way and at most n1441 2n ways to express a given 4 in the form uw(l)w(2). Proof Now for the final details of the proof that the entropy of the (T. w(k) E An. 2)-process has entropy 0.f Exercises. where j(x) is given by the preceding lemma. respectively. Show that limn (1/n) log.20 CHAPTER I.78 Lemma 1.A2) given for the computation of the covering number. There are at most 2i possible values for the first j places of any name.n (g) > mH(A). to obtain a least positive integer j = j(x) such that IT ] x — Ok < 6/4. yk ) < 69 k > K. there is an integer M and an integer K > M such that if x E [0.) . Given E > 0. for n > K + M. 6) exists. Let H(A) denote the entropy of p. choose N from the preceding lemma so that if n > N and lx — yli < 6/4 then dn (x ..4. apply Kronecker's theorem. The number of binary sequences of length n — j that can be obtained by changing the (n — j)-name of z in at most c(n — j) places is upper bounded by 2 (n— i )h(E) . where each w(i) E An and u and v are the tail and head. (a) Show that Ann ( '2) < (P. let n > K+M and let z = O. X E X. where Nn (a. Since small changes in x do not change j(x) there is a number M such that j(x) < M for all x.2. based on the algorithm (Al . With K = M N the lemma follows. 1. P)-process is 0.T i X) < E. Thus an upper bound on the number of possible names of length n for members of Xi is (7) 2(n— Dh(e) 114 : x E Xill < 2 i IA lnh(6) Since there only M possible values for j. 3. Proposition 1. The unit interval X can be partitioned into measurable sets Xi = Ix : j(x) = j}. + (b) Show that H„. of words w(0). is the binary entropy function. Given E > 0.. M] such that if Z = 0 and y = Tx then i.7. j < M. 6) is defined in (1). where h(6) = —doge —(1 — 6) log(1 — 6).4. determine M and K > M from the preceding lemma. Let g be the concatenated-block process defined by a measure it on A n .5. y) < E. Theorem 1.15.. 1) there is a j = j(x) E [1. I. (Hint: what does the entropy theorem say about the first part of the list?) 2.) ± log n. the number of possible rotation names of 0 length n is upper bounded by M2m2nh(€). Given x. BASIC CONCEPTS.Arn (a. Since h(€) —> 0 as E 1=1 this completes the proof that the (T. w(k — 1)v. Give a direct proof of the covering-exponent theorem. Note that dn—j(Z. according to the combinations bound.

Find an expression for the entropy of it. that the entropy of the induced process is 1/(A)//t(B). 5. Use this idea to prove that if Ea 2 —C(a) < 1. then there is prefix code with length function L(a). VX E AZ. In some cases it is easy to establish a property for a block coding of a process.a Approximation by finite codes. Let C be a prefix code on A with length function L. . for all xri' E An.SECTION 1. Prove Proposition 1. then passing to a suitable limit.8. Suppose {1'.1. 7.7.„} is ergodic. Let {Ifn } be the process obtained from the stationary ergodic process {X} by applying the block code CN and randomizing the start. Show.8. Section 1. using the entropy theorem. 6. then extend using the block-to-stationary code construction to obtain a desired property for a stationary coding. 79 4.9. Show that H({Y„}) < 8. that the entropy-rate of an ergodic process and its reversed process are the same. Let p. Often a property can be established by first showing that it holds for finite codes. The second basic result is a technique for creating stationary codes from block codes. for all x'i' E A n . Two basic results will be established in this section. 1] of length 2 —L(a ) whose left endpoint has dyadic expansion C (a). Associate with each a E A the dyadic subinterval of the unit interval [0. (b) 44) 5_ 2 ± 2n log IA I. a shift-invariant ergodic measure on A z and let B c A z be a measurable set of positive measure. and that this implies the Kraft inequality (2). 9.18. Show that if C„: A n H B* is a prefix code with length function L then there is a prefix code e: A" 1—* B* with length function E such that (a) Î(x) <L(x) + 1.8 Stationary coding. Recall that a stationary coder is a measurable mapping F: A z 1--> B z such that F(TA X ) = TB F (x). STATIONARY CODING.4. using Theorem 1. Stationary coding was introduced in Example 1. in terms of the entropy of the waiting-time distribution.7. Show that the resulting set of intervals is disjoint. The first is that stationary codes can be approximated by finite codes. 10. Use the covering exponent concept to show that the entropy of an ergodic nstationary process is the same as the entropy of the stationary process obtained by randomizing its start. 11. Suppose the waiting times between occurrences of "1" for a binary process it are independent and identically distributed. The definition and notation associated with stationary coding will be reviewed and the basic approximation by finite codes will be established in this subsection. Show. I.

80

CHAPTER I. BASIC CONCEPTS.

where TA and TB denote the shifts on the respective two-sided sequence spaces A z and B z . Given a (two-sided) process with alphabet A and Kolmogorov measure ,U A , the encoded process has alphabet B and Kolmogorov measure A B = /LA o F. The associated time-zero coder is the function f: A z 1--* B defined by the formula f (x) = F (x )o. Associated with the time-zero coder f: Az 1---> B is the partition P = 19 (f) = {Pb: h E B} of A z , defined by Pb = {X: f(x) = 1, } , that is,
(1)
X E h

if and only if f (x) = b.

Note that if y = F(x) then yn = f (Tnx) if and only if Tx E Pyn . In other words, y is the (T, 2)-name of x and the measure A B is the Kolmogorov measure of the (T, P)process. The partition 2( f) is called the partition defined by the encoder f or, simply, the encoding partition. Note also that a measurable partition P = {Pb: h E BI defines an encoder f by the formula (1), such that P = 2(f ), that is, P is the encoding partition for f. Thus there is a one-to-one correspondence between time-zero coders and measurable finite partitions of Az . In summary, a stationary coder can be thought of as a measurable function F: A z iB z such that F oTA = TB o F , or as a measurable function f: A z 1--* B, or as a measurable partition P = {Pb: h E B } . The descriptions are connected by the relationships
f (x) = F (x)0 , F (x) n = f (T n x), Pb = lx: f (x) = b).

A time-zero coder f is said to be finite (with window half-width w) if f (x) = f (i), whenever .0'. = i" vu,. In this case, the notation f (xw u.,) is used instead of f (x). A key fact for finite coders is that
=
Ç

U {x, n 71± —w wl ,
( x ,' „u t u,' ) = .
, fl

and, as shown in Example 1.1.9, for any stationary measure ,u, cylinder set probabilities for the encoded measure y = ,u o F -1 are given by the formula

(2)

y()1) = it(F-1 [yi])

=

E it(xi —+:) .
f (4 ± - :)=y' it

The principal result in this subsection is that time-zero coders are "almost finite", in the following sense.

Theorem 1.8.1 (Finite-coder approximation.) If f: Az 1--* B is a time-zero coder, if bt is a shift-invariant measure on A z , and if Az 1---> B such that E > 0, then there is a finite time-zero encoder

7:

1.i (ix: f (x) 0 7(x)}) _< c. Proof Let P = {Pb: b E B) be the encoding partition for f. Since p is a finite partition of A z into measurable sets, and since the measurable sets are generated by the finite cylinder sets, there is a positive integer w and a partition 1 5 = {Pb } with the following two properties
"../

(a) Each Pb is union of cylinder sets of the form

SECTION I8. STATIONARY CODING.

81

(h)
I../

Eb kt(PbA 7)1,) < E.
r',I 0.s•s

Let f be the time-zero encoder defined by f(x) = b, if [xi'.] c Pb. Condition (a) assures that f is a finite encoder, and p,({x: f (x) 0 f (x))) < c follows from condition I=1 (b). This proves the theorem.

Remark 1.8.2
The preceding argument only requires that the coded process be a finite-alphabet process. In particular, any stationary coding onto a finite-alphabet process of any i.i.d. process of finite or infinite alphabet can be approximated arbitrarily well by finite codes. As noted earlier, the stationary coding of an ergodic process is ergodic. A consequence of the finite-coder approximation theorem is that entropy for ergodic processes is not increased by stationary coding. The proof is based on direct estimation of the number of sequences of length n needed to fill a set of fixed probability in the encoded process.

Theorem 1.8.3 (Stationary coding and entropy.)
If y = it o F-1 is a stationary encoding of an ergodic process ii then h(v)

First consider the case when the time-zero encoder f is finite with window half-width w. By the entropy theorem, there is an N such that if n > N, there is a set Cn c An i +w w of measure at least 1/2 and cardinality at most 2 (n±2w+1)(h( A )+6) • The image f (C n ) is a subset of 13 1 of measure at least 1/2, by formula (2). Furthermore, because mappings cannot increase cardinality, the set f (C n ) has cardinality at most 2(n+21D+1)(h(4)+E) . Thus 1 - log I f (Cn)i h(11) + 6 + 8n, n where Bn —>. 0, as n -> oo, since w is fixed. It follows that h(v) < h(tt), since entropy equals the asymptotic covering rate, Theorem 1.7.4. In the general case, given c > 0 there is a finite time-zero coder f such that A (ix: f (x) 0 f (x)}) < E. Let F and F denote the sample path encoders and let y = it o F -1 and 7/ = ki o F -1 denote the Kolmogorov measures defined by f and f, respectively. It has already been established that finite codes do not increase entropy, so that h(71) < h(i). Thus, there is an n and a collection 'e" c Bn such that 71(E) > 1 - E and II Let C = [C], be the 6-blowup of C, that is,
Proof

C = { y: dn(Yi, 3711 ) 5_

E,

for some

3,T;

E '

el

The blowup bound, Lemma 1.7.5, implies that

ICI < where
3(E) = 0(E).

9

Let

15 = {x: T(x)7

E

El} and D = {x: F(x)7

E

CI

be the respective pull-backs to AZ. Since ki (ix: f (x) 0 7(x)}) < _ E2 , the Markov inequality implies that the set G = {x: dn (F(x)7,F(x)7) _< cl

82

CHAPTER I. BASIC CONCEPTS.

has measure at least 1 — c, so that p(G n 5 ) > 1 — 2E, since ,u(15) = 75(e) > 1 — E. By definition of G and D, however, Gniic D, so that

e > 1 — 2c. v(C) = g(D) > p(G n ij)
The bound ICI < 2n (h(A )+E+3(E )) , and the fact that entropy equals the asymptotic covering rate, Theorem 1.7.4, then imply that h(v) < h(u). This completes the proof of Theorem 1.8.3.

Example 1.8.4 (Stationary coding preserves mixing.)
A simple argument, see Exercise 19, can be used to show that stationary coding preserves mixing. Here a proof based on approximation by finite coders will be given. While not as simple as the earlier proof, it gives more direct insight into why stationary coding preserves mixing and is a nice application of coder approximation ideas. The a-field generated by the cylinders [4], for m and n fixed, that is, the cr-field generated by the random variables Xm , Xm+ i, , X n , will be denoted by E(Xm n ). As noted earlier, the i-fold shift of the cylinder set [cini n ] is the cylinder set T[a] = [c.:Ln ], where ci +i = ai, m < j <n. Let v = o F-1 be a stationary encoding of a mixing process kt. First consider the case when the time-zero encoder f is finite, with window width 2w + 1, say. The coordinates yri' of y = F(x) depend only on the coordinates xn i +:. Thus for g > 2w, the intersection [Yi ] n [ym mLg: i n] is the image under F of C n Tm D where C and D are both measurable with respect to E(X7 +:). If p, is mixing then, given E > 0 and n > 1, there is an M such that

jp,(c n

D) — ,u(C),u(D)I < c, C, D E

Earn,
El in

m > M,

which, in turn, implies that

n [y:Z +1]) — v([yi])v(Ey:VDI

M, g > 2w.

Thus v is mixing. In the general case, suppose v = p, o F -1 is a stationary coding of the mixing process with stationary encoder F and time-zero encoder f, and fix n > 1. Given E > 0, and a7 and Pi', let S be a positive number to be specified later and choose a finite encoder with time-zero encoder f, such that A({x: f(x) f . Thus,

col) <S

pax: F(x)7

fr(x)71)

/tax:

f (T i x)

f (Ti x)i)

f(x)

f (x)}), < n3,

by stationarity. But n is fixed, so that if (5 is small enough then v(4) will be so close to î(a) and v(b7) so close to i(b) that

(3)

Iv(a7)v(b7) — j) (a7) 13 (bi)I

E/ 3 , • P(x)21:71) < 2n8,

for all ar il and b7. Likewise, for any m > 1,

p,

F(x)7

P(x)7, or F(x)m m -Vi

SECTION 1.8. STATIONARY CODING.
so that (4)

83

Ivaan n T - m[b7]) — j)([4] n T -m[b71)1

6/3,

provided only that 8 is small enough, uniformly in m for all ail and b7. Since 1'5 is mixing, being a finite coding of ,u, it is true that

n T -m[bn) — i)([4])Dab7DI < E/3,
provided only that m is large enough, and hence, combining this with (3) and (4), and using the triangle inequality yields I vga7] n T -ni[b7]) — vga7])v([14])1 < c, for all sufficiently large m, provided only that 8 is small enough. Thus, indeed, stationary coding preserves the mixing property.

I.8.b

From block to stationary codes.

As noted in Example 1.1.10, an N-block code CN: A N B N can be used to map an A-valued process {X„} into a B-valued process {Y„} by applying CN to consecutive nonoverlapping blocks of length N, i iN+1
v (f-1-1)N
-

ii(j+1)N)
iN+1

j = 0, 1, 2, ...

If {X,} is stationary, then a stationary process {4} is obtained by randomizing the start, i.e., by selecting an integer U E [1, N] according to the uniform distribution and defining = Y+-i, j = 1, 2, .... The final process {Yn } is stationary, but it is not, except in rare cases, a stationary coding of {X„}, and nice properties of {X,}, such as mixing or even ergodicity, may get destroyed. A method for producing stationary codes from block codes will now be described. The basic idea is to use an event of small probability as a signal to start using the block code. The block code is then applied to successive N-blocks until within N of the next occurrence of the event. If the event has small enough probability, sample paths will be mostly covered by nonoverlapping blocks of length exactly N to which the block code is applied.

Lemma 1.8.5 (Block-to-stationary construction.)
Let 1,1, be an ergodic measure on A z and let C: A N B N be an N-block code. Given c > 0 there is a stationary code F = Fc: A z A z such that for almost every X E A Z there is an increasing sequence { ni :i E Z}, which depends on x, such that Z= ni+i) and (i) ni+1—ni < N, i E Z. (ii) If .I„ is the set of indices i such that [ni , +1) c [—n, n] and ni+i — n i < N, then ihn supn (1/2n)E iEjn ni+ 1 — ni < c, almost surely.
(iii) If n i+1 — ni = N then y: n n +'- ' = C(xn:±'), n -1 where y = F(x).

84

CHAPTER I. BASIC CONCEPTS.

Proof Let D be a cylinder set such that 0 < ,a(D) < 6/N, and let G be the set of all x E A z for which Tmx E D for infinitely many positive and negative values of m. The set G is measurable and has measure 1, by the ergodic theorem. For x E G, define mo = mo(x) to be the least nonnegative integer m such that Tx E D, then extend to obtain an increasing sequence mi = m i (x), j E Z such that Tinx E D if and only if m = mi, for some j. The next step is to split each interval [mi, m 1+1) into nonoverlapping blocks of length N, starting with mi, plus a final remainder block of length shorter than N, in case mi+i — mi is not exactly divisible by N. In other words, for each j let qi and ri be nonnegative integers such that mi +i—mi = qiN d-ri, 0 <ri <N, and form the disjoint collection 2-x (mi) of left-closed, right-open intervals [m i , mi N), [m 1 d-N,m i +2N), [mi±qiN, mi+i ).

All but the last of these have length exactly N, while the last one is either empty or has length ri <N. The definition of G guarantees that for x E G, the union Ui/x (mi) is a partition of Z. The random partition U i/x (m i ) can then be relabeled as {[n i , ni+1 ), i E Z}, where ni = ni(x), i E Z. If x G, define ni = i, i E Z. By construction, condition (i) certainly holds for every x E G. Furthermore, the ergodic theorem guarantees that the average distance between m i and m i±i is at least N/6, so that (ii) also holds, almost surely. The encoder F = Fc is defined as follows. Let b be a fixed element of B, called the filler symbol, and let bj denote the sequence of length 1 < j < N, each of whose terms is b. If x is sequence for which Z = U[ni , ni+i), then y = F (x) is defined by the formula bni+i— n, if n i±i — ni <N yn n :+i —1 = if n i+i — n i = N. C (4 +1-1 )

I

This definition guarantees that property (iii) holds for x E G, a set of measure 1. For x G, define F(x) i = b, i E Z. The function F is certainly measurable and satisfies F(TAX) = F(x), for all x. This completes the proof of Lemma 1.8.5. The blocks of length N are called coding blocks and the blocks of length less than N are called filler or spacer blocks. Any stationary coding F of 1,1, for which (i), (ii), and (iii) hold is called a stationary coding with 6-fraction of spacers induced by the block code CN. There are, of course, many different processes satisfying the conditions of the lemma, since there are many ways to parse sequences so that properties (i), (ii), and (iii) hold, for example, any event of small enough probability can be used and how the spacer blocks are coded is left unspecified in the lemma statement. The terminology applies to any of these processes. Remark 1.8.6 Lemma 1.8.5 was first proved in [40], but it is really only a translation into process language of a theorem about ergodic transformations first proved by Rohlin, [60]. Rohlin's theorem played a central role in Ornstein's fundamental work on the isomorphism problem for Bernoulli shifts, [46, 63].

j ?. as follows. fy .'. The first observation is that if stationarity is not required. define the function Cn : A n H {0.. the problem is easy to solve merely by periodically inserting blocks of O's to get bad behavior for one value of n. 1)/2+1). see [26.. A proof for the i.d.') > 1. uyss+ for any starting place s. 1 " by changing the first 4n 1 i2 terms into all O's and the final 4n 1 12 terms into all O's. STATIONARY CODING. Section 11.5. The proof for the binary case and growth rate À(n) = n 112 will be given here. . Cn (4) = y. The string-matching problem is of interest in DNA modeling and is defined as follows. that is. 85 I. The n-block coding {I'm ) of the process {X„.8. indicate that {Y..5. be forced to have entropy as close to the entropy of {X n } as desired.5 provides a powerful tool for making counterexamples. and extended to a larger class of processes in Theorem 11.d.SECTION 1. For i. where } (5) 0 Yi = 1 xi i <4n 112 or i > n — 4n 112 otherwise.1. for some O <s<t<n—k} . for each n > 64.i. The block-to-stationary code construction of Lemma 1. if the start process {X n } is i.d. A negative solution to the question was provided in [71]. since two disjoint blocks of O's of length at least n 112 must appear in any set of n consecutive places.i. 2]. {Yn }.(n) —> 0. as a simple illustration of the utility of block-to-stationary constructions. lim sup i-400 Â(ni) In particular. As an illustration of the method a string matching example will be constructed.} and any positive function X(n) for which n -1 1.1. case using coding ideas is suggested in Exercise 2. at least if the process is assumed to be mixing. almost surely.} is the process obtained from {X. of {X„}. Let pc } .} by the encoding 17(jin 1)n+1 = cn(X. where C„ is the zero-inserter defined by the formula (5). furthermore. and Markov processes it is known that L(x7) = 0(log n). such that L(Y. It can. The problem is to determine the asymptotic behavior of L(4) along sample paths drawn from a stationary. A question arose as to whether such a log n type bound could hold for the general ergodic case. where it was shown that for any ergodic process {X.-on+1) . -Fin) > n 1/2 .c A string-matching example. To make this precise. The construction is iterated to obtain an infinite sequence of such n's. the {Y} process will be mixing. For x E An let L(4) be the length of the longest block appearing at least twice in x1% that is. and an increasing unbounded sequence {ni I. there is a nontrivial stationary coding. Let .i.8. then nesting to obtain bad behavior for infinitely many n.8.} defined by yin clearly has the property —on+1 = cn(X in o . finite-alphabet ergodic process. j > 1. L(x ) = max{k: +k i xss+ xtt±±k.

1} z that maps {X. Given the ergodic process {X n } (now thought of as a two-sided process. j > 1. Combining this observation with the fact that O's are not changed.n }. (6). For n > 64./2 .. the choice of filler means that L(Y(i)?`) > rt. In particular. the sequence of processes {Y(i)} is defined by setting {Y(0)} = {X. and define the sequence of processes by starting with any process {Y(0)} and inductively defining {Y(i)} for i > 1 by the formula {Y --4Ecn {Y(i)m}. UZ { . in particular. let Fi denote the stationary encoder Fi : {0. each coordinate is changed at most once. hence. since stationary coding will be used). and it is not easy to see how to pass to a limit process. provided the n i increase rapidly enough.}.5. let Cn be the zero-inserter code defined by (5). will be used. guarantee that the final limit process is a stationary coding of the starting process with the desired string-matching properties. but mixing properties are lost. Property (6) guarantees that no coordinate is changed more than once. and let {ci } be a decreasing sequence of positive numbers. Lemma 1.) {YU — l)m} -14 {Y(i)ml. {n i : i > 11 be an increasing sequence of integers. The rigorous construction of a stationary coding is carried out as follows. and f is filler. guarantees that a limit process 1/2 {Y(oo)„. Randomizing the start at each stage will restore stationarity. and hence there is a limit code F defined by the formula [F(x)]m = lim[F i (x)] m . which will. 1 < j < for all s. to be specified later.n } by stationary coding using the codebook Cn. Furthermore.E. 11 z {0. where u is the end of a codeword.1. where u is the end of a codeword and v the beginning of a codeword.8. the i-th iterate {Y(i) m } has the property L(Y(i): +7') > n/2 . in turn. An appropriate use of block-to-stationary coding. (6) Y(i — 1) m = 0 =z= Y (i) m = O. which. xE 0. BASIC CONCEPTS. at each stage converts the block codes into stationary codings. for all m E Z and i > 1. the final process has a positive frequency of l's. since this always occurs if Y(i)?' is the concatenation uv. that is. both to be specified later. so that the code at each stage never changes a 0 into a 1. . The preceding construction is conceptually quite simple but clearly destroys stationarity. At each stage the same filler symbol. + 71 ) > n.} exists for which L(Y(oo) :. b = 0.n } to {Y(i) m }. in turn. mEZ. with a limiting Et-fraction of spacers. and long matches are even more likely if Y(i)?' is the concatenation uf v. Furthermore.n } and inductively defining {Y(i) m } for i > 1 by the formula (cn . Since stationary codings of stationary codings are themselves stationary codings. < j.86 CHAPTER I. since O's inserted at any stage are not changed in subsequent stages. yields (7) L(Y(i)7 J ) > ni ll2 . For any start process {Y(0). each process {Y(i) m } is a stationary coding of the original process {X. y is the beginning of a codeword. mixing is not destroyed. where the notation indicates that {Y(j) m } is constructed from {Y(i — 1). let {n i } be an increasing sequence of natural numbers. For each i > 1.

Since (7) hold j > 1. the d-distance between processes is often difficult to calculate. Suppose C: A N B N is an N-code such that Ei. however. then the sequence {ni(x)} of Lemma 1. the entropy of the final process will be as close to that of the original process as desired. the limit process has the property L(Y(oo)) > the limiting density of changes can be forced to be as small as desired. as are other classes of interest.9. called the dtopology.SECTION 1. o (Hint: use the preceding exercise. but it has two important defects. The collection P(A") is just the collection of processes with alphabet A.n } defined by the stationary encoder F. I. PROCESS TOPOLOGIES. weak limits of ergodic processes need not be ergodic and entropy is not weakly continuous. The above is typical of stationary coding constructions. Many of the deeper recent results in ergodic theory depend on the d-metric. On the other hand.. The d-metric is not so easy to define. merely by making the ni increase rapidly enough and the Ei go to zero rapidly enough.9. A sequence {. Show that there is a stationary code F such that limn Ett (dn (4. One concept. p(x). for each i. then Lemma 1. I.d. where A is a finite set.„ m E Z is the stationary coding of the initial process {X.) 2.a The weak topology. The weak topology is separable. (Hint: replace D by TD for n large. where it is mixing. processes. namely. declares two processes to be close if their joint distributions are close for a long enough time. (") (a lic) = n-4. for each al iv E AN . 87 The limit process {Y(oo)„z } defined by Y(oo).(dN(x fv . F(x)7) < < E 1-1(un o 2E and such that h(o. The subcollection of stationary processes is denoted by P s (A") and the subcollection of stationary ergodic processes is denoted by 7'e (A°°).00 . X = Anand X = A". The other concept. declares two ergodic processes to be close if only a a small limiting density of changes are needed to convert a typical sequence for one process into a typical sequence for the other process. entropy is d-continuous and the class of ergodic processes is d-closed.u (n) } of measures in P(A") converges weakly to a measure it if lim p. The collection of (Borel) probability measures on a compact space X is denoted by Of primary interest are the two cases.8.) Section 1. and it plays a central role in the theory of stationary codings of i. compact.5 is applied to produce a stationary process with the same limiting properties.i.9 Process topologies. 1.n = limé Y(i). Two useful ways to measure the closeness of stationary processes are described in this section. and easy to describe. a theory to be presented in Chapter 4. In particular. Show that if t is mixing and 8 > 0.8.d Exercises. First a sequence of block codes is constructed to produce a (nonstationary) limit process with the properties desired. called the weak topology. C(4)) < E.8.5 can be chosen so that Prob(x 1— ' = an < 8 . Furthermore. and the d-topology is nonseparable.

In particular. let Hi. iii .. For example. The process p.2e.) as k — > oc. which has entropy log 2.vik = E liu(a/1 at denotes the k-th order distributional (variational) distance. Since there are only countably many 4. (n ) has entropy 0.. . then lim sup.u. for all k and all'. Furthermore.(4). that puts weight 1/2 on each of the two sequences (0. The weak limit of ergodic measures need not be ergodic.(X01X: ) denote the conditional entropy of X0 relative to X=1. however.°1) } of probability measures and any all'. so that {. Now suppose .°2) be the concatenated-block process defined by the single n2'1 -block x(n). Entropy is weakly upper semicontinuous. The weak topology is Hausdorff since a measure is determined by its values on cylinder sets. The weak topology is a compact topology.u(C) is continuous for each cylinder set C.u (n) is the Markov process with transition matrix M= n [ 1 — 1/n 1/n 1/n 1 1 — 1/n j then p weakly to the (nonergodic) process p.u (n) (4) = 2-k . and each . weak convergence coincides with convergence of the entries in the transition matrix and stationary vector. the usual diagonalization procedure produces a subsequence {n i } such that { 0 (a)} converges to. so the class of stationary measures is also compact in the weak topology.) and (1. where X ° k is distributed according to p. The weak topology is a metric topology. Theorem 1.. is stationary if each A (n ) is stationary. H(a) < H(p).u. relative to the metric D(g.u(n) is stationary. The weak topology on P(A") is the topology defined by weak convergence. let x(n) be the concatenation of the members of {0. given. Thus. Proof For any stationary measure p. If k is understood then I • I may be used in place of I • lk.).1 If p (n ) converges weakly to p. the limit p.(0(X 0 IX11) < H(A)-1.u. Indeed. say. 0. the sequence {.u (n) } converges weakly to the coin-tossing process. E > 0 there is a k such that HA (X0I < H(p) €. is a probability measure. . y) where . p.u (n ) converges weakly to p. Since Ho (Xo l XII) depends continuously on the probabilities . BASIC CONCEPTS. yet lim . 1}n in some order and let p.(X 0 1X:1) decreases to H(.9. 1. for every k and every 4. V k. and recall that 111. if .u(a k i +1 ). it must be true that 1-1/. It is the weakest topology for which each of the mappings p . It is easy to check that the limit function p.88 CHAPTER I.) — v(a 101. One way to see this is to note that on the class of stationary Markov processes.. for any sequence {. A second defect in the weak topology is that entropy is not weakly continuous. for each n. as i —> oc. hence has a convergent subsequence.°1) (a)} is bounded.. .

let g (P) denote the binary i. Theorem 1. Theorem 1..u (P) -typical sequence.T (p)) is invariant since ii(y. This extends to the It measures the fraction of changes needed to convert 4 into pseudometric ci(x. The (minimal) limiting (upper) density of changes needed to convert y into a g-typical sequence is given by d(y. Note. PROCESS TOPOLOGIES. y): x E T(11.9.2 If p.T (p)) is v-almost surely constant. y) is the limiting (upper) density of changes needed to convert the (infinite) sequence x into the (infinite) sequence y.i. then H(g. y) = lirn sup d(4 . and its converse.d.SECTION 1. yi). x) = d(T y.) Example 1. .) To illustrate the d-distance concept the distance between two binary i. A precise formulation of this idea and some equivalent versions are given in this subsection. in particular.d. y) is the limiting per-letter Hamming distance between x and y. a(x. I. hence f (y) is almost surely constant. the measure p.9. where d(a.i.d.b) = { 1 n i =1 a=b a 0 b. that is.b The d-metric. The d-distance between two ergodic processes is the minimal limiting (upper) density of changes needed to convert a typical sequence for one process into a typical sequence for the other. and y is the limiting density of changes needed to convert y into a g-typical sequence. (n) ) < H(. The constant of the theorem is denoted by d(g. let x be a . then d( y. y ) = — d(xi. process such that p is the probability that X 1 = 1.) + 2E.u. Let T(g) denote the set of frequency-typical sequences for the stationary process it. 89 for all sufficiently large n. Theorem 1. .(n)(X01X11) is always true. is actually a metric on the class of ergodic processes is a consequence of an alternative description to be given later. the set of all sequences for which the limiting empirical measure is equal to p.3 (The d-distance for i. as defined. the d-distance between p.i. so that x and y must .1. But if this is true for some n. T (p. y. Suppose 0 < p <q < 1/2. For 0 < p < 1. Tx). (The fact that d(g. This proves Theorem 1. that is.2. processes will be calculated. processes.9.4. and let y be a g (q) -typical sequence. Proof The function f (y) = 41(y . y) and is called the d-distance between the ergodic processes it and v. that for v-almost every v-typical sequence y. LI since Ii(g (n ) ) < 1-11.9.9. y) on A.)) = inf{d(x.)}. By the typical-sequence theorem.4. is ergodic if and only if g(T(g)) = 1.i)• Thus ci(x. The sequence x contains a limiting p-fraction of l's while the sequence q contains a limiting q-fraction of l's. defined by d(x. and y are ergodic processes. The per-letter Hamming distance is defined by 1 x---N n O d„(4 .1. v).

p (q ) ) = g — p. to be given later in this section. To remedy these defects an alternative formulation of the d-d limit of a distance dn un . and its shift Ty. Thus it is enough to find.2 serves as a useful guide to intuition. and hence. b). separated by an extra 0 or 1 about every 1050 places. a typical /L-sequence x has blocks of alternating O's and l's. (With a little more effort it can be shown that d(.90 CHAPTER I. that is. 7" (v)) — 1/2. for . 1.9.1 The joining definition of d. and hence ii(A (P) . The definition of d-distance given by Theorem 1. Since Z is also almost surely p (r ) -typical.9. where e is addition modulo 2. y) is exactly 1/2.u. (1) .)typical. b).u and y as marginals.u (P ) -typical x. hence the nonmixing process v cannot be the d-limit of any sequence of mixing processes.) As another example of the d-distance idea it will be shown that the d-distance between two binary Markov chains whose transition matrices are close can nevertheless be close to 1/2.(x. d-far-apart Markov chains.[ p 1 — p 1—p [0 1 1 N= p 1' 1 where p is small. y) > g — p. for each p9' ) -typical x. Likewise.) These results show that the d-topology is not the same as the weak topology. b a A simple geometric model is useful in thinking about the joining concept.4 ( Weakly close. Ty) — 1/2. a(x. and let r be the solution to the equation g = p(1 — r) ± (1 — p)r . one . kt and y. d -distance This result is also a simple consequence of the more general definition of .a(a) = E X(a. see Exercise 5.u. which produces only errors. see Exercise 4. BASIC CONCEPTS. is represented by a partition of the unit square into horizontal . for almost any Z. and about half the time the alternations are out of phase.d. On the other hand. the values {Zs : s E SI are independent of the values {Zs : s 0 S}. v(b) = E X(a. This new process distance will also be denoted by d(A.2.th order distributions will be developed. Later it will be shown that the class of mixing processes is a-closed.9. that is.u (q ) -typical. for kt converges weakly to y as p —> 0. Thus there is indeed at least one . For any set S of natural numbers. there is therefore at least one . A joining of two probability measures A and y on a finite set A is a measure A.. since about half the time the alternation in x is in phase with the alternation in y. random sequence with the distribution of the Afr ) -process and let Y = x ED Z. and y be the binary Markov chains with the respective transition matrices M. so that ci(x . y) = g — p. v). vn ) between the corresponding n .b. In a later subsection it will be shown that this new process distance is the constant given by Theorem 1.i. Let Z be an i.u-almost all x. disagree in at least a limiting (g — p)-fraction of places. (q ) -typical y such that cl(x. on A x A that has . Y is . which produces no errors. but it is not always easy to use in proofs and it doesn't apply to nonergodic istance as a processes. Thus d. First each of the two measures.u (q ) -typical y such that ii(x. a(x.9. Let p. y) — 1/2.u (r) -typical z such that y = x Ef) z is . ( A key concept used in the definition of the 4-metric is the "join" of two measures.. Fix a . The y-measure is concentrated on the alternating sequence y = 101010. Example 1. say p = 10-50 . y) = g — p.

y) satisfies the triangle inequality and is strictly positive if p.9.c) R(2. and v are probability measures on An .b) R(2a) 1 2 R(2.a) R(2.5. PROCESS TOPOLOGIES. where p.1In.1 (c)). b): a E A} is exactly the y-measure of the rectangle corresponding to b. y. The joining condition . 6) means that. The second joining condition. ai ) a (q5-1 (ai ) n ( . . for each a. the use of the unit square and rectangles is for simplicity only.c) R(1.a) R(0. a) to (A. y) denote the set of joinings of p. means that the total mass of the rectangles {R(a. 0 R(0. 91 strips. since expectation is continuous with respect to the distributional distance IX — 5. V) = min Ejdn(4 .) In other words.5 Joining as reassembly of subrectangles.c) Figure 1.a) R(0.b):b E A} such that R(a. (/) from (Z. V) is a compact subset of the space 7'(A x A n ).(a. The minimum is attained. with the width of a strip equal to the measure of the symbol it represents.9. that is. where ex denotes expectation with respect to X. Also. (A space is nonatomic if any subset of positive measure contains subsets of any smaller measure. b). A). y).b) R(0. a E A. Turning now to the definition of y).u(a) = Eb X(a. for a finite distribution can always be represented as a partition of any nonatomic probability space into sets whose masses are given by the distribution. hence is a metric on the class P(Afl) of measures on A". one can also think of cutting up the y-rectangles and reassembling to give the A-rectangles. v(b) = Ea X. a) to (A.SECTION 1. and v. from (Z.b) R(2c) R(2. a joining is just a rule for cutting each A-rectangle into subrectangles and reassembling them to obtain the y-rectangles. see Exercise 2. A joining X can then be represented as a pair of measure-preserving mappings. let J(11. 6).u-rectangle corresponding to a can be partitioned into horizontal rectangles {R(a. yti')) = cin v). such that X is given by the formula (2) X(a i . Of course. a) is just a measure-preserving mapping ct. (See Figure 1.a) R(1b) R(1.9. and ik from (Z.a) a R(1. The function 4(A. on a nonatomic probability space (Z. the .b) has area X(a. a measurable mapping such that a(0 -1 (a)) = p(a). A). a) to (A. relative to which Jn(Itt. A measure X on A n x A n is said to realize v) if it is a joining of A and y such that E Vd n (x7. The 4-metric is defined by cin(ti.) A representation of p.

in fact.9. is a metric on and hence d defines a topology on P(A").9.9. v) = min ex(d(xi.6.n (1t7+m . In the stationary case. BASIC CONCEPTS. yl)). namely. in turn. and y as marginals..n . The d-distance between stationary processes is defined as a limit of n-th order dndistance. on the set AL. A joining A of p and v is just a measure on A x A" with p. p: + 41 = pr. un ) must be 0. dn (p.92 CHAPTER I.. q+m). Y) = lirn sup an(An. and v are stationary then (4) cl(p. This establishes Theorem 1. The set of all stationary joinings of p and y will be denoted by Js(p. vn) = limiin(An. which. that is. which implies that A„ = vn for all n. y) is a pseudometric on the class P(A). in turn. un). Theorem 1. v) = 0 and p and y are stationary. if d(p. The d-distance between two processes is defined for the class P(A) of probability measures on A" by passing to the limit. Indeed. y) is a metric.. since in the stationary case the limit superior is a supremum. y). v) = SUP dn (An. (3) n vilz)± m (p. since d.co Note that d(p. which is. the limit superior is. n—). It is useful to know that the d-distance can be defined directly in terms of stationary joining measures on the product space A" x A". . since iin(A. defined as a minimization of expected per-letter Hamming distance over joinings. that is. both the limit and the supremum. v'nV) < ( n + m) di±.. in turn. LI A consequence of the preceding result is that the d-pseudometric is actually a metric on the class P(A) of stationary measures. Theorem 1. a consequence of the definition of dn as an minimum. implies that p = v. ( which implies that the limsup is in fact both a limit and the supremum. Exercise 2.:tr.(B x v(B) = X(A" x B). so the preceding inequality takes the form nd + mdm < n + m)dn+m. where pi denotes the measure induced by p.vn)• Proof The proof depends on a superadditivity inequality. which is.7 (Stationary joinings and d-distance. B E E.6 If i and v are stationary then C1 (1-1 .) If p. then. In the stationary case. The proof of this is left to Exercise 3. d(it. p(B) = A. for all n.

vn) = ex(c1. Yi)) = litridn (tt. PROCESS TOPOLOGIES. This. b) = E (uav. y rit )). for any cylinder set C. let ). yr)) = K-1 ( k=0 H (t v knn -l"n kk"kn-I-1 . It takes a bit more effort to show that d cannot be less than the right-hand side in (4).. To prove (c).7 (a. E.k (d(x i .1s(p.The definition of A (n) implies that if 0 < i < n — j then x A c° )) = Xn(T -` [4] x A) = ting—i [xn) = since An has A n as marginal and it is shift invariant.. (") (C x = A(C). the measure 7. This is done by forming. for it depends only on the first order distribution.(4 . and the latter dominates an (IL. As noted in Exercise 8 of Section 1. if N = Kn+r. the measure on A x A defined by A. which proves the first of the following equalities. implies that lim. 93 Proof The existence of the minimum follows from the fact that the set .kn-Fn Ykn+1 // t( y Kn+r "n kk""Kn-1-1 )1(n+1 . B x B) = . (n) is the measure on (A x A )° defined by A((4. (a) (6) (b) (c) n-+co lim 5:(n)(B x lim 3. 0<r <n. .(n) (A = bt(B). / . The goal is to construct a stationary measure A E y) from the collection {A}. for each n.. yi ) = lima Cin (tt. that is. For each n. y) of stationary joinings is weakly closed and hence compact. va).Î. lim e-)(d(xi. the concatenatedblock process defined by A n . together with the averaging formula. then taking a weakly convergent subsequence to obtain the desired limit process À.u(B).7 be projection of A n onto its i-th coordinate. V) = Ci(p. It is also easy to see that d cannot exceed the minimum of Ex (d(x l . since A n belongs to -In(un.1. for each n choose A n E Jn(tin. yi)). such that an(An. u'bvi). yl)) = ex n (dn (4. y).01) denote the concatenated-block process defined by An. y) then stationarity implies that ex(d(xi. and from the fact that e . V). The proof of (b) is similar to the proof of (a). let A. B E E E. for if A E Js(. Yi)) is weakly continuous.9. Towards this end. va). such that ex (d(Xi . The details are given in the following paragraphs. (5).SECTION 1. v).(n) is given by the formula (5) i "(n) (B) = — n n 1=1 where A.

Note also.„ (dn(4 )11 ))• Ei x7 (a. The limit X must be stationary. bk i )1(x n i1. Thus X E Js(h. ex n (dn(x tiz . yril)). y7)) = since 3:(n) (a. Condition (6c) guarantees that ii(tt. yril) defines three empirical k-block distributions. )). This proves (c).. and X' is a joining of g. and 3. To complete the proof select a convergent subsequence of 1-(n) . y)). "i. and the third. = (n — k 1)p(aii` I x7) counts the number of times af appears in x7.y. I. Lemma 1. and): are all stationary. Fix such a n1 and denote the Kolmogorov measures exists for all k and all al` and defined by the limits in (8) as TI. One difficulty is that limits may not exist for a given pair (x. each side of E(n — k 1)p((af. y. y). for example. if necessary. Proof Note that Further- M.2 Empirical distributions and joinings.1).(d(xi. E d (xi .9. yields an increasing sequence fn . ir(x . y) = Eî(d(ai . y). and is a joining of h and y. y) as a single sequence. hence is not counted in the second term. ( . Two of these. Passing to the limit it follows that . respectively. and plc ( ly7)./ J such that each of the limits (8) lim p(afix). yi) = e7. b)). Yi)). y) = EA. Limit versions of these finite ideas as n oc will be needed. b1 )).) The measures 171. are obtained by thinking of 4 and Ai as separate sequences.94 and note that CHAPTER L BASIC CONCEPTS. ) ) is a joining of pk (. u) = e). Yi)X7 (xi . al since the only difference between the two counts is that a' 2` might be the first k — 1 terms of 4. y) = lim dni ((x in i . b) = (1/n) 1 n x.b. By dropping to a further subsequence.9. 1x). that (7) dn (x. b 101(x7.p) (d(a. The pair (x.8 (The empirical-joinings lemma. I —ICX) j—>00 lim p((a . A simple. since cin(A.j'e ly).0)(d (xi . yi)).. yri') = E p . A diagonalization argument. but important fact. 14)1(4 . by conditions (6a) and (6b).(. yn also exists. is obtained by thinking of the pair (4 . { } 0< (n — k + 2) p(414) — (n k— E p(a'nfi!) <1.(. it can be assumed that the limit i i )) . pk (af 14) and pk (biny). 1(x. The limit results that will be needed are summarized in the following lemma. however. 07 . since. pk ((alic . is that plc ( 1(4 . b). and more.

10. then X 7r(x.. and (9) A= f A(x ) da)(71. y)). This establishes Theorem 1. means that a(i. while the d-equality is a consequence of (7). But. The joining result follows from the 0 corresponding finite result.are stationary. y) is 4( X . it is concentrated on those pairs (x. PROCESS TOPOLOGIES. This. But if this holds for a given pair.9. 8 . "if and 1. y) = e). The empirical-joinings lemma can be used in conjunction with the ergodic decomposition to show that the d-distance between ergodic processes is actually realized by an ergodic joining. Thus. Li Next it will be shown that the d-distance defined as a minimization of expected perletter Hamming distance over joinings.) must be a stationary joining of g and y.9. (10) d(u.SECTION 1. The integral expression (9) implies.9.) If . is the empirical measure defined by (x. y)) be the ergodic decomposition of A. that x(d(xi yi)) = f yl)) da)(7(x. Likewise. y E T (v). y i )). Theorem 1. y).7. combined with (10). À-almost surely. yl )). since À is a stationary joining of p and y. and hence d(u.9. for À-almost every pair (x. )) . (. ) (d(xi.(x. . .(.2. A(x) is ergodic and is a joining of .9.9 (Ergodic joinings and a-distance. y) E A" x A'.„( .(x..9. Since. yi))• Proof Let A be a stationary joining of g and y for which c/(u. by definition of X.„. y) E (A. since it was assumed that dCu. y).t. y).y) (d(x i . Theorem 1. A-almost surely. the same as the constant given by Theorem 1.9. = ex(d(xi.u and v. 371)). by the empirical-joinings lemma. however. Lemma 1.)-typical.u and y are ergodic then there is an ergodic joining A of . y) = ex(d(xi .4. co is a probability measure on the set of such equivalence classes. Theorem 1.(d(xi let yi)).9. In this expression it is the projection onto frequency-equivalence classes of pairs of sequences. y) is ergodic for almost every (x. Furthermore. it follows that X E ). (x.7(x. the pair (x. as given by the ergodic decomposition theorem.y)) for A-almost every pair (x. is.7.u and y such that Au. y) for which x is g-typical and y is v-typical. yi)). yl) depends only on the first order distribution and expectation is linear. y) < E)..y) (d(xi. y) = EA. since d(xi. in the ergodic case. Theorem 1. 95 is stationary. a strengthening of the stationary-joining theorem.. and A. y) < Ex„(x .

y).9 implies that there is an ergodic joining for v).9. that is.9. Theorem 1. BASIC CONCEPTS. y).. v) = d(y. v-almost surely. or by x(jr) when n is understood.b. Since the latter is equal to 2( x. As earlier. V y E X. Proof First it will be shown that (11) (x. if p. The vector obtained from xri' be deleting all but the terms with indices j i < j2 < < j. The ergodic theorem guarantees that for A-almost all which e(d(x i . also by the empirical-joinings lemma. y). and hence d(p. Here is why this is so. y) E TOO Ç T(i) x T (v). For convenience. This establishes the desired result (12). p((at.') = — n EÀ(d(xi. will be denoted by x n (jr). i=1 (b) (x.10 (Typical sequences and d-distance.u.) If . and 2(x. (11). . v).9. conditioned on the value r of some auxiliary random variable R. Next it will be shown that there is a subset X c A of v-measure 1 such that (12) d(y.T (p)) = d(p. T (A)).(a. and v as marginals. In particular. Theorem 1. v).10.3 Properties and interpretations of the d-distance.96 CHAPTER I.v). yields LIJ the desired result. and h1 . y). as j lemma implies that Î is stationary and has p. v) E-i.. Another convenient notation is X' . y). is the distribution of X and v the to indicate the random vector distribution of Yf'. y) < d(x . (12). I. the following two properties hold. y) E T(A) x T (v) d(p. The empirical-joinings so that dnj (x l i . The lower bound result.Theorem 1. the proof of (11) is finished. bt) for each k and each al' and bt. blf)1(x n oc. the distributional (or variational) distance between n-th order distributions is given by Ii — vin = Elp. (a) cln (4 . Let x be ti-typical and y be v-typical. for example..Y1))=a(u. 17) stands for an (A. there is a set X c A' such that v(X) = 1 and such that for y E X there is at least one x E T(A) for which both (a) and (b) hold. y'. y 1 ') converges to a limit d(x.n . v) d(x .9.(d(Xi.u and v are ergodic then d(. Y1)). (a ) — v(a7)I and a measure is said to be TN -invariant or N-stationary if it is invariant under the N-th power of the shift T. random variable notation will be used in stating d-distance results. . and choose an increasing sequence {n 1 } such that )) converges to a limit A. together with the almost sure limit result. y i )) = pairs (x.

97 Lemma 1.(d„(xr il . (e) d(u. Completeness follows from (a). unless a = b. and hence."1 )). in _m } denote the natural ordering of the complement {1.y(jr)) 5_ ndn (x 11 . (d) If . y o 7-1 ).v In .9. The left-hand inequality in (c) follows from the fact that summing out in a joining of X.. is a joining of X and Yln which extends X. n } . yri') = x(x(r). .11 (Properties of the d-distance. y) (dn (di% b7)) < . o T'. y) satisfies the triangle inequality and is positive if p. dn(A.(XF n Yr n ) = dn(X i ort1)n+1 .m. for any joining X of bt and v.Via . (b) d.Y(j im).y(j"))(X Y(iin—m )) be a joining of the conditional random variables. (f) If the blocks {X i ort on+1 } and Olin 1) .) (a) dn v) _< (1/2)Ip. ±1 } are independent of each other and of past blocks.a7 ) . and mdm (x(r). .Ea ga. Li } and for each ( x (jr ) .(X (j r). Ex(d(a.Yin = 2. X(i in-m )/x( jim) and Y(i n i -m )/y(jr). 14): a7 b7 } ) = 1 . YU)) yr. and that Ex (dn (a7 . PROCESS TOPOLOGIES. 14)) X({(a7 . y). IT) < illr. . )711')) < mEVd m (x(iln ). yuln))x(x(q).. This completes the proof of property (a). Y (if n )) n . yoin -m». together with the fact that the space of measures is compact and complete relative to the metric ii . 2. a). then ci(u. then mcl. v(a)).17(iin—on+i).. Furthermore.2 Emin(u(a7). The fact that cln (A. y is left as an exercise. y) d(p. 2 1 In the case when n = 1. see Exercise 8.u and v are T N -invariant. The right-hand inequality in (c) is a consequence of the fact that any joining X of X(r) and Y(ffn) can be extended to a joining of X and To establish this extension property. since d(a. with equality when n = 1.vi n .m.(ir»(xcili -m). (c) (X (ir).. = min(A(a7). Moreover. for any finite random variable R.1A . let {6. (g) ("in(X7. let yr. X(x(jr).E ga7. < md„.9. there is always at least one joining X* of i and y for which X(4. is a complete metric on the space of measures on An. i2.YD.u . b)) > 1 . . . nE). r P(R = r)d(r Proof A direct calculation shows that IL . b) = 1. y(jr)) n .SECTION 1. v(a)). v) = lim cin (. y( j. The function defined by x*(q. Yf' gives a joining of X(ffn).

then EÀ(dn (4. The proof of Property (g) is left to the reader.11. such that Exj (dn (X17_ 1)n±l' Yii. Another metric. } and {Y(. y) = min minfc > 0: A. versions of each are stated as the following two lemmas. Property (d) for the case N = 1 was proved earlier. dn A.4.. y(ijn-on+1)• The product A of the A is a joining of Xr and Yr. y) = E).(xi' y) ct (a) 20/.. if A E y) is such that X(An(a)) > 1 — a and d:(. y) is a metric on the space P(An) of probability measures on A.12 [d:Gt. y) = ct. y) < cl.9. Yintn ) Edna (. Part (b) of each version is based on the mapping interpretation of joinings.(dn (4 . The Markov inequality implies that X(A n ( N/Fe)) > 1 — NATt. It is one form of a metric commonly known as the Prohorov metric. v).in-on+1) . Yilj.u./. and hence property (f) is established. Proof Let A be a joining of 1.E. yiii)) < E (x7./2-1)n+1' y(i.(.(d n (f. )1)) = a. for each To prove property (f). choose a joining Ai of {Xiiin j. (3). This completes the proof of the lemma. yriz)) < 2a.0 The function dn * ( u. y7)EA.98 CHAPTER I. Lemma 1.9.2-1)n+1 )) = i (.t and y such that dn (A. and define (13) *(/. This completes the discussion of Lemma 1. .u. is obtained by replacing expected values by probabilities of large deviations. 1)n-1-1 } .(xrn . 01. j=1 since mnEx (c1. by the assumed independence of the blocks. (2).`n )) < 1. y) 5_ 24(u. i . and hence nz Manin (X7 m . y) < E. BASIC CONCEPTS.-1)n+1 . v)] 2 5_ 4 (A..tt1)n+1)) i • j=1 The reverse inequality follows from the superadditivity property.. and hence cl. y(i n in property (c). the extension to N > 1 is straightforward. Minimizing over A yields the right-hand inequality since dn . yDA. Let A n (c) = dn (xiz .1)n-1-1 . with bounds given in the following lemma. dn(xii . yrn)) = E nE j (dn (x . and is uniformly equivalent to the 4-metric. equivalent to the 4-metric. Property (e) follows from (c). y) Likewise. and hence du.n (x(irm ). The preceding lemma is useful in both directions.(A n (c)) > 1 — cl.

11 ) < (n — k + 1)p(a in. ±1c-1 0 ak 1' Let c be a positive number and use (14) to choose a positive number S < c. and N1 so that if n > N1 then (15) d( x. fix k and note that if )7 4-k---1 = all' and 4 +k-1 0 al'.-measure at most c. then Y n ([B]. )1). y) < e 2 and . then iin (t. y) < 2e. B) 3). This follows from the fact that if v is ergodic and close to i. To show that small blowups don't change frequencies very much. . PROCESS TOPOLOGIES.14.4 Ergodicity. where [ME = dn(Yç z B) EL (b) If cin (i. Theorem 1. Lemma 1.u) and (An. except for a set of a-measure at most c.un (B) > 1 — c.13 99 (a) If there is a joining À of it and u such that d„(4 . In this section it will be shown that the d-limit of ergodic (mixing) processes is ergodic (mixing) and that entropy is d-continuous. then a small blowup of the set of vtypical sequences must have large ti-measure. y) < c.t.) The a-limit of ergodic processes is ergodic.15 (Ergodicity and d-limits. v). Lemma 1. then yi 0 x i for at least one j E [i. p.u is the d-limit of ergodic processes. 11 ) 3 I p(44) — p(any)1 < c. i + k — 1]. which is just Lemma 1.9. and (Z. for 3 > 0. a) is a nonatomic probability space.*(z)) > el) 5_ E. let [B]s = {y7: dn(y ri' . respectively. .b.) and (A". and small blowups don't change empirical frequencies very much. a) into (An. entropy. such that a(lz: dn((z).4. The key to these and many other limit results is that do -closeness implies that sets of large measure for one distribution must have large blowup with respect to the other distribution. except for a set of A. I.9. y) < c2.9.9. (b) If 4 and are measure-preserving mappings from a nonatomic probability space (Z. almost surely. and d limits. such that dn ( ( Z) (z)) < E. By Theorem 1.t. then cin (it. there are measurepreserving mappings (/) and from (Z. denote the 3-neighborhood or 3-blow-up of B c A". Proof Suppose .14 (a) If anCu.2 it is enough to show that p(aii`lx) is constant in x. respectively.9.(4 .) > 1 — 2e. By the ergodic theorem the relative frequency p(eil`ix) converges almost surely to a limit p(anx). y) < 2E.q) + knd. where the extra factor of k on the last term is to account for the fact that a j for which yi 0 x i can produce as many as k indices i for which y :+ k -1 _ — ak 1 and X tj. a) into (A'.9. v). mixing.SECTION 1. and hence (14) (n — k + 1)p(41). for every a. As before.

Theorem 1. that if y E [A]s then there is a sequence x E Bn such that dn (fi! . y) < 3 2 . completing the proof of Theorem 1.-measure. Suppose A is the a-limit of mixing processes.) Entropy is a-continuous on the class of ergodic processes. v gives measure at least I — 3 to a subset of An of cardinality at most 2n (n( v)+E ) and hence El h(. 57 11 E [B. that stationary coding preserves mixing. Mixing is also preserved in the passage to d-limits.9. since for stationary measures d is the supremum of the there is an N2 > Ni such that if n > N2. Lemma 1. Likewise if 5. Lemma 1. implies that A is ergodic.7.) Choose > N such that p(T) > 1 — 3. BASIC CONCEPTS. and let A.17 (Mixing and a-limits. via finite-coder approximation. vn ) < 82. for sufficiently large n. y) < 8 2 . Theorem 1. the finite sequence characterization of ergodicity.9. Let 8 be a positive number such that un bi for all n > N. however. each of positive p. ] Since [B] large An -measure.9.4. and y such that Evdi (xi . Al. D Theorem 1. choose Tn c An such that ITn I _ 1 as n co. Yi)) = Ex(cin (4. Fix n > N2 and apply Lemma 1.) The d-limit of mixing processes is mixing.')) < 82. Given E > 0. Theorem 1.u) < h (v) 2E. This completes the proof of Theorem 1.14 implies that y([7-.16 (Entropy and d-limits.15. and fix E > 0. Given 8 > 0. Bn = fx ril : I P(414) — v(4)1 < has vn measure at least 1-6.5.1' E [B. and. Let y be an ergodic measure such that cl(A.9.14 to obtain A n ([13. see Example 19. choose N. (Such a 3 exists by the blowup-bound lemma. fix all and VII. let y be a mixing process such that d(A .7. and y are d-close then a small blowup does not increase the set of p.5. Let pc be an ergodic process with entropy h = h(i).9.n . y.-entropy-typical sequences by very much. ] 5 ) > 1 — 2 8 .16. Proof The proof is similar to the proof. and has large v-measure. for all n > N. y) < 3 2 and hence an (11. Note. for each < 2i1(h+E) and p(T) n > N. Proof Now the idea is that if p. yields h(v) < h(p) + 2E.9. . Let y be an ergodic process such that a(A. for n > N I . the set an .]6 ) > 1 — 23. n > Because y is ergodic N1 . be a stationary joining of p.4. so the covering-exponent interpretation of entropy. 8 then there is a sequence i E Bn such that ] 14) — P(a lic l51)1 < E. Likewise.100 CHAPTER I. <8 and hence I 19 (alic 1-741 ) IY7)1 < 6 by (15). The definition of Bn and the two preceding inequalities yield — p(451)1 < 4E. 8.

SECTION 1.9. PROCESS TOPOLOGIES.

101

This implies that d„(x': , y7) < 8 except for a set of A measure at most 8, so that if has A-measure at most 8. Thus by making xt; < 1/n then the set {(x'il , even smaller it can be supposed that A(a7) is so close to v(a) and p.(1,11') so close to v(b?) that liu(dil )[1(b7) — v(a'12 )v(b)I Likewise, for any m > 1,

- iz X({(xr +n , y'1 ): 4 0 yrii or x: V X({(xi, yi): xi 0 4 } )

y: .- .;1 })
m+n

X({(x,m n F-7,

xm _11
,n

ym m:++:rill)

28,

so that if 8 <112n is small enough, then

1,u([4] n

T - m[bn) — vaan

n T - m[b7D1

6/3,

for all m > 1. Since u is mixing, however,

Iv(a7)v(b7) — vgan

n T - m[b7])1

6/3,

for all sufficiently large m, and hence, the triangle inequality yields

— ii([4]

n T - m[brill)1

6

,

for all sufficiently large m, provided only that 6 is small enough. Thus p, must be mixing, which establishes the theorem. This section will be closed with an example, which shows, in particular, that the

ii-topology is nonseparable.
Example 1.9.18 (Rotation processes are generally d-far-apart.) Let S and T be the transformations of X = [0, 1) defined, respectively, by Sx = x

a, T x = x

fi,

where, as before,

e denotes

addition modulo 1.

Proposition 1.9.19
Let A be an (S x T)-invariant measure on X x X which has Lebesgue measure as marginal on each factor If a and fi are rationally independent then A must be the product measure. Proof "Rationally independent" means that ka ± mfi is irrational for any two rationals k and m with (k, m) (0, 0). Let C and D be measurable subsets of X. The goal is to • show that A.(C x D) =
It is enough to prove this when C and D are intervals and p.(C) = 1/N, where N is an integer. Given E > 0, let C 1 be a subinterval of C of length (1 — E)1N and let

E = C i x D, F = X x D.

102

CHAPTER I. BASIC CONCEPTS.

Since a and fi are rationally independent, the two-dimensional version of Kronecker's theorem, Proposition 1.2.16, can be applied, yielding integers m 1 , m2, , rnN such that if V denotes the transformation S x T then Vms E and X(FAÉ) < 2E, where F' = U 1 Vrn1E .
E -±
n vrni

E = 0, if i j,

It follows that X(E) = X(P)/N is within 26/N of A.(F)/ N = 1,t(C)A(D). Let obtain X(C x D) = This completes the proof of Proposition 1.9.19.

0 to

El

Now let P be the partition of the unit interval that consists of the two intervals Pc, = [0, 1/2), Pi = [1/2, 1). It is easy to see that the mapping that carries x into its (T, 2)-name {x} is an invertible mapping of the unit interval onto the space A z the Kolmogorov measure. This fact, together withwhicaresLbgumont Proposition 1.9.19, implies that the only joining of the (T, 2)-process and the (S, 2)process is the product joining, and this, in turn, implies that the d-distance between these two processes is 1/2. This shows in particular that the class of ergodic processes is not separable, for, in fact, even the translation (rotation) subclass is not separable. It can be shown that the class of all processes that are stationary codings of i.i.d. processes is d-separable, see Exercise 3 in Section W.2.

I.9.c

Exercises.

1. A measure ,u E P(X) is extremal if it cannot be expressed in the form p = Ai 0 ia2. (a) Show that if p is ergodic then p, is extremal. (Hint: if a = ta1 -F(1 — t)A2, apply the Radon-Nikodym theorem to obtain gi (B) = fB fi diu and show that each fi is T-invariant.) (b) Show that if p is extremal, then it must be ergodic. (Hint: if T -1 B = B then p, is a convex sum of the T-invariant conditional measures p(.iB) and peiX — B).) 2. Show that l') is a complete metric by showing that

(a) The triangle inequality holds. (Hint: if X joins X and /7, and X* joins 17 and Zri', then E,T yi)A*(z7ly'll) joins X and 4.) (b) If 24(X?, ri) = 0 then Xi; and 1111 have the same distribution. (c) The metric d(X7, 17) is complete. 3. Prove the superadditivity inequality (3). 4. Let p, and y be the binary Markov chains with the respective transition matrices
m [

P —p

1—p p 1'

N [0 1 1 o]'

1

Let fi be the Markov process defined by M2.

SECTION 1.10. CUTTING AND STACKING.

103

v n = X2n, n = 1, 2, ..., then, almost (a) Show that if x is typical for it, and v surely, y is typical for rt.
(b) Use the result of part (a) to show d(,u, y) = 1/2, if 0 < p < 1. 5. Use Lemma 1.9.11(f) to show that if it. and y are i.i.d. then d(P., y) = (1/2)1A — v I i. (This is a different method for obtaining the d-distance for i.i.d. processes than the one outlined in Example 1.9.3.) 6. Suppose y is ergodic and rt, is the concatenated-block process defined by A n on A n . (p ii ) (Hint: g is concentrated on shifts of sequences Show that d(v ,g) = --n.n, ,.--n,• that are typical for the product measure on (An )° ° defined by lin .)

a.

7. Prove property (d) of Lemma 1.9.11.

* (a7 , a7) = min(p,(a7), v(a)). 8. Show there is a joining A.* of ,u and y such that Xn
9. Prove that ikt — v in = 2 — 2 E a7 min(i.t(a7), v(a)). 10. Two sets C, D C A k are a-separated if C n [D], = 0. Show that if the supports of 121 and yk are a-separated then dk(Ilk.vk) > a. 11. Suppose ,u, and y are ergodic and d(i.t, for sufficiently large n there are subsets v(Dn ) > 1 — E, and dn (x7, y7) > a 1`) ' Ak and Pk('IY1 c ) ' Vk, and k Pk('IX 1 much smaller than a, by (7)) y) > a. Show that if E > 0 is given, then Cn and Dn of A n such that /L(C) > 1 —e, — e, for x7 E Cn , yli' E D. (Hint: if is large enough, then d(4 , y ) cannot be

12. Let (Y, y) be a nonatomic probability space and suppose *: Y 1-4- A n is a measurable mapping such that dn (un , y o * -1 ) < S. Show that there is a measurable mapping q5: y H A n such that A n = V curl and such that Ev(dn(O(Y), fr(y))) <8.

Section 1.10 Cutting and stacking.
Concatenated-block processes and regenerative processes are examples of processes with block structure. Sample paths are concatenations of members of some fixed collection S of blocks, that is, finite-length sequences, with both the initial tail of a block and subsequent block probabilities governed by a product measure it* on S' . The assumption that te is a product measure is not really necessary, for any stationary measure A* on S" leads to a stationary measure p, on Aw whose sample paths are infinite concatenations of members of S, provided only that expected block length is finite. Indeed, the measure tt is just the measure given by the tower construction, with base S", measure IL*, and transformation given by the shift on S°°, see Section 1.2.c.2. It is often easier to construct counterexamples by thinking directly in terms of block structures, first constructing finite blocks that have some approximate form of the final desired property, then concatenating these blocks in some way to obtain longer blocks in which the property is approximated, continuing in this way to obtain the final process as a suitable limit of finite blocks. A powerful method for organizing such constructions will be presented in this section. The method is called "cutting and stacking," a name suggested by the geometric idea used to go from one stage to the next in the construction.

104

CHAPTER I. BASIC CONCEPTS.

Before going into the details of cutting and stacking, it will be shown how a stationary measure ,u* on 8" gives rise to a sequence of pictures, called columnar representations, and how these, in turn, lead to a description of the measure on A'.

I.10.a

The columnar representations.

Fix a set 8 c A* of finite length sequences drawn from a finite alphabet A. The members of the initial set S c A* are called words or first-order blocks. The length of a word w E S is denoted by t(w). The members w7 of the product space 8" are called nth order blocks. The length gel') of an n-th order block wrii satisfies £(w) = E, t vi ). The symbol w7 has two interpretations, one as the n-tuple w1, w2, , w,, in 8", the other as the concatenation w 1 w2 • • • to,„ in AL, where L = E, t(wi ). The context usually makes clear which interpretation is in use. The space S" consists of the infinite sequences iv?"' = w1, w2, ..., where each Wi E S. A (Borel) probability measure 1.1* on S" which is invariant under the shift on S' will be called a block-structure measure if it satisfies the finite expected-length condition,
(

E(t(w)) =

Ef (w),uT(w ) <
WES

onto 8, while, in general, Here 14 denotes the projection of of ,u* onto 8'. Note, by the way, that stationary gives E v( w7 ))

denotes the projection

. E t (w ) ,u.: (w7 ) . nE(t(w)).
to7Esn

Blocks are to be concatenated to form A-valued sequences, hence it is important to have a distribution that takes into account the length of each block. This is the probability measure Â, on S defined by the formula
A( w) —

w)te (w)

E(fi

,WE

8,

where E(ti) denotes the expected length of first-order blocks. The formula indeed defines a probability distribution, since summing t(w),u*(w) over w yields the expected block length E(t 1 ). The measure is called the linear mass distribution of words since, in the case when p,* is ergodic, X(w) is the limiting fraction of the length of a typical concatenation w1 w2 ... occupied by the word w. Indeed, using f (w1w7) to denote the number of times w appears in w7 E S", the fraction of the length of w7 occupied by w is given by

t(w)f (wIw7)
t(w7)

= t(w)

f(w1w7) n
n f(w7)

t(w)11,* (w)
E(t 1 )

= A.(w), a. s.,

since f (w1w7)1 n ---÷ ,e(w) and t(will)In E(t1), almost surely, by the ergodic theorem applied to lu*. The ratio r(w) = kt * (w)/E(ti) is called the width or thickness of w. Note that A(w) = t(w)r(w), that is, linear mass = length x width.

SECTION 1.10. CUTTING AND STACKING.

105

The unit interval can be partitioned into subintervals indexed by 21) E S such that length of the subinterval assigned to w is X(w). Thus no harm comes from thinking of X as Lebesgue measure on the unit interval. A more useful representation is obtained by subdividing the interval that corresponds to w into t(w) subintervals of width r (w), labeling the i-th subinterval with the i-th term of w, then stacking these subintervals into a column, called the column associated with w. This is called the first - order columnar representation of (5 00 , ,u*). Figure 1.10.1 shows the first-order columnar representation of S = {v, w}, where y = 01, w = 011, AIM = 1/3, and /4(w) = 2/3. Ti x

1

X V w

o

Figure 1.10.1 The first-order columnar representation of (S, AT). In the columnar representation, shifting along a word corresponds to moving intervals one level upwards. This upward movement can be accomplished by a point mapping, namely, the mapping T1 that moves each point one level upwards. This mapping is not defined on the top level of each column, but it is a Lebesgue measure-preserving map from its domain to its range, since a level is mapped linearly onto the next level, an interval of the same width. (This is also shown in Figure I.10.1.) In summary, the columnar representation not only carries full information about the distribution A.(w), or alternatively, /4(w) = p,*(w), but shifting along a block can be represented as the transformation that moves points one level upwards, a transformation which is Lebesgue measure preserving on its domain. The first-order columnar representation is determined by S and the first-order distribution AI (modulo, of course, the fact that there are many ways to partition the unit interval into subintervals and stack them into columns of the correct sizes.) Conversely, the columnar representation determines S and Al% The information about the final process is only partial since the picture gives no information about how to get from the top of a column to the base of another column, in other words, it does not tell how the blocks are to be concatenated. The first-order columnar representation is, of course, closely related to the tower representation discussed in Section I.2.c.2, the difference being that now the emphasis is on the width distribution and the partially defined transformation T that moves points upwards. Information about how first-order blocks are concatenated to form second-order blocks is given by the columnar representation of the second-order distribution ,q(w?). Let r (w?) = p(w)/E(2) be the width of w?, where E(t2) is the expected length of w?, with respect to ,4, and let X(w?) = t(w?)r(w?) be the second-order linear mass. The second-order blocks w? are represented as columns of disjoint subintervals of the unit interval of width r (14) and height aw?), with the i-th subinterval labeled by the i-th term of the concatenation tv?. A key observation, which gives the name "cutting and stacking" to the whole procedure that this discussion is leading towards, is that

BASIC CONCEPTS.2 The second-order representation via cutting and stacking. Figure 1.. Note also the important property that the set of points where T2 is undefined has only half the measure of the set where T1 is defined.(vw) = . . which is just half the total width 1/E(t 1 ) of the first-order columns. for the total width of the second-order columns is 1/E(t2). The significance of the fact that second-order columns can be built from first-order columns by appropriate cutting and stacking is that this guarantees that the transformation T2 defined for the second-order picture by mapping points directly upwards one level extends the mapping T1 that was defined for the first-order picture by mapping points directly upwards one level. Indeed. w}. the total mass contributed by the top t(w) levels of all the columns that end with w is Et(w)r(w1w)= Ea(lV t2)) WI Pf(W1W) 2E(ti) =t(w* *(w) Thus half the mass of a first-order column goes to the top parts and half to the bottom parts of second-order columns.1. . If this cutting and stacking method is used then the following two properties hold.10.4(vv) = 1/9„u.2 shows how the second-order columnar representation of S = {v. then it will continue to be so in any second-order representation that is built from the first-order representation by cutting and stacking its columns. the total mass in the second-order representation contributed by the first f(w) levels of all the columns that start with w is t(w) x--.1. where . if y is directly above x in some column of the firstorder representation. then stacking these in pairs. Likewise. can be built by cutting and stacking the first-order representation shown in Figure 1.10.--(t2) 2_.10. the 2m-th order representation can be produced by cutting the columns of the m-th order columnar representation into subcolumns and stacking them in pairs. it is possible to cut the first-order columns into subcolumns and stack them in pairs so as to obtain the second-order columnar representation. . 2 " (w)r(w). One can now proceed to define higher order columnar representations. ._ C i D *********** H E ****** B VW A B C V D H E F G ***************4****+********** w A VV F G ****** *********** WV WW Figure 1. Second-order columns can be be obtained by cutting each first-order column into subcolumns. In general.(wv) = 2/9. as claimed.. 16(ww) = 4/9.u. so. E aw yr ( ww2 ) . le un-v2)\ ( aw* * (w) = 2E( ii) tV2 2 _ 1 toor(w).. Indeed. 2 which is exactly half the total mass of w in the first-order columnar representation.106 CHAPTER I.

I. and the interval L i is called the j-th level of the column. Theorem 1. A labeling of C is a mapping Ar from {1. Some applications of its use to construct examples will be given in Chapter 3. labeled columns. Ar(E(C))) is called the name of C. collection of subintervals of the unit interval of equal positive width (C). Of course. Note that the members of a column structure are labeled columns. .b The basic cutting and stacking vocabulary. .3. The support of a column C is the union of its levels L.10. rather than the noninformative name. 2. Their common extension is a Lebesgue measure-preserving transformation on the entire interval. The labeling of levels by symbols from A provides a partition P = {Pa : a E AI of the unit interval. A column structure S is a nonempty collection of mutually disjoint. The vector Ar(C) = (H(1). using the desired goal (typically to make an example of a process with some sample path property) to guide how to cut up the columns of one stage and reassemble to create the next stage. Ar(2). Thus. in turn. CUTTING AND STACKING. A general form of this fact. The cutting and stacking idea focuses directly on the geometric concept of cutting and stacking labeled columns. The cutting and stacking language and theory will be rigorously developed in the following subsections. f(C) is a nonempty. The columnar representation idea starts with the final block-structure measure . column structures define distributions on blocks. . which is of course given by X(C) = i(C)r (C). The difference between the building of columnar representations and the general cutting and stacking method is really only a matter of approach and emphasis. P)-process is the same as the stationary process it defined by A*. (a) The associated column transformation T2m extends the transformation Tm . in essence.10. and cutting and stacking extends these to joint distributions on higher level blocks. . In particular.u* and uses it to construct a sequence of partial transformations of the unit interval each of which extends the previous ones. Its measure X(C) is the Lebesgue measure X of the support of C. 107 (b) The set where T2m is defined has one-half the measure of the set where Tft.. or height. see Corollary 1.5. and the transformation T will preserve Lebesgue measure. . L2. it is a way of building something new.. and for each j. Thus. but this all happens in the background while the user focuses on the combinatorial properties needed at each stage to produce the desired final process. the interval Lt(c) is called the top. The goal is still to produce a Lebesgue measure-preserving transformation on the unit interval. gives the desired stationary A-valued process. N(j) = Ar(L i ) is called the name or label of L. The suggestive name.10. unless stated otherwise. will be proved later.. which. ordered. f(C)} into the finite alphabet A.10. where Pa is the union of all the levels labeled a. gadget. disjoint. Two columns are said to be disjoint if their supports are disjoint. column structure. is undefined. if cutting and stacking is used to go from stage to stage. Lac)) of length. The interval L1 is called the base or bottom. will be used in this book. The (T. it is a way of representing something already known.SECTION 1. A column C = (L1. hence. . in essence. then the set of transformations {TA will have a common extension T defined for almost all x in the unit interval. which has been commonly used in ergodic theory.

Le (C )) defines a transformation T = T. This idea can be made precise in terms of subcolumns and column partitions as follows.. (An alternative definition of subcolumn in terms of the upward map is given in Exercise 1. and S c S' means that S is a substructure of S'. terminology should be interpreted as statements about sets of labeled columns. L2. The width r(S) of a column structure is the sum of the widths of its columns.. . The base or bottom of a column structure is the union of the bases of its columns and its top is the union of the tops of its columns.is the normalized column Lebesgue measure. A(S) = Ex(c). ) c cyf( a E r(S) r (S) . namely." which is taken as shorthand for "column structures with disjoint support. . BASIC CONCEPTS. 1). for 1 < j < f(C).(c).108 CHAPTER I. . L(i. Note that the base and top have the same . called its upward map. t(C)): i E I} . If a column is pictured by drawing L i+ 1 directly above L1 then T maps points directly upwards one level. for CES (c ) < Ei(c)-r(c) Ex . C = (L1. 2) . such that the following hold. Each point below the base of S is mapped one level directly upwards by T. The transformation T = Ts defined by a column structure S is the union of the upward maps defined by its columns. T is not defined on the top of S and is a Lebesgue measure-preserving mapping from all but the top to all but the bottom of S. It is called the mapping or upward mapping defined by the column structure S. and column structures have disjoint columns. Note that T is one-to-one and its inverse maps points downwards. The precise definition is that T maps Li in a linear and order-preserving manner onto L i+ i. and is not defined on the top level. The width distribution 7 width r (C) r (C) "f(C) = EVES r (D) (45)• The measure A(S) of a column structure is the Lebesgue measure of its support. Note that implicit in the definition is that a subcolumn always has the same height as the column. . (b) The distance from the left end point of L'i to the left end point of L 1 does not depend on j. r(S)." where the support of a column structure is the union of the supports of its columns. Li) is a column C' = (a) For each j. L'e ). each labeled column in S is a labeled column in S' with the same labeling. A column C = (L 1 L2. is a subinterval of L i with the same label as L. . r (5) < 00. this means that expected column height with respect to the width distribution is finite. in other words.) A (column) partition of a column C is a finite or countable collection {C(i) = (L(i. L(i. A subcolumn of L'2 . Columns are cut into subcolumns by slicing them vertically. that is. CES CES Note that X(S) < 1 since columns always consist of disjoint subintervals of the unit interval. An exception to this is the terminology "disjoint column structures.Et(c). In particular. Thus the union S US' of two column structures S and S' consists of the labeled columns of each.

Indeed. for any S' built by cutting and stacking from S.1.SECTION 1. j — f(C1)) t(C1) < j f(Ci) -e(C2). The stacking of C2 on top of C 1 is denoted by C1* C2 and is the labeled column with levels L i defined by = L(1. Note that the width of C1 * C2 is the same as the width of C1. the union of whose supports is the support of C.(C). . and with the upward map TC 2 wherever it is defined. The basic fact about stacking is that it extends upward maps. thus cutting C into subcolumns according to a distribution 7r is the same as finding a column partition {C(i)} of C such that 7r(i) = t(C(i))1 . A column partitioning of a column structure S is a column structure S' with the same support as S such that each column of S' is a subcolumn of a column of S. (iv) If S' is built by cutting and stacking from S. the key properties of the upward maps defined by column structures are summarized as follows.. it is extended by the upward map T = Ts . j): 1 < j < t(C2)) be disjoint labeled columns of the same width.. is built by cutting each of the first-order columns into subcolumns and stacking these in pairs. Partitioning corresponds to cutting. defined by stacking C2 on top of C1 agrees with the upward map Tc„ wherever it is defined. then Ts . j) 1 < j < f(C1) L(2. extends T8. namely. . is now defined on the top of C1. which is also the same as the width of C2.a*). A column structure S' is built by cutting and stacking from a column structure S if it is a stacking of a column partitioning of S. Li with the label of C1* C2 defined to be the concatenation vw of the label v of C 1 and the label w of C2. where S c A*.10. In general. for each i. In summary. . (i) The domain of Ts is the union of all except the top of S. (ii) The range of Ts is the union of all except the bottom of S. j): 1 < j < t(C1)) and C2 = (L(2. (iii) The upward map Ts is a Lebesgue measure-preserving mapping from its domain to its range. This is also true of the upward mapping T = Ts defined by a column structure S. for example.c. The cutting idea extends to column structures. A column structure S' is a stacking of a column structure S. and taking the collection of the resulting subcolumns. The stacking idea is defined precisely as follows. Thus. Longer stacks are defined inductively by Ci *C2* • • • Ck Ck+i = (C1 * C2 * • • • Ck) * Ck+1. if they have the same support and each column of S' is a stacking of columns of S. Let C 1 = (L(1. Thus. 109 of disjoint subcolumns of C. the second-order columnar representation of (S. CUTTING AND STACKING. the upward map Tco. the columns of S' are permitted to be stackings of variable numbers of subcolumns of S.c. and extends their union. since Te. a column partitioning is formed by partitioning each column of S in some way.

Proof For almost every x E [0. The process {X. (b) X(S(1)) = 1 and the (Lebesgue) measure of the top of S(m) goes to 0 as m —> oc. (x) to be the index of the member of P to which 7 1 x belongs. and a condition for ergodicity. and defining X. (Such an m exists eventually almost surely. Continuing in this manner a sequence {S(m)} of column structures is obtained for which each member is built by cutting and stacking from the preceding one. the transformation produces a stationary process. it follows that T is measurable and preserves Lebesgue measure.} is described by selecting a point x at random according to the uniform distribution on the unit interval and defining X. then a common extension T of the successive upward maps is defined for almost all x in the unit interval and preserves Lebesgue measure. A precise formulation of this result will be presented in this section.} and its Kolmogorov measure it are. the common extension of the inverses of the Ts (m) is defined for almost all x. A sequence {S(m)} of column structures is said to be complete if the following hold.10. Since such intervals generate the Borel sets on the unit interval. El The transformation T is called the transformation defined by the complete sequence {S(m)}. along with the basic formula for estimating the joint distributions of the final process from the sequence of column structures. The goal of cutting and stacking operations is to construct a measure-preserving transformation on the unit interval.3. called the process and Kolmogorov measure defined by the complete sequence {S(m)}. The (T. T is invertible and preserves Lebesgue measure. Further cutting and stacking preserves this relationship of being below the top and produces extensions of Ts(m).3 (The complete-sequences theorem. then choosing m so that x belongs to the j-th level. BASIC CONCEPTS. Together with the partition defined by the names of levels. 1].) If {8(m)} is complete then the collection {Ts (n )} has a common extension T defined for almost every x E [0. so that the transformation T defined by Tx = Ts(m)x is defined for almost all x and extends every Ts (m) . then T -1 B = Ts -(m i ) B is an interval of the same length as B. (a) For each m > 1. Furthermore.c The final transformation and process. I. This is done by starting with a column structure 8(1) with support 1 and applying cutting and stacking operations to produce a new structure 8(2). (x) to be the name of level Li+n _1 of C. Theorem 1. Cutting and stacking operations are then applied to 8(2) to produce a third structure 8(3). 1] there is an M = M (x) such that x lies in an interval below the top of S(M). since the tops are shrinking to 0. say L i . This is equivalent to picking a random x in the support of 8(1). If the tops shrink to 0.10. The label structure defines the partition P = {Pa : a E A). Likewise.) The k-th order joint distribution of the process it defined by a complete sequence {S(m)} can be directly estimated by the relative frequency of occurrence of 4 in a . since the top shrinks to 0 as m oc. where Pa is the union of all the levels of all the columns of 8(1) that have the name a. P)-process {X. If B is a subinterval of a level in a column of some S(m) which is not the base of that column. respectively. it is clearly the inverse of T.110 CHAPTER I. S(m + 1) is built by cutting and stacking from S(m). of a column C E S(m) of height t(C) > j n — 1. This completes the proof of Theorem 1.10.

To make this precise. then (1) = lim E CES pk(ailC)X(C). This completes the proof of Theorem 1. This estimate is exact for k = 1 for any m and for k > 1 is asymptotically correct as m oc. given by p. establishing (1) for the case k = 1. and hence IX(Paf n C) — p k (c41C)A(C)1 < 2(k — 1)r(C). in turn. that is. Without loss of generality it can be assumed that 8(m +1) is built by cutting and stacking from S(m). let {. CUTTING AND STACKING. a negligible effect when m is large for then most of the mass must be in long columns. X(8(1)) = 1 and the measure of the top of S(m + 1) is half the measure of the top of S(m). which is. Theorem 1. for x E [0.x. Let Pa be the union of all the levels of all the columns in 8(1) that have the name a. pk(lC) is the empirical overlapping k-block distribution pk •ixf (c) ) defined by the name of C. and hence X(Pa fl C) = [t(c)pi (aiC)j r(c) = pi(aIC)X(C). of course.4. 1]. since the top k — 1 levels of C have measure (k — 1)r(C). The same argument applies to each S(m).SECTION L10. The quantity (t(C) — k 1)pk(a lic IC) counts the number of levels below the top k — 1 levels of C that are contained in Pal k .) If t is the measure defined by the complete sequence. f(C) — k 1 : x 1' 1 = ] where xf (c) is the name of C. Then. For k > 1. The desired result (1) now follows since the sum of the widths t(C) over the columns of (m) was assumed to go oc. For each m let 8(m) denote the 2''-order columnar representation.} E A Z be the sequence defined by the relation Tnx E Px„ n E Z. Let itt be the . Furthermore. as n An application of the estimation formula (1) is connected to the earlier discussion on of the sequence of columnar representations defined by a block-structure measure where S c A*. the context will make clear which is intended. 111 column name.4 (Estimation of joint distributions. so that {S(m)} is a complete sequence. . to 0. since the top has small measure.u(a) = X(Pa).10. {8(m)}. and let Pa k = { x: x = .(apx(c). the only error in using (1) to estimate Wa l') comes from the fact that pk (ainC) ignores the final k — 1 levels of the column C. CES (1) since f(C)p i (aIC) just counts the number of times a appears in the name of C.. • k E Ak 1 (m) Proof In this proof C will denote either a column or its support. averaged over the Lebesgue measure of the column name.10. The relative frequency of occurrence of all' in a labeled column C is defined by pk(a inC) = Ili E [1.

thought of as the concatenation wi w2 • • • wn E AL. Proof Let pk(afiw7) be the empirical overlapping k-block distribution defined by the sequence w E S n .5 The tower construction and the standard representation produce the same measures. One way to make certain that the process defined by a complete sequence is ergodic is to make sure that the transformation defined by the sequence is ergodic relative to Lebesgue measure.10. In the following discussion C denotes either a column or its support. that is. but only the column distributions.) = t (w 1) • Also let /72. a E Ak . be the Kolmogorov measure of the (T. In other words. The column structures S and S' are 6-independent if (3) E E Ige n D) — X(C)À(D)I CES DES' 6. two column structures S and S' are 6-independent if and only if the partition into columns defined by S and the partition into columns defined by S' are 6-independent. The sequence {S(m)} will be called the standard cutting and stacking representation of the measure it. Since this is clearly the same as pk(a inC) where C is the column with name w. awl)]. . that these independence concepts do not depend on how the columns are labeled. with the shift S on S" as base transformation and height function defined by f (wi . Next let T be the tower transformation defined by the base measure it*. then selecting the start position according to the uniform distribution on [1. BASIC CONCEPTS. since the start distribution for is obtained by selecting w 1 at random according to the distribution AT. where P = {Pa } is the partition defined by letting P a be the set of all pairs (wr) . where tovi) w = ai . which is sufficient for most purposes. by the way. A condition for this.(C). . For example. The sequence {S(m)} of column structures is asymptotically independent if for each m and each e > 0. = l. j) for which ai = a.112 CHAPTER I. 0 Next the question of the ergodicity of the final process will be addressed. is that later stage structures become almost independent of earlier structures. 7 3)-process. where L = Ei aw 1 ). p. measure defined by this sequence. Note. Corollary 1. Related concepts are discussed in the exercises. The proof for k > 1 is left as an exercise. there is a k > 1 such that S(m) and S(m k) are 6-independent. For k = 1 the sum is constant in n. it is enough to show that (2) = lim E WES pk (ak l it4)A. These ideas are developed in the following paragraphs. A(B I C) denotes the conditional measure X(B n c)/x(c) of the intersection of the set B with the support of the column C.

.F.SECTION I10. (5). given E > 0 there is an m and a level L. (6) X(B n D) > (1 — c)X(D).E 2 )x(c). . which implies that the (T. then there must be at least one level of D which is at least (1 — E) filled by B. asymptotically independent sequence { 8(m)}. El If is often easier to make successive stages approximately independent than it is to force asymptotic independence. the Markov inequality and the fact that E x (7. It will be shown that T is ergodic. hence the following stronger form of the complete sequences and ergodicity theorem is quite useful. This completes the proof of Theorem 1. so that. the following holds. Since X(B I C) = E-D A. of some C E 8(M) such that X(B n L) > 0 This implies that the entire column C is filled to within (1 — E 2 ) by B. D e .10.10. CUTTING AND STACKING. implies E X(D) <2E. the widths of its intervals must also shrink to 0. In summary. say level j.(B1 C X(B I C) ?.) A complete asymptotically independent sequence defines an ergodic process. Thus.6. and hence T must be ergodic. 1 C) DO' El which together with the condition. Proof Let T be the Lebesgue measure-preserving transformation of the unit interval defined by the complete.F. The set C n D is a union of levels of D. DO* Thus summing over D and using (6) yields X(B) > 1 — 3E. (4) À(B n c) > 1 . and hence the collection of all the levels of all the columns in all the S(m) generates the a-algebra. Fix C and choose M so large that S(m) and S(M) are EX(C)-independent. that is. in particular. imply that n v)x(vi C). The argument used to prove (4) then shows that the entire column D must be (1 — c) filled by B. The goal is to show that A(B) = 1. so if D E . P)-process is ergodic for any partition P. since S(M) is built by cutting and stacking from S(m). 113 Theorem 1. (5) E DES(M) igvi c) — x(D)1 E- Let . Towards this end note that since the top of S(m) shrinks to 0. Let B be a measurable set of positive measure such that T -1 B = B. This shows that X(B) = 1. ( since T k (B n L) = B n TkL and T k L sweeps out the entire column as k ranges from —j + 1 to aW) — j.6 (Complete sequences and ergodicity.T be the set of all D E S(M) for which X(B1 C n D) > 1 — 6. (1 — 6 2 ).

c . can produce a variety of counterexamples when applied to substructures. which. known as independent cutting and stacking. Let S' be the set of all columns C for which some level has this property. I. Theorem 1. in spite of its simplicity.) If {S(m)} is a complete sequence such that for each m. A column structure S can be stacked independently on top of a labeled column C of the same width. This gives a new column structure denoted by C * S and defined as follows. in the complexity of the description needed to go from one stage to the next.d Independent cutting and stacking. Proof Assume T -1 B = B and X(B) > O. The support of S' has measure at least 3.7. where Em 0. x ( L i) < E 28. as well as how substructures are to become well-distributed in later substructures. Independent cutting and stacking is the geometric version of the product measure idea. A bewildering array of examples have been constructed. (i) Partition C into subcolumns {Ci } so that r (C1) = (Di ). however. Thus. provided that they have disjoint supports.10. it follows that and there must be at least one C E S(m) for which (8) holds and for which DES (m+1) E Ix(Dic) — x(D)i < E. then by the Markov inequality there is a subcollection with total measure at least 3 for which x(B n L i ) > (1 — E 2 )À(L i ). In this discussion. The argument used in the preceding theorem then gives À(B) > 1 — 3E. There are practical limitations. as earlier. The freedom in building ergodic processes via cutting and stacking lies in the arbitrary nature of the cutting and stacking rules. BASIC CONCEPTS.10.u(B)/2. This proves Theorem 1. with the context making clear which meaning is intended. and m so large that Em is smaller than both E 2 and (ju(B)/2) 2 so that there is a set of levels of columns in S(m) such that (7) holds. then {S(m)} defines an ergodic process. A few of these constructions will be described in later chapters of this book. the same notation will be used for a column or its support. and the sweeping-out argument used to prove (4) shows that (8) X(B n > ( 1 — E2 E SI )x(c). by using only a few simple techniques for going from one stage to the next. of course. (7) Eg where BC is the complement of B. The next subsection will focus on a simple form of cutting and stacking. S(m) and S(m 1) are Em -independent. The user is free to vary which columns are to be cut and in what order they are to be stacked. .10.7 (Complete sequences and ergodicity: strong form.114 CHAPTER I. as m OC. where Di is the ith column of S. The only real modification that needs to be made in the preceding proof is to note that if {Li} is a disjoint collection of column levels such that X (BC n (UL)) = BciL i . taking 3 = .

Let S' and S be disjoint column structures with the same width.. (See Figure 1. (ii) For each i. CUTTING AND STACKING. An alternative description of the columns of S * S' may given as follows. The independent cutting and stacking of S' onto S is denoted by S * S' and is defined for S = {Ci } as follows. . It is called the independent stacking of S onto C. by the way.10. Note. so that r(S:)= .SECTION 1. and letting Si be the column structure that consists of the i-th subcolumn of each column of S.) (i) Cut S' into copies {S.(Ci ).8.10. To say precisely what it means to independently cut and stack one column structure on top of another column structure the concept of copy is useful.8 Stacking a column structure independently onto a column. (See Figure 1. and name. A column structure S is said to be a copy of size a of a column structure S' if there is a one-to-one correspondence between columns such that corresponding columns have the same height and the same labeling. independently onto Ci . and the ratio of the width of a column of S to the width of its corresponding column in S' is a. Ci *Di .10. that a copy has the same width distribution as the original. In other words. stack S. a scaling of one structure is isomorphic to the other. The new column structure C * S consists of all the columns C.9 Stacking a structure independently onto a structure. by partitioning each column of S according to the distribution 7r. 115 (ii) Stack Di on top of Ci to obtain the new column.10. obtaining Ci * S. * V.) C* S Figure 1.1.9. The column structure S * S' is the union of the column structures Ci * C2 S' Figure 1. width.10. A column structure S can be cut into copies {S i : i E n according to a distribution on /. where two column structures S and S' are said to be isomorphic if there is a one-to-one correspondence between columns such that corresponding columns have the same height.

that they are independently concatenated according to the width distribution of S. and. E CHAPTER I. M} of itself of equal width and successively independently cutting and stacking them to obtain Si *52 * • • ' * SM where the latter is defined inductively by Si *.10. • • *Sm = (S1* • • • *Sm_i)*Sm. is that width distributions multiply. . r (Cii *C) r (S * S') r (Ci ) r (S) r(C) since r (S*S') = r(S). This formula expresses the probabilistic meaning of independent cutting and stacking: cut and stack so that width distributions multiply. In particular.. m > 1. The independent cutting and stacking construction contains more than just this concatenation information. however. that is.. 2. Two-fold independent cutting and stacking is indicated in Figure 1.10 Two-fold independent cutting and stacking.(Ci E (ii) Cut each column Ci '). i ) I r (S') for S' into subcolumns {C» } such that r(C) = (C)T. starting with a column structure S produces the sequence {S(m)}.10. The sequence {S(m)} is called the sequence built (or generated) from S by repeated independent cutting and stacking. in the finite column case. Successive applications of repeated independent cutting and stacking.J*Cii . The M-fold independent cutting and stacking of a column structure S is defined by cutting S into M copies {Sm : m = 1.116 (i) Cut each C each Ci E S. BASIC CONCEPTS.10. Si S2 SI * S2 Figure 1. such that r(Cii ). The key property of independent cutting and stacking. the number of columns of S * S' is the product of the number of columns of S and the number of columns of S'. The columns of Si * • • • * Sm have names that are M-fold concatenations of column names of the initial column structure S. S into subcolumns {CO. -r(Ci The column structure S *S' consists of all the C. A column structure can be cut into copies of itself and these copies stacked to form a new column structure. in the case when S has only finitely many columns. Note that (Si * • • • * S) = t (S)/M. for it carries with it the information about the distribution of the concatenations. where 5 (1) = S. Note that S(m) is isomorphic to the 2m-fold independent . the number of columns of Si * • • • * Sm is the M-th power of the number of columns of S. and S(m + 1) = S(m) * S (m). namely. .

Theorem 1. the (T. The following examples indicate how some standard processes can be built by cutting and stacking. Note that for each mo..10. This establishes that indeed S and S(m) are eventually c-independent. This completes the proof of Theorem 1.11 (Repeated independent cutting and stacking. • Ck. More interesting examples of cutting and stacking constructions are given in Chapter 3. with probability 1. just another way to define the regenerative processes. For k = 2'. by assumption. k]:C = p(CICO I = so that. and hence the proof shows that {S(m)} is asymptotically independent. the countable case is left to Exercise 2. • • Ck E SOTO. selected independently according to the width distribution of S. 117 cutting and stacking of S. since r(S)E(t(S)) = X(S) = 1. of course. then given > 0.SECTION I. define E [1. as k —> cc. by (9) and (10). CUTTING AND STACKING. and C E S. there is an m such that S and S(m) are E-independent. so that if X(S) = 1 then the process defined by {S(m)} is ergodic. as k of S. for each i. Division of X(C n D) = kp(C1Cbt(C)r(D) by the product of X(C) = i(C)r(C) and À(D) = i(D)t(D) yields x(c n D) X(C)À(D) kp(CIC I ) i(D)r (C) 1. where E(t(S)) is the expected height of the columns both with probability 1. with respect to the width distribution. kp(CICf) is the total number of occurrences of C in the sequence C. This is. (9) and (10) 1 1 — i(D) = — k k p(CIC) —> E £(0) i=1 E(t(S)). In the case when X(S) = 1. for they are the precisely the processes built by repeated independent cutting and stacking from the columnar representation of their firstorder block-structure distributions. The simplest of these. will mean the column D was formed by taking. Exercise 8. so that its columns are just concatenations of 2' subcolumns of the columns of S. P)-process defined by the sequence generated from S by repeated independent cutting and stacking is called the process built from S by repeated independent cutting and stacking.11. an example of a process with an arbitrary rate of . {S(m): m > mol is built from S(rno) by repeated independent cutting and stacking.10. Ergodicity is guaranteed by the following theorem. if X(S) = 1 then the process built from S by repeated independent cutting and stacking is ergodic. a subcolumn of Ci E S. D = C1C2. In particular. oc. in particular. By the law of large numbers. and stacking these in order of increasing i.10. The notation D = C = C1 C2 . Proof The proof for the case when S has only a finite number of columns will be given here.) If {S(m)} is built from S by repeated independent cutting and stacking.

Example 1. Example 1. For a discussion of this and other early work see [12].Ar(L).10. say a.10.} is regenerative.T. In fact. Define the set of all ak Prob(X i` = alic1X0 = a). and ak = a.c. The process built from S by repeated independent cutting and stacking is just the coin-tossing process. Let S be i for which ai a.i. They have since been used to construct .d. By starting with a partition of the unit interval into disjoint intervals labeled by the finite set A. see III. Remark 1. and how to "mix one part slowly into another. labeled as '0' and '1'. is recommended as a starting point for the reader unfamiliar with cutting and stacking constructions. the binary i. in particular. processes. one can first represent the ergodic Markov chain {(Xn . Yn )} as a process built by repeated independent cutting and stacking.10. Since dropping the second coordinates of each new label produces the old labels.1 < i < k. To see why this is so.d. The process built from the first-order columnar representation of S by repeated independent cutting and stacking is the same as the Markov process {X„}. 1) if L is the top of its column. it follows that X. that if the original structure has two columns with heights differing by 1.d.} is a function of a mixing Markov chain.! . then {Y. Example 1.i.10. i } is mixing Markov. the process {X. Note.14 (Hidden Markov chains. BASIC CONCEPTS. assume that the labeling set A does not contain the symbols 0 and 1 and relabel the column levels by changing the label A (L) of level L to (A (L). and to (. then {X.) A hidden Markov process {X.12 (I.) If {KJ is ergodic." so as to approximate the desired property. repeated independent cutting and stacking produces the A-valued i.16 The cutting and stacking ideas originated with von Neumann and Kakutani. process in which O's and l's are equally likely.118 CHAPTER I.10. and hence {X.) Let S be the column structure that consists of two columns each of height 1 and width 1/2. process with the probability of a equal to the length of the interval assigned to a. Example 1. and let pt* be the product measure on S defined by p. then drop the first coordinates of each label. that is." so as to guarantee ergodicity for the final process. convergence for frequencies. is a function of Y.15 (Finite initial structures.13 (Markov processes. It shows how repeated independent cutting and stacking on separate substructures can be used to make a process "look like another process on part of the space. all' E S. It is easy to see that the process {KJ built from this new column structure by repeated independent cutting and stacking is Markov of order no more than the maximum length of its columns.i.} built from it by repeated independent cutting and stacking is a hidden Markov chain.) The easiest way to construct an ergodic Markov chain {X„} via cutting and stacking is to think of it as the regenerative process defined by a recurrent state. 0) if L is not the top of its column. hence can be represented as the process built by repeated independent cutting and stacking.) of a Markov chain.} is a function X. (These also known as finite-state processes. = f (Y.) If the initial structure S has only a finite number of columns.

each C E S(m).) 6.d. for each m. then a subcolumn has the form (D.(m). 7..11 for the case when S has countably many columns. 119 numerous counterexamples. except for a set of D E SI of total measure at most E. Exercise 4c implies that A* is m-ergodic. A sufficient condition for a cutting and stacking construction to produce a stationary coding of an i.n such that T(m + 1) is built by An -fold independent cutting and stacking of 1(m) U R. most of which have long been part of the folklore of the subject. (a) Show that {S(m)} is asymptotically independent if and only if {r(m)} is asymptotically independent. Suppose S has measure 1 and only finitely many columns. (Hint: for S = S(m). Suppose also that {S(m} is complete. then S' and S are 26-independent. I. then the process built from S by repeated independent cutting and stacking is Markov of some order. . and an integer 111. (Hint: the tower with base transformation Sn defines the same process Ti.À(D)i except for a set of D E S I of total measure at most Nfj. and each E > 0. the conditional probability X(DI C) is the fraction of C that was cut into slices to put into D.. where D is a subset of the base B. 2..i.C. 5.e Exercises 1. Prove Theorem 1. process is given in [75]. CUTTING AND STACKING. The latter also includes several of the results presented here. where S C A*. Show that if {Mm } increases fast enough then the process defined by {S(m)} is ergodic. 3. (d) Show that if EcEs IX(C1 D) — X(D)I < E. Show that formula (2) holds for k > 1. some of which are mentioned [73]. (b) Suppose that for each m there exists R(m) C S(m). (a) Show that if the top of each column is labeled '1'. Show that if T is the upward map defined by a column C.10. E) such that C is (1 —6)-well-distributed in S(n) for n > M. T D. (c) Show that if S and S' are 6-independent.10. Then apply the k = 1 argument.) 4. and let {S(m)} be the standard cutting and stacking representation of the stationary A-valued process defined by (S. A sequence {S(m)} is asymptotically well-distributed if for each m. there is an M = M(m. A column C E S is (1 —6)-well-distributed in S' if EDES' lx(vi C)—X(7))1 < E. (b) Show that the asymptotic independence property is equivalent to the asymptotic well-distribution property.T 2 D.10.SECTION 1. disjoint from 1(m). Suppose 1(m) c S(m). and X(1 (m)) -->. . then E CES IX(CI D) . Let pt" be a block-structure measure on Soe. Show that {S(m)} is asymptotically independent if and only if it* is totally ergodic. (a) Suppose S' is built by cutting and stacking from S. Show that for C E S and D E S'. . re(c)-1 D)..1 as m —> oc. with all other levels labeled '0'. p*).

. then the process built from S by repeated independent cutting and stacking is a mixing finitestate process. BASIC CONCEPTS.10. Show that process is regenerative if and only if it is the process built from a column structure S of measure 1 by repeated independent cutting and stacking. (c) Verify that the process constructed in Example 1. (b) Show that if two columns of S have heights differing by 1.13 is indeed the same as the Markov process {X„}.120 CHAPTER I. 8.

15. Two issues left open by these results will be addressed in this section. where each Cn is an n-code. where {0.1 Entropy and coding.u. As noted in Section I. hence it does not preclude the possibility that there might be code sequences that beat entropy infinitely often on a set of positive measure. L(4)In h. entropy provides an asymptotic lower bound on expected per-symbol code length for prefix-code sequences and faithful-code sequences. where h(p) denotes the entropy of A. The entropy lower bound is an expected-value result. If each Cn is one-to-one. while if each C.. by the entropy theorem. An n-code is a mapping Cn : A n F-* 10. for almost every sample path from any ergodic process. see Exercise 1.7. more simply. so that.u(4)1. A code sequence is a sequence {Cn : n = 1. It will be shown that there are universal codes. the length i(Cn (4)) of the code word Cn (4). . code sequences which compress to the entropy in the limit.). the code sequence is called a faithful-code sequence. see Theorem 1.. The first issue is the universality problem. The sequence of Shannon codes compresses to entropy in the limit. that is. The code length function Le Icn ) of the code is the function that assigns to each . is a prefix code it is called a prefix-code sequence.7q . n n—>co for every ergodic process . 2. The second issue is the almost-sure question.12 and Theorem 1. if {C n } is a Shannon-code sequence for A. In general. then L(4)1 n may fail to converge on a set of positive y-measure. at least asymptotically in source word length n. a lower bound which is "almost" tight. When Cn is understood. universal if lim sup L(4) < h(p). 121 . It will be shown that this cannot happen.Chapter II Entropy-related properties. knowledge which may not be available in practice. r (4) will be used instead of r (x rii ICn ). 1}* is the set of finite-length binary sequences. but its construction depends on knowledge of the process. A code sequence {C n } is said to be universally asymptotically optimal or. Section 11. 1}*.7. almost surely. and y is some other ergodic process.c.7. however. almost surely. The Shannon code construction provides a prefix-code sequence {C n } for which L(xrii) = r—log. in particular.

Theorem 11. based on the Lempel-Ziv coding algorithm. Of interest. in which case the first blocks of Cn (fi') and Cn (Yil) . relative to some enumeration of the k-type class. The proof of the existence theorem to be given actually shows the existence of codes that achieve at least the process entropy. it is possible to (universally) compress to entropy in the limit. then lim inf. These two results also provide an alternative proof of the entropy theorem. this will show that good codes exist. Theorem 11..1. based on entropy estimation ideas of Ornstein and Weiss. In addition. is the fact that both the existence and nonexistence theorems can be established using only the existence of process entropy. will then be used to show that it is not possible to beat process entropy in the limit on a set of positive measure. The code C.6. thus further sharpening the connection between entropy and coding.C(4)/n < H. where H = 114 denotes the process entropy of pt. with code performance established by using the type counting results discussed in Section I.1. either have different k-types.u.) If {C n } is a faithful-code sequence and it is an ergodic measure with entropy h. Theorem 11. utilizes the empirical distribution of k-blocks.1 (Universal codes exist. lim sup .3. the direct coding construction extends to the case of semifaithful codes. that is. will be discussed in Section 11.2. the codeword Cn (4) is a concatenation of two binary blocks. also.a Universal codes exist. it has fixed length which depends only on the number of type classes. n A prefix-code sequence {C} will be constructed such that for any ergodic it.2 (Too-good codes do not exist. almost surely for any ergodic process. but for no ergodic process is it possible to beat entropy infinitely often on a set of positive measure. as shown in [49].C(fii)1 n < h(a). together equivalent to the entropy theorem. a. 4 and yril. while the existence of decay-rate entropy is given by the much deeper entropy theorem. Since process entropy H is equal to the decay-rate h given by the entropy theorem.d. The first block gives the index of the k-type of the sequence 4. will follow from the entropy theorem. A third proof of the existence theorem.) There is a prefix-code sequence {C„} such that limsup n . Theorem 11.2.1. While no simpler than the earlier proof. together with some results about entropy for Markov chains. for which a controlled amount of distortion is allowed. its length depends on the size of the type class. L(x 11 )/n > h. almost surely. A second proof of the existence theorem. relative to some enumeration of possible k-types. for suitable choice of k as a function of n. will be given in Section 11. the k-type. that is. A direct coding argument.1. which is based on a packing idea and is similar in spirit to some later proofs. will be used to establish the universal code existence theorem.s. they show clearly that the basic existence and nonexistence theorems for codes are.d. Distinct words. A counting argument. The second block gives the index of the particular sequence x'iz in its k-type class. II. In other words. whose existence depends only on subadditivity of entropy as expected value. The code is a two-part code. for any ergodic process .1. ENTROPY-RELATED PROPERTIES. almost surely. The nonexistence theorem. in essence.1.122 CHAPTER II. together with a surprisingly simple lower bound on prefix-code word length.

) Total code length satisfies log /S1 (k. hence the second blocks of Cn (4) and Cn (y. Cn (4) = blink. where is a fixed length binary sequence specifying the index of the circular k-type of x.IX II ) on il k defined by the relative frequency of occurrence of each k-block in the sequence . because the size of the type class depends on the type. n) < (n ± olAl k . or they have the same k-types but different indices in their common k-type class.1. let ril+k-1 = xx k i -1 . then the log of the number of possible k-types is negligible relative to n. < iii. A slight modification of the definition of k-type. the index of the particular sequence 4 relative to some particular enumeration of T k (4). yet has negligible effect on code performance. on the number of k-types and on the size of a k-type class yield the following bounds (1) (2) (k. ' +1 .. br relative to some fixed enumeration of the set of possible circular k-types. n) + log1 7 4(4)1 +2. say.f. so asymptotic code performance depends only on the asymptotic behavior of the cardinality of the k-type class of x. If k does not grow too rapidly with n.7irk-1 . i <k — 1. n) of circular k-types that can be produced by sequences of length n.15. ak (3) ilk-Le. Pk-1(4 -1 14) =---" which.4+k-1 . Since the first block has fixed length. aik E A k . Given a sequence 4 and an integer k < n. called the circular k-type.`) will differ. ITi(4)1 < (n — 1)2 (11-1)11.. the code Cn is a prefix code."7 on the number /V (k.(1/2) log IAI n. 4 is extended periodically for k — 1 more terms. Now suppose k = k(n) is the greatest integer in (1/2) log lAi n and let Cn be the two-part code defined by first transmitting the index of the circular k-type of x.:+ k -1 = ak 11 . ENTROPY AND CODING. then transmitting the index of 4 in its circular k-type class. in turn. where the two extra bits are needed in case the logarithms are not integers. x 7 denotes the entropy of the (k — 1)-order Markov chain defined by Pk (. the concatenation of xtiz with x k i -1 . The gain in using the circular k-type is compatibility in k. An entropy argument shows that the log of this class size cannot be asymptotically larger than nH .14 and Theorem 1. Theorem 1.. that is. The circular k-type is the measure P k = ilk(. that is.SECTION 11. k -. implies the entropy inequality EFkcal. that is.14). .6. that is. where Hk_1.6. rd: 5. so that the bounds.c14). will be used as it simplifies the final entropy argument. 123 will be different. lic 14) — Pk(a • 1 ti E [1.' . 1f1 n The circular k-type is just the usual k-type of the sequence . and bm t +1 is a variable-length binary sequence specifying. and on the cardinalityj7k (x)i of the set of all n-sequences that have the same circular k-type as 4. (Variable length is needed for the second part.

almost surely. stationary or not. due to Barron.u(ar)/. A faithful-code sequence can be converted to a prefix-code sequence with no change in asymptotic performance by using the Elias header technique.x? < H + 2E. in this discussion.x 1' < H.u(4). k(n) will exceed K. Thus it is enough to prove Theorem 11. E) such that almost every x there is an particular. relative to some fixed enumeration of circular k(n)-types. For all sufficiently large n. a bound due to Barron. choose K such that HK_i _< H ± E. ENTROPY-RELATED PROPERTIES. The bound (2) on the size of a type class implies that at most 1 + (n — log(n — 1) 4(4).7. the type class. The first proof uses a combination of the entropy theorem and a simple lower bound on the pointwise behavior of prefix codes. .l. relative to a given enumeration of bits are needed to specify a given member of 7 Thus. The second proof. where. is valid for any process. Given E > 0. almost surely. so that to show the existence of good codes it is enough to show that for any ergodic measure lim sup fik(n)-1. II. the ergodic theorem implies that PK (ar lx7) oo. as n .b Too-good codes do not exist. is based on an explicit code construction which is closely connected to the packing ideas used in the proof of the entropy theorem. n > N. described in Section I.2 for the case when each G is a prefix code. process entropy El H and decay-rate h are the same for ergodic processes. negligible relative to n. together with the assumption that k < (1/2) log iAi n.d of Chapter 1. The following lemma. which was developed in [49]. Since E is arbitrary. The bound (1) on the number of circular k-types. This is. for es•• <H + 2E. [3]. for as noted earlier. Since K is fixed. Once this happens the entropy inequality (3) combines with the preceding inequality to yield ilk(n)-1.x7 < H. of course.1. this shows that lim sun n ilk(n).124 CHAPTER II. implies that it takes at most 1 + N/Tz log(n + 1) bits to specify a given circular k(n)-type. and does not make use of the entropy theorem. and hence HK _ Lx? integer N = N(x. In almost surely. hm sup 'C(n xilz) lim sup iik(n) . HK-1 = H (X IX i K -1 ) denotes the process entropy of the Markov chain of order K — 1 defined by the conditional probabilities .1. (4) where H is the process entropy of p. establishing the desired bound (4).u(ar-1 ). Fix an ergodic measure bc with process entropy H. The proof that good codes exist is now finished. almost surely.2 will be given. Two quite different proofs of Theorem 11.1.x7.

1.t(Bn) = E kt(x . by the Kraft inequality for prefix codes. x7En„ . a faithful-code sequence with this property can be replaced by a prefix-code sequence with the same property. with high probability. ENTROPY AND CODING.u(4) ?_ .C(x7) + log . Using the relation L(x7) = log 2'.1. such that if Bi = (x: ICJ) < i(H . } is a prefix-code sequence such that lim infi £(41C. eventually a.6)} then i(nn Up. this can be rewritten as Bn = {x7: iu. The basic idea is that if a faithful-code sequence indeed compresses more than entropy infinitely often on a set of positive measure. But -(1/n) log z(x) converges El almost surely to h.1. the entropy of . This completes the proof of Theorem 11.log p.(4) eventually almost surely.an). A second proof of the nonexistence of too-good codes. on a set of positive measure.SECTION II. Barron's inequality (5) with an = 2 loglAI n.(4) which yields 2-c(4 ) 2 -an I. 125 Lemma 11. c and y.3 (The almost-sure code-length bound.l. 2-an < oo.c Nonexistence: second proof.2.„ Bj ) > y. and after division by n. long enough sample paths can be partitioned into disjoint words. then.2.) Let (CO be a prefix-code sequence and let bt be a Borel probability measure on A. II. Let i be a fixed ergodic process with process entropy H. (As noted earlier.C(4) +log .u(fil ) . will now be given. I. then If {an } is a sequence of positive numbers such that En (5) . for any ergodic measure if/. such that those words that can be coded too well cover at least a fixed fraction of sample path length.)/n < H. since adding headers to specify length has no effect on the asymptotics.) Thus there are positive numbers. Proof For each n define B. Theorem 11.1. . The lemma follows from the Borel-Cantelli lemma. with respect to A.u. ={x: . a proof which uses the existence of good codes but makes no use of the entropy theorem.0 <2 -an E 2-4(4) < 2 -"n . while the complement can be compressed almost to entropy.ei2E B„ since E „ 2—c(x7) < 1. s.an. yields lim inf n—000 n > lim inf . thereby producing a code whose expected length per symbol is less than entropy by a fixed amount. and assume {C.

3)-good representation. for some large enough M. and using a single-letter code on the remaining symbols. E G(n). such that the words that belong to B m U . To make precise the desired sample path partition concept. or belongs to Dk. or belongs to B1 for some j > M. using the good code Ck on those words that belong to Dk. contradicting Shannon's lower bound.C*(4) <n(H — E').x i` I e' -k).. there is a prefix n-code C n * whose length function E*(4) satisfies (7) . for all sufficiently large n. (a) Each w(i) belongs to A. a concatentation 4 = w(1)w(2) • • • w(t) is said to be a (y. let M > 2k IS be a positive integer.. be partitioned into disjoint words. with high probability. with length function E(x) = Theorem 11. Here it will be shown how a prefix n-code can be constructed by taking advantage of the structure of sequences in G(n).1. plus headers to tell which code is being used. eventually almost surely.126 CHAPTER II. Lemma 11. Furthermore. Sample paths with this property can be encoded by sequentially encoding the words. For n > M. A version of the packing lemma will be used to show that long enough sample paths can. is at least yn. (b) The total length of the w(i) that belong to A is at most 33n.. such that the set Dk = E(X lic ) < k(H 8)) . . . while the words that are neither in Dk nor in BM U . a code that will eventually beat entropy for a suitable choice of 3 and M.1. ENTROPY-RELATED PROPERTIES. It will be shown later that (6) x E G(n). given that (6) holds. for a suitable choice of E.4 For n > M. and M > 2k/S. .. for the uniform boundedness of E*(4)In on the complement of G(n) combines with the properties (6) and (7) to yield the expected value bound E(C*(X7)) < n(H — 672). Let G(n) be the set of all sequences 4 that have a (y. cover at most a 33-fraction of total length. (c) The total length of the w(i) that belong to BmU Bm +1U . xr. The lemma is enough to contradict Shannon's lower bound on expected code length. this produces a single code with expected length smaller than expected entropy. ' > 0. Let 8 < 1/2 be a positive number to be specified later.. if the following three conditions hold. provides a k > 1/3 and a prefix code ek £(.C*(x`Oln will be uniformly bounded above by a constant on the complement of G(n).. cover at least a y-fraction of total length. independent of n. using the too-good code Ci on those words that belong to Bi for some i > M. 8 > 0. has measure at least 1 — S. The existence theorem. The coding result is stated as the following lemma. 0-good representation of xt. If 8 is small enough and such sample paths are long enough.1...

a prefix code on the natural numbers for which the length of e(j) is log j o(log j). in turn. by the definition of Dk.7. since the initial header specifies whether E G(n) or not. The code C(x) is defined for sequences x E G(n) by first determining a (y. then a (y.)cr i' = w(l)w(2) • • • w(t) is determined and the code is defined by C:(4) = lv(1)v(2) • • • v(t).). the code word Cf (w(i)) is no longer than j (H — c).1 .t.1. it takes at most L(H — E) + (N — L)(H 8) to encode all those w(i) that belong to either Dk or to Ui > m Bi . the per-symbol code C:(4) = OF(xi)F(x2) • • • F(xn). Likewise. and. 3)-good representation . Proof of Lemma 11. ENTROPY AND CODING. Note that C * (4)/n is uniformly bounded on the complement of G(n). a code word Elk (w(i)) has length at most k(h + 6). If E G(n). ) is an Elias code. so that. the principal contribution to code length comes from the encodings Cf (u. thereby contradicting Shannon's D lower bound on expected code length. The code Cn * is clearly a prefix code.. A precise definition of Cn * is given as follows.. Appropriate headers are also inserted so the decoder can tell which code is being used.m B. But this. where OF ((w(i)) (8) v(i) = 10 .11(i).. But L > yn. Cm+i . that is. yields E(. as shown in the following paragraphs.1 4. since the total length of such w(i) is at most N — L. 127 for all sufficiently large n. since (1/n)H(A n ) -+ H. . say. ' (w(i)) iie(j)Ci(W(i)) if w(i) has length 1 if w(i) E Dk if w(i) E Bi for some j > M. the contribution .} is being applied to a given word. then it requires at most L(H — E) bits to encode all such w(i).4.1. This is so because. for x g G(n). Here e (... on the words of length 1. F : A 1--* Bd.. Theorem I. and using some fixed code. where d > log IA I. By the definition of the Bi . using the good code Ck on those words that have length k. (0) (w(i)) of the w(i) that belong to Up. while the other headers specify which of one of the prefix-code collection {F.C*(X7)) < H(1.. Ck. and the encodings k(W(i)) of the w(i) that belong to Dk. For 4 g G(n). symbols are encoded separately using the fixed code F. so that if L is total length of those w(i) that belong to BM U BA.SECTION 11. C m . ignoring the headers needed to specify which code is being used as well as the length of the words that belong to Uj >m BJ . so that. using the too-good code on those words that have length more than M. for suitable choice of 8 and M > 2kI8. then coding the w(i) in order of increasing i. 8)good representation 4 = w(1)w(2) • • • w(t). by property (c) in the definition of G(n). The goal is to show that on sequences in G(n) the code Cn * compresses more than entropy by a fixed amount. is used. . for all sufficiently large n.

m B i is upper bounded by — ye). n(H +3 — y(e +8)) n(H (9) Each word w(i) of length 1 requires d bits to encode. which is certainly negligible. since the total number of such words is at most 38n. which establishes the lemma. the total contribution of all the headers as well as the length specifiers is upper bounded by 38n ± 26n + IC8n. by property (b) of the definition of G(n). There are at most nlk words w(i) the belong to Dk or to Ui >m Bi . Of course. 4 E G(n). then n(H — E'). Since d and K are constants. The 1-bit headers used to indicate that w(i) has length 1 require a total of at most 33n bits. ENTROPY-RELATED PROPERTIES. so essentially all that remains is to show that the headers require few bits. and there are at most 38n such words. But (1/x) log x is a decreasing function for x > e. If 8 is small enough. so these 2-bits headers contribute at most 28n to total code length. for any choice of 0 < 8 < 1/2. In summary. which exceeds e since it was assumed that < 1/2 and k > 118. Lemma 1.M . eventually almost surely. the total length needed to encode all the w(i) is upper bounded by (10) n(H +8 — ye +3d3). m Bi E since each such w(i) has length M > 2k1S. it follows that if 8 is chosen small enough. the lemma depends on the truth of the assertion that 4 E G(n). k > 113. then 8 — ye -1-3d8 will be negative. provided only that M > 2k18. provided k > 118. so that. so that log log t(w(i)) < K M n. Adding this bound to the bound (10) obtained by ignoring the headers gives the bound L*(4) < n[H 8(6 ± 3d K)— yen on total code length for members of G(n). The words w(i) need an additional header to specify their length and hence which Ci is to be used. still ignoring header lengths.u(B) > y.8. relative to n. ignoring the one bit needed to tell that 4 E G(n). by property (b) in the definition of G(n). to total code length from encoding those w(i) that belong to either Dk or to Ui . By assumption. Thus. e'. provided 8 is small enough and M is large enough. M > 2k/3 and log M < 8M.3. there is a constant K such that at most log t(w(i)) K tv(i)Eui>m B1 E bits are required to specify all these lengths. choose L > M so large that if B = UL BP • then . if it assumed that log M < 8M.128 CHAPTER II. w(i)Eup. The partial packing lemma. 1=. and M > 2k13. since k < M and hence such words must have length at least k. then at most K3n bits are needed to specify the lengths of the w(i) that belong to Ui >m B i . To establish this. Since an Elias code is used to do this. for suitable choice of 8 .

) The words v:: j E /*I {x:j E J} U {xu together with the length 1 words {x s I defined by those s U[s . n — k 1] that are starting places of blocks from Dk. as follows. suggested in Exercise 5. a.SECTION 11. and (c) hold. II. vi ] meets none of the [si . and let {C„} be a prefix-code sequence such that lim sup. eventually almost surely. n]. and I Dn I < 2 n(H+E) . ti]: j E U {[ui. since Cn is invertible. The preceding paragraph can be summarized by saying that if x has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk then 4 must belong to G(n). ti ]. Since each [s1 . n] of total length at least yn. Fix an ergodic measure bt with process entropy H. Fix c > 0 and define D„ = 14: £(4) < n(H so that x Thus. 4 has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk.9. provide a representation x = w(1)w(2) • • • w(t) for which the properties (a). (b). eventually almost surely there are at least 8n indices i E [1. vi that meet the boundary of at least one [s ti] is at most 2nk/M. B. Lemma 1. ti]UU[ui.d Another proof of the entropy theorem. On the other hand. E Dk and a disjoint collection {[si. But this completes the proof that 4 E G(n). is to say that there is a disjoint collection {[u„ yi ]: i E / } of k-length subintervals of [1. eventually almost surely. which is upper bounded by 8n. But the building-block lemma. 4 is (1 — 0-strongly-covered by Dk.3.1. by the ergodic theorem. To say that x has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk. then implies that xri' is eventually almost surely (1 — 20-packed by k-blocks drawn from Dk. that is. since. This completes the second proof that too-good codes do not exist.u(x'1') < 2 1'1+2} . E /*} is disjoint and covers at least a (1 — 30-fraction of [1. t] has length at least M. Lemma 1. 1E1.s. eventually almost surely. 4 has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk. . then the collection {[sj. (This argument is essentially just the proof of the two-packings lemma. eventually almost surely. so the total length of those [ui . . eventually almost surely.3.C(x`11 )1 n < H. 129 provides. In particular.7. for which each xs% E B. Thus if I* is the set of indices i E I such that [ui .7. ENTROPY AND CODING. = . n] of total length at least (1 — 20n for which each 4. The entropy theorem can be deduced from the existence and nonexistence theorems connecting process entropy and faithful-code sequences. since it was assumed that M > 2k/ 6 .l. ti ]: j E J } of M-length or longer subintervals of [1. Dn . Section 1. the cardinality of J is at most n/M. if E . a y-packing of 4 by blocks drawn from B.

This proves that 1 1 lim inf — log — > H.. xii E Un and en (4) = C(4). then there is a renewal process v such that lim supn D (gn II vn )/n = oc and lim infn D(pn II vn )/n = 0. and so xri` g Bn . xi" g Un . be the projection of the Kolmogorov measure of unbiased coin tossing onto A n and let Cn be a Shannon code for /in . establishing the upper bound 1 1 lim sup — log ) _< H . Let C. n n and completes this proof of the entropy theorem..1. such that the first symbol in 0(4) is always a 0. C3 II.. then there is a constant C such that lim sup.(4) > 2—n(H. Lemma 5.n(H — c). eventually almost surely. is the all 0 process. (Hint: code the second occurrence of the longest string by telling how long it is.. whose range is prefix free.. Let L (xi') denote the length of the longest string that appears twice in 4. Show that if bt is i. n n 11. a. Suppose Cn * is a one-to-one function defined on a subset S of An . n D n) < IDn 12—n(H+2E) < 2—n€ Therefore 4 fl B n n Dn . . Define On (4) = 4)(4). eventually almost surely.1 . Add suitable headers to each part to make the entire code into a prefix code and apply Barron's lemma.(4 For the lower bound define U„ = Viz: p. Le Exercises 1. almost surely. 2. Such a code Cn is called a bounded extension of Cn * to An. and let £(4) be the length function of a Shannon code with respect to vn .1(4) log(pn (4)/vn (4)) be the divergence of An from yn .130 then /t (Bn CHAPTER II. where it starts.d. Code the remainder by using the Shannon code. and note that I Uni < 2n(I I —E) . This exercise explores what happens when the Shannon code for one process is used on the sample paths of another process. (b) Let D (A n il v pi ) = E7x /1.i.. The resulting code On is invertible so that Theorem 11. (a) For each n let p. Show that there is prefix n-code Cn whose length function satisfies £(4) < Kn on the complement of S.s.2 guarantees that Je ll g Un .(4) = 1C:(4). is ergodic then C(4)In cannot converge in probability to the entropy of v. and such that C. for X'il E S. Show that the expected value of £(4) with respect to p.) 3. a. where K is a constant independent of n. is 1-1(pn )± D (An II vn).. eventually almost surely. (C) Show that if p.. Show that if y 0 p. ENTROPY-RELATED PROPERTIES. Thus there is a one-to-one function 0 from Un into binary sequences of length no more than 2 -I. be the code obtained by adding the prefix 1 to the code Cn . and where it occurred earlier. L(4)/ log n < C.s.

later words have no effect on earlier words. In finite versions it is the basis for many popular data compression packages. for ease of reading. w(j + 1) = xn such that xm n . b(C)b(C + 1) of the binary words b(1). defined according to the following rules. [53].. is given in Section 11. (a) The first word w(1) consists of the single letter x1.. for example. that the Lempel-Ziv (LZ) algorithm compresses to entropy in the limit will be given in this section.. Note that the parsing is sequential..1 denote the least integer function and let f: {0. • of variable-length blocks. Ziv's proof.. W(j)} and xm+ 1 g {w(1).2. If fiz is parsed as in (1). 100..4. a way to express an infinite sequence x as a concatenation x = w(1)w(2) ..SECTION 11. . To describe the LZ code Cn precisely. . An important coding algorithm was invented by Lempel and Ziv in 1975. parses into 1. Thus. together with a description of its final symbol. the words are separated by commas. .1 +1 E {WM. 101. 11001010001000100. where the final block y is either empty or is equal to some w(j). that is.. w(j) = (i) If xn 1 +1 SI {w(1). x is parsed inductively according to the following rules. n) i-.. . 01. . called the (simple) Lempel-Ziv (LZ) code by noting that each new word is really only new because of its final symbol. [91]. 000. 7 + 1 .. w(j)} then w(j ± 1) consists of the single letter xn ro.. and therefore an initial segment of length n can be expressed as (1) xi! = w(1)w(2).. . where. . In its simplest form. To be precise. let 1. 00. (b) Suppose w(1) .. .2 The Lempel-Ziv algorithm.. for j < C. w(C) y... THE LEMPEL-ZIV ALGORITHM 131 Section 11. where m is the least integer larger than n 1 (ii) Otherwise.Brl0gn1 and g: A 1--÷ BflogiAll be fixed one-to-one functions. b(2). the word-formation rule can be summarized by saying: The next word is the shortest new word. which is due to Ornstein and Weiss. . . hence it is specified by giving a pointer to where the part before its final symbol occurred earlier. .. b(C + 1).. [90].. 2. The parsing defines a prefix n-code Cn . WW1.. . A second proof. and uses several ideas of independent interest. called words. 1. the LZ code maps it into the concatenation Cn (eii) = b(1)b(2). 0. 10. and it has been extensively analyzed in various finite and limiting forms.. called here (simple) LZ parsing. that is. The LZ algorithm is based on a parsing procedure. b(C).

it is enough to prove that entropy is an upper bound. for it takes into account the frequency with which strings occur. by Theorem 11. where a and /3 are constants. is the same as the usual topological entropy of the orbit closure of x. then b(j) = 0 g(w(j)). ENTROPY-RELATED PROPERTIES.) If p is an ergodic process with entropy h. then b(j) = 1 f (i)0 g(a). Ziv's proof that entropy is the correct almost-sure upper bound is based on an interesting extension of the entropy idea to individual sequences. The dominant term is C log n. Lemma 1. together with a proof that this individual sequence entropy is an asymptotic upper bound on (1/n)C(4) log n.132 CHAPTER II. The entropy given by the entropy theorem is h(p) = —p log p — (1 — p) log(1 — p).d. as k oc. The k-block universe of x is the set tik(x) of all a l' that appear as a block of consecutive symbols in x. together with a proof that. in which C1(4) now denotes the number of new words in the simple parsing (1). 1) = 1 f (i).) Topological entropy takes into account only the number of k-strings that occur in x.. as defined here. where i is Part (a) requires at most IA l(Flog IA + 1) bits. (a) If j < C and w(j) has length 1. 34] for discussions of this more general concept.1 (The LZ convergence theorem. for every sequence. Ziv's concept of entropy for an individual sequence begins with a simpler idea. then b(C the least integer such that y = w(i). for ergodic processes. (b) If j < C and i < j is the least integer for which w(j) = w(i)a. (The topological entropy of x.2. [90]. part (b) requires at most Calog n1 ± Flog I A + 2) bits. Theorem 11. which is the growth rate of the number of observed strings of length k. individual sequence entropy almost surely upper bounds the entropy given by the entropy theorem.7. if x is a typical sequence for the binary i.i. which depends strongly on the value of p. so to establish universality it is enough to prove the following theorem of Ziv.eiz) is upper bounded by (C+1)logn-FaC-1. k = {a• i• x 1' 1 = all for some i _> The topological entropy of x is defined by 1 h(x) = lim — log 114 (x)I k a limit which exists. For example. since I/44k (x)i < Vim (x)I • Pk (x)1. so total code length £(. since no code can beat entropy in the limit. but gives no information about frequencies of occurrence. a E A. that is.6. by subadditivity. process with p equal to the probability of a 1 then every finite string occurs with positive probability. then (11n)C(4) log n —> h. 1) is empty. otherwise b(C (c) If y is empty. called topological entropy. and part (c) requires at most Flog ni + 1 bits. almost surely.1. Of course. hence its topological entropy h(x) is log 2 = 1. . see [84.16.2.

.1. n (in + 1 )1Al m+1 jiAli and C = .) If is an ergodic process with entropy h then H(x) < h. and hence C logn ' n log I AI. will be carried out via three lemmas.) lim sup C(xti') log n H(x). The natural distance concept for this idea is cl(x. Lemma 11.2. leads immediately to the LZ convergence theorem.SECTION 11.2. y) = lim sup d. combined with the fact it is not possible to beat entropy in the limit. when n= In this case. almost surely. where d(4 . say m. X E A°°. x E A c° . which.) There exists Bn —O. 0 such that C(x) log n n <(1 + 3n)loglAl. Ziv established the LZ convergence theorem by proving the following two theorems. up to some length. Theorem 11. 133 The concept of Ziv-entropy is based on the observation that strings with too small frequency of occurrence can be eliminated by making a small limiting density of changes. defined as before by c1(x . )1).2.1. These two theorems show that lim supn (l/n)C(x7) log n < h..t--ns n is per-letter Hamming distance. that is.4 (The crude bound. The Ziv-entropy of a sequence x is denoted by H(x) and is defined by H(x) = lim inf h(y). almost surely.2. y).2. )7. Theorem 11. Proof The most words occur in the case when the LZ parsing contains all the words of length 1.2. — 1) and C IA — 1). THE LEMPEL-ZIV ALGORITHM. which also have independent interest.. The first gives a simple bound (which is useful in its own right).2. . The proof of the upper bound.2 (The LZ upper bound. Theorem 11. all the words of length 2.2. and the third establishes a ii-perturbation bound.1 ) = 1 ..=.i (xii . the second gives topological entropy as an upper bound. Theorem 11.3 (The Ziv-entropy theorem. The first lemma obtains a crude upper bound by a simple worst-case analysis. Theorem 11.

134 CHAPTER II. and choose m such that E j2» j=1 „(x) < n < E i2 p„(x) . Let x = w(1)w(2) ... 2.) x. ENTROPY-RELATED PROPERTIES. i=1 The desired upper bound then follows from the bound bounds .2.6 (The perturbation bound. first choose m such that n< j=1 jlAli. must give an upper bound. The word w(i) will be called well-matched if dt(u. 1 t m it . Lemma 11. v(i)) < .. and hence the rate of growth of the number of words of length k.. that is. as k oc. such that C(x) log n n — Proof Let hk(x) = (1/k) log lUk(x)1. then Given x and E > 0. Lemma 11. there is a S >0 such that if d( li m sup n—>oo C(. together with the fact that hi(x) — Ei < h(x) h (x) where Ei —> 0 as j oc. where t(v(i)) = t(w(i)).. i = 1. for the LZ parsing may be radically altered by even a few changes in x. .2. for k > 1.5 (The topological entropy bound... tj = Er (en-El The second bounding result extends the crude bound by noting that short words don't matter in the limit. tm 1 m(t _ i) _< E (2) m (t— 1 i Er ti < t"/(t — 1).. The third lemma asserts that the topological entropy of any sequence in a small dneighborhood of x is (almost) an upper bound on code performance. ()) (w(i). all of which are valid for t > 1 and are simple consequences of the well-known formula 0 — t)/(t — 1). for w E Ai. and the t < - t mtm. j=1 m+1 The bounds (2). be the LZ parsing of x and parse y as y = v(1)v(2) . y) <S. the topological entropy. Fix S > 0 and suppose d(x.x'iz) log n h(y) + Proof Define the word-length function by t(w) = j. there exists 8. yield the desired result. At first glance the result is quite unexpected. ei.) For each x E A'. y) < S. —> 0. To establish the general case in the form stated.

where lim n fi. the total length of the poorly-matched words in the LZ parsing of 4 is at most rt. Let C1(4) be the number of poorly-matched words and let C2(4) be the number of well-matched words in the LZ parsing of x. v(i)) < N/75. Thus there must be an integer s E [0. For any frequency-typical x and c > O. and let Gk(y) be the set of all v(i) for which w(i) E Gk(X). Lemma 11. implies LI the desired result. The resulting nonoverlapping blocks that are not in 'Tk can then be replaced by a single fixed block to obtain a sequence close to x with topological entropy close to h.5. is an immediate consequence of the following lemma.5 then gives C2(4) 5_ (1 + On)r— (h(Y) gn f (8)).2. n — 11: x in+s+1 lim inf > 1. since x was assumed to be frequency-typical. Fix a frequency-typical sequence x.E. the poorly-matched words cover less than a limiting .2. This bound.2./TS. let Gk(x) be the set of wellmatched w(i) of length k. for w(i) E Gk(X). there is a sequence y such that cl(x. x i+ 1 -T k11 • > 1 — E. The LZ upper bound. k —1] such that i. and hence it can be supposed that for all sufficiently large n. The idea for the proof is that 4 +k-1 E Tk for all but a limiting (1 — 0-fraction of indices i. Theorem 11. y) < e and h(y) <h E. Proof Fix c > 0.2. first note that lim i+k Ili E [0. Thus. yields IGk(x)I where f(S) < 2k(h(y)+ f (6)) 0 as 8 O.6. together with the fact that hk(Y) --+ h(y). By the Markov inequality. Lemma 11. completing the proof of Lemma 11. 135 otherwise poorly-matched. The cardinality of Gk(Y) is at most 2khk (Y ) . be an ergodic process with entropy h.2. Lemma 1.2. Since dk(w(i). follows immediately from Lemma 11. 0 The desired inequality. = O. hence the same must be true for nonoverlapping k-blocks for at least one shift s E [0.n+s +k E Ed' I{i E [0. k — 1]. combined with (3) and the fact that f(S) --÷ 0.7.7 Let p.SECTION 11. H(x) < h. n — 1]. the blowup-bound lemma. THE LEMPEL-ZIV ALGORITHM. Lemma 11. For each k.6. almost surely. n—>oo .2.r8fraction of x.2. To make the preceding argument rigorous. Now consider the well-matched words. since this is the total number of words of length k in y. and choose k so large that there is a set 'Tk c Ak of cardinality less than 2n (h+E) and measure more than 1 — E.4 gives the bound (3) (1 n/ +an) log n'a log I AI where limn an = O.

— n Fix a E A and let ak denote the sequence of length k.. • v(j + in — 1)r. Theorem 11.6...8 Variations in the parsing rules lead to different forms of the LZ algorithm. after which the next word is defined to be the longest block that appears in the list of previously seen words.2.") In this alternative version.2. each w(i) has length k. a set of cardinality at most 1 + 2k(h +E) . see Exercise 1. and (M — 2k)/ k < m < M/k. Theorem 11. The sequence y is defined as the concatenation y = uv(1)v(2). Condition (4) and the definition of y guarantee that ii(x. Remark 11. It is used in several data compression packages. remains true. ENTROPY-RELATED PROPERTIES. also discussed in the exercises. thereby. "new" meant "not seen as a word in the past. as well as its length must be encoded. produces slightly more rapid convergence and allows some simplification in the design of practical coding algorithms. One method is to use some form of the LZ algorithm until a fixed number. implies that h m (y) < h + 6 ± 3m. known as the LZW algorithm. of words have been seen.7.3. where the initial block u has length s. but more bookkeeping is required. say Co. new words tend to be longer. defines a word to be new only if it has been seen nowhere in the past (recall that in simple parsing. Another version. the built-up set bound. starts in much the same way.136 CHAPTER II. until it reaches the no-th term. If M is large relative to k. a member of the m-block universe Um (y) of y has the form qv(j)v(j + D.2. all of whose members are a. 0 and.10 For computer implementation of LZ-type algorithms some bound on memory growth is needed.2.9 Another variation. which improves code performance. which has also been extensively analyzed. and (4) liminf n—. and hence h(y) < h + E. the rate of growth in the number of new words still determines limiting code length and the upper bounding theorem.2. Tic where 3A1 ---* 0. Nevertheless. includes the final symbol of the new word in the next word. y) <E.00 Ri < n: w(i) E TxII > 1 — E.2. Remark 11. which can be expressed by saying that x is the concatenation x = uw(1)w(2).. Since each v(i) E U {a l } . where q and r have length at most k.. Remark 11. One version. then defines . the proof of the Ziv-entropy theorem. Lemma 1. for now the location of both the start of the initial old word. This modification. where w(i) ak if w(i) E 'Tk otherwise.. called the slidingwindow version..2. This completes the proof of Lemma 11.7.

. suggests an entropy-estimation algorithm and a specific coding technique. Show that simple LZ parsing defines a binary tree in which the number of words corresponds to the number of nodes in the tree. The nonoverlapping k-block distribution. Furthermore. and an overlapping-block form. The theorem.8 achieves entropy in the limit. asserting that eventually almost surely only a small exponential factor more than 2kh such k-blocks is enough.. These will be discussed in later sections.3. or it may depend on the past or even the entire sample path.a Exercises 1. e. is defined by qic(cli = Ri E [0. rit — 1]: in X: . which can depend on n. and that a small > 21th exponential factor less than 2kh is not enough. (Hint: if the successive words lengths are . Show that the version of LZ discussed in Remark 11. both of which will also be discussed.3 Empirical entropy. through the empirical distribution of blocks in the sample path.2. provided only that n _ .9 achieves entropy in the limit. EMPIRICAL ENTROPY. see [87. In many coding procedures a sample path 4 is partitioned into blocks. but do perform well for processes which have constraints on memory growth. Show that the LZW version discussed in Remark 11. t2.2.) 2.cc+ -E =a } I where n = km + r. ic.2. then each block is coded separately. the method of proof leads to useful techniques for analyzing variable-length codes. or variable length. The theorem is concerned with the number of k-blocks it takes to cover a large fraction of a sample path. see Exercise 1. 137 the next word as the longest block that appears somewhere in the past no terms. 88]. then E log Li < C log(n/C).. a beautiful and deep result due to Ornstein and Weiss. 0 < r < k. the coding of a block may depend only on the block. Such algorithms cannot achieve entropy for every ergodic process. In this section. about the covering exponent of the empirical distribution of fixed-length blocks will be established. II. The empirical-entropy theorem has a nonoverlapping-block form which will be discussed here. • • . such as the Lempel-Ziv code. 3. Furthermore. as well as other entropy-related ideas. Section 11. and can be shown to approach optimality as window size or tree storage approach infinity.SECTION 11. here called the empirical-entropy theorem. The blocks may have fixed length. g. . [52].

for fixed k. The ergodic theorem implies that for fixed k and mixing kt.. x) such that if k > K and n > 2kh . those long words that come from the fixed collections . the empirical measure qk(. For each c > 0 and each k there is a set 'Tk (e) c A" for which l'Tk (e)I < 2k(h +E) . then (a) qk(Z(E)14)) > 1 — E. a finite sequence w is called a word.4. such that Tk c A k .a(B) > E. Strong-packing captures the essence of the idea that if the long words of a parsing cover most of the sample path. in spite of the fact that it may not otherwise be close to iik in variational or even cik -distance.1 (The empirical-entropy theorem. if the words that belong to U'rk cover at least a (1 — E)-fraction of n. for unbiased coin-tossing the measure qk (. II.3. For example. .14) is not close to kt k in variational distance for the case n = k2". A parsing X n i = w(l)w(2) • • • w(t) is (1 — e)-built-up from {'Tk}.) Let ii be an ergodic measure with entropy h > O. w(2). A fixed-length version of the result is all that is needed for part (a) of the entropy theorem.. E w(i)Eurk f(w(i)) a.7. then most of the path is covered by words that come from fixed collections of exponential size almost determined by entropy. Part (a) of the empirical-entropy theorem is a consequence of a very general result about parsings of "typical" sample paths. but the more general result about parsings into variable-length blocks will be developed here. c)-strongly-packed by MI if any parsing . ENTROPY-RELATED PROPERTIES. w(t)} of words for which 4 = w(1)w(2) • • • w(t). . Theorem 11. word length is denoted by £(w) and a parsing of 4 is an ordered collection P = {w(1). Fix a sequence of sets {E}. for k > 1.I4) eventually almost surely has the same covering properties as ktk. t(w(I))<K E - 2 is (1 — E)-built-up from MI. in particular. for any B c A" for which IBI < 2k( h — e) . that is. as it will be useful in later sections. if the short words cover a small enough fraction. for the latter says that for given c and all k large enough. see Exercise 3. subject only to the condition that k < (11 h) log n.3.uk. As before. . there < 2k(h+f) and pl (Tk(E)) > 1 — e. the empirical and true distributions will eventually almost surely have approximately the same covering properties. or (1 — e)-packed by {TO. (b) qk(BI4)) <E. The surprising aspect of the empirical-entropy theorem is its assertion that even if k is allowed to grow with n. The theorem is. Theorem 1. hence. A sequence 4 is (K. such that for almost every x there is a K = K(e. Eventually almost surely.14) is eventually almost surely close to the true distribution. in essence. and there is no is a set T(E) c A k such that l'Tk(E)I _ set B C A k such that IB I _< 2k(h--E) and .a Strong-packing.138 CHAPTER II. (1 — O n.elz = w(1)w(2) • • • w(t) for which awci» < --En . an empirical distribution form of entropy as covering exponent. the empirical distribution qk(.

be an ergodic process of entropy h and let E be a positive number There is an integer K = K(e) and. EMPIRICAL ENTROPY 139 {'Tk } must cover most of the sample path. then. most of 4 must be covered by those words which themselves have the property that most of their indices are starting places of members of Cm . fix c > 0. by the Markov inequality. To make the outline into a precise proof. If a word is long enough. The key result is the existence of collections for which the 'Tk are of exponential size determined by entropy and such that eventually almost surely x is almost strongly-packed. 0-strongly-packed by {Tk for a suitable K. the words w(i) that are not (1 — 3/2)-stronglycovered by Cm cannot have total length more than Sn < cn12. however. 0-strongly-packed by {7 } .7. By making 8 smaller. then the word is mostly built-up from C m . . The entropy theorem provides an integer m and a set Cm c Am of measure close to 1 and cardinality at most 2m (h +6) • By the ergodic theorem. (ii) If x is (1 —3 2 /2)-strongly-covered by C m and parsed as 4 = w(1)w(2) • • w(t). (i) The ergodic theorem implies that for almost every x there is an N = N(x) such that for n > N. Lemma 1. that is.(ar) > 2 -m(h+ 5)} has measure at least 1 — 3 2 /4. w(i) is (1-0-packed by C m . The collection 'Tk of words of length k that are mostly built-up from Cm has cardinality only a small exponential factor more than 2"") . By the built-up set lemma. if necessary. n — m + 1].3. (iii) If w(i) is (1 — 3/2)-strongly-covered by C m .SECTION 11. The entropy theorem provides an m for which the set Cm = {c4n : p. that is. then.3. Proof The idea of the proof is quite simple. let 'Tk be the set of sequences of length k that are (1 — 3)-built-up from Cm . by the Markov inequality. < 2k(h+E) for k > K.6. and if f(w(i)) ?_ 2/3. such that both of the following hold. a set 'Tk = 'Tk(E) C A k . But if such an x is partitioned into words then. by the packing lemma. it can be supposed that 6 is small enough to guarantee that iTk i < 2k(n+E) .2 (The strong-packing lemma. it can be supposed that 3 < 12. by the built-up set lemma. 4 is (1 — 3 2 /2)-strongly-covered by Cm . For k > m. and most of its indices are starting places of members of Cm . Lemma 11. This is a consequence of the following three observations. by the packing lemma. (a) ITkI_ (b) x is eventually almost surely (K. It remains to show that eventually almost surely x'IL is (K. k > in. w(i) E Te(w(j)).3. Lemma 1. for each k > K.) Let p. and let 8 be a positive number to be specified later.3. eventually almost surely most indices in x7 are starting places of m-blocks from Cm . the sequence 4 has the property that X: +171-1 E Cm for at least (1 — 3 2 /2)n indices i E [1.

. If a block belongs to Tk — B then its index in some ordering of Tk is transmitted. and t(w(t + 1)) < k. Thus. From the preceding it is enough to take K > 2/8.2k(h+') for k a K.2.3. An informal description of the proof will be given first. Proof of part (a). 8)-strongly-packed by {Tk}. If the block does . Fix c > 0 and let S < 1 be a positive number to be specified later. Proof of part (b). are longer than K. i < t. The definition of strong-packing implies that (n — k + 1)qk('Tk I4) a. if n > N(x).. The strongpacking lemma provides K and 'Tk c A " . This completes the proof of part (a) of the empirical-entropy 0 theorem. Fix k > K(x) and n a 2kh . provided only that n > 2kh . for if this is so. eventually almost surely. The second part encodes successive kblocks by giving their index in the listing of B. All the blocks. where t(w(i)) = k. The good k-code used on the blocks that are not in B comes from part (a) of the empirical-entropy theorem. This completes the proof of Lemma 11. this requires roughly hk-bits. since k < Sn. while the last one has length less than Sn. A simple two-part code can be constructed that takes advantage of the existence of such sets B. or by applying a fixed good k-block code if the block does not belong to B.140 CHAPTER II.(1 — 2 8 )n.' is (K.b Proof of the empirical-entropy theorem. Fix x such that 4 is (K. Suppose xtiz is too-well covered by a too-small collection B of k-blocks. ENTROPY-RELATED PROPERTIES.3. II. The first part is an encoding of some listing of B. since the left side is just the fraction of 4 that is covered by members of Tk . for suitable choice of S. If B is exponentially smaller than 2" then fewer than kh bits are needed to code each block in B. such that Irk I . for some k. which supplies for each k a set 'Tk c A k of cardinality roughly 2. and choose K (x) > K such that if k> K(x) and n> 2kh . except possibly the last one. and suppose xril = w(1)w(2) • • • w(t)w(t + 1). dividing by n — k + 1 produces qk(Tkixriz ) a (1 — 28)(1 —8) > (1 — c). so that if B covers too much the code will beat entropy. this contributes asymptotically negligible length if I BI is exponentially smaller than 2kh and n > 2kh . and if the parsing x 11 = w(1)w(2) • • • w(t) satisfies —2 t(w(i))<K then the set of w(i) for which f(w(i)) > K and w(i) E U Tk must have total length at El least (1 — On. if the block belongs to B. and such that x. then n > N(x) and k < Sn. 8)-strongly-packed by {T} for n a N(x). such that eventually almost surely most of the k-blocks in an n-length sample path belong to Tk .

{0. which implies part (b) of the empirical-entropy theorem. for n > N(x). since such blocks cover at most a small fraction of the sample path this contributes little to overall code length. 1}1 1 05 All is needed along with its extension to length in > 1. 1} rk(4 ±3)1 . defined by Fm (X) = F(xi)F(x2). so that if K(x) < k < (log n)/ h then either 4 E B(n). since no prefix-code sequence can beat entropy infinitely often on a set of positive measure. Let 1 • 1 denote the upper integer function. D It will be shown that the suggested code Cn beats entropy on B(n) for all sufficiently large n. eventually almost surely. then qk(BI4) < c.8. and. eventually almost surely. which establishes that 4 g B(n). if 8 is small enough and K is large enough. fix c > 0 and let 8 be a positive number to be specified later.. Indeed. Mk. eventually almost surely.{0. 141 not belong to Tk U B then it is transmitted term by term using some fixed 1-block code. for almost every x. F: A i-._. B(n).3.{0. (ii) I BI <2k(h6) and qk (BI4) > E. Part (a) of the theorem provides a set Tk c ilk of cardinality at most 2k(h+3) . the definition of {rk } implies that for almost every x there is an integer K (x) > K such that qk(Tk14) > 1 . say G: Tk i. If 4 g B(n).8. or the following holds.3. Also needed for each k is a fixed-length faithful coding of Tk.E ) . To proceed with the rigorous proof. then for almost every x there is an N(x) such that x'ii g B(n). 1}rk(h-01. EMPIRICAL ENTROPY. say. (1) If B C A k and I B I < 21* -0 .B: B i . say. Let K be a positive integer to be specified later and for n > 2/(4 let B(n) be the set of 4 for which there is some k in the interval [K. for K(x) < k < loh gn. so that if k is enough larger than K(x) to guarantee that nkh Z > N(x).SECTION 11. By using the suggested coding argument it will be shown that if 8 is small enough and K large enough then xn i . then property (1) must hold for all n > 2. and a fixed-length faithful code for each B c A k of cardinality at most 2k(h. (i) qic(Tk14) ?_ 1 . A faithful single letter code. Several auxiliary codes are used in the formal definition of C. . This fact implies part (b) of the empirical-entropy theorem. n ). F(x. (log n)/h] for which there is a set B c A k such that the following two properties hold.. an integer K(x) such that if k > K(x) and n > 2" then qk(TkIx) > 1 . for each k.

3. The two-bit header on each v(i) specifies whether w(i) belongs to B. for 4 E B(n). If 4 E B(n). The rigorous definition of Cn is as follows. and Fr are now known.C(x) 15_ tkqk(h — 6) + tk(1 — qk)(h + 6) + (3 + K)0(n). so that if K is large enough and 8 small enough then n(h — c 2 12). and as noted earlier. let Ok(B) be the concatenation in some order of {Fk (4): 4 E B } . and w(t + 1) has length r = [0. or to neither of these two sets. since reading from the left. But this implies that xriz V B(n). where (2) v(i) = The code is a prefix code. and let E be an Elias prefix code on the integers. since no sequence of codes can beat entropy infinitely often on a set of positive measure.4). Fk. 1=1 . for the block length k and set B c itic used to define C(.3 I Mk. then an integer k E [K. For 4 E B(n). Finally. for all sufficiently large n. (log n)/11] and a set B C A k of cardinality at most 2k(h-E ) are determined such that qk(B14') > c. ENTROPY-RELATED PROPERTIES. Gk. implies part (b) of the empirical-entropy theorem. The code is defined as the concatenation C(4) = 10k 1E(1B1)0k (B)v( 1)v(2) • • • v(t)v(t + 1). and 4 is parsed as 4 = w(l)w(2) • • • w(t)w(t + 1). This is because tkq k (h — c) + tk(1 — q k )(h + 3) < n(h + 3 — 6(6 + S)). 4 E B(n). the principal contribution to total code length £(4) = £(41C) comes from the encoding of the w(i) that belong to rk U B. given that x til E B(n). If 4 V B(n) then Cn (4) = OFn (4). k). and øk(B) determines B. to Tic — B. in turn. ( (I BI) determines the size of B. Lemma 11. eventually almost surely. where w(i) has length k for i < t.i<t i = t +1.B. xri' E B(n). this. the first bit tells whether x`i' E B(n). (3) . and qk = clk(Blx'1 2 ).B(w(i)) OlG k (w(i)) 11 Fk(w(i)) 11Fr (w(t +1) 0 w(i) E B W(i) E Tk . a prefix code such that the length of E(n) is log n + o(log n).B w(i) g Tk UB. Once k is known. since qk > E. and.142 CHAPTER II. a fact stated as the following lemma. for each k and B C A k . that is. from which w(i) can then be determined by using the appropriate inverse. the block 0k1 determines k. since each of the codes Mk. where aK —> 0 as K —> oc. The lemma is sufficient to complete the proof of part (b) of the empirical-entropy theorem.

which. converges to h as . X2. The encoding Øk(B) of B takes k Flog IAMBI < k(1+ log I A1)2 k(h. since k > K and n > 2". -KEn. as well as the extra bits that might be needed to round k(h — E) and k(h +3) up to integers. the number of bits used to encode the blocks in Tk U B is given by tqk (Bix)k(h — E) + (t — tq k (BI4)) k(h +3). which is.3. each of which requires k(h +3) bits. and then take H (qk )/ k as an estimate of h. The header 10k le(I BD that describes k and the size of B has length (4) 2 +k+logIBI o(logIBI) which is certainly o(n). It takes at most k(1 + log I Ai) bits to encode each w(i) Tk U B as well as to encode the final r-block if r > 0. and n> k > K. (5) provided K > (E ln 2) . There are at most t + 1 blocks. This gives the dominant terms in (3). EMPIRICAL ENTROPY. Ignoring for the moment the contribution of the headers used to describe k and B and to tell which code is being applied to each k-block. and there are at most 1 +3t such blocks. This completes completes the proof of Lemma 11. The coding proof given here is simpler and fits in nicely with the general idea that coding cannot beat entropy in the limit. a quantity which is at most 6n/K. thereby establishing part (b) of the empirical-entropy theorem. If k is fixed and n oc.3. .q) blocks that belong to B.3.3. upper bounded by K (1+ loglA1)2. Given a sample path Xl. A simple procedure is to determine the empirical distribution of nonoverlapping k-blocks. The argument parallels the one used to prove the entropy theorem and depends on a count of the number of ways to select the bad set B.3. since t < nlk.4 The original proof. hence at most 80(n) bits are needed to encode them. together these contribute at most 3(t + 1) bits to total length. then H(qk)1 k will converge almost surely to H(k)/k. Adding this to the o(n) bound of (4) then yields the 30(n) term of the lemma. II.SECTION 11. in turn. xn from an unknown ergodic process it.'. in turn. and (t — tqk(B 14)) blocks in Tk — B.E ) bits. El Remark 11. [ 52]. A problem of interest is the entropy-estimation problem. since log I BI < k(h — E) and k < (log n)I n. Thus with 6 ax = K(1 + log IA1)2 -KE — = o(K). 143 Proof of Lemma 11. K the encoding of 4k(B) and the headers contributes at most naK to total length. each of which requires k(h — E) bits. of part (b) of the empirical-entropy theorem used a counting argument to show that the cardinality of the set B(n) must eventually be smaller than 2nh by a fixed exponential factor.3. each requiring a two-bit header to tell which code is being used.3. the goal is to estimate the entropy h of II. as well as a possible extra bit to round up k(h — E) and k(h +8) to integers.c Entropy estimation. since there are tqk(Bl.

The general result may be stated as follows. Theorem 1. . for example. Part (b) of the empirical-entropy theorem implies that for almost every x there is a K (x) such that qk(VkIX) < E. with alphabet A. In summary. then (6) n->C0 k(n) 1 lim — H(q00(.log iAl n then (6) holds for any ergodic measure p. > for k > K (x).3. for k > Ki(x). and note that 2/(/1-26) . at least if it is assumed that it is totally ergodic. lx) measure at least 1 — 3e. Next let I Vkl < Vk be the set of all alic for which qk (cif 14) > 2 -k(h-26) . the choice of k(n) would appear to be very dependent on the measure bt.oc. so that for all large enough k and all n > 2kh . n > 2kh . as n -. and if k(n) < (1/h)log n. can now be applied to complete the proof of Theorem 11.) If it is an ergodic measure of entropy h > 0. a. if k(n) —> oc.144 CHAPTER II. while if k(n) — log log n. see Exercise 2.6 The entropy-estimation theorem is also true with the overlapping block distribution in place of the nonoverlapping block distribution. At first glance. Theorem 11. the set of al' for which . CI Remark 11. The same argument that was used to show that entropy-rate is the same as decay-rate entropy.(.3. at least for a totally ergodic . where G k = Tk(E) — Uk — Vk. The empirical entropy theorem does imply a universal choice for the entropy-estimation problem. there is a Ki (x) such that qk(GkIx) > 1 — 36. because.14)) = h. there is no universal choice of k = k(n) for which k(n) —> oc for which the empirical distribution qi. Let Uk be the set of all cif E Tk(E) for which qk (ali` 'xi' ) < 2 -k(h+2E) . Proof Fix c > 0 and let {Tk(c) c A k : n > 1) satisfy part (a) of the empirical-entropy theorem. k —> oc.14) is close to the true distribution bak for every mixing process A.(k1n — iikkai 'x i ) < 2-0h-2E) has qk (.5.9.3. In particular. for example. combined with the bound (7) and part (a) of the empirical-entropy theorem.5 (The entropy-estimation theorem. ENTROPY-RELATED PROPERTIES. the choice k(n) — log n works for any binary ergodic process.6. n _ This bound. there is some choice of k = k(n) as a function of n for which the estimate will converge almost surely to h.u.s. Thus. (7) qk(Uklxi) _ < Irk(E)12ch+2E) < 2k(h+E)2 -k(h+2E) _ "-Ek —z. if k(n) -. then it holds for any finite-alphabet ergodic process. implies that for almost every x.

A universal code for the class of ergodic processes is a faithful code sequence {C n } such that for any ergodic process p. a header of fixed length m = k2k . The header br is a concatenation of all the members of {0. r E [0. Define b2+ +L i to be the concatenation of the E(A(w(i))) in order of increasing i. is a prefix code. (For simplicity it is assumed that the alphabet is binary. where h is the entropy of A.JF. High frequency blocks appear near the front of the code book and therefore have small indices. depends on 4. 1} k . 1. Step 1. .iiri I . EMPIRICAL ENTROPY 145 II.C is short. Partition xi: into blocks of length k = k(n) — (1/2) log AI n.lxiii). so that the empirical-entropy theorem guarantees good performance. Transmit a list rence in 4. that is. Suppose n = tk r.SECTION 11.42 ) 71 = h. followed by a sequence whose length depends on x. an Elias code. 2k — 1) is called the code book. Step 2. since L depends on the distribution qk (.3. and it was also shown that the Lempel-Ziv algorithm gives a universal code. where each w(i) has length k. of these k-blocks in order of decreasing frequency of occur- Step 3. where L is the sum of the lengths of the e(A(w(i))). The number of bits needed to transmit the list . since k -(1/2) log ik n. define b:+ 1+ -1 to be x4+1 • The code Cn (4) is defined as the concatenation Cn (x7) = br • b:V' • b7. if k does not divide n. lirn “. book E. The empirical-entropy theorem provides another way to construct a universal code. The code length £(4) = ni+ L +r. For 1 < i < t and 0 < j < 2k. Finally. subject only to the rule that af precedes 5k whenever k n -k n) qk(ailX1) > qk( all X 1 • The sequence {v(j) = j = 0.1. The steps of the code are as follows. To make this sketch into a rigorous construction. if w(i) = that is. Such a code sequence was constructed in Section II. Encode the final block. La. w(i) is the j-th word in the code book. since the number of bits needed to transmit an index is of the order of magnitude of the logarithm of the index. but the use of the Elias code insures that C. where k = k(n).1. fix {k(n)} for which k(n) — (1/2) log n.3. define the address function by the rule A(w(i)) = j. Encode successive k-blocks in 4 by giving the index of the block in the code Step 4. . almost surely. see the proof of Theorem 11. k) and 4 = w(1)w(2)•• • w(t)v.) The code word Cn (4) is a concatenation of two binary sequences. Let E be a prefix code on the natural numbers such that the length of the word E(j) is log j o(log j). with a per-symbol code. relative to n.d Universal coding.

which also includes universal coding results for coding in which some distortion is allowed. II.146 CHAPTER II. The empirical-entropy theorem guarantees that eventually almost surely. an integer K(x) such that qk (GkI4) > qk (7k (c)14) > 1 — E for k > K(x). for k > K(x). let Bk be the first 2k(h-E) sequences in the code book for x. so that lirninf' > (h — c)(1 — c).lfiz). Note that IGk I > l'Tk(E)I and. For those w(i) Gk. and hence < (1 — On(h + c) + En + o(log n). n where h is the entropy of it. n—k+ 1 in place of the nonoverlapping block distribution qk(.7 The universal code construction is drawn from [49]. Theorem 11. and in some ways. The empirical-entropy theorem provides.C(4) lim sup — < h. at least a (1 — E)-fraction of the binary addresses e(A(w(i))) will refer to members of Gk and thus have lengths bounded above by k(h + c)+ o(log n). In fact. and is described in Exercise 4. see Section 111. but it is instructive to note that it follows from the second part of the empirical-entropy theorem. n — k + 1]: 444-1 = aDi .2. let {7-k(E) c Ac} be the sequence given by the empirical-entropy theorem. Thus. Remark 11. the crude bound k + o(log n) will do. for k > K (x). where k = k(n). qk(GkIx t 0 qk(Tk(E)Ixi). a. since Gk is a set of k-sequences of largest qke probability whose cardinality is at most 2k(h +E ) . there are at most Et indices i for which w(i) E Bk. This proves that .e Exercises 1. a. For each n. let Gk denote the first 2k(h +E) members of the code book. Another application of the empirical-entropy theorem will be given in the next chapter.3. even simpler algorithm which does not require that the code book be listed in any specific order was obtained in [43]. s.1. These are the only k-blocks that have addresses shorter than k(h — 6) + o(log n). Show that the empirical-entropy theorem is true with the overlapping block distribution. The empirical-entropy theorem will be used to show that for any ergodic tt. To establish this. in the context of the problem of estimating the measure II from observation of a finite sample path. ENTROPY-RELATED PROPERTIES. The reverse inequality follows immediately from the fact that too-good codes do not exist. S. A second. for almost every x.3.) Pk(a in4) E [1. (Hint: reduce to the nonoverlapping case for some small shift of the sequence.. (8) n-* lim f(x n ) 1 = h. a.3. s. furthermore.

In other words. w(t)} if it is the concatenation (1) 4 = w(1)w(2). for any partition into distinct words. Show that {C. A sequence Jell is said to be parsed (or partitioned) into the (ordered) set of words {w(1)..4. along with a header to specify the value of k.u k is asymptotically almost surely lower bounded by (1 — e -i ) for the case when n = k2k and .n (x). and for any partition into words that have been seen in the past. The final code C(xil) transmits the shortest of the codes {Ck. most of the sample path is covered by words that are not much shorter than (log n)I h. Call this C k m (Xil ) and let £k(X) denote the length of Ck .(x. a word w is a finite sequence of symbols drawn from the alphabet A and the length of w is denoted by i(w). except possibly for the final word. (Hint: if M balls are thrown at random into M boxes. Transmit k and the list. Another simple universal coding procedure is suggested by the empirical-entropy theorem. For each k < n construct a code Ck m as follows. Their results were motivated by an attempt to better understand the Lempel-Ziv algorithm. As in earlier discussions.. Show that the variational distance between qk (. w(2).. Show that the entropy-estimation theorem is true with the overlapping block distribution in place of the nonoverlapping block distribution. be the first value of k E [ 1. . w(t). The code C(4) is the concatenation of e(kiii.} is a universal prefix-code sequence. Express x'12 as the concatenation w(l)w(2) • • • w(q)v. 3. Partitions into distinct words have the asymptotic property that most of the sample path must be contained in those words that are not too short relative to entropy. plus a possible final block y of length less than k. for i words.(4)}.4 Partitions of sample paths. Section 11...u is unbiased coin-tossing. j. They show that eventually almost surely. Make a list in some order of the k-blocks that occur... For example. while 000110110100 = [00] [0110] [1101] [00] is not. . such that all but the last symbol of each word has been seen before. of k-blocks {w(i)}. most of the sample path is covered by words that are not much longer than (log n)I h.SECTION 11. which partitions into distinct words. then code successive w(i) by using a fixed-length code. such a code requires at most k(1 + log IA I) bits per word.. let km. [53]. then the expected fraction of empty boxes is asymptotic to (1 — 4. If w(i) w(j).14) and . An interesting connection between entropy and partitions of sample paths into variable-length blocks was established by Ornstein and Weiss. PARTITIONS OF SAMPLE PATHS. then (1) is called a parsing (or partition) into distinct 000110110100 = [000] [110] [1101] [00] is a partition into distinct words. n] at which rk(x'11 ) achieves its minimum..n ) and Cknun . Append some coding of the final block y. 147 2.').

1 (The distinct-words theorem. For this purpose. see Exercise 1. if each w(i) is either in . A partition x = w(1)w(2) w(k) is called a partition (parsing) into repeated words. x) such that if n > N and xri' = w(1)w(2) partition into distinct words. then t ( o )) < En. Theorem 11. possibly for the final word.2. If this final word is not empty. meaning that there is an index j < t(w(1)) + + f(w(i — 1)) such that w(i) = Partitions into repeated words have the asymptotic property that most of the sample path must be contained in words that are not too long relative to entropy. = w(1)w(2). so that eventually almost surely most of the sample path must be covered by words that are no longer than (h — E) -1 log n. implies that C(4) log n lim inf > h. x) such that if n > N.2. Of course.4. a. Lemma 11.1. some way to get started is needed. For almost every x E A there is an N = N(E.) Let be ergodic with entropy h > 0. and the fact that too-good codes do not exist..) Let p. and let E > 0 be given. then t ( to)) <En. (3) Thus the distinct-words theorem and the modified repeated-words theorem together provide an alternate proof of the LZ convergence theorem. Lemma 11. be ergodic with entropy h > 0. For example.2.T or has been seen in the past. Theorem 11. except. With this modification. the repeated-words theorem applies to the versions of the Lempel-Ziv algorithm discussed in Section 11. as noted earlier.2 parses a finite sequence into distinct words.4.be a fixed finite collection of words.148 CHAPTER II. and let 0 < E < h and a start set T be given.7. s. the lower bound (3) is a consequence of fact that entropy is an upper bound.2. To discuss partitions into words that have been seen in the past. in turn. a.3. w(C + 1). most of the sample path is covered by blocks that are at least as long as (h + E)' log(n/2). twi>>>0-0-Ilogn It is not necessary to require exact repetition in the repeated-words theorem. by the crude bound.. w(t) is a partition into repeated words. s. this time for parsings into words of variable length.2 (The repeated-words theorem. Theorem 11. Theorem 11. (2). which. it is sufficient if all but the last symbol appears earlier. called the start set. let . ENTROPY-RELATED PROPERTIES. and hence t(w(C + 1)) < n/2. it is the same as a prior word.4. ow(0)<0+0-11ogn The (simple) Lempel-Ziv algorithm discussed in Section 11. and xi. But the part covered by shorter words contains only a few words.1. For almost every w(t) is a x E A" there is an N = N(E. Thus the distinct-words theorem implies that. The proofs of both theorems make use of the strong-packing lemma. so that (2) lim sup n- C(4) log n fl < h. .2. eventually almost surely.

then words that are too short relative to (log n)Ice cannot cover too much. Lemma 11. w(t) is a partition into distinct words such that t(w(i)) 5_ En. POD (j)) <K The second observation is that if the distinct words mostly come from sets that grow in size at a fixed exponential rate a. there is an N such that if n > N and xrii = w(1)w(2).(i)). t > 1.4 For each k..(.3 Given K and 8 > 0. PARTITIONS OF SAMPLE PATHS. because there are at most IA i k words of length k. Lemma 11. Proof First consider the case when all the words are required to come from G. This proves the lemma. The first observation is that the words must grow in length. suppose Gk C A k satisfies IGkI _< 2ak . To continue with the proof of the distinct-words theorem. yields an integer K and { rk C A": k> 1} such that I7j < 21c(h+S) . and let G = UkGk. where a is a fixed positive number. let 8 be a positive number to be specified later. by an application of the standard bound.(. Lemma 11. The distinct-words theorem is a consequence of the strong-packing lemma. The strong-packing lemma.4.4.SECTION 11. Et _ The general case follows since it is enough to estimate the sum over the too-short words that belong to G. The fact that the words are distinct then implies that .._Exio g n)la E top(o) E Ic<(1-6)(log n)la Since IGk I <2"k the sum on the right is upper bounded by ( 2' ((1 — E)logn) —1) exp2 ((1 — E)logn)) = logn) = o(n).2. w(t) is a partition into distinct words. 149 II. The proof of this simple fact is left to the reader. Given E > 0 there is an N such that if n > N and x n = w(1)w(2).3.a Proof of the distinct-words theorem. together with some simple facts about partitions into distinct words. w(i)gc then t(w(i)) < 2En. . t —) men. k > K. then E aw (i) ) Bn.4.4...

this time applied to variable-length parsing. r w(t) into distinct words.8. As in the proof of the empirical-entropy theorem.b Proof of the repeated-words theorem.4. f(w(i))(1-15)(logn)/(h+S) provided only that n is large enough. 3)-strongly-packed by {'Tk }. It will be shown that if the repeated-words theorem is false.2. and such that eventually almost surely xrii is (K.. Throughout this discussion p. The first idea is to merge the words that are between the too-long words. Label these too-long words in increasing order of appearance as V(1). Suppose also that n is so large that.x. V(2). Suppose = w(l)w(2) w(t) is some given parsing. ENTROPY-RELATED PROPERTIES.' is (K. then a prefix-code sequence can be constructed which beats entropy by a fixed amount infinitely often on a set of positive measure. and therefore Lemma 11. 3)-strongly-packed by MI. will be a fixed ergodic process with positive entropy h and E < h will be a given positive number. for any parsing P of xriz for which (4) E Sn atv(i)) — . the existence of good codes is guaranteed by the strong-packing lemma.4 can be applied with a = h + 3 to yield aw(i)) < 25n. since 8 could have been chosen in advance to be so small that log n (1 — 2S) log n h + c 5h+3 • 1:3 I1. the basic idea of the version of the Lempel-Ziv code suggested in Remark 11. E w(i)Euciz k rk t (o )) a ( l-3)n.) If a good code is used on the complement of the too-long words and if the too-long words cover too much. for 1 < i < s + 1. V(s). it into distinct words. Let u1 be the concatenation of all the words that precede V(1).150 CHAPTER II.3. property (4) holds for any parsing of . let ui be the concatenation of all the words that come between V(i — 1) and V(i).4. The distinct-words theorem follows. and let . then overall code length will be shorter than entropy allows. for which s of the w(i) are too long. i. Thus. strong-packing implies that given a parsing x = w(l)w(2) E w(i)EU Klj t (o)) a _3)n.4. e. by Lemma 11. A block of consecutive symbols in xiit of length more (h — E) -1 log n will be said to be too long. 2 Suppose xr. of course. The idea is to code each too-long word of a repeated-word parsing by telling where it occurs in the past and how long it is (this is.

ni + ± imi has been seen in the past is to say that there is an index i E [0.. Let G(n) be the set of all xri' that are (K. a too-long representation = u1V(1)u2V(2). eventually almost surely. V(s)} and fillers lui. (c) Each too-long V(j) is coded by specifying its length and the start position of its earlier occurrence. Let 3 be a positive number to be specified later. PARTITIONS OF SAMPLE PATHS. a set 'Tk c A k of cardinality at most 2k(h+8) . with the too-long words {V(1).T is a fixed finite set. it can be supposed such that V(j) that when n is large enough no too-long word belongs to F. provides an integer K and for a each k > K. 3)-stronglypacked by {'Tk }. (b) Each filler u1 g U K r k is coded by specifying its length and applying a fixed single-letter code to each letter separately. eventually almost surely. us V(s)u s+i with the following two properties. such that eventually almost surely xril is (K.2. The idea is to code sequences in B(n) by telling where the too-long words occurred earlier. Since .. The strong-packing lemma provides the good codes to be used on the fillers. In this way. If x E B(n) n G(n). 8)-strongly-packed by {Tk } ... ni ) To say V(j) = xn Since the start set .. Let B(n) be the set of all sequences x for which there is an s and a too-long representation x = uiV (1)u2 V(2) . A code C. (a) Each filler u 1 E U K Tk is coded by specifying its length and giving its index in the set Tk to which it belongs. to complete the proof of the repeated-words theorem it is enough to prove that if K is large enough. (i) Ei f ( V (D) >En..SECTION 11.3. and therefore to prove the repeated-words theorem. x'iz is expressed as the concatenation x Çz = u i V(1)u2V(2) us V(s)u s±i . The words in the too-long representation are coded sequentially using the following rules. Such a representation is called a too-long representation of 4.q E G(n) eventually almost surely. . Sequences not in B(n) n G(n) are coded using some fixed single-letter code on each letter separately. it is enough to prove that x g B(n). but to make such a code compress too much a good way to compress the fillers is needed.. is constructed as follows.u s V(s)u s+i. An application of the strong-packing lemma. is determined for which each too-long V(j) is seen somewhere in its past and the total length of the {V(j)} is at least En. . Lemma 11. 151 u s+ 1 be the concatenation of all the words that follow V(s).4. then B(n) n G(n). (ii) Each V(j) has been seen in the past.

"ell B(n) n G(n). as well as to the lemma itself.(x €)) .C(x'11 ) = s log n + ( E ti J ELJA7=Kii B(n) n G(n). For the fillers that do not belong to U71. as well as the encoding of the lengths of all the words. — log n — log n implies that E au . and from specifying the index of each u i E UlciK rk. is that the number s of too-long words. the code becomes a prefix code. ) <K (1 + n(h log n j). E au . plus the two bit headers needed to tell which code is being used and the extra bits that might be needed to round up log n and h + 6 to integers. This fact is stated in the form needed as the following lemma. that is. must satisfy the bound.5 If log K < 6 K . a prefix code S on the natural numbers such that the length of the code word assign to j is log j + o(log j). The lemma is sufficient to show that xrii B(n) n G(n). eventually almost surely. first note that there are at most s + 1 fillers so the bound Es. For 4 E B(n) n G(n). completing the proof of the repeated-words theorem. which. tw(i) > en. for XF il E B(n) n G(n). by definition. With a one bit header to tell whether or not xri` belongs to B(n) n G(n)... which requires Flog ni bits for each of the s too-long words. which requires £(u) [h + 81 bits per word. xÇ2 E B(n) n G(n). By assumption. ENTROPY-RELATED PROPERTIES. It is enough to show that the encoding of the fillers that do not belong to Urk . and hence if 6 is small enough then £(4) < n(h — € 2 12). then (5) uniformly for 4 . (6) s — av(i)) log n (h — e). The key to this. for all sufficiently large n. ) (h + 6) + 60(n). shows that. while it depends on x.5. a word is too long if its length is at least (h — E) -1 log n.4. An Elias code is used to encode the block lengths. indeed.C(fi` IC n ) comes from telling where each V(j) occurs in the past. Two bit headers are appended to each block to specify which of the three types of code is being used. Proof of Lemma 11. eventually almost surely. since. so that (5) and (6) yield the code-length bound < n(h + — e(e + 6)) + 0(n). n s < i="' (h — e) < — (h — 6).152 CHAPTER II. since it is not possible to beat entropy by a fixed amount infinitely often on a set of positive measure. Lemma 11.4. require a total of at most 60(n) bits. the principal contribution to total code length E(xriz) .

The key bound is that the number of ways to select s disjoint subintervals of [1. Finally. which can be assumed to be at least e. since it was assumed that log K < 5K. and since x -1 log x is decreasing for x > e.SECTION 11. which is 50(n). PARTITIONS OF SAMPLE PATHS. then the assumption that 4 E G(n). Remark 11. .4. Finally. that is. The first sum is at most (log K)(s + 1). the total number of bits required to encode all such fillers is at most (u. the total length contributed by the two bit headers required to tell which type of code is being used plus the possible extra bits needed to round up log n and h + 8 to integers is upper bounded by 3(2s + 1) which is o(n) since s < n(h — E)/(logn). which is o(n) for K fixed.6 The original proof. of the repeated-words theorem parallels the proof of the entropy theorem.c Exercises 1. II. Since it can be assumed that the too-long words have length at least K. the dominant terms of which can be expressed as E j )<K log t(u i ) + E t(u i )K log t(u i ) + logt(V(i)). the encoding of the lengths contributes EV(t(u i)) Et(eV(V(i») bits.4.5. since there are at most 2s + 1 words. n] of length Li > (log n)/ (h — 6). + o(n)). a u . Since it takes L(u)Flog IAI1 bits to encode a filler u g UTk. [53]. Extend the distinct-words and repeated-words theorems to the case where agreement in all but a fixed number of places is required. is upper bounded by exp((h — c) Et.•)) [log on SnFlog IA11. using a counting argument to show that the set B(n) n G(n) must eventually be smaller than 2nh by a fixed exponential factor. This completes the proof of Lemma 11. 6)-strongly-packed by {Tk }. once n is large enough to make this sum less than (Sn/2. This is equivalent to showing that the coding of the too-long repeated blocks by telling where they occurred earlier and their lengths requires at most (h — c) E ti + o(n) bits. Extend the repeated-words theorem to the case when it is only required that all but the last k symbols appeared earlier. The coding proof is given here as fits in nicely with the general idea that coding cannot beat entropy in the limit. E 7.4. implies that the words longer than K that do not belong to UTk cover at most a 6-fraction of x .. which is Sn. In particular. that 4 is (K. 153 which is o(n) for fixed K. 2.4. the second and third sums together contribute at most log K K n to total length.

Section 11. 4. process (i.5. Wyner and Ziv showed that the logarithm of the waiting time until the first n terms of a sequence x occurs again in x is asymptotic to nh.u(x'11 ). (c) Show that eventually almost surely. i. [86]. lim 1 log Rn (x) = h. both the upper and lower limits are subinvariant. n n--÷oo Some preliminary results are easy to establish.d. [85]. equiprobable. almost surely. The definition of the recurrence-time function is Rn (x) = min{rn > 1: x m+ m+n 1 — Xn } — . e.3. 5.5 Entropy and recurrence times. . 3. (Hint: use Barron's lemma. see also the earlier work of Willems. Define the upper and lower limits. These results. n 1 r(x) = lim inf . the words longer than (1 + c) log n in a parsing of 4 into repeated words have total length at most n 1.) For any ergodic process p. Lemma 11. almost surely. i . An almost-sure form of the Wyner-Ziv result was established by Ornstein and Weiss.1. along with an application to a prefix-tree problem will be discussed in this section. Carry out the details of the proof that (3) follows from the repeated-words theorem. What can be said about distinct and repeated words in the entropy 0 case? 6. be the Kolmogorov measure of the binary.) (a) Show that eventually almost surely.i..1 (The recurrence-time theorem. ENTROPY-RELATED PROPERTIES. that is. [53]. Carry out the details of the proof that (2) follows from the distinct-words theorem. Let . no parsing of 4 into repeated words contains a word longer than 4 log n. unbiased coin-tossing. there are fewer than n 1-612 words longer than (1 + c) log n in a parsing of 4 into repeated words. n—woo n T(x) Since Rn _1(T x) < R n (x). An interesting connection between entropy and recurrence times for ergodic processes was discovered by Wyner and Ziv.u. This is a sharpening of the fact that the average recurrence time is 1/.log Rn (x). whose logarithm is asymptotic to nh. The theorem to be proved is Theorem 11. and r(Tx) < r(x). 1 = lim sup .log R„(x). F(Tx) <?(x).E12 log n. in probability. with entropy h.154 CHAPTER II.) (b) Show that eventually almost surely.

if Tri = {X: t(x) > 2-n(h-i-E/2)} then x E T. and hence. so it is enough to prove (1) x g Dn n 7. . then Tx g [a7]. . and hence the sequence D(a). there are constants F and r such that F(x) = F and r(x) = r. fix (4 and consider only those x E D„ for which xri' = ari'. n > 1.(4).. n [4]. for j E [1. a more direct coding construction is used to show that if recurrence happens too soon infinitely often.2. (1). eventually almost surely. 7-1 D. however. ENTROPY AND RECURRENCE TIMES. then there is a prefix-code sequence which beats entropy in the limit. by the definition of D(a). almost surely. 7' -2a(h+°+1 Dn(ai) must be disjoint. both upper and lower limits are almost surely constant.. Since these sets all have the same measure.„ eventually almost surely. The inequality r > h will be proved by showing that if it is not true." Indeed. It is enough to prove this under the additional assumption that xril is "entropy typical. The Borel-Cantelli principle implies that x g D. The bound r > h was obtained by Ornstein and Weiss by an explicit counting argument similar to the one used to prove the entropy theorem. eventually almost surely. the set D(a7) = D. The goal is to show that x g D. = fx: R(x) > r(h+E ) }. then eventually almost surely. both are almost surely invariant. 155 hence. which is impossible. based on a different coding idea communicated to the author by Wyner. is outlined in Exercise 1..5. .SECTION 11. r < F. by Theorem 11. To establish this. 2n(h+E) — 1]. In the proof given below. 2—nE/2 . so the desired result can be established by showing that r > h and F < h. To establish i: < h. n T„. The bound F < h will be proved via a nice argument due to Wyner and Ziv. Since the measure is assumed to be ergodic. Indeed. Note. Another proof.(4)) < 2-n(h+E). n 'T.. It is now a simple matter to get from (2) to the desired result.1. disjointness implies that (2) It(D. that is. which is upper bounded by 2n(h+E / 2) . the cardinality of the projection onto An of D. so that (2) yields 1-1 (Dn < n Tn) _ 2n(h-1-€12)2—n(h+E) . that if x E D(a). A code that compresses more than entropy can then be constructed by specifying where to look for the first later occurrence of such blocks. is upper bounded by the cardinality of the projection of Tn . fix e > 0 and define D. sample paths are mostly covered by long disjoint blocks which are repeated too quickly. eventually almost surely. Furthermore. so the iterated almost sure principle yields the desired result (1).

m)-too-soon-recurrent representation. To develop the covering idea suppose r < h — E. where cm --÷ 0 as m .2 (4) f(xii) < n(h — E) n(3dS + ot ni ). compresses the members of G(n) too well for all large enough n.5. The one bit headers on F and in the encoding of each V (j) are used to specify which code is being used. The key facts here are that log ki < i(V(j))(h — c). a concatenation = u V (1)u 2 V (2) • u j V (J)u j+1 . followed by e(k i ). where E > O. For n > m. (b) The sum of the lengths of the filler words ui is at most 3 8 n. (i) Each filler word u i is coded by applying F to its successive symbols. such that each code word starts with a O.156 CHAPTER II.1 )u. is a prefix n-code. m)-too-soon-recurrent representation of 4. (ii) Each V(j) is coded using a prefix 1. both to be specified later. that is." and that the principal contribution to total code length is E i log ki < n(h—c). For x E G(n) a (8. is said to be a (8. a prefix code on the natural numbers such that t(E(j)) = log j + o(log j). The least such k is called the distance from xs to its next occurrence in i• Let m be a positive integer and 8 be a positive number. It will be shown later that a consequence of the assumption that r < h — c is that (3) xÇ E G(n). A block xs. if the following hold. followed by E(t(V (j)). for n > m. where k i is the distance from V(j) to its next occurrence in x. Let G(n) be the set of all xi' that have a (8.' of s lik k for consecutive symbols in a sequence 4 is said to recur too soon in 4 if xst = x` some k in the interval [ 1 . m)-too-soon-recurrent representation 4 = u V(1)u2 V(2) • • • u V( . oc. (a) Each V(j) has length at least m and recurs too soon in 4. Lemma 11. for which s +k < n. a result stated as the following lemma. X E G (n ) . ENTROPY-RELATED PROPERTIES. 2h(t -s + 1) ) . eventually almost surely. Here it will be shown how a prefix n-code Cn can be constructed which. The code also uses an Elias code E. A one bit header to tell whether or not 4 E G(n) is also used to guarantee that C. 1}d. where d < 2+ log l Ai. by the definition of "too-soon recurrent.t+i is determined and successive words are coded using the following rules. Sequences not in G(n) are coded by applying F to successive symbols. The code Cn uses a faithful single-letter code F: A H+ {0. for suitable choice of m and S.

the encoding of each filler u i requires dt(u i ) bits.C(x'ii) < n(h — 6/2) on G(n). . for B = n=m Bn' The ergodic theorem implies that (1/N) Eni Bx (Ti-i X ) > 1 . To carry it out. each of length at least m and at most M. x) and a collection {[n i . since (1/x) log x is decreasing for x > e. Since lim inf. The Elias encoding of the length t(V(j)) requires log t(V(D) o(log f(V(j))) bits. Thus r > h. by the definition of "too-soon recurrent. and hence the packing lemma implies that for almost every x.5. there are at most n/ m such V(j) and hence the sum over j of the second term in (5) is upper bounded by nI3. the second terms contribute at most Om bits while the total contribution of the first terms is at most Elogf(V(D) n logm rn . 0 It remains to be shown that 4 E G(n).. assume r < h — E and. where an. Proof of Lemma 11. for which the following hold.8. by property (a) of the definition of G(n). m]: i < of disjoint subintervals of [1. as m Do. 157 The lemma implies the desired result for by choosing 6 small enough and m large enough. Since each V(j) is assumed to have length at least m. define B n = (x: R(x) < 2 n(h—E) ). In summary. Finally. But too-good codes do not exist.SECTION 11.u(G(n)) must go to 0.2. n]. + (1 ± log m)/m 0. ENTROPY AND RECURRENCE TIMES. R n (x) = r there is an M such that n=M . so that . establishing the recurrence-time theorem. The one bit headers on the encoding of each V(j) contribute at most n/ m bits to total code length. there is an N(x) > M/8 such that if n > (x). bits. Summing on j < n/ m.5.u(B) > 1 — 6. contradicting the fact (3) that 4 E G(n). eventually almost surely. This completes the proof of Lemma 11. This is a covering/packing argument. The Elias encoding of the distance to the next occurrence of V(j) requires (5) f(V (j))(h — E) o(f(V(j))) bits.„ where /3. for each n > 1.2. and the total length of the fillers is at most 36n. a fact that follows from the assumption that 0 r <h — E. Thus the complete encoding of all the fillers requires at most 3d6n bits. there is an integer I = I (n.5. eventually almost surely.n --* 0 as m cc." Summing the first term over j yields the principal term n(h — E) in (4). the lemma yields . for all large enough n. the complete encoding of the too-soon recurring words V(j) requires a total of at most n(h — E) nun. for fixed m and 6. eventually almost surely. = 2/3.

Tx. 'Tn (x). as quick way to test the performance of a coding algorithm. provided n is large.158 (a) For each i < I el' n. for any j W(i): i = 0. as tree searches are easy to program. called the n-th order prefix tree determined by x. • • • . W(0) = 00. let W(i) = W(i.Tn —l x. page 72.. T 3x = 010010010 .. II. x E G (n). This works reasonably well for suitably nice processes. The sequence x`il can then be expressed as a concatenation (6) xn i = u i V(1)u2V(2)• • • u. W(3) = 0100. x 001010010010 .7.T4x = 10010010.7. it defines a tree. T 2x = 1010010010 . . n — 11 is prefix-free. and take (11 kt) Et. for fixed m and 3. if it is assumed that r < h — E. eventually almost surely. defines a tree as follows. In addition. 1. select indices fit. as indicated in Remark 11. For each 0 < i < n.. Prefix trees are used to model computer storage algorithms.n. the total length of the V(j) is at least (1 —38)n. . and can serve. T 2x. ni + 2(ni i —n. ENTROPY-RELATED PROPERTIES. where each V(j) = xn 'n. for example. . Since by definition. the set which is not a prefix of Px.5. condition (b) can be replaced by (c) > < (m — n +1) > (1 — 33)n. independently at random in Let k [1. algorithms for encoding are often formulated in terms of tree structures. a log n. (b) (mi — ni + 1) > (1 — 25)n. . W(1) = 0101. For example. Remark 11. This led Grassberger. itl./ V(J)u j+i. In practice. where underlining is used to indicate the respective prefixes. that is. n]. Xi :Fnil—n ` 9 for some j E [ni + 1. W(2) = 101. In particular. . so that if J is the largest index i for which ni + 2A1(h— E) < n.. [16]. where a < 1. Closely related to the recurrence concept is the prefix tree concept. T x = 01010010010 .a Entropy and prefix trees. This completes the proof that 4 E G(n). x) be the shortest prefix of Tx i. and where each u i is a (possibly empty) word. The prefix tree 'T5(x) is shown in Figure 1. W(4) = 100.3 The recurrence-time theorem suggests a Monte Carlo method for estimating entropy. log Rk(T i l X) as an estimate of h. A sequence X E A" together with its first n — 1 shifts.5. the simplest form of Lempel-Ziv parsing grows a subtree of the prefix tree. it can be supposed that n > 2m(h— E ) IS. 0 < j < n.5. and thereby finishes the proof of the recurrence-time theorem. = - CHAPTER II.9. in which context they are usually called suffix trees. to propose that use of the full prefix tree . recurs too soon. ±1)(h—E) ].. below.

for all L > 1.5.E)-trimmed mean of the set {L(x)} is almost surely asymptotic to log n/h. ENTROPY AND RECURRENCE TIMES. case. The {L } are bounded below so average depth (1/n) E (x) will be asymptotic to log n1 h in those cases where there is a constant C such that max L7 (x) < C log n. all but en of the numbers L7(x)/ log n are within c of (11h). however.5 (The finite-energy theorem. for example. He suggested that if the depth of node W(i) is the length L7(x) = f(W(i)).4 (The prefix-tree theorem. and independent of {Yn }. Note that the prefix-tree theorem implies a trimmed mean result." and is a frequently used technique in image compression. there is an integer N = N(x. see Theorem 11. however. and for all xl +L of positive measure.SECTION 11. and some other cases. which kills the possibility of a general limit result for the mean. is the following theorem. for which a simple coding argument gives L(x) = 0(log n). almost surely. since if L7(x) = k then. processes. (Adding noise is often called "dithering. is an ergodic process of entropy h > 0.) The mean length is certainly asymptotically almost surely no smaller than h log n.5. eventually almost surely. n].) If p. What is true in the general case.i.5.d.. for any E > 0. c) such that for n > N. then for every c > 0 and almost every x. such that x: +k-2 = There is a class of processes. eventually almost surely. the (1 . there is no way to keep the longest W(i) from being much larger than 0(log n). This is equivalent to the string matching bound L(4) = 0(log n). the binary process defined by X n = Yn Z. as earlier.i. such as.d. noise to a given ergodic process. is an ergodic finite energy process with entropy h > 0 then there is a constant C such that L(4) < C log n. Theorem 11.) If p.E)-trimmed mean of a set of cardinality M is obtained by deleting the largest ME/2 and smallest ME/2 numbers in the set and taking the mean of what is left. the finite energy processes.5. then average depth should be asymptotic to (1/h) log n. it is not true in the general ergodic case.i. Theorem 11. (mod 2) where {Yn } is an arbitrary binary ergodic process and {Z n } is binary i. 159 might lead to more rapid compression. has finite energy if there is are constants c < 1 and K such that kxt-F1 i) < Kc L . at least in probability.d. j i. While this is true in the i. . An ergodic process p. One type of counterexample is stated as follows. namely. namely.i. see Exercise 2. The (1 . h The i. eventually almost surely. mixing Markov processes and functions of mixing Markov processes all have finite energy. in general. by definition of L(x).d. as it reflects more of the structure of the data. eventually almost surely. L(x) is the length of the longest string that appears twice in x7. Another type of finite energy process is obtained by adding i.5. and hence lim n -+cx) E n log n i=1 1 1 (x) = . for all t > 1. where. there is a j E [1.

several choices are possible. lim (7) nk log nk with respect to the measure y = o F-1 . the W (i. such that vnrfo l Ltilk (x ) = oo. Theorem 11. Prefix trees may perform better than the LZW tree for they tend to have more long words. is a consequence of the recurrence-time theorem. eventually almost surely. In the simplest of these. if it has the form w(i) = w(j)a. define the sets L(x.1. whose predecessor node was labeled w(j). n)I < En. Performance can be improved in some cases by designing a finite tree that better reflects the actual structure of the data.4. For E > 0. which means more compression. for the longest prefix is only one longer than a word that has appeared twice.6 CHAPTER II.5. The details of the proof that most of the words cannot be too short will be given first. To control the location of the prefixes an "almost uniform eventual typicality" idea will be needed. for fixed n. n — I]: (x) > (1 ± c)(log n)I hl. Subsequent compression is obtained by giving the successive indices of these words in the fixed tree. (See Exercise 3. but do achieve something reasonably close to optimal compression for most finite sequences for nice classes of sources. The fact that most of the prefixes cannot be too long. ENTROPY-RELATED PROPERTIES. 88] for some finite version results.5.5. n — I]: L7(x) < (1— E)(logn) I h} U (x . (9) The fact that most of the prefixes cannot be too short.a.) The limitations of computer memory mean that at some stage the tree can no longer be allowed to grow. n) = fi E [0. The remainder of the sequence is parsed into words. is assigned to the node corresponding to a.5.160 Theorem 11.10. define Tk (8) = (x) > . From that point on. Theorem 11. mentioned in Remark 11. each of which is one symbol shorter than the shortest new word. for some j < i and a E A.7 The simplest version of Lempel-Ziv parsing x = w ( 1)w(2) • • •. the tree grows to a certain size then remains fixed. Such finite LZ-algorithms do not compress to entropy in the limit. (8). n) are distinct. in which the next word is the shortest new word. II. n) = fi E [0. The prefix-tree and finite-energy theorems will be proved and the counterexample construction will be described in the following subsections. and is a consequence of the fact that. see [87. can be viewed as an overlapping form of the distinct-words theorem. (9). grows a sequence of IA I-ary labeled trees. There is a stationary coding F of the unbiased coin-tossing process kt.1 Proof of the prefix-tree theorem. and an increasing sequence {nk}. Remark 11.2.2. It is enough to prove the following two results. n)I < En.1. For S > 0 and each k. almost surely. I U(x . eventually almost surely. Section 11. where the next word w(i). (8) 1L(x. each of which produces a different algorithm.

(See Exercise 7. by the entropy theorem. it can be supposed that if n > N(x) then the set of non-good indices will have cardinality at most 3 6 n. n). The recurrence-time theorem shows that this can only happen rarely.7.log Bk (x) = h. there are at most (constant) 2 J (h+) good indices i for which M < L < J. ENTROPY AND RECURRENCE TIMES. Lemma 11. except for the final letter. 1 lim — log Fk (x) = h. Thus if n > N(x) is sufficiently large and S is sufficiently small there will be at most cnI2 good indices i for which L'i'(x) < (1— €)(log n)I h. x) are distinct members of the collection 'Tk (6) which has cardinality at most 2k(h+8) .s. The ergodic theorem implies that. combined with the bound on the number of non-good indices of the preceding paragraph. Next the upper bound result.6 and Exercise 11. each W(i. a. (9). Thus by making the N(x) of Lemma 11. Section 1. for all k > MI has measure greater than 1 — S. If a too-long block appears twice.n. For a given sequence x and integer n. The corresponding recurrence-time functions are respectively defined by Fk(X) = Bk(X) = min {in> 1 •. which can be assumed to be less than en/2.8 For almost every x there is an integer N(x) such that if n > N(x) then for all but at most 2 6 n indices i E [0.x) appears at least twice. To proceed with the proof that words cannot be too short. an index i < n will be called good if L(x) > M and W(i) = W(i. eventually almost surely. x) E U m rk ( . n. There are at most 2k(h+3) such indices because the corresponding W (i. Suppose k > M and consider the set of good indices i for which L7(x) = k.5. Both a forward and a backward recurrence-time theorem will be needed. 1 urn . This. if necessary. fix c > O. Hence. k—> cx) k and when applied to the reversed process it yields the backward result. completes the proof that eventually almost surely most of the prefixes cannot be too short. will be established by using the fact that. there is an M such that the set 4 4 G(M) = ix: 4 E 7i. the shift Tx belongs to G(M) for all but at most a limiting 6-fraction of the indices i. This result is summarized in the following "almost uniform eventual typicality" form. (8). then it has too-short return-time.SECTION 11. for almost every x..5. a. n.) These limit results imply that .(8). 161 _ 2k(h+5) and so that I2k(6)1 < E Tk (6 ). Since E (3 ). the i-fold shift Tx belongs to G(M).s.8 larger. Since the prefixes W(i) are all distinct there are at most lAl m indices i < n such that L7(x) < M.. eventually almost surely.5. Section 1. for any J > M. k—>cc k since the reversed process is ergodic and has the same entropy as the original process. X m+ m+k i— min frn > 1: X1_ k+1 = X — ° k+11 The recurrence-time theorem directly yields the forward result.

the k-block starting at i recurs in the future within the next n steps.2 Proof of the finite-energy theorem. this means that Bk(T i+k x) < n.log Fk(T i x) < — < h(1 + E/4) -1 . such that if k > K. which means that 7 k) x E B. eventually almost surely. The key to making this method yield the desired bound is Barron's almost-sure code-length bound. By making n larger. each of measure at most en/4. for at most en/3 indices i E [0. namely. x) as the shortest prefix of Tx that is not a prefix of any other Ti x. Thus. n . In this case. n)I < en.13. n). x V B. Here "optimal code" means Shannon code. see Remark 1. An upper bound on the length of the longest repeated block is obtained by comparing this code with an optimal code. Case 2. If i +k < n.5. and use an optimal code on the rest of the sequence. (10) £(C(4)) + log . II. code the second occurrence of the longest repeated block by telling its length and where it was seen earlier. Assume i E U(x. for each w. so that 1 logn . for at most 6n/3 indices i E [0. or i + k < n. and log Bk (x) > kh(1+ c /4) -1 . and two sets F and B. The ergodic theorem implies that for almost every x there is an integer N(x) such that if n> N(x) then Tx E F. 7x E B.u(x'11 ) -21ogn. then I U (x . This completes the proof of the prefix-tree theorem. such that j i and j+k -1 =x me two cases i < j and i > j will be discussed separately. In summary. then log Fk (x) > kh(1 + /4) -1 .7.a.1]. if n also satisfies k < en/3. for j < n. A Shannon code for a measure p u is a prefix code such that E(w) = log (w)1. n. k which means that Tx E F. if i E Uk (x. there is a K. there is an index j E [0. . Case 1. Let k be the least integer that exceeds (1 + e/2) log n/ h. and hence. (b) k +1 < (1+ c)log n/ h. Condition (b) implies that L7 -1 > k. i < j. and Tx E B. ENTROPY-RELATED PROPERTIES.162 CHAPTER II. Lemma 5. that is. Fix such an x and assume n > N(x). (a) K < (1+ 6/2) log n/ h. n). n. it can be supposed that both of the following hold. Fk (T ix) < n. x F. A version of Barron's bound sufficient for the current result asserts that for any prefix-code sequence {C}. j < i. if necessary. by the definition of W (i. which asserts that no prefix code can asymptotically beat the Shannon code by more than 210g n.1]. L7 > (1+ e) log n/ h. The idea of the proof is suggested by earlier code constructions. that is.n) then T i E F.

for property (a) guarantees that eventually almost surely L7k (x) > n:I8 for at least 4/8 indices i < nk . s. there is an i in the range 0 < i < nk — n k 314 .a. x ts 0+ log 2(4+L+1 1xi+L ). 1} z and an increasing sequence {n k } of positive integers with the following properties.6. and then again at position t 1 > s in the sequence xÇ. t 3 log n + log p(xt+1 s Thus Barron's lemma gives 1 lxf)> —2 log n. The block xtr+t is encoded by specifying s and L.5. 1} z {0. Suppose a block of length L occurs at position s.SECTION 11. the shift-invariant measure defined by requiring that p(aii) = 2. li m _sxu. yields to which an application of the finite energy assumption p(x:+ L(4) 5 < . almost surely. with probability greater than 1 — 2-k . Since t. 3 log n + log p(x t + + IL lxi) < Kc L . The process v = p. 163 The details of the coding idea can be filled in as follows. then using a Shannon code with respect to the given measure A. (a) If y = F(x) then. This is now added to the code length (11) to yield t-I-L .5. 1} z . The block xi is encoded by specifying its length t. (7) holds.5. 0 < j < n k 3 I4 . It will be shown that given E > 0. (b) . such that yi+i = 0. To compare code length with the length of the Shannon code of x with respect to it. the probability it (4) is factored to produce log (4) = log ti ) log A(xt v. for each n and each a E An. that is. o F -1 will then satisfy the conditions of Theorem 11.3 A counterexample to the mean conjecture. as they are asymptotically negligible. given by unbiased coin-tossing. that is. 5/4 so that Ei<n. L7(x) > nk and hence. The final block xtn+L±I is encoded using a Shannon code with respect to the conditional measure Ix ti +L ). total code length satisfies 3 log n + log 1 1 + log p(xl) kt(4A-L+11xi+L)' where the extra o(logn) terms required by the Elias codes and the extra bits needed to roundup the logarithms to integers are ignored. since c <1. II. a prefix code for which t(e(n)) = log n + o(log n). .u(fx: xo (F(x)01) < E. and L must be encoded. ENTROPY AND RECURRENCE TIMES. The prefix property is guaranteed by encoding the lengths with the Elias code e on the natural numbers. Let it denote the measure on two-sided binary sequences {0.p — log c n log n This completes the proof of the finite-energy theorem. there is a measurable shift invariant function F: {0.

. discussed in Section I. Show that dithering an ergodic process produces a finite-energy process. where H is the entropy-rate of it and show that E p.(B„ n cn ix°) is finite. (Hint: let B. In that example.xn = yn 00 whenever f (xn .8. ENTROPY-RELATED PROPERTIES. (c) Define f (x" cc ) to be the minimum m > n for which xim niV = .u of entropy h. The coding proof of the finite-energy theorem is new. (d) The preceding argument establishes the recurrence result for the reversed process. as an illustration of the blockto-stationary coding method.(4 Ix ° ) > 2-n ( H—E) and Cn = {X n c„) : f (xn 00 )) < 2n (H-2€ ) 1.0 ) = f (y" co ) and x 0 y° . blocks of O's of length 2n k i l2 were created. then max i 14 (x) = 0 (log n). Replacing 2n k 1/2 by nk 3/4 will produce the coding F needed here.q and apply the preceding result. process a. (Hint: L (4n ) < K log n implies max L7(x) < K log n. for any ergodic measure . Show that if L(x) = 0(log n). Remark 11. but it is added to emphasize that such a y can be produced by making an arbitrarily small limiting density of changes in the sample paths of the i. due to Wyner. almost surely. which also contains the counterexample construction given here. Show that this implies the result for ft.d. II. the entropy of y can be as close to log 2 as desired.) 3. (a) Show that lim infn t(f(x n „))/ n > h. eventually almost surely.5.b Exercises. eventually almost surely.c. The coding F can be constructed by only a minor modification of the coding used in the string-matching example. hence. Property (b) is not really needed. The asymptotics of the length of the longest and shortest word in the Grassberger tree are discussed in [82] for processes satisfying memory decay conditions. A function f: An B* will be said to be conditionally invertible if . that r > h is based on the following coding idea. . = fx" o„: p. in particular. 2. 1. Another proof. after its discovery it was learned that the theorem was proved using a different method in [31].164 CHAPTER II.) (b) Deduce that the entropy of a process and its reversed process are the same.5.i.9 The prefix tree theorem was proved in [72].

but some standard processes. processes. n — k +1]: x:+k-1 — k +1 . that is. defined by IILk — Pk1= — Pk( 14)1 = E E 1[4(4) at pk (64'14)1. Markov processes. ergodic process with alphabet A. The distance between 14 and Pk will be measured using the distributional (variational) distance. there is an ergodic process that does not satisfy the rate.d. is called a rate function for frequencies for a process 1. such as renewal processes. The rate of convergence problem is to determine the rates at which these convergences takes place. given any convergence rate for frequencies or for entropy.i. is the measure pk = pk( 1x11 ) on Ak defined by Pk(a1 14) = sn E [1. Section 111. aik E A k .Chapter III Entropy for restricted classes.n). In the general ergodic case.an(tx rit : 2—n(h+E) < itin (x n i ) < 2—n(h—E)} 1. The ergodic theorem implies that if k and are fixed then ({4: as n — Pk(' 14)1 ?_ ED -± oo. as n cc. with ilk denoting the projection onto A k defined by 14(4) = . The k-th order empirical distribution. lin(fx riz : — Pk(' ix ril )i > ED < rk(6. if for each fixed k and c > 0. do not have exponential rates. 165 . where R + denotes the positive reals and Z + the positive integers. or k-type. the entropy theorem implies that . In this section bt denotes a stationary. no uniform rate is possible. Likewise. and finite-state processes have exponential rates of convergence for frequencies of all orders and for entropy.1. These results will be developed in this section.u({x: x = 4}).1 Rates of convergence. A mapping rk : R+ x Z 1—± R +. such as i. Many of the processes studied in classical probability theory.

Proof A direct calculation. in turn. processes. 58]. processes) If p.166 CHAPTER III.1. nondecreasing sequence {an }. which is. using the fact that p.) There is a positive constant c such that (1) < (n It (14: IPi( 14) — ILiI > El) _ DIAl2—ncE 2 .n) —> 0 as n frequencies if there is a rate function such that for each E > 0 and k. 14). which is treated by applying the first-order result to the larger alphabet of blocks. oc.3 (First-order rate bound.i. (-1/n) log rk (E.1. The theorem is proved by establishing the first-order case. for some unbounded. for any finite set A..1. if for each E > 0.1.i.) A mapping r: R+ x Z+ R+ is called a rate function for entropy for a process it of entropy h.d processes will be established in the next subsection. is a product measure. then it has exponential rates for frequencies and for entropy. produces (2) 11 (4) = Dit(a))nPl aEA (alx7) = 2—n(H(pi)+D(p1 1141)) . The first-order theorem is a consequence of the bound given in the following lemma. (-1/n) log r (c. The overlapping-block problem is then reduced to the nonoverlapping-block problem.(x) is a continuous function of p1 (. it will not be the focus of interest here. Exponential rates of convergence for i. called speed of convergence.d. 62. Exponential rates for entropy follow immediately from the first-order frequency theorem.d.d. [22.1 A unrelated concept with a similar name. Remark 111. while ideas from that theory will be used. 2 —n(h+E) < /41 (4) < 2 —n(h—E) 1) > 1 — r (E.n) has a rate function for entropy such that for each c > 0. In this section the following theorem will be proved. III. Examples of processes without exponential rates and even more general rates will be discussed in the final subsection. process p. 0 as n oc. ENTROPY FOR RESTRICTED CLASSES. for the fact that t is product measure implies that p.i. n) is bounded away from O. n). for any n.2 (Exponential rates for i.d. is concerned with what happens in the ergodic theorem when (1/n) f (T i x) is replaced by (1/an ) f (Ti x). with alphabet A. a combination of a large deviations bound due to Hoeffding and Sanov and an inequality of Pinsker. and for any i. See [32] for a discussion of this topic. Theorem 111. An ergodic process is said to have exponential rates for and rk(E. An ergodic process has exponential rates for entropy if it and r(E. is i. Lemma 111. (The determination of the best value of the exponent is known as "large deviations" theory. then extended to Markov and other processes that have appropriate asymptotic independence properties. n) is bounded away from O.a Exponential rates for i.i.i.

The two key facts.1 3 and Theorem 1.1. Theorem 1. Fact (a) in conjunction with the product formula (2) produces the bound. (3) p. — 1L El Since D(P II Q) I P — Q1 2 1 (2 ln 2) is always true. (T(x)) < The bad set B(n. k) and r E [0. produces the bound tt(B(n. RATES OF CONVERGENCE. The type class of _el is the set T(x) = {. the sequence x can be expressed as the concatenation (4) x = w(0)w(1)• • • w(t)w(t + 1). it follows that 1 12 1Ple lx7) D* r17- which completes the proof of Lemma 111. E) = fx7: 1/1 1( . and the final word w(t 1) has length n — tk —s. all have length k.3. w(1).n'' = exp(—n(cE 2 ön)) where 8. 0 as n —> oc. . E)) < ( n 01Al2—nD. (b) There are at most (n + 1) 1A I type classes. k)-equivalent if their . 1. 167 where H(pi) = 11(131(. and class. To assist in this task some notation and terminology will be developed. This is called the s -shifted. The extension to k-th order frequencies is carried out by reducing the overlapping-block problem to k separate nonoverlapping-block problems. the next t words. k-block parsing of x.)1 : pie lyD = pie jx7)}. k) such that n = tk + k + r.1. . see Pinsker's inequality.d. since (n + 1)I 41 2 . Section 1. w(t). and D(Pi Lai) = E a pi (aIx)log Pl(aix'11) iii(a) is the divergence of pi( • Ixii) relative to pi. El Lemma 111. — I > E } can be partitioned into disjoint hence the bound (3) on the measure of the type sets of the form B(n.12.3 gives the desired exponential-rate theorem for first-order frequencies. are (a) 17-(4)1 < 2nH(p 1 ) .6.6. is the entropy of the empirical 1-block distribution.SECTION III.6. For each s E [0. together with fact (b). E) n T(4). 1x7)) = — a (a14) log pi (a lx7). k — 1]. where the first word w(0) has length s < k. The desired result will follow from an application of the theory of type classes discussed in I. The sequences x7 and y r i` are said to be (s. Exercise 6. where = min ID(Pi Lai): IP1(.6. For n > 2k define integers t = t (n.

a lemma.1. s) is the (s. Lemma 111. 4-k +Ft k = ps k(• 1.(fx: 1Pic (' 14) — e/2})y ({U/1: IwD — > e/21). that is. This completes the proof of Theorem 111. The containment relation. The overlapping-block measure Pk is (almost) an average of the measures ps result summarized as the following where "almost" is needed to account for end effects. Lemma 111. Sk(x . ENTROPY FOR RESTRICTED CLASSES.1. i yss+t +k . k — 1] such that Ips k ( 14) — . so that if B denotes A" and y denotes the product measure on 13' defined by the formula v(wD = then /1(w i ).4 (Overlapping to nonoverlapping lemma. constant on the (s. If k and c > 0 are fixed. i The (s. k — 1].ukl> Proof Left to the reader. k)-equivalence class k.(41x Pil ) — ps k (4 E [1. s-shifted. The fact that p. hence the bound decays exponentially in n. by the first-order result.c e S=0 — kl Fix such a k and n. k-1 (5) 14: Ipk( 14) — iLk1 el g U{4: lp. and fix s E [0.) Given e > 0 there is a y > 0 such that if kln < y and Ipk( 14) — I > E then there is an s E [0.3. w(t)} of k-blocks. then provides the desired k-block bound (9) p. applied to the measure y and super-alphabet B = lAlk . however. is upper bounded by (8) (t 1) 1A1 '2'" 2 /4 .el') is the The s-shifted (nonoverlapping) empirical k-block distribution ps A k defined by distribution on l) = ps. WI E B. . is a product measure implies (6) (Sk(x riz . t]: w(j) = a}1 where x is given by (4). If k is fixed and n is large enough the lemma yields the containment relation. 04: IPke — itkI > El) < k(t + 0 1 2-46'6214 . k)-equivalence class of x is denoted by Sk (fi'. of course. (5). k-block parsings have the same (ordered) set {w(1). The latter. . (7) p.168 CHAPTER III.2.. the logarithm of the right-hand side is asymptotic to —tcE 2 14n —c€ 2 /4k. k)-equivalence class of xri'. It is. s). s)) 1=1 where Sk(x .1. s).

i. g) such that (10) n=t(k+g)±(k+g)-Er. is an ergodic Markov source. process is *-mixing.6 An aperiodic Markov chain is ip-mixing.1.u(w). and such that .1. The theorem for frequencies is proved in stages. RATES OF CONVERGENCE.5 (Exponential rates for Markov sources. since the aperiodicity assumption implies that iti`cfb 7r(b). let a be the final symbol of u. Proof A simple modification of the i.u(u). The formula (10). u E Ag.(4) is a continuous function of pi e 14) and p2( • 14). Finally. First it will be shown that aperiodic Markov chains satisfy a strong mixing condition called *-mixing. Exponential rates for Markov chains will be established in this section. A process is 11 f-mixing if there is a nonincreasing sequence {*(g)} such that *(g) 1 as g —± oc. It requires only a bit more effort to obtain the mixing Markov result. as g —> oc. w E A*. Fix a gap g.2. which allows the length of the gap to depend on the length of the past and future. This proves the lemma. irreducible Markov chains have exponential rates.0_<r<k-i-g. u. since p.) If p. proof. . for computing Markov probabilities yields E w uvw ) t(v)=g mg 7r(b) The right-hand side is upper bounded by *(g). An i.u(uvw) _<111(g)A(u). define t = t (n.1. Section 1. For n > 2k + g.1. The entropy part follows immediately from the second-order frequency result.u(w).i.d. so as to allow gaps between blocks. then it has exponential rates for frequencies and for entropy. it will be shown that periodic. k. Then it will be shown that -*-mixing processes have exponential rates for both frequencies and entropy. is all that is needed to establish this lemma. The *-mixing concept is stronger than ordinary mixing. Theorem 111.l.SECTION 111. Fix positive integers k and g.7 If kt is ilf-mixing then it has exponential rates for frequencies of all orders. LI The next task is to prove Theorem 111. with Ili(g) 1. and let b be the first symbol of w.d. where i/î (g)= maxmax a Mg ab (b) • b The function *(g) —> 1. 169 III. Proof Let M be the transition matrix and 7r(-) the stationary vector for iii. Lemma 111.b The Markov and related cases.

constant on the (s. t]: w(j) = 411 It is. k. (14: Ipke placed by the bound (k + g)[*(g)]t (t + 010 2 -tc€ 2 14 (14) — /41 ED is re- assuming.g(alnxii). The following lemma removes the aperiodicity requirement in the Markov case. Lemma 111. with gap g. s.1. g)-equivalence class of 4 is denoted by Sk(xPil . If k is large relative to gap needed to account both for end effects and for the length and n is large relative to k.8 holds. k-block parsing of 4.. with the product measure bound (6) replaced by (12) ian (Sk (4. this is no problem.170 CHAPTER III.g .1.1. and if pke > E. s.d. The logarithm of the bound in (14) is then asymptotically at most —tce 2 18n —c€ 2 /8(k + g). the sequence 4 can be expressed as the concatenation (11) x = w(0)g(1)w(1)g(2)w(2). the g(j) have length g and alternate with the k-blocks w(j). k. however. with gap g. there is a y > 0 and a K > 0. ENTROPY FOR RESTRICTED CLASSES.7.7. hence the bound decays exponentially. g-gapped. and establishes the exponential rates theorem for aperiodic Markov chains. and Lemma 111. where the w(j) are given by (11). of course. k. g). such that if k n < y. if k > K. of course.i. with gap g. where "almost" is now block measure Pk is (almost) an average of the measures p k gaps. provided only that k is large enough. Completion of proof of Theorem 111.1. This is called the the s-shifted. The i. . . The sequences 4 and are said to be (s. and the final block w(t + 1) has length n — t (k + g) — s < 2(k + g). w(t)} of k-blocks. Since it is enough to prove exponential rates for k-th order frequencies for large k.1. g)) kii(g)i t i=1 ktk(w(i)).4 ( 14) — itkl 612. this completes the proof of Theorem 111. k-block parsings have the same (ordered) set {w(1). The (s. then there is an s E [0. . s. for 1 < j < t. g)-equivalence class Sk (4.7 is completed by using the 1i-mixing property to choose g so large that (g) < 2 218 .1. that k is enough larger than g and n enough larger than k to guarantee that Lemma 111. k-block parsing of x.8 (Overlapping to nonoverlapping with gaps. k + g — 1]. g)• The s-shifted (nonoverlapping) empirical k-block distribution. The overlappings .) Given E> 0 and g. E [1.g (c4) = pisc. The upper bound (8) on Ips k ( 14) —lkI€121) is replaced by the upper bound (13) [i (g)]t(t 1)1111 4 C E 2 I4 and the final upper bound (9) on the probability p. The proof of Theorem 111. where w(0) has length s < k + g.4 easily extends to the following. the s-shifted. g)-equivalent if their s-shifted. proof adapts to the Vf-mixing case. k + g — 1] such that — 114.g(t)w(t)w(t + 1). For each s E [0. is the distribution on Ak defined by P..

Let A be an ergodic Markov chain with period d > 1. unless c(ai) = c(xs ). d] by putting c(a) = s. Let g be a gap length. Lemma 111. 171 Proof The only new case is the periodic case. k + g — 1]. and. The following theorem summarizes the goal. the nonoverlapping block measure ps condition p g (4) = 0. g = ps k . for k can always be increased or decreased by no more than d to achieve this. x C(s @ d — 1). 1 < s < d.l.9 An ergodic Markov chain has exponential rates for frequencies of all orders. in part. A simple cutting and stacking procedure which produces counterexamples for general rate functions will be presented. g (• WO satisfies the For s E [0. ) where @ denotes addition mod d. Examples of renewal processes without exponential rates for frequencies and entropy are not hard to construct. C2. and partition A into the (periodic) classes.5. such that Prob(X n+ 1 E Cs91lXn E C s = 1. It can also be assumed that k is divisible by d.10 Let n i— r(n) be a positive decreasing function with limit O. The measure . n > N. in part. so the previous theory applies to each set {xii': ' K g — I (c(xs )) I as before. (s) . There is a binary ergodic process it and an integer N such that An (1-112 : 'Pi (• Ix) — Ail 1/2» ?. if a E C. which has no effect on the asymptotics. because it is conceptually quite simple. Thus. Also let pP) denote the measure A conditioned on Xi E C5 . because it illustrates several useful cutting and stacking ideas.. III. see Exercise 4.1.1. Cd. . IL({xi: IPk — iLk I E}) > 6/2 } separately. Theorem 111. C1.1. it follows that if k I g is large enough then Since ..oc.SECTION 111.c Counterexamples. establishing the lemma.u k is an average of the bt k k+g-1 fx rI : IPk( ifil ) — Ad > El C U VI I: s=0 IPsk. RATES OF CONVERGENCE.1. Theorem 111. _ r(n). and thereby completing the CI proof of the exponential-rates theorem for Markov chains. . . ?_ decays exponentially as n -. which can be assumed to be divisible by d and small relative to k. an aperiodic Markov measure with state space C(s) x C(s @ 1) x • • .g ( ! XIII ) — 14c(xs)) 1 > 6 / 2 1. k .u (s) is. however. Define the function c: A i— [1.

then for any n. long enough to guarantee that (16) will indeed hold with m replaced by ni + 1. according to whether x`iz consists of l's or O's. Each S(m) will contain a column . if . for all m. {41 } and fr. The cutting and stacking method was designed. = (y +V)/2. for all m > 1.u. Without loss of generality. to implement both of these goals. In particular. so that A n (14: i pi( 14) — I 1/21) = 1. then it is certain that xi = 1. If x is a point in some level of C below the top ni levels. i (1) = 1/2. and to mix one part slowly into another. such that /3(1) = 1/2 and r(m) < f3(m) < 1/2. The construction will be described inductively in terms of two auxiliary unbounded sequences. Furthermore. so that. then the chance that a randomly selected x lies in the first L — ni levels of C is (1 — m/L)p. in part." suppose kt is defined by a (complete) sequence of column structures.u were the average of two processes. it can be supposed that r(m) < 1/2. while . ENTROPY FOR RESTRICTED CLASSES. n > 1. where y is concentrated on the sequence of all l's and iY is concentrated on the sequence of all O's. of positive integers which will be specified later. Mixing is done by applying repeated independent cutting and stacking to the union of the second column C2 with the complement of C. The details of the above outline will now be given. no matter what the complement of C looks like. if C has height L and measure fi. As an example of what it means to "look like one process on part of the space and something else on the other part of the space. then (141 : 1/31( Ixr) — I ?_ 112» ?_ (1 _ so that if 1/2> fi > r(m) and L is large enough then (16) /2. Suppose some column C at some stage has all of its levels labeled '1'. Counterexamples for a given rate would be easy to construct if ergodicity were not required. Let m fi(m) be a nonincreasing function with limit 0.u. Ergodic counterexamples are constructed by making . pi(11x) is either 1 or 0. and hence (15) . The mixing of "one part slowly into another" is accomplished by cutting C into two columns. As long as enough independent cutting and stacking is done at each stage and all the mass is moved in the limit to the second part.u look like one process on part of the space and something else on the other part of the space. that is.u n ({xr: pi(114) = 11) (1 — -11 L i z ) /3. The bound (16) can be achieved with ni replaced by ni + 1 by making sure that the first column CI has measure slightly more than r(m 1). the final process will be ergodic and will satisfy (16) for all m for which r(m) < 1/2. /31(114) = 1. For example. say C1 and C2. for 1 < j < in. for it can then be cut and stacked into a much longer column. .172 CHAPTER III. The desired (complete) sequence {S(m)} will now be defined.n: — I > 1/2)) > r(m).n 1.„ ({x. to make a process look like another process on part of the space. mixing the first part into the second part so slowly that the rate of convergence is as large as desired. if exactly half the measure of 8(1) is concentrated on intervals labeled '1'.

Since rm —> oc. Since the measure of R. ni > 1. the total width of 8(m) goes to 0.(1) consist of one interval of length 1/2. then it can be combined with the fact that l's and O's are equally likely and the assumption that 1/2 > 8(m) > r (m). so large that (1 — m/t n1 )/3(m) > r(m).n } is chosen.(4) < 2 -n (h+E) } goes to 0 exponentially fast. sequentially. The sequence {r. (b) Show that the measure of {xii: p. Show that the measure of the set of n-sequences that are not almost strongly-packed by 'Tk goes to 0 exponentially fast. labeled '1'.7.(m). C(m — 1.) (c) Show that a 0. labeled '0'.SECTION III. All that remains to be shown is that for suitable choice of 4. by Theorem 1. First C(m — 1) is cut into two columns. To get started. This guarantees that at all subsequent stages the total measure of the intervals of 8(m) that are labeled with a '1' is 1/2. tm holds. let C(1) consist of one interval of length 1/2. 14)— Ail a 1/2 }) a (1 — — ) /3(m).1/20 > r (m). .1 . . (a) Choose k such that Tk = {x: i(x) ) 2h j has measure close to 1. together with a column structure R. 2) U 7Z(m — 1). This is possible by Theorem 1. for this is all that is needed to make sure that m ttn({x.11. and rm the final process /1 is ergodic and satisfies the desired condition (17) An (Ix'i'I : Ipi(• 14)— Ail a. Let y be a finite stationary coding of A and assume /2. 1) has measure /3(m) and C(m — 1. and let R. Suppose A has exponential rates for frequencies. (Hint: as in the proof of the entropy theorem it is highly unlikely that a sequence can be mostly packed by 'Tk and have too-small measure. The condition (17) is guaranteed by choosing 4.1.10. and since /3(m) -÷ 0. to guarantee that (17) holds. 2) has measure fi(m —1) — )6(m). all of whose entries are labeled '1'. 2) U TZ. 173 C(m). RATES OF CONVERGENCE. 0 111.(m) goes to 1. 2). so that C(m — 1.d Exercises. and the total measure of the intervals of S(m) that are labeled with a '0' is 1/2.(m — 1) and 1?-(m) are (1 — 2')-independent. and measure 13(m). The new remainder R(m) is obtained by applying rni -fold independent cutting and stacking to C(m — 1. For ni > 1. of height 4. 1. (a) Show that y has exponential rates for frequencies. 1) is cut into t ni g ni -1 columns of equal width. 1) and C(m — 1. 1. The column C(m —1. has exponential rates for frequencies and entropy. so the sequence {8(m)} is complete and hence defines a process A. where C(m — 1.10. This completes the proof of Theorem 111. and hence the final process must satisfy t 1 (l) = A i (0) = 1/2. disjoint from C(m).10. Once this holds. which are then stacked to obtain C(m). 2. this guarantees that the final process is ergodic.: Ipi(.-mixing process has exponential rates for entropy. 8(m) is constructed as follows.

The kth-order joint distribution for an ergodic finite-alphabet process can be estimated from a sample path of length n by sliding a window of length k along the sample path and counting frequencies of k-blocks. where it is good to make the block length as long as possible. that is. such as data compression. in .) 3.174 CHAPTER III. As before. that is. there is no hope that the empirical k-block distribution will be close to the true distribution. The definition of admissible in probability is obtained by replacing almost-sure convergence by convergence in probability. A nondecreasing sequence {k(n)} will be said to be admissible for the ergodic measure bt if 11111 1Pk(n)(• 14) n-4co lik(n) = 0. Exercise 4 shows that renewal processes are not *-mixing. if k(n) > (1 + e)(logn)1 h. ENTROPY FOR RESTRICTED CLASSES.. by the ergodic theorem. the probability of a k-block will be roughly 2 -kh . a fact guaranteed by the ergodic theorem. s. to design engineering systems. Show that there exists a renewal process that does not have exponential rates for frequencies of order 1. distributional. independently drawn sample paths. There are some situations. processes or Markov chains. This is the problem addressed in this section.) 5. If k is fixed the procedure is almost surely consistent. where I • I denotes variational. Consistent estimation also may not be possible for the choice k(n) (log n)I h. It is also can be shown that for any sequence k(n) -± oc there is an ergodic measure for which {k(n)} is not admissible. In particular. below. equivalently. conditional k-type too much larger than 2 —k(h(4)—h(v) and measure too much larger than 2-PP). as a function of sample path length n. see Theorem 111. or.d. (Hint: make sure that the expected recurrence-time series converges slowly. (b) Show that y has exponential rates for entropy. that is. For example.1. k(n) > (logn)I(h — 6). see Exercise 2. 4. after which the system is run on other. Establish this by directly constructing a renewal process that is not lit-mixing. Every ergodic process has an admissible sequence such that limp k(n) = oc. for if k is large then. Thus it would be desirable to have consistency results for the case when the block length function k = k(n) grows as rapidly as possible. Section 111. The such estimates is important when using training sequences. The empirical k-block distribution for a training sequence is used as the basis for design. a. let plc ( 14) denote the empirical distribution of overlapping k-blocks in the sequence _el'. Use cutting and stacking to construct a process with exponential rates for entropy but not for frequencies.2. distance. such as the i. The problem addressed here is whether it is possible to make a universal choice of {k(n)) for "nice" classes of processes. finite consistency of sample paths.i. (Hint: bound the number of n-sequences that can have good k-block frequencies.2 Entropy and joint distributions. Here is where entropy enters the picture. with high probability. the resulting empirical k-block distribution almost surely converges to the true distribution of k-blocks as n oc.

ENTROPY AND JOINT DISTRIBUTIONS. process then the d-distance between the empirical k(n)-block distribution and the true k(n)-block distribution goes to 0. y) is the waiting time until the first n terms of x appear in the sequence y then. provided x and y are independently chosen sample paths. The d-distance is upper bounded by half the variational distance. independent of the length of past and future blocks. They showed.) If p.3 (Weak Bernoulli admissibility. and hence the results described here are a sharpening of the Ornstein-Weiss theorem for the case when k < (log n)/(h 6) and the process satisfies strong enough forms of asymptotic independence. The positive admissibility result for the i. The nonadmissibility theorem is a simple consequence of the fact that no more than 2 k(n)(h—E) sequences of length k can occur in a sequence of length n and hence there is no hope of seeing even a fraction of the full distribution. (1/n) log 14/. or Ili-mixing. almost surely. These applications. Theorem 111. [52]. . The Vi-mixing concept was introduced in the preceding section. along with various related results and counterexamples.2.2..1 (The nonadmissibility theorem. an approximate (1 — e -1 ) fraction of the k-blocks will fail to appear in a given sample path of length n.(x. Theorem 111.) If p. and ifr-mixing cases is a consequence of a slight strengthening of the exponential rate bounds of the preceding section.d.i.d. The Ornstein-Weiss results will be discussed in more detail in Section 111.2. who used the d-distance rather than the variational distance.SECTION 111.(x. and k(n) < (log n)/(h 6) then {k(n)} is admissible for p.) If p is i. The weak Bernoulli result follows from an even more careful look at what is really used for the ifr-mixing proof. requires that past and future blocks be almost independent if separated by a long enough gap.i. which will be defined carefully later.d. A second motivation was the desire to obtain a more classical version of the positive results of Ornstein and Weiss. Remark 111.2. if 0 < E < h. that if the process is a stationary coding of an i. Markov.4 A first motivation for the problem discussed in this section was the training sequence problem described in the opening paragraph. Thus the (log n)/(h 6).u. y) converges in probability to h. case of most interest is when k(n) The principle results are the following. provided k(n) — (log n)/ h.2. is ergodic with positive entropy h.i. The weak Bernoulli property. for ergodic Markov chains. are presented in Section 111.3.5. in particular.2. Theorem 111. They showed that if 147. [86]. and if k(n) > (log n)/(h — 6) then {k(n)} is not admissible in probability for p. the choice k(n) — log n is not admissible. The admissibility results of obtained here can be used to prove stronger versions of their theorem. Remark 111. for it is easy to see that.5 A third motivation for the problem discussed here was a waiting-time result obtained by Wyner and Ziv. is weak Bernoulli and k(n) < (log n)/(h 6) then {k(n)} is admissible for . with high probability.2 (The positive admissibility theorem. 175 the unbiased coin-tossing case when h = 1. see Exercise 3.

lik(x).u. process with Prob (yn = 1) < Let Cn = E yi > 2En} . as the reader can show. (1). is enough to prove the positive admissibility theorem for unbiased coin-tossing. for any n > 0.i. since each member of Tk (E/2) has measure at most 2. be an ergodic process with entropy h and suppose k = k(n) > (log n)/(h — E). This implies the nonadmissibility theorem. n — k +1]} . it can be supposed that {k(n)} is unbounded.2.d. and for any B c A such that p(B) > 1 — É and I BI > 2. Define the empirical universe of k-blocks to be the set Lik (4) = x: +k-1 = a l% for some i E [1. if n > IA I''. (2) (14 : 1Pke 14) — 141 ?.k (ték oc..) There is a positive constant C such that for any E > 0. for any finite set A. with finite alphabet A.') n Tk ( 6/2)) 2 k(h-0 2-k(h-E/2) 2 -kE/2 . and hence I pk (.14) — pk1 -± 1 for every x E A'. Proof Define yn = yn (x) — > 5E1 < 2(n ± 01/312-nCe2 =0 1 I if xn E B otherwise. Lemma 111. First the nonadmissibility theorem will be proved. for. ED < k(t +1) 1A1k 2 -t cE214 .d. for any i. see the bound (9) of Section MA. so that { yn } is a binary i.uk ((tik(4)) c) also goes to 1. where t nlk. First note that. ENTROPY FOR RESTRICTED CLASSES.k (Tk (e/2)) —> 1. E.i. where 0 < E < h. III. the following holds. The key to such results for other processes is a similar bound which holds when the full alphabet is replaced by a subset of large measure.a Proofs of admissibility and nonadmissibility. by the distance lower bound. so that . The rate bound.k(h—e12) • The entropy theorem guarantees that . whenever al' (1) Let 1 Pk( ' 14) — 1 111 ((lik(X)) c ) • Tk (6/2) = 14: p.2. it follows that Since pk(a fic ix) = 0.6 (Extended first-order bound.(4) < 2-k(h-E/2)} The assumption n 2k (h -E ) implies that ILik(4)1 < 2k (h -o. since any bounded sequence is admissible for any ergodic process.176 CHAPTER III. the error is summable in n. by the ergodic theorem. process p. and hence Pk . Next the positive admissibility theorem will be established. without loss of generality. 1/31(. Let p.

-2. by the Borel-Cantelli lemma. case.SECTION 111. • • • . Assume . k-1 (7) 14: 1Pk( 14) — tuki 31 g s=o IPsk( . It will be shown that (6) Ep. i. tt (b) /Ld o B) + E A(b) = 2(1 — .cn4E2 177 The idea now is to partition An according to the location of the indices i for which . Furthermore.2. x im from x. and hence. bEB [ A\ The exponential bound (2) can now be applied with k = 1 to upper bound (5) by tt (B(i i. then extended to the *-mixing and periodic Markov cases. .BI > € 1) Ii ui . combined with (4) and the assumption that I BI > 2. define ï = i(xil) to be the sequence of length s obtained by deleting xi„ xi2 .(B(i i . in turn.n ) den.u i and let AB be the corresponding product measure on BS defined by ALB. in1 ): m > 2en} have union Cn so that (3) yields the bound (4) (ii. . in. and apply the bound (2) to obtain (3) p. and for xri' E B(ii. — p. The proof of (6) starts with the overlapping to nonoverlapping relation derived in the preceding section.AL B I --= 2_.kI ?3/21. im )) for which in < 26n cannot exceed 1.i.)) /1 B E B s : i Pi e liD — I > 3}). with entropy h.d. . Fix a set {i1. }. The positive admissibility result will be proved first for the i. page 168.i is i. For each m < n and 1 < j 1 < i2 < • • • < j The sets .u(B)) bvB 26. 14) — ?_ 31) < oc. . in. are disjoint for different { -1. put s = n — m. x i g B.d. the sets {B(ii. upper bounded by (5) since E Bs: — 11 1. Let 8 be a given positive number. see (5). which immediately implies the desired result.} with m < 2cn. (Ca ) < (n 1)22 . ENTROPY AND JOINT DISTRIBUTIONS. im ): 'pi ( 14) . The assumption that m < 26n the implies that the probability ({Xj1 E B(ii.tti I > 56}) which is. im ))(s 1)1B12 2< ini mn 1)IBI2—n(1 —20c€ 2 The sum of the p. in. is upper bounded by tc(B(i i . (14: ip(. let B(ii. note the set of all 4 for which Xi E B if and only if i çz' {i1.) . • • • im) and have union A. the lemma follows. and assume k(n) < (log n)/ (h 6).i. no) im }:nz>2En + 1) 22-cn4E2 . Also let ALB be the conditional distribution on B defined by . i 2 .

i. furthermore. The idea now is to upper bound the right-hand side in (8) by applying the extended first-order bound. summable in n. and. (8) ti({4: Ins* — Ak1 3/2}) v({ 01: IP1( . Note that k can be replaced by k(n) in this bound. all k > K = K (g).6. Lemma 111. valid for all g. for k > K.6. which is valid for k/n smaller than some fixed y > 0. This is. yields the bound P 101) — vil > 3/20 < 2(t + 1 )2/2)2_tc82000. which. where y is a positive constant which depends only on g. produces the bound (9) tx: IPke — 1kI > < 2k(t 1)21.d.178 CHAPTER III. and plc ( '. indeed. processes. where n = tk + k +r. 101 ) — vil where each wi E A = A'. (1 + nl k) 2"1±(') . The rigorous proof is given in the following paragraphs. fi = h + 612 h ±E The right-hand side of (9) is then upper bounded by (10) 2 log n h +E (t since k(n) < (log n)/ (h E). combined with (8) and (7). Lemma 111. though no longer a polynomial in n is still dominated by the exponential factor. and all n > kly. The extended first-order bound. To obtain the desired summability result it is only necessary to choose . ENTROPY FOR RESTRIC I ED CLASSES. This establishes that the sum in (6) is finite. by the entropy theorem. 0 < r < k. with A replaced by A = A k and B replaced by T. as k grows. which. 3 } ) < 2(k + g)[lk(g)]` (t 1 )22127 tceil00 . thereby completing the proof of the admissibility theorem for i. by assumption. there is a K such that yi (T) = kil (Tk ) > 1 — 8/10. The bound (9) is actually summable in n.2. as n oc. Since the number of typical sequences can be controlled.el') denotes the s-shifted nonoverlapping empirical k-block distribution. The set rk -= {X I( : itt(x lic ) > 2— k(h+€12)} has cardinality at most 2k(h ±e/ 2) . To show this put a =C32 1100.(h±E/2)2-tc82/100 . since it was assumed that k(n) oc. As in the preceding section. the polynomial factor in that lemma takes the form. since nIk(n) and p < 1.2. to the super-alphabet A". for s E [O. with B replaced by a suitable set of entropy-typical sequences. and y denotes the product measure on » induced by A. t The preceding argument applied to the Vi-mixing case yields the bound (1 1) kt({4: 1Pke — il ?_. see (7) on page 168. valid for all k > K and n > kly. k — 1].

at least for a large fraction of shifts. provided a small fraction of blocks are omitted and conditioning on the past is allowed. The weak Bernoulli property leads to a similar bound. ar iz.1.1x:g g _n1 ) denote the conditional measure on n-steps into the future. if given E > 0 there is a gap g > 0 such that (12) E X ig-m : n (. i X -go) — n> 1. in turn.9.g(t)w(t)w(t ± 1). A process has the weak Bernoulli property if past and future become almost independent when separated by enough. It is much weaker than Ifr-mixing. The key to the O. which requires that jun (4 lxig g _m ) < (1 + c)/L n (ar iz). the s-shifted.) Also let Ex -g (f) denote conditional expectation of a function f with respect to the -g-m random vector X —g—m• —g A stationary process {X i } is weak Bernoulli (WB) or absolutely regular. —g — m < j < —g}.b The weak Bernoulli case. with only a small exponential error. = aril and Xi g Prob ( X Tg g . As before. where the variational distance is used as the measure of approximation. which only requires that for each m > 0 and n > 1. for all n > 1 and all m > O. upper bounded by log n 2(t h 1)n i3 .„(. To make this precise.SECTION 111. subject only to the requirement that k(n) < (log n)/(h E). The weak Bernoulli property is much stronger than mixing. that is. however. where g and m are nonnegative integers.. with gap g is the expression (14) 4 = w(0)g(1)w(1)g(2)w(2). ENTROPY AND JOINT DISTRIBUTIONS. n > 1. which is again summable in n.1X —g—m -g — n i) < E . there is a gap g for which (12) holds.2. k-block parsing of x. This establishes the positive admissibility theorem for the Vi-mixing case. .. and this is enough to obtain the admissibility theorem. An equivalent form is obtained by letting m using the martingale theorem to obtain the condition oc and (13) E sifc Gun( . assuming that k(n) the right-hand side in (11) is. nonoverlapping blocks could be upper bounded by the product measure on the blocks. let p. and xig g provided only that g be large enough. and oc. I11. With a = C8 2 /200 and /3 = (h + E /2)/ (h E). = xi. -mixing result was the fact that the measure on shifted.. since since t n/ k(n) and 8 < 1. hence for the aperiodic Markov case. 179 g so large that (g) < 2c6212°°. The extension to the periodic Markov case is obtained by a similar modification EJ of the exponential-rates result for periodic chains. conditioned on the past {xj. the measure defined for ar iz E An by iun (41x )— g_m = Prob(r.2. see Lemma 111. uniformly in m > 0.

8 (The weak Bernoulli splitting-set lemma.& x_ s tim -1)(k+g) ) < (1 + y)tt(w(j ni )). by the definition of B11„.1)(k +g) ). g)-splitting index for x E A Z if (w(j) ix _ s ti -1)(k+g) ) < (1 + y)p.7 Fix (y. t] is called a (y. such that the following hold. where w(0) has length s.( .n )] n B . which exists almost surely by the martingale theorem. Here.)) (1 + y) • . eventually almost surely. and there is a sequence of measurable sets {G n (y)}. g)-splitting index will be denoted by Bi (y.2.)) (1+ y)IJI J EJ Put jp. and g are understood. fl Bi ) B*) to n Bi ) = [w(j.2. or by B i if s. k. Lemma 111.) . (a) X E Gn(y). g). Thus (15) yields (n([w(i)] n B. each The first factor is an average of the measures II (w(j. The set of all x E Az for which j is a (y.„ } ( [ w (. for 1 < j < t. Lemma 111. s. .„ tc(B*). Note that Bi is measurable with respect to the coordinates i < s + j (k + g).7 follows by induction. and the final block w(t 1) has length n — t(k+g) —s < 2(k ± g).)) • tt ([.n )] of which satisfies n Bi I xs_+.u(w(in. )] n Bi ) . k. Note that the k-block w(j) starts at index s (j — 1)(k + g) + g ± 1 and ends at index s j (k + g). and later. An index j E [1. there are integers k(y) and t(y).180 CHAPTER III. k. ([w(i. lie lx i 00 ) denotes the conditional measure on the infinite past.(w(j)). Then for any assignment {w(j): E J } of k-blocks tt Proof obtain (15) JEJ (naw(i)] n B. JEJ-th„) D and Lemma 111.2._{. ENTROPY FOR RESTRICTED CLASSES. g) and fix a finite set J of positive integers. k.. k.) If l is weak Bernoulli and 0 < y < 1/2. s. s. The weak Bernoulli property guarantees the almost-sure existence of a large density of splitting indices for most shifts. y. the g-blocks g(j) alternate with the k-blocks w(j). s. then there is a gap g = g(y).2.n )] n Bi. = max{ j : j E J } and condition on B* = R E . defined as the limit of A( Ix i rn ) as m oc.

Fatou's lemma implies that I 1 — f (x) d p(x) < )/4 . The ergodic theorem implies that Nul+ rri cx. Proof By the weak Bernoulli property there is a gap g = g(y) so large that for any k f (xx ) 1 p(x i`) p(x i`lxioo g) dp. 1 N-1 i=u y2 > 1 — — a. ..U(.: 1 < i < kl. Vk ?_ MI. g)-splitting indices for x.2.s. and if (t +1)(k + g) _< n < (t + 2)(k +g). {Xi : i < —g} U {X. so there is a subset S = S(x) C [0.c ) • Let Ek be the a-algebra determined by the random variables.t=o — L — 2 then x E Gn(Y).SECTION 111. t > t(y).(x) < Y 4 4 Fix such a g and for each k define fk(X) — . see Exercise 7b. Direct calculation shows that each fk has expected value 1 and that {fk} is a martingale with respect to the increasing sequence {Ed. (Ts+0-1)(k+g)x ) > — y. t] that are (y.. 1 so there is an M such that if Cm = lx :I 1 — fk (x) 15_ . if t > t(y).3C ( IX:e g. k. s. so that if G(y) = —EICc m (T z x) > 1 1 n-1 n. ENTROPY AND JOINT DISTRIBUTIONS. 2 where kc. Put k(y)= M and let t(y) be any integer larger than 2/y 2 . denotes the indicator function of Cm. eventually almost surely. then /i(CM) > 1 — (y 2 /2). and (t + 1)(k g) < n < (t +2)(k ± g). k + g —1] of cardinality at least (1 — y)(k + g) such that for x E G(y) and s E S(x) k c. The definition of G(y) and the assumption t > 2/y 2 imply that 1 r(k+g)-1 i=0 KCht(T1 X) = g) E 1 k-i-g-1 1 t(k k + g s=o t j =i E_ t K cm ( T s-hc i —( k + g) x ) > _ y 2 . 181 (b) If k > k(y). Thus fk converges almost surely to some f. and fix an x E Gn(Y). Fix k > M. k ± g — 1] for each of which there are at least (1 — y)t indices j in the interval [1. then for x E G(y) there are at least (1 — y)(k + g) values of s E [0.

t] define s. For J C [1. s. J)) < (1+ y)' rip. k E k A. k +g —1]. J) be the set of those sequences x for which every j E J is a (y. t]. n > 1. k + g — 1] and J C [1. But if Tr+(i -1)(k +g) x E Cm then g(w(i)Ixst i-1)(k+g )— g) < 1 ( + y)tt(w(D). Theorem 111. J).3.2. g)-splitting indices for x. and note that jEJ n B = Dn(s. (. k(y) and t(y).g(ai )= E W(i) an! . In particular.9 Given 3 > 0 there is a positive y < 1/2 such that for any g there is a K = K(g. which implies that j is a (y. for at least (1—y)t indices j E [1. k. t]. Proof of weak Bernoulli admissibility. then choose integers g = g(y). t] of cardinality at least (1 — y)t. n > 1. let Dn (s. for x E G(y) and s E S(X) there are at least (1 — y)t indices j in the interval [1. . and measurable sets G. so that conditions (a) and (b) of Lemma 111. g)-splitting indices for x.8 implies that there are (1 — y)(k + g) indices s E [0.8.8 hold.J k n Pk. and if Ipk(-14) — ?. s. k + g — 1] for each of which there are at least (1 — y)t indices j in the interval [1. For s E [0. The overlapping-block measure Pk is (almost) an average of the measures pk s'g i. this is no problem. Fix t > t(y) and (t+1)(k+g) < n < (t+2)(k+g). if kl n < y. g)-splitting index. for any subset J C [1.: 3 then 114' g j ( .1-1)(k +g) x E Cm.2. Assume A is weak Bernoulli of entropy h. the empirical distribution of k-blocks obtained by looking only at those k-blocks w(j) for which j E J. t]. and k(n) < (log n)/(h + E). s. ( . k. k.1.182 CHAPTER III. then Ts±(. that is.7 and the fact that I Ji < t yield (1 6) A (n[w(j)] jEJ n D.2. y) such that if k > K. provided the sets J are large fractions of [1. .)) If x E G(y) then Lemma 111.8 easily extends to the following sharper form in which the conclusion holds for a positive fraction of shifts. where "almost" is now needed to account for end effects. = Gn(Y). s.2. and Lemma 111. t] that are (y. ENTROPY FOR RESTRICTED CLASSES. if s E S(X). 14) — ILkI 3/4 for at least 2y (k + g) indices s E [0.2. where k(y) k < (log n)/(h+E). Since I S(x)I > (1 —3)(k+ g). Fix > 0. however. If k is large relative to gap length and n is large relative to k. this completes the proof of Lemma 111.2. t] which are (y. g)-splitting index for x. so that Lemma 111. and for the j g J. In summary. Lemma 111. k.E. for gaps. choose a positive y < 1/2.(s.

Show that there exists an ergodic measure a with positive entropy h such that k(n) (log n)/(h + E) is admissible for p. = {x: Ipk (n) (. t] of cardinality small.2. 183 On the other hand. does not have exponential rates of convergence for frequencies.d.10 The admissibility results are based on joint work with Marton. Aperiodic Markov chains are weak Bernoulli because they are ifr-mixing. Since X E G n (y). which need not be if-mixing are.i. indices s. and k > k(y) and t > t(y) are sufficiently large. t] of cardinality at least > 3/4.g) assures that if I pk (14) — Ik1> S then I Ps kv g j ( . In particular.] 1J12:(1—y)t (ix: Ips k :jg elxi) — > 3/41 n Dn (s.. k + g — 1] and at least one J C [1. Remark 111. [13]. eventually almost surely.2.2.d.9 _> 3/4 for at least 2y (k -I.ei') — [4(n)1 > 31. weak Bernoulli. It is shown in Chapter 4 that weak Bernoulli processes are stationary codings of i. . The weak Bernoulli concept was introduced by Friedman and Ornstein. then for any x E G 0 (y) there exists at least one s E [0.d. Show that if p is i. 2-2ty log y to bound the number of subsets J C [1. If y is small enough. however. for any subset J C [1.i.3. ENTROPY AND JOINT DISTRIBUTIONS. with an extra factor. 2-2t y log y yy[k(o+ + 2k(n)(h+E/2)2_ t(1—Y)C32/4°° for t sufficiently large. y can be assumed to be so small and t so large that Lemma 111. Using the argument of that proof. Aperiodic renewal and regenerative processes. Show that k(n) = [log n] is not admissible for unbiased coin-tossing.2.1. if k(n) < (logn)I(h + E). This bound is the counterpart of (11). I11. Show that for any nondecreasing.SECTION 111. the set k :g j (-14) — (1 — y)t. J)) The proof of weak Bernoulli admissibility can now be completed very much as the proof for the 'tit-mixing case. see Exercises 2 and 3.i. 1. 14) Thus if y is sufficiently at least (1—y)t.3. processes in the sense of ergodic theory. then. t ] of cardinality at least (1— y)t. processes. 4.c Exercises. in Section 1V. as in the -tfr-mixing case. J) and ips (17) is contained in the set k+ g —1 s=0 lpk — itkl a.2. the measure of the set (17) is upper bounded by (18) 2 . then p(B 0 ) goes to 0 exponentially fast. and if B. the bound (18) is summable in n. this establishes Theorem 111. 3. for which x E Dn (s. 2. unbounded sequence {k(n)} there is an ergodic process p such that {k(n)} is not admissible for A. 31 n G(y) UU ig[1. as part of their proof that aperiodic Markov chains are isomorphic to i. yet p. [38].

) g dp. w(t)r.1 (The d-admissibility theorem. where each w(i) has length k. while in the variational case the condition k(n) < (log n)I(h e) was needed. .S The definition of d-admissible in probability is obtained by replacing almost-sure convergence by convergence in probability.1. Note that in the d-m etric case. process. for any integrable function g. [52}.. Show that if k(n) oc. Theorem 11.. let E(m) be the a-algebra determined by . c) be the set of all kt-sequences that can be obtained by changing an e(log n)I(h+ fraction of the w(i)'s and permuting their order. k.(x7. y) E a1 . as n 6. The d-admissibility theorem has a fairly simple proof.) (b) Show that the sequence fk defined in the proof of Lemma 111. Show that the preceding result holds for weak Bernoulli processes. based on the empirical entropy theorem.(. The earlier concept of admissible will now be called variationally admissible. Since the bound (2) dn (p. a.i.u(R. e).3..3.. (Hint: use the martingale theorem. A sequence {k(n)} is called d-admissible for the ergodic process if (1) iiIndn(Pk(n)(• Alc(n)) = 0. ENTROPY FOR RESTRICIED CLASSES. Let R. 7.3 The d-admissibility problem. almost surely.2.) If k(n) < (log n)I h then {k(n)} is d-admissible for any process of entropy h which is a stationary coding of an i. Ym. For each x let Pm (X) be the atom of E(m) that contains x.184 CHAPTER ER III.) Section 111.d. 5.d. the theorem was first proved by Ornstein and Weiss.v(4)1. and a deep property. always holds a variationally-admissible sequence is also d-admissible. The d-metric forms of the admissibility results of the preceding section are also of interest. s. Assume t = Ln/k] and xin = w(1). though only the latter will be discussed here.i. Stated in a somewhat different form. c)) —> 1. called the finitely determined property. A similar result holds with the nonoverlapping-block empirical distribution in place of the overlapping-block empirical distribution. (a) Show that E(glE)(x) = lim 1 ni -+cx) 11 (Pm(x)) fp. and p. is i. (Hint: use the previous result with Yi = E(fk+ilEk). and let E be the smallest complete a-algebra containing Um E(m). Let {Yi : j > 1} be a sequence of finite-valued random variables. k(n). 151.8 is indeed x4) to evaluate a martingale. the admissibility theorem only requires that k(n) < (log n)I h. then .. The principal positive result is Theorem 111.(x7 .

such that liminf dk( fl )(pk( fl )(. Tk = 11.5.3. and its 6-blowup dk(4. 185 which holds for stationary codings of i. define the (empirical) universe of kblocks of x'11 .) If k(n) > (log n)/(h — c). To establish the d-nonadmissibility theorem. intersecting with the entropy-typical set. subsequence version of strong nonadmissibility was first given in [52]. These results will be discussed in a later subsection. n--*oo 111(n)) > a. If 3 is small enough then for all k. and which will be proved in Section IV. If k > (log n)/(h — c) then Itik (4)1 _< 2k(h-E). n — k +1]} . The negative results are two-fold. A proof of the d-admissibility theorem. Theorem 111. Lemma 1. there is an ergodic process p.a. then {k(n)} is not admissible for any ergodic process of entropy h. But if this holds then dk(/ik. The other is a much deeper result asserting in a very strong way that no sequence can be admissible for every ergodic process. processes. almost surely The d-nonadmissibility theorem is a consequence of the fact that no more than sequences of length k(n) can occur in a sequence of length n < so that at most 2k(n )(h-E/2) sequences can be within 3 of one of them. I11.16(4)) < 81.a Admissibility and nonadmissibility proofs. produces Ak [11k(Xn)t5 ( [14(4)16 = n I.2.d. assuming this finitely determined property. 1[14(4)]s I < 2 k(" 12) by the blowup-bound. provided 3 is small enough.SECTION IIL3. A weaker. then a small neighborhood of the empirical universe of k-blocks in an n-sequence is too small to allow admissibility. is given in Section III.i. pk( 14)) > 82. THE 15-ADMISSIBILITY PROBLEM. so that ttaik(x)is) < 6. for any x'11 . One is the simple fact that if n is too short relative to k. 14(4) = k i+k-1 = a.7. .(X1) < k(h—E/4) 1 1 . then extended to the form given here in [50].3.3 (The strong-nonadmissibility theorem. Theorem 111. if k is large enough.3. and hence.) For any nondecreasing unbounded sequence {k(n)} and 0 < a < 1/2. < 2k(h-02) 2-ch-E/4) <2 -k€/4 The entropy theorem implies that it(Tk) —> 1. tai :x i some i E [0.3.2 (The d-nonadmissibility theorem. It will be proved in the next subsection.

Theorem 111. Pk( WO) is almost the same as pn. The desired result then follows since pni (an i 'lxri') —> A(an i z). fact (b). is really just the empirical-entropy theorem. the v-distribution of m-blocks in n-blocks is the measure 0. see Theorem IV.( • Ixrn). An equivalent finite form. 111 .i. since the bounded result is true for any ergodic process. The finitely determined property asserts that a process has the finitely determined property if any process close enough in joint distribution and entropy must also be dclose.l n-m EE i=0 uEiv veAn—n.x7 which makes it clear that 4)(m.. almost surely as n oc.3. the proof that stationary codings of processes are finitely determined.186 CHAPTER III. Theorem 11. of course.14(a) gives /Lk aik (x')13 ) > 1 — 28. A stationary process t is finitely determined if for any c > 0 there is a 6 > 0 and positive integers m and K such that if k > K then any measure y on Ak which satisfies the two conditions -(rn.d. vk) < E.2. then. provided only that k < (11h) log n. modulo. almost surely. by the ergodic theorem. This completes the proof of the d-nonadmissibility theorem. by the definition of finitely determined. The d-admissibility theorem is proved by taking y = plc ( I4). (a) 10 (in. Thus.2. The important fact needed for the current discussion is that a stationary coding of an i. must also satisfy cik(iik. . 11-1(14) — H(v)I <k6. it is sufficient to prove that if {k(n)} is unbounded and k(n) < (11h) log n.1. ---÷ h. and noting that the sequence {k(n)} can be assumed to be unbounded.„ = 0(m. Pk(n)(. for otherwise Lemma I. E v(uaT v). see Section IV. This completes the proof of the d-admissibility theorem. An alternative expression is Om (ain ) E Pm (ain14)v(x) = Ev (pm (ar x7)). almost surely. which is more suitable for use here and in Section 111. for each m. note that the averaged distribution 0(m. 14)) — (b) (1/k(n))H(p k(n) (. since pk(tik(fit)14) = 1. is expressed in terms of the averaging of finite distributions. ENTROPY FOR RESTRICTED CLASSES.9. as k and n go to infinity. a negligible effect if n is large enough. and in < n. of entropy h. To establish convergence in m-block distribution. Convergence of entropy.2. If y is a measure on An. for it asserts that (1 /k)H(pke h = 11(A). 1 —> 0. that is. v)1 <8. the only difference being that the former gives less than full weight to those m-blocks that start in the initial k — 1 places or end in the final k — 1 places xn .2. process is finitely determined. by the ergodic theorem. . fact (a).3. the average of the y-probability of din over all except the final m — 1 starting positions in sequences of length n. y) on Am defined by çb m (a) = 1 n — m -I. the following hold for any ergodic process p. v n ) = ym for stationary v.4.

merging of these far-apart processes must be taking place in order to obtain a final ergodic process. a-separated means that at least ak changes must be made in a member of C to produce a member of D. y k < dk (x k ) for some yk i E D} is the a-blowup of D. (U Bk(a)) = 0 k>K The construction is easy if ergodicity is not an issue.) is the path-length function.. The trick is to do the merging itself in different ways on different parts of the space. Indeed. if C n [D]ce = 0. for which (3) lim p. which is well-suited to the tasks of both merging and separation. 14 (k) ). Ak) < I. Before beginning the construction. some discussion of the relation between dk-far-far-apart measures is in order. even when blown-up by a. y) > a. in which case pk(• /2.. n(k) = max{n: k(n) < k}. 187 III. k. then cik(p. if X is any joining of two a-separated measures. see Exercise 1. unbounded sequence of positive integers.3.t and y on A k are said to be a separated if their supports are a-separated.SECTION 111. {p (J ) : j < J } . This must happen for every value of n. keeping the almost the same separation at the same time as the merging is happening. of {k(n)}-d-admissible ergodic processes which are mutually a' apart in d. for one can take p. The easiest way to make measures apart sequences and dk d-far-apart is to make supports disjoint. . The tool for managing all this is cutting and stacking. THE D-ADMISSIBILITY PROBLEM. nondecreasing. 1. Ak) > a'(1 — 1/J). then E(dk (x li` . where a'(1 — 1/J) > a. It is enough to show that for any 0 <a < 1/2 there is an ergodic measure be. is - (4) If 1. to be the average of a large number. The result (4) extends to averages of separated families in the following form. Let Bk(a) = ix ) : dic(Pk• Ix). where.u(x) > 0. when viewed through a window of length k(n). but important fact.1 and y. A simple. and k(n) < log n. Associated with the window function k(. yf)) ce). Two measures 1.c and y are a-separated. yet at the same time. as before. The support a (i) of a measure pc on Ak is the set of all x i` for which .3. look as though they are drawn from a large number of mutually far-apart ergodic processes. - [Dia = i. Throughout this section {k(n)} denotes a fixed. Thus. A given x is typical x) n(k) i is mostly concentrated on for at most one component AO ) . D c A k are said to be a separated.W-typical sequences. This simple nonergodic example suggests the basic idea: build an ergodic process whose n-length sample paths. Two sets C.*oc almost surely.. and hence lim infdk (pk (.(a (p) x a (v)) = ce.b Strong-nonadmissibility examples.

If all the columns in S(m) have the same width and height am).188 CHAPTER III.3.x ' "5. for 1 < i < n(k. v) > a(1 — 1/J). To achieve (5). and hence EAdk(4. and y. III. yl`)) v(a (0) • [1 — AGa(v)icen a(1 — 1/. how to produce an ergodic p.u. where xi is the label of the interval containing T i-l x. Thus.: j < J} of pairwise a-separated measures. then. The sequence am will decrease to 0 and the total measure of the top n(km )-levels of S(m) will be summable in m. based on [52]. ergodicity will be guaranteed by making sure that S(m) and S(m 1) are sufficiently independent. Some terminology will be helpful. after which a sketch of the modifications needed to produce the stronger result (3) will be given. A weaker liminf result will be described in the next subsection.)]a) = v jag (v i)la) = 7 j i=1 if 1 . Lemma 111.. the measure is the average of the conditional measures on the separate Si (m). (m) for i j. namely. conditioned on being n(km )-levels below the top.1). of p. illustrates most of the ideas involved and is a bit easier to understand than the one used to obtain the full limit result.4 If it = (1/J)Ei v i is the average of a family {v . if the support of y is contained in the support of y. Thus if f(m)In(k m ) is sunu-nable in m. ENTROPY FOR RESTRICTED CLASSES.b. Lemma 111. guaranteeing that (5) holds.4 suggests a strategy for making (6) hold.([a(v)]a ) < 1/J. and if the cardinality of Si (m) is constant in j.3.n ). Here it will be shown.1 A limit inferior result. in detail.. for any joining Â. for any measure v whose support is entirely contained in the support of one of the v i . then d(p. and hence not meet the support of yi . then Lemma 111.) > am . (5) The construction. for i it([a(vma) = — E vi ([a(v. Furthermore. The k-block universe Li(S) of a column . (Pkm (. from which the lemma follows. 11k. the basic constructions can begin. take S(m) to be the union of a disjoint collection of column structures {Si (m): j < Jm } such that any km blockwhich appears in the name of any column of Si (m) must be at least am apart from any km -block which appears in the name of any column of 5..3. and an increasing sequence {km } such that lim p(B km (a)) = O.4 guarantees that the d-property (6) will indeed hold. A column structure S is said to be uniform if its columns have the same height f(S) and width. then .3. (3). With these simple preliminary ideas in mind. a complete sequence {8(m)} of column structures and an increasing sequence {km } will be constructed such that a randomly chosen point x which does not lie in the top n(km )-levels of any column of S(m) must satisfy (6) dk„. Proof The a-separation condition implies that the a-blowup of the support of yi does j.

then d32(4 2 . MI whose level sets are constant. then by using cyclical rules with rapidly growing periods many different sets can be produced that are almost as far apart. (A2) S(m) and S(m 1) are 2m-independent. k)-separated if their k-block universes are a-separated. the simpler concatenation language will be used in place of cutting and stacking language. For this reason S(m) will be taken to be a subset of A" ) . namely. 01111 = (0 414)16 c = 000000000000000001 11 = (016 116)4 d = 000. j <J. and for which an2 (1 — 1/J. using a rapidly increasing period from one sequence to the next. J)merging rule is a function .. so first-order frequencies are good. The following four sequences suggest a way to do this.3.n ).. 11 = ( 064164) 1 a . yr) > 1/2.. For M divisible by J. concatenate blocks according to a rule that specifies to which set the m-th block in the concatenation belongs. that is...0101. The real problem is how to go from stage m to stage ni+1 so that separation holds. (7) = 01010. the goal (6) can be achieved with an ergodic measure for a given a E (0. The idea of merging rule is formalized as follows. Disjoint column structures S and S' are said to be (a. .. 189 structure S is the set of all a l' that appear as a block of consecutive symbols in the name of any column in S. (Al) S(m) is uniform with height am) > 2mn(k. {1. yet if a block xr of 32 consecutive symbols is drawn from one of them and a block yr is drawn from another. J } . yet asymptotic independence is guaranteed. 10 -1 (DI = M/J. 2. Conversion back to cutting and stacking language is achieved by replacing S(m) by its columnar representation with all columns equally likely. THE 13-ADMISSIBILITY PROBLEM. These sequences are created by concatenating the two symbols. The construction (7) suggests a way to merge while keeping separation. Each sequence has the same frequency of occurrence of 0 and 1. an (M. . as m Since all the columns of 5(m) will have the same width and height. (A3) S(m) is a union of a disjoint family {S) (m): j < J} of pairwise (am . and there is no loss in assuming that distinct columns have the same name. There are many ways to force the initial stage 8 (1) to have property (A3). km )separated column structures. 01 = (01) 64 b = 00001111000011110. oc... and concatenations of sets by independent cutting and stacking of appropriate copies of the corresponding column structures. the simultaneous merging and separation problem. of course. 0 and 1..SECTION III... To summarize the discussion up to this point. for which the cardinality of Si (m) is constant in j. This is.n ) decreases to a... 1/2) by constructing a complete sequence {S(m)} of column structures and an increasing sequence {k m } with the following properties. If one starts with enough far-apart sets..

J)-merging of the collection for some M. The key to the construction is that if J is large enough. S' c At are (a. m E [1. An (M. is the key to producing an ergodic measure with the desired property (5). When applied to a collection {Si : j E J } of disjoint subsets of A'.1 times. .. when applied to a collection {Si: j E J } of disjoint subsets of A produces the subset S(0) = S(0. 0 *(2) • • • w (m ) : W(M) E So(. K)-strongly-separated subsets of A t of the saine cardinalit-v. stated in a form suitable for iteration.3. the direct product S() = in =1 fl 5rn = 1. 0 is called the cyclic rule with period p. and a collection {S7: t < J* } of subsets of 2V * of equal cardinalitv. The family {Or : t < PI is called the canonical family of cyclical merging rules defined by J and J. 8(0) is the set of all concatenations w(1) • • • w(M) that can be formed by selecting w(m) from the 0(m)-th member of {S1 } . ENTROPY FOR RESTRICTED CLASSES. let M = exp .n ). each Si is cut into exactly MIJ copies and these are independently cut and stacked in the order specified by 0. be the cyclic (M. formed by concatenating the sets in the order specified by q.P n . a merging of the collection {Sj : j E J } is just an (M. for some x E S" and some i > 11. and. Two subsets S. it is equally likely to be any member of Si . (exp2 (J* — 1)) and for each t E [1. In cutting and stacking language. Cyclical merging rules are defined as follows. Given J and J*. Assume p divides M1. J)-merging rule 4). Lemma 111.0(rn ± np) = 0(m). The desired "cyclical rules with rapidly growing periods" are obtained as follows. let (/). then for any J* there is a K* and an V. p1. The following lemma. The merging rule defined by the two conditions (i) (ii) 0(m)= j.1 and J divides p. given that a block comes from Sj . In proving this it is somewhat easier to use a stronger infinite concatenation form of the separation idea. there is a Jo = Jo(a. The two important properties of this merging idea are that each factor Si appears exactly M1. In other words. at) such that if J > Jo and {S J : j < . that is. J)-merging rule with period pt = exp j (exp2 (t — 1)).11. The set S(0) is called the 0-merging of {Si : j E f }.190 CHAPTER III. K)-strongly-separated if their full k-block universes are a-separated for any k > K.1} is a collection of pairwise (a.) Given 0 < at < a < 1/2. /MI. Let expb 0 denote the exponential function with base b. . The full k-block universe of S C A t is the set tik (S oe ) = fak i : ak i = x:+k-1 . for each m. then the new collection will be almost as well separated as the old collection. such that . the canonical family {0i } produces the collection {S(0)} of disjoint subsets of A mt . {Sj }) of A m'. Use of this type of merging at each stage will insure asymptotic independence. In general. m E [1.5 (The merging/separation lemma.

each block v(j) is at least as long as each concatenation (8). K)-strongly-separated and the assumption that K < E. Fix a decreasing sequence {a } such that a < ant < 1/2. let S7 = S(0. and Yu x v±(J-Unf+1 = w(t where w(k) E j.3. so that if u+(. K*)-strongly-separated. 4 } of pairwise (an„ kni )- (B2) Each Si (m) is a merging of {S (m — 1): j < 4_ 1 }.3. and. furthermore. this may be described as follows.SECTION IIL3. Let {Or : t < J* } be the canonical family of cyclical merging rules defined by J and J*. Part (b) is certainly true. while y = c(1)c(2) • •• where each c(i) has the form (9) v(1)v(2) • • • v(J). Suppose m have been determined so that the followLemma S(m) C and k ing hold. (13) Each S7 is a merging of {Si: j E 191 J}. In concatenation language. for otherwise {Si : j < J } can be replaced by {S7: j E J } for N > Klt. THE D-ADMISSIBILITY PROBLEM. LI The construction of the desired sequence {S(m)} is now carried out by induction.Ottn-1-1) from 111. K*)-strongly-separated. provided J is large enough. Suppose x E (S. for all large enough K*. (a) {St*: t < J*} is pairwise (a*. for any j. and. and since J* is finite.5. Furthermore. 1 _< j _< J. then K* can be chosen so that property (a) holds. w(j) E Sr» 1 < j < J. This completes the proof of Lemma 111. so it only remains to show that if J is large enough.5. without losing (a. any merging of {S7 } is also a merging of {S}.--m. (B3) 2n(k) <t(m). v(j)E ST.`)c° and y E (S)oc where s > t. for each ni. and hence (10) 1 )w (t + 2) • • w(J)w(1) • • • w(t) S. choose J„. Since all but a limiting (1/J)-fraction of y is covered by such y u u+(j-1)nt+1 .1-1)nt-1-1 is a subblock of such a v(j). > JOY . K)-strongseparation. since (87)' = ST. then there can be at most one index k which is equal to 1 d u_one (xvv-i-(J-1)ne+1 y u ( J — 1 ) n f + 1 ) > a by the definition of (a.). The condition s > t implies that there are integers n and m such that m is divisible by nJ and such that x = b(1)b(2) • • where each b(i) is a concatenation of the form (8) w(l)w(2) • • • w(J). the collection {St*: t < J* } is indeed pairwise (a*. Proof Without loss of generality it can be supposed that K < f. and for each t. for each k. . Aar° (B1) S(m) is a disjoint union of a collection {S I (m): j < strongly-separated sets of the same cardinality.

J 1+1). (ii) x(S7) = ( i/J. 0): j < 41 all of whose columns have the same width and the same height t(m) = (m. The only merging idea used is that of cyclical merging of { yi : j < J} with period J. 0) is a union of pairwise km )-strongly separated substructures {Si (m. it can be supposed that 2m+ l n(kn1+ i) < t(S. and En. which is just the independent cutting and stacking in the order given.5 for the family {Si (m): j < J. 0) is cut into two copies. The new construction goes from S(m) to S(m + 1) through a sequence of Jm+i intermediate steps S(m.5 for J* = J. ENTROPY FOR RESTRICTED CLASSES.n+ 1. The union S(m + 1) = Ui Si (m + 1) then has properties (Bi). This is accomplished by doing the merging in separate intermediate steps. lc \ _ lim [ga l— E IS(m)1 t(m). Put Si (m + 1) = S. the measure of the bad set . for the complete proof see [50].192 CHAPTER III.' by (S7) N for some suitably large N.3. but two widths are possible. Only a brief sketch of the ideas will be given here. V1 * V2 * • • * V j As in the earlier construction. the sequence {S(m)} defines a stationary A-valued measure it by the formula .*). is needed. then each member of S(m) appears in most members of S(M) with frequency almost equal to 1/1S(m)1. the analogue of Theorem 1. Define k 1+ 1 to be the K* of Lemma 111. j < J. prior separation properties are retained on the unmerged part of the space until the somewhat smaller separation for longer blocks is obtained. establishing the desired 1=1 goal (5). n(lcm )I f(m) < cc. 0) S(m.n + i)x(S. 0)).10. rather than simple concatenation language.n } with am = a and ce. S. ±1 . and Sj ". Since (6) clearly holds. if necessary. the structure S(m) = S(m. At each intermediate step. for m replaced by m +1.n+ 1 = a*. In the first substage. III.3. By replacing S. each Si (nt. merging only a (1/4. Jm+i -fold independent cutting and stacking is . and (B3). (m.2 The general case.s(m) 1 k ilxiam)). ±1 )-fraction of the space at each step. and hence cutting and stacking language. where } (i) )1/4.u(Bk m (a)) is summable in m. by (B3). and the other on the part already merged. is ergodic since the definition of merging implies that if M is large relative to m. the measure p. 0)). 1) S(m. one on the part waiting to be merged at subsequent steps.(m. All columns at any substage have the same height. Meanwhile. Let {S'tk: t < Jrn+1 } be the collection given by Lemma 111. 1). on each of which only a small fraction of the space is merged. (B2). Since t(m) --> cc. that is. Cyclical merging with period is applied to the collection {57 } to produce a new column structure R 1 (m.b..4 in this case where all columns have the same height and width. 0).3. Furthermore. To obtain the stronger property (3) it is necessary to control what happens in the interval k m < k < km+1.(St i) = (1 — 1/ Ji)x(S.

implies the following. 1) affects this property. 1 as wide. and the whole construction can be applied anew to S(m 1) = S(m.3. the k-block universes of xrk) and y n(k) i are at least a-apart. can be used to show that if am) is long enough and J„. 1) is replaced by its M(m. if they lie at least n(k(m. the unmerged part has disappeared. merging a copy of {L i (m. 1) is chosen large enough. t + 1). k m )-strongly-separated. Note. 1): j < J} remains pairwise (am . then {L i (m.n+ i) 2 (1-1/0. After 4 +1 iterations. and this is enough to obtain (12) ni k. 1) > a„..3. (a) The collection S(m./r . 1). for which lim . the probability that ak(pk(. . km )-strongly-separated. An argument similar to the one used to prove the merging/separation lemma. k(m.5. 1).E.„<k<k„. for the stated range of k-values. 1))-levels have measure at most n(k(m. since R. 1). 1) is upper bounded by 1/. t): j < J } U {R 1 (m. and since the probability that both lie in in the same L i (m.14 (k) ). 1))-strongly separated from the merged part R1 (m. and (a) continues to hold. This is because. since neither cutting into copies nor independent cutting and stacking applied to separate Ej(m. 2). > 8(m. the following holds. This is.C i (m.0. 1))4 + 1. 1) is large enough and each L i (m. and R(m. 1)-fold independent cutting and stacking. pk(. 1) < 1/4 + 1. that the columns of R I (m. since strong-separation for any value of k implies strong-separation for all larger values. there is a k(m. by the way. large enough. 1). 1). The merging-of-a-fraction idea can now be applied to S(m. if M(m. ±1 . 1))strongly-separated. t) <k < k(m.4_] Bk ) < DC ' . but are only 1/J„1 . 1) > k m such that each unmerged part j (m.. (b) If x and y are picked at random in S(m. an event of probability at least (1 —2/J. which.E. 1) < k < k(m. 1))-levels below the top and in different L i (m. 1))/(m. 1) is replaced by its M(m.. but a more careful look shows that if k(m. 1). J„. weaker than the desired almost-sure result. 1)-fold independent cutting and stacking. 1) = {L i (m. 193 applied separately to each to obtain a new column structure L i (m.Iy in(k) )) > a is at least (1 — 2/4 + 0 2 (1 — 1/. of course. 1) has measure 1/4 +1 and the top n(k(m. 1): j < J} of measure 1/4 +1 . then k-blocks drawn from the unmerged part or previously merged parts must be far-apart from a large fraction of k-blocks drawn from the part merged at stage t 1. 1).1). producing in the end an ergodic process p. k (m . 1). Thus if M(m. Lemma 111.(m. while making each structure so long that (b) holds for k(m... The collection {. 1): j < J} is pairwise (a. then for any k in the range km < k < k(m.u(Bk (a)) = O. then it can be assumed that i(m. 1) have the same height as those of L i (m. then for a suitable choice of an. then applying enough independent cutting and stacking to it and to the separate pieces to achieve almost the same separation for the separate structures. But. In particular. 1) is (8(m. 1). THE D-ADMISSIBILITY PROBLEM. Each of the separate substructures can be extended by applying independent cutting and stacking to it. t))} is (p(m. which makes each substructure longer without changing any of the separation properties..SECTION III. in turn. 1) > n(k(m.

An interesting property closely connected to entropy ideas. for any subset C c An for which 11 (C) . 00 > — 1/f).3. (Hint: use Exercise 11.d.194 CHAPTER III. (Yi)k) surely. I11.4. and d(yi .i. yi ) > a. a stationary process has the blowing-up property if sets of n-sequences that are not exponentially too small in probability have a large blowup. ENTROPY FOR RESTRICTED CLASSES. which is. Let p. Show that if "if is the concatenated-block process defined by a measure y on An.3. b7) < E. process. cik(Pk(. in turn. relative to km . 2_ . y) —1511 5 . and for other processes of interest. enough to guarantee the desired result (3). has recently been shown to hold for i.i. processes is delayed to Chapter 4. = (1/J)Ei yi. Processes with the blowing-up property are characterized as those processes that have exponential rates of convergence for frequencies and entropy and are stationary codings of i.6 It is important to note here that enough separate independent cutting and stacking can always be done at each stage to make i(m) grow arbitrarily rapidly.i. for some ariL E C} . then 10(k. for i j. The following theorem characterizes those processes with the blowing-up property.e to construct waiting-time counterexamples. this shows the existence of an ergodic process for which the sequence k(n) = [(log n)/ hi is not a-admissible. = {b7: dn (a7. An ergodic process i has the blowing-up property (B UP) if given E > 0 there is a 8 > 0 and an N such that if n > N then p.d. called the blowing-up property. processes.5. Remark 111.4 Blowing-up properties. and Lemma 111. called the almost blowing-up property is. equivalent to being a stationary coding of an i. k-4. 1.d. Remark 111.. A full discussion of the connections between blowing-up properties and stationary codings of i. Section 111.7 By starting with a well-separated collection of n-sequences of cardinality more than one can obtain a final process of positive entropy h. for aperiodic Markov sources. including a large family called the finitary processes. processes. The blowing-up property and related ideas will be introduced in this section. A slightly weaker concept.c Exercises.1x n i (k) ). that is [C].i. Section 1. denotes the 6-neighborhood (or 6-blowup) of C. ixin(k) ).([C]. Informally. Show that 0.3. If C c An then [C]. for suitable choice of the sequence {40. almost surely. in fact.d. This fact will be applied in I11. almost lim infak(Pk(. 2(k — 1)/n.) 2.) > 1 — E. where each yi is ergodic.9.3. In particular.

since i. has the almost blowing-up property (ABUP) if for each k there is a set Bk C A k such that the following hold. Later in this section the following will be proved. 79].3. Theorem 111.i. Not every stationary coding of an i. 195 Theorem 111.d. The proof of this fact as well as most of the proof of the theorem will be delayed to Chapter 4.i.) > 1 — E. [28. process is called a finitary process. By borrowing one concept from Chapter 4. in-dependent processes.c.d. A set B c A k has the (8. If i E A Z and x -w(x) w(x) — w(x) then f (x ) = f Gip . almost surely finite integer-valued measurable function w(x). (ii) For any 6 > 0 there is a 8 > 0 and a K such that Bk has the (8. the time-zero encoder f satisfies the following condition. 0-blowing-up property for k > K. . Theorem 111.i.3 (Almost blowing-up characterization.d. In particular. Once these are suitably removed. then a blowing-up property will hold for an arbitrary stationary coding of an i.SECTION 111. relative to an ergodic process ii. however. process. process has the blowing-up property.i -w(x) A finitary coding of an i.4. processes have the blowing-up property it follows that finitary processes have the blowing-up property.i.B z is said to be finitary. A proof that a process with the almost blowing-up property is a stationary coding of an i. if there is a nonnegative. called finitary coding.1 (Blowing-up property characterization. for almost every x E A z . a result that will be used in the waiting-time discussion in the next section.) A stationary process has the blowing-up property if and only if it is a stationary coding of an i.) Finitary coding preserves the blowing-up property. does preserve the blowing-up property.i. in particular.a(C) > 2-18 . Note. without exponential rates there can be sets that are not exponentially too small yet fail to contain any typical sequences.i. (i) xlic E Bk.) A stationary process has the almost blowing-up property if and only if it is a stationary coding of an i.d. It is known that aperiodic Markov chains and finite-state processes. and.4.i. it will be shown later in this section that a stationary coding of an i. A stationary coding F: A z i.d. that the theorem asserts that an i.i. The fact that processes with the blowing-up property have exponential rates will be established later in the section.d. process has the almost blowing-up property.d.4. process.2 (The finitary-coding theorem.i. for any subset C C B for which . called the window function.4. eventually almost surely. such that. — . process and has exponential rates of convergence for frequencies and entropy. A particular kind of stationary coding. process has the blowing-up property.d. The difficulty is that only sets of sequences that are mostly both frequency and entropy typical can possibly have a large blowup. process will be given in Section IV. BLOWING-UP PROPERTIES.([C]. 0-blowing-up property if p. and some renewal processes are finitary.d. An ergodic process p.

a Blowing-up implies exponential rates. E) = 1Pk( 14) — IkI > El. - = trn i: iun (x n i)> so that B* (n.(C) > 2 -6 " then . Since . however. E)) > 2 -6" then ([B(n. To fill in the details of this part of the argument.196 CHAPTER III.u([C]. The blowing-up property provides 8 > 0 and N so that if n>N. E/2)) to be at least 1 — E. by the entropy theorem. since this implies that if n is sufficiently .u has the blowing-up property. and ti. and p.) would have to be at least 1 —E. If the amount of blowup is small enough. Note that if y = E/(2k + 1) then d(4.(C) > 2 -611 then ii. and hence (1) would force the measure p n (B(n. since .(n. in particular. The blowing-up property provides 6 and N so that if n>N.u(B*(n. III. has exponential rates for frequencies.u n (B(n.u n ([B*(n. o ] l < for all n. If.Cc A n . contradicting the ergodic theorem. and define B(n. ç B(n. < 2n(h—E/2)2— n(h—E14) < 2—nEl 4 .4.Cc An. Suppose . E)1 0 ) 0 it follows that . [B(n.) —> 1. c)) < 2 -6n . k. An exponential bound for the measure of the set B. for there cannot be too many sequences whose measure is too large. First it will be shown that p. k.) > 1 — a. E)) must be less than 2 -8n.u„(T. hence a small blowup of such a set cannot possibly produce enough sequences to cover a large fraction of the measure. The idea of the proof is that if the set of sequences with bad frequencies does not have exponentially small measure then it can be blown up by a small amount to get a set of large measure. ENTROPY FOR RESTRICTED CLASSES. E)]. k. ) fx n i: tin(x n i) < of sequences of too-small probability is a bit trickier to obtain. To fill in the details let pk ( WI') denote the empirical distribution of overlapping k blocks. k. let B . k.u(4) 2 ' _E/4)} gives (n. Next it will be shown that blowing-up implies exponential rates for entropy. —> 0. .([C]y ) > 1 — E. (1) Ipk( 14)— pk( 1)1)1 6/2. for all sufficiently large n. Intersecting with the set Tn = 11. n > N and . Thus p n (B(n. E/2)) = 0 for each fixed k and E. for all n sufficiently large. 0. however. then frequencies won't change much and hence the blowup would produce a set of large measure all of whose members have bad frequencies. But this cannot be true for all large n since the ergodic theorem guarantees that lim n p n (B(n. E/2).naB * 2n(h—E/2) . k. One part of this is easy. k. but depends only on the existence of exponential rates for frequencies. 6 . n In) _ SO p„([B* (n. so that. k. 6)1 < 2n(h €). and hence there is an a > 0 such that 1[B* (n.

u(T) > 1 — c by the Markov inequality. The finitary-coding theorem will now be established. Fix E > 0 and a process p with AZ.4. the set of n-sequences that are (1 —20-covered by the complement of B(k. In particular. the set T = {x n I± k 1 : P2k+1(GkiX n jc-± k i) > 1 — 61. 6/4). and hence tin(Gn n 13. if n > N I . satisfies .1)-blocks in xn -tf k that belong to Gk and.u. then iGn < 2n(h+E/2). then p. this gives an exponential bound on the number of such x. Let D be the projection of F-1 C onto B" HI. there is a sequence i n tE k E D such that dn+2k(X . 6)) < IGn12—n(h+E) < 2—ne/2 n > N • Since exponential rates have already been established for frequencies.(n. implies that there is an a > 0 and an N such that if n > N. For such a sequence (1 — c)n of its (2k + 1)-blocks belong to Gk. Suppose C c An satisfies v(C) > 2 -h 6 .SECTION 111. 6/4)) < a. the blowing-up property. Thus. where 6 > O.6. which in turn means that it is exponentially very unlikely that such an can have probability much smaller than 2 -nh. Now consider a sequence xhti. k 1 E [D]y n T. Lemma 1. x E B Z . BLOWING-UP PROPERTIES. and window function w(x).un(B. Since the complement of 13(k. Let be a finitary coding of . Thus f (x) depends only on the values x ww (x()x) . there is a 6 > 0 and an N 1 > N such that p. since k is fixed and .(k.(n. For n > k let = pk(B.1 • Thus fewer than (2k + 1)yn < cn of the (2k + 1)-blocks in x h± k± k I can differ from the corresponding block in i1Hk4 k . . (Gk) > 1 — 6 2 . To fill in the details of the preceding argument apply the entropy theorem to choose k so large that Ak(B(k.b Finitary coding and blowing-up. I=1 III. that is. except for a set of exponentially small probability. where a will be specified in a moment. 6/4) has cardinality at most 2 k(h+e/4) . As in the proof of the entropy theorem.7. the built-up set bound.4. n (G n ) > 1 — 2 -6n.u([D]y ) > 1 — E. 6 can be assumed to be so small and n so large that if y = E / (2k + 1) then .u(D) > 2 -h 8 . 1) < Y 2k -I. at the same time agree entirely with the corresponding block in imHk-±k .u has the blowing-up property. n > N1. and note that . E)) < 2_8n which gives the desired exponential bound. E /4) < 2al. most of xPli will be covered by k-blocks whose measure is about 2 -kh. and hence there is a k such that if Gk is the set of all x k k such that w(x) < k. where w(x) is almost surely finite._E k 1 . In particular. Thus + 2—ne/2. moreover. there is at least a (1 — 20-fraction of (2k -I. with full encoder F: B z time-zero encoder f: B z A. 197 large then. and.

. IC)) < 62 .1. ENTROPY FOR RESTRICTED CLASSES.(4. that is. x ntf k 19 and r-iChoose y and 5. ) (3) Ca li) = E P (a14)v(x) = Ev(Pk(a l'IX7)).2. An ergodic process p. and which will be discussed in detail in Section IV. Ane IC» <2 then 1. Z = F(Y. c+ k i . Let .3. measure less than E.198 CHAPTER III.([C]e ) > 1 — E. a property introduced to prove the d-admissibility theorem in Section 111. 4([C]) > 1 — E. ktn( . LII which completes the proof of the finitary-coding theorem. int+ ki.c Almost blowing-up and stationary coding. (b) 11-1(u n ) — H(v)I < u8. An( IC)). C). the set of 4 such that dn (xit . C) > E has p. is finitely determined (FD) if given E > 0.u. I11. v) is the measure on Ak obtained by averaging must also satisfy the y-probability of k-blocks over all starting positions in sequences of length n. Proof The relation (2) vn EC holds for any joining X of A n and A n ( IC). <e.4. Thus if (LW.4 If d(1-t.i. The sequence Z7 belongs to C and cln (z7 .4.d. In this section it is shown that stationary codings of i.a into the notation used here. C) denote the distance from x to C. and put z = . that is. by the Markov inequality. Towards this end a simple connection between the blowup of a set and the 4-distance concept will be needed.d. v)I < 8. This proves the lemma.4 )1) > P(4)dn(4 . so that Z7 E [C]2 E . then.). This proves that vn ([ che ) n T) > 1 — 2€. where 4)(k. E Yil )dn(. the minimum of dn (4 .1 (' IC) denote the conditional measure defined by the set C c An. ylz). that is.3. processes have the almost blowing-up property (ABUP). E D such that Yi F(y). there is a 3 > 0 and positive integers k and N such that if n > N then any measure y on An which satisfies the two conditions (a) Luk — 0*(k. . The proof that stationary codings of i. Lemma 111. and hence EA(Cin(X til C)) < (LW. Also let c/. process have the almost blowing-up property makes use of the fact that such coded processes have the finitely determined property.i. k This simply a translation of the definition of Section III. Z?) < 26. y E C.

a) be the set of n-sequences that are both entropy-typical. relative to a. for any C c Bp. I xi ) — . Show that (2) is indeed true. implies that C has a large blowup. Pk(• Ifiz) is the empirical distribution of overlapping k-blocks in xri'. For references to earlier papers on blowing-up ideas the reader is referred to [37] and to Section 1. [7]. both of which depend on .6 The results in this section are mostly drawn from joint work with Marton. The first step in the rigorous proof is to remove the sequences that are not suitably frequency and entropy typical. BLOWING-UP PROPERTIES. IC)) — 141 < a(n). a(n)). there is a 8 > 0 such that Bn eventually has the (8. Theorem 111. a)-frequency-typical for all j < k. 1. such that 4 E Bn(k(n). so there is a nondecreasing. The finitely determined property then implies that gn e IC) and . ttn(. Kan). a (n)). as usual. then the conditional measure A n (. if tt(x til ) < 2. where. then xçi E Bn(k. a)-frequency-typical if I Pk (.SECTION 111. .uk and entropy close to H(i). relative to a. implies that . Given E > 0. then lin e IC) will have (averaged) k-block probabilities close to . IC) satisfies the following. It will be shown that for any c > 0.) A finitely determined process has the almost blowing-up property 199 Proof Fix a finitely determined process A.11 (4)+na . eventually almost surely.4. and a nonincreasing sequence {a(n)} with limit 0. which also includes applications to some problems in multi-user information theory.) > 1 E. ED — Remark 111. Note that in this setting n-th order entropy. is used.4.5 in the Csiszdr-Körner book. Let Bn (k.4. ")I < 8 and IHCan ) — HMI < n6. 2.4. e)blowing-up property. and (j. rather than the usual nh in the definition of entropy-typical. Thus if n is large enough then an(it. 39].4.d Exercises. time IC)) < e 2 . I11.4. If k and a are fixed. Suppose C c Bn and .u([C].5 (FD implies ABUP. This completes the proof that finitely determined implies almost blowing-up. [37.uk I < a.un are d-close. The basic idea is that if C c A n consists of sequences that have good k-block frequencies and probabilities roughly equal to and i(C) is not exponentially too small. for which 2 —n8 which. which is all right since H(t)/n —> h.4. a).4. by Lemma 111. eventually almost surely. y) < e 2 . the finitely determined property provides 8 > 0 and k such that if n is large enough then any measure y on A n which satisfies WI Ø(k. A sequence 4 will be called entropy-typical. by Lemma 111. Put Bn = Bn(k(n). must also satisfy dn (ii. A sequence x i" will be called (k.u.u(C) > 2 —n8 . which. (ii) H(14) — na(n) — 178 Milne IC)) MILO ± na(n). The definition of Bn and formula (3) together imply that if k(n) > k. — (i) 10(k. Show that a process with the almost blowing-up property must be ergodic. unbounded sequence {k(n)}.

d. thus further corroborating the general folklore that waiting times and recurrence times are quite different concepts.5. 1Off. an approximate-match version will be shown to hold for the class of stationary codings of i. A connection between entropy and recurrence times. Assume that aperiodic renewal processes are stationary codings of i. for the well-known waiting-time paradox. in conjunction with the almost blowing-up property discussed in the preceding section. Section 111. y) converges in probability to h. is the positive result. Counterexamples to extensions of these results to the general ergodic case will also be discussed. at least for certain classes of ergodic processes. suggests that waiting times are generally longer than recurrence times. Wyner and Ziv also established a positive connection between a waiting-time concept and entropy. pp. first noted by Wyner and Ziv. [86]. by using the d-admissibility theorem. Theorem 111. processes. . An almost sure version of the Wyner-Ziv result will be established here for the class of weak Bernoulli processes by using the joint-distribution estimation theory of III.2. Show that a coding is finitary relative to IL if and only if for each b the set f -1 (b) is a countable union of cylinder sets together with a null set. yn mi +k-1 < 61. y E Ac° by Wk (x. They showed that if Wk (x. of course.200 CHAPTER III.3. 7. [11. (b) Show that a i/i -mixing process has the blowing-up property. (a) Show that an m-dependent process has the blowing-up property.time function Wk (x. 6) = minfm > 1: dk (x Two positive theorems will be proved. 6. In addition.d. 1. y) is the waiting time until the first k terms of x appear in an independently chosen y. 4. Show that a process with the almost blowing-up property must be mixing. Show that condition (i) in the definition of ABUP can be replaced by the condition that . Assume that i. 6) is defined by Wk (x. cannot be extended to the general ergodic case. 5. unlike recurrence ideas. including the weak Bernoulli class in [44.u(B n ) 1. for irreducible Markov chains.1. y. y) = min{m > 1: yn n: +k-1 i = xk The approximate-match waiting-time function W k (x.i. processes have the blowing-up property. y) is defined for x. The surprise here. processes. ENTROPY FOR RESTRICTED CLASSES. The waiting .d. then (1/k) log Wk(x. The counterexamples show that waiting time ideas. 76].b. This result was extended to somewhat larger classes of processes.i. Show that some of them do not have the blowing-up property.i. was shown to hold for any ergodic process in Section Section 11. v.5 The waiting-time problem. 3.

8.3. and Wk (x. THE WAITING-TIME PROBLEM.) If p is a stationary coding of an i. then for any fixed y'iz E G.(log n)/(h + 6). eventually almost surely. V S > 0. x p.4. the set G. In both cases an application the Borel-Cantelli lemma establishes the desired result. such that a large fraction of these k-blocks are almost independent of each other.5.„0 1 k Wk(X. G„ consists of those yli whose set of k-blocks has a large 3-neighborhood. k almost surely with respect to the product measure p. process. 201 Theorem 111. the Bk are the sets given by the almost blowing-up property of stationary codings of i. The set G n consists of those y E A Z for which yll can be split into k-length blocks separated by a fixed length gap.d. The d-admissibility theorem. .2. Lemma 111. 8) < h.}. for there are exponentially too few k-blocks in the first 2k(h . In other words. eventually almost surely.0 k—>oo 1 lim lim inf . processes. y. x A. guarantees that property (b) holds. are constructed for which the following hold. For the approximate-match result. In other words.log k.i.f) terms of y.log The easy part of both results is the lower bound. Bk consists of the entropy-typical sequences. k almost surely with respect to the product measure p. see Theorem 111.i.. In the approximate-match proof.c. {B k } and {G n } have the property that if n > 2k(h +E) . the probability that y E G„ does not contain x lic anywhere in its first n places is almost exponentially decaying in n. In the approximate-match case. (b) y E G n . for k . {B k C Ak} and {G. then k— s oo 1 lirn . hence G„ is taken to be a subset of the space A z of doubly infinite sequences.5.d. In the weak Bernoulli case.. the sequences {B k } and {G. For the weak Bernoulli case.3. is weak Bernoulli with entropy h. (a) x'ic E Bk.5. S) > h. though the parallelism is not complete.SECTION 111. hence it is very unlikely that a typical 4 can be among them or even close to one of them.1.1 (The exact-match theorem. Theorem 111.} have the additional property that if k < (log n)/(h + 6) then for any fixed xiic E Bk.log Wk(x. y. Theorem 111.) If p. it is necessary to allow the set G„ to depend on the infinite past. y) = h. The proofs of the two upper bound results will have a parallel structure. can be taken to depend only on the coordinates from 1 to n.2 (The approximate-match theorem. the probability that xf E Bk is not within 8 of some k-block in y'i' will be exponentially small in k. In each proof two sequences of measurable sets. 6—). so that property (a) holds by the entropy theorem. then lim sup . the G„ are the sets given by the weak Bernoulli splitting-set lemma. The set Bk C A k is chosen to have the property that any subset of it that is not exponentially too small in measure must have a large 3-blowup. hence is taken to be a subset of A. In the weak Bernoulli case.

5.d. If k(n) > (logn)I(h . and an ergodic process will be constructed for which the approximate-match theorem is false. 8—». n > N. for any ergodic measure with entropy h. In the following discussion.14(y1)) 6). ENTROPY FOR RESTRICTED CLASSES. y. 3) > h. be an ergodic process with entropy h > 0 and let e > 0 be given.5.6) then for every y E A" and almost every x. n > N. n .4 Let p.6) then for every y E A and almost every x. x A. be an ergodic process with entropy h > 0 and let e > 0 be given.k l]}. _ _L which establishes the theorem.5. for any ergodic measure with entropy h. The (empirical) universe of k-blocks of yri' is the set iik(Yç') = Its S-blowup is dk(x 1`. Theorem 111. El The theorem immediately implies that 1 lim inf . the set of entropy-typical k-sequences (with respect to a > 0) is taken to be 7k(01) 2 -k(h+a) < 4 (4) < 2-k(h-a ) } Theorem 111. 111 IL . (y)16 n Tk(e/2)) [lik(yli )]3 = for some i E [0.log Wk(X.3 Let p.. there is an N = N(x. Proof If k > (log n)/(h . and hence so that intersecting with the entropy-typical set 7k (e/2) gives < 2 k (6/2)) _ k(h-0 2 -ko-E/2) < (tik (yi ) n r Since .x /.i.6) then n <2k(h-e). I11.5. The theorem immediately implies that 1 lim lim inf .` E 7 k(E/2). 2k(h-E). There is a 3 > 0 such that if k(n) > (log n)/(h . y) h. In the final two subsections a stationary coding of an i. x /. there is an N = N(x.6) then n _ _ 2 k(h . process will be constructed for which the exact-match version of the approximate-match theorem is false. j[lik (y)]51 _ Lemma 1. y) such that x ik(n) lik(n)(Yi). by the blowup-bound lemma. Intersecting with the entropy-typical set 'Tk (E/2) gives < 2 k(h-3E/4) 2 -kch-E/2) < p. an application of the Borel-Cantelli lemma yields the theorem. and hence 1/1k(yril)1 < Proof If k > (log n)/(h .0 k—c k almost surely with respect to the product measure p. y) such that x ik(n) (/[14k(n)(11`)]5.a Lower bound proofs. yçl E An . k almost surely with respect to the product measure p. The following notation will be useful here and later subsections. eventually almost surely. I f 6 < 2k(h-3(/4)./. is small enough then for all k. The lower bound and upper bound results are established in the next three subsections. < 2k(h -E ) .7.E) .202 CHAPTER III.log Wk(x.

k + g .5. and fix c > 0. 203 III.log Wk(x. and a gap g. 1 .2. (1) sa Claw (Di n Bs. t ]. k-block parsing of y.( 0)-1)(k+g) ) < (1+ Y)Ww(i)). k. (b) For all large enough k and t. where w(0) has length s < k + g. y) < h.5. For each k put Ck(x) = Y Bk. The sets G n C A Z are defined in terms of the splitting-index concept introduced in III. Bk = Tk(a) = {xj: 2 -k(h+a) < wx lic) < where a is a positive number to be specified later.} are given by Lemma 111. 1. y) > h -E€ so that it is enough to show that y g' Ck (x). (2) 1 lim sup . k > K(x).. to be specified later. s. The s-shifted.1] for which there are at least (1 . (a) x E Gn .b Proof of the exact-match waiting-time theorem. s.u . with gap g is yti` = w(0)g(1)w(1)g(2)w(2). and for y E G(y) there is at least one s E [0. The sets {G. k. The basic terminology and results needed here will be restated in the form they will be used. t] is a (y. Fix such an (0 (Yrk) )1 and note that ye Ck (x). for 1 < j < t.s. g)-splitting index.y)t indices j in the interval [1. An index j E [1. n(k) = n(k. t] that are (y. for (t + 1)(k + g) < n < (t 2)(k ± g).. The sets {B k } are just the entropy-typical sets. which depends on y. g)-splitting indices for y.u(w(DlY s 40. Property (a) and a slightly weaker version of property (b) from that lemma are needed here. If 8 1 is the set of all y then E A z that have j as a (y.. s. To complete the proof of the weak Bernoulli waiting-time theorem. that is.g(t)w(t)w(t + 1).t(k g) .) (1 + y) Ifl libt(w(i)).a. eventually almost surely. stated as the following. k. eventually almost surely.8 and depend on a positive number y. E) denote the least integer n such that k < (log n) I (h É).2. and J C [1. Let Fix a weak Bernoulli measure a on A z with entropy h. and the final block w(t +1) has length n .s < 2(k + g). it is enough to establish the upper bound. E For almost every x E A" there exists K(x) such that x x.log Wk (x.1 k x .SECTION 111.b. the g(j) have length g and alternate with the k-blocks w(j). . g)-splitting index for y E A z if . eventually almost surely. THE WAITING-TIME PROBLEM. The entropy theorem implies that 4 E B.

eventually almost surely.c Proof of the approximate-match theorem. the probability x lic. eventually almost surely. eventually almost surely. CHAPTER III. eventually almost surely.2ty log y 0 y y (1 2—k(h+a))t( 1— Y) Indeed. Let be a stationary coding of an i. For each k.ualik(n)(31)]812) > 1/2) . t] of cardinality at least t(1 — y) and an integer s E [0.3. Since y E Gn . yields it(n[w(i)] JE.1.y). 3) > h ± c if and only if x i( E C k(y).204 ED CLASSES. since ii(x) > k(h+a) and since the cardinality of J is at least t(1 — y).4. and fix E > 0.5. Fix a positive number 3 and define G. Let {B k } be the sequence of sets given by the almost blowing-up characterization theorem. k.5. (1/k) log Wk(x. and hence it is enough to show that xi. for any y E G n (k)(J.' Ck (y). The least n such that k < (log n)/ (h c) is denoted by n(k). This completes Li the proof of the exact-match waiting-time theorem. s. then a and y can be chosen so small that the bound is summable in k. For almost every y E Aoe there is an N (y) such that yti' E G n . ENTROPY FOR RESTRICTED Choose k > K (x) and t so large that property (b) holds. s) c the splitting-index product bound. since there are k ± g ways to choose the shift s and at most 2-2ty lo gy ways to choose sets J c [1.3. Since Gnuo (J. process such that h(A) = h. and choose . n > N(. which proves the theorem. Fix such a y.0 > 0 and K such that if k > K and C c Bk is a set of k-sequences of measure at least 2 -0 then 11([C]8/2) > 1/2.(0) < (k + g)2. since t n/ k and n . k + g — 1] and let Gn(k) (J. let Ck (y) = fx 1`: x g [Uk(Yr k) )]81 so that. for all j E J. t] of cardinality at least t(1 — y). and such that (t+ 1)(k+ g) < n(k) < (t 2)(k ± g). while k(n) denotes the largest value of k for which k < (log n)/(h c). s) denote the set of all y E G n (k) for which every j E J is a (y. it then follows that y 0 Ck(x).1. I11. = . yri: E G n . By the d-admissibility theorem. To establish the key bound (3). Theorem 111. . y. The key bound is (3) ii(Ck(x) n Gn. j E J are treated as independent then. Theorem 111. (1). s). But if the blocks w(j). is upper bounded by that w(j) (1 — 2 -k(h+a ) ) t(1-Y) and hence )t it(ck(x)n G n(k)(J s)) < (1 + y (1 This implies the bound (3).1 n Bsd) + y)' jJ fl bi (w(i)). g)-splitting index for y.i. Theorem 111. if this holds.d. fix a set J c [1.

The marker construction idea is easier to carry out if the binary process p.. that is.„ (— log Wk(n)(y. If x i = 1. yet are pairwise separated at the k-block level. aL (C(4). A family {Ci } of functions from G L to A L is said to be pairwise k-separated. and sample paths cycle through long enough blocks of each symbol then W1 (x. This artificial expansion of the alphabet is not really essential. -51. Y .i. 4.u o F -1 will be constructed along with an increasing sequence {k(n)} such that (4) 1 lim P. hence intersect. Si) to be large. n-400 k(n) for every N. eventually almost surely. fix GL c A L .u. j j. however.5. where P.SECTION 111. 3 } . then switches to another marker for a long time. I11. If the process has a large enough alphabet. The same marker is used for a long succession of nonoverlapping k-blocks. y) will be large for a set of probability close to 1. But this implies that the sets Ck(y) and [ilk (Yin (k) )]8 intersect. ) 7 ) N) = 0. y) will be large for a set of y's of probability close to 1/2. and hence E p(Ck(y)n Bk ) < oc. k=1 Since 4 E Bk. A family {Ci} of functions from GL to A L is 6-close to the identity if each member of the family is 6-close to the identity. The classic example where waiting times are much longer than recurrence times is a "bursty" process. .„ denotes probability with respect to the product measure y x v. Let . y1 E Ci(Xt). 1.) is forced to be large with high probability. The initial step at each stage is to create markers by changing only a few places in a k-block.u(Ck(Y) n Bk) < 2-10 . is thought of as a measure on A Z whose support is contained in B z .d An exact-match counterexample. 1} such that g i (0) = . one in which long blocks of l's and O's alternate. where k is large enough to guarantee that many different markers are possible. for then the construction can be iterated to produce infinitely many bad k. . completing the proof of the approximate-match theorem. 2. Thus. for then blocks of 2's and 3's can be used to form markers.u i (1) = 1/2. for with additional (but messy) effort one can create binary markers. 1471 (x. which contradicts the definition of Ck(y). so that yrk) E G(k). with high probability. be the Kolmogorov measure of the i. The first problem is the creation of a large number of block codes which don't change sequences by much. By this mimicking of the bursty property of the classical example. cycling in this way through the possible markers to make sure that the time between blocks with the same marker is very large. 205 Suppose k = k(n) > K and n(k) > N(y).5. The key to the construction is to force Wk(y.E Cj(Xt). that is. To make this precise. (1/k) log Wk(y. then. ) < 6. x E GL. where A = {0. process with binary alphabet B = {0. that is Uk(yP) fl /ta ) = 0. eventually almost Li surely. it follows that 4 g Ck(y). THE WAITING-TIME PROBLEM. A function C: GL A L is said to be 6-close to the identity if points are moved no more than 8.d. say. A stationary coding y = . If it were true that [t(Ck(Y) n Bk) > 2-0 then both [Ck(Y) n Bk]512 and [16 (Yr k) )]6/2 each have measure larger than 1/2. if the k-block universes of the ranges of different functions are disjoint. with a code that makes only a small density of changes in sample paths.

where g > 1. and since different wf are used with different Ci . 31g in some way and let wf denote the i-th member. 1. 3 > 0. and (C. (i) pvxv (wk'. Put . Given c > 0. Define the i-th k-block encoder a (4) = yilc by setting y = Yg+2 = Y2g+3 = 0 . for all x.. i > 2g + 3. the family is 3-close to the identity. Lemma 111. as done in the string matching construction of Section I.c. The following lemma combines the codes of preceding lemma with the block-to-stationary construction in a form suitable for later iteration.8. Let L. pairwise k-separation must hold. and N. The block-to-stationary construction. o F-1 the following hold. An iteration of the method.8. Thus the lemma is established.2)-block Owf 0. hence no (g -I. The encoder Ci is defined by blocking xt into k-blocks and applying Cs. there is a k > 2g +3 and astationary encoder F: A z 1--+ A z such that for y = p.1)-block of 2's and 3's can occur in any concatenation of members of Ui C. with the additional property that no (g +1)-block of 2's and 3's occurs in any concatenation of members of U. F (x)) < 6. and leaves the remaining terms unchanged.C.: 1 < i < 2g1 be as in the preceding lemma for the given 3.206 CHAPTER III. 3} L consisting of those sequences in which no block of 2's and 3's of length g occur Then there is a pairwise k-separated family {C i : 1 < i < 2g} of functions from GL to A L . 2.5. g+1 2g+2 yj = x . 3 } g+ 1 ) = 0.5. i (x lic) replaces the first 2g ± 3 terms of 4 by the concatenation 0wf0wf0. then choose L > 2kN+2 lc so that L is divisible by k. El The code family constructed by the preceding lemma is concatenated to produce a single code C of length 2g L. where jk+k jk+1 = Ci jk+k x )jk+1 o < <M.(G Proof Order the set {2. ENTROPY FOR RESTRICTED CLASSES. Lemma 111. The following lemma is stated in a form that allows the necessary iteration. Any block of length k in C(x) = yt must contain the (g -I. GL. 2. Lemma 1. 2. that is. Proof Choose k > (2g +3)18.5 Suppose k > (2g+3)/8 and L = mk. 3) such that i({2. 1.6 Let it be an ergodic measure with alphabet A = {0. 3}) = 0. is used to produce the final process.(GL). Let GL be the subset of {0. (iii) v({2. Ci (xt) = yt. separately to each k-block. (ii) d(x. Since each changes at most 2g+3 < 3k coordinates in a k-block. which is 8-close to the identity.5. 5") < 2kN ) 5_ 2 -g ± E. Y2 = Yg+3 = W1 . its proof shows how a large number of pairwise separated codes close to the identity can be constructed using markers. is then used to construct a stationary coding for which the waiting time is long for one value of k. Note also that the blocks of 2's and 3's introduced by the construction are separated from all else by O's. that is.

the encoded process can have no such blocks of length g + 1. L — 1. i = j.5. • n+ +1M = C(Xn ni ± 1M ). Y (i--1)L+1 = Ci(XoL +1). i. F(x)) < 8. then yn (b) If i E Uj E jii .. since p. To show that property (i) holds. and such that y = F(x) satisfies the following two conditions. {C i }. 1 < <2 for xr E (GL) 29 Next apply the block-to-stationary construction of Lemma 1. that are shorter than M will be called gaps.SECTION 111. Both xi and i1 belong to coding blocks of F. each of which changes only the first 2g + 3 places in a k-block. Case 1 occurs with probability at most €72 since the limiting density of time in the gaps is upper bounded by 6/4. then UJEJ /J covers no more than a limiting (6/4)-fraction of Z. THE WAITING-TIME PROBLEM. had no blocks of 2's and 3's of length g. note that if y and 53 are chosen independently. The intervals I. and 2k N < m L — 1. and if y = F(x) and 5". then y. Case 2b.5 to obtain a stationary code F = F c with the property that for almost every x E A z there is a partition. Three cases are now possible. and iL iL g.11 +L-1\ yrn = C1 (4+L-1) . and such that ynn+L-1 = Ci ( X. for every x. i0 j. If Case 2c occurs then Case 2b occurs with probability at most 6/2 since L . m < 1 < m + L — 1. In Case 2. Either k > n + L — 1 or 2kN > Case 2c.8. and hence property (ii) holds.k <n+L — 1. of the integers into subintervals of length no more than M. = F(). while those of length M will be called coding blocks. such that if J consists of those j for which I is shorter than M. Either x1 or i is in a gap of the code F. for 41 (G L) 2g . Case 2a. Case 1. Z = U/1 . = x i . 2kN+2 /E. there will be integers n. Case 2. since y and j3 were independently chosen and there are 2g different kinds of L-blocks each of which is equally likely to be chosen. hence. (a) If = [n + 1. then only two cases can occur. which establishes property (iii). n +Al]. and j such that n < 1 < n + L — 1. since k > (2g +3)/8. in. first note that the only changes To show that F = F made in x to produce y = F(x) are made by the block codes Ci. the changes produce blocks of 2's and 3's of length g which are enclosed between two O's. Case 2a occurs with probability at most 2 -g. since the M-block code C is a concatenation of the L-block codes. 207 M = 2g L and define a code C: Am 1--* A m by setting C(xr) = xr. c has the desired properties. Furthermore. Thus d(x.

. where Go(x) (a) pooxon . since y (n+ 1) = o G n since it was assumed that 6 < 2 -(n+ 1) . Assume F1. imply that which completes the proof of Lemma 111. Az.7 The number E can be made arbitrarily small in the preceding argument. The finish the proof it will be shown that there is a sequence of stationary encoders. then y (n+ 1) will be so close to On ) that condition (a). step n In summary. Let 6 < 2 —(n+1) be a positive number to be specified later. . while condition (a) holds for the case i = n + 1. in y x y probability.5. the entropy of the encoded process y can be forced to be arbitrarily close to log 2. The first step is to apply Lemma 111. . so there will be a limit encoder F and measure y = desired property. 2. with n replaced by n 1. and (c) hold for n = 1. n = 1. (b). the entropy of the coin-tossing process .6 with g = 0 to select k(1) > 3 and F1 such that (a). F„ have been constructed so that (a).. (b).5.208 CHAPTER III.: Az 1-4. This completes the construction of the counterexample. = F o Fn _i o o F1 and measure 01) = o G n x.5.5. Wk (y. for n + 1. Condition (c) is a technical condition which makes it possible to proceed from step n to 1. will hold for i < n 1. it is enough to show the existence of such a sequence of encoders {Fn } and integers {k(n)}.5. Remark 111. replaced by On ) to select k(n to select an encoder Fn+i : A z A z such that for v (n+ 1) = On ) 0 . Fn. that is.6. G < 2ik(i)) < 27i +1 ± 2-i + 2—n+1 ± 2—n 9 1 < < n.. . > 2kN . (x)) almost surely. with GI = F1. by the pairwise k-separation property. 33 ) k(n) co. Condition (a) guarantees that the limit measure y will have the 1 log Wk(n)(Y. the following composition G. F2 . (c) On ) ({2. (b) d (G n (x). In particular. ENTROPY FOR RESTRICI ED CLASSES.u. however. and (c) hold for n. These properties imply that conditions (b) and (c) hold -± 1 1 . finitely many times. the properties (i)-(iii) of Lemma 111. Cases 1. and an increasing sequence {k(n)} such that for the -1 . If 6 is small enough.6 hold. properties hold for each n > 1. Apply Lemma 111. and hence one can make P(xo F(x)o) as small as desired. and 2b therefore Puxv (Wk(Y < 2kN) _< 6/2 2 .. 3} n ) = 0.g + 6/2. ( wk(i)(y. . Condition (b) guarantees that eventually almost surely each coordinate is changed only o F -1 . This will be done by induction. 2a.6 with 1) larger than both 2n + 2 and k(n) and g = n and p.

209 III. (a) S(m) is a disjoint union of a collection {Si (m): j < . The goal is to show that there is an ergodic process it for which (5 ) 1 lim Prob (— log k co k W k (X . Let a be a positive number smaller than 1/2. Such a t can be produced merely by suitably choosing the parameters in the strongnonadmissibility example constructed in 111. The only new fact here is property (c).separated if their full k-block universes are a-separated for any k > K. for any integer N. VN..-indices below the end of each £(m)-block.. see Remark 111. defined by {S(m)} is ergodic. if sample paths sample paths x and y are picked independently at random. J].b. D c A k are a-separated if C n [D]. a) > 2 which proves (6).3. and a decreasing sequence {a m } c (a. (x.5. Two subsets S. (x . First some definitions and results from 111.6. 1/2). . since Nm is unbounded. (b) Each Si (m) is a merging of {Si (m — 1): j (c) 2kni Nin < t(m)/m.strongly . K) .e An approximate-match counterexample.3. km ) strongly-separated sets of the same cardinality./m } of pairwise (am . Two sets C. J] is such that 10 -1 ( = M/J.SECTION 111.5. a) N) = 0.3. inductive application of the merging/separation lemma. This is just the subsequence form of the desired result (5). S' C A f are (a.3. then with probability at least (1 — 1/40 2 (1 — 1/m) 2 . Lemma 111. a) < 2km N ) = 0. M] j E [1. Property (b) guarantees that the measure p. y. and hence property (c) can also be assumed to hold. = 0. Indeed. produces a sequence {S(m) C A €("z ) : in > 1} and an increasing sequence {k m }. THE WAITING-TIME PROBLEM. y. Once km is determined. A weaker subsequence result is discussed first as it illustrates the ideas in simpler form. The full kblock universe of S c A f is the set k-blocks that appear anywhere in any concatentation of members of S. their starting positions x1 and yi will belong to £(m)-blocks that belong to different j (m) and lie at least 2 km N. Prob (W k. however. Thus. ) > (1 — 1/40 2 (1 — 1/rn)2 . y. where [D]„ denotes the a-blowup of D.b will be recalled. S(m) can be replaced by (S(m))n for any positive integer n without disturbing properties (a) and (b).5. such that the following properties hold for each tn. A merging of a collection {Si : j E J} is a product of the form S = ni =1 S Cm) [1. Given an increasing unbounded sequence {Nm }. while property (c) implies that (6) lim Prob (Wk„. for each where J divides M and 0: [1.

discussed in III. This leads to an example satisfying the stronger result.210 CHAPTER III. The reader is referred to [76] for a complete discussion of this final argument. . The stronger form.3. (5). ± 1. and hence. ENTROPY FOR RESTRICTED CLASSES. as in the preceding discussion. Further independent cutting and stacking can always be applied at any stage without losing separation properties already gained. the column structures at any stage can be made so long that bad waiting-time behavior is guaranteed for the entire range km < k < km± i .2.b. controls what happens in the interval km < k < kn.

1 Almost block-independence. processes is still a complex theory it becomes considerably simpler when the invertibility requirement is dropped. is expressed in terms of the a-metric (or some equivalent metric. A block-independent process is formed by extending a measure A n on An to a product measure on (An)". each arising from a different characterization.. in > 1.d. though it is not. processes.d. In other words. This is fortunate. processes. Section IV. in general. as will be done here. 211 .Chapter IV B-processes. Xin n E A mn . the so-called isomorphism problem in ergodic theory. for example. Yr) will often be used in place of dn (p.i. randomizing the start produces a stationary process. Various other names have been used. [46]. but is of little interest in stationary process theory where the focus is on the joint distributions rather than on the particular space on which the random variables are defined. like other characterizations of B-processes. Note that 71 is Ta -invariant.d. A natural and useful characterization of stationary codings of i. then transporting this to a Ta-invariant measure Ft on A'. finitely determined processes. As in earlier chapters. J+1 1 x- for all i < j < mn and all x/. for while the theory of stationary codings of i.i. will be discussed in this first section.i. ji is the measure on A' defined by the formula nz Ti(X inn )= i j=1 (x(i—l)n+ni) . that is. including almost block-independent processes.i.. finite alphabet processes that are stationary codings of i. the almost block-independence property. and very weak Bernoulli processes. stationary.) Either measure or random variable notation will be used for the d-distance.d. if t is the distribution of XT and y the distribution of Yr. processes. The almost block-independence property. This terminology and many of the ideas to be discussed here are rooted in O rn stein's fundamental work on the much harder problem of characterizing the invertible stationary codings of i. v). dn (X7. Invertibility is a basic concept in the abstract study of transformations. The focus in this chapter is on B-processes. (j_on+ and the requirement that x.

then by showing that the ABI property is preserved under the passage to d-limits. or by understood. The almost block-independence property is preserved under stationary coding and.d. for this requires the construction of a stationary coding from some i. .i. An ergodic process p. process.d. This result and the fact that mixing Markov chains are almost blockindependent are the principal results of this section. process onto a process d-close to then how to how to make a small density of changes in the code to produce a process even closer in d. since the ABI condition holds for every n > 1 and every E > O. Of course.d. for each j > 1.1 (The almost block-independence theorem. The fact that only a small density of changes are needed insures that an iteration of the method produces a limit coding equal to 12. the independent n-blocking of {Xi } is the Tn-invariant process {Y. It is not as easy to show that an almost block-independent process p. process. process is clearly almost block-independent. and is sufficient for the purposes of this book. This will be carried out by showing how to code a suitable i. process y for which h(v) > h(A).i. In random variable language. It will be denoted by Ft(n). B-PROCESSES.d.d. process has infinite alphabet with continuous distribution for then any n-block code can be represented as a function of the first n-coordinates of the process and d-joinings can be used to modify such codes.2 A mixing Markov chain is almost block-independent. but it is generally easier to use the simpler block-independence ideas. (a) Y U —1)n+n and X7 have the same distribution. in fact. Theorem IV.) An ergodic process is almost block-independent if and only if it is a stationary coding of an i.d.u n of p to An./ } defined by the following two conditions. processes.d.i. process y onto the given almost block-independent process p. if it is assumed that the i.i. The fact that stationary codings of i. is almost block-independent (ABI) if given E > 0.i.d. all this requires that h(v) > h(p.u then d(A.1. Theorem IV.4) < E. is a stationary coding of an i.i. The theory to be developed in this called the concatenated-block process defined by chapter could be stated in terms of approximation by concatenated-block processes. Both of these are quite easy to prove. (k ) v (j-1)n-l-n (j-1)n+1 is independent of { r. The independent n-blocking of an ergodic process .i. for each j > 1. characterizes the class of stationary codings of i. process. processes are almost block-independent is established by first proving it for finite codings. there is an N such that if n > N and is the independent n-blocking of .d. Note.d.1. The construction is much simpler. however.i.i. i < (j — 1)n}. which is not at all obvious. In fact it is possible to show that an almost block-independent process p is a stationary coding of any i. since stationary coding cannot increase entropy.).212 CHAPTER IV.u is the block-independent process if n is defined by the restriction . An i. that the two theorems together imply that a mixing Markov chain is a stationary coding of an i. . in particular.i.

> 0 (1) d .L) < 6 / .d.i. for all n.i.. ALMOST BLOCK-INDEPENDENCE.d.d. 213 IV. so that property (f) of the d-property list. This completes the proof that B-processes are almost block-independent. processes and d-limits of ABI processes are almost block-independent. Lii . 1. process {X i } with window half-width w.i. Next assume p. it follows that finite codings of i. The successive n — 2w blocks 7n—w-1 'w+1 7 mn —w-1 " • '(ni-1)n+w+1 n+ 1 -1 .} be the independent n-blocking of { Yi } .d. since a stationary coding is a d-limit of finite codings. Furthermore. by the definition of independent nare independent with the distribution of Yw blocking. hence such coded processes are also almost block-independent. This is because blocks separated by twice the window half-width are independent.1. then mnci„. is the d-limit of almost block-independent processes. and hence.11. finite codings of i.(Zr. < E/ 3. The triangle inequality then gives This proves that the class of ABI processes is d-closed.11.414/ by property (c) of the d-property list. If the first and last w terms of each n-block are omitted. Yr) is upper bounded by m(n _ 2040 _ 2 w) w+1 vn—w-1 Tnn—w-1 '(n1-1)n+w+1 . Lemma 1. First it will be shown that finite codings of i. ) Since this is less than c. processes are almost blockindependent. < c/3.d. by (1).SECTION IV 1. < c. Given E choose an almost block-independent process y such that d(p.d.a B-processes are ABI processes.i.d.i. the preceding argument applies to stationary codings of infinite alphabet i.i. = anal FL) = cin(y.. Let {Y. Since y is almost block-independent there is an N such that if n > N and 7/ is the independent n-blocking of y then d(y.} be a finite coding the i. and let {Z.d. Fix such an n and let FL be the independent n-blocking of A.9. processes onto finite alphabet processes. Lemma 1. yields yn—w—1 v mn —w-1 n n— +T -1 7 mn —w-1 am(n_210 (z w w-1-1 (m-1)n+w-1-1 ) Thus dnm (Z. w+1 ymn—w--1 (m-1)n±w±1 ) 4. when thought of as An-valued processes implies that d( 17.i. processes are almost blockindependent. for any n> wl€. In summary.9. while the successive n — 2w blocks vn—w-1 4 w+1 y mn —w-1 (m-1)n±w±1 +r -I by the definition of window halfare independent with the distribution of rr width and the assumption that {X i } is i.. y) < c/3.i. processes are almost block-independent. The fact that both and are i. hence stationary codings of i.

where V(0)0 is uniform on [0.i. to obtain a sequence of stationary codes {Ft } that satisfy the desired properties (a) and (b).. V): G(z)o F (z. will be used in the sequel.} with Kolmogorov measure A. 1. V (1) i .d. for t > 1. 1) and.d. process {Z. and hence any i. } is continuously distributed if Prob(Zo = b) = 0. there will almost surely exist a limit code F(z)1 = lime Ft (z) i which is stationary and for which cl(A. IV. = 2 -t. ..d. The lemma can then be applied sequentially with Z (t) i = (V(0) i . process is convenient at a given stage of the construction. This fact.d. V(2)1 .214 CHAPTER IV. since any random variable with continuous distribution is a function of any other random variable with continuous distribution. The idea of the proof of Lemma IV. is almost block-independent and G is a finite stationary coding of a continuously distributed i. i.d. (a)a(p.d. The lemma immediately yields the desired result for one can start with the vector valued.. B-PROCESSES.o G -1 ) <E. countable component. which allows one to choose whatever continuously distributed i.i. that is. and if is a binary equiprobable i.) If a(p. The goal is to show that p. process (2) {(V(0). process {Z.i.d.d. o Fr-1 ) > 0. of any other i. Therefore. for all alphabet symbols b.3 is to first truncate G to a block code. for it means that. process (2) onto the almost block-independent process A. (A x o F -1 ) < 8.i. each coordinate Fe(z)i changes only finitely often as t oc.1. process with continuous distribution is a stationary coding.. An i.i. as t — * oc. which allows the creation of arbitrarily better codes by making small changes in a good code.1. A block code version of the lemma is then used to change to an arbitrarily better block code. The key to the construction is the following lemma..i. process with continuous distribution. almost surely.b ABI processes are B-processes. A = X o F -1 . (b)Er Prob((Ft+ i (z))o (Fr(z))o) — < oc. is a stationary coding of an infinite alphabet i. Let {Xi } be an almost block-independent process with finite alphabet A and Kolmogorov measure A.i. A. process with continuous distribution.i. V(t — 1) i ). A o F -1 ) = 0. (ii) (A x v) ({(z.i.i. It will be shown that there is a continuously distributed i. producing thereby a limit coding F from the vector-valued i. The exact representation of Vi l is unimportant.5 > 0 there is a finite stationary coding F of {(Z i .d. where p.)} with independent components. and Er = 2-t+1 and 3. process independent of {Z. together with an arbitrary initial coding F0 of { V(0)i }. (i) d(p. V(t) i . then given . .3 (Stationary code improvement.}. V(1) i . Vi )} such that the following two properties hold. process Vil with Kolmogorov measure A for which there is a sequence of stationary codes {Ft } such that the following hold. v)o}) < 4E. Lemma W. V(t)0 is binary and equiprobable. The second condition is important.d. . with window width 1.

see Exercise 12 in Section 1. 215 Finally.9. while retaining the property that the blocks occur independently. G(u7)) < 8. Lemma IV. using the auxiliary process {Vi } to tell when to apply the block code. for v-almost every sample path u.i. n)-independent extension of a measure pt. Ri ): i < 0 } .d.) Let F be a finite stationary coding which carries the stationary process v onto the stationary process 71. with associated (8. . so that Vd. The almost block-independence property can be strengthened to allow gaps between the blocks. Ri)} will be called a (S. Given S > 0 there is an N such that if n > N then there is an n-block code G. such that d. and b and c are fixed words of length w. called the (S. n)-blocking process {Ri } .b. Proof Let w be the window half-width of F and let N = 2w/8. a joining has the form X(a'i' . The i.1. Indeed. if the waiting time between l's is never more than n and is exactly equal to n with probability at least 1 — 3. Proof This is really just the mapping form of the idea that d-joinings are created by cutting up nonatomic spaces. where u is any sample path for which û7 = /47. 4 E A'.8. n)-truncation of F. n)-independent process and Prob(14f.4 (Code truncation. A block R:+n-1 of length n will be called a coding block if Ri = 0. y) be a nonatomic probability space and suppose *: Y 1-± A n is a measurable mapping such that V 0 —1 ) < 8. given that is a coding block. since the only disagreements can occur in the first or last w places.1. n)blocking process.1 = On -1 1.(a7). An ergodic pair process {(Wi . A finite stationary code is truncated to a block code by using it except when within a window half-width of the end of a block. for v-almost every sample path u. The following terminology will be used. b7) = v(0-1 (4) n (b7)) for 4): Y H A n such that A n = vo V I . for i< j<i+n-1and Ri+n _1 = 1. Then there is a measurable mapping 4): Y 1-+ A n such that ji = y o 0-1 and such that Ev(dn((k (Y). This is well-defined since F(ii)7+ 1 depends only on the values of û. The block code version of the stationary code improvement lemma is stated as the following lemma. (Y))). n)-blocking process and if 147Ç' is independent of {(Wi . 1.5 (Block code improvement. n)-independent process if {Ri } is a (8.) Let (Y. an extension which will be shown to hold for ABI processes.(F (u)7.SECTION IV. that is. the better block code is converted to a stationary code by a modification of the block-to-stationary method of I. if {(147„ Ri )} is a (S. (4 . ergodic process {Ri } will be called a (S. on A'.? +n--. nature of the auxiliary process (Vi) and its independence of the {Z1 } process guarantee that the final coded process satisfies an extension of almost block-independence in which occasional spacing between blocks is allowed.G(u7)) < S. For n > N define G(u7) = bF(ii)7± 1c. (y))) < 8. The following lemma supplies the desired stronger form of the ABI property. ALMOST BLOCK-INDEPENDENCE. b7)) = E v(dn (0 (Y).1. Clearly d. A binary. An ergodic process {Wi } is called a (8.(F(u)7. Lemma IV.1 = diliq is a coding block) = .u.

n)-independent extension {W} of An . here is the idea of the proof. be . independent of {Z. (3) Put N2 = M1+ 2N1 and fix n > N2. say i. { Proof of Lemma IV.} a binary.11.i. Since y is not stationary it need not be true that clni (it„„ um ) < E/3. Given E > 0 there is an N and a 6 > 0 such that if n > N then a(p.AN Ni1+1 has the same distribution as X' N1 +i . Lemma 1.} be an almost block-independent process with Kolmogorov measure t.1. yields the desired result. It is enough to show how an independent NI -blocking that is close to p. n)independent extension 11.} are now joined by using the joining that yields (4) on the blocks ftk + 1. n)-independent extension. The goal is to show that if M1 is large enough and 3 small enough then ii(y. B-PROCESSES. completing the proof of Lemma IV. n)-blocking process {R. for almost every realization {r k }. provided only that 6 is small enough and n is large enough. } . Towards this end. Proof In outline. then. The processes Y. with associated (6.i1. equiprobable i.6 (Independent-extension. skNi < Note that m = skNI — tkNi is at least as large as MI. and let À x y denote the Kolmogorov measure of the direct product {(Z„ V. the starting position. Let p.6. hence all that is required is that 2N1 be negligible relative to n. y) < 6/3. < 2E/3.216 CHAPTER IV.1.1. and that i7 V -7. but at least there will be an M1 such that dni (/u ni . For each k for which nk +i — n k = n choose the least integer tk and greatest integer sk such that tk Ni > nk . An application of property (g) in the list of dproperties.d. let {W} denote the W-process conditioned on a realization { ri } of the associated (6. for any (8. with respective Kolmogorov measures À and y.d.} and {W.3.. process. The obstacle to a good d-fit that must be overcome is the possible nonsynchronization between n-blocks and multiples of NI -blocks. If M i is large enough so that tk NI — nk nk+1 tIcNI is negligible compared to M1 and if 8 is small enough. sk NI L and an arbitrary joining on the other parts.} will be less than 2E13.i. can be well-approximated by a (6. but i j will be for some j < NI. first choose N1 so large that a(p. so that (3) yields t NI (4) ytS:N IV I I± Wits:N A: 1+ ) < € /3. Lemma IV.}. Let 6 be a positive number to be specified later and let be the Kolmogorov measure of a (8. n)-blocking process (R 1 )..)}.) < c. the ci-distance between the processes {K} and {W.} be a continuously distributed i.) Let {X. This is solved by noting that conditioned on starting at a coding n-block. where y is the Kolmogorov measure of the independent N I -blocking IN of pt. Ft) < c. in > Mi. Let {Z. Let {n k } be the successive indices of the locations of the l's in { ri}. vm ) < 13. of an n-block may not be an exact multiple of NI . < 2E/3. of A n .9. ii(v. Thus after omitting at most N1 places at the beginning and end of the n-block it becomes synchronized with the independent Ni -blocking. process and {V. and hence cl(p.. To make this outline precise. for all ni. and that 8 be so small that very little is outside coding blocks.

if g(v) = r.)} such that the following hold. r)i = a. F(u. Furthermore. In other words. In particular. Yk n — 1]. v): G(z)0 F(z. Clearly F maps onto a n)-independent extension of . a(4 +n-i„ )) < E. and hence the proof that an ABI process is a stationary coding of an i. and coding to some fixed letter whenever this does not occur. fix a E A.i. let {yk } denote the (random) sequence of indices that mark the starting positions of the successive coding blocks in a (random) realization r of {R. n)-blocking process {Ri } and lift this to the stationary coding g* of {(Z.i. and define F(z. vgk +n-1 = 0 (Wk). X o Fix 3 > O. x y) o <8. almost surely. Since. and hence the process defined by Wk = Z Yk+n-1' i i .d. the dn(G(Z)kk+n-1 limiting density of indices i for which i V Uk[Yk. g(v)). 0(Z))) 26. Towards this end. Y)o}) <2e E< 4e. E(d. The goal is to construct a finite stationary coding F of {(Z.d. first choose n and S < e so that if is a (S. first note that {Zi } and {Ri } are independent.SECTION IV 1.(Z). 0(Z7)) I R is a coding block) so the ergodic theorem applied to {Wk }.)•••d(wk))< 2c. hence property (i) holds. (X. } . process. along with the fitting bound. 217 an almost block-independent process with alphabet A. r) be the (stationary) code defined by applying 0 whenever the waiting time between 1 's is exactly n. however. V. and so that there is an (c..u. . (i) d(p. n)-independent extension of i then d(p. Xn and hence the block code improvement lemma provides an n-block code 0 such that = o 0 -1 and E(dn (à . Ø(Z7))) = E(dn (d (Z 11 ). implies that the concatenations satisfy (6) k—>oo lirn dnk(O(Wi) • • • 0(Wk). y) 0 }) < 4e. (ii) (X x y) (((z.. this means that < 2E.. coding such that d(p. thereby establishing property (ii). y) = (z.) < 8. by the ergodic theorem applied to {R. F(Z.}. i fzi Uk[Yk. (5) Next. and . (5). and let G be a finite stationary < E. define +n-1 ). This completes the proof of the lemma. let g be a finite stationary coding of {V. ALMOST BLOCK-INDEPENDENCE. with Wo distributed as Z. r Ykk+" = dp(zYA Yk ) for all k.)} onto {(Z„ Ri )} defined by g*(z. Let F(z. yk + n — 1] is almost surely at most & it follows from (6) that (X x y) ({(z.z (d(Z11 ). V. p e-. By definition. n)-truncation G of G. a(w.} onto a (S. The see why property (ii) holds. v): G(z)0 F (z.

1.z } be the nonstationary Markov chain with the same transition matrix as { X. processes with continuous distribution. IV. suitable approximate forms of these ideas can be obtained for any finite alphabet i. For each n > 1.218 CHAPTER IV B-PROCESSES. shows that An is a probability measure with vn and p. and is concentrated on sequences that agree ever after. Let p. if fi E [1 . process whose entropy is at least that of it. with the very strong additional property that a sample path is always paired with a sample path which agrees with it ever after as soon as they agree once.Y n ) cannot exceed the expected time until the two processes agree. v) indeed goes to O. and an w+1 = 1)/ 1_ 1 if wn (a . and y be the Kolmogorov measures of {X. The coupling measure is the measure A n on An x An defined by v (an . Proof The inequality (8) follows from the fact that the A n defined by (7) is a joining of . see Exercise 2. independent of n and a.7 The existence of block codes with a given distribution as well as the independence of the blocking processes used to convert block codes to stationary codes onto independent extensions are easy to obtain for i.} and { Y n } . This establishes the lemma.i.) The coupling function satisfies the bound (8) dn(tt. Yn )} is obtained by running the two processes independently until they agree. A joining {(Xn . this guarantees that ncln (X. v) < E)" n (wn) .c Mixing Markov and almost block-independence.„ as marginals. n --± oc. The Markov property guarantees that this is indeed a joining of the two processes.u n and v n . below. the expected value of the coupling function is bounded.i. Remark W.d.8 (The Markov coupling lemma. are discussed in detail in his book. n. v(a)p(b'i') 0 if w( a. b?) = (7) A direct calculation. the coupling function is defined by the formula wn(4 = f minfi E [ 1. With considerably more effort. which are the basis for Ornstein's isomorphism theory. as well as several later results about mixing Markov chains is a simple form of coupling. . then running them together. [46].d. These results. n — 1]: ai = bi}. b7) = n otherwise. a special type of joining frequently used in probability theory. so that iin (p. but which is conditioned to start with Yo = a. = w < n. Let {X. } . The key to the fact that mixing Markov implies ABI.. Furthermore. 42].1. In particular. The precise formulation of the coupling idea needed here follows. respectively. .u(b7). once they agree.1. } be a (stationary) mixing Markov process and let {1. v) —> 0. see also [63. n — 1 ] : ai = bi } 0 0 otherwise. Lemma IV.. In particular dn(it. making use of the Markov properties for both measures. see [35].

of course./x.pnn(XT n einn ) = ii(4)11. let À nin be the measure on A"' x A mn defined by m-1 (10) A. given Xk n = xk n .n± ±n 1 /xkn) < for all sufficiently large n. ALMOST BLOCK-INDEPENDENCE. n)-independent extension which is a function of a mixing Markov chain. and the distribution of X k n +7 conditioned on Xk n is d-close to the unconditioned distribution of Xn i . a joining Ain „ of A m . j < kn. u is a joining of A„. n)-independent extension ri which is mixing Markov of some order. with ii mn will be constructed such that Ex. n)-independent extension which is actually a mixing Markov chain. provided only that n is large enough. with Kolmogorov measure i. by the coupling lemma.11. because the distribution of Xk k . This is easy to accomplish by marking long . the idea to construct a joining by matching successive n-blocks.9 An almost block-independent process is the d-limit of mixing Markov processes. } plus the fact that Zk k n +7 is independent of Z j . given Xk n = Xkn • kn "± ditioned distribution of XI and the conditional distribution of Xk The coupling lemma implies that (9) d. not just a function of a mixing Markov chain. Section 1.. } with Kolmogorov measure and fix E > 0. a measure it on A" has a (8. a joining A. Zr)) < E. in fact. exactly the almost mixing Markov processes.„. To construct Xmn. Likewise. A suitable random insertion of spacers between the blocks produces a (8.n (dmn(xr. Fix n for which (9) holds and let {Zi } be the independent n-blocking of {Xi } . conditioned on the past depends only on Xk n . is to produce a (8. and hence that {X i } is almost blockindependent. see Example 1. The concatenated-block process defined by pt is a function of an irreducible Markov chain. z7)) < which. For each positive integer m. proves that d(p. summing over the successive n-blocks and using the bound (9) yields the bound Ex„.SECTION IV 1. Since the proof extends to Markov chains of arbitrary order.6. first choose. for each xbi . conditioned on the past. The notation cin (X k n +7.„ with iimu . (dnin (finn .1. 1:1 The almost block-independent processes are. < c. Fix a mixing Markov chain {X.(x k n ±n .1.2. Proof By the independent-extension lemma. 219 To prove that a mixing Markov chain is d-close to its independent n-blocking. For each positive integer m. of the unconditioned kn "Vil and the conditional distribution of Xk kn + 7. Lemma IV.) The only problem. in the sense that Theorem IV.. This will produce the desired dcloseness. that distribution of Xk realizes the du -distance between them. (See Exercise 13. it is enough to show that for any 8 > 0.„.(Z?) k=1 Jr A1 zkknn++7)• A direct calculation. x k . by the Markov property.. this completes the proof that mixing Markov chains are almost block-independent. X k n +7/xk n ) will be used to denote the du -distance between the unconI. for all large enough n. therefore. . using the Markov property of {X. shows that A. 1.T.

it is mixing. define f(sxrn. . n (dnzn Winn zr)) E.220 CHAPTER IV. n)-independent extensions of An and fi n .„. such that 4. 3. where y and D are the Kolmogorov measures of {W} spectively.10 The use of a marker symbol not in the original alphabet in the above construction was only to simplify the proof that the final process {Z1 } is Markov. 1. if 1 < i < mn + 1. concatenations of blocks with a symbol not in the alphabet.11 The proof that the B-processes are exactly the almost block-independent processes is based on the original proof. • w(m)) = i=1 Fix 0 < p < 1. = {1T 7i } . i) to be x. and let ft be the measure on sCrn defined by the formula ft(sw( 1 ) . For each positive integer m. .9. il To describe the preceding in a rigorous manner. Furthermore.. rean(An. } . n with Ti Ex.d Exercises. 2. The proof that ABI processes are d-limits of mixing Markov processes is new. Remark W. A proof that does not require alphabet expansion is outlined in Exercise 6c. Show that the defined by (10) is indeed a joining of . n)independent extension of A. As a function of the mixing Markov chain {ri }. provided m is large enough and p is small enough (so that only a 6-fraction of indices are spacers s. Remark IV. then randomly inserting an occasional repeat of this symbol. Show that (7) defines a joining of A n and yn . (b) (sxr . i 1). Next. below. 0) with probability pii(syr). let sCm denote the set of all concatenations of the form s w(1) • • • w(m). An).1. nm 1} and transition matrix defined by the rules (a) If i < nm 1. 1. where w(i) E C.1. then d(v. irreducible. respectively. (Z1 ) is clearly a (6. to be specified later.1. and s otherwise. and let { ri } be the stationary.) This completes the proof of Theorem IV. independent of n. 1) with probability (1 — p). 1. i) can only go to (sxr".u. for each syr E SC m . Show the expected value of the coupling function is bounded. and to (syr". since once the marker s is located in the past all future probabilities are known. aperiodic Markov chain with state space se" x {0. n)-blocking process {R. The coupling proof that mixing Markov chains are almost block-independent is modeled after the proof of a conditional form given in [41]. (sxr . } and {W} are (6. let C c A' consist of those ar for which it(4) > 0 and let s be a symbol not in the alphabet A. B-PROCESSES. The process {Zi } defined by Zi = f(Y 1 ) is easily seen to be (mn + 2)-order Markov. Show that if { 1/17. nm +1) goes to (sy'llin. with the same associated (6. IV.a(syr"). [66].

where a" denotes the concatenation of a (b) Show that Wan) —3. mixing. unless X n = 0 with probability 1/2.i. Properties of such approximating processes that are preserved under d-limits are automatically inherited by any finitely determined limit process.9 without extending the alphabet. 1. afinitely determined process is a stationary coding of an i.} be binary i. is finitely determined (FD) if any sequence of stationary processes which converges to it weakly and in entropy also converges to it in d. (Hint: if n is large enough. must be mixing. 221 5. and not i.i. A stationary process p.i. oc.) (c) Prove Theorem IV. to An.-distribution. Section IV.2. otherwise. Some standard constructions produce sequences of processes converging weakly and in entropy to a given process. Proof Let p (n ) be the concatenated-block process defined by the restriction pn of a stationary measure p. 0 as n with itself n times. This principle is used in the following theorem to establish ergodicity and almost block-independence for finitely determined processes.d. Show that even if the marker s belongs to the original alphabet A. there is a 8 > 0 and a positive integer k. In other words. it can be assumed that a(a) = 0 for some a. and using Exercise 6b and Exercise 4.d.9 is a (8. for it allows d-approximation of the process merely by approximating its entropy and a suitable finite joint distribution. process. and almost block-independent.5.d. process is the finitely determined property.i. Let {X. for some a E A. y) < E. then by altering probabilities slightly. Show that {17 } is Markov. n)-independent extension of IL./ } defined in the proof of Theorem IV. (i) 114 — vkl < 6 (ii) IH(. and define Y„ to be 1 if X„ X„_1. (Hint: use Exercise 4. Some special codings of i.d.) 7. Let p. the process {Z.2 The finitely determined property.1. Section 1. (a) Show that p.1 A finitely determined process is ergodic.u) — H(v)I < must also satisfy (iii) c/(kt. In particular.2. Theorem IV. given E > 0. The sequence ba2nb can be used in place of s to mark the beginning of a long concatenation of n-blocks.SECTION IV. the only error in this .i.) 6. and 0. then the distribution of k-blocks in n-blocks is almost the same as the p. If k fixed and n is large relative to k. a stationary p is finitely determined if.d. The most important property of a stationary coding of an i. processes are Markov. THE FINITELY DETERMINED PROPERTY. be the Kolmogorov measure of an ABI process for which 0 < kc i (a) < 1. (See Exercise 4c. such that any stationary process y which satisfies the two conditions.

and furthermore.2. y) on Ak defined by )= 1 k+it k i=ouE. This completes the proof of Theorem IV. let ii (n ) be a (8n . see Exercise 5. for any k < n.222 CHAPTER IV. (n) is ergodic and since ergodicity is preserved by passage to d-limits. as n weakly and in entropy so that {g (n ) } converges to 1-1(u n )/n which converges to H(/)). is small enough then /2 00 and ii (n) have almost the same n-block distribution. and since a-limits of ABI processes are ABI. v)I < 6 and IH(pn ) — H(v)I < n6 must also satisfy cin (p. a finitely determined process must be mixing. Xn i 1 starting positions the expected value of the empirical k-block distribution with respect to v. then (1) 41 _< 2(k — 1)/n.v) < E. Thus. . A direct calculation shows that 0(aiic ) = E Pk(a14)Y(4) = Ep(Pk(4IX).1. approximation being the end effect caused by the fact that the final k symbols of the oo. Concatenated-block processes can be modified by randomly inserting spacers between the blocks. if p.tvE. hence in if is assumed to be finitely determined. to A. that is. Proof One direction is easy for if I)„ is the projection onto An of a stationary measure I). the v-distribution of k-blocks in n-blocks is the measure = 0(k. hence almost the same k-block distribution for any k < n. Note also that if i3 is the concatenated-block process defined by y. process is finitely determined is much deeper and will be established later. and hence almost block-independent. (n ) have almost the same entropy.1.d.1. For each n.6. y) —1- since the only difference between the two measures is that 14 includes in its average the ways k-blocks can overlap two n-blocks. B-PROCESSES. Lemma IV.i. then p.2 (The FD property: finite form. must be finitely determined. If y is a measure on An.2.n. a finitely determined process must be almost block-independent. Since ii(n ) can be chosen to be a function of a mixing Markov chain. it follows that a finitely determined process is ergodic. v„) = vk . {P ) } will converge weakly and in entropy to /I . a The converse result. . Also useful in applications is a form of the finitely determined property which refers only to finite distributions. Thus if 3n goes to 0 suitably fast. 10(k. satisfies the finite form of the FD property stated in the theorem. then 0(k.a (n ) and p. and k < n. Theorem IV. Also. and (11 n)H (v n ) H (v).t (n ) is equal n-block have to be ignored. EE V (14 a k i V).) A stationary process p. Thus 147 ) gk .) If 8. there is a > 0 and positive integers k and N such that if n > N then any measure v on An for which Ip k — 0(k. that a stationary coding of an i. is finitely determined if and only if given c > 0. Section IV. the average of the y-probability of a i` over the first n — k in sequences of length n. The entropy of i. since functions of mixing processes are mixing and d-limits of mixing processes are mixing. Since each 11. n)-independent extension of A n (such processes were defined as part of the independent-extension lemma.

given Y. Finally.2. an approximate independence concept will be used.d.1.2. < 6/2. so Exercise 2 implies that 4. This proof will then be extended to establish that mixing Markov processes are finitely determined. This completes the proof of Theorem IV. it will be shown that the class of finitely determined processes is closed in the d-metric. is finitely determined and let c be a given positive number.4)1 <n6/2 and let Tf be the concatenated-block process defined by v. A finitely determined process is mixing.i. that is. it can also be supposed that if n > N and if 17 is the concatenated-block process defined by a measure y on An.a. y) — p(x)p(y)I < e. process is finitely determined. if independence is assumed. then X and Y are independent.i. An approximate form of this is stated as the following lemma.d.1 I. does not depend on ril.d. y) = px. and hence the method works. Let p(x.17) < E.. Fix n > N and let y be a measure on An for which 10(k.2. in turn. and let p(x) = px (x) = E. Thus. To make this sketch into a real proof.. The dn -fi of the distributions of X'il and yr by fitting the condiidea is to extend a good tting tional distribution of X n+ 1. v) I +10(k. since H(t)) = (1/ n)H (v). CI IV. The conditional distribution of Xn+ i. X.9 of the preceding section. THE FINITELY DETERMINED PROPERTY. The random variables are said to be c-independent if E Ip(x. hence is totally ergodic. It is easy to get started for closeness in first-order distribution means that first-order distributions can be dl-well-fitted.y(x. Furthermore. Y) of finite-valued random variables. By making N Choose N so large that if n > N then 1(1/01-/(an ) — enough larger relative to k to take care of end effects. using (1).2.k1 < 3/2 and 1H (v) — H(1. The definition of finitely determined yields a 6 > 0 and a k. p(x. is almost independent of Yçi. . y) = Prob(X = x. provided closeness in distribution and entropy implies that the conditional distribution of 17. (A n . given X. within 6/2 of H(p.0(k. given Yin. y) denote the marginal distributions. y) —1 <8/2.)) — H COI < 6. the definitions of k and 6 imply that a (/urY-) < c.a ABI implies FD. v) — ii.1. The i. then 1. First it will be shown that an i. given Xi'. The proof that almost block-independence implies finitely determined will be given in three stages. y) < ii(p.2. 223 To prove the converse assume ti.2. y) — oak' < 6. proof will be carried out by induction. p.). These results immediately show that ABI implies FD.SECTION IV. by Theorem IV.i. y If H(X) = H(X1Y). for an ABI process is the d-limit of mixing Markov processes by Theorem IV. such that any stationary process y that satisfies Ivk — 12k 1 < 6 and 1H (v) — H COI < 6 must also satisfy d(v. Y = y) denote the joint distribution of the pair (X. which is within 8/2 of (1/n)H(An). The definition of N guarantees that 117k — till 5_ 17k — 0(k. 1H (i. which is. processes are finitely determined. as well as possible to the conditional distribution of yr. y) and p(y) = py (y) = Ex p(x. IV.) < E.

i. that is.y D (p x. then H(Yi) — 1-1 (Y11 172 1 ) will be small. y) denotes the joint distribution of a pair (X. Given e > 0. y) and conditional distribution p(y1x) = p(x. Exercise 6 of Section 1. as the following lemma. di (A.d. Lemma IV. for all m > 0. y) x.) There is a positive constant c such that if 11(X) — H(X1Y) < c6 2 . H(X) — H(X1Y) = _ p(x) log p(x) log E p(x. so that if 11(v) = limm H(Y i lY i ) is close enough to H(A).3 (The E-independence lemma. Use will also be made of the fact that first-order d-distance is just one-half of variational distance. property. The next lemma asserts that a process is e-independent if it is close enough in entropy and first-order distribution to an i. y) = vii/ 2 . y IP(X x P(X)P(Y)i) where c is a positive constant. In its statement. Y) of finite-valued random variables. and {Y i } be stationary with respective Kolmogorov measures /2 and v.11.d. The lemma now follows from the preceding lemma.u1 — vil < 8 and 11I (A) — H(v)1 < then {Yi} is an 6-independent process. since H(YilY°m) is decreasing in m. however. y) = X . Y) p(x)p(y) Pinsker's inequality.i. then X and Y are 6-independent. implies that H(A) = H(X i ). which is just the equality part of property (a) in the d-properties lemma. B-PROCESSES. process.i. without proof.2. 0 A conditioning formula useful in the i. 0 A stationary process {Yi } is said to be an 6-independent process if (Y1.i. of course.9.y P(Y) E p(x. then 11 (X 1 ) will be close to 11 (Y1 ).2. An i.224 CHAPTER IV.d. then E f (x. there is a 8 > 0 such that if 1.i.6. Yi-1) and yi are c-independent for each j > 1. p(x. proof as well as in later results is stated here. process is. Lemma 1. independent of X and Y. y) x.y1PxP Y )• P(x. y) log p(x. yields. that is.d. The i.2. Lemma IV.5 If f (x.d.)' g(x)p(x) E p(x) h(y)p(ydx). e-independent for every c > O. Proof If first-order distributions are close enough then first-order entropies will be close. y)I p(x). . Lemma IV. with first marginal p(x) = yy p(x. Thus if Ai is close enough to v1.4 Let {X i } be i. y)p(x. 2 D(pxylpxp y) C (E . This establishes the lemma. y) = g(x) h(y). Proof The entropy difference may be expressed as the divergence of the joint distribution pxy from the product distribution px p y .

(3) di (X E 1 /x. and let An realize an (Xtil.)T (Xn+1 Yni-1)• n+1 The first term is upper bounded by nc..2. while.' are c-independent. since it was assumed thatX x i . Fix It will be shown by induction that cin (X. close on the average to that of Yn+i .u i — vil < 3/2 < c/2.f» Y. 44 and yn+1 . y. and di (Xn +i Yn-i-i/Y) denotes the a-distance between the corresponding conditional distributions. Furthermore..) — 1/(y)1 < 3. since d i (X I . The third term is just (1/2) the variational distance between the distributions of the unconditioned random variable Yn+1 and the conditioned random variable Y +1 /y. such a Getting started is easy. YD. 4+1) = \ x. let ?. Xn+1 (4+1. which is in turn close to the distribution of X n+1 . process {Xi} and c > O..u. y +1 ) = ndn (x`i' . y`i') + (xn+i. The random variable X n+1 .2. yn+. n+i andy n+ i.n realizes d i (X n +i The triangle inequality yields (Xn+1/X r i l Xn+1) dl(Xn+19 Yn+1) + (Yn+1 Y1 /y) The first term is 0. is certainly a joining of /2.2.6 An i. by independence. conditioned on X = x is denoted by X n+ i /4. .(dn) XI E VI Xri (X n i Yn i )d1(Xn+1.2.. Y1 ). The strategy is to extend by fitting the conditional distributions Xn+ 1 /4 and Yn+i /yri' as well as possible. the second term is equal to di (X I . since it was assumed that A n realizes while the second sum is equal to cin (X. Y1 ) = (1/2)I. For each x. whose expected value (with respect to yri') is at most /2.6. 225 Proof The following notation and terminology will be used. process is finitely determined. which along with the bound nExn (dn ) < n6 and the fact that A n +1 is a joining of . since Yn+1 and Y. Assume it has been shown that an (r i'. produces the inequality (n 1)dn+1(.+1/4). ± 1 defined 1 . Yn+1)À. This strategy is successful because the distribution of Xn+ 1 /x'il is equal to the distribution of Xn+ i. Lemma IV. Lemma IV.x7. 1 (dn+i) < ndn(A. Yri+1).i. which is at most c/2. then the process {Yi } defined by y is an c-independent process. and the distribution of 1" 1 + 1 /Y1 is.2. thereby establishing the induction step.SECTION IV. Fix an i. < E. . y) +E < (n + 1)c.realize di (Xn+i by {r}. by independence. by c-independence. y) (n + A. Yin) < c./4). yields (n (2) 1 )Ex n+ I(dn+1) = nEA.5.4 provides a positive number 3 so that if j — vil < 8 and I H(/).d. This completes the proof of Theorem IV. Yin). since (n 1)dn+1(4 +1 . 1/I' ( An+1. by stationarity. THE FINITELY DE IERMINED PROPERTY. Theorem IV. The measure X.d. Thus the second sum in (2) is at most 6/2 ± 6/2. Y < 1 /y) y)di(Xn+04. Without loss of generality it can be supposed that 3 < E. the conditional formula.i.

A conditional form of c-independence will be needed. Lemma IV. process.3..226 IV. then X and Y are conditionally E-independent.i. Lemma IV. The choice of 3 and the fact that H(YZIY ° n ) decreases to m H (v) as n oc. dies off in the d-sense. The key is to show that approximate versions of both properties hold for any process close enough to the Markov process in entropy and in joint distribution for a long enough time.i. Lemma IV. ±1 is independent of X7 for the i. The Markov property n +Pin depends on the immediate past X. 17In and Yi-. and 17. B-PROCESSES.2. Proof Choose y = y(E) from preceding lemma. The next lemma provides the desired approximate form of the Markov property.2.8. The random variables X and Y are conditionally E-independent.8 Let {X.) Given E > 0. are conditionally E-independent. The Markov coupling lemma guarantees that for mixing Markov chains even this dependence on X. Lemma IV.2.i. the fitting was done one step at a time.y E. The conditional form of the E-independence lemma. given Z. the Markov property and the Markov coupling lemma. provided the Y-process is close enough to the X-process in distribution and entropy. then. whose proof is left to the reader. there is a 3 > 0 such that if lid. Z) < y. as m grows. proof. In that proof.m+1 — v m+ii < 8 and IH(a) — H(v)i < 8.2 CHAPTER IV. yiz) — p(xlz)p(ydz)ip(z) < z x.1. a future block Xn but on no previous values. n > 1. given Yo. where p(x. ylz) denotes the conditional joint distribution and p(x lz) and p(y lz) denote the respective marginals. given Z. Mixing Markov processes are finitely determined. Fix a stationary process y for which ym+ il < 3 and 11/(u)— H (v)I <8.} be an ergodic Markov process with Kolmogorov measure and let {K} be a stationary process with Kolmogorov measure v. then guarantee that HOTIN— 1 -1(YPIY2 n ) < y. By decreasing 3 if necessary it can be assumed that if I H(g) — H(v)I <8 then Inanu) — na-/(01 < y/2. for then a good match after n steps can be carried forward by fitting future m-blocks. for suitable choice of m. guarantees that for any n > 1. two properties will be used. which is possible since conditional entropy is continuous in the variational distance. proof to the mixing Markov case. A good match up to stage n was continued to stage n + 1 by using the fact that X. The Markov result is based on a generalization of the i. then choose 3 > 0 so small that if 1 12 m+1 — v m± i I < 8 then IH(Yriy0)-HaTixol < y12.1 +1 is almost independent of 1?. for every n > 1. if E E ip(x.a. Fix m. extends to the following result. Given E > 0 and a positive integer m.d.2..d. To extend the i.d. there is a y = y(E) > 0 such that if H(XIZ)— H(XIY.7 (The conditional 6-independence lemma. .

and hence the sum in (8) is less than E. implies that 171 and Yfni are conditionally E-independent. 17.'. Yln /Y0) < yo E A. This completes the proof of Theorem IV.r+Fr. conditioned on +71 /4 will denote the random vector Xn In the proof. for any stationary process {Yi } with Kolmogorov measure y for which — vm+1I < and I H(u) — H(p)i <8. if necessary. The first three terms contribute less than 3E/4. As in the proof for the i. since Ynn+ +Ini and Yin—I are conditionally (E/2)-independent. X4 +im ) will denote the d-distance between the distributions n + T/xii.2. E X n Vn 4(4.7:14:r/y. it can also be assumed Furthermore.n+i there is a 3 > 0 so that if y is a stationary — vm+1 1 < 6 then measure for which (5) am(Yln.2. Theorem IV. Ynn+ = x'17 . This proves Lemma IV. and ii(Xn of X'nl Vin / and 11. for all n > 1.4T /Yri) dm (Y: YrinZ) Y:2V/yri?). by (7). (6). YTP:r/y) E 9 where Xn realizes an (A. y) is upper bounded by Ittm — that 6 is small enough to guarantee that (6) (7) dm (xT.9 A mixing finite-order Markov process is finitely determined.d./x 1 . conditioned on X. I/2.n /y. y)ani (X .:vin/4.8. yr ) < e/4. To complete the proof that mixing Markov implies finitely determined it is enough to show that d(p. y).SECTION IV. xo E A..2. y) < E. (6) and (5). case. Since this inequality depends only on . Only the first-order proof will be given. using (6) to get started. = x.8 that Yini and Y i n are conditionally (E/2)-independent. provides an m such that (4) dm (XT. given Yn . the extension to arbitrary finite order is left to the reader.i. The distribution of Xn n+ 47/4 is the same as the distribution of Xn n + r. The d-fitting of and y is carried out m steps at a time.2. by (4). But the expected value of this variational distance is less than 6/2. Fix a mixing Markov process {X n } with Kolmogorov measure p. Thus the expected value of the fourth term is at most (1/2)6/2 = 6/4. + drn ( 17:+7 . XT/x0) < 6/4.2. By making 3 smaller. since dm (u. Lemma IV. it can also be assumed from Lemma IV. it is enough to show that (8) dni(x n+1 . and the distribution of Ynn+ +. Fix a stationary y for which (5).8.9. and fix E > O. and (7) all hold. by the choice of 3 1 . in turn. and hence the triangle inequality yields.1.'1 /y. given Yo . THE FINITELY DETERMINED PROPERTY 227 which.u. The fourth term is upper bounded by half of the variational distance between the distribution of Y. .. Proof n + 7i . given Yo. + cim(r. The Markov coupling lemma.

2. is close enough to .ak — < 3. vi)) = d(p.11 If d(p.2. (None of the deeper ideas from channel theory will be used. yi)).(b714) = E A. —> which. is also small.. close to E). Theorem IV. îl) will be small.a such that (a) 3 is close to A in distribution and entropy. then implies that a(u. (9) it (a ril ) The family {A.2.3 The final step in the proof that almost block-independence implies finitely determined is stated as the following theorem. by property (a). (. O ) } of conditional measures can thought of as the (noisy) channel (or black box). The theorem is an immediate consequence of the following lemma.14) be the conditional measure on An defined by the formula (af. It will be shown that if 72. which is however. Such a finite-sequence channel extends to an infinite-sequence channel (10) X y . IV.. outputs b7 with probability X.a.v) <E and v is finitely determined.:(d(xi. il b7) A.(d(x l . The basic idea for the proof is as follows.. If the second marginal. B-PROCESSES. < (5 and IH(A) — < S.10 (The a-closure theorem. (b7la7). Lemma IV.. The finitely determined processes are d-closed. Let A be a stationary joining of it and y such that EAd(x i . v). only its suggestive language. v). and for each ar il. given the input ar il. a result also useful in other contexts.) The d-limit of finitely determined processes is finitely determined..) Fix n. for any stationary process Trt for which 1. The fact that 3: is a stationary joining of and guarantees that E3. which guarantees that any process a-close enough to a finitely determined process must have an approximate form of the finitely determined property. A simple language for making the preceding sketch into a rigorous proof is the language of channels. then there is a (5 > 0 and a k such that d(u. y l )) = d(.u in distribution and entropy there will be a stationary A with first marginal .u. '17 is close enough to y in distribution and entropy then the finitely determined property of y guarantees that a(v.228 CHAPTER IV. let 4(. The triangle inequality d(v + d a]. (b) The second marginal of 3 is close to u in distribution and entropy. borrowed from information theory.

b ) . a) need be stationary. measure A(4` . the averaged output p. then hr(X inn. a). a) on A'. b i ) converges to a oc. Km )* mn If a is stationary then neither A*(X n .)(14) = E A A(4. and hence ' 1 lim — H(1 11 1X7) = H n — On the other hand. let n > k. and # = ti(Xn. = H(X7) H(}7 W ) . as n oc. then A(X n . it) and A = A(Xn . Likewise. = as and b(i). for i <n — k A * (a(i) i + k i. given an infinite input sequence x.nn . by breaking x into blocks of length n and applying the n-block channel to each block separately. and hence stationary measures A = A(An . _ i < n. and put A* = A*(Xn. with respect to weak convergence and convergence in entropy. ii). THE FINITELY DE I ERMINED PROPERTY 229 which outputs an infinite sequence y. A direct calculation shows that A is a joining of a and /3. The output measure /3* is the projection of A* onto its second factor. blf) as n oc. b(i)+ so that. that is. The measures A and fi are. the measure on A' defined for m > 1 by the formula fi * ( 1/1" ) E A * (aT" . a) and /3(X n . though both are certainly n-stationary. +. b lic) is an average of the n measures A * (a(i) + x where a(i). respectively. bn i ln ) = a(a)4(b i.12 (Continuity in n. ±„. - Lemma IV. independent of {xi: i < jn} is. 1 b k ) = v(b) as n To establish convergence in entropy. . in n and in a. if (XI". a) are obtained by randomizing the start. that ±n ). A(a. But. is the measure on Aœ x A" defined for m > 1 by the formula - m_i A * (a . b(i) . a). yrn) = H(X) + myrnIxrn) . Given an input measure a on A" the infinite-sequence channel (10) defined by the conditional probabilities (9) determines a joining A* = A*(X n . called the stationary joint input output measure and stationary output measure defined by An and the input measure a.2. converges weakly and in entropy to X and . = b„ for 1 < s < k. first note that if (X7. a) nor /3*(X n . Y) is a random vector with distribution A*(X n . The i.) If is a stationary joining of the two stationary processes pt and v.u) converges weakly and in entropy to v. with probability X n (b7lx i n ±i and {y i : i < j n}. "" lai. ypi+ in+n i has the value b7. n+n ) j=0 The projection of A* onto its first factor is clearly the input measure a. The joining A*. called the joint input output measure defined by the channel and the input measure a. x(a k i b/ ). a) of a with an output measure /3* = /6 * (4. bki ) converges to X(a /ic. /2). 17) is a random vector with distribution An then H(X7. The next two lemmas contain the facts that will be needed about the continuity of A(X n .f3(4.+ +ki).SECTION IV2. Proof Fix (4.

°eon)} converges weakly and in entropy to A(4. yl)) + c. and IH(fi) — < y/2. br ) = a(aT) i-1 X(b i lai ). 1 IX7) . Lemma IV. yl)) < E. a). and a stationary measure A on A" x A. then H(XT. a) is weakly continuous in a. as n oc. assume 1) is finitely determined. p)) H (X). a). y) < E. + E H(rr_1)n+1 I riz_on+i) Do since the channel treats input n-block independently. By definition. since Isin = vn . Yr) = H(XT)+mH(Y i 'X i ).12. which is weakly continuous in a. B-PROCESSES. Furthermore. (i) {A(Xn .u). so that. which depends continuously on 11(a) and the distribution of (X1. y). since the entropy of an n-stationary process is the same as the entropy of the average of its first n-shifts. so (ii) holds. p)) H(A).6. Y1).13 (Continuity in input distribution.2. If a sequence of stationary processes la(m)} converges weakly and in entropy to a stationary process a as m oc. Assume cl(p. and let A be a stationary measure realizing ci(p. Ex(d(xi. . The extension to n > 1 is straightforward. Likewise.12.u)+ — H (Y. Proof of Lemma IV. then. Thus.2. Section 1. Dividing by mn and letting in gives 1 H(A*(4. provides an n such that the stationary input-output measure A = A (An . fi = /3(X i . if = /3*(X n . A) satisfies Ifie — vti < y/2. dividing by m and letting in go to oc yields H(A) = H(a)+ H (Y I X 1) . (ii) 0(4. so that H(A*(X n . and hence Lemma IV. as in --> Do.2. if (XT.)) H (v). p.) < E. so that both A* and fi* are stationary for any stationary input. and H(/3) = H(A) — H(Xi Yi) is continuous in H(A) and the distribution of (X 1 . Randomizing the start then produces the desired result. Furthermore.11. must also satisfy cl(v. Y1 ). Lemma IV. Thus (i) holds. Proof For simplicity assume n = 1. since the channel treats input symbols independently. see Exercise 5.13 is established. This completes the proof of Lemma IV. as m --÷ oc. in particular. a). The finitely determined property of y provides y and E such that any stationary process V' for which I vt —Ve I < y and I H(v) — H(71)1 < y. g) satisfies (11) EA(d(xl. H(A(4. i=1 H(X71n ) MH(YnX ril ). Fix a stationary input measure a and put A = A(Xi. yi)) < Ex(d(xi. . An„(a r .2.) Fix n > 1. The continuity-in-n lemma. Y) is a random vector with distribution A n. p)) = H(. randomizing the start also yields H(/3(A. a(m) )1 converges weakly and in entropy to t3(X n . and the stationary output measure fi = fi(X n .2.230 H (X ) CHAPTER IV.

THE FINITELY DETERMINED PROPERTY.14 The proof that i. The fact that d-limits of finitely determined processes are finitely determined is due to Ornstein. 1 The n-th order Markovization of a stationary process p.(4). is the n-th order Markov process with transition probabilities y(x-n+i WO = p. then the joint input-output distribution X = A(X„. )' 1 )) < En(d(xi. Ft) satisfies (12) Ex(d (x i .2.13.2.(4 +1 )/p. El Remark IV.. and IH(i') — 1-1 (/3)1 < y/2. processes are finitely determined is based on Ornstein's original proof. [46]. Show that a finitely determined process is the d-limit of its . yi)) < < yi)). Furthermore. 71) is a joining of the input Ex(d(xi. [13].2.b Exercises. Given such an input 'A it follows that the output -17 satisfies rile — vel Pe — Oel + Ifie — vel < Y. y) <E. This completes the proof of Lemma IV.d. The fact that almost block-independence is equivalent to finitely determined first appeared in [67]. (Hint: y is close in distribution and entropy to p.z -th order Markovizations. Ex(d(xi. and — II (v)I III (1 7) — II (0)I + i H(8) — I-I (v)I < 1 / . IV. yi)) +e and the output distribution il = 0(X. d(ii.2. yi) + 6 Ex(d (x i.SECTION IV. 71) _-_. so the definitions of f and y imply that dal. i)) + 2E < 3E.2. and hence the proof that almost block-independent processes are finitely determined.11. since X = A(Xn .) . provides 3 and k such that if ii. while the proof that mixing Markov chains are finitely determined is based on the Friedman-Ornstein proof. thereby completing the proof that dlimits of finitely determined processes are finitely determined. 231 The continuity-in-input lemma. and the fact that X realizes d(p. The observation that the ideas of that proof could be expressed in the simple channel language used here is from [83]. is any stationary process for which Fix — lid < 8 and IH(ii) — H(A)1 < ( 5 . the proof given here that ABI processes are finitely determined is new. )' by the inequalities (11) and (12). [46].i. ii„) satisfies 1 17 )t — fit' < y/2. y) which was assumed to be less than E . Lemma IV.. il with the output 7/ and ) En(d(xi.

a The very weak Bernoulli and weak Bernoulli properties. Hence any of the limiting nonoverlapping n-block distributions determined by (T i x. y).4. 5. Section IV.1 (The very weak Bernoulli characterization. a process is VWB if the past has little effect in the a-sense on the future.. Show that a process built by repeated independent cutting and stacking is a Bprocess if the initial structure has two columns with heights differing by 1. almost blowing-up. Prove the conditional c-independence lemma. indeed. This form. it follows that almost block-independence. denotes expectation with respect to the random past X° k Informally stated.} is very weak Bernoulli (VWB) if given c > 0 there is an n such that for any k > 0. can be shown to be very weak Bernoulli by exploiting their natural expanding and contracting foliation structures. where Ex o . (1) Ex o . 4.) A process is very weak Bernoulli if and only if it is finitely determined. see Section 111.3. hence serves as another characterization of B-processes.2. As in earlier discussions. IV. A stationary process {X.(X7 I X7)) < c. and X`i'/x ° k will denote the random vector X7.17) = d(x .232 CHAPTER IV. B-PROCESSES.. y) < a(i2.d. Show that the class of B-processes is a-separable. Show that if . (Hint: a(p.3. very weak Bernoulli. Theorem IV. process.4. see Exercise la in Section 1. equivalent to being a stationary coding of an i. . The significance of very weak Bernoulli is that it is equivalent to finitely determined and that many physically interesting processes. for some i < n. Since it has already been shown that finitely determined implies almost blowing-up. and finitely determined are. and v. (d. and for some x all of whose shifts have limiting nonoverlapping n-block distribution equal to /4.) 3. The reader is referred to [54] for a discussion of such applications. such as geodesic flow on a manifold of constant negative curvature and various processes associated with billiards. A more careful look at the proof that mixing Markov chains are finitely determined shows that all that is really needed for a process to be finitely determined is a weaker form of the Markov coupling property.u is totally ergodic and y is any probability measure on A'. then 0). called the very weak Bernoulli property. either measure or random variable notation will be used for the d-distance. . v.i. where 7 denotes the concatenated-block process defined by an (A n . for some y for which the limiting nonoverlapping kblock distribution of T'y is y. 2. is actually equivalent to the finitely determined property. conditioned on the past values x ° k .7.T i )' ) must be a joining of ia. Equivalence will be established by showing that very weak Bernoulli implies finitely determined and then that almost blowing-up implies very weak Bernoulli.3 Other B-process characterizations. Lemma IV.

Vr) < cin _g (Ugn+i . Theorems 111. weak Bernoulli requires that with high probability.b Very weak Bernoulli implies finitely determined.. Their importance here is that weak Bernoulli processes are very weak Bernoulli. OTHER B-PROCESS CHARACTERIZATIONS. V :+i) g In.3. + .} in entropy and in joint distribution for a long enough time. Entropy is used to guarantee approximate independence from the distant past. 233 The proof that very weak Bernoulli implies finitely determined will be given in the next subsection. ID.T Y::ET /)1) is small for some fixed m and every n. conditioned on intermediate values. A somewhat stronger property. Furthermore.3 and 111. and use the fact that . }. given only that the processes are close enough in entropy and k-th order distribution.SECTION IV.) .nnz /y. while very weak Bernoulli is used to guarantee that even such conditional dependence dies off in the ii-sense as block length grows. This property was the key to the example constructed in [65] of a very weak Bernoulli process that is not weak Bernoulli. X g ±n i z )) < E/2. see Exercise 2 and Exercise 3. for some fixed k > ni. If this is true for all m and k.3.-. iin (14 . -k Thus weak Bernoulli indeed implies very weak Bernoulli. p-r . The class of weak Bernoulli processes includes the mixing Markov processes and the large class of mixing regenerative processes.4) + cini (yn n _. as noted in earlier theorems.5. then one can take n = g + ni with m so large that g 1(g + ni) < € 12. A stationary process {X i } is weak Bernoulli (WB) or absolutely regular. with high probability. IV. that is. names agree after some point.'. for suitable choice of ni. X'1 ) ) < E. a result established by another method in [78]. the random vectors Xr and X° k are c-independent. the key is to show that E A. while the proof of the converse will be given later after some discussion of the almost blowing-up property. The proof models the earlier argument that mixing Markov implies finitely determined. if past and future become c-independent if separated by a gap g. To see why. In a sense made precise in Exercises 4 and 5. x:-±. called weak Bernoulli. The triangle inequality gives <(x. Since approximate versions of both properties hold for any process close enough to {X. since 4-distance is upper bounded by one-half the variational distance.2.77. the conditional measures on different infinite pasts can be joined in the future so that. is obtained by using variational distance with a gap in place of the ci-distance. given c > 0 there is a gap g such that for any k > 0 and m > 0. to obtain Ex o (cin (XTIV X ° k . vin . a good fitting can be carried forward by fitting future ni-blocks. first note that if Xr and X ° k are c-independent then Ex o (dm (X g +T/X ° k . y rli )dni (x. for any n > g and any random vector (Uli.1. while very weak Bernoulli only requires that the density of disagreements be small. weak Bernoulli processes have nice empirical distribution and waiting-time properties. ( 4. As in the earlier proofs.

Towards this end. and hence there is a K such that H(XTIX ° _ K ) < mH(u) + y. H(XT IX° K ) also depends continuously on 1Uk . and hence that am (XI''.Yr /Y_°. it follows from the triangle inequality that there is a S > 0 such that E y. and Sis small enough. LI . so that if it also assumed that I H(v) — H(g)I <S. implies that Yr and YIK K4 are conditionally (E/2)-independent. Y/yr. which is small since in < k.)) < 6/2. then (3) EyoK (iim (Yr i /Y K . namely.n (XT. By stationarity. But. j>1 . r in )) < c /8.. Fix k = m K 1. as t —> Do.n\ 'n+1 I "Ai x 4 . all that remains to be ±. Y.2. let y be a positive number to be specified later. the first term is equal a. since it can also be supposed that 8 < E/2. B-PROCESSES. along with the triangle inequality and the earlier bound (3).m. t > 1. This result. t > O. If y is small enough.(XT / X ° t . H(XTIX ° t ) converges to mH(A). The very weak Bernoulli property provides an in so that (2) Ex o (d„. For in fixed.234 CHAPTER IV. yields the result that will be needed. The goal is to show that a process sufficiently close to {Xi} in distribution and entropy satisfies almost the same bound. then H(YrIY ° K ) < m H (v) + 2y holds.)ani(Xnn±n1 +I / y n vn+ni 1. given — °K• Since a-distance is upper bounded by one-half the variational distance. this means that Eyo(dm(Yr/ 17° K. The details of the preceding sketch are carried out as follows. .') is small. for any stationary process {Y. (4) Eyot (c1(Y im .7. uniformly in n. by the very weak Bernoulli property. The quantity Exo K (a.8 is small enough and {Yi } is a stationary process with Kolmogorov measure V such that I tuk — Vkl <S. j O. since H(YrIY ° r ) decreases to m H (v) this means that II (Yi n IY ° K) — 11 ( 11171 1 17 K_i) < 2y. for all n. Lemma IV.n) < 6/4. provided only shown is that the expected value of dm (Ynn± that is close enough to {X i } in k-order distribution for some large enough k > as well as close enough in entropy. so if . Yin /Y° K-j)) < E/4 . which completes the proof that very weak Bernoulli implies finitely determined.Yr i )) < c/4. Fix a very weak Bernoulli process {X n } with Kolmogorov measure p. )( tin )) depends continuously on 'Lk. } with Kolmogorov measure u for which Ik — VkI < and I H Cu) — H (v)I < (3. Thus. while the expected value of the second term is small for large enough ni.1" -1 for all n.n (rin /X ° K . once such an in is determined. yr). the conditional 6-independence lemma. Furthermore. In summary. and fix 6 > O.

4 whose /i -mass is not completely filled in the first stage has /i -measure at least E then its blowup has large vmeasure and the simple strategy can be used again.' . I : dn (d. that is. after which the details will be given. A simple strategy for beginning the construction of a good joining is to cut off an a-fraction of the v-mass of yri' and assign it to some xr. yl') < e. of A and v for which dk (4. .S. there is a 8 > 0 and an N such that Bn has the (3. of a set C c A" is its c-neighborhood relative to the dn -metric. is called an c-matching over S if dk ( yr.-measure less than E. It was shown in Section III. )1) < E.SECTION IV. whenever it(C) > c. namely. C A" such that x'll E Bn . OTHER B-PROCESS CHARACTERIZATIONS. The goal is to construct a joining A. that the total mass received by each 4 is equal to A(4). [81]. If v([C]. If the fraction is chosen to be largest possible at each stage then only a finite number of stages are needed to reach the point when the set of 4 whose pt-mass is not yet completely filled has 1a-measure less than E. This simple strategy can repeated as long as the set of x'i' whose .4. for most y'i' a small fraction of the remaining v-mass can be cut off and and assigned to the unfilled mass of a nearby x'11 .c that a finitely determined process has the almost blowing-up property.u-mass is not yet completely filled has it-measure at least E.c The almost blowing-up characterization. subject to the joining requirement. The e-blowup [C].d. b) <E. A function 0: [S]. except on a set of (4. which. [C].c. namely.3. then dn (A .3.2 Let A and v be probability measures on A. and almost blowing-up are all equivalent ways of saying that a process is a stationary coding of an i.a is called a mass function if it has nonempty support and E Ti(4) < 1. and for any E > 0. The set of such yii' is just the c-blowup of the support of A. which. in a more general setting and stronger form. c)-blowing-up property for n > N. this completes the proof that almost block-independence. by hypothesis. Proof The ideas of the proof will be described first. 235 IV.) > 1 . The proof is based on the following lemma. has v-measure at least 1 . )1) of A. 0 (yriz)) < E.i. A set B C An has the (6.c. If the domain S of such a 0 is contained in the support of a mass . The key to continuing is the observation that if the set of . very weak Bernoulli. Since very weak Bernoulli implies finitely determined. Some notation and terminology will assist in making the preceding sketch into a proof. A nonnegative function . for all y [Sk.u([C]. process. finitely determined. was introduced in Section 111. The almost blowing-up property (ABUP). y) < 2e.4. i. A joining can be thought of as a partitioning of the v-mass of each y'i' into parts and an assignment of these parts to various xlii. In this section it will be shown that almost blowing-up implies very weak Bernoulli. for any subset C c B for which i(C) > 2 -1'3 • A stationary process has the almost blowing-up property if for each n there is a B.) > 1. for some all E C) . eventually almost surely. = fb. Lemma IV. A trivial argument shows that this is indeed possible for some positive a for those )1' that are within c of some fil of positive A-mass.' for which dn (x1' . is due to Strassen. c)-blowing-up property if .E.3.

Lemma IV. The construction can be stopped after i*. and ai±i is taken to be the maximal (y. = 1. is constructed by induction. This is the fact that.o ) denote the a-algebra generated by the collection of cylinder sets {[xn k > 0 } . x E S. p. Having defined Si . inf 71. the notation xn E Fn will be used as shorthand for the statement that Clk[zn k ] c F. (pi and ai . then the number a = 0) defined by i"u. by the entropy theorem.(4). —(1/n) log . Ft.(x)>0 ) I is positive. With Lemma IV. dk (x.(Si+i) < .Lemma IV. (1) . let . (i + 1) .u(Si.u(F) > 1 — a and so that if xn E Fn then o so . otherwise .u (i) . . 0)-stuffi ng fraction. almost surely. Op-stuffing fraction.236 function at. as long as SI 0 0) and let al be the maximal (y. with high probability. I=1 Only the following consequence of the upper bound will be needed. there is an 4 E Si for which td ) (4) = ai y(0. (1) = A. (i + 1) . To get started let p. Let E(Xn. say i*. .2 is finished. provided only that n is large enough.u(4) H.u(4lx ° . Also. the function Oi+1 is taken to be any c-matching over Si+i . p.u (i + 1) be the mass function defined by tt o)(x i ) — • The set Si+1 is then taken to be the support of p. Hence there is first i. for Fn E E(X" co ). y in) of I=1 X-measure less than e.3 (The conditional entropy lemma. for which ai = 1 or for which p. B-PROCESSES. (5) a = min { 1.(Si +i) < c.0) has approximately the same exponential size as the unconditioned probability . for it is the largest number a < 1 for which av(V 1 (4)) < (x). These two facts together imply the lemma.3. then V([Si d E ) > I — E.. by the definition of maximal stuffing fraction. can be assigned to the at-mass of 4 = The desired joining A..2 in hand. y) < E.u. . Proof By the ergodic thZoc rigiiii(-x-Rlit)(14ift4 AID almost surely. It is called the maximal (y.7 1 (4)) and therefore . Oi+i )-stuffing fraction. only one simple entropy fact about conditional measures will be needed to prove that almost blowing-up implies very weak Bernoulli.) Let be an ergodic process of entropy H and let a be a given positive number There is an N = N(a) > 0 such that if n > N then there is a set Fn E E(X" co ) such that . +1 ) < E. let 01 be any E-matching over S1 (such a (P i exists by the definition of c-blowup.u(Si ). that is. for which an a-fraction of the y-mass of each y'i? E [S]. except on a set of (4. cutting up and assigning the remaining unassigned v-mass in any way consistent with the joining requirement that the total mass received by each xÇ is equal to If ai.3. let Si be the support of itt (1) . If ai < 1 then.s: ( x I' CHAPTER IV. while. and the proof of.3. In either case.u. the conditional probability .

u(c n /3. Now the main theorem of this section will be proved. the Markov inequality yields a set D. 237 Lemma IV. The quantity . eventually almost surely.X.) such that .3.3.such that if x ° E D.2 E Fn .u is very weak Bernoulli. then E E (X ° cx. . has measure at least 1 — c. then. let Bn c An . 0-blowing-up property for n > N.u(c" n Fn ix) < 2" ...5 Almost blowing-up implies very weak Bernoulli.0 ) such that A(B n ) > 1— c 2 /4. (b) 11. E(X°.n > 1. Proof Let p.u([C]f) a. which establishes the lemma. X° E V. D. Theorem IV.SECTION IV. V = Dn n Dn . Towards this end.NATe. a set Fn c A" such that 1 — Afrx and. to show that for each E > 0.) 5_ 2" . there is an n and a set V E E(X° .3. E By Lemma IV.u(C) .. and such that for any e > 0.u(V) > 1 — c and such that dn(A..u(C n 12. be a stationary process with the almost blowing-up property.u(B. there is a 3 > 0 and an N such that Bn has the (3.„ and C c An then (a) and (b) give 11-(Clx) ..) _ oe ) 2an.u(Bn ) is an average of the quantities . which.2.u(C n B. A (• lx° . there is a set Dn *) > 1 — c/2 and such that A(B n ix%) > 1 — e12. 1 — c. be such that x7 E B„.Ix° 00). by the Markov inequality. and for E V and C c A'. of measure at least 1 — Nfii. for x°. for each x 0 ° 00 ) > 1 — (a) iu(F. there is a set D. o ) > c can be rewritten as . The goal is to show that .) such that (6) If C c An and ..u(x).4 it follows that if n is large enough and a < c 2 /4 then the set * E E(X).0 E D. If n > N.1x%) 6/2 .3..u(Clx ( 1 00 ) > c then . I x_ ..(V) > 1 — c and such that the following holds for x ° 00 E V. if A(Clx. OTHER B-PROCESS CHARACTERIZATIONS.)> 2-"(c — N/Fx — el2).u(C n Fn ) Ara < 2"A(C)-1. it is enough to show how to find n and V p. that is.3. E E(X° 0° ) of measure at least Dn . and C C An.0 )) <e. so that if n is so large * E E(X ° .s/CTe < 2" . Combining that p.a(cix) < . (x tilx ° If E .(Dn this with Lemma IV.4 Let N = N (a) be as in the preceding lemma.NA ii(Clx ° Proof For n > N.

d Exercises. 2. Show by using the martingale theorem that a stationary process p. 1. c. then 2-"(e — . Show using coupling that a mixing Markov chain is weak Bernoulli. yields the stronger result that the an -metric is equivalent to the metric cl. with the property that for all x B. 1)). < E.„o ] which maps the conditional measure ii(-Ix°) onto the conditional measure tie li° . for all sufficiently large n.6 The proof of Theorem IV.(•1. Show that a mixing regenerative process is weak Bernoulli. with the property that for all x B. This establishes the desired result (6) and completes the proof that almost blowing-up implies very weak Bernoulli.. 4)(x).0 0o ).u. defined as the minimum c > 0 for which . and a measurable set B c [x° . n].) 4. . and a measurable set B c [x] such that .3. IV.3. .14Fe — 6/2) > 2 -8n. using a marriage lemma to pick the c-matchings 0i .u(Bix ° .a is weak Bernoulli if given c > 0 there is a positive integer K such that for every n > K there is a measurable set G E E(X ° .238 CHAPTER IV. B-PROCESSES. provided only that n is large enough.. 0(x) 1 = xi for all except at most en indices i E [1.0 0.u(C) < v([C]. This and related results are discussed in Pollard's book. for all C c A.*(. yields ittaCli)> ou([C n B4O1.5 appeared in [39]. The proof of Strassen's lemma given here uses a construction suggested by Ornstein and Weiss which appeared in [37].o ] such that ti(B Ix%) < E.) > 1 — E. = x. . such that if i° 00 E G then there is a measurable mapping 0: [x] 1-÷ [. Cin (X7 IX ° k . such that if x 0 E G i then there is a measurable mapping 0: [x. Show that a stationary process is very weak Bernoulli if and only if for each c > 0 there is an m such that for each k > 1. 3. except for a set of x° k of measure at most E. pp. 5. Example 26.. so that the blowing-up property of B. Remark IV. (Hint: coupling.0 ) of measure at least 1 — c.3. is very weak Bernoulli if given E > 0 there is a positive integer n and a measurable set G E E(X° 0° ) of measure at least 1 — c. Show by using the martingale theorem that a stationary process . n].„) < c. for i E [K.u. If a is chosen to be less than both 8 and c2 /8. A more sophisticated result. 79-80].0 ] which maps the conditional measure . [59.c ).u(• ix%) onto the conditional measure .

Feldman. C. of Elec. Info. [8] P. Th. New York. [13] N. Kohn. Israel J. Recurrence in ergodic theory and combinatorial number theory Princeton Univ." J. d'analyse. [5] D. [12] N. Information Theory Coding theorems for discrete memoryless systems." Advances in Math. NY. V. Budapest.. A. Feller. 1985. of Math. W. Friedman and D. Press. Thesis. Elements of information theory." Ann. "The Erd6s-Rényi strong law for pattern matching with a given proportion of mismatches. Stanford Univ. Wiley. [6] T. John Wiley and Sons.Bibliography [1] Arnold. Dept. 321-343. Information theory and reliable communication. M. "Logically smooth density estimation. Körner. [4] P.". An introduction to probability theory and its applications. Arratia and M." Ph. Billingsley. 36(1980). 1981.. Furstenberg. Cover and J. NY. Thomas. Wiley. [15] R. "r-Entropy. 17(1989). Benjamin. New York. "On isomorphism of weak Bernoulli transformations. New York. 1968.. Measure Theory.. Birkhauser. A. Csiszdr and J. and Ornstein's isomorphism theorem. Gallager. 1152-1169. [7] I. Inc.A. Probab. Ergodic theory and information. 365-394. Introduction to ergodic theory Van Nostrand Reinhold. 194-203.. 5(1970). [14] H. 1971. Ergodic problems of classical mechanics." IEEE Trans. Akadémiai Kiack"). 1968. Princeton.. Barron. Volume II (Second Edition). New York. New York. 103-111. Rényi. Erdôs and A. 1981.. Eng. D. 1970. and Avez. [11] W. [10] J. [2] R. 1965. John Wiley and Sons. New York. equipartition. 239 . 1980. New York. "Universal codeword sets and representations of the integers. Friedman. [9] P. Waterman. 1991. "On a new law of large numbers..Elias.I. Ornstein. 23(1970). [3] A. IT-21(1975). NJ..

Kamae. Kakutani.. 1990. 42(1982). U." IEEE Trans. [34] D. Grassberger. Cambridge. Suhov." Proc.S. 669-675. Math. 1993.) Wiley." Probability. 284-290.. John Wiley and Sons. Hoeffding. Springer Verlag.single sequence analysis. Princeton. 34(1979) 281-286. "Prefixes and the entropy rate for long-range sources.A. of Math. Ghandour. Gray and L. [26] S. D. Statist." Israel J. "Asymptotically optimal tests for multinomial distributions. "Source coding without the ergodic assumption. Halmos. Jones." Proc. IT-20(1975). Gray. [17] R. [25] T. An Introduction to Symbolic Dynamics and Coding. Finite Markov chains. Math. "A simple proof of the ergodic theorem using nonstandard analysis." Israel J. de Gruyter. [30] J.. [20] P. W. and optimization.. [32] U. Halmos. New York. [23] M. [27] I. statistics. 36(1965). 369-400. Ergodic theorems. Acad. Entropy and information theory. [33] R. 5800-5804. [29] J. Math. Van Nostrand Reinhold. Info.. M. Lectures on the coupling method. Princeton. "On the notion of recurrence in discrete stochastic processes." IEEE Trans. Lind and B. IT-37(1991). Sci. New York. 291-296. 681-4. Th. " Finitary isomorphism of irreducible Markov shifts. Keane and M. Kac." Ann.. . "Induced measure-preserving transformations. 1950. 1992. Kieffer. 1960. Kontoyiannis and Y. 1988. Inform. 1985. "Comparative statistics for DNA and protein sequences ." Israel J... Weiss.. [35] T. 1995. Probability. ed. Press." Proc. Krengel. Springer-Verlag. [22] W. Japan Acad. 502-516. Gray. New York. Kelly. NY. Snell. van Nostrand Co. 1002-1010." IEEE Trans. Berlin. 42(1982). Marcus. NJ. 19(1943). (F.. Karlin and G. "Sample converses in source coding theory. 82(1985). - [18] R. 87(1983). [24] S. 1956. NY. New York. Statist. AMS. IT-35(1989). Measure theory. [19] R. "New proofs of the maximal ergodic theorem and the Hardy-Littlewood maximal inequality.. New York. Chelsea Publishing Co. Davisson. D. Katznelson and B. Lindvall.240 BIBLIOGRAPHY [16] P. and ergodic properties.. [31] I. P. L. 263-268. [21] P. random processes. G. Smorodinsky. Theory. 635-641. Cambridge Univ. Nat. [28] M. "Estimating the information content of symbol sequences and efficient codes. Kemeny and J." Ann. "A simple proof of some ergodic theorems. of Math. 53(1947). Info. Th. New Jersey. Lectures on ergodic theory.

"How sampling reveals a process. Ornstein and P." Israel J. "The positive-divergence and blowing-up properties. 104(1994). CT. Yale Univ.. Th. Th. Probab. Neuhoff and P. Probab.. 63-88. Poland. Weiss. Ornstein. Correction: Ann. "The d-recognition of processes. Ornstein and B. 905-930. Ergodic theory. IT-28(1982).BIBLIOGRAPHY 241 [36] K. 182-224. "Entropy and the consistent estimation of joint distributions. IT-42(1986). 1561-1563. "The Shannon-McMillan-Breiman theorem for amenable groups. 43-58. "Block and sliding-block source coding. Shields. Weiss. Ann. Shields. 18(1990). Sys.. . Shields." IEEE Workshop on Information Theory. "Universal almost sure data compression. universal. "An application of ergodic theory to probability theory".. [50] D." Ann. D. Probab. 53-60. 441-452. Shields. [46] D. Neuhoff and P. [47] D. Th. 10(1982). Marton and P.. Weiss. New Haven. Inform.. Rydzyna. Ornstein and P.. [49] D. Ornstein. "Entropy and data compression. Wyner. [37] K. Shields. Math. 960-977. Ornstein and B.. Th. 86(1994). [40] D. "Channel entropy and primitive approximation. and B. 1(1973). Press. Shields. Probab. and dynamical systems. Probab. "A very simplistic. 1974." Advances in Math. 78-83. "Indecomposable finite state channels and primitive approximation. [38] K. Marton. "A recurrence theorem for dependent processes with applications to data compression. to appear." Memoirs of the AMS. Neuhoff and P. Nobel and A. 331-348. Neuhoff and P. [41] D.. Shields. Inform. Info." Israel J. IT-23(1977). 1995. Shields. "Equivalence of measure preserving transformations. Ornstein... Marton and P. Marton and P. 18(1990). "Almost sure waiting time results for weak and very weak Bernoulli processes. Inform. Shields. 188-198. 44(1983)." IEEE Trans. "A simple proof of the blowing-up lemma. 262(1982). 211-215.." Ann. Ornstein and B. Probab." IEEE Trans.. [45] D. 11-18.. IT-39(1993). [43] D." Ergodic Th. Weiss.. Th. Rudolph." Ann. Yale Mathematical Monographs 5. Advances in Math. [52] D. Ornstein and P. 15(1995). randomness. [39] K." IEEE Trans. and Dynam." IEEE Trans. lossless code. 22(1994)." Ann." IEEE Trans. [42] D. 951-960. 445-447. IT-38(1992). Info. "An uncountable family of K-Automorphisms". Math. 10(1973). June. Shields. [44] A. [53] D. [48] D.. [51] D.

30(1978).. of Chicago Press. Transi. Shields." IEEE Trans. IT-33(1987). 1981. Parry. 20(1992). Shields. Mat. Shields. Princeton... Shields. Pollard. 7 (1973). Weiss. Press. Shields. and Probability. 1983. Lectures on Choquet's theorem. [66] P. Cambridge Univ. Chicago. San Francisco. Math. 159-180. Ornstein and B. [59] D. "Cutting and independent stacking of intervals. Moscow. R. [69] P. 1973.the general ergodic case.. Convergence of stochastic processes." IEEE Trans. New York.J.." IEEE Trans. "Cutting and stacking. [62] I. English translation: Select. 49(1979). "On the Bernoulli nature of systems with some hyperbolic structure. English translation: Holden-Day. 263-6. Van Nostrand. [65] P. Shields. "If a two-point extension of a Bernoulli shift has an ergodic square. Topics in ergodic theory Cambridge Univ. A method for constructing stationary processes. AN SSSR. N. Shields." Ann. [55] W." Ann. 20(1992). "The entropy theorem via coding bounds. [57] Phelps. 84 (1977).." Problems of Control and Information Theory. . to appear. 269-277. Th. Probab. 403-409. 1199-1203. Shields. 1984. Sys. Rohlin.. 1960." (in Russian). Inform. Information and information stability of random variables and processes. Sbornik.. Cambridge." Dokl. [64] P. "Almost block independence. Rudolph. [70] P. 283-291. Statist." IEEE Trans. [71] P. Shields. Petersen. "On the probability of large deviations of random variables. Ergodic theory. [73] P. "Stationary coding of processes. [67] P. Inform. 1-4. Akad. 1966 [58] M. 1645-1647. Th.242 BIBLIOGRAPHY [54] D. Shields. 119-123. Cambridge. Inform. Press. [61] D. The theory of Bernoulli shifts." Israel J." Ergodic Th. Inform." Monatshefte fiir Mathematik. [60] V. [63] P. 19(1990). [56] K. "The ergodic and entropy theorems revisited. Th. "A 'general' measure-preserving transformation is not mixing. Springer-Verlag. "Universal almost sure data compression using Markov types. "Weak and very weak Bernoulli partitions. fur Wahr. Sanov. Nauk SSSR.. "Entropy and prefixes... [68] P. 60(1948). Probab." Z. 349-351. "String matching . of Math. Pinsker. IT-37(1991). 42(1957). IT-25(1979). 133-142. 11-44. then it is Bernoulli. Univ. 1(1961). Shields. 1605-1617. 213-244. 1964. [72] P. Th." Mathematical Systems Theory. (In Russian) Vol. and Dynam. IT-37(1991). 7 of the series Problemy Peredai Informacii.

. Henri Poincaré. Thouvenot. "Fixed data base version of the Lempel-ziv data compression algorithm. Inst. "Coding of sources with unknown statistics-Part I: Probability of encoding error. Ziv and A. 93-98. 521-545. Phillips. Ziv. [76] P. 25(1989). IT-35(1989). Smorodinsky. IT- 39(1993)." IEEE Trans. Syst. Th." Math. Ergodic theory. Inform. Th. [90] J. 54-58. "An ergodic process of zero divergence-distance from the class of all stationary processes." IEEE Trans." Contemporary Math. IT- 24(1978). Ziv. of Theor." J. Prob. Walters. 384-394. 732-6. Steele. Th. Inform. Th. of Theor. 337-343. of Math. New York.. Wyner and J. "Universal redundancy rates don't exist. 210-203. Shields and J. [87] A. Prob. [81] V. Willems. IT-35(1989). Strassen. "The sliding-window Lempel-Ziv algorithm is asymptotically optimal. [77] P. 125-1258. 878-880. Wyner and J. [85] F.Varadhan." IEEE Trans.." IEEE Trans. Info." J. 36(1965). [80] M." IEEE Trans. E. Smorodinsky. Ziv. 1647-1659. Ziv." IEEE Trans. IT-18(1972). [86] A.. Prob. Lernpel. "The existence of probability measures with given marginals. [91] J. NYU. Th. 520-524. 135(1992) 373-376... "Kingman's subadditive ergodic theorem." Ann. "A partition on a Bernoulli shift which is not 'weak Bernoulli'. Th. of Math. Inform. Szpankowski. "Waiting times: positive and negative results on the Wyner-Ziv problem. "Universal data compression and repetition times. "Entropy zero x Bernoulli processes are closed in the a-metric. Inform.. Part II: Distortion relative to a fidelity criterion". "Finitary isomorphism of m-dependent processes.. [82] W.-P. 6(1993). 423-439. Shields." Ann. Wyner and J. Sciences. Shields. vol. "A universal algorithm for sequential data compression. 1975. M." J. Probab. 405-412.. [89] S." Proc. New York." Ann. 3(1975). Springer-Verlag.. 5(1971)." IEEE Trans. [88] A. An introduction to ergodic theory. Inform. Th. "Coding theorems for individual sequences.. Th. IEEE Trans. . IT-23(1977). [92] J. IT-39(1993). "Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. 82(1994). of Theor. Ziv. [78] M. [79] M. 1982. 499-519. and S. J. Asymptotic properties of data compression and suffix trees. [83] J..BIBLIOGRAPHY 243 [74] P. submitted. Statist. "Two divergence-rate counterexamples. Th. Info. [84] P. IT-37(1991).. a seminar Courant Inst. Shields. Moser. [75] P. Inform. of the IEEE. Xu.. 6(1993).. 872-877.

.

71 code sequence. 24. 83 blowing-up property (BUP). 9. 194 almost blowing-up. 115 disjoint column structures.Index a-separated. 110 copy. 52 block code. 107 column partitioning. 107 of a column structure. 109 built-up. 108 uniform. 226 entropy. k)-separated structures. 115 cutting into copies. 195. 68 8-blowup. 215 column. 174 in probability. 74. 108 width. 84. 69 built-up set bound. 58 admissible. 189 (a. 24. 211 Barron's code-length bound. 108 estimation of distributions. 8 block-independent process. 179. 70 (1 — E)-built-up. 26. 107 top. 107 subcolumn. 72 codeword. 11 building blocks. K)-strongly-separated. 69 built by cutting and stacking. 24. 233 addition law. 190 columnar representation. 111 column structure: top. 125 base (bottom) of a column. 212 almost blowing-up (ABUP). 72. 104 block-to-stationary construction. 8. 107 upward map. 107 name. 121 rate of. 235 (8. 103. 123 code. 110 transformation defined by. 107 level. 187 absolutely regular. 1 asymptotic equipartition (AEP). 188 upward map. 107 complete sequences. 10 concatenation representation. 71 length function. 24. 184 almost block-independence (ABI). 108 width (thickness). 8 block coding of a process. 107 cutting a column. 24. 55 B-process. 104. 71 coding block. 194 Borel-Cantelli principle. E)-blowing-up. 121 per-symbol. 138 circular k-type. 58 . 110 concatenated-block process. 109 column structure. 108 width distribution. 74 codebook. 105 complete sequences of structures. 235 blowup. 174 in d. 108 (a. 185. 107 height (length). 195. 7 truncation. 68 blowup bound. 211 block-structure measure. 109 disjoint columns. 108 support. 27 conditional E -independence. 215 245 faithful (noiseless). 195. 107 base (bottom). 24. 235 alphabet. 108 binary entropy function. 121 n-code. 107 labeling.

4 . 223 for processes. 1 continuous distribution. 99 mixing. 48 empirical entropy. 48 Markov. 80 entropy. 133 entropy interpretations: covering exponent. 90 rotation processes. 143. 176. 67 E-independence. 103. 186. 92 for sequences. INDEX empirical Markov entropy. 16 totally. 98 direct product. 67 cutting and stacking. 89 for stationary processes. 138 estimation. 222 of a partition. 17 start. 33 of von Neumann. 42 maximal. 90 d'-distance.i. 147 distribution of a process. 129 entropy-typical sequences. 75. 92 for ergodic processes. 166 for frequencies. 59 of a distribution. 89. 2 d-admissible. 58 a random variable. 45 eventually almost surely. 100 subset bound. 144 n-th order. 89 realizing tin 91 d-topology. 164 consistency conditions. 45 "good set" form. 11 expected return time.i. 51 empirical. 58 entropy rate. 73 entropy properties: d-limits. 31 distinct words. 224 ergodic. 1 of blocks in blocks. 185 encoding partition.246 conditional invertibility. 49 decomposition. 15 components. 168 divergence. 64 empirical measure (limiting). 46 number. 57 inequality. 97 completeness. 6 shifted empirical block. 132 Ziv entropy. 56 topological. 60 upper semicontinuity. 51. 76 empirical distribution. 21 ergodic theorem (of Birkhoff). 43. 218 covering. 68 prefix codes. 184 d-distance . 190 cylinder set. 89 for i. 109 standard representation. 65 empirical joinings. 96 d-far-apart Markov chains. 100 typical sequences.d. 214 coupling. 43 ergodicity and d-limits. 87 d-distance properties. 91 definition (joining). 63 . 62 concatenated-block processes. 91 definition for processes. 20 process. 62 Markov processes. 100 ergodicity. processes.d. 42 of Kingman. 94 empirical universe. 51 a process. processes. 138 first-order entropy. 24 exponential rates for entropy. 67 cardinality bound. 59 a partition. 51 conditional. 99 finite form. 88 entropy of i. 46 almost-covering principle. 218 coupling measure. 166 Elias code. 78 entropy theorem. 112 cyclic merging rule. 63 of overlapping k-blocks. 57 doubly-infinite sequence model. 102 relationship to entropy.

195 finitary process. 73 Kronecker's theorem. 235 measure preserving. 20 name of a column. 34 (1 — 0-packing. 25 infinitely often. 190 mixing. 80 finite energy. 2)-name. 44 (k. processes. 17 joint input-output measure.i. 6 k-th order. 6 generalized renewal process. 14 merging and separation. 223 first-order blocks. 114 M-fold. 7 function. 102 faithful code. 221 finite form. 136 linear mass distribution. 132 upper bound. 11 instantaneous coder. 104 log-sum inequality. 8 approximation theorem. 168 overlapping-block process. 4. almost surely. 34. 7 invariant measure. 100 and Markov. 15 irreducible Markov. 159 finite-state process. 5 source. 25 Kolmogorov measure. 6 concatenated-block process. 26 generalized return-time picture. 4 set. 138 partial. 229 247 Kac's return-time formula. 3. 91 definition of d. 84 fillers.INDEX extremal measure. 185 nonexistence of too-good codes. 10 overlapping to nonoverlapping. 115 induced process. 18 join of measures. 40 strong-packing. 10 packing. 104 first-order rate bound. 17 cutting and stacking. 65 Markov inequality. 151 finitary coding. 122 nonoverlapping-block process. 41 packing lemma. 110 Kolmogorov partition. 14 nonadmissibility. 18 and d-limits. 28 independent partitions. 43 typical. 121 filler or spacer blocks. 27 transformation. 41 two-packings. 131 convergence theorem. 195 finite coder. E)-typical sequences. 186.d. 166 frequency. 116 repeated. 175. 10 finitely determined. 212 stacking of a structure: onto a column. 131 LZW algorithm. 139 separated. 4 complete measure model. 198. 91 join of partitions. 107 of a column level. 62 matching. 3 Kraft inequality. 222 i. 7 finite coding. 4 and complete sequences. 107 (T. 6 Markov order theorem. 41 . 215 n-blocking. 2. 90 mapping interpretation. 34 stopping-time. 45 function of a Markov chain. 115 onto a structure. 117 extension. 11 Markov chain. 71 faithful-code sequence. 13 Kolmogorov representation. 133 simple LZ parsing. 76. 22 Lempel-Ziv (LZ) algorithm. 3 two-sided model.

59 finite-energy. 33 strongly-covered. 8. 67 and d-distance. 7 with induced spacers. 1. 70 (1 — 0-strongly-covered. 14 product measure. n)-blocking. 211 entropy. 27. 109 universal code. 8 output measure. 145 existence theorem. 59 subinvariant function. 82 time-zero coder. 229 Markov process. 132 of a column structure. 120 stationary. 65 typical sequence. 167 equivalence. 81 and mixing. 6 N-stationary. 3. 215 (S. 0-strongly-covered. 66 Poincare recurrence theorem. 180 stacking columns. 188 full k-block universe. 24 process. 12 N-th term process. 28 stationary coding. 150 representation. 0-strongly-packed. 74. 158 process. 21 process. 138 subadditivity. 13 too-long word. 3 (T . 85 strong cover.248 pairwise k-separated. 6 in-dependent. 169. 175 - INDEX and upward maps. 215 i. 166 splitting index. 44. 170. 43. 89 Pinsker's inequality. 34 strongly-packed. 64 class. 190 . 167. 167 equivalence. 32. 180 splitting-set lemma. 122 sequence. 165 recurrence-time function. 229 stopping time. 23 prefix. 21 d-far-apart rotations. 5 Markov. 159 (S. 73 code sequence. 63. 3 shifted. 166 for frequencies. 40 string matching. 148 return-time distribution. 63 class. 110 B-process. 121 trees. 4 shift. 55 shift transformation. 13 and complete sequences. 64 class bound. block parsings. 14 per-letter Hamming distance. 26 rotation (translation).i. 40 interval. 68. 29 type. 45. 5 Vf-mixing. 46 (L. 8 rate function for entropy. 30 transformation. 84 stationary process. 156 totally ergodic. 109 start position. 96 randomizing the start. 24 picture.P) process. 205 partition. 10 regenerative. 7 speed of convergence. 31 sliding-block (-window) coding. 30. 63 k-type. 150 (K. 72 code. 6 input-output measure. 67 too-soon-recurrent representation. 63. 179 skew product. 21 tower construction. 154 repeated words. 13 distribution. 3 transformation/partition model.. 151 too-small set principle. 7 and entropy. n)-independent.d. 121 universe (k-block). 101 Shannon-McMillan-Breiman theorem. 170 with gap.

87. 233 weak topology. 7 words. 200 weak Bernoulli (WB). 87 well-distributed. 179. 133 249 . 88 weak convergence. 148 repeated. 195 window half-width. 200 approximate-match. 148 Ziv entropy.INDEX very weak Bernoulli (VWB). 119 window function. 232 waiting-time function. 104 distinct.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.