The Ergodic Theory of Discrete Sample Paths

Paul C. Shields

Graduate Studies in Mathematics
Volume 13

American Mathematical Society

Editorial Board James E. Humphreys David Sattinger Julius L. Shaneson Lance W. Small, chair
1991 Mathematics Subject Classification. Primary 28D20, 28D05, 94A17; Secondary 60F05, 60G17, 94A24.
ABSTRACT. This book is about finite-alphabet stationary processes, which are important in physics, engineering, and data compression. The book is designed for use in graduate courses, seminars or self study for students or faculty with some background in measure theory and probability theory.

Library of Congress Cataloging-in-Publication Data

Shields, Paul C. The ergodic theory of discrete sample paths / Paul C. Shields. p. cm. — (Graduate studies in mathematics, ISSN 1065-7339; v. 13) Includes bibliographical references and index. ISBN 0-8218-0477-4 (alk. paper) 1. Ergodic theory. 2. Measure-preserving transformations. 3. Stochastic processes. I. Title. II. Series. QA313.555 1996 96-20186 519.2'32—dc20 CIP

Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting

for them, are permitted to make fair use of the material, such as to copy a chapter for use in teaching or research. Permission is granted to quote brief passages from this publication in reviews, provided the customary acknowledgment of the source is given. Republication, systematic copying, or multiple reproduction of any material in this publication (including abstracts) is permitted only under license from the American Mathematical Society. Requests for such permission should be addressed to the Assistant to the Publisher, American Mathematical Society, P.O. Box 6248, Providence, Rhode Island 02940-6248. Requests can also be made by e-mail to reprint-permission0ams.org .

C) Copyright 1996 by the American Mathematical Society. All rights reserved. The American Mathematical Society retains all rights except those granted to the United States Government. Printed in the United States of America. ® The paper used in this book is acid-free and falls within the guidelines established to ensure permanence and durability. .. Printed on recycled paper. t.,1 10 9 8 7 6 5 4 3 2 1 01 00 99 98 97 96

Contents
Preface I ix

Basic concepts. I.1 Stationary processes. 1.2 The ergodic theory model. 1.3 The ergodic theorem. 1.4 Frequencies of finite blocks 1.5 The entropy theorem. 1.6 Entropy as expected value 1.7 Interpretations of entropy 1.8 Stationary coding 1.9 Process topologies. 1.10 Cutting and stacking.

1 1 13 33 43 51 56 66 79 87 103 121 121 131 137 147 154 165 165 174 184 194 200 211 211 221 232 239 245

II Entropy-related properties. II.! Entropy and coding 11.2 The Lempel-Ziv algorithm 11.3 Empirical entropy 11.4 Partitions of sample paths. 11.5 Entropy and recurrence times III Entropy for restricted classes. Ill.! Rates of convergence 111.2 Entropy and joint distributions. 111.3 The d-admissibility problem. 111.4 Blowing-up properties. 111.5 The waiting-time problem IV B-processes. IV1 Almost block-independence. 1V.2 The finitely determined property. 1V.3 Other B-process characterizations.
Bibliography

Index

vii

and renewal and regenerative processes. leads directly to Kakutani's simple geometric representation of a process in terms of a recurrent event. the processes obtained by independently concatenating fixed-length blocks according to some block distribution and randomizing the start. which are important in physics. The classical process models are based on independence ideas and include the i. and data compression. is to develop a theory based directly on sample path arguments. and information theory. The focus is on the combinatorial properties of typical finite sample paths drawn from a stationary. The book is designed for use in graduate courses. Much of Chapter I and all of Chapter II are devoted to the development of these ideas. A primary goal. Further models. namely. are discussed in Chapter III and Chapter IV. but only stationary. An important and simple class of such models is the class of concatenated-block processes. which bounds the number of n-sequences that can be partitioned into long blocks subject to the condition that most of them are drawn from collections of known size. however. processes.i. as a measure-preserving transformation on a probability space. A secondary goal is to give a careful presentation of the many models for stationary finite-alphabet processes that have been developed in probability theory. The packing and counting ideas yield more than these two classical results. This point of view. which shows how "almost" packings of integer intervals can be extracted from coverings by overlapping subintervals. ergodic theory.Preface This book is about finite-alphabet stationary processes. an extension of the instantaneous function concept which allows the function to depend on both past and future. introduced by Ornstein and Weiss in 1980.2. A ix . These two simple ideas. immediately yield the two fundamental theorems of ergodic theory. Markov chains. namely. or by stationary coding. McMillan. Of particular note in the discussion of process models is how ergodic theorists think of a stationary process.d. engineering. The two basic tools for a sample path theory are a packing lemma. Related models are obtained by block coding and randomizing the start. that is. ergodic process. with minimal appeals to the probability formalism. introduced in Section 1. and a counting lemma. instantaneous functions of Markov chains. seminars or self study for students or faculty with some background in measure theory and probability theory. processes. the ergodic theorem of Birkhoff and the entropy theorem of Shannon. including the weak Bernoulli processes and the important class of stationary codings of i. a representation that not only simplifies the discussion of stationary renewal and regenerative processes but generalizes these concepts to the case where times between recurrences are not assumed to be independent. All these models and more are introduced in the first two sections of Chapter I. and Breiman. only partially realized. for in combination with the ergodic and entropy theorems and further simple combinatorial ideas they provide powerful tools for the study of sample paths.d. together with a partition of the space.i.

Likewise little or nothing is said about such standard information theory topics as rate distortion theory. redundancy. suggestive names are used for concepts. given in Section 1.x PREFACE further generalization. Many standard topics from ergodic theory are omitted or given only cursory treatment. the entropy theorem rather than the ShannonMcMillan-Breiman theorem. and multi-user theory. Properties related to entropy which hold only for restricted classes of processes are discussed in Chapter III. conditioned on the material in Chapter I. and a connection between entropy and waiting times.d. Some of these date back to the original work of Ornstein and others on the isomorphism problem for Bernoulli shifts. The first chapter. Likewise. In addition. as well as combinatorialists and people from engineering and other mathematics disciplines. the relation between entropy and partitions of sample paths into fixed-length blocks.i. including the almost block-independence. channel theory. finitely determined. Several characterizations of the class of stationary codings of i. The book has four chapters. information theorists. in part because the book is already too long and in part because they are not close to the central focus of this book. including rates of convergence for frequencies and entropy. These include entropy as the almost-sure bound on per-symbol compression. random fields. a method for converting block codes to stationary codes. smooth dynamics. Theorems and lemmas are given names that include some information about content. the entropy theorem and its interpretations. K-processes. These topics include topological dynamics. ranging from undergraduate and graduate students through post-docs and junior faculty to senior professors and researchers. and continuous time and/or space theory. the sections in Chapters II. the ergodic theorem and its connection with empirical distributions. processes are given in Chapter IV. very weak Bernoulli. Proofs are sketched first. and the cutting and stacking method. Properties related to entropy which hold for every ergodic process are discussed in Chapter II. then given in complete detail. divergence-rate theory. for example. a connection between entropy and a-neighborhoods. Some specific stylistic guidelines were followed in writing this book. the weak topology and the even more important d-metric topology. and blowing-up characterizations. combinatorial number theory. and probabilists. This book is an outgrowth of the lectures I gave each fall in Budapest from 1989 through 1994. is devoted to the basic tools. the estimation of joint distributions in both the variational metric and the d-metric. general ergodic theorems. leads to a powerful method for constructing examples known as cutting and stacking.10. The audiences included ergodic theorists. although the almost block-independence and blowing-up ideas are more recent. and IV are approximately independent of each other. and the connection between entropy and recurrence times and entropy and the growth of prefix trees. which is half the book. such as . including the Kolmogorov and ergodic theory models for a process. III. With a few exceptions. algebraic coding. both as special lectures and seminars at the Mathematics Institute of the Hungarian Academy of Sciences and as courses given in the Probability Department of E6tviis Lordnd University. lectures on parts of the penultimate draft of the book were presented at the Technical University of Delft in the fall of 1995. Ziv's proof of asymptotic optimality of the Lempel-Ziv algorithm via his interesting concept of individual sequence entropy. or partitions into distinct blocks. or partitions into repeated blocks.

http://www. Shields Toledo. This book is dedicated to my son.math. Gusztdv Morvai. and Jacob Ziv. these range in difficulty from quite easy to quite hard. but far from least. and Nancy Morrison. My initial lectures in 1989 were supported by a Fulbright lectureship. Exercises that extend the ideas are given at the end of most sections.edurpshields/ergodic. Paul C. Last. Only those references that seem directly related to the topics discussed are included. Bob Burton. who learned ergodic theory by carefully reading almost all of the manuscript at each stage of its development. Much of the material in Chapter III as well as the blowing-up ideas in Chapter IV are the result of joint work with Kati. Dave Neuhoff. I am indebted to many people for assistance with this project.html I am grateful to the Mathematics Institute of the Hungarian Academy for providing me with many years of space in a comfortable and stimulating environment as well as to the Institute and to the Probability Department of Etitvotis Lordnd University for the many lecture opportunities. Shaogang Xu and Xuehong Li. Don Ornstein. Aaron Wyner. I am much indebted to my two Toledo graduate students. A list of errata as well as a forum for discussion will be available on the Internet at the following web address. Gyiirgy Michaletzky. in the process discovering numerous errors and poor ways of saying things. No project such as this can be free from errors and incompleteness. In addition I had numerous helpful conversations with Benjy Weiss. Jacek Serafin. Imre Csiszdr and Katalin Marton not only attended most of my lectures but critically read parts of the manuscript at all stages of its development and discussed many aspects of the book with me. Others who contributed ideas and/or read parts of the manuscript include Gdbor TusnAdy. Jeffrey.PREFACE xi building blocks (a name proposed by Zoli Gytirfi) and column structures (as opposed to gadgets. and his criticisms led to many revisions of the cutting and stacking discussion. Much of the work for this project was supported by NSF grants DMS8742630 and DMS-9024240 and by a joint NSF-Hungarian Academy grant MTA-NSF project 37. 1996 . Ohio March 22.) Also numbered displays are often (informally) given names similar to those used for LATEX labels. It was Imre's suggestion that led me to include the discussions of renewal and regenerative processes.utoledo.

.

so. stochastic) process is a sequence X1. for implicit in the definition of process is that the following consistency condition must hold for each k > 1. an. "measure" will mean "probability measure" and "function" will mean "measurable function" with respect to some appropriate a-algebra on a probability space. the particular space on which the functions Xn are defined is not important.Chapter I Basic concepts. p). one for each k. The k-th order joint distribution of the process {X k } is the measure p. variables defined on a probability space (X. When no confusion will result the subscript k on p. am+i. X.1 Stationary processes. however.. of random A (discrete-time. In this book the focus is on finite-alphabet processes. p) on which the Xn are defined in any convenient manner. and the successive conditional distributions k -1 k. The cardinality of a finite set A is denoted by IA I. all that really matters in probability theory is the distribution of the process. unless stated otherwise. . as 1 . alic E Ak . ak-1-1 al E A k . The distribution of a process is thus a family of probability distributions. of course. . The distribution of a process can. except where each ai E A. The sequence a m . The set of joint distributions {Ak: k > 1} is called the distribution of the process. is denoted by am for m = 1. Also. also be defined by specifying the start distribution. The set of all such am n is denoted by Al. A i .1 k.1\ (akla 1 ) = Prob(Xk = ak1X 1 = al ) = til(alk) ktk-1(a 1) .k may be omitted... that is. when An is used. E. unless it is clear from the context or explicitly stated stated otherwise. A process is considered to be defined by its joint distributions.. .. E. Section 1. "process" means a discrete-time finite-alphabet process. (1) 111(4) = E P4+1 (a jic+1 ). The process has alphabet A if the range of each X i is contained in A. X2.k (alic) = Prob(Xif = c4). • • • .k on A k defined by the formula p. The sequence cannot be completely arbitrary. Thus one is free to choose the underlying space (X. n .

) denote the ring generated by Cn . be the collection of cylinder sets defined by sequences that belong to An. Let C. denoted by [4]. In other words. there is a unique Borel measure p. if {X. n . Let Aœ denote the set of all infinite sequences X = 1Xil. let C = Un C„. imply that p. Two important properties are summarized in the following lemma. The finite intersection property. Theorem 1..1(b).) If {p k } is a sequence of measures for which the consistency conditions (1) hold.} is a process with finite alphabet A.} defines a set function p. and its union R = R(C) is the ring generated by all the cylinder sets. is closed. Thus the lemma is established. The sequence {R. since the space A is compact in the product topology and each set in R. The rigorous construction of the Kolmogorov representation is carried out as follows. then there is a unique Borel probability measure p. The collection E can also be defined as the the a -algebra generated by R.. implies that p can be extended to a . An important instance of this idea is the Kolmogorov model for a process.(C. Let E be the a-algebra generated by the cylinder sets C. or by the compact sets.2 (Kolmogorov representation theorem. together with a Borel measure on A". 1 < i < co. = ai . Lemma I. for each k and each 4. which represents it as the sequence of coordinate functions on the space of infinite sequences drawn from the alphabet A. is a finite disjoint union of cylinder sets from C. The consistency conditions (1) together with Lemma I. m < i < n} . from (a). 1 _< i < n). Xi E A. The members of E are commonly called the Borel sets of A.1. Proof The proof of (a) is left as an exercise. A is defined by kn (x).1. extends to a finitely additive set function on the ring R.} is increasing. Proof The process {X. on A' for which the sequence of coordinate functions fin } has the same distribution as {X J. n > N. (b) If {B„} c R is a decreasing sequence of sets with empty intersection then there is an N such that B„ = 0.2 CHAPTER I.1. is the subset of A' defined by The cylinder set determined by am [4] = {x: x. on the collection C of cylinder sets by the formula (2) p({41) = Prob(X.1. The Kolmogorov can be thought of as the coordinate function process {i n }. BASIC CONCEPTS. = a i . x E For each n > 1 the coordinate function in : A' representation theorem states that every process with alphabet A A. on A" such that. long as the the joint distributions are left unchanged. = R(C) generated by the cylinder sets.1 (a) Each set in R. p([4]) = pk (alic).1(a). Lemma 1. equipped with a Borel measure constructed from the consistency conditions (1). Part (b) is an application of the finite intersection property for sequences of compact sets. and let R. xn . = 'R.

STATIONARY PROCESSES. that is. x = (xi. and statistics. Equation (2) translates into the statement that { 5C. The Kolmogorov model is simply one particular way to define a space and a sequence of functions with the given distributions. including physics. The shift transformation is continuous relative to the product topology on A". ni <j < n). for all m. the particular space on which the functions X„ are defined is not important. and defines a set transformation T -1 by the formula. unless stated otherwise. "Kolmogorov measure" is taken to refer to either the measure defined by Theorem 1. for example. This proves Theorem 1. data transmission and storage.n } on the probability space (A'. different processes have Kolmogorov measures with different completions. A process is stationary. one in which subsets of sets of measure 0 are measurable.). whichever is appropriate to the context.1.1. n> 1. that is. finite-alphabet process. it) will be called the Kolmogorov representation of the process {U. or the Kolmogorov measure of the sequence Lu l l. x which is expressed in symbolic form by E A.. if the joint distributions do not depend on the choice of time origin. a process is considered to be defined by its joint distributions. x2.. completion has no effect on joint distributions. Another useful model is the complete measure model. whenever it is convenient to do so the complete Kolmogorov model will be used in this book. and affi n .A" is defined by (Tx) n= xn+i. Many ideas are easier to express and many results are easier to establish in the framework of complete measures. and uniqueness is preserved. for Aga in) = ictk (alic) for all k > 1 and all a. .SECTION Ll. The phrase "stationary process" will mean a stationary. Process and measure language will often be used interchangeably. though often without explicitly saying so. . In particular. "let be a process.2. T -1 B = fx: T x E BI. that is.2. The sequence of coordinate functions { k . The Kolmogorov measure on A extends to a complete measure on the completion t of the Borel sets E relative to tt. thus. The statement that a process {X n } with finite alphabet A is stationary translates into the statement that the Kolmogorov measure tt is invariant under the shift transformation T. ni <j < n) = Prob(Xi+i = ai . The measure will be called the Kolmogorov measure of the process. . E. Stationary finite-alphabet processes serve as models in many settings of interest.) == T x = (x2. B . that is. Furthermore. or to its completion. discrete-time. n.} and {X." means "let i be the Kolmogorov measure for some process {Xn } . (3) Prob(Xi = ai ." As noted earlier. The (left) shift T: A" 1-*.} have the saine joint distribution. 3 unique countably additive measure on the a-algebra E generated by R. x3.

Furthermore. It follows that T -1 B is a Borel set for each Borel set B. The concept of cylinder set extends immediately to this new setting. so that the set transformation T -1 maps a cylinder set onto a cylinder set. m <j < n. the shift T is Borel measurable. equivalent to the condition that p(B) = p(T -1 B) for each Borel set B. on A". Proofs are sometimes simpler if the two-sided model is used. for invertibility of the shift on A z makes things easier. It is often convenient to think of a process has having run for a long time before the start of observation. In the stationary case. that ti. in turn. E. so that only nontransient effects. those that do not depend on the time origin. "Kolmogorov measure" is taken to refer to either the one-sided measure or its completion. The intuitive idea of stationarity is that the mechanism generating the random variables doesn't change in time. on A z such that Again) = pk (a1 ).4 Note that CHAPTER I. Such a model is well interpreted by another model. Note that the shift T for the two-sided model is invertible.([4]) = p. of course. for asymptotic properties can often be easily formulated in terms of subsets of the infinite product space A". The projection of p. that is. The latter model will be called the two-sided Kolmogorov model of a stationary process. The condition that the process be stationary is just the condition that p. In summary. the set of all sequences x = {x„}. A(B) = p(T -1 B) for each cylinder set B. In addition. that is. is usually summarized by saying that T preserves the measure iu or. in fact. the model provides a concrete uniform setting for all processes. is T -invariant. The condition. bi+1 = ai. It preserves a Borel probability measure A if and only if the sequence of coordinate functions {:kn } is a stationary process on the probability space (A". which is. T-1 r„n 1 _ ri. (The two-sided shift is. that is. are seen.). different processes with the same alphabet A are distinguished by having . such results will then apply to the one-sided model in the stationary case. Remark 1. only the proof for the invertible case will be given. where each xn E A and n E Z ranges from —oo to oo. the other as the shift-invariant extension to the space Az. T is one-to-one and onto and T -1 is measure preserving. B E E. n E Z. for stationary processes there are two standard representations. which means that T is a Borel measurable mapping. the Kolmogorov measure on A' for the process defined by {pa In summary. alternatively.) From a physical point of view it does not matter in the stationary case whether the one-sided or two-sided model is used. p(B) = p(T -1 B).3 Much of the success of modern probability theory is based on use of the Kolmogorov model.A z is defined by (Tx) n= xn+i. for each k > 1 and each all'. as appropriate to the context. that is. called the doubly-infinite sequence model or two-sided model.([b 11 ]).1. In some cases where such a simplification is possible. p. or to the two-sided measure or its completion.1 . onto il°'° is. BASIC CONCEPTS. one as the Kolmogorov measure p. if {p4} is a set of measures for which the consistency conditions (1) and the stationarity conditions (3) hold then there is a unique T-invariant Borel measure ii.n+1 i L"m-1 — Lum+1. The shift T: A z 1-. Let Z denote the set of integers and let A z denote the set of doubly-infinite A-valued sequences. a homeomorphism on the compact space A z .

in which case the I AI x IAI matrix M defined by Mab = Mbia = Prob(X„ = bi X„_i = a) .SECTION 1. {X„}.) The simplest examples of finite-alphabet dependent processes are the Markov processes.d. as well as to the continuous-alphabet i. Thus a process is i.d.i. and functions of uniformly distributed i. For example. I. A sequence of random variables {X„} is independent if Prob(X. An independent process is stationary if and only if it is identically distributed. A Markov chain is said to be homogeneous or to have stationary transitions if Prob(X n = blXn _i = a) does not depend on n. return-time processes associated with finite-alphabet processes are. processes will be useful in Chapter 4.i. identically distributed (i. cases that will be considered. The i. countable-alphabet processes.d. or even continuous-alphabet processes. Remark 1.d.i.5 (Independent processes. It is identically distributed if Prob(X„ = a) = Prob(Xi = a) holds for all n > 1 and all a E A. STATIONARY PROCESSES.6 (Markov chains. Such a measure is called a product measure defined by Ai. As will be shown in later sections. condition holds if and only if the product formula iak (al k )= n Ai (ai).i. it is sometimes desirable to consider countable-alphabet processes. Example 1. 5 different Kolmogorov measures on A.1. when invertibility is useful the two-sided model on A z can be used.1. in the stationary case.1. and. Some standard examples of finite-alphabet processes will now be presented. The Kolmogorov theorem extends easily to the countable case. holds for all n > 1 and all al' E An. i= 1 k holds. if and only if its Kolmogorov measure is the product measure defined by some distribution on A. = an IX7 -1 = ar I ) = Prob(X„ = an ). When completeness is needed the complete model can be used. Example 1.) The simplest example of a stationary finite-alphabet process is an independent.1.a Examples.) process. even more can be gained in the stationary case by retaining the abstract concept of a stationary process as a sequence of random functions with a specified joint distribution on an arbitrary probability space.i. is a Markov chain if (4) Prob (X n = an I V I -1 = ar l ) = Prob(Xn = an IXn _i =a_1) holds for all n > 2 and all a7 E An. except in trivial cases.1. A sequence of random variables.4 While this book is primarily concerned with finite-alphabet processes.d.

second. anix ri . Note that the finite-state process {Yn } is stationary if the chain {X„} is stationary. Note that probabilities for a k-th order stationary chain can be calculated by thinking of it as a first-order stationary chain with alphabet B. a E A. = biSn-1). il n—k — n-1\ an—k ) does not depend on n as long as n > k. 0 Lk—I •fu i i = a2 k otherwise. that A k M (k) = . where /2i(a) = Prob(X i = a). that transitions be stationary. and B = fa: p. indeed.1. Prob(X. The start distribution is called a stationary distribution of a homogeneous chain if the matrix product A I M is equal to the row vector pl. that is. first. The start distribution is the (row) vector lit = Lai (s)}. {Yn } is a finite-state process if there is a finite-alphabet process {s n } such that the pair process {X. then a direct calculation shows that. in general. The condition that a Markov chain be a stationary process is that it have stationary transitions and that its start distribution is a stationary distribution. Here "Markov" will be reserved for processes that satisfy the Markov property (4) or its more general form (5). Yn )} is a Markov chain.k (4c) > 01. see Exercise 2. If this holds. what is here called a finite-state process is sometimes called a Markov source. In information theory. Example 1. A (an I a n v n—i _ ) = Prob(X n = anl. An A-valued process {Y. and M (k) is the I BI x IBI matrix defined by Ai (k) _ bk — l' 1 1 r1 ak Prob(Xk ±i = bk I X ki = a k\ il . In ergodic theory.) Let A and S be finite sets. Unless stated otherwise. the process is stationary.7 (Multistep Markov chains. The positivity condition rules out transient states and is included so as to avoid trivialities.} is said to be a finite-state process with state space S. BASIC CONCEPTS. A process is called k-step Markov if (5) Prob(X. = (sa . The two conditions for stationarity are. and what is here called a finite-state process is called a function (or lumping) of a Markov chain. is called the transition matrix of the chain. 4 E A " .) A generalization of the Markov property allows dependence on k steps in the past. The Markov chain {X.} is often referred to as the underlying Markov process or hidden Markov process associated with {Yn }. . and "finite-state process" will be reserved for processes satisfying property (6). a Markov chain. but {Yn } is not. where ii. for which iu 1 (a) > 0 for all a E A. Yn = bisrl . Example 1. Y.1.1T -1 ) = Prob(sn = s. = an i Xn n :ki = an niki ) holds for all n > k and for all dii. In other words. following the terminology used in [15 ]. if (6) Prob(sn = s.4-1 ) .8 (Finite-state processes.k (4`) = Prob(X i` = 4). for all n > 1.uk .6 CHAPTER I. and. "stationary Markov process" will mean a Markov chain with stationary transitions for which A I M = tti and. . any finite-alphabet process is called a finite-state process.

that is.1. when there is a nonnegative integer w such that = f (x) = f CO .B by the formula f (x) = F(x)o. Finite coding is sometimes called sliding-block or sliding-window coding. depends only on xn and the values of its w past and future neighbors. The stationary coder F and its time-zero coder f are connected by the formula = f (rx). defined for Borel subsets C of Bz by v(C) = The encoded process v is said to be a stationary coding of ti. Vx E Az . otherwise known as the per-symbol or time-zero coder. = f (T:1 4 x). Define the function f: Az i. through the function f. and the encoded process {Yn } defined by Yn = f (X. A Borel measurable mapping F: Az 1-. Suppose A and B are finite sets. 7 Example 1. that is. an idea most easily expressed in terms of the two-sided representation of a stationary process. In the special case when w = 0 the encoding is said to be instantaneous.). The word "coder" or "encoder" may refer to either F or f. In this case it is common practice to write f (x) = f (xw w ) and say that f (or its associated full coder F) is a finite coder with window half-width w. • • • 9 Xn. At time n the window will contain Xn—w.) The concept of finite-state process can be generalized to allow dependence on the past and future. The coder F transports a Borel measure IL on Az into the measure v = it o F -1 . that is. that is.71 Z E Z.. STATIONARY PROCESSES. n E Z. Thus the stationary coding idea can be expressed in terms of the sequence-to-sequence coder F. The encoded sequence y = F(x) is defined by the formula .1. Note that a stationary coding of a stationary process is automatically a stationary process.SECTION 1. the window is shifted to the right (that is. To determine yn+1 . n Yn = f (x'. the image y = F(x) is the sequence defined by y. or in terms of the sequence-to-symbol coder f.) is called an instantaneous function or simply a function of the {X n } process. • • • 1 Xn-f-w which determines the value y. for one can think of the coding as being done by sliding a window of width 2w ± 1 along x. and bringing in . sometimes called the full coder. the value y. f (x) is just the 0-th coordinate of y = F (x). x E AZ.9 (Stationary coding. A case of particular interest occurs when the time-zero coder f depends on only a finite number of coordinates of the input sequence x. The function f is called the time-zero coder associated with F.Bz is a stationary coder if F(TA X) = TB F(x). where TA and TB denotes the shifts on the respective two-sided sequence spaces Az and B z .„ with coder F. Xn—w+1. the contents of the window are shifted to the left) eliminating xn _ u.

. pt and v. the N-block coding {Yn } is not. destroys stationarity. 2. that is. into a stationary process.8 CHAPTER I.i. 1) is measurable with respect to the a-algebra E(X. where A and B are finite sets. . The key property of finite codes is that the inverse image of a cylinder set of length n is a finite union of cylinder sets of length n 2w. a finite alphabet process is called a B-process if it is a stationary coding of some i. If the process {X.„ + i. 2. . This method is called "randomizing the start. The process {Yn } is called the N-block coding of the process {X n ) defined by the N -block code C N.v ) so that. and is often easier to analyze than is stationary coding. cylinder set probabilities for the encoded measure y = o F -1 are given by the formula y(4) = ii. process with finite or infinite alphabet. the shift on B. A stationary coding of an i.d. Much more will be said about B-processes in later parts of this book.n41. it is invariant under the N-fold shift T N .i...." The relation = . where T = Tg..} and {K}. {Yn }. More generally.1. The Kolmogorov measures. then F' ([y]) = f (x.d. such as the encoded process {Yn }. Note that a finite-state process is merely a finite coding of a Markov chain in which the window half-width is 0. I. especially in Chapter 4. (j+1)N X j1s1+1 ) 9 j 0. where y = F(x) is the (measurable) mapping defined by (j+1)N r jN+1 = . {X. by the formula f (x. y(j+1)N jN+1 = CN ()((f+1)N) +1 iN j = 0 1. where yi = f m <j < n. Such a code can be used to map an A-valued process {X n } into a B-valued process {Yn } by applying CN to consecutive nonoverlapping blocks of length N.(F-1 [4]) = (7) E f A process {Yn } is a finite coding of a process {X n } if it is a stationary coding of {Xn } with finite width time-zero coder. to of the code. called block coding. • . where w is the window half-width . n .nitww)=Y s11 [x:t w w] . N] according to the uniform distribution and defining i = 1. Furthermore. BASIC CONCEPTS.. are connected by the formula y = o F -1 . There is a simple way to convert an N-stationary 2rocess. although it is N-stationary.7. An N -block code is a function CN: A N B N ..u. stationary. If F: Az B z is the full coder defined by f. It is frequently used in practice. Xn. Such a finite coder f defines a mapping.nn+ -w w ) = ym n . a function of a Markov chain. also denoted by f.} is stationary. in general. Example 1. from An B. Xn±„. that is. 2.n+w generated by the random variables X„.. that is.) Another type of coding. where several characterizations will be discussed. for any measure . of the respective processes... The process {Yn } is defined bj selecting an integer u E [1. . process is called a B-process. however.10 (Block codings. in particular. F -1 ([).

.. a structure which is inherited. an N-block code CN induces a block coding of a stationary process {Xn } onto an N-stationary process MI.1. 2. An alternative formulation of the same idea in terms of measures is obtained by starting with a probability measure p. then averaging to obtain the measure Z defined by the formula N-1 N il(B) = — i=o it*(0 -1 (7—i B)). 2. In general. {Y. (i) li.} and many useful properties of {X n } may get destroyed. } obtained (j-11N+1 be independent and have the distribution of Xr. . randomizing the start clearly generalizes to convert any N-stationary process into a stationary process.} is stationary.i. B E E. with distribution t. {Y. on A N . X N ) is a random vector with values in AN.} be the A-valued process defined by the requirement that the blocks . In Section 1.. STATIONARY PROCESSES.. j = 1. {Y. For example.} is i. Let {Y. by randomizing the start. with a random delay. The concatenated-block process is characterized by the following two conditions. which can then be converted to a stationary process. . block coding introduces a periodic structure in the encoded process. }. by the process {41. is expressed by the formula v ( c) = E v (T C). 9 between the Kolmogorov measures. y and î. The process { 171 from the N-stationary process {Yi I by randomizing the start is called the concatenatedblock process defined by the random vector Xr . i=o is the average of y.* o 0 -1 on A" via the mapping = x = w(l)w(2) • • • where JN X (j-1)N+1 = W(i). the sequence {Y:± ±(iiN oN±i : j = 1. There are several equivalent ways to describe such a process. NI such that. In summary. of the respective processes. .8. suppose X liv = (X1. the final stationary process {Yn } is not a stationary encoding of the original process {X.SECTION I.11 (Concatenated-block processes.. Example 1. . The method of that is. In random variable terms.1. and as such inherits many of its properties. forming the product measure If on (A N )". 2. together with its first N — 1 shifts.. .d. (ii) There is random variable U uniformly distributed on {1. a general procedure will be developed for converting a block code to a stationary code by inserting spacers between blocks so that the resulting process is a stationary coding of the original process.... } .} and { Y. transporting this to an N-stationary measure p. conditioned on U = u.) Processes can also be constructed by concatenating N-blocks drawn independently at random according a given distribution. with stationarity produced by randomizing the start.I. .

. (b) (ar . on A N and let { Y„. that is. (ar . n E Z.i) with probability i. The overlapping N-blocking of an i. (See Exercise 7. {Z n } and IKI. N) goes to (b Ç'. selects every N-th term from the {X.n } defined by setting Y. 1) with probability p. j). Yn = X (n-1)N+1. for n > 1.) Concatenated-block processes will play an important role in many later discussions in this book. b' 1 1 = { MaNbN. (a) If i < N .(4)/N.1. Example 1.. is the same as the concatenated-block process based on the block distribution A. n = ai.tSiv = MaNbi ri i= 1 Mb. Wn = ant X+1.d. BASIC CONCEPTS. i ± 1). 2. a function of a Markov chain.12 (Blocked processes. • • • 9 Xn+N-1)• The Z process is called the nonoverlapping N-block process determined by {X n }. +i A related process of interest {Yn }. defined by the following transition and starting rules. by Zn = (X(n-1)N+19 X(n-1)N+29 • • • 1 XnN)..r. while the W process is called the overlapping N-block process determined by {X J. represents a concatenated-block process as a finite-state process.d.} is Markov with transition matrix Mab. A third formulation. hence this measure terminology is consistent with the random variable terminology.i.i } will be Markov with transition matrix MN. called the N-th term process. The measure ii.10 CHAPTER I.i.(br). The process is started by selecting (ar . The process {Y.} process.b. that is. which is quite useful. then the overlapping N-block process is Markov with alphabet B = far: War) > 0} and transition matrix defined by Ma i. . If {Xn } is Markov with transition matrix M. N}. but overlapping clearly destroys independence. To describe this representation fix a measure p.} is i. i) can only go to (ar . then the nonoverlapping N-block process is Markov with alphabet B and transition matrix defined by N-1 1-4a'iv . if {X.d.) Associated with a stationary process {X n } and a positive integer N are two different stationary processes. then {Y. if Y. is called the concatenated-block process defined by the measure p.. In fact. on AN. Each is stationary if {X. however. The process defined by ii has the properties (i) and (ii).} is stationary. 0 if bN i —1 = ci 2 iv otherwise. defined. for each br (C) E AN. = (ar . or Markov process is Markov.} be the (stationary) Markov chain with alphabet S = AN x {1. the N-th power of M.i. then its nonoverlapping blockings are i. If the process {X. Note also that if {X n} is Markov with transition matrix Mab. .

almost surely.1.. Lemma 1. In many applications.b Probability tools. E. the Markov inequality and the Borel-Cantelli principle.. .). If {Pn } is a sequence of measurable properties then (a) P(x) holds eventually almost surely. eventually almost surely. such that Ifn(x) — f(x)1 < E . (x) holds infinitely often.u(Bn n Gn ) < oo. almost surely.1.„ eventually almost surely. Two elementary results from probability theory will be frequently used. . X E G.l. integrable function on a probability space (X. indeed. for every c > O. STATIONARY PROCESSES. —> f.) such that E ii(c. 2. < c 8 then f (x) < c. such that Pn. Lemma 1. eventually almost surely.) Let f be a nonnegative. 11 I.) Suppose {G„} and {B} are two sequences of measurable sets such that x E G„.15 (The iterated Borel-Cantelli principle.1. p.14 (The Borel-Cantelli principle. E. (b) I fn (x) — f (x)I <E. n ? N. the Borel-Cantelli principle is often expressed by saying that if E wc. if for almost every x there is an N = N(x) such that Pn (x) is true for n > N. if for almost every x there is an increasing sequence {n i } of integers.i ) < 00 then for almost every x there is an N = N(x) such that x g C„. The proof of this is left to the exercises. there is an N and a set G of measure at least 1 — c. (b) P.) < oo then x g C.) If {C n } is a sequence of measurable sets in a probability space (X. n Gn . eventually almost surely.. Lemma 1.13 (The Markov inequality.SECTION 1. summarized as follows.(x) is true for i = 1.1. Frequent use will be made of various equivalent forms of almost sure convergence. Then x g Bn . in which case the iterated Borel-Cantelli principle is. (c) Given c > 0.16 The following are equivalent for measurable functions on a probability space.1. For example. the fact that x E B.. except for a set of measure at most b. a property P is said to be measurable if the set of all x for which P(x) is true is a measurable set. eventually almost surely. Almost-sure convergence is often established using the following generalization of the Borel-Cantelli principle. If f f dp. In general. eventually almost surely. which may depend on x. n Gn. p. (a) f. and x g B.. Lemma 1. n > N. is established by showing that E . just a generalized Borel-Cantelli principle.

) Let p.d. BASIC CONCEPTS.) 3.18 (Cardinality bounds. If X is a Borel set such that p(X) = 1.1. let v = p. by AZ or B z . For ease of later reference these are stated here as the following lemma. such as product spaces. let p. though they may be used in various examples.u([4] n [ci_t) . g. (Hint: show that such a coding would have the property that there is a number c > 0 such that A(4) > c".18.1 is 1-dependent. process. this is not a serious problem. for all n and k. first see what is happening on a set of probability 1 in the old process then transfer to the new process. and the renewal theorem. be a probability measure on the finite set A. Show that {X.u([4]). p(b) > a/I BI. respectively. (Include a proof that your example is not Markov of any order. such as the martingale theorem.i.(x 1') 0 O.) .17 (The Borel mapping lemma. otherwise X. let B c A.i. For nice spaces. and is not a finite coding of a finite-alphabet i.1. As several of the preceding examples suggest. . except for a subset of B of measure at most a. Give an example of a function of a finite-alphabet Markov chain that is not Markov of any order. namely. Let {tin } be i. if U„ > Un _i . B.u([4]).i. and upper bounds on cardinality "almost" imply lower bounds on probability.l. will not play a major role in this book. This fact is summarized by the following lemma. Sometimes it is useful to go in the opposite direction. uniformly distributed on [0. and let a be a positive number (a) If a E B (b) For b E p(a) a a. be a Borel measure on A". = 1.d.1. o F-1 and let r) be the completion of v.) Let F be a Borel function from A" into B". Probabilities for the new process can then be calculated by using the inverse image to transfer back to the old process. then I BI 5_ 11a. a process is often specified as a function of some other process. (How is m related to window half-width?) 4. e. with each U. I. process is m-dependent for some m. then F(X) is measurable with respect to the completion i of v and ii(F (X)) = 1. and sometimes the martingale theorem is used to simplify an argument. that lower bounds on probability give upper bounds on cardinality.d. namely. 1. the central limit theorem. Use will also be made of two almost trivial facts about the connection between cardinality and probability.c Exercises. Show that a finite coding of an i. 1].' and an+ni±k n+m-1-1 . Prove Lemma 1. the law of the iterated logarithm. if p. that Borel images of Borel sets may not be Borel sets. = O. Deeper results from probability theory. Then show that the probability of n consecutive O's is 11(n + 1)!. A complication arises. A similar result holds with either A" or B" replaced. for such images are always measurable with respect to the completion of the image measure. [5]. and all a. 2. then a stationary process is said to be m-dependent. If . Define X. Lemma 1.12 CHAPTER I.. Lemma 1.

Tx. Since to say that . The Kolmogorov measure for a stationary process is preserved by the shift T on the sequence space. as follows.tc* to A".. The sequence of random variables and the joint distributions are expressed in terms of the shift and Kolmogorov partition. associate with the partition P the random variable Xp defined by X2 (x) = a if a is the label of member of P to which x belongs.Kn-Fr \ 1-4-4 Kn+1 .. First. . 13 5.. In many cases of interest there is a natural probability measure preserved by T. Show that the finite-state representation of a concatenated-block process in Example 1. The Kolmogorov model for a stationary process implicitly contains the transformation/partition model. that is. are given by the formula (1) X(x) = Xp(Tn -1 x). for each n. that is. called the transformation/partition model for a stationary process.SECTION L2. x E Pa .1. This suggests the possibility of using ergodic theory ideas in the study of stationary processes.11 satisfies the two conditions (i) and (ii) of that example. where N = Kn+r. Lemma 1. is the subject of this section and is the basis for much of the remainder of this book. • k=0 The measures {. where for each a E A. that is. let Xn = X n (x) be the label of the member of P to which 7' 1 x belongs. 2. A measure kt on An defines a measure . The process can therefore be described as follows: Pick a point x E X. Pc. then give rise to stationary processes.. an infinite sequence. Establish the Kolmogorov representation theorem for countable alphabet processes. N = 1. ÎZ(B) = . n > 1.a (N) . 6. Show that the concatenatedblock process Z defined by kt is an average of shifts of . Section 1. which correspond to finite partitions of X. hence have a common extension .}. of a transformation T: X X on some given space X. Prove the iterated Borel-Cantelli principle. for each Borel subset B C 9. at random according to the Kolmogorov measure. (1/n) E7=1 kt* (1' B). THE ERGODIC THEORY MODEL. Ergodic theory is concerned with the orbits x..u (N) on AN. The shift T on the sequence space. This model. relative to which information about orbit structure can be expressed in probability language.T 2x.} satisfy the Kolmogorov consistency conditions.15. by the formula = (n K-1 kn -En iU(Xkr1+1) .2 The ergodic theory model. (or on the space A z ). The coordinate functions. {X. 7.1. and. The partition P is called the Kolmogorov partition associated with the process. 8. Show that the process constructed in the preceding exercise is not Markov of any order. X = A". 0 < n < r. = {x: x 1 = a}. Finite measurements on the space X. is the transformation and the partition is P = {Pa : a E A}.

2) name of x. The random variable Xp and the measure-preserving transformation T together define a process by the formula (3) X(x) = Xp(T n-l x). 2)-name of x is the same as x. B A partition E E. BASIC CONCEPTS. that is.14 Tn-1 X E CHAPTER I. 2)-process may also be described as follows. X a (x). - (x). the values. (2). that is. is called the process defined by the transformation T and partition P. The k-th order distribution Ak of the process {Xa } is given by the formula (4) (4) = u (nLiT -i+1 Pai ) the direct analogue of the sequence space formula. Continuing in this manner. the (T. Let (X. 2) process.u. the coordinate functions. in the Kolmogorov representation a random point x is a sequence in 21°' or A z . or equivalently. Pick x E X at random according to the A-distribution and let X 1 (x) be the label of the set in P to which x belongs. and measure preserving if it is measurable and if (T' B) = . indexed by a finite set A. or. ) In summary.„7—i+1 Pai . for all B E E. tell to which set of the partition the corresponding member of the random orbit belongs. [4] = n7. A mapping T: X X is said to be measurable if T -1 B E E. (or the forward part of x in the two-sided case. The sequence { xa = X n (x): n > 1} defined for a point x E X by the formula Tn -l x E Pin . X 2 (x X3(x). the cylinder sets in the Kolmogorov representation. whose union has measure 1. The (T. The concept of stationary process is formulated in terms of the abstract concepts of measure-preserving transformation and partition. disjoint collection of measurable sets. (In some situations. P = {Pa : a E A} of X is a finite. Then apply T to x to obtain Tx and let X2(x) be the label of the set in P to which Tx belongs. and the (T. The process {Xa : n > 1} defined by (3).(B). ).. cylinder sets and joint distributions may be expressed by the respective formulas. n> 1. and the joint distributions can all be expressed directly in terms of the Kolmogorov partition and the shift transformation. are useful. more simply. and (2) = u([4]) = i (nLI T —i+1 Pa. by (4). partitions into countably many sets. Pa is the same as saying that x E T -n+ 1 Pa .) Associated with the partition P = {Pa : a E Al is the random variable Xp defined by Xp(x) = a if x E Pa . is called the (T. countable partitions. as follows. E. X — U n Pa is a null set. A) be a probability space.) . Of course.

i(B) = 0 or p(B) = 1. or. . a stationary coding F: A z B z carrying it onto y determines a partition P = {Pb: b E BI of Az such that y is the Kolmogorov measure of the (TA. 15 The (T. a stationary process can be thought of as a shift-invariant measure on a sequence space. Lemma 1. 2. hence to say that the space is indecomposable is to say that if T -1 B = B then p(B) is 0 or 1. E.2.e. in essence. The mapping F extends to a stationary coding to Bz in the case when X = A z and T is the shift. See Exercise 2 and Exercise 3. implies that f is constant.e.1 (Ergodicity equivalents. 2)-process. where the domain can be taken to be any one of the LP-spaces or the space of measurable functions. p) gives rise to a measurable function F: X 1-÷ B" which carries p onto the Kolmogorov measure y of the (T. 2 translates into the statement that T -1 Xi = Xi. The partition P is defined by Pb = {x: f (x) = 1) } . just an abstract form of the stationary coding concept. in that a partition P of (X.) The following are equivalent for a measure-preserving transformation T on a probability space. denotes symmetric difference. since once an orbit enters an invariant set it never leaves it. In particular. The condition that TX C Xi. This leads to the concept of ergodic transformation.. equivalently. = f .2. each of positive measure. i = 1. It is standard practice to use the word "ergodic" to mean that the space is indecomposable.2. a. THE ERGODIC THEORY MODEL. 2)-process concept is.SECTION 1. where TB is the shift on 13'. such that F(Tx) = TA F (x). i = 1. A measurable set B is said to be T-invariant if TB C B. It is natural to study the orbits of a transformation by looking at its action on invariant sets. C AD = (C — D)U(D— C). I. (a) T is ergodic. the natural object of study becomes the restriction of the transformation to sets that cannot be split into nontrivial invariant sets. 2)-process. Conversely. . as a measure-preserving transformation T and partition P of an arbitrary probability space. A measure-preserving transformation T is said to be ergodic if (5) 7-1 B = B = (B) = 0 or p(B) = 1. The space X is Tdecomposable if it can be expressed as the disjoint union X = X1U X2 of two measurable invariant sets. and UT denotes the operator on functions defined by (UT f)(x) = f (T x). (d) (e) UT (B) = 0 or kt(B) = 1. while modern probability theory starts with the sequence space concept. The ergodic theory point of view starts with the transformation/partition concept. = (c) T -1 B D B. B LB) = 0 = (B) = 0 or kt(B) = 1. In summary.a Ergodic processes. (b) T -1 B C B. a. The following lemma contains several equivalent formulations of the ergodicity condition. The notation.

stationary codings of ergodic processes.4 The Baker's transformation.2 In the particular case when T is invertible.2. (See Figure 1.i.d.2. 1) x [0.u(B). then T -1 C = C and . Remark 1. Also. note that if T is invertible then UT is a unitary operator on L2 . processes obtained from a block coding of an ergodic process by randomizing the start. A stationary process is ergodic.) A simple geometric example provides a transformation and partition for which the resulting process is the familiar coin-tossing process. Place the right rectangle on top of the left to obtain a square. to say that a stationary process is ergodic is equivalent to saying that measures of cylinder sets can be determined by counting limiting relative frequencies along a sample path x.u(C) = . that is. Example 1. Thus the concept of ergodic process.2. BASIC CONCEPTS. Furthermore. is equivalent to an important probability concept. (t ± 1)/2) if S <1/2 it s> 1/2. when T is one-to-one and for each measurable set C the set T C is measurable and has the same measure as C. Examples of ergodic processes include i. The transformation T is called the Baker's transformation since its action can be described as follows.16 CHAPTER I. These and other examples will now be discussed. but not all. The proofs of the other equivalences are left to the reader. an invertible T is ergodic if and only if any T-invariant set has measure 0 or 1. Squeeze each column down to height 1/2 and stretch it to width 1.2. 3. 2. concatenated-block processes. the conditions for ergodicity can be expressed in terms of the action of T.b Examples of ergodic processes. irreducible Markov chains and functions thereof. rather than T -1 . . • Tw • Tz Tw • • Tz Figure 1.4. In particular.2.3 (The Baker's transformation. for almost every x. processes. Cut the unit square into two columns of equal width. a shift-invariant measure shift-invariant measures. and some.) 1. which is natural from the transformation point of view. As will be shown in Section 1.4. 1) denote the unit square and define a transformation T by T (s. Let X = [0. Proof The equivalence of the first two follows from the fact that if T -1 B c B and C = n>0T -n B. I. if the shift in the Kolmogorov representation is ergodic relative to the Kolmogorov measure. t/2) (2s — 1. t) = { (2s.

it can be shown that the partition TnP is independent of vg -1 PP .T 2P . t): s _?: 1/2}. In general. some useful partition notation and terminology will be developed. T2P. ii). 1 0 P T -1 1)1 n Po n T Pi n T2 Po Ti' I I 1 0 1 0 T 2p 01101 I I .2. This. .i. (See Figure 1. implies. Qb E Q} • The join AP(i) of a finite sequence P (1) .P)-process is. TP .SECTION I..2. 17 The Baker's transformation T preserves Lebesgue measure in the square.d. Figure 1. PO by setting (6) Po = {(s. a E Al. together with the fact that each of the two sets P0 . In particular. This is precisely the meaning of independence. for each n > 1.6. of course. process ii.2. E. the i. the partition P v Q = {Pa n Qb: Pa E P . so that {T i P} is an independent sequence. indeed. P1 has measure 1/2. 7-1 P . i.i. and 7— '1'. To obtain the coin-tossing process define the two-set partition P = {Po. . b. 2)-process is just the coin-tossing process. Let P = {Pa : a E A} be a partition of the probability space (X. symmetric.. along with the join of these partitions. are For the Baker's transformation and partition (6). that the (T.) . t): s < 1/2}. v P v TT into exactly the same proportions as it partitions the entire space X. is given by a generalized Baker's transformation. The distribution of P is the probability distribution ftt(Pa). THE ERGODIC THEORY MODEL. P(k) of partitions is their common refinement. P v Q. that is. for dyadic subsquares (which generate the Borel field) are mapped into rectangles of the same area. defined inductively by APO ) = (vrP (i) ) y p(k).d.2. Two partitions P and Q are independent if the distribution of the restriction of P to each Qb E Q does not depend on b. TP. a E A. P (2) .5 The join of P. T -I P v P v T'P v T 2P T -1 P Figure 1. The sequence of partitions {P (i) : i > 1} is said to be independent if P (k+ i) and v1P (i) independent for each k? 1. described as follows. that is. if [t(Pa n Qb) = 12 (Pa)11 (Qb). The join.5 illustrates the partitions P.Va. To assist in showing that the (T . process. Note that T 2P partitions each set of T -1 7. of two partitions P and Q is their common refinement. the binary. P1 = {(s. defined by the first-order distribution j 1 (a).

u(C) 2 and hence . for if T -1 C = C then T -ncnD=CnD.u(C). (C nD) = Since this holds for all sets D. Thus. for all sets D and positive integers n. 1. But this is easy. for if C = [a7] and N > 0 then T-N = A +1 bNA -i = a1. labeled by the letters of A. The stochastic matrix M is said to be irreducible. i 1 . 3. the reader may refer to standard probability books or to the extensive discussion in [29] for details. The partition P into columns defined in part (a) of the definition of T then produces the desired i. C. the choice D = C gives . • Figure 1. Example 1. if D = [br] and N > m then T -N C and D depend on values of x i for indices i in disjoint sets of integers. if for any pair i. For each a E A squeeze the column labeled a down and stretch it to obtain a rectangle of height (a) and width 1.i. The corresponding Baker's transformation T preserves Lebesgue measure.d.6 The generalized Baker's transformation. Mixing clearly implies ergodicity.a(D) = Thus i.u(T -nC n D) = . A transformation is mixing if (7) n— .) The location of the zero entries. Example 1. called mixing.2.u(D). in with io = i and in = j such that > 0.d. A stationary process is mixing if the shift is mixing for its Kolmogorov measure. j there is a sequence io. Since the measure is product measure this means that p.u(C) is 0 or 1.(T -Nc n D) = . if any. hence it is enough to show that it holds for any two cylinder sets.d. (7). 1. 1 < < n. To show that the mixing condition (7) holds for the shift first note that it is enough to establish the condition for any two sets in a generating algebra.8 (Ergodicity for Markov chains. Cut the unit square into I AI columns.) To show that a process is ergodic it is sometimes easier to verify a stronger property. implies that p.a(T-N C).2. n — 1. measures satisfy the mixing condition. so that the mixing property.i(C) = . Stack the rectangles to obtain a square. in the transition matrix M determine whether a Markov process is ergodic. in = 0. processes. . D E E.18 CHAPTER I.2. . BASIC CONCEPTS. process.7 (Ergodicity for i. Suppose it is a product measure on A'. . such that the column labeled a has width A i (a).i.froo lim .i. The results are summarized here. 2.

Mc1.1 1 converges in the sense of Cesaro to (Ci). 1 < i < m. THE ERGODIC THEORY MODEL. In fact. +1 = acin).t(C) = 1u(C) 2 . Thus when k > 0 the sets T -k-n C and D depend on different coordinates. since . Furthermore.c.([4] n [btni 1 -_ . let D = [4] and C = [cri and write + T—k—n = [bnn ++ kk-Fr ]. 1=.1±kbn+k+i V = (n-Flc-1 i=n JJ Md. implies ergodicity. then the limit formula (9) gives /. +1 = 11 n+k+m-1 i=n+k-F1 Note that V. where bn_f_k+i = ci.. by (8). there is only one probability vector IT such that 7rM = it and the limit matrix P has it all its rows equal to 7r. mc. An irreducible Markov chain is ergodic.-1 (10) itt([d] k steps. Summing over all the ways to get from d. hence for any measurable sets C and D. To establish (9) for an irreducible chain. for if T -1 C = C and D = C. which establishes (9).:Vinj) = uvw. and hence . which is weaker than the mixing condition (7). 19 This is just the assertion that for any pair i. the limit of the averages of powers can be shown to exist for an arbitrary finite stochastic matrix M. This condition.+i.SECTION 1. is n ii n [bn 1) = ttacril i) A I . To prove this it is enough to show that N (9) LlM oo C n D) = for any cylinder sets C and D. to bn+k+1 yields the product p.d. if the chain is at state i at some time then there is a positive probability that it will be in state j at some later time. If M is irreducible there is a unique probability vector it such that irM = 7r.u(C) = This proves that irreducible Markov chains are ergodic.. so that C must have measure 0 or 1. where n-1 7r (di)11Md.2.d. which is the probability of transition from dn to bn+k+1 = Cl in equal to MI ci = [Mk ]d c „ the dn ci term in the k-th power of M. In the irreducible case. The sequence ML. j of states. each entry of it is positive and N (8) N—oc) liM N n=1 where each row of the k x k limit matrix P is equal to jr. The condition 7rM = it shows that the Markov chain with start distribution Ai = it and transition matrix M is stationary. .

But if it(B n Pi ) > 0. A concatenated-block process is always ergodic.2. so that. Indeed. "ergodic" for Markov chains will mean merely irreducible with "mixing Markov" reserved for the additional property that some power of the transition matrix has all positive entries. The converse is also true. Remark 1. In summary. Thus if the chain is ergodic then .11 In some probability books. see Example 1.(F-1 C) it follows that v(C) is 0 or 1. The argument used to prove (9) then shows that the shift T must be mixing. To be consistent with the general concept of ergodic process as used in this book. for some n. . see Exercise 19. It should be noted. if T is ergodic the (T. Example 1. has all positive entries) then M is certainly irreducible. Thus.10 A stationary Markov chain is mixing if and only if some power of its transition matrix has all positive entries. then . The underlying .1.u(B) = 1. which means that transition from i to j in n steps occurs with positive probability.12 (Codings and ergodicity. In this case. assuming that all states have positive probability.11. o F -1 is the Kolmogorov measure of the encoded B-valued process. This follows easily from the definitions. however. It is not important that the domain of F be the sequence space A z . Since v(C) = p.9 A stationary Markov chain is ergodic if and only if its transition matrix is irreducible. If some power of M is positive (that is.u(B n Pi ) > 0. then the set B = T -1 Pi U T -2 PJ U has positive probability and satisfies T -1 B D B. The converse is also true. Thus. at least with the additional assumption being used in this book that every state has positive probability. BASIC CONCEPTS. for any state i.20 CHAPTER I. that the (T. and hence that v is the Kolmogorov measure of an ergodic process. (See Exercise 5. If C is a shift-invariant subset of B z then F-1 C is a shift-invariant subset of Az so that . A finite-state process is a stationary coding with window width 0 of a Markov chain. suppose it is the Kolmogorov measure of an ergodic A-valued process. and hence the chain is irreducible. 2)-process is ergodic for any finite partition P. and suppose v = p.u(T -nP n Pi ) > 0.2.u(F-1 C) is 0 or 1. Feller's book. if the chain is ergodic then the finite-state process will also be ergodic. Likewise. below.2. since it can be represented as a stationary coding of an irreducible Markov chain. 2)-process can be ergodic even though T is not.„ M N = P.) Stationary coding preserves the ergodicity property. the Cesaro limit theorem (8) can be strengthened to limN. Proposition 1. "ergodic" for Markov chains is equivalent to the condition that some power of the transition matrix be positive. any probability space will do.2. [11]. if every state has positive probability and if Pi = {x: xi = j}.) A simple extension of the above argument shows that stationary coding also preserves the mixing property. again. Proposition 1. for suppose F: A z B z is a stationary encoder. for example. a finite-state process for which the underlying Markov chain is mixing must itself be mixing.

A transformation T on a probability space (X. The following proposition is basic.. since X can be thought of as the unit circle by identifying x with the angle 27rx. where ED indicates addition modulo 1.12. E..) Let a be a fixed real number and let T be defined on X = [0. Pick a point x at random in the unit interval according to the uniform (Lebesgue) measure. The final randomized-start process may not be ergodic. (The measure-preserving property is often established by proving.) is said to be totally ergodic if every power TN is ergodic. for all x and all N. In particular. For example. however. The measure -17 obtained by randomizing the start is. such processes are called translation processes. 1/2). applying a block code to a totally ergodic process and randomizing the start produces an ergodic process. The transformation T is one-to-one and maps intervals onto intervals of the same length. concatenated-block processes are not mixing except in special cases. The mapping T is called translation by a.SECTION 1. let p.) Since the condition that F(TA X) = TB F (x).. depending on whether Tn .... see Example 1. which correspond to the upper and lower halves of the circle in the circle representation. but a stationary process can be constructed from the encoded process by randomizing the start. even if the original process was ergodic.1. the same as y. A condition insuring ergodicity of the process obtained by N-block coding and randomizing the start is that the original process be ergodic relative to the N-shift T N . Example 1. and 0101 .1.. let P consist of the two intervals P0 = [0. and 111 . It is also called rotation.10. that is.. . hence is not an ergodic process.. (or rotation processes if the circle representation is used. N-block codings destroy stationarity. implies that F (Ti x) = T: F(x). so that y is concentrated on the two sequences 000. hence preserves Lebesgue measure p. so that translation by a becomes rotation by 27ra. A subinterval of the circle corresponds to a subinterval or the complement of a subinterval in X. The (T.2. the nonoverlapping N-block process defined by Zn = (X (n-1)N-1-11 X (n-1)N+2. If T is the shift on a sequence space then T is totally ergodic if and only if for each N. P)-process {X n } is then described as follows. p.) Any partition 7. The proof of this is left to the reader.. 1). give measure 1/2 to each of the two sequences 1010. for all x.2. it follows that a stationary coding of a totally ergodic process must be totally ergodic. (See Example 1. however.lx = x ED (n — 1)a belongs to P0 or P1 . as in this case. 1) by the formula Tx = x ED a. in this case. THE ERGODIC THEORY MODEL. As noted in Section 1. for it has a periodic structure due to the blocking. The value X(x) = Xp(Tn -1 x) is then 0 or 1. P 1 = [1/2.1. 21 Markov chain is not generally mixing. • • • 9 Xn is ergodic. respectively. p is the stationary Markov measure with transition matrix M and start distribution 7r given. of the interval gives rise to a process. C(10) = 11.13 (Rotation processes. up to a shift.) As an example. hence the word "interval" can be used for subsets of X that are connected or whose complements are connected. that it holds on a family of sets that generates the a-algebra. by m ro 1 L o j' = rl 1] i• Let y be the encoding of p.. Since this periodic structure is inherited. defined by the 2-block code C(01) = 00.

14. CHAPTER I. and hence the maximality of I implies that n I = 0. Two proofs of ergodicity will be given.10 to establish ergodicity of a large class of transformations. Proof To establish Kronecker's theorem let F be the closure of the forward orbit.15 (Kronecker's theorem. applied to the end points of I.14 T is ergodic if and only if a is irrational.. One first shows that orbits are dense. Given c > 0 choose an interval I such that /2(1) < E and /. It follows that is a disjoint sequence and 00 1 = . then that small disjoint intervals can be placed around each point in the orbit. but does establish ergodicity for many transformations of interest. for each x. The Fourier series of g(x) = f (T x) is E a. since a is irrational. Assume that a is irrational. Proposition 1. Suppose f is square integrable and has Fourier series E an e brinx . produces integers n1 <n2 < < n k such that {Tni / } is a disjoint sequence and Ek i=1 tt(Tni /) > 1 — 2E.14 continued. e2"ina e 27rinx . Furthermore. Proof of Proposition 1.u ( [0. {Tx: n > 1 } . An alternate proof of ergodicity can be obtained using Fourier series.u(A) = 1.t(A n I) > (1 — 6) gives it (A) E —E) n Trig) (1 — 0(1 — 2E). BASIC CONCEPTS. i= 1 Thus . E n=1 00. Such a technique does not work in all cases.2. 1)) a. The first proof is based on the following classical result of Kronecker. e2-n(x+a) a. The preceding proof is typical of proofs of ergodicity for many transformations defined on the unit interval.2. which is a contradiction. for each positive integer n. suppose A is an invariant set of positive measure.2. To proceed with the proof that T is ergodic if a is irrational. This proves Kronecker's theorem. The assumption that i. Then there is an interval I of positive length which is maximal with respect to the property of not meeting F. which completes the proof of Proposition 1. A generalization of this idea will be used in Section 1. Proof It is left to the reader to show that T is not ergodic if a is rational.2.22 Proposition 1. TI cannot be the same as I.t(A n I) > (1 — Kronecker's theorem. and suppose that F is not dense in X.) If a is irrational the forward orbit {Tnx: n > 1} is dense in X.

the product of two circles.. let n(x) be the least positive integer n such that Tnx E B.) Return to B is almost certain. with addition mod 1.18)..) The proof of the following is left to the reader. the product of two intervals. for all n > 1. the rotation T has no invariant functions and hence is ergodic. translation of a compact group will be ergodic with respect to Haar measure if orbits are dense. for n > 1. e. which means that a. E. T n-1 Bn are disjoint and have the same measure. In general. Proof To prove this.17 (The Poincare recurrence theorem. This proves Theorem 1. and define = {X E B: Tx B.. the renewal processes and the regenerative processes. and.o ) = 0. TBn C B. which. n(x) is the time of first return to B. provides an alternative and often simpler view of two standard probability models.) For each n. it follows that p(B.Vn > O. in turn. while applying T to the last set Tn -1 Bn returns it to B. (ii) T is ergodic. y)} is dense for any pair (x. that is. Furthermore. that is. furthermore. 1) x [0. which. Let T be a measure-preserving transformation on the probability space (X. consider the torus T. almost surely. To simplify further discussion it will be assumed in the remainder of this subsection that T is invertible.2. then an = an e 2lri" holds for each integer n. (See Figure 1.. Since they all have the same measure as B and p(X) = 1. IL) and let B be a measurable subset of X of positive measure. Proposition 1. for B 1 = B n 7-1 B and Bn = (X — B) n B n T -n B.2. In other words.c The return-time picture. a. y El) p) (i) {Tn(x.2.2. unless and an = an e 27' then e2lrin" n=0. y) 1-÷ (x a. 1). Points in the top level of each column are moved by T to the base B in some manner unspecified by the picture. (iii) a and 13 are rationally independent. If x E B. y). For example. . define the sets Bn = E B: n(x) = n}. Furthermore. implies that the function f is constant. e.2. one above the other. a. . alternatively. If a is irrational 0 unless n = 0. n(x) < oc. (Some extensions to the noninvertible case are outlined in the exercises. (or.16 The following are equivalent for the mapping T: (x. TB. that is. These sets are clearly disjoint and have union B. 23 If g(x) = f(x).SECTION 1. T = [0. by reassembling within each level it can be assumed that T just moves points directly upwards one level. that is. = 0. Theorem 1. only the first set Bn in this sequence meets B. n > 2.2. The return-time picture is obtained by representing the sets (11) as intervals. I. in the irrational case. A general return-time concept has been developed in ergodic theory. from which it follows that the sequence of sets {T B} is disjoint. THE ERGODIC THEORY MODEL. the sets (11) Bn .17. along with a suggestive picture. are measurable. Furthermore. if x E 1100 then Tx g Bc.

24

CHAPTER I. BASIC CONCEPTS.

Tx

B1

B2

B3

B4

Figure 1.2.18 Return-time picture.
Note that the set Util=i Bn is just the union of the column whose base is Bn , so that the picture is a representation of the set (12) U7 =0 TB

This set is T-invariant and has positive measure. If T is ergodic then it must have measure 1, in which case, the picture represents the whole space, modulo a null set, of course. The picture suggests the following terminology. The ordered set

Bn , TB,

, T 11-1 13,

is called the column C = Cn with base Bn, width w(C) = kt(B,z ), height h(C) = n, levels L i = Ti-1 Bn , 1 < i < n, and top T'' B. Note that the measure of column C is just its width times its height, that is, p,(C) = h(C)w(C). Various quantities of interest can be easily expressed in terms of the return-time picture. For example, the return-time distribution is given by Prob(n(x) =nIxEB)— and the expected return time is given by w (Ca )

Em w(C, n

)

E(n(x)lx E B) =
(13)

h(C,i )Prob(n(x) =nix E B)

h(C)w(C)

=

m w(Cm )

In the ergodic case the latter takes the form

(14)

E(n(x)ix E B) =

1 Em w(Cm )

1

since En h(C)w(C) is the measure of the set (12), which equals 1, when T is ergodic. This formula is due to Kac, [23]. The transformation I' = TB defined on B by the formula fx = Tn(x)x is called the transformation induced by T on the subset B. The basic theorem about induced transformations is due to Kakutani.

SECTION I.2. THE ERGODIC THEORY MODEL.
-

25

Theorem 1.2.19 (The induced transformation theorem.) If p(B) > 0 the induced transformation preserves the conditional measure p(.IB) and is ergodic if T is ergodic. Proof If C c B, then x E f- 1C if and only if there is an n > 1 such that x Tnx E C. This translates into the equation
00

E

Bn and

F-1 C =U(B n n=1

n TAC),

which shows that T-1 C is measurable for any measurable C c B, and also that
/2(f—i
n

= E ,u(B,, n T'C),
n=1

00

since Bn n An = 0, if m 0 n, with m, n > 1. More is true, namely, (15)

T n Ba nT ni Bm = 0, m,n > 1, m n,

by the definition of "first return" and the assumption that T is invertible. But this implies 00 E,(T.Bo
n=1

= E p(Bn ) = AO),
n=1
,

co

which, together with (15) yields E eu(B, n T'C) = E iu(Tnii n n c) = u(C). Thus the induced transformation f preserves the conditional measure p.(.1B). The induced transformation is ergodic if the original transformation is ergodic. The picture in Figure 1.2.18 shows why this is so. If f --1 C, then each C n Bn can be pushed upwards along its column to obtain the set

c.

oo n-1
Ti(c n Bn)• D=UU n=1 i=0

But p,(T DAD) = 0 and T is invertible, so the measure of D must be 0 or 1, which, in turn, implies that kt(C1B) is 0 or 1, since p,((D n B),LC) = O. This completes the proof of the induced-transformation theorem. El

1.2.c.1 Processes associated with the return-time picture.
Several processes of interest are associated with the induced transformation and the return-time picture. It will be assumed throughout this discussion that T is an invertible, ergodic transformation on the probability space (X, E, p,), partitioned into a finite partition P = {Pa : a E A}; that B is a set of positive measure; that n(x) = minfn: Tnx E BI, x E B, is the return-time function; and that f is the transformation induced on B by T. Two simple processes are connected to returns to B. The first of these is {R,}, the (T, B)-process defined by the partition B = { B, X B}, with B labeled by 1 and X — B labeled by 0, in other words, the binary process defined by

Rn(x) = xB(T n-l x), n?: 1,

26

CHAPTER I. BASIC CONCEPTS.

where xB denotes the indicator function of B. The (T, 5)-process is called the generalized renewal process defined by T and B. This terminology comes from the classical definition of a (stationary) renewal process as a binary process in which the times between occurrences of l's are independent and identically distributed, with a finite expectation, the only difference being that now these times are not required to be i.i.d., but only stationary with finite expected return-time. The process {R,,} is a stationary coding of the (T, 2)-process with time-zero encoder XB, hence it is ergodic. Any ergodic, binary process which is not the all 0 process is, in fact, the generalized renewal process defined by some transformation T and set B; see Exercise 20. The second process connected to returns to B is {Ri } , the (7." , g)-process defined by the (countable) partition

{Bi, B2, • • •},
of B, where B, is the set of points in B whose first return to B occurs at time n. The process {k } is called the return-time process defined by T and B. It takes its values in the positive integers, and has finite expectation given by (14). Also, it is ergodic, since I' is ergodic. Later, it will be shown that any ergodic positive-integer valued process with finite expectation is the return-time process for some transformation and partition; see Theorem 1.2.24. The generalized renewal process {R n } and the return-time process {kJ } are connected, for conditioned on starting in B, the times between successive occurrences of l's in {Rn} are distributed according to the return-time process. In other words, if R 1 = 1 and the sequence {S1 } of successive returns to 1 is defined inductively by setting So = 1 and

= min{n >

Rn = 1},

j > 1,

then the sequence of random variables defined by

(16)

= Si — Sf _ 1 , j> 1

has the distribution of the return-time process. Thus, the return-time process is a function of the generalized renewal process. Except for a random shift, the generalized renewal process {R,,} is a function of the return-time process, for given a random sample path R = {k } for the return-time process, a random sample path R = {R n } for the generalized renewal process can be expressed as a concatenation (17) R = ii)(1)w(2)w(3) • • •

of blocks, where, for m > 1, w(m) = 01 _ 1 1, that is, a block of O's of length kn -1, followed by a 1. The initial block tt,(1) is a tail of the block w(1) = 0 /1-1-1 1. The only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the waiting time r until the first 1 occurs. The start position problem is solved by the return-time picture and the definition of the (T, 2)-process. The (T, 2)-process is defined by selecting x E X at random according to the it-measure, then setting X(x) = a if and only if Tx E Pa . In the ergodic case, the return-time picture is a representation of X, hence selecting x at random is the same as selecting a random point in the return-time picture. But this, in turn, is the same as selecting a column C at random according to column measure

SECTION 1.2. THE ERGODIC THEORY MODEL.

27

then selecting j at random according to the uniform distribution on 11, 2, ... , h (C)}, then selecting x at random in the j-th level of C. In summary,

Theorem 1.2.20 The generalized renewal process {R,} and the return-time process {km } are connected via the successive-returns formula, (16), and the concatenation representation, (17) with initial block th(1) = OT-1 1, where r = R — j 1 and j is uniformly distributed on [1,
The distribution of r can be found by noting that

r (x) = jx E
from which it follows that
00

Prob(r = i) =E p.,(T 4-1 n=i

=

E w (C ) .
ni

The latter sum, however, is equal to Prob(x E B and n(x) > i), which gives the alternative form 1 Prob (r = i) = (18) Prob (n(x) ijx E B) , E(n(x)lx E B) since E(n(x)lx E B) = 1/ p.(B) holds for ergodic processes, by (13). The formula (18) was derived in [11, Section XIII.5] for renewal processes by using generating functions. As the preceding argument shows, it is a very general and quite simple result about ergodic binary processes with finite expected return time. The return-time process keeps track of returns to the base, but may lose information about what is happening outside B. Another process, called the induced process, carries along such information. Let A* be the set all finite sequences drawn from A and define the countable partition 13 = {P: w E A* } of B, where, for w =

=

=

E Bk: T i-1 X E

Pao 1 <j

kl.

The (?,)-process is called the process induced by the (T, P)-process on the set B. The relation between the (T, P)-process { X, } and its induced (T, f 5)-process {km } parallels the relationship (17) between the generalized renewal process and the returntime process. Thus, given a random sample path X = {W„,} for the induced process, a random sample path x = {X,} for the (T, P)-process can be expressed as a concatenation

(19)

x = tv(1)w(2)w(3) • • • ,

into blocks, where, for m > 1, w(m) = X m , and the initial block th(1) is a tail of the block w(1) = X i . Again, the only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the length of the tail fi)(1). A generalization of the return-time picture provides the answer. The generalized version of the return-time picture is obtained by further partitioning the columns of the return-time picture, Figure 1.2.18, into subcolumns according to the conditional distribution of (T, P)-names. Thus the column with base Bk is partitioned

28

CHAPTER I. BASIC CONCEPTS.

into subcolumns CL, labeled by k-length names w, so that the subcolumn corresponding to w = a k has width (ct, T - i+ 1 Pa n BO,
,

the ith-level being labeled by a,. Furthermore, by reassembling within each level of a subcolumn, it can again be assumed that T moves points directly upwards one level. Points in the top level of each subcolumn are moved by T to the base B in some manner unspecified by the picture. For example, in Figure 1.2.21, the fact that the third subcolumn over B3, which has levels labeled 1, 1, 0, is twice the width of the second subcolumn, indicates that given that a return occurs in three steps, it is twice as likely to visit 1, 1, 0 as it is to visit 1, 0, 0 in its next three steps.
o o I
0 1 1

o
i

o
i

I

o .

1

I I
1

I

1o 1

I I

I
0

I I 1

I 1 I I

I

o
0

I I

1

1Ii

I

10

I I 1 I

o o

I

1

11

B1

B2

B3

B4

Figure 1.2.21 The generalized return-time picture. The start position problem is solved by the generalized return-time picture and the definition of the (T, 2)-process. The (T, 2)-process is defined by selecting x E X at random according to the ii-measure, then setting X, (x) = a if and only if 7' 1 x E Pa . In the ergodic case, the generalized return-time picture is a representation of X, hence selecting x at random is the same as selecting a random point in the return-time picture. But this, in turn, is the same as selecting a column Cu, at random according to column measure p.(Cu,), then selecting j at random according to the uniform distribution on {1, 2, ... , h(w), where h(w) = h(C), then selecting x'at random in the j-th level of C. This proves the following theorem.
Theorem 1.2.22 The process {X,..,} has the stationary concatenation representation (19) in terms of the induced process {54}, if and only if the initial block is th(1) = w(1)i h. (w(l)) , where j is uniformly distributed on [1, h(w(1))].

A special case of the induced process occurs when B is one of the sets of P, say B = Bb. In this case, the induced process outputs the blocks that occur between successive occurrences of the symbol b. In general case, however, knowledge of where the blocking occurs may require knowledge of the entire past and future, or may even be fully hidden.

I.2.c.2 The tower construction.
The return-time construction has an inverse construction, called the tower construction, described as follows. Let T be a measure-preserving mapping on the probability

of finite expected value.2. B C X is defined to be measurable if and only if (x.23. As an application of the tower construction it will be shown that return-time processes are really just stationary positive-integer valued processes with finite expected value. p. .SECTION 1. (T x .23 The tower transformation.2. Let k defined by = {(x . as shown in Figure 1. i < n. i) = + 1) i < f (x) i = f (x). 0) The transformation T is easily shown to preserve the measure Ti and to be invertible (and ergodic) if T is invertible (and ergodic). E. and for each n > 1. THE ERGODIC THEORY MODEL. E 131 ) n 13. This and other ways in which inducing and tower extensions are inverse operations are explored in the exercises. Tx B3 X {2} B2 X 111 B3 X Bi X {0} B2 X {0} B3 X {0} Figure 1. 0 .u({x: . n=1 the expected value of f. E (f ) n=1 i=1 A transformation T is obtained by mapping points upwards. the normalizing factor being n ( B n ) = E(f). i) E T8} n Bn E E. Bn x {n} (20) as a column in order of increasing i. . B x {2}.0 . with its measure given by ‘ Tt(k) = 2-. then normalizing to get a probability measure.2. stacking the sets Bn x {1}.. and let f be a measurable mapping from X into the positive integers .}. The formal definition is T (x. The tower extension T induces the original transformation T on the base B = X x {0}. A useful picture is obtained by defining Bn = {x: f (x) = n}.). The transformation T is called the tower extension of T by the (height) function f. 1 < i f (x)}.be the subset of the product space X x Ar Ar = { 1. The measure it is extended to a measure on X. 29 space (X. n > 1. by thinking of each set in (20) as a copy of Bn with measure A(B). 2. In other words. with points in the top of each column sent back to the base according to T..

1) returns to B = X x L1} at the point (T x . Show that the (T.) 5.24 An ergodic. then use Exercise 2.Ar z of doubly-infinite positive-integer valued sequences. 2. Suppose F: A z F--> Bz is a stationary coding which carries p. .) 6. P)-process can be mixing even if T is only ergodic.2. 1) at time f (x). P)-process can be ergodic even if T is not. Ra ):n > t} is independent of the past joint process {(X. 7. A stationary process {X.30 CHAPTER I. a stationary coding gives rise to a partition. The induced transformation and the tower construction can be used to prove the following theorem. E. Complete the proof of Lemma 1. with the distribution of the waiting time r until the first I occurs given by (18). n . P)-process. onto the (two-sided) Kolmogorov measure y of the (T. be the two-sided Kolmogorov measure on the space . onto v. Xn+19 • • • Xn+N-1) be the overlapping N-block process. P)-process {Xn}. XnN) be the nonoverlapping N-block process. Let T be the tower extension of T by the function f and let B = x x {1}.n ): n < t}. (Hint: start with a suitable nonergodic Markov chain and lump states to produce an ergodic chain. (Thus. Theorem 1.d Exercises. (Thus. = f (X n ).1. N}. for all n.} such that the future joint process {(Xn. Let Zn = ( X (n -1 )N+19 41-1)N+2.) 4.) 3.25 An ergodic process is regenerative if and only if it is an instantaneous function of an induced process whose return-time process is I. fkil defines a stationary. then {Yn } is an instantaneous function of {Xn}.} is regenerative if there is a renewal process {R. A nonnegative measurable function f is said to be T-subinvariant. and note that the shift T on Aîz is invertible and ergodic and the function f (x) =X has finite expectation. Let T be a measure-preserving transformation. A point (x.2. that is.) Theorem 1. Show that a T-subinvariant function is almost-surely invariant.2. and let Yn = X(n-1)N+1 be the N-th term process defined by the (T. (Hint: consider g(x) = min{ f (x). let Wn = (Xn. (a) Show that {Zn } is the (TN. such that F(Tx) = TAF(x). In particular. if f(Tx) < f (x). BASIC CONCEPTS. a partition gives rise to a stationary coding. Show that there is a measurable mapping F: X 1-± A z which carries p. . where TA is the shift on A z .2. binary process via the concatenation-representation (17). 1. given that Rt = 1. the return-time process defined by T and B has the same distribution as {k1 }. positive-integer valued process {kJ} with E(R 1 ) < oc is the return-time process for some transformation and set. (Recall that if Y. see Exercise 24. Show that the (T. Proof Let p. P-IP)-process. This proves the theorem. P)-process. Let T be an invertible measure-preserving transformation and P = {Pa : a E Al a finite partition of the probability space (X. R. Determine a partition P of A z such that y is the Kolmogorov measure of the (TA. /2).

If T. 13. N) goes to (br . j). X x X. y). such that the mapping (x. Generalize the preceding exercise by showing that if T is a measure-preserving transformation on a probability space (X. for each br E AN. a symbol a E A N . defined by the following transition rules. x y). 16. 1} Z . y ± a) is not ergodic. i) can only go to (ar . 10. p. Show that the nonoverlapping n-blocking of an ergodic process may fail to be ergodic. . Show that the overlapping n-blocking of an ergodic process is ergodic. x and define S(x. Show that a concatenated-block process is ergodic. 12. 1).u x . . THE ERGODIC THEORY MODEL. Let T be the shift on X = {-1. . Show directly. (ii) (ar.u be the product measure on X defined by p. 15.) and {Tx : x c X) is a family of measure-preserving transformations on a probability space (Y.. Try) is a measure-preserving mapping on the product space (X x Y. then S is called the direct product of T and R. then S(x . on AN.. Give a formula expressing the (S. 2. that is. Show that if T is mixing then so is the direct product T x T: X x X where X x X is given the measure .2. (a) Show that S preserves v. but is totally ergodic.u. and number p E (0.) (b) Show that the process {} defined by setting 4. 11. Q)-name of (x. T ry) is measurable.SECTION 1. y) = (Tx. Show that an ergodic rotation is not mixing. (i) If i < N. with the product measure y = p. (ar . that if a is irrational then the direct product (x. (iii) (ar . y) 1-± (x a. v7P -1 P)-process. Insertion of random spacers between blocks can be used to change a concatenatedblock process into a mixing process.u(br). 'P)-process. N) goes to (br . 1) with probability (1 — p).n } be the stationary Markov chain with alphabet S = A N x {0. } is the (T.") 9. Txoy). is a mixing finite-state process. 14. for every x. Let {Z. = (ar . 17. Show that if T is mixing then it is totally ergodic. y) in terms of the coordinates of x and y. without using Proposition 6. for each br E AN. (b) Show that {147. (The (S. (c) Show that {Y } is the (T N .(-1) = p. if Y„. Q)-process is called "random walk with random scenery. (b) Let P be the time-zero partition of X and let Q = P x P. y) = (Tx. = a if I'm = (ar . y) i-± (Tx. 1. (a) Show that there is indeed a unique stationary Markov chain with these properties. = R.(1) = 1/2. S is called a skew product. (Hint: the transition matrix is irreducible. 0) and Ym = ai . 0) with probability p ite(br ). j > 0. N}. and let . i 1).. Let Y=Xx X. Fix a measure p. 31 8.

P)-process is an instantaneous coding of the induced (.} is a binary. BASIC CONCEPTS.) 20. then P and 2 are 3E-independent. Show that if {X. (b) Show that if P and 2 are 6-independent then Ea iii(Pa I Qb) — 11 (Pa)i < Ag.) 19.) 21. 25. Show that the tower defined by the induced transformation function n(x) is isomorphic to T. the induced transformation is measurable.b lit(Pa n Qb) — it(Pa)P. 24.25. except for a set of Qb's of total measure at most . pt). (Hint: obtain a picture like Figure 1. then use this to guide a proof. Define P = {Pa : a E AI and 2 = {Qb: b Ea. Prove that a stationary coding of a mixing process is mixing. (Hint: use the formula F-1 (C n T ' D) = F-1 C n TA.2. and let S = T be the tower transformation.g. Let X be the tower over X defined by f. except for a set of Qb's of total measure at most 6. (Hint: let T be the shift in the two-sided Kolmogorov representation and take B = Ix: xo = 1 1.18 for T -1 . ? I' and return-time 23. then it is the generalized renewal process for some transformation T and set B. preserves the conditional measure. 18. and is ergodic if T is ergodic. E BI to be 6-independent if (a) Show that p and 2 are independent if and only if they are 6-independent for each 6 > 0.i F -1 D.(Qb)1 . Let T be an invertible measure-preserving transformation and P = {Pa : a E A} a finite partition of the probability space (X.32 CHAPTER I. ._ E. E. Prove: a tower i'' over T induces on the base a transformation isomorphic to T. Show how to extend P to a partition 2 of X. ergodic process which is not identically 0.2. Prove Theorem 1.?. (c) Show that if Ea lia(PaiQb) — ii(Pa)i < E. d)-process. Prove that even if T is not invertible. such that the (T. 22.

just set C' = {[n. K]. for. . since f1 n f(T i-l x) d.3.1 (The ergodic theorem.u = 1. p. A strong cover C of Af is defined by an integer-valued function n m(n) for which m(n) > n. there be a disjoint subcollection that fills most of [1.}. Thus if T is ergodic then the limit function f*. even asymptotically. 2.a Packings from coverings..i.d.SECTION 1. then a positive and useful result is possible. The combinatorial idea is not a merely step on the way to the ergodic theorem.1-valued and T to be ergodic the following form of the ergodic theorem is obtained. if f is taken to be 0.. D.E f f(T i-l x)d. namely. for it is an important tool in its own right and will be used frequently in later parts of this book.) A strong cover C has a packing property. to pack an initial segment [1. In this discussion intervals are subsets of the natural numbers Ar = {1.2 (Ergodic theorem: binary ergodic form. processes to the general class of stationary processes. m(n)]}. I.3. m(n)]. Theorem 1. and consists of all intervals of the form [n. f (T i-1 x) converges almost surely and in L 1 -norm to a T -invariant function f*(x). If it is only required. ..) If T is a measure-preserving transformation on a probability space (X. THE ERGODIC THEOREM. The proof of the ergodic theorem will be based on a rather simple combinatorial result discussed in the next subsection. i > 1. The ergodic theorem in the almost-sure form presented here is due to G.) and if f is integrable then the average (1/n) EL.3. but the more general version will also be needed and is quite useful in many situations not treated in this book. E. m] = (j E n < j < m). however.+Xn )In converges almost surely to the constant value E(X 1 ). .3. The L'-convergence implies that f f* d 1a = f f d ia. (The word "strong" is used to indicate that every natural number is required to be the left endpoint of a member of the cover.This binary version of the ergodic theorem is sufficient for many results in this book.. Birkhoff and is often called Birkhoff's ergodic theorem or the individual ergodic theorem. it may not be possible. 33 Section 1. A general technique for extracting "almost packings" of integer intervals from certain kinds of "coverings" of the natural numbers will be described in this subsection. This is a trivial observation.u. where n1 = 1 and n i±i = 1 + m(n). n E H.) If {X n } is a binary ergodic process then the average (X 1 +X2+. The ergodic theorem extends the strong law of large numbers from i.3 The ergodic theorem. there is a subcover C' whose members are disjoint. K] by disjoint subcollections of a given strong cover C of the natural numbers. Theorem 1. The finite problem has a different character. since it is T-invariant.. of the form [n. unless the function m(n) is severely restricted. must be almost-surely equal to the constant value f f d. however.u = f f -E n n i=1 by the measure-preserving property of T. In particular.

The desired positive result is just an extension of this idea to the case when most of the intervals have length bounded by some L <8 K. K]. To carry this out rigorously set m(0) = no = 0 and. If K > LIS and if [1. K — L]: m(j) — j +1 < L}. start from the left and select successive disjoint intervals.3. by construction. (2) If j E [I. The (L.34 CHAPTER I.m(n i )]: 1 < i < I} is a (1 — 26)-packing of [1. thus guarantees that 1[1. The construction stops after I = I (C. . K) steps if m(n i ) > K — L or there is no j E[1 ±m(ni).1 of the [ni . Lemma 1. by induction. This produces the disjoint collection C' = {[iL +1. Let C be a strong cover of the natural numbers N. K]. The interval (K — L. K]. (1). For the interval [1. and apply a sequential greedy algorithm. A collection of subintervals C' of the interval [1. K — L]-1. 6)-strongly-covered by C if 1ln E [1. and are contained in [1.(i +1)L]: 0 < i < (K — L)IL} which covers all but at most the final L — 1 members of [1. K — L] for which m(j) — j + 1 < L. This completes the proof of the packing lemma. In particular. 6)-strong-cover assumption. K] is called a (1 —6)-packing of [1. stopping when within L of the end. K]: m(n) — n +1 > L}1 < 6. and hence m(n i ) < K. Proof By hypothesis K > LIS and (1) iin E K]: m(n) — n +1 > L11 < SK. The claim is that C' = {[n i . the definition of the n i implies the following fact. since m(n i )—n i +1 < L.m(n i )] has length at least (1 — 25)K. say L.ll< 6K. K] has length at most L — 1 < SK. K]. so that I(K —L. The construction of the (1 — 26)-packing will proceed sequentially from left to right. To motivate the positive result. define n i = minfj E [1+ m(n i _i). BASIC CONCEPTS. if 6 > 0 is given and K > LIS then all but at most a 6-fraction is covered. that is.K]—U]I< 6K. K — L] —Li then m(j) — j + 1 > L.) Let C be a strong cover of N and let S > 0 be given. then there is a subcollection C' c C which is a (1 — 26)packing of [1. K] if the intervals in C' are disjoint and their union has cardinality at least (1 — . selecting the first interval of length no more than L that is disjoint from the previous selections. K — L]. The interval [1. suppose all the intervals of C have the same length. Thus it is only necessary to show that the union 1. K] is (L. stopping when within L of the end of [1. 8)-strongly-covered by C.3 (The packing lemma. K]. The intervals are disjoint. K] is said to be (L.

Since lim sup n—>oo 1 +E xi + + xn = =sup n—>oo x2 + .3.SECTION 1. I. for each integer n there will be a first integer m(n) > n such that X n Xn+1 + • • • + Xm(n) m(n) — n + 1 > p.) Now suppose x E B. But. then the average over the entire interval [1. each interval in C(x) has the property that the average of the xi over that interval is too big by a fixed amount. Furthermore. • + Xn+1 the set B is T-invariant and therefore p.3.(1) + E. but in the proof of the general ergodic theorem use will be made of the explicit construction given in the proof of Lemma 1. THE ERGODIC THEOREM. including the description (2) of those indices that are neither within L of the end of the interval nor are contained in sets of the packing. ergodic process proof.3. if the average over most such disjoint blocks is too big.b The binary.. Suppose (3) is false. ( 1 ) n-+oo n i=i has positive measure. a. i < K are contained in disjoint intervals over which the average is too big.s. Some extensions of the packing lemma will be mentioned in the exercises. The packing lemma can be used to reduce the problem to a nonoverlapping interval problem. for.u.a(1). large interval. which is a contradiction. K] must also be too big by a somewhat smaller fixed amount. Without loss of generality the first of these possibilities can be assumed. however. . Since this occurs with high probability it implies that the expected value of the average over the entire set A K must be too big. In this case . and the goal is to show that (3) lirn — x i = . it will imply that if K is large enough then with high probability.3.u. 1}. m(n)]: n E JVI is a (random) strong cover of the natural numbers .3. These intervals overlap.4 In most applications of the packing lemma all that matters is the result. is invariant and ergodic with respect to the shift T on {0. A proof of the ergodic theorem for the binary ergodic process case will be given first. Thus the collection C(x) = lln. (Here is where the ergodicity assumption is used. Since B is T-invariant. n—*øo n i=1 i n where A(1) = A{x: x 1 = 1} = E(X1). as it illustrates the ideas in simplest form.(B) = 1. Then either the limit superior of the averages is too large on a set of positive measure or the limit inferior is too small on a set of positive measure. most of the terms xi .A'. and hence there is an E > 0 such that the set {n B = x : lirn sup — E xi > . combined with a simple observation about almost-surely finite variables. so it is not easy to see what happens to the average over a fixed. 35 Remark 1.

then .E f (T i-1 x) > al.36 CHAPTER I. Thus given 3 > 0 there is a number L such that if D = {x: m(1) > L}. it is bounded except for a set of small probability. the binary. and since the intervals in C'(x) are disjoint.m(ni)]: i < I (x)} C C(X) which is a (1 . the lower bound is independent of x.3. First. note that the function g K(x) = i=1 1 IC where x denotes the indicator function. since T is measure preserving. The preceding argument will now be extended to obtain the general ergodic theorem. Thus the packing lemma implies that if K > LI 8 and x E GK then there is a subcollection Cf (x) = {[ni.23)-packing of [1. let f let a be an arbitrary real number. 3)-strongcovering of [1.(1) + c]. since the random variable m(1) is almost-surely finite. ergodic process form of the ergodic theorem. by assumption. The Markov inequality implies that for each K. are nonnegative. and thereby the proof of Theorem 1. Thus.) > (1 . has integral . and define the set 1 n B = {x: lim sup . n i=i Then fB f (x) d(x) > a g(B). Since the xi.26)K ku(1) + El u(GK) > (1 . E. BASIC CONCEPTS.u(D) <62. the set GK = {X: gic(x) < 3} has measure at least 1 .5 Let T be a measure-preserving transformation on (X.a(1) + c]. taking expected values yields (K E iE =1 •x > (1 . A bit of preparation is needed before the packing lemma can be applied. E L i (X. J=1 which cannot be true for all S. > i=1 j_—n 1 (l _ 28)K [p. The definitions of D and GK imply that if x E GK then C(x) is an (L. it will be used to obtain the general theorem. K]. that is.23)(1 6)[.u). I. the sum over the intervals in C'(x) lower bounds the sum over the entire interval.(1)+ c] . Lemma 1.8.a). as long as x E GK. (1) = E (—Ex . The following lemma generalizes the essence of what was actually proved. Note that while the collection C(x) and subcollection C'(x) both depend on x. 1(x) m(n) (4) j=1 x. E.26)(1 (5)K [p. .3. This completes the proof of (3).u(D) < 6 2 . . . Second.K].c The proof in the general case.3.2. so that 1 K p.

3.12 + 13. E jE[1. (Keep in mind that the collection {[ni. Furthermore.m(ni)]} depends on x.) In the bounded case. together with the lower bound (6).5 is true in the ergodic case for bounded functions. As before. Since B is T-invariant. The lemma is clearly true if p(B) = 0. if K > LIB then the packing lemma can be applied and the argument used to prove (4) yields a lower bound of the form E f( Ti-lx ) j=1 (6) where R(K . In the unbounded case.(x1B) > a.u(. a bit more care is needed to control the effect of R(K. given 3 > 0 there is an L such that if D= E B: m(1) > L } then . x) as well as the effect of integrating over GK in place of X. x) = > R(K . the sum R(K . this lemma is essentially what was just proved. though integrable. THE ERGODIC THEOREM. + (1 — 28)K a. x) + E E f (T i-1 x) i=1 j=ni 1(x) m(ni) R(K .m(n.m(n i )] intervals.SECTION 1. Only a bit more is needed to handle the general case.. m(n)]: n E AO is a (random) strong cover of the natural numbers Js/.i(B) > O. For x E B and n E A1 there is a least integer m(n) > n such that (5) Em i= (n n ) f (T i-1 m(n)—n+1 > a. 1B).3.u(D1B) <82 and for every K the set 1 G K = {X E B: 1 -7 i=1 E xD(Ti-ix) K 6} has conditional measure at least 1 — B. where f is the indicator function of a set C and a = p(C) c. where f is allowed to be unbounded. x) is bounded from below by —2M KS and the earlier argument produces the desired conclusion. 37 Proof Note that in the special case where the process is ergodic. The same argument can be used to show that Lemma 1. (1 — 8)a A(G KIB) . does give 1 v-IC s f (x) dia(x1B) = (7) a' B i=i II -I.1q—U[n. so it can be assumed that .)] f (T x) is the sum of f(Ti -l x) over the indices that do not belong to the [ne. the collection C(x) = fin. Thus. say 1f(x)1 < M. An integration. The set B is T-invariant and the restriction of T to it preserves the conditional measure . it is enough to prove fB f (x) c 1 ii.

. which is small if K is large enough. f The measure-preserving property of T gives f3-Gic f (T . E GK jE[1. will be small if 8 is small enough.)] GK — K 1 jE(K—L. see (7). .3. m(n)] then Ti -l x E D.k.M(ri./ x) clitt(xIB) = fT_J(B-GK) f(x) Since f E L I and since all the sets T -i(B — GK) have the same measure. the final three terms can all be made arbitrarily small. all of the terms in 11.(xiB). that if j E [1. 1 ___ K f (Ti -1 x) c 1 ii. see (2). for any fixed L. K — L] — m(ni)] then m(j) — j 1 > L.(xim. (1 — kt(G + 11 + ± 13. so that passage to the appropriate limit shows that indeed fB f (x) dp.x)1 f (P -1 x)I dwxiB).38 where CHAPTER I. in the inequality f D1 f (x)1 clia(x1B).5. The integral 13 is also easy to bound for it satisfies 1 f dp.(x1B). BASIC CONCEPTS. K fB—Gic j=1 12 = f(Ti -l x) dp. K — L] — U[ni . and hence 11 itself.u(xIB). which completes the proof of Lemma 1. Thus 1121 5_ -17 1 f v—N K .K]—U[ni. L f(x) d ii(x B) ?. recall from the proof of the packing lemma. which is upper bounded by 8.)] E 13= (T i-1 x) d.K— L] —U[ni.m(n.(x1B).B xD(T' . 1131 15.K] and hence the measure-preserving property of T yields 1131 B 1 fl(x) dp.(x1B) > a. In summary. To bound 12. In the current setting this translates into the statement that if j E [1. and the measure-preserving property of T yields the bound 1121 5_ which is small if L is large enough.B jE(K—L.

5 remains true for any T-invariant subset of B. The packing lemma is a simpler version of a packing idea used to establish entropy theorems for amenable groups in [51]. Since almost-sure convergence holds. This completes the proof of the Birkhoff ergodic theorem. n.3. Proofs of the maximal theorem and Kingman's subadditive ergodic theorem which use ideas from the packing lemma will be sketched in the exercises. Thus the averaging operator An f 1 E n n i=i T is linear and is a contraction. 39 Proof of the Birkhoff ergodic theorem: general case. IA f II If Il f E L I . A n f — Am f 5- g)11 + ii Ag — Amg ii 2 11 f — gil + 11Ang — A m g iI — g) — A ni (f — The L . Moreover. g E L I then for all m. as well as von Neumann's elegant functional analytic proof of L 2-convergence.3. it follows that f* E L I and that A n f converges in L'-norm to f*. To complete the discussion of the ergodic theorem L' -convergence must be established.3. xE X.o0 n i=1 { E E then fi. Since. its application to prove the ergodic theorem as given here was developed by this author in 1980 and published much later in [68]. then A n f (x)I < M. and packing ideas which have been used recently in ergodic and information theory and which will be used frequently in this book. elegant proof due to Garsia that appears in many books on ergodic theory.SECTION 1.3. 1=1 Remark 1. Most subsequent proofs of the ergodic theorem reduce it to a stronger result. f EL I . as was just shown. [32]. THE ERGODIC THEOREM. To do this define the operator UT on L I by the formula. that is. Ag converges in L'-norm. But if f. almost surely. Thus if 1 n 1 n f (T i x) C = x: lim inf — f (T i x) <a < fi < lim sup — n-400 n i=1 n—. almost surely. if f is a bounded function. UT f (x) = f (Tx). called the maximal ergodic theorem. The reader is referred to Krengel's book for historical commentary and numerous other ergodic theorems. The proof given here is less elegant.-convergence of A. the dominated convergence theorem implies L' convergence of A n f for bounded functions. but it is much closer in spirit to the general partitioning.5. To prove almost-sure convergence first note that Lemma 1. say I f(x)I < M. for g can be taken to be a bounded approximation to f.6 Birkhoff's proof was based on a proof of Lemma 1. covering. whose statement is obtained by replacing lirn sup by sup in Lemma 1. Other proofs that were stimulated by [51] are given in [25. and since the average An f converges almost surely to a limit function f*. 27]. . UT f = if Ii where 11f II = f Ifl diu is the L'-norm.5.u(C)< Jc f d which can only be true if ti(C) = O.3. including the short. There are several different proofs of the maximal theorem. that is. f follows immediately from this inequality. The operator UT is linear and is an isometry because T is measure preserving. This proves almost sure convergence.

Suppose it is a stationary measure on A'. almost surely. lim almost surely. Proof Since r (x) is almost-surely finite.) If r is an almost-surely finite stopping time for an ergodic process t then for any > 0 and almost every x there is an N = N (3. r) = {[n.r(Tn. The interval [n. the function r(Tn -l x) is finite for every n.u) with values in the extended natural numbers . hence. starting at n. there is an integer L such that p({x: r(x) > LI) < 6/4. see (5). BASIC CONCEPTS. An almost-surely finite stopping time r has the property that for almost every x. which is a common requirement in probability theory.d Extensions of the packing lemma. the stopping time for each shift 7' 1 x is finite. Several later results will be proved using extensions of the packing lemma. Gn = then x E ix 1 x---11 s :— n 2x D (Ti . t(Tn -1 x) n — 1]: n E of stopping intervals is. Note that the concept of stopping time used here does not require that X be a sequence space. A (generalized) stopping time is a measurable function r defined on the probability space (X. Let D be the set of all x for which r(x) > L. so that if 1 n 1=1 x D (T i-1 x) d p(x) = . x) such that if n > N then [1. I. Frequent use will be made of the following almost-sure stopping-time packing result for ergodic processes.3. A stopping time r is p.ix)< 6/ 41 G . however.40 CHAPTER I. The given proof of the ergodic theorem was based on the purely combinatorial packing lemma. By the ergodic theorem.Af U ool.3.u(D) <6/4. nor that the set {x: r(x) = nl be measurable with respect to the first n coordinates. the collection C = C(x. The stopping time idea was used implicitly in the proof of the ergodic theorem. Thus almost-sure finiteness guarantees that.u ({x: r(x) = °o }) = O. for the first time m that rin f (Ti -1 x) > ma is a stopping time.-almost-surely finite if . Lemma 1. E.7 (The ergodic stopping-time packing lemma. a (random) strong cover of the natural numbers A. eventually almost surely. for an almost-surely finite stopping time. the packing lemma and its relatives can be reformulated in a more convenient way as lemmas about stopping times. given 8 > 0. Now that the ergodic theorem is available. In particular. r).n] is (1 — 3)-packed by intervals from C(x. . . it is bounded except for a set of small measure.lx)-F n — 1] is called the stopping interval for x. for almost every x.

. For a stopping time r which is not almost-surely infinite. v. This proves the lemma. K] of total length 13 K . let r be a stopping time such that .) 2.e Exercises.) Let {[ui. and let {[si . The definition of G. with the proofs left as exercises. ti]: j E J} be another disjoint collection of subintervals of [1. r). n — L]. yi ] that meet the boundary of some [si.3. n] has a y -packing C' c C(x. Formulate the partial-packing lemma as a strictly combinatorial lemma about the natural numbers. Let p. These are stated as the next two lemmas. 1. ti]. be an ergodic measure on A'. 5. n] and hence for at least (1 — 812)n indices i in the interval [1. n] is said to be separated if there is at least one integer between each interval in C'. 4. K] of total length aK. and let r be an almost-surely finite stopping time which is almost-surely bounded below by a positive integer M > 1/8. where M > m13. Show that a separated (1 — S)-packing of [1. be an ergodic measure on A. But this means that [1. a partial-packing result that holds in the case when the stopping time is assumed to be finite only on a set of positive measure.9 (The two-packings lemma. Prove the two-packings lemma. The packing lemma produces a (L. n] is determined by its complement. for which Tn -l x E F(r).) Let p. Two variations on these general ideas will also be useful later. (Hint: use the fact that the cardinality of J is at most K I M to estimate the total length of those [u. n] is (L. Then the disjoint collection {[s i . n] by a subset of C(x. Prove the partial-packing lemma.23)K. r) Lemma 1. and a two-packings variation. THE ERGODIC THEOREM. Suppose each [ui . 1 — 3/2)-stronglycovered by C(x. Show that for almost every x. r) to be the set of all intervals [n. and each [s1 . For almost every x. n] has a separated (1 — 28)-packing C' c C(x. ti ]. 1 — 3/2)-packing of [1. there is an N = N (x. Lemma 1. I. yi] meets at least one of the [s i . r(r-1 n — 1]. put F(r) = {x: r(x) < ocl.8 (The partial-packing lemma. ti ]: j E J } U flub i E I*1 has total length at least (a + p . A (1 — 3)-packing C' of [1. r).) . 41 Suppose x E G„ and n > 4L/3.u(F(r)) > y > O.3.SECTION 1.3. Show that this need not be true if some intervals are not separated by at least one integer. implies that r(T -(' -1) x) < L for at least (1 — 314)n indices i in the interval [1. then [1. (Hint: define F'(x) = r(x) + 1 and apply the ergodic stopping-time lemma to F. Let I* be the set of indices i E I such that [ui.] has length m. r). vd: i E I} be a disjoint collection of subintervals of [1. r) such that if n > N. 3. and define C(x.3. there is an integer N= x) such that if n > N then [1. ti ] has length at least M.

for any L 2 function f. and G n (r . function. Show that if r is an almost-surely finite stopping time for a stationary process p. n ] is (1 — (5)-packed by intervals from oc. if f E LI . 8)) 1. N. let Un be the set of x's for whichEL -01 f (T i x) > 0.(Gn (r. (a) Prove the theorem under the additional assumption that gn < 0. [33]. C(x. 6. Assume T is measure preserving and { g. show that (Hint: let r(x) = n.F+M is dense in L 2 .(x) < g(x) g ni (Tn x). (a) For bounded f.u(B). (e) For B = {x: sup E71-01 f (T i x) > a}. 10. The ideas of the following proof are due to R. show that r(x)-1 E i=o f (T i x)xE(T i x) ?_ 0.Jones. then ii. r(x) = 1.42 CHAPTER I. Assume T is measure preserving and f is integrable. (b) Show that . as n 7. for some r(x) < E Un .Steele. x E f (T i x)xE(T i x) i=o — (N — fII. (d) Deduce that L' -convergenceholds. Assume T is invertible so that UT is unitary.) (c) Deduce von Neumann's theorem from the two preceding results. (d) Show that the theorem holds for integrable functions. then apply packing. and for each n. (Hint: show that its orthogonal complement is 0. then from r(x) to r(T) — 1. x 0 E. 9. 8) is the set of all x such that [1. (Hint: first show that g(x) = liminf g n (x)In is an invariant function..) (b) For bounded f and L > N. BASIC CONCEPTS. (Hint: sum from 0 to r(x) — 1. The ideas of the following proof are due to M. show that h f da a.} is a sequence of integrable functions such that gn+.) (c) Show that the theorem holds for bounded functions. Fix N and let E = Un <N Un . then the averages (1/N) E7 f con8.) . (e) Extend the preceding arguments to the case when it is only assumed that T is measure preserving. continuing until within N of L. r). but not integrable f (T i-1 x) converge almost surely to oc. [80]. One form of the maximal ergodic theorem asserts that JE f (x) di(x) > 0. The mean ergodic theorem of von Neumann asserts that (1/n) EL I verges in L2-norm. Kingman's subadditive ergodic theorem asserts that gn (x)/ n converges almost surely to an invariant function g(x) > —oc. Show that if ti is ergodic and if f is a nonnegative measurable. (a) Show that the theorem is true for f E F = If E L 2 : UT f = f} and for f E .A4 = (I — UT)L 2 .

An important consequence of the ergodic theorem for stationary finite alphabet processes is that relative frequencies of overlapping k-blocks in n-blocks converge almost surely as n —> oc. The frequency can also be expressed in the alternate form f(4x) = I{i E [1. The relative frequency is defined by dividing the frequency by the maximum possible number of occurrences. g„.n (x) < cc. In the ergodic case. These and related ideas will be discussed in this section.a Frequencies for ergodic processes. to obtain Pk(aflx7) = f (4x) = I{i E [1. n — k 1. When k is understood.) (c) Show that if a = inf. 43 (Hint: let (b) Prove the theorem by reducing it to the case when g„ < O.(Tn+gx) + y8 .SECTION 1. n—k+1 Palic Wiz = The frequency of the block ail' in the sequence Jet' is defined for n > k by the formula E i=1 x fa i l (T i-1 x).4 Frequencies of finite blocks. where x [4] denotes the indicator function of the cylinder set [an.4. In the nonergodic case. A sequence x is said to be (frequency) typical for the process p. limiting relative frequencies converge almost surely to the corresponding probabilities. called the empirical distribution of overlapping k-blocks. where y8 0 as g Section 1.u < oc.4. FREQUENCIES OF FINITE BLOCKS. the existence of limiting frequencies allows a stationary process to be represented as an average of ergodic processes. n — k + 1]: = where I • I denotes cardinality. that this limit exists. x is typical for tt if for each k the empirical distribution of each k-block converges to its theoretical probability p. g 1 (x) = gni (x) — ET -1 gi (T i x ). that is. the subscript k on pk(ali( 14) may be omitted.(4). or the (overlapping) k-type of x. f(a14) is obtained by sliding a window of length k along x'it and counting the number of times af is seen in the window. if each block cif appears in x with limiting relative frequency equal to 11. . n — k + 1]: xi +k-1 = 4}1 n—k1 n—k+1 If x and k < n are fixed the relative frequency defines a measure pk • 14) on A " .(4). oo provided. f and a = fg d d. The limiting (relative) frequency of di in the infinite sequence x is defined by (1) p(ainx) = lim p(ainx). I. + (d) Show that the same conclusions hold if gn < 0 and gn+8+. Thus. of course. then convergence is also in L 1 -norm. a property characterizing ergodic processes.

In other words. lim p(aIN) = for all k and all a. The converse of the preceding theorem is also true. Proof If the limiting relative frequency p(a'nx) exists and is a constant c. Thus the entire structure of an ergodic process can be almost surely recovered merely by observing limiting relative frequencies along a single sample path. p. the limiting relative frequencies p(ali`Ix) exist and are constant in x.44 CHAPTER I. Thus the hypotheses imply that the formula x--. BASIC CONCEPTS. A basic result for ergodic processes is that almost every sequence is typical.(4). n liM — n i=1 x8 = holds for almost all x for any cylinder set B. by the ergodic theorem. that is. then this constant c must equal p. with probability 1. which. Let T(A) denote the set of sequences that are frequency typical for A.) Suppose A is a stationary measure such that for each k and for each block di'.) If A is ergodic then A(T (A)) = 1.(4). A . Theorem 1.4. there is a set B(a) of measure 0. = p(a). Proof In the ergodic case. produces liM n—)00 fl E X B (T i-l x)X c (x)= i=1 n and an integration then yields (2) lim A(T -i B n i=1 E n n= . for fixed 4. as n -> co.B. This proves the theorem. Multiplying each side of this equation by the indicator function xc (x). converges almost surely to f x [4] dp. a set of [=:] Theorem 1. where C is an arbitrary measurable set. for all measure 1. for x B(a lic). Since there are only a countable number of possible af. since the ergodic theorem includes L I -convergence of the averages to a function with the same mean. for almost every x. n—k+1 p(afixiz) - n . Then A is an ergodic measure.-almost surely.k +1 E k • (T i X •_ ' ul' ' —1 x).4. the set 00 B=U k=1 ak i EAk B(a 1`) XE has measure 0 and convergence holds for all k and all al'.1 (The typical-sequence theorem. such that p(alicixtiz) -> 1).2 (The typical-sequence converse.

4.) (a) If ji is ergodic then 4 (b) If lima p. FREQUENCIES OF FINITE BLOCKS.) A measure IL is ergodic if and only if for each block ail' and E > 0 there is an N = N(a. The members of the "good" set. ergodic.2. is Proof Part (a) is just a finite version of the typical-sequence theorem. G. c)-typical sequences.(k. eventually almost surely. 6) = An — G n (k. But if B = B then the formula gives it(B) = A(B) 2 . c). Note that as part of the preceding proof the following characterizations of ergodicity were established. for every integer k > 0 and every c > 0. (i) T is ergodic. or when k and c are understood.. G . C An such that tt(Cn) > 1 — E.4. The precise definition of the "good" set. = 1.(Ti-lx) = tt(E). The condition that A n (G n (k. C.4. Sequences of length n can be partitioned into two classes. This proves Theorem 1.e.4. then 1/3(41fi') — p(aini)I < c. for each A and B ma generating algebra. then p. a E The "bad" set is the complement. B n (k. (ii) If x. E)) converge to 1 for any E > 0 is just the condition that p(41x7) converges in probability to ii(4). 6) = p(afix) — . and B. 6)) E Gn (k.4. Theorem 1. 45 The latter holds for any cylinder set B and measurable set C. which depends on k and a measure of error.2.5 (The finite form of ergodicity. E). Since p(an. It is also useful to have the following finite-sequence form of the typical-sequence characterization of ergodic processes. . and hence (b) follows from Theorem 1. for every k and c > O. Some finite forms of the typical sequence theorems will be discussed next. a. so that p(B) must be 0 or 1. E). the limit must be almost surely constant.1.3 The following are equivalent. Theorem 1.4 (The "good set" form of ergodicity.u(A)A(B).4.n (G n (k. is (3) G n (k. A(T -i A n B) = .q) is bounded and converges almost surely to some limit. are often called the frequency-(k.(iii) limn x . Theorem 1. c) such that if n > N then there is a collection C.L(a)1 < c. just the typical sequences.4. (ii) lima pi. The usual approximation argument then shows that formula (2) holds for any measurable sets B and C. so that those in Gn will have "good" k-block frequencies and the "bad" set Bn has low probability when n is large. for each E in a generating algebra.SECTION 1. Theorem 1.

eventually almost surely.b The ergodic theorem and covering properties.6 (The covering theorem. A sequence xr is said to be (1 — 8)strongly-covered by B n if I {i E [1.) If kt is an ergodic process with alphabet A.s. Theorem 1.) If 12. M — n]: E 13n}1 — ML(13n)1 <8M. and if Bn is a subset of measure.4. BASIC CONCEPTS.n . X. E). satisfies A(13 n ) > 1 — then xr is eventually almost surely (1 — (5)-strongly-covered by r3n. and hence p(a Ix) is constant almost surely. Theorem 1. xr is "mostly strongly covered" by blocks from Bn . is chosen to have large measure. M — n] for almost surely as M which x: +n-1 E Bn . by saying that. for any 8 > 0. the "good" set defined by (3).4.a(a n n e . This simple consequence of the ergodic theorem can be stated as either a limit result or as an approximation result. The ergodic theorem is often used to derive covering properties of finite sample paths for ergodic processes. a set Bn c An with desired properties is determined in some way. In these applications. To establish the converse fix a il' and E > 0. and use convergence in probability to select n > N(a.4. then the ergodic theorem is applied to show that.4. 1]: xn-1 B n }i < 8M. Thus. IP(ailx) — P(ailZ)I 3E. n ) > 1 — 2E. eventually oc.co In other words.46 CHAPTER I.4. if pn (B n iXr) < b. If n El. eventually almost surely. a. informally.u(an) > 1 — Cn satisfies (i) and (ii) put En = {x: 4 E Cn} and note that . is an ergodic process with alphabet A. In many situations the set 13. then liM Pn(BniXi ) M—). in which case the conclusion (4) is expressed. Proof Apply the ergodic theorem to the indicator function of Bn .7 (The almost-covering principle. M — n that is. . and if 8. c) such that . C An. there are approximately tt(Bn )M indices i E [1. An of positive 1-1 (Bn). I. then define G n = ix: IP(a inxi) — P(a inx)I < EI. E _ _ n E. proving the theorem. just set Cn = G n (k. and use Theorem 1. (4) E [1. Proof If it is ergodic.

however. see Exercise 2. almost all of which will. It follows easily from (5). FREQUENCIES OF FINITE BLOCKS. for some i E [1. The following example illustrates one simple use of almost strong-covering. Since 3 is arbitrary. limiting relative frequencies of all orders exist almost surely. The set of all x for which the limit p(ainx) = lim p(44) exists for all k and all di will be denoted by E. F(xr) = min{i E [1.4. In the stationary nonergodic case. then at least one member of 13n must start within 3M of the beginning of xr and at least one member of 1:3n must start within 8M of the end of xr.SECTION 1.7. Theorem 1. so that L(xr) — F(41) > (1 — 3 8 )M. in the limit. choose n so large that A(B n ) > 1 — 3.2.00 L(41 )— F(41 ) = 1.c The ergodic decomposition. Note. and then to obtain almost strong-covering by 13n . Note that the ergodic theorem is used twice in the example. The sequences that produce the same limit measure can be grouped into classes and the process measure can then be represented as an average of the limit measures. nil. then F(41 ) < 23M and L(xr) > (1 — 3)M. first to select the set 13n . But if this is so and M > n/3. together with the ergodic theorem. I. Thus is the set of all sequences for which all limiting relative frequencies exist for all possible blocks of all possible sizes. with the convention F(41) = L(41 ) = 0. M]: x i = a} L(x) = max{i E [1. Such "doubling" is common in applications. Such an n exists by the ergodic theorem. but the limit measure varies from sequence to sequence. in fact.4. that if xr is (1 — 6)strongly-covered by 13n . The set is shift invariant and any shift-invariant measure has its support in This support result. is stated as the following theorem.4. the desired result. almost surely equal to 1/p. This result also follows easily from the return-time picture discussed in Section I. (5).(a). almost surely. which is a simple consequence of the ergodic theorem. These ideas will be made precise in this section. 47 The almost strong-covering idea and a related almost-packing idea are discussed in more detail in Section L7. implies that xr is eventually almost surely (1 — 3)-strongly-covered by Bn .c. Example 1. if no a appears in xr. Let F (xr) and L(xr) denote the first and last occurrences of a E Xr. be an ergodic process and fix a symbol a E A of positive probability. let lim kr. Given 8 > 0. Bn = {a7: ai = a.8 Let p. The almost-covering idea will be used to show that (5) To prove this.4. that the expected waiting time between occurrences of a along a sample path is. be ergodic. that is. The almost-covering principle. follows. M]: x i = a}. .

1.. each frequency-equivalence class is either entirely contained in E or disjoint from E.u onto a measure w = . Theorem 1. (x). — ix). .ux will be called the (limiting) empirical measure determined by x. BE E. The measure . shows that to prove the ergodic decomposition theorem it is enough to prove the following lemma. are called the ergodic components of kt and the formula represents p. and extends by countable additivity to any Borel set B. Next define /27(x) = ktx. for if x E r then the formula .C must El have measure 0 with respect to 1.1. as an average of its ergodic components.u(alic) = f p(af Ix) dp.(Ti.5. The usual measure theory argument. by (6). together with the finite form of ergodicity. Theorem 1.u.( x )(B) da)(7r(x)). ± i E x [ ak. x E E. To make this assertion precise first note that the ergodic theorem gives the formula (6) . CHAPTER I.10 ( The ergodic decomposition theorem. Formula (6) can then be expressed in the form (7) . a shift-invariant measure always has its support in the set of sequences whose empirical measures are ergodic.u o x -1 on equivalence classes.)c) can be thought of as measures. The projection x is measurable and transforms the measure . k 1 i=1 the ergodic theorem implies that p(aflx) = limn p(aii`ix) exists almost surely.(E) = 1 and hence the representation (7) takes the form.x is ergodic. Since there are only a countable number of the al` .tx on A. (B)= f p. Of course. Theorem 1.9 If it is a shift-invariant measure then A ( C) = 1. to a Borel measure i. An excellent discussion of ergodic components and their interpretation in communications theory can be found in [ 1 9 ]..(x )(B) dco(7(x)).. is shift-invariant then p.u.. e In other words. which can be extended.48 Theorem 1. Theorem 1. the complement of .u(B) = f 1. Let n denote the projection onto frequency equivalence classes. Let E denote the set of all sequences x E L for which the measure ti. which is well-defined on the frequency equivalence class of x. BASIC CONCEPTS.x (a) = p(ainx). B E E. for each k and 4. Two sequences x and y are said to be frequency-equivalent if /ix = A y . Proof Since p (all' 14) = n .4.9 can be strengthened to assert that a shift-invariant measure kt must actually be supported by the sequences for which the limiting frequencies not only exist but define ergodic measures... . by the Kolmogorov consistency theorem. which establishes the theorem. 4 E Ak defines a measure itx on sequences of length k.(x).9 may now be expressed in the following much stronger form.4. for each k. n —k+ 1 The frequencies p(41..) If p..4.4.4. This formula indeed holds for any cylinder set. The ergodic measures.

SECTION 1. the sequence {p(al 14)} converges to p(41x).10.9.E. Let L i = Li(y) denote the set of all x E L such that (8) ip(a'nx) — p(41y)l < E/4. [57]. the set Ps of shift-invariant probability measures is compact in the weak*-topology (obtained by thinking of continuous functions as linear functionals on the space of all measures via the mapping it i— f f dit. Thus the conditional measure it (. This completes the proof of the ergodic decomposition theorem. ILi) is shift-invariant and 1 . for any x E L. z E L3.) The extreme points of Ps correspond exactly to the ergodic measures. The conditions (9) and (8) guarantee that the relative frequency of occurrence of 4 in any two members x7. and fix a sequence y E L. fix a block 4. itiz of Cn differ by no more E.q. such that the frequency of occurrence of 4 in any two members of C. such that for any x E X1 there is an N such that if n > N there is a collection C. Indeed.4. 1 (L1) L1 P(Crilz) d iu(z) > 1 — 6 2 so the Markov inequality yields a set L3 C L2 such that /4L3) > (1 — E)/L(L 1 ) and for which . hence For each x E Li. FREQUENCIES OF FINITE BLOCKS. Z E r3. c A n .4. The unit interval [0. 49 Lemma 1. tt(CnICI) = ti Li P(xilz) dtt(z). 0 Remark 1. that the relative frequency of occurrence of al in any two members of Li differ by no more than E/2. Proof Fix E > 0. x' iz E An.c.„ E p(xz) ?. in particular. differs by no more than E. Fix n > N and put Cn = {x: x E L2 }. 1 . Note. oo. see Exercise 1 of Section 1.u(xilLi) = A(CI) In particular. 1] is bounded so L can be covered by a finite number of the sets Li(y) and hence the lemma is proved. Theorem 1. as n there is an integer N and a set L2 C Li of measure at least (1 — E2 )A(L1) such that (9) IP(414) — p(alx)I < E/4. .11 Given c > 0 and 4 there is a set X1 with ii.4. x E L2. the set of sequences with limiting frequencies of all orders.(Xi) > 1 — E. The set L1 is invariant since p(ainx) = p(4`17' x). which translates into the statement that p(C) > 1 — E.12 The ergodic decomposition can also be viewed as an application of the ChoquetBishop-deLeeuw theorem of functional analysis. n > N. with /L(C) > 1 — c.4.

Show that {Xn } is ergodic if and only its reversed process is ergodic.d Exercises 1.) 3. almost surely. (a) Determine the ergodic components of the (T 3 . t): where n = tk r. Show that an ergodic finite-state process is a function of an ergodic Markov chain. (Hint: combine Exercise 2 with Theorem 1. . I. Assume p. Let P be the time-0 partition for an ergodic Markov chain with period d = 3. P)-process. P)-process. Let T be an ergodic rotation of the circle and P the two-set partition of the unit circle into upper and lower halves. (It is an open question whether a mixing finite-state process is a function of a mixing Markov chain. where P is the Kolmogorov partition in the two-sided representation of {X. (Hint: show that the average time between occurrences of a in 4 is close to 1/pi (a14). Describe the ergodic components of the (T x T . " be the set of all x E [a] such that Tnx E [a].1 ) = I{/ E [0.P x P)-process.n = a} — 1 R1 (x) = minfm > E (x): x m = a} — E Rico i=1 . P V TP y T 2P)-process.: n > 1}.) 5. Define the mapping x (x) = min{m > 1: x. This exercise illustrates a direct method for transporting a sample-path theorem from an ergodic process to a sample-path theorem for a (nonstationary) function of the process.50 CHAPTER I.4. is the conditional measure on [a] defined by an ergodic process. = (a) Show that if T is totally ergodic then qk (ali`14) .5. Let {X„: n > 1} be a stationary process. (b) Show that the preceding result is not true if T is not totally ergodic. qk (alnx. (Hint: with high probability (x k .) 4. m+k ) and (4. R2(x).) 7. Combine (5) with the ergodic theorem to establish that for an ergodic process the expected waiting time between occurrences of a symbol a E A is just 111. z„. Let qk • 14) be the empirical distribution of nonoverlapping k-blocks in 4.1 by many n.4. BASIC CONCEPTS. 8. zt) are looking at different parts of y. 6.u(x lic). The reversed process {Yn : n > 1} is the . that is.) A measure on Aa " transports to the measure y = . 0 < r < k.u o R -1 on Aroo. (This is just the return-time process associated with the set B = [a].t(a). 2.2 for a definition of this process. See Exercise 8 in Section 1. for infinitely Fix a E A and let Aa R(x) = {R 1 (X). Show that . Use the central limit theorem to show that the "random walk with random scenery" process is ergodic. (b) Determine the ergodic components of the (T 3 .

v-almost surely.SECTION 1.1. THE ENTROPY THEOREM. the lemma follows from the preceding lemma. almost surely. The next three lemmas contain the facts about the entropy function and its connection to counting that will be needed to prove the entropy theorem. z(a) 111AI. The entropy of a probability distribution 7(a) on a finite set A is defined by H(z) = — E 71. except for some interesting cases. log means the base 2 logarithm. see Exercise 4c. n so that if A n is the measure on An defined by tt then H(t) = nE(hn (x)). Proof An elementary calculus exercise.(4) is nonincreasing in n. (Hint: use the Borel mapping lemma.u. of the process.(4) = h.5.(4) 1 log p. some counting arguments.(4). and the concept of the entropy of a finite distribution.1 (The entropy theorem. The proof of the entropy theorem will use the packing lemma.5.2 The entropy function H(7) is concave in 7 and attains its maximum value log I AI only for the uniform distribution. has limit 0. then v(R(T)) = 1. the measure p. 0 Lemma 1.(x)) _< log I AI.(a) log 7(a). Proof Since 1/(. for ergodic processes. In the theorem and henceforth. where A is finite. with a constant exponential rate called the entropy or entropy-rate. the decrease is almost surely exponential in n. For stationary processes. Lemma 1.3 E(17. where S is the shift on (e) If T is the set of the set of frequency-typical sequences for j.) (d) (1/n) En i R i -÷ 11 p. and denoted by h = h(t). (Hint: use the preceding result. Theorem 1. There is a nonnegative number h = h(A) such that n oo n 1 lim — log .5. CI . 51 (b) R(T R ' (x ) x) = SR(x).5 The entropy theorem.(a). The entropy theorem asserts that.) Section 1.4) = nE(h n (x)). Lemma 1. aEA Let 1 1 h n (x) = — log — n ti. (a) The mapping x R(X) is Borel.17.) Let it be an ergodic measure for the shift T on the space A'. and. and the natural logarithm will be denoted by ln.5.

almost surely. that is. almost surely. The first task is to show that h(x) is constant. The definition of limit inferior implies that for almost .a The proof of the entropy theorem. see [7]. Towards this end. To establish this note that if y = Tx then [Yi ] = [X3 +1 ] j so that a(y) > It(x 1 ) and hence h(Tx) < h(x).5. 0 The function H(S) = —3 log 8—(1—S) log(1-3) is called the binary entropy function. BASIC CONCEPTS. almost surely. h(x) is subinvariant. Lemma 1. k < ms. almost surely. Thus h(Tx) = h(x).4 (The combinations bound. hn (x). that is I lim — log n-.) (n ) denotes the number of combinations of n objects taken k at a time and k 3 < 1/2 then n 2nH(3) . that the binary entropy function is the correct (asymptotic) exponent in the number of binary sequences of length n that have no more than Sn ones. It is just the entropy of the binary distribution with ti(1) = S.52 CHAPTER I. Proof The function —q log3 — (1 — q) log(1 —8) is increasing in the interval 0 < q < 3 and hence 2-nH(S) < 5k(1 _ n ) and summing gives k on-k . The goal is to show that h must be equal to lim sup. Section 1. Define h to be the constant such that h = lim inf hn (x). let c be a positive number. n. h(x) = lim inf hn (x).00 almost surely. n-).00 n k<rz b E( n ) .Co Define where h(x) = —(1/n) log it(x'lz). It can be shown. k) If E k<n6( where H(3) = —3 log3 —(1 — 3) log(1 — 3). since subinvariant functions are almost surely invariant. see Exercise 4. Multiplying by ( 2-nH(8) E k<nS ( n k ) 'E k<n3 ( n k ) (5k 0 _ on-k n ) ic (1 _ 5 k on-k = 19 which proves the lemma.= H (3) k I. Since T is assumed to be ergodic the invariant function h(x) must indeed be constant.5.2.

(b) p(x) > ni+1)(h+c). 53 all x the inequality h(x) < h + E holds for infinitely many values of n.. and if they mostly fill.5. An application of the packing lemma produces the following result. [n. if these subblocks are to be long. almost surely. M) if and only if there is a collection xr S = S(xr) = {[n i . m i l E S.] E S. then there are not too many ways to specify their locations. both to be specified later. a fact expressed in exponential form as (1) A(4) > infinitely often. In other words.. K] with the following properties. almost surely" can be replaced by "eventually. The goal is to show that "infinitely often. + 1) > (1 — 28)K.. then since a location of length n can be filled in at most 2n(h+E) ways if (1) is to hold. most of a sample path is filled by disjoint blocks of varying lengths for each of which the inequality (1) holds. there are not too many ways the parts outside the long blocks can be filled. .. This is just an application of the fact that upper bounds on cardinality "almost" imply lower bounds on probability. m i [} of disjoint subintervals of [1. The third idea is a probability idea. almost surely.Es (mi — n. for large enough K. Lemma I. there will be a total of at most 2 K(h+E ) ways to fill all the locations. m. then once their locations are specified." for a suitable multiple of E.18(b). The second idea is a counting idea. [ni. Indeed. M) be the set of all that are (1 — 23)-packed by disjoint blocks of length at least M for which the inequality (1) holds. To fill in the details of the first idea. The first idea is a packing idea. Most important of all is that once locations for these subblocks are specified. let GK(S. THE ENTROPY THEOREM.SECTION 1. If a set of K-length sample paths has cardinality only a bit more than 2K (h+E) . Three ideas are used to complete the proof. xr E G K (8 . This is a simple application of the ergodic stopping-time packing lemma. (c) E [ „.1. For each K > M. (a) m — n 1 +1 > M. Eventually almost surely. The set of sample paths of length K that can be mostly filled by disjoint. let S be a positive number and M > 1/B an integer.. then it is very unlikely that a sample path in the set has probability exponentially much smaller than 2 — K (h+E ) . long subblocks for which (1) holds cannot have cardinality exponentially much larger than 2 /"+€) .

54

CHAPTER I. BASIC CONCEPTS.

Lemma 1.5.5

E

GK

(3, M), eventually almost surely.

Proof Define T. (x) to be the first time n > M such that p,([xrii]) >n(h+E) Since r is measurable and (1) holds infinitely often, almost surely, r is an almost surely finite stopping time. An application of the ergodic stopping-time lemma, Lemma 1.3.7, then yields the lemma. LI
The second idea, the counting idea, is expressed as the following lemma.

Lemma 1.5.6 There is a 3 > 0 and an M> 116 such that IG large K.

, M)1 <2K (h+26) ,

for all sufficiently

Proof A collection S = {[ni , m i ll of disjoint subintervals of [1, K], will be called a skeleton if it satisfies the requirement that mi — ni ± 1 > M, for each i, and if it covers all but a 26-fraction of [1, K], that is,
(2)

E (mi _ ni ± 1) >_ (1 — 23)K.
[n,,m,JES

A sequence xr is said to be compatible with such a skeleton S if 7i ) > 2-(m, --ni+i)(h+e) kt(x n for each i. The bound of the lemma will be obtained by first upper bounding the number of possible skeletons, then upper bounding the number of sequences xr that are compatible with a given skeleton. The product of these two numbers is an upper bound for the cardinality of G K (6, M) and a suitable choice of 6 will then establish the lemma. First note that the requirement that each member of a skeleton S have length at least M, means that Si < KIM, and hence there are at most KIM ways to choose the starting points of the intervals in S. Thus the number of possible skeletons is upper bounded by (3)

E

K k

H(1/111)

k<KIM\

where the upper bound is provided by Lemma 1.5.4, with H() denoting the binary entropy function. Fix a skeleton S = {[ni,mi]}. The condition that .)C(C be compatible with S means that the compatibility condition (4) it(x,7 1 ) >

must hold for each [ni , m i ] E S. For a given [ni , m i ] the number of ways xn m , can be chosen so that the compatibility condition (4) holds, is upper bounded by 2(m1—n's +1)(h+E) , by the principle that lower bounds on probability imply upper bounds on cardinality, Lemma I.1.18(a). Thus, the number of ways x j can be chosen so that j E Ui [n„ and so that the compatibility conditions hold is upper bounded by
nK h-1-€)
L

(fl

SECTION 1.5. THE ENTROPY THEOREM.

55

Outside the union of the [ni , mi] there are no conditions on xi . Since, however, there are fewer than 28K such j these positions can be filled in at most IA 1 26K ways. Thus, there are at most
IAl23K 2K (h+E)

sequences compatible with a given skeleton S = {[n i , m i ]}. Combining this with the bound, (3), on the number of possible skeletons yields

(5)

1G K(s, m)i

< 2. 1(- H(1lm) oi ncs 2 K(h+E)

Since the binary entropy function H (1 M) approaches 0 as M 00 and since IA I is finite, the numbers 8 > 0 and M > 1/8 can indeed be chosen so that IGK (8, M) < 0 2K(1 +2' ) , for all sufficiently large K. This completes the proof of Lemma 1.5.6. Fix 6 > 0 and M > 1/6 for which Lemma 1.5.6 holds, put GK = Gl{(8, M), and let BK be the set of all xr for which ,u(x(c ) < 2 -1C(h+36) . Then tt(Bic

n GK) < IGKI 2—K(h+3E)

5 2—ICE

,

holds for all sufficiently large K. Thus, xr g BK n GK, eventually almost surely, by the Borel-Cantelli principle. Since xt E GK, eventually almost surely, the iterated almost-sure principle, Lemma 1.1.15, implies that xr g BK, eventually almost surely, that is,

lim sup h K (x) < h + 3e, a.s.
K—>oo

In summary, for each e > 0, h = lim inf h K (x) < lim sup h K (x) < h 3e, a.s.,
K—*co

which completes the proof of the entropy theorem, since c is arbitrary.

0

Remark 1.5.7 The entropy theorem was first proved for Markov processes by Shannon, with convergence in probability established by McMillan and almost-sure convergence later obtained by Breiman, see [4] for references to these results. In information theory the entropy theorem is called the asymptotic equipartition property, or AEP. In ergodic theory it has been traditionally known as the Shannon-McMillan-Breiman theorem. The more descriptive name "entropy theorem" is used in this book. The proof given is due to Ornstein and Weiss, [51], and appeared as part of their extension of ergodic theory ideas to random fields and general amenable group actions. A slight variant of their proof, based on the separated packing idea discussed in Exercise 1, Section 1.3, appeared in [68].

I.5.b Exercises.
1. Prove the entropy theorem for the i.i.d. case by using the product formula on then taking the logarithm and using the strong law of large numbers. This yields the formula h = — Ea ,u(a) log it(a). 2. Use the idea suggested by the preceding exercise to prove the entropy theorem for ergodic Markov chains. What does it give for the value of h?

56

CHAPTER I. BASIC CONCEPTS.
3. Suppose for each k, Tk is a subset of Ac of cardinality at most 2k". A sequence 4 is said to be (K, 8, {T})-packed if it can be expressed as the concatenation 4 = w(1) . w(t), such that the sum of the lengths of the w(i) which belong to Ur_K 'Tk is at least (1 —8)n. Let G„ be the set of all (K, 8, {T})-packed sequences 4 and let E be positive number. Show that if K is large enough, if S is small enough, and if n is large enough relative to K and S, then IG n I < 2n(a+E) . 4. Assume p is ergodic and define c(x) = lima

,44), x

E

A°°.

(a) Show that c(x) is almost surely a constant c. (b) Show that if c > 0 then p. is concentrated on a finite set. (c) Show that if p. is mixing then c(x) = 0 for every x.

Section 1.6

Entropy as expected value.

Entropy for ergodic processes, as defined by the entropy theorem, is given by the almost-sure limit 1 1 — log h= n—>oo n Entropy can also be thought of as the limit of the expected value of the random quantity —(1/n) log A(4). The expected value formulation of entropy will be developed in this section.

I.6.a

The entropy of a random variable.

Let X be a finite-valued random variable with distribution defined by p(x) = X E A. The entropy of X is defined as the expected value of the random variable — log p(X), that is,

Prob(X = x),

H(X) =

xEA

E p(x) log -1-=— p(x)

p(x) log p(x). xEA

The logarithm base is 2 and the conventions Olog 0 = 0 and log0 = —oo are used. If p is the distribution of X, then H(p) may be used in place of H(X). For a pair (X, Y) of random variables with a joint distribution p(x, y) = Prob(X = x, Y = y), the notation H(X,Y)-= — p(x , y) log p(x, y) will be used, a notation which extends to random vectors. Most of the useful properties of entropy depend on the concavity of the logarithm function. One way to organize the concavity idea is expressed as follows.

Lemma 1.6.1 If p and q are probability k-vectors then

_E pi log p, < —
with equality if and only if p = q.

pi log qi ,

SECTION 1.6. ENTROPY AS EXPECTED VALUE.

57

Proof The natural logarithm is strictly concave so that, ln x < x — 1, with equality if and only if x = 0. Thus
qj

_

= 0,

with equality if and only if qi = pi , log x = (In x)/(1n 2).

1 < i 5_ k.

This proves the lemma, since

E,

The proof only requires that q be a sub-probability vector, that is nonnegative with qi 5_ 1. The sum

Dcplio =

pi in

PI
qi

is called the (informational) divergence, or cross-entropy, and the preceding lemma is expressed in the following form.

Lemma 1.6.2 (The divergence inequality.) If p is a probability k-vector and q is a sub-probability k-vector then D(pliq) > 0, with equality if and only if p = q.
A further generalization of the lemma, called the log-sum inequality, is included in the exercises. The basic inequalities for entropy are summarized in the following theorem.

Theorem 1.6.3 (Entropy inequalities.) (a) Positivity. H(X) > 0, with equality if and only if X is constant.

(b) Boundedness. If X has k values then H(X) < log k, with equality if and only if
each p(x) = 11k.

(c) Subadditivity. H(X, Y) < H(X) H(Y), with equality if and only if X and Y
are independent. Proof Positivity is easy to prove, while boundedness is obtained from Lemma 1.6.1 by setting Px = P(x), qx ----- 1/k. To establish subadditivity note that
H(X,Y).—

Ep (x, y) log p(x, y),
X.

y

then replace p(x , y) in the logarithm factor by p(x)p(y), and use Lemma 1.6.1 to obtain the inequality H(X,Y) < H(X)-}- H(Y), with equality if and only if p(x, y) p(x)p(y), that is, if and only if X and Y are independent. This completes the proof of the theorem. El The concept of conditional entropy provides a convenient tool for organizing further results. If p(x , y) is a given joint distribution, with corresponding conditional distribution p(xly) = p(x, y)1 p(y), then H ain

=_

p(x, y) log p(xiy) = —

p(x, y) log

p(x, y)

POO

58

CHAPTER I. BASIC CONCEPTS.

is called the conditional entropy of X, given Y. (Note that this is a slight variation on standard probability language which would call — Ex p(xIy) log p(xly) the conditional entropy. In information theory, however, the common practice is to take expected values with respect to the marginal, p(y), as is done here.) The key identity for conditional entropy is the following addition law.

(1)

H(X, Y) = H(Y)+ H(X1Y).

This is easily proved using the additive property of the logarithm, log ab = log a+log b. The previous unconditional inequalities extend to conditional entropy as follows. (The proofs are left to the reader.)

Theorem 1.6.4 (Conditional entropy inequalities.) (a) Positivity. H (X IY) > 0, with equality if and only if X is a function of Y. (b) Boundedness. H(XIY) < H (X) with equality if and only if X and Y are independent. (c) Subadditivity. H((X, Y)IZ) < H(XIZ)+ H(YIZ), and Y are conditionally independent given Z. with equality if and only if X

A useful fact is that conditional entropy H(X I Y) increases as more is known about the first variable and decreases as more is known about the second variable, that is, for any functions f and g,

Lemma 1.6.5 H(f(X)In < H(XIY) < 1-1 (X1g(n).
The proof follows from the concavity of the logarithm function. This can be done directly (left to the reader), or using the partition formulation of entropy which is developed in the following paragraphs. The entropy of a random variable X really depends only on the partition Px = {Pa: a E A} defined by Pa = fx: X(x) = a), which is called the partition defined by X. The entropy H(1)) is defined as H (X), where X is any random variable such that Px = P. Note that the join Px V Py is just the partition defined by the vector (X, Y) so that H(Px v Pr) = H(X, Y). The conditional entropy of P relative to Q is then defined by H(PIQ) = H(1) V Q) — H (Q). The partition point of view provides a useful geometric framework for interpretation of the inequalities in Lemma 1.6.5, because the partition Px is a refinement of the partition Pf(x), since each atom of Pf(x) is a union of atoms of P. The inequalities in Lemma 1.6.5 are expressed in partition form as follows.

Lemma 1.6.6 (a) If P refines Q then H(QIR) < H(PIR). (b) If R. refines S then H(PIS) > 11(P 1R).

Sb = Rb U Rb.. so that if b = sup . 0 < r < n. To prove (h) it is enough to consider the case when R. so that P = Pv Q. .b The entropy of a process.} of A-valued random variables is defined by H(r) = E p(x)log 1 ATEAn =— p(x'iL)log p(4).6. The entropy of a process is defined by a suitable passage to the limit. and the inequality (iii) uses the fact that H (P I VR. H (P IR) (i) (P v 21 1Z) H(QITZ) + H(PIQ v 1Z) H (OR). Let Pc denote the partition of the set C defined by restricting the sets in P to C. that is. Proof Let a = infn an /n. The n-th order entropy of a sequence {X1. the equality (ii) is just the general addition law.. The equality (i) follows from the fact that P refines Q. This is a consequence of the following basic subadditivity property of nonnegative sequences. < pan + b. If m > n write m = np +.6. say Ra = Sa . which establishes inequality (b).) If {an} is a sequence of nonnegative numbers which is subadditive. as follows. . a 0 b.. Subadditivity gives anp < pan . is obtained from S by splitting one atom of S into two pieces. where the conditional measure. IC) is used on C.r.) > O.n a. Given c > 0 choose n so that an < n(a ± 6).6. an-Fm < an + am . namely. I. then limn an ln exists and equals infn an /n. then another use of subadditivity yields am < anp + a. it (. 59 Proof The proof of (a) is accomplished by manipulating with entropy formulas. X2. n n xn In the stationary case the limit superior is a limit. t a0b II(Ra) The latter is the same as H(PIS).7 (The subadditivity lemma. 11) P(x1 xnEan E The process entropy is defined by passing to a limit. Lemma 1. • The quantity H(P IR.) can be expressed as -EEtt(Pt n R a ) log t ab it(Pt n Ra) + (4)11 (Psb liZsb ) Ewpt n Ra ) log It(Pt n Ra) + it(Sb)1-1(Psb ).SECTION I. (2) 1 1 1 H ({X n }) = lim sup — H((X7) = lim sup — p(4) log n n p(4) . ENTROPY AS EXPECTED VALUE.

H(X. BASIC CONCEPTS. and the fact that np/m —> 1.6. Lemma 1. The subadditivity property for entropy. and the simple fact from analysis that if an > 0 and an+ i — an decreases to b then an /n —> b. as in —> pc.1 )= H(X0IX:n1). and process entropy H. the process entropy H is the same as the entropy-rate h of the entropy theorem.e.6. The next goal is to show that for ergodic processes.u (x '0 = h.5._ . Towards this end. 1B) denote the conditional measure on B and use the trivial bound to obtain P(a1B)log p(aIB) . from Lemma 1.8 (The subset entropy-bound. can then be applied to give H({X n })= lim H (X01X1 nI ) . CI This proves the lemma. that is urn—log n 1 1 n . Proof Let p(. aEB E P(a)log p(a) + log p(B)) . If the process is stationary then H(X n + -FT) = H(r) so the subadditivity lemma with an = H(X7) implies that the limit superior in (2) is a limit. . 0 Let p. Division by m. Y) = H(Y)± H(X1Y). Theorem I.6.) _ E p(a) log p(a) _< p(B) log IBI — p(B)log p(B).60 CHAPTER I. (3) n--4co a formula often expressed in the suggestive form (4) H({X n }) = H (X01X_1 0 ) . n > 0. and let h be the entropy-rate of A as given by the entropy theorem.3(c). The right-hand side is decreasing in n. be an ergodic process with alphabet A. to produce the formula H(X ° n )— H(X1. the following lemma will be useful in controlling entropy on small sets. aEB for any B c A. gives lim sup ani /m < a + E. a. gives H(X7 +m) < H(X7)+ H(rnTin)._ log IBIaEB — E 1 The left-hand side is the same as ( p(B) from which the result follows. An alternative formula for the process entropy in the stationary case is obtained by using the general addition law.

the following holds n(h — e) _< — log . Proof First assume that h > 0 and fix E such that 0 < E < h. both are often simply called the entropy. = {4: . this part must go to 0 as n —> oc.i. case.u(xliz) — E . since the entropy theorem implies that ti(B) goes to O. the process entropy H({X.u(xtil) A" = — E . As n —> co the measure of G.u(G n ) and summing produces (6) (h — 6) 1 G.(x ) log .SECTION 1.10 Since for ergodic processes. [4]. [18]. ENTROPY AS EXPECTED VALUE. discuss many information-theoretic aspects of entropy for the general ergodic process. contains an excellent discussion of combinatorial and communication theory aspects of entropy. Lemma 1.u(4) log p. so the sum converges to H({X„}). The Csiszdr-Körner book. On the set G. and Cover and Thomas. in the i. If h = 0 then define G. Thus h— E which proves the theorem in the case when h > O. from the entropy theorem.(Bn )log IA 1 — . 8„ since I B I < IAIn . The bound (5) still holds. The subset entropy-bound.8. [7].u(4)/n. while D only the upper bound in (6) matters and the theorem again follows.6.u(4) <_ n(h ± c).. [6].6. Define G. } Remark 1.(4).. Then H (X7) = — E „. = A n — G.}) is the same as the entropyrate h of the entropy theorem.6.6. G. = {4: 2—n(h+E) < 1. The two sums will be estimated separately.1 (4) < 2—n(h—E)} and let B. After division by n. approaches 1. A detailed mathematical exposition and philosophical interpretation of entropy as a measure of information can be found in Billingsley's book.i(B)log(B). . B.u(4) > 2 —ne . gives (5) — E A (x) log ii(x) < np.9 Process entropy H and entropy-rate h are the same for ergodic processes.d. The recent books by Gray. 61 Theorem 1.u (4 ) log . so multiplication by .

Theorem 1. This condition actually implies that the process is Markov of order k.i. Here they will be derived from the definition of process entropy. along with a direct calculation of H(X0IX_ 1 ) yields the following formula for the entropy of a Markov chain. I.62 CHAPTER I. Let {X. Xink+1). gives H (XI') = H(Xi) which yields the entropy formula H (X2) + H(X n ) = nH (X i ). An alternate proof can be given by using the conditional limit formula. (8) H({X n }) = H(X0IX_ i ) = — E 04 log Mii .6.6. Xi n k+1 )1X1) = H(X: n k+1 1X11) H(X 0 IXI kl .c The entropy of i.d. Entropy formulas for i. BASIC CONCEPTS.i.}) = lip H (X0IX1. a E A.(4). (3).5. (3). Recall that a process {Xn } is Markov of order k if Prob (X0 = aiX) = Prob (X0 = alX:k1 ) . H (X 0 1 = H(X01 X -1)• Thus the conditional entropy formula for the entropy of a process. The argument used in the preceding paragraph shows that if {X. process and let p(x) = Prob(Xi = x).!) = H(X0)• Now suppose {X n } is a Markov chain with stationary vector p and transition matrix M.3(c). (7) H({X. and Markov processes can be derived from the entropy theorem by using the ergodic theorem to directly estimate p. and the fact that H(Xl Y) = H(X) if X and Y are independent. Theorem 1.d. a from which it follows that E A. to obtain H({X. . .6.i.11 (The Markov order theorem. n > k. see Exercises 1 and 2 in Section 1.}) = H(X i ) = — p(x) log p(x). The additivity of entropy for independent random variables.d.} is Markov of order k then 1/({X}) = H(X0IX:1) = n > k. The Markov property implies that Prob (X0 = alX) = Prob (X0 = alX_i) . } be an i.) A stationary process {X n } is Markov of order k if and only if H(X0IX:) = H(X0IX: n1 ). and Markov processes. Proof The conditional addition law gives H((X o .

17(130= ii(pl(. To say that the process is Markov of some order is to say that Hk is eventually constant. I. for n > k.6. The type class of xi' will be denoted by T WO or by 7. If this is true for every n > k then the process must be Markov of order k. Theorem 1. if each symbol a appears in x'. rather than a particular sequence that defines the type. In general the conditional entropy function Hk = H(X0IX: k1 ) is nonincreasing in k. where the latter stresses the type p i = pi (. The equality condition of the subadditivity principle. if and only if each symbol a appears in xi' exactly n pi(a) times. and will be useful in some of the deeper interpretations of entropy to be given in later chapters. that is.) < 2r1(131).6. Theok+1 are conditionally indepenrem I. that is. can then be used to conclude that X0 and X:_ n dent given XI/. provided H(X01X1 k1 ) = H(X0 1X1n1 ). a product measure is always constant on each type class. 14). k > k*.. The empirical (first-order) entropy of xi' is the entropy of p i (. n Ifi: Pi(alxi) = = an . The empirical distribution or type of a sequence x E A" is the probability distribution Pi = pl (. The entropy of an empirical distribution gives the exponent for a bound on the number of sequences that could have produced that empirical distribution.d Entropy and types.6. for any type p i . then Hks_i > fik* = Hk . The following purely combinatorial result bounds the size of a type class in terms of the empirical entropy of the type. If the true order is k*. the same number of times it appears in Type-equivalence classes are called type classes. 63 The second term on the right can be replaced by H(X01X: k1 ). x'1' E T. This fact can be used to estimate the order of a Markov chain from observation of a sample path. a E A. In since np i (a) is the number of times the symbol a appears in a given 4 particular. ENTROPY AS EXPECTED VALUE. - Proof First note that if Qn is a product measure on A n then (9) Q" (x)= ji Q(xi) i=1 = fl aEA War P1(a) . Two sequences xi' and y'11 are said to be type-equivalent if they have the same type.14). This completes the proof of the theorem. x E rn Tpn. 14)) = a Pi(aixi)log Pi(aI4).4(c). Thus.SECTION 1. 14) on A defined by the relative frequency of occurrence of each symbol in the sequence 4. E .12 (The type class bound. that is. This fact is useful in large deviations theory.6.

and hence = ypn.' and yç' are said to be k-type-equivalent if they have the same k-type.14 (The number of k-types bound. . BASIC CONCEPTS. extends immediately to k-types. which. yields Pn(x7) = 2—n17(PI) . where Pk = Pk ( Ix).6. n. in particular. Later use will also be made of the fact that there are only polynomially many type classes. The empirical overlapping k-block distribution or k-type of a sequence x7 E A' is the probability distribution Pk = pk(' 14) on A k defined by the relative frequency of occurrence of each k-block in thesen [i quenk ce +ii'i th_a i± t ki_ S.) The number of possible k-types is at most (n — k + Note. (9). Pn (4) has the constant value 2' -17(P' ) on the type class of x7. as shown in the Csiszdr-Körner book. A type pi defines a product measure Pn on An by the formula Pn (z?) = pi (zi). This establishes Theorem 1. and k-type-equivalence classes are called k-type classes. while if k is also growing. The bound on the number of types. is the correct asymptotic exponential bound for the size of a type class.13. Replacing Qn by Pi in the product formula. Theorem 1. Two sequences x. the only possible values for . 12— n 11(” v 1 ) But Pi is a probability distribution so that Pi (Tpni ) < 1. 4 E T. while not tight. that is. z E A' .1 = E I -41 n— k 1 E A k.13 (The number-of-types bound.64 CHAPTER 1. as stated in the following theorem.12. The concept of type extends to the empirical distribution of overlapping k-blocks. The k-type class of x i" will be denoted by Tpnk . but satisfies k < a logiAi n. [7]. The upper bound is all that will be used in this book. that if k is fixed the number of possible k-types grows polynomially in n. Theorem 1. x E In other words. with a < 1. The bound 2' 17. after taking the logarithm and rewriting. Proof This follows from the fact that for each a E A.) The number of possible types is at most (n + 1)1A I.6. npi (alfii) are 0. Theorem 1.6.6. 1. produces Pn(xiz) = aEA pi (a)nPl(a) . E r pni . if and only if each block all' appears in xÇ exactly (n — k + 1) pk (alic) times. then the number of k-types is of lower order than IA In.

i = 1. > 0. This yields the desired bound for the first-order case. ak ak by formula (8). Let P be the time-0 partition for an ergodic Markov chain with period d = 3 and entropy H > O. Theorem 1. with Q replaced b y (2-1) = rf. Note that it is constant on the k-type class 'Tk(xiii) = Tp k of all sequences that have the same k-type Pk as xrii. . that is. . E ai . 65 Estimating the size of a k-type class is a bit trickier. This formula. ENTROPY AS EXPECTED VALUE.-j(a k Or 1 )Ti(a ki -1 ) log 75(akiar) E Pk(akil4) log Ti(ak lari ). I.SECTION 1. 2. however. The proof for general k is obtained by an obvious extension of the argument. since the k-type measures the frequency of overlapping k-blocks.b xr it c Tp2. A suitable bound can be obtained. then E ai log(ai I bi ) ?_ a log(a/b).(n.(1) and after suitable rewrite using the definition of W I) becomes 1 TV I) (Xlii ) = Ti-(1) (X02—(n-1)1 . and hence P x E 2. since P1) (4) is a probability measure on A n .1>ii(1 ) IT P2 I — 1n since ii:(1) (xi) > 1/(n — 1). Prove the log-sum inequality: If a. If Q is Markov a direct calculation yields the formula Q(4) = Q(xi)n Q(bi a)(n—l)p2(ab) a.6. by considering the (k — 1)-st order Markov measure defined by the k-type plc ( 14). > 0. n.e Exercises.) I7k (fil)1 < (n — Proof First consider the case k = 2. 2.15 (The k-type-class bound. and a 1.6. b.6. The entropy FP") is called the empirical (k — 1)-st order Markov entropy of 4. 2)-process. b = E bi . (a) Determine the entropies of the ergodic components of the (T3 . the stationary (k — 1)-st order Markov chain ri 0c -1) with the empirical transition function (ak 14 -1 ) = Pk(ainxii ) „ f k-1 „ir‘ uk IA 1 ) E bk pk and with the stationary distribution given by The Markov chain ri(k-1) has entropy (k - "Aar') = Ebk Pk(ar i 1) -E 1.

i(4). 3.2.u(Pi ) . (Hint: use part (a) to reduce to the two set case. S) = H H (SIXr) = H(S)+ H(X l iv IS) for any N> n.66 CHAPTER I. measurable with respect to xn oc . so H(Xr)1N H(Xr IS)1N. n] and note that H (X I v . Prove that the entropy of an n-stationary process {Yn } is the same as the entropy the stationary process {X„} obtained by randomizing the start. and to the related building-block concept. useful interpretations of entropy will be discussed in this section. (Hint: use stationarity. Show that the process entropy of a stationary process and its reversed process are the same. y) = (T x . Find the entropy of the (S.u(Fn ) > 1 — a and so that if xn E Fn then 27" 11.1} z . (Hint: recurrence of simple random walk implies that all the sites of y have been almost surely visited by the past of the walk. such that . Q)-process. BASIC CONCEPTS. that is. and use the equality of process entropy and entropy-rate.QI 1.7. (b) Determine the entropies of the ergodic components of the (T 3 . Show there is an N = N(a) such that if n > N then there is a set Fn . then use calculus. and define S(x. I. p V Tp T22)-process. Let Y = X x X. Let T be the shift on X = {-1. Let p be the time-zero partition of X and let Q = P x P. Toy).u0(-1) = p. if n and c are understood.. Let p be an ergodic process and let a be a positive number. (Hint: apply the ergodic theorem to f (x) = — log //.(Pi)1 A(Q i )). which is the "random walk with random scenery" process discussed in Exercise 8 of Section 1. be an ergodic process of entropy h. (a) Show that D (P v 7Z I Q v > Q) for any partition R. The second interpretation is the connection between entropy and expected code length for the special class of prefix codes. where jp .bt 1 a2012 .a Entropy-typical sequences. The first is an expression of the entropy theorem in exponential form.) Section 1. The entropy theorem may be expressed by saying that xril is eventually almost surely entropy-typical. €)-entropy-typical sequences. with the product measure y = j x it. which leads to the concept of entropy-typical sequence. Let p. namely.) 7. (b) Prove Pinsker's inequality.7 Interpretations of entropy. (Hint: let S be uniformly distributed on [1.(xi Ix).(1) = 1/2. Then show that I-1(r IS = s) is the same as the unshifted entropy of Xs/v+1 . and let ji be the product measure on X defined by . For E > 0 and n > 1 define En(E) 1 4: 2—n(h+E) < tt (x in) < 2—n(h-6)} The set T(E) is called the set of (n. Two simple.) 4.) 5. that D (P Il Q) > (1 / 2 ln 2)IP — Q1 2 . The divergence for k-set partitions is D(P Q) = Ei It(Pi)log(p. . or simply the set of entropy-typical sequences.) 6. (4) < t(xIx) 2".

A good way to think of . the measure is eventually mostly concentrated on a set of sequences of the (generally) much smaller cardinality 2n(h+6) • This fact is of key importance in information theory and plays a major role in many applications and interpretations of the entropy concept.(E12)) is summable in n.2.u(Cn n 'T.1 (The typical-sequence form of entropy. Theorem 1. eventually almost surely.7. sometimes it means frequencytypical sequences.i (e/ 2). The preceding theorem provides an upper bound on the cardinality of the set of typical sequences and depends on the fact that typical sequences have a lower bound on their probabilities.SECTION 1." has different meaning in different contexts. even though there are I A In possible sequences of length n.7. on the number of entropy-typical sequences.7q If C.) < 2n(h--€). Proof Since .X1' E 'T. eventually almost surely. suggested by the upper bound. The context usually makes clear the notion of typicality being used. in information theory.) For each E > 0. this fact yields an upper bound on the cardinality of Tn . Since the total probability is at most 1. a)-covering number by Arn (a) = minfICI: C c An. . (Al) List the n-sequences in decreasing order of probability. then . INTERPRETATIONS OF ENTROPY. The members of the entropy-typical set T(e) all have the lower bound 2 —n(h+f) on their probabilities. c A n and ICI _ Ci'. 67 The convergence in probability form.7.u(Tn (E)) = 1.3 (The too-small set principle. sometimes it means sequences that are both frequency-typical and entropy-typical.7.) The set of entropy-typical sequences satisfies 17-n (01 < 2n (h+E) Thus. Sometimes it means entropy-typical sequences. as defined in Section 1. namely. is known as the asymptotic equipartition property or AEP. that is.7. n'T. as defined here. Theorem 1. The cardinality bound on Cn and the probability upper bound 2 2) on members of Tn (E/2) combine to give the bound (Cn n Tn(E/2)) < 2n(12+02—n(h—E/2) < 2—nc/2 Thus . (A2) Count down the list until the first time a total probability of at least a is reached. n > 1. it is enough to show that x g C. and p(C)> a}.(E12). x E T(e). Theorem 1. fl Another useful formulation of entropy. Theorem 1. and the theorem is established. For a > 0 define the (n. eventually almost surely.4.I\rn (a) is given by the following algorithm for its calculation. Here the focus is on the entropy-typical idea. Typical sequences also have an upper bound on their probabilities. Ar(a) is the minimum number of sequences of length n needed to fill an afraction of the total probability. The phrase "typical sequence. limn . which leads to the fact that too-small sets cannot be visited too often. expresses the connection between entropy and coverings. and sometimes it is just shorthand for those sequences that are likely to occur.2 (The entropy-typical cardinality bound. eventually almost surely.

lim supn (l/n) log Nn (a) .i(C) > a and I C.4 (The covering-exponent theorem.) For each a > 0. the measure . ) and hence Nn (a) = ICn > IT(e) n Cn > 2n (h-E'A(Tn(E) n cn) > 2n(h-E)a(i 6).7. below. Let d(a. y. which will be discussed in the following paragraphs. S) = min{dn (4.5_ h.68 CHAPTER I. yri`) = — n i=1 yi)• The metric cin is also known as the per-letter Hamming distance. suppose. 1) and c > O. Proof Fix a E (0.2. It is important that a small blowup does not increase size by more than a small exponential factor.(6) n cn ) > a(1 — c). When this happens N. To state this precisely let H(8) = —6 log 6 (1— 6) log(1 — 8) denote the binary entropy function. defined by if a = b otherwise. b) be the (discrete) metric on A.(4) E x.n GI. . On the other hand.(x) < 2 —n(h—E) . by Theorem 1. BASIC CONCEPTS. The distance dn (x. fl The covering exponent idea is quite useful as a tool for entropy estimation. as n oc. 1 dn (4 .Jce) < iTn (01 < 2 n0-1-0 .e. = S) < 81. defined by 0 d (a . see. for example. that a rotation process has entropy O. < 2—n(h—E) 1— -6 . for E rn (E then implies that ACTn(E) n = p. If n is large enough then A(T. for each n. and hence.u(T. Theorem 1. Since the measure of the set of typical sequences goes to I as n oc.7.7.' E S}. . The covering number Arn (a) is the count obtained in (A2). which proves the theorem.I = (a). S) from 4 to a set S C An is defined by dn (x' . and the 6-neighborhood or 6-blowup of S is defined by [S ].. the proof in Section I.Ern (onc. b) = J 1 1 and extend to the metric d on A'.i (E)) eventually exceeds a. The connection between coverings and entropy also has as a useful approximate form. The fact that p. The connection with entropy is given by the following theorem. the covering exponent (1/n)logArn (a) converges to h.

then Irrn(E)ls1 _ both go to 0 as El will hold. The reader should also note the concepts of blowup and built-up are quite different. c An . Thus.5. M].=1 (mi — n i + 1) > (1 — (b) E I3. m i ] for which x . Proof Given x. 8) = min{ICI: C c An. provided only that n is not too small. that is. there are at most nS positions i in which xi can be changed to create a member of [ { x7 } 13..7. with occasional spacers inserted between the blocks. An application of Lemma 1. are allowed to depend on the sequence xr. The . thought of as the set of building blocks.7. This idea and several useful consequences of it will now be developed. is required to be a member of 13. subject only to the requirement that the total length of the spacers be at most SM.SECTION 1. of disjoint n-length subintervals of [1. Lemma 1. Each such position can be changed in IA — 1 ways.7. 6—>0 n—> oo n I. The blowup form of the covering number is defined by (1) Arn (a." from which longer sequences are made by concatenating typical sequences.7. together with Theorem 1. < n(h+2€) 2 In particular.5.4 then exists as n establishes that 1 lim lim — log Afn (a. for all n. An application of the ergodic theorem shows that frequency-typical sequences must consist mostly of n-blocks which are entropy-typical. 8) = h. Fix a collection 5. A sequence xr is said to be (1 — 0-built-up from the building blocks B. it is the minimum size of an n-set for which the 8-blowup covers an a-fraction of the probability. This proves the blowup-bound lemma. implies that I[{4}]51 < yields the stated bound. such that (a) Eil. 8) oc. the _ 2nH(3) (IAI — 1) n8 .) The 6-blowup of S satisfies 69 I[S]81 ISI2 n11(6) (1A1 — 1) n3 .b The building-block concept. < 2n(h-1-€) since (IAI — 1)n8 < 20 log IAI and since Slog IA I and H(8) Since IT(e)1 _ < 2n (h+26) 0. INTERPRETATIONS OF ENTROPY. it follows that if 8 is small enough. Both the number I and the intervals [ni . In the special case when 8 = 0 and M is a multiple of n. In particular.5 (The blowup bound. Also fix an integer M > n and 8 > O. to say that 41 is 1-built-up from Bn is the same as saying that xr is a concatenation of blocks from B„. which combinations bound. If 8 > 0 then the notion of (1-0-built-up requires that xr be a concatenation of blocks from 13. if there is an integer I = I(41 ) and a collection llni. It is left to the reader to prove that the limit of (1/n) log Nen (a. i c I..4. given c > 0 there is a S > 0 such that Ir1n(c)131 for all n. and //([C]3 ) a}. mi]: i < 1)..7. the entropy-typical n-sequences can be thought of as the "building blocks. with spacers allowed between the blocks. Lemma 1.

M — n +1]: Bn }l < 8(M — n +1). say {[n i .4. The proof of this fact. Lemma 1.m i ] is upper bounded by IA 1 3m . Y1 E < ) which is.5. for members of Bn . in turn. Lemma 1.3. while the built-up concept only allows changes in the spaces between the building blocks. . is similar in spirit to the proof of the key bound used to establish the entropy theorem.m i ].7. then the set of M-sequences that can be (1 — 6)-built-up from a given set 8„ C An of building blocks cannot be exponentially much larger in cardinality than the set of all sequences that can be formed by selecting M I n sequences from Bn and concatenating them without spacers. provided M is large enough.6 (The built-up set bound.b. Proof The number of ways to select a family {[n i . blowup concept focuses on creating sequences by making a small density of otherwise arbitrary changes. The building-block idea is closely related to the packing/covering ideas discussed in Section I. m i ]} of disjoint n-length subintervals that cover all but a 6-fraction of [1. = 7. This is stated in precise form as the following lemma. the number ways to fill these with members of B„ is upper bounded by IBnl i < . This establishes the lemma.70 CHAPTER I. Lemma 1. can be used to show that almost strongly-covered sequences are also almost built-up. BASIC CONCEPTS.(c). and if is small enough then 11)m 1 < 2m(h +26) . Thus I Dm I < ni Min IAIM.3. M] is upper bounded by the number of ways to select at most SM points from a set with M members. if B. As usual. upper bounded by 2 A411(3) . the set of entropy-typical n-sequences relative to c. but allows arbitrary selection of the blocks from a fixed collection. A sequence xr is said to be (1 — 6)-strongly-covered by B„ C An if E [1.6. though simpler since now the blocks all have a fixed length. namely.i E / } . Lemma 1. Then I Dm I IB. by the combinations bound. H(8) = —8 log6 — (1 —8) log(1 — S) denotes the binary entropy function.4.) Let DM be the set of all sequences xr that can be (1 — 6)-built-up from a given collection B„ c An. An important fact about the building-block concept is that if S is small and n is large.3 Min and the number of ways to fill the places that are not in Ui [ni. which is the desired bound. In particular. The bound for entropy-typical building blocks follows immediately from the fact that I T(e)l 2 n(h+f) . The argument used to prove the packing lemma. which is stated as the following lemma.1 2mH (0 1 A im s. For a fixed configuration of locations.5.

In a typical data compression problem a given finite sequence 4. . In the following discussion B* denotes the set of all finite-length binary sequences and t(w) denotes the length of a member w E B*. in information theory a stationary. In particular.5.8 The preceding two lemmas are strictly combinatorial.4. and if M > 2n/8. b. hence eventually almost surely mostly built-up from members of 7.SECTION 1. I.c. that is. = Tn (e).) If xr is (1-812)-strongly-covered by 13. while the condition that M > 2n18 implies there at most 3M/2 indices j E [M — n 1. ergodic process with finite alphabet A is usually called a source. This result.4.(7. if B.. An image C (a) is called a codeword and the range of C . (e).7.. Remark 1. often of varying lengths. which can be viewed as the entropy version of the finite sequence form of the frequency-typical sequence characterization of ergodic processes. In this section the focus will be on a special class of codes. and that xr is eventually in DM. M] which are not contained in one of the [ni . implies that eventually almost surely xr is almost stronglycovered by sequences from T(e).7. The lemma is therefore established. implies there are at most 6M/2 indices j < M — n 1 which are not contained in one of the [ni .. in such a way that the source sequence x'1' is recoverable from knowledge of the encoded sequence bf. mi]. there are many possible source sequences.7. In combination with the ergodic and entropy theorems they provide powerful tools for analyzing the combinatorial properties of partitions of sample paths. B*. b2. for which there is a close connection between code length and source entropy.7. is to be mapped into a binary sequence bf = b1. Theorem 1.u. The assumption that xr is 13. a subject to be discussed in the next chapter. and n is large enough. m i ]. will be surprisingly useful later. implies that the set Dm of sequences that are mostly built-up from the building blocks T(e) will eventually have cardinality not exponentially much larger than 2/4(h+6) • The sequences in Dm are not necessarily entropy-typical. Theorem 1. The almost-covering principle. Of course. stopping when "" Ic (1 — 3/2)-strongly-covered by within n of the end of xr. to make the code length as short as possible. where n is a fixed large number. then xr is (1-0-built-up from Bn• Proof Put mo = 0 and for i > 0. A code is said to be faithful or A (binary) code on A is a mapping C: A noiseless if it is one-to-one.c Entropy and prefix codes.7. For this reason. all that is known is that the members of Dm are almost built-up from entropy-typical n-blocks.6. INTERPRETATIONS OF ENTROPY. A standard model is to think of a source sequence as a (finite) sample path drawn from some ergodic process. for no upper and lower bounds on their probabilities are given. then .7 (The building-block lemma. so that a typical code must be designed to encode a large number of different sequences and accurately decode them.(€)) 1. known as prefix codes. at least in some average sense. whose length r may depend on 4 . 71 Lemma 1. A suitable application of the built-up set lemma. drawn from some finite alphabet A. The goal is to use as little storage space as possible. Lemma 1. define ni to be least integer k > m i _ i such that ric+n--1 E 13„.7.

a lower bound which is almost tight. however.9. where b from u to y. . is called the codebook. For prefix-free sets W of binary sequences this fact takes the form (2) . In particular. The entropy of p. 0100. 0101.) Without further assumptions about the code there is little connection between entropy and expected code length. base 2 logarithms are used.) E B then there is a directed edge Figure 1. labeled by b. it is the function defined by . 100 } . directed) binary tree T(W). For u = u7 E B* and u = vr E B*. the words w E W correspond to leaves L(T) of the tree T (W). defined by the following two conditions. Note that the root of the tree is the empty word and the depth d(v) of the vertex is just the length of the word v. The expected length of a code C relative to a probability distribution a on A is E(L) = Ea . (ii) If y E V(W) has the form v = ub. An important.72 CHAPTER I.w < 1. For codes that satisfy a simple prefix condition. To develop the prefix code idea some notation and terminology and terminology will be introduced. H(j) is just the expected value of the random variable — log .(a). (As usual.u(X).7. since W is prefix free. and easily proved. where X has the distribution A.7. The function that assigns to each a E A the length of the code word C (a) is called the length function of the code and denoted by £(IC) or by r if C is understood. 101. the concatenation uv is the sequence of length n m defined by (uv) i = ui 1<i<n n < i < n m.9 The binary tree for W = {00. together with the empty word A.C(aIC) = £(C (a)). A prefix-free set W has the property that it can be represented as the leaves of the (labeled.C(aIC)p. (See Figure 1. Formally. entropy is a lower bound to code length. BASIC CONCEPTS. A nonempty word u is called a prefix of a word w if w = u v. is defined by the formula H (A) = E p(a) log aEA 1 that is. a E C.. (i) The vertex set V(W) of T (W) is the set of prefixes of members of W. property of binary trees is that the sum of 2 —ci(v) over all leaves y of the tree can never exceed 1. The prefix u is called a proper prefix of w if w = u v where y is nonempty. A nonempty set W c B* is prefix free if no member of W is a proper prefix of any member of W.

Gs } be the equivalence classes.7. The Kraft inequality then takes the form E I Gr I2-Lo-) < 1. holds. which is possible since I Gil < 2" ) . I=1 A code C is a prefix code if a = a whenever C(a) is a prefix of C(a). the leaf corresponding to C(a) can be labeled with a or with C(a).1C)) < H(g) + 1. (ii) There is a prefix code C such that EIL (C(.t be a probability distribution on A. a . i E [1.SECTION 1.10 If 1 < < f2 < < it is a nondecreasing sequence of positive integers such that E. Proof Define i and j to be equivalent if ti = Li and let {G1.. . since the code is one-to-one. (. A geometric proof of this inequality is sketched in Exercise 10.u (a ) log .u(a). a lower bound that is "almost" tight. a result known as the Kraft inequality. 73 where t(w) denotes the length of w E W. INTERPRETATIONS OF ENTROPY.7. This is possible since 1 GI 12—L(1) + I G2 1 2 L(2) < 1.1C)) = Er(aiC)Wa) = a a log 2paic). entropy is always a lower bound on average code length. Proof Let C be a prefix code. r=1 Assign the indices in G1 to binary words of length L(1) in some one-to-one way. written in order of increasing length L(r) = L.1C)). ) E p.. (i) H(p) E4 (C(. This fundamental connection between entropy and prefix codes was discovered by Shannon. Lemma 1. An inductive continuation of this assignment argument clearly establishes the lemma.(a) log a 2-C(a1C) E. where the second term on the right gives the number of words of length L(2) that have prefixes already assigned. The Kraft inequality has the following converse. Assign G2 in some one-to-one way to binary words of length L(2) that do not have already assigned words as prefixes. A little algebra on the expected code length yields E A (C(. where t i E Gr . For prefix codes.7.) Let j. The codewords of a prefix code are therefore just the leaves of a binary tree. for any prefix code C. and. For prefix codes. Thus C is a prefix code if and only if C is one-to-one and the range of C is prefix-free. so that IG21 —‹ 2") — IG11 2L(2)—L(1). < 1 then there is a prefix-free set W = {w(i): 1 < i < t} such that t(w(i)) = ti .. Theorem 1.11 (Entropy and prefix codes. t]. the Kraft inequality (3) aEA 1.

will be called a prefix-code sequence.C(4) = . In this case. relative to a process {X n } with Kolmogorov measure IL is defined by = liM sup n—co E (C(IC n)) The two results. implies that the first term is nonnegative. Lemma 1. no sequence of codes can asymptotically compress more than process entropy.11(ii) asserts the existence of a prefix n-code Cn such that 1 1 (5) — E (reIC n)) 5. (b) There is no prefix code sequence {C n } such that 1Z 4 ({C n }) < H.7. it follows that = a A. In the next chapter. n " n A sequence {Cn } . a E A. mappings Cn : A n 1—> B* from source sequences of length n to binary words. . however.7.C(4 IQ. while Theorem I. To prove part (ii) define L(a) = 1.H n (P. by the Kraft inequality. such that.. while the divergence inequality.2. BASIC CONCEPTS. for each n. The second term is just H(g). it is possible to compress as well as process entropy in the limit.10 produces a prefix code C(a) = w(a). This proves (i).6. Theorem I.C(a) < 1— log p(a).log tt(a))p. Thus "good" codes exist.12 (Process entropy and prefix-codes. that is.) If p.t(a)1. E(1.7. The (asymptotic) rate of such a code sequence. (4) and (5) then yield the following asymptotic results.(A n ) = H(i)/n.log 1. since the measure v defined by v(a) = 2 —C(a I C) satisfies v(A) < 1. Of interest for n-codes is per-symbol code length . a faithful n-code Cn : An B* is a prefix n-code if and only if its range is prefix free. which completes the proof of the theorem. a Since.74 CHAPTER I.(a) a 1— E p(a) log p(a) =1+ H(p). Let /in be a probability measure on A n with per-symbol entropy 1-1. that is. where rx1 denotes the least integer > x. and note that E 2— C (a) < E a a 2logg(a) E a E 1. n) — .11(i) takes the form 1 (4) H(iin) for any prefix n-code Cn . almost-sure versions of these two results will be obtained for ergodic processes. Theorem 1.C(a71Cn )/n and expected per-symbol code length E(C(.lCn ))1n. that is. whose length function is L. and "too-good" codes do not exist. Lemma 1.7. Cn is a prefix n-code with length function . is a stationary process with process entropy H then (a) There is a prefix code sequence {C n } such that 1-4({G}) < H. a Thus part (ii) is proved. . Next consider n-codes.

The code is a prefix code.e. there is no loss of generality in assuming that a faithful code is a prefix code.2.(a).7. but for n-codes the per-letter difference in expected code length between a Shannon code and a Huffman code is at most 1/n. A somewhat more complicated coding procedure. whose expected value is entropy.C(a) = F— log . Huffman codes have considerable practical significance. so that w' is empty and n = m. so that asymptotic results about prefix codes automatically apply to the weaker concept of faithful code. since. Lemma 1. INTERPRETATIONS OF ENTROPY. The first part u(n) is just a sequence of O's of length equal to the length of v(n). so that. for example. then. [8]. the code length function L(a) = 1— log (a)1 is closely related to the function — log p.u(a)1 will be called a Shannon code. I. u(n) = u(m).13 A prefix code with length function .14 (The Elias-code lemma.SECTION 1. The length of w(n) is flog(n +1)1. But then v(n) = v(m).7. The desired bound f(E(n)) = log n + o(log n) follows easily. u(12)=000. for if u(n)v(n)w(n) = u(m)v(in)w(m)w'.) There is a prefix code £:11. CI . which is asymptotically negligible. for example. Proof The code word assigned to n is a concatenation of three binary sequences. w(12) = 1100. C(n) = u(n)v(n)w(n). produces prefix codes that minimize expected code length. for Shannon codes. so that.7. while both u(n) and v(n) have length equal to f log(1 + Flog(n 1 )1)1. Thus E(12) = 0001001100. This means that w(n) = w(m). This can be done in such a way that the length of the header is (asymptotically) negligible compared to total codeword length. Any prefix code with this property is called an Elias code. called Huffman coding. v(12) = 100. For this reason no use will be made of Huffman coding in this book. since both consist only of O's and the first bit of both v(n) and v(m) is a 1. since both have length specified by v(n). The second part v(n) is the binary representation of the length of w(n). Shannon's theorem implies that a Shannon code is within 1 of being the best possible prefix code in the sense of minimizing expected code length. the length of the binary representation of n.7. as far as asymptotic results are concerned. so that.d Converting faithful codes to prefix codes. 75 Remark 1. The key to this is the fact that a faithful n-code can always be converted to a prefix code by adding a header (i. such that i(e(n)) = log n + o(log n). so the lemma is established. The following is due to Elias. a prefix) to each codeword to specify its length length. the length of the binary representation of Flog(n + 1)1. The third part w(n) is the usual binary representation of n. for example. since both have length equal to the length of u(n). Next it will be shown that. B*.

but header length becomes negligible relative to the length of Cn (4) as codeword length grows. 1). since Cn was assumed to be faithful.. The proof will be based on direct counting arguments rather than the entropy formalism. . For example.76 CHAPTER I.7. for ease of reading.5). 1). 001001100111. The proof for the special two set partition P = (Po.7. The argument generalizes easily to arbitrary partitions into subintervals. by Lemma 1. a prefix n-code Cn * is obtained by the formula (6) C: (4 ) = ECC(. This enables the decoder to determine the preimage . (4). Let T be the transformation defined by T:xi--> x ea.u is a stationary process with entropy-rate H there is no faithful-code sequence such that R IL ({C n }) < H. Thus Theorem 1. The header tells the decoder which codebook to apply to decode the received message. where Po = [0." in the definition of prefix-code sequence. a comma was inserted between the header information. will be called an Elias extension of C. I. The code CnK. These will produce an upper bound of the form 2nEn . and P1 = [0. to general partitions (by approximating by partitions into intervals). the header information is almost as long as Cn (4). e (C(4 ICn )). PO.16 Another application of the Elias prefix idea converts a prefix-code sequence {C n } into a single prefix code C: A* i where C is defined by C(4) = e(n)C(4). In the example. where e is an Elias prefix code on the integers. where a is irrational and ED is addition modulo 1.. a word of length 12.4 I Cn ))Cn (4). 2.17 The (T. then C:(4) = 0001001100. Proposition 1.12(b) extends to the following somewhat sharper form.14. will be given in detail. (The definition of faithful-code sequence is obtained by replacing the word "prefix" by the word "faithful. which is clearly a prefix code. (4) starts and that it is 12 bits long.x11 . As an application of the covering-exponent interpretation of entropy it will be shown that ergodic rotation processes have entropy 0. P)-names of length n. BASIC CONCEPTS. P)-process has entropy O. if C.7.) Theorem 1. 0.7.5. and. The decoder reads through the header information and learns where C.7. n = 1.15 If .7. 1). and let P be a finite partition of the unit interval [0. where En —> 0 as . and the code word C. with a bit more effort. X tiz E An. where. Remark 1. Geometric arguments will be used to estimate the number of possible (T. Given a faithful n-code Cn with length function Le ICn ).(4) = 001001100111. x E [0. x lii E An .e Rotation processes have entropy O.

y) = dn(xç .19 E. where A denotes Lebesgue measure. n where d(a. and hence the (T-P)-names of y and z will agree most of the time. some power y = Tx of x must be close to z.yli = minflx . if a = b. The names of x and y disagree at time j if and only if Tx and T y belong to different atoms of the partition P. b) = 0.Y11 < 6/4 and n > N. and d(a. The result then follows from the covering-exponent interpretation of entropy. there cannot be exponentially very many names.5) E I. given z and x. and {x n } and {yn} denote the (T. The first lemma shows that cln (x. Compactness then shows that N can be chosen to be independent of x and y. In particular.7. Ergodicity asserts the truth of the proposition for almost every x.18. y) < 1{j E E I or T -1 (0. Proposition 1. 1).11 and the pseudometric dn(x. which occurs when and only when either T --1 0 E I or (0.7.Y 1 1. tz]: Tlx E = for any interval I c [0. 1). y) such that if n > N then Given E > 0 there is an N such that dn (x. INTERPRETATIONS OF ENTROPY.y11. Theorem 1.4. P)-names of x. respectively. The next lemma is just the formal statement that the name of any point x can be shifted by a bounded amount to obtain a sequence close to the name of z = 0 in the sense of the pseudometric dn. =- 1 n d(xi. The fact that rotations are rigid motions allows the almost-sure result to be extended to a result that holds for every x. to provide an N = N(x. 1).SECTION 1. y E [0. Consider the case when Ix Y Ii = lx . n > N.7. if Ix . Yi).yll = 11 + x .yi. which is rotation by -a.18 (The uniform distribution property.7.y1. x. The case when lx . can be applied to T -1 . y) is continuous relative to Ix . Let I = [x. if a 0 b.7. To make the preceding argument precise define the metric Ix . Proof Suppose lx .b) = 1. Lemma 1.5) E <E This implies that dn (x. The proof is left as an exercise.yl can be treated in a similar manner.yli < €14. x . y E [0. Thus every name is obtained by changing the name of some fixed z in a small fraction of places and shifting a bounded amount. The uniform distribution property. . y) < e.) Inn E [1. Without loss of generality it can be assumed that x < y. 1) and any x E [0. y]. The key to the proof is the following uniform distribution property of irrational rotations. Proposition 1. 77 n oo. But translation is a rigid motion so that Tmy and Trnz will be close for all m. The key to the proof to the entropy 0 result is that.

w(k) E An. determine M and K > M from the preceding lemma. choose N from the preceding lemma so that if n > N and lx — yli < 6/4 then dn (x .) . With K = M N the lemma follows. there is an integer M and an integer K > M such that if x E [0.4. X E X. Let g be the concatenated-block process defined by a measure it on A n . let n > K+M and let z = O. Proposition 1. Since small changes in x do not change j(x) there is a number M such that j(x) < M for all x. 2)-process has entropy 0.15. Given E > 0. is the binary entropy function.n (g) > mH(A). I. 6) is defined in (1). apply Kronecker's theorem. Since h(€) —> 0 as E 1=1 this completes the proof that the (T.20 CHAPTER I. There are at most 2i possible values for the first j places of any name.. Give a direct proof of the covering-exponent theorem. (a) Show that Ann ( '2) < (P. j < M.7.5.7. to obtain a least positive integer j = j(x) such that IT ] x — Ok < 6/4. Let H(A) denote the entropy of p. 1) there is a j = j(x) E [1. (Hint: what does the entropy theorem say about the first part of the list?) 2. dk (zk Given E > 0. Note that dn—j(Z.2.A2) given for the computation of the covering number. BASIC CONCEPTS. where j(x) is given by the preceding lemma. Thus an upper bound on the number of possible names of length n for members of Xi is (7) 2(n— Dh(e) 114 : x E Xill < 2 i IA lnh(6) Since there only M possible values for j.. Lemma 1.f Exercises.7. of words w(0). for n > K + M.) ± log n. according to the combinations bound.78 Lemma 1.T i X) < E. w(k — 1)v. Show that limn (1/n) log.4. M] such that if Z = 0 and y = Tx then i. Proof Now for the final details of the proof that the entropy of the (T. (Hint: there is at least one way and at most n1441 2n ways to express a given 4 in the form uw(l)w(2). where Nn (a. Given E > 0. P)-process is 0. 3. respectively. Given x. Theorem 1. + (b) Show that H„. 1. yk ) < 69 k > K. The unit interval X can be partitioned into measurable sets Xi = Ix : j(x) = j}. the number of possible rotation names of 0 length n is upper bounded by M2m2nh(€).Arn (a. based on the algorithm (Al . where h(6) = —doge —(1 — 6) log(1 — 6). The number of binary sequences of length n — j that can be obtained by changing the (n — j)-name of z in at most c(n — j) places is upper bounded by 2 (n— i )h(E) .. y) < E. where each w(i) E An and u and v are the tail and head. 6) exists.

In some cases it is easy to establish a property for a block coding of a process. I. The definition and notation associated with stationary coding will be reviewed and the basic approximation by finite codes will be established in this subsection. Prove Proposition 1.8 Stationary coding. Suppose the waiting times between occurrences of "1" for a binary process it are independent and identically distributed. Suppose {1'. Use the covering exponent concept to show that the entropy of an ergodic nstationary process is the same as the entropy of the stationary process obtained by randomizing its start. (b) 44) 5_ 2 ± 2n log IA I. Show. STATIONARY CODING. The second basic result is a technique for creating stationary codes from block codes.1. 7. Let C be a prefix code on A with length function L. for all x'i' E A n . Let {Ifn } be the process obtained from the stationary ergodic process {X} by applying the block code CN and randomizing the start. a shift-invariant ergodic measure on A z and let B c A z be a measurable set of positive measure. Recall that a stationary coder is a measurable mapping F: A z 1--> B z such that F(TA X ) = TB F (x).8. Stationary coding was introduced in Example 1. 79 4. that the entropy-rate of an ergodic process and its reversed process are the same. in terms of the entropy of the waiting-time distribution. and that this implies the Kraft inequality (2). . Let p. then passing to a suitable limit. Associate with each a E A the dyadic subinterval of the unit interval [0. that the entropy of the induced process is 1/(A)//t(B). 9. VX E AZ. Find an expression for the entropy of it. Show that H({Y„}) < 8. then extend using the block-to-stationary code construction to obtain a desired property for a stationary coding. using Theorem 1.7. The first is that stationary codes can be approximated by finite codes.18. Two basic results will be established in this section.9.„} is ergodic.a Approximation by finite codes. for all xri' E An. then there is prefix code with length function L(a).8.7. 6. 1] of length 2 —L(a ) whose left endpoint has dyadic expansion C (a). 11.4. Often a property can be established by first showing that it holds for finite codes. using the entropy theorem. Use this idea to prove that if Ea 2 —C(a) < 1. 10.SECTION 1. Show that if C„: A n H B* is a prefix code with length function L then there is a prefix code e: A" 1—* B* with length function E such that (a) Î(x) <L(x) + 1. Show. Show that the resulting set of intervals is disjoint. 5. Section 1.

80

CHAPTER I. BASIC CONCEPTS.

where TA and TB denote the shifts on the respective two-sided sequence spaces A z and B z . Given a (two-sided) process with alphabet A and Kolmogorov measure ,U A , the encoded process has alphabet B and Kolmogorov measure A B = /LA o F. The associated time-zero coder is the function f: A z 1--* B defined by the formula f (x) = F (x )o. Associated with the time-zero coder f: Az 1---> B is the partition P = 19 (f) = {Pb: h E B} of A z , defined by Pb = {X: f(x) = 1, } , that is,
(1)
X E h

if and only if f (x) = b.

Note that if y = F(x) then yn = f (Tnx) if and only if Tx E Pyn . In other words, y is the (T, 2)-name of x and the measure A B is the Kolmogorov measure of the (T, P)process. The partition 2( f) is called the partition defined by the encoder f or, simply, the encoding partition. Note also that a measurable partition P = {Pb: h E BI defines an encoder f by the formula (1), such that P = 2(f ), that is, P is the encoding partition for f. Thus there is a one-to-one correspondence between time-zero coders and measurable finite partitions of Az . In summary, a stationary coder can be thought of as a measurable function F: A z iB z such that F oTA = TB o F , or as a measurable function f: A z 1--* B, or as a measurable partition P = {Pb: h E B } . The descriptions are connected by the relationships
f (x) = F (x)0 , F (x) n = f (T n x), Pb = lx: f (x) = b).

A time-zero coder f is said to be finite (with window half-width w) if f (x) = f (i), whenever .0'. = i" vu,. In this case, the notation f (xw u.,) is used instead of f (x). A key fact for finite coders is that
=
Ç

U {x, n 71± —w wl ,
( x ,' „u t u,' ) = .
, fl

and, as shown in Example 1.1.9, for any stationary measure ,u, cylinder set probabilities for the encoded measure y = ,u o F -1 are given by the formula

(2)

y()1) = it(F-1 [yi])

=

E it(xi —+:) .
f (4 ± - :)=y' it

The principal result in this subsection is that time-zero coders are "almost finite", in the following sense.

Theorem 1.8.1 (Finite-coder approximation.) If f: Az 1--* B is a time-zero coder, if bt is a shift-invariant measure on A z , and if Az 1---> B such that E > 0, then there is a finite time-zero encoder

7:

1.i (ix: f (x) 0 7(x)}) _< c. Proof Let P = {Pb: b E B) be the encoding partition for f. Since p is a finite partition of A z into measurable sets, and since the measurable sets are generated by the finite cylinder sets, there is a positive integer w and a partition 1 5 = {Pb } with the following two properties
"../

(a) Each Pb is union of cylinder sets of the form

SECTION I8. STATIONARY CODING.

81

(h)
I../

Eb kt(PbA 7)1,) < E.
r',I 0.s•s

Let f be the time-zero encoder defined by f(x) = b, if [xi'.] c Pb. Condition (a) assures that f is a finite encoder, and p,({x: f (x) 0 f (x))) < c follows from condition I=1 (b). This proves the theorem.

Remark 1.8.2
The preceding argument only requires that the coded process be a finite-alphabet process. In particular, any stationary coding onto a finite-alphabet process of any i.i.d. process of finite or infinite alphabet can be approximated arbitrarily well by finite codes. As noted earlier, the stationary coding of an ergodic process is ergodic. A consequence of the finite-coder approximation theorem is that entropy for ergodic processes is not increased by stationary coding. The proof is based on direct estimation of the number of sequences of length n needed to fill a set of fixed probability in the encoded process.

Theorem 1.8.3 (Stationary coding and entropy.)
If y = it o F-1 is a stationary encoding of an ergodic process ii then h(v)

First consider the case when the time-zero encoder f is finite with window half-width w. By the entropy theorem, there is an N such that if n > N, there is a set Cn c An i +w w of measure at least 1/2 and cardinality at most 2 (n±2w+1)(h( A )+6) • The image f (C n ) is a subset of 13 1 of measure at least 1/2, by formula (2). Furthermore, because mappings cannot increase cardinality, the set f (C n ) has cardinality at most 2(n+21D+1)(h(4)+E) . Thus 1 - log I f (Cn)i h(11) + 6 + 8n, n where Bn —>. 0, as n -> oo, since w is fixed. It follows that h(v) < h(tt), since entropy equals the asymptotic covering rate, Theorem 1.7.4. In the general case, given c > 0 there is a finite time-zero coder f such that A (ix: f (x) 0 f (x)}) < E. Let F and F denote the sample path encoders and let y = it o F -1 and 7/ = ki o F -1 denote the Kolmogorov measures defined by f and f, respectively. It has already been established that finite codes do not increase entropy, so that h(71) < h(i). Thus, there is an n and a collection 'e" c Bn such that 71(E) > 1 - E and II Let C = [C], be the 6-blowup of C, that is,
Proof

C = { y: dn(Yi, 3711 ) 5_

E,

for some

3,T;

E '

el

The blowup bound, Lemma 1.7.5, implies that

ICI < where
3(E) = 0(E).

9

Let

15 = {x: T(x)7

E

El} and D = {x: F(x)7

E

CI

be the respective pull-backs to AZ. Since ki (ix: f (x) 0 7(x)}) < _ E2 , the Markov inequality implies that the set G = {x: dn (F(x)7,F(x)7) _< cl

82

CHAPTER I. BASIC CONCEPTS.

has measure at least 1 — c, so that p(G n 5 ) > 1 — 2E, since ,u(15) = 75(e) > 1 — E. By definition of G and D, however, Gniic D, so that

e > 1 — 2c. v(C) = g(D) > p(G n ij)
The bound ICI < 2n (h(A )+E+3(E )) , and the fact that entropy equals the asymptotic covering rate, Theorem 1.7.4, then imply that h(v) < h(u). This completes the proof of Theorem 1.8.3.

Example 1.8.4 (Stationary coding preserves mixing.)
A simple argument, see Exercise 19, can be used to show that stationary coding preserves mixing. Here a proof based on approximation by finite coders will be given. While not as simple as the earlier proof, it gives more direct insight into why stationary coding preserves mixing and is a nice application of coder approximation ideas. The a-field generated by the cylinders [4], for m and n fixed, that is, the cr-field generated by the random variables Xm , Xm+ i, , X n , will be denoted by E(Xm n ). As noted earlier, the i-fold shift of the cylinder set [cini n ] is the cylinder set T[a] = [c.:Ln ], where ci +i = ai, m < j <n. Let v = o F-1 be a stationary encoding of a mixing process kt. First consider the case when the time-zero encoder f is finite, with window width 2w + 1, say. The coordinates yri' of y = F(x) depend only on the coordinates xn i +:. Thus for g > 2w, the intersection [Yi ] n [ym mLg: i n] is the image under F of C n Tm D where C and D are both measurable with respect to E(X7 +:). If p, is mixing then, given E > 0 and n > 1, there is an M such that

jp,(c n

D) — ,u(C),u(D)I < c, C, D E

Earn,
El in

m > M,

which, in turn, implies that

n [y:Z +1]) — v([yi])v(Ey:VDI

M, g > 2w.

Thus v is mixing. In the general case, suppose v = p, o F -1 is a stationary coding of the mixing process with stationary encoder F and time-zero encoder f, and fix n > 1. Given E > 0, and a7 and Pi', let S be a positive number to be specified later and choose a finite encoder with time-zero encoder f, such that A({x: f(x) f . Thus,

col) <S

pax: F(x)7

fr(x)71)

/tax:

f (T i x)

f (Ti x)i)

f(x)

f (x)}), < n3,

by stationarity. But n is fixed, so that if (5 is small enough then v(4) will be so close to î(a) and v(b7) so close to i(b) that

(3)

Iv(a7)v(b7) — j) (a7) 13 (bi)I

E/ 3 , • P(x)21:71) < 2n8,

for all ar il and b7. Likewise, for any m > 1,

p,

F(x)7

P(x)7, or F(x)m m -Vi

SECTION 1.8. STATIONARY CODING.
so that (4)

83

Ivaan n T - m[b7]) — j)([4] n T -m[b71)1

6/3,

provided only that 8 is small enough, uniformly in m for all ail and b7. Since 1'5 is mixing, being a finite coding of ,u, it is true that

n T -m[bn) — i)([4])Dab7DI < E/3,
provided only that m is large enough, and hence, combining this with (3) and (4), and using the triangle inequality yields I vga7] n T -ni[b7]) — vga7])v([14])1 < c, for all sufficiently large m, provided only that 8 is small enough. Thus, indeed, stationary coding preserves the mixing property.

I.8.b

From block to stationary codes.

As noted in Example 1.1.10, an N-block code CN: A N B N can be used to map an A-valued process {X„} into a B-valued process {Y„} by applying CN to consecutive nonoverlapping blocks of length N, i iN+1
v (f-1-1)N
-

ii(j+1)N)
iN+1

j = 0, 1, 2, ...

If {X,} is stationary, then a stationary process {4} is obtained by randomizing the start, i.e., by selecting an integer U E [1, N] according to the uniform distribution and defining = Y+-i, j = 1, 2, .... The final process {Yn } is stationary, but it is not, except in rare cases, a stationary coding of {X„}, and nice properties of {X,}, such as mixing or even ergodicity, may get destroyed. A method for producing stationary codes from block codes will now be described. The basic idea is to use an event of small probability as a signal to start using the block code. The block code is then applied to successive N-blocks until within N of the next occurrence of the event. If the event has small enough probability, sample paths will be mostly covered by nonoverlapping blocks of length exactly N to which the block code is applied.

Lemma 1.8.5 (Block-to-stationary construction.)
Let 1,1, be an ergodic measure on A z and let C: A N B N be an N-block code. Given c > 0 there is a stationary code F = Fc: A z A z such that for almost every X E A Z there is an increasing sequence { ni :i E Z}, which depends on x, such that Z= ni+i) and (i) ni+1—ni < N, i E Z. (ii) If .I„ is the set of indices i such that [ni , +1) c [—n, n] and ni+i — n i < N, then ihn supn (1/2n)E iEjn ni+ 1 — ni < c, almost surely.
(iii) If n i+1 — ni = N then y: n n +'- ' = C(xn:±'), n -1 where y = F(x).

84

CHAPTER I. BASIC CONCEPTS.

Proof Let D be a cylinder set such that 0 < ,a(D) < 6/N, and let G be the set of all x E A z for which Tmx E D for infinitely many positive and negative values of m. The set G is measurable and has measure 1, by the ergodic theorem. For x E G, define mo = mo(x) to be the least nonnegative integer m such that Tx E D, then extend to obtain an increasing sequence mi = m i (x), j E Z such that Tinx E D if and only if m = mi, for some j. The next step is to split each interval [mi, m 1+1) into nonoverlapping blocks of length N, starting with mi, plus a final remainder block of length shorter than N, in case mi+i — mi is not exactly divisible by N. In other words, for each j let qi and ri be nonnegative integers such that mi +i—mi = qiN d-ri, 0 <ri <N, and form the disjoint collection 2-x (mi) of left-closed, right-open intervals [m i , mi N), [m 1 d-N,m i +2N), [mi±qiN, mi+i ).

All but the last of these have length exactly N, while the last one is either empty or has length ri <N. The definition of G guarantees that for x E G, the union Ui/x (mi) is a partition of Z. The random partition U i/x (m i ) can then be relabeled as {[n i , ni+1 ), i E Z}, where ni = ni(x), i E Z. If x G, define ni = i, i E Z. By construction, condition (i) certainly holds for every x E G. Furthermore, the ergodic theorem guarantees that the average distance between m i and m i±i is at least N/6, so that (ii) also holds, almost surely. The encoder F = Fc is defined as follows. Let b be a fixed element of B, called the filler symbol, and let bj denote the sequence of length 1 < j < N, each of whose terms is b. If x is sequence for which Z = U[ni , ni+i), then y = F (x) is defined by the formula bni+i— n, if n i±i — ni <N yn n :+i —1 = if n i+i — n i = N. C (4 +1-1 )

I

This definition guarantees that property (iii) holds for x E G, a set of measure 1. For x G, define F(x) i = b, i E Z. The function F is certainly measurable and satisfies F(TAX) = F(x), for all x. This completes the proof of Lemma 1.8.5. The blocks of length N are called coding blocks and the blocks of length less than N are called filler or spacer blocks. Any stationary coding F of 1,1, for which (i), (ii), and (iii) hold is called a stationary coding with 6-fraction of spacers induced by the block code CN. There are, of course, many different processes satisfying the conditions of the lemma, since there are many ways to parse sequences so that properties (i), (ii), and (iii) hold, for example, any event of small enough probability can be used and how the spacer blocks are coded is left unspecified in the lemma statement. The terminology applies to any of these processes. Remark 1.8.6 Lemma 1.8.5 was first proved in [40], but it is really only a translation into process language of a theorem about ergodic transformations first proved by Rohlin, [60]. Rohlin's theorem played a central role in Ornstein's fundamental work on the isomorphism problem for Bernoulli shifts, [46, 63].

1)/2+1). for some O <s<t<n—k} .d. The first observation is that if stationarity is not required. The block-to-stationary code construction of Lemma 1.d. Let . since two disjoint blocks of O's of length at least n 112 must appear in any set of n consecutive places.} defined by yin clearly has the property —on+1 = cn(X in o . As an illustration of the method a string matching example will be constructed. furthermore. be forced to have entropy as close to the entropy of {X n } as desired. The n-block coding {I'm ) of the process {X„.. that is. The construction is iterated to obtain an infinite sequence of such n's. the {Y} process will be mixing. such that L(Y. Cn (4) = y.'. there is a nontrivial stationary coding. 85 I.c A string-matching example. For i. then nesting to obtain bad behavior for infinitely many n. The problem is to determine the asymptotic behavior of L(4) along sample paths drawn from a stationary. if the start process {X n } is i. and extended to a larger class of processes in Theorem 11. j ?. j > 1. as a simple illustration of the utility of block-to-stationary constructions.} by the encoding 17(jin 1)n+1 = cn(X. the problem is easy to solve merely by periodically inserting blocks of O's to get bad behavior for one value of n. almost surely. at least if the process is assumed to be mixing.8.i. and an increasing unbounded sequence {ni I.1. indicate that {Y. A proof for the i.-on+1) . of {X„}. The string-matching problem is of interest in DNA modeling and is defined as follows.i. For x E An let L(4) be the length of the longest block appearing at least twice in x1% that is. Let pc } . finite-alphabet ergodic process.1. uyss+ for any starting place s. L(x ) = max{k: +k i xss+ xtt±±k.i. . as follows. The proof for the binary case and growth rate À(n) = n 112 will be given here. Section 11. for each n > 64. fy ... To make this precise.5. and Markov processes it is known that L(x7) = 0(log n). A question arose as to whether such a log n type bound could hold for the general ergodic case.d. 1 " by changing the first 4n 1 i2 terms into all O's and the final 4n 1 12 terms into all O's. where it was shown that for any ergodic process {X.8.} and any positive function X(n) for which n -1 1.SECTION 1.8. see [26. where } (5) 0 Yi = 1 xi i <4n 112 or i > n — 4n 112 otherwise. -Fin) > n 1/2 .5. 2].(n) —> 0. lim sup i-400 Â(ni) In particular. define the function Cn : A n H {0. STATIONARY CODING.5 provides a powerful tool for making counterexamples. {Yn }.} is the process obtained from {X. where C„ is the zero-inserter defined by the formula (5). case using coding ideas is suggested in Exercise 2.') > 1. It can. A negative solution to the question was provided in [71].

xE 0. the final process has a positive frequency of l's. guarantees that a limit process 1/2 {Y(oo)„.8. BASIC CONCEPTS. both to be specified later. An appropriate use of block-to-stationary coding. but mixing properties are lost. Randomizing the start at each stage will restore stationarity.. For n > 64. For any start process {Y(0). Since stationary codings of stationary codings are themselves stationary codings. + 71 ) > n. each coordinate is changed at most once. and hence there is a limit code F defined by the formula [F(x)]m = lim[F i (x)] m . in turn. since O's inserted at any stage are not changed in subsequent stages.n } and inductively defining {Y(i) m } for i > 1 by the formula (cn . 1} z that maps {X. Property (6) guarantees that no coordinate is changed more than once. b = 0. and it is not easy to see how to pass to a limit process. Furthermore. and let {ci } be a decreasing sequence of positive numbers. the sequence of processes {Y(i)} is defined by setting {Y(0)} = {X. for all m E Z and i > 1. 11 z {0. yields (7) L(Y(i)7 J ) > ni ll2 . Combining this observation with the fact that O's are not changed. and define the sequence of processes by starting with any process {Y(0)} and inductively defining {Y(i)} for i > 1 by the formula {Y --4Ecn {Y(i)m}. Lemma 1. let Fi denote the stationary encoder Fi : {0.5. that is. and long matches are even more likely if Y(i)?' is the concatenation uf v./2 . where u is the end of a codeword. provided the n i increase rapidly enough. . 1 < j < for all s. so that the code at each stage never changes a 0 into a 1. hence. < j. let {n i } be an increasing sequence of natural numbers. will be used. j > 1. At each stage the same filler symbol. Furthermore. y is the beginning of a codeword. in turn. where the notation indicates that {Y(j) m } is constructed from {Y(i — 1). each process {Y(i) m } is a stationary coding of the original process {X. with a limiting Et-fraction of spacers.86 CHAPTER I.1. (6). at each stage converts the block codes into stationary codings. mixing is not destroyed. UZ { . the choice of filler means that L(Y(i)?`) > rt.E. to be specified later. the i-th iterate {Y(i) m } has the property L(Y(i): +7') > n/2 . {n i : i > 11 be an increasing sequence of integers. which will. (6) Y(i — 1) m = 0 =z= Y (i) m = O. Given the ergodic process {X n } (now thought of as a two-sided process.} exists for which L(Y(oo) :. For each i > 1. where u is the end of a codeword and v the beginning of a codeword. mEZ. in particular. The rigorous construction of a stationary coding is carried out as follows. guarantee that the final limit process is a stationary coding of the starting process with the desired string-matching properties. since stationary coding will be used). let Cn be the zero-inserter code defined by (5). The preceding construction is conceptually quite simple but clearly destroys stationarity. and f is filler. which. In particular.n }. since this always occurs if Y(i)?' is the concatenation uv.n } by stationary coding using the codebook Cn.) {YU — l)m} -14 {Y(i)ml.}.n } to {Y(i) m }.

PROCESS TOPOLOGIES. X = Anand X = A". On the other hand. Two useful ways to measure the closeness of stationary processes are described in this section. (Hint: replace D by TD for n large. merely by making the ni increase rapidly enough and the Ei go to zero rapidly enough. Furthermore. I.(dN(x fv . The collection of (Borel) probability measures on a compact space X is denoted by Of primary interest are the two cases.SECTION 1. One concept. the d-distance between processes is often difficult to calculate.9 Process topologies. weak limits of ergodic processes need not be ergodic and entropy is not weakly continuous. The other concept. entropy is d-continuous and the class of ergodic processes is d-closed. but it has two important defects.8. declares two ergodic processes to be close if only a a small limiting density of changes are needed to convert a typical sequence for one process into a typical sequence for the other process.d. called the weak topology. where A is a finite set. o (Hint: use the preceding exercise.9.n } defined by the stationary encoder F. then Lemma 1. The d-metric is not so easy to define. First a sequence of block codes is constructed to produce a (nonstationary) limit process with the properties desired. where it is mixing. The above is typical of stationary coding constructions.„ m E Z is the stationary coding of the initial process {X. Since (7) hold j > 1. p(x).n = limé Y(i).5 is applied to produce a stationary process with the same limiting properties. Many of the deeper recent results in ergodic theory depend on the d-metric. for each al iv E AN . In particular.) Section 1. a theory to be presented in Chapter 4. (") (a lic) = n-4.5 can be chosen so that Prob(x 1— ' = an < 8 . and it plays a central role in the theory of stationary codings of i.8. Suppose C: A N B N is an N-code such that Ei. 87 The limit process {Y(oo)„z } defined by Y(oo). Show that if t is mixing and 8 > 0.a The weak topology. The weak topology is separable.) 2. The subcollection of stationary processes is denoted by P s (A") and the subcollection of stationary ergodic processes is denoted by 7'e (A°°). the limit process has the property L(Y(oo)) > the limiting density of changes can be forced to be as small as desired.8. declares two processes to be close if their joint distributions are close for a long enough time.. namely. compact.9. for each i.d Exercises. A sequence {. I. 1.00 . F(x)7) < < E 1-1(un o 2E and such that h(o.u (n) } of measures in P(A") converges weakly to a measure it if lim p. and the d-topology is nonseparable. C(4)) < E. The collection P(A") is just the collection of processes with alphabet A.i. then the sequence {ni(x)} of Lemma 1. and easy to describe. processes. the entropy of the final process will be as close to that of the original process as desired. called the dtopology. however. as are other classes of interest. Show that there is a stationary code F such that limn Ett (dn (4.

The weak limit of ergodic measures need not be ergodic. however.(X 0 1X:1) decreases to H(.) as k — > oc.vik = E liu(a/1 at denotes the k-th order distributional (variational) distance.u (n) (4) = 2-k . Since Ho (Xo l XII) depends continuously on the probabilities .. for every k and every 4.1 If p (n ) converges weakly to p. E > 0 there is a k such that HA (X0I < H(p) €.) and (1. relative to the metric D(g. p.) — v(a 101. which has entropy log 2. as i —> oc. The weak topology is Hausdorff since a measure is determined by its values on cylinder sets. Entropy is weakly upper semicontinuous..u. The weak topology is a compact topology.. for any sequence {.(X01X: ) denote the conditional entropy of X0 relative to X=1.u (n ) converges weakly to p. hence has a convergent subsequence. Thus. If k is understood then I • I may be used in place of I • lk. and recall that 111.u (n) } converges weakly to the coin-tossing process. 0.u(n) is stationary..u(C) is continuous for each cylinder set C. if . Since there are only countably many 4. for all k and all'. V k.u. is a probability measure.2e. .. weak convergence coincides with convergence of the entries in the transition matrix and stationary vector.u(a k i +1 ). it must be true that 1-1/.(4). so the class of stationary measures is also compact in the weak topology. The weak topology is a metric topology. yet lim . . H(a) < H(p). the sequence {. The weak topology on P(A") is the topology defined by weak convergence. given. where X ° k is distributed according to p. let x(n) be the concatenation of the members of {0. Proof For any stationary measure p. the limit p. For example.u (n) is the Markov process with transition matrix M= n [ 1 — 1/n 1/n 1/n 1 1 — 1/n j then p weakly to the (nonergodic) process p. It is the weakest topology for which each of the mappings p . One way to see this is to note that on the class of stationary Markov processes. then lim sup. for each n.°2) be the concatenated-block process defined by the single n2'1 -block x(n).u. Theorem 1. 1}n in some order and let p. In particular.°1) (a)} is bounded. Now suppose . the usual diagonalization procedure produces a subsequence {n i } such that { 0 (a)} converges to. say.). iii . The process p.9. Indeed. A second defect in the weak topology is that entropy is not weakly continuous. (n ) has entropy 0. let Hi.(0(X 0 IX11) < H(A)-1. that puts weight 1/2 on each of the two sequences (0. . y) where . and each .88 CHAPTER I.°1) } of probability measures and any all'. It is easy to check that the limit function p. BASIC CONCEPTS. 1. Furthermore. so that {. is stationary if each A (n ) is stationary.

i)• Thus ci(x. For 0 < p < 1.T (p)) is invariant since ii(y. Theorem 1. y) is the limiting per-letter Hamming distance between x and y.d.4.(n)(X01X11) is always true. Theorem 1. so that x and y must .i. Tx). processes.) To illustrate the d-distance concept the distance between two binary i.2 If p.u. I. processes will be calculated. x) = d(T y. y) and is called the d-distance between the ergodic processes it and v.b) = { 1 n i =1 a=b a 0 b. and y are ergodic processes.d. then H(g. that is.9.) + 2E. (n) ) < H(. the measure p. .T (p)) is v-almost surely constant.SECTION 1. PROCESS TOPOLOGIES. The sequence x contains a limiting p-fraction of l's while the sequence q contains a limiting q-fraction of l's. y ) = — d(xi. the set of all sequences for which the limiting empirical measure is equal to p.. is ergodic if and only if g(T(g)) = 1. This extends to the It measures the fraction of changes needed to convert 4 into pseudometric ci(x.d. By the typical-sequence theorem.i. hence f (y) is almost surely constant.1. let g (P) denote the binary i. A precise formulation of this idea and some equivalent versions are given in this subsection. (The fact that d(g. then d( y. Note.9. y) = lirn sup d(4 .u (P) -typical sequence.9.i.) Example 1.)}. and let y be a g (q) -typical sequence. y) on A. is actually a metric on the class of ergodic processes is a consequence of an alternative description to be given later. and its converse. T (p. Proof The function f (y) = 41(y . Theorem 1. Suppose 0 < p <q < 1/2.4. in particular. y) is the limiting (upper) density of changes needed to convert the (infinite) sequence x into the (infinite) sequence y.)) = inf{d(x. This proves Theorem 1. v).9. y): x E T(11. y. that is. the d-distance between p. and y is the limiting density of changes needed to convert y into a g-typical sequence. The d-distance between two ergodic processes is the minimal limiting (upper) density of changes needed to convert a typical sequence for one process into a typical sequence for the other. 89 for all sufficiently large n. . that for v-almost every v-typical sequence y. The constant of the theorem is denoted by d(g.2. as defined. where d(a. let x be a . The per-letter Hamming distance is defined by 1 x---N n O d„(4 .3 (The d-distance for i. yi). defined by d(x. LI since Ii(g (n ) ) < 1-11. Let T(g) denote the set of frequency-typical sequences for the stationary process it. process such that p is the probability that X 1 = 1.9.1.b The d-metric. a(x. The (minimal) limiting (upper) density of changes needed to convert y into a g-typical sequence is given by d(y. But if this is true for some n.

In a later subsection it will be shown that this new process distance is the constant given by Theorem 1. random sequence with the distribution of the Afr ) -process and let Y = x ED Z. y) is exactly 1/2. since about half the time the alternation in x is in phase with the alternation in y. The y-measure is concentrated on the alternating sequence y = 101010.u and y as marginals. Since Z is also almost surely p (r ) -typical. For any set S of natural numbers.) As another example of the d-distance idea it will be shown that the d-distance between two binary Markov chains whose transition matrices are close can nevertheless be close to 1/2. a(x. Later it will be shown that the class of mixing processes is a-closed. v).1 The joining definition of d. and y be the binary Markov chains with the respective transition matrices M. and let r be the solution to the equation g = p(1 — r) ± (1 — p)r . (With a little more effort it can be shown that d(. is represented by a partition of the unit square into horizontal .a(a) = E X(a. v(b) = E X(a. there is therefore at least one . (q ) -typical y such that cl(x. Let p. disagree in at least a limiting (g — p)-fraction of places. (1) . ( A key concept used in the definition of the 4-metric is the "join" of two measures. On the other hand. Example 1.90 CHAPTER I. Thus d. for kt converges weakly to y as p —> 0. b).u.d. To remedy these defects an alternative formulation of the d-d limit of a distance dn un . y) = g — p. but it is not always easy to use in proofs and it doesn't apply to nonergodic istance as a processes. on A x A that has . The definition of d-distance given by Theorem 1. y) > g — p.u (r) -typical z such that y = x Ef) z is . A joining of two probability measures A and y on a finite set A is a measure A. This new process distance will also be denoted by d(A. for .) These results show that the d-topology is not the same as the weak topology.u (q ) -typical. b). Let Z be an i.4 ( Weakly close. Fix a . that is. and about half the time the alternations are out of phase. kt and y. a typical /L-sequence x has blocks of alternating O's and l's. that is. Y is .9..u (q ) -typical y such that ii(x. vn ) between the corresponding n . First each of the two measures. the values {Zs : s E SI are independent of the values {Zs : s 0 S}. Thus there is indeed at least one . and hence ii(A (P) . hence the nonmixing process v cannot be the d-limit of any sequence of mixing processes. and its shift Ty.. a(x. to be given later in this section.i. which produces no errors.th order distributions will be developed. y) — 1/2. which produces only errors.)typical. Likewise.u (P ) -typical x. BASIC CONCEPTS.2 serves as a useful guide to intuition. d-far-apart Markov chains. d -distance This result is also a simple consequence of the more general definition of . for almost any Z. see Exercise 4. separated by an extra 0 or 1 about every 1050 places.9. y) = g — p. 7" (v)) — 1/2. one .u-almost all x. p (q ) ) = g — p.u. for each p9' ) -typical x.9.2.(x. and hence. b a A simple geometric model is useful in thinking about the joining concept. so that ci(x . Thus it is enough to find.b. say p = 10-50 . 1.[ p 1 — p 1—p [0 1 1 N= p 1' 1 where p is small. see Exercise 5.9. Ty) — 1/2. where e is addition modulo 2.

a) R(0.b):b E A} such that R(a. a) to (A. 6) means that. the use of the unit square and rectangles is for simplicity only.b) R(2a) 1 2 R(2. from (Z. The minimum is attained. A). The 4-metric is defined by cin(ti. 91 strips.c) R(1. A). A joining X can then be represented as a pair of measure-preserving mappings.b) R(0. that is.) A representation of p. (A space is nonatomic if any subset of positive measure contains subsets of any smaller measure. ai ) a (q5-1 (ai ) n ( . y) denote the set of joinings of p.SECTION 1. a E A. means that the total mass of the rectangles {R(a. where p. The function 4(A. for each a. hence is a metric on the class P(Afl) of measures on A". the . and v are probability measures on An . y.1 (c)). V) = min Ejdn(4 .a) R(1b) R(1. 0 R(0.a) R(0.b) has area X(a.b) R(2c) R(2. a measurable mapping such that a(0 -1 (a)) = p(a).5. y). one can also think of cutting up the y-rectangles and reassembling to give the A-rectangles. for a finite distribution can always be represented as a partition of any nonatomic probability space into sets whose masses are given by the distribution. b). see Exercise 2. on a nonatomic probability space (Z. a) is just a measure-preserving mapping ct. (See Figure 1. let J(11. 6).9. b): a E A} is exactly the y-measure of the rectangle corresponding to b. V) is a compact subset of the space 7'(A x A n ). Of course. PROCESS TOPOLOGIES. A measure X on A n x A n is said to realize v) if it is a joining of A and y such that E Vd n (x7.u-rectangle corresponding to a can be partitioned into horizontal rectangles {R(a.u(a) = Eb X(a.c) Figure 1. The joining condition . Also.1In. relative to which Jn(Itt. where ex denotes expectation with respect to X.a) a R(1. such that X is given by the formula (2) X(a i .9. . with the width of a strip equal to the measure of the symbol it represents. The second joining condition. a) to (A. y) satisfies the triangle inequality and is strictly positive if p. since expectation is continuous with respect to the distributional distance IX — 5. yti')) = cin v). v(b) = Ea X. a joining is just a rule for cutting each A-rectangle into subrectangles and reassembling them to obtain the y-rectangles.9.) In other words. a) to (A. and ik from (Z.5 Joining as reassembly of subrectangles. (/) from (Z.a) R(2.c) R(2. Turning now to the definition of y). and v.(a.

is a metric on and hence d defines a topology on P(A"). v'nV) < ( n + m) di±. so the preceding inequality takes the form nd + mdm < n + m)dn+m. The d-distance between two processes is defined for the class P(A) of probability measures on A" by passing to the limit. dn (p. y) is a metric. Theorem 1. defined as a minimization of expected per-letter Hamming distance over joinings. This establishes Theorem 1. In the stationary case. BASIC CONCEPTS. In the stationary case.. in fact.. both the limit and the supremum. y). which implies that A„ = vn for all n.7 (Stationary joinings and d-distance.vn)• Proof The proof depends on a superadditivity inequality. LI A consequence of the preceding result is that the d-pseudometric is actually a metric on the class P(A) of stationary measures.. un ) must be 0. a consequence of the definition of dn as an minimum.n (1t7+m . ( which implies that the limsup is in fact both a limit and the supremum. Exercise 2. the limit superior is. namely. since d.co Note that d(p.9. Theorem 1. then. q+m). The d-distance between stationary processes is defined as a limit of n-th order dndistance. un). v) = 0 and p and y are stationary. n—). on the set AL.6. p: + 41 = pr. which is. since iin(A.9.:tr. A joining A of p and v is just a measure on A x A" with p. y) is a pseudometric on the class P(A). vn) = limiin(An. and y as marginals. The proof of this is left to Exercise 3. which is. It is useful to know that the d-distance can be defined directly in terms of stationary joining measures on the product space A" x A".) If p.9. in turn. v) = SUP dn (An. where pi denotes the measure induced by p. d(it. yl)). for all n. in turn. B E E. v) = min ex(d(xi.n .6 If i and v are stationary then C1 (1-1 . (3) n vilz)± m (p. .(B x v(B) = X(A" x B). since in the stationary case the limit superior is a supremum. Indeed. p(B) = A. and v are stationary then (4) cl(p. that is. in turn. if d(p. implies that p = v. The set of all stationary joinings of p and y will be denoted by Js(p. that is.92 CHAPTER I. which.. Y) = lirn sup an(An.

/ . which proves the first of the following equalities. for if A E Js(.k (d(x i .1. yr)) = K-1 ( k=0 H (t v knn -l"n kk"kn-I-1 .u(B). For each n. This. b) = E (uav. . The goal is to construct a stationary measure A E y) from the collection {A}. the measure 7. the concatenatedblock process defined by A n .kn-Fn Ykn+1 // t( y Kn+r "n kk""Kn-1-1 )1(n+1 . and from the fact that e . for it depends only on the first order distribution. that is. 0<r <n. implies that lim. the measure on A x A defined by A. va). It is also easy to see that d cannot exceed the minimum of Ex (d(x l . if N = Kn+r. y) of stationary joinings is weakly closed and hence compact. yi)). B x B) = . The details are given in the following paragraphs. yl)) = ex n (dn (4. let ). then taking a weakly convergent subsequence to obtain the desired limit process À. Towards this end.7 be projection of A n onto its i-th coordinate. It takes a bit more effort to show that d cannot be less than the right-hand side in (4).. let A. and the latter dominates an (IL. v).1s(p. va).01) denote the concatenated-block process defined by An. together with the averaging formula.7 (a. E. B E E E.. for each n choose A n E Jn(tin. u'bvi). (n) is the measure on (A x A )° defined by A((4. V). yi ) = lima Cin (tt.The definition of A (n) implies that if 0 < i < n — j then x A c° )) = Xn(T -` [4] x A) = ting—i [xn) = since An has A n as marginal and it is shift invariant. V) = Ci(p. PROCESS TOPOLOGIES. This is done by forming. As noted in Exercise 8 of Section 1. such that ex (d(Xi . (") (C x = A(C). To prove (c). Yi)) is weakly continuous.Î. vn) = ex(c1. such that an(An.. y). lim e-)(d(xi. Yi)) = litridn (tt. for any cylinder set C.(4 . for each n.9. (a) (6) (b) (c) n-+co lim 5:(n)(B x lim 3. 93 Proof The existence of the minimum follows from the fact that the set . since A n belongs to -In(un. The proof of (b) is similar to the proof of (a).(n) (A = bt(B).SECTION 1.(n) is given by the formula (5) i "(n) (B) = — n n 1=1 where A. y) then stationarity implies that ex(d(xi. (5). y rit )).

To complete the proof select a convergent subsequence of 1-(n) . b) = (1/n) 1 n x. y). ir(x . yri') = E p . { } 0< (n — k + 2) p(414) — (n k— E p(a'nfi!) <1.p) (d(a.„ (dn(4 )11 ))• Ei x7 (a. and the third. yi) = e7. and more. y). bk i )1(x n i1..(d(xi. respectively.(. 1x). and X' is a joining of g. y)). by conditions (6a) and (6b). This proves (c). b1 )). Lemma 1.9. and is a joining of h and y.) The measures 171. if necessary. Thus X E Js(h. )). pk ((alic . since cin(A. pk (af 14) and pk (biny). A diagonalization argument. The pair (x. I. ( . Two of these.y. each side of E(n — k 1)p((af. that (7) dn (x. b). u) = e).j'e ly). Yi)). it can be assumed that the limit i i )) . y) = lim dni ((x in i . but important fact. yril) defines three empirical k-block distributions. 1(x.2 Empirical distributions and joinings. The limit results that will be needed are summarized in the following lemma. b 101(x7. Fix such a n1 and denote the Kolmogorov measures exists for all k and all al` and defined by the limits in (8) as TI. hence is not counted in the second term.1).94 and note that CHAPTER L BASIC CONCEPTS. The limit X must be stationary. A simple. however. y) = Eî(d(ai . b)). Note also. are obtained by thinking of 4 and Ai as separate sequences. since. yields an increasing sequence fn . Passing to the limit it follows that . Limit versions of these finite ideas as n oc will be needed. yn also exists.8 (The empirical-joinings lemma. ) ) is a joining of pk (./ J such that each of the limits (8) lim p(afix). and plc ( ly7). al since the only difference between the two counts is that a' 2` might be the first k — 1 terms of 4. By dropping to a further subsequence. "i. 07 .0)(d (xi . y) as a single sequence. One difficulty is that limits may not exist for a given pair (x. y.b. and 3. Proof Note that Further- M. = (n — k 1)p(aii` I x7) counts the number of times af appears in x7.(. ex n (dn(x tiz . I —ICX) j—>00 lim p((a .9. yi)). is that plc ( 1(4 . E d (xi . for example. y) = EA.. y7)) = since 3:(n) (a. Condition (6c) guarantees that ii(tt. yril)). is obtained by thinking of the pair (4 . and): are all stationary. Yi)X7 (xi . 14)1(4 .

y)). as given by the ergodic decomposition theorem.. combined with (10). however. This. y) = e).9 (Ergodic joinings and a-distance. Lemma 1. (10) d(u.10. it follows that X E ). y) < E).(x..7(x. 95 is stationary.9. it is concentrated on those pairs (x.u and y such that Au. is. yl )). y i )).9. Since.are stationary. Likewise. the same as the constant given by Theorem 1. since d(xi.(.u and y are ergodic then there is an ergodic joining A of . The integral expression (9) implies. y) = ex(d(xi . yi))• Proof Let A be a stationary joining of g and y for which c/(u. since À is a stationary joining of p and y. y). A(x) is ergodic and is a joining of . Theorem 1. y) = EA.7.SECTION 1. (. The empirical-joinings lemma can be used in conjunction with the ergodic decomposition to show that the d-distance between ergodic processes is actually realized by an ergodic joining. But if this holds for a given pair. ) (d(xi. This establishes Theorem 1. Theorem 1. a strengthening of the stationary-joining theorem. Thus.(d(xi let yi)). 371)). y) E (A. means that a(i. yl) depends only on the first order distribution and expectation is linear.(x. )) .9.. and A. yi)). Li Next it will be shown that the d-distance defined as a minimization of expected perletter Hamming distance over joinings. by definition of X.9.7.9. A-almost surely. y) E A" x A'.) If . y) is 4( X . 8 . y) is ergodic for almost every (x. PROCESS TOPOLOGIES.t. in the ergodic case. is the empirical measure defined by (x. y E T (v).„. "if and 1.y) (d(x i . But. and hence d(u. y)) be the ergodic decomposition of A.9. Theorem 1. y).4. and (9) A= f A(x ) da)(71. . that x(d(xi yi)) = f yl)) da)(7(x.y)) for A-almost every pair (x. since it was assumed that dCu.9. by the empirical-joinings lemma.2.)-typical. while the d-equality is a consequence of (7). . In this expression it is the projection onto frequency-equivalence classes of pairs of sequences. y) < Ex„(x .„( . (x.u and v.) must be a stationary joining of g and y. y) for which x is g-typical and y is v-typical. The joining result follows from the 0 corresponding finite result. co is a probability measure on the set of such equivalence classes. Theorem 1. the pair (x. À-almost surely. then X 7r(x. for À-almost every pair (x. Furthermore. = ex(d(xi.y) (d(xi.. y).9.

9. v).10. y) E TOO Ç T(i) x T (v). will be denoted by x n (jr). y i )) = pairs (x. together with the almost sure limit result. (a ) — v(a7)I and a measure is said to be TN -invariant or N-stationary if it is invariant under the N-th power of the shift T. y) E T(A) x T (v) d(p. conditioned on the value r of some auxiliary random variable R. . Here is why this is so. p((at. the following two properties hold..9. and h1 . random variable notation will be used in stating d-distance results. y 1 ') converges to a limit d(x.(d(Xi. if p. i=1 (b) (x. y) < d(x . v) d(x . blf)1(x n oc. also by the empirical-joinings lemma..Theorem 1. (12). v-almost surely. Y1)).(a. y). The vector obtained from xri' be deleting all but the terms with indices j i < j2 < < j. Theorem 1. As earlier. This establishes the desired result (12).v). and 2(x.9. y'. and v as marginals. For convenience. The ergodic theorem guarantees that for A-almost all which e(d(x i . v). the distributional (or variational) distance between n-th order distributions is given by Ii — vin = Elp.u and v are ergodic then d(.96 CHAPTER I.. Since the latter is equal to 2( x. that is.u.n . 17) stands for an (A. there is a set X c A' such that v(X) = 1 and such that for y E X there is at least one x E T(A) for which both (a) and (b) hold.') = — n EÀ(d(xi. bt) for each k and each al' and bt. y). as j lemma implies that Î is stationary and has p. Next it will be shown that there is a subset X c A of v-measure 1 such that (12) d(y. yields LIJ the desired result. (a) cln (4 . y). T (A)).3 Properties and interpretations of the d-distance. (11). Let x be ti-typical and y be v-typical. In particular. BASIC CONCEPTS.b. y). I.9 implies that there is an ergodic joining for v).) If .10 (Typical sequences and d-distance. v) = d(y. v) E-i. Proof First it will be shown that (11) (x. the proof of (11) is finished. for example. or by x(jr) when n is understood.T (p)) = d(p.9. is the distribution of X and v the to indicate the random vector distribution of Yf'. y). The empirical-joinings so that dnj (x l i . and choose an increasing sequence {n 1 } such that )) converges to a limit A. . and hence d(p. Theorem 1. The lower bound result.Y1))=a(u. V y E X. Another convenient notation is X' .

.m. together with the fact that the space of measures is compact and complete relative to the metric ii . see Exercise 8. = min(A(a7). yuln))x(x(q). o T'. Yf' gives a joining of X(ffn). b)) > 1 . i2. (c) (X (ir).9. Li } and for each ( x (jr ) .. y) satisfies the triangle inequality and is positive if p. v(a)). unless a = b. )711')) < mEVd m (x(iln ). Ex(d(a.y(j"))(X Y(iin—m )) be a joining of the conditional random variables. y is left as an exercise. < md„. This completes the proof of property (a).u and v are T N -invariant. .) (a) dn v) _< (1/2)Ip.Via . and mdm (x(r).Yin = 2. 14): a7 b7 } ) = 1 . dn(A. y o 7-1 ). X(i in-m )/x( jim) and Y(i n i -m )/y(jr).(XF n Yr n ) = dn(X i ort1)n+1 . n } .2 Emin(u(a7). Furthermore. there is always at least one joining X* of i and y for which X(4. y). for any joining X of bt and v. The right-hand inequality in (c) is a consequence of the fact that any joining X of X(r) and Y(ffn) can be extended to a joining of X and To establish this extension property. Moreover.m.u .(ir»(xcili -m).v In . Y (if n )) n . y) d(p. and hence. . nE).11 (Properties of the d-distance. y) (dn (di% b7)) < . v(a)). 2. y( j. . The left-hand inequality in (c) follows from the fact that summing out in a joining of X."1 )). X(x(jr). .Y(j im).1A . 97 Lemma 1.17(iin—on+i).E ga7. and that Ex (dn (a7 . then ci(u. y(jr)) n . IT) < illr. let yr. a). since d(a. b) = 1.a7 ) .Ea ga.YD. The fact that cln (A.(d„(xr il .9.SECTION 1. PROCESS TOPOLOGIES.. v) = lim cin (. (b) d. YU)) yr. with equality when n = 1. is a joining of X and Yln which extends X. r P(R = r)d(r Proof A direct calculation shows that IL .(X (j r). Completeness follows from (a). (g) ("in(X7. is a complete metric on the space of measures on An. The function defined by x*(q. then mcl. (e) d(u. let {6.vi n . 14)) X({(a7 . in _m } denote the natural ordering of the complement {1. ±1 } are independent of each other and of past blocks. 2 1 In the case when n = 1. yoin -m».. (f) If the blocks {X i ort on+1 } and Olin 1) .y(jr)) 5_ ndn (x 11 . yri') = x(x(r). . for any finite random variable R. (d) If .

01.u. and define (13) *(/. It is one form of a metric commonly known as the Prohorov metric. by the assumed independence of the blocks.1)n-1-1 . y) < cl. the extension to N > 1 is straightforward. yriz)) < 2a./.(xi' y) ct (a) 20/.98 CHAPTER I. The Markov inequality implies that X(A n ( N/Fe)) > 1 — NATt. and hence cl. This completes the proof of the lemma. with bounds given in the following lemma.u.in-on+1) .n (x(irm ).0 The function dn * ( u.12 [d:Gt. versions of each are stated as the following two lemmas.4. and hence du. y) = ct. and is uniformly equivalent to the 4-metric. (2). 1)n-1-1 } . y7)EA. Property (e) follows from (c). Let A n (c) = dn (xiz . is obtained by replacing expected values by probabilities of large deviations. (3).11. Property (d) for the case N = 1 was proved earlier. Proof Let A be a joining of 1.(A n (c)) > 1 — cl.(dn (4 . y) is a metric on the space P(An) of probability measures on A. yrn)) = E nE j (dn (x . Yilj. The preceding lemma is useful in both directions.(xrn . BASIC CONCEPTS. y(i n in property (c). such that Exj (dn (X17_ 1)n±l' Yii. dn(xii . and hence nz Manin (X7 m . equivalent to the 4-metric. y) Likewise. This completes the discussion of Lemma 1. } and {Y(. Yintn ) Edna (. )1)) = a. i . v)] 2 5_ 4 (A.E. y(ijn-on+1)• The product A of the A is a joining of Xr and Yr... Lemma 1. yDA. Another metric. y) 5_ 24(u.-1)n+1 . and hence property (f) is established. y) < E. .tt1)n+1)) i • j=1 The reverse inequality follows from the superadditivity property. Minimizing over A yields the right-hand inequality since dn . Part (b) of each version is based on the mapping interpretation of joinings.t and y such that dn (A. v). if A E y) is such that X(An(a)) > 1 — a and d:(. dn A. y) = E)... then EÀ(dn (4.9. j=1 since mnEx (c1. choose a joining Ai of {Xiiin j. for each To prove property (f). y) = min minfc > 0: A. The proof of Property (g) is left to the reader.9./2-1)n+1' y(i.(. yiii)) < E (x7.`n )) < 1.2-1)n+1 )) = i (.(d n (f.

then cin (it. v).9. (b) If 4 and are measure-preserving mappings from a nonatomic probability space (Z. respectively. such that a(lz: dn((z). This follows from the fact that if v is ergodic and close to i. Lemma 1. y) < c2.b. fix k and note that if )7 4-k---1 = all' and 4 +k-1 0 al'. a) into (A'. 11 ) 3 I p(44) — p(any)1 < c.9. and N1 so that if n > N1 then (15) d( x. Proof Suppose . mixing. B) 3).(4 .4. let [B]s = {y7: dn(y ri' .u) and (An. almost surely.SECTION 1. PROCESS TOPOLOGIES. In this section it will be shown that the d-limit of ergodic (mixing) processes is ergodic (mixing) and that entropy is d-continuous. and hence (14) (n — k + 1)p(41). a) is a nonatomic probability space. y) < e 2 and . Lemma 1.9. i + k — 1].u is the d-limit of ergodic processes.9. then yi 0 x i for at least one j E [i.t. respectively. y) < 2e. Theorem 1.2 it is enough to show that p(aii`lx) is constant in x.9. then Y n ([B]. I.un (B) > 1 — c. except for a set of a-measure at most c. By Theorem 1. where the extra factor of k on the last term is to account for the fact that a j for which yi 0 x i can produce as many as k indices i for which y :+ k -1 _ — ak 1 and X tj.15 (Ergodicity and d-limits.t. which is just Lemma 1. a) into (An. there are measurepreserving mappings (/) and from (Z.4 Ergodicity. then iin (t. p. and d limits. As before.) The a-limit of ergodic processes is ergodic.) > 1 — 2e.14 (a) If anCu. for 3 > 0. for every a. entropy.) and (A". By the ergodic theorem the relative frequency p(eil`ix) converges almost surely to a limit p(anx).q) + knd. The key to these and many other limit results is that do -closeness implies that sets of large measure for one distribution must have large blowup with respect to the other distribution. ±1c-1 0 ak 1' Let c be a positive number and use (14) to choose a positive number S < c. y) < 2E.13 99 (a) If there is a joining À of it and u such that d„(4 .*(z)) > el) 5_ E. and small blowups don't change empirical frequencies very much.14. . v). .11 ) < (n — k + 1)p(a in. denote the 3-neighborhood or 3-blow-up of B c A". To show that small blowups don't change frequencies very much. )1). y) < c. except for a set of A. and (Z. such that dn ( ( Z) (z)) < E. where [ME = dn(Yç z B) EL (b) If cin (i.-measure at most c. then a small blowup of the set of vtypical sequences must have large ti-measure.9.

via finite-coder approximation. (Such a 3 exists by the blowup-bound lemma.7.9. implies that A is ergodic. and has large v-measure.4. <8 and hence I 19 (alic 1-741 ) IY7)1 < 6 by (15). ] 5 ) > 1 — 2 8 . Let 8 be a positive number such that un bi for all n > N. and.7.14 implies that y([7-. for sufficiently large n. Theorem 1. choose N.) The d-limit of mixing processes is mixing. Proof The proof is similar to the proof. so the covering-exponent interpretation of entropy. Al. however. be a stationary joining of p. 8 then there is a sequence i E Bn such that ] 14) — P(a lic l51)1 < E. D Theorem 1.9.9. n > Because y is ergodic N1 . Fix n > N2 and apply Lemma 1. completing the proof of Theorem 1. and let A.) Choose > N such that p(T) > 1 — 3.) Entropy is a-continuous on the class of ergodic processes.n . Given E > 0. Theorem 1. that if y E [A]s then there is a sequence x E Bn such that dn (fi! . the finite sequence characterization of ergodicity. Mixing is also preserved in the passage to d-limits. for each < 2i1(h+E) and p(T) n > N.')) < 82. since for stationary measures d is the supremum of the there is an N2 > Ni such that if n > N2. yields h(v) < h(p) + 2E. Yi)) = Ex(cin (4. Proof Now the idea is that if p. vn ) < 82.9. 57 11 E [B. see Example 19. Likewise. v gives measure at least I — 3 to a subset of An of cardinality at most 2n (n( v)+E ) and hence El h(. Let pc be an ergodic process with entropy h = h(i).1' E [B.14 to obtain A n ([13. and y such that Evdi (xi .]6 ) > 1 — 23. Theorem 1.-measure. let y be a mixing process such that d(A . y) < 3 2 and hence an (11. fix all and VII.-entropy-typical sequences by very much. Let y be an ergodic measure such that cl(A. and fix E > 0.16 (Entropy and d-limits. and y are d-close then a small blowup does not increase the set of p. y) < 3 2 . Bn = fx ril : I P(414) — v(4)1 < has vn measure at least 1-6.100 CHAPTER I. Suppose A is the a-limit of mixing processes. This completes the proof of Theorem 1. Note. y. y) < 8 2 . Likewise if 5. each of positive p.4.9. that stationary coding preserves mixing.17 (Mixing and a-limits.16. choose Tn c An such that ITn I _ 1 as n co.5. . Lemma 1. 8. BASIC CONCEPTS.9. Lemma 1.5. ] Since [B] large An -measure. the set an . The definition of Bn and the two preceding inequalities yield — p(451)1 < 4E. Given 8 > 0. for n > N I .u) < h (v) 2E.15. for all n > N. Let y be an ergodic process such that a(A.

SECTION 1.9. PROCESS TOPOLOGIES.

101

This implies that d„(x': , y7) < 8 except for a set of A measure at most 8, so that if has A-measure at most 8. Thus by making xt; < 1/n then the set {(x'il , even smaller it can be supposed that A(a7) is so close to v(a) and p.(1,11') so close to v(b?) that liu(dil )[1(b7) — v(a'12 )v(b)I Likewise, for any m > 1,

- iz X({(xr +n , y'1 ): 4 0 yrii or x: V X({(xi, yi): xi 0 4 } )

y: .- .;1 })
m+n

X({(x,m n F-7,

xm _11
,n

ym m:++:rill)

28,

so that if 8 <112n is small enough, then

1,u([4] n

T - m[bn) — vaan

n T - m[b7D1

6/3,

for all m > 1. Since u is mixing, however,

Iv(a7)v(b7) — vgan

n T - m[b7])1

6/3,

for all sufficiently large m, and hence, the triangle inequality yields

— ii([4]

n T - m[brill)1

6

,

for all sufficiently large m, provided only that 6 is small enough. Thus p, must be mixing, which establishes the theorem. This section will be closed with an example, which shows, in particular, that the

ii-topology is nonseparable.
Example 1.9.18 (Rotation processes are generally d-far-apart.) Let S and T be the transformations of X = [0, 1) defined, respectively, by Sx = x

a, T x = x

fi,

where, as before,

e denotes

addition modulo 1.

Proposition 1.9.19
Let A be an (S x T)-invariant measure on X x X which has Lebesgue measure as marginal on each factor If a and fi are rationally independent then A must be the product measure. Proof "Rationally independent" means that ka ± mfi is irrational for any two rationals k and m with (k, m) (0, 0). Let C and D be measurable subsets of X. The goal is to • show that A.(C x D) =
It is enough to prove this when C and D are intervals and p.(C) = 1/N, where N is an integer. Given E > 0, let C 1 be a subinterval of C of length (1 — E)1N and let

E = C i x D, F = X x D.

102

CHAPTER I. BASIC CONCEPTS.

Since a and fi are rationally independent, the two-dimensional version of Kronecker's theorem, Proposition 1.2.16, can be applied, yielding integers m 1 , m2, , rnN such that if V denotes the transformation S x T then Vms E and X(FAÉ) < 2E, where F' = U 1 Vrn1E .
E -±
n vrni

E = 0, if i j,

It follows that X(E) = X(P)/N is within 26/N of A.(F)/ N = 1,t(C)A(D). Let obtain X(C x D) = This completes the proof of Proposition 1.9.19.

0 to

El

Now let P be the partition of the unit interval that consists of the two intervals Pc, = [0, 1/2), Pi = [1/2, 1). It is easy to see that the mapping that carries x into its (T, 2)-name {x} is an invertible mapping of the unit interval onto the space A z the Kolmogorov measure. This fact, together withwhicaresLbgumont Proposition 1.9.19, implies that the only joining of the (T, 2)-process and the (S, 2)process is the product joining, and this, in turn, implies that the d-distance between these two processes is 1/2. This shows in particular that the class of ergodic processes is not separable, for, in fact, even the translation (rotation) subclass is not separable. It can be shown that the class of all processes that are stationary codings of i.i.d. processes is d-separable, see Exercise 3 in Section W.2.

I.9.c

Exercises.

1. A measure ,u E P(X) is extremal if it cannot be expressed in the form p = Ai 0 ia2. (a) Show that if p is ergodic then p, is extremal. (Hint: if a = ta1 -F(1 — t)A2, apply the Radon-Nikodym theorem to obtain gi (B) = fB fi diu and show that each fi is T-invariant.) (b) Show that if p is extremal, then it must be ergodic. (Hint: if T -1 B = B then p, is a convex sum of the T-invariant conditional measures p(.iB) and peiX — B).) 2. Show that l') is a complete metric by showing that

(a) The triangle inequality holds. (Hint: if X joins X and /7, and X* joins 17 and Zri', then E,T yi)A*(z7ly'll) joins X and 4.) (b) If 24(X?, ri) = 0 then Xi; and 1111 have the same distribution. (c) The metric d(X7, 17) is complete. 3. Prove the superadditivity inequality (3). 4. Let p, and y be the binary Markov chains with the respective transition matrices
m [

P —p

1—p p 1'

N [0 1 1 o]'

1

Let fi be the Markov process defined by M2.

SECTION 1.10. CUTTING AND STACKING.

103

v n = X2n, n = 1, 2, ..., then, almost (a) Show that if x is typical for it, and v surely, y is typical for rt.
(b) Use the result of part (a) to show d(,u, y) = 1/2, if 0 < p < 1. 5. Use Lemma 1.9.11(f) to show that if it. and y are i.i.d. then d(P., y) = (1/2)1A — v I i. (This is a different method for obtaining the d-distance for i.i.d. processes than the one outlined in Example 1.9.3.) 6. Suppose y is ergodic and rt, is the concatenated-block process defined by A n on A n . (p ii ) (Hint: g is concentrated on shifts of sequences Show that d(v ,g) = --n.n, ,.--n,• that are typical for the product measure on (An )° ° defined by lin .)

a.

7. Prove property (d) of Lemma 1.9.11.

* (a7 , a7) = min(p,(a7), v(a)). 8. Show there is a joining A.* of ,u and y such that Xn
9. Prove that ikt — v in = 2 — 2 E a7 min(i.t(a7), v(a)). 10. Two sets C, D C A k are a-separated if C n [D], = 0. Show that if the supports of 121 and yk are a-separated then dk(Ilk.vk) > a. 11. Suppose ,u, and y are ergodic and d(i.t, for sufficiently large n there are subsets v(Dn ) > 1 — E, and dn (x7, y7) > a 1`) ' Ak and Pk('IY1 c ) ' Vk, and k Pk('IX 1 much smaller than a, by (7)) y) > a. Show that if E > 0 is given, then Cn and Dn of A n such that /L(C) > 1 —e, — e, for x7 E Cn , yli' E D. (Hint: if is large enough, then d(4 , y ) cannot be

12. Let (Y, y) be a nonatomic probability space and suppose *: Y 1-4- A n is a measurable mapping such that dn (un , y o * -1 ) < S. Show that there is a measurable mapping q5: y H A n such that A n = V curl and such that Ev(dn(O(Y), fr(y))) <8.

Section 1.10 Cutting and stacking.
Concatenated-block processes and regenerative processes are examples of processes with block structure. Sample paths are concatenations of members of some fixed collection S of blocks, that is, finite-length sequences, with both the initial tail of a block and subsequent block probabilities governed by a product measure it* on S' . The assumption that te is a product measure is not really necessary, for any stationary measure A* on S" leads to a stationary measure p, on Aw whose sample paths are infinite concatenations of members of S, provided only that expected block length is finite. Indeed, the measure tt is just the measure given by the tower construction, with base S", measure IL*, and transformation given by the shift on S°°, see Section 1.2.c.2. It is often easier to construct counterexamples by thinking directly in terms of block structures, first constructing finite blocks that have some approximate form of the final desired property, then concatenating these blocks in some way to obtain longer blocks in which the property is approximated, continuing in this way to obtain the final process as a suitable limit of finite blocks. A powerful method for organizing such constructions will be presented in this section. The method is called "cutting and stacking," a name suggested by the geometric idea used to go from one stage to the next in the construction.

104

CHAPTER I. BASIC CONCEPTS.

Before going into the details of cutting and stacking, it will be shown how a stationary measure ,u* on 8" gives rise to a sequence of pictures, called columnar representations, and how these, in turn, lead to a description of the measure on A'.

I.10.a

The columnar representations.

Fix a set 8 c A* of finite length sequences drawn from a finite alphabet A. The members of the initial set S c A* are called words or first-order blocks. The length of a word w E S is denoted by t(w). The members w7 of the product space 8" are called nth order blocks. The length gel') of an n-th order block wrii satisfies £(w) = E, t vi ). The symbol w7 has two interpretations, one as the n-tuple w1, w2, , w,, in 8", the other as the concatenation w 1 w2 • • • to,„ in AL, where L = E, t(wi ). The context usually makes clear which interpretation is in use. The space S" consists of the infinite sequences iv?"' = w1, w2, ..., where each Wi E S. A (Borel) probability measure 1.1* on S" which is invariant under the shift on S' will be called a block-structure measure if it satisfies the finite expected-length condition,
(

E(t(w)) =

Ef (w),uT(w ) <
WES

onto 8, while, in general, Here 14 denotes the projection of of ,u* onto 8'. Note, by the way, that stationary gives E v( w7 ))

denotes the projection

. E t (w ) ,u.: (w7 ) . nE(t(w)).
to7Esn

Blocks are to be concatenated to form A-valued sequences, hence it is important to have a distribution that takes into account the length of each block. This is the probability measure Â, on S defined by the formula
A( w) —

w)te (w)

E(fi

,WE

8,

where E(ti) denotes the expected length of first-order blocks. The formula indeed defines a probability distribution, since summing t(w),u*(w) over w yields the expected block length E(t 1 ). The measure is called the linear mass distribution of words since, in the case when p,* is ergodic, X(w) is the limiting fraction of the length of a typical concatenation w1 w2 ... occupied by the word w. Indeed, using f (w1w7) to denote the number of times w appears in w7 E S", the fraction of the length of w7 occupied by w is given by

t(w)f (wIw7)
t(w7)

= t(w)

f(w1w7) n
n f(w7)

t(w)11,* (w)
E(t 1 )

= A.(w), a. s.,

since f (w1w7)1 n ---÷ ,e(w) and t(will)In E(t1), almost surely, by the ergodic theorem applied to lu*. The ratio r(w) = kt * (w)/E(ti) is called the width or thickness of w. Note that A(w) = t(w)r(w), that is, linear mass = length x width.

SECTION 1.10. CUTTING AND STACKING.

105

The unit interval can be partitioned into subintervals indexed by 21) E S such that length of the subinterval assigned to w is X(w). Thus no harm comes from thinking of X as Lebesgue measure on the unit interval. A more useful representation is obtained by subdividing the interval that corresponds to w into t(w) subintervals of width r (w), labeling the i-th subinterval with the i-th term of w, then stacking these subintervals into a column, called the column associated with w. This is called the first - order columnar representation of (5 00 , ,u*). Figure 1.10.1 shows the first-order columnar representation of S = {v, w}, where y = 01, w = 011, AIM = 1/3, and /4(w) = 2/3. Ti x

1

X V w

o

Figure 1.10.1 The first-order columnar representation of (S, AT). In the columnar representation, shifting along a word corresponds to moving intervals one level upwards. This upward movement can be accomplished by a point mapping, namely, the mapping T1 that moves each point one level upwards. This mapping is not defined on the top level of each column, but it is a Lebesgue measure-preserving map from its domain to its range, since a level is mapped linearly onto the next level, an interval of the same width. (This is also shown in Figure I.10.1.) In summary, the columnar representation not only carries full information about the distribution A.(w), or alternatively, /4(w) = p,*(w), but shifting along a block can be represented as the transformation that moves points one level upwards, a transformation which is Lebesgue measure preserving on its domain. The first-order columnar representation is determined by S and the first-order distribution AI (modulo, of course, the fact that there are many ways to partition the unit interval into subintervals and stack them into columns of the correct sizes.) Conversely, the columnar representation determines S and Al% The information about the final process is only partial since the picture gives no information about how to get from the top of a column to the base of another column, in other words, it does not tell how the blocks are to be concatenated. The first-order columnar representation is, of course, closely related to the tower representation discussed in Section I.2.c.2, the difference being that now the emphasis is on the width distribution and the partially defined transformation T that moves points upwards. Information about how first-order blocks are concatenated to form second-order blocks is given by the columnar representation of the second-order distribution ,q(w?). Let r (w?) = p(w)/E(2) be the width of w?, where E(t2) is the expected length of w?, with respect to ,4, and let X(w?) = t(w?)r(w?) be the second-order linear mass. The second-order blocks w? are represented as columns of disjoint subintervals of the unit interval of width r (14) and height aw?), with the i-th subinterval labeled by the i-th term of the concatenation tv?. A key observation, which gives the name "cutting and stacking" to the whole procedure that this discussion is leading towards, is that

Figure 1. it is possible to cut the first-order columns into subcolumns and stack them in pairs so as to obtain the second-order columnar representation.10. Indeed. the total mass contributed by the top t(w) levels of all the columns that end with w is Et(w)r(w1w)= Ea(lV t2)) WI Pf(W1W) 2E(ti) =t(w* *(w) Thus half the mass of a first-order column goes to the top parts and half to the bottom parts of second-order columns. le un-v2)\ ( aw* * (w) = 2E( ii) tV2 2 _ 1 toor(w). In general. Note also the important property that the set of points where T2 is undefined has only half the measure of the set where T1 is defined.(vw) = . . If this cutting and stacking method is used then the following two properties hold. the 2m-th order representation can be produced by cutting the columns of the m-th order columnar representation into subcolumns and stacking them in pairs. . 16(ww) = 4/9.2 shows how the second-order columnar representation of S = {v._ C i D *********** H E ****** B VW A B C V D H E F G ***************4****+********** w A VV F G ****** *********** WV WW Figure 1. Indeed. the total mass in the second-order representation contributed by the first f(w) levels of all the columns that start with w is t(w) x--.10. 2 " (w)r(w).(wv) = 2/9. Second-order columns can be be obtained by cutting each first-order column into subcolumns.1. so.--(t2) 2_. then it will continue to be so in any second-order representation that is built from the first-order representation by cutting and stacking its columns. 2 which is exactly half the total mass of w in the first-order columnar representation. w}.1. One can now proceed to define higher order columnar representations.u. .. as claimed. if y is directly above x in some column of the firstorder representation. E aw yr ( ww2 ) . can be built by cutting and stacking the first-order representation shown in Figure 1. for the total width of the second-order columns is 1/E(t2). which is just half the total width 1/E(t 1 ) of the first-order columns.4(vv) = 1/9„u. .. Likewise.10.2 The second-order representation via cutting and stacking. The significance of the fact that second-order columns can be built from first-order columns by appropriate cutting and stacking is that this guarantees that the transformation T2 defined for the second-order picture by mapping points directly upwards one level extends the mapping T1 that was defined for the first-order picture by mapping points directly upwards one level. where .106 CHAPTER I. then stacking these in pairs.. BASIC CONCEPTS.

2. hence. in turn. if cutting and stacking is used to go from stage to stage. and the interval L i is called the j-th level of the column. The vector Ar(C) = (H(1). ordered. it is a way of building something new. Their common extension is a Lebesgue measure-preserving transformation on the entire interval. 107 (b) The set where T2m is defined has one-half the measure of the set where Tft. . Thus. unless stated otherwise.u* and uses it to construct a sequence of partial transformations of the unit interval each of which extends the previous ones. (a) The associated column transformation T2m extends the transformation Tm . which.10. is undefined. Ar(2). The labeling of levels by symbols from A provides a partition P = {Pa : a E AI of the unit interval. and the transformation T will preserve Lebesgue measure. column structures define distributions on blocks. The support of a column C is the union of its levels L. gadget. Thus. in essence. A general form of this fact. . L2. gives the desired stationary A-valued process. A column structure S is a nonempty collection of mutually disjoint. and for each j. The cutting and stacking language and theory will be rigorously developed in the following subsections. which has been commonly used in ergodic theory. The difference between the building of columnar representations and the general cutting and stacking method is really only a matter of approach and emphasis. A labeling of C is a mapping Ar from {1. column structure.10. f(C)} into the finite alphabet A. The goal is still to produce a Lebesgue measure-preserving transformation on the unit interval. in essence.5. In particular.SECTION 1. The cutting and stacking idea focuses directly on the geometric concept of cutting and stacking labeled columns. f(C) is a nonempty. A column C = (L1. . Lac)) of length.. then the set of transformations {TA will have a common extension T defined for almost all x in the unit interval. P)-process is the same as the stationary process it defined by A*. Two columns are said to be disjoint if their supports are disjoint. Theorem 1. disjoint.3. the interval Lt(c) is called the top. will be proved later. where Pa is the union of all the levels labeled a. Note that the members of a column structure are labeled columns. or height. collection of subintervals of the unit interval of equal positive width (C). Some applications of its use to construct examples will be given in Chapter 3. The columnar representation idea starts with the final block-structure measure . The interval L1 is called the base or bottom. .. rather than the noninformative name. CUTTING AND STACKING. which is of course given by X(C) = i(C)r (C). and cutting and stacking extends these to joint distributions on higher level blocks. labeled columns. Its measure X(C) is the Lebesgue measure X of the support of C. Ar(E(C))) is called the name of C. The (T. but this all happens in the background while the user focuses on the combinatorial properties needed at each stage to produce the desired final process. . Of course.. The suggestive name. will be used in this book.10.b The basic cutting and stacking vocabulary. using the desired goal (typically to make an example of a process with some sample path property) to guide how to cut up the columns of one stage and reassemble to create the next stage. N(j) = Ar(L i ) is called the name or label of L. see Corollary 1.10. it is a way of representing something already known. . I.

A subcolumn of L'2 . Li) is a column C' = (a) For each j.is the normalized column Lebesgue measure. ." which is taken as shorthand for "column structures with disjoint support. and column structures have disjoint columns.(c). Note that implicit in the definition is that a subcolumn always has the same height as the column. r (5) < 00. The width distribution 7 width r (C) r (C) "f(C) = EVES r (D) (45)• The measure A(S) of a column structure is the Lebesgue measure of its support. (An alternative definition of subcolumn in terms of the upward map is given in Exercise 1. . for CES (c ) < Ei(c)-r(c) Ex . It is called the mapping or upward mapping defined by the column structure S. is a subinterval of L i with the same label as L.Et(c). A(S) = Ex(c). If a column is pictured by drawing L i+ 1 directly above L1 then T maps points directly upwards one level. Each point below the base of S is mapped one level directly upwards by T. . ) c cyf( a E r(S) r (S) . The width r(S) of a column structure is the sum of the widths of its columns. Le (C )) defines a transformation T = T. Note that the base and top have the same . 1). . and S c S' means that S is a substructure of S'. for 1 < j < f(C). and is not defined on the top level. L2. L(i. each labeled column in S is a labeled column in S' with the same labeling. 2) . this means that expected column height with respect to the width distribution is finite. L'e ). L(i. called its upward map. The base or bottom of a column structure is the union of the bases of its columns and its top is the union of the tops of its columns. that is. (b) The distance from the left end point of L'i to the left end point of L 1 does not depend on j. The precise definition is that T maps Li in a linear and order-preserving manner onto L i+ i. terminology should be interpreted as statements about sets of labeled columns. T is not defined on the top of S and is a Lebesgue measure-preserving mapping from all but the top to all but the bottom of S.. C = (L1. t(C)): i E I} . BASIC CONCEPTS. A column C = (L 1 L2." where the support of a column structure is the union of the supports of its columns.. in other words. .) A (column) partition of a column C is a finite or countable collection {C(i) = (L(i. Note that T is one-to-one and its inverse maps points downwards. The transformation T = Ts defined by a column structure S is the union of the upward maps defined by its columns. Columns are cut into subcolumns by slicing them vertically. namely.108 CHAPTER I. Thus the union S US' of two column structures S and S' consists of the labeled columns of each. CES CES Note that X(S) < 1 since columns always consist of disjoint subintervals of the unit interval. An exception to this is the terminology "disjoint column structures. such that the following hold. In particular. r(S). This idea can be made precise in terms of subcolumns and column partitions as follows.

In summary. and extends their union. . where S c A*. The cutting idea extends to column structures. A column partitioning of a column structure S is a column structure S' with the same support as S such that each column of S' is a subcolumn of a column of S.c. j): 1 < j < t(C1)) and C2 = (L(2. A column structure S' is built by cutting and stacking from a column structure S if it is a stacking of a column partitioning of S. for each i. the second-order columnar representation of (S. defined by stacking C2 on top of C1 agrees with the upward map Tc„ wherever it is defined.. Partitioning corresponds to cutting. j): 1 < j < t(C2)) be disjoint labeled columns of the same width..10. and with the upward map TC 2 wherever it is defined. (ii) The range of Ts is the union of all except the bottom of S. This is also true of the upward mapping T = Ts defined by a column structure S. 109 of disjoint subcolumns of C. the upward map Tco. Li with the label of C1* C2 defined to be the concatenation vw of the label v of C 1 and the label w of C2. then Ts . The basic fact about stacking is that it extends upward maps. CUTTING AND STACKING. . for example. the key properties of the upward maps defined by column structures are summarized as follows. is now defined on the top of C1. (i) The domain of Ts is the union of all except the top of S. j) 1 < j < f(C1) L(2. In general.1. . since Te. if they have the same support and each column of S' is a stacking of columns of S. extends T8. The stacking of C2 on top of C 1 is denoted by C1* C2 and is the labeled column with levels L i defined by = L(1. is built by cutting each of the first-order columns into subcolumns and stacking these in pairs. thus cutting C into subcolumns according to a distribution 7r is the same as finding a column partition {C(i)} of C such that 7r(i) = t(C(i))1 . j — f(C1)) t(C1) < j f(Ci) -e(C2). which is also the same as the width of C2. and taking the collection of the resulting subcolumns. a column partitioning is formed by partitioning each column of S in some way. Thus.(C). it is extended by the upward map T = Ts . Longer stacks are defined inductively by Ci *C2* • • • Ck Ck+i = (C1 * C2 * • • • Ck) * Ck+1. Indeed.a*).c. the columns of S' are permitted to be stackings of variable numbers of subcolumns of S. A column structure S' is a stacking of a column structure S. (iv) If S' is built by cutting and stacking from S. The stacking idea is defined precisely as follows. namely. Note that the width of C1 * C2 is the same as the width of C1. Thus. Let C 1 = (L(1. for any S' built by cutting and stacking from S. (iii) The upward map Ts is a Lebesgue measure-preserving mapping from its domain to its range. the union of whose supports is the support of C.SECTION 1.

The goal of cutting and stacking operations is to construct a measure-preserving transformation on the unit interval. say L i . A precise formulation of this result will be presented in this section. respectively. Likewise.} and its Kolmogorov measure it are. Since such intervals generate the Borel sets on the unit interval. The process {X. (x) to be the name of level Li+n _1 of C. where Pa is the union of all the levels of all the columns of 8(1) that have the name a. The label structure defines the partition P = {Pa : a E A). Proof For almost every x E [0. BASIC CONCEPTS. (a) For each m > 1. Together with the partition defined by the names of levels. If the tops shrink to 0. along with the basic formula for estimating the joint distributions of the final process from the sequence of column structures. and defining X. Continuing in this manner a sequence {S(m)} of column structures is obtained for which each member is built by cutting and stacking from the preceding one. This is equivalent to picking a random x in the support of 8(1).10. This completes the proof of Theorem 1. P)-process {X. S(m + 1) is built by cutting and stacking from S(m). (Such an m exists eventually almost surely. and a condition for ergodicity. so that the transformation T defined by Tx = Ts(m)x is defined for almost all x and extends every Ts (m) . it is clearly the inverse of T. Cutting and stacking operations are then applied to 8(2) to produce a third structure 8(3).} is described by selecting a point x at random according to the uniform distribution on the unit interval and defining X. Theorem 1. of a column C E S(m) of height t(C) > j n — 1. then a common extension T of the successive upward maps is defined for almost all x in the unit interval and preserves Lebesgue measure. (b) X(S(1)) = 1 and the (Lebesgue) measure of the top of S(m) goes to 0 as m —> oc.3 (The complete-sequences theorem. since the top shrinks to 0 as m oc. called the process and Kolmogorov measure defined by the complete sequence {S(m)}. the common extension of the inverses of the Ts (m) is defined for almost all x. the transformation produces a stationary process. 1]. Furthermore. T is invertible and preserves Lebesgue measure.) If {8(m)} is complete then the collection {Ts (n )} has a common extension T defined for almost every x E [0. I. 1] there is an M = M (x) such that x lies in an interval below the top of S(M). (x) to be the index of the member of P to which 7 1 x belongs.110 CHAPTER I.10. A sequence {S(m)} of column structures is said to be complete if the following hold. If B is a subinterval of a level in a column of some S(m) which is not the base of that column.) The k-th order joint distribution of the process it defined by a complete sequence {S(m)} can be directly estimated by the relative frequency of occurrence of 4 in a . This is done by starting with a column structure 8(1) with support 1 and applying cutting and stacking operations to produce a new structure 8(2).10. The (T.c The final transformation and process. Further cutting and stacking preserves this relationship of being below the top and produces extensions of Ts(m). since the tops are shrinking to 0.3. it follows that T is measurable and preserves Lebesgue measure. then T -1 B = Ts -(m i ) B is an interval of the same length as B. El The transformation T is called the transformation defined by the complete sequence {S(m)}. then choosing m so that x belongs to the j-th level.

so that {S(m)} is a complete sequence. CES (1) since f(C)p i (aIC) just counts the number of times a appears in the name of C. To make this precise. The same argument applies to each S(m). f(C) — k 1 : x 1' 1 = ] where xf (c) is the name of C. then (1) = lim E CES pk(ailC)X(C). The quantity (t(C) — k 1)pk(a lic IC) counts the number of levels below the top k — 1 levels of C that are contained in Pal k . and hence IX(Paf n C) — p k (c41C)A(C)1 < 2(k — 1)r(C). establishing (1) for the case k = 1. . given by p. Let itt be the . that is.4 (Estimation of joint distributions. and hence X(Pa fl C) = [t(c)pi (aiC)j r(c) = pi(aIC)X(C).10. which is. Without loss of generality it can be assumed that 8(m +1) is built by cutting and stacking from S(m). The desired result (1) now follows since the sum of the widths t(C) over the columns of (m) was assumed to go oc. • k E Ak 1 (m) Proof In this proof C will denote either a column or its support. {8(m)}. pk(lC) is the empirical overlapping k-block distribution pk •ixf (c) ) defined by the name of C. and let Pa k = { x: x = .u(a) = X(Pa)..(apx(c). The relative frequency of occurrence of all' in a labeled column C is defined by pk(a inC) = Ili E [1. as n An application of the estimation formula (1) is connected to the earlier discussion on of the sequence of columnar representations defined by a block-structure measure where S c A*. Then. X(8(1)) = 1 and the measure of the top of S(m + 1) is half the measure of the top of S(m).) If t is the measure defined by the complete sequence.SECTION L10. let {. in turn. averaged over the Lebesgue measure of the column name.4.x. since the top has small measure. Let Pa be the union of all the levels of all the columns in 8(1) that have the name a. for x E [0. of course.10. to 0. the context will make clear which is intended. Furthermore. 1]. This estimate is exact for k = 1 for any m and for k > 1 is asymptotically correct as m oc. For k > 1. Theorem 1. since the top k — 1 levels of C have measure (k — 1)r(C). For each m let 8(m) denote the 2''-order columnar representation. This completes the proof of Theorem 1.} E A Z be the sequence defined by the relation Tnx E Px„ n E Z. 111 column name. the only error in using (1) to estimate Wa l') comes from the fact that pk (ainC) ignores the final k — 1 levels of the column C. CUTTING AND STACKING. a negligible effect when m is large for then most of the mass must be in long columns.

. 7 3)-process. measure defined by this sequence. by the way. thought of as the concatenation wi w2 • • • wn E AL. In other words.112 CHAPTER I. where tovi) w = ai . These ideas are developed in the following paragraphs. Proof Let pk(afiw7) be the empirical overlapping k-block distribution defined by the sequence w E S n . One way to make certain that the process defined by a complete sequence is ergodic is to make sure that the transformation defined by the sequence is ergodic relative to Lebesgue measure. .) = t (w 1) • Also let /72. then selecting the start position according to the uniform distribution on [1. there is a k > 1 such that S(m) and S(m k) are 6-independent. that these independence concepts do not depend on how the columns are labeled. where P = {Pa } is the partition defined by letting P a be the set of all pairs (wr) . it is enough to show that (2) = lim E WES pk (ak l it4)A. where L = Ei aw 1 ). which is sufficient for most purposes. For k = 1 the sum is constant in n. The proof for k > 1 is left as an exercise. since the start distribution for is obtained by selecting w 1 at random according to the distribution AT. Corollary 1. In the following discussion C denotes either a column or its support. Note.5 The tower construction and the standard representation produce the same measures.(C). Since this is clearly the same as pk(a inC) where C is the column with name w. A condition for this. BASIC CONCEPTS. awl)]. A(B I C) denotes the conditional measure X(B n c)/x(c) of the intersection of the set B with the support of the column C. p. that is.10. The sequence {S(m)} will be called the standard cutting and stacking representation of the measure it. two column structures S and S' are 6-independent if and only if the partition into columns defined by S and the partition into columns defined by S' are 6-independent. but only the column distributions. The column structures S and S' are 6-independent if (3) E E Ige n D) — X(C)À(D)I CES DES' 6. j) for which ai = a. Related concepts are discussed in the exercises. a E Ak . For example. is that later stage structures become almost independent of earlier structures. The sequence {S(m)} of column structures is asymptotically independent if for each m and each e > 0. = l. be the Kolmogorov measure of the (T. with the shift S on S" as base transformation and height function defined by f (wi . 0 Next the question of the ergodicity of the final process will be addressed. Next let T be the tower transformation defined by the base measure it*.

) A complete asymptotically independent sequence defines an ergodic process. the following holds.E 2 )x(c).F. so that. and hence T must be ergodic. hence the following stronger form of the complete sequences and ergodicity theorem is quite useful. and hence the collection of all the levels of all the columns in all the S(m) generates the a-algebra. of some C E 8(M) such that X(B n L) > 0 This implies that the entire column C is filled to within (1 — E 2 ) by B. P)-process is ergodic for any partition P. the widths of its intervals must also shrink to 0. The goal is to show that A(B) = 1. in particular. Thus. 1 C) DO' El which together with the condition. say level j. El If is often easier to make successive stages approximately independent than it is to force asymptotic independence. (4) À(B n c) > 1 . .6. Fix C and choose M so large that S(m) and S(M) are EX(C)-independent. the Markov inequality and the fact that E x (7. asymptotically independent sequence { 8(m)}. then there must be at least one level of D which is at least (1 — E) filled by B. which implies that the (T.(B1 C X(B I C) ?. In summary. that is.F. given E > 0 there is an m and a level L. so if D E . (5).10.. 113 Theorem 1. The argument used to prove (4) then shows that the entire column D must be (1 — c) filled by B.T be the set of all D E S(M) for which X(B1 C n D) > 1 — 6. DO* Thus summing over D and using (6) yields X(B) > 1 — 3E. D e .10.6 (Complete sequences and ergodicity. since S(M) is built by cutting and stacking from S(m). (5) E DES(M) igvi c) — x(D)1 E- Let . (6) X(B n D) > (1 — c)X(D).SECTION I10. Let B be a measurable set of positive measure such that T -1 B = B. Proof Let T be the Lebesgue measure-preserving transformation of the unit interval defined by the complete. CUTTING AND STACKING. imply that n v)x(vi C). implies E X(D) <2E. This completes the proof of Theorem 1. This shows that X(B) = 1. (1 — 6 2 ). It will be shown that T is ergodic. ( since T k (B n L) = B n TkL and T k L sweeps out the entire column as k ranges from —j + 1 to aW) — j. Since X(B I C) = E-D A. Towards this end note that since the top of S(m) shrinks to 0. The set C n D is a union of levels of D.

by using only a few simple techniques for going from one stage to the next.) If {S(m)} is a complete sequence such that for each m. x ( L i) < E 28. and the sweeping-out argument used to prove (4) shows that (8) X(B n > ( 1 — E2 E SI )x(c). A column structure S can be stacked independently on top of a labeled column C of the same width. S(m) and S(m 1) are Em -independent. of course. where Em 0. can produce a variety of counterexamples when applied to substructures. Thus. the same notation will be used for a column or its support. BASIC CONCEPTS. which. and m so large that Em is smaller than both E 2 and (ju(B)/2) 2 so that there is a set of levels of columns in S(m) such that (7) holds. (7) Eg where BC is the complement of B. The next subsection will focus on a simple form of cutting and stacking. taking 3 = . in spite of its simplicity. There are practical limitations.7 (Complete sequences and ergodicity: strong form. . In this discussion. The only real modification that needs to be made in the preceding proof is to note that if {Li} is a disjoint collection of column levels such that X (BC n (UL)) = BciL i . as earlier. in the complexity of the description needed to go from one stage to the next. This gives a new column structure denoted by C * S and defined as follows.10. A bewildering array of examples have been constructed. The freedom in building ergodic processes via cutting and stacking lies in the arbitrary nature of the cutting and stacking rules. then {S(m)} defines an ergodic process. I. Let S' be the set of all columns C for which some level has this property.10. The user is free to vary which columns are to be cut and in what order they are to be stacked. Proof Assume T -1 B = B and X(B) > O. known as independent cutting and stacking.u(B)/2. This proves Theorem 1. A few of these constructions will be described in later chapters of this book. c .114 CHAPTER I. where Di is the ith column of S. then by the Markov inequality there is a subcollection with total measure at least 3 for which x(B n L i ) > (1 — E 2 )À(L i ). it follows that and there must be at least one C E S(m) for which (8) holds and for which DES (m+1) E Ix(Dic) — x(D)i < E. (i) Partition C into subcolumns {Ci } so that r (C1) = (Di ). with the context making clear which meaning is intended.7. The support of S' has measure at least 3. provided that they have disjoint supports. The argument used in the preceding theorem then gives À(B) > 1 — 3E.10. as well as how substructures are to become well-distributed in later substructures. however. Independent cutting and stacking is the geometric version of the product measure idea. Theorem 1. as m OC.d Independent cutting and stacking.

In other words. that a copy has the same width distribution as the original. so that r(S:)= . It is called the independent stacking of S onto C. The new column structure C * S consists of all the columns C. Note. by the way. width. independently onto Ci .9. stack S. Ci *Di .8 Stacking a column structure independently onto a column. and name. (See Figure 1. (See Figure 1.10.) C* S Figure 1. obtaining Ci * S. * V. . The column structure S * S' is the union of the column structures Ci * C2 S' Figure 1.) (i) Cut S' into copies {S. The independent cutting and stacking of S' onto S is denoted by S * S' and is defined for S = {Ci } as follows.(Ci ).9 Stacking a structure independently onto a structure.. where two column structures S and S' are said to be isomorphic if there is a one-to-one correspondence between columns such that corresponding columns have the same height.10. CUTTING AND STACKING.10. An alternative description of the columns of S * S' may given as follows.8. and letting Si be the column structure that consists of the i-th subcolumn of each column of S.SECTION 1. by partitioning each column of S according to the distribution 7r.10. a scaling of one structure is isomorphic to the other. A column structure S is said to be a copy of size a of a column structure S' if there is a one-to-one correspondence between columns such that corresponding columns have the same height and the same labeling. Let S' and S be disjoint column structures with the same width. To say precisely what it means to independently cut and stack one column structure on top of another column structure the concept of copy is useful. 115 (ii) Stack Di on top of Ci to obtain the new column. and the ratio of the width of a column of S to the width of its corresponding column in S' is a.1. A column structure S can be cut into copies {S i : i E n according to a distribution on /.10. (ii) For each i.

.10 Two-fold independent cutting and stacking. m > 1..J*Cii . The independent cutting and stacking construction contains more than just this concatenation information. . Two-fold independent cutting and stacking is indicated in Figure 1. The M-fold independent cutting and stacking of a column structure S is defined by cutting S into M copies {Sm : m = 1. M} of itself of equal width and successively independently cutting and stacking them to obtain Si *52 * • • ' * SM where the latter is defined inductively by Si *. that is. A column structure can be cut into copies of itself and these copies stacked to form a new column structure. however. BASIC CONCEPTS. and S(m + 1) = S(m) * S (m). The sequence {S(m)} is called the sequence built (or generated) from S by repeated independent cutting and stacking. -r(Ci The column structure S *S' consists of all the C. S into subcolumns {CO. This formula expresses the probabilistic meaning of independent cutting and stacking: cut and stack so that width distributions multiply. namely. Note that S(m) is isomorphic to the 2m-fold independent . In particular. . 2. that they are independently concatenated according to the width distribution of S.(Ci E (ii) Cut each column Ci '). is that width distributions multiply. • • *Sm = (S1* • • • *Sm_i)*Sm. and. for it carries with it the information about the distribution of the concatenations. The key property of independent cutting and stacking. The columns of Si * • • • * Sm have names that are M-fold concatenations of column names of the initial column structure S. the number of columns of Si * • • • * Sm is the M-th power of the number of columns of S. Successive applications of repeated independent cutting and stacking. Si S2 SI * S2 Figure 1. such that r(Cii ). Note that (Si * • • • * S) = t (S)/M. E CHAPTER I. in the finite column case. starting with a column structure S produces the sequence {S(m)}.10. the number of columns of S * S' is the product of the number of columns of S and the number of columns of S'.116 (i) Cut each C each Ci E S. r (Cii *C) r (S * S') r (Ci ) r (S) r(C) since r (S*S') = r(S). where 5 (1) = S.10. in the case when S has only finitely many columns.10. i ) I r (S') for S' into subcolumns {C» } such that r(C) = (C)T.

if X(S) = 1 then the process built from S by repeated independent cutting and stacking is ergodic.11. Theorem 1. {S(m): m > mol is built from S(rno) by repeated independent cutting and stacking. by assumption. since r(S)E(t(S)) = X(S) = 1. k]:C = p(CICO I = so that. with respect to the width distribution. in particular. • • Ck E SOTO. and hence the proof shows that {S(m)} is asymptotically independent. the countable case is left to Exercise 2. This completes the proof of Theorem 1. More interesting examples of cutting and stacking constructions are given in Chapter 3.10. CUTTING AND STACKING. and stacking these in order of increasing i. where E(t(S)) is the expected height of the columns both with probability 1. The notation D = C = C1 C2 . oc. for they are the precisely the processes built by repeated independent cutting and stacking from the columnar representation of their firstorder block-structure distributions. of course. Note that for each mo. as k of S. define E [1. Ergodicity is guaranteed by the following theorem. By the law of large numbers. In the case when X(S) = 1.11 (Repeated independent cutting and stacking. will mean the column D was formed by taking. kp(CICf) is the total number of occurrences of C in the sequence C..10. a subcolumn of Ci E S. Division of X(C n D) = kp(C1Cbt(C)r(D) by the product of X(C) = i(C)r(C) and À(D) = i(D)t(D) yields x(c n D) X(C)À(D) kp(CIC I ) i(D)r (C) 1. so that its columns are just concatenations of 2' subcolumns of the columns of S. The following examples indicate how some standard processes can be built by cutting and stacking. The simplest of these. This is. For k = 2'.SECTION I. and C E S. then given > 0. just another way to define the regenerative processes. with probability 1. an example of a process with an arbitrary rate of . (9) and (10) 1 1 — i(D) = — k k p(CIC) —> E £(0) i=1 E(t(S)). for each i.10. D = C1C2. • Ck. 117 cutting and stacking of S. there is an m such that S and S(m) are E-independent. as k —> cc.) If {S(m)} is built from S by repeated independent cutting and stacking. selected independently according to the width distribution of S. the (T. Exercise 8. Proof The proof for the case when S has only a finite number of columns will be given here. P)-process defined by the sequence generated from S by repeated independent cutting and stacking is called the process built from S by repeated independent cutting and stacking. so that if X(S) = 1 then the process defined by {S(m)} is ergodic. In particular. This establishes that indeed S and S(m) are eventually c-independent. by (9) and (10).

d. Define the set of all ak Prob(X i` = alic1X0 = a). one can first represent the ergodic Markov chain {(Xn .10. repeated independent cutting and stacking produces the A-valued i. convergence for frequencies. the binary i. labeled as '0' and '1'.) Let S be the column structure that consists of two columns each of height 1 and width 1/2.i. = f (Y.) If the initial structure S has only a finite number of columns. all' E S. (These also known as finite-state processes. processes.10.i. then {Y. BASIC CONCEPTS. is a function of Y. that if the original structure has two columns with heights differing by 1.Ar(L).} is regenerative.) The easiest way to construct an ergodic Markov chain {X„} via cutting and stacking is to think of it as the regenerative process defined by a recurrent state. Example 1.d. For a discussion of this and other early work see [12]. process in which O's and l's are equally likely. Example 1. 1) if L is the top of its column. To see why this is so.10. Remark 1. Since dropping the second coordinates of each new label produces the old labels." so as to guarantee ergodicity for the final process. and to (. say a. It is easy to see that the process {KJ built from this new column structure by repeated independent cutting and stacking is Markov of order no more than the maximum length of its columns. in particular. By starting with a partition of the unit interval into disjoint intervals labeled by the finite set A.10.T. It shows how repeated independent cutting and stacking on separate substructures can be used to make a process "look like another process on part of the space. then {X.10. it follows that X. Example 1.15 (Finite initial structures. that is. and hence {X. assume that the labeling set A does not contain the symbols 0 and 1 and relabel the column levels by changing the label A (L) of level L to (A (L). i } is mixing Markov. Example 1. In fact. see III." so as to approximate the desired property. They have since been used to construct . then drop the first coordinates of each label.d. The process built from the first-order columnar representation of S by repeated independent cutting and stacking is the same as the Markov process {X„}. and let pt* be the product measure on S defined by p.1 < i < k.} built from it by repeated independent cutting and stacking is a hidden Markov chain.} is a function X. and ak = a.) If {KJ is ergodic. hence can be represented as the process built by repeated independent cutting and stacking.118 CHAPTER I. process with the probability of a equal to the length of the interval assigned to a.16 The cutting and stacking ideas originated with von Neumann and Kakutani. The process built from S by repeated independent cutting and stacking is just the coin-tossing process. the process {X.! .12 (I. Let S be i for which ai a.) A hidden Markov process {X. 0) if L is not the top of its column.i.14 (Hidden Markov chains. Yn )} as a process built by repeated independent cutting and stacking.c. Note. is recommended as a starting point for the reader unfamiliar with cutting and stacking constructions.13 (Markov processes.) of a Markov chain. and how to "mix one part slowly into another.} is a function of a mixing Markov chain.

and each E > 0. and an integer 111. Let pt" be a block-structure measure on Soe. Show that {S(m)} is asymptotically independent if and only if it* is totally ergodic. 7. A column C E S is (1 —6)-well-distributed in S' if EDES' lx(vi C)—X(7))1 < E. 5.) 6. (Hint: the tower with base transformation Sn defines the same process Ti. disjoint from 1(m). then the process built from S by repeated independent cutting and stacking is Markov of some order. The latter also includes several of the results presented here. (b) Show that the asymptotic independence property is equivalent to the asymptotic well-distribution property.i. Prove Theorem 1. some of which are mentioned [73]. and let {S(m)} be the standard cutting and stacking representation of the stationary A-valued process defined by (S. .C. 2. then a subcolumn has the form (D. and X(1 (m)) -->. E) such that C is (1 —6)-well-distributed in S(n) for n > M. Show that if T is the upward map defined by a column C..e Exercises 1. the conditional probability X(DI C) is the fraction of C that was cut into slices to put into D. A sequence {S(m)} is asymptotically well-distributed if for each m. Suppose also that {S(m} is complete.1 as m —> oc. Then apply the k = 1 argument. T D.10.10. . each C E S(m). (a) Show that if the top of each column is labeled '1'.T 2 D. for each m. I. re(c)-1 D). most of which have long been part of the folklore of the subject. where S C A*. there is an M = M(m.À(D)i except for a set of D E S I of total measure at most Nfj. (c) Show that if S and S' are 6-independent. Exercise 4c implies that A* is m-ergodic. . A sufficient condition for a cutting and stacking construction to produce a stationary coding of an i. then S' and S are 26-independent.n such that T(m + 1) is built by An -fold independent cutting and stacking of 1(m) U R. (Hint: for S = S(m). process is given in [75]. except for a set of D E SI of total measure at most E. Suppose S has measure 1 and only finitely many columns...11 for the case when S has countably many columns. 3. with all other levels labeled '0'. (d) Show that if EcEs IX(C1 D) — X(D)I < E. then E CES IX(CI D) . Show that for C E S and D E S'. (a) Show that {S(m)} is asymptotically independent if and only if {r(m)} is asymptotically independent. p*). (a) Suppose S' is built by cutting and stacking from S. Show that formula (2) holds for k > 1. CUTTING AND STACKING. (b) Suppose that for each m there exists R(m) C S(m).) 4.(m).d.SECTION 1. where D is a subset of the base B. 119 numerous counterexamples. Show that if {Mm } increases fast enough then the process defined by {S(m)} is ergodic. Suppose 1(m) c S(m)..10.

13 is indeed the same as the Markov process {X„}. (c) Verify that the process constructed in Example 1. BASIC CONCEPTS.120 CHAPTER I.10. . Show that process is regenerative if and only if it is the process built from a column structure S of measure 1 by repeated independent cutting and stacking. (b) Show that if two columns of S have heights differing by 1. 8. then the process built from S by repeated independent cutting and stacking is a mixing finitestate process.

121 .7. see Exercise 1.c. code sequences which compress to the entropy in the limit. 1}*. When Cn is understood. for almost every sample path from any ergodic process. It will be shown that there are universal codes. Section 11.Chapter II Entropy-related properties. r (4) will be used instead of r (x rii ICn ). in particular. where {0. hence it does not preclude the possibility that there might be code sequences that beat entropy infinitely often on a set of positive measure.). if {C n } is a Shannon-code sequence for A. 1}* is the set of finite-length binary sequences. almost surely.. . then L(4)1 n may fail to converge on a set of positive y-measure.12 and Theorem 1. more simply. almost surely. knowledge which may not be available in practice. n n—>co for every ergodic process . universal if lim sup L(4) < h(p). where h(p) denotes the entropy of A. The second issue is the almost-sure question.1 Entropy and coding. where each Cn is an n-code. As noted in Section I.7. however. The Shannon code construction provides a prefix-code sequence {C n } for which L(xrii) = r—log.7q . The code length function Le Icn ) of the code is the function that assigns to each . at least asymptotically in source word length n. that is. but its construction depends on knowledge of the process. In general. The sequence of Shannon codes compresses to entropy in the limit. the length i(Cn (4)) of the code word Cn (4). so that. An n-code is a mapping Cn : A n F-* 10. entropy provides an asymptotic lower bound on expected per-symbol code length for prefix-code sequences and faithful-code sequences. by the entropy theorem. The first issue is the universality problem.15.u(4)1.7. Two issues left open by these results will be addressed in this section. is a prefix code it is called a prefix-code sequence.u. A code sequence {C n } is said to be universally asymptotically optimal or. a lower bound which is "almost" tight. while if each C. The entropy lower bound is an expected-value result. 2.. If each Cn is one-to-one. see Theorem 1. the code sequence is called a faithful-code sequence. L(4)In h. and y is some other ergodic process. It will be shown that this cannot happen. A code sequence is a sequence {Cn : n = 1.

d. Of interest. A direct coding argument. The code C. for any ergodic process . Theorem 11. the direct coding construction extends to the case of semifaithful codes. The code is a two-part code. will follow from the entropy theorem.) There is a prefix-code sequence {C„} such that limsup n . whose existence depends only on subadditivity of entropy as expected value. the k-type. also. they show clearly that the basic existence and nonexistence theorems for codes are.) If {C n } is a faithful-code sequence and it is an ergodic measure with entropy h. relative to some enumeration of possible k-types.2 (Too-good codes do not exist.6. while the existence of decay-rate entropy is given by the much deeper entropy theorem. will be given in Section 11.122 CHAPTER II. it has fixed length which depends only on the number of type classes. Theorem 11. Since process entropy H is equal to the decay-rate h given by the entropy theorem.2. utilizes the empirical distribution of k-blocks. n A prefix-code sequence {C} will be constructed such that for any ergodic it. relative to some enumeration of the k-type class. this will show that good codes exist. its length depends on the size of the type class. In addition. the codeword Cn (4) is a concatenation of two binary blocks. together with some results about entropy for Markov chains. thus further sharpening the connection between entropy and coding. where H = 114 denotes the process entropy of pt. In other words. lim sup . II. that is. 4 and yril. will be discussed in Section 11. While no simpler than the earlier proof.1.2.d. based on the Lempel-Ziv coding algorithm.1. A second proof of the existence theorem. in which case the first blocks of Cn (fi') and Cn (Yil) . as shown in [49].1 (Universal codes exist. ENTROPY-RELATED PROPERTIES. The first block gives the index of the k-type of the sequence 4. L(x 11 )/n > h. for which a controlled amount of distortion is allowed. based on entropy estimation ideas of Ornstein and Weiss. Distinct words.u. Theorem 11. a.1. The proof of the existence theorem to be given actually shows the existence of codes that achieve at least the process entropy.C(4)/n < H.3. The second block gives the index of the particular sequence x'iz in its k-type class. almost surely. in essence. together with a surprisingly simple lower bound on prefix-code word length.s. which is based on a packing idea and is similar in spirit to some later proofs. that is. it is possible to (universally) compress to entropy in the limit. These two results also provide an alternative proof of the entropy theorem. Theorem 11. The nonexistence theorem. A third proof of the existence theorem. almost surely.1. together equivalent to the entropy theorem. but for no ergodic process is it possible to beat entropy infinitely often on a set of positive measure.. either have different k-types. then lim inf.a Universal codes exist. A counting argument. with code performance established by using the type counting results discussed in Section I. almost surely for any ergodic process. will be used to establish the universal code existence theorem.C(fii)1 n < h(a). is the fact that both the existence and nonexistence theorems can be established using only the existence of process entropy.1. will then be used to show that it is not possible to beat process entropy in the limit on a set of positive measure. for suitable choice of k as a function of n.1.

14)... 1f1 n The circular k-type is just the usual k-type of the sequence . A slight modification of the definition of k-type. hence the second blocks of Cn (4) and Cn (y. Pk-1(4 -1 14) =---" which. 123 will be different.1.SECTION 11. the index of the particular sequence 4 relative to some particular enumeration of T k (4).6. n) of circular k-types that can be produced by sequences of length n. n) + log1 7 4(4)1 +2.15. let ril+k-1 = xx k i -1 . will be used as it simplifies the final entropy argument.) Total code length satisfies log /S1 (k. in turn. yet has negligible effect on code performance. so asymptotic code performance depends only on the asymptotic behavior of the cardinality of the k-type class of x. so that the bounds. If k does not grow too rapidly with n. that is.f.' . An entropy argument shows that the log of this class size cannot be asymptotically larger than nH . ITi(4)1 < (n — 1)2 (11-1)11. 4 is extended periodically for k — 1 more terms.`) will differ. lic 14) — Pk(a • 1 ti E [1. < iii. then transmitting the index of 4 in its circular k-type class. Since the first block has fixed length.(1/2) log IAI n. (Variable length is needed for the second part. or they have the same k-types but different indices in their common k-type class. ak (3) ilk-Le. aik E A k .4+k-1 . Given a sequence 4 and an integer k < n. ' +1 . the code Cn is a prefix code."7 on the number /V (k. and on the cardinalityj7k (x)i of the set of all n-sequences that have the same circular k-type as 4. that is. Theorem 1. Cn (4) = blink. then the log of the number of possible k-types is negligible relative to n. The gain in using the circular k-type is compatibility in k. say.7irk-1 . where Hk_1. .6. called the circular k-type.14 and Theorem 1. k -. br relative to some fixed enumeration of the set of possible circular k-types. rd: 5. where is a fixed length binary sequence specifying the index of the circular k-type of x. and bm t +1 is a variable-length binary sequence specifying. n) < (n ± olAl k . that is.c14). that is..:+ k -1 = ak 11 .IX II ) on il k defined by the relative frequency of occurrence of each k-block in the sequence . on the number of k-types and on the size of a k-type class yield the following bounds (1) (2) (k. x 7 denotes the entropy of the (k — 1)-order Markov chain defined by Pk (. because the size of the type class depends on the type. i <k — 1. implies the entropy inequality EFkcal. The circular k-type is the measure P k = ilk(. ENTROPY AND CODING. Now suppose k = k(n) is the greatest integer in (1/2) log lAi n and let Cn be the two-part code defined by first transmitting the index of the circular k-type of x. the concatenation of xtiz with x k i -1 . where the two extra bits are needed in case the logarithms are not integers.

x7 < H.x? < H + 2E. The bound (2) on the size of a type class implies that at most 1 + (n — log(n — 1) 4(4). HK-1 = H (X IX i K -1 ) denotes the process entropy of the Markov chain of order K — 1 defined by the conditional probabilities . is based on an explicit code construction which is closely connected to the packing ideas used in the proof of the entropy theorem. which was developed in [49]. The proof that good codes exist is now finished. ENTROPY-RELATED PROPERTIES. Since E is arbitrary. For all sufficiently large n. of course.2 for the case when each G is a prefix code. together with the assumption that k < (1/2) log iAi n. [3]. a bound due to Barron.1.l. A faithful-code sequence can be converted to a prefix-code sequence with no change in asymptotic performance by using the Elias header technique. establishing the desired bound (4).u(ar)/.2 will be given. is valid for any process. . E) such that almost every x there is an particular. choose K such that HK_i _< H ± E. in this discussion.7.u(4). for as noted earlier. for es•• <H + 2E. Thus it is enough to prove Theorem 11. Since K is fixed. the type class. implies that it takes at most 1 + N/Tz log(n + 1) bits to specify a given circular k(n)-type. II. This is. almost surely. almost surely. relative to some fixed enumeration of circular k(n)-types. hm sup 'C(n xilz) lim sup iik(n) .b Too-good codes do not exist. negligible relative to n. due to Barron. described in Section I. (4) where H is the process entropy of p. relative to a given enumeration of bits are needed to specify a given member of 7 Thus.u(ar-1 ).d of Chapter 1. k(n) will exceed K. Fix an ergodic measure bc with process entropy H. this shows that lim sun n ilk(n).x7. The first proof uses a combination of the entropy theorem and a simple lower bound on the pointwise behavior of prefix codes. and does not make use of the entropy theorem. where. and hence HK _ Lx? integer N = N(x.1. Once this happens the entropy inequality (3) combines with the preceding inequality to yield ilk(n)-1. Two quite different proofs of Theorem 11. The bound (1) on the number of circular k-types. stationary or not. the ergodic theorem implies that PK (ar lx7) oo. In almost surely.x 1' < H.124 CHAPTER II. The following lemma. as n . n > N. so that to show the existence of good codes it is enough to show that for any ergodic measure lim sup fik(n)-1. Given E > 0.1. The second proof. process entropy El H and decay-rate h are the same for ergodic processes. almost surely.

Using the relation L(x7) = log 2'.an.u(fil ) . eventually a. Let i be a fixed ergodic process with process entropy H. yields lim inf n—000 n > lim inf . ENTROPY AND CODING. But -(1/n) log z(x) converges El almost surely to h.6)} then i(nn Up. a faithful-code sequence with this property can be replaced by a prefix-code sequence with the same property.) Thus there are positive numbers. such that if Bi = (x: ICJ) < i(H .l. and after division by n. since adding headers to specify length has no effect on the asymptotics.)/n < H.ei2E B„ since E „ 2—c(x7) < 1. with respect to A. and assume {C. by the Kraft inequality for prefix codes.c Nonexistence: second proof. with high probability. c and y.(4) which yields 2-c(4 ) 2 -an I.C(4) +log .u. This completes the proof of Theorem 11.an).) Let (CO be a prefix-code sequence and let bt be a Borel probability measure on A. the entropy of . 125 Lemma 11.2. x7En„ .SECTION II. a proof which uses the existence of good codes but makes no use of the entropy theorem. Proof For each n define B. such that those words that can be coded too well cover at least a fixed fraction of sample path length.t(Bn) = E kt(x . thereby producing a code whose expected length per symbol is less than entropy by a fixed amount.1. on a set of positive measure.1. for any ergodic measure if/.1. (As noted earlier. Barron's inequality (5) with an = 2 loglAI n.C(x7) + log . this can be rewritten as Bn = {x7: iu. will now be given.0 <2 -an E 2-4(4) < 2 -"n . then If {an } is a sequence of positive numbers such that En (5) . ={x: .u(4) ?_ .log p.1. Theorem 11. II. The basic idea is that if a faithful-code sequence indeed compresses more than entropy infinitely often on a set of positive measure. I. while the complement can be compressed almost to entropy. . long enough sample paths can be partitioned into disjoint words. then.3 (The almost-sure code-length bound. A second proof of the nonexistence of too-good codes.„ Bj ) > y. s.(4) eventually almost surely. 2-an < oo.2. } is a prefix-code sequence such that lim infi £(41C. The lemma follows from the Borel-Cantelli lemma.

and using a single-letter code on the remaining symbols. if the following three conditions hold. cover at least a y-fraction of total length. contradicting Shannon's lower bound. or belongs to Dk. for a suitable choice of E. a concatentation 4 = w(1)w(2) • • • w(t) is said to be a (y. plus headers to tell which code is being used.. Let G(n) be the set of all sequences 4 that have a (y.C*(x`Oln will be uniformly bounded above by a constant on the complement of G(n). let M > 2k IS be a positive integer. and M > 2k/S.C*(4) <n(H — E'). eventually almost surely. ' > 0....4 For n > M. It will be shown later that (6) x E G(n). Here it will be shown how a prefix n-code can be constructed by taking advantage of the structure of sequences in G(n). If 8 is small enough and such sample paths are long enough. independent of n. for some large enough M. (b) The total length of the w(i) that belong to A is at most 33n. The existence theorem. (a) Each w(i) belongs to A. Furthermore. with high probability.. given that (6) holds. . ENTROPY-RELATED PROPERTIES. For n > M. a code that will eventually beat entropy for a suitable choice of 3 and M. Sample paths with this property can be encoded by sequentially encoding the words. Let 8 < 1/2 be a positive number to be specified later. provides a k > 1/3 and a prefix code ek £(. using the good code Ck on those words that belong to Dk. The lemma is enough to contradict Shannon's lower bound on expected code length. A version of the packing lemma will be used to show that long enough sample paths can.. (c) The total length of the w(i) that belong to BmU Bm +1U . .. xr. this produces a single code with expected length smaller than expected entropy. with length function E(x) = Theorem 11. there is a prefix n-code C n * whose length function E*(4) satisfies (7) . is at least yn.1. for all sufficiently large n. using the too-good code Ci on those words that belong to Bi for some i > M. such that the set Dk = E(X lic ) < k(H 8)) . or belongs to B1 for some j > M. 3)-good representation. . E G(n). 0-good representation of xt. while the words that are neither in Dk nor in BM U .x i` I e' -k). cover at most a 33-fraction of total length. The coding result is stated as the following lemma. be partitioned into disjoint words. has measure at least 1 — S.126 CHAPTER II. To make precise the desired sample path partition concept.1.1. for the uniform boundedness of E*(4)In on the complement of G(n) combines with the properties (6) and (7) to yield the expected value bound E(C*(X7)) < n(H — 672). 8 > 0.. such that the words that belong to B m U . Lemma 11.

(0) (w(i)) of the w(i) that belong to Up. then coding the w(i) in order of increasing i. Theorem I. on the words of length 1. For 4 g G(n). it takes at most L(H — E) + (N — L)(H 8) to encode all those w(i) that belong to either Dk or to Ui > m Bi .. The code C(x) is defined for sequences x E G(n) by first determining a (y. using the too-good code on those words that have length more than M. in turn. . Likewise. then a (y.4.SECTION 11. thereby contradicting Shannon's D lower bound on expected code length.C*(X7)) < H(1. say. 3)-good representation . ENTROPY AND CODING. This is so because.. a prefix code on the natural numbers for which the length of e(j) is log j o(log j). 8)good representation 4 = w(1)w(2) • • • w(t). Note that C * (4)/n is uniformly bounded on the complement of G(n). 127 for all sufficiently large n. Ck. for x g G(n). But L > yn. F : A 1--* Bd. Appropriate headers are also inserted so the decoder can tell which code is being used.t.7. ) is an Elias code. is used. using the good code Ck on those words that have length k. the per-symbol code C:(4) = OF(xi)F(x2) • • • F(xn). the code word Cf (w(i)) is no longer than j (H — c). symbols are encoded separately using the fixed code F. where d > log IA I. since the initial header specifies whether E G(n) or not...1. so that if L is total length of those w(i) that belong to BM U BA. and using some fixed code.} is being applied to a given word. since the total length of such w(i) is at most N — L. where OF ((w(i)) (8) v(i) = 10 . Here e (. Proof of Lemma 11. . ignoring the headers needed to specify which code is being used as well as the length of the words that belong to Uj >m BJ . for suitable choice of 8 and M > 2kI8.1 . since (1/n)H(A n ) -+ H. the principal contribution to code length comes from the encodings Cf (u. C m .. yields E(. A precise definition of Cn * is given as follows. The code Cn * is clearly a prefix code. for all sufficiently large n.11(i). a code word Elk (w(i)) has length at most k(h + 6).)cr i' = w(l)w(2) • • • w(t) is determined and the code is defined by C:(4) = lv(1)v(2) • • • v(t).1 4. If E G(n). by the definition of Dk.). so that. as shown in the following paragraphs.m B.. that is. But this. and the encodings k(W(i)) of the w(i) that belong to Dk. The goal is to show that on sequences in G(n) the code Cn * compresses more than entropy by a fixed amount. by property (c) in the definition of G(n). By the definition of the Bi .1. the contribution . ' (w(i)) iie(j)Ci(W(i)) if w(i) has length 1 if w(i) E Dk if w(i) E Bi for some j > M. and. Cm+i . then it requires at most L(H — E) bits to encode all such w(i). so that.. while the other headers specify which of one of the prefix-code collection {F.

Since d and K are constants. provided only that M > 2k18. since the total number of such words is at most 38n. for suitable choice of 8 . eventually almost surely. provided k > 118. 4 E G(n). M > 2k/3 and log M < 8M. and there are at most 38n such words. The 1-bit headers used to indicate that w(i) has length 1 require a total of at most 33n bits. w(i)Eup. The words w(i) need an additional header to specify their length and hence which Ci is to be used. choose L > M so large that if B = UL BP • then . which exceeds e since it was assumed that < 1/2 and k > 118. Thus.8. Of course. for any choice of 0 < 8 < 1/2. provided 8 is small enough and M is large enough.128 CHAPTER II. it follows that if 8 is chosen small enough. there is a constant K such that at most log t(w(i)) K tv(i)Eui>m B1 E bits are required to specify all these lengths. which is certainly negligible. ignoring the one bit needed to tell that 4 E G(n). which establishes the lemma. the total length needed to encode all the w(i) is upper bounded by (10) n(H +8 — ye +3d3). But (1/x) log x is a decreasing function for x > e. The partial packing lemma. To establish this. then 8 — ye -1-3d8 will be negative. so essentially all that remains is to show that the headers require few bits. and M > 2k13. By assumption. to total code length from encoding those w(i) that belong to either Dk or to Ui . k > 113. since k < M and hence such words must have length at least k. There are at most nlk words w(i) the belong to Dk or to Ui >m Bi . n(H +3 — y(e +8)) n(H (9) Each word w(i) of length 1 requires d bits to encode.M . so these 2-bits headers contribute at most 28n to total code length. m Bi E since each such w(i) has length M > 2k1S.u(B) > y.m B i is upper bounded by — ye). then at most K3n bits are needed to specify the lengths of the w(i) that belong to Ui >m B i . 1=.3. then n(H — E'). In summary. the total contribution of all the headers as well as the length specifiers is upper bounded by 38n ± 26n + IC8n. If 8 is small enough. e'. Adding this bound to the bound (10) obtained by ignoring the headers gives the bound L*(4) < n[H 8(6 ± 3d K)— yen on total code length for members of G(n). if it assumed that log M < 8M. by property (b) in the definition of G(n). Since an Elias code is used to do this. relative to n. Lemma 1. by property (b) of the definition of G(n). ENTROPY-RELATED PROPERTIES. so that log log t(w(i)) < K M n. so that. still ignoring header lengths. the lemma depends on the truth of the assertion that 4 E G(n).

C(x`11 )1 n < H. eventually almost surely. Thus if I* is the set of indices i E I such that [ui .3.9. provide a representation x = w(1)w(2) • • • w(t) for which the properties (a). eventually almost surely. eventually almost surely. n]. since it was assumed that M > 2k/ 6 . E Dk and a disjoint collection {[si. eventually almost surely there are at least 8n indices i E [1. Dn . Section 1. Lemma 1. 4 is (1 — 0-strongly-covered by Dk. then implies that xri' is eventually almost surely (1 — 20-packed by k-blocks drawn from Dk. . vi that meet the boundary of at least one [s ti] is at most 2nk/M. But this completes the proof that 4 E G(n). = .7. ti]UU[ui.SECTION 11. To say that x has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk. ti]: j E U {[ui. as follows. a y-packing of 4 by blocks drawn from B. 1E1. which is upper bounded by 8n. by the ergodic theorem.7. and let {C„} be a prefix-code sequence such that lim sup. since. 4 has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk. that is. Since each [s1 . t] has length at least M. Lemma 1. Fix c > 0 and define D„ = 14: £(4) < n(H so that x Thus. ENTROPY AND CODING. then the collection {[sj.l.1. if E . the cardinality of J is at most n/M.s. eventually almost surely. B. n] of total length at least (1 — 20n for which each 4. (This argument is essentially just the proof of the two-packings lemma. This completes the second proof that too-good codes do not exist. In particular. n — k 1] that are starting places of blocks from Dk. ti ]. II.d Another proof of the entropy theorem. On the other hand. for which each xs% E B. suggested in Exercise 5. . The entropy theorem can be deduced from the existence and nonexistence theorems connecting process entropy and faithful-code sequences. so the total length of those [ui . n] of total length at least yn. (b). ti ]: j E J } of M-length or longer subintervals of [1.) The words v:: j E /*I {x:j E J} U {xu together with the length 1 words {x s I defined by those s U[s . The preceding paragraph can be summarized by saying that if x has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk then 4 must belong to G(n). But the building-block lemma. 4 has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk. and (c) hold. eventually almost surely. and I Dn I < 2 n(H+E) . since Cn is invertible. 129 provides.u(x'1') < 2 1'1+2} . vi ] meets none of the [si . Fix an ergodic measure bt with process entropy H.3. E /*} is disjoint and covers at least a (1 — 30-fraction of [1. is to say that there is a disjoint collection {[u„ yi ]: i E / } of k-length subintervals of [1. a.

) 3. Define On (4) = 4)(4). whose range is prefix free. such that the first symbol in 0(4) is always a 0. Le Exercises 1. Show that there is prefix n-code Cn whose length function satisfies £(4) < Kn on the complement of S. xii E Un and en (4) = C(4).s. and where it occurred earlier. then there is a renewal process v such that lim supn D (gn II vn )/n = oc and lim infn D(pn II vn )/n = 0. is 1-1(pn )± D (An II vn). ENTROPY-RELATED PROPERTIES. Show that if y 0 p. Let C. (a) For each n let p. be the projection of the Kolmogorov measure of unbiased coin tossing onto A n and let Cn be a Shannon code for /in .. Such a code Cn is called a bounded extension of Cn * to An. where it starts. and such that C.. 2.. Lemma 5. L(4)/ log n < C. Suppose Cn * is a one-to-one function defined on a subset S of An . eventually almost surely.. Code the remainder by using the Shannon code. a. eventually almost surely.(4 For the lower bound define U„ = Viz: p.(4) > 2—n(H.1. be the code obtained by adding the prefix 1 to the code Cn .n(H — c). and let £(4) be the length function of a Shannon code with respect to vn . n n and completes this proof of the entropy theorem. Let L (xi') denote the length of the longest string that appears twice in 4. establishing the upper bound 1 1 lim sup — log ) _< H . almost surely..i. (Hint: code the second occurrence of the longest string by telling how long it is. a. and note that I Uni < 2n(I I —E) . Show that if bt is i. is ergodic then C(4)In cannot converge in probability to the entropy of v. C3 II.. This proves that 1 1 lim inf — log — > H. The resulting code On is invertible so that Theorem 11. where K is a constant independent of n. xi" g Un .1(4) log(pn (4)/vn (4)) be the divergence of An from yn . eventually almost surely. . Thus there is a one-to-one function 0 from Un into binary sequences of length no more than 2 -I. (b) Let D (A n il v pi ) = E7x /1. n n 11. is the all 0 process.s.130 then /t (Bn CHAPTER II. (C) Show that if p.d. This exercise explores what happens when the Shannon code for one process is used on the sample paths of another process. and so xri` g Bn ...(4) = 1C:(4).2 guarantees that Je ll g Un . Show that the expected value of £(4) with respect to p. then there is a constant C such that lim sup. n D n) < IDn 12—n(H+2E) < 2—n€ Therefore 4 fl B n n Dn . Add suitable headers to each part to make the entire code into a prefix code and apply Barron's lemma. for X'il E S.1 .

hence it is specified by giving a pointer to where the part before its final symbol occurred earlier. . In finite versions it is the basis for many popular data compression packages. called here (simple) LZ parsing. (b) Suppose w(1) . together with a description of its final symbol.. the LZ code maps it into the concatenation Cn (eii) = b(1)b(2). that is. A second proof. and uses several ideas of independent interest.. x is parsed inductively according to the following rules. let 1.. b(C). W(j)} and xm+ 1 g {w(1). . To describe the LZ code Cn precisely. . .. called words. later words have no effect on earlier words. The LZ algorithm is based on a parsing procedure.SECTION 11. w(C) y. for example.. that the Lempel-Ziv (LZ) algorithm compresses to entropy in the limit will be given in this section. 2. called the (simple) Lempel-Ziv (LZ) code by noting that each new word is really only new because of its final symbol. where the final block y is either empty or is equal to some w(j). 1.. b(C + 1).. THE LEMPEL-ZIV ALGORITHM 131 Section 11. parses into 1.1 +1 E {WM. 7 + 1 . Thus. 00. [90]... w(j + 1) = xn such that xm n . Ziv's proof. . [53]. n) i-. b(C)b(C + 1) of the binary words b(1). . To be precise. is given in Section 11. An important coding algorithm was invented by Lempel and Ziv in 1975.. 000.. for j < C. In its simplest form.Brl0gn1 and g: A 1--÷ BflogiAll be fixed one-to-one functions.2.. defined according to the following rules. Note that the parsing is sequential. 101.2 The Lempel-Ziv algorithm.. The parsing defines a prefix n-code Cn . 10. and it has been extensively analyzed in various finite and limiting forms.4. which is due to Ornstein and Weiss.. b(2).. the words are separated by commas. WW1. w(j) = (i) If xn 1 +1 SI {w(1).. 11001010001000100. for ease of reading. 100.. [91]..1 denote the least integer function and let f: {0. w(j)} then w(j ± 1) consists of the single letter xn ro. a way to express an infinite sequence x as a concatenation x = w(1)w(2) . If fiz is parsed as in (1). 0. and therefore an initial segment of length n can be expressed as (1) xi! = w(1)w(2). where. . (a) The first word w(1) consists of the single letter x1. that is. . . . . • of variable-length blocks... where m is the least integer larger than n 1 (ii) Otherwise. . 01. the word-formation rule can be summarized by saying: The next word is the shortest new word..

then b(j) = 1 f (i)0 g(a). ENTROPY-RELATED PROPERTIES.) If p is an ergodic process with entropy h. . by subadditivity.16. which depends strongly on the value of p. 1) is empty. Of course. hence its topological entropy h(x) is log 2 = 1. since no code can beat entropy in the limit. but gives no information about frequencies of occurrence. by Theorem 11. The k-block universe of x is the set tik(x) of all a l' that appear as a block of consecutive symbols in x. individual sequence entropy almost surely upper bounds the entropy given by the entropy theorem. it is enough to prove that entropy is an upper bound. Ziv's concept of entropy for an individual sequence begins with a simpler idea.d.1. so total code length £(. (The topological entropy of x. 34] for discussions of this more general concept. Lemma 1.1 (The LZ convergence theorem. which is the growth rate of the number of observed strings of length k. then (11n)C(4) log n —> h. part (b) requires at most Calog n1 ± Flog I A + 2) bits. (b) If j < C and i < j is the least integer for which w(j) = w(i)a. The dominant term is C log n. and part (c) requires at most Flog ni + 1 bits. for it takes into account the frequency with which strings occur. if x is a typical sequence for the binary i. in which C1(4) now denotes the number of new words in the simple parsing (1). see [84.6. 1) = 1 f (i). since I/44k (x)i < Vim (x)I • Pk (x)1. process with p equal to the probability of a 1 then every finite string occurs with positive probability. a E A. The entropy given by the entropy theorem is h(p) = —p log p — (1 — p) log(1 — p). then b(j) = 0 g(w(j)). for every sequence.2. where i is Part (a) requires at most IA l(Flog IA + 1) bits. as defined here. Theorem 11.7.2. as k oc. otherwise b(C (c) If y is empty. called topological entropy. where a and /3 are constants. For example.eiz) is upper bounded by (C+1)logn-FaC-1.i. k = {a• i• x 1' 1 = all for some i _> The topological entropy of x is defined by 1 h(x) = lim — log 114 (x)I k a limit which exists. that is. so to establish universality it is enough to prove the following theorem of Ziv. for ergodic processes.) Topological entropy takes into account only the number of k-strings that occur in x. is the same as the usual topological entropy of the orbit closure of x.132 CHAPTER II. Ziv's proof that entropy is the correct almost-sure upper bound is based on an interesting extension of the entropy idea to individual sequences. (a) If j < C and w(j) has length 1. together with a proof that this individual sequence entropy is an asymptotic upper bound on (1/n)C(4) log n. together with a proof that.. [90]. almost surely. then b(C the least integer such that y = w(i).

Theorem 11. The Ziv-entropy of a sequence x is denoted by H(x) and is defined by H(x) = lim inf h(y). almost surely. which also have independent interest.=. — 1) and C IA — 1).i (xii . Ziv established the LZ convergence theorem by proving the following two theorems. 0 such that C(x) log n n <(1 + 3n)loglAl.2.1. and hence C logn ' n log I AI. Lemma 11. )7.2.2.2.4 (The crude bound.2. y).2.3 (The Ziv-entropy theorem. Theorem 11. where d(4 .. The proof of the upper bound.) lim sup C(xti') log n H(x). The natural distance concept for this idea is cl(x. These two theorems show that lim supn (l/n)C(x7) log n < h.) If is an ergodic process with entropy h then H(x) < h. The first gives a simple bound (which is useful in its own right). Proof The most words occur in the case when the LZ parsing contains all the words of length 1. the second gives topological entropy as an upper bound. n (in + 1 )1Al m+1 jiAli and C = . . combined with the fact it is not possible to beat entropy in the limit. y) = lim sup d. )1). 133 The concept of Ziv-entropy is based on the observation that strings with too small frequency of occurrence can be eliminated by making a small limiting density of changes. THE LEMPEL-ZIV ALGORITHM.2. x E A c° . Theorem 11.. Theorem 11.t--ns n is per-letter Hamming distance.. almost surely.2.) There exists Bn —O.SECTION 11.1 ) = 1 . The first lemma obtains a crude upper bound by a simple worst-case analysis.2 (The LZ upper bound.. up to some length. which. defined as before by c1(x . when n= In this case. all the words of length 2. X E A°°.1. and the third establishes a ii-perturbation bound. leads immediately to the LZ convergence theorem. will be carried out via three lemmas. Theorem 11. that is. say m.

At first glance the result is quite unexpected. together with the fact that hi(x) — Ei < h(x) h (x) where Ei —> 0 as j oc. and the t < - t mtm. then Given x and E > 0.2. such that C(x) log n n — Proof Let hk(x) = (1/k) log lUk(x)1. as k oc.. tm 1 m(t _ i) _< E (2) m (t— 1 i Er ti < t"/(t — 1).. for w E Ai. y) < S. v(i)) < .. The third lemma asserts that the topological entropy of any sequence in a small dneighborhood of x is (almost) an upper bound on code performance. tj = Er (en-El The second bounding result extends the crude bound by noting that short words don't matter in the limit.. ei. —> 0. first choose m such that n< j=1 jlAli.) x. i = 1.134 CHAPTER II. there exists 8.. Lemma 11.5 (The topological entropy bound. ()) (w(i).2.x'iz) log n h(y) + Proof Define the word-length function by t(w) = j. j=1 m+1 The bounds (2). yield the desired result. that is. Fix S > 0 and suppose d(x. and hence the rate of growth of the number of words of length k. be the LZ parsing of x and parse y as y = v(1)v(2) . where t(v(i)) = t(w(i)). The word w(i) will be called well-matched if dt(u. ENTROPY-RELATED PROPERTIES. the topological entropy. .. Let x = w(1)w(2) . 2. To establish the general case in the form stated. all of which are valid for t > 1 and are simple consequences of the well-known formula 0 — t)/(t — 1). 1 t m it . i=1 The desired upper bound then follows from the bound bounds .) For each x E A'. Lemma 11..6 (The perturbation bound. must give an upper bound. and choose m such that E j2» j=1 „(x) < n < E i2 p„(x) .. y) <S. for the LZ parsing may be radically altered by even a few changes in x. for k > 1. there is a S >0 such that if d( li m sup n—>oo C(.

The resulting nonoverlapping blocks that are not in 'Tk can then be replaced by a single fixed block to obtain a sequence close to x with topological entropy close to h. Lemma 11. Lemma 11. n — 11: x in+s+1 lim inf > 1. For each k. let Gk(x) be the set of wellmatched w(i) of length k. Theorem 11. for w(i) E Gk(X). combined with (3) and the fact that f(S) --÷ 0.2. the blowup-bound lemma.2. implies LI the desired result. = O. n—>oo . Lemma 1.n+s +k E Ed' I{i E [0. k — 1].2. v(i)) < N/75. The cardinality of Gk(Y) is at most 2khk (Y ) .6.2./TS. k —1] such that i. yields IGk(x)I where f(S) < 2k(h(y)+ f (6)) 0 as 8 O. 135 otherwise poorly-matched.r8fraction of x. Let C1(4) be the number of poorly-matched words and let C2(4) be the number of well-matched words in the LZ parsing of x. x i+ 1 -T k11 • > 1 — E.2.2. completing the proof of Lemma 11. Lemma 11. Fix a frequency-typical sequence x.E. where lim n fi. 0 The desired inequality. Proof Fix c > 0. be an ergodic process with entropy h. and choose k so large that there is a set 'Tk c Ak of cardinality less than 2n (h+E) and measure more than 1 — E. Thus. since this is the total number of words of length k in y. n — 1].7 Let p. and hence it can be supposed that for all sufficiently large n. H(x) < h. Since dk(w(i).7. This bound. the poorly-matched words cover less than a limiting . Now consider the well-matched words. THE LEMPEL-ZIV ALGORITHM.2. To make the preceding argument rigorous. first note that lim i+k Ili E [0. follows immediately from Lemma 11. By the Markov inequality. For any frequency-typical x and c > O. hence the same must be true for nonoverlapping k-blocks for at least one shift s E [0.5. y) < e and h(y) <h E. The LZ upper bound. and let Gk(y) be the set of all v(i) for which w(i) E Gk(X). almost surely.6. together with the fact that hk(Y) --+ h(y). the total length of the poorly-matched words in the LZ parsing of 4 is at most rt. is an immediate consequence of the following lemma. since x was assumed to be frequency-typical.2. The idea for the proof is that 4 +k-1 E Tk for all but a limiting (1 — 0-fraction of indices i. there is a sequence y such that cl(x.SECTION 11. Thus there must be an integer s E [0.5 then gives C2(4) 5_ (1 + On)r— (h(Y) gn f (8)).4 gives the bound (3) (1 n/ +an) log n'a log I AI where limn an = O.

136 CHAPTER II. and (M — 2k)/ k < m < M/k. defines a word to be new only if it has been seen nowhere in the past (recall that in simple parsing.2. until it reaches the no-th term.2.2. produces slightly more rapid convergence and allows some simplification in the design of practical coding algorithms. see Exercise 1. new words tend to be longer. Tic where 3A1 ---* 0. implies that h m (y) < h + 6 ± 3m. If M is large relative to k. also discussed in the exercises. then defines . remains true. It is used in several data compression packages. ENTROPY-RELATED PROPERTIES. Nevertheless. The sequence y is defined as the concatenation y = uv(1)v(2). a set of cardinality at most 1 + 2k(h +E) . of words have been seen. "new" meant "not seen as a word in the past.8 Variations in the parsing rules lead to different forms of the LZ algorithm. the proof of the Ziv-entropy theorem..6. and hence h(y) < h + E. which can be expressed by saying that x is the concatenation x = uw(1)w(2)..3.2. say Co. after which the next word is defined to be the longest block that appears in the list of previously seen words... a member of the m-block universe Um (y) of y has the form qv(j)v(j + D.") In this alternative version. Theorem 11. Remark 11. Remark 11.2. and (4) liminf n—. as well as its length must be encoded. thereby.. which has also been extensively analyzed. for now the location of both the start of the initial old word. where q and r have length at most k. — n Fix a E A and let ak denote the sequence of length k. One version. y) <E. known as the LZW algorithm.. Condition (4) and the definition of y guarantee that ii(x.00 Ri < n: w(i) E TxII > 1 — E. Theorem 11. 0 and. each w(i) has length k. Lemma 1. This completes the proof of Lemma 11. but more bookkeeping is required.10 For computer implementation of LZ-type algorithms some bound on memory growth is needed. which improves code performance.2.7. the built-up set bound. called the slidingwindow version. where w(i) ak if w(i) E 'Tk otherwise. Since each v(i) E U {a l } . starts in much the same way. This modification. • v(j + in — 1)r. the rate of growth in the number of new words still determines limiting code length and the upper bounding theorem.. Remark 11. includes the final symbol of the new word in the next word. where the initial block u has length s. One method is to use some form of the LZ algorithm until a fixed number.7.2.9 Another variation. Another version. all of whose members are a.

3 Empirical entropy. or variable length.2.9 achieves entropy in the limit.8 achieves entropy in the limit. II. and can be shown to approach optimality as window size or tree storage approach infinity. 3. Section 11. both of which will also be discussed. 0 < r < k. such as the Lempel-Ziv code.. [52]. here called the empirical-entropy theorem. but do perform well for processes which have constraints on memory growth. is defined by qic(cli = Ri E [0. Show that the LZW version discussed in Remark 11.cc+ -E =a } I where n = km + r.a Exercises 1. t2. provided only that n _ . In this section. Show that simple LZ parsing defines a binary tree in which the number of words corresponds to the number of nodes in the tree. . then E log Li < C log(n/C). which can depend on n. ic. In many coding procedures a sample path 4 is partitioned into blocks. suggests an entropy-estimation algorithm and a specific coding technique. • • . Furthermore. as well as other entropy-related ideas. Furthermore. (Hint: if the successive words lengths are . the coding of a block may depend only on the block. 137 the next word as the longest block that appears somewhere in the past no terms. asserting that eventually almost surely only a small exponential factor more than 2kh such k-blocks is enough.. or it may depend on the past or even the entire sample path. through the empirical distribution of blocks in the sample path. see [87. EMPIRICAL ENTROPY.3. The blocks may have fixed length. and an overlapping-block form. rit — 1]: in X: . and that a small > 21th exponential factor less than 2kh is not enough. Show that the version of LZ discussed in Remark 11. e. then each block is coded separately.) 2. .2. The nonoverlapping k-block distribution. The theorem.2. about the covering exponent of the empirical distribution of fixed-length blocks will be established. The theorem is concerned with the number of k-blocks it takes to cover a large fraction of a sample path.SECTION 11. These will be discussed in later sections. the method of proof leads to useful techniques for analyzing variable-length codes. Such algorithms cannot achieve entropy for every ergodic process. see Exercise 1. 88]. a beautiful and deep result due to Ornstein and Weiss. g. The empirical-entropy theorem has a nonoverlapping-block form which will be discussed here.

as it will be useful in later sections. Theorem 11. see Exercise 3.14) is not close to kt k in variational distance for the case n = k2". in spite of the fact that it may not otherwise be close to iik in variational or even cik -distance.3. those long words that come from the fixed collections .elz = w(1)w(2) • • • w(t) for which awci» < --En .a Strong-packing. ENTROPY-RELATED PROPERTIES.) Let ii be an ergodic measure with entropy h > O. A fixed-length version of the result is all that is needed for part (a) of the entropy theorem. The theorem is. the empirical measure qk(. or (1 — e)-packed by {TO. in particular. word length is denoted by £(w) and a parsing of 4 is an ordered collection P = {w(1). but the more general result about parsings into variable-length blocks will be developed here. subject only to the condition that k < (11 h) log n. . for the latter says that for given c and all k large enough.138 CHAPTER II. for k > 1. for unbiased coin-tossing the measure qk (. the empirical distribution qk(.. x) such that if k > K and n > 2kh .uk. such that for almost every x there is a K = K(e. for fixed k. Fix a sequence of sets {E}.a(B) > E. w(t)} of words for which 4 = w(1)w(2) • • • w(t). that is.3. an empirical distribution form of entropy as covering exponent.I4) eventually almost surely has the same covering properties as ktk. (1 — O n.. if the words that belong to U'rk cover at least a (1 — E)-fraction of n. . and there is no is a set T(E) c A k such that l'Tk(E)I _ set B C A k such that IB I _< 2k(h--E) and . c)-strongly-packed by MI if any parsing . then (a) qk(Z(E)14)) > 1 — E. For example. Strong-packing captures the essence of the idea that if the long words of a parsing cover most of the sample path. there < 2k(h+f) and pl (Tk(E)) > 1 — e. For each c > 0 and each k there is a set 'Tk (e) c A" for which l'Tk (e)I < 2k(h +E) . As before.7. in essence. . Part (a) of the empirical-entropy theorem is a consequence of a very general result about parsings of "typical" sample paths.14) is eventually almost surely close to the true distribution. if the short words cover a small enough fraction. A parsing X n i = w(l)w(2) • • • w(t) is (1 — e)-built-up from {'Tk}. w(2). hence. t(w(I))<K E - 2 is (1 — E)-built-up from MI.4. then most of the path is covered by words that come from fixed collections of exponential size almost determined by entropy. A sequence 4 is (K. The ergodic theorem implies that for fixed k and mixing kt.1 (The empirical-entropy theorem. such that Tk c A k . a finite sequence w is called a word. for any B c A" for which IBI < 2k( h — e) . II. E w(i)Eurk f(w(i)) a. the empirical and true distributions will eventually almost surely have approximately the same covering properties. Eventually almost surely. (b) qk(BI4)) <E. Theorem 1. The surprising aspect of the empirical-entropy theorem is its assertion that even if k is allowed to grow with n.

fix c > 0. To make the outline into a precise proof. 0-strongly-packed by {Tk for a suitable K. For k > m. The key result is the existence of collections for which the 'Tk are of exponential size determined by entropy and such that eventually almost surely x is almost strongly-packed. most of 4 must be covered by those words which themselves have the property that most of their indices are starting places of members of Cm . and let 8 be a positive number to be specified later. it can be supposed that 3 < 12. then. by the packing lemma. by the packing lemma.3. the words w(i) that are not (1 — 3/2)-stronglycovered by Cm cannot have total length more than Sn < cn12.3. be an ergodic process of entropy h and let E be a positive number There is an integer K = K(e) and. let 'Tk be the set of sequences of length k that are (1 — 3)-built-up from Cm . eventually almost surely most indices in x7 are starting places of m-blocks from Cm . a set 'Tk = 'Tk(E) C A k . If a word is long enough. however. Lemma 1. EMPIRICAL ENTROPY 139 {'Tk } must cover most of the sample path.3. By making 8 smaller.7.3. (a) ITkI_ (b) x is eventually almost surely (K. Lemma 11.SECTION 11. and if f(w(i)) ?_ 2/3. if necessary. k > in. But if such an x is partitioned into words then. The entropy theorem provides an integer m and a set Cm c Am of measure close to 1 and cardinality at most 2m (h +6) • By the ergodic theorem. The entropy theorem provides an m for which the set Cm = {c4n : p.(ar) > 2 -m(h+ 5)} has measure at least 1 — 3 2 /4.2 (The strong-packing lemma. The collection 'Tk of words of length k that are mostly built-up from Cm has cardinality only a small exponential factor more than 2"") . then the word is mostly built-up from C m . 4 is (1 — 3 2 /2)-strongly-covered by Cm . that is. that is. (ii) If x is (1 —3 2 /2)-strongly-covered by C m and parsed as 4 = w(1)w(2) • • w(t). It remains to show that eventually almost surely x'IL is (K. By the built-up set lemma. and most of its indices are starting places of members of Cm .) Let p. w(i) is (1-0-packed by C m . (i) The ergodic theorem implies that for almost every x there is an N = N(x) such that for n > N. . 0-strongly-packed by {7 } . Lemma 1. w(i) E Te(w(j)). by the built-up set lemma. n — m + 1].6. the sequence 4 has the property that X: +171-1 E Cm for at least (1 — 3 2 /2)n indices i E [1. such that both of the following hold. it can be supposed that 6 is small enough to guarantee that iTk i < 2k(n+E) . This is a consequence of the following three observations. < 2k(h+E) for k > K. Proof The idea of the proof is quite simple. (iii) If w(i) is (1 — 3/2)-strongly-covered by C m . by the Markov inequality. then. for each k > K. by the Markov inequality.

3. and t(w(t + 1)) < k. If the block does . Thus. From the preceding it is enough to take K > 2/8. and suppose xril = w(1)w(2) • • • w(t)w(t + 1). i < t. A simple two-part code can be constructed that takes advantage of the existence of such sets B. If a block belongs to Tk — B then its index in some ordering of Tk is transmitted. 8)-strongly-packed by {Tk}. An informal description of the proof will be given first. then n > N(x) and k < Sn.(1 — 2 8 )n. where t(w(i)) = k. This completes the proof of Lemma 11. Suppose xtiz is too-well covered by a too-small collection B of k-blocks.. The definition of strong-packing implies that (n — k + 1)qk('Tk I4) a.' is (K. provided only that n > 2kh . The strongpacking lemma provides K and 'Tk c A " . Fix x such that 4 is (K. and choose K (x) > K such that if k> K(x) and n> 2kh . since the left side is just the fraction of 4 that is covered by members of Tk .140 CHAPTER II. The good k-code used on the blocks that are not in B comes from part (a) of the empirical-entropy theorem. ENTROPY-RELATED PROPERTIES. dividing by n — k + 1 produces qk(Tkixriz ) a (1 — 28)(1 —8) > (1 — c)..2. and if the parsing x 11 = w(1)w(2) • • • w(t) satisfies —2 t(w(i))<K then the set of w(i) for which f(w(i)) > K and w(i) E U Tk must have total length at El least (1 — On. for suitable choice of S. for if this is so. or by applying a fixed good k-block code if the block does not belong to B. for some k. while the last one has length less than Sn. The second part encodes successive kblocks by giving their index in the listing of B. eventually almost surely. if n > N(x). Fix k > K(x) and n a 2kh . Proof of part (b).b Proof of the empirical-entropy theorem. if the block belongs to B. this contributes asymptotically negligible length if I BI is exponentially smaller than 2kh and n > 2kh . except possibly the last one. Fix c > 0 and let S < 1 be a positive number to be specified later. This completes the proof of part (a) of the empirical-entropy 0 theorem. this requires roughly hk-bits.3. 8)-strongly-packed by {T} for n a N(x). and such that x. Proof of part (a). such that Irk I . II. which supplies for each k a set 'Tk c A k of cardinality roughly 2.2k(h+') for k a K. The first part is an encoding of some listing of B. so that if B covers too much the code will beat entropy. If B is exponentially smaller than 2" then fewer than kh bits are needed to code each block in B. such that eventually almost surely most of the k-blocks in an n-length sample path belong to Tk . since k < Sn. are longer than K. All the blocks.

3. or the following holds.3. say G: Tk i. Let K be a positive integer to be specified later and for n > 2/(4 let B(n) be the set of 4 for which there is some k in the interval [K.._. EMPIRICAL ENTROPY. say. 141 not belong to Tk U B then it is transmitted term by term using some fixed 1-block code. (log n)/h] for which there is a set B c A k such that the following two properties hold.B: B i . D It will be shown that the suggested code Cn beats entropy on B(n) for all sufficiently large n. and a fixed-length faithful code for each B c A k of cardinality at most 2k(h. F(x. say. Also needed for each k is a fixed-length faithful coding of Tk. which establishes that 4 g B(n). Several auxiliary codes are used in the formal definition of C. since no prefix-code sequence can beat entropy infinitely often on a set of positive measure.. (1) If B C A k and I B I < 21* -0 . 1}rk(h-01.SECTION 11. for almost every x. an integer K(x) such that if k > K(x) and n > 2" then qk(TkIx) > 1 . defined by Fm (X) = F(xi)F(x2).{0. B(n).E ) .8. 1} rk(4 ±3)1 . then for almost every x there is an N(x) such that x'ii g B(n). so that if k is enough larger than K(x) to guarantee that nkh Z > N(x). . for each k. and. F: A i-. A faithful single letter code.{0. since such blocks cover at most a small fraction of the sample path this contributes little to overall code length. if 8 is small enough and K is large enough. n ). eventually almost surely. fix c > 0 and let 8 be a positive number to be specified later. This fact implies part (b) of the empirical-entropy theorem. 1}1 1 05 All is needed along with its extension to length in > 1. eventually almost surely.8. If 4 g B(n). then property (1) must hold for all n > 2. Indeed. (i) qic(Tk14) ?_ 1 . To proceed with the rigorous proof. which implies part (b) of the empirical-entropy theorem. Let 1 • 1 denote the upper integer function. Part (a) of the theorem provides a set Tk c ilk of cardinality at most 2k(h+3) . By using the suggested coding argument it will be shown that if 8 is small enough and K large enough then xn i . (ii) I BI <2k(h6) and qk (BI4) > E. for K(x) < k < loh gn. for n > N(x). so that if K(x) < k < (log n)/ h then either 4 E B(n). the definition of {rk } implies that for almost every x there is an integer K (x) > K such that qk(Tk14) > 1 . eventually almost surely.{0. then qk(BI4) < c. Mk.

3. for all sufficiently large n. let Ok(B) be the concatenation in some order of {Fk (4): 4 E B } . to Tic — B. then an integer k E [K. 4 E B(n). and let E be an Elias prefix code on the integers. Lemma 11.3 I Mk. since no sequence of codes can beat entropy infinitely often on a set of positive measure. If 4 E B(n).B(w(i)) OlG k (w(i)) 11 Fk(w(i)) 11Fr (w(t +1) 0 w(i) E B W(i) E Tk .i<t i = t +1. in turn. Gk. k).C(x) 15_ tkqk(h — 6) + tk(1 — qk)(h + 6) + (3 + K)0(n). ( (I BI) determines the size of B. (log n)/11] and a set B C A k of cardinality at most 2k(h-E ) are determined such that qk(B14') > c. Once k is known. and Fr are now known. and 4 is parsed as 4 = w(l)w(2) • • • w(t)w(t + 1). a fact stated as the following lemma. 1=1 . The lemma is sufficient to complete the proof of part (b) of the empirical-entropy theorem.142 CHAPTER II. where (2) v(i) = The code is a prefix code. that is. since qk > E. given that x til E B(n). This is because tkq k (h — c) + tk(1 — q k )(h + 3) < n(h + 3 — 6(6 + S)). and as noted earlier. (3) . If 4 V B(n) then Cn (4) = OFn (4). or to neither of these two sets.4). But this implies that xriz V B(n). Finally. ENTROPY-RELATED PROPERTIES. the first bit tells whether x`i' E B(n).B. implies part (b) of the empirical-entropy theorem. where aK —> 0 as K —> oc. since reading from the left. the block 0k1 determines k. and. eventually almost surely. and qk = clk(Blx'1 2 ). since each of the codes Mk. The code is defined as the concatenation C(4) = 10k 1E(1B1)0k (B)v( 1)v(2) • • • v(t)v(t + 1). this. For 4 E B(n). from which w(i) can then be determined by using the appropriate inverse. and w(t + 1) has length r = [0. xri' E B(n). a prefix code such that the length of E(n) is log n + o(log n). The rigorous definition of Cn is as follows. The two-bit header on each v(i) specifies whether w(i) belongs to B.B w(i) g Tk UB. so that if K is large enough and 8 small enough then n(h — c 2 12). the principal contribution to total code length £(4) = £(41C) comes from the encoding of the w(i) that belong to rk U B. for each k and B C A k . Fk. and øk(B) determines B. for 4 E B(n). where w(i) has length k for i < t. for the block length k and set B c itic used to define C(.

(5) provided K > (E ln 2) . as well as a possible extra bit to round up k(h — E) and k(h +8) to integers. -KEn. in turn. There are at most t + 1 blocks. a quantity which is at most 6n/K. each of which requires k(h +3) bits.3. which. The header 10k le(I BD that describes k and the size of B has length (4) 2 +k+logIBI o(logIBI) which is certainly o(n). Given a sample path Xl. This completes completes the proof of Lemma 11. thereby establishing part (b) of the empirical-entropy theorem. Ignoring for the moment the contribution of the headers used to describe k and B and to tell which code is being applied to each k-block. upper bounded by K (1+ loglA1)2. It takes at most k(1 + log I Ai) bits to encode each w(i) Tk U B as well as to encode the final r-block if r > 0.'. A problem of interest is the entropy-estimation problem. The argument parallels the one used to prove the entropy theorem and depends on a count of the number of ways to select the bad set B. which is. of part (b) of the empirical-entropy theorem used a counting argument to show that the cardinality of the set B(n) must eventually be smaller than 2nh by a fixed exponential factor.c Entropy estimation. [ 52]. K the encoding of 4k(B) and the headers contributes at most naK to total length. A simple procedure is to determine the empirical distribution of nonoverlapping k-blocks.3.E ) bits. since k > K and n > 2". each requiring a two-bit header to tell which code is being used. as well as the extra bits that might be needed to round k(h — E) and k(h +3) up to integers. and then take H (qk )/ k as an estimate of h. and (t — tqk(B 14)) blocks in Tk — B. in turn. This gives the dominant terms in (3). II.3. . If k is fixed and n oc. since there are tqk(Bl. xn from an unknown ergodic process it. and n> k > K. The coding proof given here is simpler and fits in nicely with the general idea that coding cannot beat entropy in the limit. EMPIRICAL ENTROPY.SECTION 11.3. since log I BI < k(h — E) and k < (log n)I n.4 The original proof. El Remark 11. the number of bits used to encode the blocks in Tk U B is given by tqk (Bix)k(h — E) + (t — tq k (BI4)) k(h +3). and there are at most 1 +3t such blocks. together these contribute at most 3(t + 1) bits to total length. each of which requires k(h — E) bits.3. 143 Proof of Lemma 11. the goal is to estimate the entropy h of II. then H(qk)1 k will converge almost surely to H(k)/k.3. The encoding Øk(B) of B takes k Flog IAMBI < k(1+ log I A1)2 k(h. Thus with 6 ax = K(1 + log IA1)2 -KE — = o(K). converges to h as .q) blocks that belong to B. X2. since t < nlk. Adding this to the o(n) bound of (4) then yields the 30(n) term of the lemma. hence at most 80(n) bits are needed to encode them.3.

where G k = Tk(E) — Uk — Vk. at least if it is assumed that it is totally ergodic. then (6) n->C0 k(n) 1 lim — H(q00(. The same argument that was used to show that entropy-rate is the same as decay-rate entropy. there is some choice of k = k(n) as a function of n for which the estimate will converge almost surely to h. implies that for almost every x. Next let I Vkl < Vk be the set of all alic for which qk (cif 14) > 2 -k(h-26) . for k > Ki(x). Part (b) of the empirical-entropy theorem implies that for almost every x there is a K (x) such that qk(VkIX) < E. see Exercise 2. Thus. while if k(n) — log log n.144 CHAPTER II.5 (The entropy-estimation theorem. Theorem 1. In particular. and if k(n) < (1/h)log n. the choice k(n) — log n works for any binary ergodic process.14) is close to the true distribution bak for every mixing process A. The empirical entropy theorem does imply a universal choice for the entropy-estimation problem.u. Let Uk be the set of all cif E Tk(E) for which qk (ali` 'xi' ) < 2 -k(h+2E) . if k(n) —> oc.(k1n — iikkai 'x i ) < 2-0h-2E) has qk (. and note that 2/(/1-26) . lx) measure at least 1 — 3e. > for k > K (x). In summary. so that for all large enough k and all n > 2kh .3.log iAl n then (6) holds for any ergodic measure p.14)) = h. Theorem 11.9. ENTROPY-RELATED PROPERTIES.5. with alphabet A. there is no universal choice of k = k(n) for which k(n) —> oc for which the empirical distribution qi. Proof Fix c > 0 and let {Tk(c) c A k : n > 1) satisfy part (a) of the empirical-entropy theorem. the choice of k(n) would appear to be very dependent on the measure bt. k —> oc. at least for a totally ergodic . as n -. there is a Ki (x) such that qk(GkIx) > 1 — 36.oc. for example. can now be applied to complete the proof of Theorem 11. a.(. The general result may be stated as follows. n _ This bound. .s. At first glance.6 The entropy-estimation theorem is also true with the overlapping block distribution in place of the nonoverlapping block distribution.3.6. n > 2kh . the set of al' for which .3.) If it is an ergodic measure of entropy h > 0. for example. because. if k(n) -. combined with the bound (7) and part (a) of the empirical-entropy theorem. then it holds for any finite-alphabet ergodic process. CI Remark 11. (7) qk(Uklxi) _ < Irk(E)12ch+2E) < 2k(h+E)2 -k(h+2E) _ "-Ek —z.

is a prefix code. Encode successive k-blocks in 4 by giving the index of the block in the code Step 4. an Elias code. Encode the final block. relative to n. where h is the entropy of A. almost surely. k) and 4 = w(1)w(2)•• • w(t)v. if k does not divide n.42 ) 71 = h. that is. A universal code for the class of ergodic processes is a faithful code sequence {C n } such that for any ergodic process p. Step 1. Define b2+ +L i to be the concatenation of the E(A(w(i))) in order of increasing i.1. 2k — 1) is called the code book.lxiii). where k = k(n). Suppose n = tk r. subject only to the rule that af precedes 5k whenever k n -k n) qk(ailX1) > qk( all X 1 • The sequence {v(j) = j = 0. and it was also shown that the Lempel-Ziv algorithm gives a universal code. The code length £(4) = ni+ L +r. The header br is a concatenation of all the members of {0. High frequency blocks appear near the front of the code book and therefore have small indices.1.3. Partition xi: into blocks of length k = k(n) — (1/2) log AI n. of these k-blocks in order of decreasing frequency of occur- Step 3.) The code word Cn (4) is a concatenation of two binary sequences.3.C is short. since L depends on the distribution qk (. r E [0. depends on 4. define the address function by the rule A(w(i)) = j.d Universal coding. The steps of the code are as follows. Such a code sequence was constructed in Section II. since k -(1/2) log ik n. so that the empirical-entropy theorem guarantees good performance. book E. . fix {k(n)} for which k(n) — (1/2) log n.iiri I .JF. lirn “. where each w(i) has length k. (For simplicity it is assumed that the alphabet is binary. with a per-symbol code. followed by a sequence whose length depends on x. . To make this sketch into a rigorous construction. if w(i) = that is. where L is the sum of the lengths of the e(A(w(i))). La.SECTION 11. a header of fixed length m = k2k . since the number of bits needed to transmit an index is of the order of magnitude of the logarithm of the index. For 1 < i < t and 0 < j < 2k. The number of bits needed to transmit the list . 1} k . define b:+ 1+ -1 to be x4+1 • The code Cn (4) is defined as the concatenation Cn (x7) = br • b:V' • b7. Step 2. Finally. The empirical-entropy theorem provides another way to construct a universal code. but the use of the Elias code insures that C. Let E be a prefix code on the natural numbers such that the length of the word E(j) is log j o(log j). w(i) is the j-th word in the code book. 1. see the proof of Theorem 11. EMPIRICAL ENTROPY 145 II. Transmit a list rence in 4.

s.. n — k + 1]: 444-1 = aDi . Thus. II. let Bk be the first 2k(h-E) sequences in the code book for x. For those w(i) Gk. and is described in Exercise 4.3. Another application of the empirical-entropy theorem will be given in the next chapter. ENTROPY-RELATED PROPERTIES.2. even simpler algorithm which does not require that the code book be listed in any specific order was obtained in [43]. since Gk is a set of k-sequences of largest qke probability whose cardinality is at most 2k(h +E ) . let {7-k(E) c Ac} be the sequence given by the empirical-entropy theorem. n—k+ 1 in place of the nonoverlapping block distribution qk(. This proves that .) Pk(a in4) E [1.1. For each n. To establish this. In fact.146 CHAPTER II. Theorem 11. and hence < (1 — On(h + c) + En + o(log n). The empirical-entropy theorem guarantees that eventually almost surely. where k = k(n). (8) n-* lim f(x n ) 1 = h. a. but it is instructive to note that it follows from the second part of the empirical-entropy theorem.3. for almost every x. for k > K(x). The empirical-entropy theorem provides. qk(GkIx t 0 qk(Tk(E)Ixi). The reverse inequality follows immediately from the fact that too-good codes do not exist. Show that the empirical-entropy theorem is true with the overlapping block distribution. and in some ways.e Exercises 1. for k > K (x). s. These are the only k-blocks that have addresses shorter than k(h — 6) + o(log n). A second. an integer K(x) such that qk (GkI4) > qk (7k (c)14) > 1 — E for k > K(x). there are at most Et indices i for which w(i) E Bk. so that lirninf' > (h — c)(1 — c). Note that IGk I > l'Tk(E)I and. in the context of the problem of estimating the measure II from observation of a finite sample path.7 The universal code construction is drawn from [49]. at least a (1 — E)-fraction of the binary addresses e(A(w(i))) will refer to members of Gk and thus have lengths bounded above by k(h + c)+ o(log n). a. n where h is the entropy of it. a.3.C(4) lim sup — < h. (Hint: reduce to the nonoverlapping case for some small shift of the sequence. which also includes universal coding results for coding in which some distortion is allowed. The empirical-entropy theorem will be used to show that for any ergodic tt.lfiz). see Section 111. furthermore. Remark 11. S. the crude bound k + o(log n) will do. let Gk denote the first 2k(h +E) members of the code book.

Another simple universal coding procedure is suggested by the empirical-entropy theorem. such that all but the last symbol of each word has been seen before. A sequence Jell is said to be parsed (or partitioned) into the (ordered) set of words {w(1). then (1) is called a parsing (or partition) into distinct 000110110100 = [000] [110] [1101] [00] is a partition into distinct words.4. In other words. . be the first value of k E [ 1. Make a list in some order of the k-blocks that occur. then the expected fraction of empty boxes is asymptotic to (1 — 4. for i words. Express x'12 as the concatenation w(l)w(2) • • • w(q)v.. As in earlier discussions. except possibly for the final word. of k-blocks {w(i)}. For example. most of the sample path is covered by words that are not much longer than (log n)I h. w(2).4 Partitions of sample paths.. such a code requires at most k(1 + log IA I) bits per word. Call this C k m (Xil ) and let £k(X) denote the length of Ck . Show that the entropy-estimation theorem is true with the overlapping block distribution in place of the nonoverlapping block distribution. Show that {C.SECTION 11. The final code C(xil) transmits the shortest of the codes {Ck. along with a header to specify the value of k.. Partitions into distinct words have the asymptotic property that most of the sample path must be contained in those words that are not too short relative to entropy. For each k < n construct a code Ck m as follows. a word w is a finite sequence of symbols drawn from the alphabet A and the length of w is denoted by i(w).14) and . 147 2.. while 000110110100 = [00] [0110] [1101] [00] is not..'). n] at which rk(x'11 ) achieves its minimum.n ) and Cknun . If w(i) w(j). Append some coding of the final block y. w(t). Section 11. The code C(4) is the concatenation of e(kiii.u k is asymptotically almost surely lower bounded by (1 — e -i ) for the case when n = k2k and .(4)}. 3. Transmit k and the list.n (x).} is a universal prefix-code sequence.(x. w(t)} if it is the concatenation (1) 4 = w(1)w(2).. j. An interesting connection between entropy and partitions of sample paths into variable-length blocks was established by Ornstein and Weiss.. then code successive w(i) by using a fixed-length code. plus a possible final block y of length less than k. . [53]. for any partition into distinct words. PARTITIONS OF SAMPLE PATHS. most of the sample path is covered by words that are not much shorter than (log n)I h. let km. and for any partition into words that have been seen in the past... (Hint: if M balls are thrown at random into M boxes. Their results were motivated by an attempt to better understand the Lempel-Ziv algorithm. which partitions into distinct words.u is unbiased coin-tossing. They show that eventually almost surely. Show that the variational distance between qk (.

then t ( to)) <En. Thus the distinct-words theorem implies that. Lemma 11. this time for parsings into words of variable length. s. which.4. Lemma 11.1. x) such that if n > N. Theorem 11.T or has been seen in the past. and xi. be ergodic with entropy h > 0. called the start set. some way to get started is needed. the lower bound (3) is a consequence of fact that entropy is an upper bound.) Let be ergodic with entropy h > 0.4. For this purpose. then t ( o )) < En. w(t) is a partition into repeated words. a.2. see Exercise 1.1 (The distinct-words theorem.2 parses a finite sequence into distinct words. and hence t(w(C + 1)) < n/2. ENTROPY-RELATED PROPERTIES. meaning that there is an index j < t(w(1)) + + f(w(i — 1)) such that w(i) = Partitions into repeated words have the asymptotic property that most of the sample path must be contained in words that are not too long relative to entropy. as noted earlier. x) such that if n > N and xri' = w(1)w(2) partition into distinct words. it is sufficient if all but the last symbol appears earlier. (3) Thus the distinct-words theorem and the modified repeated-words theorem together provide an alternate proof of the LZ convergence theorem.2. let . implies that C(4) log n lim inf > h. Of course.148 CHAPTER II. For almost every w(t) is a x E A" there is an N = N(E. s. eventually almost surely. Theorem 11. With this modification.) Let p.2. if each w(i) is either in .. it is the same as a prior word.2. Theorem 11. so that eventually almost surely most of the sample path must be covered by words that are no longer than (h — E) -1 log n. except.be a fixed finite collection of words. A partition x = w(1)w(2) w(k) is called a partition (parsing) into repeated words. = w(1)w(2). most of the sample path is covered by blocks that are at least as long as (h + E)' log(n/2). and the fact that too-good codes do not exist. w(C + 1). by the crude bound. But the part covered by shorter words contains only a few words. a. possibly for the final word. ow(0)<0+0-11ogn The (simple) Lempel-Ziv algorithm discussed in Section 11. Theorem 11.2 (The repeated-words theorem.3. If this final word is not empty. and let 0 < E < h and a start set T be given. The proofs of both theorems make use of the strong-packing lemma. in turn. To discuss partitions into words that have been seen in the past.7. For example. and let E > 0 be given. the repeated-words theorem applies to the versions of the Lempel-Ziv algorithm discussed in Section 11.2.4. (2). . For almost every x E A there is an N = N(E. so that (2) lim sup n- C(4) log n fl < h. twi>>>0-0-Ilogn It is not necessary to require exact repetition in the repeated-words theorem.1..

because there are at most IA i k words of length k.2. This proves the lemma. there is an N such that if n > N and xrii = w(1)w(2)..4.4. The distinct-words theorem is a consequence of the strong-packing lemma. t —) men. Proof First consider the case when all the words are required to come from G. together with some simple facts about partitions into distinct words. and let G = UkGk.(..3. then E aw (i) ) Bn. . PARTITIONS OF SAMPLE PATHS... 149 II.4 For each k.(i)). yields an integer K and { rk C A": k> 1} such that I7j < 21c(h+S) . w(t) is a partition into distinct words such that t(w(i)) 5_ En. The first observation is that the words must grow in length. POD (j)) <K The second observation is that if the distinct words mostly come from sets that grow in size at a fixed exponential rate a. then words that are too short relative to (log n)Ice cannot cover too much. t > 1. where a is a fixed positive number. To continue with the proof of the distinct-words theorem. Given E > 0 there is an N such that if n > N and x n = w(1)w(2). The fact that the words are distinct then implies that .4. Lemma 11. w(i)gc then t(w(i)) < 2En.(. Et _ The general case follows since it is enough to estimate the sum over the too-short words that belong to G. k > K. w(t) is a partition into distinct words.4. suppose Gk C A k satisfies IGkI _< 2ak ._Exio g n)la E top(o) E Ic<(1-6)(log n)la Since IGk I <2"k the sum on the right is upper bounded by ( 2' ((1 — E)logn) —1) exp2 ((1 — E)logn)) = logn) = o(n).SECTION 11. The strong-packing lemma. let 8 be a positive number to be specified later.3 Given K and 8 > 0. The proof of this simple fact is left to the reader. Lemma 11. Lemma 11.a Proof of the distinct-words theorem. by an application of the standard bound.

this time applied to variable-length parsing. As in the proof of the empirical-entropy theorem. e. The distinct-words theorem follows.8.4 can be applied with a = h + 3 to yield aw(i)) < 25n. 3)-strongly-packed by {'Tk }. then overall code length will be shorter than entropy allows. for which s of the w(i) are too long. Suppose = w(l)w(2) w(t) is some given parsing.b Proof of the repeated-words theorem. A block of consecutive symbols in xiit of length more (h — E) -1 log n will be said to be too long. Suppose also that n is so large that. V(s). i. V(2). 3)-strongly-packed by MI. ENTROPY-RELATED PROPERTIES. let ui be the concatenation of all the words that come between V(i — 1) and V(i). property (4) holds for any parsing of . Thus. Let u1 be the concatenation of all the words that precede V(1). It will be shown that if the repeated-words theorem is false. Throughout this discussion p. r w(t) into distinct words. f(w(i))(1-15)(logn)/(h+S) provided only that n is large enough. and let .3. strong-packing implies that given a parsing x = w(l)w(2) E w(i)EU Klj t (o)) a _3)n. by Lemma 11. it into distinct words. and such that eventually almost surely xrii is (K. The idea is to code each too-long word of a repeated-word parsing by telling where it occurs in the past and how long it is (this is. since 8 could have been chosen in advance to be so small that log n (1 — 2S) log n h + c 5h+3 • 1:3 I1.x.. for any parsing P of xriz for which (4) E Sn atv(i)) — . Label these too-long words in increasing order of appearance as V(1). The first idea is to merge the words that are between the too-long words.4. E w(i)Euciz k rk t (o )) a ( l-3)n. the basic idea of the version of the Lempel-Ziv code suggested in Remark 11. for 1 < i < s + 1. and therefore Lemma 11. then a prefix-code sequence can be constructed which beats entropy by a fixed amount infinitely often on a set of positive measure. of course.4. will be a fixed ergodic process with positive entropy h and E < h will be a given positive number. the existence of good codes is guaranteed by the strong-packing lemma.150 CHAPTER II.2.' is (K.) If a good code is used on the complement of the too-long words and if the too-long words cover too much.4. 2 Suppose xr.

(b) Each filler u1 g U K r k is coded by specifying its length and applying a fixed single-letter code to each letter separately. Let 3 be a positive number to be specified later. (a) Each filler u 1 E U K Tk is coded by specifying its length and giving its index in the set Tk to which it belongs. is constructed as follows.T is a fixed finite set. then B(n) n G(n).3. us V(s)u s+i with the following two properties. a set 'Tk c A k of cardinality at most 2k(h+8) . . If x E B(n) n G(n). it is enough to prove that x g B(n). Such a representation is called a too-long representation of 4. ni ) To say V(j) = xn Since the start set . it can be supposed such that V(j) that when n is large enough no too-long word belongs to F. ni + ± imi has been seen in the past is to say that there is an index i E [0. (i) Ei f ( V (D) >En. Let B(n) be the set of all sequences x for which there is an s and a too-long representation x = uiV (1)u2 V(2) . An application of the strong-packing lemma.4. The strong-packing lemma provides the good codes to be used on the fillers.. Sequences not in B(n) n G(n) are coded using some fixed single-letter code on each letter separately. x'iz is expressed as the concatenation x Çz = u i V(1)u2V(2) us V(s)u s±i . is determined for which each too-long V(j) is seen somewhere in its past and the total length of the {V(j)} is at least En. V(s)} and fillers lui. but to make such a code compress too much a good way to compress the fillers is needed.. A code C. 8)-strongly-packed by {Tk } . (ii) Each V(j) has been seen in the past. (c) Each too-long V(j) is coded by specifying its length and the start position of its earlier occurrence.. .. with the too-long words {V(1). such that eventually almost surely xril is (K. Since .u s V(s)u s+i. The words in the too-long representation are coded sequentially using the following rules.. and therefore to prove the repeated-words theorem. Lemma 11. In this way..2. Let G(n) be the set of all xri' that are (K. to complete the proof of the repeated-words theorem it is enough to prove that if K is large enough. eventually almost surely. The idea is to code sequences in B(n) by telling where the too-long words occurred earlier.SECTION 11. 151 u s+ 1 be the concatenation of all the words that follow V(s).. PARTITIONS OF SAMPLE PATHS. provides an integer K and for a each k > K. 3)-stronglypacked by {'Tk }. a too-long representation = u1V(1)u2V(2). eventually almost surely.q E G(n) eventually almost surely.

must satisfy the bound.5. completing the proof of the repeated-words theorem. ENTROPY-RELATED PROPERTIES. E au . eventually almost surely. for all sufficiently large n.. for XF il E B(n) n G(n). then (5) uniformly for 4 . and from specifying the index of each u i E UlciK rk. ) (h + 6) + 60(n). For the fillers that do not belong to U71. eventually almost surely. is that the number s of too-long words.5 If log K < 6 K . as well as to the lemma itself. n s < i="' (h — e) < — (h — 6). which requires £(u) [h + 81 bits per word. Proof of Lemma 11. xÇ2 E B(n) n G(n). since. — log n — log n implies that E au . "ell B(n) n G(n). This fact is stated in the form needed as the following lemma. plus the two bit headers needed to tell which code is being used and the extra bits that might be needed to round up log n and h + 6 to integers.(x €)) . It is enough to show that the encoding of the fillers that do not belong to Urk . require a total of at most 60(n) bits. so that (5) and (6) yield the code-length bound < n(h + — e(e + 6)) + 0(n). For 4 E B(n) n G(n). With a one bit header to tell whether or not xri` belongs to B(n) n G(n).152 CHAPTER II. that is. which requires Flog ni bits for each of the s too-long words.. as well as the encoding of the lengths of all the words. An Elias code is used to encode the block lengths. first note that there are at most s + 1 fillers so the bound Es.4. shows that. a word is too long if its length is at least (h — E) -1 log n. and hence if 6 is small enough then £(4) < n(h — € 2 12). The lemma is sufficient to show that xrii B(n) n G(n). (6) s — av(i)) log n (h — e).C(fi` IC n ) comes from telling where each V(j) occurs in the past. By assumption. while it depends on x. which.4. indeed. by definition. Lemma 11. a prefix code S on the natural numbers such that the length of the code word assign to j is log j + o(log j). Two bit headers are appended to each block to specify which of the three types of code is being used. since it is not possible to beat entropy by a fixed amount infinitely often on a set of positive measure. tw(i) > en.C(x'11 ) = s log n + ( E ti J ELJA7=Kii B(n) n G(n). ) <K (1 + n(h log n j). the code becomes a prefix code. The key to this. the principal contribution to total code length E(xriz) .

and since x -1 log x is decreasing for x > e. once n is large enough to make this sum less than (Sn/2. that 4 is (K. PARTITIONS OF SAMPLE PATHS. the dominant terms of which can be expressed as E j )<K log t(u i ) + E t(u i )K log t(u i ) + logt(V(i)).4. the total length contributed by the two bit headers required to tell which type of code is being used plus the possible extra bits needed to round up log n and h + 8 to integers is upper bounded by 3(2s + 1) which is o(n) since s < n(h — E)/(logn).c Exercises 1. The coding proof is given here as fits in nicely with the general idea that coding cannot beat entropy in the limit. This completes the proof of Lemma 11. 2. Since it takes L(u)Flog IAI1 bits to encode a filler u g UTk.4. which is Sn.6 The original proof. the encoding of the lengths contributes EV(t(u i)) Et(eV(V(i») bits.4. of the repeated-words theorem parallels the proof of the entropy theorem. . This is equivalent to showing that the coding of the too-long repeated blocks by telling where they occurred earlier and their lengths requires at most (h — c) E ti + o(n) bits. The key bound is that the number of ways to select s disjoint subintervals of [1.. Extend the repeated-words theorem to the case when it is only required that all but the last k symbols appeared earlier. Since it can be assumed that the too-long words have length at least K. + o(n)). which is o(n) for K fixed. Finally. then the assumption that 4 E G(n). [53]. Extend the distinct-words and repeated-words theorems to the case where agreement in all but a fixed number of places is required. the total number of bits required to encode all such fillers is at most (u. In particular. E 7.•)) [log on SnFlog IA11.5.SECTION 11. 153 which is o(n) for fixed K. implies that the words longer than K that do not belong to UTk cover at most a 6-fraction of x . is upper bounded by exp((h — c) Et. using a counting argument to show that the set B(n) n G(n) must eventually be smaller than 2nh by a fixed exponential factor. Finally. The first sum is at most (log K)(s + 1). that is. since it was assumed that log K < 5K. which can be assumed to be at least e. II. the second and third sums together contribute at most log K K n to total length. 6)-strongly-packed by {Tk }. a u . n] of length Li > (log n)/ (h — 6). Remark 11. which is 50(n). since there are at most 2s + 1 words.4.

Carry out the details of the proof that (2) follows from the distinct-words theorem.1 (The recurrence-time theorem.u(x'11 ). Lemma 11. that is.5. What can be said about distinct and repeated words in the entropy 0 case? 6.u. An almost-sure form of the Wyner-Ziv result was established by Ornstein and Weiss. (Hint: use Barron's lemma. [85]. e. The definition of the recurrence-time function is Rn (x) = min{rn > 1: x m+ m+n 1 — Xn } — . 5.d. F(Tx) <?(x). equiprobable. n n--÷oo Some preliminary results are easy to establish. This is a sharpening of the fact that the average recurrence time is 1/. no parsing of 4 into repeated words contains a word longer than 4 log n. be the Kolmogorov measure of the binary.log R„(x). whose logarithm is asymptotic to nh.5 Entropy and recurrence times.log Rn (x). ENTROPY-RELATED PROPERTIES. process (i. 3. n 1 r(x) = lim inf ..154 CHAPTER II.1. almost surely.) (a) Show that eventually almost surely. n—woo n T(x) Since Rn _1(T x) < R n (x). An interesting connection between entropy and recurrence times for ergodic processes was discovered by Wyner and Ziv. almost surely. Wyner and Ziv showed that the logarithm of the waiting time until the first n terms of a sequence x occurs again in x is asymptotic to nh. Section 11. Carry out the details of the proof that (3) follows from the repeated-words theorem. 1 = lim sup . there are fewer than n 1-612 words longer than (1 + c) log n in a parsing of 4 into repeated words. with entropy h.3. see also the earlier work of Willems. in probability.) For any ergodic process p. . Let . lim 1 log Rn (x) = h.E12 log n. both the upper and lower limits are subinvariant. i. along with an application to a prefix-tree problem will be discussed in this section. (c) Show that eventually almost surely.i. [53]. i .) (b) Show that eventually almost surely. [86]. unbiased coin-tossing. Define the upper and lower limits. the words longer than (1 + c) log n in a parsing of 4 into repeated words have total length at most n 1. These results. 4. The theorem to be proved is Theorem 11. and r(Tx) < r(x).

the set D(a7) = D. so the desired result can be established by showing that r > h and F < h. n [4].. Indeed. eventually almost surely. that is. then Tx g [a7]. eventually almost surely. The bound r > h was obtained by Ornstein and Weiss by an explicit counting argument similar to the one used to prove the entropy theorem. The inequality r > h will be proved by showing that if it is not true. 7-1 D. fix (4 and consider only those x E D„ for which xri' = ari'. A code that compresses more than entropy can then be constructed by specifying where to look for the first later occurrence of such blocks.„ eventually almost surely. Note.2. eventually almost surely. which is upper bounded by 2n(h+E / 2) ." Indeed.. ENTROPY AND RECURRENCE TIMES. then eventually almost surely. = fx: R(x) > r(h+E ) }. by the definition of D(a). however. Another proof. so the iterated almost sure principle yields the desired result (1). almost surely. 2n(h+E) — 1]. if Tri = {X: t(x) > 2-n(h-i-E/2)} then x E T. The goal is to show that x g D.1. . To establish this. by Theorem 11. 2—nE/2 . The Borel-Cantelli principle implies that x g D. r < F. is outlined in Exercise 1. and hence the sequence D(a). so that (2) yields 1-1 (Dn < n Tn) _ 2n(h-1-€12)2—n(h+E) . so it is enough to prove (1) x g Dn n 7.. To establish i: < h. then there is a prefix-code sequence which beats entropy in the limit. a more direct coding construction is used to show that if recurrence happens too soon infinitely often. for j E [1. 155 hence. both are almost surely invariant. Furthermore.SECTION 11. sample paths are mostly covered by long disjoint blocks which are repeated too quickly. that if x E D(a). It is now a simple matter to get from (2) to the desired result. and hence. the cardinality of the projection onto An of D. (1). n 'T. . disjointness implies that (2) It(D. which is impossible. based on a different coding idea communicated to the author by Wyner.(4)) < 2-n(h+E). both upper and lower limits are almost surely constant. The bound F < h will be proved via a nice argument due to Wyner and Ziv. there are constants F and r such that F(x) = F and r(x) = r. is upper bounded by the cardinality of the projection of Tn . fix e > 0 and define D. n > 1. In the proof given below. 7' -2a(h+°+1 Dn(ai) must be disjoint. Since the measure is assumed to be ergodic. n T„.(4).5. . It is enough to prove this under the additional assumption that xril is "entropy typical.. Since these sets all have the same measure.

For n > m. oc.1 )u. a result stated as the following lemma. m)-too-soon-recurrent representation. for n > m.t+i is determined and successive words are coded using the following rules. m)-too-soon-recurrent representation of 4. The one bit headers on F and in the encoding of each V (j) are used to specify which code is being used. (b) The sum of the lengths of the filler words ui is at most 3 8 n. both to be specified later. such that each code word starts with a O. A one bit header to tell whether or not 4 E G(n) is also used to guarantee that C. m)-too-soon-recurrent representation 4 = u V(1)u2 V(2) • • • u V( . Here it will be shown how a prefix n-code Cn can be constructed which. for which s +k < n. It will be shown later that a consequence of the assumption that r < h — c is that (3) xÇ E G(n). a prefix code on the natural numbers such that t(E(j)) = log j + o(log j). if the following hold. where d < 2+ log l Ai. The key facts here are that log ki < i(V(j))(h — c). The code Cn uses a faithful single-letter code F: A H+ {0. For x E G(n) a (8. where k i is the distance from V(j) to its next occurrence in x. where cm --÷ 0 as m . ENTROPY-RELATED PROPERTIES. is said to be a (8. that is. X E G (n ) . (ii) Each V(j) is coded using a prefix 1. a concatenation = u V (1)u 2 V (2) • u j V (J)u j+1 .156 CHAPTER II. for suitable choice of m and S. (a) Each V(j) has length at least m and recurs too soon in 4. A block xs. The least such k is called the distance from xs to its next occurrence in i• Let m be a positive integer and 8 be a positive number. Sequences not in G(n) are coded by applying F to successive symbols.5." and that the principal contribution to total code length is E i log ki < n(h—c). 2h(t -s + 1) ) . compresses the members of G(n) too well for all large enough n. where E > O. Lemma 11. The code also uses an Elias code E. followed by E(t(V (j)). eventually almost surely.2 (4) f(xii) < n(h — E) n(3dS + ot ni ). by the definition of "too-soon recurrent.' of s lik k for consecutive symbols in a sequence 4 is said to recur too soon in 4 if xst = x` some k in the interval [ 1 . is a prefix n-code. To develop the covering idea suppose r < h — E. 1}d. (i) Each filler word u i is coded by applying F to its successive symbols. followed by e(k i ). Let G(n) be the set of all xi' that have a (8.

The Elias encoding of the length t(V(j)) requires log t(V(D) o(log f(V(j))) bits. To carry it out. Since each V(j) is assumed to have length at least m.5. This is a covering/packing argument. bits. = 2/3.n --* 0 as m cc.5. The one bit headers on the encoding of each V(j) contribute at most n/ m bits to total code length. .8.u(G(n)) must go to 0.2. there is an integer I = I (n. each of length at least m and at most M. In summary. R n (x) = r there is an M such that n=M ." Summing the first term over j yields the principal term n(h — E) in (4). where an. establishing the recurrence-time theorem. Summing on j < n/ m. as m Do. contradicting the fact (3) that 4 E G(n). But too-good codes do not exist. a fact that follows from the assumption that 0 r <h — E. define B n = (x: R(x) < 2 n(h—E) ). for which the following hold. there are at most n/ m such V(j) and hence the sum over j of the second term in (5) is upper bounded by nI3. the complete encoding of the too-soon recurring words V(j) requires a total of at most n(h — E) nun. + (1 ± log m)/m 0. so that .C(x'ii) < n(h — 6/2) on G(n). since (1/x) log x is decreasing for x > e. n].. eventually almost surely. for B = n=m Bn' The ergodic theorem implies that (1/N) Eni Bx (Ti-i X ) > 1 . 157 The lemma implies the desired result for by choosing 6 small enough and m large enough.SECTION 11. by the definition of "too-soon recurrent. Proof of Lemma 11. Since lim inf. by property (a) of the definition of G(n). The Elias encoding of the distance to the next occurrence of V(j) requires (5) f(V (j))(h — E) o(f(V(j))) bits. eventually almost surely. the second terms contribute at most Om bits while the total contribution of the first terms is at most Elogf(V(D) n logm rn . m]: i < of disjoint subintervals of [1.2. for each n > 1.5. x) and a collection {[n i .„ where /3. assume r < h — E and. for fixed m and 6. and the total length of the fillers is at most 36n. the lemma yields . eventually almost surely. Finally. and hence the packing lemma implies that for almost every x. for all large enough n. Thus the complete encoding of all the fillers requires at most 3d6n bits. the encoding of each filler u i requires dt(u i ) bits. there is an N(x) > M/8 such that if n > (x). ENTROPY AND RECURRENCE TIMES. Thus r > h. 0 It remains to be shown that 4 E G(n). This completes the proof of Lemma 11.u(B) > 1 — 6.

n. provided n is large.5.7. Remark 11. independently at random in Let k [1. .. W(2) = 101. where underlining is used to indicate the respective prefixes. called the n-th order prefix tree determined by x. and can serve. eventually almost surely.. x) be the shortest prefix of Tx i. = - CHAPTER II. as quick way to test the performance of a coding algorithm. itl. it defines a tree. algorithms for encoding are often formulated in terms of tree structures. The prefix tree 'T5(x) is shown in Figure 1. for example. This completes the proof that 4 E G(n). In particular.T4x = 10010010. so that if J is the largest index i for which ni + 2A1(h— E) < n. x 001010010010 . This works reasonably well for suitably nice processes. 'Tn (x). select indices fit.9. Since by definition. T 2x = 1010010010 . the total length of the V(j) is at least (1 —38)n. The sequence x`il can then be expressed as a concatenation (6) xn i = u i V(1)u2V(2)• • • u. For example. . W(1) = 0101. it can be supposed that n > 2m(h— E ) IS.5. page 72. . for any j W(i): i = 0. . T 2x. below.. W(3) = 0100. the simplest form of Lempel-Ziv parsing grows a subtree of the prefix tree. defines a tree as follows. Closely related to the recurrence concept is the prefix tree concept.a Entropy and prefix trees. let W(i) = W(i.158 (a) For each i < I el' n.3 The recurrence-time theorem suggests a Monte Carlo method for estimating entropy.7. where a < 1. condition (b) can be replaced by (c) > < (m — n +1) > (1 — 33)n. and where each u i is a (possibly empty) word. a log n. T x = 01010010010 . to propose that use of the full prefix tree . T 3x = 010010010 . and thereby finishes the proof of the recurrence-time theorem. x E G (n). A sequence X E A" together with its first n — 1 shifts. the set which is not a prefix of Px. Prefix trees are used to model computer storage algorithms.. . where each V(j) = xn 'n./ V(J)u j+i. 1. and take (11 kt) Et. ni + 2(ni i —n. Xi :Fnil—n ` 9 for some j E [ni + 1. that is. In addition. W(4) = 100.Tn —l x. as tree searches are easy to program. This led Grassberger. II. (b) (mi — ni + 1) > (1 — 25)n. recurs too soon. Tx. ±1)(h—E) ]. if it is assumed that r < h — E. n]. 0 < j < n. as indicated in Remark 11. W(0) = 00. In practice.5. n — 11 is prefix-free. For each 0 < i < n. for fixed m and 3. • • • .. [16]. log Rk(T i l X) as an estimate of h. in which context they are usually called suffix trees. ENTROPY-RELATED PROPERTIES.

for any E > 0. mixing Markov processes and functions of mixing Markov processes all have finite energy. where. is an ergodic process of entropy h > 0.5. is the following theorem.5. by definition of L(x). Theorem 11. n]. for all t > 1. This is equivalent to the string matching bound L(4) = 0(log n).) If p. case. for all L > 1." and is a frequently used technique in image compression. processes. (mod 2) where {Yn } is an arbitrary binary ergodic process and {Z n } is binary i.. (Adding noise is often called "dithering. see Exercise 2. has finite energy if there is are constants c < 1 and K such that kxt-F1 i) < Kc L .d.5. L(x) is the length of the longest string that appears twice in x7.) The mean length is certainly asymptotically almost surely no smaller than h log n.i. the finite energy processes. 159 might lead to more rapid compression. such that x: +k-2 = There is a class of processes. c) such that for n > N.i. and for all xl +L of positive measure.i. as earlier. j i. there is a j E [1. eventually almost surely. such as. and hence lim n -+cx) E n log n i=1 1 1 (x) = . eventually almost surely. namely.E)-trimmed mean of a set of cardinality M is obtained by deleting the largest ME/2 and smallest ME/2 numbers in the set and taking the mean of what is left.d.E)-trimmed mean of the set {L(x)} is almost surely asymptotic to log n/h. namely. at least in probability.5 (The finite-energy theorem. then for every c > 0 and almost every x. as it reflects more of the structure of the data. and some other cases. While this is true in the i. the binary process defined by X n = Yn Z. noise to a given ergodic process. An ergodic process p.5. however. The (1 . and independent of {Yn }. since if L7(x) = k then.SECTION 11. in general. almost surely. Another type of finite energy process is obtained by adding i.5. He suggested that if the depth of node W(i) is the length L7(x) = f(W(i)). there is an integer N = N(x. the (1 . Note that the prefix-tree theorem implies a trimmed mean result. eventually almost surely. . however. all but en of the numbers L7(x)/ log n are within c of (11h). see Theorem 11. ENTROPY AND RECURRENCE TIMES.4 (The prefix-tree theorem.i. h The i. What is true in the general case. Theorem 11.d. eventually almost surely.) If p. it is not true in the general ergodic case. The {L } are bounded below so average depth (1/n) E (x) will be asymptotic to log n1 h in those cases where there is a constant C such that max L7 (x) < C log n. for which a simple coding argument gives L(x) = 0(log n).d. for example. is an ergodic finite energy process with entropy h > 0 then there is a constant C such that L(4) < C log n. which kills the possibility of a general limit result for the mean. then average depth should be asymptotic to (1/h) log n. there is no way to keep the longest W(i) from being much larger than 0(log n). One type of counterexample is stated as follows.

There is a stationary coding F of the unbiased coin-tossing process kt. each of which produces a different algorithm. define Tk (8) = (x) > . For E > 0. (8) 1L(x. the tree grows to a certain size then remains fixed. where the next word w(i).160 Theorem 11.2.4. I U(x . From that point on. II. and is a consequence of the fact that.6 CHAPTER II. Remark 11. n) are distinct. 88] for some finite version results. (9) The fact that most of the prefixes cannot be too short. Theorem 11. is a consequence of the recurrence-time theorem. Subsequent compression is obtained by giving the successive indices of these words in the fixed tree. n) = fi E [0. several choices are possible. for some j < i and a E A. see [87. The remainder of the sequence is parsed into words.5. Theorem 11. such that vnrfo l Ltilk (x ) = oo.2. lim (7) nk log nk with respect to the measure y = o F-1 . whose predecessor node was labeled w(j). n — I]: L7(x) < (1— E)(logn) I h} U (x .1 Proof of the prefix-tree theorem. n) = fi E [0. (8). Section 11. almost surely.5.) The limitations of computer memory mean that at some stage the tree can no longer be allowed to grow. (See Exercise 3. To control the location of the prefixes an "almost uniform eventual typicality" idea will be needed. In the simplest of these.a. each of which is one symbol shorter than the shortest new word. n)I < En. The fact that most of the prefixes cannot be too long. For S > 0 and each k. if it has the form w(i) = w(j)a. for the longest prefix is only one longer than a word that has appeared twice. and an increasing sequence {nk}. n — I]: (x) > (1 ± c)(log n)I hl. define the sets L(x. mentioned in Remark 11.10. Such finite LZ-algorithms do not compress to entropy in the limit. eventually almost surely. Prefix trees may perform better than the LZW tree for they tend to have more long words. which means more compression. is assigned to the node corresponding to a. for fixed n.1. The prefix-tree and finite-energy theorems will be proved and the counterexample construction will be described in the following subsections. ENTROPY-RELATED PROPERTIES. grows a sequence of IA I-ary labeled trees.1.5. Performance can be improved in some cases by designing a finite tree that better reflects the actual structure of the data. It is enough to prove the following two results. n)I < En. (9).5. but do achieve something reasonably close to optimal compression for most finite sequences for nice classes of sources. eventually almost surely. in which the next word is the shortest new word. the W (i. can be viewed as an overlapping form of the distinct-words theorem.7 The simplest version of Lempel-Ziv parsing x = w ( 1)w(2) • • •. The details of the proof that most of the words cannot be too short will be given first.

5. fix c > O.s. This. if necessary. x) are distinct members of the collection 'Tk (6) which has cardinality at most 2k(h+8) . a. Section 1. There are at most 2k(h+3) such indices because the corresponding W (i. (See Exercise 7. except for the final letter. k—> cx) k and when applied to the reversed process it yields the backward result. the i-fold shift Tx belongs to G(M). 1 urn . an index i < n will be called good if L(x) > M and W(i) = W(i. Section 1. Suppose k > M and consider the set of good indices i for which L7(x) = k. Thus if n > N(x) is sufficiently large and S is sufficiently small there will be at most cnI2 good indices i for which L'i'(x) < (1— €)(log n)I h. Both a forward and a backward recurrence-time theorem will be needed. Thus by making the N(x) of Lemma 11. X m+ m+k i— min frn > 1: X1_ k+1 = X — ° k+11 The recurrence-time theorem directly yields the forward result. k—>cc k since the reversed process is ergodic and has the same entropy as the original process. there is an M such that the set 4 4 G(M) = ix: 4 E 7i. This result is summarized in the following "almost uniform eventual typicality" form. Hence. combined with the bound on the number of non-good indices of the preceding paragraph.n. then it has too-short return-time. n. The corresponding recurrence-time functions are respectively defined by Fk(X) = Bk(X) = min {in> 1 •. (8).(8). n). Lemma 11. Next the upper bound result. eventually almost surely. To proceed with the proof that words cannot be too short. Since the prefixes W(i) are all distinct there are at most lAl m indices i < n such that L7(x) < M. for any J > M.log Bk (x) = h. will be established by using the fact that. x) E U m rk ( .8 For almost every x there is an integer N(x) such that if n > N(x) then for all but at most 2 6 n indices i E [0. If a too-long block appears twice.8 larger..s. The ergodic theorem implies that. the shift Tx belongs to G(M) for all but at most a limiting 6-fraction of the indices i. each W(i.. for almost every x.) These limit results imply that .x) appears at least twice. (9).6 and Exercise 11.5.7.5. 161 _ 2k(h+5) and so that I2k(6)1 < E Tk (6 ). The recurrence-time theorem shows that this can only happen rarely. n. which can be assumed to be less than en/2.SECTION 11. 1 lim — log Fk (x) = h. a. it can be supposed that if n > N(x) then the set of non-good indices will have cardinality at most 3 6 n. there are at most (constant) 2 J (h+) good indices i for which M < L < J. for all k > MI has measure greater than 1 — S. by the entropy theorem. eventually almost surely. completes the proof that eventually almost surely most of the prefixes cannot be too short. Since E (3 ). ENTROPY AND RECURRENCE TIMES. For a given sequence x and integer n.

II. and log Bk (x) > kh(1+ c /4) -1 . and Tx E B. which asserts that no prefix code can asymptotically beat the Shannon code by more than 210g n. i < j. (b) k +1 < (1+ c)log n/ h. L7 > (1+ e) log n/ h. If i +k < n. . Fix such an x and assume n > N(x).1]. and two sets F and B. and hence.7. each of measure at most en/4. This completes the proof of the prefix-tree theorem. Condition (b) implies that L7 -1 > k. Fk (T ix) < n. x) as the shortest prefix of Tx that is not a prefix of any other Ti x. (10) £(C(4)) + log . that is. Lemma 5. n. that is.u(x'11 ) -21ogn. for at most 6n/3 indices i E [0. n).162 CHAPTER II. there is an index j E [0. By making n larger. the k-block starting at i recurs in the future within the next n steps. such that j i and j+k -1 =x me two cases i < j and i > j will be discussed separately. if necessary. see Remark 1. namely.13. n . A Shannon code for a measure p u is a prefix code such that E(w) = log (w)1. Thus. A version of Barron's bound sufficient for the current result asserts that for any prefix-code sequence {C}. for each w. there is a K. k which means that Tx E F. it can be supposed that both of the following hold. Here "optimal code" means Shannon code. Case 2. The idea of the proof is suggested by earlier code constructions. n). if i E Uk (x. x F. n. if n also satisfies k < en/3.a. by the definition of W (i.2 Proof of the finite-energy theorem. then log Fk (x) > kh(1 + /4) -1 . eventually almost surely.1]. which means that 7 k) x E B. ENTROPY-RELATED PROPERTIES. this means that Bk(T i+k x) < n. (a) K < (1+ 6/2) log n/ h. Assume i E U(x.n) then T i E F. An upper bound on the length of the longest repeated block is obtained by comparing this code with an optimal code. The ergodic theorem implies that for almost every x there is an integer N(x) such that if n> N(x) then Tx E F. and use an optimal code on the rest of the sequence. such that if k > K. for at most en/3 indices i E [0. 7x E B.log Fk(T i x) < — < h(1 + E/4) -1 . x V B. n)I < en. or i + k < n. Let k be the least integer that exceeds (1 + e/2) log n/ h. The key to making this method yield the desired bound is Barron's almost-sure code-length bound. In this case.5. then I U (x . In summary. j < i. code the second occurrence of the longest repeated block by telling its length and where it was seen earlier. for j < n. so that 1 logn . Case 1.

(b) .a. The prefix property is guaranteed by encoding the lengths with the Elias code e on the natural numbers. . that is. The block xi is encoded by specifying its length t. o F -1 will then satisfy the conditions of Theorem 11. total code length satisfies 3 log n + log 1 1 + log p(xl) kt(4A-L+11xi+L)' where the extra o(logn) terms required by the Elias codes and the extra bits needed to roundup the logarithms to integers are ignored. li m _sxu. The process v = p. The final block xtn+L±I is encoded using a Shannon code with respect to the conditional measure Ix ti +L ). there is a measurable shift invariant function F: {0. and then again at position t 1 > s in the sequence xÇ. the probability it (4) is factored to produce log (4) = log ti ) log A(xt v. (7) holds. for each n and each a E An. The block xtr+t is encoded by specifying s and L. with probability greater than 1 — 2-k . II. since c <1.3 A counterexample to the mean conjecture. almost surely. Since t. given by unbiased coin-tossing. that is. x ts 0+ log 2(4+L+1 1xi+L ). It will be shown that given E > 0. To compare code length with the length of the Shannon code of x with respect to it. 0 < j < n k 3 I4 . L7(x) > nk and hence. Let it denote the measure on two-sided binary sequences {0. 1} z {0. then using a Shannon code with respect to the given measure A. there is an i in the range 0 < i < nk — n k 314 . a prefix code for which t(e(n)) = log n + o(log n). (a) If y = F(x) then. such that yi+i = 0.u(fx: xo (F(x)01) < E. 1} z .p — log c n log n This completes the proof of the finite-energy theorem.6. t 3 log n + log p(xt+1 s Thus Barron's lemma gives 1 lxf)> —2 log n.5. as they are asymptotically negligible.5. s. the shift-invariant measure defined by requiring that p(aii) = 2. for property (a) guarantees that eventually almost surely L7k (x) > n:I8 for at least 4/8 indices i < nk .5. ENTROPY AND RECURRENCE TIMES. Suppose a block of length L occurs at position s. 5/4 so that Ei<n. 1} z and an increasing sequence {n k } of positive integers with the following properties. 163 The details of the coding idea can be filled in as follows.SECTION 11. This is now added to the code length (11) to yield t-I-L . and L must be encoded. yields to which an application of the finite energy assumption p(x:+ L(4) 5 < . 3 log n + log p(x t + + IL lxi) < Kc L .

which also contains the counterexample construction given here. blocks of O's of length 2n k i l2 were created.u of entropy h.d. but it is added to emphasize that such a y can be produced by making an arbitrarily small limiting density of changes in the sample paths of the i. as an illustration of the blockto-stationary coding method.c.) 3. II. A function f: An B* will be said to be conditionally invertible if . (Hint: L (4n ) < K log n implies max L7(x) < K log n.5. that r > h is based on the following coding idea. eventually almost surely. In that example.xn = yn 00 whenever f (xn .8. (a) Show that lim infn t(f(x n „))/ n > h. after its discovery it was learned that the theorem was proved using a different method in [31].i.9 The prefix tree theorem was proved in [72]. then max i 14 (x) = 0 (log n). Replacing 2n k 1/2 by nk 3/4 will produce the coding F needed here.q and apply the preceding result. discussed in Section I.164 CHAPTER II.5. (c) Define f (x" cc ) to be the minimum m > n for which xim niV = .0 ) = f (y" co ) and x 0 y° . ENTROPY-RELATED PROPERTIES. . the entropy of y can be as close to log 2 as desired.(B„ n cn ix°) is finite. eventually almost surely. process a.(4 Ix ° ) > 2-n ( H—E) and Cn = {X n c„) : f (xn 00 )) < 2n (H-2€ ) 1. in particular. 2. = fx" o„: p. (d) The preceding argument establishes the recurrence result for the reversed process. (Hint: let B. hence. Show that this implies the result for ft.) (b) Deduce that the entropy of a process and its reversed process are the same. for any ergodic measure . Another proof. Property (b) is not really needed. Show that dithering an ergodic process produces a finite-energy process. Show that if L(x) = 0(log n). Remark 11. due to Wyner. almost surely. The asymptotics of the length of the longest and shortest word in the Grassberger tree are discussed in [82] for processes satisfying memory decay conditions. 1. . The coding F can be constructed by only a minor modification of the coding used in the string-matching example. The coding proof of the finite-energy theorem is new. where H is the entropy-rate of it and show that E p.b Exercises.

Likewise. defined by IILk — Pk1= — Pk( 14)1 = E E 1[4(4) at pk (64'14)1. that is. or k-type. do not have exponential rates. 165 . The k-th order empirical distribution. lin(fx riz : — Pk(' ix ril )i > ED < rk(6. processes. as n cc. no uniform rate is possible. Section 111.n). is the measure pk = pk( 1x11 ) on Ak defined by Pk(a1 14) = sn E [1.1. In this section bt denotes a stationary. The ergodic theorem implies that if k and are fixed then ({4: as n — Pk(' 14)1 ?_ ED -± oo. aik E A k . is called a rate function for frequencies for a process 1. there is an ergodic process that does not satisfy the rate. In the general ergodic case. ergodic process with alphabet A. The distance between 14 and Pk will be measured using the distributional (variational) distance. n — k +1]: x:+k-1 — k +1 .an(tx rit : 2—n(h+E) < itin (x n i ) < 2—n(h—E)} 1. Markov processes.Chapter III Entropy for restricted classes.1 Rates of convergence. given any convergence rate for frequencies or for entropy. if for each fixed k and c > 0.u({x: x = 4}). such as i. but some standard processes. the entropy theorem implies that .i. and finite-state processes have exponential rates of convergence for frequencies of all orders and for entropy. The rate of convergence problem is to determine the rates at which these convergences takes place.d. A mapping rk : R+ x Z 1—± R +. Many of the processes studied in classical probability theory. such as renewal processes. with ilk denoting the projection onto A k defined by 14(4) = . These results will be developed in this section. where R + denotes the positive reals and Z + the positive integers.

Examples of processes without exponential rates and even more general rates will be discussed in the final subsection. n) is bounded away from O.n) —> 0 as n frequencies if there is a rate function such that for each E > 0 and k.1. while ideas from that theory will be used. is a product measure.i. is i. 58]. if for each E > 0.(x) is a continuous function of p1 (.166 CHAPTER III.a Exponential rates for i.i. An ergodic process has exponential rates for entropy if it and r(E. it will not be the focus of interest here. The overlapping-block problem is then reduced to the nonoverlapping-block problem. The theorem is proved by establishing the first-order case. Exponential rates for entropy follow immediately from the first-order frequency theorem.3 (First-order rate bound.. for some unbounded. nondecreasing sequence {an }. An ergodic process is said to have exponential rates for and rk(E.i. a combination of a large deviations bound due to Hoeffding and Sanov and an inequality of Pinsker.) There is a positive constant c such that (1) < (n It (14: IPi( 14) — ILiI > El) _ DIAl2—ncE 2 . produces (2) 11 (4) = Dit(a))nPl aEA (alx7) = 2—n(H(pi)+D(p1 1141)) . Proof A direct calculation. then it has exponential rates for frequencies and for entropy. 0 as n oc.d. (The determination of the best value of the exponent is known as "large deviations" theory. and for any i. n) is bounded away from O. is concerned with what happens in the ergodic theorem when (1/n) f (T i x) is replaced by (1/an ) f (Ti x).i. for the fact that t is product measure implies that p. oc.n) has a rate function for entropy such that for each c > 0.) A mapping r: R+ x Z+ R+ is called a rate function for entropy for a process it of entropy h. for any finite set A.d processes will be established in the next subsection.2 (Exponential rates for i. with alphabet A.1. 62. In this section the following theorem will be proved. in turn. called speed of convergence.1.d. process p. 14). [22. Lemma 111. which is treated by applying the first-order result to the larger alphabet of blocks. ENTROPY FOR RESTRICTED CLASSES. for any n. using the fact that p. which is. processes. Exponential rates of convergence for i. The first-order theorem is a consequence of the bound given in the following lemma.d.1 A unrelated concept with a similar name. (-1/n) log r (c. See [32] for a discussion of this topic.i. processes) If p. Theorem 111. III. then extended to Markov and other processes that have appropriate asymptotic independence properties. n). 2 —n(h+E) < /41 (4) < 2 —n(h—E) 1) > 1 — r (E. (-1/n) log rk (E. Remark 111.d.1.

and the final word w(t 1) has length n — tk —s. 1x7)) = — a (a14) log pi (a lx7). This is called the s -shifted. k) such that n = tk + k + r. Fact (a) in conjunction with the product formula (2) produces the bound. To assist in this task some notation and terminology will be developed. together with fact (b). Theorem 1. Section 1. see Pinsker's inequality.6. . k — 1]. since (n + 1)I 41 2 . For n > 2k define integers t = t (n.3. produces the bound tt(B(n.1. and class.d. 167 where H(pi) = 11(131(. and D(Pi Lai) = E a pi (aIx)log Pl(aix'11) iii(a) is the divergence of pi( • Ixii) relative to pi. (T(x)) < The bad set B(n. k) and r E [0. where the first word w(0) has length s < k. Exercise 6. The extension to k-th order frequencies is carried out by reducing the overlapping-block problem to k separate nonoverlapping-block problems. — 1L El Since D(P II Q) I P — Q1 2 1 (2 ln 2) is always true. k-block parsing of x. 0 as n —> oc. E) n T(4). E)) < ( n 01Al2—nD. The two key facts.12.6.6. 1.1. E) = fx7: 1/1 1( . The sequences x7 and y r i` are said to be (s. El Lemma 111. k)-equivalent if their . — I > E } can be partitioned into disjoint hence the bound (3) on the measure of the type sets of the form B(n.SECTION III. are (a) 17-(4)1 < 2nH(p 1 ) . is the entropy of the empirical 1-block distribution. RATES OF CONVERGENCE. The desired result will follow from an application of the theory of type classes discussed in I. w(1). w(t). (3) p.1 3 and Theorem 1. The type class of _el is the set T(x) = {. all have length k.6. it follows that 1 12 1Ple lx7) D* r17- which completes the proof of Lemma 111. For each s E [0. (b) There are at most (n + 1) 1A I type classes.3 gives the desired exponential-rate theorem for first-order frequencies. . where = min ID(Pi Lai): IP1(. the next t words.n'' = exp(—n(cE 2 ön)) where 8.)1 : pie lyD = pie jx7)}. the sequence x can be expressed as the concatenation (4) x = w(0)w(1)• • • w(t)w(t + 1).

(41x Pil ) — ps k (4 E [1. however.4 (Overlapping to nonoverlapping lemma. i The (s. k)-equivalence class of x is denoted by Sk (fi'. The containment relation.1. constant on the (s. by the first-order result. . t]: w(j) = a}1 where x is given by (4). It is. k — 1]. 4-k +Ft k = ps k(• 1. hence the bound decays exponentially in n. that is. k)-equivalence class k. s)) 1=1 where Sk(x . This completes the proof of Theorem 111. (5). i yss+t +k . If k is fixed and n is large enough the lemma yields the containment relation. Sk(x . then provides the desired k-block bound (9) p.1. The overlapping-block measure Pk is (almost) an average of the measures ps result summarized as the following where "almost" is needed to account for end effects. s). a lemma. k — 1] such that Ips k ( 14) — .2.ukl> Proof Left to the reader. of course.168 CHAPTER III. the logarithm of the right-hand side is asymptotic to —tcE 2 14n —c€ 2 /4k. and fix s E [0. (7) p. k-1 (5) 14: Ipk( 14) — iLk1 el g U{4: lp.el') is the The s-shifted (nonoverlapping) empirical k-block distribution ps A k defined by distribution on l) = ps. s-shifted.(fx: 1Pic (' 14) — e/2})y ({U/1: IwD — > e/21).1. applied to the measure y and super-alphabet B = lAlk . . is a product measure implies (6) (Sk(x riz . The latter. 04: IPke — itkI > El) < k(t + 0 1 2-46'6214 . If k and c > 0 are fixed. w(t)} of k-blocks.. ENTROPY FOR RESTRICTED CLASSES. k)-equivalence class of xri'. k-block parsings have the same (ordered) set {w(1). is upper bounded by (8) (t 1) 1A1 '2'" 2 /4 .3. s). The fact that p. WI E B.c e S=0 — kl Fix such a k and n. so that if B denotes A" and y denotes the product measure on 13' defined by the formula v(wD = then /1(w i ). s) is the (s.) Given e > 0 there is a y > 0 such that if kln < y and Ipk( 14) — I > E then there is an s E [0. Lemma 111. Lemma 111.

u E Ag. Lemma 111. irreducible Markov chains have exponential rates.1. RATES OF CONVERGENCE.1. For n > 2k + g. proof. Section 1.d.u(w). for computing Markov probabilities yields E w uvw ) t(v)=g mg 7r(b) The right-hand side is upper bounded by *(g).0_<r<k-i-g.7 If kt is ilf-mixing then it has exponential rates for frequencies of all orders.u(uvw) _<111(g)A(u).1.i. The entropy part follows immediately from the second-order frequency result. The theorem for frequencies is proved in stages. The *-mixing concept is stronger than ordinary mixing. u.1.d. process is *-mixing. g) such that (10) n=t(k+g)±(k+g)-Er. An i.6 An aperiodic Markov chain is ip-mixing. w E A*.) If p. Fix a gap g. with Ili(g) 1. The formula (10). define t = t (n. 169 III. Proof Let M be the transition matrix and 7r(-) the stationary vector for iii.u(w). A process is 11 f-mixing if there is a nonincreasing sequence {*(g)} such that *(g) 1 as g —± oc. Proof A simple modification of the i. since p. then it has exponential rates for frequencies and for entropy. This proves the lemma.b The Markov and related cases. it will be shown that periodic. and let b be the first symbol of w. First it will be shown that aperiodic Markov chains satisfy a strong mixing condition called *-mixing. k. which allows the length of the gap to depend on the length of the past and future. so as to allow gaps between blocks.SECTION 111. Then it will be shown that -*-mixing processes have exponential rates for both frequencies and entropy.2. let a be the final symbol of u. Theorem 111. is all that is needed to establish this lemma.(4) is a continuous function of pi e 14) and p2( • 14). as g —> oc. where i/î (g)= maxmax a Mg ab (b) • b The function *(g) —> 1.5 (Exponential rates for Markov sources. and such that . is an ergodic Markov source.l. since the aperiodicity assumption implies that iti`cfb 7r(b). .u(u). LI The next task is to prove Theorem 111.i. Finally. It requires only a bit more effort to obtain the mixing Markov result. Exponential rates for Markov chains will be established in this section. Fix positive integers k and g.

s.g (c4) = pisc. the s-shifted.170 CHAPTER III. The proof of Theorem 111.7. and establishes the exponential rates theorem for aperiodic Markov chains. where "almost" is now block measure Pk is (almost) an average of the measures p k gaps. k-block parsing of 4. g-gapped.d. k.i. and if pke > E.. then there is an s E [0. w(t)} of k-blocks.7 is completed by using the 1i-mixing property to choose g so large that (g) < 2 218 . constant on the (s.g(t)w(t)w(t + 1). (14: Ipke placed by the bound (k + g)[*(g)]t (t + 010 2 -tc€ 2 14 (14) — /41 ED is re- assuming. The logarithm of the bound in (14) is then asymptotically at most —tce 2 18n —c€ 2 /8(k + g). of course.8 holds.8 (Overlapping to nonoverlapping with gaps. there is a y > 0 and a K > 0. if k > K. g)-equivalent if their s-shifted. The upper bound (8) on Ips k ( 14) —lkI€121) is replaced by the upper bound (13) [i (g)]t(t 1)1111 4 C E 2 I4 and the final upper bound (9) on the probability p. . with gap g. the sequence 4 can be expressed as the concatenation (11) x = w(0)g(1)w(1)g(2)w(2). k + g — 1] such that — 114. proof adapts to the Vf-mixing case. for 1 < j < t. g)• The s-shifted (nonoverlapping) empirical k-block distribution. Lemma 111..1. where w(0) has length s < k + g. If k is large relative to gap needed to account both for end effects and for the length and n is large relative to k. For each s E [0. with gap g. k. . hence the bound decays exponentially. k + g — 1]. k-block parsing of x. E [1. provided only that k is large enough. The following lemma removes the aperiodicity requirement in the Markov case. s. with gap g. this is no problem.4 ( 14) — itkl 612. The (s. that k is enough larger than g and n enough larger than k to guarantee that Lemma 111.1. this completes the proof of Theorem 111. The i. s. where the w(j) are given by (11). of course. with the product measure bound (6) replaced by (12) ian (Sk (4. The sequences 4 and are said to be (s.4 easily extends to the following. and Lemma 111.7. Since it is enough to prove exponential rates for k-th order frequencies for large k. and the final block w(t + 1) has length n — t (k + g) — s < 2(k + g). k. g)) kii(g)i t i=1 ktk(w(i)). t]: w(j) = 411 It is. however. g)-equivalence class of 4 is denoted by Sk(xPil .1.1. . Completion of proof of Theorem 111. g). The overlappings .1.1.g . is the distribution on Ak defined by P.) Given E> 0 and g. k-block parsings have the same (ordered) set {w(1). This is called the the s-shifted. the g(j) have length g and alternate with the k-blocks w(j).g(alnxii). g)-equivalence class Sk (4. ENTROPY FOR RESTRICTED CLASSES. such that if k n < y.

1. Theorem 111. see Exercise 4. because it illustrates several useful cutting and stacking ideas.oc. There is a binary ergodic process it and an integer N such that An (1-112 : 'Pi (• Ix) — Ail 1/2» ?. . because it is conceptually quite simple. for k can always be increased or decreased by no more than d to achieve this.l.SECTION 111. and. _ r(n). It can also be assumed that k is divisible by d. Examples of renewal processes without exponential rates for frequencies and entropy are not hard to construct. Cd.1.1.9 An ergodic Markov chain has exponential rates for frequencies of all orders. g = ps k . so the previous theory applies to each set {xii': ' K g — I (c(xs )) I as before. The measure . ) where @ denotes addition mod d.u k is an average of the bt k k+g-1 fx rI : IPk( ifil ) — Ad > El C U VI I: s=0 IPsk. k . Define the function c: A i— [1. in part. k + g — 1]. 171 Proof The only new case is the periodic case.u (s) is. n > N. C2.1. d] by putting c(a) = s.5. such that Prob(X n+ 1 E Cs91lXn E C s = 1. III. an aperiodic Markov measure with state space C(s) x C(s @ 1) x • • . Also let pP) denote the measure A conditioned on Xi E C5 .10 Let n i— r(n) be a positive decreasing function with limit O..g ( ! XIII ) — 14c(xs)) 1 > 6 / 2 1. x C(s @ d — 1). Let g be a gap length. 1 < s < d.. ?_ decays exponentially as n -. in part. Lemma 111. IL({xi: IPk — iLk I E}) > 6/2 } separately. (s) . unless c(ai) = c(xs ). A simple cutting and stacking procedure which produces counterexamples for general rate functions will be presented. and thereby completing the CI proof of the exponential-rates theorem for Markov chains. if a E C. Let A be an ergodic Markov chain with period d > 1. the nonoverlapping block measure ps condition p g (4) = 0. Thus. C1. . it follows that if k I g is large enough then Since . however. Theorem 111. The following theorem summarizes the goal. . RATES OF CONVERGENCE. which can be assumed to be divisible by d and small relative to k. g (• WO satisfies the For s E [0. which has no effect on the asymptotics. and partition A into the (periodic) classes. establishing the lemma.c Counterexamples.

that is. For example.u. of positive integers which will be specified later. if exactly half the measure of 8(1) is concentrated on intervals labeled '1'. long enough to guarantee that (16) will indeed hold with m replaced by ni + 1. Let m fi(m) be a nonincreasing function with limit 0. As long as enough independent cutting and stacking is done at each stage and all the mass is moved in the limit to the second part. As an example of what it means to "look like one process on part of the space and something else on the other part of the space. The desired (complete) sequence {S(m)} will now be defined. Furthermore. to implement both of these goals. then the chance that a randomly selected x lies in the first L — ni levels of C is (1 — m/L)p. ENTROPY FOR RESTRICTED CLASSES. the final process will be ergodic and will satisfy (16) for all m for which r(m) < 1/2. Suppose some column C at some stage has all of its levels labeled '1'.u n ({xr: pi(114) = 11) (1 — -11 L i z ) /3. for all m. and hence (15) . If x is a point in some level of C below the top ni levels. for all m > 1." suppose kt is defined by a (complete) sequence of column structures. pi(11x) is either 1 or 0.u. in part.n 1. say C1 and C2. for 1 < j < in. if C has height L and measure fi. /31(114) = 1. to make a process look like another process on part of the space. Without loss of generality. Counterexamples for a given rate would be easy to construct if ergodicity were not required. Each S(m) will contain a column . so that. The construction will be described inductively in terms of two auxiliary unbounded sequences.172 CHAPTER III. i (1) = 1/2. In particular. then (141 : 1/31( Ixr) — I ?_ 112» ?_ (1 _ so that if 1/2> fi > r(m) and L is large enough then (16) /2. then for any n.n: — I > 1/2)) > r(m). The cutting and stacking method was designed. n > 1. The details of the above outline will now be given. The mixing of "one part slowly into another" is accomplished by cutting C into two columns. = (y +V)/2. {41 } and fr. Mixing is done by applying repeated independent cutting and stacking to the union of the second column C2 with the complement of C. . so that A n (14: i pi( 14) — I 1/21) = 1.„ ({x. The bound (16) can be achieved with ni replaced by ni + 1 by making sure that the first column CI has measure slightly more than r(m 1). it can be supposed that r(m) < 1/2. if . no matter what the complement of C looks like. such that /3(1) = 1/2 and r(m) < f3(m) < 1/2. mixing the first part into the second part so slowly that the rate of convergence is as large as desired. while . and to mix one part slowly into another. where y is concentrated on the sequence of all l's and iY is concentrated on the sequence of all O's. Ergodic counterexamples are constructed by making . according to whether x`iz consists of l's or O's.u were the average of two processes. for it can then be cut and stacked into a much longer column.u look like one process on part of the space and something else on the other part of the space. then it is certain that xi = 1.

(a) Choose k such that Tk = {x: i(x) ) 2h j has measure close to 1. and let R. C(m — 1.(m). so that C(m — 1. 1. for this is all that is needed to make sure that m ttn({x. all of whose entries are labeled '1'. labeled '1'. (b) Show that the measure of {xii: p.) (c) Show that a 0. Since the measure of R. . this guarantees that the final process is ergodic. disjoint from C(m).d Exercises. and the total measure of the intervals of S(m) that are labeled with a '0' is 1/2. The new remainder R(m) is obtained by applying rni -fold independent cutting and stacking to C(m — 1. 1) has measure /3(m) and C(m — 1.10. RATES OF CONVERGENCE. the total width of 8(m) goes to 0. To get started. . 2) U TZ. All that remains to be shown is that for suitable choice of 4. then it can be combined with the fact that l's and O's are equally likely and the assumption that 1/2 > 8(m) > r (m).(m — 1) and 1?-(m) are (1 — 2')-independent. 2).(4) < 2 -n (h+E) } goes to 0 exponentially fast. so large that (1 — m/t n1 )/3(m) > r(m). tm holds. This is possible by Theorem 1. of height 4. and since /3(m) -÷ 0. This guarantees that at all subsequent stages the total measure of the intervals of 8(m) that are labeled with a '1' is 1/2. so the sequence {8(m)} is complete and hence defines a process A. The condition (17) is guaranteed by choosing 4. has exponential rates for frequencies and entropy. (a) Show that y has exponential rates for frequencies. 0 111. 2. 14)— Ail a 1/2 }) a (1 — — ) /3(m). The column C(m —1. (Hint: as in the proof of the entropy theorem it is highly unlikely that a sequence can be mostly packed by 'Tk and have too-small measure. 8(m) is constructed as follows. labeled '0'. and rm the final process /1 is ergodic and satisfies the desired condition (17) An (Ix'i'I : Ipi(• 14)— Ail a. 2) has measure fi(m —1) — )6(m). Suppose A has exponential rates for frequencies.7. 1.(m) goes to 1.1/20 > r (m). and hence the final process must satisfy t 1 (l) = A i (0) = 1/2.1.SECTION III. Since rm —> oc. First C(m — 1) is cut into two columns. sequentially. Let y be a finite stationary coding of A and assume /2. Once this holds.n } is chosen. where C(m — 1. by Theorem 1. 1) and C(m — 1. which are then stacked to obtain C(m).10. This completes the proof of Theorem 111. 1) is cut into t ni g ni -1 columns of equal width.-mixing process has exponential rates for entropy. 2) U 7Z(m — 1). ni > 1.(1) consist of one interval of length 1/2. let C(1) consist of one interval of length 1/2.11. The sequence {r. together with a column structure R. For ni > 1. to guarantee that (17) holds. 173 C(m).: Ipi(.10. and measure 13(m).1 . Show that the measure of the set of n-sequences that are not almost strongly-packed by 'Tk goes to 0 exponentially fast.

(Hint: make sure that the expected recurrence-time series converges slowly. the resulting empirical k-block distribution almost surely converges to the true distribution of k-blocks as n oc. let plc ( 14) denote the empirical distribution of overlapping k-blocks in the sequence _el'. A nondecreasing sequence {k(n)} will be said to be admissible for the ergodic measure bt if 11111 1Pk(n)(• 14) n-4co lik(n) = 0. or.d. The kth-order joint distribution for an ergodic finite-alphabet process can be estimated from a sample path of length n by sliding a window of length k along the sample path and counting frequencies of k-blocks. It is also can be shown that for any sequence k(n) -± oc there is an ergodic measure for which {k(n)} is not admissible. Show that there exists a renewal process that does not have exponential rates for frequencies of order 1. with high probability.174 CHAPTER III. after which the system is run on other. a fact guaranteed by the ergodic theorem. Here is where entropy enters the picture. Exercise 4 shows that renewal processes are not *-mixing. ENTROPY FOR RESTRICTED CLASSES. as a function of sample path length n. to design engineering systems. in . s.) 5. The such estimates is important when using training sequences. if k(n) > (1 + e)(logn)1 h. the probability of a k-block will be roughly 2 -kh . that is. Use cutting and stacking to construct a process with exponential rates for entropy but not for frequencies.1. where I • I denotes variational. there is no hope that the empirical k-block distribution will be close to the true distribution. (Hint: bound the number of n-sequences that can have good k-block frequencies. see Theorem 111. by the ergodic theorem. conditional k-type too much larger than 2 —k(h(4)—h(v) and measure too much larger than 2-PP).i. distance. finite consistency of sample paths. Establish this by directly constructing a renewal process that is not lit-mixing. There are some situations. processes or Markov chains. see Exercise 2. Section 111.2 Entropy and joint distributions. This is the problem addressed in this section. such as the i. below. As before. for if k is large then. equivalently.. The definition of admissible in probability is obtained by replacing almost-sure convergence by convergence in probability. distributional. Every ergodic process has an admissible sequence such that limp k(n) = oc. For example. that is. The problem addressed here is whether it is possible to make a universal choice of {k(n)) for "nice" classes of processes. (b) Show that y has exponential rates for entropy. such as data compression. a. Consistent estimation also may not be possible for the choice k(n) (log n)I h. The empirical k-block distribution for a training sequence is used as the basis for design. that is.2. independently drawn sample paths. 4. where it is good to make the block length as long as possible.) 3. In particular. Thus it would be desirable to have consistency results for the case when the block length function k = k(n) grows as rapidly as possible. k(n) > (logn)I(h — 6). If k is fixed the procedure is almost surely consistent.

and hence the results described here are a sharpening of the Ornstein-Weiss theorem for the case when k < (log n)/(h 6) and the process satisfies strong enough forms of asymptotic independence. (1/n) log 14/. The nonadmissibility theorem is a simple consequence of the fact that no more than 2 k(n)(h—E) sequences of length k can occur in a sequence of length n and hence there is no hope of seeing even a fraction of the full distribution. an approximate (1 — e -1 ) fraction of the k-blocks will fail to appear in a given sample path of length n.2.. [52].) If p.4 A first motivation for the problem discussed in this section was the training sequence problem described in the opening paragraph. The positive admissibility result for the i.2. Remark 111. who used the d-distance rather than the variational distance. that if the process is a stationary coding of an i.d.5. or Ili-mixing.(x. These applications.SECTION 111. in particular. if 0 < E < h.2. The d-distance is upper bounded by half the variational distance. The admissibility results of obtained here can be used to prove stronger versions of their theorem. provided x and y are independently chosen sample paths. 175 the unbiased coin-tossing case when h = 1.i.1 (The nonadmissibility theorem. independent of the length of past and future blocks. y) is the waiting time until the first n terms of x appear in the sequence y then.) If p is i. A second motivation was the desire to obtain a more classical version of the positive results of Ornstein and Weiss. and ifr-mixing cases is a consequence of a slight strengthening of the exponential rate bounds of the preceding section. . Markov. and if k(n) > (log n)/(h — 6) then {k(n)} is not admissible in probability for p. The weak Bernoulli result follows from an even more careful look at what is really used for the ifr-mixing proof.2. are presented in Section 111. and k(n) < (log n)/(h 6) then {k(n)} is admissible for p. requires that past and future blocks be almost independent if separated by a long enough gap.(x.3. Remark 111.3 (Weak Bernoulli admissibility. [86]. They showed. with high probability. Theorem 111. which will be defined carefully later. Theorem 111.2.i.2 (The positive admissibility theorem.u. Thus the (log n)/(h 6). They showed that if 147. is weak Bernoulli and k(n) < (log n)/(h 6) then {k(n)} is admissible for . The Ornstein-Weiss results will be discussed in more detail in Section 111.2. The weak Bernoulli property.i. case of most interest is when k(n) The principle results are the following. almost surely. along with various related results and counterexamples. see Exercise 3.) If p. y) converges in probability to h.d. is ergodic with positive entropy h. for it is easy to see that. Theorem 111. provided k(n) — (log n)/ h. the choice k(n) — log n is not admissible. process then the d-distance between the empirical k(n)-block distribution and the true k(n)-block distribution goes to 0. for ergodic Markov chains. The Vi-mixing concept was introduced in the preceding section. ENTROPY AND JOINT DISTRIBUTIONS.d.5 A third motivation for the problem discussed here was a waiting-time result obtained by Wyner and Ziv.

1/31(. (2) (14 : 1Pke 14) — 141 ?. see the bound (9) of Section MA. Proof Define yn = yn (x) — > 5E1 < 2(n ± 01/312-nCe2 =0 1 I if xn E B otherwise.d. for.uk ((tik(4)) c) also goes to 1. where 0 < E < h. if n > IA I''.2.u. First note that. as the reader can show. and for any B c A such that p(B) > 1 — É and I BI > 2. Next the positive admissibility theorem will be established.i. The key to such results for other processes is a similar bound which holds when the full alphabet is replaced by a subset of large measure. n — k +1]} . (1). since each member of Tk (E/2) has measure at most 2. Define the empirical universe of k-blocks to be the set Lik (4) = x: +k-1 = a l% for some i E [1. lik(x). for any finite set A.k(h—e12) • The entropy theorem guarantees that . This implies the nonadmissibility theorem. by the distance lower bound. it can be supposed that {k(n)} is unbounded.a Proofs of admissibility and nonadmissibility. E.176 CHAPTER III.') n Tk ( 6/2)) 2 k(h-0 2-k(h-E/2) 2 -kE/2 .) There is a positive constant C such that for any E > 0. since any bounded sequence is admissible for any ergodic process.2. for any n > 0.k (ték oc. process with Prob (yn = 1) < Let Cn = E yi > 2En} . is enough to prove the positive admissibility theorem for unbiased coin-tossing. and hence Pk . for any i.i. III. where t nlk. the following holds.6 (Extended first-order bound. First the nonadmissibility theorem will be proved. it follows that Since pk(a fic ix) = 0. be an ergodic process with entropy h and suppose k = k(n) > (log n)/(h — E). The rate bound.d. so that { yn } is a binary i. without loss of generality. whenever al' (1) Let 1 Pk( ' 14) — 1 111 ((lik(X)) c ) • Tk (6/2) = 14: p.14) — pk1 -± 1 for every x E A'.. by the ergodic theorem. the error is summable in n. Lemma 111.(4) < 2-k(h-E/2)} The assumption n 2k (h -E ) implies that ILik(4)1 < 2k (h -o. process p. with finite alphabet A. ED < k(t +1) 1A1k 2 -t cE214 . and hence I pk (. so that . ENTROPY FOR RESTRICTED CLASSES.k (Tk (e/2)) —> 1. Let p.

14) — ?_ 31) < oc. im )) for which in < 26n cannot exceed 1. . is upper bounded by tc(B(i i . For each m < n and 1 < j 1 < i2 < • • • < j The sets . combined with (4) and the assumption that I BI > 2.u i and let AB be the corresponding product measure on BS defined by ALB. in. page 168. . in. -2.tti I > 56}) which is. x i g B. then extended to the *-mixing and periodic Markov cases. and assume k(n) < (log n)/ (h 6).i. bEB [ A\ The exponential bound (2) can now be applied with k = 1 to upper bound (5) by tt (B(i i.n ) den.kI ?3/21. Assume .2. let B(ii. define ï = i(xil) to be the sequence of length s obtained by deleting xi„ xi2 .i is i.AL B I --= 2_. and for xri' E B(ii. see (5).) . Let 8 be a given positive number. The assumption that m < 26n the implies that the probability ({Xj1 E B(ii. . by the Borel-Cantelli lemma. with entropy h.d. in turn. • • • im) and have union A. ENTROPY AND JOINT DISTRIBUTIONS.d. k-1 (7) 14: 1Pk( 14) — tuki 31 g s=o IPsk( . Fix a set {i1. note the set of all 4 for which Xi E B if and only if i çz' {i1.BI > € 1) Ii ui . }. It will be shown that (6) Ep.} with m < 2cn.cn4E2 177 The idea now is to partition An according to the location of the indices i for which . no) im }:nz>2En + 1) 22-cn4E2 . tt (b) /Ld o B) + E A(b) = 2(1 — . . and hence. (14: ip(.(B(i i . in1 ): m > 2en} have union Cn so that (3) yields the bound (4) (ii. and apply the bound (2) to obtain (3) p. the lemma follows. put s = n — m. i 2 . x im from x. in. i. • • • .SECTION 111. which immediately implies the desired result.)) /1 B E B s : i Pi e liD — I > 3}). the sets {B(ii.u(B)) bvB 26. The positive admissibility result will be proved first for the i. upper bounded by (5) since E Bs: — 11 1. im ): 'pi ( 14) . The proof of (6) starts with the overlapping to nonoverlapping relation derived in the preceding section. are disjoint for different { -1. — p. case. Also let ALB be the conditional distribution on B defined by . Furthermore.i. (Ca ) < (n 1)22 . im ))(s 1)1B12 2< ini mn 1)IBI2—n(1 —20c€ 2 The sum of the p.

all k > K = K (g).178 CHAPTER III. Lemma 111. Lemma 111. 101 ) — vil where each wi E A = A'. and. with A replaced by A = A k and B replaced by T.i. by assumption. Note that k can be replaced by k(n) in this bound. Since the number of typical sequences can be controlled. The bound (9) is actually summable in n. fi = h + 612 h ±E The right-hand side of (9) is then upper bounded by (10) 2 log n h +E (t since k(n) < (log n)/ (h E). the polynomial factor in that lemma takes the form. produces the bound (9) tx: IPke — 1kI > < 2k(t 1)21. valid for all g.6.2. where n = tk + k +r. combined with (8) and (7). by the entropy theorem.d. and y denotes the product measure on » induced by A. (8) ti({4: Ins* — Ak1 3/2}) v({ 01: IP1( . summable in n. since it was assumed that k(n) oc. This is. with B replaced by a suitable set of entropy-typical sequences. 0 < r < k. valid for all k > K and n > kly. This establishes that the sum in (6) is finite.2. for s E [O. which is valid for k/n smaller than some fixed y > 0.6.el') denotes the s-shifted nonoverlapping empirical k-block distribution. ENTROPY FOR RESTRIC I ED CLASSES. The rigorous proof is given in the following paragraphs. The idea now is to upper bound the right-hand side in (8) by applying the extended first-order bound. k — 1]. furthermore. t The preceding argument applied to the Vi-mixing case yields the bound (1 1) kt({4: 1Pke — il ?_. see (7) on page 168. 3 } ) < 2(k + g)[lk(g)]` (t 1 )22127 tceil00 . which.(h±E/2)2-tc82/100 . indeed. (1 + nl k) 2"1±(') . yields the bound P 101) — vil > 3/20 < 2(t + 1 )2/2)2_tc82000. though no longer a polynomial in n is still dominated by the exponential factor. and all n > kly. there is a K such that yi (T) = kil (Tk ) > 1 — 8/10. for k > K. The extended first-order bound. thereby completing the proof of the admissibility theorem for i. To obtain the desired summability result it is only necessary to choose . processes. To show this put a =C32 1100. as n oc. which. to the super-alphabet A". As in the preceding section. and plc ( '. where y is a positive constant which depends only on g. The set rk -= {X I( : itt(x lic ) > 2— k(h+€12)} has cardinality at most 2k(h ±e/ 2) . as k grows. since nIk(n) and p < 1.

I11. —g — m < j < —g}. The weak Bernoulli property is much stronger than mixing.1. uniformly in m > 0. that is. This establishes the positive admissibility theorem for the Vi-mixing case. the measure defined for ar iz E An by iun (41x )— g_m = Prob(r. which only requires that for each m > 0 and n > 1. The extension to the periodic Markov case is obtained by a similar modification EJ of the exponential-rates result for periodic chains. The weak Bernoulli property leads to a similar bound. where the variational distance is used as the measure of approximation. . let p. = aril and Xi g Prob ( X Tg g . hence for the aperiodic Markov case. where g and m are nonnegative integers. k-block parsing of x.1X —g—m -g — n i) < E . there is a gap g for which (12) holds. and this is enough to obtain the admissibility theorem.9. which is again summable in n. The key to the O.. nonoverlapping blocks could be upper bounded by the product measure on the blocks. and xig g provided only that g be large enough. conditioned on the past {xj. -mixing result was the fact that the measure on shifted. in turn. with only a small exponential error. n > 1. ENTROPY AND JOINT DISTRIBUTIONS. = xi. provided a small fraction of blocks are omitted and conditioning on the past is allowed. for all n > 1 and all m > O. with gap g is the expression (14) 4 = w(0)g(1)w(1)g(2)w(2). and oc. if given E > 0 there is a gap g > 0 such that (12) E X ig-m : n (. An equivalent form is obtained by letting m using the martingale theorem to obtain the condition oc and (13) E sifc Gun( . at least for a large fraction of shifts.2. i X -go) — n> 1.2. With a = C8 2 /200 and /3 = (h + E /2)/ (h E)..SECTION 111.„(. since since t n/ k(n) and 8 < 1. upper bounded by log n 2(t h 1)n i3 .g(t)w(t)w(t ± 1). A process has the weak Bernoulli property if past and future become almost independent when separated by enough.) Also let Ex -g (f) denote conditional expectation of a function f with respect to the -g-m random vector X —g—m• —g A stationary process {X i } is weak Bernoulli (WB) or absolutely regular.1x:g g _n1 ) denote the conditional measure on n-steps into the future. It is much weaker than Ifr-mixing. 179 g so large that (g) < 2c6212°°. which requires that jun (4 lxig g _m ) < (1 + c)/L n (ar iz). ar iz.. see Lemma 111. however.b The weak Bernoulli case. As before. subject only to the requirement that k(n) < (log n)/(h E). the s-shifted. To make this precise. assuming that k(n) the right-hand side in (11) is.

= max{ j : j E J } and condition on B* = R E . (a) X E Gn(y). g) and fix a finite set J of positive integers. defined as the limit of A( Ix i rn ) as m oc. for 1 < j < t.) If l is weak Bernoulli and 0 < y < 1/2. Note that the k-block w(j) starts at index s (j — 1)(k + g) + g ± 1 and ends at index s j (k + g). or by B i if s.180 CHAPTER III.„ } ( [ w (.u(w(in. The set of all x E Az for which j is a (y. Lemma 111. Then for any assignment {w(j): E J } of k-blocks tt Proof obtain (15) JEJ (naw(i)] n B.2._{.1)(k +g) ). The weak Bernoulli property guarantees the almost-sure existence of a large density of splitting indices for most shifts. An index j E [1.n )] of which satisfies n Bi I xs_+.(w(j)).n )] n B . g)-splitting index will be denoted by Bi (y.)) (1 + y) • . g). ([w(i. s. where w(0) has length s.2. Note that Bi is measurable with respect to the coordinates i < s + j (k + g). each The first factor is an average of the measures II (w(j. and g are understood.( . Thus (15) yields (n([w(i)] n B. such that the following hold. and later. which exists almost surely by the martingale theorem.& x_ s tim -1)(k+g) ) < (1 + y)tt(w(j ni )). and there is a sequence of measurable sets {G n (y)}. Here. s. ENTROPY FOR RESTRICTED CLASSES. k.2.7 Fix (y.. .)) (1+ y)IJI J EJ Put jp.8 (The weak Bernoulli splitting-set lemma.n )] n Bi. and the final block w(t 1) has length n — t(k+g) —s < 2(k ± g). y. eventually almost surely. lie lx i 00 ) denotes the conditional measure on the infinite past. Lemma 111. t] is called a (y. k. k.)) • tt ([. JEJ-th„) D and Lemma 111. k. k. there are integers k(y) and t(y).2. the g-blocks g(j) alternate with the k-blocks w(j). by the definition of B11„. )] n Bi ) . s. fl Bi ) B*) to n Bi ) = [w(j. s.) .„ tc(B*). g)-splitting index for x E A Z if (w(j) ix _ s ti -1)(k+g) ) < (1 + y)p. then there is a gap g = g(y).7 follows by induction.

then /i(CM) > 1 — (y 2 /2). Fatou's lemma implies that I 1 — f (x) d p(x) < )/4 . . and fix an x E Gn(Y). 2 where kc.3C ( IX:e g. k.. Thus fk converges almost surely to some f. denotes the indicator function of Cm. if t > t(y). eventually almost surely. and if (t +1)(k + g) _< n < (t + 2)(k +g). {Xi : i < —g} U {X. ENTROPY AND JOINT DISTRIBUTIONS. Fix k > M. Proof By the weak Bernoulli property there is a gap g = g(y) so large that for any k f (xx ) 1 p(x i`) p(x i`lxioo g) dp. 1 so there is an M such that if Cm = lx :I 1 — fk (x) 15_ . then for x E G(y) there are at least (1 — y)(k + g) values of s E [0. k ± g — 1] for each of which there are at least (1 — y)t indices j in the interval [1. so that if G(y) = —EICc m (T z x) > 1 1 n-1 n. The ergodic theorem implies that Nul+ rri cx.s. Put k(y)= M and let t(y) be any integer larger than 2/y 2 .2. 1 N-1 i=u y2 > 1 — — a. Direct calculation shows that each fk has expected value 1 and that {fk} is a martingale with respect to the increasing sequence {Ed.t=o — L — 2 then x E Gn(Y). s. 181 (b) If k > k(y). and (t + 1)(k g) < n < (t +2)(k ± g). so there is a subset S = S(x) C [0.U(. t > t(y). k + g —1] of cardinality at least (1 — y)(k + g) such that for x E G(y) and s E S(x) k c. t] that are (y. (Ts+0-1)(k+g)x ) > — y. Vk ?_ MI.: 1 < i < kl.(x) < Y 4 4 Fix such a g and for each k define fk(X) — .. The definition of G(y) and the assumption t > 2/y 2 imply that 1 r(k+g)-1 i=0 KCht(T1 X) = g) E 1 k-i-g-1 1 t(k k + g s=o t j =i E_ t K cm ( T s-hc i —( k + g) x ) > _ y 2 . see Exercise 7b.SECTION 111.c ) • Let Ek be the a-algebra determined by the random variables. g)-splitting indices for x.

t] define s. and k(n) < (log n)/(h + E). If k is large relative to gap length and n is large relative to k. Since I S(x)I > (1 —3)(k+ g).1.1-1)(k +g) x E Cm. n > 1.2. for at least (1—y)t indices j E [1. g)-splitting index for x. and measurable sets G. Theorem 111. k(y) and t(y).g(ai )= E W(i) an! .7 and the fact that I Ji < t yield (1 6) A (n[w(j)] jEJ n D.8 implies that there are (1 — y)(k + g) indices s E [0. k. J). which implies that j is a (y.8. if kl n < y. let Dn (s.: 3 then 114' g j ( . k E k A. s. then choose integers g = g(y). choose a positive y < 1/2. g)-splitting indices for x. The overlapping-block measure Pk is (almost) an average of the measures pk s'g i. ( . k + g — 1] for each of which there are at least (1 — y)t indices j in the interval [1. (. s. that is. where "almost" is now needed to account for end effects. k + g — 1] and J C [1.2. k +g —1]. and Lemma 111. then Ts±(. k. n > 1. t] that are (y. for x E G(y) and s E S(X) there are at least (1 — y)t indices j in the interval [1. and if Ipk(-14) — ?. for gaps. and for the j g J. Assume A is weak Bernoulli of entropy h. where k(y) k < (log n)/(h+E). if s E S(X). g)-splitting indices for x.2. Fix > 0.9 Given 3 > 0 there is a positive y < 1/2 such that for any g there is a K = K(g. and note that jEJ n B = Dn(s. . t]. k. g)-splitting index. this completes the proof of Lemma 111. s.)) If x E G(y) then Lemma 111. t] which are (y. however. For J C [1. . In summary. y) such that if k > K. Lemma 111. t] of cardinality at least (1 — y)t.182 CHAPTER III. But if Tr+(i -1)(k +g) x E Cm then g(w(i)Ixst i-1)(k+g )— g) < 1 ( + y)tt(w(D). In particular.J k n Pk. t]. so that conditions (a) and (b) of Lemma 111. provided the sets J are large fractions of [1. the empirical distribution of k-blocks obtained by looking only at those k-blocks w(j) for which j E J.8 easily extends to the following sharper form in which the conclusion holds for a positive fraction of shifts. for any subset J C [1. ENTROPY FOR RESTRICTED CLASSES.2.2. this is no problem.(s. so that Lemma 111.E.8 hold.3. k. s. t]. 14) — ILkI 3/4 for at least 2y (k + g) indices s E [0. J)) < (1+ y)' rip. For s E [0.2. J) be the set of those sequences x for which every j E J is a (y. Proof of weak Bernoulli admissibility. Fix t > t(y) and (t+1)(k+g) < n < (t+2)(k+g). = Gn(Y).

t ] of cardinality at least (1— y)t. 183 On the other hand. yet p. as in the -tfr-mixing case. 2-2ty log y to bound the number of subsets J C [1. indices s. Aperiodic Markov chains are weak Bernoulli because they are ifr-mixing. [38]. In particular. eventually almost surely. does not have exponential rates of convergence for frequencies. then for any x E G 0 (y) there exists at least one s E [0. 4. this establishes Theorem 111. the set k :g j (-14) — (1 — y)t. processes. J)) The proof of weak Bernoulli admissibility can now be completed very much as the proof for the 'tit-mixing case. the measure of the set (17) is upper bounded by (18) 2 .9 _> 3/4 for at least 2y (k -I.10 The admissibility results are based on joint work with Marton.ei') — [4(n)1 > 31. ENTROPY AND JOINT DISTRIBUTIONS.i. processes in the sense of ergodic theory.d. for any subset J C [1. Using the argument of that proof. Show that for any nondecreasing.2.g) assures that if I pk (14) — Ik1> S then I Ps kv g j ( . Remark 111. and k > k(y) and t > t(y) are sufficiently large.c Exercises. 2. the bound (18) is summable in n. which need not be if-mixing are.3. Aperiodic renewal and regenerative processes.d. Show that there exists an ergodic measure a with positive entropy h such that k(n) (log n)/(h + E) is admissible for p. however. unbounded sequence {k(n)} there is an ergodic process p such that {k(n)} is not admissible for A. 3.d.2.2.. 31 n G(y) UU ig[1.] 1J12:(1—y)t (ix: Ips k :jg elxi) — > 3/41 n Dn (s. Since X E G n (y).i. as part of their proof that aperiodic Markov chains are isomorphic to i. = {x: Ipk (n) (. and if B. [13].SECTION 111. Show that if p is i. if k(n) < (logn)I(h + E). Show that k(n) = [log n] is not admissible for unbiased coin-tossing.2. for which x E Dn (s. I11. The weak Bernoulli concept was introduced by Friedman and Ornstein. . then. 1. 14) Thus if y is sufficiently at least (1—y)t. weak Bernoulli. J) and ips (17) is contained in the set k+ g —1 s=0 lpk — itkl a.i. with an extra factor. see Exercises 2 and 3. It is shown in Chapter 4 that weak Bernoulli processes are stationary codings of i. then p(B 0 ) goes to 0 exponentially fast. t] of cardinality small. y can be assumed to be so small and t so large that Lemma 111. in Section 1V.3.1. This bound is the counterpart of (11). k + g — 1] and at least one J C [1. 2-2t y log y yy[k(o+ + 2k(n)(h+E/2)2_ t(1—Y)C32/4°° for t sufficiently large.2. If y is small enough. t] of cardinality at least > 3/4.

process. (Hint: use the previous result with Yi = E(fk+ilEk).S The definition of d-admissible in probability is obtained by replacing almost-sure convergence by convergence in probability. s.3 The d-admissibility problem. c) be the set of all kt-sequences that can be obtained by changing an e(log n)I(h+ fraction of the w(i)'s and permuting their order.3. (Hint: use the martingale theorem. the admissibility theorem only requires that k(n) < (log n)I h. Show that if k(n) oc.3.(x7 . A similar result holds with the nonoverlapping-block empirical distribution in place of the overlapping-block empirical distribution. For each x let Pm (X) be the atom of E(m) that contains x. 151. The d-admissibility theorem has a fairly simple proof.1. Note that in the d-m etric case. for any integrable function g. 7.. The d-metric forms of the admissibility results of the preceding section are also of interest. then . and p.v(4)1.) g dp. The principal positive result is Theorem 111.. A sequence {k(n)} is called d-admissible for the ergodic process if (1) iiIndn(Pk(n)(• Alc(n)) = 0.i. Theorem 11. The earlier concept of admissible will now be called variationally admissible. (a) Show that E(glE)(x) = lim 1 ni -+cx) 11 (Pm(x)) fp. though only the latter will be discussed here. where each w(i) has length k.u(R. is i. almost surely. Show that the preceding result holds for weak Bernoulli processes. let E(m) be the a-algebra determined by . e). and let E be the smallest complete a-algebra containing Um E(m). as n 6. c)) —> 1.1 (The d-admissibility theorem.d.) Section 111.(x7... k. Stated in a somewhat different form.) (b) Show that the sequence fk defined in the proof of Lemma 111. 5.2. y) E a1 . called the finitely determined property. Let {Yi : j > 1} be a sequence of finite-valued random variables. a.i. the theorem was first proved by Ornstein and Weiss. based on the empirical entropy theorem. k(n).(.d. always holds a variationally-admissible sequence is also d-admissible..) If k(n) < (log n)I h then {k(n)} is d-admissible for any process of entropy h which is a stationary coding of an i. Assume t = Ln/k] and xin = w(1). . [52}. Since the bound (2) dn (p.184 CHAPTER ER III. while in the variational case the condition k(n) < (log n)I(h e) was needed. and a deep property.8 is indeed x4) to evaluate a martingale. Ym. ENTROPY FOR RESTRICIED CLASSES. Let R. w(t)r.

intersecting with the entropy-typical set. and its 6-blowup dk(4. I11. then {k(n)} is not admissible for any ergodic process of entropy h.3 (The strong-nonadmissibility theorem. Tk = 11.2 (The d-nonadmissibility theorem.(X1) < k(h—E/4) 1 1 . produces Ak [11k(Xn)t5 ( [14(4)16 = n I. then a small neighborhood of the empirical universe of k-blocks in an n-sequence is too small to allow admissibility. there is an ergodic process p. define the (empirical) universe of kblocks of x'11 .3. THE 15-ADMISSIBILITY PROBLEM. The negative results are two-fold. These results will be discussed in a later subsection. tai :x i some i E [0. such that liminf dk( fl )(pk( fl )(. n — k +1]} . n--*oo 111(n)) > a. and which will be proved in Section IV. 185 which holds for stationary codings of i. If 3 is small enough then for all k. One is the simple fact that if n is too short relative to k. If k > (log n)/(h — c) then Itik (4)1 _< 2k(h-E).) For any nondecreasing unbounded sequence {k(n)} and 0 < a < 1/2. A proof of the d-admissibility theorem.16(4)) < 81. then extended to the form given here in [50]. . so that ttaik(x)is) < 6. almost surely The d-nonadmissibility theorem is a consequence of the fact that no more than sequences of length k(n) can occur in a sequence of length n < so that at most 2k(n )(h-E/2) sequences can be within 3 of one of them. provided 3 is small enough.SECTION IIL3. is given in Section III. subsequence version of strong nonadmissibility was first given in [52]. and hence.i.2. < 2k(h-02) 2-ch-E/4) <2 -k€/4 The entropy theorem implies that it(Tk) —> 1. pk( 14)) > 82. To establish the d-nonadmissibility theorem. 1[14(4)]s I < 2 k(" 12) by the blowup-bound. 14(4) = k i+k-1 = a.5. assuming this finitely determined property.3.3. Theorem 111. for any x'11 .3.a. It will be proved in the next subsection.d. A weaker. processes. The other is a much deeper result asserting in a very strong way that no sequence can be admissible for every ergodic process.7.) If k(n) > (log n)/(h — c).a Admissibility and nonadmissibility proofs. if k is large enough. Theorem 111. But if this holds then dk(/ik. Lemma 1.

2..( • Ixrn). of course. by the ergodic theorem. by the definition of finitely determined. y) on Am defined by çb m (a) = 1 n — m -I.d. (a) 10 (in.4.3.1. To establish convergence in m-block distribution.i. that is.9.3. see Section IV. for it asserts that (1 /k)H(pke h = 11(A). ---÷ h. since the bounded result is true for any ergodic process. ENTROPY FOR RESTRICTED CLASSES. for each m. provided only that k < (11h) log n. the proof that stationary codings of processes are finitely determined. the v-distribution of m-blocks in n-blocks is the measure 0. almost surely. almost surely. The finitely determined property asserts that a process has the finitely determined property if any process close enough in joint distribution and entropy must also be dclose. The desired result then follows since pni (an i 'lxri') —> A(an i z). An alternative expression is Om (ain ) E Pm (ain14)v(x) = Ev (pm (ar x7)). note that the averaged distribution 0(m. the following hold for any ergodic process p.2.x7 which makes it clear that 4)(m. The important fact needed for the current discussion is that a stationary coding of an i. Pk(n)(. 14)) — (b) (1/k(n))H(p k(n) (. as k and n go to infinity. v n ) = ym for stationary v. This completes the proof of the d-admissibility theorem. E v(uaT v).186 CHAPTER III. and noting that the sequence {k(n)} can be assumed to be unbounded. v)1 <8. of entropy h. 1 —> 0. vk) < E. 111 . for otherwise Lemma I. it is sufficient to prove that if {k(n)} is unbounded and k(n) < (11h) log n.„ = 0(m. 11-1(14) — H(v)I <k6. Convergence of entropy. This completes the proof of the d-nonadmissibility theorem.2.l n-m EE i=0 uEiv veAn—n. a negligible effect if n is large enough. Thus. If y is a measure on An. Pk( WO) is almost the same as pn. fact (a). must also satisfy cik(iik. is expressed in terms of the averaging of finite distributions. almost surely as n oc. is really just the empirical-entropy theorem. see Theorem IV. process is finitely determined. the average of the y-probability of din over all except the final m — 1 starting positions in sequences of length n. by the ergodic theorem.14(a) gives /Lk aik (x')13 ) > 1 — 28. and in < n. modulo. which is more suitable for use here and in Section 111. since pk(tik(fit)14) = 1. Theorem 111. the only difference being that the former gives less than full weight to those m-blocks that start in the initial k — 1 places or end in the final k — 1 places xn . . A stationary process t is finitely determined if for any c > 0 there is a 6 > 0 and positive integers m and K such that if k > K then any measure y on Ak which satisfies the two conditions -(rn. The d-admissibility theorem is proved by taking y = plc ( I4). An equivalent finite form. then. Theorem 11.2. fact (b). .

It is enough to show that for any 0 <a < 1/2 there is an ergodic measure be. 1. A simple. Two measures 1. The trick is to do the merging itself in different ways on different parts of the space.. of {k(n)}-d-admissible ergodic processes which are mutually a' apart in d. The support a (i) of a measure pc on Ak is the set of all x i` for which . even when blown-up by a. Indeed.SECTION 111. This must happen for every value of n. THE D-ADMISSIBILITY PROBLEM. k. if C n [D]ce = 0.3. Let Bk(a) = ix ) : dic(Pk• Ix). see Exercise 1.W-typical sequences.(a (p) x a (v)) = ce. The result (4) extends to averages of separated families in the following form. look as though they are drawn from a large number of mutually far-apart ergodic processes. when viewed through a window of length k(n).) is the path-length function.. where. Ak) < I. keeping the almost the same separation at the same time as the merging is happening. yf)) ce). for which (3) lim p.. This simple nonergodic example suggests the basic idea: build an ergodic process whose n-length sample paths.u(x) > 0.3.*oc almost surely. 187 III. then cik(p. D c A k are said to be a separated. A given x is typical x) n(k) i is mostly concentrated on for at most one component AO ) . which is well-suited to the tasks of both merging and separation. y) > a. Throughout this section {k(n)} denotes a fixed. is - (4) If 1.t and y on A k are said to be a separated if their supports are a-separated. then E(dk (x li` . - [Dia = i. merging of these far-apart processes must be taking place in order to obtain a final ergodic process. to be the average of a large number. Before beginning the construction. n(k) = max{n: k(n) < k}. Associated with the window function k(. in which case pk(• /2. and hence lim infdk (pk (.1 and y. {p (J ) : j < J } . y k < dk (x k ) for some yk i E D} is the a-blowup of D.b Strong-nonadmissibility examples. nondecreasing. The easiest way to make measures apart sequences and dk d-far-apart is to make supports disjoint. yet at the same time. some discussion of the relation between dk-far-far-apart measures is in order. where a'(1 — 1/J) > a. a-separated means that at least ak changes must be made in a member of C to produce a member of D. and k(n) < log n.c and y are a-separated. Thus. unbounded sequence of positive integers. but important fact. . The tool for managing all this is cutting and stacking. Two sets C. if X is any joining of two a-separated measures. (U Bk(a)) = 0 k>K The construction is easy if ergodicity is not an issue. as before. for one can take p. 14 (k) ). Ak) > a'(1 — 1/J).

namely. and y.. for i it([a(vma) = — E vi ([a(v. III. for any measure v whose support is entirely contained in the support of one of the v i . then d(p. yl`)) v(a (0) • [1 — AGa(v)icen a(1 — 1/. Thus if f(m)In(k m ) is sunu-nable in m.3.3.4 suggests a strategy for making (6) hold. illustrates most of the ideas involved and is a bit easier to understand than the one used to obtain the full limit result.)]a) = v jag (v i)la) = 7 j i=1 if 1 . based on [52]. of p. The sequence am will decrease to 0 and the total measure of the top n(km )-levels of S(m) will be summable in m. Furthermore. where xi is the label of the interval containing T i-l x.. The k-block universe Li(S) of a column . from which the lemma follows. a complete sequence {8(m)} of column structures and an increasing sequence {km } will be constructed such that a randomly chosen point x which does not lie in the top n(km )-levels of any column of S(m) must satisfy (6) dk„.188 CHAPTER III. conditioned on being n(km )-levels below the top. then Lemma 111.([a(v)]a ) < 1/J. after which a sketch of the modifications needed to produce the stronger result (3) will be given.. and if the cardinality of Si (m) is constant in j. 11k. guaranteeing that (5) holds. for 1 < i < n(k. With these simple preliminary ideas in mind. then .3. take S(m) to be the union of a disjoint collection of column structures {Si (m): j < Jm } such that any km blockwhich appears in the name of any column of Si (m) must be at least am apart from any km -block which appears in the name of any column of 5. and an increasing sequence {km } such that lim p(B km (a)) = O. Lemma 111.x ' "5.1 A limit inferior result. Here it will be shown. (3). if the support of y is contained in the support of y.1). Proof The a-separation condition implies that the a-blowup of the support of yi does j. Some terminology will be helpful. (5) The construction. A weaker liminf result will be described in the next subsection. the measure is the average of the conditional measures on the separate Si (m).n ). and hence not meet the support of yi . ENTROPY FOR RESTRICTED CLASSES.: j < J} of pairwise a-separated measures.4 guarantees that the d-property (6) will indeed hold. v) > a(1 — 1/J). for any joining Â.3. (Pkm (.u. how to produce an ergodic p. If all the columns in S(m) have the same width and height am). (m) for i j. ergodicity will be guaranteed by making sure that S(m) and S(m 1) are sufficiently independent. To achieve (5).4 If it = (1/J)Ei v i is the average of a family {v . Lemma 111. in detail. A column structure S is said to be uniform if its columns have the same height f(S) and width.) > am . and hence EAdk(4. the basic constructions can begin. then. Thus.b.

The following four sequences suggest a way to do this. 11 = ( 064164) 1 a .SECTION III. so first-order frequencies are good. 2. the simpler concatenation language will be used in place of cutting and stacking language.0101. J } . To summarize the discussion up to this point. of course.. . for which the cardinality of Si (m) is constant in j. that is. then d32(4 2 . 1/2) by constructing a complete sequence {S(m)} of column structures and an increasing sequence {k m } with the following properties.. the goal (6) can be achieved with an ergodic measure for a given a E (0. Disjoint column structures S and S' are said to be (a. The construction (7) suggests a way to merge while keeping separation. the simultaneous merging and separation problem. (7) = 01010.. and concatenations of sets by independent cutting and stacking of appropriate copies of the corresponding column structures. and for which an2 (1 — 1/J. 01111 = (0 414)16 c = 000000000000000001 11 = (016 116)4 d = 000. MI whose level sets are constant.3. (Al) S(m) is uniform with height am) > 2mn(k.. 01 = (01) 64 b = 00001111000011110. yet if a block xr of 32 consecutive symbols is drawn from one of them and a block yr is drawn from another.n ).. (A2) S(m) and S(m 1) are 2m-independent... Conversion back to cutting and stacking language is achieved by replacing S(m) by its columnar representation with all columns equally likely. k)-separated if their k-block universes are a-separated. and there is no loss in assuming that distinct columns have the same name. There are many ways to force the initial stage 8 (1) to have property (A3). (A3) S(m) is a union of a disjoint family {S) (m): j < J} of pairwise (am .. yr) > 1/2. The real problem is how to go from stage m to stage ni+1 so that separation holds.n ) decreases to a. .. The idea of merging rule is formalized as follows. yet asymptotic independence is guaranteed.. If one starts with enough far-apart sets. For M divisible by J. J)merging rule is a function . This is. then by using cyclical rules with rapidly growing periods many different sets can be produced that are almost as far apart. These sequences are created by concatenating the two symbols. concatenate blocks according to a rule that specifies to which set the m-th block in the concatenation belongs.. namely. For this reason S(m) will be taken to be a subset of A" ) . 10 -1 (DI = M/J.. {1. THE 13-ADMISSIBILITY PROBLEM. Each sequence has the same frequency of occurrence of 0 and 1. an (M. km )separated column structures.. using a rapidly increasing period from one sequence to the next. as m Since all the columns of 5(m) will have the same width and height. oc. 0 and 1. 189 structure S is the set of all a l' that appear as a block of consecutive symbols in the name of any column in S.. j <J.

a merging of the collection {Sj : j E J } is just an (M. and a collection {S7: t < J* } of subsets of 2V * of equal cardinalitv. ENTROPY FOR RESTRICTED CLASSES. When applied to a collection {Si : j E J } of disjoint subsets of A'.11. for some x E S" and some i > 11. given that a block comes from Sj . The merging rule defined by the two conditions (i) (ii) 0(m)= j. then for any J* there is a K* and an V.0(rn ± np) = 0(m).1} is a collection of pairwise (a. The family {Or : t < PI is called the canonical family of cyclical merging rules defined by J and J. Let expb 0 denote the exponential function with base b. let M = exp . The following lemma. let (/). m E [1. .190 CHAPTER III. In proving this it is somewhat easier to use a stronger infinite concatenation form of the separation idea.1 and J divides p. then the new collection will be almost as well separated as the old collection. be the cyclic (M. J)-merging of the collection for some M. m E [1. that is.3. and. Given J and J*. In other words. there is a Jo = Jo(a. Cyclical merging rules are defined as follows. /MI.. An (M. The key to the construction is that if J is large enough. 0 is called the cyclic rule with period p. In general. Lemma 111. Use of this type of merging at each stage will insure asymptotic independence. 0 *(2) • • • w (m ) : W(M) E So(. the canonical family {0i } produces the collection {S(0)} of disjoint subsets of A mt . 8(0) is the set of all concatenations w(1) • • • w(M) that can be formed by selecting w(m) from the 0(m)-th member of {S1 } . formed by concatenating the sets in the order specified by q. .) Given 0 < at < a < 1/2. J)-merging rule with period pt = exp j (exp2 (t — 1)).n ).1 times. (exp2 (J* — 1)) and for each t E [1. The two important properties of this merging idea are that each factor Si appears exactly M1. In cutting and stacking language. S' c At are (a. stated in a form suitable for iteration.P n . The set S(0) is called the 0-merging of {Si : j E f }. The full k-block universe of S C A t is the set tik (S oe ) = fak i : ak i = x:+k-1 . Assume p divides M1. Two subsets S. {Sj }) of A m'. K)-strongly-separated if their full k-block universes are a-separated for any k > K. each Si is cut into exactly MIJ copies and these are independently cut and stacked in the order specified by 0.5 (The merging/separation lemma. The desired "cyclical rules with rapidly growing periods" are obtained as follows. at) such that if J > Jo and {S J : j < . such that . for each m. is the key to producing an ergodic measure with the desired property (5). the direct product S() = in =1 fl 5rn = 1. when applied to a collection {Si: j E J } of disjoint subsets of A produces the subset S(0) = S(0. K)-strongly-separated subsets of A t of the saine cardinalit-v. it is equally likely to be any member of Si . p1. J)-merging rule 4).

Since all but a limiting (1/J)-fraction of y is covered by such y u u+(j-1)nt+1 . Suppose x E (S. Furthermore. v(j)E ST. The condition s > t implies that there are integers n and m such that m is divisible by nJ and such that x = b(1)b(2) • • where each b(i) is a concatenation of the form (8) w(l)w(2) • • • w(J). since (87)' = ST. so it only remains to show that if J is large enough. and.3. and. this may be described as follows.--m. 1 _< j _< J. (a) {St*: t < J*} is pairwise (a*.SECTION IIL3. furthermore. choose J„. Fix a decreasing sequence {a } such that a < ant < 1/2. for each k. w(j) E Sr» 1 < j < J. (13) Each S7 is a merging of {Si: j E 191 J}.`)c° and y E (S)oc where s > t.5. each block v(j) is at least as long as each concatenation (8). K*)-strongly-separated. 4 } of pairwise (an„ kni )- (B2) Each Si (m) is a merging of {S (m — 1): j < 4_ 1 }. > JOY . and hence (10) 1 )w (t + 2) • • w(J)w(1) • • • w(t) S. then K* can be chosen so that property (a) holds. K*)-strongly-separated. LI The construction of the desired sequence {S(m)} is now carried out by induction. without losing (a. . In concatenation language. Proof Without loss of generality it can be supposed that K < f. Let {Or : t < J* } be the canonical family of cyclical merging rules defined by J and J*. then there can be at most one index k which is equal to 1 d u_one (xvv-i-(J-1)ne+1 y u ( J — 1 ) n f + 1 ) > a by the definition of (a. so that if u+(. THE D-ADMISSIBILITY PROBLEM. (B3) 2n(k) <t(m).3. let S7 = S(0. while y = c(1)c(2) • •• where each c(i) has the form (9) v(1)v(2) • • • v(J). for all large enough K*. for any j. for otherwise {Si : j < J } can be replaced by {S7: j E J } for N > Klt. and Yu x v±(J-Unf+1 = w(t where w(k) E j. Suppose m have been determined so that the followLemma S(m) C and k ing hold.5.1-1)nt-1-1 is a subblock of such a v(j). the collection {St*: t < J* } is indeed pairwise (a*.). Aar° (B1) S(m) is a disjoint union of a collection {S I (m): j < strongly-separated sets of the same cardinality. and since J* is finite. Part (b) is certainly true.Ottn-1-1) from 111. for each ni. provided J is large enough. This completes the proof of Lemma 111. any merging of {S7 } is also a merging of {S}. K)-strongseparation. K)-strongly-separated and the assumption that K < E. and for each t.

Meanwhile.n + i)x(S. for the complete proof see [50]. The new construction goes from S(m) to S(m + 1) through a sequence of Jm+i intermediate steps S(m. Cyclical merging with period is applied to the collection {57 } to produce a new column structure R 1 (m.' by (S7) N for some suitably large N. Let {S'tk: t < Jrn+1 } be the collection given by Lemma 111. J 1+1). (B2). is needed. and the other on the part already merged. 0). on each of which only a small fraction of the space is merged. ±1 )-fraction of the space at each step. V1 * V2 * • • * V j As in the earlier construction. In the first substage. To obtain the stronger property (3) it is necessary to control what happens in the interval k m < k < km+1. the sequence {S(m)} defines a stationary A-valued measure it by the formula .n } with am = a and ce. Jm+i -fold independent cutting and stacking is . All columns at any substage have the same height. Only a brief sketch of the ideas will be given here. 1). 0)). each Si (nt. and (B3). n(lcm )I f(m) < cc. Furthermore. This is accomplished by doing the merging in separate intermediate steps. At each intermediate step.u(Bk m (a)) is summable in m. merging only a (1/4.4 in this case where all columns have the same height and width.. 0) S(m. rather than simple concatenation language. The union S(m + 1) = Ui Si (m + 1) then has properties (Bi). Since (6) clearly holds. if necessary. the analogue of Theorem 1.2 The general case. the measure p. establishing the desired 1=1 goal (5).5 for the family {Si (m): j < J.192 CHAPTER III.s(m) 1 k ilxiam)). by (B3). (ii) x(S7) = ( i/J. (m.(St i) = (1 — 1/ Ji)x(S.(m. 1) S(m.*). and Sj ". that is.3. one on the part waiting to be merged at subsequent steps. ±1 . the structure S(m) = S(m. prior separation properties are retained on the unmerged part of the space until the somewhat smaller separation for longer blocks is obtained. S. The only merging idea used is that of cyclical merging of { yi : j < J} with period J. Define k 1+ 1 to be the K* of Lemma 111. By replacing S.b. 0) is a union of pairwise km )-strongly separated substructures {Si (m. but two widths are possible. 0)). the measure of the bad set . ENTROPY FOR RESTRICTED CLASSES. Put Si (m + 1) = S.n+ 1 = a*. then each member of S(m) appears in most members of S(M) with frequency almost equal to 1/1S(m)1. III. and En.3. and hence cutting and stacking language. 0) is cut into two copies.3. Since t(m) --> cc. which is just the independent cutting and stacking in the order given. for m replaced by m +1. it can be supposed that 2m+ l n(kn1+ i) < t(S. j < J.10. where } (i) )1/4. is ergodic since the definition of merging implies that if M is large relative to m.n+ 1. 0): j < 41 all of whose columns have the same width and the same height t(m) = (m.5 for J* = J. lc \ _ lim [ga l— E IS(m)1 t(m).

t + 1). . 1). by the way. 1) < k < k(m. 1) > n(k(m. the probability that ak(pk(.„<k<k„.C i (m. The merging-of-a-fraction idea can now be applied to S(m. 1): j < J} is pairwise (a. This is. > 8(m..1). In particular.n+ i) 2 (1-1/0. 1))4 + 1. 193 applied separately to each to obtain a new column structure L i (m. 1)-fold independent cutting and stacking. 1).5. This is because. merging a copy of {L i (m. the following holds.. 1) is replaced by its M(m. can be used to show that if am) is long enough and J„. 1) has measure 1/4 +1 and the top n(k(m. 1))/(m. for which lim . which makes each substructure longer without changing any of the separation properties. which. 2). 1) have the same height as those of L i (m. the unmerged part has disappeared. 1) < 1/4 + 1. 1). t))} is (p(m.. Thus if M(m..u(Bk (a)) = O. 1))-strongly separated from the merged part R1 (m./r . an event of probability at least (1 —2/J. pk(. for the stated range of k-values. 1). 1): j < J} of measure 1/4 +1 . 1) is large enough and each L i (m. then {L i (m. if M(m. 1) is (8(m. t): j < J } U {R 1 (m.. But. producing in the end an ergodic process p. and since the probability that both lie in in the same L i (m. large enough. k m )-strongly-separated. then for a suitable choice of an.. 1))-levels below the top and in different L i (m. 1) is chosen large enough. 1) affects this property. then applying enough independent cutting and stacking to it and to the separate pieces to achieve almost the same separation for the separate structures. 1) is upper bounded by 1/. 1) is replaced by its M(m.0. that the columns of R I (m.3. 1). then it can be assumed that i(m. 1) = {L i (m. (a) The collection S(m. (b) If x and y are picked at random in S(m. in turn.(m. Note. 1) > k m such that each unmerged part j (m. 1 as wide. An argument similar to the one used to prove the merging/separation lemma. while making each structure so long that (b) holds for k(m. then k-blocks drawn from the unmerged part or previously merged parts must be far-apart from a large fraction of k-blocks drawn from the part merged at stage t 1. and the whole construction can be applied anew to S(m 1) = S(m. 1)-fold independent cutting and stacking. t) <k < k(m. After 4 +1 iterations. 1))-levels have measure at most n(k(m. there is a k(m. 1). if they lie at least n(k(m.E. of course. 1). the k-block universes of xrk) and y n(k) i are at least a-apart. since R. then for any k in the range km < k < k(m.14 (k) ). and this is enough to obtain (12) ni k. implies the following. but are only 1/J„1 . but a more careful look shows that if k(m.SECTION III. The collection {. Lemma 111.4_] Bk ) < DC ' . k (m . km )-strongly-separated. since strong-separation for any value of k implies strong-separation for all larger values. ±1 . 1) > a„. 1). 1). 1))strongly-separated. Each of the separate substructures can be extended by applying independent cutting and stacking to it.Iy in(k) )) > a is at least (1 — 2/4 + 0 2 (1 — 1/. 1): j < J} remains pairwise (am .3. weaker than the desired almost-sure result. THE D-ADMISSIBILITY PROBLEM. and R(m. since neither cutting into copies nor independent cutting and stacking applied to separate Ej(m.E. k(m. J„. and (a) continues to hold.

in turn. 1. An ergodic process i has the blowing-up property (B UP) if given E > 0 there is a 8 > 0 and an N such that if n > N then p. for any subset C c An for which 11 (C) . cik(Pk(.3. for i j. yi ) > a..e to construct waiting-time counterexamples.i.9. Remark 111. If C c An then [C]. 00 > — 1/f). Show that 0. and d(yi .3. This fact will be applied in I11.194 CHAPTER III. which is. (Hint: use Exercise 11. b7) < E.([C].) 2. A full discussion of the connections between blowing-up properties and stationary codings of i. called the almost blowing-up property is. I11. then 10(k.3. processes.5. 2_ . enough to guarantee the desired result (3).4 Blowing-up properties. Section 111. An interesting property closely connected to entropy ideas.i. The following theorem characterizes those processes with the blowing-up property.i. = (1/J)Ei yi. and Lemma 111.6 It is important to note here that enough separate independent cutting and stacking can always be done at each stage to make i(m) grow arbitrarily rapidly. Show that if "if is the concatenated-block process defined by a measure y on An.3. where each yi is ergodic. for some ariL E C} .1x n i (k) ). A slightly weaker concept. 2(k — 1)/n. In particular. for suitable choice of the sequence {40. ixin(k) ). Remark 111.i. a stationary process has the blowing-up property if sets of n-sequences that are not exponentially too small in probability have a large blowup.d.d. including a large family called the finitary processes.) > 1 — E.7 By starting with a well-separated collection of n-sequences of cardinality more than one can obtain a final process of positive entropy h. almost lim infak(Pk(. equivalent to being a stationary coding of an i. processes is delayed to Chapter 4.d. y) —1511 5 .c Exercises. Section 1. that is [C]. Let p. relative to km . process. ENTROPY FOR RESTRICTED CLASSES. Processes with the blowing-up property are characterized as those processes that have exponential rates of convergence for frequencies and entropy and are stationary codings of i. almost surely. and for other processes of interest. k-4. The blowing-up property and related ideas will be introduced in this section. processes. in fact. this shows the existence of an ergodic process for which the sequence k(n) = [(log n)/ hi is not a-admissible.d. (Yi)k) surely. Informally. for aperiodic Markov sources. has recently been shown to hold for i. denotes the 6-neighborhood (or 6-blowup) of C. = {b7: dn (a7. called the blowing-up property.4.

since i.d.d.i. Theorem 111.i.i.d. however. almost surely finite integer-valued measurable function w(x).SECTION 111.i -w(x) A finitary coding of an i.c. A stationary coding F: A z i. called the window function.) > 1 — E. If i E A Z and x -w(x) w(x) — w(x) then f (x ) = f Gip . Not every stationary coding of an i. such that.) A stationary process has the blowing-up property if and only if it is a stationary coding of an i. if there is a nonnegative. in-dependent processes.d. 0-blowing-up property for k > K. A particular kind of stationary coding. eventually almost surely. process.4. . The proof of this fact as well as most of the proof of the theorem will be delayed to Chapter 4.d. processes have the blowing-up property it follows that finitary processes have the blowing-up property. process has the blowing-up property.d. that the theorem asserts that an i. In particular. process will be given in Section IV. process has the blowing-up property. process and has exponential rates of convergence for frequencies and entropy.i. for almost every x E A z . An ergodic process p.3. Theorem 111. and some renewal processes are finitary.4. process. then a blowing-up property will hold for an arbitrary stationary coding of an i. without exponential rates there can be sets that are not exponentially too small yet fail to contain any typical sequences.i.) A stationary process has the almost blowing-up property if and only if it is a stationary coding of an i.a(C) > 2-18 . it will be shown later in this section that a stationary coding of an i.d.d. a result that will be used in the waiting-time discussion in the next section. 0-blowing-up property if p.2 (The finitary-coding theorem. (i) xlic E Bk. for any subset C C B for which . the time-zero encoder f satisfies the following condition.i. (ii) For any 6 > 0 there is a 8 > 0 and a K such that Bk has the (8.3 (Almost blowing-up characterization.4. relative to an ergodic process ii.4. Later in this section the following will be proved.i. process has the almost blowing-up property. does preserve the blowing-up property. The fact that processes with the blowing-up property have exponential rates will be established later in the section. called finitary coding. and.1 (Blowing-up property characterization. By borrowing one concept from Chapter 4. The difficulty is that only sets of sequences that are mostly both frequency and entropy typical can possibly have a large blowup. A proof that a process with the almost blowing-up property is a stationary coding of an i. 79]. A set B c A k has the (8. Once these are suitably removed. in particular. — .i. process is called a finitary process. BLOWING-UP PROPERTIES.B z is said to be finitary.) Finitary coding preserves the blowing-up property. has the almost blowing-up property (ABUP) if for each k there is a set Bk C A k such that the following hold. 195 Theorem 111.([C]. Note.i. It is known that aperiodic Markov chains and finite-state processes.d. [28.

naB * 2n(h—E/2) . The blowing-up property provides 6 and N so that if n>N. An exponential bound for the measure of the set B.) > 1 — a.) would have to be at least 1 —E. E)1 0 ) 0 it follows that . To fill in the details of this part of the argument. but depends only on the existence of exponential rates for frequencies. let B . Since . and hence (1) would force the measure p n (B(n.u has the blowing-up property. 6 . One part of this is easy.Cc A n . k. E/2). n > N and . [B(n. First it will be shown that p. ) fx n i: tin(x n i) < of sequences of too-small probability is a bit trickier to obtain. But this cannot be true for all large n since the ergodic theorem guarantees that lim n p n (B(n. (1) Ipk( 14)— pk( 1)1)1 6/2. E/2)) = 0 for each fixed k and E. for all sufficiently large n. E)) > 2 -6" then ([B(n. for all n sufficiently large. and hence there is an a > 0 such that 1[B* (n. To fill in the details let pk ( WI') denote the empirical distribution of overlapping k blocks.u(4) 2 ' _E/4)} gives (n. Intersecting with the set Tn = 11. c)) < 2 -6n . since .([C]y ) > 1 — E. If the amount of blowup is small enough. Thus p n (B(n.u„(T. k. III.Cc An. n In) _ SO p„([B* (n. E)]. and p. then frequencies won't change much and hence the blowup would produce a set of large measure all of whose members have bad frequencies.(C) > 2 -611 then ii. . k.196 CHAPTER III. Suppose . The blowing-up property provides 8 > 0 and N so that if n>N.(C) > 2 -6 " then . ç B(n. in particular.a Blowing-up implies exponential rates. 0.u([C]. E)) must be less than 2 -8n.4. however. ENTROPY FOR RESTRICTED CLASSES. Note that if y = E/(2k + 1) then d(4. since this implies that if n is sufficiently . —> 0. and define B(n. hence a small blowup of such a set cannot possibly produce enough sequences to cover a large fraction of the measure. by the entropy theorem. has exponential rates for frequencies. k.u n ([B*(n. E/2)) to be at least 1 — E. so that. < 2n(h—E/2)2— n(h—E14) < 2—nEl 4 . k. - = trn i: iun (x n i)> so that B* (n. o ] l < for all n.) —> 1. 6)1 < 2n(h €). however. Next it will be shown that blowing-up implies exponential rates for entropy. The idea of the proof is that if the set of sequences with bad frequencies does not have exponentially small measure then it can be blown up by a small amount to get a set of large measure. for there cannot be too many sequences whose measure is too large. E) = 1Pk( 14) — IkI > El. If. k. k.u(B*(n. and ti.(n. contradicting the ergodic theorem.u n (B(n. k.

6/4).u(D) > 2 -h 8 . and hence tin(Gn n 13.SECTION 111.u has the blowing-up property. x E B Z . For n > k let = pk(B. n > N1.1 • Thus fewer than (2k + 1)yn < cn of the (2k + 1)-blocks in x h± k± k I can differ from the corresponding block in i1Hk4 k . In particular.6. Thus f (x) depends only on the values x ww (x()x) .1)-blocks in xn -tf k that belong to Gk and. and hence there is a k such that if Gk is the set of all x k k such that w(x) < k. most of xPli will be covered by k-blocks whose measure is about 2 -kh. there is at least a (1 — 20-fraction of (2k -I. the built-up set bound. E /4) < 2al. and window function w(x). Fix E > 0 and a process p with AZ.(n. implies that there is an a > 0 and an N such that if n > N. with full encoder F: B z time-zero encoder f: B z A. 6/4) has cardinality at most 2 k(h+e/4) .(n. E)) < 2_8n which gives the desired exponential bound. . n (G n ) > 1 — 2 -6n. and note that . satisfies . then iGn < 2n(h+E/2). I=1 III. For such a sequence (1 — c)n of its (2k + 1)-blocks belong to Gk.u([D]y ) > 1 — E. there is a 6 > 0 and an N 1 > N such that p.un(B.4.7. where w(x) is almost surely finite.u. 197 large then. 1) < Y 2k -I. Now consider a sequence xhti. the set of n-sequences that are (1 —20-covered by the complement of B(k. where a will be specified in a moment. Suppose C c An satisfies v(C) > 2 -h 6 . To fill in the details of the preceding argument apply the entropy theorem to choose k so large that Ak(B(k. moreover. that is. except for a set of exponentially small probability. Let be a finitary coding of . if n > N I ._E k 1 . there is a sequence i n tE k E D such that dn+2k(X . then p. the blowing-up property. As in the proof of the entropy theorem.b Finitary coding and blowing-up. Let D be the projection of F-1 C onto B" HI. Thus + 2—ne/2.4. BLOWING-UP PROPERTIES.(k.u(T) > 1 — c by the Markov inequality. 6)) < IGn12—n(h+E) < 2—ne/2 n > N • Since exponential rates have already been established for frequencies. 6 can be assumed to be so small and n so large that if y = E / (2k + 1) then . (Gk) > 1 — 6 2 . In particular. 6/4)) < a. this gives an exponential bound on the number of such x. k 1 E [D]y n T. Since the complement of 13(k. Thus. at the same time agree entirely with the corresponding block in imHk-±k . where 6 > O. since k is fixed and . Lemma 1. and. which in turn means that it is exponentially very unlikely that such an can have probability much smaller than 2 -nh. The finitary-coding theorem will now be established. the set T = {x n I± k 1 : P2k+1(GkiX n jc-± k i) > 1 — 61.

I11. Also let c/. and put z = . The proof that stationary codings of i. Proof The relation (2) vn EC holds for any joining X of A n and A n ( IC).i. .4 )1) > P(4)dn(4 .1. ktn( .4. This proves the lemma. ) (3) Ca li) = E P (a14)v(x) = Ev(Pk(a l'IX7)). so that Z7 E [C]2 E . The sequence Z7 belongs to C and cln (z7 . and which will be discussed in detail in Section IV. E D such that Yi F(y).i.1 (' IC) denote the conditional measure defined by the set C c An.).198 CHAPTER III.3. Towards this end a simple connection between the blowup of a set and the 4-distance concept will be needed. E Yil )dn(. In this section it is shown that stationary codings of i. C) > E has p. the minimum of dn (4 .c Almost blowing-up and stationary coding. Lemma 111. a property introduced to prove the d-admissibility theorem in Section 111. then.d. there is a 3 > 0 and positive integers k and N such that if n > N then any measure y on An which satisfies the two conditions (a) Luk — 0*(k. the set of 4 such that dn (xit . k This simply a translation of the definition of Section III. process have the almost blowing-up property makes use of the fact that such coded processes have the finitely determined property. An( IC)).4 If d(1-t. (b) 11-1(u n ) — H(v)I < u8. that is.4. LII which completes the proof of the finitary-coding theorem.u. is finitely determined (FD) if given E > 0. that is. Z = F(Y. An ergodic process p. 4([C]) > 1 — E. y E C. processes have the almost blowing-up property (ABUP).2. <e.a into the notation used here. where 4)(k.d. x ntf k 19 and r-iChoose y and 5. C) denote the distance from x to C.([C]e ) > 1 — E. . This proves that vn ([ che ) n T) > 1 — 2€. ENTROPY FOR RESTRICTED CLASSES. C). that is. measure less than E. v)I < 8. by the Markov inequality. ylz). int+ ki. c+ k i . IC)) < 62 . Ane IC» <2 then 1. Z?) < 26.3. and hence EA(Cin(X til C)) < (LW.(4. v) is the measure on Ak obtained by averaging must also satisfy the y-probability of k-blocks over all starting positions in sequences of length n. Let . Thus if (LW.

a)-frequency-typical if I Pk (. [7]. is used.5 (FD implies ABUP.4. rather than the usual nh in the definition of entropy-typical. relative to a. It will be shown that for any c > 0. IC)) — 141 < a(n).SECTION 111. .u. Put Bn = Bn(k(n). and a nonincreasing sequence {a(n)} with limit 0. The definition of Bn and formula (3) together imply that if k(n) > k. 39]. then the conditional measure A n (. The first step in the rigorous proof is to remove the sequences that are not suitably frequency and entropy typical.11 (4)+na . both of which depend on . which is all right since H(t)/n —> h. a) be the set of n-sequences that are both entropy-typical. then xçi E Bn(k.) A finitely determined process has the almost blowing-up property 199 Proof Fix a finitely determined process A. ")I < 8 and IHCan ) — HMI < n6. 2. so there is a nondecreasing.4. y) < e 2 . for any C c Bp. Given E > 0. eventually almost surely. time IC)) < e 2 .5 in the Csiszdr-Körner book. ttn(. If k and a are fixed. This completes the proof that finitely determined implies almost blowing-up. where. Let Bn (k.) > 1 E.uk and entropy close to H(i). if tt(x til ) < 2.un are d-close. eventually almost surely. Thus if n is large enough then an(it. BLOWING-UP PROPERTIES. [37. by Lemma 111. and (j.4. Show that (2) is indeed true. a). e)blowing-up property. I11. which. ED — Remark 111. I xi ) — . Theorem 111. Kan).4. For references to earlier papers on blowing-up ideas the reader is referred to [37] and to Section 1. a)-frequency-typical for all j < k. implies that C has a large blowup.4.6 The results in this section are mostly drawn from joint work with Marton. 1.4. then lin e IC) will have (averaged) k-block probabilities close to . Show that a process with the almost blowing-up property must be ergodic. must also satisfy dn (ii.d Exercises. a (n)). relative to a. such that 4 E Bn(k(n).4. as usual. A sequence x i" will be called (k. IC) satisfies the following.u(C) > 2 —n8 . The finitely determined property then implies that gn e IC) and . a(n)). Note that in this setting n-th order entropy.u([C]. (ii) H(14) — na(n) — 178 Milne IC)) MILO ± na(n). — (i) 10(k. Pk(• Ifiz) is the empirical distribution of overlapping k-blocks in xri'. for which 2 —n8 which. The basic idea is that if C c A n consists of sequences that have good k-block frequencies and probabilities roughly equal to and i(C) is not exponentially too small. A sequence 4 will be called entropy-typical.uk I < a. the finitely determined property provides 8 > 0 and k such that if n is large enough then any measure y on A n which satisfies WI Ø(k. which also includes applications to some problems in multi-user information theory. Suppose C c Bn and . unbounded sequence {k(n)}.4. there is a 8 > 0 such that Bn eventually has the (8. by Lemma 111. implies that .

processes have the blowing-up property.1. yn mi +k-1 < 61. The waiting .d. They showed that if Wk (x. 6) = minfm > 1: dk (x Two positive theorems will be proved. The counterexamples show that waiting time ideas. Show that a process with the almost blowing-up property must be mixing. (b) Show that a i/i -mixing process has the blowing-up property. Assume that aperiodic renewal processes are stationary codings of i. This result was extended to somewhat larger classes of processes. y E Ac° by Wk (x.i.5 The waiting-time problem. Section 111. 3.u(B n ) 1.3. of course. In addition. Theorem 111. 4. y) converges in probability to h.i. Show that some of them do not have the blowing-up property. in conjunction with the almost blowing-up property discussed in the preceding section.d. processes. pp. y) = min{m > 1: yn n: +k-1 i = xk The approximate-match waiting-time function W k (x. by using the d-admissibility theorem. then (1/k) log Wk(x. Wyner and Ziv also established a positive connection between a waiting-time concept and entropy. The surprise here.5. thus further corroborating the general folklore that waiting times and recurrence times are quite different concepts.b.time function Wk (x.d. An almost sure version of the Wyner-Ziv result will be established here for the class of weak Bernoulli processes by using the joint-distribution estimation theory of III. at least for certain classes of ergodic processes. first noted by Wyner and Ziv.2. v. including the weak Bernoulli class in [44. suggests that waiting times are generally longer than recurrence times. y. is the positive result. Show that condition (i) in the definition of ABUP can be replaced by the condition that . 1Off. A connection between entropy and recurrence times. Counterexamples to extensions of these results to the general ergodic case will also be discussed. y) is the waiting time until the first k terms of x appear in an independently chosen y. (a) Show that an m-dependent process has the blowing-up property.200 CHAPTER III. Show that a coding is finitary relative to IL if and only if for each b the set f -1 (b) is a countable union of cylinder sets together with a null set. 5. . processes. [86]. cannot be extended to the general ergodic case. 6) is defined by Wk (x.i. for the well-known waiting-time paradox. was shown to hold for any ergodic process in Section Section 11. for irreducible Markov chains. 76]. 7. ENTROPY FOR RESTRICTED CLASSES. Assume that i. y) is defined for x. 6. [11. an approximate-match version will be shown to hold for the class of stationary codings of i. unlike recurrence ideas. 1.

(a) x'ic E Bk. In the weak Bernoulli case. y) = h.1.. . such that a large fraction of these k-blocks are almost independent of each other.) If p.5. In the approximate-match proof.1 (The exact-match theorem. x p. y. Theorem 111.8. it is necessary to allow the set G„ to depend on the infinite past.0 k—>oo 1 lim lim inf .5. The set Bk C A k is chosen to have the property that any subset of it that is not exponentially too small in measure must have a large 3-blowup. hence G„ is taken to be a subset of the space A z of doubly infinite sequences.3.log k. so that property (a) holds by the entropy theorem.log The easy part of both results is the lower bound.c. For the weak Bernoulli case. The proofs of the two upper bound results will have a parallel structure. {B k } and {G n } have the property that if n > 2k(h +E) .log Wk(x. (b) y E G n . can be taken to depend only on the coordinates from 1 to n. then lim sup .}.SECTION 111. though the parallelism is not complete. The set G n consists of those y E A Z for which yll can be split into k-length blocks separated by a fixed length gap. G„ consists of those yli whose set of k-blocks has a large 3-neighborhood.d.5.} have the additional property that if k < (log n)/(h + 6) then for any fixed xiic E Bk. The d-admissibility theorem. the set G. the sequences {B k } and {G.i.i. In each proof two sequences of measurable sets. the G„ are the sets given by the weak Bernoulli splitting-set lemma.(log n)/(h + 6). are constructed for which the following hold. For the approximate-match result. In the approximate-match case. Theorem 111. 201 Theorem 111. the probability that y E G„ does not contain x lic anywhere in its first n places is almost exponentially decaying in n.3. x A. for there are exponentially too few k-blocks in the first 2k(h ..„0 1 k Wk(X.f) terms of y. hence is taken to be a subset of A. k almost surely with respect to the product measure p. eventually almost surely. and Wk (x. is weak Bernoulli with entropy h. processes. k almost surely with respect to the product measure p.d. guarantees that property (b) holds. THE WAITING-TIME PROBLEM. Bk consists of the entropy-typical sequences. for k .2 (The approximate-match theorem. In other words. In both cases an application the Borel-Cantelli lemma establishes the desired result. hence it is very unlikely that a typical 4 can be among them or even close to one of them. the Bk are the sets given by the almost blowing-up property of stationary codings of i.) If p is a stationary coding of an i. 6—).2. S) > h.4. 8) < h. In the weak Bernoulli case. then k— s oo 1 lirn . then for any fixed y'iz E G. Lemma 111. In other words. process. eventually almost surely. y. see Theorem 111. the probability that xf E Bk is not within 8 of some k-block in y'i' will be exponentially small in k. V S > 0. {B k C Ak} and {G.

E) . If k(n) > (logn)I(h . The lower bound and upper bound results are established in the next three subsections. eventually almost surely. and hence 1/1k(yril)1 < Proof If k > (log n)/(h .x /.d. there is an N = N(x. ENTROPY FOR RESTRICTED CLASSES.3 Let p.4 Let p. and hence so that intersecting with the entropy-typical set 7k (e/2) gives < 2 k (6/2)) _ k(h-0 2 -ko-E/2) < (tik (yi ) n r Since .k l]}. Proof If k > (log n)/(h . y) such that x ik(n) (/[14k(n)(11`)]5. is small enough then for all k.log Wk(x./.6) then for every y E A" and almost every x. process will be constructed for which the exact-match version of the approximate-match theorem is false.5.202 CHAPTER III.5.6) then n <2k(h-e). for any ergodic measure with entropy h. There is a 3 > 0 such that if k(n) > (log n)/(h . and an ergodic process will be constructed for which the approximate-match theorem is false.` E 7 k(E/2). be an ergodic process with entropy h > 0 and let e > 0 be given. n > N. there is an N = N(x. n > N.14(y1)) 6). x A. n . for any ergodic measure with entropy h. 3) > h.5. Intersecting with the entropy-typical set 'Tk (E/2) gives < 2 k(h-3E/4) 2 -kch-E/2) < p. y) such that x ik(n) lik(n)(Yi).6) then for every y E A and almost every x.a Lower bound proofs. In the following discussion. El The theorem immediately implies that 1 lim inf . The following notation will be useful here and later subsections.7. k almost surely with respect to the product measure p. _ _L which establishes the theorem. j[lik (y)]51 _ Lemma 1. be an ergodic process with entropy h > 0 and let e > 0 be given. by the blowup-bound lemma. y. 111 IL . yçl E An . Theorem 111. y) h.0 k—c k almost surely with respect to the product measure p. (y)16 n Tk(e/2)) [lik(yli )]3 = for some i E [0. x /.i. 8—».. < 2k(h -E ) . The theorem immediately implies that 1 lim lim inf . the set of entropy-typical k-sequences (with respect to a > 0) is taken to be 7k(01) 2 -k(h+a) < 4 (4) < 2-k(h-a ) } Theorem 111.log Wk(X.5. I f 6 < 2k(h-3(/4). The (empirical) universe of k-blocks of yri' is the set iik(Yç') = Its S-blowup is dk(x 1`. an application of the Borel-Cantelli lemma yields the theorem. In the final two subsections a stationary coding of an i.6) then n _ _ 2 k(h . 2k(h-E). I11.

y) < h. If 8 1 is the set of all y then E A z that have j as a (y.. for (t + 1)(k + g) < n < (t 2)(k ± g).2.5. THE WAITING-TIME PROBLEM. t] that are (y. 1. The s-shifted.} are given by Lemma 111. g)-splitting index. the g(j) have length g and alternate with the k-blocks w(j). it is enough to establish the upper bound. The sets G n C A Z are defined in terms of the splitting-index concept introduced in III. k. E For almost every x E A" there exists K(x) such that x x. t ].. with gap g is yti` = w(0)g(1)w(1)g(2)w(2). k > K(x). 1 .log Wk(x. k.t(k g) . s. and J C [1. The basic terminology and results needed here will be restated in the form they will be used. for 1 < j < t. (b) For all large enough k and t.b Proof of the exact-match waiting-time theorem. n(k) = n(k. Fix such an (0 (Yrk) )1 and note that ye Ck (x). An index j E [1.2.b.u . E) denote the least integer n such that k < (log n) I (h É).u(w(DlY s 40. y) > h -E€ so that it is enough to show that y g' Ck (x).y)t indices j in the interval [1. The entropy theorem implies that 4 E B. eventually almost surely. t] is a (y. which depends on y.a. that is. k + g . s. (2) 1 lim sup . k. . Property (a) and a slightly weaker version of property (b) from that lemma are needed here. where w(0) has length s < k + g. For each k put Ck(x) = Y Bk. and fix c > 0.1 k x .log Wk (x.5. 203 III. The sets {B k } are just the entropy-typical sets.g(t)w(t)w(t + 1). (a) x E Gn .8 and depend on a positive number y. eventually almost surely. The sets {G. k-block parsing of y..s.s < 2(k + g).( 0)-1)(k+g) ) < (1+ Y)Ww(i)).SECTION 111. g)-splitting indices for y.1] for which there are at least (1 . and a gap g. and for y E G(y) there is at least one s E [0. eventually almost surely. Bk = Tk(a) = {xj: 2 -k(h+a) < wx lic) < where a is a positive number to be specified later.) (1 + y) Ifl libt(w(i)). to be specified later. (1) sa Claw (Di n Bs. and the final block w(t +1) has length n . g)-splitting index for y E A z if . To complete the proof of the weak Bernoulli waiting-time theorem. Let Fix a weak Bernoulli measure a on A z with entropy h. s. stated as the following.

j E J are treated as independent then. n > N(. 3) > h ± c if and only if x i( E C k(y).' Ck (y). Since y E Gn . for all j E J. if this holds. and hence it is enough to show that xi. yri: E G n . By the d-admissibility theorem.3. let Ck (y) = fx 1`: x g [Uk(Yr k) )]81 so that. ENTROPY FOR RESTRICTED Choose k > K (x) and t so large that property (b) holds. it then follows that y 0 Ck(x).y). s) c the splitting-index product bound.1. This completes Li the proof of the exact-match waiting-time theorem.0 > 0 and K such that if k > K and C c Bk is a set of k-sequences of measure at least 2 -0 then 11([C]8/2) > 1/2. eventually almost surely. But if the blocks w(j). eventually almost surely. For almost every y E Aoe there is an N (y) such that yti' E G n . t] of cardinality at least t(1 — y). while k(n) denotes the largest value of k for which k < (log n)/(h c).(0) < (k + g)2.i. (1/k) log Wk(x. s). Theorem 111. g)-splitting index for y.ualik(n)(31)]812) > 1/2) . and such that (t+ 1)(k+ g) < n(k) < (t 2)(k ± g). For each k.5. The least n such that k < (log n)/ (h c) is denoted by n(k). eventually almost surely. Let {B k } be the sequence of sets given by the almost blowing-up characterization theorem. since t n/ k and n . Fix such a y. k. for any y E G n (k)(J.1 n Bsd) + y)' jJ fl bi (w(i)). process such that h(A) = h. and choose . CHAPTER III. To establish the key bound (3). and fix E > 0. is upper bounded by that w(j) (1 — 2 -k(h+a ) ) t(1-Y) and hence )t it(ck(x)n G n(k)(J s)) < (1 + y (1 This implies the bound (3). . y. which proves the theorem.1. s) denote the set of all y E G n (k) for which every j E J is a (y. since there are k ± g ways to choose the shift s and at most 2-2ty lo gy ways to choose sets J c [1. eventually almost surely.4.204 ED CLASSES. (1). Fix a positive number 3 and define G. then a and y can be chosen so small that the bound is summable in k.3. fix a set J c [1. Theorem 111. since ii(x) > k(h+a) and since the cardinality of J is at least t(1 — y). The key bound is (3) ii(Ck(x) n Gn. Since Gnuo (J. k + g — 1] and let Gn(k) (J.5.c Proof of the approximate-match theorem. the probability x lic. Let be a stationary coding of an i. = . yields it(n[w(i)] JE. Theorem 111. s. I11.d.2ty log y 0 y y (1 2—k(h+a))t( 1— Y) Indeed. t] of cardinality at least t(1 — y) and an integer s E [0.

„ (— log Wk(n)(y. k=1 Since 4 E Bk. 1471 (x. then. for then the construction can be iterated to produce infinitely many bad k.u(Ck(Y) n Bk) < 2-10 . A family {Ci } of functions from G L to A L is said to be pairwise k-separated. This artificial expansion of the alphabet is not really essential. 205 Suppose k = k(n) > K and n(k) > N(y). be the Kolmogorov measure of the i. x E GL. cycling in this way through the possible markers to make sure that the time between blocks with the same marker is very large. is thought of as a measure on A Z whose support is contained in B z . aL (C(4). By this mimicking of the bursty property of the classical example. The same marker is used for a long succession of nonoverlapping k-blocks. 4. A function C: GL A L is said to be 6-close to the identity if points are moved no more than 8. The classic example where waiting times are much longer than recurrence times is a "bursty" process.i. 1} such that g i (0) = .) is forced to be large with high probability. where k is large enough to guarantee that many different markers are possible. fix GL c A L . The key to the construction is to force Wk(y. 3 } . I11. To make this precise. it follows that 4 g Ck(y). where P. Thus. 1.„ denotes probability with respect to the product measure y x v. THE WAITING-TIME PROBLEM. Y . A stationary coding y = . that is. y1 E Ci(Xt). process with binary alphabet B = {0. n-400 k(n) for every N. The initial step at each stage is to create markers by changing only a few places in a k-block. -51. and hence E p(Ck(y)n Bk ) < oc. The marker construction idea is easier to carry out if the binary process p. if the k-block universes of the ranges of different functions are disjoint. with a code that makes only a small density of changes in sample paths. yet are pairwise separated at the k-block level. If the process has a large enough alphabet. eventually almost Li surely. which contradicts the definition of Ck(y). so that yrk) E G(k). But this implies that the sets Ck(y) and [ilk (Yin (k) )]8 intersect. j j. Si) to be large. y) will be large for a set of y's of probability close to 1/2.. eventually almost surely.u i (1) = 1/2.u.d.E Cj(Xt). ) < 6. and sample paths cycle through long enough blocks of each symbol then W1 (x. then switches to another marker for a long time. y) will be large for a set of probability close to 1. say. If x i = 1. (1/k) log Wk(y. Let . that is.5. A family {Ci} of functions from GL to A L is 6-close to the identity if each member of the family is 6-close to the identity. however.d An exact-match counterexample. ) 7 ) N) = 0. completing the proof of the approximate-match theorem. with high probability.u o F -1 will be constructed along with an increasing sequence {k(n)} such that (4) 1 lim P.SECTION 111. that is Uk(yP) fl /ta ) = 0.5. If it were true that [t(Ck(Y) n Bk) > 2-0 then both [Ck(Y) n Bk]512 and [16 (Yr k) )]6/2 each have measure larger than 1/2. The first problem is the creation of a large number of block codes which don't change sequences by much. for with additional (but messy) effort one can create binary markers. one in which long blocks of l's and O's alternate. . for then blocks of 2's and 3's can be used to form markers. . hence intersect. where A = {0. 2.

El The code family constructed by the preceding lemma is concatenated to produce a single code C of length 2g L.C. (i) pvxv (wk'. i (x lic) replaces the first 2g ± 3 terms of 4 by the concatenation 0wf0wf0. Lemma 1. Define the i-th k-block encoder a (4) = yilc by setting y = Yg+2 = Y2g+3 = 0 .(G Proof Order the set {2.5. ENTROPY FOR RESTRICTED CLASSES. and (C. pairwise k-separation must hold.6 Let it be an ergodic measure with alphabet A = {0. o F-1 the following hold. with the additional property that no (g +1)-block of 2's and 3's occurs in any concatenation of members of U. 1. where jk+k jk+1 = Ci jk+k x )jk+1 o < <M. 3} L consisting of those sequences in which no block of 2's and 3's of length g occur Then there is a pairwise k-separated family {C i : 1 < i < 2g} of functions from GL to A L . An iteration of the method. i > 2g + 3. (iii) v({2.8. Since each changes at most 2g+3 < 3k coordinates in a k-block. 1. 2. The following lemma is stated in a form that allows the necessary iteration.5. is then used to construct a stationary coding for which the waiting time is long for one value of k. The encoder Ci is defined by blocking xt into k-blocks and applying Cs.. Let GL be the subset of {0. that is. Thus the lemma is established. Ci (xt) = yt. 31g in some way and let wf denote the i-th member. (ii) d(x. The block-to-stationary construction.5 Suppose k > (2g+3)/8 and L = mk. as done in the string matching construction of Section I. Given c > 0.1)-block of 2's and 3's can occur in any concatenation of members of Ui C. Put . 5") < 2kN ) 5_ 2 -g ± E. 2.206 CHAPTER III. Proof Choose k > (2g +3)18. F (x)) < 6. Lemma 111.8. 3 > 0. there is a k > 2g +3 and astationary encoder F: A z 1--+ A z such that for y = p. hence no (g -I. Let L. g+1 2g+2 yj = x . GL. for all x. and N. and leaves the remaining terms unchanged. which is 8-close to the identity. separately to each k-block. and since different wf are used with different Ci .2)-block Owf 0. that is. where g > 1. Note also that the blocks of 2's and 3's introduced by the construction are separated from all else by O's. the family is 3-close to the identity. 3 } g+ 1 ) = 0. is used to produce the final process. 2. The following lemma combines the codes of preceding lemma with the block-to-stationary construction in a form suitable for later iteration. 3}) = 0. Any block of length k in C(x) = yt must contain the (g -I. 3) such that i({2. its proof shows how a large number of pairwise separated codes close to the identity can be constructed using markers. Y2 = Yg+3 = W1 .5.c. then choose L > 2kN+2 lc so that L is divisible by k. Lemma 111.(GL).: 1 < i < 2g1 be as in the preceding lemma for the given 3.

for 41 (G L) 2g . L — 1. and j such that n < 1 < n + L — 1. and hence property (ii) holds. then yn (b) If i E Uj E jii . and if y = F(x) and 5". since y and j3 were independently chosen and there are 2g different kinds of L-blocks each of which is equally likely to be chosen. each of which changes only the first 2g + 3 places in a k-block. for every x. of the integers into subintervals of length no more than M.5. which establishes property (iii). THE WAITING-TIME PROBLEM. Case 1 occurs with probability at most €72 since the limiting density of time in the gaps is upper bounded by 6/4. c has the desired properties. Z = U/1 . 207 M = 2g L and define a code C: Am 1--* A m by setting C(xr) = xr. and such that y = F(x) satisfies the following two conditions. Three cases are now possible. 1 < <2 for xr E (GL) 29 Next apply the block-to-stationary construction of Lemma 1. had no blocks of 2's and 3's of length g. Case 2b.5 to obtain a stationary code F = F c with the property that for almost every x E A z there is a partition. Either k > n + L — 1 or 2kN > Case 2c. the encoded process can have no such blocks of length g + 1. = F(). = x i . n +Al]. then y. and iL iL g. The intervals I. Y (i--1)L+1 = Ci(XoL +1). note that if y and 53 are chosen independently. i = j. m < 1 < m + L — 1. Both xi and i1 belong to coding blocks of F. the changes produce blocks of 2's and 3's of length g which are enclosed between two O's. and 2k N < m L — 1. and such that ynn+L-1 = Ci ( X. Case 1. Case 2. If Case 2c occurs then Case 2b occurs with probability at most 6/2 since L . In Case 2. since the M-block code C is a concatenation of the L-block codes. in. {C i }. that are shorter than M will be called gaps. Case 2a occurs with probability at most 2 -g. Furthermore. Thus d(x. since k > (2g +3)/8. i. then UJEJ /J covers no more than a limiting (6/4)-fraction of Z. (a) If = [n + 1.8. hence.. while those of length M will be called coding blocks. 2kN+2 /E. • n+ +1M = C(Xn ni ± 1M ). i0 j. such that if J consists of those j for which I is shorter than M. first note that the only changes To show that F = F made in x to produce y = F(x) are made by the block codes Ci. then only two cases can occur. Either x1 or i is in a gap of the code F. Case 2a.SECTION 111. there will be integers n. since p. F(x)) < 8. To show that property (i) holds.11 +L-1\ yrn = C1 (4+L-1) .k <n+L — 1.

with GI = F1.5. it is enough to show the existence of such a sequence of encoders {Fn } and integers {k(n)}.5. and an increasing sequence {k(n)} such that for the -1 . where Go(x) (a) pooxon .6 with 1) larger than both 2n + 2 and k(n) and g = n and p. by the pairwise k-separation property.7 The number E can be made arbitrarily small in the preceding argument. and (c) hold for n = 1. that is. G < 2ik(i)) < 27i +1 ± 2-i + 2—n+1 ± 2—n 9 1 < < n. Apply Lemma 111. 2. Az. Condition (c) is a technical condition which makes it possible to proceed from step n to 1. (c) On ) ({2.6 hold. These properties imply that conditions (b) and (c) hold -± 1 1 . and 2b therefore Puxv (Wk(Y < 2kN) _< 6/2 2 . Assume F1. the entropy of the encoded process y can be forced to be arbitrarily close to log 2. and hence one can make P(xo F(x)o) as small as desired.: Az 1-4. and (c) hold for n. > 2kN . F2 . Condition (a) guarantees that the limit measure y will have the 1 log Wk(n)(Y. Condition (b) guarantees that eventually almost surely each coordinate is changed only o F -1 . = F o Fn _i o o F1 and measure 01) = o G n x. (b).6. 33 ) k(n) co.u. In particular.208 CHAPTER III.5. replaced by On ) to select k(n to select an encoder Fn+i : A z A z such that for v (n+ 1) = On ) 0 . Wk (y. while condition (a) holds for the case i = n + 1.5. so there will be a limit encoder F and measure y = desired property. (b) d (G n (x). The finish the proof it will be shown that there is a sequence of stationary encoders. (x)) almost surely. (b). 3} n ) = 0. however. finitely many times. If 6 is small enough. n = 1. then y (n+ 1) will be so close to On ) that condition (a).5. Fn.6 with g = 0 to select k(1) > 3 and F1 such that (a).. This will be done by induction. F„ have been constructed so that (a). imply that which completes the proof of Lemma 111.g + 6/2. the properties (i)-(iii) of Lemma 111. ( wk(i)(y. . Cases 1. Remark 111. with n replaced by n 1. Let 6 < 2 —(n+1) be a positive number to be specified later. step n In summary. the entropy of the coin-tossing process . ENTROPY FOR RESTRICI ED CLASSES. for n + 1.. . 2a. the following composition G. will hold for i < n 1.. . properties hold for each n > 1. The first step is to apply Lemma 111. . . in y x y probability. since y (n+ 1) = o G n since it was assumed that 6 < 2 -(n+ 1) . This completes the construction of the counterexample.

D c A k are a-separated if C n [D].strongly . Two subsets S. however. First some definitions and results from 111. (x . for any integer N.3. S' C A f are (a. Once km is determined.b. Thus.3. Indeed.5. Prob (W k. Lemma 111. since Nm is unbounded. y. Property (b) guarantees that the measure p. = 0.. and a decreasing sequence {a m } c (a. defined by {S(m)} is ergodic. A merging of a collection {Si : j E J} is a product of the form S = ni =1 S Cm) [1. y. A weaker subsequence result is discussed first as it illustrates the ideas in simpler form./m } of pairwise (am . J] is such that 10 -1 ( = M/J. 1/2). Let a be a positive number smaller than 1/2. This is just the subsequence form of the desired result (5). where [D]„ denotes the a-blowup of D. 209 III. S(m) can be replaced by (S(m))n for any positive integer n without disturbing properties (a) and (b).. see Remark 111. Given an increasing unbounded sequence {Nm }.separated if their full k-block universes are a-separated for any k > K. The goal is to show that there is an ergodic process it for which (5 ) 1 lim Prob (— log k co k W k (X . such that the following properties hold for each tn. . inductive application of the merging/separation lemma. The full kblock universe of S c A f is the set k-blocks that appear anywhere in any concatentation of members of S. ) > (1 — 1/40 2 (1 — 1/rn)2 . a) > 2 which proves (6). a) N) = 0. their starting positions x1 and yi will belong to £(m)-blocks that belong to different j (m) and lie at least 2 km N.SECTION 111. M] j E [1.6. then with probability at least (1 — 1/40 2 (1 — 1/m) 2 . while property (c) implies that (6) lim Prob (Wk„. J].3. Such a t can be produced merely by suitably choosing the parameters in the strongnonadmissibility example constructed in 111. and hence property (c) can also be assumed to hold.5. Two sets C.5. THE WAITING-TIME PROBLEM. (x. K) .b will be recalled.-indices below the end of each £(m)-block. (b) Each Si (m) is a merging of {Si (m — 1): j (c) 2kni Nin < t(m)/m. produces a sequence {S(m) C A €("z ) : in > 1} and an increasing sequence {k m }.3. for each where J divides M and 0: [1. The only new fact here is property (c). km ) strongly-separated sets of the same cardinality. VN. a) < 2km N ) = 0.e An approximate-match counterexample. (a) S(m) is a disjoint union of a collection {Si (m): j < . if sample paths sample paths x and y are picked independently at random. y.

discussed in III. The reader is referred to [76] for a complete discussion of this final argument. controls what happens in the interval km < k < kn. ± 1. as in the preceding discussion. Further independent cutting and stacking can always be applied at any stage without losing separation properties already gained. the column structures at any stage can be made so long that bad waiting-time behavior is guaranteed for the entire range km < k < km± i . ENTROPY FOR RESTRICTED CLASSES.3. The stronger form.2. and hence. (5).b. This leads to an example satisfying the stronger result.210 CHAPTER III. .

The almost block-independence property. This is fortunate. like other characterizations of B-processes. The focus in this chapter is on B-processes.1 Almost block-independence. the so-called isomorphism problem in ergodic theory. if t is the distribution of XT and y the distribution of Yr.i. 211 .d. stationary. processes. processes is still a complex theory it becomes considerably simpler when the invertibility requirement is dropped. each arising from a different characterization. Section IV. dn (X7. finitely determined processes.d. that is. the almost block-independence property. for example.. though it is not. will be discussed in this first section. for while the theory of stationary codings of i. Invertibility is a basic concept in the abstract study of transformations. processes. processes. randomizing the start produces a stationary process. As in earlier chapters. This terminology and many of the ideas to be discussed here are rooted in O rn stein's fundamental work on the much harder problem of characterizing the invertible stationary codings of i.i. but is of little interest in stationary process theory where the focus is on the joint distributions rather than on the particular space on which the random variables are defined.) Either measure or random variable notation will be used for the d-distance. ji is the measure on A' defined by the formula nz Ti(X inn )= i j=1 (x(i—l)n+ni) . and very weak Bernoulli processes. J+1 1 x- for all i < j < mn and all x/. then transporting this to a Ta-invariant measure Ft on A'. Note that 71 is Ta -invariant.d. A natural and useful characterization of stationary codings of i. In other words. Yr) will often be used in place of dn (p.. is expressed in terms of the a-metric (or some equivalent metric. in > 1. finite alphabet processes that are stationary codings of i. as will be done here.d. (j_on+ and the requirement that x.i. Various other names have been used.i.Chapter IV B-processes. Xin n E A mn . A block-independent process is formed by extending a measure A n on An to a product measure on (An)". in general. v). [46]. including almost block-independent processes.

i < (j — 1)n}. process. process onto a process d-close to then how to how to make a small density of changes in the code to produce a process even closer in d.u n of p to An. however. It is not as easy to show that an almost block-independent process p. (a) Y U —1)n+n and X7 have the same distribution.i. The theory to be developed in this called the concatenated-block process defined by chapter could be stated in terms of approximation by concatenated-block processes. processes.2 A mixing Markov chain is almost block-independent.u is the block-independent process if n is defined by the restriction . processes are almost block-independent is established by first proving it for finite codings.u then d(A. if it is assumed that the i. in particular.i.) An ergodic process is almost block-independent if and only if it is a stationary coding of an i.1.i. This will be carried out by showing how to code a suitable i. characterizes the class of stationary codings of i. The construction is much simpler.).d.d.212 CHAPTER IV. In random variable language.d. The independent n-blocking of an ergodic process . The fact that stationary codings of i.d.i. process. and is sufficient for the purposes of this book.d. The almost block-independence property is preserved under stationary coding and. in fact. process y onto the given almost block-independent process p. for each j > 1.d. process has infinite alphabet with continuous distribution for then any n-block code can be represented as a function of the first n-coordinates of the process and d-joinings can be used to modify such codes.i. process. that the two theorems together imply that a mixing Markov chain is a stationary coding of an i. B-PROCESSES.d. which is not at all obvious./ } defined by the following two conditions. This result and the fact that mixing Markov chains are almost blockindependent are the principal results of this section. is almost block-independent (ABI) if given E > 0.i. Note. for each j > 1. Theorem IV. but it is generally easier to use the simpler block-independence ideas. since the ABI condition holds for every n > 1 and every E > O. Theorem IV. all this requires that h(v) > h(p. then by showing that the ABI property is preserved under the passage to d-limits. for this requires the construction of a stationary coding from some i. . process is clearly almost block-independent. (k ) v (j-1)n-l-n (j-1)n+1 is independent of { r.i.d.d. the independent n-blocking of {Xi } is the Tn-invariant process {Y.1.1 (The almost block-independence theorem. or by understood. process y for which h(v) > h(A). An ergodic process p.i.d. It will be denoted by Ft(n).i.4) < E. Both of these are quite easy to prove. Of course. is a stationary coding of an i.i. The fact that only a small density of changes are needed insures that an iteration of the method produces a limit coding equal to 12. An i. there is an N such that if n > N and is the independent n-blocking of . In fact it is possible to show that an almost block-independent process p is a stationary coding of any i. since stationary coding cannot increase entropy. .

Lemma 1.11.i.. when thought of as An-valued processes implies that d( 17. then mnci„. First it will be shown that finite codings of i. and let {Z. This is because blocks separated by twice the window half-width are independent. since a stationary coding is a d-limit of finite codings. yields yn—w—1 v mn —w-1 n n— +T -1 7 mn —w-1 am(n_210 (z w w-1-1 (m-1)n+w-1-1 ) Thus dnm (Z.i.i. process {X i } with window half-width w. processes are almost blockindependent.L) < 6 / . for any n> wl€.i. Since y is almost block-independent there is an N such that if n > N and 7/ is the independent n-blocking of y then d(y. ) Since this is less than c. so that property (f) of the d-property list.a B-processes are ABI processes. < c.11.d. it follows that finite codings of i. Lemma 1.i. The successive n — 2w blocks 7n—w-1 'w+1 7 mn —w-1 " • '(ni-1)n+w+1 n+ 1 -1 . = anal FL) = cin(y. for all n. 1. Next assume p. > 0 (1) d .d.SECTION IV 1.d.1.d..i. processes are almost block-independent. Lii . processes and d-limits of ABI processes are almost block-independent. The fact that both and are i.414/ by property (c) of the d-property list.9. y) < c/3. This completes the proof that B-processes are almost block-independent.i. Furthermore.9. finite codings of i.i.d. 213 IV. the preceding argument applies to stationary codings of infinite alphabet i. hence stationary codings of i. ALMOST BLOCK-INDEPENDENCE. < c/3.(Zr.d. and hence. < E/ 3.} be the independent n-blocking of { Yi } .d. In summary. processes are almost blockindependent.} be a finite coding the i. Given E choose an almost block-independent process y such that d(p. by (1). Yr) is upper bounded by m(n _ 2040 _ 2 w) w+1 vn—w-1 Tnn—w-1 '(n1-1)n+w+1 . by the definition of independent nare independent with the distribution of Yw blocking.. If the first and last w terms of each n-block are omitted. Fix such an n and let FL be the independent n-blocking of A. hence such coded processes are also almost block-independent. The triangle inequality then gives This proves that the class of ABI processes is d-closed. Let {Y. while the successive n — 2w blocks vn—w-1 4 w+1 y mn —w-1 (m-1)n±w±1 +r -I by the definition of window halfare independent with the distribution of rr width and the assumption that {X i } is i. is the d-limit of almost block-independent processes. w+1 ymn—w--1 (m-1)n±w±1 ) 4.d. processes onto finite alphabet processes.

(ii) (A x v) ({(z. almost surely. The lemma immediately yields the desired result for one can start with the vector valued.i. i. v)o}) < 4E. then given . The idea of the proof of Lemma IV.. The exact representation of Vi l is unimportant.i. which allows one to choose whatever continuously distributed i. The key to the construction is the following lemma. process Vil with Kolmogorov measure A for which there is a sequence of stationary codes {Ft } such that the following hold. (A x o F -1 ) < 8. (b)Er Prob((Ft+ i (z))o (Fr(z))o) — < oc.i.}.d.i. (a)a(p... A block code version of the lemma is then used to change to an arbitrarily better block code. process with continuous distribution is a stationary coding.d. o Fr-1 ) > 0.214 CHAPTER IV.i. A = X o F -1 . with window width 1.)} with independent components.1.. and hence any i. process with continuous distribution.d. V(t)0 is binary and equiprobable. each coordinate Fe(z)i changes only finitely often as t oc. An i.o G -1 ) <E. for it means that.5 > 0 there is a finite stationary coding F of {(Z i . process (2) onto the almost block-independent process A.d. (i) d(p. A. .d. of any other i. Let {Xi } be an almost block-independent process with finite alphabet A and Kolmogorov measure A. 1) and. producing thereby a limit coding F from the vector-valued i. V(2)1 . for all alphabet symbols b.i. Lemma W. where V(0)0 is uniform on [0. . process {Z. The lemma can then be applied sequentially with Z (t) i = (V(0) i . This fact.d. to obtain a sequence of stationary codes {Ft } that satisfy the desired properties (a) and (b).3 is to first truncate G to a block code. for t > 1. where p. is almost block-independent and G is a finite stationary coding of a continuously distributed i. as t — * oc. and Er = 2-t+1 and 3. . Therefore.d.d.} with Kolmogorov measure A. which allows the creation of arbitrarily better codes by making small changes in a good code. V): G(z)o F (z..b ABI processes are B-processes. countable component. = 2 -t. since any random variable with continuous distribution is a function of any other random variable with continuous distribution.i. V(t — 1) i ).3 (Stationary code improvement. B-PROCESSES.1. that is. V (1) i .i. together with an arbitrary initial coding F0 of { V(0)i }. 1.d.i.i. process {Z. process (2) {(V(0). It will be shown that there is a continuously distributed i. IV.d. The second condition is important. there will almost surely exist a limit code F(z)1 = lime Ft (z) i which is stationary and for which cl(A. V(1) i . A o F -1 ) = 0. is a stationary coding of an infinite alphabet i. and if is a binary equiprobable i. process is convenient at a given stage of the construction. Vi )} such that the following two properties hold. The goal is to show that p. V(t) i . process with continuous distribution. will be used in the sequel. process independent of {Z.) If a(p. } is continuously distributed if Prob(Zo = b) = 0.

Then there is a measurable mapping 4): Y 1-+ A n such that ji = y o 0-1 and such that Ev(dn((k (Y). For n > N define G(u7) = bF(ii)7± 1c.b. for v-almost every sample path u. y) be a nonatomic probability space and suppose *: Y 1-± A n is a measurable mapping such that V 0 —1 ) < 8. n)-independent extension of a measure pt. a joining has the form X(a'i' . 1. n)-independent process if {Ri } is a (8. An ergodic pair process {(Wi .i.G(u7)) < S. b7)) = E v(dn (0 (Y). b7) = v(0-1 (4) n (b7)) for 4): Y H A n such that A n = vo V I . for v-almost every sample path u. using the auxiliary process {Vi } to tell when to apply the block code. ALMOST BLOCK-INDEPENDENCE. such that d. Given S > 0 there is an N such that if n > N then there is an n-block code G.u.5 (Block code improvement. n)blocking process.SECTION IV. A finite stationary code is truncated to a block code by using it except when within a window half-width of the end of a block. The following terminology will be used. The following lemma supplies the desired stronger form of the ABI property.8.1 = diliq is a coding block) = .9.(F (u)7.4 (Code truncation. The block code version of the stationary code improvement lemma is stated as the following lemma. while retaining the property that the blocks occur independently. where u is any sample path for which û7 = /47. see Exercise 12 in Section 1. ergodic process {Ri } will be called a (S. n)-truncation of F. with associated (8. .) Let F be a finite stationary coding which carries the stationary process v onto the stationary process 71. called the (S. so that Vd. that is. Lemma IV. This is well-defined since F(ii)7+ 1 depends only on the values of û. n)-blocking process {Ri } . Proof Let w be the window half-width of F and let N = 2w/8. if the waiting time between l's is never more than n and is exactly equal to n with probability at least 1 — 3. n)-independent process and Prob(14f. 4 E A'. nature of the auxiliary process (Vi) and its independence of the {Z1 } process guarantee that the final coded process satisfies an extension of almost block-independence in which occasional spacing between blocks is allowed.) Let (Y.1. n)-blocking process and if 147Ç' is independent of {(Wi . the better block code is converted to a stationary code by a modification of the block-to-stationary method of I.(a7). Proof This is really just the mapping form of the idea that d-joinings are created by cutting up nonatomic spaces.d. G(u7)) < 8.1 = On -1 1. Lemma IV. since the only disagreements can occur in the first or last w places. if {(147„ Ri )} is a (S. Clearly d. An ergodic process {Wi } is called a (8. an extension which will be shown to hold for ABI processes. Indeed. (4 . The i. (y))) < 8. Ri ): i < 0 } . 215 Finally. on A'. and b and c are fixed words of length w. (Y))). given that is a coding block.(F(u)7. for i< j<i+n-1and Ri+n _1 = 1. A binary.1. A block R:+n-1 of length n will be called a coding block if Ri = 0.? +n--. Ri)} will be called a (S. The almost block-independence property can be strengthened to allow gaps between the blocks.1.

If M i is large enough so that tk NI — nk nk+1 tIcNI is negligible compared to M1 and if 8 is small enough. n)-blocking process (R 1 ).6. and hence cl(p. To make this outline precise.} be an almost block-independent process with Kolmogorov measure t. Thus after omitting at most N1 places at the beginning and end of the n-block it becomes synchronized with the independent Ni -blocking. and let À x y denote the Kolmogorov measure of the direct product {(Z„ V. vm ) < 13. ii(v. the starting position. n)-independent extension {W} of An . The obstacle to a good d-fit that must be overcome is the possible nonsynchronization between n-blocks and multiples of NI -blocks. y) < 6/3.i.. For each k for which nk +i — n k = n choose the least integer tk and greatest integer sk such that tk Ni > nk . Proof In outline.} a binary.d. with associated (6. n)-independent extension. of an n-block may not be an exact multiple of NI .} are now joined by using the joining that yields (4) on the blocks ftk + 1. first choose N1 so large that a(p.1. n)independent extension 11. The goal is to show that if M1 is large enough and 3 small enough then ii(y.) Let {X. say i. Let p. completing the proof of Lemma IV. yields the desired result. Lemma 1. B-PROCESSES.} and {W.) < c. for all ni. here is the idea of the proof. Let {Z.216 CHAPTER IV.AN Ni1+1 has the same distribution as X' N1 +i . skNi < Note that m = skNI — tkNi is at least as large as MI. It is enough to show how an independent NI -blocking that is close to p.}. but at least there will be an M1 such that dni (/u ni . and that i7 V -7. Since y is not stationary it need not be true that clni (it„„ um ) < E/3.6 (Independent-extension. provided only that 6 is small enough and n is large enough. be .} will be less than 2E13. in > Mi.} be a continuously distributed i..1. { Proof of Lemma IV. (3) Put N2 = M1+ 2N1 and fix n > N2. equiprobable i. Let {n k } be the successive indices of the locations of the l's in { ri}.3.1. < 2E/3. but i j will be for some j < NI. < 2E/3. An application of property (g) in the list of dproperties. Let 6 be a positive number to be specified later and let be the Kolmogorov measure of a (8. Towards this end. } . independent of {Z.i1. so that (3) yields t NI (4) ytS:N IV I I± Wits:N A: 1+ ) < € /3. the ci-distance between the processes {K} and {W. n)-blocking process {R. can be well-approximated by a (6. Given E > 0 there is an N and a 6 > 0 such that if n > N then a(p.. process and {V. process. Ft) < c. where y is the Kolmogorov measure of the independent N I -blocking IN of pt. then.)}. for almost every realization {r k }. of A n . The processes Y. with respective Kolmogorov measures À and y. for any (8. let {W} denote the W-process conditioned on a realization { ri } of the associated (6.11.i. Lemma IV. This is solved by noting that conditioned on starting at a coding n-block.9. hence all that is required is that 2N1 be negligible relative to n. sk NI L and an arbitrary joining on the other parts. and that 8 be so small that very little is outside coding blocks.d.

and hence the process defined by Wk = Z Yk+n-1' i i .SECTION IV 1. E(d. 217 an almost block-independent process with alphabet A.. (X. almost surely.)} such that the following hold. Furthermore. ALMOST BLOCK-INDEPENDENCE. y) 0 }) < 4e.u. Let F(z. v): G(z)0 F (z. process. n)-blocking process {Ri } and lift this to the stationary coding g* of {(Z. let g be a finite stationary coding of {V. fix a E A. Xn and hence the block code improvement lemma provides an n-block code 0 such that = o 0 -1 and E(dn (à .) < 8.)•••d(wk))< 2c. a(4 +n-i„ )) < E. The see why property (ii) holds.. (i) d(p. The goal is to construct a finite stationary coding F of {(Z. In particular. vgk +n-1 = 0 (Wk). (ii) (X x y) (((z. V..z (d(Z11 ).i. This completes the proof of the lemma. Towards this end. Ø(Z7))) = E(dn (d (Z 11 ). n)-truncation G of G.d. r)i = a. (5).}. let {yk } denote the (random) sequence of indices that mark the starting positions of the successive coding blocks in a (random) realization r of {R. Since. X o Fix 3 > O. V. implies that the concatenations satisfy (6) k—>oo lirn dnk(O(Wi) • • • 0(Wk).i. g(v)). hence property (i) holds. first note that {Zi } and {Ri } are independent. thereby establishing property (ii). coding such that d(p. and hence the proof that an ABI process is a stationary coding of an i. first choose n and S < e so that if is a (S. By definition. by the ergodic theorem applied to {R. and define F(z. 0(Z7)) I R is a coding block) so the ergodic theorem applied to {Wk }. i fzi Uk[Yk. define +n-1 ). however.d. In other words. the dn(G(Z)kk+n-1 limiting density of indices i for which i V Uk[Yk. a(w. Y)o}) <2e E< 4e. F(u.} onto a (S. along with the fitting bound. n)-independent extension of i then d(p. 0(Z))) 26. and let G be a finite stationary < E. v): G(z)0 F(z. r Ykk+" = dp(zYA Yk ) for all k.)} onto {(Z„ Ri )} defined by g*(z. (5) Next. Yk n — 1]. r) be the (stationary) code defined by applying 0 whenever the waiting time between 1 's is exactly n. and so that there is an (c. } . if g(v) = r. F(Z. with Wo distributed as Z. x y) o <8. and . this means that < 2E. p e-. yk + n — 1] is almost surely at most & it follows from (6) that (X x y) ({(z. and coding to some fixed letter whenever this does not occur. Clearly F maps onto a n)-independent extension of . y) = (z.(Z). .

and an w+1 = 1)/ 1_ 1 if wn (a . and y be the Kolmogorov measures of {X. respectively. The precise formulation of the coupling idea needed here follows. Proof The inequality (8) follows from the fact that the A n defined by (7) is a joining of . then running them together.1. v(a)p(b'i') 0 if w( a.z } be the nonstationary Markov chain with the same transition matrix as { X. The Markov property guarantees that this is indeed a joining of the two processes.8 (The Markov coupling lemma. processes with continuous distribution.) The coupling function satisfies the bound (8) dn(tt. v) —> 0.7 The existence of block codes with a given distribution as well as the independence of the blocking processes used to convert block codes to stationary codes onto independent extensions are easy to obtain for i. = w < n.218 CHAPTER IV B-PROCESSES. below. } be a (stationary) mixing Markov process and let {1. and is concentrated on sequences that agree ever after. v) indeed goes to O. the expected value of the coupling function is bounded.1.} and { Y n } .u n and v n .„ as marginals. shows that An is a probability measure with vn and p. n — 1]: ai = bi}.d. n. n --± oc. Furthermore. with the very strong additional property that a sample path is always paired with a sample path which agrees with it ever after as soon as they agree once. process whose entropy is at least that of it. Let {X.Y n ) cannot exceed the expected time until the two processes agree. The coupling measure is the measure A n on An x An defined by v (an . a special type of joining frequently used in probability theory. These results. independent of n and a. but which is conditioned to start with Yo = a. A joining {(Xn . see Exercise 2. . With considerably more effort. The key to the fact that mixing Markov implies ABI. . suitable approximate forms of these ideas can be obtained for any finite alphabet i.i.i.1. This establishes the lemma.d. b7) = n otherwise. IV. Yn )} is obtained by running the two processes independently until they agree. Let p. this guarantees that ncln (X. see [35]. see also [63. making use of the Markov properties for both measures. if fi E [1 .. Lemma IV. Remark W.c Mixing Markov and almost block-independence. For each n > 1. [46]. In particular. as well as several later results about mixing Markov chains is a simple form of coupling.. b?) = (7) A direct calculation. the coupling function is defined by the formula wn(4 = f minfi E [ 1. are discussed in detail in his book. v) < E)" n (wn) . 42]. once they agree. so that iin (p. } .u(b7). which are the basis for Ornstein's isomorphism theory. n — 1 ] : ai = bi } 0 0 otherwise. In particular dn(it.

for all large enough n.. see Example 1. } plus the fact that Zk k n +7 is independent of Z j . and hence that {X i } is almost blockindependent. with Kolmogorov measure i. given Xk n = Xkn • kn "± ditioned distribution of XI and the conditional distribution of Xk The coupling lemma implies that (9) d. Likewise. Since the proof extends to Markov chains of arbitrary order. 1:1 The almost block-independent processes are. therefore.9 An almost block-independent process is the d-limit of mixing Markov processes. The notation cin (X k n +7. provided only that n is large enough.2. Proof By the independent-extension lemma. using the Markov property of {X. For each positive integer m. n)-independent extension ri which is mixing Markov of some order. Fix a mixing Markov chain {X. Section 1. } with Kolmogorov measure and fix E > 0. the idea to construct a joining by matching successive n-blocks. that distribution of Xk realizes the du -distance between them.(x k n ±n . (See Exercise 13..n (dmn(xr. A suitable random insertion of spacers between the blocks produces a (8. by the Markov property. < c. u is a joining of A„.1. and the distribution of X k n +7 conditioned on Xk n is d-close to the unconditioned distribution of Xn i . Zr)) < E. first choose./x. X k n +7/xk n ) will be used to denote the du -distance between the unconI.„ with iimu . To construct Xmn.1. .6. summing over the successive n-blocks and using the bound (9) yields the bound Ex„. For each positive integer m. is to produce a (8. in the sense that Theorem IV. conditioned on the past depends only on Xk n . j < kn.T.n± ±n 1 /xkn) < for all sufficiently large n. ALMOST BLOCK-INDEPENDENCE. a joining A. let À nin be the measure on A"' x A mn defined by m-1 (10) A. (dnin (finn .„. This will produce the desired dcloseness. n)-independent extension which is actually a mixing Markov chain. n)-independent extension which is a function of a mixing Markov chain. given Xk n = xk n . proves that d(p. The concatenated-block process defined by pt is a function of an irreducible Markov chain. conditioned on the past. Fix n for which (9) holds and let {Zi } be the independent n-blocking of {Xi } . of the unconditioned kn "Vil and the conditional distribution of Xk kn + 7. exactly the almost mixing Markov processes.pnn(XT n einn ) = ii(4)11. 1. This is easy to accomplish by marking long . 219 To prove that a mixing Markov chain is d-close to its independent n-blocking. a joining Ain „ of A m . x k . Lemma IV. it is enough to show that for any 8 > 0.„. in fact. for each xbi .11. this completes the proof that mixing Markov chains are almost block-independent.SECTION IV 1. because the distribution of Xk k .(Z?) k=1 Jr A1 zkknn++7)• A direct calculation. by the coupling lemma. a measure it on A" has a (8. with ii mn will be constructed such that Ex.. z7)) < which.) The only problem. of course. shows that A. not just a function of a mixing Markov chain.

respectively. 1) with probability (1 — p). it is mixing. (sxr . [66]. aperiodic Markov chain with state space se" x {0. An). (b) (sxr . n (dnzn Winn zr)) E. Show the expected value of the coupling function is bounded. concatenations of blocks with a symbol not in the alphabet. Next. where y and D are the Kolmogorov measures of {W} spectively. independent of n. i) can only go to (sxr". let C c A' consist of those ar for which it(4) > 0 and let s be a symbol not in the alphabet A. and s otherwise.. For each positive integer m. let sCm denote the set of all concatenations of the form s w(1) • • • w(m). 0) with probability pii(syr). The process {Zi } defined by Zi = f(Y 1 ) is easily seen to be (mn + 2)-order Markov. to be specified later. for each syr E SC m . } and {W} are (6. Remark IV. Show that if { 1/17.1. and let { ri } be the stationary.) This completes the proof of Theorem IV. if 1 < i < mn + 1. n)-blocking process {R.220 CHAPTER IV. then randomly inserting an occasional repeat of this symbol. 1. where w(i) E C. n with Ti Ex. Furthermore. A proof that does not require alphabet expansion is outlined in Exercise 6c. define f(sxrn. il To describe the preceding in a rigorous manner. Show that (7) defines a joining of A n and yn .u. i) to be x. nm 1} and transition matrix defined by the rules (a) If i < nm 1. 2.1. IV. 1. 1.10 The use of a marker symbol not in the original alphabet in the above construction was only to simplify the proof that the final process {Z1 } is Markov. = {1T 7i } . irreducible. provided m is large enough and p is small enough (so that only a 6-fraction of indices are spacers s. since once the marker s is located in the past all future probabilities are known. such that 4.9. . B-PROCESSES. The coupling proof that mixing Markov chains are almost block-independent is modeled after the proof of a conditional form given in [41]. rean(An. i 1). and let ft be the measure on sCrn defined by the formula ft(sw( 1 ) . 3. • w(m)) = i=1 Fix 0 < p < 1. below. n)-independent extensions of An and fi n . (Z1 ) is clearly a (6. The proof that ABI processes are d-limits of mixing Markov processes is new.1. then d(v.d Exercises. n)independent extension of A. } . . Show that the defined by (10) is indeed a joining of .a(syr").„. As a function of the mixing Markov chain {ri }. with the same associated (6. and to (syr". nm +1) goes to (sy'llin.11 The proof that the B-processes are exactly the almost block-independent processes is based on the original proof. Remark W.

THE FINITELY DETERMINED PROPERTY. (Hint: use Exercise 4. then by altering probabilities slightly.i. Show that {17 } is Markov. and define Y„ to be 1 if X„ X„_1.d. Let {X.2. Some special codings of i. 1. given E > 0. it can be assumed that a(a) = 0 for some a.2 The finitely determined property. In other words. Section IV.) (c) Prove Theorem IV.2. y) < E. to An. (i) 114 — vkl < 6 (ii) IH(. afinitely determined process is a stationary coding of an i.d. process. and almost block-independent. (See Exercise 4c. The most important property of a stationary coding of an i. the process {Z.1. In particular. Proof Let p (n ) be the concatenated-block process defined by the restriction pn of a stationary measure p. Let p.9 is a (8.i.1 A finitely determined process is ergodic./ } defined in the proof of Theorem IV. for it allows d-approximation of the process merely by approximating its entropy and a suitable finite joint distribution.) 6. oc.} be binary i. and not i. The sequence ba2nb can be used in place of s to mark the beginning of a long concatenation of n-blocks. mixing.d. processes are Markov.) 7. unless X n = 0 with probability 1/2. be the Kolmogorov measure of an ABI process for which 0 < kc i (a) < 1. 221 5.u) — H(v)I < must also satisfy (iii) c/(kt. a stationary p is finitely determined if. the only error in this . 0 as n with itself n times. such that any stationary process y which satisfies the two conditions. Theorem IV. Section 1.i. Some standard constructions produce sequences of processes converging weakly and in entropy to a given process. then the distribution of k-blocks in n-blocks is almost the same as the p. If k fixed and n is large relative to k. there is a 8 > 0 and a positive integer k. Show that even if the marker s belongs to the original alphabet A. (Hint: if n is large enough.-distribution. (a) Show that p.5. otherwise. A stationary process p.i.d. for some a E A. n)-independent extension of IL. and using Exercise 6b and Exercise 4.i. where a" denotes the concatenation of a (b) Show that Wan) —3.d. and 0. must be mixing. Properties of such approximating processes that are preserved under d-limits are automatically inherited by any finitely determined limit process. process is the finitely determined property.9 without extending the alphabet. This principle is used in the following theorem to establish ergodicity and almost block-independence for finitely determined processes.SECTION IV. is finitely determined (FD) if any sequence of stationary processes which converges to it weakly and in entropy also converges to it in d.

that a stationary coding of an i. since functions of mixing processes are mixing and d-limits of mixing processes are mixing. approximation being the end effect caused by the fact that the final k symbols of the oo. .v) < E.a (n ) and p. Theorem IV. the v-distribution of k-blocks in n-blocks is the measure = 0(k.6. satisfies the finite form of the FD property stated in the theorem. then (1) 41 _< 2(k — 1)/n.i.) A stationary process p.2 (The FD property: finite form. that is.tvE. {P ) } will converge weakly and in entropy to /I .1. a finitely determined process must be almost block-independent. Also useful in applications is a form of the finitely determined property which refers only to finite distributions. Thus if 3n goes to 0 suitably fast. If y is a measure on An. is small enough then /2 00 and ii (n) have almost the same n-block distribution. there is a > 0 and positive integers k and N such that if n > N then any measure v on An for which Ip k — 0(k. y) on Ak defined by )= 1 k+it k i=ouE. Xn i 1 starting positions the expected value of the empirical k-block distribution with respect to v. a finitely determined process must be mixing.t (n ) is equal n-block have to be ignored. is finitely determined if and only if given c > 0. then p. hence in if is assumed to be finitely determined. as n weakly and in entropy so that {g (n ) } converges to 1-1(u n )/n which converges to H(/)). Concatenated-block processes can be modified by randomly inserting spacers between the blocks. hence almost the same k-block distribution for any k < n. and since a-limits of ABI processes are ABI. process is finitely determined is much deeper and will be established later. and hence almost block-independent. (n ) have almost the same entropy. for any k < n. must be finitely determined.d. and furthermore. let ii (n ) be a (8n . The entropy of i. (n) is ergodic and since ergodicity is preserved by passage to d-limits. it follows that a finitely determined process is ergodic. Also. if p.n. and k < n. Note also that if i3 is the concatenated-block process defined by y. . For each n. y) —1- since the only difference between the two measures is that 14 includes in its average the ways k-blocks can overlap two n-blocks. a The converse result. see Exercise 5. then 0(k. Lemma IV. the average of the y-probability of a i` over the first n — k in sequences of length n. v„) = vk . v)I < 6 and IH(pn ) — H(v)I < n6 must also satisfy cin (p. Thus. Since ii(n ) can be chosen to be a function of a mixing Markov chain. Since each 11. A direct calculation shows that 0(aiic ) = E Pk(a14)Y(4) = Ep(Pk(4IX).222 CHAPTER IV. EE V (14 a k i V). n)-independent extension of A n (such processes were defined as part of the independent-extension lemma.2.) If 8. This completes the proof of Theorem IV. and (11 n)H (v n ) H (v). Proof One direction is easy for if I)„ is the projection onto An of a stationary measure I). Section IV. B-PROCESSES.1. to A. 10(k. Thus 147 ) gk .1.2.

223 To prove the converse assume ti.i. within 6/2 of H(p. by Theorem IV. This completes the proof of Theorem IV. Let p(x. then 1. y) denote the marginal distributions. for an ABI process is the d-limit of mixing Markov processes by Theorem IV. hence is totally ergodic. such that any stationary process y that satisfies Ivk — 12k 1 < 6 and 1H (v) — H COI < 6 must also satisfy d(v.2.i. given Y. The i. it will be shown that the class of finitely determined processes is closed in the d-metric. in turn. y) — oak' < 6. if independence is assumed.9 of the preceding section. and hence the method works. Furthermore. given Xi'. < 6/2. An approximate form of this is stated as the following lemma. Fix n > N and let y be a measure on An for which 10(k. an approximate independence concept will be used.).4)1 <n6/2 and let Tf be the concatenated-block process defined by v. is finitely determined and let c be a given positive number. The definition of N guarantees that 117k — till 5_ 17k — 0(k.2. y) and p(y) = py (y) = Ex p(x. First it will be shown that an i. proof will be carried out by induction. which is within 8/2 of (1/n)H(An). To make this sketch into a real proof.SECTION IV.. so Exercise 2 implies that 4.a. using (1).2.d.i. Thus. v) I +10(k.1 I. IV.)) — H COI < 6. which is.2. y) —1 <8/2. The proof that almost block-independence implies finitely determined will be given in three stages. 1H (i. as well as possible to the conditional distribution of yr. the definitions of k and 6 imply that a (/urY-) < c. These results immediately show that ABI implies FD. Y = y) denote the joint distribution of the pair (X.2. it can also be supposed that if n > N and if 17 is the concatenated-block process defined by a measure y on An. y) — p(x)p(y)I < e. y) < ii(p. The definition of finitely determined yields a 6 > 0 and a k.) < E. The dn -fi of the distributions of X'il and yr by fitting the condiidea is to extend a good tting tional distribution of X n+ 1. y If H(X) = H(X1Y).1. is almost independent of Yçi. The conditional distribution of Xn+ i. X. It is easy to get started for closeness in first-order distribution means that first-order distributions can be dl-well-fitted. given X. that is.d. This proof will then be extended to establish that mixing Markov processes are finitely determined.0(k. processes are finitely determined.17) < E. p. given Yin. y) = Prob(X = x. then X and Y are independent. (A n . CI IV.1. since H(t)) = (1/ n)H (v).d. provided closeness in distribution and entropy implies that the conditional distribution of 17. and let p(x) = px (x) = E. By making N Choose N so large that if n > N then 1(1/01-/(an ) — enough larger relative to k to take care of end effects.a ABI implies FD.2. Y) of finite-valued random variables. THE FINITELY DETERMINED PROPERTY.y(x. A finitely determined process is mixing.k1 < 3/2 and 1H (v) — H(1. y) = px. Finally. v) — ii. . The random variables are said to be c-independent if E Ip(x. process is finitely determined. does not depend on ril.. p(x.

)' g(x)p(x) E p(x) h(y)p(ydx). that is. y) = vii/ 2 . H(X) — H(X1Y) = _ p(x) log p(x) log E p(x. for all m > 0. Yi-1) and yi are c-independent for each j > 1. An i.i. process is.2. p(x.i.2. with first marginal p(x) = yy p(x. y) x. y) and conditional distribution p(y1x) = p(x. y)p(x. Lemma 1. then X and Y are 6-independent.d. then 11 (X 1 ) will be close to 11 (Y1 ).4 Let {X i } be i.d. independent of X and Y.i. 2 D(pxylpxp y) C (E .y1PxP Y )• P(x. however.y D (p x.d. y) = g(x) h(y). then H(Yi) — 1-1 (Y11 172 1 ) will be small. Proof If first-order distributions are close enough then first-order entropies will be close. there is a 8 > 0 such that if 1. y)I p(x). so that if 11(v) = limm H(Y i lY i ) is close enough to H(A). process. which is just the equality part of property (a) in the d-properties lemma. y IP(X x P(X)P(Y)i) where c is a positive constant. This establishes the lemma. di (A.6. without proof. 0 A stationary process {Yi } is said to be an 6-independent process if (Y1.224 CHAPTER IV. 0 A conditioning formula useful in the i. of course. .3 (The E-independence lemma. proof as well as in later results is stated here. e-independent for every c > O.y P(Y) E p(x.11. Thus if Ai is close enough to v1. Proof The entropy difference may be expressed as the divergence of the joint distribution pxy from the product distribution px p y . Y) p(x)p(y) Pinsker's inequality. The i. that is. Lemma IV. yields. then E f (x. property. since H(YilY°m) is decreasing in m. y) = X .) There is a positive constant c such that if 11(X) — H(X1Y) < c6 2 . Lemma IV. In its statement. B-PROCESSES.2. Use will also be made of the fact that first-order d-distance is just one-half of variational distance. as the following lemma.u1 — vil < 8 and 11I (A) — H(v)1 < then {Yi} is an 6-independent process.d. y) log p(x. The lemma now follows from the preceding lemma. y) denotes the joint distribution of a pair (X. The next lemma asserts that a process is e-independent if it is close enough in entropy and first-order distribution to an i. Y) of finite-valued random variables.i.9. Given e > 0.d. and {Y i } be stationary with respective Kolmogorov measures /2 and v. implies that H(A) = H(X i ). y) x.i. Exercise 6 of Section 1.5 If f (x. Lemma IV.

and let An realize an (Xtil. Assume it has been shown that an (r i'. Lemma IV. Theorem IV. This strategy is successful because the distribution of Xn+ 1 /x'il is equal to the distribution of Xn+ i./4). THE FINITELY DE IERMINED PROPERTY. < E. then the process {Yi } defined by y is an c-independent process. by independence.2.6. which is at most c/2.n realizes d i (X n +i The triangle inequality yields (Xn+1/X r i l Xn+1) dl(Xn+19 Yn+1) + (Yn+1 Y1 /y) The first term is 0. Lemma IV. since it was assumed that A n realizes while the second sum is equal to cin (X..u. The strategy is to extend by fitting the conditional distributions Xn+ 1 /4 and Yn+i /yri' as well as possible. y`i') + (xn+i.5. Y < 1 /y) y)di(Xn+04.realize di (Xn+i by {r}. 225 Proof The following notation and terminology will be used. 1 (dn+i) < ndn(A. since Yn+1 and Y.u i — vil < 3/2 < c/2. (3) di (X E 1 /x.) — 1/(y)1 < 3. since d i (X I . and di (Xn +i Yn-i-i/Y) denotes the a-distance between the corresponding conditional distributions. is certainly a joining of /2. thereby establishing the induction step. Y1 ) = (1/2)I. Yri+1). n+i andy n+ i. The third term is just (1/2) the variational distance between the distributions of the unconditioned random variable Yn+1 and the conditioned random variable Y +1 /y. The measure X.. y +1 ) = ndn (x`i' . The random variable X n+1 . close on the average to that of Yn+i .6 An i. whose expected value (with respect to yri') is at most /2.(dn) XI E VI Xri (X n i Yn i )d1(Xn+1. ± 1 defined 1 . Fix It will be shown by induction that cin (X.f» Y.2. which along with the bound nExn (dn ) < n6 and the fact that A n +1 is a joining of . .' are c-independent. process is finitely determined. y) +E < (n + 1)c. yields (n (2) 1 )Ex n+ I(dn+1) = nEA. by independence. Y1 ).d. while.4 provides a positive number 3 so that if j — vil < 8 and I H(/).. Yin). This completes the proof of Theorem IV. Fix an i. YD. Yin) < c.x7.d. since (n 1)dn+1(4 +1 .. 4+1) = \ x. and the distribution of 1" 1 + 1 /Y1 is.)T (Xn+1 Yni-1)• n+1 The first term is upper bounded by nc. the conditional formula.2. produces the inequality (n 1)dn+1(. the second term is equal to di (X I . Xn+1 (4+1. Thus the second sum in (2) is at most 6/2 ± 6/2. Without loss of generality it can be supposed that 3 < E. let ?. y) (n + A. y. For each x. 44 and yn+1 . Yn+1)À. conditioned on X = x is denoted by X n+ i /4.+1/4).SECTION IV.2. since it was assumed thatX x i .2. which is in turn close to the distribution of X n+1 . yn+. such a Getting started is easy. Furthermore. process {Xi} and c > O.i.i. by stationarity. 1/I' ( An+1. by c-independence. .

226 IV. process. n > 1.) Given E > 0. In that proof. The conditional form of the E-independence lemma. proof.d. The Markov coupling lemma guarantees that for mixing Markov chains even this dependence on X.y E. Mixing Markov processes are finitely determined. guarantees that for any n > 1. The Markov property n +Pin depends on the immediate past X. Lemma IV. the fitting was done one step at a time. as m grows. The choice of 3 and the fact that H(YZIY ° n ) decreases to m H (v) as n oc. The key is to show that approximate versions of both properties hold for any process close enough to the Markov process in entropy and in joint distribution for a long enough time.2. 17In and Yi-. extends to the following result. The random variables X and Y are conditionally E-independent. proof to the mixing Markov case.2 CHAPTER IV. A good match up to stage n was continued to stage n + 1 by using the fact that X. Given E > 0 and a positive integer m. then guarantee that HOTIN— 1 -1(YPIY2 n ) < y.8 Let {X. A conditional form of c-independence will be needed. the Markov property and the Markov coupling lemma.d. given Yo.} be an ergodic Markov process with Kolmogorov measure and let {K} be a stationary process with Kolmogorov measure v.i. then. there is a 3 > 0 such that if lid. The Markov result is based on a generalization of the i. ylz) denotes the conditional joint distribution and p(x lz) and p(y lz) denote the respective marginals. then X and Y are conditionally E-independent.2.1 +1 is almost independent of 1?. Lemma IV. .. for then a good match after n steps can be carried forward by fitting future m-blocks.m+1 — v m+ii < 8 and IH(a) — H(v)i < 8.2.a. for every n > 1. B-PROCESSES. dies off in the d-sense. given Z.. Proof Choose y = y(E) from preceding lemma.8. whose proof is left to the reader. which is possible since conditional entropy is continuous in the variational distance. Lemma IV.1. By decreasing 3 if necessary it can be assumed that if I H(g) — H(v)I <8 then Inanu) — na-/(01 < y/2. are conditionally E-independent.3.i.7 (The conditional 6-independence lemma. two properties will be used. yiz) — p(xlz)p(ydz)ip(z) < z x. To extend the i. The next lemma provides the desired approximate form of the Markov property.d. where p(x. Fix a stationary process y for which ym+ il < 3 and 11/(u)— H (v)I <8.i. ±1 is independent of X7 for the i. then choose 3 > 0 so small that if 1 12 m+1 — v m± i I < 8 then IH(Yriy0)-HaTixol < y12. for suitable choice of m. a future block Xn but on no previous values. Z) < y. given Z. there is a y = y(E) > 0 such that if H(XIZ)— H(XIY. Fix m. provided the Y-process is close enough to the X-process in distribution and entropy. Lemma IV. if E E ip(x. and 17.2.

YTP:r/y) E 9 where Xn realizes an (A. xo E A.9. The Markov coupling lemma.'1 /y.n+i there is a 3 > 0 so that if y is a stationary — vm+1 1 < 6 then measure for which (5) am(Yln.SECTION IV./x 1 . Since this inequality depends only on .i. by (4). E X n Vn 4(4.8. + cim(r. Yln /Y0) < yo E A. implies that 171 and Yfni are conditionally E-independent. and hence the triangle inequality yields. (6). .:vin/4. Only the first-order proof will be given. Fix a stationary y for which (5). using (6) to get started.8.1. since dm (u. THE FINITELY DETERMINED PROPERTY 227 which. conditioned on X. the extension to arbitrary finite order is left to the reader.d. X4 +im ) will denote the d-distance between the distributions n + T/xii. Lemma IV. if necessary. and hence the sum in (8) is less than E.n /y.2. The distribution of Xn n+ 47/4 is the same as the distribution of Xn n + r.2. it can also be assumed from Lemma IV. given Yn . for any stationary process {Yi } with Kolmogorov measure y for which — vm+1I < and I H(u) — H(p)i <8... y). This completes the proof of Theorem IV. (6) and (5).2. and fix E > O. 17. Thus the expected value of the fourth term is at most (1/2)6/2 = 6/4.4T /Yri) dm (Y: YrinZ) Y:2V/yri?). provides an m such that (4) dm (XT.8 that Yini and Y i n are conditionally (E/2)-independent. Proof n + 7i . by (7).u.9 A mixing finite-order Markov process is finitely determined. and ii(Xn of X'nl Vin / and 11. = x. y) < E. conditioned on +71 /4 will denote the random vector Xn In the proof. yr ) < e/4.2. given Yo. XT/x0) < 6/4.r+Fr. Fix a mixing Markov process {X n } with Kolmogorov measure p. But the expected value of this variational distance is less than 6/2. + drn ( 17:+7 . By making 3 smaller. for all n > 1. As in the proof for the i.2. Theorem IV. in turn. The first three terms contribute less than 3E/4. The fourth term is upper bounded by half of the variational distance between the distribution of Y. I/2. y)ani (X . y) is upper bounded by Ittm — that 6 is small enough to guarantee that (6) (7) dm (xT. and (7) all hold.'. This proves Lemma IV. Ynn+ = x'17 . and the distribution of Ynn+ +. given Yo .7:14:r/y. it is enough to show that (8) dni(x n+1 . it can also be assumed Furthermore. by the choice of 3 1 . case. To complete the proof that mixing Markov implies finitely determined it is enough to show that d(p. The d-fitting of and y is carried out m steps at a time. since Ynn+ +Ini and Yin—I are conditionally (E/2)-independent.

(d(x l . which is however.a. v).11 If d(p. (None of the deeper ideas from channel theory will be used. a result also useful in other contexts..2. The finitely determined processes are d-closed. for any stationary process Trt for which 1. which guarantees that any process a-close enough to a finitely determined process must have an approximate form of the finitely determined property. The triangle inequality d(v + d a].v) <E and v is finitely determined..ak — < 3. (b) The second marginal of 3 is close to u in distribution and entropy. Let A be a stationary joining of it and y such that EAd(x i . close to E). v). It will be shown that if 72. is also small. y l )) = d(. < (5 and IH(A) — < S.a such that (a) 3 is close to A in distribution and entropy. If the second marginal.10 (The a-closure theorem.. Theorem IV.(b714) = E A.:(d(xi.. (b7la7).) The d-limit of finitely determined processes is finitely determined.228 CHAPTER IV.u in distribution and entropy there will be a stationary A with first marginal . Lemma IV. and for each ar il. then there is a (5 > 0 and a k such that d(u. (9) it (a ril ) The family {A.14) be the conditional measure on An defined by the formula (af. let 4(. The theorem is an immediate consequence of the following lemma. given the input ar il. (. borrowed from information theory. Such a finite-sequence channel extends to an infinite-sequence channel (10) X y . is close enough to . vi)) = d(p. B-PROCESSES. yi)).2. outputs b7 with probability X. il b7) A. The fact that 3: is a stationary joining of and guarantees that E3. —> which. IV.3 The final step in the proof that almost block-independence implies finitely determined is stated as the following theorem. by property (a). îl) will be small. then implies that a(u. The basic idea for the proof is as follows. A simple language for making the preceding sketch into a rigorous proof is the language of channels.2.u.. '17 is close enough to y in distribution and entropy then the finitely determined property of y guarantees that a(v. only its suggestive language.) Fix n. O ) } of conditional measures can thought of as the (noisy) channel (or black box).

is the measure on Aœ x A" defined for m > 1 by the formula - m_i A * (a . Y) is a random vector with distribution A*(X n .) If is a stationary joining of the two stationary processes pt and v. The output measure /3* is the projection of A* onto its second factor. respectively. and hence stationary measures A = A(An . bn i ln ) = a(a)4(b i.nn . and hence ' 1 lim — H(1 11 1X7) = H n — On the other hand. b lic) is an average of the n measures A * (a(i) + x where a(i). then hr(X inn. yrn) = H(X) + myrnIxrn) . it) and A = A(Xn . b ) .f3(4.+ +ki). that is. called the stationary joint input output measure and stationary output measure defined by An and the input measure a. Likewise. ii). a) nor /3*(X n . _ i < n. that ±n ). a). and # = ti(Xn. bki ) converges to X(a /ic. a) and /3(X n . THE FINITELY DE I ERMINED PROPERTY 229 which outputs an infinite sequence y. a) on A'. b(i)+ so that. a) need be stationary. with respect to weak convergence and convergence in entropy. b i ) converges to a oc. ypi+ in+n i has the value b7. a) of a with an output measure /3* = /6 * (4. 17) is a random vector with distribution An then H(X7. a). b(i) . +. as n oc. though both are certainly n-stationary. let n > k. = H(X7) H(}7 W ) . = as and b(i). A(a. "" lai. A direct calculation shows that A is a joining of a and /3. ±„.u) converges weakly and in entropy to v. the averaged output p. Given an input measure a on A" the infinite-sequence channel (10) defined by the conditional probabilities (9) determines a joining A* = A*(X n .2. and put A* = A*(Xn. The i. given an infinite input sequence x. x(a k i b/ ). = b„ for 1 < s < k. by breaking x into blocks of length n and applying the n-block channel to each block separately.12 (Continuity in n. first note that if (X7. The joining A*. n+n ) j=0 The projection of A* onto its first factor is clearly the input measure a. 1 b k ) = v(b) as n To establish convergence in entropy. The measures A and fi are. The next two lemmas contain the facts that will be needed about the continuity of A(X n . independent of {xi: i < jn} is. measure A(4` . a) are obtained by randomizing the start. blf) as n oc. with probability X n (b7lx i n ±i and {y i : i < j n}. then A(X n . .)(14) = E A A(4. converges weakly and in entropy to X and . /2).SECTION IV2. - Lemma IV. Proof Fix (4. called the joint input output measure defined by the channel and the input measure a. for i <n — k A * (a(i) i + k i. the measure on A' defined for m > 1 by the formula fi * ( 1/1" ) E A * (aT" . Km )* mn If a is stationary then neither A*(X n . if (XI". in n and in a. But.

since the entropy of an n-stationary process is the same as the entropy of the average of its first n-shifts.6. Y) is a random vector with distribution A n. p.13 (Continuity in input distribution. if (XT.) Fix n > 1. A) satisfies Ifie — vti < y/2.2. The finitely determined property of y provides y and E such that any stationary process V' for which I vt —Ve I < y and I H(v) — H(71)1 < y. dividing by m and letting in go to oc yields H(A) = H(a)+ H (Y I X 1) . and the stationary output measure fi = fi(X n . in particular. Lemma IV. Randomizing the start then produces the desired result. a). (i) {A(Xn .2.2. An„(a r . H(A(4. 1 IX7) . p)) H(A).)) H (v). g) satisfies (11) EA(d(xl. Furthermore. see Exercise 5. Ex(d(xi. and hence Lemma IV. since the channel treats input symbols independently. and H(/3) = H(A) — H(Xi Yi) is continuous in H(A) and the distribution of (X 1 . then H(XT. Y1 ). Proof For simplicity assume n = 1. and a stationary measure A on A" x A. so that H(A*(X n . a). a(m) )1 converges weakly and in entropy to t3(X n . a). as in --> Do. Dividing by mn and letting in gives 1 H(A*(4.11.u). Yr) = H(XT)+mH(Y i 'X i ). Fix a stationary input measure a and put A = A(Xi. Thus (i) holds. y) < E. assume 1) is finitely determined. B-PROCESSES.2. + E H(rr_1)n+1 I riz_on+i) Do since the channel treats input n-block independently.12. Proof of Lemma IV. so that. The extension to n > 1 is straightforward. so that both A* and fi* are stationary for any stationary input. Section 1.12. then. br ) = a(aT) i-1 X(b i lai ). and let A be a stationary measure realizing ci(p. yl)) + c. By definition. If a sequence of stationary processes la(m)} converges weakly and in entropy to a stationary process a as m oc. provides an n such that the stationary input-output measure A = A (An . Assume cl(p. . fi = /3(X i . Lemma IV. p)) H (X).13 is established. Thus. y). (ii) 0(4. if = /3*(X n . as n oc.2.u)+ — H (Y. and IH(fi) — < y/2. i=1 H(X71n ) MH(YnX ril ). Y1). This completes the proof of Lemma IV. yl)) < E. which depends continuously on 11(a) and the distribution of (X1. so (ii) holds.) < E. p)) = H(. Likewise. yi)) < Ex(d(xi. Furthermore. a) is weakly continuous in a. must also satisfy cl(v. since Isin = vn . randomizing the start also yields H(/3(A. °eon)} converges weakly and in entropy to A(4. .230 H (X ) CHAPTER IV. which is weakly continuous in a. as m --÷ oc. The continuity-in-n lemma.

[46]. 71) is a joining of the input Ex(d(xi.14 The proof that i.d. )' 1 )) < En(d(xi. (Hint: y is close in distribution and entropy to p. THE FINITELY DETERMINED PROPERTY.2. yi)) +e and the output distribution il = 0(X. Show that a finitely determined process is the d-limit of its .i. 71) _-_. processes are finitely determined is based on Ornstein's original proof.13. Lemma IV.. then the joint input-output distribution X = A(X„. 231 The continuity-in-input lemma.b Exercises. and IH(i') — 1-1 (/3)1 < y/2.2.(4 +1 )/p.. the proof given here that ABI processes are finitely determined is new. Furthermore. and the fact that X realizes d(p. [13].z -th order Markovizations. is any stationary process for which Fix — lid < 8 and IH(ii) — H(A)1 < ( 5 . il with the output 7/ and ) En(d(xi. y) which was assumed to be less than E . yi)) < < yi)). )' by the inequalities (11) and (12). so the definitions of f and y imply that dal.2.2. The fact that almost block-independence is equivalent to finitely determined first appeared in [67].11.2. yi) + 6 Ex(d (x i.(4). thereby completing the proof that dlimits of finitely determined processes are finitely determined. The observation that the ideas of that proof could be expressed in the simple channel language used here is from [83]. Given such an input 'A it follows that the output -17 satisfies rile — vel Pe — Oel + Ifie — vel < Y. IV. i)) + 2E < 3E. This completes the proof of Lemma IV. since X = A(Xn .SECTION IV. provides 3 and k such that if ii. is the n-th order Markov process with transition probabilities y(x-n+i WO = p. Ex(d(xi. [46]. while the proof that mixing Markov chains are finitely determined is based on the Friedman-Ornstein proof. El Remark IV. and — II (v)I III (1 7) — II (0)I + i H(8) — I-I (v)I < 1 / . 1 The n-th order Markovization of a stationary process p. ii„) satisfies 1 17 )t — fit' < y/2. and hence the proof that almost block-independent processes are finitely determined. Ft) satisfies (12) Ex(d (x i . The fact that d-limits of finitely determined processes are finitely determined is due to Ornstein. d(ii.) . y) <E.

such as geodesic flow on a manifold of constant negative curvature and various processes associated with billiards. Equivalence will be established by showing that very weak Bernoulli implies finitely determined and then that almost blowing-up implies very weak Bernoulli. process. and X`i'/x ° k will denote the random vector X7. conditioned on the past values x ° k . indeed. v. . and for some x all of whose shifts have limiting nonoverlapping n-block distribution equal to /4.u is totally ergodic and y is any probability measure on A'.7.(X7 I X7)) < c. This form.T i )' ) must be a joining of ia. where Ex o . IV. denotes expectation with respect to the random past X° k Informally stated. it follows that almost block-independence. see Exercise la in Section 1.3. Show that the class of B-processes is a-separable.17) = d(x . (Hint: a(p. for some i < n. A more careful look at the proof that mixing Markov chains are finitely determined shows that all that is really needed for a process to be finitely determined is a weaker form of the Markov coupling property. The significance of very weak Bernoulli is that it is equivalent to finitely determined and that many physically interesting processes. (1) Ex o .) 3. then 0).232 CHAPTER IV. Theorem IV. Lemma IV. B-PROCESSES.. almost blowing-up. Prove the conditional c-independence lemma. equivalent to being a stationary coding of an i.. Section IV. y). either measure or random variable notation will be used for the d-distance.) A process is very weak Bernoulli if and only if it is finitely determined.4.d. and v. very weak Bernoulli.4. Show that a process built by repeated independent cutting and stacking is a Bprocess if the initial structure has two columns with heights differing by 1. . see Section 111. is actually equivalent to the finitely determined property. A stationary process {X.a The very weak Bernoulli and weak Bernoulli properties. for some y for which the limiting nonoverlapping kblock distribution of T'y is y. a process is VWB if the past has little effect in the a-sense on the future. can be shown to be very weak Bernoulli by exploiting their natural expanding and contracting foliation structures. The reader is referred to [54] for a discussion of such applications. where 7 denotes the concatenated-block process defined by an (A n . and finitely determined are. 5. Show that if . 4. As in earlier discussions.} is very weak Bernoulli (VWB) if given c > 0 there is an n such that for any k > 0.3. Hence any of the limiting nonoverlapping n-block distributions determined by (T i x. hence serves as another characterization of B-processes. (d. Since it has already been shown that finitely determined implies almost blowing-up. called the very weak Bernoulli property. 2.1 (The very weak Bernoulli characterization.2. y) < a(i2.i.3 Other B-process characterizations.

3. The class of weak Bernoulli processes includes the mixing Markov processes and the large class of mixing regenerative processes. the conditional measures on different infinite pasts can be joined in the future so that.b Very weak Bernoulli implies finitely determined. if past and future become c-independent if separated by a gap g. with high probability. while very weak Bernoulli only requires that the density of disagreements be small. X'1 ) ) < E. }.. The triangle inequality gives <(x.T Y::ET /)1) is small for some fixed m and every n. The proof models the earlier argument that mixing Markov implies finitely determined. and use the fact that .4) + cini (yn n _. is obtained by using variational distance with a gap in place of the ci-distance.5.) . for suitable choice of ni. to obtain Ex o (cin (XTIV X ° k . y rli )dni (x. weak Bernoulli requires that with high probability. while very weak Bernoulli is used to guarantee that even such conditional dependence dies off in the ii-sense as block length grows. for some fixed k > ni. while the proof of the converse will be given later after some discussion of the almost blowing-up property. x:-±. conditioned on intermediate values. In a sense made precise in Exercises 4 and 5. as noted in earlier theorems. see Exercise 2 and Exercise 3. the key is to show that E A. p-r . a result established by another method in [78]. then one can take n = g + ni with m so large that g 1(g + ni) < € 12.nnz /y. A somewhat stronger property. the random vectors Xr and X° k are c-independent. called weak Bernoulli. 233 The proof that very weak Bernoulli implies finitely determined will be given in the next subsection.3. Entropy is used to guarantee approximate independence from the distant past. Since approximate versions of both properties hold for any process close enough to {X.2. names agree after some point.1. a good fitting can be carried forward by fitting future ni-blocks. This property was the key to the example constructed in [65] of a very weak Bernoulli process that is not weak Bernoulli.77. Furthermore. IV.3 and 111. ID. If this is true for all m and k. Theorems 111. vin . As in the earlier proofs. OTHER B-PROCESS CHARACTERIZATIONS. -k Thus weak Bernoulli indeed implies very weak Bernoulli. for any n > g and any random vector (Uli. + . V :+i) g In.-. weak Bernoulli processes have nice empirical distribution and waiting-time properties. X g ±n i z )) < E/2. first note that if Xr and X ° k are c-independent then Ex o (dm (X g +T/X ° k . iin (14 .'. ( 4. Vr) < cin _g (Ugn+i . given c > 0 there is a gap g such that for any k > 0 and m > 0. Their importance here is that weak Bernoulli processes are very weak Bernoulli.} in entropy and in joint distribution for a long enough time. since 4-distance is upper bounded by one-half the variational distance. that is. given only that the processes are close enough in entropy and k-th order distribution. To see why.SECTION IV. A stationary process {X i } is weak Bernoulli (WB) or absolutely regular.

for all n. while the expected value of the second term is small for large enough ni. implies that Yr and YIK K4 are conditionally (E/2)-independent.)) < 6/2. for any stationary process {Y.(XT / X ° t . r in )) < c /8. Thus. } with Kolmogorov measure u for which Ik — VkI < and I H Cu) — H (v)I < (3. Towards this end. then (3) EyoK (iim (Yr i /Y K . namely.. once such an in is determined. provided only shown is that the expected value of dm (Ynn± that is close enough to {X i } in k-order distribution for some large enough k > as well as close enough in entropy. yields the result that will be needed. Y. given — °K• Since a-distance is upper bounded by one-half the variational distance. This result.n\ 'n+1 I "Ai x 4 . Lemma IV.8 is small enough and {Yi } is a stationary process with Kolmogorov measure V such that I tuk — Vkl <S.7. and hence there is a K such that H(XTIX ° _ K ) < mH(u) + y. For in fixed.n) < 6/4. all that remains to be ±. Y/yr.)ani(Xnn±n1 +I / y n vn+ni 1.m. uniformly in n.') is small. In summary. j>1 .2. Fix a very weak Bernoulli process {X n } with Kolmogorov measure p. By stationarity. let y be a positive number to be specified later. the first term is equal a. this means that Eyo(dm(Yr/ 17° K. which is small since in < k.n (XT. so that if it also assumed that I H(v) — H(g)I <S. then H(YrIY ° K ) < m H (v) + 2y holds. the conditional 6-independence lemma. and hence that am (XI''. B-PROCESSES. The quantity Exo K (a.Yr /Y_°. . since it can also be supposed that 8 < E/2. The very weak Bernoulli property provides an in so that (2) Ex o (d„. t > 1. (4) Eyot (c1(Y im . )( tin )) depends continuously on 'Lk. by the very weak Bernoulli property. Yin /Y° K-j)) < E/4 . H(XTIX ° t ) converges to mH(A). Furthermore. LI .n (rin /X ° K . The details of the preceding sketch are carried out as follows. H(XT IX° K ) also depends continuously on 1Uk . The goal is to show that a process sufficiently close to {Xi} in distribution and entropy satisfies almost the same bound. If y is small enough. j O. yr). since H(YrIY ° r ) decreases to m H (v) this means that II (Yi n IY ° K) — 11 ( 11171 1 17 K_i) < 2y.1" -1 for all n. and fix 6 > O. t > O. Fix k = m K 1.234 CHAPTER IV.Yr i )) < c/4. But. along with the triangle inequality and the earlier bound (3). so if . and Sis small enough. as t —> Do. it follows from the triangle inequality that there is a S > 0 such that E y. which completes the proof that very weak Bernoulli implies finitely determined.

which. of a set C c A" is its c-neighborhood relative to the dn -metric. Proof The ideas of the proof will be described first. = fb.2 Let A and v be probability measures on A. and for any E > 0.3. )1) of A.c that a finitely determined process has the almost blowing-up property. If the domain S of such a 0 is contained in the support of a mass . very weak Bernoulli. OTHER B-PROCESS CHARACTERIZATIONS.) > 1. in a more general setting and stronger form. The goal is to construct a joining A. A nonnegative function .4. A function 0: [S].3.4 whose /i -mass is not completely filled in the first stage has /i -measure at least E then its blowup has large vmeasure and the simple strategy can be used again. . This simple strategy can repeated as long as the set of x'i' whose .4. process. I : dn (d. A set B C An has the (6. C A" such that x'll E Bn . It was shown in Section III.c. The set of such yii' is just the c-blowup of the support of A.a is called a mass function if it has nonempty support and E Ti(4) < 1. )1) < E. is called an c-matching over S if dk ( yr. [C]. finitely determined.c. A joining can be thought of as a partitioning of the v-mass of each y'i' into parts and an assignment of these parts to various xlii. The key to continuing is the observation that if the set of . and almost blowing-up are all equivalent ways of saying that a process is a stationary coding of an i. A trivial argument shows that this is indeed possible for some positive a for those )1' that are within c of some fil of positive A-mass. Some notation and terminology will assist in making the preceding sketch into a proof. was introduced in Section 111.c The almost blowing-up characterization. The e-blowup [C]. [81].d. 0 (yriz)) < E. then dn (A . In this section it will be shown that almost blowing-up implies very weak Bernoulli. c)-blowing-up property if .SECTION IV. If the fraction is chosen to be largest possible at each stage then only a finite number of stages are needed to reach the point when the set of 4 whose pt-mass is not yet completely filled has 1a-measure less than E. of A and v for which dk (4. that the total mass received by each 4 is equal to A(4).S. Since very weak Bernoulli implies finitely determined. subject to the joining requirement. that is. eventually almost surely.-measure less than E. for all y [Sk.u-mass is not yet completely filled has it-measure at least E. is due to Strassen.i. for any subset C c B for which i(C) > 2 -1'3 • A stationary process has the almost blowing-up property if for each n there is a B. b) <E. A simple strategy for beginning the construction of a good joining is to cut off an a-fraction of the v-mass of yri' and assign it to some xr. for most y'i' a small fraction of the remaining v-mass can be cut off and and assigned to the unfilled mass of a nearby x'11 . c)-blowing-up property for n > N. namely.' for which dn (x1' . Lemma IV. whenever it(C) > c.3.) > 1 . The almost blowing-up property (ABUP). namely. except on a set of (4. The proof is based on the following lemma. has v-measure at least 1 .' . y) < 2e. which. by hypothesis. this completes the proof that almost block-independence.E. after which the details will be given. for some all E C) . 235 IV. If v([C]. i. yl') < e. there is a 8 > 0 and an N such that Bn has the (3.u([C].

let Si be the support of itt (1) . by the entropy theorem. Ft. dk (x. . Proof By the ergodic thZoc rigiiii(-x-Rlit)(14ift4 AID almost surely. only one simple entropy fact about conditional measures will be needed to prove that almost blowing-up implies very weak Bernoulli. . p. provided only that n is large enough.(x)>0 ) I is positive. and ai±i is taken to be the maximal (y.) Let be an ergodic process of entropy H and let a be a given positive number There is an N = N(a) > 0 such that if n > N then there is a set Fn E E(X" co ) such that . (1) .3.u (i) .u.2 is finished.2 in hand. the notation xn E Fn will be used as shorthand for the statement that Clk[zn k ] c F.0) has approximately the same exponential size as the unconditioned probability .u(4) H. for it is the largest number a < 1 for which av(V 1 (4)) < (x). p.(Si +i) < c.(Si+i) < .u(F) > 1 — a and so that if xn E Fn then o so . with high probability.o ) denote the a-algebra generated by the collection of cylinder sets {[xn k > 0 } . then V([Si d E ) > I — E.Lemma IV. cutting up and assigning the remaining unassigned v-mass in any way consistent with the joining requirement that the total mass received by each xÇ is equal to If ai. Also. otherwise . To get started let p.s: ( x I' CHAPTER IV. the function Oi+1 is taken to be any c-matching over Si+i . there is an 4 E Si for which td ) (4) = ai y(0. (i + 1) . . B-PROCESSES. x E S. This is the fact that. can be assigned to the at-mass of 4 = The desired joining A.3 (The conditional entropy lemma. for which an a-fraction of the y-mass of each y'i? E [S]. It is called the maximal (y. inf 71.7 1 (4)) and therefore . Having defined Si . If ai < 1 then. y in) of I=1 X-measure less than e. (1) = A. —(1/n) log . say i*.3.u. I=1 Only the following consequence of the upper bound will be needed.u(Si. by the definition of maximal stuffing fraction. In either case. then the number a = 0) defined by i"u.236 function at. for Fn E E(X" co ). Op-stuffing fraction. (i + 1) .. let .u(Si ). The construction can be stopped after i*.(4). (5) a = min { 1. +1 ) < E. With Lemma IV. Oi+i )-stuffing fraction. Hence there is first i. These two facts together imply the lemma. is constructed by induction. almost surely. let 01 be any E-matching over S1 (such a (P i exists by the definition of c-blowup. except on a set of (4.u (i + 1) be the mass function defined by tt o)(x i ) — • The set Si+1 is then taken to be the support of p. while. that is. Lemma IV.3. y) < E. as long as SI 0 0) and let al be the maximal (y. (pi and ai . Let E(Xn.u(4lx ° . the conditional probability .. = 1. and the proof of. 0)-stuffi ng fraction. for which ai = 1 or for which p.

u(Clx ( 1 00 ) > c then . if A(Clx. then.. then E E (X ° cx. which establishes the lemma. there is a set D.(V) > 1 — c and such that the following holds for x ° 00 E V.u(c n /3.0 ) such that A(B n ) > 1— c 2 /4. 237 Lemma IV.0 )) <e.. has measure at least 1 — c. A (• lx° . there is a set Dn *) > 1 — c/2 and such that A(B n ix%) > 1 — e12.Ix° 00).(Dn this with Lemma IV.5 Almost blowing-up implies very weak Bernoulli.0 E D. The goal is to show that . there is a 3 > 0 and an N such that Bn has the (3.u(x).n > 1. a set Fn c A" such that 1 — Afrx and. so that if n is so large * E E(X ° ..u(C n 12.u(B.„ and C c An then (a) and (b) give 11-(Clx) .3. 0-blowing-up property for n > N. E By Lemma IV.4 Let N = N (a) be as in the preceding lemma. by the Markov inequality.u(c" n Fn ix) < 2" . . E E(X° 0° ) of measure at least Dn .u(Bn ) is an average of the quantities .2.3. D. 1 — c.u is very weak Bernoulli. be a stationary process with the almost blowing-up property. Now the main theorem of this section will be proved. (b) 11. I x_ .SECTION IV. of measure at least 1 — Nfii.a(cix) < .1x%) 6/2 . eventually almost surely..u(C) .u(C n Fn ) Ara < 2"A(C)-1. let Bn c An . (x tilx ° If E ... be such that x7 E B„. it is enough to show how to find n and V p. OTHER B-PROCESS CHARACTERIZATIONS. for x°. for each x 0 ° 00 ) > 1 — (a) iu(F. and such that for any e > 0. Proof Let p.s/CTe < 2" .3. to show that for each E > 0..) 5_ 2" .2 E Fn .NATe. X° E V.u(V) > 1 — c and such that dn(A. the Markov inequality yields a set D. and for E V and C c A'.) such that . and C C An. V = Dn n Dn .X.) _ oe ) 2an. o ) > c can be rewritten as .u([C]f) a.. The quantity . which.)> 2-"(c — N/Fx — el2). Towards this end.such that if x ° E D.4 it follows that if n is large enough and a < c 2 /4 then the set * E E(X).3.3. there is an n and a set V E E(X° . Theorem IV. E(X°. Combining that p.NA ii(Clx ° Proof For n > N. If n > N. that is.u(C n B.) such that (6) If C c An and .

for all C c A. . Remark IV.u.u(Bix ° .3. Show by using the martingale theorem that a stationary process p.238 CHAPTER IV. n].o ] such that ti(B Ix%) < E.u. except for a set of x° k of measure at most E. (Hint: coupling. for i E [K.0 0o ). pp. Cin (X7 IX ° k .c ).d Exercises. with the property that for all x B.0 ) of measure at least 1 — c. Show using coupling that a mixing Markov chain is weak Bernoulli. 0(x) 1 = xi for all except at most en indices i E [1. A more sophisticated result. then 2-"(e — ... defined as the minimum c > 0 for which . Show by using the martingale theorem that a stationary process .*(. such that if x 0 E G i then there is a measurable mapping 0: [x.(•1. This and related results are discussed in Pollard's book.6 The proof of Theorem IV.u(• ix%) onto the conditional measure .0 0. 5. 2. using a marriage lemma to pick the c-matchings 0i . is very weak Bernoulli if given E > 0 there is a positive integer n and a measurable set G E E(X° 0° ) of measure at least 1 — c. 3. c. .0 ] which maps the conditional measure . 4)(x). This establishes the desired result (6) and completes the proof that almost blowing-up implies very weak Bernoulli. [59.„o ] which maps the conditional measure ii(-Ix°) onto the conditional measure tie li° .5 appeared in [39]. .) > 1 — E. n]. yields ittaCli)> ou([C n B4O1.„) < c. 1.) 4. 1)).3. = x. so that the blowing-up property of B.14Fe — 6/2) > 2 -8n. Show that a stationary process is very weak Bernoulli if and only if for each c > 0 there is an m such that for each k > 1. for all sufficiently large n.. 79-80]. If a is chosen to be less than both 8 and c2 /8.3. Example 26. The proof of Strassen's lemma given here uses a construction suggested by Ornstein and Weiss which appeared in [37]. with the property that for all x B. and a measurable set B c [x° . and a measurable set B c [x] such that .a is weak Bernoulli if given c > 0 there is a positive integer K such that for every n > K there is a measurable set G E E(X ° . IV. Show that a mixing regenerative process is weak Bernoulli. such that if i° 00 E G then there is a measurable mapping 0: [x] 1-÷ [. < E. yields the stronger result that the an -metric is equivalent to the metric cl.u(C) < v([C]. provided only that n is large enough. B-PROCESSES.

Th. Recurrence in ergodic theory and combinatorial number theory Princeton Univ. Israel J. 1968.. Budapest. 321-343. 1965. "r-Entropy. [10] J. Thomas.A. 365-394. Stanford Univ. 1970. "On isomorphism of weak Bernoulli transformations. Erdôs and A. [14] H.". Introduction to ergodic theory Van Nostrand Reinhold. Birkhauser. of Elec. Gallager. Wiley. Kohn. New York. 1971. An introduction to probability theory and its applications. [7] I. Information theory and reliable communication. Measure Theory. [11] W. "Universal codeword sets and representations of the integers.. John Wiley and Sons. Ergodic problems of classical mechanics.. equipartition.. 5(1970)." IEEE Trans.. Cover and J. New York. Rényi. [5] D. Feller. V. Akadémiai Kiack"). Körner. [8] P.Elias. [9] P. Thesis. [3] A. Ergodic theory and information. Eng. M. Probab. New York. 1981.Bibliography [1] Arnold. John Wiley and Sons. Info. Princeton. Arratia and M. 1985. 36(1980). 103-111. Friedman. Benjamin. New York. [15] R. Elements of information theory. 23(1970). New York. Dept. and Avez. 1991. 194-203. Barron.. Waterman. [6] T." Advances in Math. 1152-1169. Inc. New York. of Math. Wiley. A. NY. Volume II (Second Edition). Billingsley. D. 1968. Friedman and D. [2] R. "The Erd6s-Rényi strong law for pattern matching with a given proportion of mismatches. IT-21(1975). Csiszdr and J. A.. NY. and Ornstein's isomorphism theorem.I.. 239 . "On a new law of large numbers. New York.. [4] P." J. Furstenberg. Feldman. Press. NJ. C. [13] N. Information Theory Coding theorems for discrete memoryless systems. Ornstein. 1980. [12] N. "Logically smooth density estimation. 17(1989)." Ph." Ann. d'analyse. 1981. W.

"Prefixes and the entropy rate for long-range sources." Proc. Cambridge Univ. Karlin and G. Kemeny and J.. U." IEEE Trans. "Estimating the information content of symbol sequences and efficient codes. Gray. [22] W. 681-4." IEEE Trans.A.. of Math. random processes. 263-268." Israel J. Math. Halmos. Katznelson and B. Keane and M. Weiss. Entropy and information theory. and ergodic properties. Acad.." Ann. Th. 42(1982). 19(1943). P. 635-641. 53(1947). [20] P. 34(1979) 281-286." Ann. [35] T." Proc. "Comparative statistics for DNA and protein sequences . Princeton. Snell. statistics. Ghandour. [23] M. [24] S. NJ." Israel J. Probability.. Berlin. Math. Hoeffding.240 BIBLIOGRAPHY [16] P. Finite Markov chains. New York. IT-20(1975). . Statist. (F. "Sample converses in source coding theory.) Wiley. 502-516. IT-37(1991). [28] M.. Ergodic theorems. New York. IT-35(1989). Van Nostrand Reinhold. - [18] R. Kamae.. Japan Acad. John Wiley and Sons. New York. NY. [30] J. D. Marcus. New York. Davisson. [21] P. Sci. Press.. van Nostrand Co. 1956. L. "New proofs of the maximal ergodic theorem and the Hardy-Littlewood maximal inequality. Suhov. of Math. 291-296. Kieffer. Lind and B." Proc. de Gruyter.. [17] R.single sequence analysis. Info. Statist. 36(1965). Kakutani. Theory. [29] J. [19] R. Smorodinsky. Kelly. Chelsea Publishing Co. Kac. "A simple proof of the ergodic theorem using nonstandard analysis. Measure theory. New York. Gray. Springer Verlag. [26] S. "A simple proof of some ergodic theorems. 1992. ed. Info. [33] R. Jones. 1985. Halmos. " Finitary isomorphism of irreducible Markov shifts. M. G. New Jersey. 42(1982)." IEEE Trans... 1995. An Introduction to Symbolic Dynamics and Coding. [32] U. 1988. 1002-1010. D. [34] D. Krengel. [31] I. Math. "Asymptotically optimal tests for multinomial distributions. [27] I. 1950." Probability. 87(1983). 1990. 82(1985). Lectures on ergodic theory. 369-400. Gray and L. Springer-Verlag. Nat. Lindvall. Th. [25] T. W. 669-675. 5800-5804. AMS. Lectures on the coupling method. Inform.. Cambridge. Princeton.S. NY." Israel J. 284-290. "On the notion of recurrence in discrete stochastic processes. 1993. "Source coding without the ergodic assumption. "Induced measure-preserving transformations. Grassberger.. 1960. Kontoyiannis and Y. and optimization.

Ornstein and P. 211-215. 18(1990). Weiss. Neuhoff and P. Math. Ornstein and B. "A simple proof of the blowing-up lemma. IT-38(1992). Inform." IEEE Trans. Weiss. Advances in Math. Marton. 1974.. and dynamical systems. IT-42(1986). Probab. Shields. CT. Rudolph. Probab. [47] D. Th. 43-58. "A recurrence theorem for dependent processes with applications to data compression. [46] D. [41] D. "An uncountable family of K-Automorphisms". 15(1995)." Ann. [37] K. 331-348. Ergodic theory." Israel J. 960-977. 905-930." IEEE Trans. Probab. [48] D. Shields. "A very simplistic.. 53-60. Info." IEEE Trans. 445-447." IEEE Trans. 182-224. "Equivalence of measure preserving transformations.. IT-28(1982). to appear. Yale Univ. Ornstein and P. 1561-1563. "An application of ergodic theory to probability theory". Shields. 188-198. Th. Shields. Shields. Weiss. 1995. Shields. Shields.. "Entropy and the consistent estimation of joint distributions. Shields." Ann. June. "The Shannon-McMillan-Breiman theorem for amenable groups. 44(1983). and Dynam. 86(1994). Info. Wyner. IT-39(1993). "Channel entropy and primitive approximation. Shields. 18(1990). 951-960. [40] D. "How sampling reveals a process. Press. Yale Mathematical Monographs 5. 262(1982). [42] D.. Weiss. 104(1994). Marton and P. Marton and P. universal. Correction: Ann.. "The positive-divergence and blowing-up properties. Inform. "Universal almost sure data compression.. 11-18. [39] K." Ann. [53] D." Advances in Math.. randomness." IEEE Trans. Ornstein and B... Nobel and A. "Block and sliding-block source coding." Israel J. 22(1994). and B. Th. Ann. Inform. IT-23(1977). Ornstein. [43] D. 63-88. [45] D. Neuhoff and P. [51] D. [52] D. Poland. 441-452. 78-83. D." IEEE Workshop on Information Theory. Probab.BIBLIOGRAPHY 241 [36] K. "Indecomposable finite state channels and primitive approximation. [38] K. Probab. Ornstein and B. [44] A. "Almost sure waiting time results for weak and very weak Bernoulli processes. Th.. [49] D." Ann. Neuhoff and P. Shields.. "The d-recognition of processes. Th. Math." Memoirs of the AMS.. Ornstein.. "Entropy and data compression. [50] D." Ergodic Th. Ornstein. 10(1982). Ornstein and P. 1(1973).. 10(1973). Rydzyna. lossless code. Sys. Marton and P. Neuhoff and P. New Haven. Probab.. .

1973. Pollard. [64] P. . Chicago. Ornstein and B. N. "A 'general' measure-preserving transformation is not mixing. 1-4. [68] P. [65] P. "String matching . [66] P. "Cutting and independent stacking of intervals. 213-244. Probab. "On the Bernoulli nature of systems with some hyperbolic structure. Cambridge Univ. IT-33(1987)." (in Russian). [70] P.. (In Russian) Vol." IEEE Trans. Sbornik. Sys. "Weak and very weak Bernoulli partitions. Univ. "If a two-point extension of a Bernoulli shift has an ergodic square. Shields. 1199-1203.the general ergodic case. "On the probability of large deviations of random variables. New York. 119-123." Israel J.. Springer-Verlag. IT-25(1979). 1984. Inform." IEEE Trans. Th.. Shields. Th. Statist. Information and information stability of random variables and processes. "Entropy and prefixes. Cambridge. [61] D. Topics in ergodic theory Cambridge Univ. Convergence of stochastic processes. 20(1992)." Ann. Press. Moscow. Shields. Shields. Press. Math." Problems of Control and Information Theory. San Francisco." Dokl. fur Wahr. 403-409. and Dynam. Shields. "The ergodic and entropy theorems revisited. 60(1948). Th. IT-37(1991).. 7 of the series Problemy Peredai Informacii. [57] Phelps. Rohlin. Inform. 84 (1977). Lectures on Choquet's theorem. "Universal almost sure data compression using Markov types.. [62] I. Ergodic theory. Mat. Petersen. to appear. R. 1645-1647. Shields. 30(1978). 7 (1973). Sanov. Weiss. and Probability. 1(1961). 49(1979). Parry.. then it is Bernoulli. AN SSSR. 349-351. Probab." IEEE Trans. [67] P. The theory of Bernoulli shifts. IT-37(1991). Cambridge. Rudolph. Nauk SSSR. 1964. [55] W. 1960. 263-6. 159-180. 269-277. 1966 [58] M.. "Cutting and stacking. "Almost block independence.J. Princeton.. Van Nostrand.242 BIBLIOGRAPHY [54] D.. 1983." Mathematical Systems Theory. of Math. 1605-1617.. 283-291. "Stationary coding of processes. 42(1957). Akad. English translation: Holden-Day. [69] P." Ergodic Th. 19(1990). of Chicago Press. Th. [63] P. [71] P. [56] K. Shields. [72] P." IEEE Trans. Shields. Transi. [60] V. English translation: Select. Inform. 133-142. "The entropy theorem via coding bounds. [73] P." Ann. Shields. 11-44.. Shields." Z. [59] D. Pinsker. 20(1992). A method for constructing stationary processes. 1981." Monatshefte fiir Mathematik. Shields. Inform.

Inform. Wyner and J. Moser. IT- 39(1993). Th. [86] A. Ziv. Shields. 521-545. Prob. of the IEEE. of Theor. 423-439. Wyner and J. Prob.." Ann. [77] P. [78] M. [85] F. Inform.. Prob. "Fixed data base version of the Lempel-ziv data compression algorithm. Inform. Henri Poincaré. [89] S." IEEE Trans.-P. [76] P. "Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression.. a seminar Courant Inst." Proc. Inform. Ziv. 82(1994). IT-37(1991)." Ann. [75] P." IEEE Trans. 1647-1659." IEEE Trans. Steele. Shields... "Waiting times: positive and negative results on the Wyner-Ziv problem. Ziv and A. 36(1965). Th. Szpankowski. Smorodinsky. 6(1993). 135(1992) 373-376.Varadhan. 878-880..BIBLIOGRAPHY 243 [74] P. Asymptotic properties of data compression and suffix trees. Ziv.." IEEE Trans." Contemporary Math. Inform." J. 499-519. "Universal data compression and repetition times. 6(1993)." IEEE Trans. Ziv. 732-6. submitted. vol. Springer-Verlag. "Coding theorems for individual sequences. "Entropy zero x Bernoulli processes are closed in the a-metric. of Math. of Math. [88] A." Ann. Walters. IT-35(1989). 3(1975). Th. [84] P. IEEE Trans. Inform. Willems. "An ergodic process of zero divergence-distance from the class of all stationary processes. of Theor. "The existence of probability measures with given marginals." IEEE Trans. "The sliding-window Lempel-Ziv algorithm is asymptotically optimal. Strassen. 405-412. of Theor. . Phillips. Th. Th. Wyner and J. 5(1971). [91] J. NYU. [87] A. Probab." J. Smorodinsky. Inst. "Two divergence-rate counterexamples. IT-39(1993)." Math. "Kingman's subadditive ergodic theorem. Info. Shields and J. Lernpel. Statist. "A universal algorithm for sequential data compression. Th. 210-203. 1975. 520-524.. [80] M. 1982. Ergodic theory. [90] J. IT-35(1989). "Universal redundancy rates don't exist. Xu. "A partition on a Bernoulli shift which is not 'weak Bernoulli'." J.. IT-18(1972).. Info. Sciences. IT- 24(1978). Ziv. Syst. 125-1258. Th.. [81] V. J. An introduction to ergodic theory. E." IEEE Trans. "Coding of sources with unknown statistics-Part I: Probability of encoding error. [92] J. "Finitary isomorphism of m-dependent processes. Part II: Distortion relative to a fidelity criterion". and S. Th. 54-58. New York. New York. [83] J. 93-98. Shields. 337-343. M. [82] W. Thouvenot.. 25(1989). IT-23(1977).. 384-394.. 872-877. [79] M. Th..

.

235 blowup. 24. 123 code. 194 Borel-Cantelli principle. 107 subcolumn. 69 built by cutting and stacking. 24. 187 absolutely regular. 194 almost blowing-up. 26. 108 uniform. 8 block-independent process. 235 (8. 8 block coding of a process. 195. 174 in probability. 215 column. 24.Index a-separated. 107 level. 107 column partitioning. 107 top. 83 blowing-up property (BUP). 111 column structure: top. 109 disjoint columns. 1 asymptotic equipartition (AEP). 107 cutting a column. 7 truncation. 125 base (bottom) of a column. 211 Barron's code-length bound. 24. E)-blowing-up. 8. 110 transformation defined by. 226 entropy. 24. 10 concatenation representation. 69 built-up set bound. 121 n-code. 108 width distribution. 11 building blocks. 138 circular k-type. 84. 195. 108 (a. 121 per-symbol. 110 copy. 52 block code. 27 conditional E -independence. K)-strongly-separated. 121 rate of. 68 blowup bound. 72. 108 binary entropy function. 179. 104 block-to-stationary construction. 71 length function. 107 complete sequences. 189 (a. 109 built-up. 71 coding block. 233 addition law. 72 codeword. 55 B-process. 212 almost blowing-up (ABUP). 174 in d. 74 codebook. 185. 68 8-blowup. 107 height (length). 107 upward map. 108 estimation of distributions. 107 name. 109 column structure. 71 code sequence. 107 base (bottom). 58 admissible. 108 width. k)-separated structures. 104. 107 labeling. 74. 70 (1 — E)-built-up. 108 width (thickness). 235 alphabet. 107 of a column structure. 195. 188 upward map. 115 disjoint column structures. 58 . 211 block-structure measure. 108 support. 190 columnar representation. 103. 110 concatenated-block process. 24. 9. 215 245 faithful (noiseless). 115 cutting into copies. 184 almost block-independence (ABI). 105 complete sequences of structures.

68 prefix codes. 184 d-distance . 87 d-distance properties. 42 of Kingman. 109 standard representation. 63 of overlapping k-blocks. 62 Markov processes. 51. 129 entropy-typical sequences. 90 rotation processes. 89. 15 components. 67 cutting and stacking. 16 totally. 48 Markov. 222 of a partition. 166 for frequencies. 91 definition (joining). 168 divergence. 31 distinct words. 91 definition for processes.d. 218 coupling measure. 96 d-far-apart Markov chains. 49 decomposition. 185 encoding partition. 11 expected return time. 65 empirical joinings. 46 number. 89 realizing tin 91 d-topology. processes. 100 ergodicity. 21 ergodic theorem (of Birkhoff).i. 138 estimation. 48 empirical entropy. 51 a process. 224 ergodic. 24 exponential rates for entropy. 97 completeness. 218 covering.246 conditional invertibility. processes. 43 ergodicity and d-limits. 17 start. 45 "good set" form. 88 entropy of i. 20 process. 6 shifted empirical block. 60 upper semicontinuity. 51 empirical. 58 entropy rate. 75. 89 for i.i. 1 of blocks in blocks. 166 Elias code. 4 . 57 doubly-infinite sequence model. 92 for sequences. 56 topological. 63 . 214 coupling. 76 empirical distribution. 133 entropy interpretations: covering exponent. 59 of a distribution. 186. 143. 100 typical sequences. 223 for processes. 176. 46 almost-covering principle. 147 distribution of a process. 78 entropy theorem. 190 cylinder set. 138 first-order entropy. 98 direct product. 100 subset bound. 90 d'-distance. 99 finite form. 67 cardinality bound. 1 continuous distribution. 80 entropy. 89 for stationary processes. 92 for ergodic processes. 33 of von Neumann. 73 entropy properties: d-limits. 112 cyclic merging rule. 58 a random variable. 59 a partition. 67 E-independence. 43. 51 conditional. 57 inequality. 103. 64 empirical measure (limiting). 144 n-th order. 132 Ziv entropy. 94 empirical universe. 164 consistency conditions.d. INDEX empirical Markov entropy. 102 relationship to entropy. 62 concatenated-block processes. 42 maximal. 2 d-admissible. 99 mixing. 45 eventually almost surely.

27 transformation. 6 generalized renewal process. 10 overlapping to nonoverlapping. 45 function of a Markov chain. 110 Kolmogorov partition. 26 generalized return-time picture. 5 source. 195 finitary process. 25 infinitely often. 122 nonoverlapping-block process. 4 complete measure model. 20 name of a column. 34 (1 — 0-packing. 90 mapping interpretation. 222 i. 4. 17 joint input-output measure. 151 finitary coding. 121 filler or spacer blocks. 40 strong-packing. 80 finite energy. 229 247 Kac's return-time formula. 41 . 3. 185 nonexistence of too-good codes. 117 extension. 168 overlapping-block process. 7 finite coding. 62 matching. 4 set. 84 fillers. 198. 2)-name. 4 and complete sequences. 34 stopping-time. 6 Markov order theorem. 10 packing. 235 measure preserving. 138 partial. 114 M-fold. 76. 13 Kolmogorov representation. 6 concatenated-block process. 131 convergence theorem. 133 simple LZ parsing. 25 Kolmogorov measure. 91 definition of d. 43 typical. 195 finite coder. 215 n-blocking. 132 upper bound. 18 and d-limits.d. 104 first-order rate bound. 10 finitely determined. 15 irreducible Markov. 7 invariant measure. 8 approximation theorem. 14 nonadmissibility. 2. 6 k-th order. 73 Kronecker's theorem. 3 two-sided model. 28 independent partitions. 65 Markov inequality. 11 instantaneous coder. 41 two-packings. 18 join of measures. 131 LZW algorithm. 17 cutting and stacking. 91 join of partitions. 104 log-sum inequality. 115 induced process. 11 Markov chain. 3 Kraft inequality. processes. 14 merging and separation. 139 separated. 44 (k. 212 stacking of a structure: onto a column. 175. 107 (T. 159 finite-state process. 186. 22 Lempel-Ziv (LZ) algorithm. 136 linear mass distribution. 166 frequency. 41 packing lemma. 102 faithful code. 221 finite form. 107 of a column level. 7 function.INDEX extremal measure. 100 and Markov. 190 mixing. 115 onto a structure.i. E)-typical sequences. almost surely. 71 faithful-code sequence. 223 first-order blocks. 34. 116 repeated.

0-strongly-packed. 45. 12 N-th term process. 158 process. 30. 63. 70 (1 — 0-strongly-covered. 63. 67 and d-distance. 175 - INDEX and upward maps. 165 recurrence-time function. 8. 65 typical sequence. 180 splitting-set lemma. 84 stationary process. 27. 59 finite-energy. 211 entropy. 73 code sequence. 3 (T . 159 (S.d. 67 too-soon-recurrent representation. 6 input-output measure. 150 (K. 166 for frequencies. 229 Markov process. 26 rotation (translation). 85 strong cover. 10 regenerative. 55 shift transformation. 8 rate function for entropy. 13 too-long word. 33 strongly-covered. 3. 44.. 205 partition. n)-blocking.P) process. 138 subadditivity. 101 Shannon-McMillan-Breiman theorem. 132 of a column structure. 154 repeated words. 64 class. 145 existence theorem. 89 Pinsker's inequality. 63 k-type. 64 class bound. 148 return-time distribution. 82 time-zero coder. 150 representation. 29 type. 24 picture. 96 randomizing the start. 122 sequence. 46 (L. 40 string matching. 215 i. 109 universal code. 169. 188 full k-block universe. 21 d-far-apart rotations. 167. 13 and complete sequences. 179 skew product. 21 process. 3 shifted. 180 stacking columns. 7 with induced spacers. 151 too-small set principle. 110 B-process. 156 totally ergodic. 170 with gap. 120 stationary. 1. 6 N-stationary. 6 in-dependent. block parsings. 190 . 24 process. 121 universe (k-block). 0-strongly-covered. 81 and mixing. 5 Markov. n)-independent. 32. 7 and entropy. 14 product measure. 167 equivalence. 13 distribution. 215 (S. 34 strongly-packed. 229 stopping time. 40 interval. 43. 23 prefix. 4 shift. 7 speed of convergence. 63 class. 3 transformation/partition model. 31 sliding-block (-window) coding. 30 transformation. 5 Vf-mixing.248 pairwise k-separated. 121 trees. 72 code. 167 equivalence. 109 start position. 28 stationary coding. 166 splitting index.i. 14 per-letter Hamming distance. 21 tower construction. 59 subinvariant function. 170. 66 Poincare recurrence theorem. 68. 74. 8 output measure.

87. 119 window function. 195 window half-width. 148 repeated.INDEX very weak Bernoulli (VWB). 7 words. 200 approximate-match. 104 distinct. 133 249 . 233 weak topology. 200 weak Bernoulli (WB). 148 Ziv entropy. 179. 87 well-distributed. 232 waiting-time function. 88 weak convergence.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.