The Ergodic Theory of Discrete Sample Paths

Paul C. Shields

Graduate Studies in Mathematics
Volume 13

American Mathematical Society

Editorial Board James E. Humphreys David Sattinger Julius L. Shaneson Lance W. Small, chair
1991 Mathematics Subject Classification. Primary 28D20, 28D05, 94A17; Secondary 60F05, 60G17, 94A24.
ABSTRACT. This book is about finite-alphabet stationary processes, which are important in physics, engineering, and data compression. The book is designed for use in graduate courses, seminars or self study for students or faculty with some background in measure theory and probability theory.

Library of Congress Cataloging-in-Publication Data

Shields, Paul C. The ergodic theory of discrete sample paths / Paul C. Shields. p. cm. — (Graduate studies in mathematics, ISSN 1065-7339; v. 13) Includes bibliographical references and index. ISBN 0-8218-0477-4 (alk. paper) 1. Ergodic theory. 2. Measure-preserving transformations. 3. Stochastic processes. I. Title. II. Series. QA313.555 1996 96-20186 519.2'32—dc20 CIP

Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting

for them, are permitted to make fair use of the material, such as to copy a chapter for use in teaching or research. Permission is granted to quote brief passages from this publication in reviews, provided the customary acknowledgment of the source is given. Republication, systematic copying, or multiple reproduction of any material in this publication (including abstracts) is permitted only under license from the American Mathematical Society. Requests for such permission should be addressed to the Assistant to the Publisher, American Mathematical Society, P.O. Box 6248, Providence, Rhode Island 02940-6248. Requests can also be made by e-mail to reprint-permission0ams.org .

C) Copyright 1996 by the American Mathematical Society. All rights reserved. The American Mathematical Society retains all rights except those granted to the United States Government. Printed in the United States of America. ® The paper used in this book is acid-free and falls within the guidelines established to ensure permanence and durability. .. Printed on recycled paper. t.,1 10 9 8 7 6 5 4 3 2 1 01 00 99 98 97 96

Contents
Preface I ix

Basic concepts. I.1 Stationary processes. 1.2 The ergodic theory model. 1.3 The ergodic theorem. 1.4 Frequencies of finite blocks 1.5 The entropy theorem. 1.6 Entropy as expected value 1.7 Interpretations of entropy 1.8 Stationary coding 1.9 Process topologies. 1.10 Cutting and stacking.

1 1 13 33 43 51 56 66 79 87 103 121 121 131 137 147 154 165 165 174 184 194 200 211 211 221 232 239 245

II Entropy-related properties. II.! Entropy and coding 11.2 The Lempel-Ziv algorithm 11.3 Empirical entropy 11.4 Partitions of sample paths. 11.5 Entropy and recurrence times III Entropy for restricted classes. Ill.! Rates of convergence 111.2 Entropy and joint distributions. 111.3 The d-admissibility problem. 111.4 Blowing-up properties. 111.5 The waiting-time problem IV B-processes. IV1 Almost block-independence. 1V.2 The finitely determined property. 1V.3 Other B-process characterizations.
Bibliography

Index

vii

immediately yield the two fundamental theorems of ergodic theory. and Breiman. which shows how "almost" packings of integer intervals can be extracted from coverings by overlapping subintervals. the processes obtained by independently concatenating fixed-length blocks according to some block distribution and randomizing the start. A secondary goal is to give a careful presentation of the many models for stationary finite-alphabet processes that have been developed in probability theory. This point of view. which bounds the number of n-sequences that can be partitioned into long blocks subject to the condition that most of them are drawn from collections of known size. processes. A primary goal. including the weak Bernoulli processes and the important class of stationary codings of i. An important and simple class of such models is the class of concatenated-block processes. Further models. Of particular note in the discussion of process models is how ergodic theorists think of a stationary process. or by stationary coding. processes. engineering. introduced by Ornstein and Weiss in 1980. introduced in Section 1. however. and data compression. and a counting lemma. Markov chains.Preface This book is about finite-alphabet stationary processes. a representation that not only simplifies the discussion of stationary renewal and regenerative processes but generalizes these concepts to the case where times between recurrences are not assumed to be independent. ergodic process. and information theory. as a measure-preserving transformation on a probability space. together with a partition of the space. ergodic theory. Related models are obtained by block coding and randomizing the start. which are important in physics. A ix . the ergodic theorem of Birkhoff and the entropy theorem of Shannon.d. McMillan. that is. The classical process models are based on independence ideas and include the i. namely. are discussed in Chapter III and Chapter IV. leads directly to Kakutani's simple geometric representation of a process in terms of a recurrent event. and renewal and regenerative processes.i.2. but only stationary. instantaneous functions of Markov chains. is to develop a theory based directly on sample path arguments. The focus is on the combinatorial properties of typical finite sample paths drawn from a stationary. These two simple ideas. The packing and counting ideas yield more than these two classical results. with minimal appeals to the probability formalism. The two basic tools for a sample path theory are a packing lemma. All these models and more are introduced in the first two sections of Chapter I.i. for in combination with the ergodic and entropy theorems and further simple combinatorial ideas they provide powerful tools for the study of sample paths. Much of Chapter I and all of Chapter II are devoted to the development of these ideas. only partially realized. namely.d. seminars or self study for students or faculty with some background in measure theory and probability theory. an extension of the instantaneous function concept which allows the function to depend on both past and future. The book is designed for use in graduate courses.

These topics include topological dynamics. such as . and blowing-up characterizations. given in Section 1. lectures on parts of the penultimate draft of the book were presented at the Technical University of Delft in the fall of 1995. This book is an outgrowth of the lectures I gave each fall in Budapest from 1989 through 1994. channel theory. The audiences included ergodic theorists. Proofs are sketched first. combinatorial number theory. very weak Bernoulli. Properties related to entropy which hold only for restricted classes of processes are discussed in Chapter III. is devoted to the basic tools. ranging from undergraduate and graduate students through post-docs and junior faculty to senior professors and researchers. processes are given in Chapter IV. including the almost block-independence. the sections in Chapters II. the estimation of joint distributions in both the variational metric and the d-metric. Likewise little or nothing is said about such standard information theory topics as rate distortion theory.i. and the cutting and stacking method. or partitions into distinct blocks. as well as combinatorialists and people from engineering and other mathematics disciplines. and multi-user theory. the entropy theorem and its interpretations. in part because the book is already too long and in part because they are not close to the central focus of this book. and the connection between entropy and recurrence times and entropy and the growth of prefix trees. the ergodic theorem and its connection with empirical distributions. general ergodic theorems. then given in complete detail. algebraic coding. for example. and IV are approximately independent of each other. Likewise. Some specific stylistic guidelines were followed in writing this book. Some of these date back to the original work of Ornstein and others on the isomorphism problem for Bernoulli shifts.10. Properties related to entropy which hold for every ergodic process are discussed in Chapter II. divergence-rate theory. The book has four chapters. These include entropy as the almost-sure bound on per-symbol compression. Several characterizations of the class of stationary codings of i. K-processes. although the almost block-independence and blowing-up ideas are more recent. With a few exceptions. In addition. including the Kolmogorov and ergodic theory models for a process. suggestive names are used for concepts. both as special lectures and seminars at the Mathematics Institute of the Hungarian Academy of Sciences and as courses given in the Probability Department of E6tviis Lordnd University. Theorems and lemmas are given names that include some information about content. and continuous time and/or space theory. or partitions into repeated blocks. which is half the book. leads to a powerful method for constructing examples known as cutting and stacking. the entropy theorem rather than the ShannonMcMillan-Breiman theorem.d. III. smooth dynamics. including rates of convergence for frequencies and entropy. a method for converting block codes to stationary codes.x PREFACE further generalization. finitely determined. The first chapter. conditioned on the material in Chapter I. and a connection between entropy and waiting times. redundancy. random fields. the relation between entropy and partitions of sample paths into fixed-length blocks. Many standard topics from ergodic theory are omitted or given only cursory treatment. information theorists. the weak topology and the even more important d-metric topology. and probabilists. Ziv's proof of asymptotic optimality of the Lempel-Ziv algorithm via his interesting concept of individual sequence entropy. a connection between entropy and a-neighborhoods.

Bob Burton. A list of errata as well as a forum for discussion will be available on the Internet at the following web address. I am much indebted to my two Toledo graduate students. http://www.edurpshields/ergodic. Jeffrey. but far from least. and Jacob Ziv. This book is dedicated to my son. Gyiirgy Michaletzky. in the process discovering numerous errors and poor ways of saying things. and Nancy Morrison. Much of the work for this project was supported by NSF grants DMS8742630 and DMS-9024240 and by a joint NSF-Hungarian Academy grant MTA-NSF project 37.math. Jacek Serafin. Others who contributed ideas and/or read parts of the manuscript include Gdbor TusnAdy. Aaron Wyner. Much of the material in Chapter III as well as the blowing-up ideas in Chapter IV are the result of joint work with Kati.) Also numbered displays are often (informally) given names similar to those used for LATEX labels. Only those references that seem directly related to the topics discussed are included. who learned ergodic theory by carefully reading almost all of the manuscript at each stage of its development. My initial lectures in 1989 were supported by a Fulbright lectureship.PREFACE xi building blocks (a name proposed by Zoli Gytirfi) and column structures (as opposed to gadgets.html I am grateful to the Mathematics Institute of the Hungarian Academy for providing me with many years of space in a comfortable and stimulating environment as well as to the Institute and to the Probability Department of Etitvotis Lordnd University for the many lecture opportunities. In addition I had numerous helpful conversations with Benjy Weiss. Imre Csiszdr and Katalin Marton not only attended most of my lectures but critically read parts of the manuscript at all stages of its development and discussed many aspects of the book with me. Gusztdv Morvai. Dave Neuhoff. and his criticisms led to many revisions of the cutting and stacking discussion. these range in difficulty from quite easy to quite hard. No project such as this can be free from errors and incompleteness. I am indebted to many people for assistance with this project. Don Ornstein. Paul C. Last. 1996 . It was Imre's suggestion that led me to include the discussions of renewal and regenerative processes. Ohio March 22.utoledo. Shaogang Xu and Xuehong Li. Shields Toledo. Exercises that extend the ideas are given at the end of most sections.

.

Section 1.. The distribution of a process is thus a family of probability distributions. The k-th order joint distribution of the process {X k } is the measure p. however. p). p) on which the Xn are defined in any convenient manner. of course. am+i. (1) 111(4) = E P4+1 (a jic+1 ). The process has alphabet A if the range of each X i is contained in A. alic E Ak .k (alic) = Prob(Xif = c4). one for each k. . an.Chapter I Basic concepts.. when An is used. X. E. all that really matters in probability theory is the distribution of the process. that is. for implicit in the definition of process is that the following consistency condition must hold for each k > 1. "measure" will mean "probability measure" and "function" will mean "measurable function" with respect to some appropriate a-algebra on a probability space. . so. A process is considered to be defined by its joint distributions. • • • . E. is denoted by am for m = 1. Also.1 k. The distribution of a process can.1 Stationary processes.. The set of joint distributions {Ak: k > 1} is called the distribution of the process. as 1 .. unless stated otherwise. ak-1-1 al E A k .1\ (akla 1 ) = Prob(Xk = ak1X 1 = al ) = til(alk) ktk-1(a 1) . In this book the focus is on finite-alphabet processes. the particular space on which the functions Xn are defined is not important. variables defined on a probability space (X. The sequence cannot be completely arbitrary. The set of all such am n is denoted by Al.. also be defined by specifying the start distribution. A i . . and the successive conditional distributions k -1 k. X2.k on A k defined by the formula p. The sequence a m . except where each ai E A. of random A (discrete-time. Thus one is free to choose the underlying space (X.k may be omitted. The cardinality of a finite set A is denoted by IA I. unless it is clear from the context or explicitly stated stated otherwise. n . "process" means a discrete-time finite-alphabet process. When no confusion will result the subscript k on p. stochastic) process is a sequence X1.

x E For each n > 1 the coordinate function in : A' representation theorem states that every process with alphabet A A.1.1. Proof The process {X.) denote the ring generated by Cn . Two important properties are summarized in the following lemma.. = a i . The finite intersection property. Part (b) is an application of the finite intersection property for sequences of compact sets. The rigorous construction of the Kolmogorov representation is carried out as follows. The members of E are commonly called the Borel sets of A. Let C. Lemma 1. n > N. for each k and each 4. on A' for which the sequence of coordinate functions fin } has the same distribution as {X J. equipped with a Borel measure constructed from the consistency conditions (1). n . m < i < n} . p([4]) = pk (alic). long as the the joint distributions are left unchanged. 1 _< i < n). and let R. 1 < i < co. Xi E A. An important instance of this idea is the Kolmogorov model for a process. and its union R = R(C) is the ring generated by all the cylinder sets. The sequence {R.} defines a set function p. be the collection of cylinder sets defined by sequences that belong to An.. then there is a unique Borel probability measure p. = R(C) generated by the cylinder sets. (b) If {B„} c R is a decreasing sequence of sets with empty intersection then there is an N such that B„ = 0.2 CHAPTER I. implies that p can be extended to a .1. = ai . xn . Theorem 1. The consistency conditions (1) together with Lemma I. BASIC CONCEPTS.(C. let C = Un C„. on A" such that.} is a process with finite alphabet A.1(a). The Kolmogorov can be thought of as the coordinate function process {i n }. or by the compact sets. Let E be the a-algebra generated by the cylinder sets C. from (a).} is increasing. since the space A is compact in the product topology and each set in R. which represents it as the sequence of coordinate functions on the space of infinite sequences drawn from the alphabet A. Lemma I.2 (Kolmogorov representation theorem. The collection E can also be defined as the the a -algebra generated by R.1. is closed. together with a Borel measure on A". Let Aœ denote the set of all infinite sequences X = 1Xil. denoted by [4]. on the collection C of cylinder sets by the formula (2) p({41) = Prob(X. there is a unique Borel measure p. imply that p.1 (a) Each set in R.1(b). = 'R. In other words. Proof The proof of (a) is left as an exercise. A is defined by kn (x). Thus the lemma is established. is the subset of A' defined by The cylinder set determined by am [4] = {x: x. is a finite disjoint union of cylinder sets from C. extends to a finitely additive set function on the ring R.) If {p k } is a sequence of measures for which the consistency conditions (1) hold. if {X.

. a process is considered to be defined by its joint distributions.SECTION Ll.. that is. A process is stationary. or to its completion.). finite-alphabet process. that is. The Kolmogorov model is simply one particular way to define a space and a sequence of functions with the given distributions. for example.1. STATIONARY PROCESSES. x = (xi.A" is defined by (Tx) n= xn+i. The measure will be called the Kolmogorov measure of the process. discrete-time. . one in which subsets of sets of measure 0 are measurable. the particular space on which the functions X„ are defined is not important.1. n. Another useful model is the complete measure model. whichever is appropriate to the context. that is. Equation (2) translates into the statement that { 5C. Process and measure language will often be used interchangeably. The (left) shift T: A" 1-*. for all m. thus. x3. T -1 B = fx: T x E BI.) == T x = (x2. including physics. it) will be called the Kolmogorov representation of the process {U. ni <j < n)." As noted earlier. though often without explicitly saying so. The phrase "stationary process" will mean a stationary. ni <j < n) = Prob(Xi+i = ai . different processes have Kolmogorov measures with different completions. 3 unique countably additive measure on the a-algebra E generated by R. that is. whenever it is convenient to do so the complete Kolmogorov model will be used in this book. data transmission and storage. E. Stationary finite-alphabet processes serve as models in many settings of interest. and defines a set transformation T -1 by the formula. unless stated otherwise. "let be a process. x2. Furthermore. The statement that a process {X n } with finite alphabet A is stationary translates into the statement that the Kolmogorov measure tt is invariant under the shift transformation T. if the joint distributions do not depend on the choice of time origin. or the Kolmogorov measure of the sequence Lu l l.} and {X. The shift transformation is continuous relative to the product topology on A". B . This proves Theorem 1. and affi n .} have the saine joint distribution." means "let i be the Kolmogorov measure for some process {Xn } . (3) Prob(Xi = ai . n> 1. . "Kolmogorov measure" is taken to refer to either the measure defined by Theorem 1.n } on the probability space (A'. and statistics.2. Many ideas are easier to express and many results are easier to establish in the framework of complete measures. In particular. x which is expressed in symbolic form by E A. and uniqueness is preserved.2. The sequence of coordinate functions { k . for Aga in) = ictk (alic) for all k > 1 and all a. . The Kolmogorov measure on A extends to a complete measure on the completion t of the Borel sets E relative to tt. completion has no effect on joint distributions.

3 Much of the success of modern probability theory is based on use of the Kolmogorov model. In the stationary case. the Kolmogorov measure on A' for the process defined by {pa In summary. onto il°'° is. It preserves a Borel probability measure A if and only if the sequence of coordinate functions {:kn } is a stationary process on the probability space (A". so that only nontransient effects. the shift T is Borel measurable. the model provides a concrete uniform setting for all processes.A z is defined by (Tx) n= xn+i. The condition that the process be stationary is just the condition that p. is T -invariant. p(B) = p(T -1 B). that is. in turn. T-1 r„n 1 _ ri.([4]) = p. one as the Kolmogorov measure p. T is one-to-one and onto and T -1 is measure preserving. so that the set transformation T -1 maps a cylinder set onto a cylinder set. where each xn E A and n E Z ranges from —oo to oo. In addition.). that is.1 . of course. B E E. The condition. the other as the shift-invariant extension to the space Az. those that do not depend on the time origin. that is. for asymptotic properties can often be easily formulated in terms of subsets of the infinite product space A". if {p4} is a set of measures for which the consistency conditions (1) and the stationarity conditions (3) hold then there is a unique T-invariant Borel measure ii. The projection of p. BASIC CONCEPTS. Remark 1.4 Note that CHAPTER I. or to the two-sided measure or its completion. A(B) = p(T -1 B) for each cylinder set B. the set of all sequences x = {x„}. E. m <j < n. bi+1 = ai.n+1 i L"m-1 — Lum+1. Such a model is well interpreted by another model. It follows that T -1 B is a Borel set for each Borel set B. different processes with the same alphabet A are distinguished by having . p. such results will then apply to the one-sided model in the stationary case. Proofs are sometimes simpler if the two-sided model is used. which is. Note that the shift T for the two-sided model is invertible. "Kolmogorov measure" is taken to refer to either the one-sided measure or its completion. for each k > 1 and each all'. on A". which means that T is a Borel measurable mapping. as appropriate to the context. is usually summarized by saying that T preserves the measure iu or. In summary. The intuitive idea of stationarity is that the mechanism generating the random variables doesn't change in time. In some cases where such a simplification is possible. It is often convenient to think of a process has having run for a long time before the start of observation.) From a physical point of view it does not matter in the stationary case whether the one-sided or two-sided model is used. (The two-sided shift is. are seen. equivalent to the condition that p(B) = p(T -1 B) for each Borel set B. The concept of cylinder set extends immediately to this new setting. that is. Furthermore. only the proof for the invertible case will be given. that ti. alternatively. in fact. for invertibility of the shift on A z makes things easier. Let Z denote the set of integers and let A z denote the set of doubly-infinite A-valued sequences. for stationary processes there are two standard representations. n E Z.([b 11 ]). on A z such that Again) = pk (a1 ). called the doubly-infinite sequence model or two-sided model. The latter model will be called the two-sided Kolmogorov model of a stationary process. The shift T: A z 1-.1. a homeomorphism on the compact space A z .

i. in which case the I AI x IAI matrix M defined by Mab = Mbia = Prob(X„ = bi X„_i = a) . 5 different Kolmogorov measures on A. as well as to the continuous-alphabet i.1. return-time processes associated with finite-alphabet processes are. It is identically distributed if Prob(X„ = a) = Prob(Xi = a) holds for all n > 1 and all a E A. and functions of uniformly distributed i. STATIONARY PROCESSES.5 (Independent processes.d.i. For example.d. Some standard examples of finite-alphabet processes will now be presented.i. I.1.1. if and only if its Kolmogorov measure is the product measure defined by some distribution on A. even more can be gained in the stationary case by retaining the abstract concept of a stationary process as a sequence of random functions with a specified joint distribution on an arbitrary probability space.a Examples. condition holds if and only if the product formula iak (al k )= n Ai (ai). A Markov chain is said to be homogeneous or to have stationary transitions if Prob(X n = blXn _i = a) does not depend on n. A sequence of random variables {X„} is independent if Prob(X. When completeness is needed the complete model can be used. holds for all n > 1 and all al' E An. in the stationary case.4 While this book is primarily concerned with finite-alphabet processes. is a Markov chain if (4) Prob (X n = an I V I -1 = ar l ) = Prob(Xn = an IXn _i =a_1) holds for all n > 2 and all a7 E An. Example 1. As will be shown in later sections. = an IX7 -1 = ar I ) = Prob(X„ = an ).) The simplest example of a stationary finite-alphabet process is an independent. {X„}. cases that will be considered. except in trivial cases.d.6 (Markov chains. when invertibility is useful the two-sided model on A z can be used. Thus a process is i. or even continuous-alphabet processes. A sequence of random variables. processes will be useful in Chapter 4.1. The i.d.d. An independent process is stationary if and only if it is identically distributed. and.i. Example 1.SECTION 1. countable-alphabet processes.) process. it is sometimes desirable to consider countable-alphabet processes. Such a measure is called a product measure defined by Ai.) The simplest examples of finite-alphabet dependent processes are the Markov processes.i. The Kolmogorov theorem extends easily to the countable case. Remark 1.1. identically distributed (i. i= 1 k holds.

anix ri . any finite-alphabet process is called a finite-state process. .} is often referred to as the underlying Markov process or hidden Markov process associated with {Yn }. = (sa . second. Yn = bisrl .7 (Multistep Markov chains.) A generalization of the Markov property allows dependence on k steps in the past. Prob(X. Here "Markov" will be reserved for processes that satisfy the Markov property (4) or its more general form (5).1. Unless stated otherwise. BASIC CONCEPTS. The start distribution is called a stationary distribution of a homogeneous chain if the matrix product A I M is equal to the row vector pl. for all n > 1.k (4`) = Prob(X i` = 4). where ii. if (6) Prob(sn = s. In other words. The condition that a Markov chain be a stationary process is that it have stationary transitions and that its start distribution is a stationary distribution. but {Yn } is not. is called the transition matrix of the chain. If this holds.) Let A and S be finite sets. and. Note that the finite-state process {Yn } is stationary if the chain {X„} is stationary. in general. that transitions be stationary.6 CHAPTER I. that A k M (k) = . and what is here called a finite-state process is called a function (or lumping) of a Markov chain. 0 Lk—I •fu i i = a2 k otherwise. A (an I a n v n—i _ ) = Prob(X n = anl. that is. {Yn } is a finite-state process if there is a finite-alphabet process {s n } such that the pair process {X.4-1 ) .1T -1 ) = Prob(sn = s. The positivity condition rules out transient states and is included so as to avoid trivialities. then a direct calculation shows that. the process is stationary. A process is called k-step Markov if (5) Prob(X. for which iu 1 (a) > 0 for all a E A.8 (Finite-state processes. An A-valued process {Y. first. and "finite-state process" will be reserved for processes satisfying property (6). il n—k — n-1\ an—k ) does not depend on n as long as n > k. The two conditions for stationarity are. see Exercise 2. a Markov chain. 4 E A " . Note that probabilities for a k-th order stationary chain can be calculated by thinking of it as a first-order stationary chain with alphabet B. Yn )} is a Markov chain. The Markov chain {X. = biSn-1). where /2i(a) = Prob(X i = a). Y. following the terminology used in [15 ]. "stationary Markov process" will mean a Markov chain with stationary transitions for which A I M = tti and. indeed. In information theory. The start distribution is the (row) vector lit = Lai (s)}.1. Example 1. and B = fa: p.} is said to be a finite-state process with state space S.k (4c) > 01.uk . Example 1. a E A. = an i Xn n :ki = an niki ) holds for all n > k and for all dii. In ergodic theory. and M (k) is the I BI x IBI matrix defined by Ai (k) _ bk — l' 1 1 r1 ak Prob(Xk ±i = bk I X ki = a k\ il . what is here called a finite-state process is sometimes called a Markov source. .

f (x) is just the 0-th coordinate of y = F (x). otherwise known as the per-symbol or time-zero coder. the contents of the window are shifted to the left) eliminating xn _ u.Bz is a stationary coder if F(TA X) = TB F(x).„ with coder F.1. Define the function f: Az i. n Yn = f (x'. when there is a nonnegative integer w such that = f (x) = f CO . The encoded sequence y = F(x) is defined by the formula .) The concept of finite-state process can be generalized to allow dependence on the past and future.). Thus the stationary coding idea can be expressed in terms of the sequence-to-sequence coder F. Suppose A and B are finite sets. • • • 1 Xn-f-w which determines the value y. • • • 9 Xn. through the function f. The stationary coder F and its time-zero coder f are connected by the formula = f (rx). the window is shifted to the right (that is. x E AZ.71 Z E Z. that is. Xn—w+1. sometimes called the full coder. At time n the window will contain Xn—w. A case of particular interest occurs when the time-zero coder f depends on only a finite number of coordinates of the input sequence x. In the special case when w = 0 the encoding is said to be instantaneous. The coder F transports a Borel measure IL on Az into the measure v = it o F -1 . and bringing in .. Note that a stationary coding of a stationary process is automatically a stationary process. the image y = F(x) is the sequence defined by y. The word "coder" or "encoder" may refer to either F or f.) is called an instantaneous function or simply a function of the {X n } process. where TA and TB denotes the shifts on the respective two-sided sequence spaces Az and B z . and the encoded process {Yn } defined by Yn = f (X.1.SECTION 1. 7 Example 1. Finite coding is sometimes called sliding-block or sliding-window coding. In this case it is common practice to write f (x) = f (xw w ) and say that f (or its associated full coder F) is a finite coder with window half-width w. that is. Vx E Az .9 (Stationary coding. STATIONARY PROCESSES. that is. n E Z. or in terms of the sequence-to-symbol coder f. A Borel measurable mapping F: Az 1-.B by the formula f (x) = F(x)o. The function f is called the time-zero coder associated with F. the value y. = f (T:1 4 x). that is. defined for Borel subsets C of Bz by v(C) = The encoded process v is said to be a stationary coding of ti. for one can think of the coding as being done by sliding a window of width 2w ± 1 along x. an idea most easily expressed in terms of the two-sided representation of a stationary process. To determine yn+1 . depends only on xn and the values of its w past and future neighbors.

. This method is called "randomizing the start. that is." The relation = . are connected by the formula y = o F -1 . the N-block coding {Yn } is not.d. such as the encoded process {Yn }.i.) Another type of coding. to of the code.nn+ -w w ) = ym n . where yi = f m <j < n.. The key property of finite codes is that the inverse image of a cylinder set of length n is a finite union of cylinder sets of length n 2w.d.. a function of a Markov chain. destroys stationarity. • . Example 1. Furthermore. Such a code can be used to map an A-valued process {X n } into a B-valued process {Yn } by applying CN to consecutive nonoverlapping blocks of length N.(F-1 [4]) = (7) E f A process {Yn } is a finite coding of a process {X n } if it is a stationary coding of {Xn } with finite width time-zero coder.. 2.u.n41. An N -block code is a function CN: A N B N . of the respective processes. There is a simple way to convert an N-stationary 2rocess.„ + i. that is. where T = Tg. stationary. especially in Chapter 4. pt and v. {X. F -1 ([). where w is the window half-width . . cylinder set probabilities for the encoded measure y = o F -1 are given by the formula y(4) = ii. Such a finite coder f defines a mapping. If F: Az B z is the full coder defined by f. Xn. process with finite or infinite alphabet.. where A and B are finite sets. in general. although it is N-stationary.7...v ) so that.nitww)=Y s11 [x:t w w] . The process {Yn } is called the N-block coding of the process {X n ) defined by the N -block code C N. also denoted by f.i.8 CHAPTER I. in particular. 2. More generally. and is often easier to analyze than is stationary coding. where several characterizations will be discussed. . BASIC CONCEPTS. a finite alphabet process is called a B-process if it is a stationary coding of some i. (j+1)N X j1s1+1 ) 9 j 0. 1) is measurable with respect to the a-algebra E(X. by the formula f (x. . called block coding. If the process {X. then F' ([y]) = f (x.1. process is called a B-process. I. The process {Yn } is defined bj selecting an integer u E [1. {Yn }. however. for any measure . It is frequently used in practice. Note that a finite-state process is merely a finite coding of a Markov chain in which the window half-width is 0.} is stationary. A stationary coding of an i.. Xn±„. where y = F(x) is the (measurable) mapping defined by (j+1)N r jN+1 = .n+w generated by the random variables X„. into a stationary process.. . The Kolmogorov measures. n . N] according to the uniform distribution and defining i = 1.} and {K}. y(j+1)N jN+1 = CN ()((f+1)N) +1 iN j = 0 1. that is. it is invariant under the N-fold shift T N . the shift on B. 2. Much more will be said about B-processes in later parts of this book.10 (Block codings. from An B.

. In general.. an N-block code CN induces a block coding of a stationary process {Xn } onto an N-stationary process MI. with stationarity produced by randomizing the start. B E E. conditioned on U = u.11 (Concatenated-block processes.I. by the process {41. . (ii) There is random variable U uniformly distributed on {1.} is i. block coding introduces a periodic structure in the encoded process.d.* o 0 -1 on A" via the mapping = x = w(l)w(2) • • • where JN X (j-1)N+1 = W(i). y and î. {Y. and as such inherits many of its properties. . NI such that. Example 1. forming the product measure If on (A N )". . j = 1. (i) li.. a structure which is inherited. suppose X liv = (X1. An alternative formulation of the same idea in terms of measures is obtained by starting with a probability measure p. STATIONARY PROCESSES. {Y..SECTION I. then averaging to obtain the measure Z defined by the formula N-1 N il(B) = — i=o it*(0 -1 (7—i B)).} and many useful properties of {X n } may get destroyed. In summary. a general procedure will be developed for converting a block code to a stationary code by inserting spacers between blocks so that the resulting process is a stationary coding of the original process.. is expressed by the formula v ( c) = E v (T C). .8. In random variable terms. There are several equivalent ways to describe such a process. with distribution t. } obtained (j-11N+1 be independent and have the distribution of Xr. The concatenated-block process is characterized by the following two conditions. For example. }.) Processes can also be constructed by concatenating N-blocks drawn independently at random according a given distribution. of the respective processes. 9 between the Kolmogorov measures. on A N .1. by randomizing the start.i. Let {Y. . . the sequence {Y:± ±(iiN oN±i : j = 1. the final stationary process {Yn } is not a stationary encoding of the original process {X. {Y. The process { 171 from the N-stationary process {Yi I by randomizing the start is called the concatenatedblock process defined by the random vector Xr .1. 2.} and { Y. 2. with a random delay. The method of that is. X N ) is a random vector with values in AN. } ... 2. transporting this to an N-stationary measure p.} is stationary. i=o is the average of y. randomizing the start clearly generalizes to convert any N-stationary process into a stationary process.. which can then be converted to a stationary process.} be the A-valued process defined by the requirement that the blocks . In Section 1. . together with its first N — 1 shifts.

• • • 9 Xn+N-1)• The Z process is called the nonoverlapping N-block process determined by {X n }. then {Y. n = ai. then its nonoverlapping blockings are i.10 CHAPTER I. Yn = X (n-1)N+1. i) can only go to (ar . (See Exercise 7.) Associated with a stationary process {X n } and a positive integer N are two different stationary processes. 1) with probability p. a function of a Markov chain. called the N-th term process.n } defined by setting Y. j). however. on AN. N) goes to (b Ç'.tSiv = MaNbi ri i= 1 Mb. 2.} is Markov with transition matrix Mab. . 0 if bN i —1 = ci 2 iv otherwise. If the process {X. The process is started by selecting (ar .} be the (stationary) Markov chain with alphabet S = AN x {1.d. hence this measure terminology is consistent with the random variable terminology.r. that is.12 (Blocked processes. A third formulation. then the overlapping N-block process is Markov with alphabet B = far: War) > 0} and transition matrix defined by Ma i.d.(4)/N.} process. +i A related process of interest {Yn }.. To describe this representation fix a measure p. by Zn = (X(n-1)N+19 X(n-1)N+29 • • • 1 XnN). The process defined by ii has the properties (i) and (ii).i) with probability i.} is i. which is quite useful. N}. . that is. for n > 1.b. selects every N-th term from the {X. . for each br (C) E AN. while the W process is called the overlapping N-block process determined by {X J. In fact. BASIC CONCEPTS. Each is stationary if {X. Note also that if {X n} is Markov with transition matrix Mab.. represents a concatenated-block process as a finite-state process. (a) If i < N .i. if {X. The process {Y.d.i. (ar ..) Concatenated-block processes will play an important role in many later discussions in this book.i } will be Markov with transition matrix MN. the N-th power of M. i ± 1).1. (b) (ar . but overlapping clearly destroys independence. n E Z. is called the concatenated-block process defined by the measure p. b' 1 1 = { MaNbN. Wn = ant X+1. The measure ii. {Z n } and IKI. or Markov process is Markov. on A N and let { Y„. defined. = (ar . defined by the following transition and starting rules. If {Xn } is Markov with transition matrix M. The overlapping N-blocking of an i. Example 1.(br). if Y. is the same as the concatenated-block process based on the block distribution A. then the nonoverlapping N-block process is Markov with alphabet B and transition matrix defined by N-1 1-4a'iv .} is stationary.i.

(x) is true for i = 1. there is an N and a set G of measure at least 1 — c. 11 I. (b) P. . Two elementary results from probability theory will be frequently used. the fact that x E B. such that Pn.) Let f be a nonnegative. Lemma 1. 2. if for almost every x there is an increasing sequence {n i } of integers.1. just a generalized Borel-Cantelli principle. n Gn .) such that E ii(c. the Markov inequality and the Borel-Cantelli principle. < c 8 then f (x) < c. such that Ifn(x) — f(x)1 < E . indeed. Almost-sure convergence is often established using the following generalization of the Borel-Cantelli principle.i ) < 00 then for almost every x there is an N = N(x) such that x g C„. eventually almost surely. (c) Given c > 0. except for a set of measure at most b. almost surely. the Borel-Cantelli principle is often expressed by saying that if E wc. In many applications. Frequent use will be made of various equivalent forms of almost sure convergence. integrable function on a probability space (X. for every c > O. If f f dp.1. Then x g Bn . eventually almost surely.SECTION 1.u(Bn n Gn ) < oo. Lemma 1.„ eventually almost surely. E. eventually almost surely.) < oo then x g C.b Probability tools. E.14 (The Borel-Cantelli principle. and x g B.l. is established by showing that E . n > N. a property P is said to be measurable if the set of all x for which P(x) is true is a measurable set. —> f.).16 The following are equivalent for measurable functions on a probability space. almost surely. In general. eventually almost surely. n Gn. p. For example.15 (The iterated Borel-Cantelli principle. if for almost every x there is an N = N(x) such that Pn (x) is true for n > N.. If {Pn } is a sequence of measurable properties then (a) P(x) holds eventually almost surely.. which may depend on x.1.) If {C n } is a sequence of measurable sets in a probability space (X. summarized as follows. The proof of this is left to the exercises. (x) holds infinitely often. STATIONARY PROCESSES. eventually almost surely.. Lemma 1. . X E G.13 (The Markov inequality.) Suppose {G„} and {B} are two sequences of measurable sets such that x E G„. (a) f. p. Lemma 1.1... (b) I fn (x) — f (x)I <E. in which case the iterated Borel-Cantelli principle is.1. n ? N.

For nice spaces.u([4] n [ci_t) . A similar result holds with either A" or B" replaced. (Hint: show that such a coding would have the property that there is a number c > 0 such that A(4) > c". such as the martingale theorem.i. B.' and an+ni±k n+m-1-1 . namely. if p.d. Define X. let v = p.) Let F be a Borel function from A" into B". and the renewal theorem. the central limit theorem. 1]. If X is a Borel set such that p(X) = 1.(x 1') 0 O. Show that a finite coding of an i. be a probability measure on the finite set A. = O.i. then I BI 5_ 11a.d. For ease of later reference these are stated here as the following lemma. (How is m related to window half-width?) 4.u([4]). and upper bounds on cardinality "almost" imply lower bounds on probability. be a Borel measure on A". then a stationary process is said to be m-dependent. a process is often specified as a function of some other process. Show that {X. Deeper results from probability theory.) .1. 2.12 CHAPTER I. uniformly distributed on [0. Let {tin } be i.d. Then show that the probability of n consecutive O's is 11(n + 1)!.18 (Cardinality bounds.) Let p. that lower bounds on probability give upper bounds on cardinality. will not play a major role in this book. except for a subset of B of measure at most a. such as product spaces.) 3. Lemma 1.1 is 1-dependent. Sometimes it is useful to go in the opposite direction. process. Probabilities for the new process can then be calculated by using the inverse image to transfer back to the old process. [5]. e.1. and let a be a positive number (a) If a E B (b) For b E p(a) a a. .1. namely.. Use will also be made of two almost trivial facts about the connection between cardinality and probability. by AZ or B z .l. for all n and k. this is not a serious problem. let p. p(b) > a/I BI. and sometimes the martingale theorem is used to simplify an argument.u([4]). and all a. g. for such images are always measurable with respect to the completion of the image measure. If .17 (The Borel mapping lemma. A complication arises. first see what is happening on a set of probability 1 in the old process then transfer to the new process. As several of the preceding examples suggest.i. Lemma 1. respectively. (Include a proof that your example is not Markov of any order.c Exercises. if U„ > Un _i . BASIC CONCEPTS. the law of the iterated logarithm. then F(X) is measurable with respect to the completion i of v and ii(F (X)) = 1. otherwise X. o F-1 and let r) be the completion of v. that Borel images of Borel sets may not be Borel sets. though they may be used in various examples. and is not a finite coding of a finite-alphabet i. with each U. I. let B c A. Give an example of a function of a finite-alphabet Markov chain that is not Markov of any order. 1. This fact is summarized by the following lemma. process is m-dependent for some m.18. Prove Lemma 1. = 1.

. This suggests the possibility of using ergodic theory ideas in the study of stationary processes. The coordinate functions.Kn-Fr \ 1-4-4 Kn+1 . which correspond to finite partitions of X. Since to say that . First.. The Kolmogorov model for a stationary process implicitly contains the transformation/partition model. In many cases of interest there is a natural probability measure preserved by T. x E Pa .SECTION L2. that is. (or on the space A z ). The sequence of random variables and the joint distributions are expressed in terms of the shift and Kolmogorov partition. at random according to the Kolmogorov measure. for each Borel subset B C 9. by the formula = (n K-1 kn -En iU(Xkr1+1) . 7. is the subject of this section and is the basis for much of the remainder of this book. then give rise to stationary processes. The shift T on the sequence space. where for each a E A. X = A". (1/n) E7=1 kt* (1' B). 0 < n < r.2 The ergodic theory model. THE ERGODIC THEORY MODEL. Establish the Kolmogorov representation theorem for countable alphabet processes.u (N) on AN.}. of a transformation T: X X on some given space X. {X.Tx. Finite measurements on the space X.11 satisfies the two conditions (i) and (ii) of that example. • k=0 The measures {. Prove the iterated Borel-Cantelli principle... Show that the concatenatedblock process Z defined by kt is an average of shifts of . n > 1. ÎZ(B) = . an infinite sequence. for each n. This model. The partition P is called the Kolmogorov partition associated with the process.15.1. 8.. associate with the partition P the random variable Xp defined by X2 (x) = a if a is the label of member of P to which x belongs.tc* to A". hence have a common extension . that is. relative to which information about orbit structure can be expressed in probability language. let Xn = X n (x) be the label of the member of P to which 7' 1 x belongs.T 2x. 6.a (N) . is the transformation and the partition is P = {Pa : a E A}. Show that the process constructed in the preceding exercise is not Markov of any order. Show that the finite-state representation of a concatenated-block process in Example 1. called the transformation/partition model for a stationary process. N = 1. Lemma 1. A measure kt on An defines a measure . The process can therefore be described as follows: Pick a point x E X. The Kolmogorov measure for a stationary process is preserved by the shift T on the sequence space. Section 1. as follows. are given by the formula (1) X(x) = Xp(Tn -1 x). = {x: x 1 = a}.1. 13 5. Ergodic theory is concerned with the orbits x. and. 2. Pc.} satisfy the Kolmogorov consistency conditions. . where N = Kn+r. that is.

The random variable Xp and the measure-preserving transformation T together define a process by the formula (3) X(x) = Xp(T n-l x). or. Let (X. more simply.14 Tn-1 X E CHAPTER I. is called the (T. and the joint distributions can all be expressed directly in terms of the Kolmogorov partition and the shift transformation. 2) name of x. n> 1. The sequence { xa = X n (x): n > 1} defined for a point x E X by the formula Tn -l x E Pin . Of course. ) In summary. ). [4] = n7. the cylinder sets in the Kolmogorov representation.u. E. in the Kolmogorov representation a random point x is a sequence in 21°' or A z . the values. disjoint collection of measurable sets. or equivalently. (or the forward part of x in the two-sided case. that is. and the (T.) . cylinder sets and joint distributions may be expressed by the respective formulas. The (T. (In some situations. X a (x). the (T. Continuing in this manner. 2)-name of x is the same as x. BASIC CONCEPTS. The k-th order distribution Ak of the process {Xa } is given by the formula (4) (4) = u (nLiT -i+1 Pai ) the direct analogue of the sequence space formula. and measure preserving if it is measurable and if (T' B) = . (2). that is. tell to which set of the partition the corresponding member of the random orbit belongs. Then apply T to x to obtain Tx and let X2(x) be the label of the set in P to which Tx belongs. X 2 (x X3(x). A) be a probability space. is called the process defined by the transformation T and partition P.(B). P = {Pa : a E A} of X is a finite. the coordinate functions. The concept of stationary process is formulated in terms of the abstract concepts of measure-preserving transformation and partition..) Associated with the partition P = {Pa : a E Al is the random variable Xp defined by Xp(x) = a if x E Pa . and (2) = u([4]) = i (nLI T —i+1 Pa. B A partition E E. by (4). - (x). X — U n Pa is a null set. Pick x E X at random according to the A-distribution and let X 1 (x) be the label of the set in P to which x belongs. 2) process. The process {Xa : n > 1} defined by (3). Pa is the same as saying that x E T -n+ 1 Pa . countable partitions. 2)-process may also be described as follows. whose union has measure 1.„7—i+1 Pai . A mapping T: X X is said to be measurable if T -1 B E E. for all B E E. are useful. indexed by a finite set A. partitions into countably many sets. as follows.

in essence.SECTION 1.2. p) gives rise to a measurable function F: X 1-÷ B" which carries p onto the Kolmogorov measure y of the (T. The notation.1 (Ergodicity equivalents. equivalently. . The space X is Tdecomposable if it can be expressed as the disjoint union X = X1U X2 of two measurable invariant sets. This leads to the concept of ergodic transformation. denotes symmetric difference. The ergodic theory point of view starts with the transformation/partition concept.e. while modern probability theory starts with the sequence space concept. a stationary coding F: A z B z carrying it onto y determines a partition P = {Pb: b E BI of Az such that y is the Kolmogorov measure of the (TA. . where the domain can be taken to be any one of the LP-spaces or the space of measurable functions.2. hence to say that the space is indecomposable is to say that if T -1 B = B then p(B) is 0 or 1. THE ERGODIC THEORY MODEL. C AD = (C — D)U(D— C). The following lemma contains several equivalent formulations of the ergodicity condition. the natural object of study becomes the restriction of the transformation to sets that cannot be split into nontrivial invariant sets. Conversely. 2)-process. 2)-process concept is.e. In summary. 2)-process. Lemma 1. A measure-preserving transformation T is said to be ergodic if (5) 7-1 B = B = (B) = 0 or p(B) = 1.. B LB) = 0 = (B) = 0 or kt(B) = 1. In particular. i = 1. It is natural to study the orbits of a transformation by looking at its action on invariant sets.a Ergodic processes. i = 1. as a measure-preserving transformation T and partition P of an arbitrary probability space. See Exercise 2 and Exercise 3. where TB is the shift on 13'. 2. or. 15 The (T. in that a partition P of (X. The mapping F extends to a stationary coding to Bz in the case when X = A z and T is the shift. = (c) T -1 B D B. I. (b) T -1 B C B. The condition that TX C Xi. E. each of positive measure. = f .2. and UT denotes the operator on functions defined by (UT f)(x) = f (T x). The partition P is defined by Pb = {x: f (x) = 1) } .) The following are equivalent for a measure-preserving transformation T on a probability space. A measurable set B is said to be T-invariant if TB C B. just an abstract form of the stationary coding concept. It is standard practice to use the word "ergodic" to mean that the space is indecomposable. a. a. such that F(Tx) = TA F (x). a stationary process can be thought of as a shift-invariant measure on a sequence space. implies that f is constant. (a) T is ergodic. since once an orbit enters an invariant set it never leaves it. 2 translates into the statement that T -1 Xi = Xi. (d) (e) UT (B) = 0 or kt(B) = 1.i(B) = 0 or p(B) = 1.

2. if the shift in the Kolmogorov representation is ergodic relative to the Kolmogorov measure.2. that is.3 (The Baker's transformation.) A simple geometric example provides a transformation and partition for which the resulting process is the familiar coin-tossing process. Place the right rectangle on top of the left to obtain a square. then T -1 C = C and . The transformation T is called the Baker's transformation since its action can be described as follows. (t ± 1)/2) if S <1/2 it s> 1/2.2. 3.2 In the particular case when T is invertible.2.b Examples of ergodic processes. processes obtained from a block coding of an ergodic process by randomizing the start. Also. when T is one-to-one and for each measurable set C the set T C is measurable and has the same measure as C. t/2) (2s — 1. Squeeze each column down to height 1/2 and stretch it to width 1. • Tw • Tz Tw • • Tz Figure 1. but not all. the conditions for ergodicity can be expressed in terms of the action of T. Thus the concept of ergodic process. Example 1. Furthermore. As will be shown in Section 1. I. BASIC CONCEPTS. Let X = [0. Cut the unit square into two columns of equal width. 1) x [0. (See Figure 1.4.2.d. which is natural from the transformation point of view.4.u(B).4 The Baker's transformation. processes. rather than T -1 . 1) denote the unit square and define a transformation T by T (s. an invertible T is ergodic if and only if any T-invariant set has measure 0 or 1. Examples of ergodic processes include i. stationary codings of ergodic processes. In particular.) 1.2. and some. irreducible Markov chains and functions thereof. to say that a stationary process is ergodic is equivalent to saying that measures of cylinder sets can be determined by counting limiting relative frequencies along a sample path x. A stationary process is ergodic. Remark 1. a shift-invariant measure shift-invariant measures. These and other examples will now be discussed.i. for almost every x. .16 CHAPTER I. concatenated-block processes. Proof The equivalence of the first two follows from the fact that if T -1 B c B and C = n>0T -n B. is equivalent to an important probability concept. note that if T is invertible then UT is a unitary operator on L2 . t) = { (2s. The proofs of the other equivalences are left to the reader.u(C) = .

d. together with the fact that each of the two sets P0 . P (2) .P)-process is. if [t(Pa n Qb) = 12 (Pa)11 (Qb). defined inductively by APO ) = (vrP (i) ) y p(k). Let P = {Pa : a E A} be a partition of the probability space (X. TP . 2)-process is just the coin-tossing process. for each n > 1. In general.. t): s < 1/2}. Qb E Q} • The join AP(i) of a finite sequence P (1) . process ii. PO by setting (6) Po = {(s.6. process.2.Va. that the (T. of course.5 illustrates the partitions P. To assist in showing that the (T . P1 = {(s. t): s _?: 1/2}. that is. In particular. To obtain the coin-tossing process define the two-set partition P = {Po.2.5 The join of P. The distribution of P is the probability distribution ftt(Pa).) . 17 The Baker's transformation T preserves Lebesgue measure in the square. T -I P v P v T'P v T 2P T -1 P Figure 1.2.SECTION I. Two partitions P and Q are independent if the distribution of the restriction of P to each Qb E Q does not depend on b. a E Al. implies. symmetric. This is precisely the meaning of independence. . defined by the first-order distribution j 1 (a).i. b. . described as follows. some useful partition notation and terminology will be developed. Note that T 2P partitions each set of T -1 7. ii). THE ERGODIC THEORY MODEL.T 2P .. the i. along with the join of these partitions. TP. it can be shown that the partition TnP is independent of vg -1 PP . so that {T i P} is an independent sequence. E. are For the Baker's transformation and partition (6). The join. 1 0 P T -1 1)1 n Po n T Pi n T2 Po Ti' I I 1 0 1 0 T 2p 01101 I I . the binary. indeed. v P v TT into exactly the same proportions as it partitions the entire space X. T2P.2. Figure 1. and 7— '1'. a E A. (See Figure 1. that is. P(k) of partitions is their common refinement. the partition P v Q = {Pa n Qb: Pa E P . This.i. 7-1 P . P1 has measure 1/2. P v Q. The sequence of partitions {P (i) : i > 1} is said to be independent if P (k+ i) and v1P (i) independent for each k? 1.d. is given by a generalized Baker's transformation. i. of two partitions P and Q is their common refinement. for dyadic subsquares (which generate the Borel field) are mapped into rectangles of the same area.

2. called mixing. A stationary process is mixing if the shift is mixing for its Kolmogorov measure. for all sets D and positive integers n. in the transition matrix M determine whether a Markov process is ergodic.i(C) = .7 (Ergodicity for i. The corresponding Baker's transformation T preserves Lebesgue measure. Since the measure is product measure this means that p. .2. Example 1. process.6 The generalized Baker's transformation. measures satisfy the mixing condition. implies that p. 1 < < n. 2. processes.) The location of the zero entries. such that the column labeled a has width A i (a). the reader may refer to standard probability books or to the extensive discussion in [29] for details. j there is a sequence io.u(C) is 0 or 1.(T -Nc n D) = . The stochastic matrix M is said to be irreducible.u(C) 2 and hence . C. To show that the mixing condition (7) holds for the shift first note that it is enough to establish the condition for any two sets in a generating algebra. The partition P into columns defined in part (a) of the definition of T then produces the desired i. for if C = [a7] and N > 0 then T-N = A +1 bNA -i = a1. Suppose it is a product measure on A'. D E E. so that the mixing property. Example 1. • Figure 1. hence it is enough to show that it holds for any two cylinder sets. The results are summarized here.u(C). .) To show that a process is ergodic it is sometimes easier to verify a stronger property.d.u(D). Mixing clearly implies ergodicity.8 (Ergodicity for Markov chains. But this is easy.u(T -nC n D) = . .i. labeled by the letters of A. if for any pair i.2. in with io = i and in = j such that > 0. Stack the rectangles to obtain a square.i.froo lim .a(T-N C). n — 1. 1. Thus. for if T -1 C = C then T -ncnD=CnD. i 1 .d.i. BASIC CONCEPTS. 3. if D = [br] and N > m then T -N C and D depend on values of x i for indices i in disjoint sets of integers. if any. For each a E A squeeze the column labeled a down and stretch it to obtain a rectangle of height (a) and width 1. the choice D = C gives .a(D) = Thus i. 1.18 CHAPTER I. in = 0. Cut the unit square into I AI columns. (C nD) = Since this holds for all sets D. (7). A transformation is mixing if (7) n— .d.

To prove this it is enough to show that N (9) LlM oo C n D) = for any cylinder sets C and D.u(C) = This proves that irreducible Markov chains are ergodic.1 1 converges in the sense of Cesaro to (Ci). Furthermore. since . j of states. so that C must have measure 0 or 1. 19 This is just the assertion that for any pair i. Mc1. by (8). if the chain is at state i at some time then there is a positive probability that it will be in state j at some later time. mc. let D = [4] and C = [cri and write + T—k—n = [bnn ++ kk-Fr ]. for if T -1 C = C and D = C. hence for any measurable sets C and D.+i. where n-1 7r (di)11Md. THE ERGODIC THEORY MODEL. which establishes (9). then the limit formula (9) gives /. Thus when k > 0 the sets T -k-n C and D depend on different coordinates. This condition. An irreducible Markov chain is ergodic. where bn_f_k+i = ci. 1=. The sequence ML. to bn+k+1 yields the product p. . The condition 7rM = it shows that the Markov chain with start distribution Ai = it and transition matrix M is stationary. Summing over all the ways to get from d. each entry of it is positive and N (8) N—oc) liM N n=1 where each row of the k x k limit matrix P is equal to jr.1±kbn+k+i V = (n-Flc-1 i=n JJ Md. the limit of the averages of powers can be shown to exist for an arbitrary finite stochastic matrix M. +1 = 11 n+k+m-1 i=n+k-F1 Note that V.d. +1 = acin). In fact.c. is n ii n [bn 1) = ttacril i) A I . and hence .. which is weaker than the mixing condition (7).d.-1 (10) itt([d] k steps. In the irreducible case.([4] n [btni 1 -_ . implies ergodicity. If M is irreducible there is a unique probability vector it such that irM = 7r. 1 < i < m. there is only one probability vector IT such that 7rM = it and the limit matrix P has it all its rows equal to 7r. To establish (9) for an irreducible chain.t(C) = 1u(C) 2 .:Vinj) = uvw. which is the probability of transition from dn to bn+k+1 = Cl in equal to MI ci = [Mk ]d c „ the dn ci term in the k-th power of M..SECTION 1.2.

that the (T. . Indeed. To be consistent with the general concept of ergodic process as used in this book. In this case. so that. below. for suppose F: A z B z is a stationary encoder. and hence that v is the Kolmogorov measure of an ergodic process. A finite-state process is a stationary coding with window width 0 of a Markov chain. which means that transition from i to j in n steps occurs with positive probability. has all positive entries) then M is certainly irreducible. Example 1. then the set B = T -1 Pi U T -2 PJ U has positive probability and satisfies T -1 B D B. Proposition 1. Thus.2.9 A stationary Markov chain is ergodic if and only if its transition matrix is irreducible. Feller's book. 2)-process is ergodic for any finite partition P. In summary.11 In some probability books. Since v(C) = p. This follows easily from the definitions. for any state i. assuming that all states have positive probability. BASIC CONCEPTS.2.) A simple extension of the above argument shows that stationary coding also preserves the mixing property. and hence the chain is irreducible. if T is ergodic the (T. [11]. The argument used to prove (9) then shows that the shift T must be mixing. at least with the additional assumption being used in this book that every state has positive probability. if every state has positive probability and if Pi = {x: xi = j}.u(B n Pi ) > 0. It should be noted. But if it(B n Pi ) > 0. suppose it is the Kolmogorov measure of an ergodic A-valued process. if the chain is ergodic then the finite-state process will also be ergodic. for example. then .u(F-1 C) is 0 or 1.) Stationary coding preserves the ergodicity property. Likewise. Proposition 1. see Exercise 19. and suppose v = p.20 CHAPTER I.1.2. It is not important that the domain of F be the sequence space A z .u(B) = 1. 2)-process can be ergodic even though T is not.„ M N = P. a finite-state process for which the underlying Markov chain is mixing must itself be mixing.12 (Codings and ergodicity. The converse is also true. "ergodic" for Markov chains is equivalent to the condition that some power of the transition matrix be positive. see Example 1. again. A concatenated-block process is always ergodic. Remark 1. The converse is also true. however. Thus. Thus if the chain is ergodic then . any probability space will do. for some n.u(T -nP n Pi ) > 0. If some power of M is positive (that is. o F -1 is the Kolmogorov measure of the encoded B-valued process. "ergodic" for Markov chains will mean merely irreducible with "mixing Markov" reserved for the additional property that some power of the transition matrix has all positive entries. since it can be represented as a stationary coding of an irreducible Markov chain.2. the Cesaro limit theorem (8) can be strengthened to limN. If C is a shift-invariant subset of B z then F-1 C is a shift-invariant subset of Az so that .(F-1 C) it follows that v(C) is 0 or 1.10 A stationary Markov chain is mixing if and only if some power of its transition matrix has all positive entries. (See Exercise 5.11. The underlying .

) is said to be totally ergodic if every power TN is ergodic. that it holds on a family of sets that generates the a-algebra. depending on whether Tn . The (T...lx = x ED (n — 1)a belongs to P0 or P1 .) Since the condition that F(TA X) = TB F (x). which correspond to the upper and lower halves of the circle in the circle representation.1. respectively. applying a block code to a totally ergodic process and randomizing the start produces an ergodic process. for all x.1. but a stationary process can be constructed from the encoded process by randomizing the start.. The value X(x) = Xp(Tn -1 x) is then 0 or 1. P)-process {X n } is then described as follows. It is also called rotation.1. so that y is concentrated on the two sequences 000. THE ERGODIC THEORY MODEL. The measure -17 obtained by randomizing the start is.2. and 111 . the same as y. 1). however. (or rotation processes if the circle representation is used. The final randomized-start process may not be ergodic. In particular. If T is the shift on a sequence space then T is totally ergodic if and only if for each N. Example 1. N-block codings destroy stationarity. it follows that a stationary coding of a totally ergodic process must be totally ergodic. For example.SECTION 1.10. 1/2). implies that F (Ti x) = T: F(x). p is the stationary Markov measure with transition matrix M and start distribution 7r given. The following proposition is basic. As noted in Section 1. such processes are called translation processes. that is.2. The mapping T is called translation by a. even if the original process was ergodic. p. the nonoverlapping N-block process defined by Zn = (X (n-1)N-1-11 X (n-1)N+2.. The proof of this is left to the reader. • • • 9 Xn is ergodic. where ED indicates addition modulo 1. since X can be thought of as the unit circle by identifying x with the angle 27rx. A subinterval of the circle corresponds to a subinterval or the complement of a subinterval in X.. (The measure-preserving property is often established by proving.. hence the word "interval" can be used for subsets of X that are connected or whose complements are connected. 1) by the formula Tx = x ED a. let P consist of the two intervals P0 = [0.) Let a be a fixed real number and let T be defined on X = [0. 21 Markov chain is not generally mixing. A condition insuring ergodicity of the process obtained by N-block coding and randomizing the start is that the original process be ergodic relative to the N-shift T N . .. and 0101 . (See Example 1. C(10) = 11.12. defined by the 2-block code C(01) = 00.. however. in this case. for it has a periodic structure due to the blocking. hence preserves Lebesgue measure p. E.. as in this case.) As an example. up to a shift.13 (Rotation processes. P 1 = [1/2. A transformation T on a probability space (X. for all x and all N.. give measure 1/2 to each of the two sequences 1010. Since this periodic structure is inherited. let p. by m ro 1 L o j' = rl 1] i• Let y be the encoding of p. of the interval gives rise to a process. see Example 1. Pick a point x at random in the unit interval according to the uniform (Lebesgue) measure. hence is not an ergodic process. concatenated-block processes are not mixing except in special cases.) Any partition 7. The transformation T is one-to-one and maps intervals onto intervals of the same length. so that translation by a becomes rotation by 27ra..

14 continued.t(A n I) > (1 — 6) gives it (A) E —E) n Trig) (1 — 0(1 — 2E).10 to establish ergodicity of a large class of transformations.22 Proposition 1.u(A) = 1. E n=1 00. and hence the maximality of I implies that n I = 0. Given c > 0 choose an interval I such that /2(1) < E and /. The first proof is based on the following classical result of Kronecker. To proceed with the proof that T is ergodic if a is irrational. Furthermore. e2-n(x+a) a. CHAPTER I. for each x.14 T is ergodic if and only if a is irrational. The assumption that i.2. suppose A is an invariant set of positive measure. A generalization of this idea will be used in Section 1. The preceding proof is typical of proofs of ergodicity for many transformations defined on the unit interval. applied to the end points of I.) If a is irrational the forward orbit {Tnx: n > 1} is dense in X. Proposition 1. Proof It is left to the reader to show that T is not ergodic if a is rational. produces integers n1 <n2 < < n k such that {Tni / } is a disjoint sequence and Ek i=1 tt(Tni /) > 1 — 2E. {Tx: n > 1 } . Such a technique does not work in all cases. One first shows that orbits are dense. Then there is an interval I of positive length which is maximal with respect to the property of not meeting F. An alternate proof of ergodicity can be obtained using Fourier series.15 (Kronecker's theorem. Proof To establish Kronecker's theorem let F be the closure of the forward orbit. Two proofs of ergodicity will be given. BASIC CONCEPTS. The Fourier series of g(x) = f (T x) is E a. e2"ina e 27rinx . then that small disjoint intervals can be placed around each point in the orbit. but does establish ergodicity for many transformations of interest.t(A n I) > (1 — Kronecker's theorem.2. which completes the proof of Proposition 1. TI cannot be the same as I.. It follows that is a disjoint sequence and 00 1 = .u ( [0. 1)) a. Assume that a is irrational. and suppose that F is not dense in X.2. Proof of Proposition 1. i= 1 Thus . for each positive integer n. Suppose f is square integrable and has Fourier series E an e brinx . which is a contradiction. since a is irrational.2. This proves Kronecker's theorem.14.

2.. In general.16 The following are equivalent for the mapping T: (x. y El) p) (i) {Tn(x. Let T be a measure-preserving transformation on the probability space (X. n(x) is the time of first return to B.2. Furthermore. and. that is. Furthermore. e. Points in the top level of each column are moved by T to the base B in some manner unspecified by the picture. from which it follows that the sequence of sets {T B} is disjoint. (iii) a and 13 are rationally independent. alternatively. furthermore. (ii) T is ergodic. Furthermore. This proves Theorem 1. (Some extensions to the noninvertible case are outlined in the exercises. provides an alternative and often simpler view of two standard probability models.) Return to B is almost certain. Theorem 1.17 (The Poincare recurrence theorem. Proof To prove this. one above the other. In other words.. by reassembling within each level it can be assumed that T just moves points directly upwards one level. The return-time picture is obtained by representing the sets (11) as intervals. that is.. which means that a. . n > 2. If a is irrational 0 unless n = 0. the product of two intervals. the renewal processes and the regenerative processes. that is. (See Figure 1. (or. If x E B. To simplify further discussion it will be assumed in the remainder of this subsection that T is invertible. along with a suggestive picture. consider the torus T. almost surely. with addition mod 1. for B 1 = B n 7-1 B and Bn = (X — B) n B n T -n B. the product of two circles.o ) = 0. in the irrational case. that is. y) 1-÷ (x a. A general return-time concept has been developed in ergodic theory. T = [0. e. IL) and let B be a measurable subset of X of positive measure.c The return-time picture. translation of a compact group will be ergodic with respect to Haar measure if orbits are dense. and define = {X E B: Tx B. then an = an e 2lri" holds for each integer n. = 0.17.Vn > O.2.) The proof of the following is left to the reader. only the first set Bn in this sequence meets B. if x E 1100 then Tx g Bc. a. Since they all have the same measure as B and p(X) = 1.) For each n. For example. .18). the rotation T has no invariant functions and hence is ergodic. TBn C B. which. 1) x [0. THE ERGODIC THEORY MODEL. a. I.2. the sets (11) Bn . implies that the function f is constant. T n-1 Bn are disjoint and have the same measure. in turn. unless and an = an e 27' then e2lrin" n=0. which. 1). y). y)} is dense for any pair (x. n(x) < oc. 23 If g(x) = f(x).2. E. are measurable. These sets are clearly disjoint and have union B. let n(x) be the least positive integer n such that Tnx E B. for all n > 1.2.SECTION 1. TB. it follows that p(B. Proposition 1. for n > 1.. while applying T to the last set Tn -1 Bn returns it to B. define the sets Bn = E B: n(x) = n}.

24

CHAPTER I. BASIC CONCEPTS.

Tx

B1

B2

B3

B4

Figure 1.2.18 Return-time picture.
Note that the set Util=i Bn is just the union of the column whose base is Bn , so that the picture is a representation of the set (12) U7 =0 TB

This set is T-invariant and has positive measure. If T is ergodic then it must have measure 1, in which case, the picture represents the whole space, modulo a null set, of course. The picture suggests the following terminology. The ordered set

Bn , TB,

, T 11-1 13,

is called the column C = Cn with base Bn, width w(C) = kt(B,z ), height h(C) = n, levels L i = Ti-1 Bn , 1 < i < n, and top T'' B. Note that the measure of column C is just its width times its height, that is, p,(C) = h(C)w(C). Various quantities of interest can be easily expressed in terms of the return-time picture. For example, the return-time distribution is given by Prob(n(x) =nIxEB)— and the expected return time is given by w (Ca )

Em w(C, n

)

E(n(x)lx E B) =
(13)

h(C,i )Prob(n(x) =nix E B)

h(C)w(C)

=

m w(Cm )

In the ergodic case the latter takes the form

(14)

E(n(x)ix E B) =

1 Em w(Cm )

1

since En h(C)w(C) is the measure of the set (12), which equals 1, when T is ergodic. This formula is due to Kac, [23]. The transformation I' = TB defined on B by the formula fx = Tn(x)x is called the transformation induced by T on the subset B. The basic theorem about induced transformations is due to Kakutani.

SECTION I.2. THE ERGODIC THEORY MODEL.
-

25

Theorem 1.2.19 (The induced transformation theorem.) If p(B) > 0 the induced transformation preserves the conditional measure p(.IB) and is ergodic if T is ergodic. Proof If C c B, then x E f- 1C if and only if there is an n > 1 such that x Tnx E C. This translates into the equation
00

E

Bn and

F-1 C =U(B n n=1

n TAC),

which shows that T-1 C is measurable for any measurable C c B, and also that
/2(f—i
n

= E ,u(B,, n T'C),
n=1

00

since Bn n An = 0, if m 0 n, with m, n > 1. More is true, namely, (15)

T n Ba nT ni Bm = 0, m,n > 1, m n,

by the definition of "first return" and the assumption that T is invertible. But this implies 00 E,(T.Bo
n=1

= E p(Bn ) = AO),
n=1
,

co

which, together with (15) yields E eu(B, n T'C) = E iu(Tnii n n c) = u(C). Thus the induced transformation f preserves the conditional measure p.(.1B). The induced transformation is ergodic if the original transformation is ergodic. The picture in Figure 1.2.18 shows why this is so. If f --1 C, then each C n Bn can be pushed upwards along its column to obtain the set

c.

oo n-1
Ti(c n Bn)• D=UU n=1 i=0

But p,(T DAD) = 0 and T is invertible, so the measure of D must be 0 or 1, which, in turn, implies that kt(C1B) is 0 or 1, since p,((D n B),LC) = O. This completes the proof of the induced-transformation theorem. El

1.2.c.1 Processes associated with the return-time picture.
Several processes of interest are associated with the induced transformation and the return-time picture. It will be assumed throughout this discussion that T is an invertible, ergodic transformation on the probability space (X, E, p,), partitioned into a finite partition P = {Pa : a E A}; that B is a set of positive measure; that n(x) = minfn: Tnx E BI, x E B, is the return-time function; and that f is the transformation induced on B by T. Two simple processes are connected to returns to B. The first of these is {R,}, the (T, B)-process defined by the partition B = { B, X B}, with B labeled by 1 and X — B labeled by 0, in other words, the binary process defined by

Rn(x) = xB(T n-l x), n?: 1,

26

CHAPTER I. BASIC CONCEPTS.

where xB denotes the indicator function of B. The (T, 5)-process is called the generalized renewal process defined by T and B. This terminology comes from the classical definition of a (stationary) renewal process as a binary process in which the times between occurrences of l's are independent and identically distributed, with a finite expectation, the only difference being that now these times are not required to be i.i.d., but only stationary with finite expected return-time. The process {R,,} is a stationary coding of the (T, 2)-process with time-zero encoder XB, hence it is ergodic. Any ergodic, binary process which is not the all 0 process is, in fact, the generalized renewal process defined by some transformation T and set B; see Exercise 20. The second process connected to returns to B is {Ri } , the (7." , g)-process defined by the (countable) partition

{Bi, B2, • • •},
of B, where B, is the set of points in B whose first return to B occurs at time n. The process {k } is called the return-time process defined by T and B. It takes its values in the positive integers, and has finite expectation given by (14). Also, it is ergodic, since I' is ergodic. Later, it will be shown that any ergodic positive-integer valued process with finite expectation is the return-time process for some transformation and partition; see Theorem 1.2.24. The generalized renewal process {R n } and the return-time process {kJ } are connected, for conditioned on starting in B, the times between successive occurrences of l's in {Rn} are distributed according to the return-time process. In other words, if R 1 = 1 and the sequence {S1 } of successive returns to 1 is defined inductively by setting So = 1 and

= min{n >

Rn = 1},

j > 1,

then the sequence of random variables defined by

(16)

= Si — Sf _ 1 , j> 1

has the distribution of the return-time process. Thus, the return-time process is a function of the generalized renewal process. Except for a random shift, the generalized renewal process {R,,} is a function of the return-time process, for given a random sample path R = {k } for the return-time process, a random sample path R = {R n } for the generalized renewal process can be expressed as a concatenation (17) R = ii)(1)w(2)w(3) • • •

of blocks, where, for m > 1, w(m) = 01 _ 1 1, that is, a block of O's of length kn -1, followed by a 1. The initial block tt,(1) is a tail of the block w(1) = 0 /1-1-1 1. The only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the waiting time r until the first 1 occurs. The start position problem is solved by the return-time picture and the definition of the (T, 2)-process. The (T, 2)-process is defined by selecting x E X at random according to the it-measure, then setting X(x) = a if and only if Tx E Pa . In the ergodic case, the return-time picture is a representation of X, hence selecting x at random is the same as selecting a random point in the return-time picture. But this, in turn, is the same as selecting a column C at random according to column measure

SECTION 1.2. THE ERGODIC THEORY MODEL.

27

then selecting j at random according to the uniform distribution on 11, 2, ... , h (C)}, then selecting x at random in the j-th level of C. In summary,

Theorem 1.2.20 The generalized renewal process {R,} and the return-time process {km } are connected via the successive-returns formula, (16), and the concatenation representation, (17) with initial block th(1) = OT-1 1, where r = R — j 1 and j is uniformly distributed on [1,
The distribution of r can be found by noting that

r (x) = jx E
from which it follows that
00

Prob(r = i) =E p.,(T 4-1 n=i

=

E w (C ) .
ni

The latter sum, however, is equal to Prob(x E B and n(x) > i), which gives the alternative form 1 Prob (r = i) = (18) Prob (n(x) ijx E B) , E(n(x)lx E B) since E(n(x)lx E B) = 1/ p.(B) holds for ergodic processes, by (13). The formula (18) was derived in [11, Section XIII.5] for renewal processes by using generating functions. As the preceding argument shows, it is a very general and quite simple result about ergodic binary processes with finite expected return time. The return-time process keeps track of returns to the base, but may lose information about what is happening outside B. Another process, called the induced process, carries along such information. Let A* be the set all finite sequences drawn from A and define the countable partition 13 = {P: w E A* } of B, where, for w =

=

=

E Bk: T i-1 X E

Pao 1 <j

kl.

The (?,)-process is called the process induced by the (T, P)-process on the set B. The relation between the (T, P)-process { X, } and its induced (T, f 5)-process {km } parallels the relationship (17) between the generalized renewal process and the returntime process. Thus, given a random sample path X = {W„,} for the induced process, a random sample path x = {X,} for the (T, P)-process can be expressed as a concatenation

(19)

x = tv(1)w(2)w(3) • • • ,

into blocks, where, for m > 1, w(m) = X m , and the initial block th(1) is a tail of the block w(1) = X i . Again, the only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the length of the tail fi)(1). A generalization of the return-time picture provides the answer. The generalized version of the return-time picture is obtained by further partitioning the columns of the return-time picture, Figure 1.2.18, into subcolumns according to the conditional distribution of (T, P)-names. Thus the column with base Bk is partitioned

28

CHAPTER I. BASIC CONCEPTS.

into subcolumns CL, labeled by k-length names w, so that the subcolumn corresponding to w = a k has width (ct, T - i+ 1 Pa n BO,
,

the ith-level being labeled by a,. Furthermore, by reassembling within each level of a subcolumn, it can again be assumed that T moves points directly upwards one level. Points in the top level of each subcolumn are moved by T to the base B in some manner unspecified by the picture. For example, in Figure 1.2.21, the fact that the third subcolumn over B3, which has levels labeled 1, 1, 0, is twice the width of the second subcolumn, indicates that given that a return occurs in three steps, it is twice as likely to visit 1, 1, 0 as it is to visit 1, 0, 0 in its next three steps.
o o I
0 1 1

o
i

o
i

I

o .

1

I I
1

I

1o 1

I I

I
0

I I 1

I 1 I I

I

o
0

I I

1

1Ii

I

10

I I 1 I

o o

I

1

11

B1

B2

B3

B4

Figure 1.2.21 The generalized return-time picture. The start position problem is solved by the generalized return-time picture and the definition of the (T, 2)-process. The (T, 2)-process is defined by selecting x E X at random according to the ii-measure, then setting X, (x) = a if and only if 7' 1 x E Pa . In the ergodic case, the generalized return-time picture is a representation of X, hence selecting x at random is the same as selecting a random point in the return-time picture. But this, in turn, is the same as selecting a column Cu, at random according to column measure p.(Cu,), then selecting j at random according to the uniform distribution on {1, 2, ... , h(w), where h(w) = h(C), then selecting x'at random in the j-th level of C. This proves the following theorem.
Theorem 1.2.22 The process {X,..,} has the stationary concatenation representation (19) in terms of the induced process {54}, if and only if the initial block is th(1) = w(1)i h. (w(l)) , where j is uniformly distributed on [1, h(w(1))].

A special case of the induced process occurs when B is one of the sets of P, say B = Bb. In this case, the induced process outputs the blocks that occur between successive occurrences of the symbol b. In general case, however, knowledge of where the blocking occurs may require knowledge of the entire past and future, or may even be fully hidden.

I.2.c.2 The tower construction.
The return-time construction has an inverse construction, called the tower construction, described as follows. Let T be a measure-preserving mapping on the probability

The measure it is extended to a measure on X. This and other ways in which inducing and tower extensions are inverse operations are explored in the exercises. with points in the top of each column sent back to the base according to T. Let k defined by = {(x . 0 . the normalizing factor being n ( B n ) = E(f). 0) The transformation T is easily shown to preserve the measure Ti and to be invertible (and ergodic) if T is invertible (and ergodic). As an application of the tower construction it will be shown that return-time processes are really just stationary positive-integer valued processes with finite expected value. Tx B3 X {2} B2 X 111 B3 X Bi X {0} B2 X {0} B3 X {0} Figure 1. . and for each n > 1. n=1 the expected value of f. i < n. 1 < i f (x)}.. and let f be a measurable mapping from X into the positive integers . B x {2}.u({x: . The tower extension T induces the original transformation T on the base B = X x {0}. THE ERGODIC THEORY MODEL.}. n > 1. (T x . of finite expected value. i) E T8} n Bn E E. E 131 ) n 13. E (f ) n=1 i=1 A transformation T is obtained by mapping points upwards.23 The tower transformation. . as shown in Figure 1. with its measure given by ‘ Tt(k) = 2-. In other words. 29 space (X..23. p. A useful picture is obtained by defining Bn = {x: f (x) = n}. The transformation T is called the tower extension of T by the (height) function f.0 . 2.2. Bn x {n} (20) as a column in order of increasing i.). then normalizing to get a probability measure.SECTION 1. by thinking of each set in (20) as a copy of Bn with measure A(B). The formal definition is T (x. B C X is defined to be measurable if and only if (x.be the subset of the product space X x Ar Ar = { 1. i) = + 1) i < f (x) i = f (x).2.2. stacking the sets Bn x {1}. E.

XnN) be the nonoverlapping N-block process.2. positive-integer valued process {kJ} with E(R 1 ) < oc is the return-time process for some transformation and set. This proves the theorem. be the two-sided Kolmogorov measure on the space . onto the (two-sided) Kolmogorov measure y of the (T. (Recall that if Y. BASIC CONCEPTS. (Hint: consider g(x) = min{ f (x). The induced transformation and the tower construction can be used to prove the following theorem. the return-time process defined by T and B has the same distribution as {k1 }.d Exercises. Show that the (T. such that F(Tx) = TAF(x). Let Zn = ( X (n -1 )N+19 41-1)N+2. and let Yn = X(n-1)N+1 be the N-th term process defined by the (T. A nonnegative measurable function f is said to be T-subinvariant. and note that the shift T on Aîz is invertible and ergodic and the function f (x) =X has finite expectation. Show that there is a measurable mapping F: X 1-± A z which carries p. A stationary process {X. Proof Let p. P)-process. /2).) 5. onto v.1.24 An ergodic. = f (X n ). (a) Show that {Zn } is the (TN. Suppose F: A z F--> Bz is a stationary coding which carries p.n ): n < t}. Determine a partition P of A z such that y is the Kolmogorov measure of the (TA. Let T be a measure-preserving transformation.) 3. for all n. Let T be an invertible measure-preserving transformation and P = {Pa : a E Al a finite partition of the probability space (X. see Exercise 24. a stationary coding gives rise to a partition. 1) at time f (x). E. then use Exercise 2. 2.2. Theorem 1. then {Yn } is an instantaneous function of {Xn}.Ar z of doubly-infinite positive-integer valued sequences. Complete the proof of Lemma 1. (Hint: start with a suitable nonergodic Markov chain and lump states to produce an ergodic chain. Let T be the tower extension of T by the function f and let B = x x {1}. A point (x. let Wn = (Xn. a partition gives rise to a stationary coding. n .2.) 6.25 An ergodic process is regenerative if and only if it is an instantaneous function of an induced process whose return-time process is I. . R. .2. 7. fkil defines a stationary.} is regenerative if there is a renewal process {R. given that Rt = 1. Ra ):n > t} is independent of the past joint process {(X. P)-process can be mixing even if T is only ergodic. (Thus. that is.) Theorem 1. P)-process can be ergodic even if T is not. with the distribution of the waiting time r until the first I occurs given by (18).30 CHAPTER I. P-IP)-process. if f(Tx) < f (x). Show that the (T. where TA is the shift on A z .} such that the future joint process {(Xn. (Thus. 1) returns to B = X x L1} at the point (T x . Xn+19 • • • Xn+N-1) be the overlapping N-block process. P)-process {Xn}.) 4. In particular. Show that a T-subinvariant function is almost-surely invariant. P)-process. binary process via the concatenation-representation (17). 1. N}.

u be the product measure on X defined by p. Show directly. (a) Show that S preserves v. 15. p. 1). T ry) is measurable. = a if I'm = (ar . Generalize the preceding exercise by showing that if T is a measure-preserving transformation on a probability space (X. Give a formula expressing the (S. Let {Z. for each br E AN. for every x. Q)-name of (x. 11. 'P)-process. if Y„. N}.. 13. (b) Show that {147. (iii) (ar . y).. 1. Let T be the shift on X = {-1. that is. Show that the overlapping n-blocking of an ergodic process is ergodic. (i) If i < N. .n } be the stationary Markov chain with alphabet S = A N x {0. j). = (ar . 16. Show that if T is mixing then it is totally ergodic.u(br). defined by the following transition rules. 2. y) i-± (Tx. Txoy). X x X. a symbol a E A N . on AN. is a mixing finite-state process.(-1) = p. Show that an ergodic rotation is not mixing. for each br E AN. (a) Show that there is indeed a unique stationary Markov chain with these properties. Q)-process is called "random walk with random scenery.SECTION 1. y) = (Tx. y) = (Tx. then S(x . and number p E (0.2. } is the (T. y ± a) is not ergodic. . = R. (The (S. Show that a concatenated-block process is ergodic. . N) goes to (br . y) in terms of the coordinates of x and y. i 1). 17. then S is called the direct product of T and R. Show that the nonoverlapping n-blocking of an ergodic process may fail to be ergodic. 31 8. If T. Let Y=Xx X. Show that if T is mixing then so is the direct product T x T: X x X where X x X is given the measure . (b) Let P be the time-zero partition of X and let Q = P x P. i) can only go to (ar . without using Proposition 6. j > 0.. Fix a measure p.) and {Tx : x c X) is a family of measure-preserving transformations on a probability space (Y. 12. 10.(1) = 1/2. N) goes to (br .") 9. THE ERGODIC THEORY MODEL. (ar . 0) with probability p ite(br ). that if a is irrational then the direct product (x.u x . such that the mapping (x. x y). (ii) (ar. 1} Z . but is totally ergodic. and let . Insertion of random spacers between blocks can be used to change a concatenatedblock process into a mixing process. S is called a skew product. 14. (Hint: the transition matrix is irreducible. y) 1-± (x a. v7P -1 P)-process. x and define S(x. Try) is a measure-preserving mapping on the product space (X x Y. with the product measure y = p.) (b) Show that the process {} defined by setting 4. 0) and Ym = ai .u. 1) with probability (1 — p). (c) Show that {Y } is the (T N .

then P and 2 are 3E-independent.18 for T -1 . Define P = {Pa : a E AI and 2 = {Qb: b Ea. 25. then use this to guide a proof. Show that the tower defined by the induced transformation function n(x) is isomorphic to T. Prove: a tower i'' over T induces on the base a transformation isomorphic to T.i F -1 D. and is ergodic if T is ergodic. . Prove that even if T is not invertible.) 19. Show that if {X.) 21. the induced transformation is measurable. Prove that a stationary coding of a mixing process is mixing.} is a binary. Let T be an invertible measure-preserving transformation and P = {Pa : a E A} a finite partition of the probability space (X.25. ? I' and return-time 23. except for a set of Qb's of total measure at most . E BI to be 6-independent if (a) Show that p and 2 are independent if and only if they are 6-independent for each 6 > 0.2. then it is the generalized renewal process for some transformation T and set B.(Qb)1 .b lit(Pa n Qb) — it(Pa)P. BASIC CONCEPTS. d)-process.g.?. (Hint: obtain a picture like Figure 1. (Hint: use the formula F-1 (C n T ' D) = F-1 C n TA. E. (Hint: let T be the shift in the two-sided Kolmogorov representation and take B = Ix: xo = 1 1. P)-process is an instantaneous coding of the induced (. except for a set of Qb's of total measure at most 6. Let X be the tower over X defined by f._ E. ergodic process which is not identically 0. pt). (b) Show that if P and 2 are 6-independent then Ea iii(Pa I Qb) — 11 (Pa)i < Ag. and let S = T be the tower transformation. such that the (T. Prove Theorem 1. 18. 24. Show how to extend P to a partition 2 of X. (c) Show that if Ea lia(PaiQb) — ii(Pa)i < E. 22.2.) 20.32 CHAPTER I. preserves the conditional measure.

K]. A strong cover C of Af is defined by an integer-valued function n m(n) for which m(n) > n. The ergodic theorem extends the strong law of large numbers from i.a Packings from coverings.This binary version of the ergodic theorem is sufficient for many results in this book. just set C' = {[n.3. it may not be possible. I. to pack an initial segment [1. however.3. for it is an important tool in its own right and will be used frequently in later parts of this book. The proof of the ergodic theorem will be based on a rather simple combinatorial result discussed in the next subsection.2 (Ergodic theorem: binary ergodic form. Birkhoff and is often called Birkhoff's ergodic theorem or the individual ergodic theorem. n E H.) and if f is integrable then the average (1/n) EL. In this discussion intervals are subsets of the natural numbers Ar = {1. 33 Section 1..SECTION 1.) If T is a measure-preserving transformation on a probability space (X.E f f(T i-l x)d. m] = (j E n < j < m).. The finite problem has a different character. Thus if T is ergodic then the limit function f*. unless the function m(n) is severely restricted. K] by disjoint subcollections of a given strong cover C of the natural numbers. m(n)]}..) A strong cover C has a packing property. namely.u = f f -E n n i=1 by the measure-preserving property of T. but the more general version will also be needed and is quite useful in many situations not treated in this book. The combinatorial idea is not a merely step on the way to the ergodic theorem. Theorem 1. THE ERGODIC THEOREM. must be almost-surely equal to the constant value f f d.3. 2. there is a subcover C' whose members are disjoint.3 The ergodic theorem. even asymptotically. since it is T-invariant. then a positive and useful result is possible. In particular.}. .) If {X n } is a binary ergodic process then the average (X 1 +X2+. This is a trivial observation. .+Xn )In converges almost surely to the constant value E(X 1 ). E. . where n1 = 1 and n i±i = 1 + m(n). m(n)]. The ergodic theorem in the almost-sure form presented here is due to G.. p. however. Theorem 1. i > 1. D. A general technique for extracting "almost packings" of integer intervals from certain kinds of "coverings" of the natural numbers will be described in this subsection.3. for. The L'-convergence implies that f f* d 1a = f f d ia.1-valued and T to be ergodic the following form of the ergodic theorem is obtained. if f is taken to be 0.u.u = 1. since f1 n f(T i-l x) d.1 (The ergodic theorem. If it is only required. f (T i-1 x) converges almost surely and in L 1 -norm to a T -invariant function f*(x)..d. there be a disjoint subcollection that fills most of [1. of the form [n. processes to the general class of stationary processes.i. (The word "strong" is used to indicate that every natural number is required to be the left endpoint of a member of the cover. and consists of all intervals of the form [n.

The desired positive result is just an extension of this idea to the case when most of the intervals have length bounded by some L <8 K. and hence m(n i ) < K. The construction of the (1 — 26)-packing will proceed sequentially from left to right. Let C be a strong cover of the natural numbers N. K].3. K]. This completes the proof of the packing lemma. so that I(K —L.) Let C be a strong cover of N and let S > 0 be given. since m(n i )—n i +1 < L. The construction stops after I = I (C. K] is called a (1 —6)-packing of [1. 6)-strong-cover assumption. If K > LIS and if [1. The interval [1. start from the left and select successive disjoint intervals. K]. K — L]-1. the definition of the n i implies the following fact. . The claim is that C' = {[n i . This produces the disjoint collection C' = {[iL +1. K] is said to be (L.K]—U]I< 6K. if 6 > 0 is given and K > LIS then all but at most a 6-fraction is covered. (1). Lemma 1. thus guarantees that 1[1. stopping when within L of the end. K — L]. that is. K]. K]: m(n) — n +1 > L}1 < 6. and are contained in [1. The (L. Proof By hypothesis K > LIS and (1) iin E K]: m(n) — n +1 > L11 < SK. K — L] —Li then m(j) — j + 1 > L. and apply a sequential greedy algorithm.3 (The packing lemma. by induction. A collection of subintervals C' of the interval [1. K] has length at most L — 1 < SK. 8)-strongly-covered by C. To motivate the positive result. by construction. The intervals are disjoint. BASIC CONCEPTS. To carry this out rigorously set m(0) = no = 0 and.ll< 6K.(i +1)L]: 0 < i < (K — L)IL} which covers all but at most the final L — 1 members of [1.m(n i )]: 1 < i < I} is a (1 — 26)-packing of [1. For the interval [1.1 of the [ni . K) steps if m(n i ) > K — L or there is no j E[1 ±m(ni). K — L]: m(j) — j +1 < L}. K — L] for which m(j) — j + 1 < L. say L. The interval (K — L. Thus it is only necessary to show that the union 1. (2) If j E [I. suppose all the intervals of C have the same length.34 CHAPTER I. then there is a subcollection C' c C which is a (1 — 26)packing of [1. K] if the intervals in C' are disjoint and their union has cardinality at least (1 — . In particular. K] is (L. define n i = minfj E [1+ m(n i _i).m(n i )] has length at least (1 — 25)K. stopping when within L of the end of [1. K]. 6)-strongly-covered by C if 1ln E [1. selecting the first interval of length no more than L that is disjoint from the previous selections.

Since lim sup n—>oo 1 +E xi + + xn = =sup n—>oo x2 + . however.) Now suppose x E B. each interval in C(x) has the property that the average of the xi over that interval is too big by a fixed amount.SECTION 1. K] must also be too big by a somewhat smaller fixed amount. so it is not easy to see what happens to the average over a fixed. These intervals overlap. Then either the limit superior of the averages is too large on a set of positive measure or the limit inferior is too small on a set of positive measure. A proof of the ergodic theorem for the binary ergodic process case will be given first. Since B is T-invariant. which is a contradiction.s. THE ERGODIC THEOREM. 1}. Furthermore.3. is invariant and ergodic with respect to the shift T on {0. i < K are contained in disjoint intervals over which the average is too big.A'.u.u. Some extensions of the packing lemma will be mentioned in the exercises. (Here is where the ergodicity assumption is used. including the description (2) of those indices that are neither within L of the end of the interval nor are contained in sets of the packing. most of the terms xi .(1) + E. . then the average over the entire interval [1. large interval. • + Xn+1 the set B is T-invariant and therefore p. n—*øo n i=1 i n where A(1) = A{x: x 1 = 1} = E(X1).(B) = 1.3.. for each integer n there will be a first integer m(n) > n such that X n Xn+1 + • • • + Xm(n) m(n) — n + 1 > p. The packing lemma can be used to reduce the problem to a nonoverlapping interval problem. a. 35 Remark 1. But. Thus the collection C(x) = lln. if the average over most such disjoint blocks is too big.3.a(1).b The binary. Since this occurs with high probability it implies that the expected value of the average over the entire set A K must be too big. as it illustrates the ideas in simplest form. ergodic process proof.3. but in the proof of the general ergodic theorem use will be made of the explicit construction given in the proof of Lemma 1. In this case . and hence there is an E > 0 such that the set {n B = x : lirn sup — E xi > . I.3. Without loss of generality the first of these possibilities can be assumed. Suppose (3) is false. for. m(n)]: n E JVI is a (random) strong cover of the natural numbers . it will imply that if K is large enough then with high probability. and the goal is to show that (3) lirn — x i = .4 In most applications of the packing lemma all that matters is the result. combined with a simple observation about almost-surely finite variables. ( 1 ) n-+oo n i=i has positive measure.

and since the intervals in C'(x) are disjoint. The Markov inequality implies that for each K. Note that while the collection C(x) and subcollection C'(x) both depend on x. since the random variable m(1) is almost-surely finite.u). K]. Thus.23)-packing of [1. so that 1 K p.8. Thus given 3 > 0 there is a number L such that if D = {x: m(1) > L}.23)(1 6)[. Lemma 1. it will be used to obtain the general theorem. E L i (X. and thereby the proof of Theorem 1. by assumption.36 CHAPTER I. The following lemma generalizes the essence of what was actually proved. are nonnegative. . .a).c The proof in the general case. J=1 which cannot be true for all S. since T is measure preserving. n i=i Then fB f (x) d(x) > a g(B). that is. the sum over the intervals in C'(x) lower bounds the sum over the entire interval. note that the function g K(x) = i=1 1 IC where x denotes the indicator function. The definitions of D and GK imply that if x E GK then C(x) is an (L. E. 1(x) m(n) (4) j=1 x. the binary.3.u(D) < 6 2 . has integral .3.K]. taking expected values yields (K E iE =1 •x > (1 . Second. This completes the proof of (3). First.u(D) <62. ergodic process form of the ergodic theorem. BASIC CONCEPTS. (1) = E (—Ex . Since the xi. The preceding argument will now be extended to obtain the general ergodic theorem. as long as x E GK. Thus the packing lemma implies that if K > LI 8 and x E GK then there is a subcollection Cf (x) = {[ni. A bit of preparation is needed before the packing lemma can be applied. I.(1) + c].m(ni)]: i < I (x)} C C(X) which is a (1 . 3)-strongcovering of [1. the set GK = {X: gic(x) < 3} has measure at least 1 . E. the lower bound is independent of x.a(1) + c].3.E f (T i-1 x) > al.2.) > (1 . let f let a be an arbitrary real number. and define the set 1 n B = {x: lim sup .26)K ku(1) + El u(GK) > (1 . > i=1 j_—n 1 (l _ 28)K [p.26)(1 (5)K [p. . then .(1)+ c] .5 Let T be a measure-preserving transformation on (X. it is bounded except for a set of small probability.

so it can be assumed that . does give 1 v-IC s f (x) dia(x1B) = (7) a' B i=i II -I.(x1B) > a.i(B) > O.1q—U[n. THE ERGODIC THEOREM. where f is allowed to be unbounded.)] f (T x) is the sum of f(Ti -l x) over the indices that do not belong to the [ne. 37 Proof Note that in the special case where the process is ergodic. a bit more care is needed to control the effect of R(K. Furthermore. m(n)]: n E AO is a (random) strong cover of the natural numbers Js/. For x E B and n E A1 there is a least integer m(n) > n such that (5) Em i= (n n ) f (T i-1 m(n)—n+1 > a.u(. together with the lower bound (6).12 + 13. E jE[1. (Keep in mind that the collection {[ni. Thus. given 3 > 0 there is an L such that if D= E B: m(1) > L } then . if K > LIB then the packing lemma can be applied and the argument used to prove (4) yields a lower bound of the form E f( Ti-lx ) j=1 (6) where R(K .m(n i )] intervals.3. the sum R(K . the collection C(x) = fin. (1 — 8)a A(G KIB) . this lemma is essentially what was just proved. where f is the indicator function of a set C and a = p(C) c. The set B is T-invariant and the restriction of T to it preserves the conditional measure . 1B). Since B is T-invariant.SECTION 1. x) + E E f (T i-1 x) i=1 j=ni 1(x) m(ni) R(K . Only a bit more is needed to handle the general case. + (1 — 28)K a.u(D1B) <82 and for every K the set 1 G K = {X E B: 1 -7 i=1 E xD(Ti-ix) K 6} has conditional measure at least 1 — B. An integration. x) as well as the effect of integrating over GK in place of X.m(n. x) = > R(K . x) is bounded from below by —2M KS and the earlier argument produces the desired conclusion.5 is true in the ergodic case for bounded functions.. say 1f(x)1 < M.) In the bounded case. As before.3. The same argument can be used to show that Lemma 1. though integrable. The lemma is clearly true if p(B) = 0. In the unbounded case.m(ni)]} depends on x. it is enough to prove fB f (x) c 1 ii.

for any fixed L.3.K] and hence the measure-preserving property of T yields 1131 B 1 fl(x) dp. f The measure-preserving property of T gives f3-Gic f (T . which is upper bounded by 8.B xD(T' . Thus 1121 5_ -17 1 f v—N K .)] E 13= (T i-1 x) d. all of the terms in 11. which is small if K is large enough.k. E GK jE[1. 1 ___ K f (Ti -1 x) c 1 ii. and hence 11 itself. see (7). recall from the proof of the packing lemma.B jE(K—L.(xim.M(ri. K — L] — m(ni)] then m(j) — j 1 > L.)] GK — K 1 jE(K—L.K]—U[ni. K fB—Gic j=1 12 = f(Ti -l x) dp. The integral 13 is also easy to bound for it satisfies 1 f dp.(x1B).K— L] —U[ni.(x1B) > a.(xiB). and the measure-preserving property of T yields the bound 1121 5_ which is small if L is large enough. BASIC CONCEPTS. will be small if 8 is small enough. .m(n. . In summary. K — L] — U[ni . 1131 15.x)1 f (P -1 x)I dwxiB). see (2).38 where CHAPTER I. which completes the proof of Lemma 1. To bound 12. so that passage to the appropriate limit shows that indeed fB f (x) dp. In the current setting this translates into the statement that if j E [1. in the inequality f D1 f (x)1 clia(x1B). the final three terms can all be made arbitrarily small.(x1B). m(n)] then Ti -l x E D.5./ x) clitt(xIB) = fT_J(B-GK) f(x) Since f E L I and since all the sets T -i(B — GK) have the same measure. L f(x) d ii(x B) ?. (1 — kt(G + 11 + ± 13.u(xIB). that if j E [1.

The reader is referred to Krengel's book for historical commentary and numerous other ergodic theorems.3. The operator UT is linear and is an isometry because T is measure preserving. it follows that f* E L I and that A n f converges in L'-norm to f*. 39 Proof of the Birkhoff ergodic theorem: general case.u(C)< Jc f d which can only be true if ti(C) = O.5. but it is much closer in spirit to the general partitioning. almost surely. But if f. To do this define the operator UT on L I by the formula. [32]. There are several different proofs of the maximal theorem. as was just shown. g E L I then for all m.o0 n i=1 { E E then fi. Thus the averaging operator An f 1 E n n i=i T is linear and is a contraction.SECTION 1.6 Birkhoff's proof was based on a proof of Lemma 1. A n f — Am f 5- g)11 + ii Ag — Amg ii 2 11 f — gil + 11Ang — A m g iI — g) — A ni (f — The L . This completes the proof of the Birkhoff ergodic theorem.3. Other proofs that were stimulated by [51] are given in [25.3. then A n f (x)I < M. say I f(x)I < M. its application to prove the ergodic theorem as given here was developed by this author in 1980 and published much later in [68]. that is.3. The packing lemma is a simpler version of a packing idea used to establish entropy theorems for amenable groups in [51]. f follows immediately from this inequality. as well as von Neumann's elegant functional analytic proof of L 2-convergence. f EL I . . the dominated convergence theorem implies L' convergence of A n f for bounded functions. covering. Most subsequent proofs of the ergodic theorem reduce it to a stronger result. To prove almost-sure convergence first note that Lemma 1. Since. 1=1 Remark 1. including the short. THE ERGODIC THEOREM. n. called the maximal ergodic theorem. whose statement is obtained by replacing lirn sup by sup in Lemma 1. This proves almost sure convergence. almost surely. that is. Moreover.3. UT f = if Ii where 11f II = f Ifl diu is the L'-norm. elegant proof due to Garsia that appears in many books on ergodic theory. Since almost-sure convergence holds. and packing ideas which have been used recently in ergodic and information theory and which will be used frequently in this book. and since the average An f converges almost surely to a limit function f*. Thus if 1 n 1 n f (T i x) C = x: lim inf — f (T i x) <a < fi < lim sup — n-400 n i=1 n—. if f is a bounded function. Ag converges in L'-norm. UT f (x) = f (Tx). The proof given here is less elegant. 27]. for g can be taken to be a bounded approximation to f. xE X.5 remains true for any T-invariant subset of B. To complete the discussion of the ergodic theorem L' -convergence must be established.-convergence of A.5. Proofs of the maximal theorem and Kingman's subadditive ergodic theorem which use ideas from the packing lemma will be sketched in the exercises. IA f II If Il f E L I .

BASIC CONCEPTS. which is a common requirement in probability theory.Af U ool. Proof Since r (x) is almost-surely finite. Thus almost-sure finiteness guarantees that.ix)< 6/ 41 G . almost surely. so that if 1 n 1=1 x D (T i-1 x) d p(x) = .r(Tn. nor that the set {x: r(x) = nl be measurable with respect to the first n coordinates.40 CHAPTER I. Let D be the set of all x for which r(x) > L. I.-almost-surely finite if . starting at n. The given proof of the ergodic theorem was based on the purely combinatorial packing lemma. Now that the ergodic theorem is available. The stopping time idea was used implicitly in the proof of the ergodic theorem. . Gn = then x E ix 1 x---11 s :— n 2x D (Ti .n] is (1 — 3)-packed by intervals from C(x. lim almost surely. Lemma 1. By the ergodic theorem. .u(D) <6/4. for an almost-surely finite stopping time.u) with values in the extended natural numbers . A stopping time r is p. see (5). for the first time m that rin f (Ti -1 x) > ma is a stopping time.3. Suppose it is a stationary measure on A'. the stopping time for each shift 7' 1 x is finite.d Extensions of the packing lemma. hence. r). x) such that if n > N then [1. the packing lemma and its relatives can be reformulated in a more convenient way as lemmas about stopping times. there is an integer L such that p({x: r(x) > LI) < 6/4. however. Frequent use will be made of the following almost-sure stopping-time packing result for ergodic processes. the function r(Tn -l x) is finite for every n.3. In particular. the collection C = C(x. E. a (random) strong cover of the natural numbers A.7 (The ergodic stopping-time packing lemma. for almost every x. Several later results will be proved using extensions of the packing lemma. The interval [n. it is bounded except for a set of small measure. Note that the concept of stopping time used here does not require that X be a sequence space. given 8 > 0. A (generalized) stopping time is a measurable function r defined on the probability space (X.u ({x: r(x) = °o }) = O. r) = {[n.) If r is an almost-surely finite stopping time for an ergodic process t then for any > 0 and almost every x there is an N = N (3. eventually almost surely. t(Tn -1 x) n — 1]: n E of stopping intervals is. An almost-surely finite stopping time r has the property that for almost every x.lx)-F n — 1] is called the stopping interval for x.

K] of total length 13 K . Let p. Show that this need not be true if some intervals are not separated by at least one integer. 5. n] is said to be separated if there is at least one integer between each interval in C'.) 2. Suppose each [ui .] has length m. Prove the two-packings lemma.3. there is an integer N= x) such that if n > N then [1. there is an N = N (x. This proves the lemma. r) to be the set of all intervals [n. and each [s1 . vd: i E I} be a disjoint collection of subintervals of [1. Formulate the partial-packing lemma as a strictly combinatorial lemma about the natural numbers. be an ergodic measure on A. 1 — 3/2)-packing of [1. But this means that [1. r). r) such that if n > N. a partial-packing result that holds in the case when the stopping time is assumed to be finite only on a set of positive measure. Show that a separated (1 — S)-packing of [1. The definition of G. 41 Suppose x E G„ and n > 4L/3. ti]. n] has a y -packing C' c C(x. n] is determined by its complement.3.e Exercises. THE ERGODIC THEOREM. Then the disjoint collection {[s i . I. then [1. yi] meets at least one of the [s i . . ti ] has length at least M. These are stated as the next two lemmas. ti ].) Let {[ui. r) Lemma 1.) . v. for which Tn -l x E F(r). and let r be an almost-surely finite stopping time which is almost-surely bounded below by a positive integer M > 1/8. (Hint: use the fact that the cardinality of J is at most K I M to estimate the total length of those [u. yi ] that meet the boundary of some [si.3. Two variations on these general ideas will also be useful later.8 (The partial-packing lemma. 1. n] by a subset of C(x. be an ergodic measure on A'. For almost every x. and let {[si . 3. let r be a stopping time such that .9 (The two-packings lemma. put F(r) = {x: r(x) < ocl. K] of total length aK.SECTION 1. Show that for almost every x. with the proofs left as exercises. n — L]. where M > m13. A (1 — 3)-packing C' of [1. and a two-packings variation.3.u(F(r)) > y > O. ti ]: j E J } U flub i E I*1 has total length at least (a + p . Lemma 1. n] is (L. ti]: j E J} be another disjoint collection of subintervals of [1. and define C(x. The packing lemma produces a (L. For a stopping time r which is not almost-surely infinite. Prove the partial-packing lemma. r(r-1 n — 1]. (Hint: define F'(x) = r(x) + 1 and apply the ergodic stopping-time lemma to F.23)K. r). r). n] and hence for at least (1 — 812)n indices i in the interval [1.) Let p. n] has a separated (1 — 28)-packing C' c C(x. Let I* be the set of indices i E I such that [ui. 1 — 3/2)-stronglycovered by C(x. implies that r(T -(' -1) x) < L for at least (1 — 314)n indices i in the interval [1. 4.

continuing until within N of L. then from r(x) to r(T) — 1. x E f (T i x)xE(T i x) i=o — (N — fII. Assume T is invertible so that UT is unitary.42 CHAPTER I. show that (Hint: let r(x) = n. function.) (c) Deduce von Neumann's theorem from the two preceding results. (b) Show that . Show that if r is an almost-surely finite stopping time for a stationary process p. (e) Extend the preceding arguments to the case when it is only assumed that T is measure preserving. (a) Show that the theorem is true for f E F = If E L 2 : UT f = f} and for f E .} is a sequence of integrable functions such that gn+. as n 7. show that h f da a. Fix N and let E = Un <N Un . if f E LI . Show that if ti is ergodic and if f is a nonnegative measurable. Assume T is measure preserving and f is integrable.Steele. 8) is the set of all x such that [1. n ] is (1 — (5)-packed by intervals from oc.) (c) Show that the theorem holds for bounded functions. [33]. r). 8)) 1. BASIC CONCEPTS. (e) For B = {x: sup E71-01 f (T i x) > a}.. (Hint: first show that g(x) = liminf g n (x)In is an invariant function. The mean ergodic theorem of von Neumann asserts that (1/n) EL I verges in L2-norm. The ideas of the following proof are due to M. (Hint: show that its orthogonal complement is 0.Jones. then apply packing. (d) Show that the theorem holds for integrable functions. then the averages (1/N) E7 f con8. x 0 E. Kingman's subadditive ergodic theorem asserts that gn (x)/ n converges almost surely to an invariant function g(x) > —oc. 6.) . Assume T is measure preserving and { g.A4 = (I — UT)L 2 . show that r(x)-1 E i=o f (T i x)xE(T i x) ?_ 0.) (b) For bounded f and L > N. 9.F+M is dense in L 2 . but not integrable f (T i-1 x) converge almost surely to oc. (a) Prove the theorem under the additional assumption that gn < 0. N. The ideas of the following proof are due to R. and for each n. and G n (r .(x) < g(x) g ni (Tn x).u(B). r(x) = 1. (Hint: sum from 0 to r(x) — 1. [80]. C(x.(Gn (r. let Un be the set of x's for whichEL -01 f (T i x) > 0. for some r(x) < E Un . (a) For bounded f. (d) Deduce that L' -convergenceholds. then ii. One form of the maximal ergodic theorem asserts that JE f (x) di(x) > 0. 10. for any L 2 function f.

n (x) < cc.4 Frequencies of finite blocks. In the nonergodic case.4. oo provided. where y8 0 as g Section 1.(Tn+gx) + y8 . the existence of limiting frequencies allows a stationary process to be represented as an average of ergodic processes. 43 (Hint: let (b) Prove the theorem by reducing it to the case when g„ < O. a property characterizing ergodic processes.a Frequencies for ergodic processes.4. x is typical for tt if for each k the empirical distribution of each k-block converges to its theoretical probability p. In the ergodic case. An important consequence of the ergodic theorem for stationary finite alphabet processes is that relative frequencies of overlapping k-blocks in n-blocks converge almost surely as n —> oc. The limiting (relative) frequency of di in the infinite sequence x is defined by (1) p(ainx) = lim p(ainx).(4). f and a = fg d d. The relative frequency is defined by dividing the frequency by the maximum possible number of occurrences. I. n—k+1 Palic Wiz = The frequency of the block ail' in the sequence Jet' is defined for n > k by the formula E i=1 x fa i l (T i-1 x). of course. n — k + 1]: = where I • I denotes cardinality. where x [4] denotes the indicator function of the cylinder set [an. When k is understood. The frequency can also be expressed in the alternate form f(4x) = I{i E [1. Thus. then convergence is also in L 1 -norm. f(a14) is obtained by sliding a window of length k along x'it and counting the number of times af is seen in the window. that this limit exists. if each block cif appears in x with limiting relative frequency equal to 11. g„. . n — k 1. to obtain Pk(aflx7) = f (4x) = I{i E [1. FREQUENCIES OF FINITE BLOCKS. g 1 (x) = gni (x) — ET -1 gi (T i x ). A sequence x is said to be (frequency) typical for the process p.u < oc.(4). + (d) Show that the same conclusions hold if gn < 0 and gn+8+. These and related ideas will be discussed in this section.) (c) Show that if a = inf. called the empirical distribution of overlapping k-blocks. the subscript k on pk(ali( 14) may be omitted. n — k + 1]: xi +k-1 = 4}1 n—k1 n—k+1 If x and k < n are fixed the relative frequency defines a measure pk • 14) on A " . or the (overlapping) k-type of x.SECTION 1. that is. limiting relative frequencies converge almost surely to the corresponding probabilities.

for almost every x. Theorem 1. there is a set B(a) of measure 0.4. for all measure 1.(4).4. Thus the hypotheses imply that the formula x--. n liM — n i=1 x8 = holds for almost all x for any cylinder set B.2 (The typical-sequence converse. Proof In the ergodic case. BASIC CONCEPTS. This proves the theorem. = p(a).B. p. Since there are only a countable number of possible af. lim p(aIN) = for all k and all a. where C is an arbitrary measurable set. Then A is an ergodic measure. A . Proof If the limiting relative frequency p(a'nx) exists and is a constant c.(4).) If A is ergodic then A(T (A)) = 1. which. the set 00 B=U k=1 ak i EAk B(a 1`) XE has measure 0 and convergence holds for all k and all al'. n—k+1 p(afixiz) - n . Let T(A) denote the set of sequences that are frequency typical for A. that is.-almost surely. for x B(a lic). then this constant c must equal p. for fixed 4.k +1 E k • (T i X •_ ' ul' ' —1 x).1 (The typical-sequence theorem. a set of [=:] Theorem 1. the limiting relative frequencies p(ali`Ix) exist and are constant in x. produces liM n—)00 fl E X B (T i-l x)X c (x)= i=1 n and an integration then yields (2) lim A(T -i B n i=1 E n n= . Thus the entire structure of an ergodic process can be almost surely recovered merely by observing limiting relative frequencies along a single sample path. with probability 1. as n -> co. A basic result for ergodic processes is that almost every sequence is typical. since the ergodic theorem includes L I -convergence of the averages to a function with the same mean. The converse of the preceding theorem is also true.44 CHAPTER I.) Suppose A is a stationary measure such that for each k and for each block di'. In other words. by the ergodic theorem. Multiplying each side of this equation by the indicator function xc (x). such that p(alicixtiz) -> 1). converges almost surely to f x [4] dp.

Sequences of length n can be partitioned into two classes. a.L(a)1 < c. E). and B. just the typical sequences.. But if B = B then the formula gives it(B) = A(B) 2 . It is also useful to have the following finite-sequence form of the typical-sequence characterization of ergodic processes. .1.q) is bounded and converges almost surely to some limit.(k. for every integer k > 0 and every c > 0. Theorem 1. Theorem 1. for every k and c > O. G .4. Note that as part of the preceding proof the following characterizations of ergodicity were established. (i) T is ergodic.(Ti-lx) = tt(E). The members of the "good" set. for each A and B ma generating algebra.n (G n (k.3 The following are equivalent. so that those in Gn will have "good" k-block frequencies and the "bad" set Bn has low probability when n is large. a E The "bad" set is the complement. FREQUENCIES OF FINITE BLOCKS. c) such that if n > N then there is a collection C.4. The usual approximation argument then shows that formula (2) holds for any measurable sets B and C. This proves Theorem 1.5 (The finite form of ergodicity. G. which depends on k and a measure of error. ergodic.2. E). then 1/3(41fi') — p(aini)I < c.4 (The "good set" form of ergodicity. then p. Theorem 1.4. C. The precise definition of the "good" set.4. Theorem 1. = 1.4. C An such that tt(Cn) > 1 — E. Some finite forms of the typical sequence theorems will be discussed next. or when k and c are understood. and hence (b) follows from Theorem 1. the limit must be almost surely constant. 6)) E Gn (k. is (3) G n (k. 45 The latter holds for any cylinder set B and measurable set C. (ii) lima pi.) A measure IL is ergodic if and only if for each block ail' and E > 0 there is an N = N(a. E)) converge to 1 for any E > 0 is just the condition that p(41x7) converges in probability to ii(4). Since p(an. for each E in a generating algebra.SECTION 1.(iii) limn x .) (a) If ji is ergodic then 4 (b) If lima p. are often called the frequency-(k. 6) = An — G n (k. c)-typical sequences.e. is Proof Part (a) is just a finite version of the typical-sequence theorem. so that p(B) must be 0 or 1. (ii) If x. B n (k. A(T -i A n B) = .4.u(A)A(B).2. eventually almost surely. The condition that A n (G n (k. 6) = p(afix) — .4. c).

4. I. If n El. . (4) E [1. The ergodic theorem is often used to derive covering properties of finite sample paths for ergodic processes.) If 12. n ) > 1 — 2E. eventually almost surely.4. and if Bn is a subset of measure. informally. This simple consequence of the ergodic theorem can be stated as either a limit result or as an approximation result. Proof If it is ergodic. for any 8 > 0. 1]: xn-1 B n }i < 8M. then define G n = ix: IP(a inxi) — P(a inx)I < EI. To establish the converse fix a il' and E > 0.46 CHAPTER I.7 (The almost-covering principle. satisfies A(13 n ) > 1 — then xr is eventually almost surely (1 — (5)-strongly-covered by r3n. X.6 (The covering theorem. M — n] for almost surely as M which x: +n-1 E Bn . a set Bn c An with desired properties is determined in some way. E). Theorem 1. eventually almost surely. C An. xr is "mostly strongly covered" by blocks from Bn .a(a n n e . in which case the conclusion (4) is expressed. E _ _ n E. if pn (B n iXr) < b. and if 8. Thus.4. Theorem 1.4. a.n . proving the theorem. M — n that is. c) such that . Proof Apply the ergodic theorem to the indicator function of Bn .s. is chosen to have large measure. is an ergodic process with alphabet A.b The ergodic theorem and covering properties. A sequence xr is said to be (1 — 8)strongly-covered by B n if I {i E [1. the "good" set defined by (3). An of positive 1-1 (Bn). then liM Pn(BniXi ) M—). and hence p(a Ix) is constant almost surely.u(an) > 1 — Cn satisfies (i) and (ii) put En = {x: 4 E Cn} and note that . M — n]: E 13n}1 — ML(13n)1 <8M. In many situations the set 13. eventually oc. BASIC CONCEPTS. by saying that. In these applications.) If kt is an ergodic process with alphabet A. then the ergodic theorem is applied to show that.co In other words. just set Cn = G n (k. and use convergence in probability to select n > N(a.4. and use Theorem 1. IP(ailx) — P(ailZ)I 3E. there are approximately tt(Bn )M indices i E [1.

The set of all x for which the limit p(ainx) = lim p(44) exists for all k and all di will be denoted by E.c. Such an n exists by the ergodic theorem.4. then at least one member of 13n must start within 3M of the beginning of xr and at least one member of 1:3n must start within 8M of the end of xr. is stated as the following theorem. that the expected waiting time between occurrences of a along a sample path is. first to select the set 13n .2.(a). implies that xr is eventually almost surely (1 — 3)-strongly-covered by Bn . in the limit. and then to obtain almost strong-covering by 13n . almost surely. for some i E [1. the desired result. These ideas will be made precise in this section. M]: x i = a} L(x) = max{i E [1. with the convention F(41) = L(41 ) = 0. I. let lim kr. Example 1. Thus is the set of all sequences for which all limiting relative frequencies exist for all possible blocks of all possible sizes.4. be ergodic. limiting relative frequencies of all orders exist almost surely. but the limit measure varies from sequence to sequence. Such "doubling" is common in applications.00 L(41 )— F(41 ) = 1. Given 8 > 0. The almost-covering idea will be used to show that (5) To prove this. In the stationary nonergodic case.4. . F(xr) = min{i E [1. Since 3 is arbitrary. see Exercise 2.SECTION 1. that is.c The ergodic decomposition.7. which is a simple consequence of the ergodic theorem. that if xr is (1 — 6)strongly-covered by 13n . together with the ergodic theorem. Note. nil. almost all of which will. Let F (xr) and L(xr) denote the first and last occurrences of a E Xr. 47 The almost strong-covering idea and a related almost-packing idea are discussed in more detail in Section L7. This result also follows easily from the return-time picture discussed in Section I. Note that the ergodic theorem is used twice in the example.8 Let p. M]: x i = a}. Theorem 1.4. in fact. be an ergodic process and fix a symbol a E A of positive probability. But if this is so and M > n/3. however. Bn = {a7: ai = a. follows. The following example illustrates one simple use of almost strong-covering. so that L(xr) — F(41) > (1 — 3 8 )M. FREQUENCIES OF FINITE BLOCKS. The set is shift invariant and any shift-invariant measure has its support in This support result. then F(41 ) < 23M and L(xr) > (1 — 3)M. The almost-covering principle. (5). if no a appears in xr. choose n so large that A(B n ) > 1 — 3. almost surely equal to 1/p. It follows easily from (5). The sequences that produce the same limit measure can be grouped into classes and the process measure can then be represented as an average of the limit measures.

to a Borel measure i. Two sequences x and y are said to be frequency-equivalent if /ix = A y .5.(E) = 1 and hence the representation (7) takes the form.9 If it is a shift-invariant measure then A ( C) = 1. which is well-defined on the frequency equivalence class of x. Of course.(Ti. (x). The ergodic measures. Formula (6) can then be expressed in the form (7) . This formula indeed holds for any cylinder set.u.. for if x E r then the formula .4.tx on A. Next define /27(x) = ktx.1. BASIC CONCEPTS.9 may now be expressed in the following much stronger form. Theorem 1.(x )(B) dco(7(x)). To make this assertion precise first note that the ergodic theorem gives the formula (6) .48 Theorem 1..10 ( The ergodic decomposition theorem. Theorem 1. as an average of its ergodic components.) If p. each frequency-equivalence class is either entirely contained in E or disjoint from E.. B E E.( x )(B) da)(7r(x)). .. Theorem 1.4.u(B) = f 1. which can be extended. Let E denote the set of all sequences x E L for which the measure ti. Proof Since p (all' 14) = n . BE E. n —k+ 1 The frequencies p(41. together with the finite form of ergodicity. The measure . and extends by countable additivity to any Borel set B. shows that to prove the ergodic decomposition theorem it is enough to prove the following lemma.4. The usual measure theory argument.u onto a measure w = .u. The projection x is measurable and transforms the measure .ux will be called the (limiting) empirical measure determined by x. An excellent discussion of ergodic components and their interpretation in communications theory can be found in [ 1 9 ].. — ix).u(alic) = f p(af Ix) dp.u o x -1 on equivalence classes. is shift-invariant then p. by (6).x (a) = p(ainx). k 1 i=1 the ergodic theorem implies that p(aflx) = limn p(aii`ix) exists almost surely. x E E. 4 E Ak defines a measure itx on sequences of length k.9 can be strengthened to assert that a shift-invariant measure kt must actually be supported by the sequences for which the limiting frequencies not only exist but define ergodic measures.. CHAPTER I.C must El have measure 0 with respect to 1..4.4.. a shift-invariant measure always has its support in the set of sequences whose empirical measures are ergodic. ± i E x [ ak. by the Kolmogorov consistency theorem. which establishes the theorem. Let n denote the projection onto frequency equivalence classes. for each k. for each k and 4. Theorem 1. (B)= f p. are called the ergodic components of kt and the formula represents p. e In other words.x is ergodic. . the complement of .1. Since there are only a countable number of the al` .)c) can be thought of as measures.(x).

49 Lemma 1. ILi) is shift-invariant and 1 . that the relative frequency of occurrence of al in any two members of Li differ by no more than E/2.12 The ergodic decomposition can also be viewed as an application of the ChoquetBishop-deLeeuw theorem of functional analysis. The set L1 is invariant since p(ainx) = p(4`17' x).10.(Xi) > 1 — E.) The extreme points of Ps correspond exactly to the ergodic measures. 1 (L1) L1 P(Crilz) d iu(z) > 1 — 6 2 so the Markov inequality yields a set L3 C L2 such that /4L3) > (1 — E)/L(L 1 ) and for which . differs by no more than E.9. tt(CnICI) = ti Li P(xilz) dtt(z). oo. 0 Remark 1. as n there is an integer N and a set L2 C Li of measure at least (1 — E2 )A(L1) such that (9) IP(414) — p(alx)I < E/4. the set Ps of shift-invariant probability measures is compact in the weak*-topology (obtained by thinking of continuous functions as linear functionals on the space of all measures via the mapping it i— f f dit. 1] is bounded so L can be covered by a finite number of the sets Li(y) and hence the lemma is proved. the sequence {p(al 14)} converges to p(41x).„ E p(xz) ?.4. fix a block 4. This completes the proof of the ergodic decomposition theorem.E. Indeed.q. z E L3.11 Given c > 0 and 4 there is a set X1 with ii. x' iz E An. x E L2. Fix n > N and put Cn = {x: x E L2 }. such that for any x E X1 there is an N such that if n > N there is a collection C. and fix a sequence y E L. c A n . Theorem 1. The unit interval [0.4. Z E r3.c. which translates into the statement that p(C) > 1 — E. FREQUENCIES OF FINITE BLOCKS. The conditions (9) and (8) guarantee that the relative frequency of occurrence of 4 in any two members x7. the set of sequences with limiting frequencies of all orders. in particular.4. hence For each x E Li. Thus the conditional measure it (. Note. Let L i = Li(y) denote the set of all x E L such that (8) ip(a'nx) — p(41y)l < E/4. see Exercise 1 of Section 1.SECTION 1. with /L(C) > 1 — c. for any x E L. [57]. such that the frequency of occurrence of 4 in any two members of C. 1 . itiz of Cn differ by no more E. . Proof Fix E > 0.u(xilLi) = A(CI) In particular. n > N.4.

z„. See Exercise 8 in Section 1. P V TP y T 2P)-process. for infinitely Fix a E A and let Aa R(x) = {R 1 (X). Let qk • 14) be the empirical distribution of nonoverlapping k-blocks in 4.50 CHAPTER I.) 7. R2(x). t): where n = tk r. Define the mapping x (x) = min{m > 1: x. I. Let P be the time-0 partition for an ergodic Markov chain with period d = 3.2 for a definition of this process. P)-process. (Hint: combine Exercise 2 with Theorem 1. where P is the Kolmogorov partition in the two-sided representation of {X. 8.) 3. . Let T be an ergodic rotation of the circle and P the two-set partition of the unit circle into upper and lower halves. (It is an open question whether a mixing finite-state process is a function of a mixing Markov chain. (b) Show that the preceding result is not true if T is not totally ergodic. (b) Determine the ergodic components of the (T 3 . qk (alnx. P)-process. zt) are looking at different parts of y. Assume p. Use the central limit theorem to show that the "random walk with random scenery" process is ergodic. Combine (5) with the ergodic theorem to establish that for an ergodic process the expected waiting time between occurrences of a symbol a E A is just 111. (a) Determine the ergodic components of the (T 3 .1 by many n. (This is just the return-time process associated with the set B = [a]. Show that an ergodic finite-state process is a function of an ergodic Markov chain. that is. BASIC CONCEPTS.5. = (a) Show that if T is totally ergodic then qk (ali`14) . is the conditional measure on [a] defined by an ergodic process.4. This exercise illustrates a direct method for transporting a sample-path theorem from an ergodic process to a sample-path theorem for a (nonstationary) function of the process. 6.t(a).) 4.n = a} — 1 R1 (x) = minfm > E (x): x m = a} — E Rico i=1 . Show that . (Hint: show that the average time between occurrences of a in 4 is close to 1/pi (a14).u o R -1 on Aroo.d Exercises 1. 0 < r < k. almost surely.1 ) = I{/ E [0. m+k ) and (4. The reversed process {Yn : n > 1} is the .4. 2.u(x lic). Describe the ergodic components of the (T x T .) 5. " be the set of all x E [a] such that Tnx E [a]. (Hint: with high probability (x k .: n > 1}. Show that {Xn } is ergodic if and only its reversed process is ergodic.) A measure on Aa " transports to the measure y = . Let {X„: n > 1} be a stationary process.P x P)-process.

3 E(17.(a) log 7(a). 51 (b) R(T R ' (x ) x) = SR(x). the measure p.17.(x)) _< log I AI.5. The proof of the entropy theorem will use the packing lemma.1. Theorem 1.(a). THE ENTROPY THEOREM.(4). n so that if A n is the measure on An defined by tt then H(t) = nE(hn (x)). (a) The mapping x R(X) is Borel. Proof An elementary calculus exercise. where S is the shift on (e) If T is the set of the set of frequency-typical sequences for j. The next three lemmas contain the facts about the entropy function and its connection to counting that will be needed to prove the entropy theorem.5. then v(R(T)) = 1. z(a) 111AI.1 (The entropy theorem. Lemma 1. almost surely. of the process. aEA Let 1 1 h n (x) = — log — n ti. There is a nonnegative number h = h(A) such that n oo n 1 lim — log . (Hint: use the Borel mapping lemma.5.2 The entropy function H(7) is concave in 7 and attains its maximum value log I AI only for the uniform distribution. The entropy of a probability distribution 7(a) on a finite set A is defined by H(z) = — E 71. log means the base 2 logarithm.) Let it be an ergodic measure for the shift T on the space A'. and. except for some interesting cases. and denoted by h = h(t). the lemma follows from the preceding lemma. where A is finite.) (d) (1/n) En i R i -÷ 11 p.u.5.(4) is nonincreasing in n. some counting arguments. and the natural logarithm will be denoted by ln. Lemma 1.(4) 1 log p. see Exercise 4c. CI . Proof Since 1/(.5 The entropy theorem.4) = nE(h n (x)). For stationary processes. and the concept of the entropy of a finite distribution.SECTION 1. has limit 0. v-almost surely. (Hint: use the preceding result. 0 Lemma 1.(4) = h. The entropy theorem asserts that. the decrease is almost surely exponential in n. for ergodic processes. with a constant exponential rate called the entropy or entropy-rate.) Section 1. In the theorem and henceforth.

Towards this end.00 almost surely.5. Thus h(Tx) = h(x). It is just the entropy of the binary distribution with ti(1) = S. Define h to be the constant such that h = lim inf hn (x).5. almost surely. see Exercise 4. 0 The function H(S) = —3 log 8—(1—S) log(1-3) is called the binary entropy function. Since T is assumed to be ergodic the invariant function h(x) must indeed be constant. let c be a positive number. Section 1. The goal is to show that h must be equal to lim sup.52 CHAPTER I. The first task is to show that h(x) is constant. that the binary entropy function is the correct (asymptotic) exponent in the number of binary sequences of length n that have no more than Sn ones. k) If E k<n6( where H(3) = —3 log3 —(1 — 3) log(1 — 3). that is. since subinvariant functions are almost surely invariant. BASIC CONCEPTS.4 (The combinations bound. n.= H (3) k I. To establish this note that if y = Tx then [Yi ] = [X3 +1 ] j so that a(y) > It(x 1 ) and hence h(Tx) < h(x).00 n k<rz b E( n ) . almost surely.2.Co Define where h(x) = —(1/n) log it(x'lz). It can be shown. that is I lim — log n-. The definition of limit inferior implies that for almost . see [7]. n-). h(x) = lim inf hn (x). Proof The function —q log3 — (1 — q) log(1 —8) is increasing in the interval 0 < q < 3 and hence 2-nH(S) < 5k(1 _ n ) and summing gives k on-k . Multiplying by ( 2-nH(8) E k<nS ( n k ) 'E k<n3 ( n k ) (5k 0 _ on-k n ) ic (1 _ 5 k on-k = 19 which proves the lemma.) (n ) denotes the number of combinations of n objects taken k at a time and k 3 < 1/2 then n 2nH(3) .a The proof of the entropy theorem. almost surely. hn (x). almost surely. k < ms. Lemma 1. h(x) is subinvariant.

let S be a positive number and M > 1/B an integer. xr E G K (8 . The second idea is a counting idea. both to be specified later. (a) m — n 1 +1 > M. Three ideas are used to complete the proof. .. most of a sample path is filled by disjoint blocks of varying lengths for each of which the inequality (1) holds.. For each K > M. then there are not too many ways to specify their locations. The set of sample paths of length K that can be mostly filled by disjoint. + 1) > (1 — 28)K. almost surely. if these subblocks are to be long. Most important of all is that once locations for these subblocks are specified. Lemma I.] E S. then once their locations are specified. Eventually almost surely. M) if and only if there is a collection xr S = S(xr) = {[n i . and if they mostly fill. there are not too many ways the parts outside the long blocks can be filled. An application of the packing lemma produces the following result. m i [} of disjoint subintervals of [1. a fact expressed in exponential form as (1) A(4) > infinitely often. M) be the set of all that are (1 — 23)-packed by disjoint blocks of length at least M for which the inequality (1) holds. let GK(S. [n. In other words. To fill in the details of the first idea. The first idea is a packing idea. there will be a total of at most 2 K(h+E ) ways to fill all the locations. Indeed. The goal is to show that "infinitely often. (c) E [ „.18(b).1. K] with the following properties. m i l E S." for a suitable multiple of E. The third idea is a probability idea. long subblocks for which (1) holds cannot have cardinality exponentially much larger than 2 /"+€) . [ni. then it is very unlikely that a sample path in the set has probability exponentially much smaller than 2 — K (h+E ) . then since a location of length n can be filled in at most 2n(h+E) ways if (1) is to hold. If a set of K-length sample paths has cardinality only a bit more than 2K (h+E) .. (b) p(x) > ni+1)(h+c). almost surely" can be replaced by "eventually. m.5.SECTION 1. This is a simple application of the ergodic stopping-time packing lemma... This is just an application of the fact that upper bounds on cardinality "almost" imply lower bounds on probability. for large enough K.Es (mi — n. almost surely. THE ENTROPY THEOREM. 53 all x the inequality h(x) < h + E holds for infinitely many values of n.

54

CHAPTER I. BASIC CONCEPTS.

Lemma 1.5.5

E

GK

(3, M), eventually almost surely.

Proof Define T. (x) to be the first time n > M such that p,([xrii]) >n(h+E) Since r is measurable and (1) holds infinitely often, almost surely, r is an almost surely finite stopping time. An application of the ergodic stopping-time lemma, Lemma 1.3.7, then yields the lemma. LI
The second idea, the counting idea, is expressed as the following lemma.

Lemma 1.5.6 There is a 3 > 0 and an M> 116 such that IG large K.

, M)1 <2K (h+26) ,

for all sufficiently

Proof A collection S = {[ni , m i ll of disjoint subintervals of [1, K], will be called a skeleton if it satisfies the requirement that mi — ni ± 1 > M, for each i, and if it covers all but a 26-fraction of [1, K], that is,
(2)

E (mi _ ni ± 1) >_ (1 — 23)K.
[n,,m,JES

A sequence xr is said to be compatible with such a skeleton S if 7i ) > 2-(m, --ni+i)(h+e) kt(x n for each i. The bound of the lemma will be obtained by first upper bounding the number of possible skeletons, then upper bounding the number of sequences xr that are compatible with a given skeleton. The product of these two numbers is an upper bound for the cardinality of G K (6, M) and a suitable choice of 6 will then establish the lemma. First note that the requirement that each member of a skeleton S have length at least M, means that Si < KIM, and hence there are at most KIM ways to choose the starting points of the intervals in S. Thus the number of possible skeletons is upper bounded by (3)

E

K k

H(1/111)

k<KIM\

where the upper bound is provided by Lemma 1.5.4, with H() denoting the binary entropy function. Fix a skeleton S = {[ni,mi]}. The condition that .)C(C be compatible with S means that the compatibility condition (4) it(x,7 1 ) >

must hold for each [ni , m i ] E S. For a given [ni , m i ] the number of ways xn m , can be chosen so that the compatibility condition (4) holds, is upper bounded by 2(m1—n's +1)(h+E) , by the principle that lower bounds on probability imply upper bounds on cardinality, Lemma I.1.18(a). Thus, the number of ways x j can be chosen so that j E Ui [n„ and so that the compatibility conditions hold is upper bounded by
nK h-1-€)
L

(fl

SECTION 1.5. THE ENTROPY THEOREM.

55

Outside the union of the [ni , mi] there are no conditions on xi . Since, however, there are fewer than 28K such j these positions can be filled in at most IA 1 26K ways. Thus, there are at most
IAl23K 2K (h+E)

sequences compatible with a given skeleton S = {[n i , m i ]}. Combining this with the bound, (3), on the number of possible skeletons yields

(5)

1G K(s, m)i

< 2. 1(- H(1lm) oi ncs 2 K(h+E)

Since the binary entropy function H (1 M) approaches 0 as M 00 and since IA I is finite, the numbers 8 > 0 and M > 1/8 can indeed be chosen so that IGK (8, M) < 0 2K(1 +2' ) , for all sufficiently large K. This completes the proof of Lemma 1.5.6. Fix 6 > 0 and M > 1/6 for which Lemma 1.5.6 holds, put GK = Gl{(8, M), and let BK be the set of all xr for which ,u(x(c ) < 2 -1C(h+36) . Then tt(Bic

n GK) < IGKI 2—K(h+3E)

5 2—ICE

,

holds for all sufficiently large K. Thus, xr g BK n GK, eventually almost surely, by the Borel-Cantelli principle. Since xt E GK, eventually almost surely, the iterated almost-sure principle, Lemma 1.1.15, implies that xr g BK, eventually almost surely, that is,

lim sup h K (x) < h + 3e, a.s.
K—>oo

In summary, for each e > 0, h = lim inf h K (x) < lim sup h K (x) < h 3e, a.s.,
K—*co

which completes the proof of the entropy theorem, since c is arbitrary.

0

Remark 1.5.7 The entropy theorem was first proved for Markov processes by Shannon, with convergence in probability established by McMillan and almost-sure convergence later obtained by Breiman, see [4] for references to these results. In information theory the entropy theorem is called the asymptotic equipartition property, or AEP. In ergodic theory it has been traditionally known as the Shannon-McMillan-Breiman theorem. The more descriptive name "entropy theorem" is used in this book. The proof given is due to Ornstein and Weiss, [51], and appeared as part of their extension of ergodic theory ideas to random fields and general amenable group actions. A slight variant of their proof, based on the separated packing idea discussed in Exercise 1, Section 1.3, appeared in [68].

I.5.b Exercises.
1. Prove the entropy theorem for the i.i.d. case by using the product formula on then taking the logarithm and using the strong law of large numbers. This yields the formula h = — Ea ,u(a) log it(a). 2. Use the idea suggested by the preceding exercise to prove the entropy theorem for ergodic Markov chains. What does it give for the value of h?

56

CHAPTER I. BASIC CONCEPTS.
3. Suppose for each k, Tk is a subset of Ac of cardinality at most 2k". A sequence 4 is said to be (K, 8, {T})-packed if it can be expressed as the concatenation 4 = w(1) . w(t), such that the sum of the lengths of the w(i) which belong to Ur_K 'Tk is at least (1 —8)n. Let G„ be the set of all (K, 8, {T})-packed sequences 4 and let E be positive number. Show that if K is large enough, if S is small enough, and if n is large enough relative to K and S, then IG n I < 2n(a+E) . 4. Assume p is ergodic and define c(x) = lima

,44), x

E

A°°.

(a) Show that c(x) is almost surely a constant c. (b) Show that if c > 0 then p. is concentrated on a finite set. (c) Show that if p. is mixing then c(x) = 0 for every x.

Section 1.6

Entropy as expected value.

Entropy for ergodic processes, as defined by the entropy theorem, is given by the almost-sure limit 1 1 — log h= n—>oo n Entropy can also be thought of as the limit of the expected value of the random quantity —(1/n) log A(4). The expected value formulation of entropy will be developed in this section.

I.6.a

The entropy of a random variable.

Let X be a finite-valued random variable with distribution defined by p(x) = X E A. The entropy of X is defined as the expected value of the random variable — log p(X), that is,

Prob(X = x),

H(X) =

xEA

E p(x) log -1-=— p(x)

p(x) log p(x). xEA

The logarithm base is 2 and the conventions Olog 0 = 0 and log0 = —oo are used. If p is the distribution of X, then H(p) may be used in place of H(X). For a pair (X, Y) of random variables with a joint distribution p(x, y) = Prob(X = x, Y = y), the notation H(X,Y)-= — p(x , y) log p(x, y) will be used, a notation which extends to random vectors. Most of the useful properties of entropy depend on the concavity of the logarithm function. One way to organize the concavity idea is expressed as follows.

Lemma 1.6.1 If p and q are probability k-vectors then

_E pi log p, < —
with equality if and only if p = q.

pi log qi ,

SECTION 1.6. ENTROPY AS EXPECTED VALUE.

57

Proof The natural logarithm is strictly concave so that, ln x < x — 1, with equality if and only if x = 0. Thus
qj

_

= 0,

with equality if and only if qi = pi , log x = (In x)/(1n 2).

1 < i 5_ k.

This proves the lemma, since

E,

The proof only requires that q be a sub-probability vector, that is nonnegative with qi 5_ 1. The sum

Dcplio =

pi in

PI
qi

is called the (informational) divergence, or cross-entropy, and the preceding lemma is expressed in the following form.

Lemma 1.6.2 (The divergence inequality.) If p is a probability k-vector and q is a sub-probability k-vector then D(pliq) > 0, with equality if and only if p = q.
A further generalization of the lemma, called the log-sum inequality, is included in the exercises. The basic inequalities for entropy are summarized in the following theorem.

Theorem 1.6.3 (Entropy inequalities.) (a) Positivity. H(X) > 0, with equality if and only if X is constant.

(b) Boundedness. If X has k values then H(X) < log k, with equality if and only if
each p(x) = 11k.

(c) Subadditivity. H(X, Y) < H(X) H(Y), with equality if and only if X and Y
are independent. Proof Positivity is easy to prove, while boundedness is obtained from Lemma 1.6.1 by setting Px = P(x), qx ----- 1/k. To establish subadditivity note that
H(X,Y).—

Ep (x, y) log p(x, y),
X.

y

then replace p(x , y) in the logarithm factor by p(x)p(y), and use Lemma 1.6.1 to obtain the inequality H(X,Y) < H(X)-}- H(Y), with equality if and only if p(x, y) p(x)p(y), that is, if and only if X and Y are independent. This completes the proof of the theorem. El The concept of conditional entropy provides a convenient tool for organizing further results. If p(x , y) is a given joint distribution, with corresponding conditional distribution p(xly) = p(x, y)1 p(y), then H ain

=_

p(x, y) log p(xiy) = —

p(x, y) log

p(x, y)

POO

58

CHAPTER I. BASIC CONCEPTS.

is called the conditional entropy of X, given Y. (Note that this is a slight variation on standard probability language which would call — Ex p(xIy) log p(xly) the conditional entropy. In information theory, however, the common practice is to take expected values with respect to the marginal, p(y), as is done here.) The key identity for conditional entropy is the following addition law.

(1)

H(X, Y) = H(Y)+ H(X1Y).

This is easily proved using the additive property of the logarithm, log ab = log a+log b. The previous unconditional inequalities extend to conditional entropy as follows. (The proofs are left to the reader.)

Theorem 1.6.4 (Conditional entropy inequalities.) (a) Positivity. H (X IY) > 0, with equality if and only if X is a function of Y. (b) Boundedness. H(XIY) < H (X) with equality if and only if X and Y are independent. (c) Subadditivity. H((X, Y)IZ) < H(XIZ)+ H(YIZ), and Y are conditionally independent given Z. with equality if and only if X

A useful fact is that conditional entropy H(X I Y) increases as more is known about the first variable and decreases as more is known about the second variable, that is, for any functions f and g,

Lemma 1.6.5 H(f(X)In < H(XIY) < 1-1 (X1g(n).
The proof follows from the concavity of the logarithm function. This can be done directly (left to the reader), or using the partition formulation of entropy which is developed in the following paragraphs. The entropy of a random variable X really depends only on the partition Px = {Pa: a E A} defined by Pa = fx: X(x) = a), which is called the partition defined by X. The entropy H(1)) is defined as H (X), where X is any random variable such that Px = P. Note that the join Px V Py is just the partition defined by the vector (X, Y) so that H(Px v Pr) = H(X, Y). The conditional entropy of P relative to Q is then defined by H(PIQ) = H(1) V Q) — H (Q). The partition point of view provides a useful geometric framework for interpretation of the inequalities in Lemma 1.6.5, because the partition Px is a refinement of the partition Pf(x), since each atom of Pf(x) is a union of atoms of P. The inequalities in Lemma 1.6.5 are expressed in partition form as follows.

Lemma 1.6.6 (a) If P refines Q then H(QIR) < H(PIR). (b) If R. refines S then H(PIS) > 11(P 1R).

The n-th order entropy of a sequence {X1. which establishes inequality (b).n a. H (P IR) (i) (P v 21 1Z) H(QITZ) + H(PIQ v 1Z) H (OR). I. and the inequality (iii) uses the fact that H (P I VR. n n xn In the stationary case the limit superior is a limit. so that if b = sup .b The entropy of a process. a 0 b. .. Given c > 0 choose n so that an < n(a ± 6).7 (The subadditivity lemma. the equality (ii) is just the general addition law. ENTROPY AS EXPECTED VALUE. an-Fm < an + am . < pan + b.r. The entropy of a process is defined by a suitable passage to the limit. as follows. that is.. Let Pc denote the partition of the set C defined by restricting the sets in P to C.) If {an} is a sequence of nonnegative numbers which is subadditive. namely.. Sb = Rb U Rb. then limn an ln exists and equals infn an /n. . (2) 1 1 1 H ({X n }) = lim sup — H((X7) = lim sup — p(4) log n n p(4) . The equality (i) follows from the fact that P refines Q. If m > n write m = np +. IC) is used on C. say Ra = Sa . Proof Let a = infn an /n. where the conditional measure. 59 Proof The proof of (a) is accomplished by manipulating with entropy formulas. then another use of subadditivity yields am < anp + a. 11) P(x1 xnEan E The process entropy is defined by passing to a limit. Subadditivity gives anp < pan . This is a consequence of the following basic subadditivity property of nonnegative sequences.6.) can be expressed as -EEtt(Pt n R a ) log t ab it(Pt n Ra) + (4)11 (Psb liZsb ) Ewpt n Ra ) log It(Pt n Ra) + it(Sb)1-1(Psb ). X2.} of A-valued random variables is defined by H(r) = E p(x)log 1 ATEAn =— p(x'iL)log p(4). is obtained from S by splitting one atom of S into two pieces.6.SECTION I. Lemma 1. t a0b II(Ra) The latter is the same as H(PIS). • The quantity H(P IR. 0 < r < n.) > O. it (. so that P = Pv Q. To prove (h) it is enough to consider the case when R.6.

as in —> pc. An alternative formula for the process entropy in the stationary case is obtained by using the general addition law. a. the following lemma will be useful in controlling entropy on small sets.3(c). The next goal is to show that for ergodic processes.u (x '0 = h. Division by m._ .8 (The subset entropy-bound.60 CHAPTER I.6. BASIC CONCEPTS.e.1 )= H(X0IX:n1). aEB E P(a)log p(a) + log p(B)) . If the process is stationary then H(X n + -FT) = H(r) so the subadditivity lemma with an = H(X7) implies that the limit superior in (2) is a limit. CI This proves the lemma._ log IBIaEB — E 1 The left-hand side is the same as ( p(B) from which the result follows. (3) n--4co a formula often expressed in the suggestive form (4) H({X n }) = H (X01X_1 0 ) . and process entropy H. H(X. 0 Let p. Theorem I. to produce the formula H(X ° n )— H(X1. gives H(X7 +m) < H(X7)+ H(rnTin). the process entropy H is the same as the entropy-rate h of the entropy theorem. Towards this end. be an ergodic process with alphabet A. aEB for any B c A.6.5. that is urn—log n 1 1 n . The subadditivity property for entropy. gives lim sup ani /m < a + E. can then be applied to give H({X n })= lim H (X01X1 nI ) .) _ E p(a) log p(a) _< p(B) log IBI — p(B)log p(B). 1B) denote the conditional measure on B and use the trivial bound to obtain P(a1B)log p(aIB) . n > 0. Proof Let p(. The right-hand side is decreasing in n. . and let h be the entropy-rate of A as given by the entropy theorem. and the simple fact from analysis that if an > 0 and an+ i — an decreases to b then an /n —> b. and the fact that np/m —> 1.6. Lemma 1. Y) = H(Y)± H(X1Y). from Lemma 1.

u(4) <_ n(h ± c). On the set G.u(G n ) and summing produces (6) (h — 6) 1 G.u(xliz) — E . from the entropy theorem. 8„ since I B I < IAIn . while D only the upper bound in (6) matters and the theorem again follows.10 Since for ergodic processes. approaches 1.u(4)/n.6. the following holds n(h — e) _< — log . so the sum converges to H({X„}). Proof First assume that h > 0 and fix E such that 0 < E < h. since the entropy theorem implies that ti(B) goes to O. The subset entropy-bound. ENTROPY AS EXPECTED VALUE. = {4: 2—n(h+E) < 1. contains an excellent discussion of combinatorial and communication theory aspects of entropy.6.u (4 ) log . [7]. The two sums will be estimated separately. After division by n. and Cover and Thomas. in the i. this part must go to 0 as n —> oc. the process entropy H({X.u(4) > 2 —ne . Then H (X7) = — E „.6.(4). } Remark 1. As n —> co the measure of G. Thus h— E which proves the theorem in the case when h > O.(Bn )log IA 1 — . = {4: .(x ) log .d. B.i.. [6]. so multiplication by . both are often simply called the entropy. . [18].SECTION 1. Lemma 1. G.8. Define G.9 Process entropy H and entropy-rate h are the same for ergodic processes.1 (4) < 2—n(h—E)} and let B.}) is the same as the entropyrate h of the entropy theorem.. [4]. The recent books by Gray. gives (5) — E A (x) log ii(x) < np.u(xtil) A" = — E . A detailed mathematical exposition and philosophical interpretation of entropy as a measure of information can be found in Billingsley's book. 61 Theorem 1. = A n — G. The Csiszdr-Körner book. discuss many information-theoretic aspects of entropy for the general ergodic process.u(4) log p. If h = 0 then define G. The bound (5) still holds. case.i(B)log(B).6.

to obtain H({X. I. along with a direct calculation of H(X0IX_ 1 ) yields the following formula for the entropy of a Markov chain.i.) A stationary process {X n } is Markov of order k if and only if H(X0IX:) = H(X0IX: n1 ).} is Markov of order k then 1/({X}) = H(X0IX:1) = n > k.!) = H(X0)• Now suppose {X n } is a Markov chain with stationary vector p and transition matrix M.d. Here they will be derived from the definition of process entropy. Theorem 1. process and let p(x) = Prob(Xi = x). see Exercises 1 and 2 in Section 1.11 (The Markov order theorem.d. Proof The conditional addition law gives H((X o .3(c). and Markov processes can be derived from the entropy theorem by using the ergodic theorem to directly estimate p. (8) H({X n }) = H(X0IX_ i ) = — E 04 log Mii . and Markov processes. } be an i. a from which it follows that E A. (3).62 CHAPTER I. An alternate proof can be given by using the conditional limit formula.i.d. BASIC CONCEPTS. Xi n k+1 )1X1) = H(X: n k+1 1X11) H(X 0 IXI kl . The additivity of entropy for independent random variables. The argument used in the preceding paragraph shows that if {X. Entropy formulas for i.(4).5. a E A. . (7) H({X. Xink+1). and the fact that H(Xl Y) = H(X) if X and Y are independent. . Let {X. gives H (XI') = H(Xi) which yields the entropy formula H (X2) + H(X n ) = nH (X i ). Recall that a process {Xn } is Markov of order k if Prob (X0 = aiX) = Prob (X0 = alX:k1 ) .6.}) = lip H (X0IX1. (3). The Markov property implies that Prob (X0 = alX) = Prob (X0 = alX_i) .6.i.c The entropy of i. Theorem 1. This condition actually implies that the process is Markov of order k. n > k.6.}) = H(X i ) = — p(x) log p(x). H (X 0 1 = H(X01 X -1)• Thus the conditional entropy formula for the entropy of a process.

n Ifi: Pi(alxi) = = an . rather than a particular sequence that defines the type. and will be useful in some of the deeper interpretations of entropy to be given in later chapters. Thus. ENTROPY AS EXPECTED VALUE.) < 2r1(131). In since np i (a) is the number of times the symbol a appears in a given 4 particular. k > k*. for n > k.6.4(c). In general the conditional entropy function Hk = H(X0IX: k1 ) is nonincreasing in k.14)..SECTION 1. This fact can be used to estimate the order of a Markov chain from observation of a sample path. can then be used to conclude that X0 and X:_ n dent given XI/. x'1' E T. if and only if each symbol a appears in xi' exactly n pi(a) times. - Proof First note that if Qn is a product measure on A n then (9) Q" (x)= ji Q(xi) i=1 = fl aEA War P1(a) . a product measure is always constant on each type class. This completes the proof of the theorem. 14). then Hks_i > fik* = Hk . 14)) = a Pi(aixi)log Pi(aI4).6. the same number of times it appears in Type-equivalence classes are called type classes. for any type p i . that is. 17(130= ii(pl(. The type class of xi' will be denoted by T WO or by 7. provided H(X01X1 k1 ) = H(X0 1X1n1 ). if each symbol a appears in x'. If the true order is k*.6. Two sequences xi' and y'11 are said to be type-equivalent if they have the same type. The empirical distribution or type of a sequence x E A" is the probability distribution Pi = pl (. The equality condition of the subadditivity principle.6. I. a E A. If this is true for every n > k then the process must be Markov of order k. that is. that is. Theok+1 are conditionally indepenrem I. To say that the process is Markov of some order is to say that Hk is eventually constant. Theorem 1. The entropy of an empirical distribution gives the exponent for a bound on the number of sequences that could have produced that empirical distribution. 63 The second term on the right can be replaced by H(X01X: k1 ). x E rn Tpn. The following purely combinatorial result bounds the size of a type class in terms of the empirical entropy of the type. The empirical (first-order) entropy of xi' is the entropy of p i (.d Entropy and types. where the latter stresses the type p i = pi (. E . 14) on A defined by the relative frequency of occurrence of each symbol in the sequence 4.12 (The type class bound. This fact is useful in large deviations theory.

4 E T. 12— n 11(” v 1 ) But Pi is a probability distribution so that Pi (Tpni ) < 1. then the number of k-types is of lower order than IA In. that is.' and yç' are said to be k-type-equivalent if they have the same k-type.6. The bound 2' 17.1 = E I -41 n— k 1 E A k.6. while if k is also growing.12. as stated in the following theorem. that if k is fixed the number of possible k-types grows polynomially in n. The empirical overlapping k-block distribution or k-type of a sequence x7 E A' is the probability distribution Pk = pk(' 14) on A k defined by the relative frequency of occurrence of each k-block in thesen [i quenk ce +ii'i th_a i± t ki_ S. 1. extends immediately to k-types. npi (alfii) are 0. after taking the logarithm and rewriting. This establishes Theorem 1. produces Pn(xiz) = aEA pi (a)nPl(a) . The bound on the number of types. Later use will also be made of the fact that there are only polynomially many type classes. where Pk = Pk ( Ix). the only possible values for . E r pni . A type pi defines a product measure Pn on An by the formula Pn (z?) = pi (zi). as shown in the Csiszdr-Körner book.13 (The number-of-types bound. (9). The concept of type extends to the empirical distribution of overlapping k-blocks. Theorem 1. with a < 1.) The number of possible k-types is at most (n — k + Note. [7]. while not tight. but satisfies k < a logiAi n. BASIC CONCEPTS. z E A' . if and only if each block all' appears in xÇ exactly (n — k + 1) pk (alic) times. . Pn (4) has the constant value 2' -17(P' ) on the type class of x7.13. yields Pn(x7) = 2—n17(PI) . is the correct asymptotic exponential bound for the size of a type class.64 CHAPTER 1. Theorem 1. Proof This follows from the fact that for each a E A.6.14 (The number of k-types bound. and k-type-equivalence classes are called k-type classes. Replacing Qn by Pi in the product formula. in particular. Two sequences x.) The number of possible types is at most (n + 1)1A I. The upper bound is all that will be used in this book. n. and hence = ypn. The k-type class of x i" will be denoted by Tpnk . x E In other words. which. Theorem 1.6.

.(n. 2. This yields the desired bound for the first-order case. ak ak by formula (8).e Exercises. ENTROPY AS EXPECTED VALUE. since P1) (4) is a probability measure on A n . since the k-type measures the frequency of overlapping k-blocks. > 0. the stationary (k — 1)-st order Markov chain ri 0c -1) with the empirical transition function (ak 14 -1 ) = Pk(ainxii ) „ f k-1 „ir‘ uk IA 1 ) E bk pk and with the stationary distribution given by The Markov chain ri(k-1) has entropy (k - "Aar') = Ebk Pk(ar i 1) -E 1. and hence P x E 2. i = 1. E ai . b.(1) and after suitable rewrite using the definition of W I) becomes 1 TV I) (Xlii ) = Ti-(1) (X02—(n-1)1 . 2. with Q replaced b y (2-1) = rf. The proof for general k is obtained by an obvious extension of the argument. that is. > 0.SECTION 1. Prove the log-sum inequality: If a. 65 Estimating the size of a k-type class is a bit trickier.15 (The k-type-class bound. A suitable bound can be obtained. b = E bi . Let P be the time-0 partition for an ergodic Markov chain with period d = 3 and entropy H > O. (a) Determine the entropies of the ergodic components of the (T3 . I.b xr it c Tp2. Note that it is constant on the k-type class 'Tk(xiii) = Tp k of all sequences that have the same k-type Pk as xrii.1>ii(1 ) IT P2 I — 1n since ii:(1) (xi) > 1/(n — 1).) I7k (fil)1 < (n — Proof First consider the case k = 2. This formula. . and a 1. The entropy FP") is called the empirical (k — 1)-st order Markov entropy of 4. If Q is Markov a direct calculation yields the formula Q(4) = Q(xi)n Q(bi a)(n—l)p2(ab) a.6. however. 2)-process.6.6. n. by considering the (k — 1)-st order Markov measure defined by the k-type plc ( 14). then E ai log(ai I bi ) ?_ a log(a/b).-j(a k Or 1 )Ti(a ki -1 ) log 75(akiar) E Pk(akil4) log Ti(ak lari ). Theorem 1.

1} z . (a) Show that D (P v 7Z I Q v > Q) for any partition R. (b) Determine the entropies of the ergodic components of the (T 3 . (Hint: let S be uniformly distributed on [1. and use the equality of process entropy and entropy-rate. Find the entropy of the (S. (b) Prove Pinsker's inequality. €)-entropy-typical sequences.) 7. The entropy theorem may be expressed by saying that xril is eventually almost surely entropy-typical.(Pi)1 A(Q i )). Let p be the time-zero partition of X and let Q = P x P. with the product measure y = j x it.i(4).(xi Ix). so H(Xr)1N H(Xr IS)1N.a Entropy-typical sequences. The divergence for k-set partitions is D(P Q) = Ei It(Pi)log(p. S) = H H (SIXr) = H(S)+ H(X l iv IS) for any N> n.bt 1 a2012 . The first is an expression of the entropy theorem in exponential form. (Hint: use part (a) to reduce to the two set case.) 4. measurable with respect to xn oc . Let p. 3. that is.7 Interpretations of entropy. I. which is the "random walk with random scenery" process discussed in Exercise 8 of Section 1. BASIC CONCEPTS. Let p be an ergodic process and let a be a positive number.u(Fn ) > 1 — a and so that if xn E Fn then 27" 11.7. (4) < t(xIx) 2".) 6. where jp .QI 1. that D (P Il Q) > (1 / 2 ln 2)IP — Q1 2 . Prove that the entropy of an n-stationary process {Yn } is the same as the entropy the stationary process {X„} obtained by randomizing the start. For E > 0 and n > 1 define En(E) 1 4: 2—n(h+E) < tt (x in) < 2—n(h-6)} The set T(E) is called the set of (n. be an ergodic process of entropy h.u(Pi ) . Q)-process. n] and note that H (X I v . namely.) Section 1.2. Show that the process entropy of a stationary process and its reversed process are the same. useful interpretations of entropy will be discussed in this section. Show there is an N = N(a) such that if n > N then there is a set Fn . and to the related building-block concept..) 5. Toy). y) = (T x . and define S(x. such that .(1) = 1/2. The second interpretation is the connection between entropy and expected code length for the special class of prefix codes. then use calculus. Let T be the shift on X = {-1. (Hint: apply the ergodic theorem to f (x) = — log //. Then show that I-1(r IS = s) is the same as the unshifted entropy of Xs/v+1 . (Hint: use stationarity. Two simple.66 CHAPTER I. if n and c are understood. which leads to the concept of entropy-typical sequence. p V Tp T22)-process. and let ji be the product measure on X defined by . or simply the set of entropy-typical sequences. (Hint: recurrence of simple random walk implies that all the sites of y have been almost surely visited by the past of the walk. Let Y = X x X.u0(-1) = p. .

) For each E > 0. eventually almost surely. as defined here. then . namely. The members of the entropy-typical set T(e) all have the lower bound 2 —n(h+f) on their probabilities. and p(C)> a}. the measure is eventually mostly concentrated on a set of sequences of the (generally) much smaller cardinality 2n(h+6) • This fact is of key importance in information theory and plays a major role in many applications and interpretations of the entropy concept. on the number of entropy-typical sequences.X1' E 'T. eventually almost surely.2. suggested by the upper bound. . The context usually makes clear the notion of typicality being used.(E12). which leads to the fact that too-small sets cannot be visited too often.1 (The typical-sequence form of entropy. (A2) Count down the list until the first time a total probability of at least a is reached. even though there are I A In possible sequences of length n. in information theory. (Al) List the n-sequences in decreasing order of probability. and sometimes it is just shorthand for those sequences that are likely to occur. Ar(a) is the minimum number of sequences of length n needed to fill an afraction of the total probability.u(Tn (E)) = 1. limn . eventually almost surely.3 (The too-small set principle. Here the focus is on the entropy-typical idea. it is enough to show that x g C.7. c A n and ICI _ Ci'.7. a)-covering number by Arn (a) = minfICI: C c An.4. eventually almost surely. For a > 0 define the (n. x E T(e). and the theorem is established." has different meaning in different contexts. Theorem 1. sometimes it means sequences that are both frequency-typical and entropy-typical. The preceding theorem provides an upper bound on the cardinality of the set of typical sequences and depends on the fact that typical sequences have a lower bound on their probabilities. fl Another useful formulation of entropy. The phrase "typical sequence. Theorem 1. 67 The convergence in probability form. Typical sequences also have an upper bound on their probabilities.7.) The set of entropy-typical sequences satisfies 17-n (01 < 2n (h+E) Thus.SECTION 1. this fact yields an upper bound on the cardinality of Tn . A good way to think of . is known as the asymptotic equipartition property or AEP. The cardinality bound on Cn and the probability upper bound 2 2) on members of Tn (E/2) combine to give the bound (Cn n Tn(E/2)) < 2n(12+02—n(h—E/2) < 2—nc/2 Thus .(E12)) is summable in n.7q If C.u(Cn n 'T. Proof Since . sometimes it means frequencytypical sequences. Theorem 1. Since the total probability is at most 1.7. expresses the connection between entropy and coverings. INTERPRETATIONS OF ENTROPY. Sometimes it means entropy-typical sequences.7.) < 2n(h--€). n > 1. n'T.i (e/ 2).2 (The entropy-typical cardinality bound. that is. as defined in Section 1.I\rn (a) is given by the following algorithm for its calculation. Theorem 1.

defined by if a = b otherwise. .(4) E x.7. < 2—n(h—E) 1— -6 .7.(x) < 2 —n(h—E) . below. defined by 0 d (a .Jce) < iTn (01 < 2 n0-1-0 .2. S) from 4 to a set S C An is defined by dn (x' . Let d(a. If n is large enough then A(T. and the 6-neighborhood or 6-blowup of S is defined by [S ]. that a rotation process has entropy O. fl The covering exponent idea is quite useful as a tool for entropy estimation.. the covering exponent (1/n)logArn (a) converges to h.) For each a > 0. . The connection with entropy is given by the following theorem.i (E)) eventually exceeds a.I = (a). S) = min{dn (4.5_ h. as n oc. The covering number Arn (a) is the count obtained in (A2). which proves the theorem.4 (The covering-exponent theorem. suppose. see. for E rn (E then implies that ACTn(E) n = p. b) = J 1 1 and extend to the metric d on A'. To state this precisely let H(8) = —6 log 6 (1— 6) log(1 — 8) denote the binary entropy function. Proof Fix a E (0. Since the measure of the set of typical sequences goes to I as n oc. It is important that a small blowup does not increase size by more than a small exponential factor. The distance dn (x.e. for example.n GI. for each n.7.i(C) > a and I C. 1 dn (4 . and hence. On the other hand. ) and hence Nn (a) = ICn > IT(e) n Cn > 2n (h-E'A(Tn(E) n cn) > 2n(h-E)a(i 6). which will be discussed in the following paragraphs. yri`) = — n i=1 yi)• The metric cin is also known as the per-letter Hamming distance. y. BASIC CONCEPTS. the measure . Theorem 1.u(T. = S) < 81.(6) n cn ) > a(1 — c). When this happens N. b) be the (discrete) metric on A. 1) and c > O. lim supn (l/n) log Nn (a) . The connection between coverings and entropy also has as a useful approximate form. The fact that p. by Theorem 1.68 CHAPTER I. the proof in Section I.' E S}.Ern (onc.

. are allowed to depend on the sequence xr. then Irrn(E)ls1 _ both go to 0 as El will hold.5. M]. with spacers allowed between the blocks. This proves the blowup-bound lemma. 8) oc.5. that is. for all n. i c I.b The building-block concept. thought of as the set of building blocks.7. A sequence xr is said to be (1 — 0-built-up from the building blocks B. 8) = min{ICI: C c An. there are at most nS positions i in which xi can be changed to create a member of [ { x7 } 13.4 then exists as n establishes that 1 lim lim — log Afn (a. m i ] for which x .. mi]: i < 1). Both the number I and the intervals [ni . The reader should also note the concepts of blowup and built-up are quite different. It is left to the reader to prove that the limit of (1/n) log Nen (a." from which longer sequences are made by concatenating typical sequences. it follows that if 8 is small enough. such that (a) Eil. provided only that n is not too small. to say that 41 is 1-built-up from Bn is the same as saying that xr is a concatenation of blocks from B„.7. Each such position can be changed in IA — 1 ways. if there is an integer I = I(41 ) and a collection llni. it is the minimum size of an n-set for which the 8-blowup covers an a-fraction of the probability. This idea and several useful consequences of it will now be developed.. given c > 0 there is a S > 0 such that Ir1n(c)131 for all n.7. subject only to the requirement that the total length of the spacers be at most SM. together with Theorem 1.7. Lemma 1. In the special case when 8 = 0 and M is a multiple of n. INTERPRETATIONS OF ENTROPY. 6—>0 n—> oo n I. An application of Lemma 1. implies that I[{4}]51 < yields the stated bound. If 8 > 0 then the notion of (1-0-built-up requires that xr be a concatenation of blocks from 13. < 2n(h-1-€) since (IAI — 1)n8 < 20 log IAI and since Slog IA I and H(8) Since IT(e)1 _ < 2n (h+26) 0. is required to be a member of 13. Proof Given x. and //([C]3 ) a}. the entropy-typical n-sequences can be thought of as the "building blocks. The blowup form of the covering number is defined by (1) Arn (a. the _ 2nH(3) (IAI — 1) n8 . The . < n(h+2€) 2 In particular. Lemma 1. c An . Fix a collection 5. An application of the ergodic theorem shows that frequency-typical sequences must consist mostly of n-blocks which are entropy-typical. Also fix an integer M > n and 8 > O. with occasional spacers inserted between the blocks.SECTION 1.) The 6-blowup of S satisfies 69 I[S]81 ISI2 n11(6) (1A1 — 1) n3 .4.5 (The blowup bound. Thus. which combinations bound.7. 8) = h.=1 (mi — n i + 1) > (1 — (b) E I3. In particular. of disjoint n-length subintervals of [1.

3.) Let DM be the set of all sequences xr that can be (1 — 6)-built-up from a given collection B„ c An.4. Lemma 1. M — n +1]: Bn }l < 8(M — n +1). is similar in spirit to the proof of the key bound used to establish the entropy theorem. Y1 E < ) which is. Lemma 1. An important fact about the building-block concept is that if S is small and n is large.5. m i ]} of disjoint n-length subintervals that cover all but a 6-fraction of [1.5.6 (The built-up set bound. Lemma 1. Lemma 1. by the combinations bound. The building-block idea is closely related to the packing/covering ideas discussed in Section I. As usual.m i ] is upper bounded by IA 1 3m . though simpler since now the blocks all have a fixed length.b. H(8) = —8 log6 — (1 —8) log(1 — S) denotes the binary entropy function. This establishes the lemma.1 2mH (0 1 A im s.(c).i E / } . while the built-up concept only allows changes in the spaces between the building blocks. blowup concept focuses on creating sequences by making a small density of otherwise arbitrary changes. which is stated as the following lemma.7. provided M is large enough. Then I Dm I IB. . BASIC CONCEPTS. The bound for entropy-typical building blocks follows immediately from the fact that I T(e)l 2 n(h+f) .3 Min and the number of ways to fill the places that are not in Ui [ni.4. if B. In particular.70 CHAPTER I. M] is upper bounded by the number of ways to select at most SM points from a set with M members. for members of Bn . = 7. can be used to show that almost strongly-covered sequences are also almost built-up. Proof The number of ways to select a family {[n i . but allows arbitrary selection of the blocks from a fixed collection. For a fixed configuration of locations. and if is small enough then 11)m 1 < 2m(h +26) . This is stated in precise form as the following lemma. A sequence xr is said to be (1 — 6)-strongly-covered by B„ C An if E [1. which is the desired bound. the set of entropy-typical n-sequences relative to c. in turn. upper bounded by 2 A411(3) .m i ]. the number ways to fill these with members of B„ is upper bounded by IBnl i < .6.3. then the set of M-sequences that can be (1 — 6)-built-up from a given set 8„ C An of building blocks cannot be exponentially much larger in cardinality than the set of all sequences that can be formed by selecting M I n sequences from Bn and concatenating them without spacers. The proof of this fact. say {[n i . Thus I Dm I < ni Min IAIM. The argument used to prove the packing lemma. namely.

c. then xr is (1-0-built-up from Bn• Proof Put mo = 0 and for i > 0. Remark 1. define ni to be least integer k > m i _ i such that ric+n--1 E 13„. B*. stopping when "" Ic (1 — 3/2)-strongly-covered by within n of the end of xr. then . b. The lemma is therefore established.(€)) 1. Theorem 1. is to be mapped into a binary sequence bf = b1. for which there is a close connection between code length and source entropy. A standard model is to think of a source sequence as a (finite) sample path drawn from some ergodic process. while the condition that M > 2n18 implies there at most 3M/2 indices j E [M — n 1.(7. Theorem 1. will be surprisingly useful later. b2. For this reason. implies that the set Dm of sequences that are mostly built-up from the building blocks T(e) will eventually have cardinality not exponentially much larger than 2/4(h+6) • The sequences in Dm are not necessarily entropy-typical. to make the code length as short as possible.SECTION 1. m i ]. in such a way that the source sequence x'1' is recoverable from knowledge of the encoded sequence bf. whose length r may depend on 4 .5. In the following discussion B* denotes the set of all finite-length binary sequences and t(w) denotes the length of a member w E B*. INTERPRETATIONS OF ENTROPY.4. The almost-covering principle.c Entropy and prefix codes.7.7. where n is a fixed large number. = Tn (e). a subject to be discussed in the next chapter.6. This result.. The assumption that xr is 13. M] which are not contained in one of the [ni . implies there are at most 6M/2 indices j < M — n 1 which are not contained in one of the [ni . and n is large enough. all that is known is that the members of Dm are almost built-up from entropy-typical n-blocks.7.8 The preceding two lemmas are strictly combinatorial.. so that a typical code must be designed to encode a large number of different sequences and accurately decode them. that is. for no upper and lower bounds on their probabilities are given. often of varying lengths. In particular. (e). In combination with the ergodic and entropy theorems they provide powerful tools for analyzing the combinatorial properties of partitions of sample paths. Of course. Lemma 1.u. An image C (a) is called a codeword and the range of C . mi]. 71 Lemma 1. In a typical data compression problem a given finite sequence 4. known as prefix codes. if B. in information theory a stationary.7 (The building-block lemma.4.. .7. I. drawn from some finite alphabet A. The goal is to use as little storage space as possible.7.7. there are many possible source sequences. and if M > 2n/8. A code is said to be faithful or A (binary) code on A is a mapping C: A noiseless if it is one-to-one. implies that eventually almost surely xr is almost stronglycovered by sequences from T(e). hence eventually almost surely mostly built-up from members of 7. ergodic process with finite alphabet A is usually called a source. which can be viewed as the entropy version of the finite sequence form of the frequency-typical sequence characterization of ergodic processes.) If xr is (1-812)-strongly-covered by 13. at least in some average sense. A suitable application of the built-up set lemma. and that xr is eventually in DM. In this section the focus will be on a special class of codes.

A nonempty word u is called a prefix of a word w if w = u v.9.. For prefix-free sets W of binary sequences this fact takes the form (2) . For u = u7 E B* and u = vr E B*. (ii) If y E V(W) has the form v = ub. 100 } . A prefix-free set W has the property that it can be represented as the leaves of the (labeled. a lower bound which is almost tight. the concatenation uv is the sequence of length n m defined by (uv) i = ui 1<i<n n < i < n m. is defined by the formula H (A) = E p(a) log aEA 1 that is. however.(a). defined by the following two conditions.72 CHAPTER I. 101. To develop the prefix code idea some notation and terminology and terminology will be introduced. base 2 logarithms are used. and easily proved. The expected length of a code C relative to a probability distribution a on A is E(L) = Ea . An important. the words w E W correspond to leaves L(T) of the tree T (W). where b from u to y. A nonempty set W c B* is prefix free if no member of W is a proper prefix of any member of W. H(j) is just the expected value of the random variable — log . is called the codebook.) E B then there is a directed edge Figure 1. together with the empty word A. For codes that satisfy a simple prefix condition.7. 0100. The function that assigns to each a E A the length of the code word C (a) is called the length function of the code and denoted by £(IC) or by r if C is understood. (i) The vertex set V(W) of T (W) is the set of prefixes of members of W. directed) binary tree T(W). labeled by b. Note that the root of the tree is the empty word and the depth d(v) of the vertex is just the length of the word v.u(X). Formally.) Without further assumptions about the code there is little connection between entropy and expected code length. 0101. since W is prefix free. a E C. (As usual. it is the function defined by . The prefix u is called a proper prefix of w if w = u v where y is nonempty.9 The binary tree for W = {00.C(aIC) = £(C (a)). entropy is a lower bound to code length. The entropy of p. In particular. . property of binary trees is that the sum of 2 —ci(v) over all leaves y of the tree can never exceed 1. BASIC CONCEPTS. where X has the distribution A.C(aIC)p. (See Figure 1.w < 1.7.

u(a). INTERPRETATIONS OF ENTROPY.11 (Entropy and prefix codes. where t i E Gr ..7.t be a probability distribution on A. since the code is one-to-one. the leaf corresponding to C(a) can be labeled with a or with C(a). which is possible since I Gil < 2" ) . For prefix codes. a lower bound that is "almost" tight. entropy is always a lower bound on average code length. a result known as the Kraft inequality. Proof Define i and j to be equivalent if ti = Li and let {G1. Assign G2 in some one-to-one way to binary words of length L(2) that do not have already assigned words as prefixes.1C)). 73 where t(w) denotes the length of w E W. so that IG21 —‹ 2") — IG11 2L(2)—L(1).SECTION 1. The Kraft inequality then takes the form E I Gr I2-Lo-) < 1. for any prefix code C. Gs } be the equivalence classes. < 1 then there is a prefix-free set W = {w(i): 1 < i < t} such that t(w(i)) = ti .7. Theorem 1. The codewords of a prefix code are therefore just the leaves of a binary tree. . This is possible since 1 GI 12—L(1) + I G2 1 2 L(2) < 1.1C)) = Er(aiC)Wa) = a a log 2paic). This fundamental connection between entropy and prefix codes was discovered by Shannon. Thus C is a prefix code if and only if C is one-to-one and the range of C is prefix-free. holds. and. i E [1. A little algebra on the expected code length yields E A (C(. (i) H(p) E4 (C(. For prefix codes. (ii) There is a prefix code C such that EIL (C(. Proof Let C be a prefix code. the Kraft inequality (3) aEA 1. ) E p. where the second term on the right gives the number of words of length L(2) that have prefixes already assigned. An inductive continuation of this assignment argument clearly establishes the lemma. t]. A geometric proof of this inequality is sketched in Exercise 10. r=1 Assign the indices in G1 to binary words of length L(1) in some one-to-one way. written in order of increasing length L(r) = L.1C)) < H(g) + 1. Lemma 1.) Let j..(a) log a 2-C(a1C) E. The Kraft inequality has the following converse.u (a ) log . I=1 A code C is a prefix code if a = a whenever C(a) is a prefix of C(a).7. (.10 If 1 < < f2 < < it is a nondecreasing sequence of positive integers such that E. a ..

almost-sure versions of these two results will be obtained for ergodic processes. mappings Cn : A n 1—> B* from source sequences of length n to binary words. for each n.lCn ))1n.(a) a 1— E p(a) log p(a) =1+ H(p). a E A.C(4) = .7. that is.C(a) < 1— log p(a).C(a71Cn )/n and expected per-symbol code length E(C(.2. such that. In the next chapter.log tt(a))p.(A n ) = H(i)/n. is a stationary process with process entropy H then (a) There is a prefix code sequence {C n } such that 1-4({G}) < H.74 CHAPTER I. that is. Cn is a prefix n-code with length function .12 (Process entropy and prefix-codes. n) — . The second term is just H(g). E(1. . it is possible to compress as well as process entropy in the limit. In this case. and note that E 2— C (a) < E a a 2logg(a) E a E 1.7.7. Thus "good" codes exist.6. while Theorem I.t(a)1. no sequence of codes can asymptotically compress more than process entropy.. (b) There is no prefix code sequence {C n } such that 1Z 4 ({C n }) < H.11(ii) asserts the existence of a prefix n-code Cn such that 1 1 (5) — E (reIC n)) 5. where rx1 denotes the least integer > x. and "too-good" codes do not exist.11(i) takes the form 1 (4) H(iin) for any prefix n-code Cn .log 1. it follows that = a A. relative to a process {X n } with Kolmogorov measure IL is defined by = liM sup n—co E (C(IC n)) The two results. Of interest for n-codes is per-symbol code length . . a faithful n-code Cn : An B* is a prefix n-code if and only if its range is prefix free. a Since. implies that the first term is nonnegative. n " n A sequence {Cn } . Lemma 1. since the measure v defined by v(a) = 2 —C(a I C) satisfies v(A) < 1. that is. This proves (i). Lemma 1.10 produces a prefix code C(a) = w(a). BASIC CONCEPTS. (4) and (5) then yield the following asymptotic results.) If p. Theorem 1. while the divergence inequality. whose length function is L. however. a Thus part (ii) is proved. The (asymptotic) rate of such a code sequence. To prove part (ii) define L(a) = 1.7. which completes the proof of the theorem.H n (P. Let /in be a probability measure on A n with per-symbol entropy 1-1. Next consider n-codes. Theorem I.C(4 IQ. by the Kraft inequality. will be called a prefix-code sequence.

for example. The first part u(n) is just a sequence of O's of length equal to the length of v(n).C(a) = F— log . The third part w(n) is the usual binary representation of n. the length of the binary representation of n. Shannon's theorem implies that a Shannon code is within 1 of being the best possible prefix code in the sense of minimizing expected code length. The key to this is the fact that a faithful n-code can always be converted to a prefix code by adding a header (i. a prefix) to each codeword to specify its length length. so that. u(n) = u(m). since both have length equal to the length of u(n). I.e.7. called Huffman coding. the length of the binary representation of Flog(n + 1)1. A somewhat more complicated coding procedure. since both consist only of O's and the first bit of both v(n) and v(m) is a 1. so that asymptotic results about prefix codes automatically apply to the weaker concept of faithful code. For this reason no use will be made of Huffman coding in this book. such that i(e(n)) = log n + o(log n). for Shannon codes. Proof The code word assigned to n is a concatenation of three binary sequences. then. Lemma 1.u(a)1 will be called a Shannon code. v(12) = 100. INTERPRETATIONS OF ENTROPY.2.13 A prefix code with length function . which is asymptotically negligible. for example. for if u(n)v(n)w(n) = u(m)v(in)w(m)w'. so the lemma is established. The desired bound f(E(n)) = log n + o(log n) follows easily. The second part v(n) is the binary representation of the length of w(n).SECTION 1. C(n) = u(n)v(n)w(n). there is no loss of generality in assuming that a faithful code is a prefix code. CI . Huffman codes have considerable practical significance. whose expected value is entropy. as far as asymptotic results are concerned. Next it will be shown that. so that. u(12)=000. since both have length specified by v(n).7. w(12) = 1100.d Converting faithful codes to prefix codes. so that w' is empty and n = m. 75 Remark 1. The following is due to Elias.7. while both u(n) and v(n) have length equal to f log(1 + Flog(n 1 )1)1. since.(a). the code length function L(a) = 1— log (a)1 is closely related to the function — log p. produces prefix codes that minimize expected code length. for example. B*. Any prefix code with this property is called an Elias code.) There is a prefix code £:11. The length of w(n) is flog(n +1)1. so that.7. But then v(n) = v(m). but for n-codes the per-letter difference in expected code length between a Shannon code and a Huffman code is at most 1/n. [8].14 (The Elias-code lemma. The code is a prefix code. This can be done in such a way that the length of the header is (asymptotically) negligible compared to total codeword length. Thus E(12) = 0001001100. This means that w(n) = w(m).

The code CnK. 1). by Lemma 1. Given a faithful n-code Cn with length function Le ICn ). In the example. x E [0. The argument generalizes easily to arbitrary partitions into subintervals. and P1 = [0. (4).76 CHAPTER I. if C. then C:(4) = 0001001100.17 The (T. where Po = [0. 1). The header tells the decoder which codebook to apply to decode the received message. X tiz E An.(4) = 001001100111. I. where e is an Elias prefix code on the integers. For example. 0. for ease of reading. n = 1. Proposition 1. will be given in detail.7. a prefix n-code Cn * is obtained by the formula (6) C: (4 ) = ECC(. a word of length 12.4 I Cn ))Cn (4).. x lii E An . The proof for the special two set partition P = (Po. (4) starts and that it is 12 bits long. Let T be the transformation defined by T:xi--> x ea.e Rotation processes have entropy O. The proof will be based on direct counting arguments rather than the entropy formalism..7. These will produce an upper bound of the form 2nEn .14. 2. where a is irrational and ED is addition modulo 1.u is a stationary process with entropy-rate H there is no faithful-code sequence such that R IL ({C n }) < H. and let P be a finite partition of the unit interval [0.5). with a bit more effort.7.12(b) extends to the following somewhat sharper form. Remark 1.7. PO. and the code word C. a comma was inserted between the header information. As an application of the covering-exponent interpretation of entropy it will be shown that ergodic rotation processes have entropy 0. P)-process has entropy O. 1). but header length becomes negligible relative to the length of Cn (4) as codeword length grows.5. where. 001001100111. (The definition of faithful-code sequence is obtained by replacing the word "prefix" by the word "faithful.16 Another application of the Elias prefix idea converts a prefix-code sequence {C n } into a single prefix code C: A* i where C is defined by C(4) = e(n)C(4). since Cn was assumed to be faithful." in the definition of prefix-code sequence. e (C(4 ICn )). to general partitions (by approximating by partitions into intervals).x11 .7. Geometric arguments will be used to estimate the number of possible (T. where En —> 0 as . BASIC CONCEPTS. The decoder reads through the header information and learns where C.15 If . the header information is almost as long as Cn (4). P)-names of length n. which is clearly a prefix code. Thus Theorem 1. will be called an Elias extension of C. This enables the decoder to determine the preimage .) Theorem 1. and.7. .

The first lemma shows that cln (x. INTERPRETATIONS OF ENTROPY. and d(a. to provide an N = N(x.19 E. y) such that if n > N then Given E > 0 there is an N such that dn (x. b) = 0. 77 n oo. Without loss of generality it can be assumed that x < y.7. 1).Y11 < 6/4 and n > N. P)-names of x. y) is continuous relative to Ix . y) < e.18.yi.5) E <E This implies that dn (x. Theorem 1.SECTION 1. tz]: Tlx E = for any interval I c [0. x.Y 1 1. Consider the case when Ix Y Ii = lx . The proof is left as an exercise.18 (The uniform distribution property.) Inn E [1. Let I = [x. 1).4. if a 0 b.yll = 11 + x . The fact that rotations are rigid motions allows the almost-sure result to be extended to a result that holds for every x. y E [0.11 and the pseudometric dn(x. But translation is a rigid motion so that Tmy and Trnz will be close for all m. The result then follows from the covering-exponent interpretation of entropy. Yi). there cannot be exponentially very many names. To make the preceding argument precise define the metric Ix . given z and x.b) = 1. if a = b.yl can be treated in a similar manner.yli < €14.yli = minflx . 1). The uniform distribution property. 1) and any x E [0. Lemma 1. The names of x and y disagree at time j if and only if Tx and T y belong to different atoms of the partition P. In particular. y E [0.7. respectively. y]. which occurs when and only when either T --1 0 E I or (0.7. y) = dn(xç .y1. Proof Suppose lx . Thus every name is obtained by changing the name of some fixed z in a small fraction of places and shifting a bounded amount. Compactness then shows that N can be chosen to be independent of x and y. . Proposition 1. where A denotes Lebesgue measure. which is rotation by -a. x .7. The next lemma is just the formal statement that the name of any point x can be shifted by a bounded amount to obtain a sequence close to the name of z = 0 in the sense of the pseudometric dn. =- 1 n d(xi. y) < 1{j E E I or T -1 (0. and hence the (T-P)-names of y and z will agree most of the time. The key to the proof is the following uniform distribution property of irrational rotations. n where d(a. can be applied to T -1 . and {x n } and {yn} denote the (T. The key to the proof to the entropy 0 result is that. if Ix . Proposition 1. some power y = Tx of x must be close to z. n > N.7.y11. The case when lx .5) E I. Ergodicity asserts the truth of the proposition for almost every x.

3. is the binary entropy function. determine M and K > M from the preceding lemma. choose N from the preceding lemma so that if n > N and lx — yli < 6/4 then dn (x . The number of binary sequences of length n — j that can be obtained by changing the (n — j)-name of z in at most c(n — j) places is upper bounded by 2 (n— i )h(E) . w(k) E An.f Exercises. Let g be the concatenated-block process defined by a measure it on A n .T i X) < E.) . Note that dn—j(Z.. P)-process is 0. apply Kronecker's theorem. Given E > 0.15. Give a direct proof of the covering-exponent theorem.20 CHAPTER I.5.A2) given for the computation of the covering number. where Nn (a. Given x. to obtain a least positive integer j = j(x) such that IT ] x — Ok < 6/4. 1. + (b) Show that H„.) ± log n. (Hint: there is at least one way and at most n1441 2n ways to express a given 4 in the form uw(l)w(2). where h(6) = —doge —(1 — 6) log(1 — 6). 6) is defined in (1).4. for n > K + M.. based on the algorithm (Al . where each w(i) E An and u and v are the tail and head. M] such that if Z = 0 and y = Tx then i. Since h(€) —> 0 as E 1=1 this completes the proof that the (T. there is an integer M and an integer K > M such that if x E [0. Proof Now for the final details of the proof that the entropy of the (T. (Hint: what does the entropy theorem say about the first part of the list?) 2. Thus an upper bound on the number of possible names of length n for members of Xi is (7) 2(n— Dh(e) 114 : x E Xill < 2 i IA lnh(6) Since there only M possible values for j.7. 1) there is a j = j(x) E [1. Let H(A) denote the entropy of p. I.n (g) > mH(A). Theorem 1. w(k — 1)v.4. where j(x) is given by the preceding lemma. (a) Show that Ann ( '2) < (P. according to the combinations bound. Proposition 1. let n > K+M and let z = O. the number of possible rotation names of 0 length n is upper bounded by M2m2nh(€). dk (zk Given E > 0. j < M. There are at most 2i possible values for the first j places of any name.78 Lemma 1.7. The unit interval X can be partitioned into measurable sets Xi = Ix : j(x) = j}. BASIC CONCEPTS.. With K = M N the lemma follows.Arn (a. y) < E. 6) exists. X E X.2. 2)-process has entropy 0. Show that limn (1/n) log. Since small changes in x do not change j(x) there is a number M such that j(x) < M for all x. yk ) < 69 k > K. respectively.7. Lemma 1. of words w(0). Given E > 0.

5. in terms of the entropy of the waiting-time distribution.a Approximation by finite codes. for all xri' E An. that the entropy-rate of an ergodic process and its reversed process are the same. that the entropy of the induced process is 1/(A)//t(B). Prove Proposition 1. for all x'i' E A n .8. I. Suppose {1'.8. then passing to a suitable limit. then extend using the block-to-stationary code construction to obtain a desired property for a stationary coding. Let {Ifn } be the process obtained from the stationary ergodic process {X} by applying the block code CN and randomizing the start.„} is ergodic. In some cases it is easy to establish a property for a block coding of a process. Show. STATIONARY CODING. Show that the resulting set of intervals is disjoint. The definition and notation associated with stationary coding will be reviewed and the basic approximation by finite codes will be established in this subsection. Let p. The second basic result is a technique for creating stationary codes from block codes.7.18. Use this idea to prove that if Ea 2 —C(a) < 1. VX E AZ. (b) 44) 5_ 2 ± 2n log IA I.8 Stationary coding. and that this implies the Kraft inequality (2). Stationary coding was introduced in Example 1. Suppose the waiting times between occurrences of "1" for a binary process it are independent and identically distributed. Use the covering exponent concept to show that the entropy of an ergodic nstationary process is the same as the entropy of the stationary process obtained by randomizing its start. Show that H({Y„}) < 8. Let C be a prefix code on A with length function L. 79 4.SECTION 1. then there is prefix code with length function L(a). 7.4. 11. Show. 9. . The first is that stationary codes can be approximated by finite codes. using Theorem 1.1. 6. Show that if C„: A n H B* is a prefix code with length function L then there is a prefix code e: A" 1—* B* with length function E such that (a) Î(x) <L(x) + 1. Recall that a stationary coder is a measurable mapping F: A z 1--> B z such that F(TA X ) = TB F (x). Section 1. Often a property can be established by first showing that it holds for finite codes. Find an expression for the entropy of it. using the entropy theorem.9.7. a shift-invariant ergodic measure on A z and let B c A z be a measurable set of positive measure. Associate with each a E A the dyadic subinterval of the unit interval [0. Two basic results will be established in this section. 10. 1] of length 2 —L(a ) whose left endpoint has dyadic expansion C (a).

80

CHAPTER I. BASIC CONCEPTS.

where TA and TB denote the shifts on the respective two-sided sequence spaces A z and B z . Given a (two-sided) process with alphabet A and Kolmogorov measure ,U A , the encoded process has alphabet B and Kolmogorov measure A B = /LA o F. The associated time-zero coder is the function f: A z 1--* B defined by the formula f (x) = F (x )o. Associated with the time-zero coder f: Az 1---> B is the partition P = 19 (f) = {Pb: h E B} of A z , defined by Pb = {X: f(x) = 1, } , that is,
(1)
X E h

if and only if f (x) = b.

Note that if y = F(x) then yn = f (Tnx) if and only if Tx E Pyn . In other words, y is the (T, 2)-name of x and the measure A B is the Kolmogorov measure of the (T, P)process. The partition 2( f) is called the partition defined by the encoder f or, simply, the encoding partition. Note also that a measurable partition P = {Pb: h E BI defines an encoder f by the formula (1), such that P = 2(f ), that is, P is the encoding partition for f. Thus there is a one-to-one correspondence between time-zero coders and measurable finite partitions of Az . In summary, a stationary coder can be thought of as a measurable function F: A z iB z such that F oTA = TB o F , or as a measurable function f: A z 1--* B, or as a measurable partition P = {Pb: h E B } . The descriptions are connected by the relationships
f (x) = F (x)0 , F (x) n = f (T n x), Pb = lx: f (x) = b).

A time-zero coder f is said to be finite (with window half-width w) if f (x) = f (i), whenever .0'. = i" vu,. In this case, the notation f (xw u.,) is used instead of f (x). A key fact for finite coders is that
=
Ç

U {x, n 71± —w wl ,
( x ,' „u t u,' ) = .
, fl

and, as shown in Example 1.1.9, for any stationary measure ,u, cylinder set probabilities for the encoded measure y = ,u o F -1 are given by the formula

(2)

y()1) = it(F-1 [yi])

=

E it(xi —+:) .
f (4 ± - :)=y' it

The principal result in this subsection is that time-zero coders are "almost finite", in the following sense.

Theorem 1.8.1 (Finite-coder approximation.) If f: Az 1--* B is a time-zero coder, if bt is a shift-invariant measure on A z , and if Az 1---> B such that E > 0, then there is a finite time-zero encoder

7:

1.i (ix: f (x) 0 7(x)}) _< c. Proof Let P = {Pb: b E B) be the encoding partition for f. Since p is a finite partition of A z into measurable sets, and since the measurable sets are generated by the finite cylinder sets, there is a positive integer w and a partition 1 5 = {Pb } with the following two properties
"../

(a) Each Pb is union of cylinder sets of the form

SECTION I8. STATIONARY CODING.

81

(h)
I../

Eb kt(PbA 7)1,) < E.
r',I 0.s•s

Let f be the time-zero encoder defined by f(x) = b, if [xi'.] c Pb. Condition (a) assures that f is a finite encoder, and p,({x: f (x) 0 f (x))) < c follows from condition I=1 (b). This proves the theorem.

Remark 1.8.2
The preceding argument only requires that the coded process be a finite-alphabet process. In particular, any stationary coding onto a finite-alphabet process of any i.i.d. process of finite or infinite alphabet can be approximated arbitrarily well by finite codes. As noted earlier, the stationary coding of an ergodic process is ergodic. A consequence of the finite-coder approximation theorem is that entropy for ergodic processes is not increased by stationary coding. The proof is based on direct estimation of the number of sequences of length n needed to fill a set of fixed probability in the encoded process.

Theorem 1.8.3 (Stationary coding and entropy.)
If y = it o F-1 is a stationary encoding of an ergodic process ii then h(v)

First consider the case when the time-zero encoder f is finite with window half-width w. By the entropy theorem, there is an N such that if n > N, there is a set Cn c An i +w w of measure at least 1/2 and cardinality at most 2 (n±2w+1)(h( A )+6) • The image f (C n ) is a subset of 13 1 of measure at least 1/2, by formula (2). Furthermore, because mappings cannot increase cardinality, the set f (C n ) has cardinality at most 2(n+21D+1)(h(4)+E) . Thus 1 - log I f (Cn)i h(11) + 6 + 8n, n where Bn —>. 0, as n -> oo, since w is fixed. It follows that h(v) < h(tt), since entropy equals the asymptotic covering rate, Theorem 1.7.4. In the general case, given c > 0 there is a finite time-zero coder f such that A (ix: f (x) 0 f (x)}) < E. Let F and F denote the sample path encoders and let y = it o F -1 and 7/ = ki o F -1 denote the Kolmogorov measures defined by f and f, respectively. It has already been established that finite codes do not increase entropy, so that h(71) < h(i). Thus, there is an n and a collection 'e" c Bn such that 71(E) > 1 - E and II Let C = [C], be the 6-blowup of C, that is,
Proof

C = { y: dn(Yi, 3711 ) 5_

E,

for some

3,T;

E '

el

The blowup bound, Lemma 1.7.5, implies that

ICI < where
3(E) = 0(E).

9

Let

15 = {x: T(x)7

E

El} and D = {x: F(x)7

E

CI

be the respective pull-backs to AZ. Since ki (ix: f (x) 0 7(x)}) < _ E2 , the Markov inequality implies that the set G = {x: dn (F(x)7,F(x)7) _< cl

82

CHAPTER I. BASIC CONCEPTS.

has measure at least 1 — c, so that p(G n 5 ) > 1 — 2E, since ,u(15) = 75(e) > 1 — E. By definition of G and D, however, Gniic D, so that

e > 1 — 2c. v(C) = g(D) > p(G n ij)
The bound ICI < 2n (h(A )+E+3(E )) , and the fact that entropy equals the asymptotic covering rate, Theorem 1.7.4, then imply that h(v) < h(u). This completes the proof of Theorem 1.8.3.

Example 1.8.4 (Stationary coding preserves mixing.)
A simple argument, see Exercise 19, can be used to show that stationary coding preserves mixing. Here a proof based on approximation by finite coders will be given. While not as simple as the earlier proof, it gives more direct insight into why stationary coding preserves mixing and is a nice application of coder approximation ideas. The a-field generated by the cylinders [4], for m and n fixed, that is, the cr-field generated by the random variables Xm , Xm+ i, , X n , will be denoted by E(Xm n ). As noted earlier, the i-fold shift of the cylinder set [cini n ] is the cylinder set T[a] = [c.:Ln ], where ci +i = ai, m < j <n. Let v = o F-1 be a stationary encoding of a mixing process kt. First consider the case when the time-zero encoder f is finite, with window width 2w + 1, say. The coordinates yri' of y = F(x) depend only on the coordinates xn i +:. Thus for g > 2w, the intersection [Yi ] n [ym mLg: i n] is the image under F of C n Tm D where C and D are both measurable with respect to E(X7 +:). If p, is mixing then, given E > 0 and n > 1, there is an M such that

jp,(c n

D) — ,u(C),u(D)I < c, C, D E

Earn,
El in

m > M,

which, in turn, implies that

n [y:Z +1]) — v([yi])v(Ey:VDI

M, g > 2w.

Thus v is mixing. In the general case, suppose v = p, o F -1 is a stationary coding of the mixing process with stationary encoder F and time-zero encoder f, and fix n > 1. Given E > 0, and a7 and Pi', let S be a positive number to be specified later and choose a finite encoder with time-zero encoder f, such that A({x: f(x) f . Thus,

col) <S

pax: F(x)7

fr(x)71)

/tax:

f (T i x)

f (Ti x)i)

f(x)

f (x)}), < n3,

by stationarity. But n is fixed, so that if (5 is small enough then v(4) will be so close to î(a) and v(b7) so close to i(b) that

(3)

Iv(a7)v(b7) — j) (a7) 13 (bi)I

E/ 3 , • P(x)21:71) < 2n8,

for all ar il and b7. Likewise, for any m > 1,

p,

F(x)7

P(x)7, or F(x)m m -Vi

SECTION 1.8. STATIONARY CODING.
so that (4)

83

Ivaan n T - m[b7]) — j)([4] n T -m[b71)1

6/3,

provided only that 8 is small enough, uniformly in m for all ail and b7. Since 1'5 is mixing, being a finite coding of ,u, it is true that

n T -m[bn) — i)([4])Dab7DI < E/3,
provided only that m is large enough, and hence, combining this with (3) and (4), and using the triangle inequality yields I vga7] n T -ni[b7]) — vga7])v([14])1 < c, for all sufficiently large m, provided only that 8 is small enough. Thus, indeed, stationary coding preserves the mixing property.

I.8.b

From block to stationary codes.

As noted in Example 1.1.10, an N-block code CN: A N B N can be used to map an A-valued process {X„} into a B-valued process {Y„} by applying CN to consecutive nonoverlapping blocks of length N, i iN+1
v (f-1-1)N
-

ii(j+1)N)
iN+1

j = 0, 1, 2, ...

If {X,} is stationary, then a stationary process {4} is obtained by randomizing the start, i.e., by selecting an integer U E [1, N] according to the uniform distribution and defining = Y+-i, j = 1, 2, .... The final process {Yn } is stationary, but it is not, except in rare cases, a stationary coding of {X„}, and nice properties of {X,}, such as mixing or even ergodicity, may get destroyed. A method for producing stationary codes from block codes will now be described. The basic idea is to use an event of small probability as a signal to start using the block code. The block code is then applied to successive N-blocks until within N of the next occurrence of the event. If the event has small enough probability, sample paths will be mostly covered by nonoverlapping blocks of length exactly N to which the block code is applied.

Lemma 1.8.5 (Block-to-stationary construction.)
Let 1,1, be an ergodic measure on A z and let C: A N B N be an N-block code. Given c > 0 there is a stationary code F = Fc: A z A z such that for almost every X E A Z there is an increasing sequence { ni :i E Z}, which depends on x, such that Z= ni+i) and (i) ni+1—ni < N, i E Z. (ii) If .I„ is the set of indices i such that [ni , +1) c [—n, n] and ni+i — n i < N, then ihn supn (1/2n)E iEjn ni+ 1 — ni < c, almost surely.
(iii) If n i+1 — ni = N then y: n n +'- ' = C(xn:±'), n -1 where y = F(x).

84

CHAPTER I. BASIC CONCEPTS.

Proof Let D be a cylinder set such that 0 < ,a(D) < 6/N, and let G be the set of all x E A z for which Tmx E D for infinitely many positive and negative values of m. The set G is measurable and has measure 1, by the ergodic theorem. For x E G, define mo = mo(x) to be the least nonnegative integer m such that Tx E D, then extend to obtain an increasing sequence mi = m i (x), j E Z such that Tinx E D if and only if m = mi, for some j. The next step is to split each interval [mi, m 1+1) into nonoverlapping blocks of length N, starting with mi, plus a final remainder block of length shorter than N, in case mi+i — mi is not exactly divisible by N. In other words, for each j let qi and ri be nonnegative integers such that mi +i—mi = qiN d-ri, 0 <ri <N, and form the disjoint collection 2-x (mi) of left-closed, right-open intervals [m i , mi N), [m 1 d-N,m i +2N), [mi±qiN, mi+i ).

All but the last of these have length exactly N, while the last one is either empty or has length ri <N. The definition of G guarantees that for x E G, the union Ui/x (mi) is a partition of Z. The random partition U i/x (m i ) can then be relabeled as {[n i , ni+1 ), i E Z}, where ni = ni(x), i E Z. If x G, define ni = i, i E Z. By construction, condition (i) certainly holds for every x E G. Furthermore, the ergodic theorem guarantees that the average distance between m i and m i±i is at least N/6, so that (ii) also holds, almost surely. The encoder F = Fc is defined as follows. Let b be a fixed element of B, called the filler symbol, and let bj denote the sequence of length 1 < j < N, each of whose terms is b. If x is sequence for which Z = U[ni , ni+i), then y = F (x) is defined by the formula bni+i— n, if n i±i — ni <N yn n :+i —1 = if n i+i — n i = N. C (4 +1-1 )

I

This definition guarantees that property (iii) holds for x E G, a set of measure 1. For x G, define F(x) i = b, i E Z. The function F is certainly measurable and satisfies F(TAX) = F(x), for all x. This completes the proof of Lemma 1.8.5. The blocks of length N are called coding blocks and the blocks of length less than N are called filler or spacer blocks. Any stationary coding F of 1,1, for which (i), (ii), and (iii) hold is called a stationary coding with 6-fraction of spacers induced by the block code CN. There are, of course, many different processes satisfying the conditions of the lemma, since there are many ways to parse sequences so that properties (i), (ii), and (iii) hold, for example, any event of small enough probability can be used and how the spacer blocks are coded is left unspecified in the lemma statement. The terminology applies to any of these processes. Remark 1.8.6 Lemma 1.8.5 was first proved in [40], but it is really only a translation into process language of a theorem about ergodic transformations first proved by Rohlin, [60]. Rohlin's theorem played a central role in Ornstein's fundamental work on the isomorphism problem for Bernoulli shifts, [46, 63].

8. of {X„}. where it was shown that for any ergodic process {X. where C„ is the zero-inserter defined by the formula (5). for each n > 64. furthermore.} defined by yin clearly has the property —on+1 = cn(X in o .} by the encoding 17(jin 1)n+1 = cn(X. since two disjoint blocks of O's of length at least n 112 must appear in any set of n consecutive places.d. STATIONARY CODING.i. see [26. The block-to-stationary code construction of Lemma 1. fy . To make this precise. case using coding ideas is suggested in Exercise 2. and an increasing unbounded sequence {ni I.5. Cn (4) = y.d. then nesting to obtain bad behavior for infinitely many n. For i. and Markov processes it is known that L(x7) = 0(log n). A question arose as to whether such a log n type bound could hold for the general ergodic case..1.SECTION 1. lim sup i-400 Â(ni) In particular.d. 2]. It can.8. at least if the process is assumed to be mixing.') > 1.1. L(x ) = max{k: +k i xss+ xtt±±k.i. Let pc } . The problem is to determine the asymptotic behavior of L(4) along sample paths drawn from a stationary. uyss+ for any starting place s. be forced to have entropy as close to the entropy of {X n } as desired. -Fin) > n 1/2 . {Yn }.c A string-matching example. almost surely.5. Let .. the {Y} process will be mixing. as a simple illustration of the utility of block-to-stationary constructions. define the function Cn : A n H {0. A proof for the i. there is a nontrivial stationary coding.8. such that L(Y. as follows..} is the process obtained from {X. The n-block coding {I'm ) of the process {X„. The construction is iterated to obtain an infinite sequence of such n's. Section 11.'. if the start process {X n } is i. j > 1. As an illustration of the method a string matching example will be constructed. where } (5) 0 Yi = 1 xi i <4n 112 or i > n — 4n 112 otherwise.} and any positive function X(n) for which n -1 1. The string-matching problem is of interest in DNA modeling and is defined as follows.5 provides a powerful tool for making counterexamples.-on+1) . 1)/2+1). finite-alphabet ergodic process. . j ?.i. and extended to a larger class of processes in Theorem 11. The proof for the binary case and growth rate À(n) = n 112 will be given here. A negative solution to the question was provided in [71].(n) —> 0. 1 " by changing the first 4n 1 i2 terms into all O's and the final 4n 1 12 terms into all O's. for some O <s<t<n—k} . indicate that {Y. For x E An let L(4) be the length of the longest block appearing at least twice in x1% that is. 85 I. The first observation is that if stationarity is not required. that is. the problem is easy to solve merely by periodically inserting blocks of O's to get bad behavior for one value of n.

At each stage the same filler symbol.} exists for which L(Y(oo) :. since this always occurs if Y(i)?' is the concatenation uv. each coordinate is changed at most once. let Fi denote the stationary encoder Fi : {0.E. will be used. BASIC CONCEPTS. Randomizing the start at each stage will restore stationarity. y is the beginning of a codeword. and f is filler. + 71 ) > n. the final process has a positive frequency of l's. let {n i } be an increasing sequence of natural numbers. For any start process {Y(0).1. mixing is not destroyed. at each stage converts the block codes into stationary codings. and hence there is a limit code F defined by the formula [F(x)]m = lim[F i (x)] m . For n > 64. 1 < j < for all s. the sequence of processes {Y(i)} is defined by setting {Y(0)} = {X.n } and inductively defining {Y(i) m } for i > 1 by the formula (cn . in turn.5. . yields (7) L(Y(i)7 J ) > ni ll2 . but mixing properties are lost. 11 z {0. let Cn be the zero-inserter code defined by (5).) {YU — l)m} -14 {Y(i)ml. In particular. in turn. guarantees that a limit process 1/2 {Y(oo)„. the choice of filler means that L(Y(i)?`) > rt. 1} z that maps {X.n } to {Y(i) m }. Given the ergodic process {X n } (now thought of as a two-sided process. hence. since O's inserted at any stage are not changed in subsequent stages. guarantee that the final limit process is a stationary coding of the starting process with the desired string-matching properties. and let {ci } be a decreasing sequence of positive numbers. the i-th iterate {Y(i) m } has the property L(Y(i): +7') > n/2 . with a limiting Et-fraction of spacers. < j. xE 0. provided the n i increase rapidly enough. for all m E Z and i > 1. which will. each process {Y(i) m } is a stationary coding of the original process {X.n } by stationary coding using the codebook Cn. where u is the end of a codeword and v the beginning of a codeword. in particular. that is. An appropriate use of block-to-stationary coding.. b = 0. where u is the end of a codeword. j > 1. Furthermore. and long matches are even more likely if Y(i)?' is the concatenation uf v. (6). to be specified later. and it is not easy to see how to pass to a limit process. UZ { . since stationary coding will be used). and define the sequence of processes by starting with any process {Y(0)} and inductively defining {Y(i)} for i > 1 by the formula {Y --4Ecn {Y(i)m}. Property (6) guarantees that no coordinate is changed more than once.86 CHAPTER I. The rigorous construction of a stationary coding is carried out as follows./2 . Furthermore. {n i : i > 11 be an increasing sequence of integers. Combining this observation with the fact that O's are not changed. both to be specified later. mEZ. The preceding construction is conceptually quite simple but clearly destroys stationarity.n }. where the notation indicates that {Y(j) m } is constructed from {Y(i — 1). so that the code at each stage never changes a 0 into a 1. For each i > 1.}. (6) Y(i — 1) m = 0 =z= Y (i) m = O. Since stationary codings of stationary codings are themselves stationary codings. which.8. Lemma 1.

The subcollection of stationary processes is denoted by P s (A") and the subcollection of stationary ergodic processes is denoted by 7'e (A°°). The collection of (Borel) probability measures on a compact space X is denoted by Of primary interest are the two cases. namely. the d-distance between processes is often difficult to calculate. called the dtopology.9. Show that there is a stationary code F such that limn Ett (dn (4. Suppose C: A N B N is an N-code such that Ei. but it has two important defects. o (Hint: use the preceding exercise.8.n } defined by the stationary encoder F.. weak limits of ergodic processes need not be ergodic and entropy is not weakly continuous. entropy is d-continuous and the class of ergodic processes is d-closed. A sequence {. 87 The limit process {Y(oo)„z } defined by Y(oo). PROCESS TOPOLOGIES. declares two processes to be close if their joint distributions are close for a long enough time. and it plays a central role in the theory of stationary codings of i. declares two ergodic processes to be close if only a a small limiting density of changes are needed to convert a typical sequence for one process into a typical sequence for the other process. merely by making the ni increase rapidly enough and the Ei go to zero rapidly enough.9. where A is a finite set. The other concept. processes. the entropy of the final process will be as close to that of the original process as desired. The above is typical of stationary coding constructions.8.d Exercises. then the sequence {ni(x)} of Lemma 1.9 Process topologies. First a sequence of block codes is constructed to produce a (nonstationary) limit process with the properties desired. X = Anand X = A". compact. for each i. Two useful ways to measure the closeness of stationary processes are described in this section. then Lemma 1. a theory to be presented in Chapter 4.„ m E Z is the stationary coding of the initial process {X. The d-metric is not so easy to define. and the d-topology is nonseparable. 1. I.d. and easy to describe. (") (a lic) = n-4. Furthermore. On the other hand.a The weak topology.u (n) } of measures in P(A") converges weakly to a measure it if lim p. The weak topology is separable.00 .5 is applied to produce a stationary process with the same limiting properties.5 can be chosen so that Prob(x 1— ' = an < 8 . I.n = limé Y(i). the limit process has the property L(Y(oo)) > the limiting density of changes can be forced to be as small as desired. (Hint: replace D by TD for n large. Since (7) hold j > 1.) 2. as are other classes of interest. F(x)7) < < E 1-1(un o 2E and such that h(o. p(x). however. Show that if t is mixing and 8 > 0.SECTION 1.i. for each al iv E AN . One concept. The collection P(A") is just the collection of processes with alphabet A. Many of the deeper recent results in ergodic theory depend on the d-metric. where it is mixing.8. In particular. called the weak topology.(dN(x fv .) Section 1. C(4)) < E.

that puts weight 1/2 on each of the two sequences (0. One way to see this is to note that on the class of stationary Markov processes.) — v(a 101.u(C) is continuous for each cylinder set C. A second defect in the weak topology is that entropy is not weakly continuous. relative to the metric D(g. so the class of stationary measures is also compact in the weak topology. the usual diagonalization procedure produces a subsequence {n i } such that { 0 (a)} converges to.°1) } of probability measures and any all'. which has entropy log 2.) and (1. given. is stationary if each A (n ) is stationary. y) where ..u(n) is stationary.u. The weak topology is a metric topology. so that {.). Since Ho (Xo l XII) depends continuously on the probabilities . If k is understood then I • I may be used in place of I • lk. and each .. for any sequence {. for each n. Entropy is weakly upper semicontinuous.u (n ) converges weakly to p. The weak topology on P(A") is the topology defined by weak convergence. Thus. 0. the limit p.2e. Indeed. It is easy to check that the limit function p.u (n) } converges weakly to the coin-tossing process.. let x(n) be the concatenation of the members of {0.(X 0 1X:1) decreases to H(.u (n) (4) = 2-k . Now suppose . (n ) has entropy 0. for all k and all'. Theorem 1.. H(a) < H(p).vik = E liu(a/1 at denotes the k-th order distributional (variational) distance. For example. as i —> oc. however. . where X ° k is distributed according to p.(4). The weak topology is a compact topology. say. BASIC CONCEPTS. for every k and every 4.u (n) is the Markov process with transition matrix M= n [ 1 — 1/n 1/n 1/n 1 1 — 1/n j then p weakly to the (nonergodic) process p. It is the weakest topology for which each of the mappings p . Furthermore. The process p. p.(0(X 0 IX11) < H(A)-1. the sequence {. let Hi. The weak topology is Hausdorff since a measure is determined by its values on cylinder sets.u. hence has a convergent subsequence.) as k — > oc.9.u. 1.°2) be the concatenated-block process defined by the single n2'1 -block x(n).(X01X: ) denote the conditional entropy of X0 relative to X=1. . Proof For any stationary measure p. yet lim .u(a k i +1 ). it must be true that 1-1/. if . and recall that 111.°1) (a)} is bounded. is a probability measure.88 CHAPTER I. Since there are only countably many 4. E > 0 there is a k such that HA (X0I < H(p) €. weak convergence coincides with convergence of the entries in the transition matrix and stationary vector. V k. 1}n in some order and let p. then lim sup.1 If p (n ) converges weakly to p. The weak limit of ergodic measures need not be ergodic.. . In particular. iii .

. processes will be calculated.T (p)) is v-almost surely constant. Theorem 1. The sequence x contains a limiting p-fraction of l's while the sequence q contains a limiting q-fraction of l's. is actually a metric on the class of ergodic processes is a consequence of an alternative description to be given later. let g (P) denote the binary i.) To illustrate the d-distance concept the distance between two binary i. defined by d(x.1. x) = d(T y. a(x.)) = inf{d(x. and its converse.b The d-metric. For 0 < p < 1.4. y) = lirn sup d(4 . Theorem 1.i. that is. and y are ergodic processes. Let T(g) denote the set of frequency-typical sequences for the stationary process it. The (minimal) limiting (upper) density of changes needed to convert y into a g-typical sequence is given by d(y. then H(g. let x be a . Proof The function f (y) = 41(y .9.)}. so that x and y must . yi). y) is the limiting (upper) density of changes needed to convert the (infinite) sequence x into the (infinite) sequence y. PROCESS TOPOLOGIES.(n)(X01X11) is always true. and y is the limiting density of changes needed to convert y into a g-typical sequence. y) is the limiting per-letter Hamming distance between x and y.SECTION 1.i. hence f (y) is almost surely constant.i)• Thus ci(x.T (p)) is invariant since ii(y. y) on A. the d-distance between p. The d-distance between two ergodic processes is the minimal limiting (upper) density of changes needed to convert a typical sequence for one process into a typical sequence for the other. .9. Theorem 1. in particular. y) and is called the d-distance between the ergodic processes it and v. then d( y. But if this is true for some n.b) = { 1 n i =1 a=b a 0 b. where d(a. A precise formulation of this idea and some equivalent versions are given in this subsection. v). 89 for all sufficiently large n.) + 2E.. and let y be a g (q) -typical sequence. is ergodic if and only if g(T(g)) = 1. Tx). The constant of the theorem is denoted by d(g.2 If p. I.i. Note.u (P) -typical sequence. The per-letter Hamming distance is defined by 1 x---N n O d„(4 . processes.9. y ) = — d(xi. process such that p is the probability that X 1 = 1. By the typical-sequence theorem. the measure p.9.u.4.d. This extends to the It measures the fraction of changes needed to convert 4 into pseudometric ci(x. as defined.) Example 1. LI since Ii(g (n ) ) < 1-11. the set of all sequences for which the limiting empirical measure is equal to p. that for v-almost every v-typical sequence y.3 (The d-distance for i. This proves Theorem 1. that is.9. y): x E T(11. y. Suppose 0 < p <q < 1/2. T (p.2. (The fact that d(g.1.d. (n) ) < H(.d.

d-far-apart Markov chains. say p = 10-50 . Example 1.u (q ) -typical y such that ii(x. This new process distance will also be denoted by d(A. and its shift Ty.9. for .u (r) -typical z such that y = x Ef) z is .u. First each of the two measures. vn ) between the corresponding n . (q ) -typical y such that cl(x. (1) . ( A key concept used in the definition of the 4-metric is the "join" of two measures.u and y as marginals. for kt converges weakly to y as p —> 0. and y be the binary Markov chains with the respective transition matrices M. separated by an extra 0 or 1 about every 1050 places.(x. Thus it is enough to find. on A x A that has . to be given later in this section.1 The joining definition of d. is represented by a partition of the unit square into horizontal . and about half the time the alternations are out of phase. that is. Thus there is indeed at least one . where e is addition modulo 2.90 CHAPTER I. b a A simple geometric model is useful in thinking about the joining concept. A joining of two probability measures A and y on a finite set A is a measure A. (With a little more effort it can be shown that d(. hence the nonmixing process v cannot be the d-limit of any sequence of mixing processes.9. a(x. y) > g — p. To remedy these defects an alternative formulation of the d-d limit of a distance dn un . which produces only errors. Let p. Likewise. b). For any set S of natural numbers. Fix a . that is. On the other hand. Thus d.th order distributions will be developed.a(a) = E X(a. Later it will be shown that the class of mixing processes is a-closed. v(b) = E X(a. Let Z be an i. for each p9' ) -typical x. The y-measure is concentrated on the alternating sequence y = 101010. BASIC CONCEPTS. see Exercise 5. since about half the time the alternation in x is in phase with the alternation in y. 7" (v)) — 1/2.u (q ) -typical. kt and y. and hence. see Exercise 4.2 serves as a useful guide to intuition.) These results show that the d-topology is not the same as the weak topology.i.u-almost all x. 1. y) = g — p. p (q ) ) = g — p. Since Z is also almost surely p (r ) -typical. so that ci(x . y) is exactly 1/2. b).4 ( Weakly close.9. y) — 1/2. there is therefore at least one . but it is not always easy to use in proofs and it doesn't apply to nonergodic istance as a processes. Y is . and let r be the solution to the equation g = p(1 — r) ± (1 — p)r . a typical /L-sequence x has blocks of alternating O's and l's.u (P ) -typical x.[ p 1 — p 1—p [0 1 1 N= p 1' 1 where p is small. Ty) — 1/2.)typical. y) = g — p. In a later subsection it will be shown that this new process distance is the constant given by Theorem 1. The definition of d-distance given by Theorem 1. the values {Zs : s E SI are independent of the values {Zs : s 0 S}.9.d. one . which produces no errors.) As another example of the d-distance idea it will be shown that the d-distance between two binary Markov chains whose transition matrices are close can nevertheless be close to 1/2. a(x. disagree in at least a limiting (g — p)-fraction of places.2.. v). random sequence with the distribution of the Afr ) -process and let Y = x ED Z. for almost any Z.. d -distance This result is also a simple consequence of the more general definition of .b.u. and hence ii(A (P) .

a E A. see Exercise 2.) In other words.9. A joining X can then be represented as a pair of measure-preserving mappings.) A representation of p. relative to which Jn(Itt. a) to (A. Turning now to the definition of y). from (Z. (/) from (Z. with the width of a strip equal to the measure of the symbol it represents. a) to (A. A). a) is just a measure-preserving mapping ct. 6) means that. 0 R(0.u(a) = Eb X(a. such that X is given by the formula (2) X(a i .b) R(0. yti')) = cin v). Of course. y) satisfies the triangle inequality and is strictly positive if p.(a. V) is a compact subset of the space 7'(A x A n ). . where p. Also. V) = min Ejdn(4 . The function 4(A. ai ) a (q5-1 (ai ) n ( . one can also think of cutting up the y-rectangles and reassembling to give the A-rectangles.b) has area X(a. that is. hence is a metric on the class P(Afl) of measures on A".b) R(2c) R(2. (See Figure 1.b) R(2a) 1 2 R(2. where ex denotes expectation with respect to X.5.a) R(1b) R(1. y). (A space is nonatomic if any subset of positive measure contains subsets of any smaller measure. The second joining condition. a) to (A. 91 strips. The minimum is attained. v(b) = Ea X. b). The 4-metric is defined by cin(ti. 6). for each a. for a finite distribution can always be represented as a partition of any nonatomic probability space into sets whose masses are given by the distribution. on a nonatomic probability space (Z. and ik from (Z.9.u-rectangle corresponding to a can be partitioned into horizontal rectangles {R(a. since expectation is continuous with respect to the distributional distance IX — 5. means that the total mass of the rectangles {R(a. y.a) a R(1. A measure X on A n x A n is said to realize v) if it is a joining of A and y such that E Vd n (x7. a measurable mapping such that a(0 -1 (a)) = p(a). PROCESS TOPOLOGIES.5 Joining as reassembly of subrectangles. let J(11.a) R(0.c) Figure 1. b): a E A} is exactly the y-measure of the rectangle corresponding to b.b):b E A} such that R(a. and v are probability measures on An . A).SECTION 1. The joining condition . y) denote the set of joinings of p.1 (c)). and v.a) R(2. a joining is just a rule for cutting each A-rectangle into subrectangles and reassembling them to obtain the y-rectangles. the .c) R(1.9.c) R(2.a) R(0. the use of the unit square and rectangles is for simplicity only.1In.

p(B) = A. on the set AL. y) is a pseudometric on the class P(A). so the preceding inequality takes the form nd + mdm < n + m)dn+m. if d(p. in turn.n (1t7+m . is a metric on and hence d defines a topology on P(A"). where pi denotes the measure induced by p. y) is a metric. ( which implies that the limsup is in fact both a limit and the supremum.7 (Stationary joinings and d-distance.. (3) n vilz)± m (p. in turn. which is.6 If i and v are stationary then C1 (1-1 . that is. v'nV) < ( n + m) di±. v) = 0 and p and y are stationary. namely. Indeed. in turn. defined as a minimization of expected per-letter Hamming distance over joinings. un ) must be 0. The d-distance between two processes is defined for the class P(A) of probability measures on A" by passing to the limit. which. which is. the limit superior is. . The proof of this is left to Exercise 3. vn) = limiin(An. since in the stationary case the limit superior is a supremum. It is useful to know that the d-distance can be defined directly in terms of stationary joining measures on the product space A" x A". This establishes Theorem 1. In the stationary case. The set of all stationary joinings of p and y will be denoted by Js(p. that is. n—). since iin(A. q+m). in fact. un).:tr. a consequence of the definition of dn as an minimum.. B E E..92 CHAPTER I. A joining A of p and v is just a measure on A x A" with p. since d. y). then.9. implies that p = v. Exercise 2.9. Y) = lirn sup an(An. and y as marginals. BASIC CONCEPTS.vn)• Proof The proof depends on a superadditivity inequality. p: + 41 = pr. The d-distance between stationary processes is defined as a limit of n-th order dndistance.co Note that d(p. Theorem 1. dn (p. Theorem 1. and v are stationary then (4) cl(p. v) = SUP dn (An. v) = min ex(d(xi.9. for all n.) If p.n . LI A consequence of the preceding result is that the d-pseudometric is actually a metric on the class P(A) of stationary measures.6. In the stationary case..(B x v(B) = X(A" x B). both the limit and the supremum. d(it. which implies that A„ = vn for all n. yl)).

for if A E Js(. (n) is the measure on (A x A )° defined by A((4. for each n choose A n E Jn(tin. .. To prove (c). then taking a weakly convergent subsequence to obtain the desired limit process À.Î.(n) is given by the formula (5) i "(n) (B) = — n n 1=1 where A.9. This.u(B).. va). It is also easy to see that d cannot exceed the minimum of Ex (d(x l . the concatenatedblock process defined by A n . for each n. since A n belongs to -In(un. vn) = ex(c1. / . if N = Kn+r. that is. let A.k (d(x i . 93 Proof The existence of the minimum follows from the fact that the set . (") (C x = A(C).The definition of A (n) implies that if 0 < i < n — j then x A c° )) = Xn(T -` [4] x A) = ting—i [xn) = since An has A n as marginal and it is shift invariant. It takes a bit more effort to show that d cannot be less than the right-hand side in (4). and the latter dominates an (IL. y) of stationary joinings is weakly closed and hence compact. The details are given in the following paragraphs. for any cylinder set C. yr)) = K-1 ( k=0 H (t v knn -l"n kk"kn-I-1 . implies that lim. y).7 (a. together with the averaging formula. y rit )). such that an(An. V). such that ex (d(Xi . Towards this end. b) = E (uav. the measure 7.01) denote the concatenated-block process defined by An.kn-Fn Ykn+1 // t( y Kn+r "n kk""Kn-1-1 )1(n+1 .. As noted in Exercise 8 of Section 1. yi ) = lima Cin (tt. The proof of (b) is similar to the proof of (a). yi)).(4 . v). V) = Ci(p. for it depends only on the first order distribution. and from the fact that e . Yi)) is weakly continuous. Yi)) = litridn (tt. B E E E. B x B) = . E.1s(p. y) then stationarity implies that ex(d(xi. (a) (6) (b) (c) n-+co lim 5:(n)(B x lim 3. which proves the first of the following equalities. let ). The goal is to construct a stationary measure A E y) from the collection {A}.7 be projection of A n onto its i-th coordinate. u'bvi).SECTION 1. PROCESS TOPOLOGIES.1. (5). 0<r <n.(n) (A = bt(B). This is done by forming. For each n. yl)) = ex n (dn (4. va). lim e-)(d(xi. the measure on A x A defined by A.

2 Empirical distributions and joinings.(d(xi. yi) = e7.8 (The empirical-joinings lemma.. respectively. y.„ (dn(4 )11 ))• Ei x7 (a.94 and note that CHAPTER L BASIC CONCEPTS. and 3.9. if necessary. Passing to the limit it follows that . bk i )1(x n i1. I. b). and plc ( ly7). y). 1x). yi)). To complete the proof select a convergent subsequence of 1-(n) . The pair (x. ir(x . by conditions (6a) and (6b).y.9. )). By dropping to a further subsequence. yri') = E p . ) ) is a joining of pk (. and the third. pk (af 14) and pk (biny). and): are all stationary. I —ICX) j—>00 lim p((a ./ J such that each of the limits (8) lim p(afix). 07 . each side of E(n — k 1)p((af. Fix such a n1 and denote the Kolmogorov measures exists for all k and all al` and defined by the limits in (8) as TI. ex n (dn(x tiz . b) = (1/n) 1 n x. { } 0< (n — k + 2) p(414) — (n k— E p(a'nfi!) <1. One difficulty is that limits may not exist for a given pair (x.b. b 101(x7. but important fact. and more. y)).0)(d (xi . Condition (6c) guarantees that ii(tt. yril)). Two of these. A diagonalization argument. it can be assumed that the limit i i )) . is obtained by thinking of the pair (4 .(. that (7) dn (x. E d (xi . is that plc ( 1(4 . y7)) = since 3:(n) (a.) The measures 171. Yi)).p) (d(a. ( . Proof Note that Further- M. for example. This proves (c). pk ((alic . y) = Eî(d(ai . and X' is a joining of g. Note also. y) as a single sequence. Thus X E Js(h. The limit X must be stationary.j'e ly). and is a joining of h and y. b)). A simple. 14)1(4 . yields an increasing sequence fn . are obtained by thinking of 4 and Ai as separate sequences. Limit versions of these finite ideas as n oc will be needed. hence is not counted in the second term. y). = (n — k 1)p(aii` I x7) counts the number of times af appears in x7. 1(x. b1 )). however. "i. y) = lim dni ((x in i . since. yril) defines three empirical k-block distributions. since cin(A. u) = e). y) = EA. Yi)X7 (xi . al since the only difference between the two counts is that a' 2` might be the first k — 1 terms of 4. yn also exists.(. Lemma 1.1).. The limit results that will be needed are summarized in the following lemma.

y) < E). by the empirical-joinings lemma. Furthermore. y). co is a probability measure on the set of such equivalence classes.9. in the ergodic case. by definition of X. that x(d(xi yi)) = f yl)) da)(7(x.) If . Li Next it will be shown that the d-distance defined as a minimization of expected perletter Hamming distance over joinings.„( . y)) be the ergodic decomposition of A. 95 is stationary. y) E A" x A'.)-typical. Theorem 1..9. it is concentrated on those pairs (x.. combined with (10). The integral expression (9) implies. the pair (x. y) < Ex„(x . y) = EA.10. y i )). since it was assumed that dCu. .u and v. as given by the ergodic decomposition theorem.) must be a stationary joining of g and y. The joining result follows from the 0 corresponding finite result.7(x. In this expression it is the projection onto frequency-equivalence classes of pairs of sequences. y)). . y) for which x is g-typical and y is v-typical. PROCESS TOPOLOGIES. yl )).y) (d(x i . it follows that X E ). (x. for À-almost every pair (x.9. the same as the constant given by Theorem 1. But if this holds for a given pair. )) . This establishes Theorem 1. y).y) (d(xi. "if and 1. since d(xi.y)) for A-almost every pair (x.(x.9. ) (d(xi.. while the d-equality is a consequence of (7). But. y) is ergodic for almost every (x. Thus.are stationary.9.7. Theorem 1. A(x) is ergodic and is a joining of . y) = ex(d(xi . Theorem 1. yi)).SECTION 1. y). (. is the empirical measure defined by (x.(d(xi let yi)).9. yl) depends only on the first order distribution and expectation is linear.„. A-almost surely. since À is a stationary joining of p and y. This. Likewise. 8 .2. Lemma 1. Theorem 1. a strengthening of the stationary-joining theorem.(x. and hence d(u. y) E (A. and (9) A= f A(x ) da)(71.u and y such that Au. however. À-almost surely. yi))• Proof Let A be a stationary joining of g and y for which c/(u.u and y are ergodic then there is an ergodic joining A of .. y) = e).9.(. = ex(d(xi. and A. The empirical-joinings lemma can be used in conjunction with the ergodic decomposition to show that the d-distance between ergodic processes is actually realized by an ergodic joining.7.4.9. is.t. y E T (v). means that a(i. (10) d(u. 371)). Since. then X 7r(x. y) is 4( X .9 (Ergodic joinings and a-distance.

together with the almost sure limit result. y). BASIC CONCEPTS. For convenience.. there is a set X c A' such that v(X) = 1 and such that for y E X there is at least one x E T(A) for which both (a) and (b) hold. Since the latter is equal to 2( x.') = — n EÀ(d(xi. the distributional (or variational) distance between n-th order distributions is given by Ii — vin = Elp.u. Y1)). if p. v-almost surely.10.) If . the proof of (11) is finished. v).3 Properties and interpretations of the d-distance. T (A)). random variable notation will be used in stating d-distance results. The vector obtained from xri' be deleting all but the terms with indices j i < j2 < < j. y).b. and h1 .u and v are ergodic then d(. The lower bound result. is the distribution of X and v the to indicate the random vector distribution of Yf'. and hence d(p. y i )) = pairs (x. .. I. v) E-i. and choose an increasing sequence {n 1 } such that )) converges to a limit A. y 1 ') converges to a limit d(x. v).. and v as marginals. or by x(jr) when n is understood. Theorem 1. The ergodic theorem guarantees that for A-almost all which e(d(x i . y) < d(x . that is. blf)1(x n oc. will be denoted by x n (jr).(d(Xi. p((at. (a ) — v(a7)I and a measure is said to be TN -invariant or N-stationary if it is invariant under the N-th power of the shift T.v).9 implies that there is an ergodic joining for v). y). Here is why this is so. the following two properties hold. y).n . (12).(a. Theorem 1.9. y) E TOO Ç T(i) x T (v). Next it will be shown that there is a subset X c A of v-measure 1 such that (12) d(y. y). also by the empirical-joinings lemma. v) d(x .9. This establishes the desired result (12). i=1 (b) (x. conditioned on the value r of some auxiliary random variable R.9.96 CHAPTER I.Theorem 1. In particular. bt) for each k and each al' and bt. and 2(x. The empirical-joinings so that dnj (x l i . .10 (Typical sequences and d-distance. y) E T(A) x T (v) d(p. y'.Y1))=a(u. Another convenient notation is X' . (a) cln (4 . v) = d(y. Proof First it will be shown that (11) (x. for example. as j lemma implies that Î is stationary and has p. yields LIJ the desired result. 17) stands for an (A. (11). Let x be ti-typical and y be v-typical. As earlier.9.T (p)) = d(p. V y E X.

E ga7. IT) < illr.vi n . and hence. y). for any finite random variable R.v In .y(j"))(X Y(iin—m )) be a joining of the conditional random variables. (e) d(u. b) = 1. there is always at least one joining X* of i and y for which X(4. let {6. . X(i in-m )/x( jim) and Y(i n i -m )/y(jr). Y (if n )) n . then ci(u. This completes the proof of property (a). PROCESS TOPOLOGIES.9. (g) ("in(X7.Ea ga. Moreover.Via . (d) If . The left-hand inequality in (c) follows from the fact that summing out in a joining of X. and mdm (x(r). 2 1 In the case when n = 1. . n } . .a7 ) . let yr. unless a = b. 14)) X({(a7 . i2. y) satisfies the triangle inequality and is positive if p.. The fact that cln (A. o T'. yuln))x(x(q).Y(j im). The right-hand inequality in (c) is a consequence of the fact that any joining X of X(r) and Y(ffn) can be extended to a joining of X and To establish this extension property..Yin = 2.y(jr)) 5_ ndn (x 11 .(d„(xr il . and that Ex (dn (a7 ."1 )).1A . for any joining X of bt and v.u . y( j. y(jr)) n . (f) If the blocks {X i ort on+1 } and Olin 1) . y is left as an exercise.(X (j r).17(iin—on+i). X(x(jr). 14): a7 b7 } ) = 1 .11 (Properties of the d-distance. (c) (X (ir). v(a)). y) (dn (di% b7)) < .YD. Furthermore. r P(R = r)d(r Proof A direct calculation shows that IL . with equality when n = 1.(ir»(xcili -m). dn(A.. in _m } denote the natural ordering of the complement {1. The function defined by x*(q.(XF n Yr n ) = dn(X i ort1)n+1 . a). yri') = x(x(r). = min(A(a7). y o 7-1 ).) (a) dn v) _< (1/2)Ip. is a joining of X and Yln which extends X. is a complete metric on the space of measures on An. since d(a. b)) > 1 .m. nE). Li } and for each ( x (jr ) . ±1 } are independent of each other and of past blocks. v) = lim cin (. (b) d. < md„. )711')) < mEVd m (x(iln ).m. y) d(p. 97 Lemma 1. 2. see Exercise 8. yoin -m». Yf' gives a joining of X(ffn). Completeness follows from (a). v(a))..2 Emin(u(a7). then mcl. YU)) yr.9.SECTION 1. together with the fact that the space of measures is compact and complete relative to the metric ii . Ex(d(a. . .u and v are T N -invariant.

(xi' y) ct (a) 20/. y) Likewise.`n )) < 1. and is uniformly equivalent to the 4-metric. The Markov inequality implies that X(A n ( N/Fe)) > 1 — NATt. j=1 since mnEx (c1. with bounds given in the following lemma. y) = min minfc > 0: A. 1)n-1-1 } ./. 01. The proof of Property (g) is left to the reader. v)] 2 5_ 4 (A. y) 5_ 24(u. and hence property (f) is established. It is one form of a metric commonly known as the Prohorov metric. y) = E).u.12 [d:Gt. Yintn ) Edna (.tt1)n+1)) i • j=1 The reverse inequality follows from the superadditivity property. This completes the discussion of Lemma 1.(A n (c)) > 1 — cl.u. Proof Let A be a joining of 1. Yilj.(xrn . is obtained by replacing expected values by probabilities of large deviations. and hence nz Manin (X7 m . (2).11. yrn)) = E nE j (dn (x . versions of each are stated as the following two lemmas. )1)) = a. if A E y) is such that X(An(a)) > 1 — a and d:(. yiii)) < E (x7.n (x(irm ). Another metric. Property (d) for the case N = 1 was proved earlier.2-1)n+1 )) = i (. Part (b) of each version is based on the mapping interpretation of joinings. BASIC CONCEPTS. Property (e) follows from (c). i . This completes the proof of the lemma.9. choose a joining Ai of {Xiiin j. then EÀ(dn (4. . and hence cl.(. such that Exj (dn (X17_ 1)n±l' Yii./2-1)n+1' y(i. } and {Y(.(d n (f.1)n-1-1 . y) < E. and define (13) *(/. Lemma 1.-1)n+1 .98 CHAPTER I. yDA. v). y(ijn-on+1)• The product A of the A is a joining of Xr and Yr. y) = ct.. y) < cl. for each To prove property (f).9.E. yriz)) < 2a.4. equivalent to the 4-metric.0 The function dn * ( u. Minimizing over A yields the right-hand inequality since dn .. dn(xii . by the assumed independence of the blocks.(dn (4 .. y7)EA. y(i n in property (c). Let A n (c) = dn (xiz . y) is a metric on the space P(An) of probability measures on A. The preceding lemma is useful in both directions. dn A. the extension to N > 1 is straightforward..in-on+1) . and hence du. (3).t and y such that dn (A.

) > 1 — 2e. 11 ) 3 I p(44) — p(any)1 < c.(4 .15 (Ergodicity and d-limits. a) into (An.9.14 (a) If anCu. a) is a nonatomic probability space. Proof Suppose .q) + knd. respectively. then Y n ([B]. a) into (A'. then a small blowup of the set of vtypical sequences must have large ti-measure. Lemma 1. for every a. except for a set of A. for 3 > 0. such that dn ( ( Z) (z)) < E. where the extra factor of k on the last term is to account for the fact that a j for which yi 0 x i can produce as many as k indices i for which y :+ k -1 _ — ak 1 and X tj. and hence (14) (n — k + 1)p(41).2 it is enough to show that p(aii`lx) is constant in x. In this section it will be shown that the d-limit of ergodic (mixing) processes is ergodic (mixing) and that entropy is d-continuous.9.13 99 (a) If there is a joining À of it and u such that d„(4 . let [B]s = {y7: dn(y ri' .SECTION 1.4. B) 3). y) < 2E. and (Z. fix k and note that if )7 4-k---1 = all' and 4 +k-1 0 al'. v). p. Lemma 1.4 Ergodicity.t. The key to these and many other limit results is that do -closeness implies that sets of large measure for one distribution must have large blowup with respect to the other distribution. y) < 2e. v). )1). entropy. there are measurepreserving mappings (/) and from (Z. y) < c. By Theorem 1. This follows from the fact that if v is ergodic and close to i. .*(z)) > el) 5_ E. respectively.14. and small blowups don't change empirical frequencies very much. and d limits. Theorem 1. To show that small blowups don't change frequencies very much.b.un (B) > 1 — c. i + k — 1].9. PROCESS TOPOLOGIES. then yi 0 x i for at least one j E [i.-measure at most c.) The a-limit of ergodic processes is ergodic. then cin (it. then iin (t. except for a set of a-measure at most c.u is the d-limit of ergodic processes. y) < e 2 and . and N1 so that if n > N1 then (15) d( x.t.9.9. such that a(lz: dn((z). mixing. (b) If 4 and are measure-preserving mappings from a nonatomic probability space (Z. which is just Lemma 1.u) and (An. By the ergodic theorem the relative frequency p(eil`ix) converges almost surely to a limit p(anx). y) < c2.9. where [ME = dn(Yç z B) EL (b) If cin (i.11 ) < (n — k + 1)p(a in. .) and (A". I. As before. almost surely. ±1c-1 0 ak 1' Let c be a positive number and use (14) to choose a positive number S < c. denote the 3-neighborhood or 3-blow-up of B c A".

The definition of Bn and the two preceding inequalities yield — p(451)1 < 4E. so the covering-exponent interpretation of entropy.1' E [B. and fix E > 0.14 implies that y([7-. 8 then there is a sequence i E Bn such that ] 14) — P(a lic l51)1 < E. Given 8 > 0. vn ) < 82.4.17 (Mixing and a-limits. choose Tn c An such that ITn I _ 1 as n co.) Entropy is a-continuous on the class of ergodic processes.9. Let y be an ergodic process such that a(A.4. Given E > 0. Fix n > N2 and apply Lemma 1.5.16 (Entropy and d-limits. completing the proof of Theorem 1.-measure. however. Let 8 be a positive number such that un bi for all n > N.7. ] Since [B] large An -measure.5. y) < 3 2 and hence an (11. BASIC CONCEPTS.16. each of positive p. D Theorem 1. and y such that Evdi (xi . Theorem 1. Mixing is also preserved in the passage to d-limits.-entropy-typical sequences by very much. Suppose A is the a-limit of mixing processes. and has large v-measure. Let pc be an ergodic process with entropy h = h(i). 8.9. <8 and hence I 19 (alic 1-741 ) IY7)1 < 6 by (15).7. Theorem 1. and let A.9. Al. Let y be an ergodic measure such that cl(A. that if y E [A]s then there is a sequence x E Bn such that dn (fi! . Bn = fx ril : I P(414) — v(4)1 < has vn measure at least 1-6. Lemma 1. for n > N I . and. . Proof Now the idea is that if p. and y are d-close then a small blowup does not increase the set of p.) Choose > N such that p(T) > 1 — 3. that stationary coding preserves mixing. since for stationary measures d is the supremum of the there is an N2 > Ni such that if n > N2. ] 5 ) > 1 — 2 8 . Proof The proof is similar to the proof. the finite sequence characterization of ergodicity. (Such a 3 exists by the blowup-bound lemma. n > Because y is ergodic N1 . y) < 8 2 . Yi)) = Ex(cin (4.u) < h (v) 2E. fix all and VII. 57 11 E [B. Note. the set an . for sufficiently large n.15.9. let y be a mixing process such that d(A . y) < 3 2 .100 CHAPTER I. see Example 19. implies that A is ergodic. via finite-coder approximation. be a stationary joining of p. Likewise if 5. This completes the proof of Theorem 1.) The d-limit of mixing processes is mixing. y. Theorem 1. Likewise. choose N.14 to obtain A n ([13.n .9.')) < 82. v gives measure at least I — 3 to a subset of An of cardinality at most 2n (n( v)+E ) and hence El h(. yields h(v) < h(p) + 2E. for each < 2i1(h+E) and p(T) n > N. Lemma 1.9.]6 ) > 1 — 23. for all n > N.

SECTION 1.9. PROCESS TOPOLOGIES.

101

This implies that d„(x': , y7) < 8 except for a set of A measure at most 8, so that if has A-measure at most 8. Thus by making xt; < 1/n then the set {(x'il , even smaller it can be supposed that A(a7) is so close to v(a) and p.(1,11') so close to v(b?) that liu(dil )[1(b7) — v(a'12 )v(b)I Likewise, for any m > 1,

- iz X({(xr +n , y'1 ): 4 0 yrii or x: V X({(xi, yi): xi 0 4 } )

y: .- .;1 })
m+n

X({(x,m n F-7,

xm _11
,n

ym m:++:rill)

28,

so that if 8 <112n is small enough, then

1,u([4] n

T - m[bn) — vaan

n T - m[b7D1

6/3,

for all m > 1. Since u is mixing, however,

Iv(a7)v(b7) — vgan

n T - m[b7])1

6/3,

for all sufficiently large m, and hence, the triangle inequality yields

— ii([4]

n T - m[brill)1

6

,

for all sufficiently large m, provided only that 6 is small enough. Thus p, must be mixing, which establishes the theorem. This section will be closed with an example, which shows, in particular, that the

ii-topology is nonseparable.
Example 1.9.18 (Rotation processes are generally d-far-apart.) Let S and T be the transformations of X = [0, 1) defined, respectively, by Sx = x

a, T x = x

fi,

where, as before,

e denotes

addition modulo 1.

Proposition 1.9.19
Let A be an (S x T)-invariant measure on X x X which has Lebesgue measure as marginal on each factor If a and fi are rationally independent then A must be the product measure. Proof "Rationally independent" means that ka ± mfi is irrational for any two rationals k and m with (k, m) (0, 0). Let C and D be measurable subsets of X. The goal is to • show that A.(C x D) =
It is enough to prove this when C and D are intervals and p.(C) = 1/N, where N is an integer. Given E > 0, let C 1 be a subinterval of C of length (1 — E)1N and let

E = C i x D, F = X x D.

102

CHAPTER I. BASIC CONCEPTS.

Since a and fi are rationally independent, the two-dimensional version of Kronecker's theorem, Proposition 1.2.16, can be applied, yielding integers m 1 , m2, , rnN such that if V denotes the transformation S x T then Vms E and X(FAÉ) < 2E, where F' = U 1 Vrn1E .
E -±
n vrni

E = 0, if i j,

It follows that X(E) = X(P)/N is within 26/N of A.(F)/ N = 1,t(C)A(D). Let obtain X(C x D) = This completes the proof of Proposition 1.9.19.

0 to

El

Now let P be the partition of the unit interval that consists of the two intervals Pc, = [0, 1/2), Pi = [1/2, 1). It is easy to see that the mapping that carries x into its (T, 2)-name {x} is an invertible mapping of the unit interval onto the space A z the Kolmogorov measure. This fact, together withwhicaresLbgumont Proposition 1.9.19, implies that the only joining of the (T, 2)-process and the (S, 2)process is the product joining, and this, in turn, implies that the d-distance between these two processes is 1/2. This shows in particular that the class of ergodic processes is not separable, for, in fact, even the translation (rotation) subclass is not separable. It can be shown that the class of all processes that are stationary codings of i.i.d. processes is d-separable, see Exercise 3 in Section W.2.

I.9.c

Exercises.

1. A measure ,u E P(X) is extremal if it cannot be expressed in the form p = Ai 0 ia2. (a) Show that if p is ergodic then p, is extremal. (Hint: if a = ta1 -F(1 — t)A2, apply the Radon-Nikodym theorem to obtain gi (B) = fB fi diu and show that each fi is T-invariant.) (b) Show that if p is extremal, then it must be ergodic. (Hint: if T -1 B = B then p, is a convex sum of the T-invariant conditional measures p(.iB) and peiX — B).) 2. Show that l') is a complete metric by showing that

(a) The triangle inequality holds. (Hint: if X joins X and /7, and X* joins 17 and Zri', then E,T yi)A*(z7ly'll) joins X and 4.) (b) If 24(X?, ri) = 0 then Xi; and 1111 have the same distribution. (c) The metric d(X7, 17) is complete. 3. Prove the superadditivity inequality (3). 4. Let p, and y be the binary Markov chains with the respective transition matrices
m [

P —p

1—p p 1'

N [0 1 1 o]'

1

Let fi be the Markov process defined by M2.

SECTION 1.10. CUTTING AND STACKING.

103

v n = X2n, n = 1, 2, ..., then, almost (a) Show that if x is typical for it, and v surely, y is typical for rt.
(b) Use the result of part (a) to show d(,u, y) = 1/2, if 0 < p < 1. 5. Use Lemma 1.9.11(f) to show that if it. and y are i.i.d. then d(P., y) = (1/2)1A — v I i. (This is a different method for obtaining the d-distance for i.i.d. processes than the one outlined in Example 1.9.3.) 6. Suppose y is ergodic and rt, is the concatenated-block process defined by A n on A n . (p ii ) (Hint: g is concentrated on shifts of sequences Show that d(v ,g) = --n.n, ,.--n,• that are typical for the product measure on (An )° ° defined by lin .)

a.

7. Prove property (d) of Lemma 1.9.11.

* (a7 , a7) = min(p,(a7), v(a)). 8. Show there is a joining A.* of ,u and y such that Xn
9. Prove that ikt — v in = 2 — 2 E a7 min(i.t(a7), v(a)). 10. Two sets C, D C A k are a-separated if C n [D], = 0. Show that if the supports of 121 and yk are a-separated then dk(Ilk.vk) > a. 11. Suppose ,u, and y are ergodic and d(i.t, for sufficiently large n there are subsets v(Dn ) > 1 — E, and dn (x7, y7) > a 1`) ' Ak and Pk('IY1 c ) ' Vk, and k Pk('IX 1 much smaller than a, by (7)) y) > a. Show that if E > 0 is given, then Cn and Dn of A n such that /L(C) > 1 —e, — e, for x7 E Cn , yli' E D. (Hint: if is large enough, then d(4 , y ) cannot be

12. Let (Y, y) be a nonatomic probability space and suppose *: Y 1-4- A n is a measurable mapping such that dn (un , y o * -1 ) < S. Show that there is a measurable mapping q5: y H A n such that A n = V curl and such that Ev(dn(O(Y), fr(y))) <8.

Section 1.10 Cutting and stacking.
Concatenated-block processes and regenerative processes are examples of processes with block structure. Sample paths are concatenations of members of some fixed collection S of blocks, that is, finite-length sequences, with both the initial tail of a block and subsequent block probabilities governed by a product measure it* on S' . The assumption that te is a product measure is not really necessary, for any stationary measure A* on S" leads to a stationary measure p, on Aw whose sample paths are infinite concatenations of members of S, provided only that expected block length is finite. Indeed, the measure tt is just the measure given by the tower construction, with base S", measure IL*, and transformation given by the shift on S°°, see Section 1.2.c.2. It is often easier to construct counterexamples by thinking directly in terms of block structures, first constructing finite blocks that have some approximate form of the final desired property, then concatenating these blocks in some way to obtain longer blocks in which the property is approximated, continuing in this way to obtain the final process as a suitable limit of finite blocks. A powerful method for organizing such constructions will be presented in this section. The method is called "cutting and stacking," a name suggested by the geometric idea used to go from one stage to the next in the construction.

104

CHAPTER I. BASIC CONCEPTS.

Before going into the details of cutting and stacking, it will be shown how a stationary measure ,u* on 8" gives rise to a sequence of pictures, called columnar representations, and how these, in turn, lead to a description of the measure on A'.

I.10.a

The columnar representations.

Fix a set 8 c A* of finite length sequences drawn from a finite alphabet A. The members of the initial set S c A* are called words or first-order blocks. The length of a word w E S is denoted by t(w). The members w7 of the product space 8" are called nth order blocks. The length gel') of an n-th order block wrii satisfies £(w) = E, t vi ). The symbol w7 has two interpretations, one as the n-tuple w1, w2, , w,, in 8", the other as the concatenation w 1 w2 • • • to,„ in AL, where L = E, t(wi ). The context usually makes clear which interpretation is in use. The space S" consists of the infinite sequences iv?"' = w1, w2, ..., where each Wi E S. A (Borel) probability measure 1.1* on S" which is invariant under the shift on S' will be called a block-structure measure if it satisfies the finite expected-length condition,
(

E(t(w)) =

Ef (w),uT(w ) <
WES

onto 8, while, in general, Here 14 denotes the projection of of ,u* onto 8'. Note, by the way, that stationary gives E v( w7 ))

denotes the projection

. E t (w ) ,u.: (w7 ) . nE(t(w)).
to7Esn

Blocks are to be concatenated to form A-valued sequences, hence it is important to have a distribution that takes into account the length of each block. This is the probability measure Â, on S defined by the formula
A( w) —

w)te (w)

E(fi

,WE

8,

where E(ti) denotes the expected length of first-order blocks. The formula indeed defines a probability distribution, since summing t(w),u*(w) over w yields the expected block length E(t 1 ). The measure is called the linear mass distribution of words since, in the case when p,* is ergodic, X(w) is the limiting fraction of the length of a typical concatenation w1 w2 ... occupied by the word w. Indeed, using f (w1w7) to denote the number of times w appears in w7 E S", the fraction of the length of w7 occupied by w is given by

t(w)f (wIw7)
t(w7)

= t(w)

f(w1w7) n
n f(w7)

t(w)11,* (w)
E(t 1 )

= A.(w), a. s.,

since f (w1w7)1 n ---÷ ,e(w) and t(will)In E(t1), almost surely, by the ergodic theorem applied to lu*. The ratio r(w) = kt * (w)/E(ti) is called the width or thickness of w. Note that A(w) = t(w)r(w), that is, linear mass = length x width.

SECTION 1.10. CUTTING AND STACKING.

105

The unit interval can be partitioned into subintervals indexed by 21) E S such that length of the subinterval assigned to w is X(w). Thus no harm comes from thinking of X as Lebesgue measure on the unit interval. A more useful representation is obtained by subdividing the interval that corresponds to w into t(w) subintervals of width r (w), labeling the i-th subinterval with the i-th term of w, then stacking these subintervals into a column, called the column associated with w. This is called the first - order columnar representation of (5 00 , ,u*). Figure 1.10.1 shows the first-order columnar representation of S = {v, w}, where y = 01, w = 011, AIM = 1/3, and /4(w) = 2/3. Ti x

1

X V w

o

Figure 1.10.1 The first-order columnar representation of (S, AT). In the columnar representation, shifting along a word corresponds to moving intervals one level upwards. This upward movement can be accomplished by a point mapping, namely, the mapping T1 that moves each point one level upwards. This mapping is not defined on the top level of each column, but it is a Lebesgue measure-preserving map from its domain to its range, since a level is mapped linearly onto the next level, an interval of the same width. (This is also shown in Figure I.10.1.) In summary, the columnar representation not only carries full information about the distribution A.(w), or alternatively, /4(w) = p,*(w), but shifting along a block can be represented as the transformation that moves points one level upwards, a transformation which is Lebesgue measure preserving on its domain. The first-order columnar representation is determined by S and the first-order distribution AI (modulo, of course, the fact that there are many ways to partition the unit interval into subintervals and stack them into columns of the correct sizes.) Conversely, the columnar representation determines S and Al% The information about the final process is only partial since the picture gives no information about how to get from the top of a column to the base of another column, in other words, it does not tell how the blocks are to be concatenated. The first-order columnar representation is, of course, closely related to the tower representation discussed in Section I.2.c.2, the difference being that now the emphasis is on the width distribution and the partially defined transformation T that moves points upwards. Information about how first-order blocks are concatenated to form second-order blocks is given by the columnar representation of the second-order distribution ,q(w?). Let r (w?) = p(w)/E(2) be the width of w?, where E(t2) is the expected length of w?, with respect to ,4, and let X(w?) = t(w?)r(w?) be the second-order linear mass. The second-order blocks w? are represented as columns of disjoint subintervals of the unit interval of width r (14) and height aw?), with the i-th subinterval labeled by the i-th term of the concatenation tv?. A key observation, which gives the name "cutting and stacking" to the whole procedure that this discussion is leading towards, is that

which is just half the total width 1/E(t 1 ) of the first-order columns. the total mass contributed by the top t(w) levels of all the columns that end with w is Et(w)r(w1w)= Ea(lV t2)) WI Pf(W1W) 2E(ti) =t(w* *(w) Thus half the mass of a first-order column goes to the top parts and half to the bottom parts of second-order columns. where . for the total width of the second-order columns is 1/E(t2). E aw yr ( ww2 ) . Indeed. so. 2 which is exactly half the total mass of w in the first-order columnar representation. 2 " (w)r(w). If this cutting and stacking method is used then the following two properties hold. Likewise.u. .. then it will continue to be so in any second-order representation that is built from the first-order representation by cutting and stacking its columns. can be built by cutting and stacking the first-order representation shown in Figure 1.(wv) = 2/9.2 The second-order representation via cutting and stacking. 16(ww) = 4/9._ C i D *********** H E ****** B VW A B C V D H E F G ***************4****+********** w A VV F G ****** *********** WV WW Figure 1.2 shows how the second-order columnar representation of S = {v.1..(vw) = . if y is directly above x in some column of the firstorder representation. le un-v2)\ ( aw* * (w) = 2E( ii) tV2 2 _ 1 toor(w). Note also the important property that the set of points where T2 is undefined has only half the measure of the set where T1 is defined. Indeed. then stacking these in pairs. it is possible to cut the first-order columns into subcolumns and stack them in pairs so as to obtain the second-order columnar representation. Figure 1.4(vv) = 1/9„u.. w}. as claimed. the total mass in the second-order representation contributed by the first f(w) levels of all the columns that start with w is t(w) x--. The significance of the fact that second-order columns can be built from first-order columns by appropriate cutting and stacking is that this guarantees that the transformation T2 defined for the second-order picture by mapping points directly upwards one level extends the mapping T1 that was defined for the first-order picture by mapping points directly upwards one level.10.10.1. BASIC CONCEPTS.10. Second-order columns can be be obtained by cutting each first-order column into subcolumns. the 2m-th order representation can be produced by cutting the columns of the m-th order columnar representation into subcolumns and stacking them in pairs. One can now proceed to define higher order columnar representations.106 CHAPTER I. . .--(t2) 2_. . In general.

hence. Theorem 1. A column C = (L1. disjoint. Ar(2).10. A column structure S is a nonempty collection of mutually disjoint. The interval L1 is called the base or bottom. The vector Ar(C) = (H(1). Of course. which has been commonly used in ergodic theory. labeled columns. The difference between the building of columnar representations and the general cutting and stacking method is really only a matter of approach and emphasis. unless stated otherwise. in essence. The suggestive name. and for each j. ordered. where Pa is the union of all the levels labeled a. column structure. Note that the members of a column structure are labeled columns. rather than the noninformative name.10. Some applications of its use to construct examples will be given in Chapter 3. it is a way of building something new. The cutting and stacking language and theory will be rigorously developed in the following subsections. it is a way of representing something already known.. . Thus. I. A labeling of C is a mapping Ar from {1. The support of a column C is the union of its levels L. The labeling of levels by symbols from A provides a partition P = {Pa : a E AI of the unit interval. The columnar representation idea starts with the final block-structure measure . Two columns are said to be disjoint if their supports are disjoint. column structures define distributions on blocks.b The basic cutting and stacking vocabulary. L2. N(j) = Ar(L i ) is called the name or label of L. is undefined. but this all happens in the background while the user focuses on the combinatorial properties needed at each stage to produce the desired final process. Their common extension is a Lebesgue measure-preserving transformation on the entire interval. .5. f(C)} into the finite alphabet A. . f(C) is a nonempty. Lac)) of length. The goal is still to produce a Lebesgue measure-preserving transformation on the unit interval. P)-process is the same as the stationary process it defined by A*. then the set of transformations {TA will have a common extension T defined for almost all x in the unit interval. A general form of this fact. will be proved later. gadget. which is of course given by X(C) = i(C)r (C). . see Corollary 1. or height.SECTION 1.. The cutting and stacking idea focuses directly on the geometric concept of cutting and stacking labeled columns. and cutting and stacking extends these to joint distributions on higher level blocks.u* and uses it to construct a sequence of partial transformations of the unit interval each of which extends the previous ones. CUTTING AND STACKING. collection of subintervals of the unit interval of equal positive width (C). 2. Ar(E(C))) is called the name of C. if cutting and stacking is used to go from stage to stage. and the interval L i is called the j-th level of the column. Its measure X(C) is the Lebesgue measure X of the support of C. using the desired goal (typically to make an example of a process with some sample path property) to guide how to cut up the columns of one stage and reassemble to create the next stage. 107 (b) The set where T2m is defined has one-half the measure of the set where Tft. will be used in this book. Thus. (a) The associated column transformation T2m extends the transformation Tm .10. gives the desired stationary A-valued process. the interval Lt(c) is called the top. .3. in turn. in essence.10. which. In particular. . The (T.. and the transformation T will preserve Lebesgue measure.

each labeled column in S is a labeled column in S' with the same labeling. . Note that T is one-to-one and its inverse maps points downwards. L'e ). If a column is pictured by drawing L i+ 1 directly above L1 then T maps points directly upwards one level.Et(c). namely. Note that the base and top have the same . r(S). (An alternative definition of subcolumn in terms of the upward map is given in Exercise 1. A subcolumn of L'2 . terminology should be interpreted as statements about sets of labeled columns." which is taken as shorthand for "column structures with disjoint support. The width distribution 7 width r (C) r (C) "f(C) = EVES r (D) (45)• The measure A(S) of a column structure is the Lebesgue measure of its support. The base or bottom of a column structure is the union of the bases of its columns and its top is the union of the tops of its columns. r (5) < 00. .) A (column) partition of a column C is a finite or countable collection {C(i) = (L(i. In particular. Note that implicit in the definition is that a subcolumn always has the same height as the column.. The width r(S) of a column structure is the sum of the widths of its columns. that is. This idea can be made precise in terms of subcolumns and column partitions as follows.(c). t(C)): i E I} .108 CHAPTER I. (b) The distance from the left end point of L'i to the left end point of L 1 does not depend on j.. The transformation T = Ts defined by a column structure S is the union of the upward maps defined by its columns. Li) is a column C' = (a) For each j. . this means that expected column height with respect to the width distribution is finite. ) c cyf( a E r(S) r (S) . L2. T is not defined on the top of S and is a Lebesgue measure-preserving mapping from all but the top to all but the bottom of S. A(S) = Ex(c). is a subinterval of L i with the same label as L. in other words. A column C = (L 1 L2. called its upward map. for 1 < j < f(C). BASIC CONCEPTS. 1). and is not defined on the top level. Thus the union S US' of two column structures S and S' consists of the labeled columns of each. CES CES Note that X(S) < 1 since columns always consist of disjoint subintervals of the unit interval. . L(i. It is called the mapping or upward mapping defined by the column structure S. C = (L1. and S c S' means that S is a substructure of S'. . L(i. Each point below the base of S is mapped one level directly upwards by T. and column structures have disjoint columns." where the support of a column structure is the union of the supports of its columns. such that the following hold. Le (C )) defines a transformation T = T.is the normalized column Lebesgue measure. Columns are cut into subcolumns by slicing them vertically. An exception to this is the terminology "disjoint column structures. The precise definition is that T maps Li in a linear and order-preserving manner onto L i+ i. for CES (c ) < Ei(c)-r(c) Ex . 2) .

the union of whose supports is the support of C.10. In summary. for any S' built by cutting and stacking from S. (iii) The upward map Ts is a Lebesgue measure-preserving mapping from its domain to its range. then Ts . A column partitioning of a column structure S is a column structure S' with the same support as S such that each column of S' is a subcolumn of a column of S. (iv) If S' is built by cutting and stacking from S.1. defined by stacking C2 on top of C1 agrees with the upward map Tc„ wherever it is defined.SECTION 1. Li with the label of C1* C2 defined to be the concatenation vw of the label v of C 1 and the label w of C2. 109 of disjoint subcolumns of C.. Let C 1 = (L(1. j) 1 < j < f(C1) L(2. The stacking idea is defined precisely as follows. . A column structure S' is a stacking of a column structure S. The basic fact about stacking is that it extends upward maps. The cutting idea extends to column structures. namely. Partitioning corresponds to cutting. thus cutting C into subcolumns according to a distribution 7r is the same as finding a column partition {C(i)} of C such that 7r(i) = t(C(i))1 . the upward map Tco. extends T8. (i) The domain of Ts is the union of all except the top of S. (ii) The range of Ts is the union of all except the bottom of S. a column partitioning is formed by partitioning each column of S in some way. This is also true of the upward mapping T = Ts defined by a column structure S. Longer stacks are defined inductively by Ci *C2* • • • Ck Ck+i = (C1 * C2 * • • • Ck) * Ck+1. since Te. Thus. j — f(C1)) t(C1) < j f(Ci) -e(C2). Indeed. is built by cutting each of the first-order columns into subcolumns and stacking these in pairs. CUTTING AND STACKING. The stacking of C2 on top of C 1 is denoted by C1* C2 and is the labeled column with levels L i defined by = L(1.c. .(C). the key properties of the upward maps defined by column structures are summarized as follows.. and extends their union. the columns of S' are permitted to be stackings of variable numbers of subcolumns of S.a*). the second-order columnar representation of (S. is now defined on the top of C1. and taking the collection of the resulting subcolumns. where S c A*. A column structure S' is built by cutting and stacking from a column structure S if it is a stacking of a column partitioning of S. for example. for each i. if they have the same support and each column of S' is a stacking of columns of S. In general.c. and with the upward map TC 2 wherever it is defined. it is extended by the upward map T = Ts . j): 1 < j < t(C1)) and C2 = (L(2. Note that the width of C1 * C2 is the same as the width of C1. j): 1 < j < t(C2)) be disjoint labeled columns of the same width. . Thus. which is also the same as the width of C2.

The process {X. (Such an m exists eventually almost surely. (a) For each m > 1. since the tops are shrinking to 0. (x) to be the index of the member of P to which 7 1 x belongs. A sequence {S(m)} of column structures is said to be complete if the following hold. called the process and Kolmogorov measure defined by the complete sequence {S(m)}. then T -1 B = Ts -(m i ) B is an interval of the same length as B. where Pa is the union of all the levels of all the columns of 8(1) that have the name a. and a condition for ergodicity. the common extension of the inverses of the Ts (m) is defined for almost all x. If the tops shrink to 0. the transformation produces a stationary process. P)-process {X. Likewise. Furthermore. BASIC CONCEPTS. of a column C E S(m) of height t(C) > j n — 1.10. then a common extension T of the successive upward maps is defined for almost all x in the unit interval and preserves Lebesgue measure. then choosing m so that x belongs to the j-th level. it follows that T is measurable and preserves Lebesgue measure. El The transformation T is called the transformation defined by the complete sequence {S(m)}. and defining X. S(m + 1) is built by cutting and stacking from S(m). (b) X(S(1)) = 1 and the (Lebesgue) measure of the top of S(m) goes to 0 as m —> oc. Proof For almost every x E [0. A precise formulation of this result will be presented in this section. This is done by starting with a column structure 8(1) with support 1 and applying cutting and stacking operations to produce a new structure 8(2).) The k-th order joint distribution of the process it defined by a complete sequence {S(m)} can be directly estimated by the relative frequency of occurrence of 4 in a . it is clearly the inverse of T.110 CHAPTER I. The (T. (x) to be the name of level Li+n _1 of C. since the top shrinks to 0 as m oc.c The final transformation and process. The label structure defines the partition P = {Pa : a E A).} and its Kolmogorov measure it are. The goal of cutting and stacking operations is to construct a measure-preserving transformation on the unit interval. This completes the proof of Theorem 1. This is equivalent to picking a random x in the support of 8(1). T is invertible and preserves Lebesgue measure. 1] there is an M = M (x) such that x lies in an interval below the top of S(M). Since such intervals generate the Borel sets on the unit interval. along with the basic formula for estimating the joint distributions of the final process from the sequence of column structures. Cutting and stacking operations are then applied to 8(2) to produce a third structure 8(3). respectively. Theorem 1.} is described by selecting a point x at random according to the uniform distribution on the unit interval and defining X.10. say L i . I. Together with the partition defined by the names of levels. If B is a subinterval of a level in a column of some S(m) which is not the base of that column. 1].10. so that the transformation T defined by Tx = Ts(m)x is defined for almost all x and extends every Ts (m) . Continuing in this manner a sequence {S(m)} of column structures is obtained for which each member is built by cutting and stacking from the preceding one.3 (The complete-sequences theorem.3. Further cutting and stacking preserves this relationship of being below the top and produces extensions of Ts(m).) If {8(m)} is complete then the collection {Ts (n )} has a common extension T defined for almost every x E [0.

since the top has small measure. X(8(1)) = 1 and the measure of the top of S(m + 1) is half the measure of the top of S(m).. CES (1) since f(C)p i (aIC) just counts the number of times a appears in the name of C. and let Pa k = { x: x = . averaged over the Lebesgue measure of the column name. to 0. Let itt be the . establishing (1) for the case k = 1. Let Pa be the union of all the levels of all the columns in 8(1) that have the name a. the only error in using (1) to estimate Wa l') comes from the fact that pk (ainC) ignores the final k — 1 levels of the column C. The quantity (t(C) — k 1)pk(a lic IC) counts the number of levels below the top k — 1 levels of C that are contained in Pal k . since the top k — 1 levels of C have measure (k — 1)r(C). • k E Ak 1 (m) Proof In this proof C will denote either a column or its support. pk(lC) is the empirical overlapping k-block distribution pk •ixf (c) ) defined by the name of C.4 (Estimation of joint distributions. which is. To make this precise. The desired result (1) now follows since the sum of the widths t(C) over the columns of (m) was assumed to go oc. CUTTING AND STACKING. so that {S(m)} is a complete sequence. given by p. {8(m)}. as n An application of the estimation formula (1) is connected to the earlier discussion on of the sequence of columnar representations defined by a block-structure measure where S c A*. that is. the context will make clear which is intended. The relative frequency of occurrence of all' in a labeled column C is defined by pk(a inC) = Ili E [1. The same argument applies to each S(m). of course. This estimate is exact for k = 1 for any m and for k > 1 is asymptotically correct as m oc. and hence X(Pa fl C) = [t(c)pi (aiC)j r(c) = pi(aIC)X(C).u(a) = X(Pa).10. . then (1) = lim E CES pk(ailC)X(C).x. let {. for x E [0.4. Furthermore. 111 column name. Theorem 1.10. Without loss of generality it can be assumed that 8(m +1) is built by cutting and stacking from S(m). in turn. This completes the proof of Theorem 1.(apx(c).} E A Z be the sequence defined by the relation Tnx E Px„ n E Z.) If t is the measure defined by the complete sequence. For each m let 8(m) denote the 2''-order columnar representation. 1].SECTION L10. f(C) — k 1 : x 1' 1 = ] where xf (c) is the name of C. and hence IX(Paf n C) — p k (c41C)A(C)1 < 2(k — 1)r(C). a negligible effect when m is large for then most of the mass must be in long columns. For k > 1. Then.

be the Kolmogorov measure of the (T. For example. there is a k > 1 such that S(m) and S(m k) are 6-independent. awl)].10. j) for which ai = a. . it is enough to show that (2) = lim E WES pk (ak l it4)A. where tovi) w = ai . These ideas are developed in the following paragraphs. a E Ak . since the start distribution for is obtained by selecting w 1 at random according to the distribution AT.5 The tower construction and the standard representation produce the same measures. = l. The sequence {S(m)} of column structures is asymptotically independent if for each m and each e > 0. with the shift S on S" as base transformation and height function defined by f (wi . measure defined by this sequence. is that later stage structures become almost independent of earlier structures. A(B I C) denotes the conditional measure X(B n c)/x(c) of the intersection of the set B with the support of the column C.112 CHAPTER I. that is. that these independence concepts do not depend on how the columns are labeled. The proof for k > 1 is left as an exercise. In the following discussion C denotes either a column or its support. A condition for this. thought of as the concatenation wi w2 • • • wn E AL. Corollary 1. The column structures S and S' are 6-independent if (3) E E Ige n D) — X(C)À(D)I CES DES' 6. 0 Next the question of the ergodicity of the final process will be addressed.(C). Since this is clearly the same as pk(a inC) where C is the column with name w. then selecting the start position according to the uniform distribution on [1. In other words. Next let T be the tower transformation defined by the base measure it*. Proof Let pk(afiw7) be the empirical overlapping k-block distribution defined by the sequence w E S n . 7 3)-process. two column structures S and S' are 6-independent if and only if the partition into columns defined by S and the partition into columns defined by S' are 6-independent. BASIC CONCEPTS. where P = {Pa } is the partition defined by letting P a be the set of all pairs (wr) . For k = 1 the sum is constant in n. where L = Ei aw 1 ). but only the column distributions. . p. Related concepts are discussed in the exercises. One way to make certain that the process defined by a complete sequence is ergodic is to make sure that the transformation defined by the sequence is ergodic relative to Lebesgue measure. which is sufficient for most purposes. Note. The sequence {S(m)} will be called the standard cutting and stacking representation of the measure it.) = t (w 1) • Also let /72. by the way.

F. DO* Thus summing over D and using (6) yields X(B) > 1 — 3E. say level j. the widths of its intervals must also shrink to 0. since S(M) is built by cutting and stacking from S(m). El If is often easier to make successive stages approximately independent than it is to force asymptotic independence. hence the following stronger form of the complete sequences and ergodicity theorem is quite useful. This shows that X(B) = 1. In summary. ( since T k (B n L) = B n TkL and T k L sweeps out the entire column as k ranges from —j + 1 to aW) — j. (6) X(B n D) > (1 — c)X(D).E 2 )x(c). in particular.(B1 C X(B I C) ?. Proof Let T be the Lebesgue measure-preserving transformation of the unit interval defined by the complete.10. asymptotically independent sequence { 8(m)}. 113 Theorem 1. Fix C and choose M so large that S(m) and S(M) are EX(C)-independent. so if D E .6.) A complete asymptotically independent sequence defines an ergodic process. It will be shown that T is ergodic. CUTTING AND STACKING. which implies that the (T. then there must be at least one level of D which is at least (1 — E) filled by B. (5) E DES(M) igvi c) — x(D)1 E- Let . given E > 0 there is an m and a level L. imply that n v)x(vi C). The goal is to show that A(B) = 1. the Markov inequality and the fact that E x (7. (4) À(B n c) > 1 . of some C E 8(M) such that X(B n L) > 0 This implies that the entire column C is filled to within (1 — E 2 ) by B.F. D e . Thus.T be the set of all D E S(M) for which X(B1 C n D) > 1 — 6. . and hence T must be ergodic. implies E X(D) <2E. Since X(B I C) = E-D A. 1 C) DO' El which together with the condition. so that.SECTION I10.. and hence the collection of all the levels of all the columns in all the S(m) generates the a-algebra. Let B be a measurable set of positive measure such that T -1 B = B.6 (Complete sequences and ergodicity. The argument used to prove (4) then shows that the entire column D must be (1 — c) filled by B. P)-process is ergodic for any partition P. (5). The set C n D is a union of levels of D. Towards this end note that since the top of S(m) shrinks to 0. (1 — 6 2 ). that is.10. This completes the proof of Theorem 1. the following holds.

The only real modification that needs to be made in the preceding proof is to note that if {Li} is a disjoint collection of column levels such that X (BC n (UL)) = BciL i . Independent cutting and stacking is the geometric version of the product measure idea. Thus. with the context making clear which meaning is intended. the same notation will be used for a column or its support. and m so large that Em is smaller than both E 2 and (ju(B)/2) 2 so that there is a set of levels of columns in S(m) such that (7) holds. known as independent cutting and stacking. .u(B)/2.7 (Complete sequences and ergodicity: strong form. can produce a variety of counterexamples when applied to substructures. The next subsection will focus on a simple form of cutting and stacking. Proof Assume T -1 B = B and X(B) > O. The freedom in building ergodic processes via cutting and stacking lies in the arbitrary nature of the cutting and stacking rules. and the sweeping-out argument used to prove (4) shows that (8) X(B n > ( 1 — E2 E SI )x(c). (i) Partition C into subcolumns {Ci } so that r (C1) = (Di ). where Di is the ith column of S. Theorem 1. The argument used in the preceding theorem then gives À(B) > 1 — 3E.10. The support of S' has measure at least 3. BASIC CONCEPTS. A bewildering array of examples have been constructed. (7) Eg where BC is the complement of B. where Em 0.10. then {S(m)} defines an ergodic process. which. I. This proves Theorem 1. This gives a new column structure denoted by C * S and defined as follows. taking 3 = . as earlier. c . as m OC.7. Let S' be the set of all columns C for which some level has this property. by using only a few simple techniques for going from one stage to the next. then by the Markov inequality there is a subcollection with total measure at least 3 for which x(B n L i ) > (1 — E 2 )À(L i ). There are practical limitations. as well as how substructures are to become well-distributed in later substructures. in the complexity of the description needed to go from one stage to the next. however. of course. In this discussion. A few of these constructions will be described in later chapters of this book. in spite of its simplicity.) If {S(m)} is a complete sequence such that for each m.114 CHAPTER I. S(m) and S(m 1) are Em -independent. A column structure S can be stacked independently on top of a labeled column C of the same width. it follows that and there must be at least one C E S(m) for which (8) holds and for which DES (m+1) E Ix(Dic) — x(D)i < E. The user is free to vary which columns are to be cut and in what order they are to be stacked. provided that they have disjoint supports.d Independent cutting and stacking. x ( L i) < E 28.10.

A column structure S can be cut into copies {S i : i E n according to a distribution on /. A column structure S is said to be a copy of size a of a column structure S' if there is a one-to-one correspondence between columns such that corresponding columns have the same height and the same labeling. stack S.8. where two column structures S and S' are said to be isomorphic if there is a one-to-one correspondence between columns such that corresponding columns have the same height.10. independently onto Ci .9. Note. In other words. that a copy has the same width distribution as the original. The new column structure C * S consists of all the columns C. The column structure S * S' is the union of the column structures Ci * C2 S' Figure 1. .10.) (i) Cut S' into copies {S.SECTION 1. It is called the independent stacking of S onto C. (ii) For each i.9 Stacking a structure independently onto a structure. by the way. obtaining Ci * S. CUTTING AND STACKING.8 Stacking a column structure independently onto a column. * V.1. Let S' and S be disjoint column structures with the same width. and the ratio of the width of a column of S to the width of its corresponding column in S' is a..10. The independent cutting and stacking of S' onto S is denoted by S * S' and is defined for S = {Ci } as follows. An alternative description of the columns of S * S' may given as follows.10. To say precisely what it means to independently cut and stack one column structure on top of another column structure the concept of copy is useful. and name. by partitioning each column of S according to the distribution 7r. and letting Si be the column structure that consists of the i-th subcolumn of each column of S.10. so that r(S:)= . Ci *Di . (See Figure 1. width.(Ci ). a scaling of one structure is isomorphic to the other. (See Figure 1. 115 (ii) Stack Di on top of Ci to obtain the new column.) C* S Figure 1.

the number of columns of S * S' is the product of the number of columns of S and the number of columns of S'. namely. i ) I r (S') for S' into subcolumns {C» } such that r(C) = (C)T. The independent cutting and stacking construction contains more than just this concatenation information. This formula expresses the probabilistic meaning of independent cutting and stacking: cut and stack so that width distributions multiply.. . -r(Ci The column structure S *S' consists of all the C.(Ci E (ii) Cut each column Ci '). in the finite column case. Si S2 SI * S2 Figure 1. where 5 (1) = S. In particular. that is. and. 2. M} of itself of equal width and successively independently cutting and stacking them to obtain Si *52 * • • ' * SM where the latter is defined inductively by Si *. in the case when S has only finitely many columns. such that r(Cii ). r (Cii *C) r (S * S') r (Ci ) r (S) r(C) since r (S*S') = r(S). and S(m + 1) = S(m) * S (m). BASIC CONCEPTS. The columns of Si * • • • * Sm have names that are M-fold concatenations of column names of the initial column structure S. Successive applications of repeated independent cutting and stacking.10. the number of columns of Si * • • • * Sm is the M-th power of the number of columns of S. The sequence {S(m)} is called the sequence built (or generated) from S by repeated independent cutting and stacking.10.116 (i) Cut each C each Ci E S.10. that they are independently concatenated according to the width distribution of S. • • *Sm = (S1* • • • *Sm_i)*Sm. is that width distributions multiply. The M-fold independent cutting and stacking of a column structure S is defined by cutting S into M copies {Sm : m = 1. The key property of independent cutting and stacking. for it carries with it the information about the distribution of the concatenations. Note that (Si * • • • * S) = t (S)/M. . S into subcolumns {CO.J*Cii .10 Two-fold independent cutting and stacking. Note that S(m) is isomorphic to the 2m-fold independent .. A column structure can be cut into copies of itself and these copies stacked to form a new column structure. starting with a column structure S produces the sequence {S(m)}. E CHAPTER I. Two-fold independent cutting and stacking is indicated in Figure 1. however. m > 1.

and hence the proof shows that {S(m)} is asymptotically independent. of course. D = C1C2. CUTTING AND STACKING. Note that for each mo.10. {S(m): m > mol is built from S(rno) by repeated independent cutting and stacking. as k —> cc. the (T. More interesting examples of cutting and stacking constructions are given in Chapter 3. for each i. oc. Theorem 1. kp(CICf) is the total number of occurrences of C in the sequence C. Proof The proof for the case when S has only a finite number of columns will be given here. if X(S) = 1 then the process built from S by repeated independent cutting and stacking is ergodic.10. define E [1.11. by assumption. Division of X(C n D) = kp(C1Cbt(C)r(D) by the product of X(C) = i(C)r(C) and À(D) = i(D)t(D) yields x(c n D) X(C)À(D) kp(CIC I ) i(D)r (C) 1. P)-process defined by the sequence generated from S by repeated independent cutting and stacking is called the process built from S by repeated independent cutting and stacking. will mean the column D was formed by taking.. The following examples indicate how some standard processes can be built by cutting and stacking. In the case when X(S) = 1. selected independently according to the width distribution of S. with respect to the width distribution. • • Ck E SOTO. just another way to define the regenerative processes.SECTION I. so that if X(S) = 1 then the process defined by {S(m)} is ergodic. a subcolumn of Ci E S. This is. where E(t(S)) is the expected height of the columns both with probability 1.11 (Repeated independent cutting and stacking. by (9) and (10). with probability 1. (9) and (10) 1 1 — i(D) = — k k p(CIC) —> E £(0) i=1 E(t(S)). • Ck. in particular. then given > 0. k]:C = p(CICO I = so that. The simplest of these. Ergodicity is guaranteed by the following theorem. there is an m such that S and S(m) are E-independent. For k = 2'. the countable case is left to Exercise 2. By the law of large numbers.) If {S(m)} is built from S by repeated independent cutting and stacking. The notation D = C = C1 C2 . This establishes that indeed S and S(m) are eventually c-independent. and C E S. Exercise 8. so that its columns are just concatenations of 2' subcolumns of the columns of S. In particular. an example of a process with an arbitrary rate of . as k of S. This completes the proof of Theorem 1. for they are the precisely the processes built by repeated independent cutting and stacking from the columnar representation of their firstorder block-structure distributions.10. and stacking these in order of increasing i. since r(S)E(t(S)) = X(S) = 1. 117 cutting and stacking of S.

(These also known as finite-state processes. It shows how repeated independent cutting and stacking on separate substructures can be used to make a process "look like another process on part of the space.16 The cutting and stacking ideas originated with von Neumann and Kakutani." so as to approximate the desired property. The process built from S by repeated independent cutting and stacking is just the coin-tossing process.} is a function X.i. labeled as '0' and '1'. Example 1. one can first represent the ergodic Markov chain {(Xn . By starting with a partition of the unit interval into disjoint intervals labeled by the finite set A. the binary i. Remark 1. then {Y. say a.) The easiest way to construct an ergodic Markov chain {X„} via cutting and stacking is to think of it as the regenerative process defined by a recurrent state. Define the set of all ak Prob(X i` = alic1X0 = a). = f (Y. processes.} built from it by repeated independent cutting and stacking is a hidden Markov chain.10. 0) if L is not the top of its column. is a function of Y.) If the initial structure S has only a finite number of columns.! .12 (I.1 < i < k. assume that the labeling set A does not contain the symbols 0 and 1 and relabel the column levels by changing the label A (L) of level L to (A (L).i.10. Example 1. For a discussion of this and other early work see [12].) Let S be the column structure that consists of two columns each of height 1 and width 1/2. and ak = a.118 CHAPTER I. in particular. process with the probability of a equal to the length of the interval assigned to a. the process {X." so as to guarantee ergodicity for the final process. BASIC CONCEPTS.i. and let pt* be the product measure on S defined by p. hence can be represented as the process built by repeated independent cutting and stacking. Example 1. and hence {X.10.d. repeated independent cutting and stacking produces the A-valued i.c. Since dropping the second coordinates of each new label produces the old labels.10.Ar(L). The process built from the first-order columnar representation of S by repeated independent cutting and stacking is the same as the Markov process {X„}. and how to "mix one part slowly into another. process in which O's and l's are equally likely. They have since been used to construct .} is a function of a mixing Markov chain.T. In fact. 1) if L is the top of its column. Yn )} as a process built by repeated independent cutting and stacking. then drop the first coordinates of each label. Example 1. that if the original structure has two columns with heights differing by 1.14 (Hidden Markov chains.d. convergence for frequencies. Let S be i for which ai a. is recommended as a starting point for the reader unfamiliar with cutting and stacking constructions. it follows that X.10. see III. Note. all' E S. i } is mixing Markov.} is regenerative.) A hidden Markov process {X. To see why this is so. It is easy to see that the process {KJ built from this new column structure by repeated independent cutting and stacking is Markov of order no more than the maximum length of its columns.d. that is.) of a Markov chain.15 (Finite initial structures.) If {KJ is ergodic. and to (.13 (Markov processes. then {X.

Suppose also that {S(m} is complete.SECTION 1. (a) Show that if the top of each column is labeled '1'. disjoint from 1(m). CUTTING AND STACKING.n such that T(m + 1) is built by An -fold independent cutting and stacking of 1(m) U R.10. for each m. some of which are mentioned [73]. (d) Show that if EcEs IX(C1 D) — X(D)I < E. 5. (a) Suppose S' is built by cutting and stacking from S.) 6.1 as m —> oc. . (Hint: for S = S(m).T 2 D.C. Then apply the k = 1 argument..10. then the process built from S by repeated independent cutting and stacking is Markov of some order.10.d. 2. (a) Show that {S(m)} is asymptotically independent if and only if {r(m)} is asymptotically independent. Show that {S(m)} is asymptotically independent if and only if it* is totally ergodic. 3. I. . Show that for C E S and D E S'.) 4..À(D)i except for a set of D E S I of total measure at most Nfj.i. and let {S(m)} be the standard cutting and stacking representation of the stationary A-valued process defined by (S. where S C A*.. the conditional probability X(DI C) is the fraction of C that was cut into slices to put into D. most of which have long been part of the folklore of the subject. there is an M = M(m. 119 numerous counterexamples. Show that if {Mm } increases fast enough then the process defined by {S(m)} is ergodic. (Hint: the tower with base transformation Sn defines the same process Ti. and each E > 0.. where D is a subset of the base B. A sufficient condition for a cutting and stacking construction to produce a stationary coding of an i. T D. process is given in [75]. . Exercise 4c implies that A* is m-ergodic. re(c)-1 D). then a subcolumn has the form (D. E) such that C is (1 —6)-well-distributed in S(n) for n > M. then E CES IX(CI D) .11 for the case when S has countably many columns. (b) Show that the asymptotic independence property is equivalent to the asymptotic well-distribution property. Prove Theorem 1. Let pt" be a block-structure measure on Soe. The latter also includes several of the results presented here. Show that if T is the upward map defined by a column C. except for a set of D E SI of total measure at most E. p*). then S' and S are 26-independent. (b) Suppose that for each m there exists R(m) C S(m). and an integer 111. A column C E S is (1 —6)-well-distributed in S' if EDES' lx(vi C)—X(7))1 < E. (c) Show that if S and S' are 6-independent.(m). Show that formula (2) holds for k > 1. and X(1 (m)) -->. each C E S(m). with all other levels labeled '0'.e Exercises 1. A sequence {S(m)} is asymptotically well-distributed if for each m. 7. Suppose 1(m) c S(m). Suppose S has measure 1 and only finitely many columns.

(b) Show that if two columns of S have heights differing by 1.10. 8. Show that process is regenerative if and only if it is the process built from a column structure S of measure 1 by repeated independent cutting and stacking. . BASIC CONCEPTS. (c) Verify that the process constructed in Example 1.120 CHAPTER I. then the process built from S by repeated independent cutting and stacking is a mixing finitestate process.13 is indeed the same as the Markov process {X„}.

see Exercise 1. A code sequence {C n } is said to be universally asymptotically optimal or. for almost every sample path from any ergodic process. and y is some other ergodic process. code sequences which compress to the entropy in the limit. in particular. but its construction depends on knowledge of the process. n n—>co for every ergodic process . The entropy lower bound is an expected-value result. so that. Section 11. while if each C.. Two issues left open by these results will be addressed in this section. In general. When Cn is understood.7. hence it does not preclude the possibility that there might be code sequences that beat entropy infinitely often on a set of positive measure. The first issue is the universality problem. An n-code is a mapping Cn : A n F-* 10. by the entropy theorem. where each Cn is an n-code. 2.7. L(4)In h. almost surely. As noted in Section I. the code sequence is called a faithful-code sequence. The sequence of Shannon codes compresses to entropy in the limit. universal if lim sup L(4) < h(p). where {0. r (4) will be used instead of r (x rii ICn ). 1}* is the set of finite-length binary sequences.12 and Theorem 1. almost surely.u(4)1. The Shannon code construction provides a prefix-code sequence {C n } for which L(xrii) = r—log. .. see Theorem 1. where h(p) denotes the entropy of A. more simply. is a prefix code it is called a prefix-code sequence. 121 . 1}*. a lower bound which is "almost" tight. A code sequence is a sequence {Cn : n = 1.). however. If each Cn is one-to-one. The second issue is the almost-sure question. if {C n } is a Shannon-code sequence for A. knowledge which may not be available in practice.7q . then L(4)1 n may fail to converge on a set of positive y-measure.7.1 Entropy and coding. that is. It will be shown that there are universal codes.u. the length i(Cn (4)) of the code word Cn (4).c.Chapter II Entropy-related properties. It will be shown that this cannot happen. The code length function Le Icn ) of the code is the function that assigns to each . at least asymptotically in source word length n.15. entropy provides an asymptotic lower bound on expected per-symbol code length for prefix-code sequences and faithful-code sequences.

Theorem 11.2. the direct coding construction extends to the case of semifaithful codes. this will show that good codes exist.) There is a prefix-code sequence {C„} such that limsup n . that is.122 CHAPTER II. will be used to establish the universal code existence theorem.u. thus further sharpening the connection between entropy and coding. 4 and yril. L(x 11 )/n > h. The first block gives the index of the k-type of the sequence 4. the codeword Cn (4) is a concatenation of two binary blocks.1.. A third proof of the existence theorem.s. it has fixed length which depends only on the number of type classes.1. based on the Lempel-Ziv coding algorithm.1.1. These two results also provide an alternative proof of the entropy theorem.d. The nonexistence theorem. will be given in Section 11. Theorem 11.1.2 (Too-good codes do not exist. together equivalent to the entropy theorem. almost surely for any ergodic process. its length depends on the size of the type class. lim sup . which is based on a packing idea and is similar in spirit to some later proofs. A direct coding argument. The proof of the existence theorem to be given actually shows the existence of codes that achieve at least the process entropy. for which a controlled amount of distortion is allowed.2.a Universal codes exist. as shown in [49]. Distinct words. will follow from the entropy theorem. together with a surprisingly simple lower bound on prefix-code word length.C(4)/n < H. Theorem 11.1 (Universal codes exist. either have different k-types. based on entropy estimation ideas of Ornstein and Weiss. Of interest. almost surely. The code C. will then be used to show that it is not possible to beat process entropy in the limit on a set of positive measure.) If {C n } is a faithful-code sequence and it is an ergodic measure with entropy h. whose existence depends only on subadditivity of entropy as expected value. for suitable choice of k as a function of n. but for no ergodic process is it possible to beat entropy infinitely often on a set of positive measure. relative to some enumeration of the k-type class. with code performance established by using the type counting results discussed in Section I. ENTROPY-RELATED PROPERTIES. The second block gives the index of the particular sequence x'iz in its k-type class.1. While no simpler than the earlier proof. n A prefix-code sequence {C} will be constructed such that for any ergodic it. utilizes the empirical distribution of k-blocks. where H = 114 denotes the process entropy of pt. The code is a two-part code. relative to some enumeration of possible k-types. In addition. they show clearly that the basic existence and nonexistence theorems for codes are. Theorem 11. In other words.6. A counting argument. II. then lim inf. together with some results about entropy for Markov chains.d. is the fact that both the existence and nonexistence theorems can be established using only the existence of process entropy. almost surely. it is possible to (universally) compress to entropy in the limit. in which case the first blocks of Cn (fi') and Cn (Yil) . that is. in essence. Since process entropy H is equal to the decay-rate h given by the entropy theorem. a. A second proof of the existence theorem. the k-type.3. will be discussed in Section 11.C(fii)1 n < h(a). for any ergodic process . while the existence of decay-rate entropy is given by the much deeper entropy theorem. also.

k -. Given a sequence 4 and an integer k < n. and on the cardinalityj7k (x)i of the set of all n-sequences that have the same circular k-type as 4. Now suppose k = k(n) is the greatest integer in (1/2) log lAi n and let Cn be the two-part code defined by first transmitting the index of the circular k-type of x. . hence the second blocks of Cn (4) and Cn (y. Cn (4) = blink.14 and Theorem 1. that is. then transmitting the index of 4 in its circular k-type class.15. The circular k-type is the measure P k = ilk(.6.' . say. rd: 5. and bm t +1 is a variable-length binary sequence specifying. ITi(4)1 < (n — 1)2 (11-1)11. ' +1 .SECTION 11.) Total code length satisfies log /S1 (k. Since the first block has fixed length. the index of the particular sequence 4 relative to some particular enumeration of T k (4). < iii. that is. called the circular k-type. x 7 denotes the entropy of the (k — 1)-order Markov chain defined by Pk (.`) will differ. because the size of the type class depends on the type. so that the bounds."7 on the number /V (k. ENTROPY AND CODING. A slight modification of the definition of k-type. br relative to some fixed enumeration of the set of possible circular k-types. where is a fixed length binary sequence specifying the index of the circular k-type of x.. that is. n) + log1 7 4(4)1 +2. aik E A k . implies the entropy inequality EFkcal.. that is. The gain in using the circular k-type is compatibility in k. the concatenation of xtiz with x k i -1 . i <k — 1. the code Cn is a prefix code. Pk-1(4 -1 14) =---" which.1. on the number of k-types and on the size of a k-type class yield the following bounds (1) (2) (k. lic 14) — Pk(a • 1 ti E [1. If k does not grow too rapidly with n.:+ k -1 = ak 11 .7irk-1 . where the two extra bits are needed in case the logarithms are not integers. n) < (n ± olAl k . ak (3) ilk-Le.(1/2) log IAI n.f.6.c14). in turn.IX II ) on il k defined by the relative frequency of occurrence of each k-block in the sequence .4+k-1 . so asymptotic code performance depends only on the asymptotic behavior of the cardinality of the k-type class of x. let ril+k-1 = xx k i -1 . 123 will be different. yet has negligible effect on code performance. 4 is extended periodically for k — 1 more terms. An entropy argument shows that the log of this class size cannot be asymptotically larger than nH . where Hk_1..14). then the log of the number of possible k-types is negligible relative to n. Theorem 1. (Variable length is needed for the second part. will be used as it simplifies the final entropy argument. n) of circular k-types that can be produced by sequences of length n. 1f1 n The circular k-type is just the usual k-type of the sequence . or they have the same k-types but different indices in their common k-type class.

almost surely. in this discussion. A faithful-code sequence can be converted to a prefix-code sequence with no change in asymptotic performance by using the Elias header technique.x? < H + 2E. Two quite different proofs of Theorem 11. for as noted earlier.1. relative to a given enumeration of bits are needed to specify a given member of 7 Thus. For all sufficiently large n. The second proof. II. Since K is fixed. which was developed in [49]. for es•• <H + 2E. In almost surely. n > N. stationary or not. a bound due to Barron. so that to show the existence of good codes it is enough to show that for any ergodic measure lim sup fik(n)-1.1. as n .x 1' < H. described in Section I.x7. almost surely. the ergodic theorem implies that PK (ar lx7) oo. Fix an ergodic measure bc with process entropy H. The following lemma.b Too-good codes do not exist. ENTROPY-RELATED PROPERTIES. almost surely.d of Chapter 1.u(ar-1 ).124 CHAPTER II. process entropy El H and decay-rate h are the same for ergodic processes. relative to some fixed enumeration of circular k(n)-types. The bound (1) on the number of circular k-types. is based on an explicit code construction which is closely connected to the packing ideas used in the proof of the entropy theorem. together with the assumption that k < (1/2) log iAi n. This is.l. The proof that good codes exist is now finished. this shows that lim sun n ilk(n). The bound (2) on the size of a type class implies that at most 1 + (n — log(n — 1) 4(4). Since E is arbitrary. of course. implies that it takes at most 1 + N/Tz log(n + 1) bits to specify a given circular k(n)-type.1. where. hm sup 'C(n xilz) lim sup iik(n) .u(ar)/. due to Barron. and hence HK _ Lx? integer N = N(x.2 will be given. is valid for any process. Once this happens the entropy inequality (3) combines with the preceding inequality to yield ilk(n)-1. and does not make use of the entropy theorem. Thus it is enough to prove Theorem 11. Given E > 0.2 for the case when each G is a prefix code. The first proof uses a combination of the entropy theorem and a simple lower bound on the pointwise behavior of prefix codes. [3]. . choose K such that HK_i _< H ± E.u(4).7.x7 < H. HK-1 = H (X IX i K -1 ) denotes the process entropy of the Markov chain of order K — 1 defined by the conditional probabilities . k(n) will exceed K. establishing the desired bound (4). negligible relative to n. E) such that almost every x there is an particular. (4) where H is the process entropy of p. the type class.

such that if Bi = (x: ICJ) < i(H . since adding headers to specify length has no effect on the asymptotics.1. This completes the proof of Theorem 11. eventually a. for any ergodic measure if/. yields lim inf n—000 n > lim inf .C(x7) + log . Theorem 11.u(fil ) . this can be rewritten as Bn = {x7: iu.(4) eventually almost surely. while the complement can be compressed almost to entropy.C(4) +log .2.) Let (CO be a prefix-code sequence and let bt be a Borel probability measure on A. ENTROPY AND CODING. then.) Thus there are positive numbers.c Nonexistence: second proof. I. } is a prefix-code sequence such that lim infi £(41C.an). Proof For each n define B. Using the relation L(x7) = log 2'. The lemma follows from the Borel-Cantelli lemma. But -(1/n) log z(x) converges El almost surely to h. and assume {C.ei2E B„ since E „ 2—c(x7) < 1. thereby producing a code whose expected length per symbol is less than entropy by a fixed amount. a proof which uses the existence of good codes but makes no use of the entropy theorem. with respect to A. x7En„ .an. Let i be a fixed ergodic process with process entropy H. by the Kraft inequality for prefix codes. ={x: .„ Bj ) > y. A second proof of the nonexistence of too-good codes. long enough sample paths can be partitioned into disjoint words.0 <2 -an E 2-4(4) < 2 -"n .u. the entropy of . (As noted earlier.u(4) ?_ .3 (The almost-sure code-length bound.1. II. The basic idea is that if a faithful-code sequence indeed compresses more than entropy infinitely often on a set of positive measure.2.(4) which yields 2-c(4 ) 2 -an I.SECTION II. with high probability. and after division by n. 2-an < oo. then If {an } is a sequence of positive numbers such that En (5) .1. on a set of positive measure. Barron's inequality (5) with an = 2 loglAI n. .log p. 125 Lemma 11. such that those words that can be coded too well cover at least a fixed fraction of sample path length. s. c and y.l. will now be given.)/n < H.1.6)} then i(nn Up.t(Bn) = E kt(x . a faithful-code sequence with this property can be replaced by a prefix-code sequence with the same property.

and M > 2k/S. xr. if the following three conditions hold. be partitioned into disjoint words. The coding result is stated as the following lemma. E G(n).. this produces a single code with expected length smaller than expected entropy. such that the set Dk = E(X lic ) < k(H 8)) . for the uniform boundedness of E*(4)In on the complement of G(n) combines with the properties (6) and (7) to yield the expected value bound E(C*(X7)) < n(H — 672). .C*(x`Oln will be uniformly bounded above by a constant on the complement of G(n). using the too-good code Ci on those words that belong to Bi for some i > M.. Let 8 < 1/2 be a positive number to be specified later. while the words that are neither in Dk nor in BM U . a code that will eventually beat entropy for a suitable choice of 3 and M.. there is a prefix n-code C n * whose length function E*(4) satisfies (7) . To make precise the desired sample path partition concept.1..1. with high probability. for a suitable choice of E. Let G(n) be the set of all sequences 4 that have a (y.. with length function E(x) = Theorem 11. If 8 is small enough and such sample paths are long enough. such that the words that belong to B m U .. has measure at least 1 — S. Here it will be shown how a prefix n-code can be constructed by taking advantage of the structure of sequences in G(n). ' > 0. for all sufficiently large n. or belongs to B1 for some j > M. independent of n..1. a concatentation 4 = w(1)w(2) • • • w(t) is said to be a (y.x i` I e' -k). Lemma 11. 0-good representation of xt. is at least yn. ENTROPY-RELATED PROPERTIES. . or belongs to Dk.126 CHAPTER II. Sample paths with this property can be encoded by sequentially encoding the words. provides a k > 1/3 and a prefix code ek £(. (c) The total length of the w(i) that belong to BmU Bm +1U . It will be shown later that (6) x E G(n).4 For n > M. 3)-good representation. (b) The total length of the w(i) that belong to A is at most 33n. let M > 2k IS be a positive integer.. cover at least a y-fraction of total length. eventually almost surely. A version of the packing lemma will be used to show that long enough sample paths can. For n > M. and using a single-letter code on the remaining symbols. given that (6) holds. The lemma is enough to contradict Shannon's lower bound on expected code length. Furthermore. plus headers to tell which code is being used. using the good code Ck on those words that belong to Dk. (a) Each w(i) belongs to A. The existence theorem. for some large enough M. 8 > 0. cover at most a 33-fraction of total length. contradicting Shannon's lower bound. .C*(4) <n(H — E').

.t. But this.. by the definition of Dk. Note that C * (4)/n is uniformly bounded on the complement of G(n).. then it requires at most L(H — E) bits to encode all such w(i). and using some fixed code. since the initial header specifies whether E G(n) or not.. using the too-good code on those words that have length more than M. . the contribution . for suitable choice of 8 and M > 2kI8.1. symbols are encoded separately using the fixed code F. The goal is to show that on sequences in G(n) the code Cn * compresses more than entropy by a fixed amount. 127 for all sufficiently large n. thereby contradicting Shannon's D lower bound on expected code length. For 4 g G(n). Here e (. so that. Proof of Lemma 11.. 3)-good representation . C m . the per-symbol code C:(4) = OF(xi)F(x2) • • • F(xn).1. Cm+i . The code Cn * is clearly a prefix code. using the good code Ck on those words that have length k. in turn. then coding the w(i) in order of increasing i. ENTROPY AND CODING. This is so because.4. ) is an Elias code. F : A 1--* Bd.. ignoring the headers needed to specify which code is being used as well as the length of the words that belong to Uj >m BJ . since (1/n)H(A n ) -+ H. then a (y.SECTION 11. while the other headers specify which of one of the prefix-code collection {F. it takes at most L(H — E) + (N — L)(H 8) to encode all those w(i) that belong to either Dk or to Ui > m Bi . on the words of length 1. where d > log IA I. Ck.)cr i' = w(l)w(2) • • • w(t) is determined and the code is defined by C:(4) = lv(1)v(2) • • • v(t). Appropriate headers are also inserted so the decoder can tell which code is being used. and.11(i). 8)good representation 4 = w(1)w(2) • • • w(t). The code C(x) is defined for sequences x E G(n) by first determining a (y. Likewise. as shown in the following paragraphs. (0) (w(i)) of the w(i) that belong to Up. where OF ((w(i)) (8) v(i) = 10 .C*(X7)) < H(1. ' (w(i)) iie(j)Ci(W(i)) if w(i) has length 1 if w(i) E Dk if w(i) E Bi for some j > M. by property (c) in the definition of G(n). since the total length of such w(i) is at most N — L.1 4. that is. By the definition of the Bi . a prefix code on the natural numbers for which the length of e(j) is log j o(log j). is used. so that if L is total length of those w(i) that belong to BM U BA.1 .} is being applied to a given word. for all sufficiently large n. Theorem I.7. the code word Cf (w(i)) is no longer than j (H — c). If E G(n). .m B. a code word Elk (w(i)) has length at most k(h + 6). But L > yn.. say. yields E(.). A precise definition of Cn * is given as follows. for x g G(n). and the encodings k(W(i)) of the w(i) that belong to Dk. so that. the principal contribution to code length comes from the encodings Cf (u.

In summary. Since an Elias code is used to do this.M . The 1-bit headers used to indicate that w(i) has length 1 require a total of at most 33n bits. If 8 is small enough. by property (b) of the definition of G(n). for any choice of 0 < 8 < 1/2. it follows that if 8 is chosen small enough. which establishes the lemma. the total contribution of all the headers as well as the length specifiers is upper bounded by 38n ± 26n + IC8n. since the total number of such words is at most 38n. M > 2k/3 and log M < 8M. By assumption. w(i)Eup. which is certainly negligible. Lemma 1. Of course. provided 8 is small enough and M is large enough. choose L > M so large that if B = UL BP • then . But (1/x) log x is a decreasing function for x > e. eventually almost surely. There are at most nlk words w(i) the belong to Dk or to Ui >m Bi . and M > 2k13. still ignoring header lengths. the total length needed to encode all the w(i) is upper bounded by (10) n(H +8 — ye +3d3). then n(H — E'). relative to n. 4 E G(n). To establish this. ignoring the one bit needed to tell that 4 E G(n). k > 113. The words w(i) need an additional header to specify their length and hence which Ci is to be used. so that. there is a constant K such that at most log t(w(i)) K tv(i)Eui>m B1 E bits are required to specify all these lengths. then 8 — ye -1-3d8 will be negative. 1=. so that log log t(w(i)) < K M n.u(B) > y. then at most K3n bits are needed to specify the lengths of the w(i) that belong to Ui >m B i . so these 2-bits headers contribute at most 28n to total code length. so essentially all that remains is to show that the headers require few bits. the lemma depends on the truth of the assertion that 4 E G(n). by property (b) in the definition of G(n). Adding this bound to the bound (10) obtained by ignoring the headers gives the bound L*(4) < n[H 8(6 ± 3d K)— yen on total code length for members of G(n).m B i is upper bounded by — ye). Since d and K are constants. for suitable choice of 8 . Thus. ENTROPY-RELATED PROPERTIES. m Bi E since each such w(i) has length M > 2k1S. to total code length from encoding those w(i) that belong to either Dk or to Ui . since k < M and hence such words must have length at least k. and there are at most 38n such words. provided only that M > 2k18. e'.3.128 CHAPTER II. which exceeds e since it was assumed that < 1/2 and k > 118.8. provided k > 118. n(H +3 — y(e +8)) n(H (9) Each word w(i) of length 1 requires d bits to encode. if it assumed that log M < 8M. The partial packing lemma.

. and let {C„} be a prefix-code sequence such that lim sup. eventually almost surely. Dn . by the ergodic theorem. vi ] meets none of the [si . 4 is (1 — 0-strongly-covered by Dk. vi that meet the boundary of at least one [s ti] is at most 2nk/M. ti]: j E U {[ui. B. which is upper bounded by 8n. eventually almost surely. E /*} is disjoint and covers at least a (1 — 30-fraction of [1. The entropy theorem can be deduced from the existence and nonexistence theorems connecting process entropy and faithful-code sequences.u(x'1') < 2 1'1+2} . But this completes the proof that 4 E G(n). ti ]. Fix c > 0 and define D„ = 14: £(4) < n(H so that x Thus. eventually almost surely. In particular. 129 provides. ti]UU[ui. n] of total length at least (1 — 20n for which each 4. the cardinality of J is at most n/M. Since each [s1 . and (c) hold.s. Fix an ergodic measure bt with process entropy H. = . ti ]: j E J } of M-length or longer subintervals of [1. 4 has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk. But the building-block lemma. suggested in Exercise 5. eventually almost surely there are at least 8n indices i E [1.1. since Cn is invertible. E Dk and a disjoint collection {[si. a y-packing of 4 by blocks drawn from B.7.l. as follows.) The words v:: j E /*I {x:j E J} U {xu together with the length 1 words {x s I defined by those s U[s . since it was assumed that M > 2k/ 6 . Section 1. then the collection {[sj. for which each xs% E B. n]. Lemma 1. ENTROPY AND CODING. n — k 1] that are starting places of blocks from Dk. 1E1. eventually almost surely. This completes the second proof that too-good codes do not exist. II.3.C(x`11 )1 n < H.SECTION 11. The preceding paragraph can be summarized by saying that if x has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk then 4 must belong to G(n). and I Dn I < 2 n(H+E) . To say that x has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk.9.3. since. . (This argument is essentially just the proof of the two-packings lemma. t] has length at least M. so the total length of those [ui . On the other hand. Thus if I* is the set of indices i E I such that [ui .7. provide a representation x = w(1)w(2) • • • w(t) for which the properties (a). eventually almost surely. 4 has both a y-packing by blocks drawn from B and a (1 — 20-packing by blocks drawn from Dk. then implies that xri' is eventually almost surely (1 — 20-packed by k-blocks drawn from Dk. if E . (b). n] of total length at least yn. Lemma 1. a.d Another proof of the entropy theorem. that is. is to say that there is a disjoint collection {[u„ yi ]: i E / } of k-length subintervals of [1.

n n 11. This proves that 1 1 lim inf — log — > H. (a) For each n let p. (C) Show that if p. eventually almost surely.130 then /t (Bn CHAPTER II.n(H — c).s. a. and note that I Uni < 2n(I I —E) ... xii E Un and en (4) = C(4). Suppose Cn * is a one-to-one function defined on a subset S of An . where it starts. C3 II.. almost surely. and let £(4) be the length function of a Shannon code with respect to vn . then there is a constant C such that lim sup..) 3.. Show that the expected value of £(4) with respect to p.2 guarantees that Je ll g Un . establishing the upper bound 1 1 lim sup — log ) _< H . such that the first symbol in 0(4) is always a 0.(4 For the lower bound define U„ = Viz: p.. be the code obtained by adding the prefix 1 to the code Cn .1. be the projection of the Kolmogorov measure of unbiased coin tossing onto A n and let Cn be a Shannon code for /in . for X'il E S. Code the remainder by using the Shannon code.1 . Let C.i. Thus there is a one-to-one function 0 from Un into binary sequences of length no more than 2 -I.s. is 1-1(pn )± D (An II vn). n D n) < IDn 12—n(H+2E) < 2—n€ Therefore 4 fl B n n Dn . L(4)/ log n < C. The resulting code On is invertible so that Theorem 11. whose range is prefix free.(4) > 2—n(H. (b) Let D (A n il v pi ) = E7x /1.1(4) log(pn (4)/vn (4)) be the divergence of An from yn .. Add suitable headers to each part to make the entire code into a prefix code and apply Barron's lemma. Show that if bt is i. (Hint: code the second occurrence of the longest string by telling how long it is. This exercise explores what happens when the Shannon code for one process is used on the sample paths of another process. Le Exercises 1. xi" g Un . and so xri` g Bn .. Define On (4) = 4)(4). a. Lemma 5. is the all 0 process. where K is a constant independent of n.d. n n and completes this proof of the entropy theorem.(4) = 1C:(4). then there is a renewal process v such that lim supn D (gn II vn )/n = oc and lim infn D(pn II vn )/n = 0. Such a code Cn is called a bounded extension of Cn * to An. 2. and where it occurred earlier. Show that there is prefix n-code Cn whose length function satisfies £(4) < Kn on the complement of S. . Show that if y 0 p. is ergodic then C(4)In cannot converge in probability to the entropy of v. eventually almost surely. and such that C. Let L (xi') denote the length of the longest string that appears twice in 4. ENTROPY-RELATED PROPERTIES. eventually almost surely.

and therefore an initial segment of length n can be expressed as (1) xi! = w(1)w(2).. b(C)b(C + 1) of the binary words b(1). .2. WW1. [91]. where the final block y is either empty or is equal to some w(j).. 2.... for j < C... . . 101. . • of variable-length blocks.. . An important coding algorithm was invented by Lempel and Ziv in 1975. for example. defined according to the following rules. hence it is specified by giving a pointer to where the part before its final symbol occurred earlier. . and it has been extensively analyzed in various finite and limiting forms. where. called the (simple) Lempel-Ziv (LZ) code by noting that each new word is really only new because of its final symbol. that is. A second proof. w(j)} then w(j ± 1) consists of the single letter xn ro... 10. THE LEMPEL-ZIV ALGORITHM 131 Section 11. . 00. . [53]. The LZ algorithm is based on a parsing procedure. (b) Suppose w(1) . [90]. Ziv's proof.. To describe the LZ code Cn precisely. W(j)} and xm+ 1 g {w(1).. called here (simple) LZ parsing.1 +1 E {WM. Note that the parsing is sequential. parses into 1. the LZ code maps it into the concatenation Cn (eii) = b(1)b(2). which is due to Ornstein and Weiss.4. b(2)... 0. To be precise. x is parsed inductively according to the following rules. 1. n) i-. and uses several ideas of independent interest... The parsing defines a prefix n-code Cn .. for ease of reading..2 The Lempel-Ziv algorithm.Brl0gn1 and g: A 1--÷ BflogiAll be fixed one-to-one functions. the words are separated by commas. the word-formation rule can be summarized by saying: The next word is the shortest new word. Thus. let 1. w(j + 1) = xn such that xm n . a way to express an infinite sequence x as a concatenation x = w(1)w(2) . b(C + 1). 01. that is. 000. called words. .. w(C) y.1 denote the least integer function and let f: {0. where m is the least integer larger than n 1 (ii) Otherwise. . is given in Section 11. later words have no effect on earlier words.SECTION 11. (a) The first word w(1) consists of the single letter x1. If fiz is parsed as in (1). . 7 + 1 . .. In finite versions it is the basis for many popular data compression packages. b(C). w(j) = (i) If xn 1 +1 SI {w(1). that the Lempel-Ziv (LZ) algorithm compresses to entropy in the limit will be given in this section.. together with a description of its final symbol. 100. In its simplest form. 11001010001000100.

16. as defined here. for it takes into account the frequency with which strings occur. Lemma 1. together with a proof that this individual sequence entropy is an asymptotic upper bound on (1/n)C(4) log n. Ziv's concept of entropy for an individual sequence begins with a simpler idea. since no code can beat entropy in the limit. a E A. almost surely. Theorem 11. which is the growth rate of the number of observed strings of length k. individual sequence entropy almost surely upper bounds the entropy given by the entropy theorem. by Theorem 11. For example.) Topological entropy takes into account only the number of k-strings that occur in x. then b(j) = 0 g(w(j)). then b(j) = 1 f (i)0 g(a). together with a proof that. and part (c) requires at most Flog ni + 1 bits. hence its topological entropy h(x) is log 2 = 1. 34] for discussions of this more general concept.2. so to establish universality it is enough to prove the following theorem of Ziv.6. for ergodic processes. [90]. The dominant term is C log n. see [84. it is enough to prove that entropy is an upper bound. so total code length £(.i. The entropy given by the entropy theorem is h(p) = —p log p — (1 — p) log(1 — p). (b) If j < C and i < j is the least integer for which w(j) = w(i)a. . is the same as the usual topological entropy of the orbit closure of x. but gives no information about frequencies of occurrence..d. where i is Part (a) requires at most IA l(Flog IA + 1) bits.132 CHAPTER II. for every sequence.eiz) is upper bounded by (C+1)logn-FaC-1. The k-block universe of x is the set tik(x) of all a l' that appear as a block of consecutive symbols in x.7. otherwise b(C (c) If y is empty. ENTROPY-RELATED PROPERTIES. as k oc. part (b) requires at most Calog n1 ± Flog I A + 2) bits. if x is a typical sequence for the binary i. Of course. process with p equal to the probability of a 1 then every finite string occurs with positive probability. in which C1(4) now denotes the number of new words in the simple parsing (1). called topological entropy. k = {a• i• x 1' 1 = all for some i _> The topological entropy of x is defined by 1 h(x) = lim — log 114 (x)I k a limit which exists. since I/44k (x)i < Vim (x)I • Pk (x)1. 1) = 1 f (i).) If p is an ergodic process with entropy h. then b(C the least integer such that y = w(i). which depends strongly on the value of p. Ziv's proof that entropy is the correct almost-sure upper bound is based on an interesting extension of the entropy idea to individual sequences. then (11n)C(4) log n —> h. by subadditivity. that is. (a) If j < C and w(j) has length 1. where a and /3 are constants.1 (The LZ convergence theorem.1. (The topological entropy of x. 1) is empty.2.

X E A°°. Proof The most words occur in the case when the LZ parsing contains all the words of length 1. THE LEMPEL-ZIV ALGORITHM. Theorem 11. )7. combined with the fact it is not possible to beat entropy in the limit. 0 such that C(x) log n n <(1 + 3n)loglAl.2. which also have independent interest. will be carried out via three lemmas. defined as before by c1(x . which.2. Theorem 11. up to some length. The first gives a simple bound (which is useful in its own right). where d(4 . Lemma 11.2. n (in + 1 )1Al m+1 jiAli and C = . y) = lim sup d.) If is an ergodic process with entropy h then H(x) < h.3 (The Ziv-entropy theorem. Theorem 11. . y). the second gives topological entropy as an upper bound.2. and hence C logn ' n log I AI.2. 133 The concept of Ziv-entropy is based on the observation that strings with too small frequency of occurrence can be eliminated by making a small limiting density of changes. that is.2. almost surely. Theorem 11.) There exists Bn —O.SECTION 11.. The Ziv-entropy of a sequence x is denoted by H(x) and is defined by H(x) = lim inf h(y). say m. Theorem 11. Ziv established the LZ convergence theorem by proving the following two theorems. all the words of length 2.. x E A c° . leads immediately to the LZ convergence theorem. The first lemma obtains a crude upper bound by a simple worst-case analysis.1.2. almost surely..4 (The crude bound.) lim sup C(xti') log n H(x).2.=.1. and the third establishes a ii-perturbation bound.1 ) = 1 .2 (The LZ upper bound. These two theorems show that lim supn (l/n)C(x7) log n < h. The natural distance concept for this idea is cl(x. )1). when n= In this case.i (xii . — 1) and C IA — 1). The proof of the upper bound.t--ns n is per-letter Hamming distance..

the topological entropy. To establish the general case in the form stated. ()) (w(i). i = 1. first choose m such that n< j=1 jlAli. then Given x and E > 0. where t(v(i)) = t(w(i)). and choose m such that E j2» j=1 „(x) < n < E i2 p„(x) . Lemma 11..x'iz) log n h(y) + Proof Define the word-length function by t(w) = j. together with the fact that hi(x) — Ei < h(x) h (x) where Ei —> 0 as j oc. for the LZ parsing may be radically altered by even a few changes in x. ENTROPY-RELATED PROPERTIES. y) <S. 2. as k oc... all of which are valid for t > 1 and are simple consequences of the well-known formula 0 — t)/(t — 1). . tj = Er (en-El The second bounding result extends the crude bound by noting that short words don't matter in the limit. ei. such that C(x) log n n — Proof Let hk(x) = (1/k) log lUk(x)1. —> 0. The third lemma asserts that the topological entropy of any sequence in a small dneighborhood of x is (almost) an upper bound on code performance. be the LZ parsing of x and parse y as y = v(1)v(2) . j=1 m+1 The bounds (2).. for w E Ai.5 (The topological entropy bound... there is a S >0 such that if d( li m sup n—>oo C(. must give an upper bound. for k > 1..2. At first glance the result is quite unexpected. v(i)) < . Lemma 11. tm 1 m(t _ i) _< E (2) m (t— 1 i Er ti < t"/(t — 1).6 (The perturbation bound. i=1 The desired upper bound then follows from the bound bounds . 1 t m it . Let x = w(1)w(2) . Fix S > 0 and suppose d(x.134 CHAPTER II..) For each x E A'. The word w(i) will be called well-matched if dt(u. there exists 8. yield the desired result. and the t < - t mtm. and hence the rate of growth of the number of words of length k.2. that is.) x. y) < S.

the blowup-bound lemma.E.2. and choose k so large that there is a set 'Tk c Ak of cardinality less than 2n (h+E) and measure more than 1 — E. 0 The desired inequality.2. let Gk(x) be the set of wellmatched w(i) of length k. This bound. The idea for the proof is that 4 +k-1 E Tk for all but a limiting (1 — 0-fraction of indices i. By the Markov inequality. the poorly-matched words cover less than a limiting . Proof Fix c > 0. hence the same must be true for nonoverlapping k-blocks for at least one shift s E [0. and hence it can be supposed that for all sufficiently large n.2. and let Gk(y) be the set of all v(i) for which w(i) E Gk(X). Thus there must be an integer s E [0. Since dk(w(i). where lim n fi. Lemma 11. Lemma 11. Fix a frequency-typical sequence x. n — 11: x in+s+1 lim inf > 1. the total length of the poorly-matched words in the LZ parsing of 4 is at most rt.2.n+s +k E Ed' I{i E [0.5. almost surely. there is a sequence y such that cl(x.2.SECTION 11.7 Let p.6.2. implies LI the desired result. 135 otherwise poorly-matched. Theorem 11. Lemma 1.7. The LZ upper bound. for w(i) E Gk(X). k — 1]. is an immediate consequence of the following lemma. The cardinality of Gk(Y) is at most 2khk (Y ) . Thus. yields IGk(x)I where f(S) < 2k(h(y)+ f (6)) 0 as 8 O. H(x) < h. Now consider the well-matched words. For each k.2. To make the preceding argument rigorous. = O. combined with (3) and the fact that f(S) --÷ 0. For any frequency-typical x and c > O. since this is the total number of words of length k in y. Let C1(4) be the number of poorly-matched words and let C2(4) be the number of well-matched words in the LZ parsing of x./TS.6. together with the fact that hk(Y) --+ h(y).5 then gives C2(4) 5_ (1 + On)r— (h(Y) gn f (8)). n—>oo . be an ergodic process with entropy h. k —1] such that i. The resulting nonoverlapping blocks that are not in 'Tk can then be replaced by a single fixed block to obtain a sequence close to x with topological entropy close to h.2. v(i)) < N/75.r8fraction of x. n — 1]. THE LEMPEL-ZIV ALGORITHM. x i+ 1 -T k11 • > 1 — E. follows immediately from Lemma 11. y) < e and h(y) <h E.4 gives the bound (3) (1 n/ +an) log n'a log I AI where limn an = O. completing the proof of Lemma 11. Lemma 11. since x was assumed to be frequency-typical. first note that lim i+k Ili E [0.

ENTROPY-RELATED PROPERTIES.2. • v(j + in — 1)r.. remains true. after which the next word is defined to be the longest block that appears in the list of previously seen words. but more bookkeeping is required. y) <E.00 Ri < n: w(i) E TxII > 1 — E. Since each v(i) E U {a l } . a set of cardinality at most 1 + 2k(h +E) . It is used in several data compression packages. new words tend to be longer. where the initial block u has length s. Theorem 11. Lemma 1. a member of the m-block universe Um (y) of y has the form qv(j)v(j + D.. then defines . and (M — 2k)/ k < m < M/k.") In this alternative version. and hence h(y) < h + E.. which has also been extensively analyzed. defines a word to be new only if it has been seen nowhere in the past (recall that in simple parsing. 0 and.8 Variations in the parsing rules lead to different forms of the LZ algorithm. for now the location of both the start of the initial old word. If M is large relative to k. also discussed in the exercises.2.. Theorem 11. Another version. as well as its length must be encoded. called the slidingwindow version. "new" meant "not seen as a word in the past. One method is to use some form of the LZ algorithm until a fixed number. — n Fix a E A and let ak denote the sequence of length k.136 CHAPTER II.9 Another variation. known as the LZW algorithm. This modification. implies that h m (y) < h + 6 ± 3m.. One version. the built-up set bound. This completes the proof of Lemma 11. of words have been seen. The sequence y is defined as the concatenation y = uv(1)v(2). starts in much the same way.. where w(i) ak if w(i) E 'Tk otherwise. see Exercise 1. Tic where 3A1 ---* 0. produces slightly more rapid convergence and allows some simplification in the design of practical coding algorithms.2. say Co. the rate of growth in the number of new words still determines limiting code length and the upper bounding theorem. which improves code performance.3.2. thereby. where q and r have length at most k. Remark 11. Condition (4) and the definition of y guarantee that ii(x. which can be expressed by saying that x is the concatenation x = uw(1)w(2).7. the proof of the Ziv-entropy theorem.2.2. Remark 11. all of whose members are a.10 For computer implementation of LZ-type algorithms some bound on memory growth is needed.. Nevertheless. includes the final symbol of the new word in the next word.2.7. Remark 11. and (4) liminf n—. each w(i) has length k.6. until it reaches the no-th term.

(Hint: if the successive words lengths are . Section 11. g. provided only that n _ .. here called the empirical-entropy theorem. then each block is coded separately. In this section. The empirical-entropy theorem has a nonoverlapping-block form which will be discussed here. asserting that eventually almost surely only a small exponential factor more than 2kh such k-blocks is enough.3 Empirical entropy. These will be discussed in later sections. is defined by qic(cli = Ri E [0.. and an overlapping-block form. and that a small > 21th exponential factor less than 2kh is not enough. Such algorithms cannot achieve entropy for every ergodic process. then E log Li < C log(n/C). both of which will also be discussed. .2. e. [52].a Exercises 1. Furthermore.cc+ -E =a } I where n = km + r. such as the Lempel-Ziv code. Furthermore. t2. the method of proof leads to useful techniques for analyzing variable-length codes. The theorem is concerned with the number of k-blocks it takes to cover a large fraction of a sample path. rit — 1]: in X: . but do perform well for processes which have constraints on memory growth. see Exercise 1. Show that the LZW version discussed in Remark 11. the coding of a block may depend only on the block.SECTION 11. and can be shown to approach optimality as window size or tree storage approach infinity. a beautiful and deep result due to Ornstein and Weiss. see [87. suggests an entropy-estimation algorithm and a specific coding technique. II.2. Show that simple LZ parsing defines a binary tree in which the number of words corresponds to the number of nodes in the tree.) 2. 3. The blocks may have fixed length. about the covering exponent of the empirical distribution of fixed-length blocks will be established. • • . as well as other entropy-related ideas. 137 the next word as the longest block that appears somewhere in the past no terms.9 achieves entropy in the limit. through the empirical distribution of blocks in the sample path. Show that the version of LZ discussed in Remark 11. 88]. 0 < r < k. The nonoverlapping k-block distribution.2. . In many coding procedures a sample path 4 is partitioned into blocks.8 achieves entropy in the limit. or variable length. EMPIRICAL ENTROPY. which can depend on n. ic. or it may depend on the past or even the entire sample path. The theorem.3.

The surprising aspect of the empirical-entropy theorem is its assertion that even if k is allowed to grow with n. Eventually almost surely. For each c > 0 and each k there is a set 'Tk (e) c A" for which l'Tk (e)I < 2k(h +E) . Strong-packing captures the essence of the idea that if the long words of a parsing cover most of the sample path. for any B c A" for which IBI < 2k( h — e) .14) is eventually almost surely close to the true distribution. t(w(I))<K E - 2 is (1 — E)-built-up from MI.7. then most of the path is covered by words that come from fixed collections of exponential size almost determined by entropy. Part (a) of the empirical-entropy theorem is a consequence of a very general result about parsings of "typical" sample paths. ENTROPY-RELATED PROPERTIES. there < 2k(h+f) and pl (Tk(E)) > 1 — e. the empirical measure qk(. Fix a sequence of sets {E}. those long words that come from the fixed collections . . that is. but the more general result about parsings into variable-length blocks will be developed here. As before.a(B) > E. as it will be useful in later sections.elz = w(1)w(2) • • • w(t) for which awci» < --En . in particular.14) is not close to kt k in variational distance for the case n = k2".1 (The empirical-entropy theorem.. in essence. such that for almost every x there is a K = K(e. such that Tk c A k . x) such that if k > K and n > 2kh . word length is denoted by £(w) and a parsing of 4 is an ordered collection P = {w(1).a Strong-packing. a finite sequence w is called a word. or (1 — e)-packed by {TO.I4) eventually almost surely has the same covering properties as ktk. The ergodic theorem implies that for fixed k and mixing kt. (1 — O n. the empirical and true distributions will eventually almost surely have approximately the same covering properties.4. then (a) qk(Z(E)14)) > 1 — E. .) Let ii be an ergodic measure with entropy h > O. see Exercise 3.138 CHAPTER II. w(2). the empirical distribution qk(. A sequence 4 is (K. Theorem 1. subject only to the condition that k < (11 h) log n. if the short words cover a small enough fraction. hence. A parsing X n i = w(l)w(2) • • • w(t) is (1 — e)-built-up from {'Tk}.3. in spite of the fact that it may not otherwise be close to iik in variational or even cik -distance.. E w(i)Eurk f(w(i)) a. (b) qk(BI4)) <E. For example. if the words that belong to U'rk cover at least a (1 — E)-fraction of n. and there is no is a set T(E) c A k such that l'Tk(E)I _ set B C A k such that IB I _< 2k(h--E) and . Theorem 11. A fixed-length version of the result is all that is needed for part (a) of the entropy theorem. for unbiased coin-tossing the measure qk (. c)-strongly-packed by MI if any parsing . The theorem is.3. w(t)} of words for which 4 = w(1)w(2) • • • w(t). .uk. for fixed k. for the latter says that for given c and all k large enough. II. an empirical distribution form of entropy as covering exponent. for k > 1.

n — m + 1]. By the built-up set lemma. The entropy theorem provides an integer m and a set Cm c Am of measure close to 1 and cardinality at most 2m (h +6) • By the ergodic theorem. a set 'Tk = 'Tk(E) C A k . by the packing lemma. it can be supposed that 3 < 12. w(i) E Te(w(j)). then. Lemma 11. most of 4 must be covered by those words which themselves have the property that most of their indices are starting places of members of Cm . This is a consequence of the following three observations. Proof The idea of the proof is quite simple. w(i) is (1-0-packed by C m . (a) ITkI_ (b) x is eventually almost surely (K. then. Lemma 1. that is. then the word is mostly built-up from C m . For k > m. eventually almost surely most indices in x7 are starting places of m-blocks from Cm . by the Markov inequality. The collection 'Tk of words of length k that are mostly built-up from Cm has cardinality only a small exponential factor more than 2"") . 0-strongly-packed by {Tk for a suitable K.SECTION 11. for each k > K. that is. k > in.3. EMPIRICAL ENTROPY 139 {'Tk } must cover most of the sample path. such that both of the following hold. Lemma 1.3. (ii) If x is (1 —3 2 /2)-strongly-covered by C m and parsed as 4 = w(1)w(2) • • w(t). by the packing lemma. and most of its indices are starting places of members of Cm . and if f(w(i)) ?_ 2/3. the sequence 4 has the property that X: +171-1 E Cm for at least (1 — 3 2 /2)n indices i E [1. .3. (iii) If w(i) is (1 — 3/2)-strongly-covered by C m . But if such an x is partitioned into words then.6. let 'Tk be the set of sequences of length k that are (1 — 3)-built-up from Cm . the words w(i) that are not (1 — 3/2)-stronglycovered by Cm cannot have total length more than Sn < cn12. it can be supposed that 6 is small enough to guarantee that iTk i < 2k(n+E) . If a word is long enough. fix c > 0.2 (The strong-packing lemma. It remains to show that eventually almost surely x'IL is (K.7. 0-strongly-packed by {7 } . be an ergodic process of entropy h and let E be a positive number There is an integer K = K(e) and. and let 8 be a positive number to be specified later.3. if necessary. To make the outline into a precise proof.(ar) > 2 -m(h+ 5)} has measure at least 1 — 3 2 /4.) Let p. by the Markov inequality. The entropy theorem provides an m for which the set Cm = {c4n : p. however. The key result is the existence of collections for which the 'Tk are of exponential size determined by entropy and such that eventually almost surely x is almost strongly-packed. < 2k(h+E) for k > K. (i) The ergodic theorem implies that for almost every x there is an N = N(x) such that for n > N. 4 is (1 — 3 2 /2)-strongly-covered by Cm . by the built-up set lemma. By making 8 smaller.

for some k. then n > N(x) and k < Sn. for if this is so.' is (K. this contributes asymptotically negligible length if I BI is exponentially smaller than 2kh and n > 2kh . Fix k > K(x) and n a 2kh . since k < Sn. A simple two-part code can be constructed that takes advantage of the existence of such sets B. and suppose xril = w(1)w(2) • • • w(t)w(t + 1).(1 — 2 8 )n.3. provided only that n > 2kh . where t(w(i)) = k.3.2. eventually almost surely. Fix c > 0 and let S < 1 be a positive number to be specified later. An informal description of the proof will be given first. The strongpacking lemma provides K and 'Tk c A " . and t(w(t + 1)) < k. if the block belongs to B. The definition of strong-packing implies that (n — k + 1)qk('Tk I4) a.. This completes the proof of Lemma 11. since the left side is just the fraction of 4 that is covered by members of Tk . so that if B covers too much the code will beat entropy. 8)-strongly-packed by {T} for n a N(x). i < t. if n > N(x). All the blocks. 8)-strongly-packed by {Tk}. except possibly the last one. The good k-code used on the blocks that are not in B comes from part (a) of the empirical-entropy theorem. If a block belongs to Tk — B then its index in some ordering of Tk is transmitted. for suitable choice of S. Proof of part (b). II. which supplies for each k a set 'Tk c A k of cardinality roughly 2. are longer than K. The first part is an encoding of some listing of B. or by applying a fixed good k-block code if the block does not belong to B. Suppose xtiz is too-well covered by a too-small collection B of k-blocks. dividing by n — k + 1 produces qk(Tkixriz ) a (1 — 28)(1 —8) > (1 — c). If B is exponentially smaller than 2" then fewer than kh bits are needed to code each block in B. and choose K (x) > K such that if k> K(x) and n> 2kh . such that Irk I . while the last one has length less than Sn.. Fix x such that 4 is (K.140 CHAPTER II. From the preceding it is enough to take K > 2/8. This completes the proof of part (a) of the empirical-entropy 0 theorem. Thus. and if the parsing x 11 = w(1)w(2) • • • w(t) satisfies —2 t(w(i))<K then the set of w(i) for which f(w(i)) > K and w(i) E U Tk must have total length at El least (1 — On. The second part encodes successive kblocks by giving their index in the listing of B. such that eventually almost surely most of the k-blocks in an n-length sample path belong to Tk . this requires roughly hk-bits. and such that x.2k(h+') for k a K.b Proof of the empirical-entropy theorem. ENTROPY-RELATED PROPERTIES. Proof of part (a). If the block does .

for almost every x. eventually almost surely._.. Mk. Several auxiliary codes are used in the formal definition of C. Indeed. 1}rk(h-01. F: A i-.B: B i . an integer K(x) such that if k > K(x) and n > 2" then qk(TkIx) > 1 . Part (a) of the theorem provides a set Tk c ilk of cardinality at most 2k(h+3) .3. Let 1 • 1 denote the upper integer function. 1} rk(4 ±3)1 . eventually almost surely. the definition of {rk } implies that for almost every x there is an integer K (x) > K such that qk(Tk14) > 1 . then qk(BI4) < c.{0..{0. n ). Also needed for each k is a fixed-length faithful coding of Tk. defined by Fm (X) = F(xi)F(x2). (ii) I BI <2k(h6) and qk (BI4) > E. so that if k is enough larger than K(x) to guarantee that nkh Z > N(x).8. which implies part (b) of the empirical-entropy theorem. 1}1 1 05 All is needed along with its extension to length in > 1. To proceed with the rigorous proof. then for almost every x there is an N(x) such that x'ii g B(n). for n > N(x).{0. If 4 g B(n). (1) If B C A k and I B I < 21* -0 . so that if K(x) < k < (log n)/ h then either 4 E B(n). A faithful single letter code. say.E ) .SECTION 11. Let K be a positive integer to be specified later and for n > 2/(4 let B(n) be the set of 4 for which there is some k in the interval [K. By using the suggested coding argument it will be shown that if 8 is small enough and K large enough then xn i . say. . which establishes that 4 g B(n). (i) qic(Tk14) ?_ 1 . for each k.3. (log n)/h] for which there is a set B c A k such that the following two properties hold. 141 not belong to Tk U B then it is transmitted term by term using some fixed 1-block code. then property (1) must hold for all n > 2.8. for K(x) < k < loh gn. EMPIRICAL ENTROPY. say G: Tk i. since such blocks cover at most a small fraction of the sample path this contributes little to overall code length. D It will be shown that the suggested code Cn beats entropy on B(n) for all sufficiently large n. F(x. and. eventually almost surely. fix c > 0 and let 8 be a positive number to be specified later. B(n). This fact implies part (b) of the empirical-entropy theorem. or the following holds. if 8 is small enough and K is large enough. since no prefix-code sequence can beat entropy infinitely often on a set of positive measure. and a fixed-length faithful code for each B c A k of cardinality at most 2k(h.

the block 0k1 determines k.B. 4 E B(n). Gk.B(w(i)) OlG k (w(i)) 11 Fk(w(i)) 11Fr (w(t +1) 0 w(i) E B W(i) E Tk . a fact stated as the following lemma. Fk. Finally.4). so that if K is large enough and 8 small enough then n(h — c 2 12). given that x til E B(n). and Fr are now known. The two-bit header on each v(i) specifies whether w(i) belongs to B. k). (3) . Once k is known.B w(i) g Tk UB. the first bit tells whether x`i' E B(n).3 I Mk. But this implies that xriz V B(n). from which w(i) can then be determined by using the appropriate inverse. let Ok(B) be the concatenation in some order of {Fk (4): 4 E B } . since each of the codes Mk.i<t i = t +1.C(x) 15_ tkqk(h — 6) + tk(1 — qk)(h + 6) + (3 + K)0(n). If 4 V B(n) then Cn (4) = OFn (4). where aK —> 0 as K —> oc. then an integer k E [K. Lemma 11. since no sequence of codes can beat entropy infinitely often on a set of positive measure. in turn. a prefix code such that the length of E(n) is log n + o(log n). The code is defined as the concatenation C(4) = 10k 1E(1B1)0k (B)v( 1)v(2) • • • v(t)v(t + 1). and w(t + 1) has length r = [0. or to neither of these two sets. implies part (b) of the empirical-entropy theorem. for all sufficiently large n. that is. and 4 is parsed as 4 = w(l)w(2) • • • w(t)w(t + 1). since reading from the left. 1=1 . If 4 E B(n). and as noted earlier. The lemma is sufficient to complete the proof of part (b) of the empirical-entropy theorem.3. where (2) v(i) = The code is a prefix code. For 4 E B(n). This is because tkq k (h — c) + tk(1 — q k )(h + 3) < n(h + 3 — 6(6 + S)). ( (I BI) determines the size of B.142 CHAPTER II. and let E be an Elias prefix code on the integers. for the block length k and set B c itic used to define C(. The rigorous definition of Cn is as follows. where w(i) has length k for i < t. for 4 E B(n). since qk > E. xri' E B(n). and qk = clk(Blx'1 2 ). eventually almost surely. and. (log n)/11] and a set B C A k of cardinality at most 2k(h-E ) are determined such that qk(B14') > c. to Tic — B. the principal contribution to total code length £(4) = £(41C) comes from the encoding of the w(i) that belong to rk U B. this. for each k and B C A k . and øk(B) determines B. ENTROPY-RELATED PROPERTIES.

(5) provided K > (E ln 2) . If k is fixed and n oc. which is. The encoding Øk(B) of B takes k Flog IAMBI < k(1+ log I A1)2 k(h. the goal is to estimate the entropy h of II.3. 143 Proof of Lemma 11. -KEn. together these contribute at most 3(t + 1) bits to total length. The coding proof given here is simpler and fits in nicely with the general idea that coding cannot beat entropy in the limit. II. each of which requires k(h +3) bits. upper bounded by K (1+ loglA1)2. A problem of interest is the entropy-estimation problem. and there are at most 1 +3t such blocks. since there are tqk(Bl.3. xn from an unknown ergodic process it.3. Ignoring for the moment the contribution of the headers used to describe k and B and to tell which code is being applied to each k-block. then H(qk)1 k will converge almost surely to H(k)/k. Thus with 6 ax = K(1 + log IA1)2 -KE — = o(K). and (t — tqk(B 14)) blocks in Tk — B. . This gives the dominant terms in (3). There are at most t + 1 blocks. The argument parallels the one used to prove the entropy theorem and depends on a count of the number of ways to select the bad set B. as well as a possible extra bit to round up k(h — E) and k(h +8) to integers. It takes at most k(1 + log I Ai) bits to encode each w(i) Tk U B as well as to encode the final r-block if r > 0.3. El Remark 11. The header 10k le(I BD that describes k and the size of B has length (4) 2 +k+logIBI o(logIBI) which is certainly o(n). K the encoding of 4k(B) and the headers contributes at most naK to total length. converges to h as . A simple procedure is to determine the empirical distribution of nonoverlapping k-blocks.4 The original proof. each of which requires k(h — E) bits. [ 52]. in turn.SECTION 11.3. which.3. each requiring a two-bit header to tell which code is being used. as well as the extra bits that might be needed to round k(h — E) and k(h +3) up to integers. of part (b) of the empirical-entropy theorem used a counting argument to show that the cardinality of the set B(n) must eventually be smaller than 2nh by a fixed exponential factor. since t < nlk. This completes completes the proof of Lemma 11. hence at most 80(n) bits are needed to encode them. and then take H (qk )/ k as an estimate of h. since log I BI < k(h — E) and k < (log n)I n. Given a sample path Xl. thereby establishing part (b) of the empirical-entropy theorem. a quantity which is at most 6n/K. EMPIRICAL ENTROPY. since k > K and n > 2".c Entropy estimation. and n> k > K. in turn.E ) bits. the number of bits used to encode the blocks in Tk U B is given by tqk (Bix)k(h — E) + (t — tq k (BI4)) k(h +3).'.3.q) blocks that belong to B. Adding this to the o(n) bound of (4) then yields the 30(n) term of the lemma. X2.

Theorem 11. and if k(n) < (1/h)log n. can now be applied to complete the proof of Theorem 11. if k(n) —> oc. so that for all large enough k and all n > 2kh . there is no universal choice of k = k(n) for which k(n) —> oc for which the empirical distribution qi.) If it is an ergodic measure of entropy h > 0. there is some choice of k = k(n) as a function of n for which the estimate will converge almost surely to h. CI Remark 11. the choice k(n) — log n works for any binary ergodic process. and note that 2/(/1-26) .(k1n — iikkai 'x i ) < 2-0h-2E) has qk (.s. Let Uk be the set of all cif E Tk(E) for which qk (ali` 'xi' ) < 2 -k(h+2E) .5 (The entropy-estimation theorem.6.144 CHAPTER II. Part (b) of the empirical-entropy theorem implies that for almost every x there is a K (x) such that qk(VkIX) < E.oc. The general result may be stated as follows. n _ This bound.3. if k(n) -.14) is close to the true distribution bak for every mixing process A. then (6) n->C0 k(n) 1 lim — H(q00(.(. a. (7) qk(Uklxi) _ < Irk(E)12ch+2E) < 2k(h+E)2 -k(h+2E) _ "-Ek —z. with alphabet A. for k > Ki(x). implies that for almost every x. k —> oc. ENTROPY-RELATED PROPERTIES.u. at least for a totally ergodic . because. In summary. at least if it is assumed that it is totally ergodic. The same argument that was used to show that entropy-rate is the same as decay-rate entropy. there is a Ki (x) such that qk(GkIx) > 1 — 36.5. for example. then it holds for any finite-alphabet ergodic process. In particular.14)) = h. Next let I Vkl < Vk be the set of all alic for which qk (cif 14) > 2 -k(h-26) . n > 2kh . lx) measure at least 1 — 3e. the set of al' for which . Thus. while if k(n) — log log n.log iAl n then (6) holds for any ergodic measure p.3.9. where G k = Tk(E) — Uk — Vk. . as n -. The empirical entropy theorem does imply a universal choice for the entropy-estimation problem. Proof Fix c > 0 and let {Tk(c) c A k : n > 1) satisfy part (a) of the empirical-entropy theorem. for example. > for k > K (x).6 The entropy-estimation theorem is also true with the overlapping block distribution in place of the nonoverlapping block distribution. At first glance. Theorem 1. see Exercise 2. the choice of k(n) would appear to be very dependent on the measure bt. combined with the bound (7) and part (a) of the empirical-entropy theorem.3.

that is. where each w(i) has length k. The empirical-entropy theorem provides another way to construct a universal code. if k does not divide n. and it was also shown that the Lempel-Ziv algorithm gives a universal code. High frequency blocks appear near the front of the code book and therefore have small indices. Encode successive k-blocks in 4 by giving the index of the block in the code Step 4. 1} k .3. 2k — 1) is called the code book. . Finally. Define b2+ +L i to be the concatenation of the E(A(w(i))) in order of increasing i. The code length £(4) = ni+ L +r. where L is the sum of the lengths of the e(A(w(i))). where h is the entropy of A. The steps of the code are as follows. subject only to the rule that af precedes 5k whenever k n -k n) qk(ailX1) > qk( all X 1 • The sequence {v(j) = j = 0. k) and 4 = w(1)w(2)•• • w(t)v.iiri I .42 ) 71 = h. (For simplicity it is assumed that the alphabet is binary. is a prefix code.SECTION 11.d Universal coding. La. Suppose n = tk r. Encode the final block.JF. Let E be a prefix code on the natural numbers such that the length of the word E(j) is log j o(log j). lirn “. book E. . fix {k(n)} for which k(n) — (1/2) log n. Step 2. EMPIRICAL ENTROPY 145 II. The number of bits needed to transmit the list . since the number of bits needed to transmit an index is of the order of magnitude of the logarithm of the index. depends on 4.3. Step 1. since k -(1/2) log ik n.C is short. The header br is a concatenation of all the members of {0. an Elias code. see the proof of Theorem 11. relative to n.) The code word Cn (4) is a concatenation of two binary sequences. a header of fixed length m = k2k .1. almost surely. Such a code sequence was constructed in Section II. if w(i) = that is.1. w(i) is the j-th word in the code book. For 1 < i < t and 0 < j < 2k.lxiii). but the use of the Elias code insures that C. define the address function by the rule A(w(i)) = j. define b:+ 1+ -1 to be x4+1 • The code Cn (4) is defined as the concatenation Cn (x7) = br • b:V' • b7. followed by a sequence whose length depends on x. 1. where k = k(n). since L depends on the distribution qk (. of these k-blocks in order of decreasing frequency of occur- Step 3. A universal code for the class of ergodic processes is a faithful code sequence {C n } such that for any ergodic process p. Transmit a list rence in 4. r E [0. Partition xi: into blocks of length k = k(n) — (1/2) log AI n. To make this sketch into a rigorous construction. so that the empirical-entropy theorem guarantees good performance. with a per-symbol code.

n — k + 1]: 444-1 = aDi . let Bk be the first 2k(h-E) sequences in the code book for x.3. an integer K(x) such that qk (GkI4) > qk (7k (c)14) > 1 — E for k > K(x). For each n. II. ENTROPY-RELATED PROPERTIES. a. at least a (1 — E)-fraction of the binary addresses e(A(w(i))) will refer to members of Gk and thus have lengths bounded above by k(h + c)+ o(log n).) Pk(a in4) E [1. Another application of the empirical-entropy theorem will be given in the next chapter. a. furthermore. in the context of the problem of estimating the measure II from observation of a finite sample path. S. but it is instructive to note that it follows from the second part of the empirical-entropy theorem. let Gk denote the first 2k(h +E) members of the code book. The empirical-entropy theorem guarantees that eventually almost surely.1. n—k+ 1 in place of the nonoverlapping block distribution qk(. To establish this. Show that the empirical-entropy theorem is true with the overlapping block distribution. Remark 11. and in some ways.3. so that lirninf' > (h — c)(1 — c). for k > K(x). Note that IGk I > l'Tk(E)I and. even simpler algorithm which does not require that the code book be listed in any specific order was obtained in [43]. since Gk is a set of k-sequences of largest qke probability whose cardinality is at most 2k(h +E ) . n where h is the entropy of it. which also includes universal coding results for coding in which some distortion is allowed.C(4) lim sup — < h.2.e Exercises 1. the crude bound k + o(log n) will do. and is described in Exercise 4. qk(GkIx t 0 qk(Tk(E)Ixi).. a. Theorem 11. These are the only k-blocks that have addresses shorter than k(h — 6) + o(log n). The empirical-entropy theorem will be used to show that for any ergodic tt. and hence < (1 — On(h + c) + En + o(log n). where k = k(n).lfiz). The empirical-entropy theorem provides. (8) n-* lim f(x n ) 1 = h. Thus.146 CHAPTER II. This proves that . for almost every x. there are at most Et indices i for which w(i) E Bk. s. For those w(i) Gk. see Section 111.3.7 The universal code construction is drawn from [49]. A second. The reverse inequality follows immediately from the fact that too-good codes do not exist. s. (Hint: reduce to the nonoverlapping case for some small shift of the sequence. let {7-k(E) c Ac} be the sequence given by the empirical-entropy theorem. for k > K (x). In fact.

Call this C k m (Xil ) and let £k(X) denote the length of Ck . w(2). be the first value of k E [ 1. j. of k-blocks {w(i)}.4 Partitions of sample paths.. In other words. Append some coding of the final block y. w(t)} if it is the concatenation (1) 4 = w(1)w(2). For example. along with a header to specify the value of k. Partitions into distinct words have the asymptotic property that most of the sample path must be contained in those words that are not too short relative to entropy.u k is asymptotically almost surely lower bounded by (1 — e -i ) for the case when n = k2k and .(x. [53]. The code C(4) is the concatenation of e(kiii. for i words. Show that {C. Express x'12 as the concatenation w(l)w(2) • • • w(q)v. 3. while 000110110100 = [00] [0110] [1101] [00] is not. most of the sample path is covered by words that are not much shorter than (log n)I h..(4)}. a word w is a finite sequence of symbols drawn from the alphabet A and the length of w is denoted by i(w).. then (1) is called a parsing (or partition) into distinct 000110110100 = [000] [110] [1101] [00] is a partition into distinct words.} is a universal prefix-code sequence.. An interesting connection between entropy and partitions of sample paths into variable-length blocks was established by Ornstein and Weiss. Another simple universal coding procedure is suggested by the empirical-entropy theorem. and for any partition into words that have been seen in the past..u is unbiased coin-tossing.'). for any partition into distinct words. (Hint: if M balls are thrown at random into M boxes. Show that the variational distance between qk (. such a code requires at most k(1 + log IA I) bits per word. Transmit k and the list. Make a list in some order of the k-blocks that occur.. most of the sample path is covered by words that are not much longer than (log n)I h.. For each k < n construct a code Ck m as follows. plus a possible final block y of length less than k. If w(i) w(j). 147 2... w(t).14) and . A sequence Jell is said to be parsed (or partitioned) into the (ordered) set of words {w(1). As in earlier discussions. Section 11. They show that eventually almost surely. Their results were motivated by an attempt to better understand the Lempel-Ziv algorithm. then code successive w(i) by using a fixed-length code.n ) and Cknun . then the expected fraction of empty boxes is asymptotic to (1 — 4.n (x). PARTITIONS OF SAMPLE PATHS.SECTION 11. . except possibly for the final word.4. The final code C(xil) transmits the shortest of the codes {Ck. such that all but the last symbol of each word has been seen before. which partitions into distinct words. n] at which rk(x'11 ) achieves its minimum. . Show that the entropy-estimation theorem is true with the overlapping block distribution in place of the nonoverlapping block distribution. let km.

If this final word is not empty. by the crude bound. and let E > 0 be given. if each w(i) is either in .2. in turn.1.. .2. twi>>>0-0-Ilogn It is not necessary to require exact repetition in the repeated-words theorem. possibly for the final word. Theorem 11. it is the same as a prior word. be ergodic with entropy h > 0. ENTROPY-RELATED PROPERTIES.1 (The distinct-words theorem. except. For this purpose. and the fact that too-good codes do not exist. Thus the distinct-words theorem implies that. Of course. which.4. called the start set. then t ( o )) < En.4. s.2 parses a finite sequence into distinct words. Theorem 11. a.be a fixed finite collection of words. Lemma 11. w(C + 1).) Let p. eventually almost surely. as noted earlier. (2). Theorem 11.148 CHAPTER II. implies that C(4) log n lim inf > h. Lemma 11. then t ( to)) <En. some way to get started is needed.. so that (2) lim sup n- C(4) log n fl < h. To discuss partitions into words that have been seen in the past.2 (The repeated-words theorem. see Exercise 1. meaning that there is an index j < t(w(1)) + + f(w(i — 1)) such that w(i) = Partitions into repeated words have the asymptotic property that most of the sample path must be contained in words that are not too long relative to entropy. most of the sample path is covered by blocks that are at least as long as (h + E)' log(n/2).2. and let 0 < E < h and a start set T be given.7. For almost every x E A there is an N = N(E. For example.2. this time for parsings into words of variable length. = w(1)w(2). Theorem 11. and hence t(w(C + 1)) < n/2. For almost every w(t) is a x E A" there is an N = N(E. x) such that if n > N and xri' = w(1)w(2) partition into distinct words.2. it is sufficient if all but the last symbol appears earlier. and xi. But the part covered by shorter words contains only a few words. a. With this modification. ow(0)<0+0-11ogn The (simple) Lempel-Ziv algorithm discussed in Section 11. x) such that if n > N. The proofs of both theorems make use of the strong-packing lemma. A partition x = w(1)w(2) w(k) is called a partition (parsing) into repeated words.T or has been seen in the past. let . the repeated-words theorem applies to the versions of the Lempel-Ziv algorithm discussed in Section 11.4.) Let be ergodic with entropy h > 0. (3) Thus the distinct-words theorem and the modified repeated-words theorem together provide an alternate proof of the LZ convergence theorem. so that eventually almost surely most of the sample path must be covered by words that are no longer than (h — E) -1 log n. w(t) is a partition into repeated words.1. s. the lower bound (3) is a consequence of fact that entropy is an upper bound.3.

This proves the lemma. let 8 be a positive number to be specified later. The fact that the words are distinct then implies that .. yields an integer K and { rk C A": k> 1} such that I7j < 21c(h+S) .. w(i)gc then t(w(i)) < 2En.4.(i)). then E aw (i) ) Bn. by an application of the standard bound.3. The strong-packing lemma. where a is a fixed positive number. POD (j)) <K The second observation is that if the distinct words mostly come from sets that grow in size at a fixed exponential rate a. The first observation is that the words must grow in length.. .4. Given E > 0 there is an N such that if n > N and x n = w(1)w(2). t > 1. and let G = UkGk. Lemma 11.(.a Proof of the distinct-words theorem. Et _ The general case follows since it is enough to estimate the sum over the too-short words that belong to G. together with some simple facts about partitions into distinct words. t —) men.3 Given K and 8 > 0. then words that are too short relative to (log n)Ice cannot cover too much. there is an N such that if n > N and xrii = w(1)w(2). Proof First consider the case when all the words are required to come from G. w(t) is a partition into distinct words. The distinct-words theorem is a consequence of the strong-packing lemma. PARTITIONS OF SAMPLE PATHS. w(t) is a partition into distinct words such that t(w(i)) 5_ En.(. Lemma 11.2._Exio g n)la E top(o) E Ic<(1-6)(log n)la Since IGk I <2"k the sum on the right is upper bounded by ( 2' ((1 — E)logn) —1) exp2 ((1 — E)logn)) = logn) = o(n).4 For each k.4. The proof of this simple fact is left to the reader.4. 149 II. To continue with the proof of the distinct-words theorem. suppose Gk C A k satisfies IGkI _< 2ak . k > K.SECTION 11.. because there are at most IA i k words of length k. Lemma 11.

will be a fixed ergodic process with positive entropy h and E < h will be a given positive number.3. by Lemma 11. It will be shown that if the repeated-words theorem is false. Thus. the existence of good codes is guaranteed by the strong-packing lemma.150 CHAPTER II.4.4. The distinct-words theorem follows.4 can be applied with a = h + 3 to yield aw(i)) < 25n. e. 2 Suppose xr. A block of consecutive symbols in xiit of length more (h — E) -1 log n will be said to be too long. Label these too-long words in increasing order of appearance as V(1). property (4) holds for any parsing of . E w(i)Euciz k rk t (o )) a ( l-3)n. ENTROPY-RELATED PROPERTIES. for which s of the w(i) are too long. V(s). since 8 could have been chosen in advance to be so small that log n (1 — 2S) log n h + c 5h+3 • 1:3 I1. the basic idea of the version of the Lempel-Ziv code suggested in Remark 11.2. this time applied to variable-length parsing. The idea is to code each too-long word of a repeated-word parsing by telling where it occurs in the past and how long it is (this is. f(w(i))(1-15)(logn)/(h+S) provided only that n is large enough. Suppose also that n is so large that. and let .' is (K. for any parsing P of xriz for which (4) E Sn atv(i)) — . i. and such that eventually almost surely xrii is (K. Suppose = w(l)w(2) w(t) is some given parsing. strong-packing implies that given a parsing x = w(l)w(2) E w(i)EU Klj t (o)) a _3)n. let ui be the concatenation of all the words that come between V(i — 1) and V(i). V(2). then a prefix-code sequence can be constructed which beats entropy by a fixed amount infinitely often on a set of positive measure.b Proof of the repeated-words theorem. As in the proof of the empirical-entropy theorem. The first idea is to merge the words that are between the too-long words. for 1 < i < s + 1. of course. Throughout this discussion p. 3)-strongly-packed by {'Tk }. 3)-strongly-packed by MI.x. then overall code length will be shorter than entropy allows. r w(t) into distinct words. Let u1 be the concatenation of all the words that precede V(1).) If a good code is used on the complement of the too-long words and if the too-long words cover too much.8.. it into distinct words.4. and therefore Lemma 11.

us V(s)u s+i with the following two properties. then B(n) n G(n).SECTION 11. eventually almost surely. eventually almost surely. 3)-stronglypacked by {'Tk }. to complete the proof of the repeated-words theorem it is enough to prove that if K is large enough. The words in the too-long representation are coded sequentially using the following rules. . 151 u s+ 1 be the concatenation of all the words that follow V(s).. it is enough to prove that x g B(n).q E G(n) eventually almost surely. it can be supposed such that V(j) that when n is large enough no too-long word belongs to F.. An application of the strong-packing lemma. is constructed as follows.2.T is a fixed finite set. Lemma 11.. A code C.3.. a set 'Tk c A k of cardinality at most 2k(h+8) . (ii) Each V(j) has been seen in the past. but to make such a code compress too much a good way to compress the fillers is needed. Let 3 be a positive number to be specified later. Let G(n) be the set of all xri' that are (K. The strong-packing lemma provides the good codes to be used on the fillers..u s V(s)u s+i. Since . If x E B(n) n G(n).4. V(s)} and fillers lui. (a) Each filler u 1 E U K Tk is coded by specifying its length and giving its index in the set Tk to which it belongs. provides an integer K and for a each k > K. a too-long representation = u1V(1)u2V(2). (i) Ei f ( V (D) >En.. PARTITIONS OF SAMPLE PATHS.. with the too-long words {V(1). such that eventually almost surely xril is (K. x'iz is expressed as the concatenation x Çz = u i V(1)u2V(2) us V(s)u s±i . ni + ± imi has been seen in the past is to say that there is an index i E [0. is determined for which each too-long V(j) is seen somewhere in its past and the total length of the {V(j)} is at least En. In this way. ni ) To say V(j) = xn Since the start set . . The idea is to code sequences in B(n) by telling where the too-long words occurred earlier. (b) Each filler u1 g U K r k is coded by specifying its length and applying a fixed single-letter code to each letter separately. 8)-strongly-packed by {Tk } . Such a representation is called a too-long representation of 4. and therefore to prove the repeated-words theorem. Let B(n) be the set of all sequences x for which there is an s and a too-long representation x = uiV (1)u2 V(2) . Sequences not in B(n) n G(n) are coded using some fixed single-letter code on each letter separately. (c) Each too-long V(j) is coded by specifying its length and the start position of its earlier occurrence.

which requires Flog ni bits for each of the s too-long words. "ell B(n) n G(n). indeed. The lemma is sufficient to show that xrii B(n) n G(n). For 4 E B(n) n G(n). for XF il E B(n) n G(n). since it is not possible to beat entropy by a fixed amount infinitely often on a set of positive measure.4. is that the number s of too-long words. and from specifying the index of each u i E UlciK rk. must satisfy the bound. the principal contribution to total code length E(xriz) . while it depends on x.. Two bit headers are appended to each block to specify which of the three types of code is being used. require a total of at most 60(n) bits. Proof of Lemma 11. ENTROPY-RELATED PROPERTIES. eventually almost surely. This fact is stated in the form needed as the following lemma. which. ) <K (1 + n(h log n j). shows that. as well as the encoding of the lengths of all the words. For the fillers that do not belong to U71. then (5) uniformly for 4 . since.C(x'11 ) = s log n + ( E ti J ELJA7=Kii B(n) n G(n). first note that there are at most s + 1 fillers so the bound Es. the code becomes a prefix code.152 CHAPTER II.5. eventually almost surely. a word is too long if its length is at least (h — E) -1 log n. An Elias code is used to encode the block lengths. for all sufficiently large n. by definition. It is enough to show that the encoding of the fillers that do not belong to Urk . The key to this. — log n — log n implies that E au .C(fi` IC n ) comes from telling where each V(j) occurs in the past. Lemma 11. tw(i) > en. plus the two bit headers needed to tell which code is being used and the extra bits that might be needed to round up log n and h + 6 to integers. completing the proof of the repeated-words theorem.. which requires £(u) [h + 81 bits per word. as well as to the lemma itself.4. n s < i="' (h — e) < — (h — 6). a prefix code S on the natural numbers such that the length of the code word assign to j is log j + o(log j). ) (h + 6) + 60(n). and hence if 6 is small enough then £(4) < n(h — € 2 12). xÇ2 E B(n) n G(n).5 If log K < 6 K . so that (5) and (6) yield the code-length bound < n(h + — e(e + 6)) + 0(n). (6) s — av(i)) log n (h — e). With a one bit header to tell whether or not xri` belongs to B(n) n G(n). E au . that is. By assumption.(x €)) .

Since it takes L(u)Flog IAI1 bits to encode a filler u g UTk. . then the assumption that 4 E G(n).4. the dominant terms of which can be expressed as E j )<K log t(u i ) + E t(u i )K log t(u i ) + logt(V(i)). and since x -1 log x is decreasing for x > e. This completes the proof of Lemma 11. once n is large enough to make this sum less than (Sn/2. that 4 is (K.4. The coding proof is given here as fits in nicely with the general idea that coding cannot beat entropy in the limit. PARTITIONS OF SAMPLE PATHS. of the repeated-words theorem parallels the proof of the entropy theorem. Extend the distinct-words and repeated-words theorems to the case where agreement in all but a fixed number of places is required. Extend the repeated-words theorem to the case when it is only required that all but the last k symbols appeared earlier. that is.. which is o(n) for K fixed. Finally. the total length contributed by the two bit headers required to tell which type of code is being used plus the possible extra bits needed to round up log n and h + 8 to integers is upper bounded by 3(2s + 1) which is o(n) since s < n(h — E)/(logn). Finally.•)) [log on SnFlog IA11. 6)-strongly-packed by {Tk }. + o(n)). implies that the words longer than K that do not belong to UTk cover at most a 6-fraction of x .SECTION 11. This is equivalent to showing that the coding of the too-long repeated blocks by telling where they occurred earlier and their lengths requires at most (h — c) E ti + o(n) bits. II.c Exercises 1. the total number of bits required to encode all such fillers is at most (u. which is Sn. is upper bounded by exp((h — c) Et. Since it can be assumed that the too-long words have length at least K. since there are at most 2s + 1 words. In particular. The key bound is that the number of ways to select s disjoint subintervals of [1. n] of length Li > (log n)/ (h — 6). 153 which is o(n) for fixed K. since it was assumed that log K < 5K. using a counting argument to show that the set B(n) n G(n) must eventually be smaller than 2nh by a fixed exponential factor.6 The original proof.4. a u . the second and third sums together contribute at most log K K n to total length. E 7. 2. [53]. which is 50(n). Remark 11. the encoding of the lengths contributes EV(t(u i)) Et(eV(V(i») bits.4. The first sum is at most (log K)(s + 1).5. which can be assumed to be at least e.

Let . Section 11. and r(Tx) < r(x). Carry out the details of the proof that (3) follows from the repeated-words theorem. An almost-sure form of the Wyner-Ziv result was established by Ornstein and Weiss. 4. The theorem to be proved is Theorem 11. . almost surely. n 1 r(x) = lim inf .154 CHAPTER II. unbiased coin-tossing.) For any ergodic process p. 5. both the upper and lower limits are subinvariant. see also the earlier work of Willems. along with an application to a prefix-tree problem will be discussed in this section. no parsing of 4 into repeated words contains a word longer than 4 log n.i. almost surely. Lemma 11. (c) Show that eventually almost surely.5.1. Wyner and Ziv showed that the logarithm of the waiting time until the first n terms of a sequence x occurs again in x is asymptotic to nh. be the Kolmogorov measure of the binary.u(x'11 ). What can be said about distinct and repeated words in the entropy 0 case? 6. there are fewer than n 1-612 words longer than (1 + c) log n in a parsing of 4 into repeated words. This is a sharpening of the fact that the average recurrence time is 1/. the words longer than (1 + c) log n in a parsing of 4 into repeated words have total length at most n 1. n—woo n T(x) Since Rn _1(T x) < R n (x). [86]..) (b) Show that eventually almost surely.1 (The recurrence-time theorem. with entropy h. Carry out the details of the proof that (2) follows from the distinct-words theorem. n n--÷oo Some preliminary results are easy to establish. [85]. whose logarithm is asymptotic to nh. i. F(Tx) <?(x).log R„(x).5 Entropy and recurrence times. The definition of the recurrence-time function is Rn (x) = min{rn > 1: x m+ m+n 1 — Xn } — .u. These results. ENTROPY-RELATED PROPERTIES.E12 log n. equiprobable. Define the upper and lower limits. 1 = lim sup . process (i. i . in probability. e. lim 1 log Rn (x) = h. An interesting connection between entropy and recurrence times for ergodic processes was discovered by Wyner and Ziv.log Rn (x). 3.3.) (a) Show that eventually almost surely.d. [53]. that is. (Hint: use Barron's lemma.

. . so that (2) yields 1-1 (Dn < n Tn) _ 2n(h-1-€12)2—n(h+E) . then eventually almost surely. fix (4 and consider only those x E D„ for which xri' = ari'. n > 1.(4)) < 2-n(h+E). if Tri = {X: t(x) > 2-n(h-i-E/2)} then x E T. n 'T. that if x E D(a). both are almost surely invariant. disjointness implies that (2) It(D. the set D(a7) = D. r < F. = fx: R(x) > r(h+E ) }. by Theorem 11. n [4]. The inequality r > h will be proved by showing that if it is not true. the cardinality of the projection onto An of D.. . which is upper bounded by 2n(h+E / 2) . 7-1 D. The goal is to show that x g D. Note. sample paths are mostly covered by long disjoint blocks which are repeated too quickly. The bound r > h was obtained by Ornstein and Weiss by an explicit counting argument similar to the one used to prove the entropy theorem.5.. for j E [1. Furthermore. there are constants F and r such that F(x) = F and r(x) = r. almost surely. 2—nE/2 . so the iterated almost sure principle yields the desired result (1).SECTION 11. is upper bounded by the cardinality of the projection of Tn .2. 7' -2a(h+°+1 Dn(ai) must be disjoint. Since these sets all have the same measure. eventually almost surely. The bound F < h will be proved via a nice argument due to Wyner and Ziv.. so the desired result can be established by showing that r > h and F < h. then there is a prefix-code sequence which beats entropy in the limit. both upper and lower limits are almost surely constant. eventually almost surely. eventually almost surely. 2n(h+E) — 1]. The Borel-Cantelli principle implies that x g D. by the definition of D(a). and hence. n T„.(4). that is. It is now a simple matter to get from (2) to the desired result. then Tx g [a7]. so it is enough to prove (1) x g Dn n 7. a more direct coding construction is used to show that if recurrence happens too soon infinitely often..„ eventually almost surely.1. In the proof given below." Indeed. To establish this. 155 hence. however. Since the measure is assumed to be ergodic. To establish i: < h. It is enough to prove this under the additional assumption that xril is "entropy typical. which is impossible. fix e > 0 and define D. and hence the sequence D(a). Another proof. Indeed. (1). is outlined in Exercise 1. based on a different coding idea communicated to the author by Wyner. A code that compresses more than entropy can then be constructed by specifying where to look for the first later occurrence of such blocks. ENTROPY AND RECURRENCE TIMES.

Let G(n) be the set of all xi' that have a (8. (i) Each filler word u i is coded by applying F to its successive symbols.t+i is determined and successive words are coded using the following rules. (a) Each V(j) has length at least m and recurs too soon in 4. a result stated as the following lemma. 1}d.156 CHAPTER II. The key facts here are that log ki < i(V(j))(h — c). where E > O." and that the principal contribution to total code length is E i log ki < n(h—c). To develop the covering idea suppose r < h — E. The code also uses an Elias code E.1 )u. Sequences not in G(n) are coded by applying F to successive symbols. a prefix code on the natural numbers such that t(E(j)) = log j + o(log j). (b) The sum of the lengths of the filler words ui is at most 3 8 n. ENTROPY-RELATED PROPERTIES. It will be shown later that a consequence of the assumption that r < h — c is that (3) xÇ E G(n). For n > m. for which s +k < n.2 (4) f(xii) < n(h — E) n(3dS + ot ni ). The code Cn uses a faithful single-letter code F: A H+ {0. that is. by the definition of "too-soon recurrent. where d < 2+ log l Ai. Lemma 11. The one bit headers on F and in the encoding of each V (j) are used to specify which code is being used. X E G (n ) . is said to be a (8. a concatenation = u V (1)u 2 V (2) • u j V (J)u j+1 . such that each code word starts with a O. m)-too-soon-recurrent representation 4 = u V(1)u2 V(2) • • • u V( . The least such k is called the distance from xs to its next occurrence in i• Let m be a positive integer and 8 be a positive number. m)-too-soon-recurrent representation. followed by E(t(V (j)). where cm --÷ 0 as m . 2h(t -s + 1) ) . where k i is the distance from V(j) to its next occurrence in x. m)-too-soon-recurrent representation of 4. eventually almost surely. both to be specified later.5. for n > m. oc. Here it will be shown how a prefix n-code Cn can be constructed which. if the following hold. for suitable choice of m and S. A block xs. (ii) Each V(j) is coded using a prefix 1.' of s lik k for consecutive symbols in a sequence 4 is said to recur too soon in 4 if xst = x` some k in the interval [ 1 . compresses the members of G(n) too well for all large enough n. followed by e(k i ). A one bit header to tell whether or not 4 E G(n) is also used to guarantee that C. For x E G(n) a (8. is a prefix n-code.

assume r < h — E and. The Elias encoding of the distance to the next occurrence of V(j) requires (5) f(V (j))(h — E) o(f(V(j))) bits. for fixed m and 6. since (1/x) log x is decreasing for x > e. so that . Finally. there is an N(x) > M/8 such that if n > (x). Summing on j < n/ m. n].n --* 0 as m cc. for which the following hold. This completes the proof of Lemma 11. Thus r > h. and hence the packing lemma implies that for almost every x. The one bit headers on the encoding of each V(j) contribute at most n/ m bits to total code length. 157 The lemma implies the desired result for by choosing 6 small enough and m large enough. contradicting the fact (3) that 4 E G(n). as m Do.2.C(x'ii) < n(h — 6/2) on G(n). the lemma yields .8. eventually almost surely.5.SECTION 11. ENTROPY AND RECURRENCE TIMES.„ where /3. 0 It remains to be shown that 4 E G(n). eventually almost surely. Proof of Lemma 11.u(G(n)) must go to 0. Since lim inf.5." Summing the first term over j yields the principal term n(h — E) in (4). the complete encoding of the too-soon recurring words V(j) requires a total of at most n(h — E) nun. eventually almost surely. a fact that follows from the assumption that 0 r <h — E. each of length at least m and at most M. But too-good codes do not exist. the encoding of each filler u i requires dt(u i ) bits. bits. define B n = (x: R(x) < 2 n(h—E) ). m]: i < of disjoint subintervals of [1. R n (x) = r there is an M such that n=M . for all large enough n. there is an integer I = I (n. for B = n=m Bn' The ergodic theorem implies that (1/N) Eni Bx (Ti-i X ) > 1 . Thus the complete encoding of all the fillers requires at most 3d6n bits. + (1 ± log m)/m 0. where an. Since each V(j) is assumed to have length at least m. by property (a) of the definition of G(n). In summary. the second terms contribute at most Om bits while the total contribution of the first terms is at most Elogf(V(D) n logm rn .u(B) > 1 — 6. for each n > 1. To carry it out. This is a covering/packing argument. establishing the recurrence-time theorem. by the definition of "too-soon recurrent. x) and a collection {[n i . The Elias encoding of the length t(V(j)) requires log t(V(D) o(log f(V(j))) bits. and the total length of the fillers is at most 36n. there are at most n/ m such V(j) and hence the sum over j of the second term in (5) is upper bounded by nI3.. = 2/3.2.5. .

. For example.5. and take (11 kt) Et. Xi :Fnil—n ` 9 for some j E [ni + 1.. 1. that is. where underlining is used to indicate the respective prefixes. for fixed m and 3. W(4) = 100. itl. for any j W(i): i = 0. a log n. The prefix tree 'T5(x) is shown in Figure 1. condition (b) can be replaced by (c) > < (m — n +1) > (1 — 33)n. T 2x. T 2x = 1010010010 .a Entropy and prefix trees./ V(J)u j+i.Tn —l x.158 (a) For each i < I el' n.7. Prefix trees are used to model computer storage algorithms.T4x = 10010010. . . x) be the shortest prefix of Tx i.. log Rk(T i l X) as an estimate of h. x 001010010010 . . 0 < j < n. T 3x = 010010010 . page 72.. ±1)(h—E) ]. A sequence X E A" together with its first n — 1 shifts. eventually almost surely. In particular. For each 0 < i < n. The sequence x`il can then be expressed as a concatenation (6) xn i = u i V(1)u2V(2)• • • u. (b) (mi — ni + 1) > (1 — 25)n. in which context they are usually called suffix trees. let W(i) = W(i. and where each u i is a (possibly empty) word. where a < 1. it can be supposed that n > 2m(h— E ) IS. the set which is not a prefix of Px. In practice. W(2) = 101. W(1) = 0101. and thereby finishes the proof of the recurrence-time theorem. called the n-th order prefix tree determined by x. n]. II. if it is assumed that r < h — E.5. provided n is large. algorithms for encoding are often formulated in terms of tree structures.. as quick way to test the performance of a coding algorithm. defines a tree as follows. . [16]. independently at random in Let k [1. Remark 11. This works reasonably well for suitably nice processes. n — 11 is prefix-free. the total length of the V(j) is at least (1 —38)n. Closely related to the recurrence concept is the prefix tree concept.5. 'Tn (x). for example. This completes the proof that 4 E G(n). it defines a tree. where each V(j) = xn 'n. and can serve.n. W(0) = 00. Tx. select indices fit. the simplest form of Lempel-Ziv parsing grows a subtree of the prefix tree. Since by definition. W(3) = 0100.9. to propose that use of the full prefix tree . In addition. as tree searches are easy to program. recurs too soon. ni + 2(ni i —n. T x = 01010010010 . = - CHAPTER II. ENTROPY-RELATED PROPERTIES.7. below. so that if J is the largest index i for which ni + 2A1(h— E) < n. This led Grassberger. x E G (n).3 The recurrence-time theorem suggests a Monte Carlo method for estimating entropy.. as indicated in Remark 11. • • • .

for all L > 1. case. then for every c > 0 and almost every x.5. (Adding noise is often called "dithering.5. and some other cases. eventually almost surely.5. eventually almost surely. Note that the prefix-tree theorem implies a trimmed mean result.5 (The finite-energy theorem.d.E)-trimmed mean of the set {L(x)} is almost surely asymptotic to log n/h. L(x) is the length of the longest string that appears twice in x7. the binary process defined by X n = Yn Z.i. however. for which a simple coding argument gives L(x) = 0(log n). He suggested that if the depth of node W(i) is the length L7(x) = f(W(i)). all but en of the numbers L7(x)/ log n are within c of (11h). The {L } are bounded below so average depth (1/n) E (x) will be asymptotic to log n1 h in those cases where there is a constant C such that max L7 (x) < C log n. The (1 . there is a j E [1. This is equivalent to the string matching bound L(4) = 0(log n). Another type of finite energy process is obtained by adding i. it is not true in the general ergodic case. the (1 . is an ergodic process of entropy h > 0. see Theorem 11. at least in probability. noise to a given ergodic process. as earlier. in general. c) such that for n > N. is the following theorem.d.. as it reflects more of the structure of the data. however. there is an integer N = N(x. and for all xl +L of positive measure. and independent of {Yn }. for all t > 1.5.SECTION 11.) If p. h The i. there is no way to keep the longest W(i) from being much larger than 0(log n).) The mean length is certainly asymptotically almost surely no smaller than h log n. has finite energy if there is are constants c < 1 and K such that kxt-F1 i) < Kc L .) If p.i. mixing Markov processes and functions of mixing Markov processes all have finite energy. such as. for any E > 0. .i. 159 might lead to more rapid compression. One type of counterexample is stated as follows. see Exercise 2.4 (The prefix-tree theorem. such that x: +k-2 = There is a class of processes." and is a frequently used technique in image compression. Theorem 11.i. An ergodic process p. (mod 2) where {Yn } is an arbitrary binary ergodic process and {Z n } is binary i. j i. for example. and hence lim n -+cx) E n log n i=1 1 1 (x) = . by definition of L(x). eventually almost surely. then average depth should be asymptotic to (1/h) log n. namely. ENTROPY AND RECURRENCE TIMES. n]. While this is true in the i.E)-trimmed mean of a set of cardinality M is obtained by deleting the largest ME/2 and smallest ME/2 numbers in the set and taking the mean of what is left.d. which kills the possibility of a general limit result for the mean. processes. the finite energy processes. where. eventually almost surely. Theorem 11. since if L7(x) = k then. What is true in the general case. is an ergodic finite energy process with entropy h > 0 then there is a constant C such that L(4) < C log n. namely. almost surely.d.5.

each of which is one symbol shorter than the shortest new word. define the sets L(x.a. for fixed n.5.4.6 CHAPTER II. the tree grows to a certain size then remains fixed. but do achieve something reasonably close to optimal compression for most finite sequences for nice classes of sources. For S > 0 and each k. is a consequence of the recurrence-time theorem. n) = fi E [0. n — I]: L7(x) < (1— E)(logn) I h} U (x .7 The simplest version of Lempel-Ziv parsing x = w ( 1)w(2) • • •. Remark 11. The details of the proof that most of the words cannot be too short will be given first. II. whose predecessor node was labeled w(j). Section 11. Theorem 11. (See Exercise 3. From that point on. ENTROPY-RELATED PROPERTIES.1 Proof of the prefix-tree theorem. In the simplest of these. if it has the form w(i) = w(j)a. almost surely. The prefix-tree and finite-energy theorems will be proved and the counterexample construction will be described in the following subsections. It is enough to prove the following two results. the W (i. n)I < En. n) are distinct. where the next word w(i). The remainder of the sequence is parsed into words. (8). in which the next word is the shortest new word. 88] for some finite version results.) The limitations of computer memory mean that at some stage the tree can no longer be allowed to grow. lim (7) nk log nk with respect to the measure y = o F-1 . To control the location of the prefixes an "almost uniform eventual typicality" idea will be needed.2. grows a sequence of IA I-ary labeled trees. eventually almost surely. eventually almost surely. each of which produces a different algorithm. for the longest prefix is only one longer than a word that has appeared twice. n) = fi E [0.2. For E > 0.160 Theorem 11.10. for some j < i and a E A. and an increasing sequence {nk}. (9). Performance can be improved in some cases by designing a finite tree that better reflects the actual structure of the data. and is a consequence of the fact that.5. define Tk (8) = (x) > . n)I < En. which means more compression.5. see [87. n — I]: (x) > (1 ± c)(log n)I hl. There is a stationary coding F of the unbiased coin-tossing process kt. is assigned to the node corresponding to a. The fact that most of the prefixes cannot be too long.1. can be viewed as an overlapping form of the distinct-words theorem. Theorem 11. Prefix trees may perform better than the LZW tree for they tend to have more long words. such that vnrfo l Ltilk (x ) = oo.1. (9) The fact that most of the prefixes cannot be too short. I U(x . (8) 1L(x. Subsequent compression is obtained by giving the successive indices of these words in the fixed tree. Such finite LZ-algorithms do not compress to entropy in the limit. several choices are possible. mentioned in Remark 11.5.

x) are distinct members of the collection 'Tk (6) which has cardinality at most 2k(h+8) . except for the final letter. combined with the bound on the number of non-good indices of the preceding paragraph. 1 urn . Hence.6 and Exercise 11. (8). by the entropy theorem.5. n. Next the upper bound result.7. for all k > MI has measure greater than 1 — S.8 larger. eventually almost surely. if necessary. a. an index i < n will be called good if L(x) > M and W(i) = W(i. then it has too-short return-time. X m+ m+k i— min frn > 1: X1_ k+1 = X — ° k+11 The recurrence-time theorem directly yields the forward result. To proceed with the proof that words cannot be too short. This. (9).s. Section 1. The recurrence-time theorem shows that this can only happen rarely. the shift Tx belongs to G(M) for all but at most a limiting 6-fraction of the indices i. There are at most 2k(h+3) such indices because the corresponding W (i.. Thus by making the N(x) of Lemma 11. Both a forward and a backward recurrence-time theorem will be needed. n).5. for almost every x.. eventually almost surely. k—> cx) k and when applied to the reversed process it yields the backward result.s. If a too-long block appears twice. Suppose k > M and consider the set of good indices i for which L7(x) = k. a. The corresponding recurrence-time functions are respectively defined by Fk(X) = Bk(X) = min {in> 1 •. the i-fold shift Tx belongs to G(M). Since the prefixes W(i) are all distinct there are at most lAl m indices i < n such that L7(x) < M. which can be assumed to be less than en/2. completes the proof that eventually almost surely most of the prefixes cannot be too short.log Bk (x) = h.8 For almost every x there is an integer N(x) such that if n > N(x) then for all but at most 2 6 n indices i E [0. Lemma 11. x) E U m rk ( .5. 1 lim — log Fk (x) = h. each W(i. Thus if n > N(x) is sufficiently large and S is sufficiently small there will be at most cnI2 good indices i for which L'i'(x) < (1— €)(log n)I h. n. Since E (3 ).SECTION 11. Section 1. For a given sequence x and integer n.) These limit results imply that .n. for any J > M. there is an M such that the set 4 4 G(M) = ix: 4 E 7i.x) appears at least twice. fix c > O. 161 _ 2k(h+5) and so that I2k(6)1 < E Tk (6 ). The ergodic theorem implies that. k—>cc k since the reversed process is ergodic and has the same entropy as the original process.(8). This result is summarized in the following "almost uniform eventual typicality" form. (See Exercise 7. it can be supposed that if n > N(x) then the set of non-good indices will have cardinality at most 3 6 n. there are at most (constant) 2 J (h+) good indices i for which M < L < J. ENTROPY AND RECURRENCE TIMES. will be established by using the fact that.

In summary. if i E Uk (x. or i + k < n. that is. n. II. This completes the proof of the prefix-tree theorem. by the definition of W (i. L7 > (1+ e) log n/ h. code the second occurrence of the longest repeated block by telling its length and where it was seen earlier. namely. then I U (x .2 Proof of the finite-energy theorem. this means that Bk(T i+k x) < n. x V B. Condition (b) implies that L7 -1 > k. By making n larger. there is an index j E [0. and Tx E B. which means that 7 k) x E B. n . The key to making this method yield the desired bound is Barron's almost-sure code-length bound. Thus. for each w. Let k be the least integer that exceeds (1 + e/2) log n/ h.5. there is a K. and two sets F and B. and use an optimal code on the rest of the sequence. n)I < en. and hence. which asserts that no prefix code can asymptotically beat the Shannon code by more than 210g n. An upper bound on the length of the longest repeated block is obtained by comparing this code with an optimal code. n).u(x'11 ) -21ogn. The ergodic theorem implies that for almost every x there is an integer N(x) such that if n> N(x) then Tx E F. In this case. eventually almost surely. If i +k < n. (b) k +1 < (1+ c)log n/ h. x F.1].13.n) then T i E F. such that j i and j+k -1 =x me two cases i < j and i > j will be discussed separately. n. A Shannon code for a measure p u is a prefix code such that E(w) = log (w)1. ENTROPY-RELATED PROPERTIES. n). so that 1 logn . Case 2. x) as the shortest prefix of Tx that is not a prefix of any other Ti x.1].log Fk(T i x) < — < h(1 + E/4) -1 . 7x E B. A version of Barron's bound sufficient for the current result asserts that for any prefix-code sequence {C}. Fix such an x and assume n > N(x).7. each of measure at most en/4. (a) K < (1+ 6/2) log n/ h. it can be supposed that both of the following hold. (10) £(C(4)) + log .162 CHAPTER II. for j < n. that is. Fk (T ix) < n. see Remark 1. then log Fk (x) > kh(1 + /4) -1 . Case 1. . if necessary. i < j. Assume i E U(x.a. if n also satisfies k < en/3. k which means that Tx E F. and log Bk (x) > kh(1+ c /4) -1 . j < i. such that if k > K. for at most 6n/3 indices i E [0. the k-block starting at i recurs in the future within the next n steps. The idea of the proof is suggested by earlier code constructions. Lemma 5. for at most en/3 indices i E [0. Here "optimal code" means Shannon code.

yields to which an application of the finite energy assumption p(x:+ L(4) 5 < . total code length satisfies 3 log n + log 1 1 + log p(xl) kt(4A-L+11xi+L)' where the extra o(logn) terms required by the Elias codes and the extra bits needed to roundup the logarithms to integers are ignored. 3 log n + log p(x t + + IL lxi) < Kc L .5. as they are asymptotically negligible. there is a measurable shift invariant function F: {0. (a) If y = F(x) then. such that yi+i = 0.u(fx: xo (F(x)01) < E.3 A counterexample to the mean conjecture. The final block xtn+L±I is encoded using a Shannon code with respect to the conditional measure Ix ti +L ). given by unbiased coin-tossing. the shift-invariant measure defined by requiring that p(aii) = 2. almost surely. t 3 log n + log p(xt+1 s Thus Barron's lemma gives 1 lxf)> —2 log n. The block xtr+t is encoded by specifying s and L. 163 The details of the coding idea can be filled in as follows. o F -1 will then satisfy the conditions of Theorem 11. then using a Shannon code with respect to the given measure A. the probability it (4) is factored to produce log (4) = log ti ) log A(xt v. 1} z .5. that is. Suppose a block of length L occurs at position s.a. Since t. that is. 5/4 so that Ei<n. (7) holds. a prefix code for which t(e(n)) = log n + o(log n). ENTROPY AND RECURRENCE TIMES. and L must be encoded. The prefix property is guaranteed by encoding the lengths with the Elias code e on the natural numbers.5. L7(x) > nk and hence. The block xi is encoded by specifying its length t.SECTION 11. 1} z and an increasing sequence {n k } of positive integers with the following properties.6. 0 < j < n k 3 I4 . s. with probability greater than 1 — 2-k . This is now added to the code length (11) to yield t-I-L . (b) . It will be shown that given E > 0. li m _sxu. Let it denote the measure on two-sided binary sequences {0. since c <1. there is an i in the range 0 < i < nk — n k 314 . II. and then again at position t 1 > s in the sequence xÇ. To compare code length with the length of the Shannon code of x with respect to it. for property (a) guarantees that eventually almost surely L7k (x) > n:I8 for at least 4/8 indices i < nk . .p — log c n log n This completes the proof of the finite-energy theorem. x ts 0+ log 2(4+L+1 1xi+L ). 1} z {0. The process v = p. for each n and each a E An.

) 3. the entropy of y can be as close to log 2 as desired. process a. where H is the entropy-rate of it and show that E p. as an illustration of the blockto-stationary coding method. . = fx" o„: p. The coding F can be constructed by only a minor modification of the coding used in the string-matching example. Replacing 2n k 1/2 by nk 3/4 will produce the coding F needed here. discussed in Section I.5. eventually almost surely.xn = yn 00 whenever f (xn . that r > h is based on the following coding idea.q and apply the preceding result.0 ) = f (y" co ) and x 0 y° .u of entropy h. in particular. Remark 11.b Exercises. The asymptotics of the length of the longest and shortest word in the Grassberger tree are discussed in [82] for processes satisfying memory decay conditions. A function f: An B* will be said to be conditionally invertible if .) (b) Deduce that the entropy of a process and its reversed process are the same.5.9 The prefix tree theorem was proved in [72]. In that example. eventually almost surely. Show that if L(x) = 0(log n). which also contains the counterexample construction given here. II. Show that dithering an ergodic process produces a finite-energy process.c.164 CHAPTER II. 1.(B„ n cn ix°) is finite.d.(4 Ix ° ) > 2-n ( H—E) and Cn = {X n c„) : f (xn 00 )) < 2n (H-2€ ) 1. then max i 14 (x) = 0 (log n). ENTROPY-RELATED PROPERTIES. for any ergodic measure . (a) Show that lim infn t(f(x n „))/ n > h. (Hint: let B. Show that this implies the result for ft. (c) Define f (x" cc ) to be the minimum m > n for which xim niV = . 2. (d) The preceding argument establishes the recurrence result for the reversed process. due to Wyner. hence. Property (b) is not really needed. but it is added to emphasize that such a y can be produced by making an arbitrarily small limiting density of changes in the sample paths of the i.i. almost surely. after its discovery it was learned that the theorem was proved using a different method in [31]. (Hint: L (4n ) < K log n implies max L7(x) < K log n. blocks of O's of length 2n k i l2 were created.8. . The coding proof of the finite-energy theorem is new. Another proof.

These results will be developed in this section. Likewise.i. given any convergence rate for frequencies or for entropy.u({x: x = 4}). where R + denotes the positive reals and Z + the positive integers. defined by IILk — Pk1= — Pk( 14)1 = E E 1[4(4) at pk (64'14)1. with ilk denoting the projection onto A k defined by 14(4) = . The rate of convergence problem is to determine the rates at which these convergences takes place. In the general ergodic case. The k-th order empirical distribution.an(tx rit : 2—n(h+E) < itin (x n i ) < 2—n(h—E)} 1. is called a rate function for frequencies for a process 1. do not have exponential rates. no uniform rate is possible. is the measure pk = pk( 1x11 ) on Ak defined by Pk(a1 14) = sn E [1. n — k +1]: x:+k-1 — k +1 . the entropy theorem implies that . and finite-state processes have exponential rates of convergence for frequencies of all orders and for entropy. that is. as n cc. The distance between 14 and Pk will be measured using the distributional (variational) distance. Markov processes. processes. Section 111. Many of the processes studied in classical probability theory. but some standard processes. or k-type. lin(fx riz : — Pk(' ix ril )i > ED < rk(6. there is an ergodic process that does not satisfy the rate. if for each fixed k and c > 0. such as i. A mapping rk : R+ x Z 1—± R +. such as renewal processes.1.1 Rates of convergence. 165 . aik E A k .Chapter III Entropy for restricted classes. The ergodic theorem implies that if k and are fixed then ({4: as n — Pk(' 14)1 ?_ ED -± oo.n). ergodic process with alphabet A.d. In this section bt denotes a stationary.

produces (2) 11 (4) = Dit(a))nPl aEA (alx7) = 2—n(H(pi)+D(p1 1141)) . Lemma 111.d.1. using the fact that p.2 (Exponential rates for i.d.d. n) is bounded away from O. it will not be the focus of interest here. Exponential rates of convergence for i. for the fact that t is product measure implies that p. 14).i. [22. in turn. called speed of convergence. while ideas from that theory will be used.1.(x) is a continuous function of p1 (.d. a combination of a large deviations bound due to Hoeffding and Sanov and an inequality of Pinsker.i.1.166 CHAPTER III. Theorem 111.i. n) is bounded away from O. is concerned with what happens in the ergodic theorem when (1/n) f (T i x) is replaced by (1/an ) f (Ti x). and for any i. oc. An ergodic process is said to have exponential rates for and rk(E.n) has a rate function for entropy such that for each c > 0. then it has exponential rates for frequencies and for entropy. n). nondecreasing sequence {an }.d processes will be established in the next subsection. An ergodic process has exponential rates for entropy if it and r(E. (The determination of the best value of the exponent is known as "large deviations" theory.n) —> 0 as n frequencies if there is a rate function such that for each E > 0 and k. Remark 111.i. (-1/n) log r (c.1. Proof A direct calculation. process p. which is treated by applying the first-order result to the larger alphabet of blocks. if for each E > 0. See [32] for a discussion of this topic. processes) If p. is a product measure.) A mapping r: R+ x Z+ R+ is called a rate function for entropy for a process it of entropy h. III. In this section the following theorem will be proved..a Exponential rates for i. ENTROPY FOR RESTRICTED CLASSES.1 A unrelated concept with a similar name. for any n. 58]. The theorem is proved by establishing the first-order case. then extended to Markov and other processes that have appropriate asymptotic independence properties. for some unbounded. The overlapping-block problem is then reduced to the nonoverlapping-block problem.i.) There is a positive constant c such that (1) < (n It (14: IPi( 14) — ILiI > El) _ DIAl2—ncE 2 . 62. Examples of processes without exponential rates and even more general rates will be discussed in the final subsection. which is. is i. Exponential rates for entropy follow immediately from the first-order frequency theorem. (-1/n) log rk (E. The first-order theorem is a consequence of the bound given in the following lemma. processes.3 (First-order rate bound. for any finite set A. 2 —n(h+E) < /41 (4) < 2 —n(h—E) 1) > 1 — r (E. with alphabet A. 0 as n oc.

SECTION III.)1 : pie lyD = pie jx7)}. the next t words. Theorem 1.6. — 1L El Since D(P II Q) I P — Q1 2 1 (2 ln 2) is always true.3. since (n + 1)I 41 2 . is the entropy of the empirical 1-block distribution. . see Pinsker's inequality. The two key facts. k-block parsing of x. (T(x)) < The bad set B(n. E) n T(4). E)) < ( n 01Al2—nD. The type class of _el is the set T(x) = {. Exercise 6. For each s E [0.3 gives the desired exponential-rate theorem for first-order frequencies. where the first word w(0) has length s < k.d.1 3 and Theorem 1. (3) p. together with fact (b). w(1). the sequence x can be expressed as the concatenation (4) x = w(0)w(1)• • • w(t)w(t + 1). For n > 2k define integers t = t (n.6.1. 167 where H(pi) = 11(131(. where = min ID(Pi Lai): IP1(. and the final word w(t 1) has length n — tk —s. To assist in this task some notation and terminology will be developed.n'' = exp(—n(cE 2 ön)) where 8. w(t). E) = fx7: 1/1 1( . k — 1]. it follows that 1 12 1Ple lx7) D* r17- which completes the proof of Lemma 111.6. The extension to k-th order frequencies is carried out by reducing the overlapping-block problem to k separate nonoverlapping-block problems. El Lemma 111. and D(Pi Lai) = E a pi (aIx)log Pl(aix'11) iii(a) is the divergence of pi( • Ixii) relative to pi.1. k) such that n = tk + k + r. k)-equivalent if their . 0 as n —> oc. . and class. all have length k. — I > E } can be partitioned into disjoint hence the bound (3) on the measure of the type sets of the form B(n. 1. Section 1. The desired result will follow from an application of the theory of type classes discussed in I. k) and r E [0. (b) There are at most (n + 1) 1A I type classes. 1x7)) = — a (a14) log pi (a lx7). RATES OF CONVERGENCE.6.12. produces the bound tt(B(n. are (a) 17-(4)1 < 2nH(p 1 ) . The sequences x7 and y r i` are said to be (s. This is called the s -shifted. Fact (a) in conjunction with the product formula (2) produces the bound.

If k and c > 0 are fixed. . w(t)} of k-blocks. 4-k +Ft k = ps k(• 1.el') is the The s-shifted (nonoverlapping) empirical k-block distribution ps A k defined by distribution on l) = ps.168 CHAPTER III. (5). then provides the desired k-block bound (9) p. constant on the (s. Lemma 111.(41x Pil ) — ps k (4 E [1. The containment relation. however. hence the bound decays exponentially in n.1. applied to the measure y and super-alphabet B = lAlk . is a product measure implies (6) (Sk(x riz . is upper bounded by (8) (t 1) 1A1 '2'" 2 /4 . and fix s E [0. k-1 (5) 14: Ipk( 14) — iLk1 el g U{4: lp.. i yss+t +k . so that if B denotes A" and y denotes the product measure on 13' defined by the formula v(wD = then /1(w i ). The latter. ENTROPY FOR RESTRICTED CLASSES. s).4 (Overlapping to nonoverlapping lemma. Lemma 111. the logarithm of the right-hand side is asymptotic to —tcE 2 14n —c€ 2 /4k.(fx: 1Pic (' 14) — e/2})y ({U/1: IwD — > e/21). k — 1] such that Ips k ( 14) — .c e S=0 — kl Fix such a k and n. Sk(x . s). of course. . s) is the (s. s-shifted. It is.1.2. WI E B. k)-equivalence class of x is denoted by Sk (fi'. This completes the proof of Theorem 111. by the first-order result. t]: w(j) = a}1 where x is given by (4). k)-equivalence class k. 04: IPke — itkI > El) < k(t + 0 1 2-46'6214 .1. k)-equivalence class of xri'. a lemma. The fact that p.3.) Given e > 0 there is a y > 0 such that if kln < y and Ipk( 14) — I > E then there is an s E [0.ukl> Proof Left to the reader. The overlapping-block measure Pk is (almost) an average of the measures ps result summarized as the following where "almost" is needed to account for end effects. (7) p. If k is fixed and n is large enough the lemma yields the containment relation. s)) 1=1 where Sk(x . that is. k — 1]. k-block parsings have the same (ordered) set {w(1). i The (s.

where i/î (g)= maxmax a Mg ab (b) • b The function *(g) —> 1. For n > 2k + g.6 An aperiodic Markov chain is ip-mixing.b The Markov and related cases.d. LI The next task is to prove Theorem 111. with Ili(g) 1. Proof A simple modification of the i. Section 1. Fix a gap g. This proves the lemma. First it will be shown that aperiodic Markov chains satisfy a strong mixing condition called *-mixing.SECTION 111. is an ergodic Markov source. A process is 11 f-mixing if there is a nonincreasing sequence {*(g)} such that *(g) 1 as g —± oc. g) such that (10) n=t(k+g)±(k+g)-Er.l. as g —> oc.1. w E A*. since the aperiodicity assumption implies that iti`cfb 7r(b). 169 III. and such that .1. The formula (10).0_<r<k-i-g. The *-mixing concept is stronger than ordinary mixing. Then it will be shown that -*-mixing processes have exponential rates for both frequencies and entropy. define t = t (n. An i.u(u). .u(uvw) _<111(g)A(u). RATES OF CONVERGENCE. is all that is needed to establish this lemma.(4) is a continuous function of pi e 14) and p2( • 14). process is *-mixing.) If p.u(w). since p. u E Ag. The entropy part follows immediately from the second-order frequency result.2. Lemma 111.1. The theorem for frequencies is proved in stages. Exponential rates for Markov chains will be established in this section. and let b be the first symbol of w.1. which allows the length of the gap to depend on the length of the past and future. Fix positive integers k and g. It requires only a bit more effort to obtain the mixing Markov result. let a be the final symbol of u. proof. so as to allow gaps between blocks.u(w). Theorem 111. it will be shown that periodic. irreducible Markov chains have exponential rates. then it has exponential rates for frequencies and for entropy.i.i. u.5 (Exponential rates for Markov sources. for computing Markov probabilities yields E w uvw ) t(v)=g mg 7r(b) The right-hand side is upper bounded by *(g). Proof Let M be the transition matrix and 7r(-) the stationary vector for iii. Finally. k.d.7 If kt is ilf-mixing then it has exponential rates for frequencies of all orders.

and Lemma 111. of course. . .1. hence the bound decays exponentially. however.g (c4) = pisc. This is called the the s-shifted. where w(0) has length s < k + g.8 holds.1. the sequence 4 can be expressed as the concatenation (11) x = w(0)g(1)w(1)g(2)w(2). k-block parsings have the same (ordered) set {w(1). .4 ( 14) — itkl 612. The overlappings . the g(j) have length g and alternate with the k-blocks w(j). Completion of proof of Theorem 111. and if pke > E. The upper bound (8) on Ips k ( 14) —lkI€121) is replaced by the upper bound (13) [i (g)]t(t 1)1111 4 C E 2 I4 and the final upper bound (9) on the probability p. with the product measure bound (6) replaced by (12) ian (Sk (4.g . k + g — 1]. g)• The s-shifted (nonoverlapping) empirical k-block distribution. for 1 < j < t. g)) kii(g)i t i=1 ktk(w(i)). (14: Ipke placed by the bound (k + g)[*(g)]t (t + 010 2 -tc€ 2 14 (14) — /41 ED is re- assuming. and establishes the exponential rates theorem for aperiodic Markov chains. k-block parsing of x.1.) Given E> 0 and g. k. ENTROPY FOR RESTRICTED CLASSES. such that if k n < y. s. with gap g. k.1. The (s. Lemma 111. with gap g.. where the w(j) are given by (11). The i.1. The sequences 4 and are said to be (s.1. For each s E [0.g(alnxii). g). g)-equivalent if their s-shifted. of course. then there is an s E [0.i. with gap g. k + g — 1] such that — 114. that k is enough larger than g and n enough larger than k to guarantee that Lemma 111. there is a y > 0 and a K > 0. s. The logarithm of the bound in (14) is then asymptotically at most —tce 2 18n —c€ 2 /8(k + g).170 CHAPTER III.d. g)-equivalence class of 4 is denoted by Sk(xPil . is the distribution on Ak defined by P.7. k-block parsing of 4. k. and the final block w(t + 1) has length n — t (k + g) — s < 2(k + g). proof adapts to the Vf-mixing case.4 easily extends to the following. t]: w(j) = 411 It is.. Since it is enough to prove exponential rates for k-th order frequencies for large k. The proof of Theorem 111.g(t)w(t)w(t + 1). if k > K. constant on the (s. where "almost" is now block measure Pk is (almost) an average of the measures p k gaps. the s-shifted. If k is large relative to gap needed to account both for end effects and for the length and n is large relative to k. provided only that k is large enough. this is no problem.8 (Overlapping to nonoverlapping with gaps. w(t)} of k-blocks. this completes the proof of Theorem 111. E [1. s. g)-equivalence class Sk (4. The following lemma removes the aperiodicity requirement in the Markov case. g-gapped.7.7 is completed by using the 1i-mixing property to choose g so large that (g) < 2 218 .

such that Prob(X n+ 1 E Cs91lXn E C s = 1. ) where @ denotes addition mod d. x C(s @ d — 1).9 An ergodic Markov chain has exponential rates for frequencies of all orders. d] by putting c(a) = s.1. Thus.g ( ! XIII ) — 14c(xs)) 1 > 6 / 2 1.5. III. it follows that if k I g is large enough then Since . and.1. Cd. Let A be an ergodic Markov chain with period d > 1. It can also be assumed that k is divisible by d. Examples of renewal processes without exponential rates for frequencies and entropy are not hard to construct.. IL({xi: IPk — iLk I E}) > 6/2 } separately. k + g — 1]. . C1. Define the function c: A i— [1.oc.1.10 Let n i— r(n) be a positive decreasing function with limit O. unless c(ai) = c(xs ). There is a binary ergodic process it and an integer N such that An (1-112 : 'Pi (• Ix) — Ail 1/2» ?. and partition A into the (periodic) classes. Also let pP) denote the measure A conditioned on Xi E C5 . . see Exercise 4. g = ps k . g (• WO satisfies the For s E [0.. k . so the previous theory applies to each set {xii': ' K g — I (c(xs )) I as before. A simple cutting and stacking procedure which produces counterexamples for general rate functions will be presented. The measure . because it illustrates several useful cutting and stacking ideas. because it is conceptually quite simple. . which can be assumed to be divisible by d and small relative to k. which has no effect on the asymptotics. and thereby completing the CI proof of the exponential-rates theorem for Markov chains. in part. ?_ decays exponentially as n -.l. however. _ r(n). 171 Proof The only new case is the periodic case. the nonoverlapping block measure ps condition p g (4) = 0. if a E C.SECTION 111. establishing the lemma.c Counterexamples.u (s) is. Theorem 111. Lemma 111. The following theorem summarizes the goal. RATES OF CONVERGENCE. n > N.u k is an average of the bt k k+g-1 fx rI : IPk( ifil ) — Ad > El C U VI I: s=0 IPsk. Theorem 111.1. an aperiodic Markov measure with state space C(s) x C(s @ 1) x • • . for k can always be increased or decreased by no more than d to achieve this. in part. 1 < s < d. C2. Let g be a gap length. (s) .

u n ({xr: pi(114) = 11) (1 — -11 L i z ) /3. to implement both of these goals. i (1) = 1/2. . ENTROPY FOR RESTRICTED CLASSES. for 1 < j < in. n > 1. /31(114) = 1. As an example of what it means to "look like one process on part of the space and something else on the other part of the space. = (y +V)/2.u.u look like one process on part of the space and something else on the other part of the space. in part. Let m fi(m) be a nonincreasing function with limit 0.n 1. then (141 : 1/31( Ixr) — I ?_ 112» ?_ (1 _ so that if 1/2> fi > r(m) and L is large enough then (16) /2. mixing the first part into the second part so slowly that the rate of convergence is as large as desired.„ ({x. that is. and to mix one part slowly into another. Each S(m) will contain a column . then for any n. then it is certain that xi = 1.u were the average of two processes. no matter what the complement of C looks like. Furthermore. long enough to guarantee that (16) will indeed hold with m replaced by ni + 1. If x is a point in some level of C below the top ni levels. for all m > 1. The construction will be described inductively in terms of two auxiliary unbounded sequences.n: — I > 1/2)) > r(m). for all m. and hence (15) . it can be supposed that r(m) < 1/2. Counterexamples for a given rate would be easy to construct if ergodicity were not required. if . Mixing is done by applying repeated independent cutting and stacking to the union of the second column C2 with the complement of C. As long as enough independent cutting and stacking is done at each stage and all the mass is moved in the limit to the second part. so that A n (14: i pi( 14) — I 1/21) = 1. so that. Ergodic counterexamples are constructed by making . The mixing of "one part slowly into another" is accomplished by cutting C into two columns. Without loss of generality. according to whether x`iz consists of l's or O's. {41 } and fr. For example. pi(11x) is either 1 or 0. such that /3(1) = 1/2 and r(m) < f3(m) < 1/2.172 CHAPTER III." suppose kt is defined by a (complete) sequence of column structures. The details of the above outline will now be given. if C has height L and measure fi. to make a process look like another process on part of the space. The cutting and stacking method was designed. In particular. say C1 and C2. where y is concentrated on the sequence of all l's and iY is concentrated on the sequence of all O's. Suppose some column C at some stage has all of its levels labeled '1'. while . then the chance that a randomly selected x lies in the first L — ni levels of C is (1 — m/L)p. of positive integers which will be specified later. for it can then be cut and stacked into a much longer column.u. The bound (16) can be achieved with ni replaced by ni + 1 by making sure that the first column CI has measure slightly more than r(m 1). if exactly half the measure of 8(1) is concentrated on intervals labeled '1'. the final process will be ergodic and will satisfy (16) for all m for which r(m) < 1/2. The desired (complete) sequence {S(m)} will now be defined.

-mixing process has exponential rates for entropy. 2) has measure fi(m —1) — )6(m). First C(m — 1) is cut into two columns. ni > 1. labeled '1'. for this is all that is needed to make sure that m ttn({x. to guarantee that (17) holds. C(m — 1. The sequence {r. 173 C(m). Since the measure of R. . has exponential rates for frequencies and entropy. 2).) (c) Show that a 0. and rm the final process /1 is ergodic and satisfies the desired condition (17) An (Ix'i'I : Ipi(• 14)— Ail a.10. 2.(m) goes to 1. 2) U TZ. of height 4. 0 111.(4) < 2 -n (h+E) } goes to 0 exponentially fast. 8(m) is constructed as follows. and the total measure of the intervals of S(m) that are labeled with a '0' is 1/2. (b) Show that the measure of {xii: p. . 1.7. where C(m — 1.SECTION III. (Hint: as in the proof of the entropy theorem it is highly unlikely that a sequence can be mostly packed by 'Tk and have too-small measure. and measure 13(m).11. The condition (17) is guaranteed by choosing 4.10. The column C(m —1. sequentially. so large that (1 — m/t n1 )/3(m) > r(m). together with a column structure R.10. (a) Show that y has exponential rates for frequencies.(m). which are then stacked to obtain C(m). so the sequence {8(m)} is complete and hence defines a process A. and hence the final process must satisfy t 1 (l) = A i (0) = 1/2. and let R. and since /3(m) -÷ 0. 1) has measure /3(m) and C(m — 1. Show that the measure of the set of n-sequences that are not almost strongly-packed by 'Tk goes to 0 exponentially fast. 1) and C(m — 1. the total width of 8(m) goes to 0. To get started. For ni > 1. All that remains to be shown is that for suitable choice of 4.: Ipi(. then it can be combined with the fact that l's and O's are equally likely and the assumption that 1/2 > 8(m) > r (m). Let y be a finite stationary coding of A and assume /2. 14)— Ail a 1/2 }) a (1 — — ) /3(m). This guarantees that at all subsequent stages the total measure of the intervals of 8(m) that are labeled with a '1' is 1/2.1/20 > r (m). by Theorem 1. Suppose A has exponential rates for frequencies. RATES OF CONVERGENCE.1 . labeled '0'. let C(1) consist of one interval of length 1/2. Since rm —> oc.n } is chosen. This completes the proof of Theorem 111. 1) is cut into t ni g ni -1 columns of equal width. 2) U 7Z(m — 1). This is possible by Theorem 1.(1) consist of one interval of length 1/2.d Exercises. this guarantees that the final process is ergodic. tm holds. The new remainder R(m) is obtained by applying rni -fold independent cutting and stacking to C(m — 1. disjoint from C(m). (a) Choose k such that Tk = {x: i(x) ) 2h j has measure close to 1.(m — 1) and 1?-(m) are (1 — 2')-independent.1. so that C(m — 1. all of whose entries are labeled '1'. 1. Once this holds.

2. conditional k-type too much larger than 2 —k(h(4)—h(v) and measure too much larger than 2-PP). see Theorem 111. below. distributional. In particular. a fact guaranteed by the ergodic theorem. The problem addressed here is whether it is possible to make a universal choice of {k(n)) for "nice" classes of processes. such as data compression. Thus it would be desirable to have consistency results for the case when the block length function k = k(n) grows as rapidly as possible. distance. Exercise 4 shows that renewal processes are not *-mixing. It is also can be shown that for any sequence k(n) -± oc there is an ergodic measure for which {k(n)} is not admissible. for if k is large then. that is.d.i. (Hint: make sure that the expected recurrence-time series converges slowly. by the ergodic theorem. there is no hope that the empirical k-block distribution will be close to the true distribution. Here is where entropy enters the picture. let plc ( 14) denote the empirical distribution of overlapping k-blocks in the sequence _el'. independently drawn sample paths. where I • I denotes variational. The kth-order joint distribution for an ergodic finite-alphabet process can be estimated from a sample path of length n by sliding a window of length k along the sample path and counting frequencies of k-blocks. that is. equivalently. finite consistency of sample paths. As before. such as the i. k(n) > (logn)I(h — 6). after which the system is run on other. processes or Markov chains. s. This is the problem addressed in this section. The definition of admissible in probability is obtained by replacing almost-sure convergence by convergence in probability. For example. that is. ENTROPY FOR RESTRICTED CLASSES. (Hint: bound the number of n-sequences that can have good k-block frequencies. to design engineering systems.174 CHAPTER III. with high probability. Establish this by directly constructing a renewal process that is not lit-mixing. Consistent estimation also may not be possible for the choice k(n) (log n)I h. Use cutting and stacking to construct a process with exponential rates for entropy but not for frequencies. If k is fixed the procedure is almost surely consistent. the probability of a k-block will be roughly 2 -kh . the resulting empirical k-block distribution almost surely converges to the true distribution of k-blocks as n oc. a.2 Entropy and joint distributions. in . The empirical k-block distribution for a training sequence is used as the basis for design. Show that there exists a renewal process that does not have exponential rates for frequencies of order 1.1..) 3. Every ergodic process has an admissible sequence such that limp k(n) = oc. if k(n) > (1 + e)(logn)1 h. (b) Show that y has exponential rates for entropy. as a function of sample path length n. or. There are some situations. see Exercise 2. Section 111.) 5. where it is good to make the block length as long as possible. 4. A nondecreasing sequence {k(n)} will be said to be admissible for the ergodic measure bt if 11111 1Pk(n)(• 14) n-4co lik(n) = 0. The such estimates is important when using training sequences.

which will be defined carefully later. These applications.5.1 (The nonadmissibility theorem.2. Theorem 111. The Ornstein-Weiss results will be discussed in more detail in Section 111. The Vi-mixing concept was introduced in the preceding section. The admissibility results of obtained here can be used to prove stronger versions of their theorem. Theorem 111. who used the d-distance rather than the variational distance. for it is easy to see that. They showed that if 147.i. Theorem 111. and if k(n) > (log n)/(h — 6) then {k(n)} is not admissible in probability for p. (1/n) log 14/.3. for ergodic Markov chains. Thus the (log n)/(h 6).d. almost surely.5 A third motivation for the problem discussed here was a waiting-time result obtained by Wyner and Ziv.2. and k(n) < (log n)/(h 6) then {k(n)} is admissible for p.SECTION 111. see Exercise 3. along with various related results and counterexamples.) If p is i.4 A first motivation for the problem discussed in this section was the training sequence problem described in the opening paragraph.2. 175 the unbiased coin-tossing case when h = 1.2.) If p.d. process then the d-distance between the empirical k(n)-block distribution and the true k(n)-block distribution goes to 0. The weak Bernoulli result follows from an even more careful look at what is really used for the ifr-mixing proof.i. is ergodic with positive entropy h. The nonadmissibility theorem is a simple consequence of the fact that no more than 2 k(n)(h—E) sequences of length k can occur in a sequence of length n and hence there is no hope of seeing even a fraction of the full distribution. A second motivation was the desire to obtain a more classical version of the positive results of Ornstein and Weiss. Remark 111.2. and ifr-mixing cases is a consequence of a slight strengthening of the exponential rate bounds of the preceding section. y) is the waiting time until the first n terms of x appear in the sequence y then.(x. provided x and y are independently chosen sample paths.i.2. the choice k(n) — log n is not admissible. independent of the length of past and future blocks. and hence the results described here are a sharpening of the Ornstein-Weiss theorem for the case when k < (log n)/(h 6) and the process satisfies strong enough forms of asymptotic independence. is weak Bernoulli and k(n) < (log n)/(h 6) then {k(n)} is admissible for .u. requires that past and future blocks be almost independent if separated by a long enough gap.(x. an approximate (1 — e -1 ) fraction of the k-blocks will fail to appear in a given sample path of length n. if 0 < E < h. are presented in Section 111. or Ili-mixing.) If p.2 (The positive admissibility theorem. ENTROPY AND JOINT DISTRIBUTIONS. They showed. [52]. [86]. .3 (Weak Bernoulli admissibility. case of most interest is when k(n) The principle results are the following. Markov.. The positive admissibility result for the i. Remark 111. provided k(n) — (log n)/ h.d. y) converges in probability to h. that if the process is a stationary coding of an i. The weak Bernoulli property. in particular. with high probability. The d-distance is upper bounded by half the variational distance.

for. The key to such results for other processes is a similar bound which holds when the full alphabet is replaced by a subset of large measure. This implies the nonadmissibility theorem.') n Tk ( 6/2)) 2 k(h-0 2-k(h-E/2) 2 -kE/2 . Next the positive admissibility theorem will be established. with finite alphabet A. process p. and hence I pk (. as the reader can show. whenever al' (1) Let 1 Pk( ' 14) — 1 111 ((lik(X)) c ) • Tk (6/2) = 14: p.k(h—e12) • The entropy theorem guarantees that . without loss of generality.176 CHAPTER III.14) — pk1 -± 1 for every x E A'. 1/31(. n — k +1]} . lik(x). so that { yn } is a binary i. Let p. be an ergodic process with entropy h and suppose k = k(n) > (log n)/(h — E). The rate bound. since any bounded sequence is admissible for any ergodic process. where 0 < E < h. see the bound (9) of Section MA. Lemma 111. so that . by the distance lower bound. since each member of Tk (E/2) has measure at most 2.) There is a positive constant C such that for any E > 0.d.(4) < 2-k(h-E/2)} The assumption n 2k (h -E ) implies that ILik(4)1 < 2k (h -o. is enough to prove the positive admissibility theorem for unbiased coin-tossing.k (Tk (e/2)) —> 1.a Proofs of admissibility and nonadmissibility.d. (2) (14 : 1Pke 14) — 141 ?. First note that. ED < k(t +1) 1A1k 2 -t cE214 . Proof Define yn = yn (x) — > 5E1 < 2(n ± 01/312-nCe2 =0 1 I if xn E B otherwise.2. by the ergodic theorem.u. process with Prob (yn = 1) < Let Cn = E yi > 2En} . it follows that Since pk(a fic ix) = 0. it can be supposed that {k(n)} is unbounded. First the nonadmissibility theorem will be proved. for any finite set A. E.i. ENTROPY FOR RESTRICTED CLASSES.uk ((tik(4)) c) also goes to 1. Define the empirical universe of k-blocks to be the set Lik (4) = x: +k-1 = a l% for some i E [1. for any n > 0. (1). where t nlk.i. the following holds.2. for any i. and for any B c A such that p(B) > 1 — É and I BI > 2. if n > IA I''. the error is summable in n.k (ték oc.6 (Extended first-order bound.. and hence Pk . III.

in1 ): m > 2en} have union Cn so that (3) yields the bound (4) (ii. Fix a set {i1. x i g B.d. . and hence.i. no) im }:nz>2En + 1) 22-cn4E2 . Let 8 be a given positive number.(B(i i . Assume . Also let ALB be the conditional distribution on B defined by . — p. the lemma follows.cn4E2 177 The idea now is to partition An according to the location of the indices i for which . let B(ii. -2. in. page 168. x im from x. ENTROPY AND JOINT DISTRIBUTIONS.kI ?3/21.AL B I --= 2_.n ) den. . The assumption that m < 26n the implies that the probability ({Xj1 E B(ii.i. by the Borel-Cantelli lemma. The proof of (6) starts with the overlapping to nonoverlapping relation derived in the preceding section. in. in. • • • .} with m < 2cn.) . . is upper bounded by tc(B(i i . bEB [ A\ The exponential bound (2) can now be applied with k = 1 to upper bound (5) by tt (B(i i. and apply the bound (2) to obtain (3) p.u i and let AB be the corresponding product measure on BS defined by ALB. and assume k(n) < (log n)/ (h 6). k-1 (7) 14: 1Pk( 14) — tuki 31 g s=o IPsk( . note the set of all 4 for which Xi E B if and only if i çz' {i1. im )) for which in < 26n cannot exceed 1. then extended to the *-mixing and periodic Markov cases. 14) — ?_ 31) < oc. combined with (4) and the assumption that I BI > 2. define ï = i(xil) to be the sequence of length s obtained by deleting xi„ xi2 . i 2 . i. im ): 'pi ( 14) . in turn. are disjoint for different { -1. upper bounded by (5) since E Bs: — 11 1.u(B)) bvB 26. For each m < n and 1 < j 1 < i2 < • • • < j The sets . put s = n — m. (14: ip(. im ))(s 1)1B12 2< ini mn 1)IBI2—n(1 —20c€ 2 The sum of the p.)) /1 B E B s : i Pi e liD — I > 3}).tti I > 56}) which is. which immediately implies the desired result. tt (b) /Ld o B) + E A(b) = 2(1 — . case. The positive admissibility result will be proved first for the i. see (5). the sets {B(ii. (Ca ) < (n 1)22 .2.d. with entropy h. and for xri' E B(ii.SECTION 111. • • • im) and have union A.BI > € 1) Ii ui .i is i. . It will be shown that (6) Ep. }. Furthermore.

indeed. with A replaced by A = A k and B replaced by T. for s E [O.el') denotes the s-shifted nonoverlapping empirical k-block distribution. Lemma 111. valid for all k > K and n > kly. all k > K = K (g). since nIk(n) and p < 1. the polynomial factor in that lemma takes the form. t The preceding argument applied to the Vi-mixing case yields the bound (1 1) kt({4: 1Pke — il ?_. summable in n. fi = h + 612 h ±E The right-hand side of (9) is then upper bounded by (10) 2 log n h +E (t since k(n) < (log n)/ (h E). as k grows. This is. 3 } ) < 2(k + g)[lk(g)]` (t 1 )22127 tceil00 . with B replaced by a suitable set of entropy-typical sequences. Since the number of typical sequences can be controlled. which. valid for all g. as n oc. and plc ( '. and all n > kly. ENTROPY FOR RESTRIC I ED CLASSES. To obtain the desired summability result it is only necessary to choose . and y denotes the product measure on » induced by A. The set rk -= {X I( : itt(x lic ) > 2— k(h+€12)} has cardinality at most 2k(h ±e/ 2) . Lemma 111. where y is a positive constant which depends only on g. thereby completing the proof of the admissibility theorem for i. k — 1]. To show this put a =C32 1100. Note that k can be replaced by k(n) in this bound. The idea now is to upper bound the right-hand side in (8) by applying the extended first-order bound.6. 101 ) — vil where each wi E A = A'.6.d. The bound (9) is actually summable in n.(h±E/2)2-tc82/100 . which. which is valid for k/n smaller than some fixed y > 0. This establishes that the sum in (6) is finite. where n = tk + k +r. combined with (8) and (7). though no longer a polynomial in n is still dominated by the exponential factor. by the entropy theorem.i. to the super-alphabet A". by assumption. since it was assumed that k(n) oc. furthermore. 0 < r < k. for k > K. there is a K such that yi (T) = kil (Tk ) > 1 — 8/10. produces the bound (9) tx: IPke — 1kI > < 2k(t 1)21. processes. The extended first-order bound.178 CHAPTER III. (8) ti({4: Ins* — Ak1 3/2}) v({ 01: IP1( . The rigorous proof is given in the following paragraphs. and.2. (1 + nl k) 2"1±(') . yields the bound P 101) — vil > 3/20 < 2(t + 1 )2/2)2_tc82000. As in the preceding section. see (7) on page 168.2.

) Also let Ex -g (f) denote conditional expectation of a function f with respect to the -g-m random vector X —g—m• —g A stationary process {X i } is weak Bernoulli (WB) or absolutely regular. the measure defined for ar iz E An by iun (41x )— g_m = Prob(r. ar iz. let p. for all n > 1 and all m > O.. -mixing result was the fact that the measure on shifted. provided a small fraction of blocks are omitted and conditioning on the past is allowed. which is again summable in n.1x:g g _n1 ) denote the conditional measure on n-steps into the future. hence for the aperiodic Markov case.1. and xig g provided only that g be large enough.1X —g—m -g — n i) < E . i X -go) — n> 1. at least for a large fraction of shifts. that is. since since t n/ k(n) and 8 < 1. in turn.2. This establishes the positive admissibility theorem for the Vi-mixing case. A process has the weak Bernoulli property if past and future become almost independent when separated by enough. I11.9. ENTROPY AND JOINT DISTRIBUTIONS. It is much weaker than Ifr-mixing. assuming that k(n) the right-hand side in (11) is. = xi. which requires that jun (4 lxig g _m ) < (1 + c)/L n (ar iz). 179 g so large that (g) < 2c6212°°.. the s-shifted.. which only requires that for each m > 0 and n > 1. where the variational distance is used as the measure of approximation. The weak Bernoulli property leads to a similar bound. there is a gap g for which (12) holds. upper bounded by log n 2(t h 1)n i3 . —g — m < j < —g}. uniformly in m > 0. see Lemma 111. nonoverlapping blocks could be upper bounded by the product measure on the blocks. with gap g is the expression (14) 4 = w(0)g(1)w(1)g(2)w(2). = aril and Xi g Prob ( X Tg g . The extension to the periodic Markov case is obtained by a similar modification EJ of the exponential-rates result for periodic chains. however.„(. and oc. n > 1.SECTION 111. As before. k-block parsing of x. The weak Bernoulli property is much stronger than mixing. conditioned on the past {xj.b The weak Bernoulli case. where g and m are nonnegative integers. if given E > 0 there is a gap g > 0 such that (12) E X ig-m : n (. and this is enough to obtain the admissibility theorem.g(t)w(t)w(t ± 1).2. An equivalent form is obtained by letting m using the martingale theorem to obtain the condition oc and (13) E sifc Gun( . With a = C8 2 /200 and /3 = (h + E /2)/ (h E). with only a small exponential error. . To make this precise. subject only to the requirement that k(n) < (log n)/(h E). The key to the O.

& x_ s tim -1)(k+g) ) < (1 + y)tt(w(j ni )). g) and fix a finite set J of positive integers. Then for any assignment {w(j): E J } of k-blocks tt Proof obtain (15) JEJ (naw(i)] n B. g)-splitting index will be denoted by Bi (y. JEJ-th„) D and Lemma 111.) If l is weak Bernoulli and 0 < y < 1/2. there are integers k(y) and t(y). fl Bi ) B*) to n Bi ) = [w(j.. Note that the k-block w(j) starts at index s (j — 1)(k + g) + g ± 1 and ends at index s j (k + g). Here. and later. for 1 < j < t.n )] n B . s. s. g)-splitting index for x E A Z if (w(j) ix _ s ti -1)(k+g) ) < (1 + y)p. ENTROPY FOR RESTRICTED CLASSES. (a) X E Gn(y). ([w(i.n )] n Bi.)) (1 + y) • . where w(0) has length s. which exists almost surely by the martingale theorem. y. k.7 Fix (y. each The first factor is an average of the measures II (w(j. the g-blocks g(j) alternate with the k-blocks w(j). lie lx i 00 ) denotes the conditional measure on the infinite past.) . then there is a gap g = g(y). k. t] is called a (y. and there is a sequence of measurable sets {G n (y)}. s. or by B i if s.2.2. by the definition of B11„.( . . Lemma 111. and the final block w(t 1) has length n — t(k+g) —s < 2(k ± g). An index j E [1. = max{ j : j E J } and condition on B* = R E .)) • tt ([. Lemma 111. and g are understood.7 follows by induction.1)(k +g) ).8 (The weak Bernoulli splitting-set lemma.u(w(in._{.2. defined as the limit of A( Ix i rn ) as m oc. eventually almost surely. The weak Bernoulli property guarantees the almost-sure existence of a large density of splitting indices for most shifts.180 CHAPTER III. The set of all x E Az for which j is a (y. )] n Bi ) .2. such that the following hold. k. k.n )] of which satisfies n Bi I xs_+. k.)) (1+ y)IJI J EJ Put jp. s.„ tc(B*). Note that Bi is measurable with respect to the coordinates i < s + j (k + g). g).(w(j)).„ } ( [ w (. Thus (15) yields (n([w(i)] n B.

. 181 (b) If k > k(y). The definition of G(y) and the assumption t > 2/y 2 imply that 1 r(k+g)-1 i=0 KCht(T1 X) = g) E 1 k-i-g-1 1 t(k k + g s=o t j =i E_ t K cm ( T s-hc i —( k + g) x ) > _ y 2 . t > t(y). 2 where kc.3C ( IX:e g. k ± g — 1] for each of which there are at least (1 — y)t indices j in the interval [1. {Xi : i < —g} U {X. Thus fk converges almost surely to some f. Direct calculation shows that each fk has expected value 1 and that {fk} is a martingale with respect to the increasing sequence {Ed. see Exercise 7b. so there is a subset S = S(x) C [0. g)-splitting indices for x. and fix an x E Gn(Y). then for x E G(y) there are at least (1 — y)(k + g) values of s E [0.. Fix k > M.2. Proof By the weak Bernoulli property there is a gap g = g(y) so large that for any k f (xx ) 1 p(x i`) p(x i`lxioo g) dp.c ) • Let Ek be the a-algebra determined by the random variables. Fatou's lemma implies that I 1 — f (x) d p(x) < )/4 . Put k(y)= M and let t(y) be any integer larger than 2/y 2 . .(x) < Y 4 4 Fix such a g and for each k define fk(X) — .t=o — L — 2 then x E Gn(Y). ENTROPY AND JOINT DISTRIBUTIONS. t] that are (y. s. Vk ?_ MI. if t > t(y).U(. The ergodic theorem implies that Nul+ rri cx. (Ts+0-1)(k+g)x ) > — y.s. 1 N-1 i=u y2 > 1 — — a. k. so that if G(y) = —EICc m (T z x) > 1 1 n-1 n. eventually almost surely.: 1 < i < kl. and if (t +1)(k + g) _< n < (t + 2)(k +g). 1 so there is an M such that if Cm = lx :I 1 — fk (x) 15_ .SECTION 111. k + g —1] of cardinality at least (1 — y)(k + g) such that for x E G(y) and s E S(x) k c. and (t + 1)(k g) < n < (t +2)(k ± g). denotes the indicator function of Cm. then /i(CM) > 1 — (y 2 /2).

(s. this is no problem.2. so that Lemma 111. k + g — 1] and J C [1. Theorem 111. g)-splitting index. k(y) and t(y).)) If x E G(y) then Lemma 111. and for the j g J.7 and the fact that I Ji < t yield (1 6) A (n[w(j)] jEJ n D. ( .3. let Dn (s. n > 1.8. s.2. n > 1. . then Ts±(. (.2.8 easily extends to the following sharper form in which the conclusion holds for a positive fraction of shifts.2. for x E G(y) and s E S(X) there are at least (1 — y)t indices j in the interval [1. g)-splitting indices for x. . = Gn(Y). then choose integers g = g(y).2. y) such that if k > K. which implies that j is a (y. For J C [1.2. so that conditions (a) and (b) of Lemma 111. t]. s.1-1)(k +g) x E Cm. and if Ipk(-14) — ?. and Lemma 111. t].g(ai )= E W(i) an! . k E k A. Lemma 111. if kl n < y.E. if s E S(X). s. In particular. Fix > 0. however.J k n Pk. t] that are (y. Assume A is weak Bernoulli of entropy h. and measurable sets G. t].182 CHAPTER III. this completes the proof of Lemma 111. J)) < (1+ y)' rip.: 3 then 114' g j ( . choose a positive y < 1/2.8 hold. for gaps.1. k. Proof of weak Bernoulli admissibility. and k(n) < (log n)/(h + E). g)-splitting index for x. t] of cardinality at least (1 — y)t. the empirical distribution of k-blocks obtained by looking only at those k-blocks w(j) for which j E J. The overlapping-block measure Pk is (almost) an average of the measures pk s'g i. that is. ENTROPY FOR RESTRICTED CLASSES. k. for at least (1—y)t indices j E [1. s. t] define s. where k(y) k < (log n)/(h+E). Since I S(x)I > (1 —3)(k+ g). In summary. g)-splitting indices for x. k.8 implies that there are (1 — y)(k + g) indices s E [0. But if Tr+(i -1)(k +g) x E Cm then g(w(i)Ixst i-1)(k+g )— g) < 1 ( + y)tt(w(D). k. k + g — 1] for each of which there are at least (1 — y)t indices j in the interval [1. For s E [0. where "almost" is now needed to account for end effects. J). J) be the set of those sequences x for which every j E J is a (y. and note that jEJ n B = Dn(s. k +g —1]. t] which are (y. If k is large relative to gap length and n is large relative to k. provided the sets J are large fractions of [1. 14) — ILkI 3/4 for at least 2y (k + g) indices s E [0.9 Given 3 > 0 there is a positive y < 1/2 such that for any g there is a K = K(g. Fix t > t(y) and (t+1)(k+g) < n < (t+2)(k+g). for any subset J C [1.

If y is small enough. 183 On the other hand. Show that if p is i.ei') — [4(n)1 > 31.d.] 1J12:(1—y)t (ix: Ips k :jg elxi) — > 3/41 n Dn (s.d. processes in the sense of ergodic theory. then for any x E G 0 (y) there exists at least one s E [0. yet p. however. . In particular. Aperiodic Markov chains are weak Bernoulli because they are ifr-mixing.i.2. does not have exponential rates of convergence for frequencies. This bound is the counterpart of (11). Show that for any nondecreasing. Show that k(n) = [log n] is not admissible for unbiased coin-tossing.i.d. as in the -tfr-mixing case. in Section 1V. [13]. then.9 _> 3/4 for at least 2y (k -I. for any subset J C [1.2. t] of cardinality at least > 3/4. which need not be if-mixing are. ENTROPY AND JOINT DISTRIBUTIONS. Remark 111. the measure of the set (17) is upper bounded by (18) 2 ..3. = {x: Ipk (n) (. unbounded sequence {k(n)} there is an ergodic process p such that {k(n)} is not admissible for A.2. I11. as part of their proof that aperiodic Markov chains are isomorphic to i. eventually almost surely. Aperiodic renewal and regenerative processes. 14) Thus if y is sufficiently at least (1—y)t.c Exercises. y can be assumed to be so small and t so large that Lemma 111. weak Bernoulli. t] of cardinality small.3. with an extra factor. J) and ips (17) is contained in the set k+ g —1 s=0 lpk — itkl a. indices s. 2-2ty log y to bound the number of subsets J C [1.SECTION 111. t ] of cardinality at least (1— y)t. [38]. the set k :g j (-14) — (1 — y)t. see Exercises 2 and 3.10 The admissibility results are based on joint work with Marton.2. and k > k(y) and t > t(y) are sufficiently large. J)) The proof of weak Bernoulli admissibility can now be completed very much as the proof for the 'tit-mixing case. 31 n G(y) UU ig[1. 2. the bound (18) is summable in n.2. then p(B 0 ) goes to 0 exponentially fast. 1.1. Show that there exists an ergodic measure a with positive entropy h such that k(n) (log n)/(h + E) is admissible for p. this establishes Theorem 111. 2-2t y log y yy[k(o+ + 2k(n)(h+E/2)2_ t(1—Y)C32/4°° for t sufficiently large. Using the argument of that proof. 4. It is shown in Chapter 4 that weak Bernoulli processes are stationary codings of i.i. 3. processes. if k(n) < (logn)I(h + E). The weak Bernoulli concept was introduced by Friedman and Ornstein. k + g — 1] and at least one J C [1. Since X E G n (y). and if B.g) assures that if I pk (14) — Ik1> S then I Ps kv g j ( . for which x E Dn (s.

. the admissibility theorem only requires that k(n) < (log n)I h. 7. though only the latter will be discussed here. Let R.2. based on the empirical entropy theorem. The principal positive result is Theorem 111. [52}. Ym. For each x let Pm (X) be the atom of E(m) that contains x. as n 6. A similar result holds with the nonoverlapping-block empirical distribution in place of the overlapping-block empirical distribution. c)) —> 1. Note that in the d-m etric case.i.3 The d-admissibility problem. a.3.) (b) Show that the sequence fk defined in the proof of Lemma 111. Show that the preceding result holds for weak Bernoulli processes. and p. . k.) If k(n) < (log n)I h then {k(n)} is d-admissible for any process of entropy h which is a stationary coding of an i.(x7.1 (The d-admissibility theorem.) g dp. let E(m) be the a-algebra determined by . where each w(i) has length k.i.184 CHAPTER ER III.1. and a deep property. Since the bound (2) dn (p. Stated in a somewhat different form.(x7 .d. for any integrable function g. Show that if k(n) oc.3. process. Assume t = Ln/k] and xin = w(1). and let E be the smallest complete a-algebra containing Um E(m). then .v(4)1. (Hint: use the martingale theorem.) Section 111. s.S The definition of d-admissible in probability is obtained by replacing almost-sure convergence by convergence in probability. (a) Show that E(glE)(x) = lim 1 ni -+cx) 11 (Pm(x)) fp. k(n). A sequence {k(n)} is called d-admissible for the ergodic process if (1) iiIndn(Pk(n)(• Alc(n)) = 0.8 is indeed x4) to evaluate a martingale. almost surely. ENTROPY FOR RESTRICIED CLASSES.. 5. is i. The earlier concept of admissible will now be called variationally admissible.d. c) be the set of all kt-sequences that can be obtained by changing an e(log n)I(h+ fraction of the w(i)'s and permuting their order. while in the variational case the condition k(n) < (log n)I(h e) was needed. 151... Let {Yi : j > 1} be a sequence of finite-valued random variables. w(t)r. (Hint: use the previous result with Yi = E(fk+ilEk). Theorem 11. The d-metric forms of the admissibility results of the preceding section are also of interest. always holds a variationally-admissible sequence is also d-admissible. e). called the finitely determined property.u(R. The d-admissibility theorem has a fairly simple proof. the theorem was first proved by Ornstein and Weiss.(.. y) E a1 .

define the (empirical) universe of kblocks of x'11 . The negative results are two-fold. produces Ak [11k(Xn)t5 ( [14(4)16 = n I. I11. tai :x i some i E [0. A weaker.3. One is the simple fact that if n is too short relative to k.i.7. so that ttaik(x)is) < 6.3 (The strong-nonadmissibility theorem. intersecting with the entropy-typical set.d. . subsequence version of strong nonadmissibility was first given in [52]. 1[14(4)]s I < 2 k(" 12) by the blowup-bound. 185 which holds for stationary codings of i.a Admissibility and nonadmissibility proofs. Theorem 111. is given in Section III. provided 3 is small enough.5. such that liminf dk( fl )(pk( fl )(. To establish the d-nonadmissibility theorem.3.2. and its 6-blowup dk(4.a. It will be proved in the next subsection. for any x'11 . Tk = 11.SECTION IIL3. n--*oo 111(n)) > a. and which will be proved in Section IV. But if this holds then dk(/ik. almost surely The d-nonadmissibility theorem is a consequence of the fact that no more than sequences of length k(n) can occur in a sequence of length n < so that at most 2k(n )(h-E/2) sequences can be within 3 of one of them. and hence. Lemma 1. there is an ergodic process p. then {k(n)} is not admissible for any ergodic process of entropy h. THE 15-ADMISSIBILITY PROBLEM. If 3 is small enough then for all k. then extended to the form given here in [50]. If k > (log n)/(h — c) then Itik (4)1 _< 2k(h-E).3. These results will be discussed in a later subsection. < 2k(h-02) 2-ch-E/4) <2 -k€/4 The entropy theorem implies that it(Tk) —> 1.(X1) < k(h—E/4) 1 1 .3. The other is a much deeper result asserting in a very strong way that no sequence can be admissible for every ergodic process.16(4)) < 81.) If k(n) > (log n)/(h — c). processes. then a small neighborhood of the empirical universe of k-blocks in an n-sequence is too small to allow admissibility.2 (The d-nonadmissibility theorem. n — k +1]} . 14(4) = k i+k-1 = a. A proof of the d-admissibility theorem.) For any nondecreasing unbounded sequence {k(n)} and 0 < a < 1/2. Theorem 111. if k is large enough. assuming this finitely determined property. pk( 14)) > 82.

Thus. the only difference being that the former gives less than full weight to those m-blocks that start in the initial k — 1 places or end in the final k — 1 places xn . since the bounded result is true for any ergodic process. Convergence of entropy.3. 14)) — (b) (1/k(n))H(p k(n) (.2. a negligible effect if n is large enough. If y is a measure on An. of course.14(a) gives /Lk aik (x')13 ) > 1 — 28. The important fact needed for the current discussion is that a stationary coding of an i. The d-admissibility theorem is proved by taking y = plc ( I4).„ = 0(m. almost surely. for otherwise Lemma I. The desired result then follows since pni (an i 'lxri') —> A(an i z). and noting that the sequence {k(n)} can be assumed to be unbounded.l n-m EE i=0 uEiv veAn—n. The finitely determined property asserts that a process has the finitely determined property if any process close enough in joint distribution and entropy must also be dclose. see Theorem IV. A stationary process t is finitely determined if for any c > 0 there is a 6 > 0 and positive integers m and K such that if k > K then any measure y on Ak which satisfies the two conditions -(rn. the v-distribution of m-blocks in n-blocks is the measure 0.186 CHAPTER III. provided only that k < (11h) log n. is really just the empirical-entropy theorem. ENTROPY FOR RESTRICTED CLASSES. modulo.. 111 .4. that is. is expressed in terms of the averaging of finite distributions. it is sufficient to prove that if {k(n)} is unbounded and k(n) < (11h) log n. since pk(tik(fit)14) = 1. Theorem 11. which is more suitable for use here and in Section 111. almost surely. . This completes the proof of the d-nonadmissibility theorem.2. the average of the y-probability of din over all except the final m — 1 starting positions in sequences of length n. by the definition of finitely determined.1. the proof that stationary codings of processes are finitely determined. the following hold for any ergodic process p.9. for each m. must also satisfy cik(iik. for it asserts that (1 /k)H(pke h = 11(A). 1 —> 0. E v(uaT v). see Section IV. An equivalent finite form.( • Ixrn). . fact (b). and in < n. (a) 10 (in. v)1 <8. fact (a). To establish convergence in m-block distribution. An alternative expression is Om (ain ) E Pm (ain14)v(x) = Ev (pm (ar x7)).3.i. vk) < E. Pk(n)(.2. then. by the ergodic theorem. by the ergodic theorem. y) on Am defined by çb m (a) = 1 n — m -I. v n ) = ym for stationary v.2. 11-1(14) — H(v)I <k6.x7 which makes it clear that 4)(m. Pk( WO) is almost the same as pn. note that the averaged distribution 0(m. of entropy h.d. ---÷ h. almost surely as n oc. Theorem 111. process is finitely determined. This completes the proof of the d-admissibility theorem. as k and n go to infinity.

when viewed through a window of length k(n).. It is enough to show that for any 0 <a < 1/2 there is an ergodic measure be. 14 (k) ). Let Bk(a) = ix ) : dic(Pk• Ix). merging of these far-apart processes must be taking place in order to obtain a final ergodic process. if X is any joining of two a-separated measures. where. {p (J ) : j < J } . nondecreasing. and hence lim infdk (pk (. some discussion of the relation between dk-far-far-apart measures is in order. This must happen for every value of n. of {k(n)}-d-admissible ergodic processes which are mutually a' apart in d.u(x) > 0. see Exercise 1. where a'(1 — 1/J) > a. n(k) = max{n: k(n) < k}. unbounded sequence of positive integers. which is well-suited to the tasks of both merging and separation. look as though they are drawn from a large number of mutually far-apart ergodic processes. Ak) < I.b Strong-nonadmissibility examples. y) > a. Two measures 1. in which case pk(• /2. Thus. keeping the almost the same separation at the same time as the merging is happening. Associated with the window function k(. D c A k are said to be a separated.c and y are a-separated.3.) is the path-length function. Indeed. even when blown-up by a. (U Bk(a)) = 0 k>K The construction is easy if ergodicity is not an issue. as before.*oc almost surely. a-separated means that at least ak changes must be made in a member of C to produce a member of D. for which (3) lim p. to be the average of a large number.W-typical sequences. then cik(p.SECTION 111. Ak) > a'(1 — 1/J). then E(dk (x li` . The easiest way to make measures apart sequences and dk d-far-apart is to make supports disjoint. . yf)) ce).3. The result (4) extends to averages of separated families in the following form.. The tool for managing all this is cutting and stacking. - [Dia = i. Before beginning the construction. but important fact. Throughout this section {k(n)} denotes a fixed. Two sets C. for one can take p. THE D-ADMISSIBILITY PROBLEM. The trick is to do the merging itself in different ways on different parts of the space. A simple. A given x is typical x) n(k) i is mostly concentrated on for at most one component AO ) . and k(n) < log n.1 and y. 187 III.. k. yet at the same time. 1.(a (p) x a (v)) = ce. The support a (i) of a measure pc on Ak is the set of all x i` for which .t and y on A k are said to be a separated if their supports are a-separated. This simple nonergodic example suggests the basic idea: build an ergodic process whose n-length sample paths. y k < dk (x k ) for some yk i E D} is the a-blowup of D. if C n [D]ce = 0. is - (4) If 1.

then. A column structure S is said to be uniform if its columns have the same height f(S) and width.4 suggests a strategy for making (6) hold. namely. 11k. for i it([a(vma) = — E vi ([a(v.x ' "5. Thus if f(m)In(k m ) is sunu-nable in m.)]a) = v jag (v i)la) = 7 j i=1 if 1 . (Pkm (. and if the cardinality of Si (m) is constant in j. III. (3). then d(p. Here it will be shown.3. With these simple preliminary ideas in mind.188 CHAPTER III. the basic constructions can begin. and y. guaranteeing that (5) holds. yl`)) v(a (0) • [1 — AGa(v)icen a(1 — 1/.3.: j < J} of pairwise a-separated measures.4 guarantees that the d-property (6) will indeed hold. a complete sequence {8(m)} of column structures and an increasing sequence {km } will be constructed such that a randomly chosen point x which does not lie in the top n(km )-levels of any column of S(m) must satisfy (6) dk„. for any measure v whose support is entirely contained in the support of one of the v i . in detail. where xi is the label of the interval containing T i-l x. for any joining Â.n ).1 A limit inferior result. (5) The construction. take S(m) to be the union of a disjoint collection of column structures {Si (m): j < Jm } such that any km blockwhich appears in the name of any column of Si (m) must be at least am apart from any km -block which appears in the name of any column of 5. If all the columns in S(m) have the same width and height am). To achieve (5). The k-block universe Li(S) of a column . Thus.3..([a(v)]a ) < 1/J. Furthermore. of p. Proof The a-separation condition implies that the a-blowup of the support of yi does j.u. The sequence am will decrease to 0 and the total measure of the top n(km )-levels of S(m) will be summable in m. then Lemma 111. how to produce an ergodic p. (m) for i j.4 If it = (1/J)Ei v i is the average of a family {v . Lemma 111. A weaker liminf result will be described in the next subsection.. ergodicity will be guaranteed by making sure that S(m) and S(m 1) are sufficiently independent. and an increasing sequence {km } such that lim p(B km (a)) = O. after which a sketch of the modifications needed to produce the stronger result (3) will be given.b. v) > a(1 — 1/J). the measure is the average of the conditional measures on the separate Si (m). for 1 < i < n(k. from which the lemma follows. then . and hence not meet the support of yi . based on [52]. Lemma 111.. conditioned on being n(km )-levels below the top. illustrates most of the ideas involved and is a bit easier to understand than the one used to obtain the full limit result. ENTROPY FOR RESTRICTED CLASSES.3.1). if the support of y is contained in the support of y. Some terminology will be helpful.) > am . and hence EAdk(4.

J)merging rule is a function . The real problem is how to go from stage m to stage ni+1 so that separation holds. and there is no loss in assuming that distinct columns have the same name. J } . (7) = 01010.n ). 10 -1 (DI = M/J. km )separated column structures. The following four sequences suggest a way to do this. 2.. To summarize the discussion up to this point. the simpler concatenation language will be used in place of cutting and stacking language. and for which an2 (1 — 1/J. that is.. {1. The construction (7) suggests a way to merge while keeping separation. If one starts with enough far-apart sets. . the goal (6) can be achieved with an ergodic measure for a given a E (0. using a rapidly increasing period from one sequence to the next. and concatenations of sets by independent cutting and stacking of appropriate copies of the corresponding column structures. as m Since all the columns of 5(m) will have the same width and height.. 189 structure S is the set of all a l' that appear as a block of consecutive symbols in the name of any column in S. then d32(4 2 . These sequences are created by concatenating the two symbols. for which the cardinality of Si (m) is constant in j.. (Al) S(m) is uniform with height am) > 2mn(k...n ) decreases to a. an (M. oc. yet asymptotic independence is guaranteed. THE 13-ADMISSIBILITY PROBLEM. 11 = ( 064164) 1 a . 01 = (01) 64 b = 00001111000011110. The idea of merging rule is formalized as follows. For this reason S(m) will be taken to be a subset of A" ) ..0101. of course. yet if a block xr of 32 consecutive symbols is drawn from one of them and a block yr is drawn from another. MI whose level sets are constant. the simultaneous merging and separation problem.. Each sequence has the same frequency of occurrence of 0 and 1. There are many ways to force the initial stage 8 (1) to have property (A3). ..3. 0 and 1. yr) > 1/2..SECTION III. 1/2) by constructing a complete sequence {S(m)} of column structures and an increasing sequence {k m } with the following properties. Disjoint column structures S and S' are said to be (a... j <J. (A2) S(m) and S(m 1) are 2m-independent. (A3) S(m) is a union of a disjoint family {S) (m): j < J} of pairwise (am . so first-order frequencies are good. namely. then by using cyclical rules with rapidly growing periods many different sets can be produced that are almost as far apart... k)-separated if their k-block universes are a-separated. This is. 01111 = (0 414)16 c = 000000000000000001 11 = (016 116)4 d = 000. Conversion back to cutting and stacking language is achieved by replacing S(m) by its columnar representation with all columns equally likely. For M divisible by J. concatenate blocks according to a rule that specifies to which set the m-th block in the concatenation belongs.

1} is a collection of pairwise (a.190 CHAPTER III. then for any J* there is a K* and an V. 0 *(2) • • • w (m ) : W(M) E So(. it is equally likely to be any member of Si . /MI.11. a merging of the collection {Sj : j E J } is just an (M. In cutting and stacking language. that is. (exp2 (J* — 1)) and for each t E [1. Let expb 0 denote the exponential function with base b.1 times. ENTROPY FOR RESTRICTED CLASSES. An (M. such that . K)-strongly-separated if their full k-block universes are a-separated for any k > K. when applied to a collection {Si: j E J } of disjoint subsets of A produces the subset S(0) = S(0. for each m. for some x E S" and some i > 11.3. The full k-block universe of S C A t is the set tik (S oe ) = fak i : ak i = x:+k-1 . let M = exp . Given J and J*.) Given 0 < at < a < 1/2. In proving this it is somewhat easier to use a stronger infinite concatenation form of the separation idea. 8(0) is the set of all concatenations w(1) • • • w(M) that can be formed by selecting w(m) from the 0(m)-th member of {S1 } . be the cyclic (M. In general. there is a Jo = Jo(a. 0 is called the cyclic rule with period p. When applied to a collection {Si : j E J } of disjoint subsets of A'. . the canonical family {0i } produces the collection {S(0)} of disjoint subsets of A mt .. J)-merging rule with period pt = exp j (exp2 (t — 1)). K)-strongly-separated subsets of A t of the saine cardinalit-v.5 (The merging/separation lemma.n ). The two important properties of this merging idea are that each factor Si appears exactly M1. The following lemma. Use of this type of merging at each stage will insure asymptotic independence. formed by concatenating the sets in the order specified by q. Cyclical merging rules are defined as follows.1 and J divides p. let (/). stated in a form suitable for iteration.0(rn ± np) = 0(m). The key to the construction is that if J is large enough. J)-merging of the collection for some M.P n . given that a block comes from Sj . . then the new collection will be almost as well separated as the old collection. Two subsets S. at) such that if J > Jo and {S J : j < . The merging rule defined by the two conditions (i) (ii) 0(m)= j. S' c At are (a. the direct product S() = in =1 fl 5rn = 1. Assume p divides M1. m E [1. each Si is cut into exactly MIJ copies and these are independently cut and stacked in the order specified by 0. J)-merging rule 4). Lemma 111. is the key to producing an ergodic measure with the desired property (5). The desired "cyclical rules with rapidly growing periods" are obtained as follows. and a collection {S7: t < J* } of subsets of 2V * of equal cardinalitv. p1. and. The family {Or : t < PI is called the canonical family of cyclical merging rules defined by J and J. m E [1. {Sj }) of A m'. The set S(0) is called the 0-merging of {Si : j E f }. In other words.

Part (b) is certainly true.SECTION IIL3. without losing (a.Ottn-1-1) from 111. any merging of {S7 } is also a merging of {S}. Fix a decreasing sequence {a } such that a < ant < 1/2. while y = c(1)c(2) • •• where each c(i) has the form (9) v(1)v(2) • • • v(J). then there can be at most one index k which is equal to 1 d u_one (xvv-i-(J-1)ne+1 y u ( J — 1 ) n f + 1 ) > a by the definition of (a. Furthermore. and Yu x v±(J-Unf+1 = w(t where w(k) E j.1-1)nt-1-1 is a subblock of such a v(j). each block v(j) is at least as long as each concatenation (8). K*)-strongly-separated.5. Since all but a limiting (1/J)-fraction of y is covered by such y u u+(j-1)nt+1 . v(j)E ST. . > JOY . furthermore. since (87)' = ST. for otherwise {Si : j < J } can be replaced by {S7: j E J } for N > Klt. Aar° (B1) S(m) is a disjoint union of a collection {S I (m): j < strongly-separated sets of the same cardinality. K*)-strongly-separated. (a) {St*: t < J*} is pairwise (a*. 1 _< j _< J. and. Let {Or : t < J* } be the canonical family of cyclical merging rules defined by J and J*. for all large enough K*. and for each t.`)c° and y E (S)oc where s > t.). The condition s > t implies that there are integers n and m such that m is divisible by nJ and such that x = b(1)b(2) • • where each b(i) is a concatenation of the form (8) w(l)w(2) • • • w(J). the collection {St*: t < J* } is indeed pairwise (a*.5. for any j. then K* can be chosen so that property (a) holds. so that if u+(. this may be described as follows. Suppose x E (S. K)-strongseparation. This completes the proof of Lemma 111. Suppose m have been determined so that the followLemma S(m) C and k ing hold.--m.3. for each ni. K)-strongly-separated and the assumption that K < E. choose J„. w(j) E Sr» 1 < j < J.3. THE D-ADMISSIBILITY PROBLEM. 4 } of pairwise (an„ kni )- (B2) Each Si (m) is a merging of {S (m — 1): j < 4_ 1 }. and since J* is finite. so it only remains to show that if J is large enough. provided J is large enough. and. In concatenation language. for each k. and hence (10) 1 )w (t + 2) • • w(J)w(1) • • • w(t) S. (13) Each S7 is a merging of {Si: j E 191 J}. Proof Without loss of generality it can be supposed that K < f. LI The construction of the desired sequence {S(m)} is now carried out by induction. (B3) 2n(k) <t(m). let S7 = S(0.

is needed. prior separation properties are retained on the unmerged part of the space until the somewhat smaller separation for longer blocks is obtained. the analogue of Theorem 1. for the complete proof see [50]. one on the part waiting to be merged at subsequent steps. the structure S(m) = S(m. merging only a (1/4. ±1 )-fraction of the space at each step.n+ 1 = a*. Meanwhile. it can be supposed that 2m+ l n(kn1+ i) < t(S. the sequence {S(m)} defines a stationary A-valued measure it by the formula . 1). the measure p. if necessary. Furthermore. the measure of the bad set .3.(m. but two widths are possible. where } (i) )1/4.*). The only merging idea used is that of cyclical merging of { yi : j < J} with period J. J 1+1). To obtain the stronger property (3) it is necessary to control what happens in the interval k m < k < km+1.n + i)x(S. and En. The new construction goes from S(m) to S(m + 1) through a sequence of Jm+i intermediate steps S(m.4 in this case where all columns have the same height and width.' by (S7) N for some suitably large N.n } with am = a and ce. is ergodic since the definition of merging implies that if M is large relative to m. ±1 . ENTROPY FOR RESTRICTED CLASSES.5 for the family {Si (m): j < J.s(m) 1 k ilxiam)). (m.u(Bk m (a)) is summable in m. Cyclical merging with period is applied to the collection {57 } to produce a new column structure R 1 (m. then each member of S(m) appears in most members of S(M) with frequency almost equal to 1/1S(m)1. Since (6) clearly holds. establishing the desired 1=1 goal (5). 0) is a union of pairwise km )-strongly separated substructures {Si (m.. 0) S(m. on each of which only a small fraction of the space is merged. Let {S'tk: t < Jrn+1 } be the collection given by Lemma 111. which is just the independent cutting and stacking in the order given.n+ 1. Define k 1+ 1 to be the K* of Lemma 111. and hence cutting and stacking language. n(lcm )I f(m) < cc. V1 * V2 * • • * V j As in the earlier construction. 0) is cut into two copies.10.3. This is accomplished by doing the merging in separate intermediate steps. each Si (nt. Since t(m) --> cc. In the first substage.2 The general case. j < J. By replacing S. by (B3). 0)). rather than simple concatenation language. The union S(m + 1) = Ui Si (m + 1) then has properties (Bi). (ii) x(S7) = ( i/J. III. and (B3). 0)). S. At each intermediate step.5 for J* = J. lc \ _ lim [ga l— E IS(m)1 t(m).(St i) = (1 — 1/ Ji)x(S. (B2). and the other on the part already merged.192 CHAPTER III.b. Put Si (m + 1) = S.3. for m replaced by m +1. 0): j < 41 all of whose columns have the same width and the same height t(m) = (m. Only a brief sketch of the ideas will be given here. and Sj ". Jm+i -fold independent cutting and stacking is . 1) S(m. that is. 0). All columns at any substage have the same height.

This is. and since the probability that both lie in in the same L i (m.„<k<k„. 1) have the same height as those of L i (m.1). 1)-fold independent cutting and stacking. 1). 1))4 + 1.3. In particular. then for any k in the range km < k < k(m. > 8(m.. 1).C i (m. 1) = {L i (m. 1) is (8(m. since strong-separation for any value of k implies strong-separation for all larger values. while making each structure so long that (b) holds for k(m. 1): j < J} of measure 1/4 +1 . weaker than the desired almost-sure result. Lemma 111. 1): j < J} remains pairwise (am . 1) > k m such that each unmerged part j (m.4_] Bk ) < DC ' . ±1 . t) <k < k(m. 193 applied separately to each to obtain a new column structure L i (m. since neither cutting into copies nor independent cutting and stacking applied to separate Ej(m. t + 1). (b) If x and y are picked at random in S(m. 1) > a„. k(m.E.14 (k) ). 1) has measure 1/4 +1 and the top n(k(m. if they lie at least n(k(m. 1).(m. 1) < 1/4 + 1. But. 1))/(m. An argument similar to the one used to prove the merging/separation lemma. THE D-ADMISSIBILITY PROBLEM. then applying enough independent cutting and stacking to it and to the separate pieces to achieve almost the same separation for the separate structures. The merging-of-a-fraction idea can now be applied to S(m. in turn. Thus if M(m. k m )-strongly-separated. and the whole construction can be applied anew to S(m 1) = S(m. which makes each substructure longer without changing any of the separation properties. the following holds. 1) < k < k(m. The collection {. then for a suitable choice of an. km )-strongly-separated. for which lim . 1) is replaced by its M(m.. k (m . 1) is replaced by its M(m.3. and this is enough to obtain (12) ni k. This is because..E. pk(. and R(m. 1) is large enough and each L i (m. 1) > n(k(m. 1)-fold independent cutting and stacking./r . 1))-levels have measure at most n(k(m. implies the following. After 4 +1 iterations. 2). by the way. 1) is chosen large enough. large enough. then it can be assumed that i(m. . an event of probability at least (1 —2/J.. if M(m. 1). producing in the end an ergodic process p. the probability that ak(pk(. then k-blocks drawn from the unmerged part or previously merged parts must be far-apart from a large fraction of k-blocks drawn from the part merged at stage t 1. 1). can be used to show that if am) is long enough and J„.Iy in(k) )) > a is at least (1 — 2/4 + 0 2 (1 — 1/. 1))-levels below the top and in different L i (m. 1))-strongly separated from the merged part R1 (m.SECTION III.. Note. of course.u(Bk (a)) = O. 1) is upper bounded by 1/. J„. 1))strongly-separated. then {L i (m. 1). but a more careful look shows that if k(m. 1). since R. Each of the separate substructures can be extended by applying independent cutting and stacking to it.. the k-block universes of xrk) and y n(k) i are at least a-apart. the unmerged part has disappeared. which. but are only 1/J„1 . merging a copy of {L i (m. t))} is (p(m.5. (a) The collection S(m. for the stated range of k-values. t): j < J } U {R 1 (m. there is a k(m. 1). 1): j < J} is pairwise (a. 1).n+ i) 2 (1-1/0. 1 as wide. and (a) continues to hold.0. that the columns of R I (m. 1) affects this property.

almost surely. including a large family called the finitary processes. almost lim infak(Pk(.i. A slightly weaker concept. Section 111. A full discussion of the connections between blowing-up properties and stationary codings of i. processes. enough to guarantee the desired result (3). 2_ . ENTROPY FOR RESTRICTED CLASSES. Section 1.([C]. Let p. which is. y) —1511 5 . process. and for other processes of interest. (Hint: use Exercise 11.d.3. An interesting property closely connected to entropy ideas.1x n i (k) ). (Yi)k) surely.d.4 Blowing-up properties. processes. Show that 0. for any subset C c An for which 11 (C) . I11. for some ariL E C} . 1. cik(Pk(.6 It is important to note here that enough separate independent cutting and stacking can always be done at each stage to make i(m) grow arbitrarily rapidly.) > 1 — E. The blowing-up property and related ideas will be introduced in this section. Remark 111.194 CHAPTER III. The following theorem characterizes those processes with the blowing-up property. where each yi is ergodic.i.3. in fact. ixin(k) ). relative to km . Informally. Processes with the blowing-up property are characterized as those processes that have exponential rates of convergence for frequencies and entropy and are stationary codings of i.3. 00 > — 1/f). that is [C]. In particular.i.d. If C c An then [C].9. a stationary process has the blowing-up property if sets of n-sequences that are not exponentially too small in probability have a large blowup.3. b7) < E.4. An ergodic process i has the blowing-up property (B UP) if given E > 0 there is a 8 > 0 and an N such that if n > N then p. and d(yi . in turn. This fact will be applied in I11. called the almost blowing-up property is. called the blowing-up property..d. Remark 111. denotes the 6-neighborhood (or 6-blowup) of C. k-4. then 10(k. has recently been shown to hold for i. = {b7: dn (a7.e to construct waiting-time counterexamples. this shows the existence of an ergodic process for which the sequence k(n) = [(log n)/ hi is not a-admissible. for aperiodic Markov sources. processes is delayed to Chapter 4. for suitable choice of the sequence {40.5. = (1/J)Ei yi. Show that if "if is the concatenated-block process defined by a measure y on An.7 By starting with a well-separated collection of n-sequences of cardinality more than one can obtain a final process of positive entropy h. equivalent to being a stationary coding of an i.) 2. 2(k — 1)/n. and Lemma 111.i. yi ) > a. for i j.c Exercises.

The fact that processes with the blowing-up property have exponential rates will be established later in the section.d. relative to an ergodic process ii. The difficulty is that only sets of sequences that are mostly both frequency and entropy typical can possibly have a large blowup. Later in this section the following will be proved.i. 0-blowing-up property if p. process. BLOWING-UP PROPERTIES.i. if there is a nonnegative. since i.SECTION 111. If i E A Z and x -w(x) w(x) — w(x) then f (x ) = f Gip .i -w(x) A finitary coding of an i.d.c. in-dependent processes. process has the blowing-up property. and.d. processes have the blowing-up property it follows that finitary processes have the blowing-up property. has the almost blowing-up property (ABUP) if for each k there is a set Bk C A k such that the following hold.i.3 (Almost blowing-up characterization. A particular kind of stationary coding. for any subset C C B for which .i. By borrowing one concept from Chapter 4. 0-blowing-up property for k > K.d. for almost every x E A z . process is called a finitary process. A proof that a process with the almost blowing-up property is a stationary coding of an i. process has the blowing-up property.([C]. and some renewal processes are finitary.B z is said to be finitary.a(C) > 2-18 . process will be given in Section IV.d. it will be shown later in this section that a stationary coding of an i.d.i. Theorem 111. such that. without exponential rates there can be sets that are not exponentially too small yet fail to contain any typical sequences.) Finitary coding preserves the blowing-up property. An ergodic process p.i.i. Once these are suitably removed.d.) A stationary process has the blowing-up property if and only if it is a stationary coding of an i. Theorem 111. The proof of this fact as well as most of the proof of the theorem will be delayed to Chapter 4. the time-zero encoder f satisfies the following condition. a result that will be used in the waiting-time discussion in the next section. process. Not every stationary coding of an i. It is known that aperiodic Markov chains and finite-state processes. called finitary coding. (ii) For any 6 > 0 there is a 8 > 0 and a K such that Bk has the (8. .) > 1 — E.d. almost surely finite integer-valued measurable function w(x). does preserve the blowing-up property. process has the almost blowing-up property. then a blowing-up property will hold for an arbitrary stationary coding of an i. that the theorem asserts that an i. A set B c A k has the (8. — . however. called the window function. in particular. A stationary coding F: A z i.) A stationary process has the almost blowing-up property if and only if it is a stationary coding of an i.4.4. process and has exponential rates of convergence for frequencies and entropy. (i) xlic E Bk.4. 79]. Note. 195 Theorem 111.3.4.i.d.i.1 (Blowing-up property characterization. [28. In particular.2 (The finitary-coding theorem. eventually almost surely.

An exponential bound for the measure of the set B. c)) < 2 -6n . (1) Ipk( 14)— pk( 1)1)1 6/2. < 2n(h—E/2)2— n(h—E14) < 2—nEl 4 . 0. The blowing-up property provides 6 and N so that if n>N. however. First it will be shown that p. Note that if y = E/(2k + 1) then d(4. - = trn i: iun (x n i)> so that B* (n. —> 0. then frequencies won't change much and hence the blowup would produce a set of large measure all of whose members have bad frequencies. Intersecting with the set Tn = 11. E/2)) to be at least 1 — E. and ti. If the amount of blowup is small enough. If. k. k. however.196 CHAPTER III.u has the blowing-up property. [B(n. E/2)) = 0 for each fixed k and E. Thus p n (B(n.u(B*(n.u n ([B*(n. let B . k. and p. To fill in the details of this part of the argument. E/2). for all sufficiently large n. k.) —> 1. for there cannot be too many sequences whose measure is too large. k. ) fx n i: tin(x n i) < of sequences of too-small probability is a bit trickier to obtain. n > N and .(C) > 2 -611 then ii. since . But this cannot be true for all large n since the ergodic theorem guarantees that lim n p n (B(n.(n. has exponential rates for frequencies. in particular. E) = 1Pk( 14) — IkI > El. k.u(4) 2 ' _E/4)} gives (n. k. E)) must be less than 2 -8n. E)1 0 ) 0 it follows that .naB * 2n(h—E/2) . and hence (1) would force the measure p n (B(n. o ] l < for all n. One part of this is easy. ç B(n.(C) > 2 -6 " then . Next it will be shown that blowing-up implies exponential rates for entropy.4. .u n (B(n. 6 .Cc An. To fill in the details let pk ( WI') denote the empirical distribution of overlapping k blocks. since this implies that if n is sufficiently . hence a small blowup of such a set cannot possibly produce enough sequences to cover a large fraction of the measure.([C]y ) > 1 — E. Since .) would have to be at least 1 —E. so that. E)) > 2 -6" then ([B(n. ENTROPY FOR RESTRICTED CLASSES.) > 1 — a. and define B(n. but depends only on the existence of exponential rates for frequencies.u([C]. and hence there is an a > 0 such that 1[B* (n. k.a Blowing-up implies exponential rates. E)].Cc A n . III. 6)1 < 2n(h €). for all n sufficiently large. contradicting the ergodic theorem. by the entropy theorem. The blowing-up property provides 8 > 0 and N so that if n>N. The idea of the proof is that if the set of sequences with bad frequencies does not have exponentially small measure then it can be blown up by a small amount to get a set of large measure.u„(T. n In) _ SO p„([B* (n. Suppose .

if n > N I . Lemma 1. n > N1.u([D]y ) > 1 — E. with full encoder F: B z time-zero encoder f: B z A. E /4) < 2al. except for a set of exponentially small probability. To fill in the details of the preceding argument apply the entropy theorem to choose k so large that Ak(B(k. there is at least a (1 — 20-fraction of (2k -I. there is a 6 > 0 and an N 1 > N such that p. As in the proof of the entropy theorem. the built-up set bound. where 6 > O. E)) < 2_8n which gives the desired exponential bound. satisfies .4.(k. that is.(n. Now consider a sequence xhti._E k 1 . and note that .u(T) > 1 — c by the Markov inequality. at the same time agree entirely with the corresponding block in imHk-±k . Let D be the projection of F-1 C onto B" HI. In particular. then p. and hence there is a k such that if Gk is the set of all x k k such that w(x) < k.u has the blowing-up property. 197 large then. For such a sequence (1 — c)n of its (2k + 1)-blocks belong to Gk. there is a sequence i n tE k E D such that dn+2k(X . the set T = {x n I± k 1 : P2k+1(GkiX n jc-± k i) > 1 — 61. and.1)-blocks in xn -tf k that belong to Gk and. Fix E > 0 and a process p with AZ. (Gk) > 1 — 6 2 . k 1 E [D]y n T. The finitary-coding theorem will now be established. 6/4). Thus + 2—ne/2. this gives an exponential bound on the number of such x. moreover. implies that there is an a > 0 and an N such that if n > N.1 • Thus fewer than (2k + 1)yn < cn of the (2k + 1)-blocks in x h± k± k I can differ from the corresponding block in i1Hk4 k . since k is fixed and . Let be a finitary coding of .u(D) > 2 -h 8 . 1) < Y 2k -I.SECTION 111. Thus f (x) depends only on the values x ww (x()x) .u. where w(x) is almost surely finite.7. 6 can be assumed to be so small and n so large that if y = E / (2k + 1) then . BLOWING-UP PROPERTIES. the set of n-sequences that are (1 —20-covered by the complement of B(k. Suppose C c An satisfies v(C) > 2 -h 6 .6. where a will be specified in a moment. x E B Z . n (G n ) > 1 — 2 -6n. and hence tin(Gn n 13. In particular.un(B. For n > k let = pk(B.b Finitary coding and blowing-up. 6/4)) < a. . most of xPli will be covered by k-blocks whose measure is about 2 -kh. Since the complement of 13(k. I=1 III. Thus.(n. which in turn means that it is exponentially very unlikely that such an can have probability much smaller than 2 -nh. the blowing-up property. and window function w(x).4. then iGn < 2n(h+E/2). 6/4) has cardinality at most 2 k(h+e/4) . 6)) < IGn12—n(h+E) < 2—ne/2 n > N • Since exponential rates have already been established for frequencies.

1. This proves that vn ([ che ) n T) > 1 — 2€. v) is the measure on Ak obtained by averaging must also satisfy the y-probability of k-blocks over all starting positions in sequences of length n. and which will be discussed in detail in Section IV. ) (3) Ca li) = E P (a14)v(x) = Ev(Pk(a l'IX7)). .u. Z = F(Y.d. and put z = . <e. ktn( . The proof that stationary codings of i. E D such that Yi F(y). C) > E has p. then. An ergodic process p. LII which completes the proof of the finitary-coding theorem. measure less than E. Proof The relation (2) vn EC holds for any joining X of A n and A n ( IC).4 )1) > P(4)dn(4 . x ntf k 19 and r-iChoose y and 5. a property introduced to prove the d-admissibility theorem in Section 111. Let . ylz). by the Markov inequality. Z?) < 26. (b) 11-1(u n ) — H(v)I < u8.2.([C]e ) > 1 — E.4.3. processes have the almost blowing-up property (ABUP). This proves the lemma. Thus if (LW. process have the almost blowing-up property makes use of the fact that such coded processes have the finitely determined property. there is a 3 > 0 and positive integers k and N such that if n > N then any measure y on An which satisfies the two conditions (a) Luk — 0*(k.3. c+ k i .4.4 If d(1-t. The sequence Z7 belongs to C and cln (z7 . 4([C]) > 1 — E. the set of 4 such that dn (xit .d. . ENTROPY FOR RESTRICTED CLASSES. IC)) < 62 . C). the minimum of dn (4 . that is. v)I < 8. An( IC)).1 (' IC) denote the conditional measure defined by the set C c An. I11. that is. Towards this end a simple connection between the blowup of a set and the 4-distance concept will be needed.(4. and hence EA(Cin(X til C)) < (LW.). int+ ki. is finitely determined (FD) if given E > 0. Also let c/.i.a into the notation used here.198 CHAPTER III.c Almost blowing-up and stationary coding. that is. Ane IC» <2 then 1. k This simply a translation of the definition of Section III. y E C. C) denote the distance from x to C. In this section it is shown that stationary codings of i. E Yil )dn(.i. so that Z7 E [C]2 E . Lemma 111. where 4)(k.

39]. unbounded sequence {k(n)}. a (n)). e)blowing-up property. Suppose C c Bn and . [37.4.4. Put Bn = Bn(k(n). .u(C) > 2 —n8 .u. both of which depend on . A sequence 4 will be called entropy-typical. as usual. Thus if n is large enough then an(it. where. time IC)) < e 2 . a) be the set of n-sequences that are both entropy-typical.uk and entropy close to H(i). Note that in this setting n-th order entropy. [7]. The first step in the rigorous proof is to remove the sequences that are not suitably frequency and entropy typical. 1. This completes the proof that finitely determined implies almost blowing-up. Kan).6 The results in this section are mostly drawn from joint work with Marton. and a nonincreasing sequence {a(n)} with limit 0. eventually almost surely. I xi ) — . must also satisfy dn (ii.4. The basic idea is that if C c A n consists of sequences that have good k-block frequencies and probabilities roughly equal to and i(C) is not exponentially too small.) A finitely determined process has the almost blowing-up property 199 Proof Fix a finitely determined process A. then the conditional measure A n (. so there is a nondecreasing. IC)) — 141 < a(n). Show that (2) is indeed true. relative to a.11 (4)+na . a(n)). relative to a. rather than the usual nh in the definition of entropy-typical. implies that .) > 1 E. If k and a are fixed.4. a)-frequency-typical for all j < k.un are d-close. IC) satisfies the following.uk I < a. if tt(x til ) < 2. there is a 8 > 0 such that Bn eventually has the (8. a). y) < e 2 . — (i) 10(k. BLOWING-UP PROPERTIES. ttn(.5 (FD implies ABUP. by Lemma 111. ED — Remark 111. for any C c Bp. The definition of Bn and formula (3) together imply that if k(n) > k. is used. the finitely determined property provides 8 > 0 and k such that if n is large enough then any measure y on A n which satisfies WI Ø(k. A sequence x i" will be called (k. then xçi E Bn(k.d Exercises. For references to earlier papers on blowing-up ideas the reader is referred to [37] and to Section 1. 2. which also includes applications to some problems in multi-user information theory. It will be shown that for any c > 0. for which 2 —n8 which. implies that C has a large blowup.SECTION 111. eventually almost surely. and (j.4.u([C]. a)-frequency-typical if I Pk (. which.5 in the Csiszdr-Körner book. which is all right since H(t)/n —> h. by Lemma 111. Pk(• Ifiz) is the empirical distribution of overlapping k-blocks in xri'. Given E > 0.4. then lin e IC) will have (averaged) k-block probabilities close to .4. Theorem 111. I11. (ii) H(14) — na(n) — 178 Milne IC)) MILO ± na(n). Show that a process with the almost blowing-up property must be ergodic.4. ")I < 8 and IHCan ) — HMI < n6. The finitely determined property then implies that gn e IC) and . Let Bn (k. such that 4 E Bn(k(n).

7. A connection between entropy and recurrence times. 6. y E Ac° by Wk (x. y) converges in probability to h. Wyner and Ziv also established a positive connection between a waiting-time concept and entropy. 1. Counterexamples to extensions of these results to the general ergodic case will also be discussed. yn mi +k-1 < 61. Show that some of them do not have the blowing-up property. in conjunction with the almost blowing-up property discussed in the preceding section. . first noted by Wyner and Ziv. v. for irreducible Markov chains. for the well-known waiting-time paradox.d. 5. 6) = minfm > 1: dk (x Two positive theorems will be proved.2. The counterexamples show that waiting time ideas. suggests that waiting times are generally longer than recurrence times. 6) is defined by Wk (x. [86]. y.1. This result was extended to somewhat larger classes of processes. Show that a coding is finitary relative to IL if and only if for each b the set f -1 (b) is a countable union of cylinder sets together with a null set. The surprise here. at least for certain classes of ergodic processes. processes have the blowing-up property. 3.5. [11. ENTROPY FOR RESTRICTED CLASSES. Show that a process with the almost blowing-up property must be mixing. (b) Show that a i/i -mixing process has the blowing-up property. cannot be extended to the general ergodic case. processes. 76]. y) = min{m > 1: yn n: +k-1 i = xk The approximate-match waiting-time function W k (x. thus further corroborating the general folklore that waiting times and recurrence times are quite different concepts. An almost sure version of the Wyner-Ziv result will be established here for the class of weak Bernoulli processes by using the joint-distribution estimation theory of III. Show that condition (i) in the definition of ABUP can be replaced by the condition that . was shown to hold for any ergodic process in Section Section 11.i. Theorem 111. then (1/k) log Wk(x. Section 111. (a) Show that an m-dependent process has the blowing-up property.d.time function Wk (x. Assume that i.d.5 The waiting-time problem. is the positive result. 1Off.200 CHAPTER III. processes.b.3. of course.u(B n ) 1. pp. The waiting .i. They showed that if Wk (x. an approximate-match version will be shown to hold for the class of stationary codings of i. including the weak Bernoulli class in [44.i. Assume that aperiodic renewal processes are stationary codings of i. unlike recurrence ideas. y) is the waiting time until the first k terms of x appear in an independently chosen y. In addition. 4. by using the d-admissibility theorem. y) is defined for x.

The d-admissibility theorem. y. can be taken to depend only on the coordinates from 1 to n. The set G n consists of those y E A Z for which yll can be split into k-length blocks separated by a fixed length gap. the probability that xf E Bk is not within 8 of some k-block in y'i' will be exponentially small in k. 6—). for k . it is necessary to allow the set G„ to depend on the infinite past. guarantees that property (b) holds. In each proof two sequences of measurable sets.}. eventually almost surely.5. x p. x A. Bk consists of the entropy-typical sequences. the sequences {B k } and {G. In other words. In other words. for there are exponentially too few k-blocks in the first 2k(h .) If p is a stationary coding of an i.8. hence it is very unlikely that a typical 4 can be among them or even close to one of them. then lim sup . y. Theorem 111. For the approximate-match result.) If p..} have the additional property that if k < (log n)/(h + 6) then for any fixed xiic E Bk.i. THE WAITING-TIME PROBLEM. the G„ are the sets given by the weak Bernoulli splitting-set lemma.5.d.3. In both cases an application the Borel-Cantelli lemma establishes the desired result. and Wk (x. {B k } and {G n } have the property that if n > 2k(h +E) . S) > h. k almost surely with respect to the product measure p.log The easy part of both results is the lower bound. the probability that y E G„ does not contain x lic anywhere in its first n places is almost exponentially decaying in n.1. V S > 0. the Bk are the sets given by the almost blowing-up property of stationary codings of i. 8) < h. though the parallelism is not complete.log Wk(x. For the weak Bernoulli case. {B k C Ak} and {G.f) terms of y. In the approximate-match proof. is weak Bernoulli with entropy h. Lemma 111.SECTION 111.. the set G. G„ consists of those yli whose set of k-blocks has a large 3-neighborhood. eventually almost surely.5. see Theorem 111.2 (The approximate-match theorem. processes.2. 201 Theorem 111. k almost surely with respect to the product measure p. y) = h. (a) x'ic E Bk. The proofs of the two upper bound results will have a parallel structure.c. hence is taken to be a subset of A. such that a large fraction of these k-blocks are almost independent of each other.d.„0 1 k Wk(X. hence G„ is taken to be a subset of the space A z of doubly infinite sequences. In the approximate-match case. In the weak Bernoulli case. so that property (a) holds by the entropy theorem. Theorem 111.i. .4.3.1 (The exact-match theorem.0 k—>oo 1 lim lim inf .(log n)/(h + 6). The set Bk C A k is chosen to have the property that any subset of it that is not exponentially too small in measure must have a large 3-blowup. are constructed for which the following hold.log k. In the weak Bernoulli case. process. then for any fixed y'iz E G. (b) y E G n . then k— s oo 1 lirn .

Intersecting with the entropy-typical set 'Tk (E/2) gives < 2 k(h-3E/4) 2 -kch-E/2) < p.5.5. I11. by the blowup-bound lemma. The (empirical) universe of k-blocks of yri' is the set iik(Yç') = Its S-blowup is dk(x 1`. In the final two subsections a stationary coding of an i. x /. there is an N = N(x. 8—». I f 6 < 2k(h-3(/4). 2k(h-E). be an ergodic process with entropy h > 0 and let e > 0 be given.6) then n <2k(h-e).5.log Wk(X.6) then for every y E A and almost every x. for any ergodic measure with entropy h. Theorem 111.6) then n _ _ 2 k(h . y) h. there is an N = N(x.log Wk(x. and an ergodic process will be constructed for which the approximate-match theorem is false. and hence 1/1k(yril)1 < Proof If k > (log n)/(h . an application of the Borel-Cantelli lemma yields the theorem. The following notation will be useful here and later subsections. j[lik (y)]51 _ Lemma 1. yçl E An . Proof If k > (log n)/(h . n . (y)16 n Tk(e/2)) [lik(yli )]3 = for some i E [0. n > N.i. y.7. for any ergodic measure with entropy h.` E 7 k(E/2). the set of entropy-typical k-sequences (with respect to a > 0) is taken to be 7k(01) 2 -k(h+a) < 4 (4) < 2-k(h-a ) } Theorem 111.d. eventually almost surely. If k(n) > (logn)I(h ./.14(y1)) 6).5. and hence so that intersecting with the entropy-typical set 7k (e/2) gives < 2 k (6/2)) _ k(h-0 2 -ko-E/2) < (tik (yi ) n r Since . _ _L which establishes the theorem. k almost surely with respect to the product measure p. < 2k(h -E ) .. y) such that x ik(n) (/[14k(n)(11`)]5. 3) > h. 111 IL . The theorem immediately implies that 1 lim lim inf .4 Let p. The lower bound and upper bound results are established in the next three subsections.0 k—c k almost surely with respect to the product measure p. be an ergodic process with entropy h > 0 and let e > 0 be given. n > N. ENTROPY FOR RESTRICTED CLASSES. x A.E) .6) then for every y E A" and almost every x. process will be constructed for which the exact-match version of the approximate-match theorem is false.3 Let p.x /.k l]}.a Lower bound proofs.202 CHAPTER III. In the following discussion. There is a 3 > 0 such that if k(n) > (log n)/(h . El The theorem immediately implies that 1 lim inf . is small enough then for all k. y) such that x ik(n) lik(n)(Yi).

t(k g) .b. E For almost every x E A" there exists K(x) such that x x. and a gap g. To complete the proof of the weak Bernoulli waiting-time theorem. s. the g(j) have length g and alternate with the k-blocks w(j).a. (1) sa Claw (Di n Bs. . (a) x E Gn .b Proof of the exact-match waiting-time theorem.u(w(DlY s 40. k > K(x).8 and depend on a positive number y.s.log Wk (x.. t ].log Wk(x.} are given by Lemma 111.g(t)w(t)w(t + 1). where w(0) has length s < k + g. that is.5.2.1 k x . An index j E [1.) (1 + y) Ifl libt(w(i)). E) denote the least integer n such that k < (log n) I (h É). 1. eventually almost surely. to be specified later. THE WAITING-TIME PROBLEM. The s-shifted. y) > h -E€ so that it is enough to show that y g' Ck (x). k. stated as the following.. 1 . and J C [1.1] for which there are at least (1 .2. 203 III. Let Fix a weak Bernoulli measure a on A z with entropy h.( 0)-1)(k+g) ) < (1+ Y)Ww(i))..s < 2(k + g). If 8 1 is the set of all y then E A z that have j as a (y. t] is a (y.SECTION 111. n(k) = n(k. For each k put Ck(x) = Y Bk. k + g . The basic terminology and results needed here will be restated in the form they will be used.u . The sets G n C A Z are defined in terms of the splitting-index concept introduced in III. t] that are (y. s. y) < h. it is enough to establish the upper bound. eventually almost surely. Bk = Tk(a) = {xj: 2 -k(h+a) < wx lic) < where a is a positive number to be specified later. for 1 < j < t. k-block parsing of y.5. and fix c > 0. and the final block w(t +1) has length n . g)-splitting index for y E A z if . The sets {B k } are just the entropy-typical sets. (2) 1 lim sup . with gap g is yti` = w(0)g(1)w(1)g(2)w(2). k. Fix such an (0 (Yrk) )1 and note that ye Ck (x). for (t + 1)(k + g) < n < (t 2)(k ± g). g)-splitting indices for y.y)t indices j in the interval [1. (b) For all large enough k and t. and for y E G(y) there is at least one s E [0. k. Property (a) and a slightly weaker version of property (b) from that lemma are needed here. The sets {G. eventually almost surely. s. which depends on y. g)-splitting index. The entropy theorem implies that 4 E B.

yields it(n[w(i)] JE.1 n Bsd) + y)' jJ fl bi (w(i)). is upper bounded by that w(j) (1 — 2 -k(h+a ) ) t(1-Y) and hence )t it(ck(x)n G n(k)(J s)) < (1 + y (1 This implies the bound (3). Let be a stationary coding of an i.2ty log y 0 y y (1 2—k(h+a))t( 1— Y) Indeed. let Ck (y) = fx 1`: x g [Uk(Yr k) )]81 so that. t] of cardinality at least t(1 — y).d. while k(n) denotes the largest value of k for which k < (log n)/(h c). eventually almost surely.y). for any y E G n (k)(J. eventually almost surely. Let {B k } be the sequence of sets given by the almost blowing-up characterization theorem. = . Theorem 111. I11. The key bound is (3) ii(Ck(x) n Gn. k. For each k.0 > 0 and K such that if k > K and C c Bk is a set of k-sequences of measure at least 2 -0 then 11([C]8/2) > 1/2. g)-splitting index for y.ualik(n)(31)]812) > 1/2) . . To establish the key bound (3).5. process such that h(A) = h.i. ENTROPY FOR RESTRICTED Choose k > K (x) and t so large that property (b) holds. (1).c Proof of the approximate-match theorem.3. y.3. since ii(x) > k(h+a) and since the cardinality of J is at least t(1 — y). fix a set J c [1. since there are k ± g ways to choose the shift s and at most 2-2ty lo gy ways to choose sets J c [1. n > N(. Theorem 111. But if the blocks w(j).1. s) denote the set of all y E G n (k) for which every j E J is a (y. s).' Ck (y). eventually almost surely. for all j E J. if this holds. and hence it is enough to show that xi. s) c the splitting-index product bound. Theorem 111. and choose . By the d-admissibility theorem. and fix E > 0. since t n/ k and n . which proves the theorem. and such that (t+ 1)(k+ g) < n(k) < (t 2)(k ± g). then a and y can be chosen so small that the bound is summable in k. The least n such that k < (log n)/ (h c) is denoted by n(k). This completes Li the proof of the exact-match waiting-time theorem. 3) > h ± c if and only if x i( E C k(y). (1/k) log Wk(x.5. Since y E Gn . CHAPTER III. Fix a positive number 3 and define G. yri: E G n .4.1. Since Gnuo (J. For almost every y E Aoe there is an N (y) such that yti' E G n . the probability x lic. k + g — 1] and let Gn(k) (J.204 ED CLASSES. j E J are treated as independent then. Fix such a y.(0) < (k + g)2. s. t] of cardinality at least t(1 — y) and an integer s E [0. eventually almost surely. it then follows that y 0 Ck(x).

with high probability. it follows that 4 g Ck(y).SECTION 111. 205 Suppose k = k(n) > K and n(k) > N(y). (1/k) log Wk(y. is thought of as a measure on A Z whose support is contained in B z .E Cj(Xt).u(Ck(Y) n Bk) < 2-10 . Si) to be large. that is Uk(yP) fl /ta ) = 0. then switches to another marker for a long time. cycling in this way through the possible markers to make sure that the time between blocks with the same marker is very large. ) < 6. be the Kolmogorov measure of the i.5.d. . Y . 1} such that g i (0) = ..d An exact-match counterexample. A family {Ci } of functions from G L to A L is said to be pairwise k-separated. 1471 (x. A family {Ci} of functions from GL to A L is 6-close to the identity if each member of the family is 6-close to the identity. eventually almost surely. -51. THE WAITING-TIME PROBLEM. A function C: GL A L is said to be 6-close to the identity if points are moved no more than 8. and hence E p(Ck(y)n Bk ) < oc. If the process has a large enough alphabet. that is.u o F -1 will be constructed along with an increasing sequence {k(n)} such that (4) 1 lim P. say. and sample paths cycle through long enough blocks of each symbol then W1 (x. If it were true that [t(Ck(Y) n Bk) > 2-0 then both [Ck(Y) n Bk]512 and [16 (Yr k) )]6/2 each have measure larger than 1/2. so that yrk) E G(k). yet are pairwise separated at the k-block level. process with binary alphabet B = {0. with a code that makes only a small density of changes in sample paths. y) will be large for a set of y's of probability close to 1/2.u i (1) = 1/2. 1. The classic example where waiting times are much longer than recurrence times is a "bursty" process. I11. eventually almost Li surely. then. for with additional (but messy) effort one can create binary markers. A stationary coding y = . To make this precise. x E GL. But this implies that the sets Ck(y) and [ilk (Yin (k) )]8 intersect. hence intersect.5. where A = {0. The marker construction idea is easier to carry out if the binary process p.„ denotes probability with respect to the product measure y x v. j j. The first problem is the creation of a large number of block codes which don't change sequences by much. for then blocks of 2's and 3's can be used to form markers.i. that is.u. where P. If x i = 1. The key to the construction is to force Wk(y. y) will be large for a set of probability close to 1. The same marker is used for a long succession of nonoverlapping k-blocks. completing the proof of the approximate-match theorem. Thus. 3 } . By this mimicking of the bursty property of the classical example. however. fix GL c A L . which contradicts the definition of Ck(y). This artificial expansion of the alphabet is not really essential. 4. . aL (C(4). y1 E Ci(Xt).) is forced to be large with high probability.„ (— log Wk(n)(y. The initial step at each stage is to create markers by changing only a few places in a k-block. Let . for then the construction can be iterated to produce infinitely many bad k. one in which long blocks of l's and O's alternate. where k is large enough to guarantee that many different markers are possible. k=1 Since 4 E Bk. 2. ) 7 ) N) = 0. n-400 k(n) for every N. if the k-block universes of the ranges of different functions are disjoint.

Note also that the blocks of 2's and 3's introduced by the construction are separated from all else by O's. 3} L consisting of those sequences in which no block of 2's and 3's of length g occur Then there is a pairwise k-separated family {C i : 1 < i < 2g} of functions from GL to A L .6 Let it be an ergodic measure with alphabet A = {0. Put . F (x)) < 6. g+1 2g+2 yj = x . Lemma 111. (iii) v({2. 3) such that i({2. Any block of length k in C(x) = yt must contain the (g -I.2)-block Owf 0. i (x lic) replaces the first 2g ± 3 terms of 4 by the concatenation 0wf0wf0. that is.8. 5") < 2kN ) 5_ 2 -g ± E.c. which is 8-close to the identity.(G Proof Order the set {2. then choose L > 2kN+2 lc so that L is divisible by k. separately to each k-block. Thus the lemma is established. where g > 1. there is a k > 2g +3 and astationary encoder F: A z 1--+ A z such that for y = p. Lemma 1. 2. The block-to-stationary construction. Proof Choose k > (2g +3)18. ENTROPY FOR RESTRICTED CLASSES. where jk+k jk+1 = Ci jk+k x )jk+1 o < <M. 1. Let L. Ci (xt) = yt. i > 2g + 3. (i) pvxv (wk'. and leaves the remaining terms unchanged. 3 > 0. as done in the string matching construction of Section I. and since different wf are used with different Ci .206 CHAPTER III. An iteration of the method. The encoder Ci is defined by blocking xt into k-blocks and applying Cs. with the additional property that no (g +1)-block of 2's and 3's occurs in any concatenation of members of U. The following lemma is stated in a form that allows the necessary iteration. hence no (g -I. Since each changes at most 2g+3 < 3k coordinates in a k-block. that is.5. Define the i-th k-block encoder a (4) = yilc by setting y = Yg+2 = Y2g+3 = 0 . Y2 = Yg+3 = W1 . 3 } g+ 1 ) = 0.C. and (C. 2. El The code family constructed by the preceding lemma is concatenated to produce a single code C of length 2g L.5 Suppose k > (2g+3)/8 and L = mk. Given c > 0.1)-block of 2's and 3's can occur in any concatenation of members of Ui C. GL. (ii) d(x. Let GL be the subset of {0. 2. 3}) = 0.: 1 < i < 2g1 be as in the preceding lemma for the given 3. o F-1 the following hold. is used to produce the final process. 31g in some way and let wf denote the i-th member. Lemma 111.. 1. The following lemma combines the codes of preceding lemma with the block-to-stationary construction in a form suitable for later iteration.(GL). is then used to construct a stationary coding for which the waiting time is long for one value of k.8. and N. its proof shows how a large number of pairwise separated codes close to the identity can be constructed using markers. for all x.5. the family is 3-close to the identity. pairwise k-separation must hold.5.

The intervals I. first note that the only changes To show that F = F made in x to produce y = F(x) are made by the block codes Ci. i0 j. then UJEJ /J covers no more than a limiting (6/4)-fraction of Z. since k > (2g +3)/8. 207 M = 2g L and define a code C: Am 1--* A m by setting C(xr) = xr. Either k > n + L — 1 or 2kN > Case 2c.11 +L-1\ yrn = C1 (4+L-1) . and such that ynn+L-1 = Ci ( X. n +Al]. and 2k N < m L — 1.5. for every x. Case 2a. c has the desired properties. Y (i--1)L+1 = Ci(XoL +1). 1 < <2 for xr E (GL) 29 Next apply the block-to-stationary construction of Lemma 1. and such that y = F(x) satisfies the following two conditions. and hence property (ii) holds. Both xi and i1 belong to coding blocks of F.k <n+L — 1. Z = U/1 . then yn (b) If i E Uj E jii . Case 1. note that if y and 53 are chosen independently. the encoded process can have no such blocks of length g + 1. 2kN+2 /E. since y and j3 were independently chosen and there are 2g different kinds of L-blocks each of which is equally likely to be chosen. then only two cases can occur. {C i }. Either x1 or i is in a gap of the code F. and j such that n < 1 < n + L — 1. for 41 (G L) 2g .SECTION 111. since the M-block code C is a concatenation of the L-block codes. To show that property (i) holds. Case 2. each of which changes only the first 2g + 3 places in a k-block. and iL iL g. In Case 2. F(x)) < 8. there will be integers n. and if y = F(x) and 5". (a) If = [n + 1. that are shorter than M will be called gaps.8. • n+ +1M = C(Xn ni ± 1M ). while those of length M will be called coding blocks. Furthermore. Case 1 occurs with probability at most €72 since the limiting density of time in the gaps is upper bounded by 6/4. since p. had no blocks of 2's and 3's of length g. hence. If Case 2c occurs then Case 2b occurs with probability at most 6/2 since L . = F(). such that if J consists of those j for which I is shorter than M. THE WAITING-TIME PROBLEM. Three cases are now possible.. = x i . then y. in. i. the changes produce blocks of 2's and 3's of length g which are enclosed between two O's. which establishes property (iii). m < 1 < m + L — 1. Thus d(x. Case 2a occurs with probability at most 2 -g.5 to obtain a stationary code F = F c with the property that for almost every x E A z there is a partition. L — 1. of the integers into subintervals of length no more than M. i = j. Case 2b.

by the pairwise k-separation property.5.. and hence one can make P(xo F(x)o) as small as desired. (c) On ) ({2. the following composition G. Wk (y. 33 ) k(n) co. Condition (b) guarantees that eventually almost surely each coordinate is changed only o F -1 . (b). . Cases 1. ENTROPY FOR RESTRICI ED CLASSES. ( wk(i)(y. while condition (a) holds for the case i = n + 1. in y x y probability. finitely many times. and (c) hold for n = 1. F„ have been constructed so that (a). the properties (i)-(iii) of Lemma 111.208 CHAPTER III. that is. The first step is to apply Lemma 111. The finish the proof it will be shown that there is a sequence of stationary encoders. . This completes the construction of the counterexample. (x)) almost surely. the entropy of the coin-tossing process . and an increasing sequence {k(n)} such that for the -1 .6 with 1) larger than both 2n + 2 and k(n) and g = n and p. so there will be a limit encoder F and measure y = desired property. n = 1. = F o Fn _i o o F1 and measure 01) = o G n x. G < 2ik(i)) < 27i +1 ± 2-i + 2—n+1 ± 2—n 9 1 < < n. step n In summary. Let 6 < 2 —(n+1) be a positive number to be specified later. since y (n+ 1) = o G n since it was assumed that 6 < 2 -(n+ 1) . 2. with GI = F1. (b) d (G n (x).5. and (c) hold for n.5. it is enough to show the existence of such a sequence of encoders {Fn } and integers {k(n)}. This will be done by induction.7 The number E can be made arbitrarily small in the preceding argument.6.5. Condition (a) guarantees that the limit measure y will have the 1 log Wk(n)(Y. F2 . however.6 with g = 0 to select k(1) > 3 and F1 such that (a). where Go(x) (a) pooxon . properties hold for each n > 1. . These properties imply that conditions (b) and (c) hold -± 1 1 . (b).g + 6/2. 2a. with n replaced by n 1. Condition (c) is a technical condition which makes it possible to proceed from step n to 1. > 2kN . the entropy of the encoded process y can be forced to be arbitrarily close to log 2. Az. Assume F1.. If 6 is small enough. . will hold for i < n 1..5. Fn. In particular. replaced by On ) to select k(n to select an encoder Fn+i : A z A z such that for v (n+ 1) = On ) 0 . Apply Lemma 111.u. and 2b therefore Puxv (Wk(Y < 2kN) _< 6/2 2 .6 hold. 3} n ) = 0. imply that which completes the proof of Lemma 111. for n + 1. then y (n+ 1) will be so close to On ) that condition (a). Remark 111. .: Az 1-4.

(x. if sample paths sample paths x and y are picked independently at random.SECTION 111. S(m) can be replaced by (S(m))n for any positive integer n without disturbing properties (a) and (b). (b) Each Si (m) is a merging of {Si (m — 1): j (c) 2kni Nin < t(m)/m. for any integer N. Indeed. (a) S(m) is a disjoint union of a collection {Si (m): j < .strongly . Once km is determined.e An approximate-match counterexample. km ) strongly-separated sets of the same cardinality.-indices below the end of each £(m)-block. inductive application of the merging/separation lemma.5. Let a be a positive number smaller than 1/2. 1/2). This is just the subsequence form of the desired result (5). Prob (W k.b. = 0. and hence property (c) can also be assumed to hold.5. ) > (1 — 1/40 2 (1 — 1/rn)2 . then with probability at least (1 — 1/40 2 (1 — 1/m) 2 .3. 209 III. while property (c) implies that (6) lim Prob (Wk„. a) > 2 which proves (6). y. VN. (x . since Nm is unbounded..3. y. Given an increasing unbounded sequence {Nm }. J] is such that 10 -1 ( = M/J. The only new fact here is property (c).3. however. produces a sequence {S(m) C A €("z ) : in > 1} and an increasing sequence {k m }. such that the following properties hold for each tn. and a decreasing sequence {a m } c (a. . their starting positions x1 and yi will belong to £(m)-blocks that belong to different j (m) and lie at least 2 km N. A merging of a collection {Si : j E J} is a product of the form S = ni =1 S Cm) [1. First some definitions and results from 111. defined by {S(m)} is ergodic. where [D]„ denotes the a-blowup of D. A weaker subsequence result is discussed first as it illustrates the ideas in simpler form. Lemma 111. a) < 2km N ) = 0. Property (b) guarantees that the measure p. The goal is to show that there is an ergodic process it for which (5 ) 1 lim Prob (— log k co k W k (X . D c A k are a-separated if C n [D]. see Remark 111. Two subsets S.5. S' C A f are (a. for each where J divides M and 0: [1. J]. M] j E [1. The full kblock universe of S c A f is the set k-blocks that appear anywhere in any concatentation of members of S.3. Such a t can be produced merely by suitably choosing the parameters in the strongnonadmissibility example constructed in 111./m } of pairwise (am . Thus. THE WAITING-TIME PROBLEM.b will be recalled. K) . Two sets C. a) N) = 0.. y.6.separated if their full k-block universes are a-separated for any k > K.

(5).210 CHAPTER III. This leads to an example satisfying the stronger result. The reader is referred to [76] for a complete discussion of this final argument.b.2. ± 1.3. discussed in III. controls what happens in the interval km < k < kn. Further independent cutting and stacking can always be applied at any stage without losing separation properties already gained. . the column structures at any stage can be made so long that bad waiting-time behavior is guaranteed for the entire range km < k < km± i . ENTROPY FOR RESTRICTED CLASSES. as in the preceding discussion. and hence. The stronger form.

each arising from a different characterization. then transporting this to a Ta-invariant measure Ft on A'.d. for while the theory of stationary codings of i..Chapter IV B-processes.1 Almost block-independence. processes. for example. The focus in this chapter is on B-processes. finite alphabet processes that are stationary codings of i. the so-called isomorphism problem in ergodic theory. [46]. processes. Yr) will often be used in place of dn (p.d. that is.d. and very weak Bernoulli processes. This is fortunate. Invertibility is a basic concept in the abstract study of transformations. as will be done here. Section IV. will be discussed in this first section. stationary. This terminology and many of the ideas to be discussed here are rooted in O rn stein's fundamental work on the much harder problem of characterizing the invertible stationary codings of i. Various other names have been used. including almost block-independent processes.i. like other characterizations of B-processes. In other words. finitely determined processes. As in earlier chapters. A block-independent process is formed by extending a measure A n on An to a product measure on (An)". A natural and useful characterization of stationary codings of i. J+1 1 x- for all i < j < mn and all x/. v).d.. in > 1. dn (X7. processes.) Either measure or random variable notation will be used for the d-distance.i. The almost block-independence property. in general. randomizing the start produces a stationary process. Note that 71 is Ta -invariant. is expressed in terms of the a-metric (or some equivalent metric. but is of little interest in stationary process theory where the focus is on the joint distributions rather than on the particular space on which the random variables are defined. the almost block-independence property. (j_on+ and the requirement that x.i. processes is still a complex theory it becomes considerably simpler when the invertibility requirement is dropped. if t is the distribution of XT and y the distribution of Yr. though it is not.i. Xin n E A mn . 211 . ji is the measure on A' defined by the formula nz Ti(X inn )= i j=1 (x(i—l)n+ni) .

since stationary coding cannot increase entropy.u is the block-independent process if n is defined by the restriction . It will be denoted by Ft(n).i. Of course. process.d.d. for each j > 1. i < (j — 1)n}. . An i.d.).u then d(A. process.u n of p to An. (k ) v (j-1)n-l-n (j-1)n+1 is independent of { r. and is sufficient for the purposes of this book. in fact.1 (The almost block-independence theorem.i. then by showing that the ABI property is preserved under the passage to d-limits.i. characterizes the class of stationary codings of i. since the ABI condition holds for every n > 1 and every E > O. Both of these are quite easy to prove. in particular. the independent n-blocking of {Xi } is the Tn-invariant process {Y.2 A mixing Markov chain is almost block-independent. The fact that only a small density of changes are needed insures that an iteration of the method produces a limit coding equal to 12. if it is assumed that the i.d. This result and the fact that mixing Markov chains are almost blockindependent are the principal results of this section.d. or by understood. process has infinite alphabet with continuous distribution for then any n-block code can be represented as a function of the first n-coordinates of the process and d-joinings can be used to modify such codes.d. This will be carried out by showing how to code a suitable i. B-PROCESSES. The theory to be developed in this called the concatenated-block process defined by chapter could be stated in terms of approximation by concatenated-block processes.212 CHAPTER IV. In fact it is possible to show that an almost block-independent process p is a stationary coding of any i. which is not at all obvious.4) < E.d.1. (a) Y U —1)n+n and X7 have the same distribution.i. processes are almost block-independent is established by first proving it for finite codings.i. processes. is almost block-independent (ABI) if given E > 0.) An ergodic process is almost block-independent if and only if it is a stationary coding of an i. there is an N such that if n > N and is the independent n-blocking of . all this requires that h(v) > h(p.d. In random variable language.1. for each j > 1.i.i. for this requires the construction of a stationary coding from some i. is a stationary coding of an i. process.d. . It is not as easy to show that an almost block-independent process p. The independent n-blocking of an ergodic process . The fact that stationary codings of i.i. process y for which h(v) > h(A). process is clearly almost block-independent. Note.i.i. process onto a process d-close to then how to how to make a small density of changes in the code to produce a process even closer in d. but it is generally easier to use the simpler block-independence ideas. Theorem IV. process y onto the given almost block-independent process p. Theorem IV. The almost block-independence property is preserved under stationary coding and. An ergodic process p.d. The construction is much simpler./ } defined by the following two conditions. that the two theorems together imply that a mixing Markov chain is a stationary coding of an i. however.

ALMOST BLOCK-INDEPENDENCE. while the successive n — 2w blocks vn—w-1 4 w+1 y mn —w-1 (m-1)n±w±1 +r -I by the definition of window halfare independent with the distribution of rr width and the assumption that {X i } is i.d. so that property (f) of the d-property list. the preceding argument applies to stationary codings of infinite alphabet i. Let {Y. This is because blocks separated by twice the window half-width are independent.} be a finite coding the i. < E/ 3. for all n. If the first and last w terms of each n-block are omitted.d.L) < 6 / . since a stationary coding is a d-limit of finite codings.d. processes are almost block-independent. Lemma 1.d. y) < c/3.i. > 0 (1) d . yields yn—w—1 v mn —w-1 n n— +T -1 7 mn —w-1 am(n_210 (z w w-1-1 (m-1)n+w-1-1 ) Thus dnm (Z. The successive n — 2w blocks 7n—w-1 'w+1 7 mn —w-1 " • '(ni-1)n+w+1 n+ 1 -1 . This completes the proof that B-processes are almost block-independent. Lemma 1. Fix such an n and let FL be the independent n-blocking of A. Lii .i. The fact that both and are i. First it will be shown that finite codings of i. Given E choose an almost block-independent process y such that d(p. process {X i } with window half-width w. Furthermore. w+1 ymn—w--1 (m-1)n±w±1 ) 4.414/ by property (c) of the d-property list.. = anal FL) = cin(y.d. < c. processes are almost blockindependent. < c/3. processes and d-limits of ABI processes are almost block-independent.} be the independent n-blocking of { Yi } . 213 IV.i.i. and hence. hence such coded processes are also almost block-independent.11.. it follows that finite codings of i.. and let {Z. The triangle inequality then gives This proves that the class of ABI processes is d-closed. 1. then mnci„.11.d.d.SECTION IV 1.i. Next assume p.i.9. In summary. when thought of as An-valued processes implies that d( 17. ) Since this is less than c. Yr) is upper bounded by m(n _ 2040 _ 2 w) w+1 vn—w-1 Tnn—w-1 '(n1-1)n+w+1 .d.i. by the definition of independent nare independent with the distribution of Yw blocking. for any n> wl€.1. finite codings of i. hence stationary codings of i. by (1). processes onto finite alphabet processes.9. is the d-limit of almost block-independent processes. processes are almost blockindependent.i. Since y is almost block-independent there is an N such that if n > N and 7/ is the independent n-blocking of y then d(y.a B-processes are ABI processes.(Zr.

is a stationary coding of an infinite alphabet i. Let {Xi } be an almost block-independent process with finite alphabet A and Kolmogorov measure A. there will almost surely exist a limit code F(z)1 = lime Ft (z) i which is stationary and for which cl(A.d. producing thereby a limit coding F from the vector-valued i. The lemma can then be applied sequentially with Z (t) i = (V(0) i . B-PROCESSES. and hence any i. as t — * oc. with window width 1. 1) and.d. for it means that. of any other i. i. (i) d(p. V (1) i . since any random variable with continuous distribution is a function of any other random variable with continuous distribution.d. process {Z. V): G(z)o F (z. process independent of {Z. v)o}) < 4E. = 2 -t.1. process Vil with Kolmogorov measure A for which there is a sequence of stationary codes {Ft } such that the following hold. A block code version of the lemma is then used to change to an arbitrarily better block code. This fact. process (2) onto the almost block-independent process A.5 > 0 there is a finite stationary coding F of {(Z i .214 CHAPTER IV. which allows one to choose whatever continuously distributed i.i.o G -1 ) <E. process with continuous distribution is a stationary coding.d. V(t) i . An i. (A x o F -1 ) < 8.. (a)a(p.3 is to first truncate G to a block code. The exact representation of Vi l is unimportant. is almost block-independent and G is a finite stationary coding of a continuously distributed i. and if is a binary equiprobable i. process with continuous distribution. It will be shown that there is a continuously distributed i. The idea of the proof of Lemma IV..d.i.1. (b)Er Prob((Ft+ i (z))o (Fr(z))o) — < oc.d. Lemma W. V(1) i . Vi )} such that the following two properties hold. V(t — 1) i ). which allows the creation of arbitrarily better codes by making small changes in a good code. The lemma immediately yields the desired result for one can start with the vector valued. The second condition is important. . A. for t > 1.i. A = X o F -1 . for all alphabet symbols b. process (2) {(V(0). The goal is to show that p.d. V(2)1 .d. that is. where V(0)0 is uniform on [0. each coordinate Fe(z)i changes only finitely often as t oc. (ii) (A x v) ({(z.} with Kolmogorov measure A. countable component.d.i. o Fr-1 ) > 0.3 (Stationary code improvement. IV.i. process is convenient at a given stage of the construction. A o F -1 ) = 0. then given ..i. .i..i.i. V(t)0 is binary and equiprobable.. will be used in the sequel. Therefore.}.b ABI processes are B-processes.d. almost surely. } is continuously distributed if Prob(Zo = b) = 0. . process {Z.)} with independent components. The key to the construction is the following lemma. process with continuous distribution.i.) If a(p. 1. to obtain a sequence of stationary codes {Ft } that satisfy the desired properties (a) and (b). and Er = 2-t+1 and 3. where p. together with an arbitrary initial coding F0 of { V(0)i }.

A finite stationary code is truncated to a block code by using it except when within a window half-width of the end of a block. since the only disagreements can occur in the first or last w places. such that d. Proof This is really just the mapping form of the idea that d-joinings are created by cutting up nonatomic spaces.1. using the auxiliary process {Vi } to tell when to apply the block code. ergodic process {Ri } will be called a (S.) Let (Y.8. n)-blocking process and if 147Ç' is independent of {(Wi . The following terminology will be used. if the waiting time between l's is never more than n and is exactly equal to n with probability at least 1 — 3. and b and c are fixed words of length w.(a7).(F(u)7. An ergodic process {Wi } is called a (8. y) be a nonatomic probability space and suppose *: Y 1-± A n is a measurable mapping such that V 0 —1 ) < 8.1 = diliq is a coding block) = . Clearly d.G(u7)) < S. The block code version of the stationary code improvement lemma is stated as the following lemma. Lemma IV. ALMOST BLOCK-INDEPENDENCE. The i. where u is any sample path for which û7 = /47. for i< j<i+n-1and Ri+n _1 = 1. For n > N define G(u7) = bF(ii)7± 1c. (4 . The following lemma supplies the desired stronger form of the ABI property. nature of the auxiliary process (Vi) and its independence of the {Z1 } process guarantee that the final coded process satisfies an extension of almost block-independence in which occasional spacing between blocks is allowed. . Given S > 0 there is an N such that if n > N then there is an n-block code G.i. see Exercise 12 in Section 1. An ergodic pair process {(Wi . a joining has the form X(a'i' . an extension which will be shown to hold for ABI processes. 215 Finally. for v-almost every sample path u.4 (Code truncation. (Y))).SECTION IV. given that is a coding block. while retaining the property that the blocks occur independently.b. on A'.1 = On -1 1.9. if {(147„ Ri )} is a (S.(F (u)7. Then there is a measurable mapping 4): Y 1-+ A n such that ji = y o 0-1 and such that Ev(dn((k (Y). for v-almost every sample path u. n)-independent process and Prob(14f.d. A block R:+n-1 of length n will be called a coding block if Ri = 0.u. the better block code is converted to a stationary code by a modification of the block-to-stationary method of I. A binary. n)-independent process if {Ri } is a (8. with associated (8. Proof Let w be the window half-width of F and let N = 2w/8.5 (Block code improvement.) Let F be a finite stationary coding which carries the stationary process v onto the stationary process 71. (y))) < 8. n)-blocking process {Ri } . Ri ): i < 0 } . b7) = v(0-1 (4) n (b7)) for 4): Y H A n such that A n = vo V I . 1.? +n--. n)-truncation of F. n)-independent extension of a measure pt. called the (S. Ri)} will be called a (S. Lemma IV. The almost block-independence property can be strengthened to allow gaps between the blocks. Indeed. This is well-defined since F(ii)7+ 1 depends only on the values of û. 4 E A'. that is.1. b7)) = E v(dn (0 (Y).1. G(u7)) < 8. n)blocking process. so that Vd.

Lemma 1.AN Ni1+1 has the same distribution as X' N1 +i . process. An application of property (g) in the list of dproperties.. in > Mi.1.11. If M i is large enough so that tk NI — nk nk+1 tIcNI is negligible compared to M1 and if 8 is small enough. and that i7 V -7. This is solved by noting that conditioned on starting at a coding n-block.} be a continuously distributed i. Since y is not stationary it need not be true that clni (it„„ um ) < E/3.6 (Independent-extension. of an n-block may not be an exact multiple of NI . (3) Put N2 = M1+ 2N1 and fix n > N2. but i j will be for some j < NI.} and {W. then. and that 8 be so small that very little is outside coding blocks. for all ni.216 CHAPTER IV. with associated (6. n)-blocking process {R. and let À x y denote the Kolmogorov measure of the direct product {(Z„ V. completing the proof of Lemma IV. so that (3) yields t NI (4) ytS:N IV I I± Wits:N A: 1+ ) < € /3. Proof In outline. can be well-approximated by a (6. Let 6 be a positive number to be specified later and let be the Kolmogorov measure of a (8.} a binary. let {W} denote the W-process conditioned on a realization { ri } of the associated (6. It is enough to show how an independent NI -blocking that is close to p.3.} are now joined by using the joining that yields (4) on the blocks ftk + 1. where y is the Kolmogorov measure of the independent N I -blocking IN of pt.1. To make this outline precise. ii(v.6. here is the idea of the proof. with respective Kolmogorov measures À and y. skNi < Note that m = skNI — tkNi is at least as large as MI. Let p. { Proof of Lemma IV. B-PROCESSES. Let {n k } be the successive indices of the locations of the l's in { ri}. n)-independent extension {W} of An .} will be less than 2E13. For each k for which nk +i — n k = n choose the least integer tk and greatest integer sk such that tk Ni > nk .) < c.) Let {X. The processes Y. but at least there will be an M1 such that dni (/u ni . n)-blocking process (R 1 ). Let {Z.)}. y) < 6/3. and hence cl(p. process and {V.d. vm ) < 13. < 2E/3. be . The obstacle to a good d-fit that must be overcome is the possible nonsynchronization between n-blocks and multiples of NI -blocks. Towards this end.1.} be an almost block-independent process with Kolmogorov measure t. the ci-distance between the processes {K} and {W. n)independent extension 11. n)-independent extension. The goal is to show that if M1 is large enough and 3 small enough then ii(y.d. Given E > 0 there is an N and a 6 > 0 such that if n > N then a(p. for almost every realization {r k }. for any (8. equiprobable i.. hence all that is required is that 2N1 be negligible relative to n. say i.i.}. } .9. sk NI L and an arbitrary joining on the other parts. first choose N1 so large that a(p. yields the desired result. Thus after omitting at most N1 places at the beginning and end of the n-block it becomes synchronized with the independent Ni -blocking. provided only that 6 is small enough and n is large enough. the starting position.i.. of A n .i1. independent of {Z. < 2E/3. Lemma IV. Ft) < c.

and let G be a finite stationary < E. The goal is to construct a finite stationary coding F of {(Z. ALMOST BLOCK-INDEPENDENCE. Yk n — 1]. and coding to some fixed letter whenever this does not occur. (ii) (X x y) (((z. by the ergodic theorem applied to {R. x y) o <8. and . coding such that d(p. v): G(z)0 F (z. 0(Z7)) I R is a coding block) so the ergodic theorem applied to {Wk }. n)-blocking process {Ri } and lift this to the stationary coding g* of {(Z. and hence the proof that an ABI process is a stationary coding of an i. Let F(z.)} onto {(Z„ Ri )} defined by g*(z.}. and define F(z. first note that {Zi } and {Ri } are independent. almost surely.)} such that the following hold. The see why property (ii) holds. thereby establishing property (ii). This completes the proof of the lemma. 217 an almost block-independent process with alphabet A. Y)o}) <2e E< 4e. define +n-1 ). In other words. 0(Z))) 26. hence property (i) holds. Towards this end. r Ykk+" = dp(zYA Yk ) for all k. the dn(G(Z)kk+n-1 limiting density of indices i for which i V Uk[Yk. (5) Next. n)-truncation G of G.d. V. yk + n — 1] is almost surely at most & it follows from (6) that (X x y) ({(z.) < 8. implies that the concatenations satisfy (6) k—>oo lirn dnk(O(Wi) • • • 0(Wk). let g be a finite stationary coding of {V. Xn and hence the block code improvement lemma provides an n-block code 0 such that = o 0 -1 and E(dn (à . Ø(Z7))) = E(dn (d (Z 11 ). and so that there is an (c. y) = (z.z (d(Z11 ). (5). .SECTION IV 1. Furthermore. r) be the (stationary) code defined by applying 0 whenever the waiting time between 1 's is exactly n. p e-. this means that < 2E. along with the fitting bound. } . Since.)•••d(wk))< 2c. X o Fix 3 > O. g(v)). r)i = a. let {yk } denote the (random) sequence of indices that mark the starting positions of the successive coding blocks in a (random) realization r of {R. process. a(w. (X. F(Z. By definition. and hence the process defined by Wk = Z Yk+n-1' i i .u.d. a(4 +n-i„ )) < E. i fzi Uk[Yk.i. n)-independent extension of i then d(p.(Z).i.. however. first choose n and S < e so that if is a (S. with Wo distributed as Z. Clearly F maps onto a n)-independent extension of . E(d.. vgk +n-1 = 0 (Wk). (i) d(p. In particular. V.} onto a (S. if g(v) = r. y) 0 }) < 4e. v): G(z)0 F(z. F(u.. fix a E A.

below. once they agree. The key to the fact that mixing Markov implies ABI.} and { Y n } . b7) = n otherwise. the expected value of the coupling function is bounded.1.7 The existence of block codes with a given distribution as well as the independence of the blocking processes used to convert block codes to stationary codes onto independent extensions are easy to obtain for i. = w < n. respectively. n — 1 ] : ai = bi } 0 0 otherwise. Remark W. n — 1]: ai = bi}. . process whose entropy is at least that of it. b?) = (7) A direct calculation. n. and an w+1 = 1)/ 1_ 1 if wn (a . are discussed in detail in his book..i. For each n > 1. and is concentrated on sequences that agree ever after. shows that An is a probability measure with vn and p. v) indeed goes to O. Furthermore. A joining {(Xn .) The coupling function satisfies the bound (8) dn(tt.8 (The Markov coupling lemma. In particular. and y be the Kolmogorov measures of {X. Lemma IV. making use of the Markov properties for both measures. independent of n and a. v(a)p(b'i') 0 if w( a. .i. } be a (stationary) mixing Markov process and let {1.218 CHAPTER IV B-PROCESSES. [46].u n and v n .1. Let {X.d. if fi E [1 . see Exercise 2. a special type of joining frequently used in probability theory.Y n ) cannot exceed the expected time until the two processes agree. see also [63. Yn )} is obtained by running the two processes independently until they agree.. IV. this guarantees that ncln (X. the coupling function is defined by the formula wn(4 = f minfi E [ 1. } .c Mixing Markov and almost block-independence. With considerably more effort.z } be the nonstationary Markov chain with the same transition matrix as { X. v) —> 0. Proof The inequality (8) follows from the fact that the A n defined by (7) is a joining of . The precise formulation of the coupling idea needed here follows.d. as well as several later results about mixing Markov chains is a simple form of coupling. with the very strong additional property that a sample path is always paired with a sample path which agrees with it ever after as soon as they agree once. see [35]. but which is conditioned to start with Yo = a. 42]. suitable approximate forms of these ideas can be obtained for any finite alphabet i. processes with continuous distribution. Let p. The coupling measure is the measure A n on An x An defined by v (an .u(b7).„ as marginals. In particular dn(it. so that iin (p. This establishes the lemma. then running them together. which are the basis for Ornstein's isomorphism theory. The Markov property guarantees that this is indeed a joining of the two processes. v) < E)" n (wn) .1. These results. n --± oc.

given Xk n = Xkn • kn "± ditioned distribution of XI and the conditional distribution of Xk The coupling lemma implies that (9) d./x. 1:1 The almost block-independent processes are. a measure it on A" has a (8. that distribution of Xk realizes the du -distance between them..) The only problem. Zr)) < E.2.n± ±n 1 /xkn) < for all sufficiently large n. n)-independent extension which is a function of a mixing Markov chain. shows that A. ALMOST BLOCK-INDEPENDENCE. (See Exercise 13.6. For each positive integer m. < c. not just a function of a mixing Markov chain. and the distribution of X k n +7 conditioned on Xk n is d-close to the unconditioned distribution of Xn i .pnn(XT n einn ) = ii(4)11. given Xk n = xk n . 219 To prove that a mixing Markov chain is d-close to its independent n-blocking.„.1. Section 1.1. (dnin (finn . let À nin be the measure on A"' x A mn defined by m-1 (10) A. Likewise. The notation cin (X k n +7. exactly the almost mixing Markov processes. n)-independent extension which is actually a mixing Markov chain. u is a joining of A„.„. a joining Ain „ of A m . Since the proof extends to Markov chains of arbitrary order. Fix a mixing Markov chain {X. for all large enough n. with ii mn will be constructed such that Ex. summing over the successive n-blocks and using the bound (9) yields the bound Ex„. is to produce a (8. This is easy to accomplish by marking long . of course. the idea to construct a joining by matching successive n-blocks. Proof By the independent-extension lemma. Lemma IV.n (dmn(xr. in fact. } with Kolmogorov measure and fix E > 0. with Kolmogorov measure i. . provided only that n is large enough. therefore. this completes the proof that mixing Markov chains are almost block-independent. proves that d(p. by the coupling lemma. by the Markov property. in the sense that Theorem IV. conditioned on the past.(Z?) k=1 Jr A1 zkknn++7)• A direct calculation. This will produce the desired dcloseness.SECTION IV 1.11. Fix n for which (9) holds and let {Zi } be the independent n-blocking of {Xi } . } plus the fact that Zk k n +7 is independent of Z j . and hence that {X i } is almost blockindependent. because the distribution of Xk k . X k n +7/xk n ) will be used to denote the du -distance between the unconI. A suitable random insertion of spacers between the blocks produces a (8. To construct Xmn. 1.„ with iimu . conditioned on the past depends only on Xk n .9 An almost block-independent process is the d-limit of mixing Markov processes.(x k n ±n . first choose. for each xbi . For each positive integer m. j < kn. n)-independent extension ri which is mixing Markov of some order... The concatenated-block process defined by pt is a function of an irreducible Markov chain. a joining A. x k .T. z7)) < which. using the Markov property of {X. see Example 1. it is enough to show that for any 8 > 0. of the unconditioned kn "Vil and the conditional distribution of Xk kn + 7.

Show that if { 1/17. The process {Zi } defined by Zi = f(Y 1 ) is easily seen to be (mn + 2)-order Markov. irreducible. [66]. } and {W} are (6.) This completes the proof of Theorem IV. i) can only go to (sxr". independent of n. n)-blocking process {R.1. and let ft be the measure on sCrn defined by the formula ft(sw( 1 ) . then randomly inserting an occasional repeat of this symbol. • w(m)) = i=1 Fix 0 < p < 1. The coupling proof that mixing Markov chains are almost block-independent is modeled after the proof of a conditional form given in [41]. Show that the defined by (10) is indeed a joining of . 1) with probability (1 — p). i 1). to be specified later. } . respectively.d Exercises. with the same associated (6. Remark IV. As a function of the mixing Markov chain {ri }. il To describe the preceding in a rigorous manner.1. 1. where w(i) E C. nm 1} and transition matrix defined by the rules (a) If i < nm 1. it is mixing.10 The use of a marker symbol not in the original alphabet in the above construction was only to simplify the proof that the final process {Z1 } is Markov. and to (syr".11 The proof that the B-processes are exactly the almost block-independent processes is based on the original proof. concatenations of blocks with a symbol not in the alphabet.a(syr"). where y and D are the Kolmogorov measures of {W} spectively. IV. Show the expected value of the coupling function is bounded. . An). such that 4. if 1 < i < mn + 1.1. then d(v. Furthermore. n)-independent extensions of An and fi n . 3. 0) with probability pii(syr).9. since once the marker s is located in the past all future probabilities are known. (b) (sxr .. Show that (7) defines a joining of A n and yn . B-PROCESSES. 2. = {1T 7i } .„.220 CHAPTER IV. The proof that ABI processes are d-limits of mixing Markov processes is new. define f(sxrn. Next. n (dnzn Winn zr)) E. below. (Z1 ) is clearly a (6. let C c A' consist of those ar for which it(4) > 0 and let s be a symbol not in the alphabet A. for each syr E SC m . n with Ti Ex. aperiodic Markov chain with state space se" x {0. 1. (sxr .u. A proof that does not require alphabet expansion is outlined in Exercise 6c. Remark W. nm +1) goes to (sy'llin. n)independent extension of A. 1. rean(An. and s otherwise. For each positive integer m. let sCm denote the set of all concatenations of the form s w(1) • • • w(m). i) to be x. provided m is large enough and p is small enough (so that only a 6-fraction of indices are spacers s. and let { ri } be the stationary. .

for some a E A.} be binary i. Show that {17 } is Markov. (a) Show that p.i. 1. and using Exercise 6b and Exercise 4. y) < E.d. Theorem IV. unless X n = 0 with probability 1/2.) 6. THE FINITELY DETERMINED PROPERTY.d. there is a 8 > 0 and a positive integer k.d.d. Proof Let p (n ) be the concatenated-block process defined by the restriction pn of a stationary measure p. 221 5. oc.i. Section 1. to An. such that any stationary process y which satisfies the two conditions. If k fixed and n is large relative to k. Let p. A stationary process p.i. Let {X. (Hint: if n is large enough. must be mixing. In particular.SECTION IV. Section IV. The sequence ba2nb can be used in place of s to mark the beginning of a long concatenation of n-blocks. be the Kolmogorov measure of an ABI process for which 0 < kc i (a) < 1. 0 as n with itself n times. processes are Markov.d.1. and 0. the only error in this .5.i. otherwise. This principle is used in the following theorem to establish ergodicity and almost block-independence for finitely determined processes. the process {Z. Some special codings of i. process is the finitely determined property. Some standard constructions produce sequences of processes converging weakly and in entropy to a given process. Properties of such approximating processes that are preserved under d-limits are automatically inherited by any finitely determined limit process. and not i. given E > 0. mixing. In other words. The most important property of a stationary coding of an i. and almost block-independent.2 The finitely determined property. is finitely determined (FD) if any sequence of stationary processes which converges to it weakly and in entropy also converges to it in d. process.) 7.2. (See Exercise 4c. a stationary p is finitely determined if.-distribution. Show that even if the marker s belongs to the original alphabet A. it can be assumed that a(a) = 0 for some a.u) — H(v)I < must also satisfy (iii) c/(kt.i.1 A finitely determined process is ergodic. for it allows d-approximation of the process merely by approximating its entropy and a suitable finite joint distribution. then the distribution of k-blocks in n-blocks is almost the same as the p.9 without extending the alphabet. n)-independent extension of IL. (Hint: use Exercise 4.2.9 is a (8. afinitely determined process is a stationary coding of an i. then by altering probabilities slightly./ } defined in the proof of Theorem IV. (i) 114 — vkl < 6 (ii) IH(.) (c) Prove Theorem IV. where a" denotes the concatenation of a (b) Show that Wan) —3. and define Y„ to be 1 if X„ X„_1.

as n weakly and in entropy so that {g (n ) } converges to 1-1(u n )/n which converges to H(/)). {P ) } will converge weakly and in entropy to /I . process is finitely determined is much deeper and will be established later.1. that is. y) —1- since the only difference between the two measures is that 14 includes in its average the ways k-blocks can overlap two n-blocks. and hence almost block-independent.t (n ) is equal n-block have to be ignored.1. Proof One direction is easy for if I)„ is the projection onto An of a stationary measure I).v) < E. is small enough then /2 00 and ii (n) have almost the same n-block distribution. The entropy of i. approximation being the end effect caused by the fact that the final k symbols of the oo.2. a finitely determined process must be almost block-independent. satisfies the finite form of the FD property stated in the theorem. (n ) have almost the same entropy.) If 8.6. If y is a measure on An. and (11 n)H (v n ) H (v). A direct calculation shows that 0(aiic ) = E Pk(a14)Y(4) = Ep(Pk(4IX). let ii (n ) be a (8n . hence in if is assumed to be finitely determined. For each n. Since each 11. Section IV. and since a-limits of ABI processes are ABI. then p. is finitely determined if and only if given c > 0. (n) is ergodic and since ergodicity is preserved by passage to d-limits. for any k < n. 10(k.2 (The FD property: finite form. then 0(k. the average of the y-probability of a i` over the first n — k in sequences of length n. v„) = vk . EE V (14 a k i V).a (n ) and p. Also useful in applications is a form of the finitely determined property which refers only to finite distributions. Thus 147 ) gk . . Also.d. . to A. Since ii(n ) can be chosen to be a function of a mixing Markov chain. that a stationary coding of an i. y) on Ak defined by )= 1 k+it k i=ouE.222 CHAPTER IV. a The converse result. n)-independent extension of A n (such processes were defined as part of the independent-extension lemma. and k < n. then (1) 41 _< 2(k — 1)/n.i. a finitely determined process must be mixing. Thus if 3n goes to 0 suitably fast. Theorem IV. This completes the proof of Theorem IV. v)I < 6 and IH(pn ) — H(v)I < n6 must also satisfy cin (p.1. must be finitely determined. B-PROCESSES. it follows that a finitely determined process is ergodic. see Exercise 5. Lemma IV. since functions of mixing processes are mixing and d-limits of mixing processes are mixing.n. the v-distribution of k-blocks in n-blocks is the measure = 0(k. Xn i 1 starting positions the expected value of the empirical k-block distribution with respect to v. Concatenated-block processes can be modified by randomly inserting spacers between the blocks.tvE. if p.) A stationary process p. hence almost the same k-block distribution for any k < n. there is a > 0 and positive integers k and N such that if n > N then any measure v on An for which Ip k — 0(k. Note also that if i3 is the concatenated-block process defined by y. Thus. and furthermore.2.

given Y. that is. Let p(x. The random variables are said to be c-independent if E Ip(x.1.a ABI implies FD.1. using (1).i. process is finitely determined. . since H(t)) = (1/ n)H (v).1 I. p. as well as possible to the conditional distribution of yr.. y) — oak' < 6.i. Y = y) denote the joint distribution of the pair (X. y) — p(x)p(y)I < e. in turn. y) denote the marginal distributions. It is easy to get started for closeness in first-order distribution means that first-order distributions can be dl-well-fitted. First it will be shown that an i. v) I +10(k. The definition of N guarantees that 117k — till 5_ 17k — 0(k. (A n . y) = Prob(X = x. The dn -fi of the distributions of X'il and yr by fitting the condiidea is to extend a good tting tional distribution of X n+ 1.) < E.0(k. if independence is assumed.)) — H COI < 6. provided closeness in distribution and entropy implies that the conditional distribution of 17. The definition of finitely determined yields a 6 > 0 and a k.SECTION IV. THE FINITELY DETERMINED PROPERTY. Y) of finite-valued random variables.17) < E. by Theorem IV.2. it will be shown that the class of finitely determined processes is closed in the d-metric. The conditional distribution of Xn+ i. does not depend on ril. < 6/2. This proof will then be extended to establish that mixing Markov processes are finitely determined. it can also be supposed that if n > N and if 17 is the concatenated-block process defined by a measure y on An.k1 < 3/2 and 1H (v) — H(1. Fix n > N and let y be a measure on An for which 10(k. is finitely determined and let c be a given positive number.. Furthermore.4)1 <n6/2 and let Tf be the concatenated-block process defined by v. X.2. A finitely determined process is mixing.2. and hence the method works. an approximate independence concept will be used.9 of the preceding section. This completes the proof of Theorem IV. hence is totally ergodic.d. The i. 1H (i. given Yin. for an ABI process is the d-limit of mixing Markov processes by Theorem IV. 223 To prove the converse assume ti. which is within 8/2 of (1/n)H(An). y If H(X) = H(X1Y). y) < ii(p. y) —1 <8/2. and let p(x) = px (x) = E. IV. then X and Y are independent. p(x. proof will be carried out by induction.). is almost independent of Yçi.a. CI IV. within 6/2 of H(p. processes are finitely determined. given Xi'. which is. Thus.2. An approximate form of this is stated as the following lemma.2. By making N Choose N so large that if n > N then 1(1/01-/(an ) — enough larger relative to k to take care of end effects. To make this sketch into a real proof. then 1. the definitions of k and 6 imply that a (/urY-) < c.d.i.d. y) and p(y) = py (y) = Ex p(x.2. The proof that almost block-independence implies finitely determined will be given in three stages. Finally. y) = px. v) — ii. so Exercise 2 implies that 4.y(x. These results immediately show that ABI implies FD. such that any stationary process y that satisfies Ivk — 12k 1 < 6 and 1H (v) — H COI < 6 must also satisfy d(v. given X.

224 CHAPTER IV.i. that is.i. process is. property. implies that H(A) = H(X i ). without proof. B-PROCESSES. as the following lemma.d. of course.)' g(x)p(x) E p(x) h(y)p(ydx). In its statement. which is just the equality part of property (a) in the d-properties lemma. yields. Given e > 0. Proof If first-order distributions are close enough then first-order entropies will be close.i. independent of X and Y. y) x. di (A. Y) p(x)p(y) Pinsker's inequality. there is a 8 > 0 such that if 1. Lemma IV. Lemma IV. y)I p(x). y) = X .9.2.4 Let {X i } be i.y D (p x.5 If f (x.3 (The E-independence lemma.2. This establishes the lemma. then X and Y are 6-independent. then 11 (X 1 ) will be close to 11 (Y1 ).d. 0 A stationary process {Yi } is said to be an 6-independent process if (Y1.) There is a positive constant c such that if 11(X) — H(X1Y) < c6 2 . for all m > 0.y1PxP Y )• P(x. proof as well as in later results is stated here. An i.d. . process.6. p(x. Thus if Ai is close enough to v1. The lemma now follows from the preceding lemma. since H(YilY°m) is decreasing in m.i. y) x. y) = vii/ 2 . y) = g(x) h(y). y)p(x. y IP(X x P(X)P(Y)i) where c is a positive constant. The i. Lemma 1. y) log p(x. Use will also be made of the fact that first-order d-distance is just one-half of variational distance. Exercise 6 of Section 1.11. The next lemma asserts that a process is e-independent if it is close enough in entropy and first-order distribution to an i. Yi-1) and yi are c-independent for each j > 1.2. y) and conditional distribution p(y1x) = p(x. y) denotes the joint distribution of a pair (X. H(X) — H(X1Y) = _ p(x) log p(x) log E p(x.d. 0 A conditioning formula useful in the i. then E f (x. Proof The entropy difference may be expressed as the divergence of the joint distribution pxy from the product distribution px p y .y P(Y) E p(x. Y) of finite-valued random variables. that is. with first marginal p(x) = yy p(x.d. then H(Yi) — 1-1 (Y11 172 1 ) will be small. e-independent for every c > O.i. and {Y i } be stationary with respective Kolmogorov measures /2 and v. Lemma IV.u1 — vil < 8 and 11I (A) — H(v)1 < then {Yi} is an 6-independent process. so that if 11(v) = limm H(Y i lY i ) is close enough to H(A). however. 2 D(pxylpxp y) C (E .

4+1) = \ x. and the distribution of 1" 1 + 1 /Y1 is.) — 1/(y)1 < 3. let ?./4).2. which is at most c/2. y. This strategy is successful because the distribution of Xn+ 1 /x'il is equal to the distribution of Xn+ i. produces the inequality (n 1)dn+1(.' are c-independent.. The strategy is to extend by fitting the conditional distributions Xn+ 1 /4 and Yn+i /yri' as well as possible. Y1 ). yields (n (2) 1 )Ex n+ I(dn+1) = nEA. Fix It will be shown by induction that cin (X. is certainly a joining of /2..SECTION IV. since d i (X I . 225 Proof The following notation and terminology will be used.2. yn+. Yin) < c. by independence. then the process {Yi } defined by y is an c-independent process. (3) di (X E 1 /x. close on the average to that of Yn+i . Xn+1 (4+1.d. Lemma IV. Thus the second sum in (2) is at most 6/2 ± 6/2. Yn+1)À. the conditional formula. This completes the proof of Theorem IV. since it was assumed that A n realizes while the second sum is equal to cin (X. Theorem IV. since (n 1)dn+1(4 +1 . 1 (dn+i) < ndn(A. process is finitely determined. whose expected value (with respect to yri') is at most /2. .4 provides a positive number 3 so that if j — vil < 8 and I H(/). y) (n + A.x7. Yin). Fix an i.n realizes d i (X n +i The triangle inequality yields (Xn+1/X r i l Xn+1) dl(Xn+19 Yn+1) + (Yn+1 Y1 /y) The first term is 0..i. the second term is equal to di (X I .5. since it was assumed thatX x i . 1/I' ( An+1. The third term is just (1/2) the variational distance between the distributions of the unconditioned random variable Yn+1 and the conditioned random variable Y +1 /y. such a Getting started is easy.u i — vil < 3/2 < c/2. y +1 ) = ndn (x`i' . Y < 1 /y) y)di(Xn+04. < E. which along with the bound nExn (dn ) < n6 and the fact that A n +1 is a joining of . y) +E < (n + 1)c. ± 1 defined 1 . by independence.u. THE FINITELY DE IERMINED PROPERTY.. Furthermore. The measure X. by stationarity. thereby establishing the induction step.2.2. n+i andy n+ i.i.f» Y. Without loss of generality it can be supposed that 3 < E. and di (Xn +i Yn-i-i/Y) denotes the a-distance between the corresponding conditional distributions. Assume it has been shown that an (r i'. Yri+1). and let An realize an (Xtil. since Yn+1 and Y. conditioned on X = x is denoted by X n+ i /4.)T (Xn+1 Yni-1)• n+1 The first term is upper bounded by nc.+1/4).d. For each x. Y1 ) = (1/2)I. process {Xi} and c > O.6 An i.2. by c-independence.realize di (Xn+i by {r}. . 44 and yn+1 .6.(dn) XI E VI Xri (X n i Yn i )d1(Xn+1. while. which is in turn close to the distribution of X n+1 . YD. y`i') + (xn+i. The random variable X n+1 . Lemma IV.

there is a y = y(E) > 0 such that if H(XIZ)— H(XIY.1. ±1 is independent of X7 for the i.2.7 (The conditional 6-independence lemma. The conditional form of the E-independence lemma. Fix a stationary process y for which ym+ il < 3 and 11/(u)— H (v)I <8. In that proof. Lemma IV. there is a 3 > 0 such that if lid.a.i. for suitable choice of m. for then a good match after n steps can be carried forward by fitting future m-blocks. By decreasing 3 if necessary it can be assumed that if I H(g) — H(v)I <8 then Inanu) — na-/(01 < y/2. The random variables X and Y are conditionally E-independent.y E. then. which is possible since conditional entropy is continuous in the variational distance. guarantees that for any n > 1.3.} be an ergodic Markov process with Kolmogorov measure and let {K} be a stationary process with Kolmogorov measure v. Lemma IV.i. The next lemma provides the desired approximate form of the Markov property. Proof Choose y = y(E) from preceding lemma. then X and Y are conditionally E-independent.. then choose 3 > 0 so small that if 1 12 m+1 — v m± i I < 8 then IH(Yriy0)-HaTixol < y12.d.) Given E > 0.d. n > 1. given Z. To extend the i. for every n > 1.8. proof to the mixing Markov case. Lemma IV. Z) < y. the fitting was done one step at a time. a future block Xn but on no previous values.m+1 — v m+ii < 8 and IH(a) — H(v)i < 8. The key is to show that approximate versions of both properties hold for any process close enough to the Markov process in entropy and in joint distribution for a long enough time. ylz) denotes the conditional joint distribution and p(x lz) and p(y lz) denote the respective marginals. process. B-PROCESSES. The choice of 3 and the fact that H(YZIY ° n ) decreases to m H (v) as n oc. where p(x. given Z. two properties will be used. Given E > 0 and a positive integer m.2 CHAPTER IV. A conditional form of c-independence will be needed. Mixing Markov processes are finitely determined. . Fix m.2. dies off in the d-sense. and 17. the Markov property and the Markov coupling lemma. 17In and Yi-.226 IV.1 +1 is almost independent of 1?. provided the Y-process is close enough to the X-process in distribution and entropy. then guarantee that HOTIN— 1 -1(YPIY2 n ) < y. if E E ip(x. whose proof is left to the reader. Lemma IV.d. proof. The Markov result is based on a generalization of the i. yiz) — p(xlz)p(ydz)ip(z) < z x. The Markov coupling lemma guarantees that for mixing Markov chains even this dependence on X.i.. are conditionally E-independent. A good match up to stage n was continued to stage n + 1 by using the fact that X.2. extends to the following result. The Markov property n +Pin depends on the immediate past X. as m grows. given Yo.2.8 Let {X.

y) < E. implies that 171 and Yfni are conditionally E-independent. The Markov coupling lemma. Theorem IV. by the choice of 3 1 .i. The d-fitting of and y is carried out m steps at a time. + cim(r. provides an m such that (4) dm (XT. given Yn .n+i there is a 3 > 0 so that if y is a stationary — vm+1 1 < 6 then measure for which (5) am(Yln. Since this inequality depends only on .r+Fr. and hence the triangle inequality yields. (6) and (5). This proves Lemma IV. The fourth term is upper bounded by half of the variational distance between the distribution of Y. The first three terms contribute less than 3E/4.2. for all n > 1. But the expected value of this variational distance is less than 6/2. in turn. it is enough to show that (8) dni(x n+1 . As in the proof for the i. . By making 3 smaller. yr ) < e/4. it can also be assumed from Lemma IV.2.:vin/4. conditioned on +71 /4 will denote the random vector Xn In the proof. (6).'1 /y.8. XT/x0) < 6/4.SECTION IV. by (4). YTP:r/y) E 9 where Xn realizes an (A. xo E A. Proof n + 7i ./x 1 . = x.9 A mixing finite-order Markov process is finitely determined..u. y) is upper bounded by Ittm — that 6 is small enough to guarantee that (6) (7) dm (xT.2. y)ani (X . Ynn+ = x'17 . Fix a mixing Markov process {X n } with Kolmogorov measure p. Yln /Y0) < yo E A. Fix a stationary y for which (5). Lemma IV. for any stationary process {Yi } with Kolmogorov measure y for which — vm+1I < and I H(u) — H(p)i <8. it can also be assumed Furthermore.8 that Yini and Y i n are conditionally (E/2)-independent. Only the first-order proof will be given. since Ynn+ +Ini and Yin—I are conditionally (E/2)-independent. and fix E > O. and the distribution of Ynn+ +.4T /Yri) dm (Y: YrinZ) Y:2V/yri?). 17. using (6) to get started.'.7:14:r/y. the extension to arbitrary finite order is left to the reader.2. if necessary.9. and ii(Xn of X'nl Vin / and 11. conditioned on X. and (7) all hold.2.d. since dm (u. To complete the proof that mixing Markov implies finitely determined it is enough to show that d(p. This completes the proof of Theorem IV. given Yo. I/2. THE FINITELY DETERMINED PROPERTY 227 which. E X n Vn 4(4.. X4 +im ) will denote the d-distance between the distributions n + T/xii. + drn ( 17:+7 . y). given Yo . The distribution of Xn n+ 47/4 is the same as the distribution of Xn n + r. by (7).n /y. case.1.8. Thus the expected value of the fourth term is at most (1/2)6/2 = 6/4. and hence the sum in (8) is less than E.

outputs b7 with probability X. y l )) = d(. (b7la7). close to E).2. < (5 and IH(A) — < S. Let A be a stationary joining of it and y such that EAd(x i . v).(d(x l . The basic idea for the proof is as follows.. O ) } of conditional measures can thought of as the (noisy) channel (or black box). —> which. Such a finite-sequence channel extends to an infinite-sequence channel (10) X y . Theorem IV. It will be shown that if 72. The theorem is an immediate consequence of the following lemma.2. v). (None of the deeper ideas from channel theory will be used. (b) The second marginal of 3 is close to u in distribution and entropy. îl) will be small.11 If d(p.ak — < 3. only its suggestive language. which is however. and for each ar il. is also small. '17 is close enough to y in distribution and entropy then the finitely determined property of y guarantees that a(v. il b7) A. If the second marginal. let 4(. then there is a (5 > 0 and a k such that d(u. B-PROCESSES.10 (The a-closure theorem. IV.. vi)) = d(p. The triangle inequality d(v + d a].v) <E and v is finitely determined. Lemma IV. The fact that 3: is a stationary joining of and guarantees that E3.u in distribution and entropy there will be a stationary A with first marginal .a.228 CHAPTER IV. borrowed from information theory. (. given the input ar il.. for any stationary process Trt for which 1. The finitely determined processes are d-closed.2.) The d-limit of finitely determined processes is finitely determined.14) be the conditional measure on An defined by the formula (af. by property (a). a result also useful in other contexts.. A simple language for making the preceding sketch into a rigorous proof is the language of channels.a such that (a) 3 is close to A in distribution and entropy. then implies that a(u.3 The final step in the proof that almost block-independence implies finitely determined is stated as the following theorem..) Fix n. (9) it (a ril ) The family {A. which guarantees that any process a-close enough to a finitely determined process must have an approximate form of the finitely determined property. yi)).(b714) = E A.:(d(xi. is close enough to .u.

b(i)+ so that. _ i < n. A(a. is the measure on Aœ x A" defined for m > 1 by the formula - m_i A * (a . in n and in a. respectively.)(14) = E A A(4. ±„. though both are certainly n-stationary. Km )* mn If a is stationary then neither A*(X n .SECTION IV2. given an infinite input sequence x. n+n ) j=0 The projection of A* onto its first factor is clearly the input measure a. b i ) converges to a oc. and hence ' 1 lim — H(1 11 1X7) = H n — On the other hand.nn . it) and A = A(Xn . /2). by breaking x into blocks of length n and applying the n-block channel to each block separately.f3(4.12 (Continuity in n. The next two lemmas contain the facts that will be needed about the continuity of A(X n . that is. with probability X n (b7lx i n ±i and {y i : i < j n}. ii). b lic) is an average of the n measures A * (a(i) + x where a(i). the measure on A' defined for m > 1 by the formula fi * ( 1/1" ) E A * (aT" . as n oc. The measures A and fi are. Y) is a random vector with distribution A*(X n . a). . with respect to weak convergence and convergence in entropy. and hence stationary measures A = A(An . bki ) converges to X(a /ic. +. a) of a with an output measure /3* = /6 * (4. and put A* = A*(Xn. 17) is a random vector with distribution An then H(X7. then hr(X inn.+ +ki). if (XI". called the stationary joint input output measure and stationary output measure defined by An and the input measure a. that ±n ). b ) . let n > k. Given an input measure a on A" the infinite-sequence channel (10) defined by the conditional probabilities (9) determines a joining A* = A*(X n . The output measure /3* is the projection of A* onto its second factor. independent of {xi: i < jn} is. THE FINITELY DE I ERMINED PROPERTY 229 which outputs an infinite sequence y.2. 1 b k ) = v(b) as n To establish convergence in entropy. x(a k i b/ ). first note that if (X7. A direct calculation shows that A is a joining of a and /3. a) and /3(X n . = b„ for 1 < s < k. bn i ln ) = a(a)4(b i. Proof Fix (4.u) converges weakly and in entropy to v. converges weakly and in entropy to X and . yrn) = H(X) + myrnIxrn) . measure A(4` . a) nor /3*(X n . called the joint input output measure defined by the channel and the input measure a. "" lai. a) on A'. the averaged output p. blf) as n oc. = H(X7) H(}7 W ) . a) need be stationary. - Lemma IV. ypi+ in+n i has the value b7. = as and b(i). b(i) . But. and # = ti(Xn. Likewise. a) are obtained by randomizing the start. The i. a). for i <n — k A * (a(i) i + k i. then A(X n .) If is a stationary joining of the two stationary processes pt and v. The joining A*.

13 (Continuity in input distribution. since Isin = vn . p)) H (X). a(m) )1 converges weakly and in entropy to t3(X n .6. 1 IX7) . This completes the proof of Lemma IV. so that H(A*(X n . if = /3*(X n . and IH(fi) — < y/2. and H(/3) = H(A) — H(Xi Yi) is continuous in H(A) and the distribution of (X 1 .2. provides an n such that the stationary input-output measure A = A (An .2. Furthermore. . p)) = H(. so that both A* and fi* are stationary for any stationary input. a). see Exercise 5. Thus. (ii) 0(4. and hence Lemma IV. Randomizing the start then produces the desired result. since the channel treats input symbols independently. Section 1.)) H (v). assume 1) is finitely determined. Lemma IV.12.) Fix n > 1.2. and a stationary measure A on A" x A. y). Ex(d(xi. yl)) + c. Y1 ). as in --> Do. so that. must also satisfy cl(v. yi)) < Ex(d(xi. (i) {A(Xn . A) satisfies Ifie — vti < y/2. in particular. and let A be a stationary measure realizing ci(p. H(A(4. Dividing by mn and letting in gives 1 H(A*(4. Y1). y) < E. The continuity-in-n lemma. fi = /3(X i . . Yr) = H(XT)+mH(Y i 'X i ).u). then. so (ii) holds. If a sequence of stationary processes la(m)} converges weakly and in entropy to a stationary process a as m oc.2. Lemma IV. Fix a stationary input measure a and put A = A(Xi. then H(XT. p)) H(A). randomizing the start also yields H(/3(A. i=1 H(X71n ) MH(YnX ril ). if (XT.230 H (X ) CHAPTER IV. °eon)} converges weakly and in entropy to A(4. B-PROCESSES. dividing by m and letting in go to oc yields H(A) = H(a)+ H (Y I X 1) . Assume cl(p. The finitely determined property of y provides y and E such that any stationary process V' for which I vt —Ve I < y and I H(v) — H(71)1 < y. An„(a r . as n oc.) < E. p. By definition.12. The extension to n > 1 is straightforward. as m --÷ oc.11. Proof For simplicity assume n = 1. a). Furthermore. and the stationary output measure fi = fi(X n .13 is established. Y) is a random vector with distribution A n. which depends continuously on 11(a) and the distribution of (X1. br ) = a(aT) i-1 X(b i lai ). which is weakly continuous in a. Thus (i) holds. a) is weakly continuous in a. since the entropy of an n-stationary process is the same as the entropy of the average of its first n-shifts. Likewise. Proof of Lemma IV. + E H(rr_1)n+1 I riz_on+i) Do since the channel treats input n-block independently.2. yl)) < E.u)+ — H (Y. g) satisfies (11) EA(d(xl. a).

(4 +1 )/p. and — II (v)I III (1 7) — II (0)I + i H(8) — I-I (v)I < 1 / . y) <E.2. and the fact that X realizes d(p. This completes the proof of Lemma IV. ii„) satisfies 1 17 )t — fit' < y/2. Show that a finitely determined process is the d-limit of its . IV.. so the definitions of f and y imply that dal. provides 3 and k such that if ii.2. )' by the inequalities (11) and (12). Given such an input 'A it follows that the output -17 satisfies rile — vel Pe — Oel + Ifie — vel < Y. il with the output 7/ and ) En(d(xi.. thereby completing the proof that dlimits of finitely determined processes are finitely determined. )' 1 )) < En(d(xi. while the proof that mixing Markov chains are finitely determined is based on the Friedman-Ornstein proof. 231 The continuity-in-input lemma.z -th order Markovizations. yi) + 6 Ex(d (x i. 71) _-_. i)) + 2E < 3E. is any stationary process for which Fix — lid < 8 and IH(ii) — H(A)1 < ( 5 .SECTION IV.i.14 The proof that i. 1 The n-th order Markovization of a stationary process p. Furthermore.11. d(ii. The observation that the ideas of that proof could be expressed in the simple channel language used here is from [83].13. [46]. Ex(d(xi. [13].) .2. (Hint: y is close in distribution and entropy to p.2. Ft) satisfies (12) Ex(d (x i . El Remark IV. The fact that d-limits of finitely determined processes are finitely determined is due to Ornstein. The fact that almost block-independence is equivalent to finitely determined first appeared in [67]. processes are finitely determined is based on Ornstein's original proof. y) which was assumed to be less than E .b Exercises. then the joint input-output distribution X = A(X„. is the n-th order Markov process with transition probabilities y(x-n+i WO = p. [46]. Lemma IV. yi)) < < yi)). the proof given here that ABI processes are finitely determined is new. 71) is a joining of the input Ex(d(xi. yi)) +e and the output distribution il = 0(X. THE FINITELY DETERMINED PROPERTY.d. since X = A(Xn .2. and IH(i') — 1-1 (/3)1 < y/2. and hence the proof that almost block-independent processes are finitely determined.(4).

denotes expectation with respect to the random past X° k Informally stated. a process is VWB if the past has little effect in the a-sense on the future. Equivalence will be established by showing that very weak Bernoulli implies finitely determined and then that almost blowing-up implies very weak Bernoulli. y). where 7 denotes the concatenated-block process defined by an (A n . Since it has already been shown that finitely determined implies almost blowing-up. for some y for which the limiting nonoverlapping kblock distribution of T'y is y.4. conditioned on the past values x ° k .a The very weak Bernoulli and weak Bernoulli properties.3. (Hint: a(p. see Exercise la in Section 1. hence serves as another characterization of B-processes. process. almost blowing-up.7. very weak Bernoulli. As in earlier discussions. equivalent to being a stationary coding of an i.3 Other B-process characterizations. Show that the class of B-processes is a-separable. see Section 111.u is totally ergodic and y is any probability measure on A'.) 3. Prove the conditional c-independence lemma.T i )' ) must be a joining of ia. . can be shown to be very weak Bernoulli by exploiting their natural expanding and contracting foliation structures. (1) Ex o .(X7 I X7)) < c.232 CHAPTER IV. Show that if . A stationary process {X. for some i < n. The reader is referred to [54] for a discussion of such applications. is actually equivalent to the finitely determined property. Section IV. v. 2. 4. Theorem IV.d.i.3. y) < a(i2. Show that a process built by repeated independent cutting and stacking is a Bprocess if the initial structure has two columns with heights differing by 1. Hence any of the limiting nonoverlapping n-block distributions determined by (T i x..} is very weak Bernoulli (VWB) if given c > 0 there is an n such that for any k > 0.2. B-PROCESSES. and for some x all of whose shifts have limiting nonoverlapping n-block distribution equal to /4. and v. called the very weak Bernoulli property. A more careful look at the proof that mixing Markov chains are finitely determined shows that all that is really needed for a process to be finitely determined is a weaker form of the Markov coupling property. (d..1 (The very weak Bernoulli characterization. indeed. IV. 5. This form. and finitely determined are. it follows that almost block-independence. .17) = d(x .) A process is very weak Bernoulli if and only if it is finitely determined. The significance of very weak Bernoulli is that it is equivalent to finitely determined and that many physically interesting processes.4. then 0). and X`i'/x ° k will denote the random vector X7. Lemma IV. either measure or random variable notation will be used for the d-distance. where Ex o . such as geodesic flow on a manifold of constant negative curvature and various processes associated with billiards.

given c > 0 there is a gap g such that for any k > 0 and m > 0. The triangle inequality gives <(x. V :+i) g In. A somewhat stronger property. IV. Since approximate versions of both properties hold for any process close enough to {X. Furthermore. + .5. y rli )dni (x. with high probability.4) + cini (yn n _. for suitable choice of ni. ID. X g ±n i z )) < E/2. The class of weak Bernoulli processes includes the mixing Markov processes and the large class of mixing regenerative processes. while very weak Bernoulli is used to guarantee that even such conditional dependence dies off in the ii-sense as block length grows. p-r .1. weak Bernoulli processes have nice empirical distribution and waiting-time properties.3.T Y::ET /)1) is small for some fixed m and every n.3 and 111. given only that the processes are close enough in entropy and k-th order distribution. conditioned on intermediate values. names agree after some point. Their importance here is that weak Bernoulli processes are very weak Bernoulli. is obtained by using variational distance with a gap in place of the ci-distance. as noted in earlier theorems. This property was the key to the example constructed in [65] of a very weak Bernoulli process that is not weak Bernoulli. then one can take n = g + ni with m so large that g 1(g + ni) < € 12. Entropy is used to guarantee approximate independence from the distant past.77.SECTION IV. ( 4. see Exercise 2 and Exercise 3. while the proof of the converse will be given later after some discussion of the almost blowing-up property. A stationary process {X i } is weak Bernoulli (WB) or absolutely regular.-. called weak Bernoulli.b Very weak Bernoulli implies finitely determined.) . weak Bernoulli requires that with high probability.2.. while very weak Bernoulli only requires that the density of disagreements be small.nnz /y. To see why. that is. In a sense made precise in Exercises 4 and 5. since 4-distance is upper bounded by one-half the variational distance. a result established by another method in [78]. first note that if Xr and X ° k are c-independent then Ex o (dm (X g +T/X ° k . X'1 ) ) < E. -k Thus weak Bernoulli indeed implies very weak Bernoulli. the conditional measures on different infinite pasts can be joined in the future so that. the key is to show that E A. The proof models the earlier argument that mixing Markov implies finitely determined. 233 The proof that very weak Bernoulli implies finitely determined will be given in the next subsection. iin (14 . and use the fact that . for any n > g and any random vector (Uli. Vr) < cin _g (Ugn+i . Theorems 111. if past and future become c-independent if separated by a gap g. OTHER B-PROCESS CHARACTERIZATIONS.'. the random vectors Xr and X° k are c-independent. a good fitting can be carried forward by fitting future ni-blocks. As in the earlier proofs. }. to obtain Ex o (cin (XTIV X ° k .3.} in entropy and in joint distribution for a long enough time. x:-±. vin . If this is true for all m and k. for some fixed k > ni.

the conditional 6-independence lemma. by the very weak Bernoulli property. B-PROCESSES. Thus. Y. j>1 . By stationarity. and hence there is a K such that H(XTIX ° _ K ) < mH(u) + y. Furthermore. for any stationary process {Y. j O. namely. and fix 6 > O. r in )) < c /8. as t —> Do. But. The very weak Bernoulli property provides an in so that (2) Ex o (d„.n) < 6/4. uniformly in n. Y/yr.n\ 'n+1 I "Ai x 4 . Fix a very weak Bernoulli process {X n } with Kolmogorov measure p. the first term is equal a. and Sis small enough. let y be a positive number to be specified later.7. H(XTIX ° t ) converges to mH(A).') is small.8 is small enough and {Yi } is a stationary process with Kolmogorov measure V such that I tuk — Vkl <S. then H(YrIY ° K ) < m H (v) + 2y holds. (4) Eyot (c1(Y im .)) < 6/2. . implies that Yr and YIK K4 are conditionally (E/2)-independent. The details of the preceding sketch are carried out as follows. which completes the proof that very weak Bernoulli implies finitely determined. it follows from the triangle inequality that there is a S > 0 such that E y. while the expected value of the second term is small for large enough ni.Yr /Y_°. t > 1. so if . In summary. Lemma IV. yr). all that remains to be ±. since it can also be supposed that 8 < E/2. For in fixed. for all n. This result. )( tin )) depends continuously on 'Lk. The goal is to show that a process sufficiently close to {Xi} in distribution and entropy satisfies almost the same bound. so that if it also assumed that I H(v) — H(g)I <S. provided only shown is that the expected value of dm (Ynn± that is close enough to {X i } in k-order distribution for some large enough k > as well as close enough in entropy. } with Kolmogorov measure u for which Ik — VkI < and I H Cu) — H (v)I < (3.(XT / X ° t . yields the result that will be needed. given — °K• Since a-distance is upper bounded by one-half the variational distance. The quantity Exo K (a. If y is small enough. Fix k = m K 1. once such an in is determined. which is small since in < k.234 CHAPTER IV.2.n (rin /X ° K . LI .. H(XT IX° K ) also depends continuously on 1Uk .Yr i )) < c/4. this means that Eyo(dm(Yr/ 17° K.)ani(Xnn±n1 +I / y n vn+ni 1.m. Yin /Y° K-j)) < E/4 . Towards this end.1" -1 for all n. t > O. since H(YrIY ° r ) decreases to m H (v) this means that II (Yi n IY ° K) — 11 ( 11171 1 17 K_i) < 2y. then (3) EyoK (iim (Yr i /Y K . along with the triangle inequality and the earlier bound (3).n (XT. and hence that am (XI''.

If v([C]. was introduced in Section 111. The goal is to construct a joining A. A set B C An has the (6.a is called a mass function if it has nonempty support and E Ti(4) < 1. )1) < E. [C].u([C]. . I : dn (d. which. of a set C c A" is its c-neighborhood relative to the dn -metric. subject to the joining requirement. The almost blowing-up property (ABUP). finitely determined. for any subset C c B for which i(C) > 2 -1'3 • A stationary process has the almost blowing-up property if for each n there is a B. eventually almost surely. yl') < e.d. for some all E C) .4. If the fraction is chosen to be largest possible at each stage then only a finite number of stages are needed to reach the point when the set of 4 whose pt-mass is not yet completely filled has 1a-measure less than E. The set of such yii' is just the c-blowup of the support of A. i. [81]. A nonnegative function .c. is due to Strassen. b) <E. A trivial argument shows that this is indeed possible for some positive a for those )1' that are within c of some fil of positive A-mass. A simple strategy for beginning the construction of a good joining is to cut off an a-fraction of the v-mass of yri' and assign it to some xr. which.c that a finitely determined process has the almost blowing-up property. This simple strategy can repeated as long as the set of x'i' whose . for most y'i' a small fraction of the remaining v-mass can be cut off and and assigned to the unfilled mass of a nearby x'11 . for all y [Sk. by hypothesis. The proof is based on the following lemma. The key to continuing is the observation that if the set of .4 whose /i -mass is not completely filled in the first stage has /i -measure at least E then its blowup has large vmeasure and the simple strategy can be used again. Some notation and terminology will assist in making the preceding sketch into a proof.' . and almost blowing-up are all equivalent ways of saying that a process is a stationary coding of an i.) > 1. C A" such that x'll E Bn . namely. namely.3. If the domain S of such a 0 is contained in the support of a mass . 0 (yriz)) < E. this completes the proof that almost block-independence. It was shown in Section III. process. c)-blowing-up property if . has v-measure at least 1 . c)-blowing-up property for n > N.-measure less than E. that is. after which the details will be given. Lemma IV. The e-blowup [C]. very weak Bernoulli. OTHER B-PROCESS CHARACTERIZATIONS.SECTION IV. A joining can be thought of as a partitioning of the v-mass of each y'i' into parts and an assignment of these parts to various xlii.3.3.c The almost blowing-up characterization. A function 0: [S]. there is a 8 > 0 and an N such that Bn has the (3.c.i. y) < 2e. In this section it will be shown that almost blowing-up implies very weak Bernoulli.4. )1) of A. is called an c-matching over S if dk ( yr. and for any E > 0.) > 1 .E. 235 IV. in a more general setting and stronger form.2 Let A and v be probability measures on A. whenever it(C) > c. then dn (A . of A and v for which dk (4.S.' for which dn (x1' .u-mass is not yet completely filled has it-measure at least E. except on a set of (4. Proof The ideas of the proof will be described first. = fb. Since very weak Bernoulli implies finitely determined. that the total mass received by each 4 is equal to A(4).

can be assigned to the at-mass of 4 = The desired joining A. otherwise . almost surely.3. .2 is finished.3.7 1 (4)) and therefore . let 01 be any E-matching over S1 (such a (P i exists by the definition of c-blowup. by the definition of maximal stuffing fraction. then V([Si d E ) > I — E.o ) denote the a-algebra generated by the collection of cylinder sets {[xn k > 0 } . with high probability.u(4) H. inf 71. = 1. (i + 1) .u (i + 1) be the mass function defined by tt o)(x i ) — • The set Si+1 is then taken to be the support of p. (pi and ai . Op-stuffing fraction. (5) a = min { 1.236 function at.u(F) > 1 — a and so that if xn E Fn then o so . Hence there is first i. for Fn E E(X" co ). let Si be the support of itt (1) .) Let be an ergodic process of entropy H and let a be a given positive number There is an N = N(a) > 0 such that if n > N then there is a set Fn E E(X" co ) such that . then the number a = 0) defined by i"u. This is the fact that. say i*.(x)>0 ) I is positive.u.(4).Lemma IV. The construction can be stopped after i*. for which ai = 1 or for which p. y) < E. only one simple entropy fact about conditional measures will be needed to prove that almost blowing-up implies very weak Bernoulli. the function Oi+1 is taken to be any c-matching over Si+i . for which an a-fraction of the y-mass of each y'i? E [S].u.u(4lx ° . the notation xn E Fn will be used as shorthand for the statement that Clk[zn k ] c F. and ai±i is taken to be the maximal (y. Ft.. (i + 1) . +1 ) < E. Having defined Si . B-PROCESSES. —(1/n) log . by the entropy theorem.u (i) .(Si +i) < c. y in) of I=1 X-measure less than e. x E S. (1) = A.. Lemma IV.u(Si. . Let E(Xn. let .3 (The conditional entropy lemma.s: ( x I' CHAPTER IV. as long as SI 0 0) and let al be the maximal (y. To get started let p.(Si+i) < . If ai < 1 then.u(Si ). 0)-stuffi ng fraction. that is. provided only that n is large enough.0) has approximately the same exponential size as the unconditioned probability . cutting up and assigning the remaining unassigned v-mass in any way consistent with the joining requirement that the total mass received by each xÇ is equal to If ai. In either case. the conditional probability . Oi+i )-stuffing fraction. p. dk (x. . there is an 4 E Si for which td ) (4) = ai y(0. Also. (1) .3. These two facts together imply the lemma.2 in hand. p. With Lemma IV. is constructed by induction. and the proof of. for it is the largest number a < 1 for which av(V 1 (4)) < (x). It is called the maximal (y. while. I=1 Only the following consequence of the upper bound will be needed. Proof By the ergodic thZoc rigiiii(-x-Rlit)(14ift4 AID almost surely. except on a set of (4.

there is a set D. Proof Let p.u(c n /3. The quantity . for each x 0 ° 00 ) > 1 — (a) iu(F. A (• lx° .u(Clx ( 1 00 ) > c then . The goal is to show that . (x tilx ° If E . OTHER B-PROCESS CHARACTERIZATIONS. which. the Markov inequality yields a set D..„ and C c An then (a) and (b) give 11-(Clx) .)> 2-"(c — N/Fx — el2).. E By Lemma IV. 1 — c.(V) > 1 — c and such that the following holds for x ° 00 E V.2. then E E (X ° cx.0 E D.u(Bn ) is an average of the quantities . D.u(V) > 1 — c and such that dn(A. if A(Clx. E(X°. o ) > c can be rewritten as . there is an n and a set V E E(X° . .n > 1. then.3.(Dn this with Lemma IV.4 it follows that if n is large enough and a < c 2 /4 then the set * E E(X). it is enough to show how to find n and V p. be such that x7 E B„. If n > N.u(C n B.NA ii(Clx ° Proof For n > N.Ix° 00). a set Fn c A" such that 1 — Afrx and. be a stationary process with the almost blowing-up property.) such that . E E(X° 0° ) of measure at least Dn . (b) 11.3..5 Almost blowing-up implies very weak Bernoulli. there is a 3 > 0 and an N such that Bn has the (3.u is very weak Bernoulli.u(C n Fn ) Ara < 2"A(C)-1.3.) 5_ 2" .. and such that for any e > 0.a(cix) < . by the Markov inequality. Theorem IV.u(C) . V = Dn n Dn .) _ oe ) 2an. eventually almost surely. X° E V. let Bn c An .0 )) <e..u(C n 12.u(B..u([C]f) a.. and C C An. so that if n is so large * E E(X ° .2 E Fn .s/CTe < 2" .SECTION IV.NATe. has measure at least 1 — c. I x_ .such that if x ° E D. Now the main theorem of this section will be proved.u(x). that is. for x°.) such that (6) If C c An and . Towards this end.0 ) such that A(B n ) > 1— c 2 /4. of measure at least 1 — Nfii. 237 Lemma IV. 0-blowing-up property for n > N. Combining that p.3.u(c" n Fn ix) < 2" . and for E V and C c A'. to show that for each E > 0. there is a set Dn *) > 1 — c/2 and such that A(B n ix%) > 1 — e12.4 Let N = N (a) be as in the preceding lemma..X.3.1x%) 6/2 . which establishes the lemma.

Show by using the martingale theorem that a stationary process p. Show using coupling that a mixing Markov chain is weak Bernoulli.0 0o ). such that if x 0 E G i then there is a measurable mapping 0: [x. except for a set of x° k of measure at most E. . < E.0 0. 79-80].) > 1 — E.) 4. pp.c ).3.. 3. such that if i° 00 E G then there is a measurable mapping 0: [x] 1-÷ [. with the property that for all x B. Example 26. using a marriage lemma to pick the c-matchings 0i .u(Bix ° . provided only that n is large enough. defined as the minimum c > 0 for which .5 appeared in [39].„o ] which maps the conditional measure ii(-Ix°) onto the conditional measure tie li° .u(C) < v([C]. 2. .3. [59.14Fe — 6/2) > 2 -8n. 4)(x).238 CHAPTER IV. 1. This establishes the desired result (6) and completes the proof that almost blowing-up implies very weak Bernoulli. c. This and related results are discussed in Pollard's book. IV. yields ittaCli)> ou([C n B4O1. Show that a mixing regenerative process is weak Bernoulli. Show that a stationary process is very weak Bernoulli if and only if for each c > 0 there is an m such that for each k > 1. 0(x) 1 = xi for all except at most en indices i E [1. is very weak Bernoulli if given E > 0 there is a positive integer n and a measurable set G E E(X° 0° ) of measure at least 1 — c.0 ] which maps the conditional measure . 1)). for i E [K.(•1. for all C c A. yields the stronger result that the an -metric is equivalent to the metric cl. with the property that for all x B.u(• ix%) onto the conditional measure .„) < c. so that the blowing-up property of B. and a measurable set B c [x° . (Hint: coupling. n].u. for all sufficiently large n.d Exercises.3. Remark IV.*(.0 ) of measure at least 1 — c. B-PROCESSES. . n]. = x..6 The proof of Theorem IV.a is weak Bernoulli if given c > 0 there is a positive integer K such that for every n > K there is a measurable set G E E(X ° . Cin (X7 IX ° k ..o ] such that ti(B Ix%) < E. Show by using the martingale theorem that a stationary process . If a is chosen to be less than both 8 and c2 /8. and a measurable set B c [x] such that . then 2-"(e — .u. A more sophisticated result. The proof of Strassen's lemma given here uses a construction suggested by Ornstein and Weiss which appeared in [37]. 5.

Budapest. [4] P. Wiley.A. Friedman and D. Volume II (Second Edition). Information theory and reliable communication. New York. "Universal codeword sets and representations of the integers. 1968. 23(1970). New York. 1968. 1980. 1152-1169. 1981. Dept. 17(1989)." J. Info. Ergodic problems of classical mechanics. 321-343. Israel J." Ann. D. d'analyse. of Math. [8] P. Stanford Univ. [11] W. Princeton.. Birkhauser. M. of Elec.. [12] N. Wiley. Feller... Ornstein. "On a new law of large numbers. New York. Benjamin. Eng. Feldman. "On isomorphism of weak Bernoulli transformations. An introduction to probability theory and its applications. 365-394. [14] H. Introduction to ergodic theory Van Nostrand Reinhold. Erdôs and A. [2] R. Thesis. "Logically smooth density estimation. equipartition.I. C. NY." IEEE Trans. Csiszdr and J. IT-21(1975). Cover and J. 103-111. [3] A. New York.. [6] T. Waterman. John Wiley and Sons. and Ornstein's isomorphism theorem. 1965. 1991. 1971. 5(1970). Gallager. Ergodic theory and information. John Wiley and Sons. Press.Elias. 36(1980). Billingsley. V.. Thomas. [10] J. 194-203. [5] D. [9] P. Barron.. Furstenberg.". Kohn. 1985.. Elements of information theory. NJ. A. 1970. [13] N.. Inc. New York. New York. NY. [7] I." Ph. Akadémiai Kiack").Bibliography [1] Arnold." Advances in Math. 239 . Arratia and M. Probab. New York. [15] R. Measure Theory. W. Körner. 1981. Friedman. Rényi. "The Erd6s-Rényi strong law for pattern matching with a given proportion of mismatches. Th. "r-Entropy. and Avez. Recurrence in ergodic theory and combinatorial number theory Princeton Univ. Information Theory Coding theorems for discrete memoryless systems. A.

D. [32] U. Entropy and information theory. Kakutani. An Introduction to Symbolic Dynamics and Coding. Halmos. 36(1965). [30] J. Kelly." Probability. - [18] R.. Math. Theory. New York." Ann. [33] R. New York. Ghandour. U. 1956. [17] R. G. 1995. P. [28] M. 82(1985).. 1985. ed. Jones." Israel J. "Estimating the information content of symbol sequences and efficient codes. Nat. AMS. W." IEEE Trans." Ann. Katznelson and B. [24] S. D. 19(1943)." IEEE Trans. Springer Verlag. 284-290.. [23] M." IEEE Trans. Info. of Math. Grassberger. 1002-1010. Th. " Finitary isomorphism of irreducible Markov shifts.. 42(1982). "Source coding without the ergodic assumption. Snell. 502-516. Press. Cambridge Univ. 635-641." Proc. Kemeny and J. Van Nostrand Reinhold. Springer-Verlag. Statist. random processes." Israel J. Math. "Comparative statistics for DNA and protein sequences . and ergodic properties. [29] J.. "Sample converses in source coding theory.240 BIBLIOGRAPHY [16] P.. [34] D. Hoeffding. Kieffer. Lectures on the coupling method.. Info. de Gruyter. [25] T. 669-675. New York. 1988. 42(1982). [31] I. Kamae. Kac. Weiss." Proc. New York. Acad. "Induced measure-preserving transformations. "New proofs of the maximal ergodic theorem and the Hardy-Littlewood maximal inequality. L. M. "A simple proof of the ergodic theorem using nonstandard analysis. 5800-5804. Cambridge. "Prefixes and the entropy rate for long-range sources. John Wiley and Sons. Japan Acad. [21] P..S.. Marcus. Suhov. 34(1979) 281-286. Lectures on ergodic theory.single sequence analysis. 1992. "A simple proof of some ergodic theorems. New York. IT-35(1989). NJ. . Gray and L. Smorodinsky. Measure theory. "On the notion of recurrence in discrete stochastic processes. 369-400. Finite Markov chains. statistics. van Nostrand Co. 291-296. Gray. Lind and B. Inform. Statist. [26] S. Ergodic theorems. Gray. Math. "Asymptotically optimal tests for multinomial distributions.. IT-37(1991). [19] R. Berlin. [27] I. 681-4. 1960. Krengel. Kontoyiannis and Y. IT-20(1975). 1950. Th.A. (F. Halmos. 1993. NY. and optimization. [20] P.. Sci.. Lindvall.) Wiley. Princeton. [35] T. Davisson. Karlin and G. NY. Keane and M." Israel J. 263-268. New Jersey. Princeton. Probability. 87(1983). of Math. 53(1947). 1990." Proc. [22] W. Chelsea Publishing Co.

Ornstein and P. 188-198. [41] D. 43-58. "The Shannon-McMillan-Breiman theorem for amenable groups." Memoirs of the AMS. [51] D. 53-60. CT.. Shields. [48] D." Ergodic Th. Probab. [37] K. Ergodic theory. 63-88." Israel J. IT-39(1993). Marton and P. 182-224.. "A simple proof of the blowing-up lemma. 104(1994). Neuhoff and P. "How sampling reveals a process. Inform." Israel J. [47] D. Probab. Ornstein and B. 18(1990).." Ann. Info. and B. [50] D. Probab. Math. 905-930. IT-38(1992). Ornstein and P." IEEE Trans... "Channel entropy and primitive approximation. Poland." IEEE Trans. Shields. Ornstein. "The positive-divergence and blowing-up properties. Shields. Inform. "A very simplistic." IEEE Trans.. IT-42(1986). 445-447.. Neuhoff and P. Ornstein and B. and dynamical systems. 1561-1563. Shields. "A recurrence theorem for dependent processes with applications to data compression. [38] K. "Block and sliding-block source coding. "Entropy and the consistent estimation of joint distributions. Shields. 1995. 441-452. Neuhoff and P." IEEE Workshop on Information Theory. lossless code. 22(1994). [42] D. Weiss. Probab. 15(1995). Marton and P." Ann." IEEE Trans. Shields. Ann. randomness. [45] D. Th. [40] D. Ornstein and B. Info. 1974.. Weiss. 18(1990).. and Dynam. 11-18." Advances in Math." Ann. [43] D. Shields. Yale Univ. Sys. Nobel and A. Yale Mathematical Monographs 5. Press. D. Marton and P. Weiss." Ann. Ornstein. 44(1983).. "Universal almost sure data compression. "An application of ergodic theory to probability theory"... [53] D. [52] D. [44] A. "Entropy and data compression. . Marton.. Th.... Th. Rudolph. Weiss. Th. 960-977. IT-28(1982). to appear. Probab. Ornstein and P. [46] D. New Haven. "An uncountable family of K-Automorphisms". "Almost sure waiting time results for weak and very weak Bernoulli processes. 951-960. Shields. "Equivalence of measure preserving transformations. Correction: Ann. "Indecomposable finite state channels and primitive approximation. Ornstein. Probab. 331-348. 78-83.BIBLIOGRAPHY 241 [36] K. IT-23(1977). Rydzyna. Math. 211-215. 86(1994). Wyner. Shields. 10(1982). 262(1982). 1(1973). Shields. June. Advances in Math. Inform. [39] K." IEEE Trans. "The d-recognition of processes. 10(1973). Th. [49] D. Neuhoff and P. universal.

159-180. Princeton. 20(1992)." Ann.. Press. [61] D. Lectures on Choquet's theorem. Chicago. [64] P. IT-37(1991). "If a two-point extension of a Bernoulli shift has an ergodic square." Mathematical Systems Theory. Springer-Verlag. 49(1979). English translation: Select." IEEE Trans. Shields. Shields. 263-6." IEEE Trans.242 BIBLIOGRAPHY [54] D. 403-409. Univ. [59] D. IT-37(1991). "On the probability of large deviations of random variables. Convergence of stochastic processes. "Cutting and stacking. "A 'general' measure-preserving transformation is not mixing. IT-25(1979). A method for constructing stationary processes. Rudolph." (in Russian). Inform. "Cutting and independent stacking of intervals. 1-4. 283-291." Dokl. 1199-1203. AN SSSR." Ergodic Th. Rohlin. Shields. then it is Bernoulli. 1645-1647. 1981. 1960." Israel J. 213-244. "On the Bernoulli nature of systems with some hyperbolic structure. N. 1(1961). 269-277. [70] P. Pinsker. "Almost block independence." IEEE Trans. 7 (1973). Inform. Shields. Inform. "Entropy and prefixes. Press. Transi. [57] Phelps. [72] P. Sys. 1964. R. [68] P. Th. Th. Shields. Moscow. Ergodic theory. New York. 1966 [58] M. of Chicago Press. 19(1990). 84 (1977). Shields. [56] K. and Probability." Monatshefte fiir Mathematik.. and Dynam. Pollard. Inform. Nauk SSSR. [65] P. [62] I.J. Probab. 20(1992)." Z. "The entropy theorem via coding bounds. Petersen. [73] P. The theory of Bernoulli shifts. of Math. [71] P. IT-33(1987). Cambridge. Probab. to appear.. [60] V. 349-351. 1984.. 1973." Ann. "The ergodic and entropy theorems revisited. Mat. (In Russian) Vol.. [66] P. 11-44. 1983. Shields. "Stationary coding of processes. Topics in ergodic theory Cambridge Univ.. Sbornik. Information and information stability of random variables and processes. Sanov. [63] P. Van Nostrand. [69] P. Shields. fur Wahr. "Weak and very weak Bernoulli partitions. Shields. Ornstein and B. English translation: Holden-Day.. 30(1978). Statist. Weiss. "String matching . [55] W. Cambridge Univ.. Parry. Shields. 133-142. 7 of the series Problemy Peredai Informacii. Shields..the general ergodic case.. San Francisco. 119-123. Th. Th. Akad. 42(1957).. "Universal almost sure data compression using Markov types. 60(1948). 1605-1617. [67] P. Math." Problems of Control and Information Theory. . Cambridge." IEEE Trans.

Inform. Ziv and A. Info.. Wyner and J. [85] F. "A partition on a Bernoulli shift which is not 'weak Bernoulli'. 521-545. 337-343. 36(1965). Probab. [81] V.BIBLIOGRAPHY 243 [74] P. [78] M. Prob. IT-18(1972)." IEEE Trans. "Kingman's subadditive ergodic theorem. 732-6." IEEE Trans.. IEEE Trans. [76] P. 520-524. [80] M. [79] M. [75] P. of Math. E. [88] A. Ziv. 1647-1659. Shields. Th. M. of Theor. Willems. 135(1992) 373-376. Phillips. 423-439." IEEE Trans. Th. 82(1994). "Waiting times: positive and negative results on the Wyner-Ziv problem. "Universal data compression and repetition times.. IT-39(1993). Ziv. Inform. Th. [82] W. IT-35(1989)." Math.. submitted. Inform. 499-519... IT-37(1991). Th. of Theor. Asymptotic properties of data compression and suffix trees. Thouvenot. An introduction to ergodic theory. IT-23(1977). Walters.. IT-35(1989). "Coding of sources with unknown statistics-Part I: Probability of encoding error. Th. Inst. Syst. Lernpel. "The sliding-window Lempel-Ziv algorithm is asymptotically optimal. Prob. Szpankowski. of the IEEE. 1982. 3(1975). Info. [91] J.." Ann." J. Th. and S. 384-394. Statist. Henri Poincaré." Ann. 25(1989)." IEEE Trans. 872-877. 54-58. [90] J. "Finitary isomorphism of m-dependent processes." J." Ann. "Two divergence-rate counterexamples.." IEEE Trans... Prob. New York. Wyner and J. Ziv." IEEE Trans." Contemporary Math. [77] P. [92] J. 210-203. Shields. 405-412. IT- 24(1978). [83] J." IEEE Trans. Inform. Ziv. Inform. Smorodinsky. Wyner and J. 125-1258. Th. Th. Shields and J. "An ergodic process of zero divergence-distance from the class of all stationary processes." Proc. "Coding theorems for individual sequences. Strassen. Shields. Sciences. "Fixed data base version of the Lempel-ziv data compression algorithm. 6(1993). Inform. "Entropy zero x Bernoulli processes are closed in the a-metric. NYU. Part II: Distortion relative to a fidelity criterion". of Theor. . "The existence of probability measures with given marginals." J.. Smorodinsky. a seminar Courant Inst. Moser.. 878-880. Steele. 5(1971). J. [86] A. [87] A. 93-98. [89] S. Th. [84] P.. 1975.. vol. Xu. "A universal algorithm for sequential data compression. "Universal redundancy rates don't exist. "Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression.-P. of Math. Ergodic theory.Varadhan. Ziv. IT- 39(1993). 6(1993). New York. Springer-Verlag.

.

74. 195. 27 conditional E -independence. 195. 194 almost blowing-up. 83 blowing-up property (BUP). E)-blowing-up. 107 of a column structure. 121 per-symbol. 110 concatenated-block process. 107 upward map. 24. 107 top. 107 name. 115 disjoint column structures. 71 coding block. 226 entropy. 123 code. 11 building blocks. 24. 24. 107 column partitioning. 215 column. 110 transformation defined by. 52 block code. 174 in probability. 107 base (bottom). 184 almost block-independence (ABI). 189 (a. 10 concatenation representation. 108 width (thickness). 107 labeling. 107 complete sequences. 58 . 84. 69 built-up set bound. 235 blowup. 211 block-structure measure. 104 block-to-stationary construction. 107 height (length). 195. 110 copy. 8. 8 block-independent process. 235 (8. 108 uniform. 125 base (bottom) of a column. 235 alphabet. 111 column structure: top. 233 addition law. 108 binary entropy function. 188 upward map.Index a-separated. 211 Barron's code-length bound. 115 cutting into copies. 71 code sequence. 68 8-blowup. 105 complete sequences of structures. 108 width. 109 disjoint columns. 185. 190 columnar representation. 108 support. 72 codeword. k)-separated structures. 58 admissible. 24. 107 level. 194 Borel-Cantelli principle. 103. 70 (1 — E)-built-up. 109 column structure. 108 estimation of distributions. 174 in d. 68 blowup bound. 24. 104. 121 n-code. 8 block coding of a process. 187 absolutely regular. 72. 107 subcolumn. 9. K)-strongly-separated. 107 cutting a column. 121 rate of. 215 245 faithful (noiseless). 212 almost blowing-up (ABUP). 7 truncation. 26. 1 asymptotic equipartition (AEP). 109 built-up. 55 B-process. 108 (a. 24. 74 codebook. 179. 69 built by cutting and stacking. 138 circular k-type. 71 length function. 108 width distribution.

112 cyclic merging rule. 94 empirical universe. 56 topological. 67 cardinality bound. 16 totally. 6 shifted empirical block. 49 decomposition. 45 eventually almost surely. 20 process. 90 rotation processes. 89 for i.i. 91 definition for processes. 186.i. 100 ergodicity. 87 d-distance properties. 97 completeness. 21 ergodic theorem (of Birkhoff). 132 Ziv entropy. 17 start. 80 entropy. 90 d'-distance.d. 51 conditional. 42 maximal. 63 of overlapping k-blocks. 51 empirical. 92 for ergodic processes. 15 components. 57 inequality. 1 of blocks in blocks. 59 of a distribution. 168 divergence. 31 distinct words. 89. 184 d-distance . 78 entropy theorem. 166 Elias code. 62 Markov processes. 63 . 76 empirical distribution. 100 subset bound. 62 concatenated-block processes. 99 mixing. 224 ergodic. 129 entropy-typical sequences. 100 typical sequences. 51. 223 for processes. 67 E-independence. 164 consistency conditions. 58 a random variable. 11 expected return time. 65 empirical joinings. 68 prefix codes. 1 continuous distribution. 43 ergodicity and d-limits. 58 entropy rate. 91 definition (joining). 4 . 51 a process. 185 encoding partition. 176. 133 entropy interpretations: covering exponent. 67 cutting and stacking. 102 relationship to entropy. 33 of von Neumann. 109 standard representation. 24 exponential rates for entropy. 92 for sequences. 89 realizing tin 91 d-topology. processes. 138 first-order entropy. 48 Markov. 98 direct product. 59 a partition. INDEX empirical Markov entropy. 143. 138 estimation. 222 of a partition. 64 empirical measure (limiting). 89 for stationary processes. 46 number. 75. 218 covering. 218 coupling measure. 214 coupling. 96 d-far-apart Markov chains. 73 entropy properties: d-limits. 88 entropy of i. 42 of Kingman. 99 finite form. 46 almost-covering principle.246 conditional invertibility. 57 doubly-infinite sequence model. 144 n-th order. 2 d-admissible. 45 "good set" form. 48 empirical entropy. processes. 43. 60 upper semicontinuity. 103.d. 166 for frequencies. 147 distribution of a process. 190 cylinder set.

11 instantaneous coder. 15 irreducible Markov. 6 generalized renewal process. 110 Kolmogorov partition. 11 Markov chain. 41 two-packings. 25 Kolmogorov measure. 3. 25 infinitely often. 7 function. 175. 4. 223 first-order blocks. 139 separated. 14 merging and separation. 71 faithful-code sequence. almost surely. 10 packing. processes. 215 n-blocking. 34 (1 — 0-packing. 20 name of a column. 90 mapping interpretation. 13 Kolmogorov representation. 26 generalized return-time picture. 6 concatenated-block process. 6 k-th order. 3 two-sided model. 186. 34 stopping-time. 115 induced process. 229 247 Kac's return-time formula.i. 4 and complete sequences. 138 partial. E)-typical sequences. 40 strong-packing. 102 faithful code. 190 mixing. 136 linear mass distribution. 3 Kraft inequality. 10 overlapping to nonoverlapping. 7 invariant measure. 62 matching. 122 nonoverlapping-block process. 168 overlapping-block process. 76.INDEX extremal measure. 91 join of partitions. 2. 195 finitary process. 212 stacking of a structure: onto a column. 133 simple LZ parsing. 2)-name. 104 first-order rate bound. 166 frequency. 44 (k. 100 and Markov. 41 . 131 convergence theorem. 131 LZW algorithm. 107 (T. 235 measure preserving. 14 nonadmissibility. 43 typical. 4 complete measure model. 132 upper bound. 41 packing lemma. 116 repeated. 28 independent partitions. 114 M-fold. 10 finitely determined.d. 17 joint input-output measure. 104 log-sum inequality. 222 i. 117 extension. 45 function of a Markov chain. 107 of a column level. 73 Kronecker's theorem. 121 filler or spacer blocks. 5 source. 185 nonexistence of too-good codes. 27 transformation. 22 Lempel-Ziv (LZ) algorithm. 91 definition of d. 34. 159 finite-state process. 195 finite coder. 80 finite energy. 151 finitary coding. 4 set. 18 join of measures. 84 fillers. 115 onto a structure. 221 finite form. 8 approximation theorem. 17 cutting and stacking. 7 finite coding. 6 Markov order theorem. 198. 18 and d-limits. 65 Markov inequality.

34 strongly-packed. 32. 154 repeated words. 165 recurrence-time function. 0-strongly-covered. 63. 40 string matching. 229 Markov process. 1. 30. 23 prefix. 89 Pinsker's inequality. 33 strongly-covered. 190 . 21 process. 215 (S. 3 (T . 63 k-type. 63. 24 process. 138 subadditivity. 101 Shannon-McMillan-Breiman theorem. 170 with gap. 84 stationary process. 67 and d-distance. 167. 13 distribution. 5 Markov. 65 typical sequence. 13 too-long word. 158 process. 229 stopping time. 167 equivalence. block parsings. 6 input-output measure. 122 sequence. 21 d-far-apart rotations. 70 (1 — 0-strongly-covered. 40 interval. 45. 150 (K. 109 universal code. 169. 3. 7 and entropy. 85 strong cover. 31 sliding-block (-window) coding. 55 shift transformation. 215 i.i. 72 code. 44. 205 partition. 179 skew product. 166 for frequencies. 30 transformation. 63 class. 132 of a column structure. 29 type. 82 time-zero coder. 26 rotation (translation). 5 Vf-mixing. 175 - INDEX and upward maps. 166 splitting index. 180 splitting-set lemma. 3 transformation/partition model. 66 Poincare recurrence theorem. n)-independent. 156 totally ergodic. 24 picture. 64 class. 145 existence theorem. 21 tower construction. 121 universe (k-block).248 pairwise k-separated. 14 product measure.P) process. 4 shift. 167 equivalence. 46 (L. 10 regenerative. 14 per-letter Hamming distance. 151 too-small set principle. 96 randomizing the start. 109 start position. 68. 150 representation. 7 with induced spacers. 121 trees. 0-strongly-packed. 6 N-stationary. 59 subinvariant function. 8 rate function for entropy. 64 class bound.. 170. 159 (S.d. 28 stationary coding. 148 return-time distribution. 74. 188 full k-block universe. 59 finite-energy. 81 and mixing. 6 in-dependent. n)-blocking. 27. 211 entropy. 3 shifted. 8. 110 B-process. 180 stacking columns. 12 N-th term process. 8 output measure. 120 stationary. 73 code sequence. 43. 67 too-soon-recurrent representation. 13 and complete sequences. 7 speed of convergence.

119 window function.INDEX very weak Bernoulli (VWB). 148 repeated. 179. 7 words. 200 approximate-match. 232 waiting-time function. 148 Ziv entropy. 87 well-distributed. 87. 200 weak Bernoulli (WB). 195 window half-width. 104 distinct. 233 weak topology. 133 249 . 88 weak convergence.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.