You are on page 1of 213
A Second Course 1n Probability ; " ; | a ‘ aan ian \ i i‘ Pi Sheldon M. Ross Erol A. Pek6éz PMtac MMe em mace lliliay MUiAUe mA MU ISM abst ONAN TET UCC UR eS TT Reem cc selling author Sheldon Ross (University of Southern California) teams up with Erol Pekéz (Boston University) to bring you A Second CTT Me ae) Le a sc celeLe a CCL MMTTLCe CRU eelO eT Cc) graduate students in statistics, mathematics, engineering, finance, Flere UArcUCAL SU ecmmcUItMAM Me Ue cre mCeDO mteletT Toe OM I care UUTT Ui FIT ALaiiereC\eUicecle mn C0)e) eam ogesr< SULTAN CO Ce OU MUM SIE USEN mee gem mio Ripert lle Molde) 21e)] [10g Cpe Comm CO Oc Ce Le Co a Pelee am SCOTCH ge CSL ote ele SameO oe ELe STIMU TL a t= vam CeCe eT Seta T= Ui Ue Lge Nona Caen ee MUO Uesr CMM ATCC ge S\ moar aT ETM oe sess) SIsto COM aime eclace Mc gem ror) CoN] Se Ui HEIGAM UE milese eee CCU mee Neca COLA RO ag concepts required, such as the definition of convergence, the Leb- Cre MC e-UMCUC MOL MMONT cll eeUTO MMT eel iia) ciera come tego Tet Ease UVC cn UA MM ana Nm InegeLe ete RNA MALO eI Coe aL measurable set, the law of large numbers is proved using the ergodic TURMERIC ROM res tel mae crt FSM MUS SMS CIMT AL OCOM TOM Sige TIMI A(O) UML (60 Mime mec oMm oe ccm mel Cer deme Tee mg sr UlL LL eae UL aCe Mae www.ProbabilityBookstore.com A Second Course in Probability Sheldon M. Ross and Erol A. Pekéz Spring 2007 www. ProbabilityBookstore.com 2007 Copyright © by Sheldon M. Ross and Erol A. Pekéz. All rights reserved. Published by ProbabilityBookstore.com, Boston, MA. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to sales@probabilitybookstore.com. Printed in the United States of America. ISBN: 0-9795704-0-9 Library of Congress Control Number: 2007930255 To order give your college bookstore the information below: A Second Course in Probability by Sheldon M. Ross and Erol A. Pekéz (2007) Publisher: www.ProbabilityBookstore.com, Boston, MA ISBN: 0-9795704-0-9 Contents Preface 1 Measure Theory and Laws of Large Numbers 1.1 Introduction 1.2 A Non-Measurable Event 1.3 Countable and Uncountable Sets 1.4 Probability Spaces 1.5 Random Variables 1.6 Expected Value 1.7 Almost Sure Convergence and the Dominated Conver- gence Theorem 1.8 Convergence in Probability and in Distribution 1.9 The Law of Large Numbers and Ergodic Theorem 1.10 Exercises 2. Stein's Method and Central Limit Theorems 2.1 Introduction 2.2 Coupling 2.3. Poisson Approximation and Le Cam’s Theorem 2.4 The Stein-Chen Method 2.5. Stein’s Method for the Geometric Distribution 2.6 Stein’s Method for the Normal Distribution 2.7 Exercises 3 Conditional Expectation and Martingales 3.1 Introduction 3.2 Conditional Expectation 3.3. Martingales Contents 3.4. The Martingale Stopping Theorem 87 3.5 The Hoeffding-Azuma Inequality 96 3.6 Submartingales, Supermartingales, and a Convergence Theorem 104 3.7. Exercises 109 Bounding Probabilities and Expectations 113 4.1 Introduction 113 4.2 Jensen’s Inequality 113 4.3 Probability Bounds via the Importance Sampling Iden- tity 114 4.4 Chernoff Bounds 119 4.5 Second Moment and Conditional Expectation Inequal- ities 122 4.6 The Min-Max Identity and Bounds on the Maximum 126 4.7 Stochastic Orderings 132 4.8 Excercises 137 Markov Chains 141 5.1 Introduction 141 5.2. The Transition Matrix 142 5.3 The Strong Markov Property 144 5.4 Classification of States M7 5.5 Stationary and Limiting Distributions 149 5.6 Time Reversibility 157 5.7. A Mean Passage Time Bound 159 5.8 Exercises 162 Renewal Theory 167 6.1 Introduction 167 6.2 Some Limit Theorems of Renewal Theory 169 6.3 Renewal Reward Processes 174 6.3.1 Queueing Theory Applications of Renewal Re- ward Processes 181 6.4 Blackwell’s Theorem 183 6.5 The Poisson Process 184 6.6 Exercises Contents. 7 Brownian Motion 7.1 Introduction 7.2 Continuous Time Martingales 7.3 Constructing Brownian Motion 7.4 Embedding Variables in Brownian Motion 7.5 The Central Limit Theorem 7.6 Exercises References Index 191 191 191 192 201 204 205 207 208 Preface This book is intended to be a second course in probability for un- dergraduate and graduate students in statistics, mathematics, engi- neering, finance, and actuarial science. It is a guided tour aimed at instructors who want to give their students a familiarity with some advanced topics in probability, without having to wade through the exhaustive coverage contained in the classic advanced probabil- ity theory books (books by Billingsley, Chung, Durrett, Breiman, etc.). The topics covered here include measure theory, limit the- orems, bounding probabilities and expectations, coupling, Stein’s method, martingales, Markov chains, renewal theory, and Brownian motion. One noteworthy feature is that this text covers these advanced topics rigorously but without the need of much background in real analysis; other than calculus and material from a first undergraduate course in probability (at the level of A First Course in Probability, by Sheldon Ross), any other concepts required, such as the definition of convergence, the Lebesgue integral, and of countable and uncount- able sets, are introduced as needed. The treatment is highly selective, and one focus is on giving al- ternative or non-standard approaches for familiar topics to improve intuition. For example we introduce measure theory with an exam- ple of a non-measurable set, prove the law of large numbers using the ergodic theorem in the very first chapter, and later give two al- ternative (but beautiful) proofs of the central limit theorem using Stein’s method and Brownian motion embeddings. The coverage of martingales, probability bounds, Markov chains, and renewal theory focuses on applications in applied probability, where a number of recently developed results from the literature are given. The book can be used in a flexible fashion: starting with chapter 1, the remaining chapters can be covered in almost any order, with a few caveats. We hope you enjoy this book. About notation Here we assume the reader is familiar with the mathematical notation used in an elementary probability course. For example we write X ~ U(a,b) or X =4 U(a,b) to mean that X is a random variable 6 having a uniform distribution between the numbers a and 6. We use common abbreviations like N(u,0?) and Poisson() to respectively mean a normal distribution with mean y and variance o?, and a Poisson distribution with parameter A. We also write I4 or I{A} to denote a random variable which equals 1 if A is true and equals 0 otherwise, and use the abbreviation “iid” for random variables to mean independent and identically distributed random variables. About the Authors Sheldon M. Ross is Epstein Chair Professor in the Department of Industrial and Systems Engineering at the University of Southern California. He was previously a Professor of Industrial Engineering and Operations Research at University of California, Berkeley. He received his Ph.D in statistics from Stanford University in 1968, and has published more than 100 technical articles as well as a variety of textbooks in the areas of applied probability, statistics, and industrial engineering. He is the founding and continuing editor of the journal Probability in the Engineering and Informational Sciences, a fellow of the Institute of Mathematical Statistics, and a recipient of the Humboldt U.S. Senior Scientist Award. He is also the recipient of the 2006 INFORMS Expository Writing Award, awarded by the Institute for Operations Research and the Management Sciences. Erol A. Pekéz is Associate Professor in the School of Management at Boston University. He received his B.S. from Cornell University and Ph.D. in Operations Research from the University of California, Berkeley in 1995 and has published more than 25 technical articles in applied probability and statistics. He also has taught in the statistics departments at the University of California, Berkeley, the University of California, Los Angeles, and Harvard University. At Boston Uni- versity he was awarded the 2001 Broderick Prize for Teaching, and is now writing The Manager’s Common Sense Guide to Statistics. Acknowledgements We would like to thank the following people for many valuable com- ments and suggestions: Jose Blanchet (Harvard University), Joe Blitzstein (Harvard University), Stephen Herschkorn, Amber Puha, (California State University, San Marcos), Ward Whitt (Columbia University), and the students in Statistics 212 at Harvard University in the Fall 2005 semester. Chapter 1 Measure Theory and Laws of Large Numbers 1.1 Introduction If you're reading this you've probably already scen many different types of random variables, and have applied the usual theorems and laws of probability to them. We will, however. show you there are some seemingly innocent random variables for which none of the law. of probability apply. Measure theory, as it applies to probability. is a theory which carefully describes the types of random variables the laws of probability apply to. This puts the whole field of probability and statistics on a mathematically rigorous foundation. You are probably familiar with some proof of the famous strong law of large numbers, which asserts that the long-run average of iid random variables converges to the expected value. One goal of this chapter is to show you a beautiful and more general alternative proof of this result using the powerful ergodic theorem. In order to do this. we will first take you on a brief tour of measure theory and introduce you to the dominated convergence theorem. one of measure theory most famous results and the key ingredient we need. In section 2 we construct an event. called a. non-measnrable event, to which the laws of probability don’t apply. In section 3 we introduce the notions of countably and uncountably infinite set you how the elements of some infinite sets cannot be listed in a sequence. In section 4 we define a probability space. and the laws and show 9 ot probability which apply to them. In section 5 we introduce the concept of a measurable random variable, and in section 6 we define the expected value in terms of the Lebesgue integral. In section 7 we illustrate and prove the dominated convergence theorem, in section 8 we discuss convergence in probability and distribution, and in section 9 we prove 0-1 laws, the ergodic theorem, and use these to obtain the strong law of large numbers. 1.2. A Non-Measurable Event Cousider a circle having radius equal to one. We say that two points on the edge of the circle are in the same family if you can go from one point to the other point by taking steps of length one unit around the edge of the circle. By this we mean each step you take moves you au angle of exactly one radian degree around the circle, and you are allowed to keep looping around the circle in either direction. Suppose each family elects one of its members to be the head of the family. Here is the question: what is the probability a point X selected uniformly at random along the edge of the circle is the head of its family? It turns out this question has no answer. ‘The first thing to notice is that each family has an infinite number of family members. Since the circumference of the circle is 27, you can never get back to your starting point by looping around the circle with steps of length one. If it were possible to start at the top of the circle and get back to the top going a steps clockwise and looping around b times, then you would have a = b2n for some integers a, , and hence + = a/(2b). This is impossible because it’s well-known that 7 is an irrational number and can’t be written as a ratio of integers. It may seem to you like the probability should either be zero or ‘one, but we will show you why neither answer could be correct. It doesn’t even depend on how the family heads are elected. Define the events A = {X is the head of its family}, and A; = {X is i steps clockwise from the head of its family}, and B; = {X is i steps counter-clockwise from the head of its family}. Since X was uniformly chosen, we must have P(A) = P(A) = P(B,). But since every family has a head, the sum of these proba- bilities should equal 1, or in other words 1= P(A)+ Sa + P(Bi)). i=l ‘Thus if ¢ = P(A) we get 1 = ¢ + 0%, 2x, which has no solution where 0< x < 1. This means it’s impossible to compute P(A), and the answer is neither 0 nor 1, nor any other possible number. The event A is called a non-measurable event, because you can’t measure its probability in a consistent way. What's going on here? It turns out that allowing only one head per family, or any finite number of heads, is what makes this event non-measurable. If we allowed more than one head per family and gave everyone a 50% chance, independent of all else, of being a head of the family, then we would have no trouble measuring the proba- bility of this event. Or if we let everyone in the top half of the circle be a family head, and again let families have more than one head, the answer would be easy. Later we will give a careful description of what types of events we can actually compute probabilities for. Being allowed to choose exactly one family head from each fam- ily requires a special mathematical assumption called the axiom of choice. This axiom famously can create all sorts of other logical mayhem, such as allowing you to break a sphere into a finite number of pieces and rearrange them into two spheres of the same size (the “Banach-Tarski paradox”). For this reason the axiom is controversial and has been the subject of much study by mathematicians. 1.3. Countable and Uncountable Sets ‘You may now be asking yourself if the existence of a uniform random variable X ~ U(0,1) also contradicts the laws of probability. We know that for all z, P(X = 2) = 0, but also P(Q< X <1) =1 Doesn't this give a contradiction because POSX<1)= Yo P(X=2)=0 re[0.1] Actually, this is not a contradiction because a summation over an interval of real numbers does not make any sense. Which values of « 12 Chapter 1 Measure Theory and Laws of Large Numbers would you use for the first few terms in the sum? The first term in the sum could use x = 0, but it’s difficult to decide on which value of x to use next. Tn fact. infinite sums are defined in terms of a sequence of finite sums: oo 7 Sous lim Va, noe ma i and so to have an infinite sum it must be possible to arrange the terms in a sequence. If an infinite set of items can be arranged in a sequence it is called countable, otherwise it is called uncountable. Obviously the integers are countable, using the sequence 0,-1,+1,- 2,+2,.... The positive rational numbers are also countable if you express them as a ratio of integers and list them in order by the sum of these integers: 12 ‘The real numbers between zero and one, however, are not count- able. Here we will explain why. Suppose somebody thinks they have a method of arranging them into a sequence 71,2, ..., where we ex- press them as rj = [%,dijl0~, so that di; € {0,1,2,...,9} is the ith digit after the decimal place of the jth number in their sequence. Then you can clearly see that, the number y= 0014 Ndi = 1})107', a1 where I{A} equals 1 if A is true and 0 otherwise, is nowhere to be found in their sequence. This is because y differs from 2; in at least the ith decimal place, and so it is different from every number in their sequence. Whenever someone tries to arrange the real numbers into a sequence, this shows that they will always be omitting some of the numbers. This proves that the real numbers in any interval are uncountable, and that you can’t take a sum over all of them. So it’s true with X ~ U(0.1) that for any countable set A we have P(X € A) = Dye, P(X = x) = 0, but we can’t simply sum up the probabilities like this for an uncountable set. There are, however, some examples of uncountable sets A (the Cantor set, for example) which have P(X € A) = 0 1.4 Probability Spaces 13 1.4 Probability Spaces Let 9 be the set of points in a sample space. and let F be the collection of subsets of 2 for which we can calculate a probability. These subsets are called events, and can be viewed as possible things which could happen. If we let P be the function which gives the probability for any event in F, then the triple (0, F. P) is called a probability space. The collection F is usually what is called a sigma field (also called a sigma algebra). which we define next. Definition 1.1 The collection of sets F is a sigma field. or aa-field. if it has the following three properties: 19€F 2 ACF AOAEF a These properties say you can calculate the probability of the whole sample space (property 1). the complement of any event (prop- erty 2), and the countable union of any sequence of events (property 3). These also imply that. you can calculate the probability of the countable intersection of any sequence of events. since A%, A; = (UPL ARS ‘To specify a o—field, people typically start with a collection of events A and write o(.A) to represent the smallest o-field containing the collection of events A. Thus a(.A) is called the a-field “generated” by A. It is uniquely defined as the intersection of all possible sigma fields which contain A, and in exercise 2 below you will show such an intersection is always a sigma field. Example 1.2 Let @ = {a,b,c} be the sample space and let A = {{a,b}, {c}}. Then A is not a o—field because {o.b.c} ¢ A. but o(A) = {{a, b,c}, {a,b}, {c}, @}, where d = 9 is the empty set. Definition 1.3 A probability measure P is a function, defined on the sets in a sigma field, which has the following three propertirs: 1. P(Q) =1, and 4 Chapter 1 Measure Theory and Laws of Large Numbers 2. P(A) > 0, and 3. P(U2,Ai) = D2, PlAi) if Vi xj we have AiNA; = ¢. These imply that probabilities must be between zero and one, and say that the probability of a countable union of mutually exclusive events is the sum of the probabilities. Example 1.4 Dice. If you roll a pair of dice, the 36 points in the sample space are 2 = {(1, 1), (1,2), «+ (5,6), (6,6)}. We can let F be the collection of all possible subsets of ©, and it’s easy to see that it is a sigma field. Then we can define _ Al P(a)= where |A| is the number of sample space points in A. Thus if A = {(1, 1), (3,2)}, then P(A) = 2/36, and it’s easy to see that P is a probability measure. Example 1.5 The unit interval. Suppose we want to pick a uniform random number between zero and one. Then the sample space equals Q = [0,1], the set of all real numbers between zero and one. We can let F be the collection of all possible subsets of 9, and it’s easy to see that it is a sigma field. But it turns out that it’s not possible to put a probability measure on this sigma field. Since one of the sets in F would be similar to the set of heads of the family (from the non- measurable event example), this event cannot have a prob- ability assigned to it. So this sigma field is not a good one to use in probability. Example 1.6 The unit interval again. Again with = [0,1], sup- pose we use the sigina field F =o({z}zcq), the smallest sigma field generated by all possible sets containing a single real number. This is a nice enough sigma field, but it would never be possible to find the probability for some interval, such as [0.2,0.4]. You can’t take a countable union of single real numbers and expect to get an un- countable interval somehow. So this is not a good sigma field to use. 1.4 Probability Spaces 15 So if we want to put a probability measure on the real numbers between zero and one, what sigma field can we use? The answer is the Borel sigma field B, the smallest sigma field generated by all intervals of the form [z,y) of real numbers between zero and one: B =o((z, y)x P(AIN Ay) at + E P(AiN AS NAR) i (ym (eee + Gaya mai When we define the event A; = {a run of a tail immediately followed by & heads in a row starts at flip i}, and Ap = {the first k flips are heads}, we can use the inclusion-exclusion formula to get the answer above because P(at least k heads in row) = P(URGE"A) and 0 if flips for any events overlap P(Ai Aig ++ Aig) = 9 27™MRHD otherwise and i; > 0 armel otherwise and i; =0 and the number of sets of indices i} < ig < ") if i: > 0 (imagine the & heads in each of the m runs are invisible, so this is the number of ways to arrange m tails in n — mk visible flips) and ("7") if i) =0. m An important property of the probability function is that it is a continuous function on the events of the sample space 2. To make this precise let An,n > 1, be a sequence of events, and define the event lim inf An as lim inf Ay = U2, NF =n Ai Because lim inf An consists of all outcomes of the sample space that, are contained in 2,,A; for some n, it follows that lim inf A, consists of all outcomes that are contained in all but a finite number of the events An,n > 1. Similarly, the event limsup A, is defined by limsup Ay = M22, U%, Ai Because lim sup An consists of all outcomes of the sample space that are contained in U%,, A; for all n, it follows that lim sup An consists of 18 Chapter 1 Measure Theory and Laws of Large Numbers all outcomes that are contained in an infinite number of the events An,n > 1. Sometimes the notation {Aj i.o.} is used to represent limsup An, where the “i.o.” stands for “infinitely often” and it means that an infinite number of the events A, occur. Note that by their definitions liminf A, C limsup An Definition 1.11 if limsup A, = liminf An, we say that lim A, ex- ists and define it by lim Aq = limsup An = lim inf A, Example 1.12 (a) Suppose that An,n > 1 is an increasing sequence of events, in that An C Anyi, n> 1. Then M%,,4; = An, showing that lim inf An = US) An Also, U%,,A; = U%,4i, showing that limeup Ay = U1 An Hence, lim Ay = UR) Ai (b) If An,n > 1 is a decreasing sequence of events, in that Angi C An; 1 > 1, then it similarly follows that lim An =A ‘The following result is known as the continuity property of prob- abilities. Proposition 1.13 If limp An = A, then limp P(An) = P(A) Proof: We prove it first for when A, is either an increasing or de- creasing sequence of events. Suppose An C Angi, 2 > 1. Then, with 1.4 Probability Spaces 19 Ao defined to be the empty set, P(lim An) = P(UZ2, Ai) = PUR A(U= = PUR AAL1) ~ = YPAAL,) Fart : = Ed PAal2) = jim PUR AAR = jim PULA) = lim P(An) a Now, suppose that Anji C An, n> 1. Because AS is an increasing sequence of events, the preceding implies that P(UR Af) = Jim, P(AS) or, equivalently P(N, A:)*) ~ Jim, P(An) S PUR 1As) = Jim P(An) which completes the proof whenever A, is a monotone sequence. Now, consider the general case, and let By, = UfS,,Ai. Noting that Bn41 C Bp, and applying the preceding yields P(imsup Ap) = P(O%1Bp) lim, P(Bn) (1.1) Also, with Cn = %,,4i, P(liminf A,) = P(U%,Cn) = jim P(Ch) (1.2) 20 Chapter 1 Measure Theory and Laws of Large Numbers because Cn C Cn. But en Ai C An C Up Ai = Bn showing that P(Cn) < P(An) < P(Bn)- (1.3) Thus, if lim inf An = limsup An = lim Ap, then we obtain from (1.1) and (1.2) that the upper and lower bounds of (1.3) converge to each other in the limit, and this proves the result. ™ 1.5 Random Variables Suppose you have a function X which assigns a real number to each point in the sample space 9, and you also have a sigma field F. We say that X is an F—measurable random variable if you can com- pute its entire cumulative distribution function using probabilities of events in F or, equivalently, that you would know the value of X if you were told which events in F actually happen. We define the no- tation {X <2} = {w € 2: X(w) < x}, 80 that X is F—measurable if {X < 2} € F for all z. This is often written in shorthand notation as X €F. Example 1.14 9 = {a,b,c}, A = {{a,b,c}, {a,b}, {c}, 9}, and we define three random variables X,Y, Z as follows: xX YZ Which of the random variables X,Y, and Z are A - measurable? Well since {Y < 1} = {a} ¢ A, then Y is not A—messurable. For the same reason, Z is not A—measurable. The variable X is A measurable because {X < 1} = {a,b} € A, and {X < 2} = {a,b,c} € A. In other words, you can always figure out the value of X using just the events in A, but you can’t always figure out the values of Y and Z. 1.5 Random Variables a Definition 1.15 For a random variable X we define o(X) to be the smallest sigma field which makes X measurable with respect to o(X). We read it as “the sigma field generated by X.” Definition 1.16 For random variables X,Y we say that X is ¥ measurable if X € o(Y). Example 1.17 In the previous example, is ¥ € o(Z)? Yes, because o(Z) = {{a,, c}, {a}, {a,b}, {0}, {b,c}, {c}, {ca}, 6}, the set of all possible subsets of 2. Is X € o(Y)? No, since {X < 1} = {a,b} ¢ o(¥) = {{a,, c}, {b,c}, {a},4}. To see why o(Z) is as given, note that {Z < 1} = {a}, {Z <4} = {a,c}, {2 < 7} = {a,b,c}, {a}® = {b,c}, {a,b}° = {c}, fa} U {ce} = {a,c}, {a,b, c}® = d, and {a, c}° = {b}. Example 1.18 Suppose X and Y are random variables taking values between zero and one, and are measurable with respect to the Borel sigma field B. Is Z = X +Y also measurable with respect to B? Well, we must show that {Z < z} € B for all z. We can write {X+¥ > z} = Ugeq({X > gh {¥ > z-4}), where Q is the set of rational numbers. Since {X > q} € B, {¥ > z—4q} € B, and Q is countable, this means that {X + Y < z} = {X+Y > z}* € B and thus Z is measurable with respect to B. Example 1.19 The function F(x) = P(X < 2) is called the distribu- tion function of the random variable X. If z_ | z then the sequence of events A, = {X < zn}, n > 1, is a decreasing sequence whose limit is Jim An = nn = {X < 2} Consequently, the continuity property of probabilities yields that F(a) = lim F(en) showing that a distribution function is always right continuous. On the other hand, if z_ T z, then the sequence of events A, = {X < tn}, n > 1, is an increasing sequence, implying that lim, F(tn) = P(UnAn) = P(X <2) = F(z) - P(X =2) 22 Chapter 1 Measure Theory and Laws of Large Numbers Two events are independent if knowing that one occurs does not change the chance that the other occurs. This is formalized in the following definition. Definition 1.20 Sigma fields F;,...,Fn are independent if whenever A; € F; for i=1,...,n, we have P(MP,Ai) = [Ty P(Ai)- Using this we say that random variables Xj,..., Xn are independent if the sigma fields o(X;),...,0(Xn) are independent, and we say events Ai,...,An are independent if I4,,...,[4, are independent random variables. Remark 1.21 One interesting property of independence is that it’s possible that events A, B,C’ are not independent even if each pair of the events are independent. For example if we make three indepen- dent flips of a fair coin and let A represent the event exactly 1 head comes up in the first two flips, let B represent the event exactly 1 head comes up in the last two flips, and let C represent the event exactly 1 head comes up among the first and last flip. Then each event has probability 1/2, the intersection of each pair of events has probability 1/4, but we have P(ABC) = 0. In our next example we derive a formula for the distribution of the convolution of geometric random variables. Example 1.22 Suppose we have n coins that we toss in sequence, moving from one coin to the next in line each time a head appears. That is we continue using a coin until it lands heads, and then we switch to the next one. Let X; denote the number of flips made with coin i. Assuming that all coin flips are independent and that each lands heads with probability p, we know from our first course in probability that X; is a geometric random variable with parameter p, and also that the total number of flips made has a negative binomial distribution with probability mass function Lr (my )ra —py"", mon 1.6 Expected Value 23 The probability mass function of the total number of flips when each coin has a different probability of landing heads is easily obtained using the following proposition. Proposition 1.23 If X1,...,Xn are independent geometric random variables with parameters pi,...,Pn, where pi # pj if i # j, then, with qi =1-pi, fork>n P(X +e + Xn > k) = vd : i=l ja a Proof. We will prove Ann = P(Xi+-+:+Xn > &) is as given above using induction on k +n. Since clearly Aj,1 = qi, we will assume as our induction hypothesis that Aj, is as given above for alli+j 1} occurs we get Akn = InAk-in + PrAg-in-1 ke Ton i. ated 1Pr = TT jai ju PIPE alee te Pi which completes the proof by induction. m 1.6 Expected Value A random variable X is continuous if there is a function f, called its density function, so that P(X < z) = f*,, f(t)dt for all 2. A random variable is discrete if it can only take a countable number of different values. In elementary textbooks you usually see two separate definitions for expected value: EIX]= Di siP(X = ai) if X is discrete = Jxf(x)dx — if X is continuous with density f. But it’s possible to have a random variable which is neither con- tinuous nor discrete. For example, with U ~ U(0, 1), the variable 24 Chapter 1 Measure Theory and Laws of Large Numbers X =Ulys5 is neither continuous nor discrete. It’s also possible to have a sequence of continuous random variables which converges to a discrete random variable ~ or vice versa. For example, if Xn = U/n, then each Xp is a continuous random variable but limyoo Xn is a discrete random variable (which equals zero). This means it would be better to have a single more general definition which covers all types of random variables. We introduce this next. A simple random variable is one which can take on only a finite number of different possible values, and its expected value is defined as above for discrete random variables. Using these, we next define the expected value of a more general non-negative random variable. We will later define it for general random variables X by expressing it as the difference of two nonnegative random variables X = X+—X™, where z+ = max(0,z) and 2~ = max(—z,0). Definition 1.24 If X > 0, then we define E[X} = sup EY). ail simple variables YX We write ¥ < X for random variables X,Y to mean P(Y < X) = 1, and this is sometimes written as “Y < X almost surely” and abbreviated “Y < X as.” For example if X is nonnegative and a > 0 then Y = alx>¢ is a simple random variable such that Y < X. And by taking a supremum over “all simple variables” we of course mean the simple random variables must be measurable with respect to some given sigma field. Given a nonnegative random variable X, one concrete choice of simple variables is the sequence Yq = min(|2"X j/2",n), where [z| denotes the integer portion of z. We ask you in exercise 17 at the end of the chapter to show that Yq 1 X and E[X] = lim, E[Yq]- Another consequence of the definition of expected value is that if Y < X, then E[Y] < E[X]. Example 1.25 Markov’s inequality. Suppose X > 0. Then, for any a > 0 we have that aIx>q < X. Therefore, Ela Ix>a] < E[X] or, equivalently, P(X 2a) < E[X]/a which is known as Markov’s inequality. 1.6 Expected Value 25 Example 1.26 Chebyshev’s inequality. A consequence of Markov’s inequality is that for a > 0 P(|X| > a) = P(X? > a?) < E[XI/a” a result known as Chebyshev’s inequality. Given any random variable X > 0 with E[X] < oo, and any € > 0, we can find a simple random variable Y with E[X]-¢ < E[Y] < E[X]. Our definition of the expected value also gives what is called the Lebesgue integral of X with respect to the probability measure P, and is sometimes denoted E[X] = f XdP. So far we have only defined the expected value of a nonnegative random variable. For the general case we first define X+ = XIx>0 and X~ = —XIx 0,a > 0, and b = 0. The general cases will follow using B[X+Y] = E[X++Y¥+]-E[X-+¥7], Bib+X] = eup EL]= sup Bb+Y]=6+ sup BY] = b+ 21X1, 1.6 Expected Value 27 and -aX +b=a(-X) +b. For part (a) if X is simple we have ElaX] = Yo arP(X = 2) = aE [X], and since for every simple variable Z < X there corresponds another simple variable aZ < aX, and vice versa, we get BlaX]= sup, ElaZ| = sup aE |Z] = oF{X) where the supremums are over simple random variables. For part (b) if X,¥ are simple we have EX +Y] =o 2P(X+Y¥ =z) z =e YY P(x=n,yY=y) 2 nycty=z = DY t+y)P(x =a =y) z pyetysz = Det p(x =nY=9) my = oeP(X =2,¥ = 9) + Do yP(X =2,Y =y) ay zy = O2P(X =2)+ VyPy =y) : v = E[X]+ Ely], and applying this in the second line below we get E|X]+E[Y]= sup E{A} + E[B] ASK,BSY = sup ElA+B] ASX,BSY < sup EfA] ASX+Y = EX +Y\, 28 Chapter 1 Measure Theory and Laws of Large Numbers where the supremums are over simple random variables. We then use this inequality in the third line below of Elmin(X + Y,n)] = 2n— E[2n - min(X + Y,n)} < 2n — E[n - min(X,n) + n— min(Y,n)] n). a0 Proof. Since E[X] = pi + 2p2 + 3ps + 4pa-. re-write this as see problem 6), we E[X] =p. + po + ps + Pa + po + ps + Pa + ps + Pa + Pp and notice that the columns respectively equal pi, 2p2, 3p3.... while the rows respectively equal P(X > 0), P(X > 1),P(X >2).... = Example 1.31 With Xj, X2... independent U(0, 1) random variables, compute the expected value of a wemin fn: Sexo}. i=1 1.7 Almost Sure Convergence and the Dominated Convergence Theorem 29 Solution, Using E[N] = 92.9 P(N > 2), and noting that P(N >0)=P(N>1)=1, and 1 pl=2 1-21-22 1-24 -2a---—-In-1 PW >n)= f [ [ i ae, ‘0 JO 0 0 =1/nl, we get E[N] =e. @ 1.7. Almost Sure Convergence and the Dominated Convergence Theorem For a sequence of non-random real numbers we write tn —> x or limnootn = 2 if for any € > 0 there exists a value n such that lam —2| < € for all m > n. Intuitively this means eventually the sequence never leaves an arbitrarily small neighborhood around z. It doesn’t simply mean that you can always find terms in the sequence which are arbitrarily close to x, but rather it means that eventually all terms in the sequence become arbitrarily close to z. When tn — oo, it means that for any k > 0 there exists a value n such that Im >k for allm>n. The sequence of random variables Xp,n > 1, is said to converge almost surely to the random variable X, written X;, —*as X, or limp—oo Xn = X a.s., if with probability 1 lim X, =X " The following proposition presents an alternative characterization of almost sure convergence. Proposition 1.32 X;, —+as X if and only if for any «>0 P(\Xn—X|m)—+1 as moo Proof. Suppose first that X, —+a, X. Fix € > 0, and for m > 1, define the event Am = {|Xn — X| m} 30 Chapter 1 Measure Theory and Laws of Large Numbers Because Am,m > 1, is an increasing sequence of events, the continu- ity property of probabilities yields that lim P(Am) = (im Ar) P(|Xn —X|< for all n sufficiently large) P(im Xn = X) _ Voit To go the other way, assume that for any « > 0 P(\Xn—X|m)—+1 as moo Let ¢;, i 2 1, be a decreasing sequence of positive numbers that converge to 0, and let Ami = {\Xn - X| <6 for all n> m} Because Ami C Am+1i and, by assumption, limm P(Am,i) = 1, it follows from the continuity property that 1= P( lim Ami) = P(Bi) where By = {|Xn — X| < «for all n sufficiently large}. But B;, i > 1, is a decreasing sequence of events, so invoking the con- tinuity property once again yields that 1= lim P(B;) = P(lim B;) ise0 i which proves the result since lim By = {for all i, [Xn — X| <¢ for all n sufficiently large} 7 (lim Xn = X} Remark 1.33 The reason for the word “almost” in “almost surely” is that P(A) = 1 doesn’t necessarily mean that A‘ is the empty set. For example if X ~ U(0,1) we know that P(X # 1/3) = 1 even though {X = 1/3} is a possible outcome. 1.7 Almost sure and dominated convergence 31 The dominated convergence theorem is one of the fundamental building blocks of all limit theorems in probability. It tells you some- thing about what happens to the expected value of random variables in a sequence, if the random variables are converging almost surely. Many limit theorems in probability involve an almost surely converg- ing sequence, and being able to accurately say something about the expected value of the limiting random variable is important. Given a sequence of random variables X1, X2, ..., it may seem to you at first thought that X, — X a.s. should imply limn—o E[Xn] = E|X]. This is sometimes called “interchanging limit and expecta- tion,” since E[X] = Ellimnco Xn]. But this interchange is not al- ways valid, and the next example illustrates this. Example 1.34 Suppose U ~ U(0,1) and Xn = nIncijy- Since re- gardless of what U turns out to be, as soon as n gets larger than 1/U we see that the terms X, in the sequence will all equal zero. This means X, — 0 a.s., but at the same time we have E[Xp] = nP(U < 1/n) = n/n = 1 for all n, and thus limp. E[Xn] = 1- Interchanging limit and expectation is not valid in this case. What’s going wrong here? In this case X;, can increase beyond any level as n gets larger and larger, and this can cause problems with the expected value. The dominated convergence theorem says that if Xq is always bounded in absolute value by some other random variable with finite mean, then we can interchange limit and expec- tation, We will first state the theorem, give some examples, and then give a proof. The proof is a nice illustration of the definition of expected value. Proposition 1.35 The dominated convergence theorem. Suppose Xn — X a.s., and there is a random variable Y with E[Y] < 00 such that [Xnl <¥ for all n. Then EU ficg, Xn) = ing, EL. This is often used in the form where Y is a nonrandom constant, and then it’s called the bounded convergence theorem. Before we prove it, we first give a couple of examples and illustrations. 32 Chapter 1 Measure Theory and Laws of Large Numbers Example 1.36 Suppose U ~ U(0,1) and X, =U/n. It’s easy to see that X, + 0 as., and the theorem would tell us that E[Xn] — 0. In fact in this case we can easily calculate E[X,] = + — 0. The theorem applies using Y = 1 since |Xq| < 1. Example 1.37 With X ~ N(0,1) let X, = min(X,n), and notice Xn — X almost surely. Since Xn < |X| we can apply the theorem using ¥ = |X| to tell us E[Xn] > E[X]. Example 1.38 Suppose X ~ N(0, 1) and let Xn = XIx>-n—nIx<-n- Again X, + X, so using ¥ = |X| the theorem tells us E[Xn] EX). Proof. Proof of the dominated convergence theorem. To be able to directly apply the definition of expected value, in this proof we assume Xn > 0. To prove the general result, we can apply the same argument to X,+Y > 0 with the bound [Xn +¥|< oY. Our approach will be to show that for any € > 0 we have, for all sufficiently large n, both (a) E[Xn] 2 E[X] ~ 3e, and (b) E[Xn] < E[X] +3e. Since ¢ is arbitrary, this will prove the theorem. First let Ne = min{n : |X;— X| n}, and note that Xn —tas X implies that P(N, < 00) = 1. To prove (a), note first that for any m Xq te > min(X,m) — mIy,>n- The preceding is true when N, > n because in this case the right hand side is nonpositive; it is also true when N, < n because in this case X, + € > X. Thus, E[Xn] + > Elmin(X,m)] — mP(Ne > n). Now, |X| < Y implies that E[X] < E[Y] < 00. Consequently, using the definition of E[X], we can find a simple random variable Z < X with E[Z] > E[X]—e. Since Z is simple, we can then pick m large enough so Z < min(X,m), and thus E(min(X,m)] > E[Z| > E[X]-e. 11.7 Almost sure and dominated convergence BS] Then N, < co implies, by the continuity property, that mP(N. > n) < for sufficiently large n. Combining this with the preceding shows that for sufficiently large n E[Xq] + > E[X] —2e, which is part (a) above. For part (b), apply part (a) to the sequence of nonnegative ran- dom variables Y — X;, which converges almost surely to Y — X with a bound [¥ — Xn] <2Y We get E[Y — Xp] > E[Y — X]—3e, and re-arranging and subtracting E[Y] from both sides gives (b). Remark 1.39 Part (a) in the proof holds for non-negative random variables even without the upper bound Y and under the weaker assumption that infmsn Xm —» X as n—» oo. This result is usually referred to as Fatou’s lemma. That is, Fatou’s lemma states that for any € > 0 we have E[Xn] > E[X]—« for sufficiently large n, or equiv- alently that infm>n [Xm] 2 E[X] —€ for sufficiently large n. This result is usually denoted as lim inf, oo E[Xn] > Elim infy.oo Xn}. A result called the monotone convergence theorem can also be proved. Proposition 1.40 The monotone convergence theorem. If O< XTX then E|Xn] t E[X]. Proof. If E[X] < 00, we can apply the dominated convergence the- orem using the bound |Xn| < X. Consider now the case where E[X] = oo. For any m, we have min(Xn,m) —> min(X,m). Because Elmin(X,m)] < oo, it follows by the dominated convergence theorem that lim E|min(Xq,m)] = Elmin(X,m)}. But since E[Xq] > Elmin(Xn,m)], this implies lim E[Xq] > lim, Efmin(X,m)]. 34 Chapter 1 Measure Theory and Laws of Large Numbers Because E[X] = 00, it follows that for any K there is a simple random variable A < X such that E[A] > K. Because A is simple, AS min(X,m) for sufficiently large m. Thus, for any K lim, B{min(X,m)] > El] > K proving that limm_—oo E[min(X,m)] = oo, and completing the proof. a We now present a couple of corollaries of the monotone conver- gence theorem. Corollary 1.41 If X;>0, then E[D2, Xi] = DX, ElXi) i=) Proof. > (Xi) = lim > EX) r= = 8 lim EL. Xi] i=1 EQ. Xd i= where the final equality follows from the monotone convergence the- orem since D", Xi 1 022; Xi. Corollary 1.42 If X and Y are independent, then E[XY] = E[X]EIY). Proof. Suppose first that X and Y are simple. Then we can write isl 7 X= Valeony Y= Vulva} j=1 7 Almost sure and dominated convergence 35 Thus, E[XY] EIS til xe) ny De Yai Elxanv=y)) 7 = DY aa/Pt aay) tj DEV aij P(X = a) PY =) 7 i ELXJEY] Next, suppose X,Y are general non-negative random variables. For any n define the simple random variables x, = fF if F< X <1, k=0,...,n2"-1 Calan i Xen Define random variables Y, in a similar fashion, and note that Xn 1X, Ya TY, XaY¥n T XY Hence, by the monotone convergence theorem, E[XnYa] > E[XY] But X;, and Y, are simple, and so E[XpYa] = E[XnJE[Ya] > E[XJE(Y], with the convergence again following by the monotone convergence theorem. Thus, E[XY] = E[X]E[Y] when X and Y are nonnegative. The general case follows by writing X = X+—X-, Y =¥+-Y-, using that E|XY] = E[X*Y*] — B[X+Y7] — E[X-Y*] + B[X-Y7] and applying the result to each of the four preceding expectations. s 36 Chapter 1 Measure Theory and Laws of Large Numbers 1.8 Convergence in Probability and in Distribu- tion In this section we introduce two forms of convergence that are weaker than almost sure convergence. However, before giving their defini- tions, we will start with a useful result, known as the Borel-Cantelli lemma. Proposition 1.43 If 2; P(A;) < oo, then P(limsup Ap) = 0. Proof: Suppose 3, P(A;) < 00. Now, P(limsup Ay) = P(NR2; UZ, Ai) Hence, for any n P(limsup Ay) < P(UZ,Ai) ES A and the result follows by letting n — oo. . Remark: Because So, 14, is the number of the events An,n > 1, that occur, the Borel-Cantelli theorem states that if the expected number of the events Aj,n > 1, that occur is finite, then the prob- ability that an infinite number of them occur is 0. Thus the Borel- Cantelli lemma is equivalent to the rather intuitive result that if there is a positive probability that an infinite number of the events ‘An occur, then the expected number of them that occur is infinite. The converse of the Borel-Cantelli lemma requires that the indi- cator variables for each pair of events be negatively correlated. Proposition 1.44 Let the events Ai,i > 1, be such that Cov(I4, A, = Ella taj] — Ella \BUa,] $0, i #3 If 2%, P(Ai) = 00, then P(limsup Aj) = 1. Proof. Let Na = 3.7, Ia, be the number of the events Aj,...,4n that occur, and let N = S772, Ja, be the total number of events that 1.8 Convergence in Probability and in Distribution 7 occur. Let ma = E[Np] = D7, P(Aj), and note that lim, Mn = 00. Using the formula for the variance of a sum of random variables learned in your first course in probability, we have Vat(Nn) = So var(Ta,) +25° Cov(Ia,,L4;) i=1 i Var(Ia,.) ro IA DL PlAdl - P(A) a ™Mn IA Now, by Chebyshev’s inequality, for any x < mn P(Nn <2) = P(mn— Np > mn - 2) P(|Nn — ml > ™Mn — 2) Var(Nn) (mn - 2)? Mn (mn — 2)? ears IA Hence, for any 2, lity—oo P(Nn < 2) = 0. Because P(N <2) < P(Nn <2), this implies that P(N <2) =0. Consequently, by the continuity property of probabilities, 0= lim P(N 0. For fixed k, let B, be the event that flips n,n+1,...,n+k—1 all land heads. Because the events B,,n > 1, are positively correlated, we cannot directly apply the converse to the Borel-Cantelli lemma to obtain that, with probability 1, an infi- nite number of them occur. However, by letting A, be the event that flips nk + 1,...,nk +k all land heads, then because the set of flips these events refer to are nonoverlapping, it follows that they are inde- pendent. Because >, P(An) = ,P* = 00, we obtain from Borel- Cantelli that P(limsup A,) = 1. But limsup A, C limsup Bp, so the preceding yields the result P(limsup B,)=1. = Remark 1.46 The converse of the Borel-Cantelli lemma is usually stated as requiring the events Aj,i > 1, to be independent. Our weakening of this condition can be quite useful, as is indicated by our next example. Example 1.47 Consider an infinite collection of balls that are num- bered 0,1,..., and an infinite collection of boxes also numbered 0,1,.... Suppose that ball i, > 0, is to be put in box i + X; where X;,i > 0, are iid with probability mass function PUX=j)=7 Yop=l j20 Suppose also that the X; are not deterministic, so that p; <1 for all j > 0. If A; denotes the event that box j remains empty, then P(A) = P(X; £0,Xj-1 #1... X0FI) = P(Xo#0,%X1 #1,...,X5 #3) > P(X; #i,for all i > 0) But P(X; # i, for all i > 0) = 1- P(Ujs0{ Xi = i}) =1-po— > P(Xo #0,...,Xi-1 #i-1, Xi =) idl il =1-po- dori [[-2) 1. 7-0 11.8 Convergence in Probability and in Distribution 39 Now, there is at least one pair k 0. Hence, for that pair i-1 pi [[G ~p;) Spill — pe) = pi - jx0 implying that P(A;) > P(X; # i, for alli > 0) > p>0 Hence, 5; P(Aj) = 00. Conditional on box j being empty, each ball becomes more likely to be put in box 4,4 # j, so, for i 1, con- verges in probability to the random variable X, written X, —p X, if for any e > 0 P(\Xn—X|>6) 70 as n> 00 An immediate corollary of Proposition 1.32 is that almost sure con- vergence implies convergence in probability. The following example shows that the converse is not true. Example 1.48 Let Xn,n > 1 be independent random variables such that P(Xq = 1) = 1/n = 1 ~ P(X = 0) 40 Chapter 1 Measure Theory and Laws of Large Numbers For any €>0, P(|Xn| >) =1/n— 0; hence, Xn —+p 0. However, because 22%, P(Xn = 1) = 00, it follows from the converse to the Borel-t Cantelli lemma that X, = 1 for infinitely many values of n, showing that the sequence does not converge almost surely to 0. Let Fn be the distribution function of Xp, and let F be the dis- tribution function of X. We say that Xj converges in distribution to X if im Fala) = F@) for all at which F is continuous. (That is, convergence is required at all x for which P(X = 2) = To understand why convergence in distribution only requires that F,(«) + F(z) at points of continuity of F, rather than at all values z, let Xq be uniformly distributed on (0, 1/n). Then, it seems reasonable to suppose that X,, converges in distribution to the random variable X that is identically 0. However, 0, ifz<0 Fr(z)=¢ nz, if 0<2<1/n 1, ifz>1/n while the distribution function of X is 0, if2<0 re={? if 2>0 and so lima Fy(0) = 0 # F(0) = 1. On the other hand, for all points of continuity of F that is, for all z # 0 - we have that lim, F,(z) = F(x), and so with the definition given it is indeed true that X, 4X. We now show that convergence in probability implies convergence in distribution. Proposition 1.49 Xn—pX = Xn—raX 1.8 Convergence in Probability and in Distribution 4 Proof: Suppose that X, —+) X. Let Fy be the distribution function of Xn,n > 1, and let F be the distribution function of X. Now, for any « > 0 Fa(t) = P(Xn $2, Xa+0) S F(et+e)+ P(\Xp — X|>6) where the preceding used that Xp Sa, X >rt+e > |X,-X|>€ Letting n go to infinity yields, upon using that Xn —+p X, limsup Fy(2) < F(z +6) (1.4) Similarly, F(z —¢) = P(X < 2-6, Xn S$ 2)+P(X <2-€, Xn > 2) S Fat) + P(Xn—X1>6) Letting n — 00 gives F(e~6) < liminf Fy(x) (1.5) Combining equations (1.4) and (1.5) shows that Fle 6) limit Fy(z) < limsup Fy(z) < F(a +6) Letting « + 0 shows that if x is a continuity point of F then F(a) < liminf Fy(2) < limsup Fa(x) < F(2) and the result is proved. Proposition 1.50 If X, —+4 X, then Blg(Xn)] + Elg(X)] for any bounded continuous function g. 42 Chapter 1 Measure Theory and Laws of Large Numbers To focus on the essentials, we will present a proof of Proposition 1.50 when all the random variables X, and X are continuous. Before doing so, we will prove a couple of lemmas. Lemma 1.51 Let G be the distribution function of a continuous random variable, and let G-! (x) = inf {t : G(t) > x}, be its inverse function. If U is a uniform (0,1) random variable, then G-}(U) has distribution function G. Proof. Since inf {t: G(t) > U} <2 G(2) >U implies P(G"(U) < 2) = P(G(z) 2U) =G(2), we get the result. = Lemma 1.52 Let X, —+q X, where Xp is continuous with dis- tribution function Fy, n > 1, and X is continuous with distribution function F. If Fa(tn) + F(z), where 0< F(z) <1 then tq > 2. Proof. Suppose there is an > 0 such that zn < z—€ for infinitely many n. If so, then Fy(tn) < Fa(t—e) for infinitely many n, implying that F(a) =liminf Fy(0q) < lim Fa(z- 6) = F(z - 0) which is a contradiction. We arrive at a similar contradiction upon assuming there is an € > 0 such that zp, > 2+ for infinitely many n. Consequently, we can conclude that for any € > 0, |z, — 2| > € for only a finite number of n, thus proving the lemma. Proof of Proposition 1.50: Let U be a uniform (0,1) random variable, and set Y, = Fy'(U), n > 1, and Y = F-1(U). Note that 1.8 Convergence in Probability and in Distribution 43 from Lemma 1.51 it follows that Y, has distribution F, and Y has distribution F. Because Fr(Fy(u)) =u = F(F(u)) it follows from Lemma 1.52 that Fy!(u) > F-1(u) for all u. Thus, Yn —tas Y. By continuity, this implies that g(¥,) —*as g(¥), and, because g is bounded, the dominated convergence theorem yields that Elg(¥n)] ~ Elg(¥)]. But X, and Y, both have distribution Fy, while X and Y both have distribution F, and so Elg(¥n)} = E(g(Xn)] and Elg(¥)] = Ela(X)|. Remark 1.53 The key to our proof of Proposition 1.50 was showing that if X, —+q X, then we can define random variables Y,,n > 1, and Y such that ¥, has the same distribution as X,, for each n, and Y has the same distribution as X, and are such that Y, —+as Y. This result (which is true without the continuity assumptions we made) is known as Skorokhod’s representation theorem. Skorokhod’s representation and the dominated convergence the- orem immediately yield the following. Corollary 1.54 If Xn —+a X and there exists a constant M < oo such that |Xq| 1, and F that of X. Let U be a uniform (0,1) random variable and, for n > 1, set Yn = Fz(U), and ¥Y = F-1(U). Note that the hypotheses of the corollary imply that Yn —*as Y, and, because F,(M) = 1 = 1—F,(—M), also that |Yq| BY] which proves the result because ¥, has distribution Fy, and Y has distribution F. = Proposition 1.50 can also be used to give a simple proof of Weier- strass’ approximation theorem. 44 Chapter 1 Measure Theory and Laws of Large Numbers Corollary 1.55 Weierstrass’ Approximation Theorem Any con- tinuous function f defined on the interval [0,1] can be expressed as @ limit of polynomial functions. Specifically, if BA) = s/n) (2) ea i=0 then f(t) = lita soo Bn(t). Proof. Let X;,i > 1, be a sequence of iid random variables such that P(X; =1) =t=1- P(X; =0) Because B[%1+=+%a] = ¢, it follows from Chebyshev’s inequality that for any « > 0 Xit...+Xn Var([X1 +...+X, 1- p(eitert Xn yy oy < Varta +... + Xal/n) _ wl 2] n € né Thus, %1+.+%o _, t, implying that Xit-+%a _, ¢. Because f is a continuous function on a closed interval, it is bounded and so Proposition 1.50 yields that Xit...+X, E[f(-———)] > fp) But X:+ +X, is a binomial (n,¢) random variable; thus, X1+...+ Xn Elf(A——*)] = Balt), and the proof is complete. 1.9 The Law of Large Numbers and Ergodic The- orem Definition 1.56 For a sequence of random variables X1, X2,... the tail sigma field T is defined as T =0%:0(XnyXntits--)- Events A € T are called tail events. 1.9 The Law of Large Numbers and Ergodic Theorem 45 Though it may seem as though there are no events remaining in the above intersection, there are lots of examples of very interesting tail events. Intuitively, with a tail event you can ignore any finite number of the variables and still be able to tell whether or not the event occurs. Next are some examples. Example 1.57 Consider a sequence of random variables X1, X2,... having tail sigma field T and satisfying |X;| < co for all i. For the event Az = {lmao 4 DL: Xi = 2}, it’s easy to see that A, € T because to determine if Az happens you can ignore any finite number of the random variables; their contributions end up becoming negligible in the limit. For the event B, = {sup; Xi = z}, it’s easy to see that Ba. € T because it depends on the long run behavior of the sequence. Note that By ¢ T because it depends, for example, on whether or not Xi <7. Example 1.58 Consider a sequence of random variables X;,X2, having tail sigma field 7, but this time let it be possible for Xj; oo for some i. For the event Ay = {limpoo+ D7, Xi = 1} we now have A, ¢ T, because any variable along the way which equals infinity will affect the limit. Remark 1.59 The previous two examples also motivate the subtle difference between X; < oo and X; < co almost surely. The former means it’s impossible to see Xs = co and the latter only says it has probability zero. An event which has probability zero could still be a possible occurrence. For example if X is a uniform random variable between zero and one, we can write X # .2 almost surely even though it is possible to see X = .2. One approach for proving an event always happens is to first prove that its probability must either be zero or one, and then rule out zero as a possibility. This first type of result is called a zero-one law, since we are proving the chance must either be zero or one. A nice way to do this is to show an event A is independent of itself, and hence P(A) = P(A() A) = P(A)P(A), and thus P(A) = 0 or 1. We use this approach next to prove a famous zero-one law for independent random variables, and we will use this in our proof of the law of large numbers. 46 Chapter 1 Measure Theory and Laws of Large Numbers First we need the following definition. Events with probability either zero or one are called trivial events, and a sigma field is called trivial if every event in it is trivial. Theorem 1.60 Kolmogorov’s Zero-One Law. A sequence of indepen- dent random variables has a trivial tail sigma field. Before we give a proof we need the following result. To show that a random variable Y is independent of an infinite sequence of ran- dom variables X;,X2,..., it suffices to show that Y is independent of X1, X9,..-,Xn for every finite n < oo. In elementary courses this result is often given as a definition, but it can be justified using measure theory in the next proposition. We define o(X;,i € A) = o(Uieao(X;)) to be the smallest sigma field generated by the collec- tion of random variables X;,i € A. Proposition 1.61 Consider the random variables Y and Xj, X2,... where o(Y) is independent of o(X1, X2,..,Xn) for every n < oo. Then o(Y) is independent of o(X1, Xa, - Before we prove this proposition, we show how this implies Kol- mogorov’s Zero-One Law. Proof. Proof of Kolmogorov’s Zero-One Law. We will argue that any event A € T is independent of itself, and thus P(A) = P(AN A) = P(A)P(A) and so P(A) = 0 or 1. Note that the tail sigma field T is independent of o(X1,X2,.... Xn) for every n < 00 (since T © o(Xn41,Xn+2)---)), and so by the previous proposition is also independent of o(X1, X2,...) and thus, since T C o(Xi, X2,...), also is independent of T Now we prove the proposition. Proof. Proof of Proposition 1.61. Pick any A € o(Y). You might at first think that H = U%,0(X), Xo, ...,Xn) is the same as F = o(X;, X2,...), and then the theorem would follow immediately since by assumption A is independent of any event in 1. But it is not true that 1 and F are the same — 1 may not even be a sigma field. Also, the tail sigma field T is a subset of F but not necessarily of 1. It is, however, true that F C o(H) (in fact it turns out that 1.9 The Law of Large Numbers and Ergodic Theorem a7 o(H) = F) which follows because o(X1,Xo,...) = o(U%,0(Xn)) and U%.,0(Xn) CH. We will use F C o(H) later below. Define the collection of events G to contain any B € F where for every € > 0 we can find a corresponding “approximating” event C €H where P(BNC*) + P(BSNC) < «. Since A is independent of any event C € H, we can see that A must also be independent of any event B € G because, using the corresponding “approximating” event C for any desired € > 0, P(ANB) = P(ANBNC)+ P(ANBNC) < P(ANC)+P(BNC) < P(A)P(C) +€ = P(A)(P(CN B) + P(CNB)) +¢ < P(A)P(B) + 2€ and 1— P(ANB) = P(A°U BY) = P(A°) + P(AN B®) = P(A‘) + P(ANB°NC) + P(AN BNC) < P(A’) + P(B°NC) + P(ANC) < P(A’) +€+ P(A)P(C*) = P(A°) + + P(A)(P(C®N B) + P(C°N B*)) < P(A‘) + 2 + P(A)P(B°) = 1+ 2. — P(A)P(B), which when combined gives P(A)P(B) — 2e < P(ANB) < P(A)P(B) + 2¢. Since ¢ is arbitrary, this shows o(Y) is independent of G. We obtain the proposition by showing F C o(H) C G, and thus that o(Y) is independent of F, as follows. First note we immediately have H C G, and thus o(H) © o(G), and we will be finished if we can show o(G) =. To show that G is a sigma field, clearly Q € G and BC € G whenever B € G. Next let By, Bo,... be events in G. To show that 48 Chapter 1 Measure Theory and Laws of Large Numbers Us, Bi € G, pick any € > 0 and let C; be the corresponding “approx- imating” events that satisfy P(B; C$) + P(BSNC;) < ¢/2'+!. Then pick n so that Yo PBN BEN BEN BA) < €/2 ion and below we use the approximating event C = UL, Ci € H to get P(U;BiN C°) + P((U:Bi)° NC) S PUR, BiNC*) + €/2 + P((U%1Bi)° NC) < DO P(BINC#) + P(BENC) + €/2 seat 4/2 7 =. and thus U%,B;€G. = ‘A more powerful theorem, called the extension theorem, can be used to prove Kolmogorov's Zero-One Law. We state it without proof. Theorem 1.62 The extension theorem. Suppose you have random variables X1,Xo,... and you consistently define probabilities for all events in o(X1, X2,..., Xn) for every n. This implies a unique value of the probability of any event in o(X1, X2,...)- Remark 1.63 To see how this implies Kolmogorov’s Zero-One Law, specify probabilities under the assumption that A is independent of any event B € U%.,Fq. The extension theorem will say that A is independent of o(U%.,Fn). We will prove the law of large numbers using the more powerful ergodic theorem. This means we will show that the long-run aver- age for a sequence of random variables converges to the expected value under more general conditions then just for independent ran- dom variables. We will define these more general conditions next. Given a sequence of random variables X1,X2,..., suppose (for simplicity and without loss of generality) that there is a one-to-one correspondence between events of the form {X, = 21, X2 = 22, X3 = 1.9 The Law of Large Numbers and Ergodic Theorem 49 23...} and elements of the sample space 2. An event A is called an invariant event if the occurrence of {X1 = 21, Xo = 02, X3 =25..} A implies both {Xy = 29,Xp = 23,X3 = 14. EA and {X1 = 20, Xo = 11, X3 = z2...} € AL In other words, an invariant event is not affected by shifting the sequence of random variables to the left or right. For example, A = {supp>1 Xn = 00} is an invariant event if Xp < oo for all n because SUPa>1 Xn = 00 implies both supa>; Xn41 = 00 and supp>y Xn—1 = co. On the other hand, the event A = {limp Xz, = 0} is not invariant because if a sequence x2, 74,... converges to 0 it doesn’t necessarily mean that 2,23,... converges to 0. Consider the example where P(X; = 1) = 1/2 = 1- P(X; =0) and X, = 1—Xp—y for n> 1. In this case either Xan = 0 and Xgn_1 = 1 for all n > 1 or Xam = 1 and Xon—1 = 0 for all n > 1, so {limp Xan = 0} and A = {limp Xon—1 = 0} cannot occur together. It can be shown (see Problem 21 below) that the set of in- variant events makes up a sigma field, called the invariant sigma field, and is a subset of the tail sigma field. A sequence of ran- dom variables X;,Xo,... is called ergodic if it has a trivial invari- ant sigma field, and is called stationary if the random variables (Xi, X2, .... Xn) have the same joint distribution as the random vari- ables (Xk, Xk41--->Xn4k-1) for every n,k. We are now ready to state the ergodic theorem, and an immediate corollary will be the strong law of large numbers. Theorem 1.64 The ergodic theorem. If the sequence Xi, Xo,... is stationary and ergodic with E|X,| < 00, then 472, X; + E[Xi] almost surely. Since a sequence of independent and identically distributed ran- dom variables is clearly stationary and, by Kolmogorov’s zero-one law, ergodic, we get the strong law of large numbers as an immedi- ate corollary. 50 Chapter 1 Measure Theory and Laws of Large Numbers Corollary 1.65 The strong law of large numbers. If Xi, Xo, ... are tid with E|X,| < 00, then 1 D2 X; + E[X3] almost surely. Proof. Proof of the ergodic theorem. Given € > 0 let ¥j = X; — PIX] = and My = max(0,¥i,¥i + Yaron ¥i + ¥o-+++ Ya). Since 1 ¥; < 1M, we will first show that M,/n — 0 almost surely, tad then the theorem will follow after repeating the whole argument applied instead to ¥; = —X; + E[Xi] —. Letting Mj, = max(0, Y2, ¥2 + ¥3,.... ¥a + Ys + +++ + Yay) and using stationarity in the last equality below, we have E(Mn41] = Elmax(0,%1 + M, fn] = E|M}, + max(—M} mY) = E[M,] + E[max(—M,, ¥i)] and, since My, < Mn+. implies E[Mn] < E[Mn+i], we can conclude Elmax(—M/, ¥;)] > 0 for all n. Since {M,,/n — 0} is an invariant event, by ergodicity it must have probability either 0 or 1. If we were to assume the prob- ability is 0, then Mny1 > My would imply My — oo and also M1, — oo, and thus max(—M/,¥;) > Y;. The dominated conver- gence theorem using the bound |max(—M/,,¥i)| < [Yi| would then give E[max(-M1!,¥i)] > E[Yi] = —e, which would then contradict the previous conclusion that E[max(—M/,,¥;)] > 0 for all n. This contradiction means we must have M,,/n — 0 almost surely, and the theorem is proved. = 1.10 Exercises 51 1.10 Exercises 1. 10. ll. Given a sigma field F, if Aj € F for all 1 0 and E[X] < 00, then limp—oo E[XIx>n] = 0. . Assume X > 0 is a random variable, but don’t necessarily assume that E[1/X] < oo. Show that limp oo E[¥Ix>n] = 0 and limp soo E[ Fg Ix>nl = 0. |. Use the definition of expected value in terms of simple variables to prove that if X > 0 and E[X] = 0 then X = 0 almost surely. Show that if X, —+qc then X; —p c. Show that if Elg(Xn)] + Elg(X)] for all bounded, continuous functions g, then X, —+a X. 52 13. 14. 15. 16. ae 18. 19, 20. Chapter 1 Measure Theory and Laws of Large Numbers . If X1,Xg,... are non-negative random variables with the same distribution (but the variables are not necessarily independent) and E[X] < oo, prove that limpoo Elmaxicn Xi/n] = 0. For random variables X;,X2,..., let T be the tail sigma field and let Sp = 2, Xi. (a) Is {limpoo Sn/n > 0} € T? (b) Is {limpco Sn > 0} € T? If Xj, Xo,... are non-negative iid rv’s with P(X; > 0) > 0, show that P(D%, Xi = 00) = 1. Suppose X1, X2,... are continuous iid rv’s, and Yn = [xq>maxicn Xi} (a) Argue that Yj is independent of Yj, for i # j. (b) What is P(X Yi < 00)? (c) What is PDS, Vi¥in1 < 00)? Suppose there is a single server and the ith customer to ar- rive requires the server spend Uj of time serving him, and the time between his arrival and the next customer's arrival is V;, and that X; = U; — V; are iid with mean yw. (a) If Qn+1 is the amount of time the (n + 1)st customer must wait before being served, explain why Qnii = max(Qn + Xn,0) = max(0, Xn, Xn + Xn—1,--)Xn +--+ X1). (b) Show P(Qn > oo) = 1ifp>0. Given a nonnegative random variable X define the sequence of random variables ¥, = min(|2"X|/2",n), where |x] de- notes the integer portion of x. Show that Y, 1 X and E[X] = lima ElYal- Show that for any monotone functions f and g if X,Y are independent random variables then so are f(X),9(¥). Let X;,X2,... be random variables with X; < oo and suppose Ln P(Xn > 1) < co. Compute P(sup, Xp < 00). Suppose Xp —p X and there is a random variable Y with E[Y} < co such that |Xq| < ¥ for all n. Show Ellimy oo Xn] = littasoo E[Xn]- 1.10 Exercises 53 21. 22. 23. For random variables X1, X2,..., let T and Z respectively be the set of tail events and the set of invariant events. Show that Z and T are both sigma fields. A ring is hanging from the ceiling by a string. Someone will cut the ring in two positions chosen uniformly at random on the circumference, and this will break the ring into two pieces. Player I gets the piece which falls to the floor, and player IT gets the piece which stays attached to the string. Whoever gets the bigger piece wins. Does either player have a big advantage here? Explain. A box contains four marbles. One marble is red, and each of the other three marbles is either yellow or green - but you have no idea exactly how many of each color there are, or even if the other three marbles are all the same color or not. (a) Someone chooses one marble at random from the box and if you can correctly guess the color you will win $1,000. What color would you guess? Explain. (b) If this game is going to be played four times using the same box of marbles (and in between replacing the marble drawn each time), what guesses would you make if you had to make all four guesses ahead of time? Explain. Chapter 2 Stein’s Method and Central Limit Theorems 2.1 Introduction You are probably familiar with the central limit theorem, saying that the sum of a large number of independent random variables follows roughly a normal distribution. Most proofs presented for this cele- brated result generally involve properties of the characteristic func- tion ¢(t) = Ele] for a random variable X, the proofs of which are non-probabilistic and often somewhat mysterious to the uninitiated. One goal of this chapter is to present a beautiful alternative proof of a version of the central limit theorem using a powerful technique called Stein’s method. This technique also amazingly can be applied in settings with dependent variables, and gives an explicit bound on the error of the normal approximation; such a bound is quite difficult to derive using other methods. The technique also can be applied to other distributions, the Poisson and geometric distributions in- cluded. We first embark on a brief tour of Stein’s method applied in the relatively simpler settings of the Poisson and geometric distribu- tions, and then move to the normal distribution. As a first step we introduce the concept of a coupling, one of the key ingredients we need. In section 2 we introduce the concept of coupling, and show how it can be used to bound the error when approximating one distribution with another distribution, and in section 3 we prove a theorem by 55 56 Chapter 2 Stein's Method and Central Limit Theorems Le Cam giving a bound on the error of the Poisson approximation for independent events. In section 4 we introduce the Stein-Chen method, which can give bounds on the error of the Poisson approx- imation for events with dependencies, and in section 5 we illustrate how the method can be adapted to the setting of the geometric dis- tribution. In section 6, we demonstrate Stein’s method applied to the normal distribution, obtain a bound on the error of the normal approximation for the sum of independent variables, and use this to prove a version of the central limit theorem. 2.2 Coupling One of the most interesting properties of expected value is that E[X— Y] = E[X]—E[Y] even if the variables X and ¥ are highly dependent on each other. A useful strategy for estimating E[X] — E[Y] is to create a dependency between X and Y which simplifies estimating E[X—Y]. Such a dependency between two random variables is called a coupling. Definition 2.1 The pair (X,) is a “coupling” of the random vari- ables (X,Y) if X =q X and ¥ =,¥ Example 2.2 Suppose X,Y and U are U(0,1) random variables. ‘Then both (U,U) and (U,1—U) are couplings of (X,Y). A random variable X is said to be “stochastically smaller” than Y, also written as X P(¥ <2),Ve, and under this condition we can create a coupling where one variable is always less than the other. Proposition 2.3 If X <. Y it is possible to construct a coupling (X,Y) of (X,Y) with X <¥ almost surely. Proof. With F(t) = P(X < t), G(t) = P(Y < t) and F-! (2) inf {t : F(t) > x} and G-! (x) = inf {t: G(t) > x}, let U ~ U(O,1), 2.2 Coupling 57 X =F (U),and¥ =G"! (UV). Since F(t) > G(t) implies F-'(z) < G-\(2), we have X

U} <2 F(x) >U implies P(F-\(U) < 2) = P(F(z) > U) = F(a), we get X =4 X and ¥ =q Y after applying the same argument to Ge Example 2.4 If X ~ N(0,1) and Y ~ N(1,1), then X p which, in light of the previous inequality, must therefore be the maximal coupling. Let B,C, and D be independent random variables with respective density functions vie) = Mints). 92) P de) = £2) mine). 92) and ate) = He) male). se), let I~ Bernoulli(p), and if I = 1 then let ¥ = ¥ = B and otherwise let X = C and = D. This clearly gives P(X = Y) > P(I = 1) =p, and P(X $2) = P(X < all = 1)p+ P(X 9(2)}, we must have dry(X,Y) = max{P(X € A) - P(Y € A), P(Y € A) — P(X € A®)} = max{P(X € A) — P(Y € A),1— P(Y € A)—14 P(X € A)} = P(X € A)- PY € A) and so PREP) =1— [ min(s(e),o(e))d2 =1- f oleae— fi Sla)de =1-P(Y €A)-14+ P(X eA) = dry(X,Y). 60 Chapter 2 Stein's Method and Central Limit Theorems 2.3. Poisson Approximation and Le Cam's Theo- rem It’s well-known that a binomial distribution converges to a Poisson distribution when the number of trials is increased and the prob- ability of success is decreased at the same time in such a way so that the mean stays constant. This also motivates using a Poisson distribution as an approximation for a binomial distribution if the probability of success is smalll and the number of trials is large. If the number of trials is very large, computing the distribution function of a binomial distribution can be computationally difficult, whereas the Poisson approximation may be much easier. A fact which is not quite as well known is that the Poisson dis- tribution can be a reasonable approximation even if the trials have varying probabilities and even if the trials are not completely inde- pendent of each other. This approximation is interesting because dependent trials are notoriously difficult to analyze in general, and the Poisson distribution is elementary. It’s possible to assess how accurate such Poisson approximations are, and we first give a bound on the error of the Poisson approxi- mation for completely independent trials with different probabilities of success. Proposition 2.8 Let X; be independent Bernoulli(p;), and let W = S11 Xi, Z ~Poisson(d) and d= E[W] = C2, pi. Then Proof. We first write Z = S7, Zi, where the Z; are independent and Z; ~Poisson(p;). Then we create the maximal coupling (Zi, X:) of (Zi, Xi) and use the previous corollary to get P(Z, =X) = > min(P(X; = k), P(Zi = k)) = = min(1— pj,e"™) + min(p;, pie) =1l-p+pie™ 21-pp 2.4 The Stein-Chen Method 61 where we use e~? > 1-2. Using that (OZ, 02. X) is a coupling of (Z,W) yields that A dry(W,Z) < PIL 24S XK) aes PLZ # X}) YPh 4 %) : Yr. ct < Remark 2.9 The drawback to this beautiful result is that when E[W] is large the upper bound could be much greater than 1 even if W has approximately a Poisson distribution. For example if W is a binomial (100,.1) random variable and Z ~Poisson(10) we should have W %q Z but the proposition tells gives us the trivial and useless result dry(W, Z) < 100 x (.1)? = 1. 2.4 The Stein-Chen Method The Stein-Chen method is another approach for obtaining an upper bound on dry(W, Z), where Z is a Poisson random variable and W is some other variable you are interested in. This approach covers the distribution of the number of successes in dependent as well as independent trials with varying probabilities of success. In order for the bound to be good, it should be close to 0 when W is close in distribution to Z. In order to achieve this, the Stein-Chen method uses the interesting property of the Poisson distribution kP(Z =k) = AP(Z =k—1). (2.1) 62 Chapter 2 Stein's Method and Central Limit Theorems Thus, for any bounded function f with f(0) = 0, xs kf (k)P(Z =k) k=0 AS @PZ =k-1) k=l adorG+ 1)P(Z =i) = AE[F(Z + 1). E(Zf(Z)| The secret to using the preceding is in cleverly picking a function f so that dry(W,Z) < E[WF(W)] — AE[F(W + 1)], and so we are likely to get a very small upper bound when W %q Z, (where we use the notation W 4 Z to mean that W and Z approximately have the same distribution). Suppose Z ~ Poisson() and A is any set of nonnegative integers. Define the function f4(k),k = 0,1,2,..., starting with f4(0) = 0 and then using the following “Stein equation” for the Poisson distribu- tion: falk +1) ~ kfa(k) = Ikea — P(Z € A). (2.2) Notice that by plugging in any random variable W for k and taking expected values we get AE|fa(W + 1) — E[W fa(W)]] = P(W € A) - P(Z € A), so that dry(W,Z) < sup AE [fa(W +} - EWfa(W)]I- Lemma 2.10 For any A and i,j, |fa(i)— fa(i)| < min(1,1/2)|i—3J]. Proof. The solution to (2.2) is Fick — P(Z < k) La Lxpz =b/P(Z=3) 2.4 Ihe Stein-Chen Method 63 because when we plug it in to the left, hand side of (2.2) and use (2.1) in the second line below, we get Afa(k +1) — kfa(k) =v isk PS SH) Vigha = P(Z S- 1) > P(Z=k)/P(Z=j) AP(Z = k-1)/P(Z = 3) -r Ijzk ~ P(Z =k) 2, P= IPE =i) = Ikea- P(Z€ A), which is the right hand side of (2.2). Since P(Z P(Z is increasing in k and 1-P(Zh) P(Zk) | PZSk-1) x k [ Pu > k) , PO <2 MEIW Vil, i=l where Vj is any random variable having distribution Vi =a (W — 1X, =1). = Proposition 2.12 With the above notation if W > V; for all i, then we have drv(W,Z) <1—Var(W)/E|W]. 2.4 The Stein-Chen Method 65 Proof. Using W > Vi we have 7 YS AEIW - Vil = SO AE IW - Vd i=1 i=l a = +A- SOE + Vi 2 =¥4—S ON BIWIX = 1] al = 2 4a- So BLKA] i=1 =A-Var(W), and the preceding theorem along with \ = E[W] gives dry (W, 2) < min(1,1/E(W))(E(W) ~ Var(W)) <1-Var(W)/E[W). Example 2.13 Let X; be independent Bernoulli(p;) random vari- ables with A= 77, p;. Let W = 7%, Xi, and let Z ~ Poisson(A). Using Vi = 054: Xj, note that W > Vi and E[W — Vi] = pi, so the preceding theorem gives us dry(W, Z) < min(1,1/)) Sor. i= For instance, if X is a binomial random variable with parameters n = 100,p = 1/10, then the upper bound on the total variation distance between X and a Poisson random variable with mean 10 given by the Stein-Chen method is 1/10, as opposed to the upper bound of 1 which results from the LeCam method of the preceding section. Example 2.14 A coin with probability p = 1—q of coming up heads is flipped n + k times. We are interested in P(Rx), where Rx is the event that a run of at least k heads in a row occurs. To approximate 66 Chapter 2 Stein's Method and Centra! Limit Theorems this probability, whose exact expression is given in Example 1.10, let X; = 1 if flip i lands tails and fips i+1,...,i+ all land heads, and let X; = 0 otherwise, i= 1,...,n. Let and note that there will be a run of k heads either if W > 0 or if the first k flips all land heads. Consequently, P(W > 0) < P(Re) < P(W > 0) +p*. Because the flips are independent and X; = 1 implies X; = 0 for all j # i where |i — j| < k, it follows that if we let itk Yaw- > x; j=i-k then Vi =a (W — 1|X; = 1) and W > Vj. Using that \ = E[W] = ngp* and E[W — Vj] = (2k + 1)qp*, we see that dry(W, Z) < min(1,1/A)n(2k + 1)qp?*. where Z ~ Poisson(A). For instance, suppose we flip a fair coin 1034 times, and want to approximate the probability that we have a run of 10 heads in a row. In this case, n = 1024,k = 10,p = 1/2, and so d= 1024/2"! = .5. Consequently, P(W >0)x1-e75 with the error in the above approximation being at most 21/2!?. Consequently, we obtain that 1—e7 — 21/2! < P(Ryg > 0) <1 —e7 5 + 21/2 + (1/2)? or 0.388 < P(Ryo > 0) < 0.4 Example 2.15 Birthday Problem. With m people and n days in the year, let ¥j = number of people born on day i. Let Xi = Iyj-0, 2. dtein’s Method tor the Geometric Distribution o7 and W = 2, X; equal the number of days on which nobody has a birthday. Next imagine n different hypothetical scenarios are constructed, where in scenario i all the Y; people initially born on day i have their birthdays reassigned randomly to other days. Let 1+ V; equal the number of days under scenario i on which nobody has a birthday. Notice that W > Vj and Vi =¢ (W - 1|X; = 1). Also we have d= E[W] = n(1 — 1/n)™ and El] = (n — 1)(1-1/(n- 1)”, 50 with Z ~ Poisson() we get dry(W, Z) < min(1,1/d)n(n(d - 1/n)™ = (n - 1)(1 - 1/(n - 1))™). 2.5 Stein’s Method for the Geometric Distribu- tion In this section we will show how to obtain an upper bound on the distance dry (W, Z) where W is a given random variable and Z has a geometric distribution with parameter p = 1—q = P(W = 1) = P(Z = 1). We will use a version of Stein’s method applied to the geometric distribution. Define f4(1) = 0 and, for k = 1,2,..., using the recursion fa(k) — afalk +1) = Ikea — P(Z € A). Lemma 2.16 We have |fa(i) — fa(i)| < 1/p Proof. It’s easy to check that the solution is fa(k) = P(Z € A, Z > k)/P(Z =k) — P(Z € A)/p and, since neither of the two terms in the difference above can be larger than 1/p, the lemma follows. Theorem 2.17 Given random variables W and V such that V =a (W -1|W >), let p= P(W =1) and Z ~geometric(p). Then drv(W,Z) < gp" PW # V). 68 Chapter 2 Stein's Method and Central Limit Theorems Proof. IP(W € A) - P(Z € A)| = |Elfa(W) - afa(W + 1)]] = laE[fa(W)|W > 1} - gElfa(W + 1)]] S$ gE|faQl+V)—- fa + W)) <@ PW #V), where the last inequality follows from the preceding lemma. = Example 2.18 Coin flipping. A coin has probability p = 1 —q of coming up “heads” with each flip, and let X; = 1 if the ith flip is heads and X; = 0 if it is tails. We are interested in the distribution of the number of flips M required until the start of the first appearance of a run of k heads in a row. Suppose in particular we are interested in estimating P(M € A) for some set of integers A. To do this, instead let N be the number of flips required until the start of the first appearance of a run of tails followed by k heads in a row. For instance, if k = 3, and the flip sequence is HHTTHHH, then M = 4 and N = 5. Note that in general we will have N+1 = M if M > 1 so consequently, dry(M,N +1) < P(N+1=M)=p*. We will first obtain a bound on the geometric approximation to N. Letting Aj = {Xi = 0,Xigi = Xina = = Xign = 1} be the event a run of a tail followed by k heads in a row appears starting with flip number 4, let W = min{i > 2: 4, = 1} — 1 and note that N=aW Next generate V as follows using a new sequence of coin flips ¥;,Y2, ... defined so that Yj = X;,Vi > k+1 and the vector (Yay Yay ++) Vent) =a (Xa) Xa, 00) Xegilla, = 0) is generated to be independent of all else. Then let By = {Yj = 0, Yis1 = Yin2 = +++ = Vigne = 1} be the event a run of tails followed by k heads in a row appears in this new sequence starting with fip number i. If I4, = 0 then let V = W, and otherwise let V = min{i > 2.6 Stein's Method for the Normal Distribution 69 2: Ip, = 1} - 1. Note that this gives V =4 (W — 1|W > 1) and keh P(W #V) <> P(ALN Bi) ] +1 = P(A1) >° P(AiAS) . Atl = P(Ay) )> P(A:)/P(AS) i=2d = kqp”*. Thus if Z ~geometric(qp*) then applying the previous theorem gives dry(N, Z) < kp*. The triangle inequality applies to dry so we get dry(M — 1,2) < drv(M —1,N) + dry(N, Z) = (k + 1)p* and P(Z € A) —(k+1)p* < P(M € A) < P(Z€ A) + (k+1)p*. 2.6 Stein’s Method for the Normal Distribution Let Z ~ N(0,1) be a standard normal random variable. It can be shown that for smooth functions f we have E[f'(Z) — Zf(Z)] = 0, and this inspires the following Stein equation. Lemma 2.19 Given a >0 and any value of z let 1 ife #))dt, and a similar argument gives eRe) fe f(z)= a) fe (t)P(Z < t)dt P(Zsz) [~,, a i H()P(Z > that, we get If"(a)| < |a'(a)| + | + 2) f(a) + 2(h(x) — ElA(Z)]| Le a + er + (1+2°)P(Z > 2)/9(2))(2P(Z < 2) + 4(2)) ie + (1+2?)P(Z < 2)/4(2))(-2P(Z > 2) + 4(z)) A + wv x 8 2.6 Stein's Method tor the Normal Vistribution a We finally use max(2y) 2 (x) — f(y) < "(a)\da < — \f'@) ros fo If @)ldz s 5 to give the lemma. Theorem 2.20 If Z ~ N(0,1) and W = 0%, Xj where X; are in- dependent mean 0 and Var(W) = 1, then sup|PW <2) —P(Z <2) <2,|87 FIX) ae 1 Proof. Given any a > 0 and any 2z, define h, f as in the previous lemma. Then PW > 321X3I] PW $2) -P(Z <2) $2.13) > BUX). and then by choosing we get 72 Chapter 2 Stein's Method and Central Limit Theorems Repeating the same argument starting with P(Z <2)- P(W <2) 1, and W, = 0%, Xi then Wp i=l satisfies the conditions of Theorem 2.20. Because n a DAP = nba = EGE 0 it follows from Theorem 2.20 that P(W, <2)—+ P(Z <2). @ 74 Chapter 2 Stein's Method and Central Limit Theorems 2.7 Exercises 1. If X ~ Poisson(a) and Y ~ Poisson(b), with b > a, use coupling to show that Y >a X. . Suppose a particle starts at position 5 on a number line and at each time period the particle moves one position to the right with probability p and, if the particle is above position 0, moves one position to left with probability 1 — p. Let X;,(p) be the position of the particle at time n for the given value of p. Use coupling to show that Xp(a) >st Xn(b) for any n if a > b. . Let X,Y be indicator variables with E[X] = a and B[Y] = 6. (a) Show how to construct a maximal coupling X,¥ for X and Y, and then compute P(X = Y) as a function of a,b. (b) Show how to construct a minimal coupling to minimize P(X = ¥). . In a room full of n people, let X be the number of people who share a birthday with at least one other person in the room. Then let Y be the number of pairs of people in the room having the same birthday. (a) Compute E[X] and Var(X) and E[Y] and Var(¥). (b) Which of the two variables X or Y do you believe will more closely follow @ Poisson distribution? Why? (c) Ina room of 51 people, it turns out there are 3 pairs with the same birthday and also a triplet (3 people) with the same birthday. This is a total of 9 people and also 6 pairs. Use a Poisson approximation to estimate P(X > 9) and P(Y > 6). Which of these two approximations do you think will be better? Have we observed a rare event here? . Compute a bound on the accuracy of the better approximation in the previous exercise part (c) using the Stein-Chen method. . For discrete X,Y prove dry(X,¥) = 45, |P(X = 2)-P(Y = 2) . For discrete X,Y show that P(X # Y) > drv(X,Y) and show also that there exists a coupling that yields equality. . Compute a bound on the accuracy of a normal approximation for a Poisson random variable with mean 100. 2.7 Exercises 75 9. If X ~ Geometric(p), with q = 1 — p, then show that for any bounded function f with f(0) = 0, we have E[f(X) —af(X + )) =0. 10. Suppose X),X2,... are independent mean zero random vari- ables with |Xq| <1 for all n and Dicp, Var(X;)/n + 8 < 00 If Sn = Dign Xi and Z ~ N(0,s), show that Sp/ Vn +p Z. 11. Suppose m balls are placed among n urns, with each ball inde- pendently going in to urn i with probability p;. Assume m is much larger than n. Approximate the chance none of the urns are empty, and give a bound on the error of the approximation. 12. A ring with a circumference of c is cut into n pieces (where nis large) by cutting at n places chosen uniformly at random around the ring. Estimate the chance you get k pieces of length at least a, and give a bound on the error of the approximation. 13. Suppose Xj, i= 1,2,...,10 are iid U(0, 1). Give an approxima- tion for P(}0j2, Xi > 7), and give a bound on the error of this approximation. 14. Suppose Xi, i = 1,2,...,n are independent random variables with E[X;] = 0, and 0", Var(X;) =1. Let W = 52, X; and show that jew - 2 <3) E\X.)* a? i= Chapter 3 Conditional Expectation and Martingales 3.1 Introduction A generalization of a sequence of independent random variables oc- curs when we allow the variables of the sequence to be dependent on previous variables in the sequence. One example of this type of dependence is called a martingale, and its definition formalizes the concept of a fair gambling game. A number of results which hold for independent random variables also hold, under certain conditions, for martingales, and seemingly complex problems can be elegantly solved by reframing them in terms of a martingale. In section 2 we introduce the notion of conditional expectation, in section 3 we formally define a martingale, in section 4 we introduce the concept of stopping times and prove the martingale stopping theorem, in section 5 we give an approach for finding tail probabilities for martingale, and in section 6 we introduce supermartingales and submartingales, and prove the martingale convergence theorem. 3.2 Conditional Expectation Let X be such that E[|X|] < ov. In a first course in probability, E{X|Y], the conditional expectation of X given Y, is defined to be 7 78 Chapter 3 Conditional Expectation and Martingales that function of Y that when Y = y is equal to Uz eP(X =2l¥ =y), if X,Y are discrete E[XIY =y]= Sxfxyy(cly)dz, if X,Y have joint density f where Fayy(2ly) = f(y) _ f(y) (3.1) J fle,y)dz Fry) The important result, often called the tower property, that BX] = EE[X|¥]] is then proven. This result, which is often written as Ly FIXI¥ = yP(¥ = 9) if X,Y are discrete E[X] = J EIXIY = ylfy(u)ay, if X,Y are jointly continuous. is then gainfully employed in a variety of different calculations. ‘We now show how to give a more general definition of conditional expectation, that reduces to the preceding cases when the random variables are discrete or continuous. To motivate our definition, sup- pose that whether or not A occurs is determined by the value of Y. (That is, suppose that A € o(Y).) Then, using material from our first course in probability, we see that E[XI4] EEX I4\¥]] EUaE(X\¥]] where the final equality holds because, given Y, I, is a constant random variable. ‘We are now ready for a general definition of conditional expecta- tion. 3.2 Conditional Expectation 79 Definition For random variables X,Y, let E[X|Y'], called the condi- tional expectation of X given Y, denote that function h(Y) having the property that for any A € o({Y) E(XE4] = BAYA]. By the Radon-Nikodym Theorem of measure theory, a function h that makes h(Y) a (measurable) random variable and satisfies the preceding always exists, and, as we show in the following, is unique in the sense that any two such functions of Y must, with probability 1, be equal. The function A is also referred to as a Radon-Nikodym derivative. Proposition 3.1 If hy and hg are functions such that Ejhi(Y La] = Elha(¥)L4] for any A€o(Y), then P(hy(¥Y) = ha(¥)) =1 Proof. Let An = {h(Y) — ho(Y) > 1/n}. Then, 0 = Btha(Y Mag) — Elha¥ a] = Bi(ha() — hal ))fag) > 2PUAn) showing that P(A, this yields 0. Because the events An are increasing in n, 0 = lim P(An) = (lim An) = P(UnAn) = P(hi(¥) > ha(¥)) Similarly, we can show that 0 = P(hi(Y) < ho(Y)), which proves the result. = ‘We now show that the preceding general definition of conditional expectation reduces to the usual ones when X,Y are either jointly discrete or continuous. 80 Chapter 3 Conditional Expectation and Martingales Proposition 3.2 If X and Y are both discrete, then E[X|¥ = y] = )o P(X =al¥ =y) = whereas if they are jointly continuous with joint density f, then Bx = a= f afew (elder where fxiy(xly) = rena Proof Suppose X and Y are both discrete, and let hy) = So 2P(X = 2l¥ =) z For A € o(Y), define B= {y:Ia=1when¥ = y} Because B € o(Y), it follows that I, = Ig. Thus, _[a, #X=2Yes xmn={f otherwise and _fX.eP(X=2\¥=y), if Y= yes wor = {7 otherwise Thus, E[XI4] = Do zP(X =2,Y € B) = Vey r(x =2Y=y) = yeB YY Px = al¥ = y) PW =v) ven c E{h(Y)La) 3.2 Conditional Expectation 81 The result thus follows by uniqueness. The proof in the continuous case is similar, and is left as an exercise. . For any sigma field F we can define E[X|F] to be that random variable in F having the property that for all A € F E[XIa) = ElE[X|F La] Intuitively, E[X|F] represents the conditional expectation of X given that we know all of the events in F that occur. Remark 3.3 It follows from the preceding definition that E[X|¥] = ElX\o(¥)]. For any random variables X,X1,...,Xn, define E[X|X1,...,Xn] by ELXIXA,.. Xn] = E[Xlo(X1,--- Xn] In other words, E[X|Xi,....X. which is that function h(X1,-..,Xn) for E[XI4] = E(MX),...,Xn)La], for all A€ 0 (X1,..-, Xn) We now establish some important properties of conditional ex- pectation. Proposition 3.4 (a) The Tower Property: For any sigma field F E(X] = E[E|X|F]] (b) For any AEF, E[XIa\F] = IAEIXIF] (c) If X is independent of all ¥ € F, then E[X|F] = E[X] (a) If E(|Xil] < 00, i= 1,...4, then PIO XIF= AMI i=1 #1 82 Chapter 3 Conditional Expectation and Martingales (e) Jensen's Inequality If f is a conver function, then Elf(X)|F] = f(ELX|F)) provided the expectations exist. Proof Recall that E[X|F] is the unique random variable in F such that E[XI4] = ElE|X|Flla] if ACF Letting A = 9, I, = In = 1, and part (a) follows. To prove (b), fix A € F and let X* = XI. Because E[X*|F] is the unique function of Y such that E[X*Zy] = E[E[X*|Flq] for all A’ € F, to show that E[X*|F] = I4E[X|F], it suffices to show that for A’ € F E[X* Ta) = ElaE[X|F lla] That is, it suffices to show that E[XI ala) = EllaE|X|F Ua‘) or, equivalently, that ElXIaa)] = ElE[X|Fllaar] which, since AA’ € F, follows by the definition of conditional expec- tation. Part (c) will follow if we can show that, for A € F E[XI4] = E[E[X]Ia} which follows because E[XI4] = E[X]E[I4] _ by independence = ELE[X\a] We will leave the proofs of (d) and (e) as exercises. Remark 3.5 It can be shown that E[X|-] satisfies all the properties of ordinary expectations, except that all probabilities are now com- puted conditional on knowing which events in F have occurred. For instance, applying the tower property to E[X|F] yields that E|X|F] = EIE[X|F U GIF]. 4.2 Conaitional Expectation 83 While the conditional expectation of X given the sigma field F is defined to be that function satisfying E[XI4] = E{E[X|Fll4] for all AG F it can be shown, using the dominated convergence theorem, that E[XW] = ElE[X|F]W] for al We F (3.2) The following proposition is quite useful. Proposition 3.6 If W € F, then B|XW|F] = WEIX|F] Before giving a proof, let us note that the result is quite intuitive. Because W € F, it follows that conditional on knowing which events of F occur - that is, conditional on F the random variable W becomes a constant and the expected value of a constant times a random variable is just the constant times the expected value of the random variable. Next we formally prove the result. Proof. Let Y = E[XW|F]-WE[X|F] = (X - E[X|F))W - (XW — E[XW|F)) and note that Y € F. Now, for Ae F, EY Ia] = EX — E[X|F))WIa] — E( XW - E[XW|F))Ia] However, because WI4 € F, we use Equation (3.2) to get E\(X — E(X|F])WIa4] = E[XWIa] — E[E[X|F]WIa] and by the definition of conditional expectation we have El(XW — E[XW|F])I4)] = B[XWIa] — E[E[XW|F]I4] = 0. Thus, we see that for A € F, EY 14] =0 Setting first A = {Y > 0}, and then A = {Y <0} (which are both in F because Y € F) shows that P(Y >0)=P(Y <0)=0 Hence, Y = 0, which proves the result. 84 Chapter 3 Conditional Expectation and Martingales 3.3. Martingales We say that the sequence of sigma fields Fy, Fp... is a filtration if Fi © Fe... We say a sequence of random variables X), X2,... is adapted to Fa if Xn € Fa for all n. To obtain a feel for these definitions it is useful to think of n as representing time, with information being gathered as time pro- gresses. With this interpretation, the sigma field F,, represents all events that are determined by what occurs up to time n, and thus contains F,_1. The sequence X,,n > 1, is adapted to the filtration Fa,n > 1, when the value of X;, is determined by what occurs by time n. Definition 3.7 Zn is a martingale for filtration Fy, if 1. El|Znl] < 00 2. Zn is adapted to Fr 3. ElZn4alFa] = Zn A bet is said to be fair it its expected gain is equal to 0. A martingale can be thought of as being a generalized version of a fair game. For consider a gambling casino in which bets can be made concerning games played in sequence. Let Fn consist of all events whose outcome is determined by the results of the first n games. Let Zp denote the fortune of a specified gambler after the first n games. Then the martingale condition states that no matter what the results of the first n games, the gambler’s expected fortune after game n+1 is exactly what it was before that game. That is, no matter what has previously occurred, the gambler’s expected winnings in any game is equal to 0. It follows upon taking expectations of both sides of the final mar- tingale condition that ElZns1] = BlZn] implying that E|Z,] = E[Zi) We call E[Z;] the mean of the martingale. 3.3 Martingales 85, Another useful martingale result is that E\Zn+2lFnl ELE(Zn+2|Fnsi U Fall Fa] EE [Zn+2lFn+ill Fa} ElZn+1\Fn] =2, Repeating this argument yields that ElZn+k\Fn) = Zn, k 21 Definition 3.8 We say that Z,,n > 1, is a martingale (without spec- ‘fying a filtration) if 1. EllZnl] < 00 2 BlZnsilZiy--+sZnl = Zn If {Z,} is a martingale for the filtration F,, then it is a martin- gale. This follows from ElZn+1\Z1y-+-) Zn) = ElE[Zn411Z1,---sZns Fall, EVE|Zn+1|Fal|Z15---s Zn] ElZn|Z1,..-, Zn] = 2. Zn) where the second equality followed because Z; € Fn, for all i = on ‘We now give some examples of martingales. Example 3.9 If X;,i > 1, are independent zero mean random vari- ables, then Z, = 7%, X;,n > 1, is a martingale with respect to the filtration o(X1,...,Xn),m > 1. This follows because ElZn411X15-- +5 ElZn + Xn+ilX1,---» Xn} Biza\Xe ON Ela Xl = Zn + E[Xnsi] by the independence of the X; Zn . I 86 Chapter 3 Conditional Expectation and Martingales Example 3.10 Let X;,i > 1, be independent and identically dis- tributed with mean zero and variance o?. Let S, = 2, Xi, and define Z, = S2?-no®, n>1 ‘Then Zp,n > 1 is a martingale for the filtration o(X1,...,Xn). To verify this claim, note that E(S2,5|X1,.-- Xn] El(Sn + Xn41)"|Xi,---) Xn) E[S2|X1,..-, Xn] + E[2SnXnvilXay--- Xal + E[X2,3|X1,---) Xn] $2 + 25nE[XnsilX1,--- Xn) + B[XAaal S32 + 2S, E[Xn4i] +0? = B40? " i Subtracting (n + 1)o? from both sides, yields ElZnt11X1,---)Xn]=Zn Our next example introduces an important type of martingale, known as a Doob martingale. Example 3.11 Let ¥ be an arbitrary random variable with E[[Y |] < 00, let Fy,n > 1, be a filtration, and define Zn = ElY Fr] We claim that Zn,n > 1, is a martingale with respect to the filtration Fayn > 1. To verify this, we must first show that E[|Z,|] 1, is a called a Doob martingale. . Example 3.12 Our final example generalizes the result that the suc- cessive partial sums of independent zero mean random variables con- stitutes a martingale. For any random variables X;,i > 1, the ran- dom variables X;— B[Xi|X1,...,Xi-1], 7 > 1, have mean 0. Even though they need not be independent, their partial sums constitute a martingale. That is, we claim that 7 Zn = (Ki - BIXIM,.-. i=l is, provided that E[|Zp|] < oo, a martingale with respect to the filtration o(X1,...,Xn),n > 1. To verify this claim, note that Zn = Zn + Xnti — E[Xnti1X1,-- Xnj Thus, ElZn+i|X1,- = Zn + E[XneilX1,. ~ B{XnsilX,.- =% © 3.4 The Martingale Stopping Theorem The positive integer valued, possibly infinite, random variable N is said to be a random time for the filtration F, if {N =n} € Fy or, equivalently, if {N > n} € Fy-1, for all n. If P(N < oo) = 1, then the random time is said to be a stopping time for the filtration. Thinking once again in terms of information being amassed over time, with F;, being the cumulative information as to all events that have occurred by time n, the random variable NV will be a stopping 88 Chapter 3 Conditional Expectation and Martingales time for this filtration if the decision whether to stop at time n (so that N =n) depends only on what has occurred by time n. (That is, the decision to stop at time 7 is not allowed to depend on the results of future events.) It should be noted, however, that the decision to stop at time n need not be independent of future events, only that it must be conditionally independent of future events given all information up to the present. Lemma 3.13 Let Zn,n > 1, be a martingale for the filtration Fr. If N is a random time for this filtration, then the process Zn = Zmin(win)s 2 2 1, called the stopped process, is also a martingale for the filtration Fy. Proof Start with the identity Zn = Zn—1 + Tpw>n)(Zn — Zn-1) To verify the preceding, consider two cases: (1) N 2m: Here, Zn, = Zp, Zn1 = Zn—1, [qy>n} = 1, and the preceding is true (2) Nny(Zn — Zn—1)|Fn—1] Zn-1 + Upn>n}E|Zn — Zn-1|Fn-1] Ze Theorem 3.14 The Martingale Stopping Theorem Let Zn, n > 1, be a martingale for the filtration Fn, and suppose that N is a stopping time for this filtration. Then ElZy) = E[Z} if any of the following three sufficient conditions hold. (1) Z, are uniformly bounded; (2) N is bounded; (3) E|N] <0, and there exists M < oo such that El\Zns1 -— ZnllFn] -P(N >i) ia MEIN] < oo. IA 90 Chapter 3 Conditional Expectation and Martingales A corollary of the martingale stopping theorem is Wald’s equa- tion. Corollary 3.15 Wald’s Equation If X;, Xo identically distributed with finite mean p stopping time for the filtration Fy = 0(X1,. EN] < 00, then . are independent and E[Xi], and if N is a +Xn),n 2 1, such that i BIS. Xd = wE(N i=1 Proof. Zn = 3c%;(Xi — p),n > 1, being the successive partial sums of independent zero mean random variables, is a martingale with mean 0. Hence, assuming the martingale stopping theorem can be applied we have that 0 = ElZy) N = BDO - w)) = = EID Xi — Nu) i=l N = FID Xi] - EINy i=l To complete the proof, we verify sufficient condition (3) of the mar- tingale stopping theorem. EllZns1 — ZallFa] = EllXnsa — ullFal = El|Xn4i—sl]__ by independence E(\\Xil] + lel Thus, condition (3) is verified and the result proved. = IA Example 3.16 Suppose independent and identically distributed dis- crete random variables X1,X2,..., are observed in sequence. With P(X; = 3) = pj, what is the expected number of random variables that must be observed until the subsequence 0, 1, 2,0, 1 occurs? What is the variance? 3.4 The Martingale Stopping Theorem 91 Solution Consider a fair gambling casino - that is, one in which the expected casino winning for every bet is 0. Note that if a gambler bets her entire fortune of a that the next outcome is j, then her fortune after the bet will either be 0 with probability 1 — pj, or a/p with probability pj. Now, imagine a sequence of gamblers betting at this casino. Each gambler starts with an initial fortune 1 and stops playing if his or her fortune ever becomes 0. Gambler i bets 1 that X; = 0; if she wins, she bets her entire fortune (of 1/po) that Xj+1 = 1; if she wins that bet, she bets her entire fortune that Xi,2 = 2; if she wins that bet, she bets her. entire fortune that Xi;3 = 0; if she wins that bet, she bets her entire fortune that Xin4 = 1; if she wins that bet, she quits with a final fortune of (opip2)"*- Let Z, denote the casino’s winnings after the data value X, is observed; because it is a fair casino, Zn,n > 1, is a martingale with mean 0 with respect to the filtration o(X),...,Xn),n > 1. Let N denote the number of random variables that need be observed until the pattern 0,1,2,0,1 appears - so (Xy_4,-..,Xw) = (0,1,2,0,1). As it is easy to verify that N is a stopping time for the filtration, and that condition (3) of the martingale stopping theorem is satisfied when M = 4/(p2p?p2), it follows that E[Zy] = 0. However, after Xw has been observed, each of the gamblers 1,..., N —5 would have lost 1; gambler N —4 would have won (p2p?p2)~! — 1; gamblers N—3 and N — 2 would each have lost 1; gambler N — 1 would have won (pop:)~} — 1; gambler N would have lost 1. Therefore, Zy = N — (popipe)”* — (popr)* Using that E[Zy] = 0 yields the result E(N] = (Popip2)"' + (popi)* In the same manner we can compute the expected time until any specified pattern occurs in independent and identically distributed generated random data. For instance, when making independent flips of a coin that comes u heads with probability p, the mean number of flips until the pattern HHTTHH appears is p~4q-? + p-? +p}, where q = 1 -p. To determine Var(N), suppose now that gambler i starts with an initial fortune i and bets that amount that X; = 0; if she wins, she 92 Chapter 3 Conditional Expectation and Martingales bets her entire fortune that Xi+1 = 1; if she wins that bet, she bets her entire fortune that X;42 = 2; if she wins that bet, she bets her entire fortune that X;43 = 0; if she wins that bet, she bets her entire fortune that Xii4 = 1; if she wins that bet, she quits with a final fortune of i/(p2p2p2). The casino’s winnings at time N is thus N-4 N-1 Pepip2 —PoPt _ N(N+1) N-4_ N- 2 PaPIP2 PoP Assuming that the martingale stopping theorem holds (although none of the three sufficient conditions hold, the stopping theorem can still be shown to be valid for this martingale), we obtain upon taking expectations that Zn = 14+2+...4N- BIN|—4 , . EIN]-1 E[N?] + E[N] = 2—3-3— + 2 Wwi'l+ BIN] POPiP2 PoP1 Using the previously obtained value of E[N], the preceding can now be solved for E[N?], to obtain Var(N) = E[N?]—(E[N])?. w= Example 3.17 The cards from a shuffled deck of 26 red and 26 black cards are to be turned over one at a time. At any time a player can say “next”, and is a winner if the next card is red and is a loser otherwise. A player who has not yet said “next” when only a single card remains, is a winner if the final card is red, and is a loser otherwise. What is a good strategy for the player? Solution Every strategy has probability 1/2 of resulting in a win. To see this, let R, denote the number of red cards remaining in the deck after n cards have been shown. Then. _ Ra _5l-n E|Rn+ilRa,.-.)Rn] = Rn = 3 = Hence, ;f0-,n > 0 is a martingale. Because Ro/52 = 1/2, this martingale has mean 1/2. Now, consider any strategy, and let N denote the number of cards that are turned over before “next” is said. Because N is a bounded stopping time, it follows from the martingale stopping theorem tahat. Ry 52- Rn Bl 1/2 3.4 The Martingale Stopping Theorem 93 Hence, with I= Ipwiny Ry 52 ~ No Our next example involves the matching problem. BU = EWwyRyl) = EH 172 Example 3.18 Each member of a group of n individuals throws his or her hat in a pile. The hats are then mixed together, and each person randomly selects a hat in such a manner that each of the n! possible selections of the n individuals are equally likely. Any individual who selects his or her own hat departs, and that is the end of round one. If any individuals remain, then each throws the hat they have in a pile, and each one then randomly chooses a hat. Those selecting their own hats leave, and that ends round two. Find E[N], where N is the number of rounds until everyone has departed. Solution Let X;,i > 1, denote the number of matches on round for i=1,...,.N, and let it equal 1 for i > N. To solve this example, we will use the zero mean martingale Z;,k > 1 defined by k DK - BIXIM, Xia) i=1 k Sx -Y rat where the final equality follows because, for any number of remaining individuals, the expected number of matches in a round is 1 (which is seen by writing this as the sum of indicator variables for the events that each remaining person has a match). Because X N=min{k: ox, =n} i=] is a stopping time for this martingale, we obtain from the martingale stopping theorem that N 0 = ElZn] = ELD) Xi - NJ=n- EIN] i=1 andso E[N]=n. = 94 Chapter 3 Conditional Expectation and Martingales Example 3.19 If X,X2, a sequence of independent and iden- tically distributed random variables, P(X; = 0) # 1, then the process is said to be a random walk. For given positive constants a,b, let p denote the probability that the the random walk becomes as large as a before it becomes as small as —b. We now show how to use martingale theory to approximate p. In the case where we have a bound |X;| < c it can be shown there will be a value 6 4 0 such that Ele*] Then because is the product of independent random variables with mean 1, it fol lows that Zn,n > 1, is a martingale having mean 1. Let N=min(n:S,>a or Sp<~b). Condition (3) of the martingale stopping theorem can be shown to hold, implying that Ele") =1 Thus. 1 = Ele" |Sy > alp + Ele""|Sy < —b\(1—p) Now, if @ > 0, then re Efe" (Sy >al< eblate) and et) < Ble" Sy < i] se yielding the bounds 1 — e- Hbte) eo — @—Obte)’ 3.4 Lhe Martingale Stopping Theorem 95, and motivating the approximation 1 PS Ga ew We leave it as an exercise to obtain bounds on p when @<0. Our next example involves Doob’s backward martingale. Before defining this martingale, we need the following definition. Definition 3.20 The random variables X1,...,Xn are said to be ea- changeable if Xi,,...,Xiq has the same probability distribution for every permutation i1,...,in of 1y...4n. Suppose that Xj,...,Xn are ex¢ eable. Assume E[|Xil] < 00, let i Sj =X jHly...gn i=l and consider the Doob martingale Zi,...,Zn, given by E[X1|Sn} E[X1\Sn)Sn—1y---) Satish 5 4 4; However, Snticj = E[Sn41—jlSnv1—j Xnaaejn--»Xal nti-j DY BK Snt i Xnga-sy- +s Xa] ia (n+1—j)E[Xi|Sn41-j, Xne2-j.---» Xn) (by exchangeability) (n+1-9)Z; where the final equality follows since knowing Sp, Spt, is equivalent to knowing Sp41—j,Xn42—js-++»Xn- The martingale oa Sn41~j n+1-j’ is called the Doob backward martingale. We now apply it to solve the ballot problem. 5 gehen 96 Chapter 3 Conditional Expectation and Martingales Example 3.21 In an election between candidates A and B candidate A receives n votes and candidate B receives m votes, where n > m. Assuming, in the count of the votes, that all orderings of the n +m votes are equally likely, what is the probability that A is always ahead in the count of the votes? Solution Let X; equal 1 if the i** voted counted is for A and let it equal —1 if that vote is for B. Because all orderings of the n+ m. votes are assumed to be equally likely, it follows that the random variables X1,...,Xn4m are exchangeable, and Z1, is a Doob backward martingale when Zntm Z= Sp+mpi-j I ne m+1—j5 where S, = Sj, Xi. Because Z; = Spim/(n +m) = (n—m)/(n+ m), the mean of this martingale is (n — m)/(n+m). Because n > m, A will always be ahead in the count of the vote unless there is a tie at some point, which will occur if one of the S; - or equivalently, one of the Z; - is equal to 0. Consequently, define the bounded stopping time N by Ne=min{j Zj;=Oorj=n+m} Because Zn4m = X1, it follows that Zyy will equal 0 if the candidates are ever tied, and will equal X; if A is always ahead. However, if A is always abead, then A must receive the first vote; therefore, _ fl, if A is always ahead one { 0, otherwise By the martingale stopping theorem, E[Zw] = (n — m)/(n +m), yielding the result n-m P(A is always ahead) = 2—™ 3.5 The Hoeffding-Azuma Inequality Let Zn,n > 0, be a martingale with respect to the filtration Fy. If the differences Zn ~ Zn—1 can be shown to lie in a bounded ran- dom interval of the form [-Bn,—Bn + dn] where By € Fy_; and dy 3.5 The Hoeffding-Azuma Inequality 97 is constant, then the Hoeffding-Azuma inequality often yields useful bounds on the tail probabilities of Z,. Before presenting the inequal- ity we will need a couple of lemmas. Lemma 3.22 If E[X] = 0, and P(—a < X < f) =1, then for any conver function f B a EU] S gy pia) + axel Proof Because f is convex it follows that, in the region -a 100). Thus, if the Lemma were false, then the function f(a, B) =F + e-F + ae? — e-F) — 20810712 98 Chapter 3 Conditional Expectation and Martingales would assume a strictly positive maximum in the interior of the re- gion R= {(a,8) |a| <1, || < 10}. Setting the partial derivatives of f equal to 0, we obtain u FF pale +e) = 2aper+h"/2 (3.3) P—eF = rperh+0"/2 (3.4) We will now prove the lemma by showing that any solution of (3.3) and (3.4) must have 8 = 0. However, because f(a, 0) = 0, this would contradict the hypothesis that f assumes a strictly positive maximum in R,, thus establishing the lemma. So, assume that there is a solution of (3.3) and (3.4) in which B # 0. Now note that there is no solution of these equations for which a = 0 and 8 # 0. For if there were such a solution, then (3.4) would say that ef — 68 = 2pe/? (3.5) But expanding in a power series about 0 shows that (3.5) is equivalent to pret pitt 2 ae =, (28 +2)! =e ‘itat which (because (2i + oY > iat when i > 0) is clearly impossible when 8 # 0. Thus, any solution of (3.3) and (3.4) in which 8 # 0 will also have a # 0. Assuming such a solution gives, upon dividing (3.3) by (3.4), that 1+aS = Sits - 1+ 7 Because a # 0, the ae is equivalent to BF + e~®) =e —eF or, expanding in a Taylor series, Lor putt SO fae Gi!” & GaN! which is clearly not posible when 2 # 0. Thus, there is no solution of (3.3) and (3.4) in which f # 0, thus proving the result. . We are now ready for the Hoeffding-Azuma inequality. 3.5 The Hoeffding-Azuma Inequality 99 Theorem 3.24 The Hoeffding-Azuma Inequality Let Zn, n > 1, be a martingale with mean p with respect to the filtration Fy. Let Z = p and suppose there exist nonnegative random variables By,n > 0, where Bn € Fn-1, and positive constants da,n > 0, such that —Bn S Zn - Zn-1 S —But dn Then, forn >0,a > 0, @) P(@n-pza) < MEG (ii) P(Zn-—pS a) < e/a G (36) Proof Suppose that 1 = 0. For any c > 0 P(Z, 2a) = P(e > e*) < e-™ Bler2n] where we use Markov’s inequality in the second equality. Let W, = e€@n, Note that Wo = 1, and that for n > 0 E[WalFn—1) = Blen-* e(@n- 2-1 Fa) = Fn-t Ble(n-ZnF, a] Wa-1 Ble(>- 29-1 F,_1} (3.7) il where the second equality used that Zn—1 € Fn—1. Because (i) f(z) =e is convex; (ii) E\Zn — Zn-1|Fn—a] E(ZnlFn-i] — ElZn—1|Fn-i] Zn-1— Zn-1 = 0 and (iii) -Ba < Zn -Zp-1 S$ —Batdn it follows from Lemma 3.22, with a = Bn, 8 = —Bp + dp, that . —Bpn +dn)e~°Pn + Byet(-Bntdn) Eleaf Dy) < B[ Ba le Pa 1] _ (CBa + dn)e~CBn + Bye? (—Batdn) d, 1A Ef 100 Chapter 3 Conditional Expectation and Martingales where the final equality used that B, € F,-1. Hence, from Equation (3.7), we see that (—Bn + dp)e7eBe + Byes (-Batdn) S Wyo Re dy S Wyre il E(Wal Fai} A where the final inequality used Lemma 3.23 (with p = Bn/dn, t = cd). Taking expectations gives E[We] < E|Wp—ile/* Using that E[Wo] = 1 yields, upon iterating this inequality, that PRB — 6 EI G/8 Therefore, from Equation (3.6), we obtain that for any c > 0 : P(Zn > @) < exp(—ca +c? d?/8) i=1 Letting c = 4a/5°?_,d?, (which is the value of c that minimizes —cat+ OL, d?/8), gives that P(Zq > a) S$ e728/ Dia Parts (i) and (ii) of the Hoeffding-Azuma inequality now follow from applying the preceding, first to the zero mean martingale {Zn — 1}, and secondly to the zero mean martingale {4 — Zn}. . Example 3.25 Let X;,i > 1 be independent Bernoulli random vari- ables with means pj,i = 1,...,n. Then In = SK) = S- op nO i=1 =1 is a martingale with mean 0. Because Z, — Zp_1 = Xn — p, we see that —PS2Zn-Zp-1S1-p 3.5 The Hoeffding-Azuma Inequality 101 Thus, by the Hoeffding-Azuma inequality (Bn = p,d, = 1), we see that fora >0 . P(Sn— opi 2a) < "lm i=l a 2 P(Sn— Domi < a) < en a The preceding inequalities are often called Chernoff bounds. ™ ‘The Hoeffding-Azuma inequality is often applied to a Doob type martingale. The following corollary is often used. Corollary 3.26 Let h be such that if the vectors x = (x1,...,2,) and y =(y1,-.-,Yn) differ in at most one coordinate (that is, for some k, a4 = yi for all i # k) then |a(x) — A(y)] <1 Then, for a vector of independent random variables K = (X1,... anda>0 P(h(X) ~ BJA(X)] > a) < e77*/" P(A(X) — Efh(K)] 0 P(N ~(n—k + l)papapsps 2 a) < e**/(n-9) P(N —(n—k+ l)pspapsps < -a) < e77/"-3) og Example 3.28 Suppose that n balls are to be placed in m urns, with each ball independently going into urn i with probability pi, 27, pi = 1. Find bounds on the tail probability of Yj, equal to the number of urns that receive exactly k balls. Solution First note that A E [= Turn i has exactly k tat i=l & (j)ea-w Let X; denote the urn in which ball j is put, j = 1,...,n. Also, let hy(21,.+.,2n) denote the number of urns that receive exactly k balls when X; = 2;,i = 1,...,n, and note that Y; = hy(X1,---,Xn)- When k = 0 it is easy to see that ho satisfies the condition that if x and y differ in at most one coordinate, then |ho(x) — ho(y)| < 1. Therefore, from Corollary 3.26 we obtain, for a > 0, that Ely] P(¥o— 0(1 - pi)” 2 a) 1 P(Yo- Y(1- pi" $ -a) < f=1 oe a, IA 2a? /n ene) A For 0 < k < nit is no longer true that if x and y differ in at most one coordinate, then |hj,(x)—Ax(y)| < 1. This is so because the one different value could result in one of the vectors having one less and the other having one more urn with k balls than would have resulted if that coordinate was not included. Thus, if x and y differ in at most one coordinate, then Vu (x) — he(y)| <2 104 Chapter 3 Conditional Expectation and Martingales showing that Aj(x) = hy(x)/2 satisfies the condition of Corollary 3.26. Because P(Ye — El¥a] 2 a) = P(hi(X) — ElAe(X)] 2 a/2) we obtain, for a > 0,0 < k ("ora —pi** >a) i=1 ene /2n 1A i P%-3 ("ora = pyr 1, is said to be a submartingale for the filtration Fy if 1. El|Znl] < 00 2. Zn is adapted to Fp 9. E(Zns1|Fa] = Zn If3 is replaced by E(Zn4ilFa] S Zn, then Znyn > 1, is said to be supermartingale. It follows from the tower property that ElZn4i] 2 E[Zn] for a submartingale, with the inequality reversed for a supermartin- gale. (Of course, if Zn,n > 1, is a submartingale, then —Zn,n > 1, is a supermartingale, and vice-versa.) 3.6 Sub and Supermartingales, and Convergence Theorem 105 The analogs of the martingale stopping theorem remain valid for submartingales and supermartingales. We leave the proof of the following theorem as an exercise. Theorem 3.30 If N is a stopping time for the filtration Fn, then ElZy] > ElZi) for a submartingale E|Zy] < ElZ:] for a supermartingale provided that any of the sufficient conditions of Theorem 3.14 hold. One of the most useful results about submartingales is the Kol- mogorov inequality. Before presenting it, we need a couple of lem- mas. Lemma 3.31 If Z,,n > 1, is a submartingale for the filtration Fn, and N is a stopping time for this filtration such that P(N E[Z,]. Now, ElZnlFe, N =k) = ElZnl Fe] > Ze = Zw Taking expectations of this inequality completes the proof. m Lemma 3.32 If Zn,n > 1, is a martingale with respect to the filtra- tion Fn,n > 1, and f a conver function for which El|f(Zn)l] < 00, then f(Zn),n > 1, is @ submartingale with respect to the filtration Franz. Proof. Elf(Zn+1)|Fn] > F(E[Zn+ilFa]) by Jensen’s inequality = f(2n) a Theorem 3.33 Kolmogorov’s Inequality for Submartingales Sup- pose Zn,n > 1, is a nonnegative submartingale, then for a >0 P(max{Z,...,Zn} >a) < E[Zn]/a 106 Chapter 3 Conditional Expectation and Martingales Proof Let N be the smallest i, i < n such that Z; > a, and let it equal n if Z; 1, is a martingale, then for a>0 P(max{|Zi|,---,|Znl} > @) < min(E(|Znl|/e, E[Z2)/a?) Proof Noting that P(max{|Z1|,...,|Znl} 2 @) = P(max{Z?...,Z2} > a?) the corollary follows from Lemma 3.32 and Kolmogorov’s inequality for submartingales upon using that f(z) = |2| and f(z) = 2? are convex functions. Theorem 3.35 The Martingale Convergence Theorem Let Z,,n > 1, be a martingale. If there is M < 00 such that ElZll 1, is a submartingale, yielding that E[Z2] is non- decreasing. Because E[Z?] is bounded, it follows that it converges; let m < oo be given by = li 2) mm ip Flee We now argue that limp... Z, exists and is finite by showing that, with probability 1, Z,,n > 1, is a Cauchy sequence. That is, we will show that, with probability 1, \Zmtk - Zm| 0 asm,k + 00 3.6 Sub and Supermartingales, and Convergence Theorem 107 Using that Zin4k—Zm, k > 1, is a martingale, it follows that (Zim+k— Zm)*, k > 1, is a submartingale. Thus, by Kolmogorov’s inequality, P(|Zmik — Zm| > € for some k < n) = P( max (Zm+k — Zm)? > &) kelenn S$ El(Zman — Zm)?V/e? = BIZ 4m — 2m Zn+m + Zale However, ElZmZn+m| = ElE|ZmZn+m|Zml] = ElZmE(Zn+m|Zm]] Elza) Therefore, P(\Zm+x —Zm| >€ for some k € for some k) < (m — E[Z2])/e? Thus, P(\Zm+k —Zm|>€ for some k) +0 asm— oo Therefore, with probability 1, Zn,n > 1, is a Cauchy sequence, and so has a finite limit. . ‘As a consequence of the martingale convergence theorem we ob- tain the following. Corollary 3.36 If Zp,n > 1, is a nonnegative martingale then, with probability 1, limp—co Zn exists and is finite. Proof Because Zp, is nonnegative EllZal] = ElZn] = E[Zi)< 00 108 Chapter 3 Conditional Expectation and Martingales Example 3.37 A branching process follows the size of a population over succeeding generations. It supposes that, independent of what occurred in prior generations, each individual in generation n inde- pendently has j offspring with probability p;,j > 0. The offspring of individuals of generation n then make up generation n + 1. Let Xn denote the number of individuals in generation n. Assuming that m = 3), Jp;, the mean number of offspring of an individual, is finite it is east to verify that Zp, = Xq/m",n > 0, is a martin- gale. Because it is nonnegative, the preceding corollary implies that lim, X,/m" exists and is finite. But this implies, when m < 1, that lim Xp = 0; or, equivalently, that Xp = 0 for all n sufficiently large. When m > 1, the implication is that the generation size either becomes 0 or converges to infinity at an exponential rate. . 3.7 Exercises 109 3.7 Exercises 1. 2. For F = {4,}, show that E[X|F] = E[X]. Give the proof of Proposition 3.2 when X and Y are jointly continuous. . If E[|Xil] < 00, i=1,...,n, show that BO XI = KIA i=1 i=l |. Prove that if f is a convex function, then ELf(X)|F] = FELXIF) provided the expectations exist. . Let X1,Xo,..., be independent random variables with mean mn 1. Show that Zn = " Xi, n> 1, is a martingale. . If E[Xn4i1X1,.--)Xn] = @nXn+tbn for constants an,bn,n > 0, find constants An, Bn so that Zn = AnXn + Buin > 0, isa martingale with respect to the filtration o(Xo,..., Xn)- . Consider a population of individuals as it evolves over time, and suppose that, independent of what occurred in prior gen- erations, each individual in generation n independently has j offspring with probability p;, j > 0. The offspring of individuals of generation n then make up generation n + 1. Assume that m = ¥jipj < 00. Let Xp denote the number of individuals in generation n, and define a martingale related to X,,n > 0. The process X;,,7 > 0 is called a branching process. . Suppose X}, Xo, ..., are independent and identically distributed random variables with mean zero and finite variance o?. If T is a stopping time with finite mean, show that td Var()> Xi) = 07 E(T). i=1 110 10. li. 12. Chapter 3 Conditional Expectation and Martingales ). Suppose Xj, Xa, ..., are independent and identically distributed mean 0 random variables which each take value +1 with prob- ability 1/2 and take value -1 with probability 1/2. Let Sy = 3%, Xi. Which of the following are stopping times? Compute the mean for the ones that are stopping times. (a) Ty = min{i > 5: S; = Sis +5} (b) Ty =T -5 (c) Ty =) +10. Consider a sequence of independent flips of a coin, and let Pj denote the probability of a head on any toss. Let A be the hypothesis that P, = a and let B be the hypothesis that Pp = , for given values 0 < a,b < 1. Let X; be the outcome of flip i, and set — P(X, XnlA) P(X,..-,XnlB) If P, = b, show that Zp,n > 1, is a martingale having mean 1. Zn Let Z,,n > 0 be a martingale with Z) = 0. Show that ElZz] = So El(Zi - Zi-1)?] i=l Consider an individual who at each stage, independently of past movements, moves to the right with probability p or to the left with probability 1 —p. Assuming that p > 1/2 find the expected number of stages it takes the person to move i positions to the right from where she started. . In Example 3.19 obtain bounds on p when 6 < 0. 14. Use Wald’s equation to approximate the expected time it takes a random walk to either become as large as a or as small as —b, for positive a and b. Give the exact expression if a and b are integers, and at each stage the random walk either moves up 1 with probability p or moves down 1 with probability 1— p. . Consider a branching process that starts with a single individ- ual. Let denote the probability this process eventually dies out. With X;, denoting the number of individuals in generation n, argue that 7**, n > 0, is a martingale. 3.7 Exercises ill 16. 17. 18. 19. 21. 23. Given X1,Xa,..., let Sp = DL, Xi and Fy = o(Xi,...Xn)- Suppose for all n E|Sy| < 00 and E[Sn4i\Fa] = Su. Show E|X.Xj] =0 if i ¢ J. Suppose n random points are chosen in a circle having diam- eter equal to 1, and let X be the length of the shortest path connecting all of them. For a > 0, bound P(X — E[X] > a). Let X1,Xz,...,Xn be independent and identically distributed discrete random variables, with P(X; = j) = pj. Obtain bounds on the tail probability of the number of times the pat- tern 0,0,0,0 appears in the sequence. Repeat Example 3.27, but now assuming that the X; are inde- pendent but not identically distributed. Let P;,; = P(X: = j). . Let Zpyn > 0, be a martingale with mean Zp = 0, and let v;,j 2 0, be a sequence of nondecreasing constants with vp = 0. Prove the Kolmogorov-Hajek-Renyi inequality: P(\25| < vj, for all j =1,. ) 2 1=S B(Z; - 2)-1)1/2} jal Consider a gambler who plays at a fair casino. Suppose that the casino does not give any credit, so that the gambler must quit when his fortune is 0. Suppose further that on each bet made at least 1 is either won or lost. Argue that, with probability 1, a gambler who wants to play forever will eventually go broke. . What is the implication of the martingale convergence theorem to the scenario of Exercise 10? Three gamblers each start with a,b, and c chips respectively. In each round of a game a gambler is selected uniformly at random to give up a chip, and one of the other gamblers is selected uniformly at random to receive that chip. The game ends when there are only two players remaining with chips. Let Xn,¥n, and Z, respectively denote the number of chips the three players have after round n, so (Xo, Yo, Zo) = (a, b,c). (a) Compute E[Xn41¥a+iZn+1 | (Xn Ya: Zn) = (21, 2)]- 112 Chapter 3 Conditional Expectation and Martingales (b) Show that Mn = Xn¥nZn-+n(a+b+c)/3 is a martingale. (c) Use the preceding to compute the expected length of the game. Chapter 4 Bounding Probabilities and Expectations 4.1 Introduction In this chapter we develop some approaches for bounding expec- tations and probabilities. We start in Section 2 with Jensen’s in- equality, which bounds the expected value of a convex function of a random variable. In Section 3 we develop the importance sampling identity and show how it can be used to yield bounds on tail probabil- ities. A specialization of this method results in the Chernoff bound, which is developed in section 4. Section 5 deals with the Second Moment and the Conditional Expectation Inequalities, which bound the probability that a random variable is positive. Section 6 develops the Min-Max Identity and use it to obtain bounds on the maximum of aset of random variables. Finally, in Section 7 we introduce some general stochastic order relations and explore their consequences. 4.2 Jensen's Inequality Jensen’s inequality yields a lower bound on the expected value of a convex function of a random variable. Proposition 4.1 If f is a convex function, then ELf(X)] 2 F(ELX)) 113 114 Chapter 4 Bounding Probabilities and Expectations provided the expectations exist. Proof. We give a proof under the assumption that f has a Taylor series expansion. Expanding f about the value » = E[X], and using the Taylor series expansion with a remainder term, yields that for some a F(z) = fe) + Few) + f(a) - w)?/2 > f(u)+ f'(u)(z - ») where the preceding used that f”(a) > 0 by convexity. Hence, F(X) > F(u) + f(u)(X - w) ‘Taking expectations yields the result. = Remark 4.2 If P(X = 21) =A=1- P(X = 29) then Jensen’s inequality implies that for a convex function f AF (a1) + (1 — A) F(w2) 2 F(Aei + (1 - A)z2), which is the definition of a convex function. Thus, Jensen’s inequal- ity can be thought of as extending the defining equation of convexity from random variables that take on only two possible values to arbi- trary random variables. 4.3 Probability Bounds via the Importance Sam- pling Identity Let f and g be probability density (or probability mass) functions; let A be an arbitrary function, and suppose that g(x) = 0 implies that, f(c)h(x) = 0. The following is known as the importance sampling identity. Proposition 4.3 The Importance Sampling Identity MX)FX hS), 9({X) where the subscript on the expectation indicates the density (or mass function) of the random variable X. E,(h(X)] = BI- 4.3 Probability Bounds via the Importance Sampling Identity 115 Proof We give the proof when f and g are density functions: Bylbx)) = J ate) se) ae (2) $2) 9 ~ 92) oe: F(X) 9X) ‘The importance sampling identity yields the following useful corol- lary concerning the tail probability of a random variable. (a) de i El 1 Corollary 4.4 L(x) P(X > 0) = Bul IX > qP,(X >) Proof. P(X >) = Eyllx>e] B(xzf) Bul GX) eezar) Fal 5) a |X > gP,(X > 0) ears! |X > dP)(X >) Example 4.5 Bounding Standard Normal Tail Probabilities Let f be the standard norma] density function 1 f(z) = yee 00 <2<00 For c > 0, consider Py(X > c}, the probability that a standard normal random variable exceeds c. With g(z) =ce"™, «>0 116 Chapter 4 Bounding Probabilities and Expectations we obtain from Corollary 4.4 Py(X >) ”F, gleX* eX |X > we E, fexter*/2, eolX+0)] TR where the first equality used that P,(X > c) = e-®, and the sec- ond the lack of memory property of exponential random variables to conclude that the conditional distribution of an exponential random variable X given that it exceeds c is the unconditional distribution of X +c. Thus the preceding yields et Py(X > 0) = [e-%7/?] (4.1) Noting that, for > 0, l-rc) < (1-1 +3, ¢ 1°) iz 4(X > c) < (1-1/1 10) in Our next example uses the importance sampling identity to bound the probability that successive sums of a sequence of independent and identically distributed normal random variables with a negative mean ever cross some specified positive number. w (4.2) 4.3 Probability Bounds via the Importance Sampling Identity 17 Example 4.6 Let X1, Xo,... be a sequence of independent and iden- tically distributed normal random variables with mean yp < 0 and variance 1. Let S, = Tf, X; and, for a fixed A > 0, consider p= P(S, > Afor some k) Let fx(xk) = fe(21,---,2%) be the joint density function of X; = (X1,..., Xx). That is, falxu) = (2)~#/2e~ Eten eso)? /2 Also, let gx be the joint density of k independent and identically distributed normal random variables with mean —y and variance 1. That is, 9k (Xz) = (20) ~*/2¢ DE (eitu)?/2 Note that SelXe) _ gan Dh 2 9n(Xx) With j k Re = {(t1---t) i SALI < hy Domi > AY i= = we have © = VP € A) k=1 ~ = YE Moers] k=) Sp xeeRa Se(Ke) -2 Eg) a x Eoullexpersy oH] kel Now, if X, € Ry then S, > A, implying, because p < 0, that eS < e#4_ Because this implies that Tyxgergy OM < Texper OM 118 Chapter 4 Bounding Probabilities and Expectations we obtain from the preceding that Y Baellexver,3 M4] k=1 oo MAS Faull xpera)] k=l 8 1A i Now, if ¥j,i > 1, is a sequence of independent normal random vari- ables with mean —y and variance 1, then i k Eoullxpery] = POLY SA 5 A) 1 i=l Therefore, from the preceding a A k < CAS PITY S A, 13 A) k=l i=l #1 k eA P(S~Y; > A for some k) i=l il = eH where the final equality follows from the strong law of large numbers because limy—oo 0%; ¥i/n = —w > 0, implies P(limp—oo D2) Yi = oo) =1, and thus P(TE, ¥; > A for some k) = 1. ‘The bound p= P(S > Afor some k) < e744 is not very useful when A is a small nonnegative number. In this case, we should condition on X;, and then apply the preceding inequality. 4.4 Chernott Bounds 119 With © being the standard normal distribution function, this yields P i ce 1 2 P(S, > Ak k|X1 = 2) e724 ie (Sh > Aor some kX = 2) =e Iz . 1 2 P(S, > Af KX = 2) ee 82g, if (Sk for some k|X1 Re lx +P(X1 > A) 1 2 < wl A-2) 6—(8-W?/2dp 4 1 — (A > wale : ae A ems f e-(@HH)*2de 4.1 O(A — p) Vin Ico PHAS(A +p) +1—O(A— p) Thus, for instance P(Sq > Ofor some k) < ®(y) + 1 — &(—p) = 26(n) 4.4 Chernoff Bounds Suppose that X has probability density (or probability mass) func- tion f(z). For t > 0, let e* f(z) Mi) o(z) = where M(t) = E,[e'*] is the moment generating function of X. Corollary 4.4 yields, for c > 0, that Py(X >) Eg{M(t)e~** |X > c] Py(X > c) E,{M(t)e"*|X > d M(t)e-® IA 1A Because the preceding holds for all ¢ > 0, we can conclude that i te Py(X > ¢) < inf M(te (4.3) The inequality (4.3) is called the Chernoff bound. Rather than choosing the value of ¢ so as to obtain the best bound, it is often convenient to work with bounds that are more 120 Chapter 4 Bounding Probabilities and Expectations analytically tractable. The following inequality can be used to sim- plify the Chernoff bound for a sum of independent Bernoulli random variables. Lemma 4.7 For 00 P(X - E[X] >) < en (4.4) P(X - E[X] <0) < e?7/" (4.5) Proof. For c>0,t>0 P(X ~ E[X] > c) = P(et(X-FIXD > et) eE[e(X-FIX)] by the Markov inequality e* Blexr( SX — EXD) IA IA W n eB] et] is1 = ee] [let Fy ia However, if Y is Bernoulli with parameter p, then Bet -E1"1] = pet(l-P) + (1 — pe? < @f/8 where the inequality follows from Lemma 4.7. Therefore, P(X - E[X] >0) seter/8 Letting t = 4c/n yields the inequality (4.4). The proof of the inequality (4.5) is obtained by writing it as P(E[X]—-X > 0) 0, as n —+ 00 and m —+ 00, P((1—¢)ma” < Ng < (1 +6€)ma”) 1 Proof. To prove the result, it is convenient to first formulate an equivalent continuous time model that results in the times at which the n-+m cells are killed being independent random variables. To do so, let X1,...,Xn4m be independent exponential random variables, with X; having weight w;, i= 1,...,n-+m. Note that X; will be the smallest of these exponentials with probability w;/ 5°; w;; further, given that X; is the smallest, X,,r # i, will be the second smallest with probability w,/ 30,4; ws; further, given that X; and X; are, re- spectively, the first and second smallest, X,,s # i, 7, will be the next smallest with probability w,/ 2,44, tj; and so on. Consequently, if we imagine that cell i is killed at time X;, then the order in which the n + m cells are killed has the same distribution as the order in which they are killed in the original model. So let us suppose that cell i is killed at time X;,i > 1. Now let 7. denote the time at which the number of surviving target cells first falls below na. Also, let N(t) denote the number of normal cells that are still alive at time t; so Na = N(Tq). We will first show that P(N(ta) < (1+€)me"”) 1 asnjm— co (4.6) 122 Chapter 4 Bounding Probabilities and Expectations To prove the preceding, let e* be such that 0 < e* < ¢, and set t = ~—In(a(1+e*)/”), We will prove (4.6) by showing that as n and ™ approach oo, () P(t >t) 1, and (i) P(N(t) < (1 +6)ma¥) +1. Because the events 7_ >t and N(t) < (1+)ma™ together imply that N(ta) < N(t) < (1+ €)ma”, the result (4.6) will be established. To prove (i), note that the number, call it Y, of surviving target cells by time ¢ is a binomial random variable with parameters n and et =a(1+e*)”. Hence, with a= na[(1 + ¢*)'/” — 1], we have P(ta < t) = P(Y < na} = P(Y < net -a) < 2" /" where the inequality follows from the Chernoff bound (4.3). This proves (i), because a?/n — 00 as n> 00. To prove (ii), note that N(¢) is a binomial random variable with parameters m and e~“t = a” (1+e*). Thus, by letting b= ma™(e— é*) and again applying the Chernoff bound (4.3), we obtain P(N(t) > (1+ €)ma”} = P(N(t) > me7** +b) < em This proves (ii), because b?/m — co as m — oo. Thus (4.6) is established. It remains to prove that P(N(za) > (1—€)ma”) +1 as n,m — co (4.7) However, (4.7) can be proven in a similar manner as was (4.6); a combination of these two results completes the proof of the theorem. s 4.5 Second Moment and Conditional Expecta- tion Inequalities ‘The second moment inequality gives a lower bound on the probability that a nonnegative random variable is positive. 4.5 Second Moment and Conditional Expectation Inequalities 123 Proposition 4.11 The Second Moment Inequality For a non- negative random variable X (E{X))? BX?) Proof. Using Jensen's inequality in the second line below, we have E[X?] = E[X?|X > 0]P(X > 0) (E[X|X > 0])?P(X > 0) (E(x)? P(X> 0) P(X >0)> Vv When _X is the sum of Bernoulli random variables, we can improve the bound of the second moment inequality. So, suppose for the remainder of this section that a x=>)%, i=l where X; is Bernoulli with E[X;] = pi, i = ‘We will need the following lemma. Lemma 4.12 For any random variable R EIXR] = SpE (RIX: = 1] i=l Proof E[XR] = BIS XR a = So eGR i=l = Seewnix = pi + ELGR|X: = 00 — pi)} rast e SC EIAX: =Ip # i=l 124 Chapter 4 Bounding Probabilities and Expectations Proposition 4.13 The Conditional Expectation Inequality n Pi X>0)2> 57% — PX 02S arK aa] Proof. Let R= “229. Noting that EIXR)=P(X>0), E[RIX:=1) = BUX =1] we obtain, from Lemma 4.12, P(X >0) = nell where the final equality made use of Jensen’s inequality. ™ Example 4.14 Consider a system consisting of m components, each of which either works or not. Suppose, further, that for given subsets of components S;,j = 1,...,n, none of which is a subset of another, the system functions if all of the components of at least one of these subsets work. If component j independently works with probability aj, derive a lower bound on the probability the system functions. Solution Let X; equal 1 if all the components in S; work, and let it equal 0 otherwise, i= 1,...,n. Also, let =[Io JES pi= P(X = Then, with X = D7, Xj, we have P(system functions) = P(X > 0) : Pi soa Last SSP m= D Pi OE ae eeaa: 4.5 Second Moment and Conditional Expectation Inequalities 125 where S; — S; consists of all components that are in S; but not in Si. ' Example 4.15 Consider a random graph on the set of vertices {1,2, +;n}, which is such that each of the (3) pairs of vertices i # j is, independently, an edge of the graph with probability p. We are interested in the probability that this graph will be connected, where by connected we mean that for each pair of distinct vertices i # j there is a sequence of edges of the form (i,i1), (i1, i2),---; (ik, 3)- (That is, a graph is connected if for each each pair of distinct vertices iand j, there is a path from i to j.) Suppose that = late) We will now show that if c < 1, then the probability that the graph is connected goes to 0 as n — oo. To verify this result, consider the number of isolated vertices, where vertex i is said to be isolated if there are no edges of type (i,j). Let X; be the indicator variable for the event that vertex i is isolated, and let 2 x=yo% i=l be the number of isolated vertices. Now, P(X; =1) = (1-p)?? Also, a E(X|X; = 1) = S> P(X} = Xi = 1) ja =1i+)°a-pr? j#i = 1+(n-1)(1—p)"? Because =p = 1 eBAen z enein(n) 126 Chapter 4 Bounding Probabilities and Expectations the conditional expectation inequality yields that for n large ae > P(X >) 2 Gay Therefore, c<1> P(X >0} > 1Lasn— 00 Because the graph is not connected if X > 0, it follows that the graph will almost certainly be disconnected when n is large and c < 1. (It can be shown when c > 1 that the probability the graph is connected goes to 1 as n > co.) . 4.6 The Min-Max Identity and Bounds on the Maximum In this section we will be interested in obtaining an upper bound on Elmax; Xi], when X,,...,Xq are nonnegative random variables. To begin, note that for any nonnegative constant c, n maxX; Se+ D(X; —e)* (48) i=1 where x*, the positive part of z, is equal to z if x > 0 and is equal to 0 otherwise. Taking expectations of the preceding inequality yields that. i. Elmax Xj] E[(Xi - ¢)*] rast Because (X;—c)* is a nonnegative random variable, we have i E(X:—0)*] f ” P(Xi—e)* > ade i P(X;—c> a)de 0 E [fo P&> vey Therefore, Blmax Xi] < c+ > i ” BUX: > w)dy (49) i=l 4.0 Ine Min-Max Identity and Bounds on the Maximum. 127 Because the preceding is true for all c > 0, the best bound is obtained by choosing the c that minimizes the right side of the preceding. Differentiating, and setting the result equal to 0, shows that the best bound is obtained when ¢ is the value c* such that, SoP(K >) =1 i=l Because 7”, P(X; > c) is a decreasing function of c, the value of c* can be easily approximated and then utilized in the inequality (4.9). It is interesting to note that c* is such that the expected number of the X; that exceed c* is equal to 1, which is interesting because the inequality (4.8) becomes an equality when exactly one of the Xi exceed c. Example 4.16 Suppose the X; are independent exponential random variables with rates \;,i = 1,...,n. Then the minimizing value c* is such that with resulting bound Elmax Xi] < i In the special case where the rates are all equal, say A; = 1, then l=ne~" or c*=In(n) and the bound becomes Emax Xj) < In(n) +1 (4.10) However, it is easy to compute the expected maximum of a sequence of independent exponentials with rate 1. Interpreting these random variables as the failure times of n components, we can write 8 max Xi = OT, i=2 128 Chapter 4 Bounding Probabilities and Expectations where T; is the time between the (i — 1)st and the i‘ failure. Using the lack of memory property of exponentials, it follows that the T; are independent, with T; being an exponential random variable with rate n—i+1. (This is because when the (i— 1)st failure occurs, the time until the next failure is the minimum of the n— i+ 1 remaining lifetimes, each of which is exponential with rate 1.) Therefore, in this case : : 1 1 Elmax Xi} x nets As it is known that, for n large So ys Zein) +E ia where E ~ .5772 is Euler’s constant, we see that the bound yielded by the approach can be quite accurate. (Also, the bound (4.10) only requires that the X; are exponential with rate 1, and not that they are independent.) . The preceding bounds on E{max; Xi] only involve the marginal distributions of the X;. When we have additional knowledge about the joint distributions, we can often do better. To illustrate this, we first need to establish an identity relating the maximum of a set of random variables to the minimums of all the partial sets. For nonnegative random variables X1,...,Xn, fix x and let A; denote the event that X; > x. Let X =mex(X1,...,Xn) Noting that X will be greater than z if and only if at least one of the events A; occur, we have P(X > 2) = P(ULAi) and the inclusion-exclusion identity gives P(X >2) = S>P(A)— Do PAA) + SD P(AASAR) 7 ia)= Xe De eA) fncn2} P(AAs) = P(X; > 2, X; > 2} = P(min(X;, X;) > 2) P(A;AjAg) = P(X; > 2, Xj > 2, X_ > x} P(min(X;, X;, Xx) > 2) and so on. Thus, we see that : PIX >a)= 0-4 SO Pemin(Xi,,...,%,) > 2)) Integrating both sides as x goes from 0 to oo gives the result: EIX]= S(O Blmin(X%,,....¥:,)] r=1 tS YAAK — Blin 5) iG ElX] < TAK) — Fn 1+ YO Hlmin(Xi, X;, Xx) i 1 1 + ————— +... + (1)! —_—___. San ) Pit... +Dn Using the preceding formula for the mean number of coupons needed to obtain a complete set requires summing over 2" terms, and so is not practical when n is large. Moreover, the bounds obtained by only going out a few steps in the formula for the expected value of a maximum generally turn out to be too loose to be beneficial. However, a useful bound can often be obtained by applying the max- min inequalities to an upper bound for E[X] rather than directly to E[X]. We now develop the theory. For nonnegative random variables Xi,..., Xp, let X = max(X, Fix c > 0, and note the inequality X $c+max((X1—0)*,...,(Xn—¢)*). 4.0 Ihe Min-Max Identity and Bounds on the Maximum 131 Now apply the max-min upper bound inequalities to the right side of the preceding, take expectations, and obtain that E[X] < c+ STIX 04] i=l E[X] < e+ > BUX — €)*] — > Blmin((X; — 0), (X; - )*)] 7 SG + YS Blmin((x: - 0), (Xj - (Xe - 0D] icjck and so on. Example 4.18 Consider the case of independent exponential ran- dom variables X;,...,Xn, all having rate 1. Then, the preceding gives the bound ElmexX;] Set DEUX - ot} ~ Y Elmin(x; - 0), (X; - )*)] 7 roa + YS Blmin((X; — 0), (Xj — 0)*, (Xe — 0)*)] i c, whether min(X;,X;) > ¢, and whether min(X;, Xj, Xx) > c. This yields E((Xi — ¢)*] Elmin((X; — ¢)*, (Xj — ¢)*)] W Efmin((X; — ¢)*, (Xj —¢)*, (Xe -¢)*)] = € Using the constant c = In(n) yields the bound a a Blmax Xi] < In(a) 41-25) 4 MeV?) In(n) +.806 for n large 132 Chapter 4 Bounding Probabilities and Expectations Example 4.19 Let us reconsider Example 4.17, the coupon col- lector's problem. Let c be an integer. To compute E{(X; — c)*), condition on whether a type é coupon is among the first c collected. El(Xi — 0)*] = El(Xi — €)*|Xi < P(X: < 0) +El(X; - 0)*|X; > JP(X; > ©) = El(X: — ¢)* |X; > (1 — pi)® — (=p) Pi Similarly, Blmin((X:— 0), (%j~ eh] = CoRR BI Blain((X.— )*, (X59, egy) = CORB BT Therefore, for any nonnegative integer eps inh 1- e ger Sta pi) -y force Conon iG (1 = pi = yj ~ Pe)? + Ua Pia Pi ~ Pky & Pit Pj + Pe zB 3 e AY a A 4.7 Stochastic Orderings We say that X is stochastically greater than Y, written X >g¢ Y, if P(X >t)>P(Y >t) for allt In this section, we define and compare some other stochastic order- ings of random variables. If X is a nonnegative continuous random variable with distribu- tion function F and density f then the hazard rate function of X is defined by Ax(t) = fO/FO 4.7 Stochastic Orderings 133 where F(t) = 1 — F(t). Interpreting X as the lifetime of an item, then for ¢ small P(t year old item dies within an additional time e) = P(X t) & Ax(te If Y is a nonnegative continuous random variable with distribu- tion function G and density g, say that X is hazard rate order larger than Y, written X >a, Y, if Ax(t) S Ay(t) for allt Say that X is likelihood ratio order larger than Y, written X >i, Y, if F(z)/a(z) Tz where f and g are the respective densities of X and Y Proposition 4.20 X>rY >X 2a Y >X DH Y Proof. Let X have density f, and Y have density g. Suppose X 21, Y Then, for y >< fly) £(z) Fy) = oy) Gy 2 9 (2) [P serane £2 PP owen Ax(z) < Ay(z) To prove the final implication, note first that : - Pf i ax(nat= Foy tt= hehe) implying that or or P(e) =e Bx 134 Chapter 4 Bounding Probabilities and Expectations which immediately shows that X >pr Y => X Dot Y. w Define rx(t), the reversed hazard rate function of X, by rx(t) = f(Q)/FO) and note that _ P(t-e< XIX <2) SS ii a Say that X is reverse hazard rate order larger than Y,, written X 2n¥, if rx(t) > ry(t) for allt Our next theorem gives some representations of the orderings of X and Y in terms of stochastically larger relations between certain conditional distributions of X and Y Before presenting it, we intro- duce the following notation. Notation For a random variable X and event A, let [X|A] de- note a random variable whose distribution is that of the conditional distribution of X given A. Theorem 4.21 () X Zar ¥ @ [X|X >t] Dee (YIY > dl for all t (i) X Den ¥ 4 [XIX #] and ¥; =4 [Y-t|Y > ¢]. Noting that = 0; ifst shows that X Shr Y => Xt Dar Ye > Xe Det Ye > [XIX > t] Vee (VY > t] To go the other way, use the identity Fy, =e" Ser" Ax(s)ds 4.1 stochastic Orderings 135 to obtain that [X|X >t] 28 ([YIY > 4 Xr 20% => Ax(t) S Av(#) (ii) With X; =a [t- X|X < ¢], Ax(y) =rx(t-y) O XiseX% [t-X|X a VY [ovtnray ° firxt-nav> [ore nney => rx(t) 2 rr(t) (iii) Let X and Y have respective densities f and g. Suppose [X|s< X <#] Su [Yls<¥<¢#] foralls v/s < X P(Y > v/s < ¥ £9, showing that X >i ¥. Now suppose that X >i, Y. Then clearly [X|s < X < t] >i [Yls < Y < ¢], implying, from Proposition 4.20, that [X|s < X < th>e(Y\s<¥ Y > X Den Y >X 2H Y Proof. The first implication immediately follows from parts (ii) and (iii) of Theorem 4.21. The second implication follows upon taking the limit as t + oo in part (ii) of that theorem. m Say that X is an increasing hazard rate (or IHR) random variable if Ax(t) is nondecreasing in t. (Other terminology is to say that X has an increasing failure rate.) Proposition 4.23 Let X; =a (X —t|X >t]. Then, (a) Xe ln ast? « XisJHR (b) Xi le ast? +> log f(z) is concave Proof. (a) Let A(y) be the hazard rate function of X. Then d,(y), the hazard rate function of X; is given by A(y) =AlE+y), y>O Hence, if X is THR then X; Ine ¢, implying that X_ Jee t. Now, let 8 s Xt. Then, eo MWY = P(X, > €) S$ P(Xp > 6) =e Fe Mey 4.8 Excercises 137 showing that A(t) > (s). Thus part (a) is proved. (b) Using that the density of X; is fx,(z) = f(x +t)/F(t) yields, for s < t, that f(ct+s) far '* © log f(x+s)—log f(z+t)tz fats) _ fet). 4 Xs 2 Xt Fats) flatt) = fy) ° Fw tY ° five f @)ly © log f(y) is concave 4.8 Excercises 1. For a nonnegative random variable X, show that (E[X™})!/" is nondecreasing in n. 2. Let X be as standard normal random variable. Use Corollary 4.4, along with the density g(x) =ze""?, 2 >0 to show, for c > 0, that (a) P(X > 0) = Fee“ Eyl Z1X > (b) Show, for any positive random variable W, that 1 Bil > d < BL (©) Show that PUL > 0) | ote” ViRP(X > c) 138 a Chapter 4 Bounding Probabilities and Expectations (e) Use Jensen’s inequality, along with the preceding, to show that eP(X > 0) +e? VIm(P(X > c))? > me (£) Argue from (e) that P(X > c) must be at least as large as the positive root of the equation crt ef VTma? = oe (g) Conclude that P(X >o)> Sr (V2 +4-c)e*? . Let X be a Poisson random variable with mean 4. Show that, for n > A, the Chernoff bound yields that e->(Ae)" OCS.) Se . Let m(t) = E[X*]. The moment bound states that for c > 0 P(X 2c) < m(t)c* for all t > 0. Show that this result can be obtained from the importance sampling identity. . Fill in the details of the proof that, for independent Bernoulli random variables X1,...,Xn, and c> 0, P(S - E[S] < -c) < e?"/" where $= 2, X, . If X is a binomial random variable with parameters n and p, show (a) P(X = np| > ce) < 2672 (b) P(X — np > anp) < exp{—2np*a*} Give the details of the proof of (4.7). 4.8 Excercises 139 8. Prove that Elf(X)] 2 ELF(E[X1¥))] 2 F(ELX)) Suppose you want a lower bound on E[f(X)] for a convex function f. The preceding shows that first conditioning on Y and then applying Jensen's inequality to the individual terms Elf(X)|Y = y] results in 0 larger lower bound than does an immediate application of Jensen’s inequality. 9. Let X; be binary random variables with parameters p;,i = 1,...)2. Let X = 30%, Xi, and also let I, independent of the variables X;,...,Xn, be equally likely to be any of the values 1,...)n. For R independent of I, show that (a) PU = {Xr = 1) = pi/E[X] (b) E[XR] = E[X]E[R|Xz = 1] (©) P(X > 0) = E[X] E[X|X1 = 1] 10. For X; and X as in Exercise 9, show that Pi (E[x1? x BRIX = 1 = BUX? ‘Thus, for sums of binary variables, the conditional expectation inequality yields a stronger lower bound than does the second moment inequality. Hint: Make use of the results of Exercises 8 and 9. 11. Let X; be exponential with mean 8+ 2i, for i = 1,2,3. Obtain an upper bound on E[max X;], and compare it with the exact result when the X; are independent. 12, Let Uj, i =1,...,n be uniform (0, 1) random variables. Obtain an upper bound on E[max Uj], and compare it with the exact result when the U; are independent. 13. Let U; and Uz be uniform (0,1) random variables. Obtain an upper bound on E[max(Ui,Us)], and show this maximum is obtained when Uj = 1— Up. 140 14, 15. 16. Chapter 4 Bounding Probabilities and Expectations Show that X >p, Y if and only if P(X >t), PY >#) P(X > s) ~ P(Y >s) for all s A(y,z) whenever > y (a) Show that if X and Y are independent and X >i, Y then A(X,Y) Dat h(Y,X). (b) Show by a counterexample that the preceding is not valid under the weaker condition X >yt Y. There are n jobs, with job i requiring a random time X; to process. The jobs must be processed sequentially. Give a suffi- cient condition, the weaker the better, under which the policy of processing jobs in the order 1,2,...,n maximizes the prob- ability that at least k jobs are processed by time t for all k and t. Chapter 5 Markov Chains 5.1 Introduction This chapter introduces a natural generalization of a sequence of in- dependent random variables, called a Markov chain, where a variable may depend on the immediately preceding variable in the sequence. Named after the 19th century Russian mathematician Andrei An- dreyevich Markov, these chains are widely used as simple models of more complex real-world phenomena. Given a sequence of discrete random variables Xo, X1, Xo, ... tak- ing values in some finite or countably infinite set S, we say that Xp is a Markov chain with respect to a filtration Fy if Xn € Fa for all n and, for all B C S, we have the Markov property P(Xnt1 € Bl Fn) = P(Xnti € BlXn). If we interpret X, as the state of the chain at time n, then the preceding means that if you know the current state, nothing else from the past is relevant to the future of the Markov chain. That is, given the present state, the future states and the past states are independent. When we let Fy, = o(Xo,X1,-..,Xn) this definition reduces to P(Xna1 = il Xn = i Xn = tnaty ey Xo = to) = P(Xn41 = j| Xn =i). Tf P(Xni1 = j| Xn = 7) is the same for all n, we say that the Markov 141 142 Chapter 5 Markov Chains chain has stationary transition probabilities, and we set P(Xn4i = 5| Xn =i) In this case the quantities P,; are called the transition probabilities, and specifying them along with a probability distribution for the starting state Xo is enough to determine all probabilities concerning Xo,...,Xn. We will assume from here on that all Markov chains considered have stationary transition probabilities. In addition, un- less otherwise noted, we will assume that S, the set of all possible states of the Markov chain, is the set of nonnegative integers. Example 5.1 Reflected random walk. Suppose Yj are independent and identically distributed Bernoulli(p) random variables, and let Xo = 0 and Xp = (Xn-1 + 2¥n — 1)* for n = 1,2,.... The process Xn, called a reflected random walk, can be viewed as the position of a particle at time n such that at each time the particle has a p probability of moving one step to the right, a 1 — p probability of moving one step to the left, and is returned to position 0 if it ever attempts to move to the left of 0. It is immediate from its definition that Xp is a Markov chain. Example 5.2 A non-Markov chain. Again let Y; be independent and identically distributed Bernoulli(p) random variables, let Xo = 0, and this time let X, = Yn + Yn-1 for n = 1,2,.... It’s easy to see that X, is not a Markov chain because P(Xn41 = 2|Xn = 1,Xn-1 = 2) =0 while on the other hand P(Xn+1 = 2|Xn = 1, Xn-1 = 0) =p. 5.2 The Transition Matrix The transition probabilities Pj = P(X1 = j|Xo = 1), are also called the one-step transition probabilities. We define the n- step transition probabilities by rr 5.2 Ihe transition Matrix 143 In addition, we define the transition probability matrix Poo Por Poo P=| Pio Pu Pi and the n-step transition probability matrix Bp rip ee po) =| Pip Ph 2 An interesting relation between these matrices is obtained by noting that PET = S7 P(Xntm = j1X0 = i, Xn = k)P(Xn = klXo = 3) k = (m) p(n) = rnp. k ‘The preceding are called the Chapman-Kolmogorov equations. If follows from the Chapman-Kolmogorov equations that pintm) _ pn, pm where here “x” represents matrix multiplication. Hence, Pp?) =PxP, and, by induction, p®) = pr, where the right-hand side represents multiplying the matrix P by itself n times. Example 5.3 Reflected random walk. A particle starts at position zero and at each time moves one position to the right with probability p and, if the particle is not in position zero, moves one position to the left (or remains in state 0) with probability 1 — p. The position 144 Chapter 5 Markov Chains Xp, of the particle at time n forms a Markov chain with transition matrix l-p p 000 l-p 0 poo P=! 0 1-p0po Example 5.4 The two state Markov chain. Consider a Markov chain with states 0 and 1 having transition probability matrix l-a p=|% [178] The two-step transition probability matrix is given by pe — aes ogee ata | a8 + B(1— 8) 1-06 — B01 - 8) 5.3 The Strong Markov Property Consider a Markov chain X, having one-step transition probabilities Pj, which means that if the Markov chain is in state i at a fixed time n, then the next state will be j with probability P,;. However, it is not necessarily true that if the Markov chain is in state i at a randomly distributed time T, the next state will be j with probability Py. That is, if T is an arbitrary nonnegative integer valued random variable, it is not necessarily true that P(Xr41 = j|Xr = i) = Py. For a simple counterexample suppose T =min(n: Xn =i, Xai = 3) Then, clearly, P(Xry1 = j|Xr =) = 1. The idea behind this counterexample is that a general random vari- able T may depend not only on the states of the Markov chain up to time T but also on future states after time T. Recalling that T is a stopping time for a filtration F, if {T =n} € Fp for every n, we see that for a stopping time the value of T can only depend on the states up to time ¢. We now show that P(Xr4n = j|Xr = ) will equal Pp) provided that T is a stopping time. 5.3 The Strong Markov Property 145 This is usually called the strong Markov property, and essentially means a Markov chain “starts over” at stopping times. Below we define Fr = {A AN{T = t} € F, for all t}, which intuitively represents any information you would know by time T. Proposition 5.5 The strong Markov property. Let Xn,n > 0, bea Markov chain with respect to the filtration Fy. If T < co as. isa stopping time with respect to Fn, then P(Xran = ilXr = i, Fr) = P Proof. P(Xr4n = Xr =i, Fr,T =t) = P(Xten = j|Xt =i, FT = t) = P(Xten = j|Xt =i, F:) = Pe where the next to last equality used the fact that T is a stopping time to give {T =t}e F,. = Example 5.6 Losses in queueing busy periods. Consider a queu- ing system where Xp is the number of customers in the system at time n. At each time n = 1,2,... there is a probability p that a customer arrives and a probability (1 — p), if there are any cus- tomers present, that one of them departs. Starting with Xo = 1, let T = min{t > 0: X; = 0} be the length of a busy period. Sup- pose also there is only space for at most m customers in the system. ‘Whenever a customer arrives to find m customers already in the sys- tem, the customer is lost and departs immediately. Letting Nm be the number of customers lost during a busy period, compute E[Nyn]. Solution. Let A be the event that the first arrival occurs before the first departure. We will obtain E{Njn] by conditioning on whether A occurs. Now, when A happens, for the busy period to end we must first wait an interval of time until the system goes back to having a single customer, and then after that we must wait another interval of time until the system becomes completely empty. By the Markov property the number of losses during the first time interval has dis- tribution Nm—1, because we are now starting with two customers and 146 Chapter 5 Markov Chains therefore with only m--1 spaces for additional customers. The strong Markov property tells us that the number of losses in the second time interval has distribution Nj». We therefore have _ f E{Nm-1] + E[Nm] if m >1 EINnlal = { Tes ifm=1, and using P(A) = p and E{Ny|A‘] = 0 we have E[Nmm] = E[Nm|A]P(A) + E[NmlA°]P(A®) = pE|Nm-i] + PE|Nm] for m > 1 along with E(N] = p+ pE(M] and thus P_\m E[Nm] = (—2-)". [Nm] = >) It’s quite interesting to notice that E[Nm] increases in m when p> 1/2, decreases when p < 1/2, and stays constant for all m when p= 1/2. The intuition for the case p = 1/2 is that when m increases, losses become less frequent but the busy period becomes longer. We next apply the strong Markov property to obtain a result for the cover time, the time when all states of a Markov chain have been visited. Proposition 5.7 Cover times. Given an N-state Markov chain Xp, let T, = min{n > 0: Xp = i} and let C = max; T; be the cover time. Then E(C\ < N_, 2 maxi; ElT;|Xo = i). =] im Proof. Let th, l2,-...Jv be a random permutation of the integers N chosen so that all possible orderings are equally likely. Letting Ti, = 0 and noting that maxj past] ie m1 ae 1 Fy max E[T;|Xo = i, mig m=1 where the second line follows since all orderings are equally likely and thus P(Ti, > maxjcm-1T1,) = 1/m, and the third from the strong Markov property. m@ 5.4 Classification of States We say that states i,j of a Markov chain communicate with each other, or are in the same class, if there are integers n and m such that both PS) > 0 and P!” > O hold. This means that it is possible for the chain to get from i to j and vice versa. A Markov chain is called irreducible if all states are in the same class. For a Markov chain Xp, let T= min{n > 0: Xn =i} be the time until the Markov chain first makes a transition into state i. Using the notation Ei[---] and P,(---) to denote that the Markov chain starts from state i, let fi = P(Ti < 00) be the probability that the chain ever makes a transition into state i given that it starts in state i. We say that state i is transient if fi <1 and recurrent if f; = 1. Let os M= Do kx,=0 n=l 148 Chapter 5 Markov Chains be the total number of transitions into state i. The strong Markov property tells us that, starting in state i, Ni+1~ geometric(1— fi), since each time the chain makes a transition into state i there is, independent of all else, a (1 — f;) chance it will never return. Conse- quently, « EIN] = oP is either infinite or finite depending on whether or not state i is recurrent or transient. Proposition 5.8 If state i is recurrent and i communicates with j, then j is also recurrent. Proof. Because i and j communicate, there exist values n and m such that P{?) PO” > 0. But for any k > 0 (n+m+k) (mm) 15k) Pe ere ca where the preceding follows because P{?*™*") is the probability starting in state j that the chain will be back in j after n-+m-+k tran- sitions, whereas P{” Pi p\”) is the probability of the same event oc- curring but with the additional condition that the chain must also be in i after the first m transitions and then back in é after an additional k transitions. Summing over k shows that BAIN] > Pg MY > PIM PLY PY = 00 k Thus, j is also recurrent. = Proposition 5.9 If j is transient, then T°, Pl < oo Proof. Note that. BN) = BASS Ten] = La n=0 5.5 Stationary and Limiting Distributions 149 Let fij denote the probability that the chain ever makes a transition into j given that it starts at i. Then, conditioning on whether such a transition ever occurs yields, upon using the strong Markov property, that E,[Nj] = (1+ Ej[Nj)) fig < 00 since j is transient. = If é is recurrent, let mi = ET) denote the mean number of transitions it takes the chain to return to state i, given it starts in i. We say that a recurrent state i is null if pi = 00 and positive if 4; < oo. In the next section we will show that positive recurrence is a class property, meaning that if i is positive recurrent and communicates with j then j is also positive recurrence. (This also implies, using that recurrence is a class property, that so is null recurrence.) 5.5 Stationary and Li iting Distributions For a Markov chain X, starting in some given state i, we define the limiting probability of being in state j to be noo P;= lim Pi? if the limit exists and is the same for all i. It is easy to see that not all Markov chains will have limiting probabilities, For instance, consider the two state Markov chain with Po. = Pyo = 1. For this chain P{") will equal 1 when n is even and 0 when n is odd, and so has no limit. Definition 5.10 State i of a Markov chain Xp is said to have period d if d is the largest integer having the property that P\") =0 whenn is not a multiple of d. Proposition 5.11 If states i and j communicate, then they have the same period. 150 Chapter 5 Markov Chains Proof. Let dj be the period of state k. Let n,m be such that PP) P™ > 0. Now, if Pi > 0, then (r+ntm) plm) pir) pin) Pym > PPP >0 So d; divides r +n +m. Moreover, because (27) pl pr) PR 2 PE Pe > 0 the same argument shows that d; also divides 2r +n +m; therefore d; divides 2r-+n+m-—(r+n+m) = r. Because dj divides r whenever PS? > 0, it follows that d; divides d;. But the same argument can now be used to show that d; divides d;. Hence, d; It follows from the preceding that all states of an irreducible Markov chain have the same period. If the period is 1, we say that the chain is aperiodic. It’s easy to see that only aperiodic chains can have limiting probabilities. Intimately linked to limiting probabilities are stationary proba- bilities. The probability vector m;,i € S is said to be a stationary probability vector for the Markov chain if my = SomPy forall 5 oy =1 7 Its name arises from the fact that if the Xo is distributed according to a stationary probability vector {x} then P(X: = 9) = D> P(X) = 5X0 = im = Do mPy = 7; and, by a simple induction argument, P(Xn = 3) = 0 P(Xn = Xn = )P(Xn-1 =) = SO miPy = 05 7 7 Consequently, if we start the chain with a stationary probability vector then Xn, Xn+1,... has the same probability distribution for all n. The following result will be needed later. d.b Dtationary ana Limiting Listriputions: apa Proposition 5.12 An irreducible transient Markov chain does not have a stationary probability vector. Proof. Assume there is a stationary probability vector 7,1 > 0 and take it to be the probability mass function of Xo. Then, for any j 14 = P(Xn = 3) = dary Pm Consequently, for any m 4 = kn mrs Py Jim AS mPS +m) ism bm = Ya Jim Pf? +n ism ibm =A bm Ae where the final equality used that >, P{f) < oo because j is tran- sient, implying that limy so P{” = 0. Letting m — oo shows that 173 =0 for all j, contradicting the fact that So, xj = 1. Thus, assum- ing a stationary probability vector results in a contradiction, proving the result. @ The following theorem is of key importance. Theorem 5.13 An irreducible Markov chain has a stationary proba- bility vector {x;} if and only if all states are positive recurrent. The stationary probability vector is unique and satisfies my = Vay Moreover, if the chain is aperiodic then =} (n) = lim Pj To prove the preceding theorem, we will make use of a couple of lemmas. 152 Chapter 5 Markov Chains Lemma 5.14 For an irreducible Markov chain, if there exists a sta- tionary probability vector {x;} then all states are positive recurrent. Moreover, the stationary probability vector is unique and satisfies my = 1/ny Proof. Let 7; be stationary probabilities, and suppose that P(Xo = 3) = 7; for all j. We first show that 1; > 0 for all 4. To verify this, suppose that 7_ = 0. Now for any state j, because the chain is irre- ducible, there is an n such that P{") > 0. Because Xp is determined by the stationary probabilities, Te = P(Xn =k) = So mPL > 1j3PO a Consequently, if 7, = 0 then so is 7;. Because j was arbitrary, that means that it 7; = 0 for any i, then 7; = 0 for all i. But that would contradict the fact that 7, 7m = 1. Hence any stationary probabil- ity vector for an irreducible Markov chain must have all positive elements. Now, recall that T; = min(n > 0: Xp, = j). So, ny = E(T;|Xo = J] co = YPC; = lx = 5) = = PPG 20 Xo= 5) ~ £4 P(X = 3) Because Xo is chosen according to the stationary probability vector {mi}, this gives oo TyHy = D> PUT; > 0, Xo = 3) (5.1) n=1 Now, P(T; 2 1, Xo = j) = P(Xo =) = 35 and, for n > 2 P(T; > ,X0=3) P(Xi #51 Si 1) = 0. Because P(X; # j) = 1-7, we thus obtain m5 = Wy thus showing that there is at most one stationary probability vector. In addition, because all 7; > 0, we have that all yj < 00, showing that all states of the chain are positive recurrent. ™ Lemma 5.15 If some state of an irreducible Markov chain is positive recurrent then there exists a stationary probability vector. Proof. Suppose state k is positive recurrent. Thus, Be = ExT] < 00 Say that a new cycle begins every time the chain makes a transition into state k. For any state j, let A; denote the amount of time the chain spends in state j during a cycle. Then = ExlQ exn=s.t.>0)] n=0 0 DY Fillexneynenyl n=0 e Do Pi(Xn = HT > 1) n=0 E[Aj] ' We claim that 7; = E[Aj)/u, 7 > 0, is a stationary probability vector. Because E[2, Aj] is the expected time of a cycle, it must equal Ey [Tx], showing that 154 Chapter 5 Markov Chains Moreover, for j # k many = SO Pe(Xn = 5, Tk > 2) nz0 = VY Phe = 4,7 > 2-1, Xe =4) n21 i = VY Ph > 2-1, Xi =i) n2zli xPy(Xn = jlTk > 2-1, Xn-1 = 1) SY Pee > 2-1, Xn = Py 7 a3l = VY PelTe > 0, Xn = i) Py 7 n30 = EIAIP, 7 = be Py Finally, Mahe = Om - Py) 7 7 jek =1-mes tk t =1- SE 5 afk = tt and the proof is complete. ™ Note that Lemmas (5.14) and (5.15) imply the following. Corollary 5.16 If one state of an irreducible Markov chain is positive recurrent then all states are positive recurrent. ‘We are now ready to prove Theorem 5.13. Proof. All that remains to be proven is that if the chain is aperi- odic, as well as irreducible and positive recurrent, then the stationary 5. dtationary and Limiting Listriputions probabilities are also limiting probabilities To prove this, let m;,i > 0, be stationary probabilities. Let Xp, > 0 and Yq,n > 0 be indepen- dent Markov chains, both with transition probabilities P,,j, but with Xo =i and with P(% = 4) = 7. Let N = min(n : Xn = Yn) We first show that P(N < 00) = 1. To do so, consider the Markov chain whose state at time n is (Xp, Yn), and which thus has transition probabilities Pj) (47) = PiPjr That this chain is irreducible can be seen by the following ar- gument. Because {Xp} is irreducible and aperiodic, it follows that for any state i there are relatively prime integers n,m such that ”) Ph™ > 0. But any sufficiently large integer can be expressed as a linear combination of relatively prime integers, implying that there is an integer Nj such that P® 30 foralln> N Because i and j communicate this implies the existence of an integer Nig such that PI) >0 for all n> Nig Hence, | = P® p > 0 for all sufficiently large n Gk) 7) ~ Ger ly larg which shows that the vector chain (Xn, Yn) is irreducible. In addition, we claim that 7,3 = 77 is a stationary probability vector, which is seen from Tiny = YomPes oD a, P,j = DV rete Pei Peg k r ker By Lemma 5.14 this shows that the vector Markov chain is positive recurrent and so P(N < co) = 1 and thus lim, P(N > n) = 0. Now, let Z, = Xn ifn < N and let Z, = Y; ifn > N. It is easy to see that Z,,n > 0 is also a Markov chain with transition 156 Chapter 5 Markov Chains probabilities P;; and has Zp = i. Now (n) Pa W P(Z, =i) P(Z, gN n) P(Yn = 5,N $1) +P(Zn = 5,N >n) P(¥_ =) + P(N > n) aj + P(N >n) (5.2) s On the other hand my = PYn = 3) P(¥n = 5,N ) P(Zn = G,N 1) P(Zq = j) + P(N > n) PI) + P(N >n) (5.3) IA Wl i Hence, from (5.2) and (5.3) we see that lim PS? = ny. Remark 5.17 It follows from Theorem 5.13 that if we have an irre- ducible Markov chain, and we can find a solution of the stationarity equations y= DomPy G20 7 Dn 7 then the Markov chain is positive recurrent, and the 7; are the unique stationary probabilities. If, in addition, the chain is aperiodic then the 7; are also limiting probabilities. Remark 5.18 Because j; is the mean number of transitions between successive visits to state é, it is intuitive (and will be formally proven in the chapter on renewal theory) that the long run proportion of 5.6 Time Reversibility 157 time that the chain spends in state i is equal to 1/y;. Hence, the stationary probability 7; is equal to the long run proportion of time that the chain spends in state i. Definition 5.19 A positive recurrent, aperiodic, irreducible Markov chain is called an ergodic Markov chain. Definition 5.20 A positive recurrent irreducible Markov chain whose initial state is distributed according to its stationary probabilities is called a stationary Markov chain. 5.6 Time Reversibility A stationary Markov chain X,, is called time reversible if P(Xn = j/Xnvi =i) = P(Xn41 = Xn = 1) for all i,j. By the Markov property we know that the processes Xn+1, Xn+2,--+ and Xp-1, Xn-2,--- are conditionally independent given Xj, and so it follows that the reversed process Xn~1,Xn-2,-.. will also be a Markov chain having transition probabilities PXn = 5, Xnt P(Xnt — Ph P(Xn = Xan = 1) = m% where 7; and P,; respectively denote the stationary probabilities and the transition probabilities for the Markov chain X,. Thus an equiv- alent definition for X;, being time reversible is if mPy = 75Pji for all i,j. Intuitively, a Markov chain is time reversible if it looks the same running backwards as it does running forwards. It also means that the rate of transitions from i to j, namely 7;P;;, is the same as the rate of transitions from j to i, namely 7jPj;. This happens if there are no “loops” for which a Markov chain is more likely in the long tun to go in one direction compared with the other direction. We illustrate with an examples. 158 Chapter 5 Markov Chains Example 5.21 Random walk on the circle. Consider a particle which moves around n. positions on a circle numbered 1,2,....n according to transition probabilities Piz: = p = 1— Piyii forl Si, m; = 44 = 1, that the Markov chain is time reversible with the given stationary probabilities. m 5.7 A Mean Passage Time Bound Consider now a Markov chain whose state space is the set of non- negative integers, and which is such that Py =0, 0Si 0, we have that E[Dj] > di, then BAN] < 1/4 Proof. The proof is by induction on n. It is true when n = 1, because Nj is geometric with mean 1 1 PMI Fe ~ EID] 160 Chapter 5 Markov Chains So, assume that E[N,] < Ch, 1/dj, for all k > ElNn-j]P(Da = 3) j=l a as S1+ Pan E(Nnl + D> P(Dn = 5) D0 di jn fl = 1+ PanE[Nn] + YPw = aod a a i=l i=n-j+1 S14 Pan B[Nn) + YP, = = ay 1/di ~ 5/4) ja where the last line follows because d; is nondecreasing. Continuing from the previous line we get = 1+ PanE[Nal + (1 — Pan) x Udi - ELI j) =14 PanElNa] + (1 — Pan) Sodas — Ba = : S Pan E[Nn] + (1 — Pan) S/d i= which completes the proof. m Example 5.24 At each stage, each of a set of balls is independently put in one of n urns, with each ball being put in urn i with probability Pi, 8.1 pi = 1. After this is done, all of the balls in the same urn

You might also like