Welcome to Scribd!
Academic Documents
Professional Documents
Culture Documents
Hobbies & Crafts Documents
Personal Growth Documents
A Second Course in Probability - Sheldon M Ross and Erol A Peköz PDF
U} <2 F(x) >U implies P(F-\(U) < 2) = P(F(z) > U) = F(a), we get X =4 X and ¥ =q Y after applying the same argument to Ge Example 2.4 If X ~ N(0,1) and Y ~ N(1,1), then X p which, in light of the previous inequality, must therefore be the maximal coupling. Let B,C, and D be independent random variables with respective density functions vie) = Mints). 92) P de) = £2) mine). 92) and ate) = He) male). se), let I~ Bernoulli(p), and if I = 1 then let ¥ = ¥ = B and otherwise let X = C and = D. This clearly gives P(X = Y) > P(I = 1) =p, and P(X $2) = P(X < all = 1)p+ P(X 9(2)}, we must have dry(X,Y) = max{P(X € A) - P(Y € A), P(Y € A) — P(X € A®)} = max{P(X € A) — P(Y € A),1— P(Y € A)—14 P(X € A)} = P(X € A)- PY € A) and so PREP) =1— [ min(s(e),o(e))d2 =1- f oleae— fi Sla)de =1-P(Y €A)-14+ P(X eA) = dry(X,Y).60 Chapter 2 Stein's Method and Central Limit Theorems 2.3. Poisson Approximation and Le Cam's Theo- rem It’s well-known that a binomial distribution converges to a Poisson distribution when the number of trials is increased and the prob- ability of success is decreased at the same time in such a way so that the mean stays constant. This also motivates using a Poisson distribution as an approximation for a binomial distribution if the probability of success is smalll and the number of trials is large. If the number of trials is very large, computing the distribution function of a binomial distribution can be computationally difficult, whereas the Poisson approximation may be much easier. A fact which is not quite as well known is that the Poisson dis- tribution can be a reasonable approximation even if the trials have varying probabilities and even if the trials are not completely inde- pendent of each other. This approximation is interesting because dependent trials are notoriously difficult to analyze in general, and the Poisson distribution is elementary. It’s possible to assess how accurate such Poisson approximations are, and we first give a bound on the error of the Poisson approxi- mation for completely independent trials with different probabilities of success. Proposition 2.8 Let X; be independent Bernoulli(p;), and let W = S11 Xi, Z ~Poisson(d) and d= E[W] = C2, pi. Then Proof. We first write Z = S7, Zi, where the Z; are independent and Z; ~Poisson(p;). Then we create the maximal coupling (Zi, X:) of (Zi, Xi) and use the previous corollary to get P(Z, =X) = > min(P(X; = k), P(Zi = k)) = = min(1— pj,e"™) + min(p;, pie) =1l-p+pie™ 21-pp2.4 The Stein-Chen Method 61 where we use e~? > 1-2. Using that (OZ, 02. X) is a coupling of (Z,W) yields that A dry(W,Z) < PIL 24S XK) aes PLZ # X}) YPh 4 %) : Yr. ct < Remark 2.9 The drawback to this beautiful result is that when E[W] is large the upper bound could be much greater than 1 even if W has approximately a Poisson distribution. For example if W is a binomial (100,.1) random variable and Z ~Poisson(10) we should have W %q Z but the proposition tells gives us the trivial and useless result dry(W, Z) < 100 x (.1)? = 1. 2.4 The Stein-Chen Method The Stein-Chen method is another approach for obtaining an upper bound on dry(W, Z), where Z is a Poisson random variable and W is some other variable you are interested in. This approach covers the distribution of the number of successes in dependent as well as independent trials with varying probabilities of success. In order for the bound to be good, it should be close to 0 when W is close in distribution to Z. In order to achieve this, the Stein-Chen method uses the interesting property of the Poisson distribution kP(Z =k) = AP(Z =k—1). (2.1)62 Chapter 2 Stein's Method and Central Limit Theorems Thus, for any bounded function f with f(0) = 0, xs kf (k)P(Z =k) k=0 AS @PZ =k-1) k=l adorG+ 1)P(Z =i) = AE[F(Z + 1). E(Zf(Z)| The secret to using the preceding is in cleverly picking a function f so that dry(W,Z) < E[WF(W)] — AE[F(W + 1)], and so we are likely to get a very small upper bound when W %q Z, (where we use the notation W 4 Z to mean that W and Z approximately have the same distribution). Suppose Z ~ Poisson() and A is any set of nonnegative integers. Define the function f4(k),k = 0,1,2,..., starting with f4(0) = 0 and then using the following “Stein equation” for the Poisson distribu- tion: falk +1) ~ kfa(k) = Ikea — P(Z € A). (2.2) Notice that by plugging in any random variable W for k and taking expected values we get AE|fa(W + 1) — E[W fa(W)]] = P(W € A) - P(Z € A), so that dry(W,Z) < sup AE [fa(W +} - EWfa(W)]I- Lemma 2.10 For any A and i,j, |fa(i)— fa(i)| < min(1,1/2)|i—3J]. Proof. The solution to (2.2) is Fick — P(Z < k) La Lxpz =b/P(Z=3)2.4 Ihe Stein-Chen Method 63 because when we plug it in to the left, hand side of (2.2) and use (2.1) in the second line below, we get Afa(k +1) — kfa(k) =v isk PS SH) Vigha = P(Z S- 1) > P(Z=k)/P(Z=j) AP(Z = k-1)/P(Z = 3) -r Ijzk ~ P(Z =k) 2, P= IPE =i) = Ikea- P(Z€ A), which is the right hand side of (2.2). Since P(Z P(Z is increasing in k and 1-P(Zh) P(Zk) | PZSk-1) x k [ Pu > k) , PO <2 MEIW Vil, i=l where Vj is any random variable having distribution Vi =a (W — 1X, =1). = Proposition 2.12 With the above notation if W > V; for all i, then we have drv(W,Z) <1—Var(W)/E|W].2.4 The Stein-Chen Method 65 Proof. Using W > Vi we have 7 YS AEIW - Vil = SO AE IW - Vd i=1 i=l a = +A- SOE + Vi 2 =¥4—S ON BIWIX = 1] al = 2 4a- So BLKA] i=1 =A-Var(W), and the preceding theorem along with \ = E[W] gives dry (W, 2) < min(1,1/E(W))(E(W) ~ Var(W)) <1-Var(W)/E[W). Example 2.13 Let X; be independent Bernoulli(p;) random vari- ables with A= 77, p;. Let W = 7%, Xi, and let Z ~ Poisson(A). Using Vi = 054: Xj, note that W > Vi and E[W — Vi] = pi, so the preceding theorem gives us dry(W, Z) < min(1,1/)) Sor. i= For instance, if X is a binomial random variable with parameters n = 100,p = 1/10, then the upper bound on the total variation distance between X and a Poisson random variable with mean 10 given by the Stein-Chen method is 1/10, as opposed to the upper bound of 1 which results from the LeCam method of the preceding section. Example 2.14 A coin with probability p = 1—q of coming up heads is flipped n + k times. We are interested in P(Rx), where Rx is the event that a run of at least k heads in a row occurs. To approximate66 Chapter 2 Stein's Method and Centra! Limit Theorems this probability, whose exact expression is given in Example 1.10, let X; = 1 if flip i lands tails and fips i+1,...,i+ all land heads, and let X; = 0 otherwise, i= 1,...,n. Let and note that there will be a run of k heads either if W > 0 or if the first k flips all land heads. Consequently, P(W > 0) < P(Re) < P(W > 0) +p*. Because the flips are independent and X; = 1 implies X; = 0 for all j # i where |i — j| < k, it follows that if we let itk Yaw- > x; j=i-k then Vi =a (W — 1|X; = 1) and W > Vj. Using that \ = E[W] = ngp* and E[W — Vj] = (2k + 1)qp*, we see that dry(W, Z) < min(1,1/A)n(2k + 1)qp?*. where Z ~ Poisson(A). For instance, suppose we flip a fair coin 1034 times, and want to approximate the probability that we have a run of 10 heads in a row. In this case, n = 1024,k = 10,p = 1/2, and so d= 1024/2"! = .5. Consequently, P(W >0)x1-e75 with the error in the above approximation being at most 21/2!?. Consequently, we obtain that 1—e7 — 21/2! < P(Ryg > 0) <1 —e7 5 + 21/2 + (1/2)? or 0.388 < P(Ryo > 0) < 0.4 Example 2.15 Birthday Problem. With m people and n days in the year, let ¥j = number of people born on day i. Let Xi = Iyj-0,2. dtein’s Method tor the Geometric Distribution o7 and W = 2, X; equal the number of days on which nobody has a birthday. Next imagine n different hypothetical scenarios are constructed, where in scenario i all the Y; people initially born on day i have their birthdays reassigned randomly to other days. Let 1+ V; equal the number of days under scenario i on which nobody has a birthday. Notice that W > Vj and Vi =¢ (W - 1|X; = 1). Also we have d= E[W] = n(1 — 1/n)™ and El] = (n — 1)(1-1/(n- 1)”, 50 with Z ~ Poisson() we get dry(W, Z) < min(1,1/d)n(n(d - 1/n)™ = (n - 1)(1 - 1/(n - 1))™). 2.5 Stein’s Method for the Geometric Distribu- tion In this section we will show how to obtain an upper bound on the distance dry (W, Z) where W is a given random variable and Z has a geometric distribution with parameter p = 1—q = P(W = 1) = P(Z = 1). We will use a version of Stein’s method applied to the geometric distribution. Define f4(1) = 0 and, for k = 1,2,..., using the recursion fa(k) — afalk +1) = Ikea — P(Z € A). Lemma 2.16 We have |fa(i) — fa(i)| < 1/p Proof. It’s easy to check that the solution is fa(k) = P(Z € A, Z > k)/P(Z =k) — P(Z € A)/p and, since neither of the two terms in the difference above can be larger than 1/p, the lemma follows. Theorem 2.17 Given random variables W and V such that V =a (W -1|W >), let p= P(W =1) and Z ~geometric(p). Then drv(W,Z) < gp" PW # V).68 Chapter 2 Stein's Method and Central Limit Theorems Proof. IP(W € A) - P(Z € A)| = |Elfa(W) - afa(W + 1)]] = laE[fa(W)|W > 1} - gElfa(W + 1)]] S$ gE|faQl+V)—- fa + W)) <@ PW #V), where the last inequality follows from the preceding lemma. = Example 2.18 Coin flipping. A coin has probability p = 1 —q of coming up “heads” with each flip, and let X; = 1 if the ith flip is heads and X; = 0 if it is tails. We are interested in the distribution of the number of flips M required until the start of the first appearance of a run of k heads in a row. Suppose in particular we are interested in estimating P(M € A) for some set of integers A. To do this, instead let N be the number of flips required until the start of the first appearance of a run of tails followed by k heads in a row. For instance, if k = 3, and the flip sequence is HHTTHHH, then M = 4 and N = 5. Note that in general we will have N+1 = M if M > 1 so consequently, dry(M,N +1) < P(N+1=M)=p*. We will first obtain a bound on the geometric approximation to N. Letting Aj = {Xi = 0,Xigi = Xina = = Xign = 1} be the event a run of a tail followed by k heads in a row appears starting with flip number 4, let W = min{i > 2: 4, = 1} — 1 and note that N=aW Next generate V as follows using a new sequence of coin flips ¥;,Y2, ... defined so that Yj = X;,Vi > k+1 and the vector (Yay Yay ++) Vent) =a (Xa) Xa, 00) Xegilla, = 0) is generated to be independent of all else. Then let By = {Yj = 0, Yis1 = Yin2 = +++ = Vigne = 1} be the event a run of tails followed by k heads in a row appears in this new sequence starting with fip number i. If I4, = 0 then let V = W, and otherwise let V = min{i >2.6 Stein's Method for the Normal Distribution 69 2: Ip, = 1} - 1. Note that this gives V =4 (W — 1|W > 1) and keh P(W #V) <> P(ALN Bi) ] +1 = P(A1) >° P(AiAS) . Atl = P(Ay) )> P(A:)/P(AS) i=2d = kqp”*. Thus if Z ~geometric(qp*) then applying the previous theorem gives dry(N, Z) < kp*. The triangle inequality applies to dry so we get dry(M — 1,2) < drv(M —1,N) + dry(N, Z) = (k + 1)p* and P(Z € A) —(k+1)p* < P(M € A) < P(Z€ A) + (k+1)p*. 2.6 Stein’s Method for the Normal Distribution Let Z ~ N(0,1) be a standard normal random variable. It can be shown that for smooth functions f we have E[f'(Z) — Zf(Z)] = 0, and this inspires the following Stein equation. Lemma 2.19 Given a >0 and any value of z let 1 ife #))dt, and a similar argument gives eRe) fe f(z)= a) fe (t)P(Z < t)dt P(Zsz) [~,, a i H()P(Z > that, we get If"(a)| < |a'(a)| + | + 2) f(a) + 2(h(x) — ElA(Z)]| Le a + er + (1+2°)P(Z > 2)/9(2))(2P(Z < 2) + 4(2)) ie + (1+2?)P(Z < 2)/4(2))(-2P(Z > 2) + 4(z)) A + wv x 82.6 Stein's Method tor the Normal Vistribution a We finally use max(2y) 2 (x) — f(y) < "(a)\da < — \f'@) ros fo If @)ldz s 5 to give the lemma. Theorem 2.20 If Z ~ N(0,1) and W = 0%, Xj where X; are in- dependent mean 0 and Var(W) = 1, then sup|PW <2) —P(Z <2) <2,|87 FIX) ae 1 Proof. Given any a > 0 and any 2z, define h, f as in the previous lemma. Then PW > 321X3I] PW $2) -P(Z <2) $2.13) > BUX). and then by choosing we get72 Chapter 2 Stein's Method and Central Limit Theorems Repeating the same argument starting with P(Z <2)- P(W <2) 1, and W, = 0%, Xi then Wp i=l satisfies the conditions of Theorem 2.20. Because n a DAP = nba = EGE 0 it follows from Theorem 2.20 that P(W, <2)—+ P(Z <2). @74 Chapter 2 Stein's Method and Central Limit Theorems 2.7 Exercises 1. If X ~ Poisson(a) and Y ~ Poisson(b), with b > a, use coupling to show that Y >a X. . Suppose a particle starts at position 5 on a number line and at each time period the particle moves one position to the right with probability p and, if the particle is above position 0, moves one position to left with probability 1 — p. Let X;,(p) be the position of the particle at time n for the given value of p. Use coupling to show that Xp(a) >st Xn(b) for any n if a > b. . Let X,Y be indicator variables with E[X] = a and B[Y] = 6. (a) Show how to construct a maximal coupling X,¥ for X and Y, and then compute P(X = Y) as a function of a,b. (b) Show how to construct a minimal coupling to minimize P(X = ¥). . In a room full of n people, let X be the number of people who share a birthday with at least one other person in the room. Then let Y be the number of pairs of people in the room having the same birthday. (a) Compute E[X] and Var(X) and E[Y] and Var(¥). (b) Which of the two variables X or Y do you believe will more closely follow @ Poisson distribution? Why? (c) Ina room of 51 people, it turns out there are 3 pairs with the same birthday and also a triplet (3 people) with the same birthday. This is a total of 9 people and also 6 pairs. Use a Poisson approximation to estimate P(X > 9) and P(Y > 6). Which of these two approximations do you think will be better? Have we observed a rare event here? . Compute a bound on the accuracy of the better approximation in the previous exercise part (c) using the Stein-Chen method. . For discrete X,Y prove dry(X,¥) = 45, |P(X = 2)-P(Y = 2) . For discrete X,Y show that P(X # Y) > drv(X,Y) and show also that there exists a coupling that yields equality. . Compute a bound on the accuracy of a normal approximation for a Poisson random variable with mean 100.2.7 Exercises 75 9. If X ~ Geometric(p), with q = 1 — p, then show that for any bounded function f with f(0) = 0, we have E[f(X) —af(X + )) =0. 10. Suppose X),X2,... are independent mean zero random vari- ables with |Xq| <1 for all n and Dicp, Var(X;)/n + 8 < 00 If Sn = Dign Xi and Z ~ N(0,s), show that Sp/ Vn +p Z. 11. Suppose m balls are placed among n urns, with each ball inde- pendently going in to urn i with probability p;. Assume m is much larger than n. Approximate the chance none of the urns are empty, and give a bound on the error of the approximation. 12. A ring with a circumference of c is cut into n pieces (where nis large) by cutting at n places chosen uniformly at random around the ring. Estimate the chance you get k pieces of length at least a, and give a bound on the error of the approximation. 13. Suppose Xj, i= 1,2,...,10 are iid U(0, 1). Give an approxima- tion for P(}0j2, Xi > 7), and give a bound on the error of this approximation. 14. Suppose Xi, i = 1,2,...,n are independent random variables with E[X;] = 0, and 0", Var(X;) =1. Let W = 52, X; and show that jew - 2 <3) E\X.)* a? i=Chapter 3 Conditional Expectation and Martingales 3.1 Introduction A generalization of a sequence of independent random variables oc- curs when we allow the variables of the sequence to be dependent on previous variables in the sequence. One example of this type of dependence is called a martingale, and its definition formalizes the concept of a fair gambling game. A number of results which hold for independent random variables also hold, under certain conditions, for martingales, and seemingly complex problems can be elegantly solved by reframing them in terms of a martingale. In section 2 we introduce the notion of conditional expectation, in section 3 we formally define a martingale, in section 4 we introduce the concept of stopping times and prove the martingale stopping theorem, in section 5 we give an approach for finding tail probabilities for martingale, and in section 6 we introduce supermartingales and submartingales, and prove the martingale convergence theorem. 3.2 Conditional Expectation Let X be such that E[|X|] < ov. In a first course in probability, E{X|Y], the conditional expectation of X given Y, is defined to be 778 Chapter 3 Conditional Expectation and Martingales that function of Y that when Y = y is equal to Uz eP(X =2l¥ =y), if X,Y are discrete E[XIY =y]= Sxfxyy(cly)dz, if X,Y have joint density f where Fayy(2ly) = f(y) _ f(y) (3.1) J fle,y)dz Fry) The important result, often called the tower property, that BX] = EE[X|¥]] is then proven. This result, which is often written as Ly FIXI¥ = yP(¥ = 9) if X,Y are discrete E[X] = J EIXIY = ylfy(u)ay, if X,Y are jointly continuous. is then gainfully employed in a variety of different calculations. ‘We now show how to give a more general definition of conditional expectation, that reduces to the preceding cases when the random variables are discrete or continuous. To motivate our definition, sup- pose that whether or not A occurs is determined by the value of Y. (That is, suppose that A € o(Y).) Then, using material from our first course in probability, we see that E[XI4] EEX I4\¥]] EUaE(X\¥]] where the final equality holds because, given Y, I, is a constant random variable. ‘We are now ready for a general definition of conditional expecta- tion.3.2 Conditional Expectation 79 Definition For random variables X,Y, let E[X|Y'], called the condi- tional expectation of X given Y, denote that function h(Y) having the property that for any A € o({Y) E(XE4] = BAYA]. By the Radon-Nikodym Theorem of measure theory, a function h that makes h(Y) a (measurable) random variable and satisfies the preceding always exists, and, as we show in the following, is unique in the sense that any two such functions of Y must, with probability 1, be equal. The function A is also referred to as a Radon-Nikodym derivative. Proposition 3.1 If hy and hg are functions such that Ejhi(Y La] = Elha(¥)L4] for any A€o(Y), then P(hy(¥Y) = ha(¥)) =1 Proof. Let An = {h(Y) — ho(Y) > 1/n}. Then, 0 = Btha(Y Mag) — Elha¥ a] = Bi(ha() — hal ))fag) > 2PUAn) showing that P(A, this yields 0. Because the events An are increasing in n, 0 = lim P(An) = (lim An) = P(UnAn) = P(hi(¥) > ha(¥)) Similarly, we can show that 0 = P(hi(Y) < ho(Y)), which proves the result. = ‘We now show that the preceding general definition of conditional expectation reduces to the usual ones when X,Y are either jointly discrete or continuous.80 Chapter 3 Conditional Expectation and Martingales Proposition 3.2 If X and Y are both discrete, then E[X|¥ = y] = )o P(X =al¥ =y) = whereas if they are jointly continuous with joint density f, then Bx = a= f afew (elder where fxiy(xly) = rena Proof Suppose X and Y are both discrete, and let hy) = So 2P(X = 2l¥ =) z For A € o(Y), define B= {y:Ia=1when¥ = y} Because B € o(Y), it follows that I, = Ig. Thus, _[a, #X=2Yes xmn={f otherwise and _fX.eP(X=2\¥=y), if Y= yes wor = {7 otherwise Thus, E[XI4] = Do zP(X =2,Y € B) = Vey r(x =2Y=y) = yeB YY Px = al¥ = y) PW =v) ven c E{h(Y)La)3.2 Conditional Expectation 81 The result thus follows by uniqueness. The proof in the continuous case is similar, and is left as an exercise. . For any sigma field F we can define E[X|F] to be that random variable in F having the property that for all A € F E[XIa) = ElE[X|F La] Intuitively, E[X|F] represents the conditional expectation of X given that we know all of the events in F that occur. Remark 3.3 It follows from the preceding definition that E[X|¥] = ElX\o(¥)]. For any random variables X,X1,...,Xn, define E[X|X1,...,Xn] by ELXIXA,.. Xn] = E[Xlo(X1,--- Xn] In other words, E[X|Xi,....X. which is that function h(X1,-..,Xn) for E[XI4] = E(MX),...,Xn)La], for all A€ 0 (X1,..-, Xn) We now establish some important properties of conditional ex- pectation. Proposition 3.4 (a) The Tower Property: For any sigma field F E(X] = E[E|X|F]] (b) For any AEF, E[XIa\F] = IAEIXIF] (c) If X is independent of all ¥ € F, then E[X|F] = E[X] (a) If E(|Xil] < 00, i= 1,...4, then PIO XIF= AMI i=1 #182 Chapter 3 Conditional Expectation and Martingales (e) Jensen's Inequality If f is a conver function, then Elf(X)|F] = f(ELX|F)) provided the expectations exist. Proof Recall that E[X|F] is the unique random variable in F such that E[XI4] = ElE|X|Flla] if ACF Letting A = 9, I, = In = 1, and part (a) follows. To prove (b), fix A € F and let X* = XI. Because E[X*|F] is the unique function of Y such that E[X*Zy] = E[E[X*|Flq] for all A’ € F, to show that E[X*|F] = I4E[X|F], it suffices to show that for A’ € F E[X* Ta) = ElaE[X|F lla] That is, it suffices to show that E[XI ala) = EllaE|X|F Ua‘) or, equivalently, that ElXIaa)] = ElE[X|Fllaar] which, since AA’ € F, follows by the definition of conditional expec- tation. Part (c) will follow if we can show that, for A € F E[XI4] = E[E[X]Ia} which follows because E[XI4] = E[X]E[I4] _ by independence = ELE[X\a] We will leave the proofs of (d) and (e) as exercises. Remark 3.5 It can be shown that E[X|-] satisfies all the properties of ordinary expectations, except that all probabilities are now com- puted conditional on knowing which events in F have occurred. For instance, applying the tower property to E[X|F] yields that E|X|F] = EIE[X|F U GIF].4.2 Conaitional Expectation 83 While the conditional expectation of X given the sigma field F is defined to be that function satisfying E[XI4] = E{E[X|Fll4] for all AG F it can be shown, using the dominated convergence theorem, that E[XW] = ElE[X|F]W] for al We F (3.2) The following proposition is quite useful. Proposition 3.6 If W € F, then B|XW|F] = WEIX|F] Before giving a proof, let us note that the result is quite intuitive. Because W € F, it follows that conditional on knowing which events of F occur - that is, conditional on F the random variable W becomes a constant and the expected value of a constant times a random variable is just the constant times the expected value of the random variable. Next we formally prove the result. Proof. Let Y = E[XW|F]-WE[X|F] = (X - E[X|F))W - (XW — E[XW|F)) and note that Y € F. Now, for Ae F, EY Ia] = EX — E[X|F))WIa] — E( XW - E[XW|F))Ia] However, because WI4 € F, we use Equation (3.2) to get E\(X — E(X|F])WIa4] = E[XWIa] — E[E[X|F]WIa] and by the definition of conditional expectation we have El(XW — E[XW|F])I4)] = B[XWIa] — E[E[XW|F]I4] = 0. Thus, we see that for A € F, EY 14] =0 Setting first A = {Y > 0}, and then A = {Y <0} (which are both in F because Y € F) shows that P(Y >0)=P(Y <0)=0 Hence, Y = 0, which proves the result.84 Chapter 3 Conditional Expectation and Martingales 3.3. Martingales We say that the sequence of sigma fields Fy, Fp... is a filtration if Fi © Fe... We say a sequence of random variables X), X2,... is adapted to Fa if Xn € Fa for all n. To obtain a feel for these definitions it is useful to think of n as representing time, with information being gathered as time pro- gresses. With this interpretation, the sigma field F,, represents all events that are determined by what occurs up to time n, and thus contains F,_1. The sequence X,,n > 1, is adapted to the filtration Fa,n > 1, when the value of X;, is determined by what occurs by time n. Definition 3.7 Zn is a martingale for filtration Fy, if 1. El|Znl] < 00 2. Zn is adapted to Fr 3. ElZn4alFa] = Zn A bet is said to be fair it its expected gain is equal to 0. A martingale can be thought of as being a generalized version of a fair game. For consider a gambling casino in which bets can be made concerning games played in sequence. Let Fn consist of all events whose outcome is determined by the results of the first n games. Let Zp denote the fortune of a specified gambler after the first n games. Then the martingale condition states that no matter what the results of the first n games, the gambler’s expected fortune after game n+1 is exactly what it was before that game. That is, no matter what has previously occurred, the gambler’s expected winnings in any game is equal to 0. It follows upon taking expectations of both sides of the final mar- tingale condition that ElZns1] = BlZn] implying that E|Z,] = E[Zi) We call E[Z;] the mean of the martingale.3.3 Martingales 85, Another useful martingale result is that E\Zn+2lFnl ELE(Zn+2|Fnsi U Fall Fa] EE [Zn+2lFn+ill Fa} ElZn+1\Fn] =2, Repeating this argument yields that ElZn+k\Fn) = Zn, k 21 Definition 3.8 We say that Z,,n > 1, is a martingale (without spec- ‘fying a filtration) if 1. EllZnl] < 00 2 BlZnsilZiy--+sZnl = Zn If {Z,} is a martingale for the filtration F,, then it is a martin- gale. This follows from ElZn+1\Z1y-+-) Zn) = ElE[Zn411Z1,---sZns Fall, EVE|Zn+1|Fal|Z15---s Zn] ElZn|Z1,..-, Zn] = 2. Zn) where the second equality followed because Z; € Fn, for all i = on ‘We now give some examples of martingales. Example 3.9 If X;,i > 1, are independent zero mean random vari- ables, then Z, = 7%, X;,n > 1, is a martingale with respect to the filtration o(X1,...,Xn),m > 1. This follows because ElZn411X15-- +5 ElZn + Xn+ilX1,---» Xn} Biza\Xe ON Ela Xl = Zn + E[Xnsi] by the independence of the X; Zn . I86 Chapter 3 Conditional Expectation and Martingales Example 3.10 Let X;,i > 1, be independent and identically dis- tributed with mean zero and variance o?. Let S, = 2, Xi, and define Z, = S2?-no®, n>1 ‘Then Zp,n > 1 is a martingale for the filtration o(X1,...,Xn). To verify this claim, note that E(S2,5|X1,.-- Xn] El(Sn + Xn41)"|Xi,---) Xn) E[S2|X1,..-, Xn] + E[2SnXnvilXay--- Xal + E[X2,3|X1,---) Xn] $2 + 25nE[XnsilX1,--- Xn) + B[XAaal S32 + 2S, E[Xn4i] +0? = B40? " i Subtracting (n + 1)o? from both sides, yields ElZnt11X1,---)Xn]=Zn Our next example introduces an important type of martingale, known as a Doob martingale. Example 3.11 Let ¥ be an arbitrary random variable with E[[Y |] < 00, let Fy,n > 1, be a filtration, and define Zn = ElY Fr] We claim that Zn,n > 1, is a martingale with respect to the filtration Fayn > 1. To verify this, we must first show that E[|Z,|] 1, is a called a Doob martingale. . Example 3.12 Our final example generalizes the result that the suc- cessive partial sums of independent zero mean random variables con- stitutes a martingale. For any random variables X;,i > 1, the ran- dom variables X;— B[Xi|X1,...,Xi-1], 7 > 1, have mean 0. Even though they need not be independent, their partial sums constitute a martingale. That is, we claim that 7 Zn = (Ki - BIXIM,.-. i=l is, provided that E[|Zp|] < oo, a martingale with respect to the filtration o(X1,...,Xn),n > 1. To verify this claim, note that Zn = Zn + Xnti — E[Xnti1X1,-- Xnj Thus, ElZn+i|X1,- = Zn + E[XneilX1,. ~ B{XnsilX,.- =% © 3.4 The Martingale Stopping Theorem The positive integer valued, possibly infinite, random variable N is said to be a random time for the filtration F, if {N =n} € Fy or, equivalently, if {N > n} € Fy-1, for all n. If P(N < oo) = 1, then the random time is said to be a stopping time for the filtration. Thinking once again in terms of information being amassed over time, with F;, being the cumulative information as to all events that have occurred by time n, the random variable NV will be a stopping88 Chapter 3 Conditional Expectation and Martingales time for this filtration if the decision whether to stop at time n (so that N =n) depends only on what has occurred by time n. (That is, the decision to stop at time 7 is not allowed to depend on the results of future events.) It should be noted, however, that the decision to stop at time n need not be independent of future events, only that it must be conditionally independent of future events given all information up to the present. Lemma 3.13 Let Zn,n > 1, be a martingale for the filtration Fr. If N is a random time for this filtration, then the process Zn = Zmin(win)s 2 2 1, called the stopped process, is also a martingale for the filtration Fy. Proof Start with the identity Zn = Zn—1 + Tpw>n)(Zn — Zn-1) To verify the preceding, consider two cases: (1) N 2m: Here, Zn, = Zp, Zn1 = Zn—1, [qy>n} = 1, and the preceding is true (2) Nny(Zn — Zn—1)|Fn—1] Zn-1 + Upn>n}E|Zn — Zn-1|Fn-1] Ze Theorem 3.14 The Martingale Stopping Theorem Let Zn, n > 1, be a martingale for the filtration Fn, and suppose that N is a stopping time for this filtration. Then ElZy) = E[Z} if any of the following three sufficient conditions hold. (1) Z, are uniformly bounded; (2) N is bounded; (3) E|N] <0, and there exists M < oo such that El\Zns1 -— ZnllFn] -P(N >i) ia MEIN] < oo. IA90 Chapter 3 Conditional Expectation and Martingales A corollary of the martingale stopping theorem is Wald’s equa- tion. Corollary 3.15 Wald’s Equation If X;, Xo identically distributed with finite mean p stopping time for the filtration Fy = 0(X1,. EN] < 00, then . are independent and E[Xi], and if N is a +Xn),n 2 1, such that i BIS. Xd = wE(N i=1 Proof. Zn = 3c%;(Xi — p),n > 1, being the successive partial sums of independent zero mean random variables, is a martingale with mean 0. Hence, assuming the martingale stopping theorem can be applied we have that 0 = ElZy) N = BDO - w)) = = EID Xi — Nu) i=l N = FID Xi] - EINy i=l To complete the proof, we verify sufficient condition (3) of the mar- tingale stopping theorem. EllZns1 — ZallFa] = EllXnsa — ullFal = El|Xn4i—sl]__ by independence E(\\Xil] + lel Thus, condition (3) is verified and the result proved. = IA Example 3.16 Suppose independent and identically distributed dis- crete random variables X1,X2,..., are observed in sequence. With P(X; = 3) = pj, what is the expected number of random variables that must be observed until the subsequence 0, 1, 2,0, 1 occurs? What is the variance?3.4 The Martingale Stopping Theorem 91 Solution Consider a fair gambling casino - that is, one in which the expected casino winning for every bet is 0. Note that if a gambler bets her entire fortune of a that the next outcome is j, then her fortune after the bet will either be 0 with probability 1 — pj, or a/p with probability pj. Now, imagine a sequence of gamblers betting at this casino. Each gambler starts with an initial fortune 1 and stops playing if his or her fortune ever becomes 0. Gambler i bets 1 that X; = 0; if she wins, she bets her entire fortune (of 1/po) that Xj+1 = 1; if she wins that bet, she bets her entire fortune that Xi,2 = 2; if she wins that bet, she bets her. entire fortune that Xi;3 = 0; if she wins that bet, she bets her entire fortune that Xin4 = 1; if she wins that bet, she quits with a final fortune of (opip2)"*- Let Z, denote the casino’s winnings after the data value X, is observed; because it is a fair casino, Zn,n > 1, is a martingale with mean 0 with respect to the filtration o(X),...,Xn),n > 1. Let N denote the number of random variables that need be observed until the pattern 0,1,2,0,1 appears - so (Xy_4,-..,Xw) = (0,1,2,0,1). As it is easy to verify that N is a stopping time for the filtration, and that condition (3) of the martingale stopping theorem is satisfied when M = 4/(p2p?p2), it follows that E[Zy] = 0. However, after Xw has been observed, each of the gamblers 1,..., N —5 would have lost 1; gambler N —4 would have won (p2p?p2)~! — 1; gamblers N—3 and N — 2 would each have lost 1; gambler N — 1 would have won (pop:)~} — 1; gambler N would have lost 1. Therefore, Zy = N — (popipe)”* — (popr)* Using that E[Zy] = 0 yields the result E(N] = (Popip2)"' + (popi)* In the same manner we can compute the expected time until any specified pattern occurs in independent and identically distributed generated random data. For instance, when making independent flips of a coin that comes u heads with probability p, the mean number of flips until the pattern HHTTHH appears is p~4q-? + p-? +p}, where q = 1 -p. To determine Var(N), suppose now that gambler i starts with an initial fortune i and bets that amount that X; = 0; if she wins, she92 Chapter 3 Conditional Expectation and Martingales bets her entire fortune that Xi+1 = 1; if she wins that bet, she bets her entire fortune that X;42 = 2; if she wins that bet, she bets her entire fortune that X;43 = 0; if she wins that bet, she bets her entire fortune that Xii4 = 1; if she wins that bet, she quits with a final fortune of i/(p2p2p2). The casino’s winnings at time N is thus N-4 N-1 Pepip2 —PoPt _ N(N+1) N-4_ N- 2 PaPIP2 PoP Assuming that the martingale stopping theorem holds (although none of the three sufficient conditions hold, the stopping theorem can still be shown to be valid for this martingale), we obtain upon taking expectations that Zn = 14+2+...4N- BIN|—4 , . EIN]-1 E[N?] + E[N] = 2—3-3— + 2 Wwi'l+ BIN] POPiP2 PoP1 Using the previously obtained value of E[N], the preceding can now be solved for E[N?], to obtain Var(N) = E[N?]—(E[N])?. w= Example 3.17 The cards from a shuffled deck of 26 red and 26 black cards are to be turned over one at a time. At any time a player can say “next”, and is a winner if the next card is red and is a loser otherwise. A player who has not yet said “next” when only a single card remains, is a winner if the final card is red, and is a loser otherwise. What is a good strategy for the player? Solution Every strategy has probability 1/2 of resulting in a win. To see this, let R, denote the number of red cards remaining in the deck after n cards have been shown. Then. _ Ra _5l-n E|Rn+ilRa,.-.)Rn] = Rn = 3 = Hence, ;f0-,n > 0 is a martingale. Because Ro/52 = 1/2, this martingale has mean 1/2. Now, consider any strategy, and let N denote the number of cards that are turned over before “next” is said. Because N is a bounded stopping time, it follows from the martingale stopping theorem tahat. Ry 52- Rn Bl 1/23.4 The Martingale Stopping Theorem 93 Hence, with I= Ipwiny Ry 52 ~ No Our next example involves the matching problem. BU = EWwyRyl) = EH 172 Example 3.18 Each member of a group of n individuals throws his or her hat in a pile. The hats are then mixed together, and each person randomly selects a hat in such a manner that each of the n! possible selections of the n individuals are equally likely. Any individual who selects his or her own hat departs, and that is the end of round one. If any individuals remain, then each throws the hat they have in a pile, and each one then randomly chooses a hat. Those selecting their own hats leave, and that ends round two. Find E[N], where N is the number of rounds until everyone has departed. Solution Let X;,i > 1, denote the number of matches on round for i=1,...,.N, and let it equal 1 for i > N. To solve this example, we will use the zero mean martingale Z;,k > 1 defined by k DK - BIXIM, Xia) i=1 k Sx -Y rat where the final equality follows because, for any number of remaining individuals, the expected number of matches in a round is 1 (which is seen by writing this as the sum of indicator variables for the events that each remaining person has a match). Because X N=min{k: ox, =n} i=] is a stopping time for this martingale, we obtain from the martingale stopping theorem that N 0 = ElZn] = ELD) Xi - NJ=n- EIN] i=1 andso E[N]=n. =94 Chapter 3 Conditional Expectation and Martingales Example 3.19 If X,X2, a sequence of independent and iden- tically distributed random variables, P(X; = 0) # 1, then the process is said to be a random walk. For given positive constants a,b, let p denote the probability that the the random walk becomes as large as a before it becomes as small as —b. We now show how to use martingale theory to approximate p. In the case where we have a bound |X;| < c it can be shown there will be a value 6 4 0 such that Ele*] Then because is the product of independent random variables with mean 1, it fol lows that Zn,n > 1, is a martingale having mean 1. Let N=min(n:S,>a or Sp<~b). Condition (3) of the martingale stopping theorem can be shown to hold, implying that Ele") =1 Thus. 1 = Ele" |Sy > alp + Ele""|Sy < —b\(1—p) Now, if @ > 0, then re Efe" (Sy >al< eblate) and et) < Ble" Sy < i] se yielding the bounds 1 — e- Hbte) eo — @—Obte)’3.4 Lhe Martingale Stopping Theorem 95, and motivating the approximation 1 PS Ga ew We leave it as an exercise to obtain bounds on p when @<0. Our next example involves Doob’s backward martingale. Before defining this martingale, we need the following definition. Definition 3.20 The random variables X1,...,Xn are said to be ea- changeable if Xi,,...,Xiq has the same probability distribution for every permutation i1,...,in of 1y...4n. Suppose that Xj,...,Xn are ex¢ eable. Assume E[|Xil] < 00, let i Sj =X jHly...gn i=l and consider the Doob martingale Zi,...,Zn, given by E[X1|Sn} E[X1\Sn)Sn—1y---) Satish 5 4 4; However, Snticj = E[Sn41—jlSnv1—j Xnaaejn--»Xal nti-j DY BK Snt i Xnga-sy- +s Xa] ia (n+1—j)E[Xi|Sn41-j, Xne2-j.---» Xn) (by exchangeability) (n+1-9)Z; where the final equality follows since knowing Sp, Spt, is equivalent to knowing Sp41—j,Xn42—js-++»Xn- The martingale oa Sn41~j n+1-j’ is called the Doob backward martingale. We now apply it to solve the ballot problem. 5 gehen96 Chapter 3 Conditional Expectation and Martingales Example 3.21 In an election between candidates A and B candidate A receives n votes and candidate B receives m votes, where n > m. Assuming, in the count of the votes, that all orderings of the n +m votes are equally likely, what is the probability that A is always ahead in the count of the votes? Solution Let X; equal 1 if the i** voted counted is for A and let it equal —1 if that vote is for B. Because all orderings of the n+ m. votes are assumed to be equally likely, it follows that the random variables X1,...,Xn4m are exchangeable, and Z1, is a Doob backward martingale when Zntm Z= Sp+mpi-j I ne m+1—j5 where S, = Sj, Xi. Because Z; = Spim/(n +m) = (n—m)/(n+ m), the mean of this martingale is (n — m)/(n+m). Because n > m, A will always be ahead in the count of the vote unless there is a tie at some point, which will occur if one of the S; - or equivalently, one of the Z; - is equal to 0. Consequently, define the bounded stopping time N by Ne=min{j Zj;=Oorj=n+m} Because Zn4m = X1, it follows that Zyy will equal 0 if the candidates are ever tied, and will equal X; if A is always ahead. However, if A is always abead, then A must receive the first vote; therefore, _ fl, if A is always ahead one { 0, otherwise By the martingale stopping theorem, E[Zw] = (n — m)/(n +m), yielding the result n-m P(A is always ahead) = 2—™ 3.5 The Hoeffding-Azuma Inequality Let Zn,n > 0, be a martingale with respect to the filtration Fy. If the differences Zn ~ Zn—1 can be shown to lie in a bounded ran- dom interval of the form [-Bn,—Bn + dn] where By € Fy_; and dy3.5 The Hoeffding-Azuma Inequality 97 is constant, then the Hoeffding-Azuma inequality often yields useful bounds on the tail probabilities of Z,. Before presenting the inequal- ity we will need a couple of lemmas. Lemma 3.22 If E[X] = 0, and P(—a < X < f) =1, then for any conver function f B a EU] S gy pia) + axel Proof Because f is convex it follows that, in the region -a 100). Thus, if the Lemma were false, then the function f(a, B) =F + e-F + ae? — e-F) — 2081071298 Chapter 3 Conditional Expectation and Martingales would assume a strictly positive maximum in the interior of the re- gion R= {(a,8) |a| <1, || < 10}. Setting the partial derivatives of f equal to 0, we obtain u FF pale +e) = 2aper+h"/2 (3.3) P—eF = rperh+0"/2 (3.4) We will now prove the lemma by showing that any solution of (3.3) and (3.4) must have 8 = 0. However, because f(a, 0) = 0, this would contradict the hypothesis that f assumes a strictly positive maximum in R,, thus establishing the lemma. So, assume that there is a solution of (3.3) and (3.4) in which B # 0. Now note that there is no solution of these equations for which a = 0 and 8 # 0. For if there were such a solution, then (3.4) would say that ef — 68 = 2pe/? (3.5) But expanding in a power series about 0 shows that (3.5) is equivalent to pret pitt 2 ae =, (28 +2)! =e ‘itat which (because (2i + oY > iat when i > 0) is clearly impossible when 8 # 0. Thus, any solution of (3.3) and (3.4) in which 8 # 0 will also have a # 0. Assuming such a solution gives, upon dividing (3.3) by (3.4), that 1+aS = Sits - 1+ 7 Because a # 0, the ae is equivalent to BF + e~®) =e —eF or, expanding in a Taylor series, Lor putt SO fae Gi!” & GaN! which is clearly not posible when 2 # 0. Thus, there is no solution of (3.3) and (3.4) in which f # 0, thus proving the result. . We are now ready for the Hoeffding-Azuma inequality.3.5 The Hoeffding-Azuma Inequality 99 Theorem 3.24 The Hoeffding-Azuma Inequality Let Zn, n > 1, be a martingale with mean p with respect to the filtration Fy. Let Z = p and suppose there exist nonnegative random variables By,n > 0, where Bn € Fn-1, and positive constants da,n > 0, such that —Bn S Zn - Zn-1 S —But dn Then, forn >0,a > 0, @) P(@n-pza) < MEG (ii) P(Zn-—pS a) < e/a G (36) Proof Suppose that 1 = 0. For any c > 0 P(Z, 2a) = P(e > e*) < e-™ Bler2n] where we use Markov’s inequality in the second equality. Let W, = e€@n, Note that Wo = 1, and that for n > 0 E[WalFn—1) = Blen-* e(@n- 2-1 Fa) = Fn-t Ble(n-ZnF, a] Wa-1 Ble(>- 29-1 F,_1} (3.7) il where the second equality used that Zn—1 € Fn—1. Because (i) f(z) =e is convex; (ii) E\Zn — Zn-1|Fn—a] E(ZnlFn-i] — ElZn—1|Fn-i] Zn-1— Zn-1 = 0 and (iii) -Ba < Zn -Zp-1 S$ —Batdn it follows from Lemma 3.22, with a = Bn, 8 = —Bp + dp, that . —Bpn +dn)e~°Pn + Byet(-Bntdn) Eleaf Dy) < B[ Ba le Pa 1] _ (CBa + dn)e~CBn + Bye? (—Batdn) d, 1A Ef100 Chapter 3 Conditional Expectation and Martingales where the final equality used that B, € F,-1. Hence, from Equation (3.7), we see that (—Bn + dp)e7eBe + Byes (-Batdn) S Wyo Re dy S Wyre il E(Wal Fai} A where the final inequality used Lemma 3.23 (with p = Bn/dn, t = cd). Taking expectations gives E[We] < E|Wp—ile/* Using that E[Wo] = 1 yields, upon iterating this inequality, that PRB — 6 EI G/8 Therefore, from Equation (3.6), we obtain that for any c > 0 : P(Zn > @) < exp(—ca +c? d?/8) i=1 Letting c = 4a/5°?_,d?, (which is the value of c that minimizes —cat+ OL, d?/8), gives that P(Zq > a) S$ e728/ Dia Parts (i) and (ii) of the Hoeffding-Azuma inequality now follow from applying the preceding, first to the zero mean martingale {Zn — 1}, and secondly to the zero mean martingale {4 — Zn}. . Example 3.25 Let X;,i > 1 be independent Bernoulli random vari- ables with means pj,i = 1,...,n. Then In = SK) = S- op nO i=1 =1 is a martingale with mean 0. Because Z, — Zp_1 = Xn — p, we see that —PS2Zn-Zp-1S1-p3.5 The Hoeffding-Azuma Inequality 101 Thus, by the Hoeffding-Azuma inequality (Bn = p,d, = 1), we see that fora >0 . P(Sn— opi 2a) < "lm i=l a 2 P(Sn— Domi < a) < en a The preceding inequalities are often called Chernoff bounds. ™ ‘The Hoeffding-Azuma inequality is often applied to a Doob type martingale. The following corollary is often used. Corollary 3.26 Let h be such that if the vectors x = (x1,...,2,) and y =(y1,-.-,Yn) differ in at most one coordinate (that is, for some k, a4 = yi for all i # k) then |a(x) — A(y)] <1 Then, for a vector of independent random variables K = (X1,... anda>0 P(h(X) ~ BJA(X)] > a) < e77*/" P(A(X) — Efh(K)] 0 P(N ~(n—k + l)papapsps 2 a) < e**/(n-9) P(N —(n—k+ l)pspapsps < -a) < e77/"-3) og Example 3.28 Suppose that n balls are to be placed in m urns, with each ball independently going into urn i with probability pi, 27, pi = 1. Find bounds on the tail probability of Yj, equal to the number of urns that receive exactly k balls. Solution First note that A E [= Turn i has exactly k tat i=l & (j)ea-w Let X; denote the urn in which ball j is put, j = 1,...,n. Also, let hy(21,.+.,2n) denote the number of urns that receive exactly k balls when X; = 2;,i = 1,...,n, and note that Y; = hy(X1,---,Xn)- When k = 0 it is easy to see that ho satisfies the condition that if x and y differ in at most one coordinate, then |ho(x) — ho(y)| < 1. Therefore, from Corollary 3.26 we obtain, for a > 0, that Ely] P(¥o— 0(1 - pi)” 2 a) 1 P(Yo- Y(1- pi" $ -a) < f=1 oe a, IA 2a? /n ene) A For 0 < k < nit is no longer true that if x and y differ in at most one coordinate, then |hj,(x)—Ax(y)| < 1. This is so because the one different value could result in one of the vectors having one less and the other having one more urn with k balls than would have resulted if that coordinate was not included. Thus, if x and y differ in at most one coordinate, then Vu (x) — he(y)| <2104 Chapter 3 Conditional Expectation and Martingales showing that Aj(x) = hy(x)/2 satisfies the condition of Corollary 3.26. Because P(Ye — El¥a] 2 a) = P(hi(X) — ElAe(X)] 2 a/2) we obtain, for a > 0,0 < k ("ora —pi** >a) i=1 ene /2n 1A i P%-3 ("ora = pyr 1, is said to be a submartingale for the filtration Fy if 1. El|Znl] < 00 2. Zn is adapted to Fp 9. E(Zns1|Fa] = Zn If3 is replaced by E(Zn4ilFa] S Zn, then Znyn > 1, is said to be supermartingale. It follows from the tower property that ElZn4i] 2 E[Zn] for a submartingale, with the inequality reversed for a supermartin- gale. (Of course, if Zn,n > 1, is a submartingale, then —Zn,n > 1, is a supermartingale, and vice-versa.)3.6 Sub and Supermartingales, and Convergence Theorem 105 The analogs of the martingale stopping theorem remain valid for submartingales and supermartingales. We leave the proof of the following theorem as an exercise. Theorem 3.30 If N is a stopping time for the filtration Fn, then ElZy] > ElZi) for a submartingale E|Zy] < ElZ:] for a supermartingale provided that any of the sufficient conditions of Theorem 3.14 hold. One of the most useful results about submartingales is the Kol- mogorov inequality. Before presenting it, we need a couple of lem- mas. Lemma 3.31 If Z,,n > 1, is a submartingale for the filtration Fn, and N is a stopping time for this filtration such that P(N E[Z,]. Now, ElZnlFe, N =k) = ElZnl Fe] > Ze = Zw Taking expectations of this inequality completes the proof. m Lemma 3.32 If Zn,n > 1, is a martingale with respect to the filtra- tion Fn,n > 1, and f a conver function for which El|f(Zn)l] < 00, then f(Zn),n > 1, is @ submartingale with respect to the filtration Franz. Proof. Elf(Zn+1)|Fn] > F(E[Zn+ilFa]) by Jensen’s inequality = f(2n) a Theorem 3.33 Kolmogorov’s Inequality for Submartingales Sup- pose Zn,n > 1, is a nonnegative submartingale, then for a >0 P(max{Z,...,Zn} >a) < E[Zn]/a106 Chapter 3 Conditional Expectation and Martingales Proof Let N be the smallest i, i < n such that Z; > a, and let it equal n if Z; 1, is a martingale, then for a>0 P(max{|Zi|,---,|Znl} > @) < min(E(|Znl|/e, E[Z2)/a?) Proof Noting that P(max{|Z1|,...,|Znl} 2 @) = P(max{Z?...,Z2} > a?) the corollary follows from Lemma 3.32 and Kolmogorov’s inequality for submartingales upon using that f(z) = |2| and f(z) = 2? are convex functions. Theorem 3.35 The Martingale Convergence Theorem Let Z,,n > 1, be a martingale. If there is M < 00 such that ElZll 1, is a submartingale, yielding that E[Z2] is non- decreasing. Because E[Z?] is bounded, it follows that it converges; let m < oo be given by = li 2) mm ip Flee We now argue that limp... Z, exists and is finite by showing that, with probability 1, Z,,n > 1, is a Cauchy sequence. That is, we will show that, with probability 1, \Zmtk - Zm| 0 asm,k + 003.6 Sub and Supermartingales, and Convergence Theorem 107 Using that Zin4k—Zm, k > 1, is a martingale, it follows that (Zim+k— Zm)*, k > 1, is a submartingale. Thus, by Kolmogorov’s inequality, P(|Zmik — Zm| > € for some k < n) = P( max (Zm+k — Zm)? > &) kelenn S$ El(Zman — Zm)?V/e? = BIZ 4m — 2m Zn+m + Zale However, ElZmZn+m| = ElE|ZmZn+m|Zml] = ElZmE(Zn+m|Zm]] Elza) Therefore, P(\Zm+x —Zm| >€ for some k € for some k) < (m — E[Z2])/e? Thus, P(\Zm+k —Zm|>€ for some k) +0 asm— oo Therefore, with probability 1, Zn,n > 1, is a Cauchy sequence, and so has a finite limit. . ‘As a consequence of the martingale convergence theorem we ob- tain the following. Corollary 3.36 If Zp,n > 1, is a nonnegative martingale then, with probability 1, limp—co Zn exists and is finite. Proof Because Zp, is nonnegative EllZal] = ElZn] = E[Zi)< 00108 Chapter 3 Conditional Expectation and Martingales Example 3.37 A branching process follows the size of a population over succeeding generations. It supposes that, independent of what occurred in prior generations, each individual in generation n inde- pendently has j offspring with probability p;,j > 0. The offspring of individuals of generation n then make up generation n + 1. Let Xn denote the number of individuals in generation n. Assuming that m = 3), Jp;, the mean number of offspring of an individual, is finite it is east to verify that Zp, = Xq/m",n > 0, is a martin- gale. Because it is nonnegative, the preceding corollary implies that lim, X,/m" exists and is finite. But this implies, when m < 1, that lim Xp = 0; or, equivalently, that Xp = 0 for all n sufficiently large. When m > 1, the implication is that the generation size either becomes 0 or converges to infinity at an exponential rate. .3.7 Exercises 109 3.7 Exercises 1. 2. For F = {4,}, show that E[X|F] = E[X]. Give the proof of Proposition 3.2 when X and Y are jointly continuous. . If E[|Xil] < 00, i=1,...,n, show that BO XI = KIA i=1 i=l |. Prove that if f is a convex function, then ELf(X)|F] = FELXIF) provided the expectations exist. . Let X1,Xo,..., be independent random variables with mean mn 1. Show that Zn = " Xi, n> 1, is a martingale. . If E[Xn4i1X1,.--)Xn] = @nXn+tbn for constants an,bn,n > 0, find constants An, Bn so that Zn = AnXn + Buin > 0, isa martingale with respect to the filtration o(Xo,..., Xn)- . Consider a population of individuals as it evolves over time, and suppose that, independent of what occurred in prior gen- erations, each individual in generation n independently has j offspring with probability p;, j > 0. The offspring of individuals of generation n then make up generation n + 1. Assume that m = ¥jipj < 00. Let Xp denote the number of individuals in generation n, and define a martingale related to X,,n > 0. The process X;,,7 > 0 is called a branching process. . Suppose X}, Xo, ..., are independent and identically distributed random variables with mean zero and finite variance o?. If T is a stopping time with finite mean, show that td Var()> Xi) = 07 E(T). i=1110 10. li. 12. Chapter 3 Conditional Expectation and Martingales ). Suppose Xj, Xa, ..., are independent and identically distributed mean 0 random variables which each take value +1 with prob- ability 1/2 and take value -1 with probability 1/2. Let Sy = 3%, Xi. Which of the following are stopping times? Compute the mean for the ones that are stopping times. (a) Ty = min{i > 5: S; = Sis +5} (b) Ty =T -5 (c) Ty =) +10. Consider a sequence of independent flips of a coin, and let Pj denote the probability of a head on any toss. Let A be the hypothesis that P, = a and let B be the hypothesis that Pp = , for given values 0 < a,b < 1. Let X; be the outcome of flip i, and set — P(X, XnlA) P(X,..-,XnlB) If P, = b, show that Zp,n > 1, is a martingale having mean 1. Zn Let Z,,n > 0 be a martingale with Z) = 0. Show that ElZz] = So El(Zi - Zi-1)?] i=l Consider an individual who at each stage, independently of past movements, moves to the right with probability p or to the left with probability 1 —p. Assuming that p > 1/2 find the expected number of stages it takes the person to move i positions to the right from where she started. . In Example 3.19 obtain bounds on p when 6 < 0. 14. Use Wald’s equation to approximate the expected time it takes a random walk to either become as large as a or as small as —b, for positive a and b. Give the exact expression if a and b are integers, and at each stage the random walk either moves up 1 with probability p or moves down 1 with probability 1— p. . Consider a branching process that starts with a single individ- ual. Let denote the probability this process eventually dies out. With X;, denoting the number of individuals in generation n, argue that 7**, n > 0, is a martingale.3.7 Exercises ill 16. 17. 18. 19. 21. 23. Given X1,Xa,..., let Sp = DL, Xi and Fy = o(Xi,...Xn)- Suppose for all n E|Sy| < 00 and E[Sn4i\Fa] = Su. Show E|X.Xj] =0 if i ¢ J. Suppose n random points are chosen in a circle having diam- eter equal to 1, and let X be the length of the shortest path connecting all of them. For a > 0, bound P(X — E[X] > a). Let X1,Xz,...,Xn be independent and identically distributed discrete random variables, with P(X; = j) = pj. Obtain bounds on the tail probability of the number of times the pat- tern 0,0,0,0 appears in the sequence. Repeat Example 3.27, but now assuming that the X; are inde- pendent but not identically distributed. Let P;,; = P(X: = j). . Let Zpyn > 0, be a martingale with mean Zp = 0, and let v;,j 2 0, be a sequence of nondecreasing constants with vp = 0. Prove the Kolmogorov-Hajek-Renyi inequality: P(\25| < vj, for all j =1,. ) 2 1=S B(Z; - 2)-1)1/2} jal Consider a gambler who plays at a fair casino. Suppose that the casino does not give any credit, so that the gambler must quit when his fortune is 0. Suppose further that on each bet made at least 1 is either won or lost. Argue that, with probability 1, a gambler who wants to play forever will eventually go broke. . What is the implication of the martingale convergence theorem to the scenario of Exercise 10? Three gamblers each start with a,b, and c chips respectively. In each round of a game a gambler is selected uniformly at random to give up a chip, and one of the other gamblers is selected uniformly at random to receive that chip. The game ends when there are only two players remaining with chips. Let Xn,¥n, and Z, respectively denote the number of chips the three players have after round n, so (Xo, Yo, Zo) = (a, b,c). (a) Compute E[Xn41¥a+iZn+1 | (Xn Ya: Zn) = (21, 2)]-112 Chapter 3 Conditional Expectation and Martingales (b) Show that Mn = Xn¥nZn-+n(a+b+c)/3 is a martingale. (c) Use the preceding to compute the expected length of the game.Chapter 4 Bounding Probabilities and Expectations 4.1 Introduction In this chapter we develop some approaches for bounding expec- tations and probabilities. We start in Section 2 with Jensen’s in- equality, which bounds the expected value of a convex function of a random variable. In Section 3 we develop the importance sampling identity and show how it can be used to yield bounds on tail probabil- ities. A specialization of this method results in the Chernoff bound, which is developed in section 4. Section 5 deals with the Second Moment and the Conditional Expectation Inequalities, which bound the probability that a random variable is positive. Section 6 develops the Min-Max Identity and use it to obtain bounds on the maximum of aset of random variables. Finally, in Section 7 we introduce some general stochastic order relations and explore their consequences. 4.2 Jensen's Inequality Jensen’s inequality yields a lower bound on the expected value of a convex function of a random variable. Proposition 4.1 If f is a convex function, then ELf(X)] 2 F(ELX)) 113114 Chapter 4 Bounding Probabilities and Expectations provided the expectations exist. Proof. We give a proof under the assumption that f has a Taylor series expansion. Expanding f about the value » = E[X], and using the Taylor series expansion with a remainder term, yields that for some a F(z) = fe) + Few) + f(a) - w)?/2 > f(u)+ f'(u)(z - ») where the preceding used that f”(a) > 0 by convexity. Hence, F(X) > F(u) + f(u)(X - w) ‘Taking expectations yields the result. = Remark 4.2 If P(X = 21) =A=1- P(X = 29) then Jensen’s inequality implies that for a convex function f AF (a1) + (1 — A) F(w2) 2 F(Aei + (1 - A)z2), which is the definition of a convex function. Thus, Jensen’s inequal- ity can be thought of as extending the defining equation of convexity from random variables that take on only two possible values to arbi- trary random variables. 4.3 Probability Bounds via the Importance Sam- pling Identity Let f and g be probability density (or probability mass) functions; let A be an arbitrary function, and suppose that g(x) = 0 implies that, f(c)h(x) = 0. The following is known as the importance sampling identity. Proposition 4.3 The Importance Sampling Identity MX)FX hS), 9({X) where the subscript on the expectation indicates the density (or mass function) of the random variable X. E,(h(X)] = BI-4.3 Probability Bounds via the Importance Sampling Identity 115 Proof We give the proof when f and g are density functions: Bylbx)) = J ate) se) ae (2) $2) 9 ~ 92) oe: F(X) 9X) ‘The importance sampling identity yields the following useful corol- lary concerning the tail probability of a random variable. (a) de i El 1 Corollary 4.4 L(x) P(X > 0) = Bul IX > qP,(X >) Proof. P(X >) = Eyllx>e] B(xzf) Bul GX) eezar) Fal 5) a |X > gP,(X > 0) ears! |X > dP)(X >) Example 4.5 Bounding Standard Normal Tail Probabilities Let f be the standard norma] density function 1 f(z) = yee 00 <2<00 For c > 0, consider Py(X > c}, the probability that a standard normal random variable exceeds c. With g(z) =ce"™, «>0116 Chapter 4 Bounding Probabilities and Expectations we obtain from Corollary 4.4 Py(X >) ”F, gleX* eX |X > we E, fexter*/2, eolX+0)] TR where the first equality used that P,(X > c) = e-®, and the sec- ond the lack of memory property of exponential random variables to conclude that the conditional distribution of an exponential random variable X given that it exceeds c is the unconditional distribution of X +c. Thus the preceding yields et Py(X > 0) = [e-%7/?] (4.1) Noting that, for > 0, l-rc) < (1-1 +3, ¢ 1°) iz 4(X > c) < (1-1/1 10) in Our next example uses the importance sampling identity to bound the probability that successive sums of a sequence of independent and identically distributed normal random variables with a negative mean ever cross some specified positive number. w (4.2)4.3 Probability Bounds via the Importance Sampling Identity 17 Example 4.6 Let X1, Xo,... be a sequence of independent and iden- tically distributed normal random variables with mean yp < 0 and variance 1. Let S, = Tf, X; and, for a fixed A > 0, consider p= P(S, > Afor some k) Let fx(xk) = fe(21,---,2%) be the joint density function of X; = (X1,..., Xx). That is, falxu) = (2)~#/2e~ Eten eso)? /2 Also, let gx be the joint density of k independent and identically distributed normal random variables with mean —y and variance 1. That is, 9k (Xz) = (20) ~*/2¢ DE (eitu)?/2 Note that SelXe) _ gan Dh 2 9n(Xx) With j k Re = {(t1---t) i SALI < hy Domi > AY i= = we have © = VP € A) k=1 ~ = YE Moers] k=) Sp xeeRa Se(Ke) -2 Eg) a x Eoullexpersy oH] kel Now, if X, € Ry then S, > A, implying, because p < 0, that eS < e#4_ Because this implies that Tyxgergy OM < Texper OM118 Chapter 4 Bounding Probabilities and Expectations we obtain from the preceding that Y Baellexver,3 M4] k=1 oo MAS Faull xpera)] k=l 8 1A i Now, if ¥j,i > 1, is a sequence of independent normal random vari- ables with mean —y and variance 1, then i k Eoullxpery] = POLY SA 5 A) 1 i=l Therefore, from the preceding a A k < CAS PITY S A, 13 A) k=l i=l #1 k eA P(S~Y; > A for some k) i=l il = eH where the final equality follows from the strong law of large numbers because limy—oo 0%; ¥i/n = —w > 0, implies P(limp—oo D2) Yi = oo) =1, and thus P(TE, ¥; > A for some k) = 1. ‘The bound p= P(S > Afor some k) < e744 is not very useful when A is a small nonnegative number. In this case, we should condition on X;, and then apply the preceding inequality.4.4 Chernott Bounds 119 With © being the standard normal distribution function, this yields P i ce 1 2 P(S, > Ak k|X1 = 2) e724 ie (Sh > Aor some kX = 2) =e Iz . 1 2 P(S, > Af KX = 2) ee 82g, if (Sk for some k|X1 Re lx +P(X1 > A) 1 2 < wl A-2) 6—(8-W?/2dp 4 1 — (A > wale : ae A ems f e-(@HH)*2de 4.1 O(A — p) Vin Ico PHAS(A +p) +1—O(A— p) Thus, for instance P(Sq > Ofor some k) < ®(y) + 1 — &(—p) = 26(n) 4.4 Chernoff Bounds Suppose that X has probability density (or probability mass) func- tion f(z). For t > 0, let e* f(z) Mi) o(z) = where M(t) = E,[e'*] is the moment generating function of X. Corollary 4.4 yields, for c > 0, that Py(X >) Eg{M(t)e~** |X > c] Py(X > c) E,{M(t)e"*|X > d M(t)e-® IA 1A Because the preceding holds for all ¢ > 0, we can conclude that i te Py(X > ¢) < inf M(te (4.3) The inequality (4.3) is called the Chernoff bound. Rather than choosing the value of ¢ so as to obtain the best bound, it is often convenient to work with bounds that are more120 Chapter 4 Bounding Probabilities and Expectations analytically tractable. The following inequality can be used to sim- plify the Chernoff bound for a sum of independent Bernoulli random variables. Lemma 4.7 For 00 P(X - E[X] >) < en (4.4) P(X - E[X] <0) < e?7/" (4.5) Proof. For c>0,t>0 P(X ~ E[X] > c) = P(et(X-FIXD > et) eE[e(X-FIX)] by the Markov inequality e* Blexr( SX — EXD) IA IA W n eB] et] is1 = ee] [let Fy ia However, if Y is Bernoulli with parameter p, then Bet -E1"1] = pet(l-P) + (1 — pe? < @f/8 where the inequality follows from Lemma 4.7. Therefore, P(X - E[X] >0) seter/8 Letting t = 4c/n yields the inequality (4.4). The proof of the inequality (4.5) is obtained by writing it as P(E[X]—-X > 0) 0, as n —+ 00 and m —+ 00, P((1—¢)ma” < Ng < (1 +6€)ma”) 1 Proof. To prove the result, it is convenient to first formulate an equivalent continuous time model that results in the times at which the n-+m cells are killed being independent random variables. To do so, let X1,...,Xn4m be independent exponential random variables, with X; having weight w;, i= 1,...,n-+m. Note that X; will be the smallest of these exponentials with probability w;/ 5°; w;; further, given that X; is the smallest, X,,r # i, will be the second smallest with probability w,/ 30,4; ws; further, given that X; and X; are, re- spectively, the first and second smallest, X,,s # i, 7, will be the next smallest with probability w,/ 2,44, tj; and so on. Consequently, if we imagine that cell i is killed at time X;, then the order in which the n + m cells are killed has the same distribution as the order in which they are killed in the original model. So let us suppose that cell i is killed at time X;,i > 1. Now let 7. denote the time at which the number of surviving target cells first falls below na. Also, let N(t) denote the number of normal cells that are still alive at time t; so Na = N(Tq). We will first show that P(N(ta) < (1+€)me"”) 1 asnjm— co (4.6)122 Chapter 4 Bounding Probabilities and Expectations To prove the preceding, let e* be such that 0 < e* < ¢, and set t = ~—In(a(1+e*)/”), We will prove (4.6) by showing that as n and ™ approach oo, () P(t >t) 1, and (i) P(N(t) < (1 +6)ma¥) +1. Because the events 7_ >t and N(t) < (1+)ma™ together imply that N(ta) < N(t) < (1+ €)ma”, the result (4.6) will be established. To prove (i), note that the number, call it Y, of surviving target cells by time ¢ is a binomial random variable with parameters n and et =a(1+e*)”. Hence, with a= na[(1 + ¢*)'/” — 1], we have P(ta < t) = P(Y < na} = P(Y < net -a) < 2" /" where the inequality follows from the Chernoff bound (4.3). This proves (i), because a?/n — 00 as n> 00. To prove (ii), note that N(¢) is a binomial random variable with parameters m and e~“t = a” (1+e*). Thus, by letting b= ma™(e— é*) and again applying the Chernoff bound (4.3), we obtain P(N(t) > (1+ €)ma”} = P(N(t) > me7** +b) < em This proves (ii), because b?/m — co as m — oo. Thus (4.6) is established. It remains to prove that P(N(za) > (1—€)ma”) +1 as n,m — co (4.7) However, (4.7) can be proven in a similar manner as was (4.6); a combination of these two results completes the proof of the theorem. s 4.5 Second Moment and Conditional Expecta- tion Inequalities ‘The second moment inequality gives a lower bound on the probability that a nonnegative random variable is positive.4.5 Second Moment and Conditional Expectation Inequalities 123 Proposition 4.11 The Second Moment Inequality For a non- negative random variable X (E{X))? BX?) Proof. Using Jensen's inequality in the second line below, we have E[X?] = E[X?|X > 0]P(X > 0) (E[X|X > 0])?P(X > 0) (E(x)? P(X> 0) P(X >0)> Vv When _X is the sum of Bernoulli random variables, we can improve the bound of the second moment inequality. So, suppose for the remainder of this section that a x=>)%, i=l where X; is Bernoulli with E[X;] = pi, i = ‘We will need the following lemma. Lemma 4.12 For any random variable R EIXR] = SpE (RIX: = 1] i=l Proof E[XR] = BIS XR a = So eGR i=l = Seewnix = pi + ELGR|X: = 00 — pi)} rast e SC EIAX: =Ip # i=l124 Chapter 4 Bounding Probabilities and Expectations Proposition 4.13 The Conditional Expectation Inequality n Pi X>0)2> 57% — PX 02S arK aa] Proof. Let R= “229. Noting that EIXR)=P(X>0), E[RIX:=1) = BUX =1] we obtain, from Lemma 4.12, P(X >0) = nell where the final equality made use of Jensen’s inequality. ™ Example 4.14 Consider a system consisting of m components, each of which either works or not. Suppose, further, that for given subsets of components S;,j = 1,...,n, none of which is a subset of another, the system functions if all of the components of at least one of these subsets work. If component j independently works with probability aj, derive a lower bound on the probability the system functions. Solution Let X; equal 1 if all the components in S; work, and let it equal 0 otherwise, i= 1,...,n. Also, let =[Io JES pi= P(X = Then, with X = D7, Xj, we have P(system functions) = P(X > 0) : Pi soa Last SSP m= D Pi OE ae eeaa:4.5 Second Moment and Conditional Expectation Inequalities 125 where S; — S; consists of all components that are in S; but not in Si. ' Example 4.15 Consider a random graph on the set of vertices {1,2, +;n}, which is such that each of the (3) pairs of vertices i # j is, independently, an edge of the graph with probability p. We are interested in the probability that this graph will be connected, where by connected we mean that for each pair of distinct vertices i # j there is a sequence of edges of the form (i,i1), (i1, i2),---; (ik, 3)- (That is, a graph is connected if for each each pair of distinct vertices iand j, there is a path from i to j.) Suppose that = late) We will now show that if c < 1, then the probability that the graph is connected goes to 0 as n — oo. To verify this result, consider the number of isolated vertices, where vertex i is said to be isolated if there are no edges of type (i,j). Let X; be the indicator variable for the event that vertex i is isolated, and let 2 x=yo% i=l be the number of isolated vertices. Now, P(X; =1) = (1-p)?? Also, a E(X|X; = 1) = S> P(X} = Xi = 1) ja =1i+)°a-pr? j#i = 1+(n-1)(1—p)"? Because =p = 1 eBAen z enein(n)126 Chapter 4 Bounding Probabilities and Expectations the conditional expectation inequality yields that for n large ae > P(X >) 2 Gay Therefore, c<1> P(X >0} > 1Lasn— 00 Because the graph is not connected if X > 0, it follows that the graph will almost certainly be disconnected when n is large and c < 1. (It can be shown when c > 1 that the probability the graph is connected goes to 1 as n > co.) . 4.6 The Min-Max Identity and Bounds on the Maximum In this section we will be interested in obtaining an upper bound on Elmax; Xi], when X,,...,Xq are nonnegative random variables. To begin, note that for any nonnegative constant c, n maxX; Se+ D(X; —e)* (48) i=1 where x*, the positive part of z, is equal to z if x > 0 and is equal to 0 otherwise. Taking expectations of the preceding inequality yields that. i. Elmax Xj] E[(Xi - ¢)*] rast Because (X;—c)* is a nonnegative random variable, we have i E(X:—0)*] f ” P(Xi—e)* > ade i P(X;—c> a)de 0 E [fo P&> vey Therefore, Blmax Xi] < c+ > i ” BUX: > w)dy (49) i=l4.0 Ine Min-Max Identity and Bounds on the Maximum. 127 Because the preceding is true for all c > 0, the best bound is obtained by choosing the c that minimizes the right side of the preceding. Differentiating, and setting the result equal to 0, shows that the best bound is obtained when ¢ is the value c* such that, SoP(K >) =1 i=l Because 7”, P(X; > c) is a decreasing function of c, the value of c* can be easily approximated and then utilized in the inequality (4.9). It is interesting to note that c* is such that the expected number of the X; that exceed c* is equal to 1, which is interesting because the inequality (4.8) becomes an equality when exactly one of the Xi exceed c. Example 4.16 Suppose the X; are independent exponential random variables with rates \;,i = 1,...,n. Then the minimizing value c* is such that with resulting bound Elmax Xi] < i In the special case where the rates are all equal, say A; = 1, then l=ne~" or c*=In(n) and the bound becomes Emax Xj) < In(n) +1 (4.10) However, it is easy to compute the expected maximum of a sequence of independent exponentials with rate 1. Interpreting these random variables as the failure times of n components, we can write 8 max Xi = OT, i=2128 Chapter 4 Bounding Probabilities and Expectations where T; is the time between the (i — 1)st and the i‘ failure. Using the lack of memory property of exponentials, it follows that the T; are independent, with T; being an exponential random variable with rate n—i+1. (This is because when the (i— 1)st failure occurs, the time until the next failure is the minimum of the n— i+ 1 remaining lifetimes, each of which is exponential with rate 1.) Therefore, in this case : : 1 1 Elmax Xi} x nets As it is known that, for n large So ys Zein) +E ia where E ~ .5772 is Euler’s constant, we see that the bound yielded by the approach can be quite accurate. (Also, the bound (4.10) only requires that the X; are exponential with rate 1, and not that they are independent.) . The preceding bounds on E{max; Xi] only involve the marginal distributions of the X;. When we have additional knowledge about the joint distributions, we can often do better. To illustrate this, we first need to establish an identity relating the maximum of a set of random variables to the minimums of all the partial sets. For nonnegative random variables X1,...,Xn, fix x and let A; denote the event that X; > x. Let X =mex(X1,...,Xn) Noting that X will be greater than z if and only if at least one of the events A; occur, we have P(X > 2) = P(ULAi) and the inclusion-exclusion identity gives P(X >2) = S>P(A)— Do PAA) + SD P(AASAR) 7 ia)= Xe De eA) fncn2} P(AAs) = P(X; > 2, X; > 2} = P(min(X;, X;) > 2) P(A;AjAg) = P(X; > 2, Xj > 2, X_ > x} P(min(X;, X;, Xx) > 2) and so on. Thus, we see that : PIX >a)= 0-4 SO Pemin(Xi,,...,%,) > 2)) Integrating both sides as x goes from 0 to oo gives the result: EIX]= S(O Blmin(X%,,....¥:,)] r=1 tS YAAK — Blin 5) iG ElX] < TAK) — Fn 1+ YO Hlmin(Xi, X;, Xx) i 1 1 + ————— +... + (1)! —_—___. San ) Pit... +Dn Using the preceding formula for the mean number of coupons needed to obtain a complete set requires summing over 2" terms, and so is not practical when n is large. Moreover, the bounds obtained by only going out a few steps in the formula for the expected value of a maximum generally turn out to be too loose to be beneficial. However, a useful bound can often be obtained by applying the max- min inequalities to an upper bound for E[X] rather than directly to E[X]. We now develop the theory. For nonnegative random variables Xi,..., Xp, let X = max(X, Fix c > 0, and note the inequality X $c+max((X1—0)*,...,(Xn—¢)*).4.0 Ihe Min-Max Identity and Bounds on the Maximum 131 Now apply the max-min upper bound inequalities to the right side of the preceding, take expectations, and obtain that E[X] < c+ STIX 04] i=l E[X] < e+ > BUX — €)*] — > Blmin((X; — 0), (X; - )*)] 7 SG + YS Blmin((x: - 0), (Xj - (Xe - 0D] icjck and so on. Example 4.18 Consider the case of independent exponential ran- dom variables X;,...,Xn, all having rate 1. Then, the preceding gives the bound ElmexX;] Set DEUX - ot} ~ Y Elmin(x; - 0), (X; - )*)] 7 roa + YS Blmin((X; — 0), (Xj — 0)*, (Xe — 0)*)] i c, whether min(X;,X;) > ¢, and whether min(X;, Xj, Xx) > c. This yields E((Xi — ¢)*] Elmin((X; — ¢)*, (Xj — ¢)*)] W Efmin((X; — ¢)*, (Xj —¢)*, (Xe -¢)*)] = € Using the constant c = In(n) yields the bound a a Blmax Xi] < In(a) 41-25) 4 MeV?) In(n) +.806 for n large132 Chapter 4 Bounding Probabilities and Expectations Example 4.19 Let us reconsider Example 4.17, the coupon col- lector's problem. Let c be an integer. To compute E{(X; — c)*), condition on whether a type é coupon is among the first c collected. El(Xi — 0)*] = El(Xi — €)*|Xi < P(X: < 0) +El(X; - 0)*|X; > JP(X; > ©) = El(X: — ¢)* |X; > (1 — pi)® — (=p) Pi Similarly, Blmin((X:— 0), (%j~ eh] = CoRR BI Blain((X.— )*, (X59, egy) = CORB BT Therefore, for any nonnegative integer eps inh 1- e ger Sta pi) -y force Conon iG (1 = pi = yj ~ Pe)? + Ua Pia Pi ~ Pky & Pit Pj + Pe zB 3 e AY a A 4.7 Stochastic Orderings We say that X is stochastically greater than Y, written X >g¢ Y, if P(X >t)>P(Y >t) for allt In this section, we define and compare some other stochastic order- ings of random variables. If X is a nonnegative continuous random variable with distribu- tion function F and density f then the hazard rate function of X is defined by Ax(t) = fO/FO4.7 Stochastic Orderings 133 where F(t) = 1 — F(t). Interpreting X as the lifetime of an item, then for ¢ small P(t year old item dies within an additional time e) = P(X t) & Ax(te If Y is a nonnegative continuous random variable with distribu- tion function G and density g, say that X is hazard rate order larger than Y, written X >a, Y, if Ax(t) S Ay(t) for allt Say that X is likelihood ratio order larger than Y, written X >i, Y, if F(z)/a(z) Tz where f and g are the respective densities of X and Y Proposition 4.20 X>rY >X 2a Y >X DH Y Proof. Let X have density f, and Y have density g. Suppose X 21, Y Then, for y >< fly) £(z) Fy) = oy) Gy 2 9 (2) [P serane £2 PP owen Ax(z) < Ay(z) To prove the final implication, note first that : - Pf i ax(nat= Foy tt= hehe) implying that or or P(e) =e Bx134 Chapter 4 Bounding Probabilities and Expectations which immediately shows that X >pr Y => X Dot Y. w Define rx(t), the reversed hazard rate function of X, by rx(t) = f(Q)/FO) and note that _ P(t-e< XIX <2) SS ii a Say that X is reverse hazard rate order larger than Y,, written X 2n¥, if rx(t) > ry(t) for allt Our next theorem gives some representations of the orderings of X and Y in terms of stochastically larger relations between certain conditional distributions of X and Y Before presenting it, we intro- duce the following notation. Notation For a random variable X and event A, let [X|A] de- note a random variable whose distribution is that of the conditional distribution of X given A. Theorem 4.21 () X Zar ¥ @ [X|X >t] Dee (YIY > dl for all t (i) X Den ¥ 4 [XIX #] and ¥; =4 [Y-t|Y > ¢]. Noting that = 0; ifst shows that X Shr Y => Xt Dar Ye > Xe Det Ye > [XIX > t] Vee (VY > t] To go the other way, use the identity Fy, =e" Ser" Ax(s)ds4.1 stochastic Orderings 135 to obtain that [X|X >t] 28 ([YIY > 4 Xr 20% => Ax(t) S Av(#) (ii) With X; =a [t- X|X < ¢], Ax(y) =rx(t-y) O XiseX% [t-X|X a VY [ovtnray ° firxt-nav> [ore nney => rx(t) 2 rr(t) (iii) Let X and Y have respective densities f and g. Suppose [X|s< X <#] Su [Yls<¥<¢#] foralls v/s < X P(Y > v/s < ¥ £9, showing that X >i ¥. Now suppose that X >i, Y. Then clearly [X|s < X < t] >i [Yls < Y < ¢], implying, from Proposition 4.20, that [X|s < X < th>e(Y\s<¥ Y > X Den Y >X 2H Y Proof. The first implication immediately follows from parts (ii) and (iii) of Theorem 4.21. The second implication follows upon taking the limit as t + oo in part (ii) of that theorem. m Say that X is an increasing hazard rate (or IHR) random variable if Ax(t) is nondecreasing in t. (Other terminology is to say that X has an increasing failure rate.) Proposition 4.23 Let X; =a (X —t|X >t]. Then, (a) Xe ln ast? « XisJHR (b) Xi le ast? +> log f(z) is concave Proof. (a) Let A(y) be the hazard rate function of X. Then d,(y), the hazard rate function of X; is given by A(y) =AlE+y), y>O Hence, if X is THR then X; Ine ¢, implying that X_ Jee t. Now, let 8 s Xt. Then, eo MWY = P(X, > €) S$ P(Xp > 6) =e Fe Mey4.8 Excercises 137 showing that A(t) > (s). Thus part (a) is proved. (b) Using that the density of X; is fx,(z) = f(x +t)/F(t) yields, for s < t, that f(ct+s) far '* © log f(x+s)—log f(z+t)tz fats) _ fet). 4 Xs 2 Xt Fats) flatt) = fy) ° Fw tY ° five f @)ly © log f(y) is concave 4.8 Excercises 1. For a nonnegative random variable X, show that (E[X™})!/" is nondecreasing in n. 2. Let X be as standard normal random variable. Use Corollary 4.4, along with the density g(x) =ze""?, 2 >0 to show, for c > 0, that (a) P(X > 0) = Fee“ Eyl Z1X > (b) Show, for any positive random variable W, that 1 Bil > d < BL (©) Show that PUL > 0) | ote” ViRP(X > c)138 a Chapter 4 Bounding Probabilities and Expectations (e) Use Jensen’s inequality, along with the preceding, to show that eP(X > 0) +e? VIm(P(X > c))? > me (£) Argue from (e) that P(X > c) must be at least as large as the positive root of the equation crt ef VTma? = oe (g) Conclude that P(X >o)> Sr (V2 +4-c)e*? . Let X be a Poisson random variable with mean 4. Show that, for n > A, the Chernoff bound yields that e->(Ae)" OCS.) Se . Let m(t) = E[X*]. The moment bound states that for c > 0 P(X 2c) < m(t)c* for all t > 0. Show that this result can be obtained from the importance sampling identity. . Fill in the details of the proof that, for independent Bernoulli random variables X1,...,Xn, and c> 0, P(S - E[S] < -c) < e?"/" where $= 2, X, . If X is a binomial random variable with parameters n and p, show (a) P(X = np| > ce) < 2672 (b) P(X — np > anp) < exp{—2np*a*} Give the details of the proof of (4.7).4.8 Excercises 139 8. Prove that Elf(X)] 2 ELF(E[X1¥))] 2 F(ELX)) Suppose you want a lower bound on E[f(X)] for a convex function f. The preceding shows that first conditioning on Y and then applying Jensen's inequality to the individual terms Elf(X)|Y = y] results in 0 larger lower bound than does an immediate application of Jensen’s inequality. 9. Let X; be binary random variables with parameters p;,i = 1,...)2. Let X = 30%, Xi, and also let I, independent of the variables X;,...,Xn, be equally likely to be any of the values 1,...)n. For R independent of I, show that (a) PU = {Xr = 1) = pi/E[X] (b) E[XR] = E[X]E[R|Xz = 1] (©) P(X > 0) = E[X] E[X|X1 = 1] 10. For X; and X as in Exercise 9, show that Pi (E[x1? x BRIX = 1 = BUX? ‘Thus, for sums of binary variables, the conditional expectation inequality yields a stronger lower bound than does the second moment inequality. Hint: Make use of the results of Exercises 8 and 9. 11. Let X; be exponential with mean 8+ 2i, for i = 1,2,3. Obtain an upper bound on E[max X;], and compare it with the exact result when the X; are independent. 12, Let Uj, i =1,...,n be uniform (0, 1) random variables. Obtain an upper bound on E[max Uj], and compare it with the exact result when the U; are independent. 13. Let U; and Uz be uniform (0,1) random variables. Obtain an upper bound on E[max(Ui,Us)], and show this maximum is obtained when Uj = 1— Up.140 14, 15. 16. Chapter 4 Bounding Probabilities and Expectations Show that X >p, Y if and only if P(X >t), PY >#) P(X > s) ~ P(Y >s) for all s A(y,z) whenever > y (a) Show that if X and Y are independent and X >i, Y then A(X,Y) Dat h(Y,X). (b) Show by a counterexample that the preceding is not valid under the weaker condition X >yt Y. There are n jobs, with job i requiring a random time X; to process. The jobs must be processed sequentially. Give a suffi- cient condition, the weaker the better, under which the policy of processing jobs in the order 1,2,...,n maximizes the prob- ability that at least k jobs are processed by time t for all k and t.Chapter 5 Markov Chains 5.1 Introduction This chapter introduces a natural generalization of a sequence of in- dependent random variables, called a Markov chain, where a variable may depend on the immediately preceding variable in the sequence. Named after the 19th century Russian mathematician Andrei An- dreyevich Markov, these chains are widely used as simple models of more complex real-world phenomena. Given a sequence of discrete random variables Xo, X1, Xo, ... tak- ing values in some finite or countably infinite set S, we say that Xp is a Markov chain with respect to a filtration Fy if Xn € Fa for all n and, for all B C S, we have the Markov property P(Xnt1 € Bl Fn) = P(Xnti € BlXn). If we interpret X, as the state of the chain at time n, then the preceding means that if you know the current state, nothing else from the past is relevant to the future of the Markov chain. That is, given the present state, the future states and the past states are independent. When we let Fy, = o(Xo,X1,-..,Xn) this definition reduces to P(Xna1 = il Xn = i Xn = tnaty ey Xo = to) = P(Xn41 = j| Xn =i). Tf P(Xni1 = j| Xn = 7) is the same for all n, we say that the Markov 141142 Chapter 5 Markov Chains chain has stationary transition probabilities, and we set P(Xn4i = 5| Xn =i) In this case the quantities P,; are called the transition probabilities, and specifying them along with a probability distribution for the starting state Xo is enough to determine all probabilities concerning Xo,...,Xn. We will assume from here on that all Markov chains considered have stationary transition probabilities. In addition, un- less otherwise noted, we will assume that S, the set of all possible states of the Markov chain, is the set of nonnegative integers. Example 5.1 Reflected random walk. Suppose Yj are independent and identically distributed Bernoulli(p) random variables, and let Xo = 0 and Xp = (Xn-1 + 2¥n — 1)* for n = 1,2,.... The process Xn, called a reflected random walk, can be viewed as the position of a particle at time n such that at each time the particle has a p probability of moving one step to the right, a 1 — p probability of moving one step to the left, and is returned to position 0 if it ever attempts to move to the left of 0. It is immediate from its definition that Xp is a Markov chain. Example 5.2 A non-Markov chain. Again let Y; be independent and identically distributed Bernoulli(p) random variables, let Xo = 0, and this time let X, = Yn + Yn-1 for n = 1,2,.... It’s easy to see that X, is not a Markov chain because P(Xn41 = 2|Xn = 1,Xn-1 = 2) =0 while on the other hand P(Xn+1 = 2|Xn = 1, Xn-1 = 0) =p. 5.2 The Transition Matrix The transition probabilities Pj = P(X1 = j|Xo = 1), are also called the one-step transition probabilities. We define the n- step transition probabilities by rr5.2 Ihe transition Matrix 143 In addition, we define the transition probability matrix Poo Por Poo P=| Pio Pu Pi and the n-step transition probability matrix Bp rip ee po) =| Pip Ph 2 An interesting relation between these matrices is obtained by noting that PET = S7 P(Xntm = j1X0 = i, Xn = k)P(Xn = klXo = 3) k = (m) p(n) = rnp. k ‘The preceding are called the Chapman-Kolmogorov equations. If follows from the Chapman-Kolmogorov equations that pintm) _ pn, pm where here “x” represents matrix multiplication. Hence, Pp?) =PxP, and, by induction, p®) = pr, where the right-hand side represents multiplying the matrix P by itself n times. Example 5.3 Reflected random walk. A particle starts at position zero and at each time moves one position to the right with probability p and, if the particle is not in position zero, moves one position to the left (or remains in state 0) with probability 1 — p. The position144 Chapter 5 Markov Chains Xp, of the particle at time n forms a Markov chain with transition matrix l-p p 000 l-p 0 poo P=! 0 1-p0po Example 5.4 The two state Markov chain. Consider a Markov chain with states 0 and 1 having transition probability matrix l-a p=|% [178] The two-step transition probability matrix is given by pe — aes ogee ata | a8 + B(1— 8) 1-06 — B01 - 8) 5.3 The Strong Markov Property Consider a Markov chain X, having one-step transition probabilities Pj, which means that if the Markov chain is in state i at a fixed time n, then the next state will be j with probability P,;. However, it is not necessarily true that if the Markov chain is in state i at a randomly distributed time T, the next state will be j with probability Py. That is, if T is an arbitrary nonnegative integer valued random variable, it is not necessarily true that P(Xr41 = j|Xr = i) = Py. For a simple counterexample suppose T =min(n: Xn =i, Xai = 3) Then, clearly, P(Xry1 = j|Xr =) = 1. The idea behind this counterexample is that a general random vari- able T may depend not only on the states of the Markov chain up to time T but also on future states after time T. Recalling that T is a stopping time for a filtration F, if {T =n} € Fp for every n, we see that for a stopping time the value of T can only depend on the states up to time ¢. We now show that P(Xr4n = j|Xr = ) will equal Pp) provided that T is a stopping time.5.3 The Strong Markov Property 145 This is usually called the strong Markov property, and essentially means a Markov chain “starts over” at stopping times. Below we define Fr = {A AN{T = t} € F, for all t}, which intuitively represents any information you would know by time T. Proposition 5.5 The strong Markov property. Let Xn,n > 0, bea Markov chain with respect to the filtration Fy. If T < co as. isa stopping time with respect to Fn, then P(Xran = ilXr = i, Fr) = P Proof. P(Xr4n = Xr =i, Fr,T =t) = P(Xten = j|Xt =i, FT = t) = P(Xten = j|Xt =i, F:) = Pe where the next to last equality used the fact that T is a stopping time to give {T =t}e F,. = Example 5.6 Losses in queueing busy periods. Consider a queu- ing system where Xp is the number of customers in the system at time n. At each time n = 1,2,... there is a probability p that a customer arrives and a probability (1 — p), if there are any cus- tomers present, that one of them departs. Starting with Xo = 1, let T = min{t > 0: X; = 0} be the length of a busy period. Sup- pose also there is only space for at most m customers in the system. ‘Whenever a customer arrives to find m customers already in the sys- tem, the customer is lost and departs immediately. Letting Nm be the number of customers lost during a busy period, compute E[Nyn]. Solution. Let A be the event that the first arrival occurs before the first departure. We will obtain E{Njn] by conditioning on whether A occurs. Now, when A happens, for the busy period to end we must first wait an interval of time until the system goes back to having a single customer, and then after that we must wait another interval of time until the system becomes completely empty. By the Markov property the number of losses during the first time interval has dis- tribution Nm—1, because we are now starting with two customers and146 Chapter 5 Markov Chains therefore with only m--1 spaces for additional customers. The strong Markov property tells us that the number of losses in the second time interval has distribution Nj». We therefore have _ f E{Nm-1] + E[Nm] if m >1 EINnlal = { Tes ifm=1, and using P(A) = p and E{Ny|A‘] = 0 we have E[Nmm] = E[Nm|A]P(A) + E[NmlA°]P(A®) = pE|Nm-i] + PE|Nm] for m > 1 along with E(N] = p+ pE(M] and thus P_\m E[Nm] = (—2-)". [Nm] = >) It’s quite interesting to notice that E[Nm] increases in m when p> 1/2, decreases when p < 1/2, and stays constant for all m when p= 1/2. The intuition for the case p = 1/2 is that when m increases, losses become less frequent but the busy period becomes longer. We next apply the strong Markov property to obtain a result for the cover time, the time when all states of a Markov chain have been visited. Proposition 5.7 Cover times. Given an N-state Markov chain Xp, let T, = min{n > 0: Xp = i} and let C = max; T; be the cover time. Then E(C\ < N_, 2 maxi; ElT;|Xo = i). =] im Proof. Let th, l2,-...Jv be a random permutation of the integers N chosen so that all possible orderings are equally likely. Letting Ti, = 0 and noting that maxj past] ie m1 ae 1 Fy max E[T;|Xo = i, mig m=1 where the second line follows since all orderings are equally likely and thus P(Ti, > maxjcm-1T1,) = 1/m, and the third from the strong Markov property. m@ 5.4 Classification of States We say that states i,j of a Markov chain communicate with each other, or are in the same class, if there are integers n and m such that both PS) > 0 and P!” > O hold. This means that it is possible for the chain to get from i to j and vice versa. A Markov chain is called irreducible if all states are in the same class. For a Markov chain Xp, let T= min{n > 0: Xn =i} be the time until the Markov chain first makes a transition into state i. Using the notation Ei[---] and P,(---) to denote that the Markov chain starts from state i, let fi = P(Ti < 00) be the probability that the chain ever makes a transition into state i given that it starts in state i. We say that state i is transient if fi <1 and recurrent if f; = 1. Let os M= Do kx,=0 n=l148 Chapter 5 Markov Chains be the total number of transitions into state i. The strong Markov property tells us that, starting in state i, Ni+1~ geometric(1— fi), since each time the chain makes a transition into state i there is, independent of all else, a (1 — f;) chance it will never return. Conse- quently, « EIN] = oP is either infinite or finite depending on whether or not state i is recurrent or transient. Proposition 5.8 If state i is recurrent and i communicates with j, then j is also recurrent. Proof. Because i and j communicate, there exist values n and m such that P{?) PO” > 0. But for any k > 0 (n+m+k) (mm) 15k) Pe ere ca where the preceding follows because P{?*™*") is the probability starting in state j that the chain will be back in j after n-+m-+k tran- sitions, whereas P{” Pi p\”) is the probability of the same event oc- curring but with the additional condition that the chain must also be in i after the first m transitions and then back in é after an additional k transitions. Summing over k shows that BAIN] > Pg MY > PIM PLY PY = 00 k Thus, j is also recurrent. = Proposition 5.9 If j is transient, then T°, Pl < oo Proof. Note that. BN) = BASS Ten] = La n=05.5 Stationary and Limiting Distributions 149 Let fij denote the probability that the chain ever makes a transition into j given that it starts at i. Then, conditioning on whether such a transition ever occurs yields, upon using the strong Markov property, that E,[Nj] = (1+ Ej[Nj)) fig < 00 since j is transient. = If é is recurrent, let mi = ET) denote the mean number of transitions it takes the chain to return to state i, given it starts in i. We say that a recurrent state i is null if pi = 00 and positive if 4; < oo. In the next section we will show that positive recurrence is a class property, meaning that if i is positive recurrent and communicates with j then j is also positive recurrence. (This also implies, using that recurrence is a class property, that so is null recurrence.) 5.5 Stationary and Li iting Distributions For a Markov chain X, starting in some given state i, we define the limiting probability of being in state j to be noo P;= lim Pi? if the limit exists and is the same for all i. It is easy to see that not all Markov chains will have limiting probabilities, For instance, consider the two state Markov chain with Po. = Pyo = 1. For this chain P{") will equal 1 when n is even and 0 when n is odd, and so has no limit. Definition 5.10 State i of a Markov chain Xp is said to have period d if d is the largest integer having the property that P\") =0 whenn is not a multiple of d. Proposition 5.11 If states i and j communicate, then they have the same period.150 Chapter 5 Markov Chains Proof. Let dj be the period of state k. Let n,m be such that PP) P™ > 0. Now, if Pi > 0, then (r+ntm) plm) pir) pin) Pym > PPP >0 So d; divides r +n +m. Moreover, because (27) pl pr) PR 2 PE Pe > 0 the same argument shows that d; also divides 2r +n +m; therefore d; divides 2r-+n+m-—(r+n+m) = r. Because dj divides r whenever PS? > 0, it follows that d; divides d;. But the same argument can now be used to show that d; divides d;. Hence, d; It follows from the preceding that all states of an irreducible Markov chain have the same period. If the period is 1, we say that the chain is aperiodic. It’s easy to see that only aperiodic chains can have limiting probabilities. Intimately linked to limiting probabilities are stationary proba- bilities. The probability vector m;,i € S is said to be a stationary probability vector for the Markov chain if my = SomPy forall 5 oy =1 7 Its name arises from the fact that if the Xo is distributed according to a stationary probability vector {x} then P(X: = 9) = D> P(X) = 5X0 = im = Do mPy = 7; and, by a simple induction argument, P(Xn = 3) = 0 P(Xn = Xn = )P(Xn-1 =) = SO miPy = 05 7 7 Consequently, if we start the chain with a stationary probability vector then Xn, Xn+1,... has the same probability distribution for all n. The following result will be needed later.d.b Dtationary ana Limiting Listriputions: apa Proposition 5.12 An irreducible transient Markov chain does not have a stationary probability vector. Proof. Assume there is a stationary probability vector 7,1 > 0 and take it to be the probability mass function of Xo. Then, for any j 14 = P(Xn = 3) = dary Pm Consequently, for any m 4 = kn mrs Py Jim AS mPS +m) ism bm = Ya Jim Pf? +n ism ibm =A bm Ae where the final equality used that >, P{f) < oo because j is tran- sient, implying that limy so P{” = 0. Letting m — oo shows that 173 =0 for all j, contradicting the fact that So, xj = 1. Thus, assum- ing a stationary probability vector results in a contradiction, proving the result. @ The following theorem is of key importance. Theorem 5.13 An irreducible Markov chain has a stationary proba- bility vector {x;} if and only if all states are positive recurrent. The stationary probability vector is unique and satisfies my = Vay Moreover, if the chain is aperiodic then =} (n) = lim Pj To prove the preceding theorem, we will make use of a couple of lemmas.152 Chapter 5 Markov Chains Lemma 5.14 For an irreducible Markov chain, if there exists a sta- tionary probability vector {x;} then all states are positive recurrent. Moreover, the stationary probability vector is unique and satisfies my = 1/ny Proof. Let 7; be stationary probabilities, and suppose that P(Xo = 3) = 7; for all j. We first show that 1; > 0 for all 4. To verify this, suppose that 7_ = 0. Now for any state j, because the chain is irre- ducible, there is an n such that P{") > 0. Because Xp is determined by the stationary probabilities, Te = P(Xn =k) = So mPL > 1j3PO a Consequently, if 7, = 0 then so is 7;. Because j was arbitrary, that means that it 7; = 0 for any i, then 7; = 0 for all i. But that would contradict the fact that 7, 7m = 1. Hence any stationary probabil- ity vector for an irreducible Markov chain must have all positive elements. Now, recall that T; = min(n > 0: Xp, = j). So, ny = E(T;|Xo = J] co = YPC; = lx = 5) = = PPG 20 Xo= 5) ~ £4 P(X = 3) Because Xo is chosen according to the stationary probability vector {mi}, this gives oo TyHy = D> PUT; > 0, Xo = 3) (5.1) n=1 Now, P(T; 2 1, Xo = j) = P(Xo =) = 35 and, for n > 2 P(T; > ,X0=3) P(Xi #51 Si 1) = 0. Because P(X; # j) = 1-7, we thus obtain m5 = Wy thus showing that there is at most one stationary probability vector. In addition, because all 7; > 0, we have that all yj < 00, showing that all states of the chain are positive recurrent. ™ Lemma 5.15 If some state of an irreducible Markov chain is positive recurrent then there exists a stationary probability vector. Proof. Suppose state k is positive recurrent. Thus, Be = ExT] < 00 Say that a new cycle begins every time the chain makes a transition into state k. For any state j, let A; denote the amount of time the chain spends in state j during a cycle. Then = ExlQ exn=s.t.>0)] n=0 0 DY Fillexneynenyl n=0 e Do Pi(Xn = HT > 1) n=0 E[Aj] ' We claim that 7; = E[Aj)/u, 7 > 0, is a stationary probability vector. Because E[2, Aj] is the expected time of a cycle, it must equal Ey [Tx], showing that154 Chapter 5 Markov Chains Moreover, for j # k many = SO Pe(Xn = 5, Tk > 2) nz0 = VY Phe = 4,7 > 2-1, Xe =4) n21 i = VY Ph > 2-1, Xi =i) n2zli xPy(Xn = jlTk > 2-1, Xn-1 = 1) SY Pee > 2-1, Xn = Py 7 a3l = VY PelTe > 0, Xn = i) Py 7 n30 = EIAIP, 7 = be Py Finally, Mahe = Om - Py) 7 7 jek =1-mes tk t =1- SE 5 afk = tt and the proof is complete. ™ Note that Lemmas (5.14) and (5.15) imply the following. Corollary 5.16 If one state of an irreducible Markov chain is positive recurrent then all states are positive recurrent. ‘We are now ready to prove Theorem 5.13. Proof. All that remains to be proven is that if the chain is aperi- odic, as well as irreducible and positive recurrent, then the stationary5. dtationary and Limiting Listriputions probabilities are also limiting probabilities To prove this, let m;,i > 0, be stationary probabilities. Let Xp, > 0 and Yq,n > 0 be indepen- dent Markov chains, both with transition probabilities P,,j, but with Xo =i and with P(% = 4) = 7. Let N = min(n : Xn = Yn) We first show that P(N < 00) = 1. To do so, consider the Markov chain whose state at time n is (Xp, Yn), and which thus has transition probabilities Pj) (47) = PiPjr That this chain is irreducible can be seen by the following ar- gument. Because {Xp} is irreducible and aperiodic, it follows that for any state i there are relatively prime integers n,m such that ”) Ph™ > 0. But any sufficiently large integer can be expressed as a linear combination of relatively prime integers, implying that there is an integer Nj such that P® 30 foralln> N Because i and j communicate this implies the existence of an integer Nig such that PI) >0 for all n> Nig Hence, | = P® p > 0 for all sufficiently large n Gk) 7) ~ Ger ly larg which shows that the vector chain (Xn, Yn) is irreducible. In addition, we claim that 7,3 = 77 is a stationary probability vector, which is seen from Tiny = YomPes oD a, P,j = DV rete Pei Peg k r ker By Lemma 5.14 this shows that the vector Markov chain is positive recurrent and so P(N < co) = 1 and thus lim, P(N > n) = 0. Now, let Z, = Xn ifn < N and let Z, = Y; ifn > N. It is easy to see that Z,,n > 0 is also a Markov chain with transition156 Chapter 5 Markov Chains probabilities P;; and has Zp = i. Now (n) Pa W P(Z, =i) P(Z, gN n) P(Yn = 5,N $1) +P(Zn = 5,N >n) P(¥_ =) + P(N > n) aj + P(N >n) (5.2) s On the other hand my = PYn = 3) P(¥n = 5,N ) P(Zn = G,N 1) P(Zq = j) + P(N > n) PI) + P(N >n) (5.3) IA Wl i Hence, from (5.2) and (5.3) we see that lim PS? = ny. Remark 5.17 It follows from Theorem 5.13 that if we have an irre- ducible Markov chain, and we can find a solution of the stationarity equations y= DomPy G20 7 Dn 7 then the Markov chain is positive recurrent, and the 7; are the unique stationary probabilities. If, in addition, the chain is aperiodic then the 7; are also limiting probabilities. Remark 5.18 Because j; is the mean number of transitions between successive visits to state é, it is intuitive (and will be formally proven in the chapter on renewal theory) that the long run proportion of5.6 Time Reversibility 157 time that the chain spends in state i is equal to 1/y;. Hence, the stationary probability 7; is equal to the long run proportion of time that the chain spends in state i. Definition 5.19 A positive recurrent, aperiodic, irreducible Markov chain is called an ergodic Markov chain. Definition 5.20 A positive recurrent irreducible Markov chain whose initial state is distributed according to its stationary probabilities is called a stationary Markov chain. 5.6 Time Reversibility A stationary Markov chain X,, is called time reversible if P(Xn = j/Xnvi =i) = P(Xn41 = Xn = 1) for all i,j. By the Markov property we know that the processes Xn+1, Xn+2,--+ and Xp-1, Xn-2,--- are conditionally independent given Xj, and so it follows that the reversed process Xn~1,Xn-2,-.. will also be a Markov chain having transition probabilities PXn = 5, Xnt P(Xnt — Ph P(Xn = Xan = 1) = m% where 7; and P,; respectively denote the stationary probabilities and the transition probabilities for the Markov chain X,. Thus an equiv- alent definition for X;, being time reversible is if mPy = 75Pji for all i,j. Intuitively, a Markov chain is time reversible if it looks the same running backwards as it does running forwards. It also means that the rate of transitions from i to j, namely 7;P;;, is the same as the rate of transitions from j to i, namely 7jPj;. This happens if there are no “loops” for which a Markov chain is more likely in the long tun to go in one direction compared with the other direction. We illustrate with an examples.158 Chapter 5 Markov Chains Example 5.21 Random walk on the circle. Consider a particle which moves around n. positions on a circle numbered 1,2,....n according to transition probabilities Piz: = p = 1— Piyii forl Si, m; = 44 = 1, that the Markov chain is time reversible with the given stationary probabilities. m 5.7 A Mean Passage Time Bound Consider now a Markov chain whose state space is the set of non- negative integers, and which is such that Py =0, 0Si 0, we have that E[Dj] > di, then BAN] < 1/4 Proof. The proof is by induction on n. It is true when n = 1, because Nj is geometric with mean 1 1 PMI Fe ~ EID]160 Chapter 5 Markov Chains So, assume that E[N,] < Ch, 1/dj, for all k > ElNn-j]P(Da = 3) j=l a as S1+ Pan E(Nnl + D> P(Dn = 5) D0 di jn fl = 1+ PanE[Nn] + YPw = aod a a i=l i=n-j+1 S14 Pan B[Nn) + YP, = = ay 1/di ~ 5/4) ja where the last line follows because d; is nondecreasing. Continuing from the previous line we get = 1+ PanE[Nal + (1 — Pan) x Udi - ELI j) =14 PanElNa] + (1 — Pan) Sodas — Ba = : S Pan E[Nn] + (1 — Pan) S/d i= which completes the proof. m Example 5.24 At each stage, each of a set of balls is independently put in one of n urns, with each ball being put in urn i with probability Pi, 8.1 pi = 1. After this is done, all of the balls in the same urn
p which, in light of the previous inequality, must therefore be the maximal coupling. Let B,C, and D be independent random variables with respective density functions vie) = Mints). 92) P de) = £2) mine). 92) and ate) = He) male). se), let I~ Bernoulli(p), and if I = 1 then let ¥ = ¥ = B and otherwise let X = C and = D. This clearly gives P(X = Y) > P(I = 1) =p, and P(X $2) = P(X < all = 1)p+ P(X 9(2)}, we must have dry(X,Y) = max{P(X € A) - P(Y € A), P(Y € A) — P(X € A®)} = max{P(X € A) — P(Y € A),1— P(Y € A)—14 P(X € A)} = P(X € A)- PY € A) and so PREP) =1— [ min(s(e),o(e))d2 =1- f oleae— fi Sla)de =1-P(Y €A)-14+ P(X eA) = dry(X,Y).60 Chapter 2 Stein's Method and Central Limit Theorems 2.3. Poisson Approximation and Le Cam's Theo- rem It’s well-known that a binomial distribution converges to a Poisson distribution when the number of trials is increased and the prob- ability of success is decreased at the same time in such a way so that the mean stays constant. This also motivates using a Poisson distribution as an approximation for a binomial distribution if the probability of success is smalll and the number of trials is large. If the number of trials is very large, computing the distribution function of a binomial distribution can be computationally difficult, whereas the Poisson approximation may be much easier. A fact which is not quite as well known is that the Poisson dis- tribution can be a reasonable approximation even if the trials have varying probabilities and even if the trials are not completely inde- pendent of each other. This approximation is interesting because dependent trials are notoriously difficult to analyze in general, and the Poisson distribution is elementary. It’s possible to assess how accurate such Poisson approximations are, and we first give a bound on the error of the Poisson approxi- mation for completely independent trials with different probabilities of success. Proposition 2.8 Let X; be independent Bernoulli(p;), and let W = S11 Xi, Z ~Poisson(d) and d= E[W] = C2, pi. Then Proof. We first write Z = S7, Zi, where the Z; are independent and Z; ~Poisson(p;). Then we create the maximal coupling (Zi, X:) of (Zi, Xi) and use the previous corollary to get P(Z, =X) = > min(P(X; = k), P(Zi = k)) = = min(1— pj,e"™) + min(p;, pie) =1l-p+pie™ 21-pp2.4 The Stein-Chen Method 61 where we use e~? > 1-2. Using that (OZ, 02. X) is a coupling of (Z,W) yields that A dry(W,Z) < PIL 24S XK) aes PLZ # X}) YPh 4 %) : Yr. ct < Remark 2.9 The drawback to this beautiful result is that when E[W] is large the upper bound could be much greater than 1 even if W has approximately a Poisson distribution. For example if W is a binomial (100,.1) random variable and Z ~Poisson(10) we should have W %q Z but the proposition tells gives us the trivial and useless result dry(W, Z) < 100 x (.1)? = 1. 2.4 The Stein-Chen Method The Stein-Chen method is another approach for obtaining an upper bound on dry(W, Z), where Z is a Poisson random variable and W is some other variable you are interested in. This approach covers the distribution of the number of successes in dependent as well as independent trials with varying probabilities of success. In order for the bound to be good, it should be close to 0 when W is close in distribution to Z. In order to achieve this, the Stein-Chen method uses the interesting property of the Poisson distribution kP(Z =k) = AP(Z =k—1). (2.1)62 Chapter 2 Stein's Method and Central Limit Theorems Thus, for any bounded function f with f(0) = 0, xs kf (k)P(Z =k) k=0 AS @PZ =k-1) k=l adorG+ 1)P(Z =i) = AE[F(Z + 1). E(Zf(Z)| The secret to using the preceding is in cleverly picking a function f so that dry(W,Z) < E[WF(W)] — AE[F(W + 1)], and so we are likely to get a very small upper bound when W %q Z, (where we use the notation W 4 Z to mean that W and Z approximately have the same distribution). Suppose Z ~ Poisson() and A is any set of nonnegative integers. Define the function f4(k),k = 0,1,2,..., starting with f4(0) = 0 and then using the following “Stein equation” for the Poisson distribu- tion: falk +1) ~ kfa(k) = Ikea — P(Z € A). (2.2) Notice that by plugging in any random variable W for k and taking expected values we get AE|fa(W + 1) — E[W fa(W)]] = P(W € A) - P(Z € A), so that dry(W,Z) < sup AE [fa(W +} - EWfa(W)]I- Lemma 2.10 For any A and i,j, |fa(i)— fa(i)| < min(1,1/2)|i—3J]. Proof. The solution to (2.2) is Fick — P(Z < k) La Lxpz =b/P(Z=3)2.4 Ihe Stein-Chen Method 63 because when we plug it in to the left, hand side of (2.2) and use (2.1) in the second line below, we get Afa(k +1) — kfa(k) =v isk PS SH) Vigha = P(Z S- 1) > P(Z=k)/P(Z=j) AP(Z = k-1)/P(Z = 3) -r Ijzk ~ P(Z =k) 2, P= IPE =i) = Ikea- P(Z€ A), which is the right hand side of (2.2). Since P(Z P(Z is increasing in k and 1-P(Zh) P(Zk) | PZSk-1) x k [ Pu > k) , PO <2 MEIW Vil, i=l where Vj is any random variable having distribution Vi =a (W — 1X, =1). = Proposition 2.12 With the above notation if W > V; for all i, then we have drv(W,Z) <1—Var(W)/E|W].2.4 The Stein-Chen Method 65 Proof. Using W > Vi we have 7 YS AEIW - Vil = SO AE IW - Vd i=1 i=l a = +A- SOE + Vi 2 =¥4—S ON BIWIX = 1] al = 2 4a- So BLKA] i=1 =A-Var(W), and the preceding theorem along with \ = E[W] gives dry (W, 2) < min(1,1/E(W))(E(W) ~ Var(W)) <1-Var(W)/E[W). Example 2.13 Let X; be independent Bernoulli(p;) random vari- ables with A= 77, p;. Let W = 7%, Xi, and let Z ~ Poisson(A). Using Vi = 054: Xj, note that W > Vi and E[W — Vi] = pi, so the preceding theorem gives us dry(W, Z) < min(1,1/)) Sor. i= For instance, if X is a binomial random variable with parameters n = 100,p = 1/10, then the upper bound on the total variation distance between X and a Poisson random variable with mean 10 given by the Stein-Chen method is 1/10, as opposed to the upper bound of 1 which results from the LeCam method of the preceding section. Example 2.14 A coin with probability p = 1—q of coming up heads is flipped n + k times. We are interested in P(Rx), where Rx is the event that a run of at least k heads in a row occurs. To approximate66 Chapter 2 Stein's Method and Centra! Limit Theorems this probability, whose exact expression is given in Example 1.10, let X; = 1 if flip i lands tails and fips i+1,...,i+ all land heads, and let X; = 0 otherwise, i= 1,...,n. Let and note that there will be a run of k heads either if W > 0 or if the first k flips all land heads. Consequently, P(W > 0) < P(Re) < P(W > 0) +p*. Because the flips are independent and X; = 1 implies X; = 0 for all j # i where |i — j| < k, it follows that if we let itk Yaw- > x; j=i-k then Vi =a (W — 1|X; = 1) and W > Vj. Using that \ = E[W] = ngp* and E[W — Vj] = (2k + 1)qp*, we see that dry(W, Z) < min(1,1/A)n(2k + 1)qp?*. where Z ~ Poisson(A). For instance, suppose we flip a fair coin 1034 times, and want to approximate the probability that we have a run of 10 heads in a row. In this case, n = 1024,k = 10,p = 1/2, and so d= 1024/2"! = .5. Consequently, P(W >0)x1-e75 with the error in the above approximation being at most 21/2!?. Consequently, we obtain that 1—e7 — 21/2! < P(Ryg > 0) <1 —e7 5 + 21/2 + (1/2)? or 0.388 < P(Ryo > 0) < 0.4 Example 2.15 Birthday Problem. With m people and n days in the year, let ¥j = number of people born on day i. Let Xi = Iyj-0,2. dtein’s Method tor the Geometric Distribution o7 and W = 2, X; equal the number of days on which nobody has a birthday. Next imagine n different hypothetical scenarios are constructed, where in scenario i all the Y; people initially born on day i have their birthdays reassigned randomly to other days. Let 1+ V; equal the number of days under scenario i on which nobody has a birthday. Notice that W > Vj and Vi =¢ (W - 1|X; = 1). Also we have d= E[W] = n(1 — 1/n)™ and El] = (n — 1)(1-1/(n- 1)”, 50 with Z ~ Poisson() we get dry(W, Z) < min(1,1/d)n(n(d - 1/n)™ = (n - 1)(1 - 1/(n - 1))™). 2.5 Stein’s Method for the Geometric Distribu- tion In this section we will show how to obtain an upper bound on the distance dry (W, Z) where W is a given random variable and Z has a geometric distribution with parameter p = 1—q = P(W = 1) = P(Z = 1). We will use a version of Stein’s method applied to the geometric distribution. Define f4(1) = 0 and, for k = 1,2,..., using the recursion fa(k) — afalk +1) = Ikea — P(Z € A). Lemma 2.16 We have |fa(i) — fa(i)| < 1/p Proof. It’s easy to check that the solution is fa(k) = P(Z € A, Z > k)/P(Z =k) — P(Z € A)/p and, since neither of the two terms in the difference above can be larger than 1/p, the lemma follows. Theorem 2.17 Given random variables W and V such that V =a (W -1|W >), let p= P(W =1) and Z ~geometric(p). Then drv(W,Z) < gp" PW # V).68 Chapter 2 Stein's Method and Central Limit Theorems Proof. IP(W € A) - P(Z € A)| = |Elfa(W) - afa(W + 1)]] = laE[fa(W)|W > 1} - gElfa(W + 1)]] S$ gE|faQl+V)—- fa + W)) <@ PW #V), where the last inequality follows from the preceding lemma. = Example 2.18 Coin flipping. A coin has probability p = 1 —q of coming up “heads” with each flip, and let X; = 1 if the ith flip is heads and X; = 0 if it is tails. We are interested in the distribution of the number of flips M required until the start of the first appearance of a run of k heads in a row. Suppose in particular we are interested in estimating P(M € A) for some set of integers A. To do this, instead let N be the number of flips required until the start of the first appearance of a run of tails followed by k heads in a row. For instance, if k = 3, and the flip sequence is HHTTHHH, then M = 4 and N = 5. Note that in general we will have N+1 = M if M > 1 so consequently, dry(M,N +1) < P(N+1=M)=p*. We will first obtain a bound on the geometric approximation to N. Letting Aj = {Xi = 0,Xigi = Xina = = Xign = 1} be the event a run of a tail followed by k heads in a row appears starting with flip number 4, let W = min{i > 2: 4, = 1} — 1 and note that N=aW Next generate V as follows using a new sequence of coin flips ¥;,Y2, ... defined so that Yj = X;,Vi > k+1 and the vector (Yay Yay ++) Vent) =a (Xa) Xa, 00) Xegilla, = 0) is generated to be independent of all else. Then let By = {Yj = 0, Yis1 = Yin2 = +++ = Vigne = 1} be the event a run of tails followed by k heads in a row appears in this new sequence starting with fip number i. If I4, = 0 then let V = W, and otherwise let V = min{i >2.6 Stein's Method for the Normal Distribution 69 2: Ip, = 1} - 1. Note that this gives V =4 (W — 1|W > 1) and keh P(W #V) <> P(ALN Bi) ] +1 = P(A1) >° P(AiAS) . Atl = P(Ay) )> P(A:)/P(AS) i=2d = kqp”*. Thus if Z ~geometric(qp*) then applying the previous theorem gives dry(N, Z) < kp*. The triangle inequality applies to dry so we get dry(M — 1,2) < drv(M —1,N) + dry(N, Z) = (k + 1)p* and P(Z € A) —(k+1)p* < P(M € A) < P(Z€ A) + (k+1)p*. 2.6 Stein’s Method for the Normal Distribution Let Z ~ N(0,1) be a standard normal random variable. It can be shown that for smooth functions f we have E[f'(Z) — Zf(Z)] = 0, and this inspires the following Stein equation. Lemma 2.19 Given a >0 and any value of z let 1 ife #))dt, and a similar argument gives eRe) fe f(z)= a) fe (t)P(Z < t)dt P(Zsz) [~,, a i H()P(Z > that, we get If"(a)| < |a'(a)| + | + 2) f(a) + 2(h(x) — ElA(Z)]| Le a + er + (1+2°)P(Z > 2)/9(2))(2P(Z < 2) + 4(2)) ie + (1+2?)P(Z < 2)/4(2))(-2P(Z > 2) + 4(z)) A + wv x 82.6 Stein's Method tor the Normal Vistribution a We finally use max(2y) 2 (x) — f(y) < "(a)\da < — \f'@) ros fo If @)ldz s 5 to give the lemma. Theorem 2.20 If Z ~ N(0,1) and W = 0%, Xj where X; are in- dependent mean 0 and Var(W) = 1, then sup|PW <2) —P(Z <2) <2,|87 FIX) ae 1 Proof. Given any a > 0 and any 2z, define h, f as in the previous lemma. Then PW > 321X3I] PW $2) -P(Z <2) $2.13) > BUX). and then by choosing we get72 Chapter 2 Stein's Method and Central Limit Theorems Repeating the same argument starting with P(Z <2)- P(W <2) 1, and W, = 0%, Xi then Wp i=l satisfies the conditions of Theorem 2.20. Because n a DAP = nba = EGE 0 it follows from Theorem 2.20 that P(W, <2)—+ P(Z <2). @74 Chapter 2 Stein's Method and Central Limit Theorems 2.7 Exercises 1. If X ~ Poisson(a) and Y ~ Poisson(b), with b > a, use coupling to show that Y >a X. . Suppose a particle starts at position 5 on a number line and at each time period the particle moves one position to the right with probability p and, if the particle is above position 0, moves one position to left with probability 1 — p. Let X;,(p) be the position of the particle at time n for the given value of p. Use coupling to show that Xp(a) >st Xn(b) for any n if a > b. . Let X,Y be indicator variables with E[X] = a and B[Y] = 6. (a) Show how to construct a maximal coupling X,¥ for X and Y, and then compute P(X = Y) as a function of a,b. (b) Show how to construct a minimal coupling to minimize P(X = ¥). . In a room full of n people, let X be the number of people who share a birthday with at least one other person in the room. Then let Y be the number of pairs of people in the room having the same birthday. (a) Compute E[X] and Var(X) and E[Y] and Var(¥). (b) Which of the two variables X or Y do you believe will more closely follow @ Poisson distribution? Why? (c) Ina room of 51 people, it turns out there are 3 pairs with the same birthday and also a triplet (3 people) with the same birthday. This is a total of 9 people and also 6 pairs. Use a Poisson approximation to estimate P(X > 9) and P(Y > 6). Which of these two approximations do you think will be better? Have we observed a rare event here? . Compute a bound on the accuracy of the better approximation in the previous exercise part (c) using the Stein-Chen method. . For discrete X,Y prove dry(X,¥) = 45, |P(X = 2)-P(Y = 2) . For discrete X,Y show that P(X # Y) > drv(X,Y) and show also that there exists a coupling that yields equality. . Compute a bound on the accuracy of a normal approximation for a Poisson random variable with mean 100.2.7 Exercises 75 9. If X ~ Geometric(p), with q = 1 — p, then show that for any bounded function f with f(0) = 0, we have E[f(X) —af(X + )) =0. 10. Suppose X),X2,... are independent mean zero random vari- ables with |Xq| <1 for all n and Dicp, Var(X;)/n + 8 < 00 If Sn = Dign Xi and Z ~ N(0,s), show that Sp/ Vn +p Z. 11. Suppose m balls are placed among n urns, with each ball inde- pendently going in to urn i with probability p;. Assume m is much larger than n. Approximate the chance none of the urns are empty, and give a bound on the error of the approximation. 12. A ring with a circumference of c is cut into n pieces (where nis large) by cutting at n places chosen uniformly at random around the ring. Estimate the chance you get k pieces of length at least a, and give a bound on the error of the approximation. 13. Suppose Xj, i= 1,2,...,10 are iid U(0, 1). Give an approxima- tion for P(}0j2, Xi > 7), and give a bound on the error of this approximation. 14. Suppose Xi, i = 1,2,...,n are independent random variables with E[X;] = 0, and 0", Var(X;) =1. Let W = 52, X; and show that jew - 2 <3) E\X.)* a? i=Chapter 3 Conditional Expectation and Martingales 3.1 Introduction A generalization of a sequence of independent random variables oc- curs when we allow the variables of the sequence to be dependent on previous variables in the sequence. One example of this type of dependence is called a martingale, and its definition formalizes the concept of a fair gambling game. A number of results which hold for independent random variables also hold, under certain conditions, for martingales, and seemingly complex problems can be elegantly solved by reframing them in terms of a martingale. In section 2 we introduce the notion of conditional expectation, in section 3 we formally define a martingale, in section 4 we introduce the concept of stopping times and prove the martingale stopping theorem, in section 5 we give an approach for finding tail probabilities for martingale, and in section 6 we introduce supermartingales and submartingales, and prove the martingale convergence theorem. 3.2 Conditional Expectation Let X be such that E[|X|] < ov. In a first course in probability, E{X|Y], the conditional expectation of X given Y, is defined to be 778 Chapter 3 Conditional Expectation and Martingales that function of Y that when Y = y is equal to Uz eP(X =2l¥ =y), if X,Y are discrete E[XIY =y]= Sxfxyy(cly)dz, if X,Y have joint density f where Fayy(2ly) = f(y) _ f(y) (3.1) J fle,y)dz Fry) The important result, often called the tower property, that BX] = EE[X|¥]] is then proven. This result, which is often written as Ly FIXI¥ = yP(¥ = 9) if X,Y are discrete E[X] = J EIXIY = ylfy(u)ay, if X,Y are jointly continuous. is then gainfully employed in a variety of different calculations. ‘We now show how to give a more general definition of conditional expectation, that reduces to the preceding cases when the random variables are discrete or continuous. To motivate our definition, sup- pose that whether or not A occurs is determined by the value of Y. (That is, suppose that A € o(Y).) Then, using material from our first course in probability, we see that E[XI4] EEX I4\¥]] EUaE(X\¥]] where the final equality holds because, given Y, I, is a constant random variable. ‘We are now ready for a general definition of conditional expecta- tion.3.2 Conditional Expectation 79 Definition For random variables X,Y, let E[X|Y'], called the condi- tional expectation of X given Y, denote that function h(Y) having the property that for any A € o({Y) E(XE4] = BAYA]. By the Radon-Nikodym Theorem of measure theory, a function h that makes h(Y) a (measurable) random variable and satisfies the preceding always exists, and, as we show in the following, is unique in the sense that any two such functions of Y must, with probability 1, be equal. The function A is also referred to as a Radon-Nikodym derivative. Proposition 3.1 If hy and hg are functions such that Ejhi(Y La] = Elha(¥)L4] for any A€o(Y), then P(hy(¥Y) = ha(¥)) =1 Proof. Let An = {h(Y) — ho(Y) > 1/n}. Then, 0 = Btha(Y Mag) — Elha¥ a] = Bi(ha() — hal ))fag) > 2PUAn) showing that P(A, this yields 0. Because the events An are increasing in n, 0 = lim P(An) = (lim An) = P(UnAn) = P(hi(¥) > ha(¥)) Similarly, we can show that 0 = P(hi(Y) < ho(Y)), which proves the result. = ‘We now show that the preceding general definition of conditional expectation reduces to the usual ones when X,Y are either jointly discrete or continuous.80 Chapter 3 Conditional Expectation and Martingales Proposition 3.2 If X and Y are both discrete, then E[X|¥ = y] = )o P(X =al¥ =y) = whereas if they are jointly continuous with joint density f, then Bx = a= f afew (elder where fxiy(xly) = rena Proof Suppose X and Y are both discrete, and let hy) = So 2P(X = 2l¥ =) z For A € o(Y), define B= {y:Ia=1when¥ = y} Because B € o(Y), it follows that I, = Ig. Thus, _[a, #X=2Yes xmn={f otherwise and _fX.eP(X=2\¥=y), if Y= yes wor = {7 otherwise Thus, E[XI4] = Do zP(X =2,Y € B) = Vey r(x =2Y=y) = yeB YY Px = al¥ = y) PW =v) ven c E{h(Y)La)3.2 Conditional Expectation 81 The result thus follows by uniqueness. The proof in the continuous case is similar, and is left as an exercise. . For any sigma field F we can define E[X|F] to be that random variable in F having the property that for all A € F E[XIa) = ElE[X|F La] Intuitively, E[X|F] represents the conditional expectation of X given that we know all of the events in F that occur. Remark 3.3 It follows from the preceding definition that E[X|¥] = ElX\o(¥)]. For any random variables X,X1,...,Xn, define E[X|X1,...,Xn] by ELXIXA,.. Xn] = E[Xlo(X1,--- Xn] In other words, E[X|Xi,....X. which is that function h(X1,-..,Xn) for E[XI4] = E(MX),...,Xn)La], for all A€ 0 (X1,..-, Xn) We now establish some important properties of conditional ex- pectation. Proposition 3.4 (a) The Tower Property: For any sigma field F E(X] = E[E|X|F]] (b) For any AEF, E[XIa\F] = IAEIXIF] (c) If X is independent of all ¥ € F, then E[X|F] = E[X] (a) If E(|Xil] < 00, i= 1,...4, then PIO XIF= AMI i=1 #182 Chapter 3 Conditional Expectation and Martingales (e) Jensen's Inequality If f is a conver function, then Elf(X)|F] = f(ELX|F)) provided the expectations exist. Proof Recall that E[X|F] is the unique random variable in F such that E[XI4] = ElE|X|Flla] if ACF Letting A = 9, I, = In = 1, and part (a) follows. To prove (b), fix A € F and let X* = XI. Because E[X*|F] is the unique function of Y such that E[X*Zy] = E[E[X*|Flq] for all A’ € F, to show that E[X*|F] = I4E[X|F], it suffices to show that for A’ € F E[X* Ta) = ElaE[X|F lla] That is, it suffices to show that E[XI ala) = EllaE|X|F Ua‘) or, equivalently, that ElXIaa)] = ElE[X|Fllaar] which, since AA’ € F, follows by the definition of conditional expec- tation. Part (c) will follow if we can show that, for A € F E[XI4] = E[E[X]Ia} which follows because E[XI4] = E[X]E[I4] _ by independence = ELE[X\a] We will leave the proofs of (d) and (e) as exercises. Remark 3.5 It can be shown that E[X|-] satisfies all the properties of ordinary expectations, except that all probabilities are now com- puted conditional on knowing which events in F have occurred. For instance, applying the tower property to E[X|F] yields that E|X|F] = EIE[X|F U GIF].4.2 Conaitional Expectation 83 While the conditional expectation of X given the sigma field F is defined to be that function satisfying E[XI4] = E{E[X|Fll4] for all AG F it can be shown, using the dominated convergence theorem, that E[XW] = ElE[X|F]W] for al We F (3.2) The following proposition is quite useful. Proposition 3.6 If W € F, then B|XW|F] = WEIX|F] Before giving a proof, let us note that the result is quite intuitive. Because W € F, it follows that conditional on knowing which events of F occur - that is, conditional on F the random variable W becomes a constant and the expected value of a constant times a random variable is just the constant times the expected value of the random variable. Next we formally prove the result. Proof. Let Y = E[XW|F]-WE[X|F] = (X - E[X|F))W - (XW — E[XW|F)) and note that Y € F. Now, for Ae F, EY Ia] = EX — E[X|F))WIa] — E( XW - E[XW|F))Ia] However, because WI4 € F, we use Equation (3.2) to get E\(X — E(X|F])WIa4] = E[XWIa] — E[E[X|F]WIa] and by the definition of conditional expectation we have El(XW — E[XW|F])I4)] = B[XWIa] — E[E[XW|F]I4] = 0. Thus, we see that for A € F, EY 14] =0 Setting first A = {Y > 0}, and then A = {Y <0} (which are both in F because Y € F) shows that P(Y >0)=P(Y <0)=0 Hence, Y = 0, which proves the result.84 Chapter 3 Conditional Expectation and Martingales 3.3. Martingales We say that the sequence of sigma fields Fy, Fp... is a filtration if Fi © Fe... We say a sequence of random variables X), X2,... is adapted to Fa if Xn € Fa for all n. To obtain a feel for these definitions it is useful to think of n as representing time, with information being gathered as time pro- gresses. With this interpretation, the sigma field F,, represents all events that are determined by what occurs up to time n, and thus contains F,_1. The sequence X,,n > 1, is adapted to the filtration Fa,n > 1, when the value of X;, is determined by what occurs by time n. Definition 3.7 Zn is a martingale for filtration Fy, if 1. El|Znl] < 00 2. Zn is adapted to Fr 3. ElZn4alFa] = Zn A bet is said to be fair it its expected gain is equal to 0. A martingale can be thought of as being a generalized version of a fair game. For consider a gambling casino in which bets can be made concerning games played in sequence. Let Fn consist of all events whose outcome is determined by the results of the first n games. Let Zp denote the fortune of a specified gambler after the first n games. Then the martingale condition states that no matter what the results of the first n games, the gambler’s expected fortune after game n+1 is exactly what it was before that game. That is, no matter what has previously occurred, the gambler’s expected winnings in any game is equal to 0. It follows upon taking expectations of both sides of the final mar- tingale condition that ElZns1] = BlZn] implying that E|Z,] = E[Zi) We call E[Z;] the mean of the martingale.3.3 Martingales 85, Another useful martingale result is that E\Zn+2lFnl ELE(Zn+2|Fnsi U Fall Fa] EE [Zn+2lFn+ill Fa} ElZn+1\Fn] =2, Repeating this argument yields that ElZn+k\Fn) = Zn, k 21 Definition 3.8 We say that Z,,n > 1, is a martingale (without spec- ‘fying a filtration) if 1. EllZnl] < 00 2 BlZnsilZiy--+sZnl = Zn If {Z,} is a martingale for the filtration F,, then it is a martin- gale. This follows from ElZn+1\Z1y-+-) Zn) = ElE[Zn411Z1,---sZns Fall, EVE|Zn+1|Fal|Z15---s Zn] ElZn|Z1,..-, Zn] = 2. Zn) where the second equality followed because Z; € Fn, for all i = on ‘We now give some examples of martingales. Example 3.9 If X;,i > 1, are independent zero mean random vari- ables, then Z, = 7%, X;,n > 1, is a martingale with respect to the filtration o(X1,...,Xn),m > 1. This follows because ElZn411X15-- +5 ElZn + Xn+ilX1,---» Xn} Biza\Xe ON Ela Xl = Zn + E[Xnsi] by the independence of the X; Zn . I86 Chapter 3 Conditional Expectation and Martingales Example 3.10 Let X;,i > 1, be independent and identically dis- tributed with mean zero and variance o?. Let S, = 2, Xi, and define Z, = S2?-no®, n>1 ‘Then Zp,n > 1 is a martingale for the filtration o(X1,...,Xn). To verify this claim, note that E(S2,5|X1,.-- Xn] El(Sn + Xn41)"|Xi,---) Xn) E[S2|X1,..-, Xn] + E[2SnXnvilXay--- Xal + E[X2,3|X1,---) Xn] $2 + 25nE[XnsilX1,--- Xn) + B[XAaal S32 + 2S, E[Xn4i] +0? = B40? " i Subtracting (n + 1)o? from both sides, yields ElZnt11X1,---)Xn]=Zn Our next example introduces an important type of martingale, known as a Doob martingale. Example 3.11 Let ¥ be an arbitrary random variable with E[[Y |] < 00, let Fy,n > 1, be a filtration, and define Zn = ElY Fr] We claim that Zn,n > 1, is a martingale with respect to the filtration Fayn > 1. To verify this, we must first show that E[|Z,|] 1, is a called a Doob martingale. . Example 3.12 Our final example generalizes the result that the suc- cessive partial sums of independent zero mean random variables con- stitutes a martingale. For any random variables X;,i > 1, the ran- dom variables X;— B[Xi|X1,...,Xi-1], 7 > 1, have mean 0. Even though they need not be independent, their partial sums constitute a martingale. That is, we claim that 7 Zn = (Ki - BIXIM,.-. i=l is, provided that E[|Zp|] < oo, a martingale with respect to the filtration o(X1,...,Xn),n > 1. To verify this claim, note that Zn = Zn + Xnti — E[Xnti1X1,-- Xnj Thus, ElZn+i|X1,- = Zn + E[XneilX1,. ~ B{XnsilX,.- =% © 3.4 The Martingale Stopping Theorem The positive integer valued, possibly infinite, random variable N is said to be a random time for the filtration F, if {N =n} € Fy or, equivalently, if {N > n} € Fy-1, for all n. If P(N < oo) = 1, then the random time is said to be a stopping time for the filtration. Thinking once again in terms of information being amassed over time, with F;, being the cumulative information as to all events that have occurred by time n, the random variable NV will be a stopping88 Chapter 3 Conditional Expectation and Martingales time for this filtration if the decision whether to stop at time n (so that N =n) depends only on what has occurred by time n. (That is, the decision to stop at time 7 is not allowed to depend on the results of future events.) It should be noted, however, that the decision to stop at time n need not be independent of future events, only that it must be conditionally independent of future events given all information up to the present. Lemma 3.13 Let Zn,n > 1, be a martingale for the filtration Fr. If N is a random time for this filtration, then the process Zn = Zmin(win)s 2 2 1, called the stopped process, is also a martingale for the filtration Fy. Proof Start with the identity Zn = Zn—1 + Tpw>n)(Zn — Zn-1) To verify the preceding, consider two cases: (1) N 2m: Here, Zn, = Zp, Zn1 = Zn—1, [qy>n} = 1, and the preceding is true (2) Nny(Zn — Zn—1)|Fn—1] Zn-1 + Upn>n}E|Zn — Zn-1|Fn-1] Ze Theorem 3.14 The Martingale Stopping Theorem Let Zn, n > 1, be a martingale for the filtration Fn, and suppose that N is a stopping time for this filtration. Then ElZy) = E[Z} if any of the following three sufficient conditions hold. (1) Z, are uniformly bounded; (2) N is bounded; (3) E|N] <0, and there exists M < oo such that El\Zns1 -— ZnllFn] -P(N >i) ia MEIN] < oo. IA90 Chapter 3 Conditional Expectation and Martingales A corollary of the martingale stopping theorem is Wald’s equa- tion. Corollary 3.15 Wald’s Equation If X;, Xo identically distributed with finite mean p stopping time for the filtration Fy = 0(X1,. EN] < 00, then . are independent and E[Xi], and if N is a +Xn),n 2 1, such that i BIS. Xd = wE(N i=1 Proof. Zn = 3c%;(Xi — p),n > 1, being the successive partial sums of independent zero mean random variables, is a martingale with mean 0. Hence, assuming the martingale stopping theorem can be applied we have that 0 = ElZy) N = BDO - w)) = = EID Xi — Nu) i=l N = FID Xi] - EINy i=l To complete the proof, we verify sufficient condition (3) of the mar- tingale stopping theorem. EllZns1 — ZallFa] = EllXnsa — ullFal = El|Xn4i—sl]__ by independence E(\\Xil] + lel Thus, condition (3) is verified and the result proved. = IA Example 3.16 Suppose independent and identically distributed dis- crete random variables X1,X2,..., are observed in sequence. With P(X; = 3) = pj, what is the expected number of random variables that must be observed until the subsequence 0, 1, 2,0, 1 occurs? What is the variance?3.4 The Martingale Stopping Theorem 91 Solution Consider a fair gambling casino - that is, one in which the expected casino winning for every bet is 0. Note that if a gambler bets her entire fortune of a that the next outcome is j, then her fortune after the bet will either be 0 with probability 1 — pj, or a/p with probability pj. Now, imagine a sequence of gamblers betting at this casino. Each gambler starts with an initial fortune 1 and stops playing if his or her fortune ever becomes 0. Gambler i bets 1 that X; = 0; if she wins, she bets her entire fortune (of 1/po) that Xj+1 = 1; if she wins that bet, she bets her entire fortune that Xi,2 = 2; if she wins that bet, she bets her. entire fortune that Xi;3 = 0; if she wins that bet, she bets her entire fortune that Xin4 = 1; if she wins that bet, she quits with a final fortune of (opip2)"*- Let Z, denote the casino’s winnings after the data value X, is observed; because it is a fair casino, Zn,n > 1, is a martingale with mean 0 with respect to the filtration o(X),...,Xn),n > 1. Let N denote the number of random variables that need be observed until the pattern 0,1,2,0,1 appears - so (Xy_4,-..,Xw) = (0,1,2,0,1). As it is easy to verify that N is a stopping time for the filtration, and that condition (3) of the martingale stopping theorem is satisfied when M = 4/(p2p?p2), it follows that E[Zy] = 0. However, after Xw has been observed, each of the gamblers 1,..., N —5 would have lost 1; gambler N —4 would have won (p2p?p2)~! — 1; gamblers N—3 and N — 2 would each have lost 1; gambler N — 1 would have won (pop:)~} — 1; gambler N would have lost 1. Therefore, Zy = N — (popipe)”* — (popr)* Using that E[Zy] = 0 yields the result E(N] = (Popip2)"' + (popi)* In the same manner we can compute the expected time until any specified pattern occurs in independent and identically distributed generated random data. For instance, when making independent flips of a coin that comes u heads with probability p, the mean number of flips until the pattern HHTTHH appears is p~4q-? + p-? +p}, where q = 1 -p. To determine Var(N), suppose now that gambler i starts with an initial fortune i and bets that amount that X; = 0; if she wins, she92 Chapter 3 Conditional Expectation and Martingales bets her entire fortune that Xi+1 = 1; if she wins that bet, she bets her entire fortune that X;42 = 2; if she wins that bet, she bets her entire fortune that X;43 = 0; if she wins that bet, she bets her entire fortune that Xii4 = 1; if she wins that bet, she quits with a final fortune of i/(p2p2p2). The casino’s winnings at time N is thus N-4 N-1 Pepip2 —PoPt _ N(N+1) N-4_ N- 2 PaPIP2 PoP Assuming that the martingale stopping theorem holds (although none of the three sufficient conditions hold, the stopping theorem can still be shown to be valid for this martingale), we obtain upon taking expectations that Zn = 14+2+...4N- BIN|—4 , . EIN]-1 E[N?] + E[N] = 2—3-3— + 2 Wwi'l+ BIN] POPiP2 PoP1 Using the previously obtained value of E[N], the preceding can now be solved for E[N?], to obtain Var(N) = E[N?]—(E[N])?. w= Example 3.17 The cards from a shuffled deck of 26 red and 26 black cards are to be turned over one at a time. At any time a player can say “next”, and is a winner if the next card is red and is a loser otherwise. A player who has not yet said “next” when only a single card remains, is a winner if the final card is red, and is a loser otherwise. What is a good strategy for the player? Solution Every strategy has probability 1/2 of resulting in a win. To see this, let R, denote the number of red cards remaining in the deck after n cards have been shown. Then. _ Ra _5l-n E|Rn+ilRa,.-.)Rn] = Rn = 3 = Hence, ;f0-,n > 0 is a martingale. Because Ro/52 = 1/2, this martingale has mean 1/2. Now, consider any strategy, and let N denote the number of cards that are turned over before “next” is said. Because N is a bounded stopping time, it follows from the martingale stopping theorem tahat. Ry 52- Rn Bl 1/23.4 The Martingale Stopping Theorem 93 Hence, with I= Ipwiny Ry 52 ~ No Our next example involves the matching problem. BU = EWwyRyl) = EH 172 Example 3.18 Each member of a group of n individuals throws his or her hat in a pile. The hats are then mixed together, and each person randomly selects a hat in such a manner that each of the n! possible selections of the n individuals are equally likely. Any individual who selects his or her own hat departs, and that is the end of round one. If any individuals remain, then each throws the hat they have in a pile, and each one then randomly chooses a hat. Those selecting their own hats leave, and that ends round two. Find E[N], where N is the number of rounds until everyone has departed. Solution Let X;,i > 1, denote the number of matches on round for i=1,...,.N, and let it equal 1 for i > N. To solve this example, we will use the zero mean martingale Z;,k > 1 defined by k DK - BIXIM, Xia) i=1 k Sx -Y rat where the final equality follows because, for any number of remaining individuals, the expected number of matches in a round is 1 (which is seen by writing this as the sum of indicator variables for the events that each remaining person has a match). Because X N=min{k: ox, =n} i=] is a stopping time for this martingale, we obtain from the martingale stopping theorem that N 0 = ElZn] = ELD) Xi - NJ=n- EIN] i=1 andso E[N]=n. =94 Chapter 3 Conditional Expectation and Martingales Example 3.19 If X,X2, a sequence of independent and iden- tically distributed random variables, P(X; = 0) # 1, then the process is said to be a random walk. For given positive constants a,b, let p denote the probability that the the random walk becomes as large as a before it becomes as small as —b. We now show how to use martingale theory to approximate p. In the case where we have a bound |X;| < c it can be shown there will be a value 6 4 0 such that Ele*] Then because is the product of independent random variables with mean 1, it fol lows that Zn,n > 1, is a martingale having mean 1. Let N=min(n:S,>a or Sp<~b). Condition (3) of the martingale stopping theorem can be shown to hold, implying that Ele") =1 Thus. 1 = Ele" |Sy > alp + Ele""|Sy < —b\(1—p) Now, if @ > 0, then re Efe" (Sy >al< eblate) and et) < Ble" Sy < i] se yielding the bounds 1 — e- Hbte) eo — @—Obte)’3.4 Lhe Martingale Stopping Theorem 95, and motivating the approximation 1 PS Ga ew We leave it as an exercise to obtain bounds on p when @<0. Our next example involves Doob’s backward martingale. Before defining this martingale, we need the following definition. Definition 3.20 The random variables X1,...,Xn are said to be ea- changeable if Xi,,...,Xiq has the same probability distribution for every permutation i1,...,in of 1y...4n. Suppose that Xj,...,Xn are ex¢ eable. Assume E[|Xil] < 00, let i Sj =X jHly...gn i=l and consider the Doob martingale Zi,...,Zn, given by E[X1|Sn} E[X1\Sn)Sn—1y---) Satish 5 4 4; However, Snticj = E[Sn41—jlSnv1—j Xnaaejn--»Xal nti-j DY BK Snt i Xnga-sy- +s Xa] ia (n+1—j)E[Xi|Sn41-j, Xne2-j.---» Xn) (by exchangeability) (n+1-9)Z; where the final equality follows since knowing Sp, Spt, is equivalent to knowing Sp41—j,Xn42—js-++»Xn- The martingale oa Sn41~j n+1-j’ is called the Doob backward martingale. We now apply it to solve the ballot problem. 5 gehen96 Chapter 3 Conditional Expectation and Martingales Example 3.21 In an election between candidates A and B candidate A receives n votes and candidate B receives m votes, where n > m. Assuming, in the count of the votes, that all orderings of the n +m votes are equally likely, what is the probability that A is always ahead in the count of the votes? Solution Let X; equal 1 if the i** voted counted is for A and let it equal —1 if that vote is for B. Because all orderings of the n+ m. votes are assumed to be equally likely, it follows that the random variables X1,...,Xn4m are exchangeable, and Z1, is a Doob backward martingale when Zntm Z= Sp+mpi-j I ne m+1—j5 where S, = Sj, Xi. Because Z; = Spim/(n +m) = (n—m)/(n+ m), the mean of this martingale is (n — m)/(n+m). Because n > m, A will always be ahead in the count of the vote unless there is a tie at some point, which will occur if one of the S; - or equivalently, one of the Z; - is equal to 0. Consequently, define the bounded stopping time N by Ne=min{j Zj;=Oorj=n+m} Because Zn4m = X1, it follows that Zyy will equal 0 if the candidates are ever tied, and will equal X; if A is always ahead. However, if A is always abead, then A must receive the first vote; therefore, _ fl, if A is always ahead one { 0, otherwise By the martingale stopping theorem, E[Zw] = (n — m)/(n +m), yielding the result n-m P(A is always ahead) = 2—™ 3.5 The Hoeffding-Azuma Inequality Let Zn,n > 0, be a martingale with respect to the filtration Fy. If the differences Zn ~ Zn—1 can be shown to lie in a bounded ran- dom interval of the form [-Bn,—Bn + dn] where By € Fy_; and dy3.5 The Hoeffding-Azuma Inequality 97 is constant, then the Hoeffding-Azuma inequality often yields useful bounds on the tail probabilities of Z,. Before presenting the inequal- ity we will need a couple of lemmas. Lemma 3.22 If E[X] = 0, and P(—a < X < f) =1, then for any conver function f B a EU] S gy pia) + axel Proof Because f is convex it follows that, in the region -a 100). Thus, if the Lemma were false, then the function f(a, B) =F + e-F + ae? — e-F) — 2081071298 Chapter 3 Conditional Expectation and Martingales would assume a strictly positive maximum in the interior of the re- gion R= {(a,8) |a| <1, || < 10}. Setting the partial derivatives of f equal to 0, we obtain u FF pale +e) = 2aper+h"/2 (3.3) P—eF = rperh+0"/2 (3.4) We will now prove the lemma by showing that any solution of (3.3) and (3.4) must have 8 = 0. However, because f(a, 0) = 0, this would contradict the hypothesis that f assumes a strictly positive maximum in R,, thus establishing the lemma. So, assume that there is a solution of (3.3) and (3.4) in which B # 0. Now note that there is no solution of these equations for which a = 0 and 8 # 0. For if there were such a solution, then (3.4) would say that ef — 68 = 2pe/? (3.5) But expanding in a power series about 0 shows that (3.5) is equivalent to pret pitt 2 ae =, (28 +2)! =e ‘itat which (because (2i + oY > iat when i > 0) is clearly impossible when 8 # 0. Thus, any solution of (3.3) and (3.4) in which 8 # 0 will also have a # 0. Assuming such a solution gives, upon dividing (3.3) by (3.4), that 1+aS = Sits - 1+ 7 Because a # 0, the ae is equivalent to BF + e~®) =e —eF or, expanding in a Taylor series, Lor putt SO fae Gi!” & GaN! which is clearly not posible when 2 # 0. Thus, there is no solution of (3.3) and (3.4) in which f # 0, thus proving the result. . We are now ready for the Hoeffding-Azuma inequality.3.5 The Hoeffding-Azuma Inequality 99 Theorem 3.24 The Hoeffding-Azuma Inequality Let Zn, n > 1, be a martingale with mean p with respect to the filtration Fy. Let Z = p and suppose there exist nonnegative random variables By,n > 0, where Bn € Fn-1, and positive constants da,n > 0, such that —Bn S Zn - Zn-1 S —But dn Then, forn >0,a > 0, @) P(@n-pza) < MEG (ii) P(Zn-—pS a) < e/a G (36) Proof Suppose that 1 = 0. For any c > 0 P(Z, 2a) = P(e > e*) < e-™ Bler2n] where we use Markov’s inequality in the second equality. Let W, = e€@n, Note that Wo = 1, and that for n > 0 E[WalFn—1) = Blen-* e(@n- 2-1 Fa) = Fn-t Ble(n-ZnF, a] Wa-1 Ble(>- 29-1 F,_1} (3.7) il where the second equality used that Zn—1 € Fn—1. Because (i) f(z) =e is convex; (ii) E\Zn — Zn-1|Fn—a] E(ZnlFn-i] — ElZn—1|Fn-i] Zn-1— Zn-1 = 0 and (iii) -Ba < Zn -Zp-1 S$ —Batdn it follows from Lemma 3.22, with a = Bn, 8 = —Bp + dp, that . —Bpn +dn)e~°Pn + Byet(-Bntdn) Eleaf Dy) < B[ Ba le Pa 1] _ (CBa + dn)e~CBn + Bye? (—Batdn) d, 1A Ef100 Chapter 3 Conditional Expectation and Martingales where the final equality used that B, € F,-1. Hence, from Equation (3.7), we see that (—Bn + dp)e7eBe + Byes (-Batdn) S Wyo Re dy S Wyre il E(Wal Fai} A where the final inequality used Lemma 3.23 (with p = Bn/dn, t = cd). Taking expectations gives E[We] < E|Wp—ile/* Using that E[Wo] = 1 yields, upon iterating this inequality, that PRB — 6 EI G/8 Therefore, from Equation (3.6), we obtain that for any c > 0 : P(Zn > @) < exp(—ca +c? d?/8) i=1 Letting c = 4a/5°?_,d?, (which is the value of c that minimizes —cat+ OL, d?/8), gives that P(Zq > a) S$ e728/ Dia Parts (i) and (ii) of the Hoeffding-Azuma inequality now follow from applying the preceding, first to the zero mean martingale {Zn — 1}, and secondly to the zero mean martingale {4 — Zn}. . Example 3.25 Let X;,i > 1 be independent Bernoulli random vari- ables with means pj,i = 1,...,n. Then In = SK) = S- op nO i=1 =1 is a martingale with mean 0. Because Z, — Zp_1 = Xn — p, we see that —PS2Zn-Zp-1S1-p3.5 The Hoeffding-Azuma Inequality 101 Thus, by the Hoeffding-Azuma inequality (Bn = p,d, = 1), we see that fora >0 . P(Sn— opi 2a) < "lm i=l a 2 P(Sn— Domi < a) < en a The preceding inequalities are often called Chernoff bounds. ™ ‘The Hoeffding-Azuma inequality is often applied to a Doob type martingale. The following corollary is often used. Corollary 3.26 Let h be such that if the vectors x = (x1,...,2,) and y =(y1,-.-,Yn) differ in at most one coordinate (that is, for some k, a4 = yi for all i # k) then |a(x) — A(y)] <1 Then, for a vector of independent random variables K = (X1,... anda>0 P(h(X) ~ BJA(X)] > a) < e77*/" P(A(X) — Efh(K)] 0 P(N ~(n—k + l)papapsps 2 a) < e**/(n-9) P(N —(n—k+ l)pspapsps < -a) < e77/"-3) og Example 3.28 Suppose that n balls are to be placed in m urns, with each ball independently going into urn i with probability pi, 27, pi = 1. Find bounds on the tail probability of Yj, equal to the number of urns that receive exactly k balls. Solution First note that A E [= Turn i has exactly k tat i=l & (j)ea-w Let X; denote the urn in which ball j is put, j = 1,...,n. Also, let hy(21,.+.,2n) denote the number of urns that receive exactly k balls when X; = 2;,i = 1,...,n, and note that Y; = hy(X1,---,Xn)- When k = 0 it is easy to see that ho satisfies the condition that if x and y differ in at most one coordinate, then |ho(x) — ho(y)| < 1. Therefore, from Corollary 3.26 we obtain, for a > 0, that Ely] P(¥o— 0(1 - pi)” 2 a) 1 P(Yo- Y(1- pi" $ -a) < f=1 oe a, IA 2a? /n ene) A For 0 < k < nit is no longer true that if x and y differ in at most one coordinate, then |hj,(x)—Ax(y)| < 1. This is so because the one different value could result in one of the vectors having one less and the other having one more urn with k balls than would have resulted if that coordinate was not included. Thus, if x and y differ in at most one coordinate, then Vu (x) — he(y)| <2104 Chapter 3 Conditional Expectation and Martingales showing that Aj(x) = hy(x)/2 satisfies the condition of Corollary 3.26. Because P(Ye — El¥a] 2 a) = P(hi(X) — ElAe(X)] 2 a/2) we obtain, for a > 0,0 < k ("ora —pi** >a) i=1 ene /2n 1A i P%-3 ("ora = pyr 1, is said to be a submartingale for the filtration Fy if 1. El|Znl] < 00 2. Zn is adapted to Fp 9. E(Zns1|Fa] = Zn If3 is replaced by E(Zn4ilFa] S Zn, then Znyn > 1, is said to be supermartingale. It follows from the tower property that ElZn4i] 2 E[Zn] for a submartingale, with the inequality reversed for a supermartin- gale. (Of course, if Zn,n > 1, is a submartingale, then —Zn,n > 1, is a supermartingale, and vice-versa.)3.6 Sub and Supermartingales, and Convergence Theorem 105 The analogs of the martingale stopping theorem remain valid for submartingales and supermartingales. We leave the proof of the following theorem as an exercise. Theorem 3.30 If N is a stopping time for the filtration Fn, then ElZy] > ElZi) for a submartingale E|Zy] < ElZ:] for a supermartingale provided that any of the sufficient conditions of Theorem 3.14 hold. One of the most useful results about submartingales is the Kol- mogorov inequality. Before presenting it, we need a couple of lem- mas. Lemma 3.31 If Z,,n > 1, is a submartingale for the filtration Fn, and N is a stopping time for this filtration such that P(N E[Z,]. Now, ElZnlFe, N =k) = ElZnl Fe] > Ze = Zw Taking expectations of this inequality completes the proof. m Lemma 3.32 If Zn,n > 1, is a martingale with respect to the filtra- tion Fn,n > 1, and f a conver function for which El|f(Zn)l] < 00, then f(Zn),n > 1, is @ submartingale with respect to the filtration Franz. Proof. Elf(Zn+1)|Fn] > F(E[Zn+ilFa]) by Jensen’s inequality = f(2n) a Theorem 3.33 Kolmogorov’s Inequality for Submartingales Sup- pose Zn,n > 1, is a nonnegative submartingale, then for a >0 P(max{Z,...,Zn} >a) < E[Zn]/a106 Chapter 3 Conditional Expectation and Martingales Proof Let N be the smallest i, i < n such that Z; > a, and let it equal n if Z; 1, is a martingale, then for a>0 P(max{|Zi|,---,|Znl} > @) < min(E(|Znl|/e, E[Z2)/a?) Proof Noting that P(max{|Z1|,...,|Znl} 2 @) = P(max{Z?...,Z2} > a?) the corollary follows from Lemma 3.32 and Kolmogorov’s inequality for submartingales upon using that f(z) = |2| and f(z) = 2? are convex functions. Theorem 3.35 The Martingale Convergence Theorem Let Z,,n > 1, be a martingale. If there is M < 00 such that ElZll 1, is a submartingale, yielding that E[Z2] is non- decreasing. Because E[Z?] is bounded, it follows that it converges; let m < oo be given by = li 2) mm ip Flee We now argue that limp... Z, exists and is finite by showing that, with probability 1, Z,,n > 1, is a Cauchy sequence. That is, we will show that, with probability 1, \Zmtk - Zm| 0 asm,k + 003.6 Sub and Supermartingales, and Convergence Theorem 107 Using that Zin4k—Zm, k > 1, is a martingale, it follows that (Zim+k— Zm)*, k > 1, is a submartingale. Thus, by Kolmogorov’s inequality, P(|Zmik — Zm| > € for some k < n) = P( max (Zm+k — Zm)? > &) kelenn S$ El(Zman — Zm)?V/e? = BIZ 4m — 2m Zn+m + Zale However, ElZmZn+m| = ElE|ZmZn+m|Zml] = ElZmE(Zn+m|Zm]] Elza) Therefore, P(\Zm+x —Zm| >€ for some k € for some k) < (m — E[Z2])/e? Thus, P(\Zm+k —Zm|>€ for some k) +0 asm— oo Therefore, with probability 1, Zn,n > 1, is a Cauchy sequence, and so has a finite limit. . ‘As a consequence of the martingale convergence theorem we ob- tain the following. Corollary 3.36 If Zp,n > 1, is a nonnegative martingale then, with probability 1, limp—co Zn exists and is finite. Proof Because Zp, is nonnegative EllZal] = ElZn] = E[Zi)< 00108 Chapter 3 Conditional Expectation and Martingales Example 3.37 A branching process follows the size of a population over succeeding generations. It supposes that, independent of what occurred in prior generations, each individual in generation n inde- pendently has j offspring with probability p;,j > 0. The offspring of individuals of generation n then make up generation n + 1. Let Xn denote the number of individuals in generation n. Assuming that m = 3), Jp;, the mean number of offspring of an individual, is finite it is east to verify that Zp, = Xq/m",n > 0, is a martin- gale. Because it is nonnegative, the preceding corollary implies that lim, X,/m" exists and is finite. But this implies, when m < 1, that lim Xp = 0; or, equivalently, that Xp = 0 for all n sufficiently large. When m > 1, the implication is that the generation size either becomes 0 or converges to infinity at an exponential rate. .3.7 Exercises 109 3.7 Exercises 1. 2. For F = {4,}, show that E[X|F] = E[X]. Give the proof of Proposition 3.2 when X and Y are jointly continuous. . If E[|Xil] < 00, i=1,...,n, show that BO XI = KIA i=1 i=l |. Prove that if f is a convex function, then ELf(X)|F] = FELXIF) provided the expectations exist. . Let X1,Xo,..., be independent random variables with mean mn 1. Show that Zn = " Xi, n> 1, is a martingale. . If E[Xn4i1X1,.--)Xn] = @nXn+tbn for constants an,bn,n > 0, find constants An, Bn so that Zn = AnXn + Buin > 0, isa martingale with respect to the filtration o(Xo,..., Xn)- . Consider a population of individuals as it evolves over time, and suppose that, independent of what occurred in prior gen- erations, each individual in generation n independently has j offspring with probability p;, j > 0. The offspring of individuals of generation n then make up generation n + 1. Assume that m = ¥jipj < 00. Let Xp denote the number of individuals in generation n, and define a martingale related to X,,n > 0. The process X;,,7 > 0 is called a branching process. . Suppose X}, Xo, ..., are independent and identically distributed random variables with mean zero and finite variance o?. If T is a stopping time with finite mean, show that td Var()> Xi) = 07 E(T). i=1110 10. li. 12. Chapter 3 Conditional Expectation and Martingales ). Suppose Xj, Xa, ..., are independent and identically distributed mean 0 random variables which each take value +1 with prob- ability 1/2 and take value -1 with probability 1/2. Let Sy = 3%, Xi. Which of the following are stopping times? Compute the mean for the ones that are stopping times. (a) Ty = min{i > 5: S; = Sis +5} (b) Ty =T -5 (c) Ty =) +10. Consider a sequence of independent flips of a coin, and let Pj denote the probability of a head on any toss. Let A be the hypothesis that P, = a and let B be the hypothesis that Pp = , for given values 0 < a,b < 1. Let X; be the outcome of flip i, and set — P(X, XnlA) P(X,..-,XnlB) If P, = b, show that Zp,n > 1, is a martingale having mean 1. Zn Let Z,,n > 0 be a martingale with Z) = 0. Show that ElZz] = So El(Zi - Zi-1)?] i=l Consider an individual who at each stage, independently of past movements, moves to the right with probability p or to the left with probability 1 —p. Assuming that p > 1/2 find the expected number of stages it takes the person to move i positions to the right from where she started. . In Example 3.19 obtain bounds on p when 6 < 0. 14. Use Wald’s equation to approximate the expected time it takes a random walk to either become as large as a or as small as —b, for positive a and b. Give the exact expression if a and b are integers, and at each stage the random walk either moves up 1 with probability p or moves down 1 with probability 1— p. . Consider a branching process that starts with a single individ- ual. Let denote the probability this process eventually dies out. With X;, denoting the number of individuals in generation n, argue that 7**, n > 0, is a martingale.3.7 Exercises ill 16. 17. 18. 19. 21. 23. Given X1,Xa,..., let Sp = DL, Xi and Fy = o(Xi,...Xn)- Suppose for all n E|Sy| < 00 and E[Sn4i\Fa] = Su. Show E|X.Xj] =0 if i ¢ J. Suppose n random points are chosen in a circle having diam- eter equal to 1, and let X be the length of the shortest path connecting all of them. For a > 0, bound P(X — E[X] > a). Let X1,Xz,...,Xn be independent and identically distributed discrete random variables, with P(X; = j) = pj. Obtain bounds on the tail probability of the number of times the pat- tern 0,0,0,0 appears in the sequence. Repeat Example 3.27, but now assuming that the X; are inde- pendent but not identically distributed. Let P;,; = P(X: = j). . Let Zpyn > 0, be a martingale with mean Zp = 0, and let v;,j 2 0, be a sequence of nondecreasing constants with vp = 0. Prove the Kolmogorov-Hajek-Renyi inequality: P(\25| < vj, for all j =1,. ) 2 1=S B(Z; - 2)-1)1/2} jal Consider a gambler who plays at a fair casino. Suppose that the casino does not give any credit, so that the gambler must quit when his fortune is 0. Suppose further that on each bet made at least 1 is either won or lost. Argue that, with probability 1, a gambler who wants to play forever will eventually go broke. . What is the implication of the martingale convergence theorem to the scenario of Exercise 10? Three gamblers each start with a,b, and c chips respectively. In each round of a game a gambler is selected uniformly at random to give up a chip, and one of the other gamblers is selected uniformly at random to receive that chip. The game ends when there are only two players remaining with chips. Let Xn,¥n, and Z, respectively denote the number of chips the three players have after round n, so (Xo, Yo, Zo) = (a, b,c). (a) Compute E[Xn41¥a+iZn+1 | (Xn Ya: Zn) = (21, 2)]-112 Chapter 3 Conditional Expectation and Martingales (b) Show that Mn = Xn¥nZn-+n(a+b+c)/3 is a martingale. (c) Use the preceding to compute the expected length of the game.Chapter 4 Bounding Probabilities and Expectations 4.1 Introduction In this chapter we develop some approaches for bounding expec- tations and probabilities. We start in Section 2 with Jensen’s in- equality, which bounds the expected value of a convex function of a random variable. In Section 3 we develop the importance sampling identity and show how it can be used to yield bounds on tail probabil- ities. A specialization of this method results in the Chernoff bound, which is developed in section 4. Section 5 deals with the Second Moment and the Conditional Expectation Inequalities, which bound the probability that a random variable is positive. Section 6 develops the Min-Max Identity and use it to obtain bounds on the maximum of aset of random variables. Finally, in Section 7 we introduce some general stochastic order relations and explore their consequences. 4.2 Jensen's Inequality Jensen’s inequality yields a lower bound on the expected value of a convex function of a random variable. Proposition 4.1 If f is a convex function, then ELf(X)] 2 F(ELX)) 113114 Chapter 4 Bounding Probabilities and Expectations provided the expectations exist. Proof. We give a proof under the assumption that f has a Taylor series expansion. Expanding f about the value » = E[X], and using the Taylor series expansion with a remainder term, yields that for some a F(z) = fe) + Few) + f(a) - w)?/2 > f(u)+ f'(u)(z - ») where the preceding used that f”(a) > 0 by convexity. Hence, F(X) > F(u) + f(u)(X - w) ‘Taking expectations yields the result. = Remark 4.2 If P(X = 21) =A=1- P(X = 29) then Jensen’s inequality implies that for a convex function f AF (a1) + (1 — A) F(w2) 2 F(Aei + (1 - A)z2), which is the definition of a convex function. Thus, Jensen’s inequal- ity can be thought of as extending the defining equation of convexity from random variables that take on only two possible values to arbi- trary random variables. 4.3 Probability Bounds via the Importance Sam- pling Identity Let f and g be probability density (or probability mass) functions; let A be an arbitrary function, and suppose that g(x) = 0 implies that, f(c)h(x) = 0. The following is known as the importance sampling identity. Proposition 4.3 The Importance Sampling Identity MX)FX hS), 9({X) where the subscript on the expectation indicates the density (or mass function) of the random variable X. E,(h(X)] = BI-4.3 Probability Bounds via the Importance Sampling Identity 115 Proof We give the proof when f and g are density functions: Bylbx)) = J ate) se) ae (2) $2) 9 ~ 92) oe: F(X) 9X) ‘The importance sampling identity yields the following useful corol- lary concerning the tail probability of a random variable. (a) de i El 1 Corollary 4.4 L(x) P(X > 0) = Bul IX > qP,(X >) Proof. P(X >) = Eyllx>e] B(xzf) Bul GX) eezar) Fal 5) a |X > gP,(X > 0) ears! |X > dP)(X >) Example 4.5 Bounding Standard Normal Tail Probabilities Let f be the standard norma] density function 1 f(z) = yee 00 <2<00 For c > 0, consider Py(X > c}, the probability that a standard normal random variable exceeds c. With g(z) =ce"™, «>0116 Chapter 4 Bounding Probabilities and Expectations we obtain from Corollary 4.4 Py(X >) ”F, gleX* eX |X > we E, fexter*/2, eolX+0)] TR where the first equality used that P,(X > c) = e-®, and the sec- ond the lack of memory property of exponential random variables to conclude that the conditional distribution of an exponential random variable X given that it exceeds c is the unconditional distribution of X +c. Thus the preceding yields et Py(X > 0) = [e-%7/?] (4.1) Noting that, for > 0, l-rc) < (1-1 +3, ¢ 1°) iz 4(X > c) < (1-1/1 10) in Our next example uses the importance sampling identity to bound the probability that successive sums of a sequence of independent and identically distributed normal random variables with a negative mean ever cross some specified positive number. w (4.2)4.3 Probability Bounds via the Importance Sampling Identity 17 Example 4.6 Let X1, Xo,... be a sequence of independent and iden- tically distributed normal random variables with mean yp < 0 and variance 1. Let S, = Tf, X; and, for a fixed A > 0, consider p= P(S, > Afor some k) Let fx(xk) = fe(21,---,2%) be the joint density function of X; = (X1,..., Xx). That is, falxu) = (2)~#/2e~ Eten eso)? /2 Also, let gx be the joint density of k independent and identically distributed normal random variables with mean —y and variance 1. That is, 9k (Xz) = (20) ~*/2¢ DE (eitu)?/2 Note that SelXe) _ gan Dh 2 9n(Xx) With j k Re = {(t1---t) i SALI < hy Domi > AY i= = we have © = VP € A) k=1 ~ = YE Moers] k=) Sp xeeRa Se(Ke) -2 Eg) a x Eoullexpersy oH] kel Now, if X, € Ry then S, > A, implying, because p < 0, that eS < e#4_ Because this implies that Tyxgergy OM < Texper OM118 Chapter 4 Bounding Probabilities and Expectations we obtain from the preceding that Y Baellexver,3 M4] k=1 oo MAS Faull xpera)] k=l 8 1A i Now, if ¥j,i > 1, is a sequence of independent normal random vari- ables with mean —y and variance 1, then i k Eoullxpery] = POLY SA 5 A) 1 i=l Therefore, from the preceding a A k < CAS PITY S A, 13 A) k=l i=l #1 k eA P(S~Y; > A for some k) i=l il = eH where the final equality follows from the strong law of large numbers because limy—oo 0%; ¥i/n = —w > 0, implies P(limp—oo D2) Yi = oo) =1, and thus P(TE, ¥; > A for some k) = 1. ‘The bound p= P(S > Afor some k) < e744 is not very useful when A is a small nonnegative number. In this case, we should condition on X;, and then apply the preceding inequality.4.4 Chernott Bounds 119 With © being the standard normal distribution function, this yields P i ce 1 2 P(S, > Ak k|X1 = 2) e724 ie (Sh > Aor some kX = 2) =e Iz . 1 2 P(S, > Af KX = 2) ee 82g, if (Sk for some k|X1 Re lx +P(X1 > A) 1 2 < wl A-2) 6—(8-W?/2dp 4 1 — (A > wale : ae A ems f e-(@HH)*2de 4.1 O(A — p) Vin Ico PHAS(A +p) +1—O(A— p) Thus, for instance P(Sq > Ofor some k) < ®(y) + 1 — &(—p) = 26(n) 4.4 Chernoff Bounds Suppose that X has probability density (or probability mass) func- tion f(z). For t > 0, let e* f(z) Mi) o(z) = where M(t) = E,[e'*] is the moment generating function of X. Corollary 4.4 yields, for c > 0, that Py(X >) Eg{M(t)e~** |X > c] Py(X > c) E,{M(t)e"*|X > d M(t)e-® IA 1A Because the preceding holds for all ¢ > 0, we can conclude that i te Py(X > ¢) < inf M(te (4.3) The inequality (4.3) is called the Chernoff bound. Rather than choosing the value of ¢ so as to obtain the best bound, it is often convenient to work with bounds that are more120 Chapter 4 Bounding Probabilities and Expectations analytically tractable. The following inequality can be used to sim- plify the Chernoff bound for a sum of independent Bernoulli random variables. Lemma 4.7 For 00 P(X - E[X] >) < en (4.4) P(X - E[X] <0) < e?7/" (4.5) Proof. For c>0,t>0 P(X ~ E[X] > c) = P(et(X-FIXD > et) eE[e(X-FIX)] by the Markov inequality e* Blexr( SX — EXD) IA IA W n eB] et] is1 = ee] [let Fy ia However, if Y is Bernoulli with parameter p, then Bet -E1"1] = pet(l-P) + (1 — pe? < @f/8 where the inequality follows from Lemma 4.7. Therefore, P(X - E[X] >0) seter/8 Letting t = 4c/n yields the inequality (4.4). The proof of the inequality (4.5) is obtained by writing it as P(E[X]—-X > 0) 0, as n —+ 00 and m —+ 00, P((1—¢)ma” < Ng < (1 +6€)ma”) 1 Proof. To prove the result, it is convenient to first formulate an equivalent continuous time model that results in the times at which the n-+m cells are killed being independent random variables. To do so, let X1,...,Xn4m be independent exponential random variables, with X; having weight w;, i= 1,...,n-+m. Note that X; will be the smallest of these exponentials with probability w;/ 5°; w;; further, given that X; is the smallest, X,,r # i, will be the second smallest with probability w,/ 30,4; ws; further, given that X; and X; are, re- spectively, the first and second smallest, X,,s # i, 7, will be the next smallest with probability w,/ 2,44, tj; and so on. Consequently, if we imagine that cell i is killed at time X;, then the order in which the n + m cells are killed has the same distribution as the order in which they are killed in the original model. So let us suppose that cell i is killed at time X;,i > 1. Now let 7. denote the time at which the number of surviving target cells first falls below na. Also, let N(t) denote the number of normal cells that are still alive at time t; so Na = N(Tq). We will first show that P(N(ta) < (1+€)me"”) 1 asnjm— co (4.6)122 Chapter 4 Bounding Probabilities and Expectations To prove the preceding, let e* be such that 0 < e* < ¢, and set t = ~—In(a(1+e*)/”), We will prove (4.6) by showing that as n and ™ approach oo, () P(t >t) 1, and (i) P(N(t) < (1 +6)ma¥) +1. Because the events 7_ >t and N(t) < (1+)ma™ together imply that N(ta) < N(t) < (1+ €)ma”, the result (4.6) will be established. To prove (i), note that the number, call it Y, of surviving target cells by time ¢ is a binomial random variable with parameters n and et =a(1+e*)”. Hence, with a= na[(1 + ¢*)'/” — 1], we have P(ta < t) = P(Y < na} = P(Y < net -a) < 2" /" where the inequality follows from the Chernoff bound (4.3). This proves (i), because a?/n — 00 as n> 00. To prove (ii), note that N(¢) is a binomial random variable with parameters m and e~“t = a” (1+e*). Thus, by letting b= ma™(e— é*) and again applying the Chernoff bound (4.3), we obtain P(N(t) > (1+ €)ma”} = P(N(t) > me7** +b) < em This proves (ii), because b?/m — co as m — oo. Thus (4.6) is established. It remains to prove that P(N(za) > (1—€)ma”) +1 as n,m — co (4.7) However, (4.7) can be proven in a similar manner as was (4.6); a combination of these two results completes the proof of the theorem. s 4.5 Second Moment and Conditional Expecta- tion Inequalities ‘The second moment inequality gives a lower bound on the probability that a nonnegative random variable is positive.4.5 Second Moment and Conditional Expectation Inequalities 123 Proposition 4.11 The Second Moment Inequality For a non- negative random variable X (E{X))? BX?) Proof. Using Jensen's inequality in the second line below, we have E[X?] = E[X?|X > 0]P(X > 0) (E[X|X > 0])?P(X > 0) (E(x)? P(X> 0) P(X >0)> Vv When _X is the sum of Bernoulli random variables, we can improve the bound of the second moment inequality. So, suppose for the remainder of this section that a x=>)%, i=l where X; is Bernoulli with E[X;] = pi, i = ‘We will need the following lemma. Lemma 4.12 For any random variable R EIXR] = SpE (RIX: = 1] i=l Proof E[XR] = BIS XR a = So eGR i=l = Seewnix = pi + ELGR|X: = 00 — pi)} rast e SC EIAX: =Ip # i=l124 Chapter 4 Bounding Probabilities and Expectations Proposition 4.13 The Conditional Expectation Inequality n Pi X>0)2> 57% — PX 02S arK aa] Proof. Let R= “229. Noting that EIXR)=P(X>0), E[RIX:=1) = BUX =1] we obtain, from Lemma 4.12, P(X >0) = nell where the final equality made use of Jensen’s inequality. ™ Example 4.14 Consider a system consisting of m components, each of which either works or not. Suppose, further, that for given subsets of components S;,j = 1,...,n, none of which is a subset of another, the system functions if all of the components of at least one of these subsets work. If component j independently works with probability aj, derive a lower bound on the probability the system functions. Solution Let X; equal 1 if all the components in S; work, and let it equal 0 otherwise, i= 1,...,n. Also, let =[Io JES pi= P(X = Then, with X = D7, Xj, we have P(system functions) = P(X > 0) : Pi soa Last SSP m= D Pi OE ae eeaa:4.5 Second Moment and Conditional Expectation Inequalities 125 where S; — S; consists of all components that are in S; but not in Si. ' Example 4.15 Consider a random graph on the set of vertices {1,2, +;n}, which is such that each of the (3) pairs of vertices i # j is, independently, an edge of the graph with probability p. We are interested in the probability that this graph will be connected, where by connected we mean that for each pair of distinct vertices i # j there is a sequence of edges of the form (i,i1), (i1, i2),---; (ik, 3)- (That is, a graph is connected if for each each pair of distinct vertices iand j, there is a path from i to j.) Suppose that = late) We will now show that if c < 1, then the probability that the graph is connected goes to 0 as n — oo. To verify this result, consider the number of isolated vertices, where vertex i is said to be isolated if there are no edges of type (i,j). Let X; be the indicator variable for the event that vertex i is isolated, and let 2 x=yo% i=l be the number of isolated vertices. Now, P(X; =1) = (1-p)?? Also, a E(X|X; = 1) = S> P(X} = Xi = 1) ja =1i+)°a-pr? j#i = 1+(n-1)(1—p)"? Because =p = 1 eBAen z enein(n)126 Chapter 4 Bounding Probabilities and Expectations the conditional expectation inequality yields that for n large ae > P(X >) 2 Gay Therefore, c<1> P(X >0} > 1Lasn— 00 Because the graph is not connected if X > 0, it follows that the graph will almost certainly be disconnected when n is large and c < 1. (It can be shown when c > 1 that the probability the graph is connected goes to 1 as n > co.) . 4.6 The Min-Max Identity and Bounds on the Maximum In this section we will be interested in obtaining an upper bound on Elmax; Xi], when X,,...,Xq are nonnegative random variables. To begin, note that for any nonnegative constant c, n maxX; Se+ D(X; —e)* (48) i=1 where x*, the positive part of z, is equal to z if x > 0 and is equal to 0 otherwise. Taking expectations of the preceding inequality yields that. i. Elmax Xj] E[(Xi - ¢)*] rast Because (X;—c)* is a nonnegative random variable, we have i E(X:—0)*] f ” P(Xi—e)* > ade i P(X;—c> a)de 0 E [fo P&> vey Therefore, Blmax Xi] < c+ > i ” BUX: > w)dy (49) i=l4.0 Ine Min-Max Identity and Bounds on the Maximum. 127 Because the preceding is true for all c > 0, the best bound is obtained by choosing the c that minimizes the right side of the preceding. Differentiating, and setting the result equal to 0, shows that the best bound is obtained when ¢ is the value c* such that, SoP(K >) =1 i=l Because 7”, P(X; > c) is a decreasing function of c, the value of c* can be easily approximated and then utilized in the inequality (4.9). It is interesting to note that c* is such that the expected number of the X; that exceed c* is equal to 1, which is interesting because the inequality (4.8) becomes an equality when exactly one of the Xi exceed c. Example 4.16 Suppose the X; are independent exponential random variables with rates \;,i = 1,...,n. Then the minimizing value c* is such that with resulting bound Elmax Xi] < i In the special case where the rates are all equal, say A; = 1, then l=ne~" or c*=In(n) and the bound becomes Emax Xj) < In(n) +1 (4.10) However, it is easy to compute the expected maximum of a sequence of independent exponentials with rate 1. Interpreting these random variables as the failure times of n components, we can write 8 max Xi = OT, i=2128 Chapter 4 Bounding Probabilities and Expectations where T; is the time between the (i — 1)st and the i‘ failure. Using the lack of memory property of exponentials, it follows that the T; are independent, with T; being an exponential random variable with rate n—i+1. (This is because when the (i— 1)st failure occurs, the time until the next failure is the minimum of the n— i+ 1 remaining lifetimes, each of which is exponential with rate 1.) Therefore, in this case : : 1 1 Elmax Xi} x nets As it is known that, for n large So ys Zein) +E ia where E ~ .5772 is Euler’s constant, we see that the bound yielded by the approach can be quite accurate. (Also, the bound (4.10) only requires that the X; are exponential with rate 1, and not that they are independent.) . The preceding bounds on E{max; Xi] only involve the marginal distributions of the X;. When we have additional knowledge about the joint distributions, we can often do better. To illustrate this, we first need to establish an identity relating the maximum of a set of random variables to the minimums of all the partial sets. For nonnegative random variables X1,...,Xn, fix x and let A; denote the event that X; > x. Let X =mex(X1,...,Xn) Noting that X will be greater than z if and only if at least one of the events A; occur, we have P(X > 2) = P(ULAi) and the inclusion-exclusion identity gives P(X >2) = S>P(A)— Do PAA) + SD P(AASAR) 7 ia)= Xe De eA) fncn2} P(AAs) = P(X; > 2, X; > 2} = P(min(X;, X;) > 2) P(A;AjAg) = P(X; > 2, Xj > 2, X_ > x} P(min(X;, X;, Xx) > 2) and so on. Thus, we see that : PIX >a)= 0-4 SO Pemin(Xi,,...,%,) > 2)) Integrating both sides as x goes from 0 to oo gives the result: EIX]= S(O Blmin(X%,,....¥:,)] r=1 tS YAAK — Blin 5) iG ElX] < TAK) — Fn 1+ YO Hlmin(Xi, X;, Xx) i 1 1 + ————— +... + (1)! —_—___. San ) Pit... +Dn Using the preceding formula for the mean number of coupons needed to obtain a complete set requires summing over 2" terms, and so is not practical when n is large. Moreover, the bounds obtained by only going out a few steps in the formula for the expected value of a maximum generally turn out to be too loose to be beneficial. However, a useful bound can often be obtained by applying the max- min inequalities to an upper bound for E[X] rather than directly to E[X]. We now develop the theory. For nonnegative random variables Xi,..., Xp, let X = max(X, Fix c > 0, and note the inequality X $c+max((X1—0)*,...,(Xn—¢)*).4.0 Ihe Min-Max Identity and Bounds on the Maximum 131 Now apply the max-min upper bound inequalities to the right side of the preceding, take expectations, and obtain that E[X] < c+ STIX 04] i=l E[X] < e+ > BUX — €)*] — > Blmin((X; — 0), (X; - )*)] 7 SG + YS Blmin((x: - 0), (Xj - (Xe - 0D] icjck and so on. Example 4.18 Consider the case of independent exponential ran- dom variables X;,...,Xn, all having rate 1. Then, the preceding gives the bound ElmexX;] Set DEUX - ot} ~ Y Elmin(x; - 0), (X; - )*)] 7 roa + YS Blmin((X; — 0), (Xj — 0)*, (Xe — 0)*)] i c, whether min(X;,X;) > ¢, and whether min(X;, Xj, Xx) > c. This yields E((Xi — ¢)*] Elmin((X; — ¢)*, (Xj — ¢)*)] W Efmin((X; — ¢)*, (Xj —¢)*, (Xe -¢)*)] = € Using the constant c = In(n) yields the bound a a Blmax Xi] < In(a) 41-25) 4 MeV?) In(n) +.806 for n large132 Chapter 4 Bounding Probabilities and Expectations Example 4.19 Let us reconsider Example 4.17, the coupon col- lector's problem. Let c be an integer. To compute E{(X; — c)*), condition on whether a type é coupon is among the first c collected. El(Xi — 0)*] = El(Xi — €)*|Xi < P(X: < 0) +El(X; - 0)*|X; > JP(X; > ©) = El(X: — ¢)* |X; > (1 — pi)® — (=p) Pi Similarly, Blmin((X:— 0), (%j~ eh] = CoRR BI Blain((X.— )*, (X59, egy) = CORB BT Therefore, for any nonnegative integer eps inh 1- e ger Sta pi) -y force Conon iG (1 = pi = yj ~ Pe)? + Ua Pia Pi ~ Pky & Pit Pj + Pe zB 3 e AY a A 4.7 Stochastic Orderings We say that X is stochastically greater than Y, written X >g¢ Y, if P(X >t)>P(Y >t) for allt In this section, we define and compare some other stochastic order- ings of random variables. If X is a nonnegative continuous random variable with distribu- tion function F and density f then the hazard rate function of X is defined by Ax(t) = fO/FO4.7 Stochastic Orderings 133 where F(t) = 1 — F(t). Interpreting X as the lifetime of an item, then for ¢ small P(t year old item dies within an additional time e) = P(X t) & Ax(te If Y is a nonnegative continuous random variable with distribu- tion function G and density g, say that X is hazard rate order larger than Y, written X >a, Y, if Ax(t) S Ay(t) for allt Say that X is likelihood ratio order larger than Y, written X >i, Y, if F(z)/a(z) Tz where f and g are the respective densities of X and Y Proposition 4.20 X>rY >X 2a Y >X DH Y Proof. Let X have density f, and Y have density g. Suppose X 21, Y Then, for y >< fly) £(z) Fy) = oy) Gy 2 9 (2) [P serane £2 PP owen Ax(z) < Ay(z) To prove the final implication, note first that : - Pf i ax(nat= Foy tt= hehe) implying that or or P(e) =e Bx134 Chapter 4 Bounding Probabilities and Expectations which immediately shows that X >pr Y => X Dot Y. w Define rx(t), the reversed hazard rate function of X, by rx(t) = f(Q)/FO) and note that _ P(t-e< XIX <2) SS ii a Say that X is reverse hazard rate order larger than Y,, written X 2n¥, if rx(t) > ry(t) for allt Our next theorem gives some representations of the orderings of X and Y in terms of stochastically larger relations between certain conditional distributions of X and Y Before presenting it, we intro- duce the following notation. Notation For a random variable X and event A, let [X|A] de- note a random variable whose distribution is that of the conditional distribution of X given A. Theorem 4.21 () X Zar ¥ @ [X|X >t] Dee (YIY > dl for all t (i) X Den ¥ 4 [XIX #] and ¥; =4 [Y-t|Y > ¢]. Noting that = 0; ifst shows that X Shr Y => Xt Dar Ye > Xe Det Ye > [XIX > t] Vee (VY > t] To go the other way, use the identity Fy, =e" Ser" Ax(s)ds4.1 stochastic Orderings 135 to obtain that [X|X >t] 28 ([YIY > 4 Xr 20% => Ax(t) S Av(#) (ii) With X; =a [t- X|X < ¢], Ax(y) =rx(t-y) O XiseX% [t-X|X a VY [ovtnray ° firxt-nav> [ore nney => rx(t) 2 rr(t) (iii) Let X and Y have respective densities f and g. Suppose [X|s< X <#] Su [Yls<¥<¢#] foralls v/s < X P(Y > v/s < ¥ £9, showing that X >i ¥. Now suppose that X >i, Y. Then clearly [X|s < X < t] >i [Yls < Y < ¢], implying, from Proposition 4.20, that [X|s < X < th>e(Y\s<¥ Y > X Den Y >X 2H Y Proof. The first implication immediately follows from parts (ii) and (iii) of Theorem 4.21. The second implication follows upon taking the limit as t + oo in part (ii) of that theorem. m Say that X is an increasing hazard rate (or IHR) random variable if Ax(t) is nondecreasing in t. (Other terminology is to say that X has an increasing failure rate.) Proposition 4.23 Let X; =a (X —t|X >t]. Then, (a) Xe ln ast? « XisJHR (b) Xi le ast? +> log f(z) is concave Proof. (a) Let A(y) be the hazard rate function of X. Then d,(y), the hazard rate function of X; is given by A(y) =AlE+y), y>O Hence, if X is THR then X; Ine ¢, implying that X_ Jee t. Now, let 8 s Xt. Then, eo MWY = P(X, > €) S$ P(Xp > 6) =e Fe Mey4.8 Excercises 137 showing that A(t) > (s). Thus part (a) is proved. (b) Using that the density of X; is fx,(z) = f(x +t)/F(t) yields, for s < t, that f(ct+s) far '* © log f(x+s)—log f(z+t)tz fats) _ fet). 4 Xs 2 Xt Fats) flatt) = fy) ° Fw tY ° five f @)ly © log f(y) is concave 4.8 Excercises 1. For a nonnegative random variable X, show that (E[X™})!/" is nondecreasing in n. 2. Let X be as standard normal random variable. Use Corollary 4.4, along with the density g(x) =ze""?, 2 >0 to show, for c > 0, that (a) P(X > 0) = Fee“ Eyl Z1X > (b) Show, for any positive random variable W, that 1 Bil > d < BL (©) Show that PUL > 0) | ote” ViRP(X > c)138 a Chapter 4 Bounding Probabilities and Expectations (e) Use Jensen’s inequality, along with the preceding, to show that eP(X > 0) +e? VIm(P(X > c))? > me (£) Argue from (e) that P(X > c) must be at least as large as the positive root of the equation crt ef VTma? = oe (g) Conclude that P(X >o)> Sr (V2 +4-c)e*? . Let X be a Poisson random variable with mean 4. Show that, for n > A, the Chernoff bound yields that e->(Ae)" OCS.) Se . Let m(t) = E[X*]. The moment bound states that for c > 0 P(X 2c) < m(t)c* for all t > 0. Show that this result can be obtained from the importance sampling identity. . Fill in the details of the proof that, for independent Bernoulli random variables X1,...,Xn, and c> 0, P(S - E[S] < -c) < e?"/" where $= 2, X, . If X is a binomial random variable with parameters n and p, show (a) P(X = np| > ce) < 2672 (b) P(X — np > anp) < exp{—2np*a*} Give the details of the proof of (4.7).4.8 Excercises 139 8. Prove that Elf(X)] 2 ELF(E[X1¥))] 2 F(ELX)) Suppose you want a lower bound on E[f(X)] for a convex function f. The preceding shows that first conditioning on Y and then applying Jensen's inequality to the individual terms Elf(X)|Y = y] results in 0 larger lower bound than does an immediate application of Jensen’s inequality. 9. Let X; be binary random variables with parameters p;,i = 1,...)2. Let X = 30%, Xi, and also let I, independent of the variables X;,...,Xn, be equally likely to be any of the values 1,...)n. For R independent of I, show that (a) PU = {Xr = 1) = pi/E[X] (b) E[XR] = E[X]E[R|Xz = 1] (©) P(X > 0) = E[X] E[X|X1 = 1] 10. For X; and X as in Exercise 9, show that Pi (E[x1? x BRIX = 1 = BUX? ‘Thus, for sums of binary variables, the conditional expectation inequality yields a stronger lower bound than does the second moment inequality. Hint: Make use of the results of Exercises 8 and 9. 11. Let X; be exponential with mean 8+ 2i, for i = 1,2,3. Obtain an upper bound on E[max X;], and compare it with the exact result when the X; are independent. 12, Let Uj, i =1,...,n be uniform (0, 1) random variables. Obtain an upper bound on E[max Uj], and compare it with the exact result when the U; are independent. 13. Let U; and Uz be uniform (0,1) random variables. Obtain an upper bound on E[max(Ui,Us)], and show this maximum is obtained when Uj = 1— Up.140 14, 15. 16. Chapter 4 Bounding Probabilities and Expectations Show that X >p, Y if and only if P(X >t), PY >#) P(X > s) ~ P(Y >s) for all s A(y,z) whenever > y (a) Show that if X and Y are independent and X >i, Y then A(X,Y) Dat h(Y,X). (b) Show by a counterexample that the preceding is not valid under the weaker condition X >yt Y. There are n jobs, with job i requiring a random time X; to process. The jobs must be processed sequentially. Give a suffi- cient condition, the weaker the better, under which the policy of processing jobs in the order 1,2,...,n maximizes the prob- ability that at least k jobs are processed by time t for all k and t.Chapter 5 Markov Chains 5.1 Introduction This chapter introduces a natural generalization of a sequence of in- dependent random variables, called a Markov chain, where a variable may depend on the immediately preceding variable in the sequence. Named after the 19th century Russian mathematician Andrei An- dreyevich Markov, these chains are widely used as simple models of more complex real-world phenomena. Given a sequence of discrete random variables Xo, X1, Xo, ... tak- ing values in some finite or countably infinite set S, we say that Xp is a Markov chain with respect to a filtration Fy if Xn € Fa for all n and, for all B C S, we have the Markov property P(Xnt1 € Bl Fn) = P(Xnti € BlXn). If we interpret X, as the state of the chain at time n, then the preceding means that if you know the current state, nothing else from the past is relevant to the future of the Markov chain. That is, given the present state, the future states and the past states are independent. When we let Fy, = o(Xo,X1,-..,Xn) this definition reduces to P(Xna1 = il Xn = i Xn = tnaty ey Xo = to) = P(Xn41 = j| Xn =i). Tf P(Xni1 = j| Xn = 7) is the same for all n, we say that the Markov 141142 Chapter 5 Markov Chains chain has stationary transition probabilities, and we set P(Xn4i = 5| Xn =i) In this case the quantities P,; are called the transition probabilities, and specifying them along with a probability distribution for the starting state Xo is enough to determine all probabilities concerning Xo,...,Xn. We will assume from here on that all Markov chains considered have stationary transition probabilities. In addition, un- less otherwise noted, we will assume that S, the set of all possible states of the Markov chain, is the set of nonnegative integers. Example 5.1 Reflected random walk. Suppose Yj are independent and identically distributed Bernoulli(p) random variables, and let Xo = 0 and Xp = (Xn-1 + 2¥n — 1)* for n = 1,2,.... The process Xn, called a reflected random walk, can be viewed as the position of a particle at time n such that at each time the particle has a p probability of moving one step to the right, a 1 — p probability of moving one step to the left, and is returned to position 0 if it ever attempts to move to the left of 0. It is immediate from its definition that Xp is a Markov chain. Example 5.2 A non-Markov chain. Again let Y; be independent and identically distributed Bernoulli(p) random variables, let Xo = 0, and this time let X, = Yn + Yn-1 for n = 1,2,.... It’s easy to see that X, is not a Markov chain because P(Xn41 = 2|Xn = 1,Xn-1 = 2) =0 while on the other hand P(Xn+1 = 2|Xn = 1, Xn-1 = 0) =p. 5.2 The Transition Matrix The transition probabilities Pj = P(X1 = j|Xo = 1), are also called the one-step transition probabilities. We define the n- step transition probabilities by rr5.2 Ihe transition Matrix 143 In addition, we define the transition probability matrix Poo Por Poo P=| Pio Pu Pi and the n-step transition probability matrix Bp rip ee po) =| Pip Ph 2 An interesting relation between these matrices is obtained by noting that PET = S7 P(Xntm = j1X0 = i, Xn = k)P(Xn = klXo = 3) k = (m) p(n) = rnp. k ‘The preceding are called the Chapman-Kolmogorov equations. If follows from the Chapman-Kolmogorov equations that pintm) _ pn, pm where here “x” represents matrix multiplication. Hence, Pp?) =PxP, and, by induction, p®) = pr, where the right-hand side represents multiplying the matrix P by itself n times. Example 5.3 Reflected random walk. A particle starts at position zero and at each time moves one position to the right with probability p and, if the particle is not in position zero, moves one position to the left (or remains in state 0) with probability 1 — p. The position144 Chapter 5 Markov Chains Xp, of the particle at time n forms a Markov chain with transition matrix l-p p 000 l-p 0 poo P=! 0 1-p0po Example 5.4 The two state Markov chain. Consider a Markov chain with states 0 and 1 having transition probability matrix l-a p=|% [178] The two-step transition probability matrix is given by pe — aes ogee ata | a8 + B(1— 8) 1-06 — B01 - 8) 5.3 The Strong Markov Property Consider a Markov chain X, having one-step transition probabilities Pj, which means that if the Markov chain is in state i at a fixed time n, then the next state will be j with probability P,;. However, it is not necessarily true that if the Markov chain is in state i at a randomly distributed time T, the next state will be j with probability Py. That is, if T is an arbitrary nonnegative integer valued random variable, it is not necessarily true that P(Xr41 = j|Xr = i) = Py. For a simple counterexample suppose T =min(n: Xn =i, Xai = 3) Then, clearly, P(Xry1 = j|Xr =) = 1. The idea behind this counterexample is that a general random vari- able T may depend not only on the states of the Markov chain up to time T but also on future states after time T. Recalling that T is a stopping time for a filtration F, if {T =n} € Fp for every n, we see that for a stopping time the value of T can only depend on the states up to time ¢. We now show that P(Xr4n = j|Xr = ) will equal Pp) provided that T is a stopping time.5.3 The Strong Markov Property 145 This is usually called the strong Markov property, and essentially means a Markov chain “starts over” at stopping times. Below we define Fr = {A AN{T = t} € F, for all t}, which intuitively represents any information you would know by time T. Proposition 5.5 The strong Markov property. Let Xn,n > 0, bea Markov chain with respect to the filtration Fy. If T < co as. isa stopping time with respect to Fn, then P(Xran = ilXr = i, Fr) = P Proof. P(Xr4n = Xr =i, Fr,T =t) = P(Xten = j|Xt =i, FT = t) = P(Xten = j|Xt =i, F:) = Pe where the next to last equality used the fact that T is a stopping time to give {T =t}e F,. = Example 5.6 Losses in queueing busy periods. Consider a queu- ing system where Xp is the number of customers in the system at time n. At each time n = 1,2,... there is a probability p that a customer arrives and a probability (1 — p), if there are any cus- tomers present, that one of them departs. Starting with Xo = 1, let T = min{t > 0: X; = 0} be the length of a busy period. Sup- pose also there is only space for at most m customers in the system. ‘Whenever a customer arrives to find m customers already in the sys- tem, the customer is lost and departs immediately. Letting Nm be the number of customers lost during a busy period, compute E[Nyn]. Solution. Let A be the event that the first arrival occurs before the first departure. We will obtain E{Njn] by conditioning on whether A occurs. Now, when A happens, for the busy period to end we must first wait an interval of time until the system goes back to having a single customer, and then after that we must wait another interval of time until the system becomes completely empty. By the Markov property the number of losses during the first time interval has dis- tribution Nm—1, because we are now starting with two customers and146 Chapter 5 Markov Chains therefore with only m--1 spaces for additional customers. The strong Markov property tells us that the number of losses in the second time interval has distribution Nj». We therefore have _ f E{Nm-1] + E[Nm] if m >1 EINnlal = { Tes ifm=1, and using P(A) = p and E{Ny|A‘] = 0 we have E[Nmm] = E[Nm|A]P(A) + E[NmlA°]P(A®) = pE|Nm-i] + PE|Nm] for m > 1 along with E(N] = p+ pE(M] and thus P_\m E[Nm] = (—2-)". [Nm] = >) It’s quite interesting to notice that E[Nm] increases in m when p> 1/2, decreases when p < 1/2, and stays constant for all m when p= 1/2. The intuition for the case p = 1/2 is that when m increases, losses become less frequent but the busy period becomes longer. We next apply the strong Markov property to obtain a result for the cover time, the time when all states of a Markov chain have been visited. Proposition 5.7 Cover times. Given an N-state Markov chain Xp, let T, = min{n > 0: Xp = i} and let C = max; T; be the cover time. Then E(C\ < N_, 2 maxi; ElT;|Xo = i). =] im Proof. Let th, l2,-...Jv be a random permutation of the integers N chosen so that all possible orderings are equally likely. Letting Ti, = 0 and noting that maxj past] ie m1 ae 1 Fy max E[T;|Xo = i, mig m=1 where the second line follows since all orderings are equally likely and thus P(Ti, > maxjcm-1T1,) = 1/m, and the third from the strong Markov property. m@ 5.4 Classification of States We say that states i,j of a Markov chain communicate with each other, or are in the same class, if there are integers n and m such that both PS) > 0 and P!” > O hold. This means that it is possible for the chain to get from i to j and vice versa. A Markov chain is called irreducible if all states are in the same class. For a Markov chain Xp, let T= min{n > 0: Xn =i} be the time until the Markov chain first makes a transition into state i. Using the notation Ei[---] and P,(---) to denote that the Markov chain starts from state i, let fi = P(Ti < 00) be the probability that the chain ever makes a transition into state i given that it starts in state i. We say that state i is transient if fi <1 and recurrent if f; = 1. Let os M= Do kx,=0 n=l148 Chapter 5 Markov Chains be the total number of transitions into state i. The strong Markov property tells us that, starting in state i, Ni+1~ geometric(1— fi), since each time the chain makes a transition into state i there is, independent of all else, a (1 — f;) chance it will never return. Conse- quently, « EIN] = oP is either infinite or finite depending on whether or not state i is recurrent or transient. Proposition 5.8 If state i is recurrent and i communicates with j, then j is also recurrent. Proof. Because i and j communicate, there exist values n and m such that P{?) PO” > 0. But for any k > 0 (n+m+k) (mm) 15k) Pe ere ca where the preceding follows because P{?*™*") is the probability starting in state j that the chain will be back in j after n-+m-+k tran- sitions, whereas P{” Pi p\”) is the probability of the same event oc- curring but with the additional condition that the chain must also be in i after the first m transitions and then back in é after an additional k transitions. Summing over k shows that BAIN] > Pg MY > PIM PLY PY = 00 k Thus, j is also recurrent. = Proposition 5.9 If j is transient, then T°, Pl < oo Proof. Note that. BN) = BASS Ten] = La n=05.5 Stationary and Limiting Distributions 149 Let fij denote the probability that the chain ever makes a transition into j given that it starts at i. Then, conditioning on whether such a transition ever occurs yields, upon using the strong Markov property, that E,[Nj] = (1+ Ej[Nj)) fig < 00 since j is transient. = If é is recurrent, let mi = ET) denote the mean number of transitions it takes the chain to return to state i, given it starts in i. We say that a recurrent state i is null if pi = 00 and positive if 4; < oo. In the next section we will show that positive recurrence is a class property, meaning that if i is positive recurrent and communicates with j then j is also positive recurrence. (This also implies, using that recurrence is a class property, that so is null recurrence.) 5.5 Stationary and Li iting Distributions For a Markov chain X, starting in some given state i, we define the limiting probability of being in state j to be noo P;= lim Pi? if the limit exists and is the same for all i. It is easy to see that not all Markov chains will have limiting probabilities, For instance, consider the two state Markov chain with Po. = Pyo = 1. For this chain P{") will equal 1 when n is even and 0 when n is odd, and so has no limit. Definition 5.10 State i of a Markov chain Xp is said to have period d if d is the largest integer having the property that P\") =0 whenn is not a multiple of d. Proposition 5.11 If states i and j communicate, then they have the same period.150 Chapter 5 Markov Chains Proof. Let dj be the period of state k. Let n,m be such that PP) P™ > 0. Now, if Pi > 0, then (r+ntm) plm) pir) pin) Pym > PPP >0 So d; divides r +n +m. Moreover, because (27) pl pr) PR 2 PE Pe > 0 the same argument shows that d; also divides 2r +n +m; therefore d; divides 2r-+n+m-—(r+n+m) = r. Because dj divides r whenever PS? > 0, it follows that d; divides d;. But the same argument can now be used to show that d; divides d;. Hence, d; It follows from the preceding that all states of an irreducible Markov chain have the same period. If the period is 1, we say that the chain is aperiodic. It’s easy to see that only aperiodic chains can have limiting probabilities. Intimately linked to limiting probabilities are stationary proba- bilities. The probability vector m;,i € S is said to be a stationary probability vector for the Markov chain if my = SomPy forall 5 oy =1 7 Its name arises from the fact that if the Xo is distributed according to a stationary probability vector {x} then P(X: = 9) = D> P(X) = 5X0 = im = Do mPy = 7; and, by a simple induction argument, P(Xn = 3) = 0 P(Xn = Xn = )P(Xn-1 =) = SO miPy = 05 7 7 Consequently, if we start the chain with a stationary probability vector then Xn, Xn+1,... has the same probability distribution for all n. The following result will be needed later.d.b Dtationary ana Limiting Listriputions: apa Proposition 5.12 An irreducible transient Markov chain does not have a stationary probability vector. Proof. Assume there is a stationary probability vector 7,1 > 0 and take it to be the probability mass function of Xo. Then, for any j 14 = P(Xn = 3) = dary Pm Consequently, for any m 4 = kn mrs Py Jim AS mPS +m) ism bm = Ya Jim Pf? +n ism ibm =A bm Ae where the final equality used that >, P{f) < oo because j is tran- sient, implying that limy so P{” = 0. Letting m — oo shows that 173 =0 for all j, contradicting the fact that So, xj = 1. Thus, assum- ing a stationary probability vector results in a contradiction, proving the result. @ The following theorem is of key importance. Theorem 5.13 An irreducible Markov chain has a stationary proba- bility vector {x;} if and only if all states are positive recurrent. The stationary probability vector is unique and satisfies my = Vay Moreover, if the chain is aperiodic then =} (n) = lim Pj To prove the preceding theorem, we will make use of a couple of lemmas.152 Chapter 5 Markov Chains Lemma 5.14 For an irreducible Markov chain, if there exists a sta- tionary probability vector {x;} then all states are positive recurrent. Moreover, the stationary probability vector is unique and satisfies my = 1/ny Proof. Let 7; be stationary probabilities, and suppose that P(Xo = 3) = 7; for all j. We first show that 1; > 0 for all 4. To verify this, suppose that 7_ = 0. Now for any state j, because the chain is irre- ducible, there is an n such that P{") > 0. Because Xp is determined by the stationary probabilities, Te = P(Xn =k) = So mPL > 1j3PO a Consequently, if 7, = 0 then so is 7;. Because j was arbitrary, that means that it 7; = 0 for any i, then 7; = 0 for all i. But that would contradict the fact that 7, 7m = 1. Hence any stationary probabil- ity vector for an irreducible Markov chain must have all positive elements. Now, recall that T; = min(n > 0: Xp, = j). So, ny = E(T;|Xo = J] co = YPC; = lx = 5) = = PPG 20 Xo= 5) ~ £4 P(X = 3) Because Xo is chosen according to the stationary probability vector {mi}, this gives oo TyHy = D> PUT; > 0, Xo = 3) (5.1) n=1 Now, P(T; 2 1, Xo = j) = P(Xo =) = 35 and, for n > 2 P(T; > ,X0=3) P(Xi #51 Si 1) = 0. Because P(X; # j) = 1-7, we thus obtain m5 = Wy thus showing that there is at most one stationary probability vector. In addition, because all 7; > 0, we have that all yj < 00, showing that all states of the chain are positive recurrent. ™ Lemma 5.15 If some state of an irreducible Markov chain is positive recurrent then there exists a stationary probability vector. Proof. Suppose state k is positive recurrent. Thus, Be = ExT] < 00 Say that a new cycle begins every time the chain makes a transition into state k. For any state j, let A; denote the amount of time the chain spends in state j during a cycle. Then = ExlQ exn=s.t.>0)] n=0 0 DY Fillexneynenyl n=0 e Do Pi(Xn = HT > 1) n=0 E[Aj] ' We claim that 7; = E[Aj)/u, 7 > 0, is a stationary probability vector. Because E[2, Aj] is the expected time of a cycle, it must equal Ey [Tx], showing that154 Chapter 5 Markov Chains Moreover, for j # k many = SO Pe(Xn = 5, Tk > 2) nz0 = VY Phe = 4,7 > 2-1, Xe =4) n21 i = VY Ph > 2-1, Xi =i) n2zli xPy(Xn = jlTk > 2-1, Xn-1 = 1) SY Pee > 2-1, Xn = Py 7 a3l = VY PelTe > 0, Xn = i) Py 7 n30 = EIAIP, 7 = be Py Finally, Mahe = Om - Py) 7 7 jek =1-mes tk t =1- SE 5 afk = tt and the proof is complete. ™ Note that Lemmas (5.14) and (5.15) imply the following. Corollary 5.16 If one state of an irreducible Markov chain is positive recurrent then all states are positive recurrent. ‘We are now ready to prove Theorem 5.13. Proof. All that remains to be proven is that if the chain is aperi- odic, as well as irreducible and positive recurrent, then the stationary5. dtationary and Limiting Listriputions probabilities are also limiting probabilities To prove this, let m;,i > 0, be stationary probabilities. Let Xp, > 0 and Yq,n > 0 be indepen- dent Markov chains, both with transition probabilities P,,j, but with Xo =i and with P(% = 4) = 7. Let N = min(n : Xn = Yn) We first show that P(N < 00) = 1. To do so, consider the Markov chain whose state at time n is (Xp, Yn), and which thus has transition probabilities Pj) (47) = PiPjr That this chain is irreducible can be seen by the following ar- gument. Because {Xp} is irreducible and aperiodic, it follows that for any state i there are relatively prime integers n,m such that ”) Ph™ > 0. But any sufficiently large integer can be expressed as a linear combination of relatively prime integers, implying that there is an integer Nj such that P® 30 foralln> N Because i and j communicate this implies the existence of an integer Nig such that PI) >0 for all n> Nig Hence, | = P® p > 0 for all sufficiently large n Gk) 7) ~ Ger ly larg which shows that the vector chain (Xn, Yn) is irreducible. In addition, we claim that 7,3 = 77 is a stationary probability vector, which is seen from Tiny = YomPes oD a, P,j = DV rete Pei Peg k r ker By Lemma 5.14 this shows that the vector Markov chain is positive recurrent and so P(N < co) = 1 and thus lim, P(N > n) = 0. Now, let Z, = Xn ifn < N and let Z, = Y; ifn > N. It is easy to see that Z,,n > 0 is also a Markov chain with transition156 Chapter 5 Markov Chains probabilities P;; and has Zp = i. Now (n) Pa W P(Z, =i) P(Z, gN n) P(Yn = 5,N $1) +P(Zn = 5,N >n) P(¥_ =) + P(N > n) aj + P(N >n) (5.2) s On the other hand my = PYn = 3) P(¥n = 5,N ) P(Zn = G,N 1) P(Zq = j) + P(N > n) PI) + P(N >n) (5.3) IA Wl i Hence, from (5.2) and (5.3) we see that lim PS? = ny. Remark 5.17 It follows from Theorem 5.13 that if we have an irre- ducible Markov chain, and we can find a solution of the stationarity equations y= DomPy G20 7 Dn 7 then the Markov chain is positive recurrent, and the 7; are the unique stationary probabilities. If, in addition, the chain is aperiodic then the 7; are also limiting probabilities. Remark 5.18 Because j; is the mean number of transitions between successive visits to state é, it is intuitive (and will be formally proven in the chapter on renewal theory) that the long run proportion of5.6 Time Reversibility 157 time that the chain spends in state i is equal to 1/y;. Hence, the stationary probability 7; is equal to the long run proportion of time that the chain spends in state i. Definition 5.19 A positive recurrent, aperiodic, irreducible Markov chain is called an ergodic Markov chain. Definition 5.20 A positive recurrent irreducible Markov chain whose initial state is distributed according to its stationary probabilities is called a stationary Markov chain. 5.6 Time Reversibility A stationary Markov chain X,, is called time reversible if P(Xn = j/Xnvi =i) = P(Xn41 = Xn = 1) for all i,j. By the Markov property we know that the processes Xn+1, Xn+2,--+ and Xp-1, Xn-2,--- are conditionally independent given Xj, and so it follows that the reversed process Xn~1,Xn-2,-.. will also be a Markov chain having transition probabilities PXn = 5, Xnt P(Xnt — Ph P(Xn = Xan = 1) = m% where 7; and P,; respectively denote the stationary probabilities and the transition probabilities for the Markov chain X,. Thus an equiv- alent definition for X;, being time reversible is if mPy = 75Pji for all i,j. Intuitively, a Markov chain is time reversible if it looks the same running backwards as it does running forwards. It also means that the rate of transitions from i to j, namely 7;P;;, is the same as the rate of transitions from j to i, namely 7jPj;. This happens if there are no “loops” for which a Markov chain is more likely in the long tun to go in one direction compared with the other direction. We illustrate with an examples.158 Chapter 5 Markov Chains Example 5.21 Random walk on the circle. Consider a particle which moves around n. positions on a circle numbered 1,2,....n according to transition probabilities Piz: = p = 1— Piyii forl Si, m; = 44 = 1, that the Markov chain is time reversible with the given stationary probabilities. m 5.7 A Mean Passage Time Bound Consider now a Markov chain whose state space is the set of non- negative integers, and which is such that Py =0, 0Si 0, we have that E[Dj] > di, then BAN] < 1/4 Proof. The proof is by induction on n. It is true when n = 1, because Nj is geometric with mean 1 1 PMI Fe ~ EID]160 Chapter 5 Markov Chains So, assume that E[N,] < Ch, 1/dj, for all k > ElNn-j]P(Da = 3) j=l a as S1+ Pan E(Nnl + D> P(Dn = 5) D0 di jn fl = 1+ PanE[Nn] + YPw = aod a a i=l i=n-j+1 S14 Pan B[Nn) + YP, = = ay 1/di ~ 5/4) ja where the last line follows because d; is nondecreasing. Continuing from the previous line we get = 1+ PanE[Nal + (1 — Pan) x Udi - ELI j) =14 PanElNa] + (1 — Pan) Sodas — Ba = : S Pan E[Nn] + (1 — Pan) S/d i= which completes the proof. m Example 5.24 At each stage, each of a set of balls is independently put in one of n urns, with each ball being put in urn i with probability Pi, 8.1 pi = 1. After this is done, all of the balls in the same urn
1, and W, = 0%, Xi then Wp i=l satisfies the conditions of Theorem 2.20. Because n a DAP = nba = EGE 0 it follows from Theorem 2.20 that P(W, <2)—+ P(Z <2). @74 Chapter 2 Stein's Method and Central Limit Theorems 2.7 Exercises 1. If X ~ Poisson(a) and Y ~ Poisson(b), with b > a, use coupling to show that Y >a X. . Suppose a particle starts at position 5 on a number line and at each time period the particle moves one position to the right with probability p and, if the particle is above position 0, moves one position to left with probability 1 — p. Let X;,(p) be the position of the particle at time n for the given value of p. Use coupling to show that Xp(a) >st Xn(b) for any n if a > b. . Let X,Y be indicator variables with E[X] = a and B[Y] = 6. (a) Show how to construct a maximal coupling X,¥ for X and Y, and then compute P(X = Y) as a function of a,b. (b) Show how to construct a minimal coupling to minimize P(X = ¥). . In a room full of n people, let X be the number of people who share a birthday with at least one other person in the room. Then let Y be the number of pairs of people in the room having the same birthday. (a) Compute E[X] and Var(X) and E[Y] and Var(¥). (b) Which of the two variables X or Y do you believe will more closely follow @ Poisson distribution? Why? (c) Ina room of 51 people, it turns out there are 3 pairs with the same birthday and also a triplet (3 people) with the same birthday. This is a total of 9 people and also 6 pairs. Use a Poisson approximation to estimate P(X > 9) and P(Y > 6). Which of these two approximations do you think will be better? Have we observed a rare event here? . Compute a bound on the accuracy of the better approximation in the previous exercise part (c) using the Stein-Chen method. . For discrete X,Y prove dry(X,¥) = 45, |P(X = 2)-P(Y = 2) . For discrete X,Y show that P(X # Y) > drv(X,Y) and show also that there exists a coupling that yields equality. . Compute a bound on the accuracy of a normal approximation for a Poisson random variable with mean 100.2.7 Exercises 75 9. If X ~ Geometric(p), with q = 1 — p, then show that for any bounded function f with f(0) = 0, we have E[f(X) —af(X + )) =0. 10. Suppose X),X2,... are independent mean zero random vari- ables with |Xq| <1 for all n and Dicp, Var(X;)/n + 8 < 00 If Sn = Dign Xi and Z ~ N(0,s), show that Sp/ Vn +p Z. 11. Suppose m balls are placed among n urns, with each ball inde- pendently going in to urn i with probability p;. Assume m is much larger than n. Approximate the chance none of the urns are empty, and give a bound on the error of the approximation. 12. A ring with a circumference of c is cut into n pieces (where nis large) by cutting at n places chosen uniformly at random around the ring. Estimate the chance you get k pieces of length at least a, and give a bound on the error of the approximation. 13. Suppose Xj, i= 1,2,...,10 are iid U(0, 1). Give an approxima- tion for P(}0j2, Xi > 7), and give a bound on the error of this approximation. 14. Suppose Xi, i = 1,2,...,n are independent random variables with E[X;] = 0, and 0", Var(X;) =1. Let W = 52, X; and show that jew - 2 <3) E\X.)* a? i=Chapter 3 Conditional Expectation and Martingales 3.1 Introduction A generalization of a sequence of independent random variables oc- curs when we allow the variables of the sequence to be dependent on previous variables in the sequence. One example of this type of dependence is called a martingale, and its definition formalizes the concept of a fair gambling game. A number of results which hold for independent random variables also hold, under certain conditions, for martingales, and seemingly complex problems can be elegantly solved by reframing them in terms of a martingale. In section 2 we introduce the notion of conditional expectation, in section 3 we formally define a martingale, in section 4 we introduce the concept of stopping times and prove the martingale stopping theorem, in section 5 we give an approach for finding tail probabilities for martingale, and in section 6 we introduce supermartingales and submartingales, and prove the martingale convergence theorem. 3.2 Conditional Expectation Let X be such that E[|X|] < ov. In a first course in probability, E{X|Y], the conditional expectation of X given Y, is defined to be 778 Chapter 3 Conditional Expectation and Martingales that function of Y that when Y = y is equal to Uz eP(X =2l¥ =y), if X,Y are discrete E[XIY =y]= Sxfxyy(cly)dz, if X,Y have joint density f where Fayy(2ly) = f(y) _ f(y) (3.1) J fle,y)dz Fry) The important result, often called the tower property, that BX] = EE[X|¥]] is then proven. This result, which is often written as Ly FIXI¥ = yP(¥ = 9) if X,Y are discrete E[X] = J EIXIY = ylfy(u)ay, if X,Y are jointly continuous. is then gainfully employed in a variety of different calculations. ‘We now show how to give a more general definition of conditional expectation, that reduces to the preceding cases when the random variables are discrete or continuous. To motivate our definition, sup- pose that whether or not A occurs is determined by the value of Y. (That is, suppose that A € o(Y).) Then, using material from our first course in probability, we see that E[XI4] EEX I4\¥]] EUaE(X\¥]] where the final equality holds because, given Y, I, is a constant random variable. ‘We are now ready for a general definition of conditional expecta- tion.3.2 Conditional Expectation 79 Definition For random variables X,Y, let E[X|Y'], called the condi- tional expectation of X given Y, denote that function h(Y) having the property that for any A € o({Y) E(XE4] = BAYA]. By the Radon-Nikodym Theorem of measure theory, a function h that makes h(Y) a (measurable) random variable and satisfies the preceding always exists, and, as we show in the following, is unique in the sense that any two such functions of Y must, with probability 1, be equal. The function A is also referred to as a Radon-Nikodym derivative. Proposition 3.1 If hy and hg are functions such that Ejhi(Y La] = Elha(¥)L4] for any A€o(Y), then P(hy(¥Y) = ha(¥)) =1 Proof. Let An = {h(Y) — ho(Y) > 1/n}. Then, 0 = Btha(Y Mag) — Elha¥ a] = Bi(ha() — hal ))fag) > 2PUAn) showing that P(A, this yields 0. Because the events An are increasing in n, 0 = lim P(An) = (lim An) = P(UnAn) = P(hi(¥) > ha(¥)) Similarly, we can show that 0 = P(hi(Y) < ho(Y)), which proves the result. = ‘We now show that the preceding general definition of conditional expectation reduces to the usual ones when X,Y are either jointly discrete or continuous.80 Chapter 3 Conditional Expectation and Martingales Proposition 3.2 If X and Y are both discrete, then E[X|¥ = y] = )o P(X =al¥ =y) = whereas if they are jointly continuous with joint density f, then Bx = a= f afew (elder where fxiy(xly) = rena Proof Suppose X and Y are both discrete, and let hy) = So 2P(X = 2l¥ =) z For A € o(Y), define B= {y:Ia=1when¥ = y} Because B € o(Y), it follows that I, = Ig. Thus, _[a, #X=2Yes xmn={f otherwise and _fX.eP(X=2\¥=y), if Y= yes wor = {7 otherwise Thus, E[XI4] = Do zP(X =2,Y € B) = Vey r(x =2Y=y) = yeB YY Px = al¥ = y) PW =v) ven c E{h(Y)La)3.2 Conditional Expectation 81 The result thus follows by uniqueness. The proof in the continuous case is similar, and is left as an exercise. . For any sigma field F we can define E[X|F] to be that random variable in F having the property that for all A € F E[XIa) = ElE[X|F La] Intuitively, E[X|F] represents the conditional expectation of X given that we know all of the events in F that occur. Remark 3.3 It follows from the preceding definition that E[X|¥] = ElX\o(¥)]. For any random variables X,X1,...,Xn, define E[X|X1,...,Xn] by ELXIXA,.. Xn] = E[Xlo(X1,--- Xn] In other words, E[X|Xi,....X. which is that function h(X1,-..,Xn) for E[XI4] = E(MX),...,Xn)La], for all A€ 0 (X1,..-, Xn) We now establish some important properties of conditional ex- pectation. Proposition 3.4 (a) The Tower Property: For any sigma field F E(X] = E[E|X|F]] (b) For any AEF, E[XIa\F] = IAEIXIF] (c) If X is independent of all ¥ € F, then E[X|F] = E[X] (a) If E(|Xil] < 00, i= 1,...4, then PIO XIF= AMI i=1 #182 Chapter 3 Conditional Expectation and Martingales (e) Jensen's Inequality If f is a conver function, then Elf(X)|F] = f(ELX|F)) provided the expectations exist. Proof Recall that E[X|F] is the unique random variable in F such that E[XI4] = ElE|X|Flla] if ACF Letting A = 9, I, = In = 1, and part (a) follows. To prove (b), fix A € F and let X* = XI. Because E[X*|F] is the unique function of Y such that E[X*Zy] = E[E[X*|Flq] for all A’ € F, to show that E[X*|F] = I4E[X|F], it suffices to show that for A’ € F E[X* Ta) = ElaE[X|F lla] That is, it suffices to show that E[XI ala) = EllaE|X|F Ua‘) or, equivalently, that ElXIaa)] = ElE[X|Fllaar] which, since AA’ € F, follows by the definition of conditional expec- tation. Part (c) will follow if we can show that, for A € F E[XI4] = E[E[X]Ia} which follows because E[XI4] = E[X]E[I4] _ by independence = ELE[X\a] We will leave the proofs of (d) and (e) as exercises. Remark 3.5 It can be shown that E[X|-] satisfies all the properties of ordinary expectations, except that all probabilities are now com- puted conditional on knowing which events in F have occurred. For instance, applying the tower property to E[X|F] yields that E|X|F] = EIE[X|F U GIF].4.2 Conaitional Expectation 83 While the conditional expectation of X given the sigma field F is defined to be that function satisfying E[XI4] = E{E[X|Fll4] for all AG F it can be shown, using the dominated convergence theorem, that E[XW] = ElE[X|F]W] for al We F (3.2) The following proposition is quite useful. Proposition 3.6 If W € F, then B|XW|F] = WEIX|F] Before giving a proof, let us note that the result is quite intuitive. Because W € F, it follows that conditional on knowing which events of F occur - that is, conditional on F the random variable W becomes a constant and the expected value of a constant times a random variable is just the constant times the expected value of the random variable. Next we formally prove the result. Proof. Let Y = E[XW|F]-WE[X|F] = (X - E[X|F))W - (XW — E[XW|F)) and note that Y € F. Now, for Ae F, EY Ia] = EX — E[X|F))WIa] — E( XW - E[XW|F))Ia] However, because WI4 € F, we use Equation (3.2) to get E\(X — E(X|F])WIa4] = E[XWIa] — E[E[X|F]WIa] and by the definition of conditional expectation we have El(XW — E[XW|F])I4)] = B[XWIa] — E[E[XW|F]I4] = 0. Thus, we see that for A € F, EY 14] =0 Setting first A = {Y > 0}, and then A = {Y <0} (which are both in F because Y € F) shows that P(Y >0)=P(Y <0)=0 Hence, Y = 0, which proves the result.84 Chapter 3 Conditional Expectation and Martingales 3.3. Martingales We say that the sequence of sigma fields Fy, Fp... is a filtration if Fi © Fe... We say a sequence of random variables X), X2,... is adapted to Fa if Xn € Fa for all n. To obtain a feel for these definitions it is useful to think of n as representing time, with information being gathered as time pro- gresses. With this interpretation, the sigma field F,, represents all events that are determined by what occurs up to time n, and thus contains F,_1. The sequence X,,n > 1, is adapted to the filtration Fa,n > 1, when the value of X;, is determined by what occurs by time n. Definition 3.7 Zn is a martingale for filtration Fy, if 1. El|Znl] < 00 2. Zn is adapted to Fr 3. ElZn4alFa] = Zn A bet is said to be fair it its expected gain is equal to 0. A martingale can be thought of as being a generalized version of a fair game. For consider a gambling casino in which bets can be made concerning games played in sequence. Let Fn consist of all events whose outcome is determined by the results of the first n games. Let Zp denote the fortune of a specified gambler after the first n games. Then the martingale condition states that no matter what the results of the first n games, the gambler’s expected fortune after game n+1 is exactly what it was before that game. That is, no matter what has previously occurred, the gambler’s expected winnings in any game is equal to 0. It follows upon taking expectations of both sides of the final mar- tingale condition that ElZns1] = BlZn] implying that E|Z,] = E[Zi) We call E[Z;] the mean of the martingale.3.3 Martingales 85, Another useful martingale result is that E\Zn+2lFnl ELE(Zn+2|Fnsi U Fall Fa] EE [Zn+2lFn+ill Fa} ElZn+1\Fn] =2, Repeating this argument yields that ElZn+k\Fn) = Zn, k 21 Definition 3.8 We say that Z,,n > 1, is a martingale (without spec- ‘fying a filtration) if 1. EllZnl] < 00 2 BlZnsilZiy--+sZnl = Zn If {Z,} is a martingale for the filtration F,, then it is a martin- gale. This follows from ElZn+1\Z1y-+-) Zn) = ElE[Zn411Z1,---sZns Fall, EVE|Zn+1|Fal|Z15---s Zn] ElZn|Z1,..-, Zn] = 2. Zn) where the second equality followed because Z; € Fn, for all i = on ‘We now give some examples of martingales. Example 3.9 If X;,i > 1, are independent zero mean random vari- ables, then Z, = 7%, X;,n > 1, is a martingale with respect to the filtration o(X1,...,Xn),m > 1. This follows because ElZn411X15-- +5 ElZn + Xn+ilX1,---» Xn} Biza\Xe ON Ela Xl = Zn + E[Xnsi] by the independence of the X; Zn . I86 Chapter 3 Conditional Expectation and Martingales Example 3.10 Let X;,i > 1, be independent and identically dis- tributed with mean zero and variance o?. Let S, = 2, Xi, and define Z, = S2?-no®, n>1 ‘Then Zp,n > 1 is a martingale for the filtration o(X1,...,Xn). To verify this claim, note that E(S2,5|X1,.-- Xn] El(Sn + Xn41)"|Xi,---) Xn) E[S2|X1,..-, Xn] + E[2SnXnvilXay--- Xal + E[X2,3|X1,---) Xn] $2 + 25nE[XnsilX1,--- Xn) + B[XAaal S32 + 2S, E[Xn4i] +0? = B40? " i Subtracting (n + 1)o? from both sides, yields ElZnt11X1,---)Xn]=Zn Our next example introduces an important type of martingale, known as a Doob martingale. Example 3.11 Let ¥ be an arbitrary random variable with E[[Y |] < 00, let Fy,n > 1, be a filtration, and define Zn = ElY Fr] We claim that Zn,n > 1, is a martingale with respect to the filtration Fayn > 1. To verify this, we must first show that E[|Z,|] 1, is a called a Doob martingale. . Example 3.12 Our final example generalizes the result that the suc- cessive partial sums of independent zero mean random variables con- stitutes a martingale. For any random variables X;,i > 1, the ran- dom variables X;— B[Xi|X1,...,Xi-1], 7 > 1, have mean 0. Even though they need not be independent, their partial sums constitute a martingale. That is, we claim that 7 Zn = (Ki - BIXIM,.-. i=l is, provided that E[|Zp|] < oo, a martingale with respect to the filtration o(X1,...,Xn),n > 1. To verify this claim, note that Zn = Zn + Xnti — E[Xnti1X1,-- Xnj Thus, ElZn+i|X1,- = Zn + E[XneilX1,. ~ B{XnsilX,.- =% © 3.4 The Martingale Stopping Theorem The positive integer valued, possibly infinite, random variable N is said to be a random time for the filtration F, if {N =n} € Fy or, equivalently, if {N > n} € Fy-1, for all n. If P(N < oo) = 1, then the random time is said to be a stopping time for the filtration. Thinking once again in terms of information being amassed over time, with F;, being the cumulative information as to all events that have occurred by time n, the random variable NV will be a stopping88 Chapter 3 Conditional Expectation and Martingales time for this filtration if the decision whether to stop at time n (so that N =n) depends only on what has occurred by time n. (That is, the decision to stop at time 7 is not allowed to depend on the results of future events.) It should be noted, however, that the decision to stop at time n need not be independent of future events, only that it must be conditionally independent of future events given all information up to the present. Lemma 3.13 Let Zn,n > 1, be a martingale for the filtration Fr. If N is a random time for this filtration, then the process Zn = Zmin(win)s 2 2 1, called the stopped process, is also a martingale for the filtration Fy. Proof Start with the identity Zn = Zn—1 + Tpw>n)(Zn — Zn-1) To verify the preceding, consider two cases: (1) N 2m: Here, Zn, = Zp, Zn1 = Zn—1, [qy>n} = 1, and the preceding is true (2) Nny(Zn — Zn—1)|Fn—1] Zn-1 + Upn>n}E|Zn — Zn-1|Fn-1] Ze Theorem 3.14 The Martingale Stopping Theorem Let Zn, n > 1, be a martingale for the filtration Fn, and suppose that N is a stopping time for this filtration. Then ElZy) = E[Z} if any of the following three sufficient conditions hold. (1) Z, are uniformly bounded; (2) N is bounded; (3) E|N] <0, and there exists M < oo such that El\Zns1 -— ZnllFn] -P(N >i) ia MEIN] < oo. IA90 Chapter 3 Conditional Expectation and Martingales A corollary of the martingale stopping theorem is Wald’s equa- tion. Corollary 3.15 Wald’s Equation If X;, Xo identically distributed with finite mean p stopping time for the filtration Fy = 0(X1,. EN] < 00, then . are independent and E[Xi], and if N is a +Xn),n 2 1, such that i BIS. Xd = wE(N i=1 Proof. Zn = 3c%;(Xi — p),n > 1, being the successive partial sums of independent zero mean random variables, is a martingale with mean 0. Hence, assuming the martingale stopping theorem can be applied we have that 0 = ElZy) N = BDO - w)) = = EID Xi — Nu) i=l N = FID Xi] - EINy i=l To complete the proof, we verify sufficient condition (3) of the mar- tingale stopping theorem. EllZns1 — ZallFa] = EllXnsa — ullFal = El|Xn4i—sl]__ by independence E(\\Xil] + lel Thus, condition (3) is verified and the result proved. = IA Example 3.16 Suppose independent and identically distributed dis- crete random variables X1,X2,..., are observed in sequence. With P(X; = 3) = pj, what is the expected number of random variables that must be observed until the subsequence 0, 1, 2,0, 1 occurs? What is the variance?3.4 The Martingale Stopping Theorem 91 Solution Consider a fair gambling casino - that is, one in which the expected casino winning for every bet is 0. Note that if a gambler bets her entire fortune of a that the next outcome is j, then her fortune after the bet will either be 0 with probability 1 — pj, or a/p with probability pj. Now, imagine a sequence of gamblers betting at this casino. Each gambler starts with an initial fortune 1 and stops playing if his or her fortune ever becomes 0. Gambler i bets 1 that X; = 0; if she wins, she bets her entire fortune (of 1/po) that Xj+1 = 1; if she wins that bet, she bets her entire fortune that Xi,2 = 2; if she wins that bet, she bets her. entire fortune that Xi;3 = 0; if she wins that bet, she bets her entire fortune that Xin4 = 1; if she wins that bet, she quits with a final fortune of (opip2)"*- Let Z, denote the casino’s winnings after the data value X, is observed; because it is a fair casino, Zn,n > 1, is a martingale with mean 0 with respect to the filtration o(X),...,Xn),n > 1. Let N denote the number of random variables that need be observed until the pattern 0,1,2,0,1 appears - so (Xy_4,-..,Xw) = (0,1,2,0,1). As it is easy to verify that N is a stopping time for the filtration, and that condition (3) of the martingale stopping theorem is satisfied when M = 4/(p2p?p2), it follows that E[Zy] = 0. However, after Xw has been observed, each of the gamblers 1,..., N —5 would have lost 1; gambler N —4 would have won (p2p?p2)~! — 1; gamblers N—3 and N — 2 would each have lost 1; gambler N — 1 would have won (pop:)~} — 1; gambler N would have lost 1. Therefore, Zy = N — (popipe)”* — (popr)* Using that E[Zy] = 0 yields the result E(N] = (Popip2)"' + (popi)* In the same manner we can compute the expected time until any specified pattern occurs in independent and identically distributed generated random data. For instance, when making independent flips of a coin that comes u heads with probability p, the mean number of flips until the pattern HHTTHH appears is p~4q-? + p-? +p}, where q = 1 -p. To determine Var(N), suppose now that gambler i starts with an initial fortune i and bets that amount that X; = 0; if she wins, she92 Chapter 3 Conditional Expectation and Martingales bets her entire fortune that Xi+1 = 1; if she wins that bet, she bets her entire fortune that X;42 = 2; if she wins that bet, she bets her entire fortune that X;43 = 0; if she wins that bet, she bets her entire fortune that Xii4 = 1; if she wins that bet, she quits with a final fortune of i/(p2p2p2). The casino’s winnings at time N is thus N-4 N-1 Pepip2 —PoPt _ N(N+1) N-4_ N- 2 PaPIP2 PoP Assuming that the martingale stopping theorem holds (although none of the three sufficient conditions hold, the stopping theorem can still be shown to be valid for this martingale), we obtain upon taking expectations that Zn = 14+2+...4N- BIN|—4 , . EIN]-1 E[N?] + E[N] = 2—3-3— + 2 Wwi'l+ BIN] POPiP2 PoP1 Using the previously obtained value of E[N], the preceding can now be solved for E[N?], to obtain Var(N) = E[N?]—(E[N])?. w= Example 3.17 The cards from a shuffled deck of 26 red and 26 black cards are to be turned over one at a time. At any time a player can say “next”, and is a winner if the next card is red and is a loser otherwise. A player who has not yet said “next” when only a single card remains, is a winner if the final card is red, and is a loser otherwise. What is a good strategy for the player? Solution Every strategy has probability 1/2 of resulting in a win. To see this, let R, denote the number of red cards remaining in the deck after n cards have been shown. Then. _ Ra _5l-n E|Rn+ilRa,.-.)Rn] = Rn = 3 = Hence, ;f0-,n > 0 is a martingale. Because Ro/52 = 1/2, this martingale has mean 1/2. Now, consider any strategy, and let N denote the number of cards that are turned over before “next” is said. Because N is a bounded stopping time, it follows from the martingale stopping theorem tahat. Ry 52- Rn Bl 1/23.4 The Martingale Stopping Theorem 93 Hence, with I= Ipwiny Ry 52 ~ No Our next example involves the matching problem. BU = EWwyRyl) = EH 172 Example 3.18 Each member of a group of n individuals throws his or her hat in a pile. The hats are then mixed together, and each person randomly selects a hat in such a manner that each of the n! possible selections of the n individuals are equally likely. Any individual who selects his or her own hat departs, and that is the end of round one. If any individuals remain, then each throws the hat they have in a pile, and each one then randomly chooses a hat. Those selecting their own hats leave, and that ends round two. Find E[N], where N is the number of rounds until everyone has departed. Solution Let X;,i > 1, denote the number of matches on round for i=1,...,.N, and let it equal 1 for i > N. To solve this example, we will use the zero mean martingale Z;,k > 1 defined by k DK - BIXIM, Xia) i=1 k Sx -Y rat where the final equality follows because, for any number of remaining individuals, the expected number of matches in a round is 1 (which is seen by writing this as the sum of indicator variables for the events that each remaining person has a match). Because X N=min{k: ox, =n} i=] is a stopping time for this martingale, we obtain from the martingale stopping theorem that N 0 = ElZn] = ELD) Xi - NJ=n- EIN] i=1 andso E[N]=n. =94 Chapter 3 Conditional Expectation and Martingales Example 3.19 If X,X2, a sequence of independent and iden- tically distributed random variables, P(X; = 0) # 1, then the process is said to be a random walk. For given positive constants a,b, let p denote the probability that the the random walk becomes as large as a before it becomes as small as —b. We now show how to use martingale theory to approximate p. In the case where we have a bound |X;| < c it can be shown there will be a value 6 4 0 such that Ele*] Then because is the product of independent random variables with mean 1, it fol lows that Zn,n > 1, is a martingale having mean 1. Let N=min(n:S,>a or Sp<~b). Condition (3) of the martingale stopping theorem can be shown to hold, implying that Ele") =1 Thus. 1 = Ele" |Sy > alp + Ele""|Sy < —b\(1—p) Now, if @ > 0, then re Efe" (Sy >al< eblate) and et) < Ble" Sy < i] se yielding the bounds 1 — e- Hbte) eo — @—Obte)’3.4 Lhe Martingale Stopping Theorem 95, and motivating the approximation 1 PS Ga ew We leave it as an exercise to obtain bounds on p when @<0. Our next example involves Doob’s backward martingale. Before defining this martingale, we need the following definition. Definition 3.20 The random variables X1,...,Xn are said to be ea- changeable if Xi,,...,Xiq has the same probability distribution for every permutation i1,...,in of 1y...4n. Suppose that Xj,...,Xn are ex¢ eable. Assume E[|Xil] < 00, let i Sj =X jHly...gn i=l and consider the Doob martingale Zi,...,Zn, given by E[X1|Sn} E[X1\Sn)Sn—1y---) Satish 5 4 4; However, Snticj = E[Sn41—jlSnv1—j Xnaaejn--»Xal nti-j DY BK Snt i Xnga-sy- +s Xa] ia (n+1—j)E[Xi|Sn41-j, Xne2-j.---» Xn) (by exchangeability) (n+1-9)Z; where the final equality follows since knowing Sp, Spt, is equivalent to knowing Sp41—j,Xn42—js-++»Xn- The martingale oa Sn41~j n+1-j’ is called the Doob backward martingale. We now apply it to solve the ballot problem. 5 gehen96 Chapter 3 Conditional Expectation and Martingales Example 3.21 In an election between candidates A and B candidate A receives n votes and candidate B receives m votes, where n > m. Assuming, in the count of the votes, that all orderings of the n +m votes are equally likely, what is the probability that A is always ahead in the count of the votes? Solution Let X; equal 1 if the i** voted counted is for A and let it equal —1 if that vote is for B. Because all orderings of the n+ m. votes are assumed to be equally likely, it follows that the random variables X1,...,Xn4m are exchangeable, and Z1, is a Doob backward martingale when Zntm Z= Sp+mpi-j I ne m+1—j5 where S, = Sj, Xi. Because Z; = Spim/(n +m) = (n—m)/(n+ m), the mean of this martingale is (n — m)/(n+m). Because n > m, A will always be ahead in the count of the vote unless there is a tie at some point, which will occur if one of the S; - or equivalently, one of the Z; - is equal to 0. Consequently, define the bounded stopping time N by Ne=min{j Zj;=Oorj=n+m} Because Zn4m = X1, it follows that Zyy will equal 0 if the candidates are ever tied, and will equal X; if A is always ahead. However, if A is always abead, then A must receive the first vote; therefore, _ fl, if A is always ahead one { 0, otherwise By the martingale stopping theorem, E[Zw] = (n — m)/(n +m), yielding the result n-m P(A is always ahead) = 2—™ 3.5 The Hoeffding-Azuma Inequality Let Zn,n > 0, be a martingale with respect to the filtration Fy. If the differences Zn ~ Zn—1 can be shown to lie in a bounded ran- dom interval of the form [-Bn,—Bn + dn] where By € Fy_; and dy3.5 The Hoeffding-Azuma Inequality 97 is constant, then the Hoeffding-Azuma inequality often yields useful bounds on the tail probabilities of Z,. Before presenting the inequal- ity we will need a couple of lemmas. Lemma 3.22 If E[X] = 0, and P(—a < X < f) =1, then for any conver function f B a EU] S gy pia) + axel Proof Because f is convex it follows that, in the region -a 100). Thus, if the Lemma were false, then the function f(a, B) =F + e-F + ae? — e-F) — 2081071298 Chapter 3 Conditional Expectation and Martingales would assume a strictly positive maximum in the interior of the re- gion R= {(a,8) |a| <1, || < 10}. Setting the partial derivatives of f equal to 0, we obtain u FF pale +e) = 2aper+h"/2 (3.3) P—eF = rperh+0"/2 (3.4) We will now prove the lemma by showing that any solution of (3.3) and (3.4) must have 8 = 0. However, because f(a, 0) = 0, this would contradict the hypothesis that f assumes a strictly positive maximum in R,, thus establishing the lemma. So, assume that there is a solution of (3.3) and (3.4) in which B # 0. Now note that there is no solution of these equations for which a = 0 and 8 # 0. For if there were such a solution, then (3.4) would say that ef — 68 = 2pe/? (3.5) But expanding in a power series about 0 shows that (3.5) is equivalent to pret pitt 2 ae =, (28 +2)! =e ‘itat which (because (2i + oY > iat when i > 0) is clearly impossible when 8 # 0. Thus, any solution of (3.3) and (3.4) in which 8 # 0 will also have a # 0. Assuming such a solution gives, upon dividing (3.3) by (3.4), that 1+aS = Sits - 1+ 7 Because a # 0, the ae is equivalent to BF + e~®) =e —eF or, expanding in a Taylor series, Lor putt SO fae Gi!” & GaN! which is clearly not posible when 2 # 0. Thus, there is no solution of (3.3) and (3.4) in which f # 0, thus proving the result. . We are now ready for the Hoeffding-Azuma inequality.3.5 The Hoeffding-Azuma Inequality 99 Theorem 3.24 The Hoeffding-Azuma Inequality Let Zn, n > 1, be a martingale with mean p with respect to the filtration Fy. Let Z = p and suppose there exist nonnegative random variables By,n > 0, where Bn € Fn-1, and positive constants da,n > 0, such that —Bn S Zn - Zn-1 S —But dn Then, forn >0,a > 0, @) P(@n-pza) < MEG (ii) P(Zn-—pS a) < e/a G (36) Proof Suppose that 1 = 0. For any c > 0 P(Z, 2a) = P(e > e*) < e-™ Bler2n] where we use Markov’s inequality in the second equality. Let W, = e€@n, Note that Wo = 1, and that for n > 0 E[WalFn—1) = Blen-* e(@n- 2-1 Fa) = Fn-t Ble(n-ZnF, a] Wa-1 Ble(>- 29-1 F,_1} (3.7) il where the second equality used that Zn—1 € Fn—1. Because (i) f(z) =e is convex; (ii) E\Zn — Zn-1|Fn—a] E(ZnlFn-i] — ElZn—1|Fn-i] Zn-1— Zn-1 = 0 and (iii) -Ba < Zn -Zp-1 S$ —Batdn it follows from Lemma 3.22, with a = Bn, 8 = —Bp + dp, that . —Bpn +dn)e~°Pn + Byet(-Bntdn) Eleaf Dy) < B[ Ba le Pa 1] _ (CBa + dn)e~CBn + Bye? (—Batdn) d, 1A Ef100 Chapter 3 Conditional Expectation and Martingales where the final equality used that B, € F,-1. Hence, from Equation (3.7), we see that (—Bn + dp)e7eBe + Byes (-Batdn) S Wyo Re dy S Wyre il E(Wal Fai} A where the final inequality used Lemma 3.23 (with p = Bn/dn, t = cd). Taking expectations gives E[We] < E|Wp—ile/* Using that E[Wo] = 1 yields, upon iterating this inequality, that PRB — 6 EI G/8 Therefore, from Equation (3.6), we obtain that for any c > 0 : P(Zn > @) < exp(—ca +c? d?/8) i=1 Letting c = 4a/5°?_,d?, (which is the value of c that minimizes —cat+ OL, d?/8), gives that P(Zq > a) S$ e728/ Dia Parts (i) and (ii) of the Hoeffding-Azuma inequality now follow from applying the preceding, first to the zero mean martingale {Zn — 1}, and secondly to the zero mean martingale {4 — Zn}. . Example 3.25 Let X;,i > 1 be independent Bernoulli random vari- ables with means pj,i = 1,...,n. Then In = SK) = S- op nO i=1 =1 is a martingale with mean 0. Because Z, — Zp_1 = Xn — p, we see that —PS2Zn-Zp-1S1-p3.5 The Hoeffding-Azuma Inequality 101 Thus, by the Hoeffding-Azuma inequality (Bn = p,d, = 1), we see that fora >0 . P(Sn— opi 2a) < "lm i=l a 2 P(Sn— Domi < a) < en a The preceding inequalities are often called Chernoff bounds. ™ ‘The Hoeffding-Azuma inequality is often applied to a Doob type martingale. The following corollary is often used. Corollary 3.26 Let h be such that if the vectors x = (x1,...,2,) and y =(y1,-.-,Yn) differ in at most one coordinate (that is, for some k, a4 = yi for all i # k) then |a(x) — A(y)] <1 Then, for a vector of independent random variables K = (X1,... anda>0 P(h(X) ~ BJA(X)] > a) < e77*/" P(A(X) — Efh(K)] 0 P(N ~(n—k + l)papapsps 2 a) < e**/(n-9) P(N —(n—k+ l)pspapsps < -a) < e77/"-3) og Example 3.28 Suppose that n balls are to be placed in m urns, with each ball independently going into urn i with probability pi, 27, pi = 1. Find bounds on the tail probability of Yj, equal to the number of urns that receive exactly k balls. Solution First note that A E [= Turn i has exactly k tat i=l & (j)ea-w Let X; denote the urn in which ball j is put, j = 1,...,n. Also, let hy(21,.+.,2n) denote the number of urns that receive exactly k balls when X; = 2;,i = 1,...,n, and note that Y; = hy(X1,---,Xn)- When k = 0 it is easy to see that ho satisfies the condition that if x and y differ in at most one coordinate, then |ho(x) — ho(y)| < 1. Therefore, from Corollary 3.26 we obtain, for a > 0, that Ely] P(¥o— 0(1 - pi)” 2 a) 1 P(Yo- Y(1- pi" $ -a) < f=1 oe a, IA 2a? /n ene) A For 0 < k < nit is no longer true that if x and y differ in at most one coordinate, then |hj,(x)—Ax(y)| < 1. This is so because the one different value could result in one of the vectors having one less and the other having one more urn with k balls than would have resulted if that coordinate was not included. Thus, if x and y differ in at most one coordinate, then Vu (x) — he(y)| <2104 Chapter 3 Conditional Expectation and Martingales showing that Aj(x) = hy(x)/2 satisfies the condition of Corollary 3.26. Because P(Ye — El¥a] 2 a) = P(hi(X) — ElAe(X)] 2 a/2) we obtain, for a > 0,0 < k ("ora —pi** >a) i=1 ene /2n 1A i P%-3 ("ora = pyr 1, is said to be a submartingale for the filtration Fy if 1. El|Znl] < 00 2. Zn is adapted to Fp 9. E(Zns1|Fa] = Zn If3 is replaced by E(Zn4ilFa] S Zn, then Znyn > 1, is said to be supermartingale. It follows from the tower property that ElZn4i] 2 E[Zn] for a submartingale, with the inequality reversed for a supermartin- gale. (Of course, if Zn,n > 1, is a submartingale, then —Zn,n > 1, is a supermartingale, and vice-versa.)3.6 Sub and Supermartingales, and Convergence Theorem 105 The analogs of the martingale stopping theorem remain valid for submartingales and supermartingales. We leave the proof of the following theorem as an exercise. Theorem 3.30 If N is a stopping time for the filtration Fn, then ElZy] > ElZi) for a submartingale E|Zy] < ElZ:] for a supermartingale provided that any of the sufficient conditions of Theorem 3.14 hold. One of the most useful results about submartingales is the Kol- mogorov inequality. Before presenting it, we need a couple of lem- mas. Lemma 3.31 If Z,,n > 1, is a submartingale for the filtration Fn, and N is a stopping time for this filtration such that P(N E[Z,]. Now, ElZnlFe, N =k) = ElZnl Fe] > Ze = Zw Taking expectations of this inequality completes the proof. m Lemma 3.32 If Zn,n > 1, is a martingale with respect to the filtra- tion Fn,n > 1, and f a conver function for which El|f(Zn)l] < 00, then f(Zn),n > 1, is @ submartingale with respect to the filtration Franz. Proof. Elf(Zn+1)|Fn] > F(E[Zn+ilFa]) by Jensen’s inequality = f(2n) a Theorem 3.33 Kolmogorov’s Inequality for Submartingales Sup- pose Zn,n > 1, is a nonnegative submartingale, then for a >0 P(max{Z,...,Zn} >a) < E[Zn]/a106 Chapter 3 Conditional Expectation and Martingales Proof Let N be the smallest i, i < n such that Z; > a, and let it equal n if Z; 1, is a martingale, then for a>0 P(max{|Zi|,---,|Znl} > @) < min(E(|Znl|/e, E[Z2)/a?) Proof Noting that P(max{|Z1|,...,|Znl} 2 @) = P(max{Z?...,Z2} > a?) the corollary follows from Lemma 3.32 and Kolmogorov’s inequality for submartingales upon using that f(z) = |2| and f(z) = 2? are convex functions. Theorem 3.35 The Martingale Convergence Theorem Let Z,,n > 1, be a martingale. If there is M < 00 such that ElZll 1, is a submartingale, yielding that E[Z2] is non- decreasing. Because E[Z?] is bounded, it follows that it converges; let m < oo be given by = li 2) mm ip Flee We now argue that limp... Z, exists and is finite by showing that, with probability 1, Z,,n > 1, is a Cauchy sequence. That is, we will show that, with probability 1, \Zmtk - Zm| 0 asm,k + 003.6 Sub and Supermartingales, and Convergence Theorem 107 Using that Zin4k—Zm, k > 1, is a martingale, it follows that (Zim+k— Zm)*, k > 1, is a submartingale. Thus, by Kolmogorov’s inequality, P(|Zmik — Zm| > € for some k < n) = P( max (Zm+k — Zm)? > &) kelenn S$ El(Zman — Zm)?V/e? = BIZ 4m — 2m Zn+m + Zale However, ElZmZn+m| = ElE|ZmZn+m|Zml] = ElZmE(Zn+m|Zm]] Elza) Therefore, P(\Zm+x —Zm| >€ for some k € for some k) < (m — E[Z2])/e? Thus, P(\Zm+k —Zm|>€ for some k) +0 asm— oo Therefore, with probability 1, Zn,n > 1, is a Cauchy sequence, and so has a finite limit. . ‘As a consequence of the martingale convergence theorem we ob- tain the following. Corollary 3.36 If Zp,n > 1, is a nonnegative martingale then, with probability 1, limp—co Zn exists and is finite. Proof Because Zp, is nonnegative EllZal] = ElZn] = E[Zi)< 00108 Chapter 3 Conditional Expectation and Martingales Example 3.37 A branching process follows the size of a population over succeeding generations. It supposes that, independent of what occurred in prior generations, each individual in generation n inde- pendently has j offspring with probability p;,j > 0. The offspring of individuals of generation n then make up generation n + 1. Let Xn denote the number of individuals in generation n. Assuming that m = 3), Jp;, the mean number of offspring of an individual, is finite it is east to verify that Zp, = Xq/m",n > 0, is a martin- gale. Because it is nonnegative, the preceding corollary implies that lim, X,/m" exists and is finite. But this implies, when m < 1, that lim Xp = 0; or, equivalently, that Xp = 0 for all n sufficiently large. When m > 1, the implication is that the generation size either becomes 0 or converges to infinity at an exponential rate. .3.7 Exercises 109 3.7 Exercises 1. 2. For F = {4,}, show that E[X|F] = E[X]. Give the proof of Proposition 3.2 when X and Y are jointly continuous. . If E[|Xil] < 00, i=1,...,n, show that BO XI = KIA i=1 i=l |. Prove that if f is a convex function, then ELf(X)|F] = FELXIF) provided the expectations exist. . Let X1,Xo,..., be independent random variables with mean mn 1. Show that Zn = " Xi, n> 1, is a martingale. . If E[Xn4i1X1,.--)Xn] = @nXn+tbn for constants an,bn,n > 0, find constants An, Bn so that Zn = AnXn + Buin > 0, isa martingale with respect to the filtration o(Xo,..., Xn)- . Consider a population of individuals as it evolves over time, and suppose that, independent of what occurred in prior gen- erations, each individual in generation n independently has j offspring with probability p;, j > 0. The offspring of individuals of generation n then make up generation n + 1. Assume that m = ¥jipj < 00. Let Xp denote the number of individuals in generation n, and define a martingale related to X,,n > 0. The process X;,,7 > 0 is called a branching process. . Suppose X}, Xo, ..., are independent and identically distributed random variables with mean zero and finite variance o?. If T is a stopping time with finite mean, show that td Var()> Xi) = 07 E(T). i=1110 10. li. 12. Chapter 3 Conditional Expectation and Martingales ). Suppose Xj, Xa, ..., are independent and identically distributed mean 0 random variables which each take value +1 with prob- ability 1/2 and take value -1 with probability 1/2. Let Sy = 3%, Xi. Which of the following are stopping times? Compute the mean for the ones that are stopping times. (a) Ty = min{i > 5: S; = Sis +5} (b) Ty =T -5 (c) Ty =) +10. Consider a sequence of independent flips of a coin, and let Pj denote the probability of a head on any toss. Let A be the hypothesis that P, = a and let B be the hypothesis that Pp = , for given values 0 < a,b < 1. Let X; be the outcome of flip i, and set — P(X, XnlA) P(X,..-,XnlB) If P, = b, show that Zp,n > 1, is a martingale having mean 1. Zn Let Z,,n > 0 be a martingale with Z) = 0. Show that ElZz] = So El(Zi - Zi-1)?] i=l Consider an individual who at each stage, independently of past movements, moves to the right with probability p or to the left with probability 1 —p. Assuming that p > 1/2 find the expected number of stages it takes the person to move i positions to the right from where she started. . In Example 3.19 obtain bounds on p when 6 < 0. 14. Use Wald’s equation to approximate the expected time it takes a random walk to either become as large as a or as small as —b, for positive a and b. Give the exact expression if a and b are integers, and at each stage the random walk either moves up 1 with probability p or moves down 1 with probability 1— p. . Consider a branching process that starts with a single individ- ual. Let denote the probability this process eventually dies out. With X;, denoting the number of individuals in generation n, argue that 7**, n > 0, is a martingale.3.7 Exercises ill 16. 17. 18. 19. 21. 23. Given X1,Xa,..., let Sp = DL, Xi and Fy = o(Xi,...Xn)- Suppose for all n E|Sy| < 00 and E[Sn4i\Fa] = Su. Show E|X.Xj] =0 if i ¢ J. Suppose n random points are chosen in a circle having diam- eter equal to 1, and let X be the length of the shortest path connecting all of them. For a > 0, bound P(X — E[X] > a). Let X1,Xz,...,Xn be independent and identically distributed discrete random variables, with P(X; = j) = pj. Obtain bounds on the tail probability of the number of times the pat- tern 0,0,0,0 appears in the sequence. Repeat Example 3.27, but now assuming that the X; are inde- pendent but not identically distributed. Let P;,; = P(X: = j). . Let Zpyn > 0, be a martingale with mean Zp = 0, and let v;,j 2 0, be a sequence of nondecreasing constants with vp = 0. Prove the Kolmogorov-Hajek-Renyi inequality: P(\25| < vj, for all j =1,. ) 2 1=S B(Z; - 2)-1)1/2} jal Consider a gambler who plays at a fair casino. Suppose that the casino does not give any credit, so that the gambler must quit when his fortune is 0. Suppose further that on each bet made at least 1 is either won or lost. Argue that, with probability 1, a gambler who wants to play forever will eventually go broke. . What is the implication of the martingale convergence theorem to the scenario of Exercise 10? Three gamblers each start with a,b, and c chips respectively. In each round of a game a gambler is selected uniformly at random to give up a chip, and one of the other gamblers is selected uniformly at random to receive that chip. The game ends when there are only two players remaining with chips. Let Xn,¥n, and Z, respectively denote the number of chips the three players have after round n, so (Xo, Yo, Zo) = (a, b,c). (a) Compute E[Xn41¥a+iZn+1 | (Xn Ya: Zn) = (21, 2)]-112 Chapter 3 Conditional Expectation and Martingales (b) Show that Mn = Xn¥nZn-+n(a+b+c)/3 is a martingale. (c) Use the preceding to compute the expected length of the game.Chapter 4 Bounding Probabilities and Expectations 4.1 Introduction In this chapter we develop some approaches for bounding expec- tations and probabilities. We start in Section 2 with Jensen’s in- equality, which bounds the expected value of a convex function of a random variable. In Section 3 we develop the importance sampling identity and show how it can be used to yield bounds on tail probabil- ities. A specialization of this method results in the Chernoff bound, which is developed in section 4. Section 5 deals with the Second Moment and the Conditional Expectation Inequalities, which bound the probability that a random variable is positive. Section 6 develops the Min-Max Identity and use it to obtain bounds on the maximum of aset of random variables. Finally, in Section 7 we introduce some general stochastic order relations and explore their consequences. 4.2 Jensen's Inequality Jensen’s inequality yields a lower bound on the expected value of a convex function of a random variable. Proposition 4.1 If f is a convex function, then ELf(X)] 2 F(ELX)) 113114 Chapter 4 Bounding Probabilities and Expectations provided the expectations exist. Proof. We give a proof under the assumption that f has a Taylor series expansion. Expanding f about the value » = E[X], and using the Taylor series expansion with a remainder term, yields that for some a F(z) = fe) + Few) + f(a) - w)?/2 > f(u)+ f'(u)(z - ») where the preceding used that f”(a) > 0 by convexity. Hence, F(X) > F(u) + f(u)(X - w) ‘Taking expectations yields the result. = Remark 4.2 If P(X = 21) =A=1- P(X = 29) then Jensen’s inequality implies that for a convex function f AF (a1) + (1 — A) F(w2) 2 F(Aei + (1 - A)z2), which is the definition of a convex function. Thus, Jensen’s inequal- ity can be thought of as extending the defining equation of convexity from random variables that take on only two possible values to arbi- trary random variables. 4.3 Probability Bounds via the Importance Sam- pling Identity Let f and g be probability density (or probability mass) functions; let A be an arbitrary function, and suppose that g(x) = 0 implies that, f(c)h(x) = 0. The following is known as the importance sampling identity. Proposition 4.3 The Importance Sampling Identity MX)FX hS), 9({X) where the subscript on the expectation indicates the density (or mass function) of the random variable X. E,(h(X)] = BI-4.3 Probability Bounds via the Importance Sampling Identity 115 Proof We give the proof when f and g are density functions: Bylbx)) = J ate) se) ae (2) $2) 9 ~ 92) oe: F(X) 9X) ‘The importance sampling identity yields the following useful corol- lary concerning the tail probability of a random variable. (a) de i El 1 Corollary 4.4 L(x) P(X > 0) = Bul IX > qP,(X >) Proof. P(X >) = Eyllx>e] B(xzf) Bul GX) eezar) Fal 5) a |X > gP,(X > 0) ears! |X > dP)(X >) Example 4.5 Bounding Standard Normal Tail Probabilities Let f be the standard norma] density function 1 f(z) = yee 00 <2<00 For c > 0, consider Py(X > c}, the probability that a standard normal random variable exceeds c. With g(z) =ce"™, «>0116 Chapter 4 Bounding Probabilities and Expectations we obtain from Corollary 4.4 Py(X >) ”F, gleX* eX |X > we E, fexter*/2, eolX+0)] TR where the first equality used that P,(X > c) = e-®, and the sec- ond the lack of memory property of exponential random variables to conclude that the conditional distribution of an exponential random variable X given that it exceeds c is the unconditional distribution of X +c. Thus the preceding yields et Py(X > 0) = [e-%7/?] (4.1) Noting that, for > 0, l-rc) < (1-1 +3, ¢ 1°) iz 4(X > c) < (1-1/1 10) in Our next example uses the importance sampling identity to bound the probability that successive sums of a sequence of independent and identically distributed normal random variables with a negative mean ever cross some specified positive number. w (4.2)4.3 Probability Bounds via the Importance Sampling Identity 17 Example 4.6 Let X1, Xo,... be a sequence of independent and iden- tically distributed normal random variables with mean yp < 0 and variance 1. Let S, = Tf, X; and, for a fixed A > 0, consider p= P(S, > Afor some k) Let fx(xk) = fe(21,---,2%) be the joint density function of X; = (X1,..., Xx). That is, falxu) = (2)~#/2e~ Eten eso)? /2 Also, let gx be the joint density of k independent and identically distributed normal random variables with mean —y and variance 1. That is, 9k (Xz) = (20) ~*/2¢ DE (eitu)?/2 Note that SelXe) _ gan Dh 2 9n(Xx) With j k Re = {(t1---t) i SALI < hy Domi > AY i= = we have © = VP € A) k=1 ~ = YE Moers] k=) Sp xeeRa Se(Ke) -2 Eg) a x Eoullexpersy oH] kel Now, if X, € Ry then S, > A, implying, because p < 0, that eS < e#4_ Because this implies that Tyxgergy OM < Texper OM118 Chapter 4 Bounding Probabilities and Expectations we obtain from the preceding that Y Baellexver,3 M4] k=1 oo MAS Faull xpera)] k=l 8 1A i Now, if ¥j,i > 1, is a sequence of independent normal random vari- ables with mean —y and variance 1, then i k Eoullxpery] = POLY SA 5 A) 1 i=l Therefore, from the preceding a A k < CAS PITY S A, 13 A) k=l i=l #1 k eA P(S~Y; > A for some k) i=l il = eH where the final equality follows from the strong law of large numbers because limy—oo 0%; ¥i/n = —w > 0, implies P(limp—oo D2) Yi = oo) =1, and thus P(TE, ¥; > A for some k) = 1. ‘The bound p= P(S > Afor some k) < e744 is not very useful when A is a small nonnegative number. In this case, we should condition on X;, and then apply the preceding inequality.4.4 Chernott Bounds 119 With © being the standard normal distribution function, this yields P i ce 1 2 P(S, > Ak k|X1 = 2) e724 ie (Sh > Aor some kX = 2) =e Iz . 1 2 P(S, > Af KX = 2) ee 82g, if (Sk for some k|X1 Re lx +P(X1 > A) 1 2 < wl A-2) 6—(8-W?/2dp 4 1 — (A > wale : ae A ems f e-(@HH)*2de 4.1 O(A — p) Vin Ico PHAS(A +p) +1—O(A— p) Thus, for instance P(Sq > Ofor some k) < ®(y) + 1 — &(—p) = 26(n) 4.4 Chernoff Bounds Suppose that X has probability density (or probability mass) func- tion f(z). For t > 0, let e* f(z) Mi) o(z) = where M(t) = E,[e'*] is the moment generating function of X. Corollary 4.4 yields, for c > 0, that Py(X >) Eg{M(t)e~** |X > c] Py(X > c) E,{M(t)e"*|X > d M(t)e-® IA 1A Because the preceding holds for all ¢ > 0, we can conclude that i te Py(X > ¢) < inf M(te (4.3) The inequality (4.3) is called the Chernoff bound. Rather than choosing the value of ¢ so as to obtain the best bound, it is often convenient to work with bounds that are more120 Chapter 4 Bounding Probabilities and Expectations analytically tractable. The following inequality can be used to sim- plify the Chernoff bound for a sum of independent Bernoulli random variables. Lemma 4.7 For 00 P(X - E[X] >) < en (4.4) P(X - E[X] <0) < e?7/" (4.5) Proof. For c>0,t>0 P(X ~ E[X] > c) = P(et(X-FIXD > et) eE[e(X-FIX)] by the Markov inequality e* Blexr( SX — EXD) IA IA W n eB] et] is1 = ee] [let Fy ia However, if Y is Bernoulli with parameter p, then Bet -E1"1] = pet(l-P) + (1 — pe? < @f/8 where the inequality follows from Lemma 4.7. Therefore, P(X - E[X] >0) seter/8 Letting t = 4c/n yields the inequality (4.4). The proof of the inequality (4.5) is obtained by writing it as P(E[X]—-X > 0) 0, as n —+ 00 and m —+ 00, P((1—¢)ma” < Ng < (1 +6€)ma”) 1 Proof. To prove the result, it is convenient to first formulate an equivalent continuous time model that results in the times at which the n-+m cells are killed being independent random variables. To do so, let X1,...,Xn4m be independent exponential random variables, with X; having weight w;, i= 1,...,n-+m. Note that X; will be the smallest of these exponentials with probability w;/ 5°; w;; further, given that X; is the smallest, X,,r # i, will be the second smallest with probability w,/ 30,4; ws; further, given that X; and X; are, re- spectively, the first and second smallest, X,,s # i, 7, will be the next smallest with probability w,/ 2,44, tj; and so on. Consequently, if we imagine that cell i is killed at time X;, then the order in which the n + m cells are killed has the same distribution as the order in which they are killed in the original model. So let us suppose that cell i is killed at time X;,i > 1. Now let 7. denote the time at which the number of surviving target cells first falls below na. Also, let N(t) denote the number of normal cells that are still alive at time t; so Na = N(Tq). We will first show that P(N(ta) < (1+€)me"”) 1 asnjm— co (4.6)122 Chapter 4 Bounding Probabilities and Expectations To prove the preceding, let e* be such that 0 < e* < ¢, and set t = ~—In(a(1+e*)/”), We will prove (4.6) by showing that as n and ™ approach oo, () P(t >t) 1, and (i) P(N(t) < (1 +6)ma¥) +1. Because the events 7_ >t and N(t) < (1+)ma™ together imply that N(ta) < N(t) < (1+ €)ma”, the result (4.6) will be established. To prove (i), note that the number, call it Y, of surviving target cells by time ¢ is a binomial random variable with parameters n and et =a(1+e*)”. Hence, with a= na[(1 + ¢*)'/” — 1], we have P(ta < t) = P(Y < na} = P(Y < net -a) < 2" /" where the inequality follows from the Chernoff bound (4.3). This proves (i), because a?/n — 00 as n> 00. To prove (ii), note that N(¢) is a binomial random variable with parameters m and e~“t = a” (1+e*). Thus, by letting b= ma™(e— é*) and again applying the Chernoff bound (4.3), we obtain P(N(t) > (1+ €)ma”} = P(N(t) > me7** +b) < em This proves (ii), because b?/m — co as m — oo. Thus (4.6) is established. It remains to prove that P(N(za) > (1—€)ma”) +1 as n,m — co (4.7) However, (4.7) can be proven in a similar manner as was (4.6); a combination of these two results completes the proof of the theorem. s 4.5 Second Moment and Conditional Expecta- tion Inequalities ‘The second moment inequality gives a lower bound on the probability that a nonnegative random variable is positive.4.5 Second Moment and Conditional Expectation Inequalities 123 Proposition 4.11 The Second Moment Inequality For a non- negative random variable X (E{X))? BX?) Proof. Using Jensen's inequality in the second line below, we have E[X?] = E[X?|X > 0]P(X > 0) (E[X|X > 0])?P(X > 0) (E(x)? P(X> 0) P(X >0)> Vv When _X is the sum of Bernoulli random variables, we can improve the bound of the second moment inequality. So, suppose for the remainder of this section that a x=>)%, i=l where X; is Bernoulli with E[X;] = pi, i = ‘We will need the following lemma. Lemma 4.12 For any random variable R EIXR] = SpE (RIX: = 1] i=l Proof E[XR] = BIS XR a = So eGR i=l = Seewnix = pi + ELGR|X: = 00 — pi)} rast e SC EIAX: =Ip # i=l124 Chapter 4 Bounding Probabilities and Expectations Proposition 4.13 The Conditional Expectation Inequality n Pi X>0)2> 57% — PX 02S arK aa] Proof. Let R= “229. Noting that EIXR)=P(X>0), E[RIX:=1) = BUX =1] we obtain, from Lemma 4.12, P(X >0) = nell where the final equality made use of Jensen’s inequality. ™ Example 4.14 Consider a system consisting of m components, each of which either works or not. Suppose, further, that for given subsets of components S;,j = 1,...,n, none of which is a subset of another, the system functions if all of the components of at least one of these subsets work. If component j independently works with probability aj, derive a lower bound on the probability the system functions. Solution Let X; equal 1 if all the components in S; work, and let it equal 0 otherwise, i= 1,...,n. Also, let =[Io JES pi= P(X = Then, with X = D7, Xj, we have P(system functions) = P(X > 0) : Pi soa Last SSP m= D Pi OE ae eeaa:4.5 Second Moment and Conditional Expectation Inequalities 125 where S; — S; consists of all components that are in S; but not in Si. ' Example 4.15 Consider a random graph on the set of vertices {1,2, +;n}, which is such that each of the (3) pairs of vertices i # j is, independently, an edge of the graph with probability p. We are interested in the probability that this graph will be connected, where by connected we mean that for each pair of distinct vertices i # j there is a sequence of edges of the form (i,i1), (i1, i2),---; (ik, 3)- (That is, a graph is connected if for each each pair of distinct vertices iand j, there is a path from i to j.) Suppose that = late) We will now show that if c < 1, then the probability that the graph is connected goes to 0 as n — oo. To verify this result, consider the number of isolated vertices, where vertex i is said to be isolated if there are no edges of type (i,j). Let X; be the indicator variable for the event that vertex i is isolated, and let 2 x=yo% i=l be the number of isolated vertices. Now, P(X; =1) = (1-p)?? Also, a E(X|X; = 1) = S> P(X} = Xi = 1) ja =1i+)°a-pr? j#i = 1+(n-1)(1—p)"? Because =p = 1 eBAen z enein(n)126 Chapter 4 Bounding Probabilities and Expectations the conditional expectation inequality yields that for n large ae > P(X >) 2 Gay Therefore, c<1> P(X >0} > 1Lasn— 00 Because the graph is not connected if X > 0, it follows that the graph will almost certainly be disconnected when n is large and c < 1. (It can be shown when c > 1 that the probability the graph is connected goes to 1 as n > co.) . 4.6 The Min-Max Identity and Bounds on the Maximum In this section we will be interested in obtaining an upper bound on Elmax; Xi], when X,,...,Xq are nonnegative random variables. To begin, note that for any nonnegative constant c, n maxX; Se+ D(X; —e)* (48) i=1 where x*, the positive part of z, is equal to z if x > 0 and is equal to 0 otherwise. Taking expectations of the preceding inequality yields that. i. Elmax Xj] E[(Xi - ¢)*] rast Because (X;—c)* is a nonnegative random variable, we have i E(X:—0)*] f ” P(Xi—e)* > ade i P(X;—c> a)de 0 E [fo P&> vey Therefore, Blmax Xi] < c+ > i ” BUX: > w)dy (49) i=l4.0 Ine Min-Max Identity and Bounds on the Maximum. 127 Because the preceding is true for all c > 0, the best bound is obtained by choosing the c that minimizes the right side of the preceding. Differentiating, and setting the result equal to 0, shows that the best bound is obtained when ¢ is the value c* such that, SoP(K >) =1 i=l Because 7”, P(X; > c) is a decreasing function of c, the value of c* can be easily approximated and then utilized in the inequality (4.9). It is interesting to note that c* is such that the expected number of the X; that exceed c* is equal to 1, which is interesting because the inequality (4.8) becomes an equality when exactly one of the Xi exceed c. Example 4.16 Suppose the X; are independent exponential random variables with rates \;,i = 1,...,n. Then the minimizing value c* is such that with resulting bound Elmax Xi] < i In the special case where the rates are all equal, say A; = 1, then l=ne~" or c*=In(n) and the bound becomes Emax Xj) < In(n) +1 (4.10) However, it is easy to compute the expected maximum of a sequence of independent exponentials with rate 1. Interpreting these random variables as the failure times of n components, we can write 8 max Xi = OT, i=2128 Chapter 4 Bounding Probabilities and Expectations where T; is the time between the (i — 1)st and the i‘ failure. Using the lack of memory property of exponentials, it follows that the T; are independent, with T; being an exponential random variable with rate n—i+1. (This is because when the (i— 1)st failure occurs, the time until the next failure is the minimum of the n— i+ 1 remaining lifetimes, each of which is exponential with rate 1.) Therefore, in this case : : 1 1 Elmax Xi} x nets As it is known that, for n large So ys Zein) +E ia where E ~ .5772 is Euler’s constant, we see that the bound yielded by the approach can be quite accurate. (Also, the bound (4.10) only requires that the X; are exponential with rate 1, and not that they are independent.) . The preceding bounds on E{max; Xi] only involve the marginal distributions of the X;. When we have additional knowledge about the joint distributions, we can often do better. To illustrate this, we first need to establish an identity relating the maximum of a set of random variables to the minimums of all the partial sets. For nonnegative random variables X1,...,Xn, fix x and let A; denote the event that X; > x. Let X =mex(X1,...,Xn) Noting that X will be greater than z if and only if at least one of the events A; occur, we have P(X > 2) = P(ULAi) and the inclusion-exclusion identity gives P(X >2) = S>P(A)— Do PAA) + SD P(AASAR) 7 ia)= Xe De eA) fncn2} P(AAs) = P(X; > 2, X; > 2} = P(min(X;, X;) > 2) P(A;AjAg) = P(X; > 2, Xj > 2, X_ > x} P(min(X;, X;, Xx) > 2) and so on. Thus, we see that : PIX >a)= 0-4 SO Pemin(Xi,,...,%,) > 2)) Integrating both sides as x goes from 0 to oo gives the result: EIX]= S(O Blmin(X%,,....¥:,)] r=1 tS YAAK — Blin 5) iG ElX] < TAK) — Fn 1+ YO Hlmin(Xi, X;, Xx) i 1 1 + ————— +... + (1)! —_—___. San ) Pit... +Dn Using the preceding formula for the mean number of coupons needed to obtain a complete set requires summing over 2" terms, and so is not practical when n is large. Moreover, the bounds obtained by only going out a few steps in the formula for the expected value of a maximum generally turn out to be too loose to be beneficial. However, a useful bound can often be obtained by applying the max- min inequalities to an upper bound for E[X] rather than directly to E[X]. We now develop the theory. For nonnegative random variables Xi,..., Xp, let X = max(X, Fix c > 0, and note the inequality X $c+max((X1—0)*,...,(Xn—¢)*).4.0 Ihe Min-Max Identity and Bounds on the Maximum 131 Now apply the max-min upper bound inequalities to the right side of the preceding, take expectations, and obtain that E[X] < c+ STIX 04] i=l E[X] < e+ > BUX — €)*] — > Blmin((X; — 0), (X; - )*)] 7 SG + YS Blmin((x: - 0), (Xj - (Xe - 0D] icjck and so on. Example 4.18 Consider the case of independent exponential ran- dom variables X;,...,Xn, all having rate 1. Then, the preceding gives the bound ElmexX;] Set DEUX - ot} ~ Y Elmin(x; - 0), (X; - )*)] 7 roa + YS Blmin((X; — 0), (Xj — 0)*, (Xe — 0)*)] i c, whether min(X;,X;) > ¢, and whether min(X;, Xj, Xx) > c. This yields E((Xi — ¢)*] Elmin((X; — ¢)*, (Xj — ¢)*)] W Efmin((X; — ¢)*, (Xj —¢)*, (Xe -¢)*)] = € Using the constant c = In(n) yields the bound a a Blmax Xi] < In(a) 41-25) 4 MeV?) In(n) +.806 for n large132 Chapter 4 Bounding Probabilities and Expectations Example 4.19 Let us reconsider Example 4.17, the coupon col- lector's problem. Let c be an integer. To compute E{(X; — c)*), condition on whether a type é coupon is among the first c collected. El(Xi — 0)*] = El(Xi — €)*|Xi < P(X: < 0) +El(X; - 0)*|X; > JP(X; > ©) = El(X: — ¢)* |X; > (1 — pi)® — (=p) Pi Similarly, Blmin((X:— 0), (%j~ eh] = CoRR BI Blain((X.— )*, (X59, egy) = CORB BT Therefore, for any nonnegative integer eps inh 1- e ger Sta pi) -y force Conon iG (1 = pi = yj ~ Pe)? + Ua Pia Pi ~ Pky & Pit Pj + Pe zB 3 e AY a A 4.7 Stochastic Orderings We say that X is stochastically greater than Y, written X >g¢ Y, if P(X >t)>P(Y >t) for allt In this section, we define and compare some other stochastic order- ings of random variables. If X is a nonnegative continuous random variable with distribu- tion function F and density f then the hazard rate function of X is defined by Ax(t) = fO/FO4.7 Stochastic Orderings 133 where F(t) = 1 — F(t). Interpreting X as the lifetime of an item, then for ¢ small P(t year old item dies within an additional time e) = P(X t) & Ax(te If Y is a nonnegative continuous random variable with distribu- tion function G and density g, say that X is hazard rate order larger than Y, written X >a, Y, if Ax(t) S Ay(t) for allt Say that X is likelihood ratio order larger than Y, written X >i, Y, if F(z)/a(z) Tz where f and g are the respective densities of X and Y Proposition 4.20 X>rY >X 2a Y >X DH Y Proof. Let X have density f, and Y have density g. Suppose X 21, Y Then, for y >< fly) £(z) Fy) = oy) Gy 2 9 (2) [P serane £2 PP owen Ax(z) < Ay(z) To prove the final implication, note first that : - Pf i ax(nat= Foy tt= hehe) implying that or or P(e) =e Bx134 Chapter 4 Bounding Probabilities and Expectations which immediately shows that X >pr Y => X Dot Y. w Define rx(t), the reversed hazard rate function of X, by rx(t) = f(Q)/FO) and note that _ P(t-e< XIX <2) SS ii a Say that X is reverse hazard rate order larger than Y,, written X 2n¥, if rx(t) > ry(t) for allt Our next theorem gives some representations of the orderings of X and Y in terms of stochastically larger relations between certain conditional distributions of X and Y Before presenting it, we intro- duce the following notation. Notation For a random variable X and event A, let [X|A] de- note a random variable whose distribution is that of the conditional distribution of X given A. Theorem 4.21 () X Zar ¥ @ [X|X >t] Dee (YIY > dl for all t (i) X Den ¥ 4 [XIX #] and ¥; =4 [Y-t|Y > ¢]. Noting that = 0; ifst shows that X Shr Y => Xt Dar Ye > Xe Det Ye > [XIX > t] Vee (VY > t] To go the other way, use the identity Fy, =e" Ser" Ax(s)ds4.1 stochastic Orderings 135 to obtain that [X|X >t] 28 ([YIY > 4 Xr 20% => Ax(t) S Av(#) (ii) With X; =a [t- X|X < ¢], Ax(y) =rx(t-y) O XiseX% [t-X|X a VY [ovtnray ° firxt-nav> [ore nney => rx(t) 2 rr(t) (iii) Let X and Y have respective densities f and g. Suppose [X|s< X <#] Su [Yls<¥<¢#] foralls v/s < X P(Y > v/s < ¥ £9, showing that X >i ¥. Now suppose that X >i, Y. Then clearly [X|s < X < t] >i [Yls < Y < ¢], implying, from Proposition 4.20, that [X|s < X < th>e(Y\s<¥ Y > X Den Y >X 2H Y Proof. The first implication immediately follows from parts (ii) and (iii) of Theorem 4.21. The second implication follows upon taking the limit as t + oo in part (ii) of that theorem. m Say that X is an increasing hazard rate (or IHR) random variable if Ax(t) is nondecreasing in t. (Other terminology is to say that X has an increasing failure rate.) Proposition 4.23 Let X; =a (X —t|X >t]. Then, (a) Xe ln ast? « XisJHR (b) Xi le ast? +> log f(z) is concave Proof. (a) Let A(y) be the hazard rate function of X. Then d,(y), the hazard rate function of X; is given by A(y) =AlE+y), y>O Hence, if X is THR then X; Ine ¢, implying that X_ Jee t. Now, let 8 s Xt. Then, eo MWY = P(X, > €) S$ P(Xp > 6) =e Fe Mey4.8 Excercises 137 showing that A(t) > (s). Thus part (a) is proved. (b) Using that the density of X; is fx,(z) = f(x +t)/F(t) yields, for s < t, that f(ct+s) far '* © log f(x+s)—log f(z+t)tz fats) _ fet). 4 Xs 2 Xt Fats) flatt) = fy) ° Fw tY ° five f @)ly © log f(y) is concave 4.8 Excercises 1. For a nonnegative random variable X, show that (E[X™})!/" is nondecreasing in n. 2. Let X be as standard normal random variable. Use Corollary 4.4, along with the density g(x) =ze""?, 2 >0 to show, for c > 0, that (a) P(X > 0) = Fee“ Eyl Z1X > (b) Show, for any positive random variable W, that 1 Bil > d < BL (©) Show that PUL > 0) | ote” ViRP(X > c)138 a Chapter 4 Bounding Probabilities and Expectations (e) Use Jensen’s inequality, along with the preceding, to show that eP(X > 0) +e? VIm(P(X > c))? > me (£) Argue from (e) that P(X > c) must be at least as large as the positive root of the equation crt ef VTma? = oe (g) Conclude that P(X >o)> Sr (V2 +4-c)e*? . Let X be a Poisson random variable with mean 4. Show that, for n > A, the Chernoff bound yields that e->(Ae)" OCS.) Se . Let m(t) = E[X*]. The moment bound states that for c > 0 P(X 2c) < m(t)c* for all t > 0. Show that this result can be obtained from the importance sampling identity. . Fill in the details of the proof that, for independent Bernoulli random variables X1,...,Xn, and c> 0, P(S - E[S] < -c) < e?"/" where $= 2, X, . If X is a binomial random variable with parameters n and p, show (a) P(X = np| > ce) < 2672 (b) P(X — np > anp) < exp{—2np*a*} Give the details of the proof of (4.7).4.8 Excercises 139 8. Prove that Elf(X)] 2 ELF(E[X1¥))] 2 F(ELX)) Suppose you want a lower bound on E[f(X)] for a convex function f. The preceding shows that first conditioning on Y and then applying Jensen's inequality to the individual terms Elf(X)|Y = y] results in 0 larger lower bound than does an immediate application of Jensen’s inequality. 9. Let X; be binary random variables with parameters p;,i = 1,...)2. Let X = 30%, Xi, and also let I, independent of the variables X;,...,Xn, be equally likely to be any of the values 1,...)n. For R independent of I, show that (a) PU = {Xr = 1) = pi/E[X] (b) E[XR] = E[X]E[R|Xz = 1] (©) P(X > 0) = E[X] E[X|X1 = 1] 10. For X; and X as in Exercise 9, show that Pi (E[x1? x BRIX = 1 = BUX? ‘Thus, for sums of binary variables, the conditional expectation inequality yields a stronger lower bound than does the second moment inequality. Hint: Make use of the results of Exercises 8 and 9. 11. Let X; be exponential with mean 8+ 2i, for i = 1,2,3. Obtain an upper bound on E[max X;], and compare it with the exact result when the X; are independent. 12, Let Uj, i =1,...,n be uniform (0, 1) random variables. Obtain an upper bound on E[max Uj], and compare it with the exact result when the U; are independent. 13. Let U; and Uz be uniform (0,1) random variables. Obtain an upper bound on E[max(Ui,Us)], and show this maximum is obtained when Uj = 1— Up.140 14, 15. 16. Chapter 4 Bounding Probabilities and Expectations Show that X >p, Y if and only if P(X >t), PY >#) P(X > s) ~ P(Y >s) for all s A(y,z) whenever > y (a) Show that if X and Y are independent and X >i, Y then A(X,Y) Dat h(Y,X). (b) Show by a counterexample that the preceding is not valid under the weaker condition X >yt Y. There are n jobs, with job i requiring a random time X; to process. The jobs must be processed sequentially. Give a suffi- cient condition, the weaker the better, under which the policy of processing jobs in the order 1,2,...,n maximizes the prob- ability that at least k jobs are processed by time t for all k and t.Chapter 5 Markov Chains 5.1 Introduction This chapter introduces a natural generalization of a sequence of in- dependent random variables, called a Markov chain, where a variable may depend on the immediately preceding variable in the sequence. Named after the 19th century Russian mathematician Andrei An- dreyevich Markov, these chains are widely used as simple models of more complex real-world phenomena. Given a sequence of discrete random variables Xo, X1, Xo, ... tak- ing values in some finite or countably infinite set S, we say that Xp is a Markov chain with respect to a filtration Fy if Xn € Fa for all n and, for all B C S, we have the Markov property P(Xnt1 € Bl Fn) = P(Xnti € BlXn). If we interpret X, as the state of the chain at time n, then the preceding means that if you know the current state, nothing else from the past is relevant to the future of the Markov chain. That is, given the present state, the future states and the past states are independent. When we let Fy, = o(Xo,X1,-..,Xn) this definition reduces to P(Xna1 = il Xn = i Xn = tnaty ey Xo = to) = P(Xn41 = j| Xn =i). Tf P(Xni1 = j| Xn = 7) is the same for all n, we say that the Markov 141142 Chapter 5 Markov Chains chain has stationary transition probabilities, and we set P(Xn4i = 5| Xn =i) In this case the quantities P,; are called the transition probabilities, and specifying them along with a probability distribution for the starting state Xo is enough to determine all probabilities concerning Xo,...,Xn. We will assume from here on that all Markov chains considered have stationary transition probabilities. In addition, un- less otherwise noted, we will assume that S, the set of all possible states of the Markov chain, is the set of nonnegative integers. Example 5.1 Reflected random walk. Suppose Yj are independent and identically distributed Bernoulli(p) random variables, and let Xo = 0 and Xp = (Xn-1 + 2¥n — 1)* for n = 1,2,.... The process Xn, called a reflected random walk, can be viewed as the position of a particle at time n such that at each time the particle has a p probability of moving one step to the right, a 1 — p probability of moving one step to the left, and is returned to position 0 if it ever attempts to move to the left of 0. It is immediate from its definition that Xp is a Markov chain. Example 5.2 A non-Markov chain. Again let Y; be independent and identically distributed Bernoulli(p) random variables, let Xo = 0, and this time let X, = Yn + Yn-1 for n = 1,2,.... It’s easy to see that X, is not a Markov chain because P(Xn41 = 2|Xn = 1,Xn-1 = 2) =0 while on the other hand P(Xn+1 = 2|Xn = 1, Xn-1 = 0) =p. 5.2 The Transition Matrix The transition probabilities Pj = P(X1 = j|Xo = 1), are also called the one-step transition probabilities. We define the n- step transition probabilities by rr5.2 Ihe transition Matrix 143 In addition, we define the transition probability matrix Poo Por Poo P=| Pio Pu Pi and the n-step transition probability matrix Bp rip ee po) =| Pip Ph 2 An interesting relation between these matrices is obtained by noting that PET = S7 P(Xntm = j1X0 = i, Xn = k)P(Xn = klXo = 3) k = (m) p(n) = rnp. k ‘The preceding are called the Chapman-Kolmogorov equations. If follows from the Chapman-Kolmogorov equations that pintm) _ pn, pm where here “x” represents matrix multiplication. Hence, Pp?) =PxP, and, by induction, p®) = pr, where the right-hand side represents multiplying the matrix P by itself n times. Example 5.3 Reflected random walk. A particle starts at position zero and at each time moves one position to the right with probability p and, if the particle is not in position zero, moves one position to the left (or remains in state 0) with probability 1 — p. The position144 Chapter 5 Markov Chains Xp, of the particle at time n forms a Markov chain with transition matrix l-p p 000 l-p 0 poo P=! 0 1-p0po Example 5.4 The two state Markov chain. Consider a Markov chain with states 0 and 1 having transition probability matrix l-a p=|% [178] The two-step transition probability matrix is given by pe — aes ogee ata | a8 + B(1— 8) 1-06 — B01 - 8) 5.3 The Strong Markov Property Consider a Markov chain X, having one-step transition probabilities Pj, which means that if the Markov chain is in state i at a fixed time n, then the next state will be j with probability P,;. However, it is not necessarily true that if the Markov chain is in state i at a randomly distributed time T, the next state will be j with probability Py. That is, if T is an arbitrary nonnegative integer valued random variable, it is not necessarily true that P(Xr41 = j|Xr = i) = Py. For a simple counterexample suppose T =min(n: Xn =i, Xai = 3) Then, clearly, P(Xry1 = j|Xr =) = 1. The idea behind this counterexample is that a general random vari- able T may depend not only on the states of the Markov chain up to time T but also on future states after time T. Recalling that T is a stopping time for a filtration F, if {T =n} € Fp for every n, we see that for a stopping time the value of T can only depend on the states up to time ¢. We now show that P(Xr4n = j|Xr = ) will equal Pp) provided that T is a stopping time.5.3 The Strong Markov Property 145 This is usually called the strong Markov property, and essentially means a Markov chain “starts over” at stopping times. Below we define Fr = {A AN{T = t} € F, for all t}, which intuitively represents any information you would know by time T. Proposition 5.5 The strong Markov property. Let Xn,n > 0, bea Markov chain with respect to the filtration Fy. If T < co as. isa stopping time with respect to Fn, then P(Xran = ilXr = i, Fr) = P Proof. P(Xr4n = Xr =i, Fr,T =t) = P(Xten = j|Xt =i, FT = t) = P(Xten = j|Xt =i, F:) = Pe where the next to last equality used the fact that T is a stopping time to give {T =t}e F,. = Example 5.6 Losses in queueing busy periods. Consider a queu- ing system where Xp is the number of customers in the system at time n. At each time n = 1,2,... there is a probability p that a customer arrives and a probability (1 — p), if there are any cus- tomers present, that one of them departs. Starting with Xo = 1, let T = min{t > 0: X; = 0} be the length of a busy period. Sup- pose also there is only space for at most m customers in the system. ‘Whenever a customer arrives to find m customers already in the sys- tem, the customer is lost and departs immediately. Letting Nm be the number of customers lost during a busy period, compute E[Nyn]. Solution. Let A be the event that the first arrival occurs before the first departure. We will obtain E{Njn] by conditioning on whether A occurs. Now, when A happens, for the busy period to end we must first wait an interval of time until the system goes back to having a single customer, and then after that we must wait another interval of time until the system becomes completely empty. By the Markov property the number of losses during the first time interval has dis- tribution Nm—1, because we are now starting with two customers and146 Chapter 5 Markov Chains therefore with only m--1 spaces for additional customers. The strong Markov property tells us that the number of losses in the second time interval has distribution Nj». We therefore have _ f E{Nm-1] + E[Nm] if m >1 EINnlal = { Tes ifm=1, and using P(A) = p and E{Ny|A‘] = 0 we have E[Nmm] = E[Nm|A]P(A) + E[NmlA°]P(A®) = pE|Nm-i] + PE|Nm] for m > 1 along with E(N] = p+ pE(M] and thus P_\m E[Nm] = (—2-)". [Nm] = >) It’s quite interesting to notice that E[Nm] increases in m when p> 1/2, decreases when p < 1/2, and stays constant for all m when p= 1/2. The intuition for the case p = 1/2 is that when m increases, losses become less frequent but the busy period becomes longer. We next apply the strong Markov property to obtain a result for the cover time, the time when all states of a Markov chain have been visited. Proposition 5.7 Cover times. Given an N-state Markov chain Xp, let T, = min{n > 0: Xp = i} and let C = max; T; be the cover time. Then E(C\ < N_, 2 maxi; ElT;|Xo = i). =] im Proof. Let th, l2,-...Jv be a random permutation of the integers N chosen so that all possible orderings are equally likely. Letting Ti, = 0 and noting that maxj past] ie m1 ae 1 Fy max E[T;|Xo = i, mig m=1 where the second line follows since all orderings are equally likely and thus P(Ti, > maxjcm-1T1,) = 1/m, and the third from the strong Markov property. m@ 5.4 Classification of States We say that states i,j of a Markov chain communicate with each other, or are in the same class, if there are integers n and m such that both PS) > 0 and P!” > O hold. This means that it is possible for the chain to get from i to j and vice versa. A Markov chain is called irreducible if all states are in the same class. For a Markov chain Xp, let T= min{n > 0: Xn =i} be the time until the Markov chain first makes a transition into state i. Using the notation Ei[---] and P,(---) to denote that the Markov chain starts from state i, let fi = P(Ti < 00) be the probability that the chain ever makes a transition into state i given that it starts in state i. We say that state i is transient if fi <1 and recurrent if f; = 1. Let os M= Do kx,=0 n=l148 Chapter 5 Markov Chains be the total number of transitions into state i. The strong Markov property tells us that, starting in state i, Ni+1~ geometric(1— fi), since each time the chain makes a transition into state i there is, independent of all else, a (1 — f;) chance it will never return. Conse- quently, « EIN] = oP is either infinite or finite depending on whether or not state i is recurrent or transient. Proposition 5.8 If state i is recurrent and i communicates with j, then j is also recurrent. Proof. Because i and j communicate, there exist values n and m such that P{?) PO” > 0. But for any k > 0 (n+m+k) (mm) 15k) Pe ere ca where the preceding follows because P{?*™*") is the probability starting in state j that the chain will be back in j after n-+m-+k tran- sitions, whereas P{” Pi p\”) is the probability of the same event oc- curring but with the additional condition that the chain must also be in i after the first m transitions and then back in é after an additional k transitions. Summing over k shows that BAIN] > Pg MY > PIM PLY PY = 00 k Thus, j is also recurrent. = Proposition 5.9 If j is transient, then T°, Pl < oo Proof. Note that. BN) = BASS Ten] = La n=05.5 Stationary and Limiting Distributions 149 Let fij denote the probability that the chain ever makes a transition into j given that it starts at i. Then, conditioning on whether such a transition ever occurs yields, upon using the strong Markov property, that E,[Nj] = (1+ Ej[Nj)) fig < 00 since j is transient. = If é is recurrent, let mi = ET) denote the mean number of transitions it takes the chain to return to state i, given it starts in i. We say that a recurrent state i is null if pi = 00 and positive if 4; < oo. In the next section we will show that positive recurrence is a class property, meaning that if i is positive recurrent and communicates with j then j is also positive recurrence. (This also implies, using that recurrence is a class property, that so is null recurrence.) 5.5 Stationary and Li iting Distributions For a Markov chain X, starting in some given state i, we define the limiting probability of being in state j to be noo P;= lim Pi? if the limit exists and is the same for all i. It is easy to see that not all Markov chains will have limiting probabilities, For instance, consider the two state Markov chain with Po. = Pyo = 1. For this chain P{") will equal 1 when n is even and 0 when n is odd, and so has no limit. Definition 5.10 State i of a Markov chain Xp is said to have period d if d is the largest integer having the property that P\") =0 whenn is not a multiple of d. Proposition 5.11 If states i and j communicate, then they have the same period.150 Chapter 5 Markov Chains Proof. Let dj be the period of state k. Let n,m be such that PP) P™ > 0. Now, if Pi > 0, then (r+ntm) plm) pir) pin) Pym > PPP >0 So d; divides r +n +m. Moreover, because (27) pl pr) PR 2 PE Pe > 0 the same argument shows that d; also divides 2r +n +m; therefore d; divides 2r-+n+m-—(r+n+m) = r. Because dj divides r whenever PS? > 0, it follows that d; divides d;. But the same argument can now be used to show that d; divides d;. Hence, d; It follows from the preceding that all states of an irreducible Markov chain have the same period. If the period is 1, we say that the chain is aperiodic. It’s easy to see that only aperiodic chains can have limiting probabilities. Intimately linked to limiting probabilities are stationary proba- bilities. The probability vector m;,i € S is said to be a stationary probability vector for the Markov chain if my = SomPy forall 5 oy =1 7 Its name arises from the fact that if the Xo is distributed according to a stationary probability vector {x} then P(X: = 9) = D> P(X) = 5X0 = im = Do mPy = 7; and, by a simple induction argument, P(Xn = 3) = 0 P(Xn = Xn = )P(Xn-1 =) = SO miPy = 05 7 7 Consequently, if we start the chain with a stationary probability vector then Xn, Xn+1,... has the same probability distribution for all n. The following result will be needed later.d.b Dtationary ana Limiting Listriputions: apa Proposition 5.12 An irreducible transient Markov chain does not have a stationary probability vector. Proof. Assume there is a stationary probability vector 7,1 > 0 and take it to be the probability mass function of Xo. Then, for any j 14 = P(Xn = 3) = dary Pm Consequently, for any m 4 = kn mrs Py Jim AS mPS +m) ism bm = Ya Jim Pf? +n ism ibm =A bm Ae where the final equality used that >, P{f) < oo because j is tran- sient, implying that limy so P{” = 0. Letting m — oo shows that 173 =0 for all j, contradicting the fact that So, xj = 1. Thus, assum- ing a stationary probability vector results in a contradiction, proving the result. @ The following theorem is of key importance. Theorem 5.13 An irreducible Markov chain has a stationary proba- bility vector {x;} if and only if all states are positive recurrent. The stationary probability vector is unique and satisfies my = Vay Moreover, if the chain is aperiodic then =} (n) = lim Pj To prove the preceding theorem, we will make use of a couple of lemmas.152 Chapter 5 Markov Chains Lemma 5.14 For an irreducible Markov chain, if there exists a sta- tionary probability vector {x;} then all states are positive recurrent. Moreover, the stationary probability vector is unique and satisfies my = 1/ny Proof. Let 7; be stationary probabilities, and suppose that P(Xo = 3) = 7; for all j. We first show that 1; > 0 for all 4. To verify this, suppose that 7_ = 0. Now for any state j, because the chain is irre- ducible, there is an n such that P{") > 0. Because Xp is determined by the stationary probabilities, Te = P(Xn =k) = So mPL > 1j3PO a Consequently, if 7, = 0 then so is 7;. Because j was arbitrary, that means that it 7; = 0 for any i, then 7; = 0 for all i. But that would contradict the fact that 7, 7m = 1. Hence any stationary probabil- ity vector for an irreducible Markov chain must have all positive elements. Now, recall that T; = min(n > 0: Xp, = j). So, ny = E(T;|Xo = J] co = YPC; = lx = 5) = = PPG 20 Xo= 5) ~ £4 P(X = 3) Because Xo is chosen according to the stationary probability vector {mi}, this gives oo TyHy = D> PUT; > 0, Xo = 3) (5.1) n=1 Now, P(T; 2 1, Xo = j) = P(Xo =) = 35 and, for n > 2 P(T; > ,X0=3) P(Xi #51 Si 1) = 0. Because P(X; # j) = 1-7, we thus obtain m5 = Wy thus showing that there is at most one stationary probability vector. In addition, because all 7; > 0, we have that all yj < 00, showing that all states of the chain are positive recurrent. ™ Lemma 5.15 If some state of an irreducible Markov chain is positive recurrent then there exists a stationary probability vector. Proof. Suppose state k is positive recurrent. Thus, Be = ExT] < 00 Say that a new cycle begins every time the chain makes a transition into state k. For any state j, let A; denote the amount of time the chain spends in state j during a cycle. Then = ExlQ exn=s.t.>0)] n=0 0 DY Fillexneynenyl n=0 e Do Pi(Xn = HT > 1) n=0 E[Aj] ' We claim that 7; = E[Aj)/u, 7 > 0, is a stationary probability vector. Because E[2, Aj] is the expected time of a cycle, it must equal Ey [Tx], showing that154 Chapter 5 Markov Chains Moreover, for j # k many = SO Pe(Xn = 5, Tk > 2) nz0 = VY Phe = 4,7 > 2-1, Xe =4) n21 i = VY Ph > 2-1, Xi =i) n2zli xPy(Xn = jlTk > 2-1, Xn-1 = 1) SY Pee > 2-1, Xn = Py 7 a3l = VY PelTe > 0, Xn = i) Py 7 n30 = EIAIP, 7 = be Py Finally, Mahe = Om - Py) 7 7 jek =1-mes tk t =1- SE 5 afk = tt and the proof is complete. ™ Note that Lemmas (5.14) and (5.15) imply the following. Corollary 5.16 If one state of an irreducible Markov chain is positive recurrent then all states are positive recurrent. ‘We are now ready to prove Theorem 5.13. Proof. All that remains to be proven is that if the chain is aperi- odic, as well as irreducible and positive recurrent, then the stationary5. dtationary and Limiting Listriputions probabilities are also limiting probabilities To prove this, let m;,i > 0, be stationary probabilities. Let Xp, > 0 and Yq,n > 0 be indepen- dent Markov chains, both with transition probabilities P,,j, but with Xo =i and with P(% = 4) = 7. Let N = min(n : Xn = Yn) We first show that P(N < 00) = 1. To do so, consider the Markov chain whose state at time n is (Xp, Yn), and which thus has transition probabilities Pj) (47) = PiPjr That this chain is irreducible can be seen by the following ar- gument. Because {Xp} is irreducible and aperiodic, it follows that for any state i there are relatively prime integers n,m such that ”) Ph™ > 0. But any sufficiently large integer can be expressed as a linear combination of relatively prime integers, implying that there is an integer Nj such that P® 30 foralln> N Because i and j communicate this implies the existence of an integer Nig such that PI) >0 for all n> Nig Hence, | = P® p > 0 for all sufficiently large n Gk) 7) ~ Ger ly larg which shows that the vector chain (Xn, Yn) is irreducible. In addition, we claim that 7,3 = 77 is a stationary probability vector, which is seen from Tiny = YomPes oD a, P,j = DV rete Pei Peg k r ker By Lemma 5.14 this shows that the vector Markov chain is positive recurrent and so P(N < co) = 1 and thus lim, P(N > n) = 0. Now, let Z, = Xn ifn < N and let Z, = Y; ifn > N. It is easy to see that Z,,n > 0 is also a Markov chain with transition156 Chapter 5 Markov Chains probabilities P;; and has Zp = i. Now (n) Pa W P(Z, =i) P(Z, gN n) P(Yn = 5,N $1) +P(Zn = 5,N >n) P(¥_ =) + P(N > n) aj + P(N >n) (5.2) s On the other hand my = PYn = 3) P(¥n = 5,N ) P(Zn = G,N 1) P(Zq = j) + P(N > n) PI) + P(N >n) (5.3) IA Wl i Hence, from (5.2) and (5.3) we see that lim PS? = ny. Remark 5.17 It follows from Theorem 5.13 that if we have an irre- ducible Markov chain, and we can find a solution of the stationarity equations y= DomPy G20 7 Dn 7 then the Markov chain is positive recurrent, and the 7; are the unique stationary probabilities. If, in addition, the chain is aperiodic then the 7; are also limiting probabilities. Remark 5.18 Because j; is the mean number of transitions between successive visits to state é, it is intuitive (and will be formally proven in the chapter on renewal theory) that the long run proportion of5.6 Time Reversibility 157 time that the chain spends in state i is equal to 1/y;. Hence, the stationary probability 7; is equal to the long run proportion of time that the chain spends in state i. Definition 5.19 A positive recurrent, aperiodic, irreducible Markov chain is called an ergodic Markov chain. Definition 5.20 A positive recurrent irreducible Markov chain whose initial state is distributed according to its stationary probabilities is called a stationary Markov chain. 5.6 Time Reversibility A stationary Markov chain X,, is called time reversible if P(Xn = j/Xnvi =i) = P(Xn41 = Xn = 1) for all i,j. By the Markov property we know that the processes Xn+1, Xn+2,--+ and Xp-1, Xn-2,--- are conditionally independent given Xj, and so it follows that the reversed process Xn~1,Xn-2,-.. will also be a Markov chain having transition probabilities PXn = 5, Xnt P(Xnt — Ph P(Xn = Xan = 1) = m% where 7; and P,; respectively denote the stationary probabilities and the transition probabilities for the Markov chain X,. Thus an equiv- alent definition for X;, being time reversible is if mPy = 75Pji for all i,j. Intuitively, a Markov chain is time reversible if it looks the same running backwards as it does running forwards. It also means that the rate of transitions from i to j, namely 7;P;;, is the same as the rate of transitions from j to i, namely 7jPj;. This happens if there are no “loops” for which a Markov chain is more likely in the long tun to go in one direction compared with the other direction. We illustrate with an examples.158 Chapter 5 Markov Chains Example 5.21 Random walk on the circle. Consider a particle which moves around n. positions on a circle numbered 1,2,....n according to transition probabilities Piz: = p = 1— Piyii forl Si, m; = 44 = 1, that the Markov chain is time reversible with the given stationary probabilities. m 5.7 A Mean Passage Time Bound Consider now a Markov chain whose state space is the set of non- negative integers, and which is such that Py =0, 0Si 0, we have that E[Dj] > di, then BAN] < 1/4 Proof. The proof is by induction on n. It is true when n = 1, because Nj is geometric with mean 1 1 PMI Fe ~ EID]160 Chapter 5 Markov Chains So, assume that E[N,] < Ch, 1/dj, for all k > ElNn-j]P(Da = 3) j=l a as S1+ Pan E(Nnl + D> P(Dn = 5) D0 di jn fl = 1+ PanE[Nn] + YPw = aod a a i=l i=n-j+1 S14 Pan B[Nn) + YP, = = ay 1/di ~ 5/4) ja where the last line follows because d; is nondecreasing. Continuing from the previous line we get = 1+ PanE[Nal + (1 — Pan) x Udi - ELI j) =14 PanElNa] + (1 — Pan) Sodas — Ba = : S Pan E[Nn] + (1 — Pan) S/d i= which completes the proof. m Example 5.24 At each stage, each of a set of balls is independently put in one of n urns, with each ball being put in urn i with probability Pi, 8.1 pi = 1. After this is done, all of the balls in the same urn
0 P(X - E[X] >) < en (4.4) P(X - E[X] <0) < e?7/" (4.5) Proof. For c>0,t>0 P(X ~ E[X] > c) = P(et(X-FIXD > et) eE[e(X-FIX)] by the Markov inequality e* Blexr( SX — EXD) IA IA W n eB] et] is1 = ee] [let Fy ia However, if Y is Bernoulli with parameter p, then Bet -E1"1] = pet(l-P) + (1 — pe? < @f/8 where the inequality follows from Lemma 4.7. Therefore, P(X - E[X] >0) seter/8 Letting t = 4c/n yields the inequality (4.4). The proof of the inequality (4.5) is obtained by writing it as P(E[X]—-X > 0) 0, as n —+ 00 and m —+ 00, P((1—¢)ma” < Ng < (1 +6€)ma”) 1 Proof. To prove the result, it is convenient to first formulate an equivalent continuous time model that results in the times at which the n-+m cells are killed being independent random variables. To do so, let X1,...,Xn4m be independent exponential random variables, with X; having weight w;, i= 1,...,n-+m. Note that X; will be the smallest of these exponentials with probability w;/ 5°; w;; further, given that X; is the smallest, X,,r # i, will be the second smallest with probability w,/ 30,4; ws; further, given that X; and X; are, re- spectively, the first and second smallest, X,,s # i, 7, will be the next smallest with probability w,/ 2,44, tj; and so on. Consequently, if we imagine that cell i is killed at time X;, then the order in which the n + m cells are killed has the same distribution as the order in which they are killed in the original model. So let us suppose that cell i is killed at time X;,i > 1. Now let 7. denote the time at which the number of surviving target cells first falls below na. Also, let N(t) denote the number of normal cells that are still alive at time t; so Na = N(Tq). We will first show that P(N(ta) < (1+€)me"”) 1 asnjm— co (4.6)122 Chapter 4 Bounding Probabilities and Expectations To prove the preceding, let e* be such that 0 < e* < ¢, and set t = ~—In(a(1+e*)/”), We will prove (4.6) by showing that as n and ™ approach oo, () P(t >t) 1, and (i) P(N(t) < (1 +6)ma¥) +1. Because the events 7_ >t and N(t) < (1+)ma™ together imply that N(ta) < N(t) < (1+ €)ma”, the result (4.6) will be established. To prove (i), note that the number, call it Y, of surviving target cells by time ¢ is a binomial random variable with parameters n and et =a(1+e*)”. Hence, with a= na[(1 + ¢*)'/” — 1], we have P(ta < t) = P(Y < na} = P(Y < net -a) < 2" /" where the inequality follows from the Chernoff bound (4.3). This proves (i), because a?/n — 00 as n> 00. To prove (ii), note that N(¢) is a binomial random variable with parameters m and e~“t = a” (1+e*). Thus, by letting b= ma™(e— é*) and again applying the Chernoff bound (4.3), we obtain P(N(t) > (1+ €)ma”} = P(N(t) > me7** +b) < em This proves (ii), because b?/m — co as m — oo. Thus (4.6) is established. It remains to prove that P(N(za) > (1—€)ma”) +1 as n,m — co (4.7) However, (4.7) can be proven in a similar manner as was (4.6); a combination of these two results completes the proof of the theorem. s 4.5 Second Moment and Conditional Expecta- tion Inequalities ‘The second moment inequality gives a lower bound on the probability that a nonnegative random variable is positive.4.5 Second Moment and Conditional Expectation Inequalities 123 Proposition 4.11 The Second Moment Inequality For a non- negative random variable X (E{X))? BX?) Proof. Using Jensen's inequality in the second line below, we have E[X?] = E[X?|X > 0]P(X > 0) (E[X|X > 0])?P(X > 0) (E(x)? P(X> 0) P(X >0)> Vv When _X is the sum of Bernoulli random variables, we can improve the bound of the second moment inequality. So, suppose for the remainder of this section that a x=>)%, i=l where X; is Bernoulli with E[X;] = pi, i = ‘We will need the following lemma. Lemma 4.12 For any random variable R EIXR] = SpE (RIX: = 1] i=l Proof E[XR] = BIS XR a = So eGR i=l = Seewnix = pi + ELGR|X: = 00 — pi)} rast e SC EIAX: =Ip # i=l124 Chapter 4 Bounding Probabilities and Expectations Proposition 4.13 The Conditional Expectation Inequality n Pi X>0)2> 57% — PX 02S arK aa] Proof. Let R= “229. Noting that EIXR)=P(X>0), E[RIX:=1) = BUX =1] we obtain, from Lemma 4.12, P(X >0) = nell where the final equality made use of Jensen’s inequality. ™ Example 4.14 Consider a system consisting of m components, each of which either works or not. Suppose, further, that for given subsets of components S;,j = 1,...,n, none of which is a subset of another, the system functions if all of the components of at least one of these subsets work. If component j independently works with probability aj, derive a lower bound on the probability the system functions. Solution Let X; equal 1 if all the components in S; work, and let it equal 0 otherwise, i= 1,...,n. Also, let =[Io JES pi= P(X = Then, with X = D7, Xj, we have P(system functions) = P(X > 0) : Pi soa Last SSP m= D Pi OE ae eeaa:4.5 Second Moment and Conditional Expectation Inequalities 125 where S; — S; consists of all components that are in S; but not in Si. ' Example 4.15 Consider a random graph on the set of vertices {1,2, +;n}, which is such that each of the (3) pairs of vertices i # j is, independently, an edge of the graph with probability p. We are interested in the probability that this graph will be connected, where by connected we mean that for each pair of distinct vertices i # j there is a sequence of edges of the form (i,i1), (i1, i2),---; (ik, 3)- (That is, a graph is connected if for each each pair of distinct vertices iand j, there is a path from i to j.) Suppose that = late) We will now show that if c < 1, then the probability that the graph is connected goes to 0 as n — oo. To verify this result, consider the number of isolated vertices, where vertex i is said to be isolated if there are no edges of type (i,j). Let X; be the indicator variable for the event that vertex i is isolated, and let 2 x=yo% i=l be the number of isolated vertices. Now, P(X; =1) = (1-p)?? Also, a E(X|X; = 1) = S> P(X} = Xi = 1) ja =1i+)°a-pr? j#i = 1+(n-1)(1—p)"? Because =p = 1 eBAen z enein(n)126 Chapter 4 Bounding Probabilities and Expectations the conditional expectation inequality yields that for n large ae > P(X >) 2 Gay Therefore, c<1> P(X >0} > 1Lasn— 00 Because the graph is not connected if X > 0, it follows that the graph will almost certainly be disconnected when n is large and c < 1. (It can be shown when c > 1 that the probability the graph is connected goes to 1 as n > co.) . 4.6 The Min-Max Identity and Bounds on the Maximum In this section we will be interested in obtaining an upper bound on Elmax; Xi], when X,,...,Xq are nonnegative random variables. To begin, note that for any nonnegative constant c, n maxX; Se+ D(X; —e)* (48) i=1 where x*, the positive part of z, is equal to z if x > 0 and is equal to 0 otherwise. Taking expectations of the preceding inequality yields that. i. Elmax Xj] E[(Xi - ¢)*] rast Because (X;—c)* is a nonnegative random variable, we have i E(X:—0)*] f ” P(Xi—e)* > ade i P(X;—c> a)de 0 E [fo P&> vey Therefore, Blmax Xi] < c+ > i ” BUX: > w)dy (49) i=l4.0 Ine Min-Max Identity and Bounds on the Maximum. 127 Because the preceding is true for all c > 0, the best bound is obtained by choosing the c that minimizes the right side of the preceding. Differentiating, and setting the result equal to 0, shows that the best bound is obtained when ¢ is the value c* such that, SoP(K >) =1 i=l Because 7”, P(X; > c) is a decreasing function of c, the value of c* can be easily approximated and then utilized in the inequality (4.9). It is interesting to note that c* is such that the expected number of the X; that exceed c* is equal to 1, which is interesting because the inequality (4.8) becomes an equality when exactly one of the Xi exceed c. Example 4.16 Suppose the X; are independent exponential random variables with rates \;,i = 1,...,n. Then the minimizing value c* is such that with resulting bound Elmax Xi] < i In the special case where the rates are all equal, say A; = 1, then l=ne~" or c*=In(n) and the bound becomes Emax Xj) < In(n) +1 (4.10) However, it is easy to compute the expected maximum of a sequence of independent exponentials with rate 1. Interpreting these random variables as the failure times of n components, we can write 8 max Xi = OT, i=2128 Chapter 4 Bounding Probabilities and Expectations where T; is the time between the (i — 1)st and the i‘ failure. Using the lack of memory property of exponentials, it follows that the T; are independent, with T; being an exponential random variable with rate n—i+1. (This is because when the (i— 1)st failure occurs, the time until the next failure is the minimum of the n— i+ 1 remaining lifetimes, each of which is exponential with rate 1.) Therefore, in this case : : 1 1 Elmax Xi} x nets As it is known that, for n large So ys Zein) +E ia where E ~ .5772 is Euler’s constant, we see that the bound yielded by the approach can be quite accurate. (Also, the bound (4.10) only requires that the X; are exponential with rate 1, and not that they are independent.) . The preceding bounds on E{max; Xi] only involve the marginal distributions of the X;. When we have additional knowledge about the joint distributions, we can often do better. To illustrate this, we first need to establish an identity relating the maximum of a set of random variables to the minimums of all the partial sets. For nonnegative random variables X1,...,Xn, fix x and let A; denote the event that X; > x. Let X =mex(X1,...,Xn) Noting that X will be greater than z if and only if at least one of the events A; occur, we have P(X > 2) = P(ULAi) and the inclusion-exclusion identity gives P(X >2) = S>P(A)— Do PAA) + SD P(AASAR) 7 ia)= Xe De eA) fncn2} P(AAs) = P(X; > 2, X; > 2} = P(min(X;, X;) > 2) P(A;AjAg) = P(X; > 2, Xj > 2, X_ > x} P(min(X;, X;, Xx) > 2) and so on. Thus, we see that : PIX >a)= 0-4 SO Pemin(Xi,,...,%,) > 2)) Integrating both sides as x goes from 0 to oo gives the result: EIX]= S(O Blmin(X%,,....¥:,)] r=1 tS YAAK — Blin 5) iG ElX] < TAK) — Fn 1+ YO Hlmin(Xi, X;, Xx) i 1 1 + ————— +... + (1)! —_—___. San ) Pit... +Dn Using the preceding formula for the mean number of coupons needed to obtain a complete set requires summing over 2" terms, and so is not practical when n is large. Moreover, the bounds obtained by only going out a few steps in the formula for the expected value of a maximum generally turn out to be too loose to be beneficial. However, a useful bound can often be obtained by applying the max- min inequalities to an upper bound for E[X] rather than directly to E[X]. We now develop the theory. For nonnegative random variables Xi,..., Xp, let X = max(X, Fix c > 0, and note the inequality X $c+max((X1—0)*,...,(Xn—¢)*).4.0 Ihe Min-Max Identity and Bounds on the Maximum 131 Now apply the max-min upper bound inequalities to the right side of the preceding, take expectations, and obtain that E[X] < c+ STIX 04] i=l E[X] < e+ > BUX — €)*] — > Blmin((X; — 0), (X; - )*)] 7 SG + YS Blmin((x: - 0), (Xj - (Xe - 0D] icjck and so on. Example 4.18 Consider the case of independent exponential ran- dom variables X;,...,Xn, all having rate 1. Then, the preceding gives the bound ElmexX;] Set DEUX - ot} ~ Y Elmin(x; - 0), (X; - )*)] 7 roa + YS Blmin((X; — 0), (Xj — 0)*, (Xe — 0)*)] i c, whether min(X;,X;) > ¢, and whether min(X;, Xj, Xx) > c. This yields E((Xi — ¢)*] Elmin((X; — ¢)*, (Xj — ¢)*)] W Efmin((X; — ¢)*, (Xj —¢)*, (Xe -¢)*)] = € Using the constant c = In(n) yields the bound a a Blmax Xi] < In(a) 41-25) 4 MeV?) In(n) +.806 for n large132 Chapter 4 Bounding Probabilities and Expectations Example 4.19 Let us reconsider Example 4.17, the coupon col- lector's problem. Let c be an integer. To compute E{(X; — c)*), condition on whether a type é coupon is among the first c collected. El(Xi — 0)*] = El(Xi — €)*|Xi < P(X: < 0) +El(X; - 0)*|X; > JP(X; > ©) = El(X: — ¢)* |X; > (1 — pi)® — (=p) Pi Similarly, Blmin((X:— 0), (%j~ eh] = CoRR BI Blain((X.— )*, (X59, egy) = CORB BT Therefore, for any nonnegative integer eps inh 1- e ger Sta pi) -y force Conon iG (1 = pi = yj ~ Pe)? + Ua Pia Pi ~ Pky & Pit Pj + Pe zB 3 e AY a A 4.7 Stochastic Orderings We say that X is stochastically greater than Y, written X >g¢ Y, if P(X >t)>P(Y >t) for allt In this section, we define and compare some other stochastic order- ings of random variables. If X is a nonnegative continuous random variable with distribu- tion function F and density f then the hazard rate function of X is defined by Ax(t) = fO/FO4.7 Stochastic Orderings 133 where F(t) = 1 — F(t). Interpreting X as the lifetime of an item, then for ¢ small P(t year old item dies within an additional time e) = P(X t) & Ax(te If Y is a nonnegative continuous random variable with distribu- tion function G and density g, say that X is hazard rate order larger than Y, written X >a, Y, if Ax(t) S Ay(t) for allt Say that X is likelihood ratio order larger than Y, written X >i, Y, if F(z)/a(z) Tz where f and g are the respective densities of X and Y Proposition 4.20 X>rY >X 2a Y >X DH Y Proof. Let X have density f, and Y have density g. Suppose X 21, Y Then, for y >< fly) £(z) Fy) = oy) Gy 2 9 (2) [P serane £2 PP owen Ax(z) < Ay(z) To prove the final implication, note first that : - Pf i ax(nat= Foy tt= hehe) implying that or or P(e) =e Bx134 Chapter 4 Bounding Probabilities and Expectations which immediately shows that X >pr Y => X Dot Y. w Define rx(t), the reversed hazard rate function of X, by rx(t) = f(Q)/FO) and note that _ P(t-e< XIX <2) SS ii a Say that X is reverse hazard rate order larger than Y,, written X 2n¥, if rx(t) > ry(t) for allt Our next theorem gives some representations of the orderings of X and Y in terms of stochastically larger relations between certain conditional distributions of X and Y Before presenting it, we intro- duce the following notation. Notation For a random variable X and event A, let [X|A] de- note a random variable whose distribution is that of the conditional distribution of X given A. Theorem 4.21 () X Zar ¥ @ [X|X >t] Dee (YIY > dl for all t (i) X Den ¥ 4 [XIX #] and ¥; =4 [Y-t|Y > ¢]. Noting that = 0; ifst shows that X Shr Y => Xt Dar Ye > Xe Det Ye > [XIX > t] Vee (VY > t] To go the other way, use the identity Fy, =e" Ser" Ax(s)ds4.1 stochastic Orderings 135 to obtain that [X|X >t] 28 ([YIY > 4 Xr 20% => Ax(t) S Av(#) (ii) With X; =a [t- X|X < ¢], Ax(y) =rx(t-y) O XiseX% [t-X|X a VY [ovtnray ° firxt-nav> [ore nney => rx(t) 2 rr(t) (iii) Let X and Y have respective densities f and g. Suppose [X|s< X <#] Su [Yls<¥<¢#] foralls v/s < X P(Y > v/s < ¥ £9, showing that X >i ¥. Now suppose that X >i, Y. Then clearly [X|s < X < t] >i [Yls < Y < ¢], implying, from Proposition 4.20, that [X|s < X < th>e(Y\s<¥ Y > X Den Y >X 2H Y Proof. The first implication immediately follows from parts (ii) and (iii) of Theorem 4.21. The second implication follows upon taking the limit as t + oo in part (ii) of that theorem. m Say that X is an increasing hazard rate (or IHR) random variable if Ax(t) is nondecreasing in t. (Other terminology is to say that X has an increasing failure rate.) Proposition 4.23 Let X; =a (X —t|X >t]. Then, (a) Xe ln ast? « XisJHR (b) Xi le ast? +> log f(z) is concave Proof. (a) Let A(y) be the hazard rate function of X. Then d,(y), the hazard rate function of X; is given by A(y) =AlE+y), y>O Hence, if X is THR then X; Ine ¢, implying that X_ Jee t. Now, let 8 s Xt. Then, eo MWY = P(X, > €) S$ P(Xp > 6) =e Fe Mey4.8 Excercises 137 showing that A(t) > (s). Thus part (a) is proved. (b) Using that the density of X; is fx,(z) = f(x +t)/F(t) yields, for s < t, that f(ct+s) far '* © log f(x+s)—log f(z+t)tz fats) _ fet). 4 Xs 2 Xt Fats) flatt) = fy) ° Fw tY ° five f @)ly © log f(y) is concave 4.8 Excercises 1. For a nonnegative random variable X, show that (E[X™})!/" is nondecreasing in n. 2. Let X be as standard normal random variable. Use Corollary 4.4, along with the density g(x) =ze""?, 2 >0 to show, for c > 0, that (a) P(X > 0) = Fee“ Eyl Z1X > (b) Show, for any positive random variable W, that 1 Bil > d < BL (©) Show that PUL > 0) | ote” ViRP(X > c)138 a Chapter 4 Bounding Probabilities and Expectations (e) Use Jensen’s inequality, along with the preceding, to show that eP(X > 0) +e? VIm(P(X > c))? > me (£) Argue from (e) that P(X > c) must be at least as large as the positive root of the equation crt ef VTma? = oe (g) Conclude that P(X >o)> Sr (V2 +4-c)e*? . Let X be a Poisson random variable with mean 4. Show that, for n > A, the Chernoff bound yields that e->(Ae)" OCS.) Se . Let m(t) = E[X*]. The moment bound states that for c > 0 P(X 2c) < m(t)c* for all t > 0. Show that this result can be obtained from the importance sampling identity. . Fill in the details of the proof that, for independent Bernoulli random variables X1,...,Xn, and c> 0, P(S - E[S] < -c) < e?"/" where $= 2, X, . If X is a binomial random variable with parameters n and p, show (a) P(X = np| > ce) < 2672 (b) P(X — np > anp) < exp{—2np*a*} Give the details of the proof of (4.7).4.8 Excercises 139 8. Prove that Elf(X)] 2 ELF(E[X1¥))] 2 F(ELX)) Suppose you want a lower bound on E[f(X)] for a convex function f. The preceding shows that first conditioning on Y and then applying Jensen's inequality to the individual terms Elf(X)|Y = y] results in 0 larger lower bound than does an immediate application of Jensen’s inequality. 9. Let X; be binary random variables with parameters p;,i = 1,...)2. Let X = 30%, Xi, and also let I, independent of the variables X;,...,Xn, be equally likely to be any of the values 1,...)n. For R independent of I, show that (a) PU = {Xr = 1) = pi/E[X] (b) E[XR] = E[X]E[R|Xz = 1] (©) P(X > 0) = E[X] E[X|X1 = 1] 10. For X; and X as in Exercise 9, show that Pi (E[x1? x BRIX = 1 = BUX? ‘Thus, for sums of binary variables, the conditional expectation inequality yields a stronger lower bound than does the second moment inequality. Hint: Make use of the results of Exercises 8 and 9. 11. Let X; be exponential with mean 8+ 2i, for i = 1,2,3. Obtain an upper bound on E[max X;], and compare it with the exact result when the X; are independent. 12, Let Uj, i =1,...,n be uniform (0, 1) random variables. Obtain an upper bound on E[max Uj], and compare it with the exact result when the U; are independent. 13. Let U; and Uz be uniform (0,1) random variables. Obtain an upper bound on E[max(Ui,Us)], and show this maximum is obtained when Uj = 1— Up.140 14, 15. 16. Chapter 4 Bounding Probabilities and Expectations Show that X >p, Y if and only if P(X >t), PY >#) P(X > s) ~ P(Y >s) for all s A(y,z) whenever > y (a) Show that if X and Y are independent and X >i, Y then A(X,Y) Dat h(Y,X). (b) Show by a counterexample that the preceding is not valid under the weaker condition X >yt Y. There are n jobs, with job i requiring a random time X; to process. The jobs must be processed sequentially. Give a suffi- cient condition, the weaker the better, under which the policy of processing jobs in the order 1,2,...,n maximizes the prob- ability that at least k jobs are processed by time t for all k and t.Chapter 5 Markov Chains 5.1 Introduction This chapter introduces a natural generalization of a sequence of in- dependent random variables, called a Markov chain, where a variable may depend on the immediately preceding variable in the sequence. Named after the 19th century Russian mathematician Andrei An- dreyevich Markov, these chains are widely used as simple models of more complex real-world phenomena. Given a sequence of discrete random variables Xo, X1, Xo, ... tak- ing values in some finite or countably infinite set S, we say that Xp is a Markov chain with respect to a filtration Fy if Xn € Fa for all n and, for all B C S, we have the Markov property P(Xnt1 € Bl Fn) = P(Xnti € BlXn). If we interpret X, as the state of the chain at time n, then the preceding means that if you know the current state, nothing else from the past is relevant to the future of the Markov chain. That is, given the present state, the future states and the past states are independent. When we let Fy, = o(Xo,X1,-..,Xn) this definition reduces to P(Xna1 = il Xn = i Xn = tnaty ey Xo = to) = P(Xn41 = j| Xn =i). Tf P(Xni1 = j| Xn = 7) is the same for all n, we say that the Markov 141142 Chapter 5 Markov Chains chain has stationary transition probabilities, and we set P(Xn4i = 5| Xn =i) In this case the quantities P,; are called the transition probabilities, and specifying them along with a probability distribution for the starting state Xo is enough to determine all probabilities concerning Xo,...,Xn. We will assume from here on that all Markov chains considered have stationary transition probabilities. In addition, un- less otherwise noted, we will assume that S, the set of all possible states of the Markov chain, is the set of nonnegative integers. Example 5.1 Reflected random walk. Suppose Yj are independent and identically distributed Bernoulli(p) random variables, and let Xo = 0 and Xp = (Xn-1 + 2¥n — 1)* for n = 1,2,.... The process Xn, called a reflected random walk, can be viewed as the position of a particle at time n such that at each time the particle has a p probability of moving one step to the right, a 1 — p probability of moving one step to the left, and is returned to position 0 if it ever attempts to move to the left of 0. It is immediate from its definition that Xp is a Markov chain. Example 5.2 A non-Markov chain. Again let Y; be independent and identically distributed Bernoulli(p) random variables, let Xo = 0, and this time let X, = Yn + Yn-1 for n = 1,2,.... It’s easy to see that X, is not a Markov chain because P(Xn41 = 2|Xn = 1,Xn-1 = 2) =0 while on the other hand P(Xn+1 = 2|Xn = 1, Xn-1 = 0) =p. 5.2 The Transition Matrix The transition probabilities Pj = P(X1 = j|Xo = 1), are also called the one-step transition probabilities. We define the n- step transition probabilities by rr5.2 Ihe transition Matrix 143 In addition, we define the transition probability matrix Poo Por Poo P=| Pio Pu Pi and the n-step transition probability matrix Bp rip ee po) =| Pip Ph 2 An interesting relation between these matrices is obtained by noting that PET = S7 P(Xntm = j1X0 = i, Xn = k)P(Xn = klXo = 3) k = (m) p(n) = rnp. k ‘The preceding are called the Chapman-Kolmogorov equations. If follows from the Chapman-Kolmogorov equations that pintm) _ pn, pm where here “x” represents matrix multiplication. Hence, Pp?) =PxP, and, by induction, p®) = pr, where the right-hand side represents multiplying the matrix P by itself n times. Example 5.3 Reflected random walk. A particle starts at position zero and at each time moves one position to the right with probability p and, if the particle is not in position zero, moves one position to the left (or remains in state 0) with probability 1 — p. The position144 Chapter 5 Markov Chains Xp, of the particle at time n forms a Markov chain with transition matrix l-p p 000 l-p 0 poo P=! 0 1-p0po Example 5.4 The two state Markov chain. Consider a Markov chain with states 0 and 1 having transition probability matrix l-a p=|% [178] The two-step transition probability matrix is given by pe — aes ogee ata | a8 + B(1— 8) 1-06 — B01 - 8) 5.3 The Strong Markov Property Consider a Markov chain X, having one-step transition probabilities Pj, which means that if the Markov chain is in state i at a fixed time n, then the next state will be j with probability P,;. However, it is not necessarily true that if the Markov chain is in state i at a randomly distributed time T, the next state will be j with probability Py. That is, if T is an arbitrary nonnegative integer valued random variable, it is not necessarily true that P(Xr41 = j|Xr = i) = Py. For a simple counterexample suppose T =min(n: Xn =i, Xai = 3) Then, clearly, P(Xry1 = j|Xr =) = 1. The idea behind this counterexample is that a general random vari- able T may depend not only on the states of the Markov chain up to time T but also on future states after time T. Recalling that T is a stopping time for a filtration F, if {T =n} € Fp for every n, we see that for a stopping time the value of T can only depend on the states up to time ¢. We now show that P(Xr4n = j|Xr = ) will equal Pp) provided that T is a stopping time.5.3 The Strong Markov Property 145 This is usually called the strong Markov property, and essentially means a Markov chain “starts over” at stopping times. Below we define Fr = {A AN{T = t} € F, for all t}, which intuitively represents any information you would know by time T. Proposition 5.5 The strong Markov property. Let Xn,n > 0, bea Markov chain with respect to the filtration Fy. If T < co as. isa stopping time with respect to Fn, then P(Xran = ilXr = i, Fr) = P Proof. P(Xr4n = Xr =i, Fr,T =t) = P(Xten = j|Xt =i, FT = t) = P(Xten = j|Xt =i, F:) = Pe where the next to last equality used the fact that T is a stopping time to give {T =t}e F,. = Example 5.6 Losses in queueing busy periods. Consider a queu- ing system where Xp is the number of customers in the system at time n. At each time n = 1,2,... there is a probability p that a customer arrives and a probability (1 — p), if there are any cus- tomers present, that one of them departs. Starting with Xo = 1, let T = min{t > 0: X; = 0} be the length of a busy period. Sup- pose also there is only space for at most m customers in the system. ‘Whenever a customer arrives to find m customers already in the sys- tem, the customer is lost and departs immediately. Letting Nm be the number of customers lost during a busy period, compute E[Nyn]. Solution. Let A be the event that the first arrival occurs before the first departure. We will obtain E{Njn] by conditioning on whether A occurs. Now, when A happens, for the busy period to end we must first wait an interval of time until the system goes back to having a single customer, and then after that we must wait another interval of time until the system becomes completely empty. By the Markov property the number of losses during the first time interval has dis- tribution Nm—1, because we are now starting with two customers and146 Chapter 5 Markov Chains therefore with only m--1 spaces for additional customers. The strong Markov property tells us that the number of losses in the second time interval has distribution Nj». We therefore have _ f E{Nm-1] + E[Nm] if m >1 EINnlal = { Tes ifm=1, and using P(A) = p and E{Ny|A‘] = 0 we have E[Nmm] = E[Nm|A]P(A) + E[NmlA°]P(A®) = pE|Nm-i] + PE|Nm] for m > 1 along with E(N] = p+ pE(M] and thus P_\m E[Nm] = (—2-)". [Nm] = >) It’s quite interesting to notice that E[Nm] increases in m when p> 1/2, decreases when p < 1/2, and stays constant for all m when p= 1/2. The intuition for the case p = 1/2 is that when m increases, losses become less frequent but the busy period becomes longer. We next apply the strong Markov property to obtain a result for the cover time, the time when all states of a Markov chain have been visited. Proposition 5.7 Cover times. Given an N-state Markov chain Xp, let T, = min{n > 0: Xp = i} and let C = max; T; be the cover time. Then E(C\ < N_, 2 maxi; ElT;|Xo = i). =] im Proof. Let th, l2,-...Jv be a random permutation of the integers N chosen so that all possible orderings are equally likely. Letting Ti, = 0 and noting that maxj past] ie m1 ae 1 Fy max E[T;|Xo = i, mig m=1 where the second line follows since all orderings are equally likely and thus P(Ti, > maxjcm-1T1,) = 1/m, and the third from the strong Markov property. m@ 5.4 Classification of States We say that states i,j of a Markov chain communicate with each other, or are in the same class, if there are integers n and m such that both PS) > 0 and P!” > O hold. This means that it is possible for the chain to get from i to j and vice versa. A Markov chain is called irreducible if all states are in the same class. For a Markov chain Xp, let T= min{n > 0: Xn =i} be the time until the Markov chain first makes a transition into state i. Using the notation Ei[---] and P,(---) to denote that the Markov chain starts from state i, let fi = P(Ti < 00) be the probability that the chain ever makes a transition into state i given that it starts in state i. We say that state i is transient if fi <1 and recurrent if f; = 1. Let os M= Do kx,=0 n=l148 Chapter 5 Markov Chains be the total number of transitions into state i. The strong Markov property tells us that, starting in state i, Ni+1~ geometric(1— fi), since each time the chain makes a transition into state i there is, independent of all else, a (1 — f;) chance it will never return. Conse- quently, « EIN] = oP is either infinite or finite depending on whether or not state i is recurrent or transient. Proposition 5.8 If state i is recurrent and i communicates with j, then j is also recurrent. Proof. Because i and j communicate, there exist values n and m such that P{?) PO” > 0. But for any k > 0 (n+m+k) (mm) 15k) Pe ere ca where the preceding follows because P{?*™*") is the probability starting in state j that the chain will be back in j after n-+m-+k tran- sitions, whereas P{” Pi p\”) is the probability of the same event oc- curring but with the additional condition that the chain must also be in i after the first m transitions and then back in é after an additional k transitions. Summing over k shows that BAIN] > Pg MY > PIM PLY PY = 00 k Thus, j is also recurrent. = Proposition 5.9 If j is transient, then T°, Pl < oo Proof. Note that. BN) = BASS Ten] = La n=05.5 Stationary and Limiting Distributions 149 Let fij denote the probability that the chain ever makes a transition into j given that it starts at i. Then, conditioning on whether such a transition ever occurs yields, upon using the strong Markov property, that E,[Nj] = (1+ Ej[Nj)) fig < 00 since j is transient. = If é is recurrent, let mi = ET) denote the mean number of transitions it takes the chain to return to state i, given it starts in i. We say that a recurrent state i is null if pi = 00 and positive if 4; < oo. In the next section we will show that positive recurrence is a class property, meaning that if i is positive recurrent and communicates with j then j is also positive recurrence. (This also implies, using that recurrence is a class property, that so is null recurrence.) 5.5 Stationary and Li iting Distributions For a Markov chain X, starting in some given state i, we define the limiting probability of being in state j to be noo P;= lim Pi? if the limit exists and is the same for all i. It is easy to see that not all Markov chains will have limiting probabilities, For instance, consider the two state Markov chain with Po. = Pyo = 1. For this chain P{") will equal 1 when n is even and 0 when n is odd, and so has no limit. Definition 5.10 State i of a Markov chain Xp is said to have period d if d is the largest integer having the property that P\") =0 whenn is not a multiple of d. Proposition 5.11 If states i and j communicate, then they have the same period.150 Chapter 5 Markov Chains Proof. Let dj be the period of state k. Let n,m be such that PP) P™ > 0. Now, if Pi > 0, then (r+ntm) plm) pir) pin) Pym > PPP >0 So d; divides r +n +m. Moreover, because (27) pl pr) PR 2 PE Pe > 0 the same argument shows that d; also divides 2r +n +m; therefore d; divides 2r-+n+m-—(r+n+m) = r. Because dj divides r whenever PS? > 0, it follows that d; divides d;. But the same argument can now be used to show that d; divides d;. Hence, d; It follows from the preceding that all states of an irreducible Markov chain have the same period. If the period is 1, we say that the chain is aperiodic. It’s easy to see that only aperiodic chains can have limiting probabilities. Intimately linked to limiting probabilities are stationary proba- bilities. The probability vector m;,i € S is said to be a stationary probability vector for the Markov chain if my = SomPy forall 5 oy =1 7 Its name arises from the fact that if the Xo is distributed according to a stationary probability vector {x} then P(X: = 9) = D> P(X) = 5X0 = im = Do mPy = 7; and, by a simple induction argument, P(Xn = 3) = 0 P(Xn = Xn = )P(Xn-1 =) = SO miPy = 05 7 7 Consequently, if we start the chain with a stationary probability vector then Xn, Xn+1,... has the same probability distribution for all n. The following result will be needed later.d.b Dtationary ana Limiting Listriputions: apa Proposition 5.12 An irreducible transient Markov chain does not have a stationary probability vector. Proof. Assume there is a stationary probability vector 7,1 > 0 and take it to be the probability mass function of Xo. Then, for any j 14 = P(Xn = 3) = dary Pm Consequently, for any m 4 = kn mrs Py Jim AS mPS +m) ism bm = Ya Jim Pf? +n ism ibm =A bm Ae where the final equality used that >, P{f) < oo because j is tran- sient, implying that limy so P{” = 0. Letting m — oo shows that 173 =0 for all j, contradicting the fact that So, xj = 1. Thus, assum- ing a stationary probability vector results in a contradiction, proving the result. @ The following theorem is of key importance. Theorem 5.13 An irreducible Markov chain has a stationary proba- bility vector {x;} if and only if all states are positive recurrent. The stationary probability vector is unique and satisfies my = Vay Moreover, if the chain is aperiodic then =} (n) = lim Pj To prove the preceding theorem, we will make use of a couple of lemmas.152 Chapter 5 Markov Chains Lemma 5.14 For an irreducible Markov chain, if there exists a sta- tionary probability vector {x;} then all states are positive recurrent. Moreover, the stationary probability vector is unique and satisfies my = 1/ny Proof. Let 7; be stationary probabilities, and suppose that P(Xo = 3) = 7; for all j. We first show that 1; > 0 for all 4. To verify this, suppose that 7_ = 0. Now for any state j, because the chain is irre- ducible, there is an n such that P{") > 0. Because Xp is determined by the stationary probabilities, Te = P(Xn =k) = So mPL > 1j3PO a Consequently, if 7, = 0 then so is 7;. Because j was arbitrary, that means that it 7; = 0 for any i, then 7; = 0 for all i. But that would contradict the fact that 7, 7m = 1. Hence any stationary probabil- ity vector for an irreducible Markov chain must have all positive elements. Now, recall that T; = min(n > 0: Xp, = j). So, ny = E(T;|Xo = J] co = YPC; = lx = 5) = = PPG 20 Xo= 5) ~ £4 P(X = 3) Because Xo is chosen according to the stationary probability vector {mi}, this gives oo TyHy = D> PUT; > 0, Xo = 3) (5.1) n=1 Now, P(T; 2 1, Xo = j) = P(Xo =) = 35 and, for n > 2 P(T; > ,X0=3) P(Xi #51 Si 1) = 0. Because P(X; # j) = 1-7, we thus obtain m5 = Wy thus showing that there is at most one stationary probability vector. In addition, because all 7; > 0, we have that all yj < 00, showing that all states of the chain are positive recurrent. ™ Lemma 5.15 If some state of an irreducible Markov chain is positive recurrent then there exists a stationary probability vector. Proof. Suppose state k is positive recurrent. Thus, Be = ExT] < 00 Say that a new cycle begins every time the chain makes a transition into state k. For any state j, let A; denote the amount of time the chain spends in state j during a cycle. Then = ExlQ exn=s.t.>0)] n=0 0 DY Fillexneynenyl n=0 e Do Pi(Xn = HT > 1) n=0 E[Aj] ' We claim that 7; = E[Aj)/u, 7 > 0, is a stationary probability vector. Because E[2, Aj] is the expected time of a cycle, it must equal Ey [Tx], showing that154 Chapter 5 Markov Chains Moreover, for j # k many = SO Pe(Xn = 5, Tk > 2) nz0 = VY Phe = 4,7 > 2-1, Xe =4) n21 i = VY Ph > 2-1, Xi =i) n2zli xPy(Xn = jlTk > 2-1, Xn-1 = 1) SY Pee > 2-1, Xn = Py 7 a3l = VY PelTe > 0, Xn = i) Py 7 n30 = EIAIP, 7 = be Py Finally, Mahe = Om - Py) 7 7 jek =1-mes tk t =1- SE 5 afk = tt and the proof is complete. ™ Note that Lemmas (5.14) and (5.15) imply the following. Corollary 5.16 If one state of an irreducible Markov chain is positive recurrent then all states are positive recurrent. ‘We are now ready to prove Theorem 5.13. Proof. All that remains to be proven is that if the chain is aperi- odic, as well as irreducible and positive recurrent, then the stationary5. dtationary and Limiting Listriputions probabilities are also limiting probabilities To prove this, let m;,i > 0, be stationary probabilities. Let Xp, > 0 and Yq,n > 0 be indepen- dent Markov chains, both with transition probabilities P,,j, but with Xo =i and with P(% = 4) = 7. Let N = min(n : Xn = Yn) We first show that P(N < 00) = 1. To do so, consider the Markov chain whose state at time n is (Xp, Yn), and which thus has transition probabilities Pj) (47) = PiPjr That this chain is irreducible can be seen by the following ar- gument. Because {Xp} is irreducible and aperiodic, it follows that for any state i there are relatively prime integers n,m such that ”) Ph™ > 0. But any sufficiently large integer can be expressed as a linear combination of relatively prime integers, implying that there is an integer Nj such that P® 30 foralln> N Because i and j communicate this implies the existence of an integer Nig such that PI) >0 for all n> Nig Hence, | = P® p > 0 for all sufficiently large n Gk) 7) ~ Ger ly larg which shows that the vector chain (Xn, Yn) is irreducible. In addition, we claim that 7,3 = 77 is a stationary probability vector, which is seen from Tiny = YomPes oD a, P,j = DV rete Pei Peg k r ker By Lemma 5.14 this shows that the vector Markov chain is positive recurrent and so P(N < co) = 1 and thus lim, P(N > n) = 0. Now, let Z, = Xn ifn < N and let Z, = Y; ifn > N. It is easy to see that Z,,n > 0 is also a Markov chain with transition156 Chapter 5 Markov Chains probabilities P;; and has Zp = i. Now (n) Pa W P(Z, =i) P(Z, gN n) P(Yn = 5,N $1) +P(Zn = 5,N >n) P(¥_ =) + P(N > n) aj + P(N >n) (5.2) s On the other hand my = PYn = 3) P(¥n = 5,N ) P(Zn = G,N 1) P(Zq = j) + P(N > n) PI) + P(N >n) (5.3) IA Wl i Hence, from (5.2) and (5.3) we see that lim PS? = ny. Remark 5.17 It follows from Theorem 5.13 that if we have an irre- ducible Markov chain, and we can find a solution of the stationarity equations y= DomPy G20 7 Dn 7 then the Markov chain is positive recurrent, and the 7; are the unique stationary probabilities. If, in addition, the chain is aperiodic then the 7; are also limiting probabilities. Remark 5.18 Because j; is the mean number of transitions between successive visits to state é, it is intuitive (and will be formally proven in the chapter on renewal theory) that the long run proportion of5.6 Time Reversibility 157 time that the chain spends in state i is equal to 1/y;. Hence, the stationary probability 7; is equal to the long run proportion of time that the chain spends in state i. Definition 5.19 A positive recurrent, aperiodic, irreducible Markov chain is called an ergodic Markov chain. Definition 5.20 A positive recurrent irreducible Markov chain whose initial state is distributed according to its stationary probabilities is called a stationary Markov chain. 5.6 Time Reversibility A stationary Markov chain X,, is called time reversible if P(Xn = j/Xnvi =i) = P(Xn41 = Xn = 1) for all i,j. By the Markov property we know that the processes Xn+1, Xn+2,--+ and Xp-1, Xn-2,--- are conditionally independent given Xj, and so it follows that the reversed process Xn~1,Xn-2,-.. will also be a Markov chain having transition probabilities PXn = 5, Xnt P(Xnt — Ph P(Xn = Xan = 1) = m% where 7; and P,; respectively denote the stationary probabilities and the transition probabilities for the Markov chain X,. Thus an equiv- alent definition for X;, being time reversible is if mPy = 75Pji for all i,j. Intuitively, a Markov chain is time reversible if it looks the same running backwards as it does running forwards. It also means that the rate of transitions from i to j, namely 7;P;;, is the same as the rate of transitions from j to i, namely 7jPj;. This happens if there are no “loops” for which a Markov chain is more likely in the long tun to go in one direction compared with the other direction. We illustrate with an examples.158 Chapter 5 Markov Chains Example 5.21 Random walk on the circle. Consider a particle which moves around n. positions on a circle numbered 1,2,....n according to transition probabilities Piz: = p = 1— Piyii forl Si, m; = 44 = 1, that the Markov chain is time reversible with the given stationary probabilities. m 5.7 A Mean Passage Time Bound Consider now a Markov chain whose state space is the set of non- negative integers, and which is such that Py =0, 0Si 0, we have that E[Dj] > di, then BAN] < 1/4 Proof. The proof is by induction on n. It is true when n = 1, because Nj is geometric with mean 1 1 PMI Fe ~ EID]160 Chapter 5 Markov Chains So, assume that E[N,] < Ch, 1/dj, for all k > ElNn-j]P(Da = 3) j=l a as S1+ Pan E(Nnl + D> P(Dn = 5) D0 di jn fl = 1+ PanE[Nn] + YPw = aod a a i=l i=n-j+1 S14 Pan B[Nn) + YP, = = ay 1/di ~ 5/4) ja where the last line follows because d; is nondecreasing. Continuing from the previous line we get = 1+ PanE[Nal + (1 — Pan) x Udi - ELI j) =14 PanElNa] + (1 — Pan) Sodas — Ba = : S Pan E[Nn] + (1 — Pan) S/d i= which completes the proof. m Example 5.24 At each stage, each of a set of balls is independently put in one of n urns, with each ball being put in urn i with probability Pi, 8.1 pi = 1. After this is done, all of the balls in the same urn