This action might not be possible to undo. Are you sure you want to continue?

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

**MARKOV CHAINS AND RANDOM WALKS
**

Takis Konstantopoulos∗ Autumn 2009

∗

c Takis Konstantopoulos 2006-2009

Contents

1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Deﬁnition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and ﬁrst-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Deﬁnition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

85 27. . .5 First return to zero . . 90 30.2 First return to the origin .2 The ballot theorem . . .1 Branching processes: a population growth application . . . . . . .1 Asymptotic distribution of the maximum . . . 95 31. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. .4 The Wright-Fisher model: an example from biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 A storage or queueing application . . . . . . . . . . . . . . . . . . . . . . . . . . .3 The distribution of the ﬁrst return to the origin . . . . . 68 24. . . . . . . . . . . . . . . . . . .2 The ALOHA protocol: a computer communications application . . . .2 Paths that start and end at speciﬁc points and do not touch zero at all . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30.1 Paths that start and end at speciﬁc points . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . . . . . . . . . . . . . . . . . . . 92 30. . . . . . . . . . . 70 24. . . . . 86 28 The reﬂection principle for a simple random walk in dimension 1 86 28. . . . . 65 24. . . . . . . .4 Some remarkable identities . . . . . . . .1 Distribution of hitting time and maximum . . . . . . .3 Some conditional probabilities . . . . .2 Return probabilities . . . . . . . . . . . . . . . . . . . . . 79 26. . . . . . . . . . . . . . . . . . . .24. 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . .1 Distribution after n steps . . . . . . . . .3 Total number of visits to a state . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . 71 24. . . . . . . . . . . . . . 95 31. . . . . 84 27. . . . . .1 First hitting time analysis . . . . . . . . . . . . . . . . . . 84 27. . . . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . 83 27. . . . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

Greece and the U. dozens of good books on the topic.S. I have a little mathematical appendix. There are. Also. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuﬀ with matrices – On ﬁnance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come.. There notes are still incomplete. A few starred sections should be considered “advanced” and can be omitted at ﬁrst reading.. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. They form the core material for an undergraduate course on Markov chains in discrete time. I introduce stationarity before even talking about state classiﬁcation.K. At the end. Of course. The second one is speciﬁcally for simple random walks. v . one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. of course. Also. “important” formulae are placed inside a coloured box. I have tried both and prefer the current ordering.. I tried to put all terms in blue small capitals whenever they are ﬁrst encountered.Preface These notes were written based on a number of courses I taught over the years in the U. The ﬁrst part is about Markov chains and some applications.

In the module following this one you will study continuous-time chains. Xn+2 . yn+1 ). In other words. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. X0 . Recall (from elementary school maths). the integers). Newton’s second law says that the acceleration is proportional to the force: m¨ = F. and evolving as xn+1 = yn yn+1 = rem(xn . Suppose that a particle of mass m is moving under the action of a force F in a straight line. . x(t)). the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). y0 = min(a. X2 . Assume that the force F depends only on the particle’s position ¨ dt and velocity. . a totally ordered set. . b). the future trajectory (X(s). that gcd(a. Its velocity is x = dx . b) = gcd(r. the real numbers). In Mathematics.e. b) between two positive integers a and b. x). n = 0. for any n. for any t. yn ). . . yn ). has the property that. we let our “state” be a pair of integers (xn . X1 . initialised by x0 = max(a. s ≥ t) ˙ 1 .e. and the other the greatest common divisor. 2. there is a function F that takes the pair Xn = (xn . . depend only on the pair Xn . An example from Arithmetic of a dynamical system in discrete time is the one which ﬁnds the greatest common divisor gcd(a. where r = rem(a. the future pairs Xn+1 . F = f (x. b). An example from Physics is provided by Newton’s laws of motion. one of which is 0. Formally. 1. The “time” can be discrete (i. For instance. x = x(t) denotes the position of the particle at time t. . . b). It is not hard to see that. a phenomenon which evolves with time in a way that only the present aﬀects the future is called a dynamical system. we end up with two numbers. where xn ≥ yn . By repeating the procedure. b) is the remainder of the division of a by b. more generally. continuous (i. The sequence of pairs thus produced.e. or. . x Here. yn ) into Xn+1 = (xn+1 . and its ˙ dt 2x acceleration is x = d 2 .PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past aﬀects the future only through the present. i.

Count the number of 1’s and divide by N . instead of deterministic objects. We can summarise this information by the transition diagram: 1 Stochastic means random.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells.can be completely speciﬁed by the current state X(t). regardless of where it was at earlier times. Y0 ) of random numbers. Each time count 1 if Yn < f (Xn ) and 0 otherwise. . Yn ).1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as ﬁnding the determinant of a matrix with 10. respectively. containing fresh and stinky cheese. by the software in your computer. 0 ≤ Y0 ≤ B.99. Markov chains generalise this concept of dependence of the future only on the present. past states are not needed. an eﬀective way for dealing with large problems is via the use of randomness. it does so. 1 and 2. uniformly. via the use of the tools of Probability. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. analyse. but some are stochastic. it moves from 2 to 1 with probability β = 0. and use those mathematical models which are called Markov chains. In addition. The resulting ratio is an approximation b of a f (x)dx. Choose a pair (X0 . . Some of these algorithms are deterministic. Indeed.05. i.. we will see what kind of questions we can ask and what kind of answers we can hope for. at time n + 1 it is either still in 1 or has moved to 2.000 rows and 10. Similarly. In other words. Try this on the computer! 2 . 2. 2 2. a ≤ X0 ≤ b. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. A scientist’s job is to record the position of the mouse every minute.000 columns or computing the integral of a complicated function of a large number of variables. Other examples of dynamical systems are the algorithms run. We will also see why they are useful and discuss how they are applied. We will develop a theory that tells us how to describe. Perform this a large number N of times. We will be dealing with random variables. The generalisation takes us into the realm of Randomness. A mouse lives in the cage. When the mouse is in cell 1 at time n (minutes) then. for steps n = 1. say. Repeat with (Xn .e. .

on the average. . he can’t. px. as follows: Xn+1 = max{Xn + Dn − Sn . D2 . respectively. 0}. the ﬁrst time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. Assume also that X0 . We easily see that if y > 0 then ∞ x. S1 . are independent random variables. = z=0 ∞ = z=0 3 . y = 0. Assume that Dn . Dn − Sn = y − x) P (Sn = z. S0 . (This is the mean of the binomial distribution with parameter α.2 Bank account The amount of money in Mr Baxter’s bank account evolves. Xn is the money (in pounds) at the beginning of the month. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is deﬁned by: px. to move from cell 1 to cell 2? 2. How long does it take for the mouse.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. S2 .01 Questions of interest: 1.05 0. FS . .) 2. So if he wishes to spend more than what is available. from month to month. How often is the mouse in room 1 ? Question 1 has an easy. . . are random variables with distributions FD . intuitive.99 0. 2. Sn . Dn is the money he deposits. and Sn is the money he wants to spend. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. D1 . 1. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x).05 ≈ 20 minutes. . Here.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z.95 0. . D0 .y := P (Xn+1 = y|Xn = x).

can be computed from FS and FD . and if so.something that. how often will he be visiting 0? 2.0 p0. how long will it take him. Of course. Are there any places which he will never visit? 3. if y = 0. on the average. Questions of interest: 1. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction.1 p0. in principle. So he may move forwards with equal probability that he moves backwards. being drunk. he has no sense of direction.1 p2. how fast? 2. The transition probability matrix is y=1 p0.2 p1. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left).0 = 1 − ∞ px. If he starts from position 0. 2. we have px. Are there any places which he will visit inﬁnitely many times? 4.2 P= p2.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 .0 p2. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions.0 p1.1 p1. Will Mr Baxter’s account grow.y . 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. 2.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. If the pub is located in position 0 and his home in position 100.2 ··· ··· ··· ··· · · · · · · ··· and is an inﬁnite matrix.

01 S 0.S = 0. west.H .H − pS.D = 1.3 H 0. and let’s say he’s completely drunk. etc. and pS. mathematically imprecise. and pS.H = 1 − pH.S = 1 − pS. so we assign probability 1/4 for each move.S because pH. that since there are more degrees of freedom now. but they will become more precise in the sequel. So he can move east.D . We may again want to ask similar questions.3 that the person will become sick. Also. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. the answer to this is crucial for the policy determination.1 D Thus. These things are.S − pH. 5 .D is omitted.5 Actuarial chains A life insurance company wants to ﬁnd out how much money to charge its clients. pD. north or south. there is probability pH. for now. pD.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. the reason being that the company does not believe that its clients are subject to resurrection. Note that the diagram omits pH.8 0. there is clearly a higher chance that our man will be lost. Clearly. But.D . therefore. the company must have an idea on how long the clients will live. observe. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. 2. given that he is currently healthy.

3. for any n. Let us now write the Markov property in an equivalent way. Proof. and so future is independent of past given the present. n ∈ T} is Markov instead of saying that it has the Markov property. . . almost exclusively (unless otherwise said explicitly). it is customary to call (Xn ) a Markov chain. (Xn . . . . X2 = i2 . m ∈ T) is independent of the past process (Xm .1 The Markov property Deﬁnition of the Markov property Consider a (ﬁnite or inﬁnite) sequence of random variables {Xn . if the claim holds. . n + m] Xk = ik | Xn = in ). 2. The elements of S are frequently called states. Lemma 1. Xn−1 ) given Xn . 3. On the other hand.} or the set Z of all integers (positive.3 3. . . Since S is countable. We say that this sequence has the Markov property if. 2. . . So we will omit referring to T explicitly. . m < n. Usually. for all n ∈ Z+ and all i0 . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). . which precisely means that (Xn+1 . . 1. If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . Xn+m ) and (X0 . (1) 6 . . . . unless needed. We will also assume. . Xn−1 ) are independent conditionally on Xn .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . called the state space. . . . . zero. conditionally on Xn . in+1 ∈ S. We focus almost exclusively to the ﬁrst choice. .} or N = {1. m ∈ T). X0 = i0 ) = P (∀ k ∈ [n + 1. X1 = i1 . n + m] Xk = ik | Xn = in . the Markov property. for any n ∈ T. in+m . Sometimes we say that {Xn . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). n ∈ Z+ ) is Markov if and only if. . . . . where T is a subset of the integers. . . . that the Xn take values in some countable set S. . the future process (Xm . . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. . . . . T = Z+ = {0. . This is true for any n and m. . or negative). m and any choice of states i0 . 3. m > n. n ∈ T}. . P (Xn+1 = in+1 | Xn = in . viz.

j (n. pi. n). n). ℓ) P(ℓ.j (n. j ∈ S.i2 (1. 1) pi1 . m + 2) · · · P(n − 1. n) = i1 ∈S ··· in−1 ∈S pi. More generally. m + 1) · · · pin−1 .3 In this new notation. n) := [pi. a square matrix whose rows and columns are indexed by S (and this.j (n. If |S| = d then we have d × d matrices. 3. But we will circumvent them by using Probability only. . .. i∈S i. The function µ(i) is called initial distribution. pi. by a result in Probability. n + 1) do not depend on n .j (m. something that can be called semigroup property or Chapman-Kolmogorov equation. remembering a bit of Algebra. n) = P(m.i1 (0. Xn = in ) = µ(i0 ) pi0 . pi. will specify all joint distributions2 : P (X0 = i0 . requires some ordering on S) and whose entries are pi. . n) = 1(i = j) and. means that the transition probabilities pi. 2 3 And.which means that the two functions µ(i) := P (X0 = i) . we deﬁne.j∈S . of course. viz. X1 = i1 . n + 1) is called transition probability from state i at time n to state j at time n + 1. for each m ≤ n the matrix P(m.j (m. n) if m ≤ ℓ ≤ n.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. which. Of course. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. .i1 (m. or as P(m. n)]i. by deﬁnition.in (n − 1. X2 = i2 . m + 1)P(m + 1. 2) · · · pin−1 .j (m. all probabilities of events associated with the Markov chain This causes some analytical diﬃculties. we can write (2) as P(m. n) = P(m. The function pi. n). by (1). Therefore.i (n − 1. n ≥ 0. n + 1) := P (Xn+1 = j | Xn = i) .j (n.j (m. n). pi. 7 . (2) which is reminiscent of matrix multiplication. but if S is an inﬁnite set. then the matrices are inﬁnitely large.

if µ assigns probability p to state i1 and 1 − p to state i2 . unless otherwise speciﬁed. arranged in a row vector. In other words. (3) We call δi the unit mass at i.j (n. Similarly.j . A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . b ≥ 0. δi (j) := 1. Then P(m.and therefore we can simply write pi. Notice that any probability distribution on S can be written as a linear combination of unit masses. The matrix notation is useful. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. if j = i otherwise. simply. for all integers a. Thus Pi is the law of the chain when the initial distribution is δi . Starting from a speciﬁc state i. For example. and the semigroup property is now even more obvious because Pa+b = Pa Pb . while P = [pi. if Y is a real random variable.j the (one-step) transition probability from state i to state j.j ]i. y y 8 . transition matrix. Notational conventions Unit mass. From now on. then µ = pδi1 + (1 − p)δi2 . n) = P × P × · · · · · · × P = Pn−m . It corresponds. as follows from (1) and the deﬁnition of P. For example. and call pi. n + 1) =: pi. 0. Then we have µ n = µ 0 Pn . of course. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). to picking state i with probability 1. n−m times for all m ≤ n.j∈S is called the (one-step) transition probability matrix or.

Clearly.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . n ≥ 0) will not be stationary: indeed. m + 1.4 Stationarity We say that the process (Xn . . at time n + 1 it moves to one of the adjacent vertices with equal probability.i. It is clear. 1. we indeed expect that we have N/2 molecules in each room. The gas is placed partly in room 1 and partly in room 2. Another guess then is: Place 9 . We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. random variables provides an example of a stationary process.d. selecting a vertex with probability 1/7. n = m. It should be clear that if initially place the bug at random on one of the vertices. .). This is a model of gas distributed between two identical communicating chambers. n = 0. Then a hole is opened in the separator and we watch what happens as the molecules diﬀuse between the two rooms. . say N = 1025 ) in a metallic chamber with a separator in the middle. the law of a stationary process does not depend on the origin of time. at any point of time n. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. Then pi. We now investigate when a Markov chain is stationary. If at time n the bug is on a vertex. and vice versa. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . dividing it into two identical rooms. On the average. the distribution of the bug will still be the same. . But this will not work.) is stationary if it has the same law as (Xn . for a while we will be able to tell how long ago the process started. . because it is impossible for the number of molecules to remain constant at all times. Imagine there are N molecules of gas (where N is a large number. as it takes some time before the molecules diﬀuse from room to room. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. Example: the Ehrenfest chain. then. then. i.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. To provide motivation and intuition. Hence the uniform probability distribution is stationary. In other words. let us look at the following example: Example: Motion on a polygon. . a sequence of i.e. for any m ∈ Z+ . Consider a canonical heptagon and let a bug move on its vertices. Let Xn be the number of molecules in room 1 at time n.

. which you should go through by yourselves. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. If we can do so. Xm+2 = i2 . Since.i + π(i + 1)pi+1. a Markov chain is stationary if and only if the distribution of Xn does not change with n. 2−N i−1 i+1 N N (The dots signify some missing algebra. . i 0 ≤ i ≤ N. Xn = in ) = P (Xm = i0 . . Lemma 2.each molecule at random either in room 1 or in room 2. hence. . . Indeed. P (X1 = i) = j P (X0 = j)pj. Xm+1 = i1 . The equations (4) are called balance equations. Changing notation. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn .i = P (X0 = i − 1)pi−1.) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. . X2 = i2 . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. Proof. use (1) to see that P (X0 = i0 . P (Xn = i) = π(i) for all n. But we need to prove this. If we do so. (a binomial distribution). then we have created a stationary Markov chain.i = N N −i+1 N i+1 2−N + = · · · = π(i). then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . .i = π(i − 1)pi−1. by (3). we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. we are led to consider: The stationarity problem: Given [pij ]. stationary or invariant distribution. For the if part. (4) Such a π is called .i + P (X0 = i + 1)pi+1. . ﬁnd those distributions π such that πP = π. . . A Markov chain (Xn . X1 = i1 . . for all m ≥ 0 and i0 . in ∈ S. . 10 . In other words. The only if part is obvious. Xm+n = in ).

i ≥ 1. . write the equations (4): π(i) = π(j)pj. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. In addition to that. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. To ﬁnd the stationary distribution. Then (Xn . 2 . j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). . then we run into trouble: indeed. After bus arrives at a bus stop. xd ). each x ∈ Sd is of the form x = (x1 . . .i = 1. in matrix notation. which in turn means that the normalisation condition cannot be satisﬁed. . can be written as P1 = 1. random variables. we must be careful.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. 11 . i ≥ 0. and so each π(i) will be zero. Who tells us that the sum inside the parentheses is ﬁnite? If the sum is inﬁnite. i≥1 π(i) −1 To compute π(1). . p1. The state space is S = N = {1. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. d}. if i > 0 i ∈ N. 2. Thus. But here.i = π(0)p(i) + π(i + 1).j = 1. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. . . . π must be a probability: π(i) = 1. the next one will arrive in i minutes with probability p(i). Example: card shuﬄing.. It is easy to see that the stationary distribution is uniform: π(x) = 1 . where 1 is a column whose entries are all 1. . Example: waiting for a bus. where all the xi are distinct. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. which gives p(j) . . This means that the solution to the balance equations is identically zero.d. Consider a deck of d = 52 cards and shuﬄe them in the following (stupid) way: pick two cards at random and swap their positions. which.i = p(i). π(1) will be zero. i = 1. d! x ∈ Sd .i. The times between successive buses are i.}. n ≥ 0) is a Markov chain with transition probabilities pi+1. Keep doing it continuously. . . we use the normalisation condition π(1) = i≥1 j≥i = 1.

1 − ξ3 . .. . Remarks: This is not a Markov chain with countably many states. i. ξ2 (n) = 1 − ε1 . . . i. So it goes beyond our theory. and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . ξr+1 (n) = εr ) + P (ξ1 (n) = 1. .d. . . You can check that . P (ξ1 (n + 1) = ε1 . . The Markov chain is obtained by applying the mapping ϕ again and again. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is ﬁnite. ξr (n + 1) = εr ) = P (ξ1 (n) = 0. . . . So let us do so.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). ξ2 (n). Hence choosing the initial state at random in this manner results in a stationary Markov chain. . . . for any n = 0. . ξi ∈ {0. and the rule maps x into 2(1 − x). ..i.d. and any ε1 . ξr+1 (n) = 1 − εr ). . 1} for all i.e. Here ξ(n) = (ξ1 (n). We have thus deﬁned a mapping ϕ from binary vectors into binary vectors. . with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. 1] (think of [0. . are i.) Consider an inﬁnite binary vector ξ = (ξ1 . if ξ1 = 1 then shift to the left and ﬂip all bits to obtain (1 − ξ2 .). . Example: bread kneading. But the function f (x) = min(2x. If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. . .). . any r = 1. . 2(1 − x)). what we have really shown is that this assumption is a necessary and suﬃcient condition for the existence and uniqueness of a stationary distribution. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n).). . But to even think about it will give you some strength.i. 1] as a ﬂexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. ξ2 . . . . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. . As for the name of the chain (bread kneading) it can be justiﬁed as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. . ξ2 (n) = ε1 . . . . applied to the interval [0. . So. . . εr ∈ {0.* (This is slightly beyond the scope of these lectures and can be omitted. . ξ2 (n). from (5). We can show that if we start with the ξr (0). with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. ξ(n + 1) = ϕ(ξ(n)). (5) 12 . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. . 2. the same is true at n + 1. 1}. 2. 1. If ξ1 = 1 then x ≥ 1/2.) is the state at time n.. . . 2. r = 1. ξ3 .

Ac ) is the total current from A to Ac .i . When a Markov chain is stationary we refer to it as being in 4.j . j∈S i ∈ S. Ac ) = F (Ac .i = j∈A π(j)pj.j as a “current” ﬂowing from i to j. π(j)pj. π(i) = j∈S (balance equations) (normalisation condition). We have: Proposition 1. Ac ) := π(i)pi. The convenience is often suggested by the topological structure of the chain.j = j∈A π(j)pj. i∈A j∈Ac Think of π(i)pi. for all A ⊂ S. Writing the equations in this form is not always the best thing to do. But we only have d unknowns.Note on terminology: steady-state. Proof.i + j∈Ac π(j)pj. π is a stationary distribution if and only if π1 = 1 and F (A.i 13 . we introduce more. π(j) = 1. therefore one of them is always redundant. then choose A = {i} and see that you obtain the balance equations. expecting to be able to choose.j (which equals 1): π(i) j∈S pi. If the state space has d states. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. a stationary distribution π satisﬁes π = πP π1 = 1 In other words. Conversely.i Multiply π(i) by j∈S pi. then the above are d + 1 equations. A). because the ﬁrst d equations are linearly dependent: the sum of both sides equals 1. suppose that the balance equations hold. If the latter condition holds.1 Finding a stationary distribution As explained above. This is OK. So F (A. amongst them. Instead of eliminating equations.i + j∈Ac π(j)pj. as we will see in examples. a set of d linearly independent equations which can be more easily solved. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A.

p22 = 1 − β. while the remaining two terms give the equality we seek. The extra degree of freedom provided by the last result.j = j∈A π(j)pj. α+β π(2) = α . Thus.04 π(2) = 0. 14 i = 1.952.) Assume 0 < α. . A) Schematic representation of the ﬂow balance relations. We have π(1)α = π(2)β. 2. p21 = β.8% of the time in room 2.Now split the left sum also: π(i)pi. . The constant c is found from π(1) + π(2) = 1.99 ≈ 0.. we see that the ﬁrst term on the left cancels with the ﬁrst on the right. (Consequently. . we ﬁnd π(1) = 0. α+β π(2) = cα.04 room 1.05. β = 0. Summing over i ∈ A. gives us ﬂexibility. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . in steady state.. here is an example: Example: Consider the 2-state Markov chain with p12 = α. p11 = 1 − α.i This is true for all i ∈ S. and 4. the mouse spends only 95. 1. If we take α = 0. 3. but we shall not do this here. . p q Consider the following chain. Example: a walk with a barrier.j + j∈A j∈Ac π(i)pi. That is. A c F(A . π(1) = β . A ) A c c F( A . One can ask how to choose a minimal complete set of linearly independent equations.99 (as in the mouse example).048. Instead. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1).i + j∈Ac π(j)pj. β < 1. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ.2% of the time in 1.05 ≈ 0.

But we can make them simpler by choosing A = {0. Proof. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. This is i=0 possible if and only if p < q. A). (ii) pi. i.) 5 5. provided that ∞ π(i) = 1.i2 · · · pin−1 . i − 1}. If i = j there is nothing to prove. 1. 5. .e. For i ≥ 1. .1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. . Does this provide a stationary distribution? Yes. a pair (S. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). These are much simpler than the previous ones. (If p ≥ 1/2 there is no stationary distribution. If i ≥ 1 we have F (A. . . . We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. n≥0 . In fact. the chain will visit j at some ﬁnite time.2 The relation “leads to” We say that a state i leads to state j if. and E a set of directed edges. starting from i. j) ∈ E ⇐⇒ pi. in−1 . . (iii) Pi (Xn = j) > 0 for some n ≥ 0. This is often cast in terms of a digraph (directed graph). Ac ) = π(i − 1)p = π(i)q = F (Ac . π(i) = (p/q)i π(0). p < 1/2. . In graph-theoretic terminology.j > 0 for some n ∈ N and some states i1 . we can solve them immediately. So assume i = j.j > 0.i1 pi1 . S is a set of vertices. i. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0.e. In other words. • Suppose that i j. E) where S is the state space and E ⊂ S × S deﬁned by (i.

• Suppose that (ii) holds. It is reﬂexive and transitive because is so. However. (iii) holds.in−1 pi. is an equivalence relation. {2. • Suppose now that (iii) holds. 5} is closed: if the chain goes in there then it never moves out of it. In addition. {6}. and so (iii) holds. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). The communicating class corresponding to the state i is.. the set of all states that communicate with i: [i] := {j ∈ S : j So. 16 . (6) j.3 The relation “communicates with” Deﬁne next the relation i communicates with j (written as i Obviously. Hence Pi (Xn = j) > 0. where the sum extends over all possible choices of states in S. it partitions S into equivalence classes known as communicating classes. The ﬁrst two classes diﬀer from the last two in character.i2 · · · pin−1 .. meaning that (ii) holds. We just observed it is symmetric. reﬂexive and transitive. this is symmetric (i Corollary 2. (i) holds. The class {4..the sum on the right must have a nonzero term. Just as any equivalence relation. 3}. the class {2. by deﬁnition. {4. we have that the sum is also positive and so one of its terms is positive. The relation is transitive: If i j and j k then i k. Look at the last display again. Two communicating classes are either 1 2 3 4 5 6 In the example of the ﬁgure. [i] = [j] if and only if i identical or completely disjoint. 3} is not closed. By assumption. we see that there are four communicating classes: {1}. Proof. j) ⇐⇒ i j and j i. the sum contains a nonzero term. Equivalence relation means that it is symmetric. 5. by deﬁnition. i}. Corollary 1. 5}.. Since Pi (Xn = j) > 0.i1 pi1 . The relation j ⇐⇒ j i). In other words. We have Pi (Xn = j) = i1 .j .

more manageable. in−1 such that pi. Closed communicating classes are particularly important because they decompose the chain into smaller. we have i1 ∈ C. we conclude that j ∈ C. Note: If we consider a Markov chain with transition matrix P and ﬁx n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. . Proof. 17 . we say that a set of states C ⊂ S is closed if pi. C5 in the ﬁgure above are closed communicating classes.j > 0. The classes C4 . Suppose that i j. Why? A state i is essential if it belongs to a closed communicating class. . the state space S is a closed communicating class. more precisely. Similarly. pi. We then can ﬁnd states i1 . If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. Since i ∈ C.i1 > 0. . then the chain (or.j = 1. its transition matrix P) is called irreducible. Note that there can be no arrows between closed communicating classes.i1 pi1 . and so on. There can be arrows between two nonclosed communicating classes but they are always in the same direction.i2 > 0. we can reach any other state. In graph terms. If all states communicate with all others. In other words still. . starting from any state. C3 are not closed. and C is closed. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this ﬁgure. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. Note: An inessential state i is such that there exists a state j such that i j but j i. and so i2 ∈ C. Thus. The communicating classes C1 . If a state i is such that pi. To put it otherwise. pi1 . j∈C for all i ∈ C. Suppose ﬁrst that C is a closed communicating class.i2 · · · pin−1 . Lemma 3. A general Markov chain can have an inﬁnite number of communicating classes. parts.i = 1 then we say that i is an absorbing state. The converse is left as an exercise. Otherwise. C2 . the state is inessential. the single-element set {i} is a closed communicating class.More generally.

. . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. (Can you ﬁnd an example of a chain with period 3?) We will now give the deﬁnition of the period of a state of an arbitrary chain. and. 3} at some point of time then. So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . X2 ∈ C1 . ℓ ∈ N. i. pii (ℓn) ≥ pii (n) ℓ . (n) Proof. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . X4 ∈ C1 . Notice however. .5. Let us denote by pij the entries of Pn . If i j there is α ∈ N with pij > 0 and if 18 . Consider two distinct essential states i. (m) (n) for all m.e. d(i) = gcd Di . . The chain alternates between the two sets. If i j then d(i) = d(j). j. n ≥ 0. i. Dj = {n ∈ (n) (α) N : pjj > 0}.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. X3 ∈ C2 . Since Pm+n = Pm Pn . whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. d(j) = gcd Dj . A state i is called aperiodic if d(i) = 1. The period of an essential state is deﬁned as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . 4} at the next step and vice versa. We say that the chain has period 2. it will move to the set C2 = {2. j ∈ S. more generally. The period of an inessential state is not deﬁned. So. Theorem 2 (the period is a class property). all states form a single closed communicating class. Let Di = {n ∈ N : pii > 0}. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0.

(n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This ﬁgure shows the internal structure of a closed communicating class with period d = 4. From the last two displays we get d(j) | n for all n ∈ Di . showing that α + β ∈ Dj . Of particular interest. Theorem 4. r = 0. C1 . In particular. where the remainder r ≤ n1 − 1. . 1. . are irreducible chains. If a number divides two other numbers then it also divides their diﬀerence. n2 ∈ Di . d(j) | d(i)). Divide n by n1 to obtain n = qn1 + r. chains where. C1 . Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . . More generally. . So d(i) = d(j). d − 1. An irreducible chain with ﬁnitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. if an irreducible chain has period d then we can decompose the state space into d sets C0 . i. Since n1 . Proof. we have n ∈ Di as well. Cd−1 such that the chain moves cyclically between them. j∈Cr+1 i ∈ Cr . q − r > 0. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . Therefore d(j) | α + β. Corollary 3. Arguing symmetrically. n1 ∈ Di such that n2 − n1 = 1. if d = 1. So d(j) is a divisor of all elements of Di . as deﬁned earlier. all states communicate with one another. . (Here Cd := C0 . j ∈ S. An irreducible chain has period d if one (and hence all) of the states have period d. Consider an irreducible chain with period d. Theorem 3. . Let n be suﬃciently large. Therefore d(j) | α + β + n for all n ∈ Di . Cd−1 such that pij = 1.j i there is β ∈ N with pji > 0.) 19 . . So pjj ≥ pij pji > 0. Now let n ∈ Di . showing that α + n + β ∈ Dj . . . we obtain d(i) ≤ d(j) as well. Then we can uniquely partition the state space S into d disjoint subsets C0 . Because n is suﬃciently large.e. . Formally. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. . the chain is called aperiodic. Pick n2 . . Then pii > 0 and so pjj ≥ pij pii pji > 0. this is the content of the following theorem.

which contradicts the deﬁnition of d. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). . Suppose pij > 0 but j ∈ C1 but. The reader is warned to be alert as to which of the two variables we are considering at each time. and by the assumptions that d d j. say. . We are interested in deriving formulae for the probability that this time is ﬁnite as well as the expectation of this time. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . Take. i1 . We avoid excessive notation by using the same letter for both.. Hence it partitions the state space S into d disjoint equivalence classes.. Let C1 be the equivalence class of i1 . then. Assume d > 1. necessarily. It is easy to see that if we continue and pick a further state id with pid−1 .. . Pick a state i0 and let C0 be its equivalence class (i. If X0 ∈ A the two times coincide. after it takes one step from its current position. . Cd−1 . Then pick i1 such that pi0 i1 > 0. As an example. that is why we call it “first-step analysis”. We use a method that is based upon considering what the Markov chain does at time 1. Deﬁne the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0.. We show that these classes satisfy what we need. The existence of such a path implies that it is possible to go from i0 to i. C1 . consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which diﬀers from TA simply by considering n ≥ 1 instead of n ≥ 0. j belongs to the next class. . with corresponding classes C0 . for example. Notice that this is an equivalence relation. We now show that if i belongs to one of these classes and if pij > 0 then. i ∈ C0 . necessarily.e the set of states j such that i j).. i..e. id−1 d 6 Hitting times and ﬁrst-step analysis Consider a Markov chain with transition probability matrix P.Proof. 20 . Such a path is possible because of the choice of the i0 .. i1 .id > 0. id ∈ C0 . Continue in this manner and deﬁne states i0 . j ∈ C2 . We deﬁne the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}.

i ∈ A. ϕ(i) = j∈S i∈A pij ϕ(j). Combining the above. j∈S as needed. For the second part of the proof. If i ∈ A then TA = 0. X0 = i) = P (TA < ∞|X1 = j). conditionally on X1 . and deﬁne ϕ(i) := Pi (TA < ∞). Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). So TA = 1 + TA . We ﬁrst have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. . But the Markov chain is homogeneous. If i ∈ A.). if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S.e. let ϕ′ (i) be another solution. By self-feeding this equation. The function ϕ(i) satisﬁes ϕ(i) = 1.It is clear that P1 (T0 < ∞) < 1. a ′ function of X1 . ′ is the remaining time until A is hit). which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). . ﬁx a set of states A. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A pjk ϕ′ (k) k∈A 21 . then TA ≥ 1. Furthermore. . because the chain may end up in state 2. and so ϕ(i) = 1. we have Pi (TA < ∞) = ϕ(j)pij . X2 . But what exactly is this probability equal to? To answer this in its generality. the event TA is independent of X0 . ′ Proof. We then have: Theorem 5. Therefore. Hence: ′ ′ P (TA < ∞|X1 = j.

let A = {0. If i ∈ A then. 3 2 so ϕ(1) = 3/4. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). 3 22 . TA = 0 and so ψ(i) := Ei TA = 0. ψ(0) = ψ(2) = 0. and let ψ(i) = Ei TA . if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). We have: Theorem 6. 1. because it takes no time to hit A if the chain starts from A. On the other hand. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). then TA = 1 + TA . Clearly. 1 ψ(1) = 1 + ψ(1). Letting n → ∞. which is the second of the equations.) As for ϕ(1). the second equals Pi (TA = 2). The second part of the proof is omitted. If ′ i ∈ A. ψ(i) = 1 + i∈A pij ψ(j). Example: In the example of the previous ﬁgure. we choose A = {0}. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). j∈A i ∈ A. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the ﬁrst time. so the ﬁrst two terms together equal Pi (TA ≤ 2). We immediately have ϕ(0) = 1. and ϕ(2) = 0. By continuing self-feeding the above equation n times. ψ(i) := Ei TA . obviously. as above. The function ψ(i) satisﬁes ψ(i) = 0. Start the chain from i. the set that contains only state 0. 2. Example: Continuing the previous example. Proof. (If the chain starts from 0 it takes no time to hit 0. if the chain starts from 2 it will never hit 0. 2}.We recognise that the ﬁrst term equals Pi (TA = 1). i = 0. Furthermore. We let ϕ(i) = Pi (T0 < ∞). Therefore.

as argued earlier. . ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). . 23 . This happens up to the ﬁrst hitting time TA ﬁrst hitting time of the set A. Then ′ 1+TA x ∈ A. (This should have been obvious from the start. Clearly. of the random variables X1 . h(x) = f (x). we just changed index from n to n + 1. otherwise. ϕAB (i) = j∈A i∈B pij ϕAB (j). consider the following situation: Every time the state is x a reward f (x) is earned. ϕAB (i) = 0. B. Hence. We may ask to ﬁnd the probabilities ϕAB (i) = Pi (TA < TB ). until set A is reached. i. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the ﬁrst-step analysis method. after one step. It should be clear that the equations satisﬁed are: ϕAB (i) = 1. where. ′ where TA is the remaining time. by the Markov property. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). Now the last sum is a function of the future after time 1. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). TB for two disjoint sets of states A. X2 . therefore the mean number of ﬂips until the ﬁrst success is 3/2. Next. ϕAB is the minimal solution to these equations. in the last sum..which gives ψ(1) = 3/2. i∈A Furthermore. . ′ TA = 1 + TA .e. if x ∈ A. because the underlying experiment is the ﬂipping of a coin with “success” probability 2/3.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ).

in which case he dies immediately. k ∈ N) form the strategy of the gambler. h(1) = 5. for i = 1. why?) If we let h(i) be the average total reward if he starts from room i. So. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). so 24 . until he dies in room 4. he collects i pounds. no gambler is assumed to have any information about the outcome of the upcoming toss.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. collecting “rewards”: when in room i. Clearly. he will die. 2 2 We solve and ﬁnd h(2) = 8. 1 2 4 3 The question is to compute the average total reward. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. If heads come up then you win £1 per pound of stake.Thus. The successive stakes (Yk . (Sooner or later. (Also. Example: A thief enters sneaks in the following house and moves around the rooms. x∈A pxy h(y). the set of equations satisﬁed by h. unless he is in room 4. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. if he starts from room 2. 3. are h(x) = 0. Let p the probability of heads and q = 1 − p that of tails. h(3) = 7. y∈S h(x) = f (x) + x ∈ A. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). 2.

The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). It can only depend on ξ1 . . b are absorbing. She will stop playing if her money fall below a (a < x). . This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. in simplest case Yk = 1 for all k. In other words. b − 1. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals: x−a − 1 (q/p) (q/p)b−a − 1 . Ex ) for the probability (respectively. We obviously have ϕ(0) = 0. (Think why!) To simplify things. x − a . if p = q = 1/2 b−a Proof.i−1 = q = 1 − p. . . and where the states a. 0 ≤ x ≤ ℓ. she must base it on whatever outcomes have appeared thus far. P (tails) = 1 − p. Consider a gambler playing a simple coin tossing game. . . . The diﬀerence between the gambler’s ruin problem for a company and that for a mere mortal is that. as described above. betting £1 at each integer time against a coin which has P (heads) = p. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. ℓ). 25 ϕ(ℓ) = 1.i+1 = p. . in addition. if p = q Px (Tb < Ta ) = . Yk cannot depend on ξk . We are clearly dealing with a Markov chain in the state space {a. a + 1. is our next goal. ℓ) := Px (Tℓ < T0 ). We use the notation Px (respectively. Theorem 7. pi. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. ξk−1 and–as is often the case in practice–on other considerations of e. a < i < b. Let £x be the initial fortune of the gambler. astrological nature. the company has control over the probability p.if the gambler wants to have a strategy at all. Let us consider a concrete problem. write ϕ(x) instead of ϕ(x. b} with transition probabilities pi. To solve the problem. We wish to consider the problem of ﬁnding the probability that the gambler will never lose her money. . expectation) under the initial condition X0 = x. b = ℓ > 0 and let ϕ(x. it will make enormous proﬁts). in the ﬁrst case. b − a). a ≤ x ≤ b. We ﬁrst solve the problem when a = 0. a + 2.g.

Corollary 4.and. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. ℓ→∞ ℓ→∞ (q/p)x . if p > 1/2 if p ≤ 1/2. and so Px (T0 < ∞) = 1 − lim If p < 1/2. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). λℓ − 1 0 ≤ x ≤ ℓ. The general case is obtained by replacing ℓ by b − a and x by x − a. Px (T0 < ∞) = 1 − lim 26 . Let δ(x) := ϕ(x + 1) − ϕ(x). and so λx − 1 λx − 1 =1− = λx . λℓ −1 x ℓ. 1. λ = 1. then λ = q/p < 1. λx − 1 δ(0). 0 ≤ x ≤ ℓ. ℓ) = λx −1 . λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). if p = q. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. we ﬁnd δ(0) = 1/ℓ and so x ϕ(x) = . If p > 1/2. ℓ Summarising. by ﬁrst-step analysis. We ﬁnally ﬁnd λx − 1 . we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). if λ = 1 if λ = 1. Since p + q = 1. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. λ = 1. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). we ﬁnd ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. then λ = q/p > 1. we have obtained ϕ(x. Assuming that λ := q/p = 1. δ(x − 2) = · · · = q p x δ(0). we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. Since ϕ(ℓ) = 1.

. . . . n ≥ 0) is a time-homogeneous Markov chain. . . A. .j2 · · · P (Xn = i. . Let τ be a stopping time. . In particular. Then. (d) where (a) is just logic. We are going to show that for any such A. . . τ < ∞)P (A | Xτ . Xn ) and the ordinary splitting property at n. Xτ ) and has the same law as the original process started from i. . . . . A.) is independent of the past process (X0 . . Xn ). . . A ∩ {τ = n} is determined by (X0 . . . P (Xτ +1 = j1 . . . τ = n) (c) = P (Xn+1 = j1 . . .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. τ < ∞) and that P (Xτ +1 = j1 . Xn+2 = j2 . . Theorem 8 (strong Markov property). possibly. . A | Xτ . τ = n)P (Xn = i. Xn ) for all n ∈ Z+ . n ≥ 0) is independent of the past process (X0 . . Xn+2 = j2 . Xn+2 = j2 . . . the future process (Xτ +n . . .j1 pj1 . τ = n) = P (Xn+1 = j1 . Xn = i. . . . . Xn+1 . for any n. conditional on τ < ∞ and Xτ = i.j1 pj1 . Proof. Let A be any event determined by the past (X0 . . τ < ∞) = P (Xn+1 = j1 . A. τ = n) (a) (b) = P (Xτ +1 = j1 . . τ = n). . . the future process (Xn . including. τ = n) = pi. A. Suppose (Xn . | Xn = i. . . Xτ +2 = j2 . . | Xτ = i. Xτ )1(τ < ∞) = n∈Z+ (X0 . | Xτ . . Xτ +2 = j2 . . for all n ∈ Z+ . | Xn = i)P (Xn = i. Xτ ) if τ < ∞. . . . A. considered earlier. Xτ +2 = j2 . . . . any such A must have the property that. and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . τ < ∞). We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 .j2 · · · We have: P (Xτ +1 = j1 . Xn )1(τ = n). . An example of a random time which is not a stopping time is the last visit of a speciﬁc set. Xτ = i. . Since (X0 . . Recall that the Markov property states that.Xτ +2 = j2 . . This stronger property is referred to as the strong Markov property. the ﬁrst hitting time of the set A. Xn ) if we know the present Xn . An example of a stopping time is the time TA . 27 . . Let τ be a stopping time. . . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . τ < ∞) = pi. . the value +∞. Our assertions are now easily obtained from this by ﬁrst summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. A. (b) is by the deﬁnition of conditional probability. .

i. then the behaviour of the chain between successive visits cannot be inﬂuenced by the past or the future (and that is due to the strong Markov property). So Xi is the r-th excursion: the chain visits 28 . we will show that we can split the sequence into i. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn .. . Regardless. we let Ti be the time of the second (1) (3) visit. (1) r = 1. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the ﬁrst time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. random trajectories. . However. we recursively deﬁne Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. Schematically. Xi (1) .d. in fact. Our Ti here is always positive. := Xn .d. Note the subtle diﬀerence between this Ti and the Ti of Section 6. The Ti of Section 6 was allowed to take the value 0. 2. r = 2. (And we set Ti := Ti . Formally. The random variables Xn comprising a general Markov chain are not independent. “chunks”.) We let Ti be the time of the third visit. They are. All these random times are stopping times. in a generalised sense. We (r) refer to them as excursions of the process. If X0 = i then the two deﬁnitions coincide. . by using the strong Markov property. 3. The idea is that if we can ﬁnd a state i that is visited again and again. Ti Xi (0) (r) ≤ n < Ti (r+1) . as “random variables”. Xi (0) . . Xi (2) .9 Regenerative structure of a Markov chain A sequence of i. State i may or may not be visited again.. 0 ≤ n < Ti . random variables is a (very) particular example of a Markov chain.i. and so on. . . .

). are independent random variables. ad inﬁnitum. Xi (1) . Λ2 . But (r) the state at Ti is. (Numerical) functions g(Xi independent random variables. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. . g(Xi (1) ). (r) . by deﬁnition. starting from i. The last corollary ensures us that Λ1 . Example: Deﬁne Ti (r+1) (0) ).. given the state at (r) (r) (r) the stopping time Ti .. except. Perhaps some states (inessential ones) will be visited ﬁnitely many times. . A trivial. Apply Strong Markov Property (SMP) at Ti : SMP says that. . we “watch” it up until the next time it returns to state i. The excursions Xi (0) . The claim that Xi . Xi (2) . 10 Recurrence and transience When a Markov chain has ﬁnitely many states. the future after Ti is independent of the past before Ti . . . all. In particular. we say that a state i is transient if. Therefore. it will behave in the same way as if it starts from the same state at another point of time.i. g(Xi (2) ). we say that a state i is recurrent if. and “goes for an excursion”. if it starts from state i at some point of time. possibly. the chain returns to i only ﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 0. have the same Proof. for time-homogeneous Markov chains. starting from i. Second. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. 29 . An immediate consequence of the strong Markov property is: Theorem 9. Moreover. . they are also identical in law (i. but there are always states which will be visited again and again.state i for the r-th time. equal to i. but important corollary of the observation of this section is that: Corollary 5. (0) are independent random variables. We abstract this trivial observation and give a couple of deﬁnitions: First.d. the future is independent of the (1) (2) past. . Therefore. and this proves the ﬁrst claim. at least one state must be visited inﬁnitely many times. the chain returns to i inﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 1. Xi distribution.

every state is either recurrent or transient. We will show that such a possibility does not exist. Lemma 5.Important note: Notice that the two statements above are not negations of one another. Pi (Ji = ∞) = 1. we conclude what we asserted earlier. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . Ei Ji = ∞. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. then Pi (Ji < ∞) = 1. namely. Notice that: state i is visited inﬁnitely many times ⇐⇒ Ji = ∞. Indeed. we have Proof. 4. Indeed. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. Therefore. therefore concluding that every state is either recurrent or transient. This means that the state i is transient. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. State i is recurrent. Starting from X0 = i. Because either fii = 1 or fii < 1. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). 30 . Pi (Ji = ∞) = 1. We claim that: Lemma 4. Suppose that the Markov chain starts from state i. and that is why we get the product. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . 2. k ≥ 0. .e. The following are equivalent: 1. in principle. If fii < 1. 3. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). fii = Pi (Ti < ∞) = 1. But the past (k−1) before Ti is independent of the future. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. This means that the state i is recurrent. there could be yet another possibility: 0 < Pi (Xn = i for inﬁnitely many n) < 1. i. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii .

If Ei Ji = ∞. this implies that Ei Ji < ∞. Clearly. assuming that the chain (n) starts from state i. but the two sequences are related in an exact way: Lemma 7. The following are equivalent: 1.e. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . if we assume Ei Ji < ∞. is equivalent to Pi (Ji < ∞) = 1. On the other hand. clearly. by the deﬁnition of Ji . Pi (Ji < ∞) = 1. (n) But if a series is ﬁnite then the summands go to 0. then. 10. Ji cannot take value +∞ with positive probability. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . State i is recurrent if. From Elementary Probability. by deﬁnition. This. Proof. The equivalence of the ﬁrst two statements was proved above.e. since Ji is a geometric random variable. it is visited inﬁnitely many times. we see that its expectation cannot be inﬁnity without Ji itself being inﬁnity with probability one. (n) i. fii = Pi (Ti < ∞) < 1. which. therefore Pi (Ji < ∞) = 1. The equivalence of the ﬁrst two statements was proved before the lemma. by deﬁnition. pii → 0. it is visited ﬁnitely many times with probability 1. If i is a transient state then pii → 0. Ei Ji < ∞. Hence Ei Ji = ∞ also. 4. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . Proof. Corollary 6. If i is transient then Ei Ji < ∞.Proof. Proof. State i is transient. Lemma 6. as n → ∞.1 Deﬁne First hitting time decomposition fij := Pi (Tj = n). as n → ∞. 2. owing to the fact that Ji is a geometric random variable. by deﬁnition of Ji is equivalent to Pi (Ji = ∞) = 1. i. The latter is the probability to be in state j at time n (not necessarily for the ﬁrst time) starting from i. fij ≤ pij . 3. Notice the diﬀerence with pij . State i is transient if. and. then. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. n ≥ 1. the probability to hit state j for the ﬁrst time at time n ≥ 1.

and since they take the value 1 with positive probability.i. If j is a transient state then pij → 0. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. So the random variables δj. In particular.r := 1 j appears in excursion Xi (r) . . by deﬁnition. as n → ∞. Hence. we have n=1 pjj = Ej Jj < ∞. the probability to go to j in some ﬁnite time is positive. ( Remember: another property that was a class property was the property that the state have a certain period. inﬁnitely many of them will be 1 (with probability 1). . For the second term. not only fij > 0. Hence. if we let fij := Pi (Tj < ∞) . Xi . we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . Start with X0 = i. 32 . Proof. ∞ Proof. for all states i.d.d. starting from i. . r = 0. we have i j ⇐⇒ fij > 0. and gij := Pi (Tj < Ti ) > 0. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. excursions Xi . Corollary 7. either all states are recurrent or all states are transient. In a communicating class. In other words. The last statement is simply an expression in symbols of what we said above. showing that j will appear in inﬁnitely many of the excursions for sure. 1.i.The ﬁrst term equals fij . . and so the probability that j will appear in a speciﬁc excursion is positive. 10. j i. If j is transient.. . but also fij = 1. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). by summing over n the relation of Lemma 7. and hence pij as n → ∞. Indeed. We now show that recurrence is a class property. j i. are i. Corollary 8. Now.) Theorem 10. which is ﬁnite.2 Communicating classes and recurrence/transience Recall the deﬁnition of the relation “state i leads to state j”. denoted by i j: it means that. . the probability that j will appear in a speciﬁc excursion equals gij = Pi (Tj < Ti ).

and hence. P (B) = Pi (Xn = i for inﬁnitely many n) < 1. Every ﬁnite closed communicating class is recurrent. Let T := inf{n ≥ 1 : Un > U0 }. So if a communicating class is recurrent then it contains no inessential states and so it is closed. We now take a closer look at recurrence. all other states are also transient. Thus T represents the number of variables we need to draw to get a value larger than the initial one. What is the expectation of T ? First. By ﬁniteness.i. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. Theorem 12. every other j is also recurrent. n+1 33 = 0 . 1]. . Recall that a ﬁnite random variable may not necessarily have ﬁnite expectation. By closedness.e. be i. If i is recurrent then every other j in the same communicating class satisﬁes i j. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. It should be clear that this can. P (T > n) = P (U1 ≤ U0 . Proof.Proof. Let C be the class. U1 . but Pi (A ∩ B) = Pi (Xm = j. Suppose that i is inessential. and so i is a transient state. . i. But then. This state is recurrent. appear in a certain game. Un ≤ x | U0 = x)dx xn dx = 1 . . . . . Proof. random variables. If i is inessential then it is transient. some state i must be visited inﬁnitely many times. If a communicating class contains a state which is transient then. Hence all states are recurrent. .d. if started in C. Indeed. . Pi (B) = 0. . Example: Let U0 . X remains in C. by the above theorem. necessarily. j but j i. for example. . Hence 11 Positive recurrence Recall that if we ﬁx a state i and let Ti be the ﬁrst return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). i. . Theorem 11. U2 . observe that T is ﬁnite. uniformly distributed in the unit interval [0.e. Hence. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). Xn = i for inﬁnitely many n) = 0.

(i) Suppose E|Z1 | < ∞ and let µ = EZ1 . When we have a Markov chain. an inﬁnite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. 0) = ∞. It is not a Law. of successive states is not an independent sequence. . Before doing that. be i. . the Fundamental Theorem of Probability Theory. X2 . Let Z1 . the sequence X0 . To our rescue comes the regenerative structure of a Markov chain identiﬁed at an earlier section. . It is traditional to refer to it as “Law”. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1.d. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is.i. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. 34 . n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. . but E min(Z1 . on the average. . (ii) Suppose E max(Z1 . 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. then it takes. Theorem 13. it is a Theorem that can be derived from the axioms of Probability Theory. arguably. random variables. 0) > −∞. X1 . . Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. Z2 . We state it here without proof. To apply the law of large numbers we must create some kind of independence. But this is misleading. We say that i is null recurrent if Ei Ti = ∞.Therefore.

. for reasonable choices of the functions G taking values in R. for an excursion X . Xa(2) . assign a reward f (x). G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ).2 Average reward Consider now a function f : S → R. Since functions of independent random variables are also independent. . as in Example 2. . Xa . Xa(3) . Example 2: Let f (x) be as in example 1. G(Xa(3) ). 35 .d. and because if we start with X0 = a. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). because the excursions Xa . we can think of as a reward received when the chain is in state x. In particular. . Thus. (7) Whichever the choice of G. . Example 1: To each state x ∈ S. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). .d. we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit.13. . (1) (2) (1) (n) (8) 13. the (0) initial excursion Xa also has the same law as the rest of them. we have that G(Xa(1) ). which. (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. forms an independent sequence of (generalised) random variables. The ﬁnal equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. G(Xa(2) ). are i. in this example. random variables. with probability 1.1 Functions of excursions Namely.i.i. . the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). . but deﬁne G(X ) to be the total reward received over the excursion X . given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been ﬁxed once and for all) the sequence of excursions Xa(1) . Speciﬁcally. Then.. k=0 which represents the ‘time-average reward’ received. n as n → ∞. deﬁne G(X ) to be the maximum reward received during this excursion. are i.

t t as t → ∞. Since P (Ta < ∞) = 1. Let f : S → R+ be a bounded ‘reward’ function. with probability one. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. Then. and so 1 ﬁrst G → 0. say |f (x)| ≤ B. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). A similar argument shows that 1 last G → 0. We assumed that f is t bounded.e. (r) Nt := max{r ≥ 1 : Ta ≤ t}. with probability one. we have BTa /t → 0. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a ﬁrst and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gﬁrst + Glast . Thus we need to keep track of the number. Gr is given by (7). t as t → ∞. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. in the ﬁgure below. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . t t In other words. The idea is this: Suppose that t is large enough. and Glast is the reward over the last incomplete cycle. Gﬁrst is the reward up to time Ta = Ta or t (whichever t is smaller). We break the sum into the total reward received over the initial excursion. Nt = 5. We are looking for the total reward received between 0 and t. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. Proof. t t 36 . say. Nt . For example. Hence |Gﬁrst | ≤ BTa .Theorem 14. of complete excursions from the ground state a that occur between 0 and t. and so on. plus the total reward received over the next excursion.

it follows that the number Nt of complete cycles before time t will also be tending to ∞.i. = E a Ta . E a Ta 37 . r=1 with probability one. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . we have that the visit with index r = Nt to the ground state a is the last before t (see also ﬁgure above): (N (N Ta t ) ≤ t < Ta t +1) . . are i. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . lim t∞ 1 Nt Nt −1 Gr = EG1 . t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . (11) By deﬁnition. So we are left with the sum of the rewards over complete cycles. Since Ta → ∞.also. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . .. random variables with common expectation Ea Ta . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. and since Ta − Ta k = 1. 2. 1 Nt = . . = E a Ta . as r → ∞. Hence. as t → ∞. we have Ta r→∞ r lim (r) . Since Ta /r → 0.d. E a Ta f (Xn ) = n=0 EG1 =: f .

Ta −1 k=0 1 lim t→∞ t with probability 1. The physical interpretation of the latter should be clear: If. which communicate and are positive recurrent. Applying formula (14) with b in place of a we ﬁnd that the limit in (13) also equals 1/Eb Tb . b. it takes Ea Ta units of time between successive visits to the state a.To see that f is as claimed. on the average. E a Ta (14) with probability 1. The equality of these two limits gives: average no. from the Strong Markov Property. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . Note that we include only one endpoint of the excursion. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). we let f to be 1 at the state b and 0 everywhere else. then the long-run proportion of visits to the state a should equal 1/Ea Ta . We can use b as a ground state instead of a. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). for all b ∈ S. i. The constant f is a sort of average over an excursion. then we see that the numerator equals 1. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . t Ea 1(Xk = b) E a Ta . a. The theorem above then says that. The denominator is the duration of the excursion. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). Comment 4: Suppose that two states. the ground state. Comment 1: It is important to understand what the formula says.e. E b Tb 38 . just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). not both.

Indeed. In view of this. x ∈ S. Both of them equal 1. . (15) should play an important role. by a function of X1 . X2 . i. The real reason for doing so is because we wanted to replace a function of X0 . Consider the function ν [a] deﬁned in (15). we have 0 < ν [a] (x) < ∞. notice that the choice of notation ν [a] is justiﬁable through the notation (6) for the communicating class of the state a. ν [a] (x) = y∈S ν [a] (y)pyx . 39 . . we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). X2 . Fix a ground state a ∈ S and assume that it is recurrent5 . so there is nothing lost in doing so.e. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. for any state b such that a → x (a leads to x). where a is the ﬁxed recurrent ground state whose existence was assumed. Proof. . Moreover. . 5 We stress that we do not need the stronger assumption of positive recurrence here. we will soon realise that this dependence vanishes when the chain is irreducible. x ∈ S. X1 . Note: Although ν [a] depends on the choice of the ‘ground’ state a. Clearly. Hence the superscript [a] will soon be dropped. Since a = XT a . Recurrence tells us that Ta < ∞ with probability one. Proposition 2. . Then ν [a] satisﬁes the balance equations: ν [a] = ν [a] P. (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. . this should have something to do with the concept of stationary distribution discussed a few sections ago. X3 .14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . We start the Markov chain with X0 = a.

for all n ∈ N. by deﬁnition. which are precisely the balance equations. x a. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. 40 . So we have shown ν [a] (x) = y∈S pyx ν [a] (y). (17) This π is a stationary distribution. We are now going to assume more. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). ax ax From Theorem 10. xa so. we used (16). We now have: Corollary 9. such that pax > 0. ν [a] = ν [a] Pn . which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. Xn−1 = y. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. is ﬁnite when a is positive recurrent. namely. Therefore. where. Note: The last result was shown under the assumption that a is a recurrent state. ν [a] (x) < ∞. Deﬁne Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . First note that. n ≤ Ta ) Pa (Xn = x. so there exists m ≥ 1. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). so there exists n ≥ 1. in this last step. Suppose that the Markov chain possesses a positive recurrent state a. x ∈ S. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . from (16). We just showed that ν [a] = ν [a] P. (n) Next. n ≤ Ta ) pyx Pa (Xn−1 = y. that a is positive recurrent. indeed.and thus be in a position to apply ﬁrst-step analysis. which. let x be such that a x. for all x ∈ S. such that pxa > 0.

if tails show up. Otherwise. 41 . If heads show up at the r-th excursion. then j occurs in the r-th excursion. Since a is positive recurrent. Let i be a positive recurrent state.d. the event that j occurs in a speciﬁc excursion is independent from the event that j occurs in the next excursion. and so π [a] is not trivially equal to zero. unique. in Proposition 2. Suppose that i recurrent. We already know that j is recurrent. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. Hence j will occur in inﬁnitely many excursions. Let R1 (respectively. But π [a] is simply a multiple of ν [a] . then the r-th excursion does not contain state j. This is what we will try to do in the sequel.i. second) excursion that contains j. Start with X0 = i. since the excursions are independent. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. 1. Also. Theorem 15. π(y) = 1 y∈S x∈S have at least one solution. . we toss a coin with probability of heads gij > 0. . in general. we have Ea Ta < ∞. It only remains to show that π [a] is a probability. that it sums up to one. We have already shown. .e. no matter which one. 2. and the probability gij = Pi (Tj < ∞) that j occurs in a speciﬁc excursion is positive.In other words: If there is a positive recurrent state. to decide whether j will occur in a speciﬁc excursion. The coins are i. (r) j. π [a] P = π [a] . But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. R2 ) be the index of the ﬁrst (respectively. Therefore. Proof. Then j is positive Proof. positive recurrent state. then the set of linear equations π(x) = y∈S π(y)pyx . r = 0. i. In words. Consider the excursions Xi . that ν [a] P = ν [a] .

.i. . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. by Wald’s lemma. 2. . or all states are null recurrent or all states are transient. IRREDUCIBLE CHAIN r = 1. An irreducible Markov chain is called transient if one (and hence all) states are transient. . with the same law as Ti . where the Zr are i. Let C be a communicating class. . Corollary 10. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. . In other words.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . Then either all states in C are positive recurrent. we just need to show that R2 −R1 +1 ζ := r=1 Zr has ﬁnite expectation. . R2 has ﬁnite expectation. But. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 .d. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞.

By a theorem of Analysis (Bounded Convergence Theorem). x ∈ S. (18) π(x)Px (Ta < ∞) = π(x) = 1. we can take expectations of both sides: E as t → ∞. Fix a ground state a. π(1) = 0. Then there is a unique stationary distribution. as t → ∞. we have. we have P (Xn = x) = π(x). there is a stationary distribution. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. Consider the Markov chain (Xn . state 1 (and state 2) is positive recurrent. Let π be a stationary distribution. an absorbing state). Theorem 16. the π deﬁned by π(0) = α. what prohibits uniqueness. we start the Markov chain from the stationary distribution π. Hence. for all n. indeed. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). π(2) = 1 − α 2.16 Uniqueness of the stationary distribution At the risk of being too pedantic. for totally trivial reasons (it is. after all. x∈S By (13). But. Then. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). Proof. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. we stress that a stationary distribution may not be unique. Then P (Ta < ∞) = x∈S In other words. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. x ∈ S. n=1 43 . for all states i and j. with probability one. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). for all x ∈ S. following Theorem 14. But observe that there are many–inﬁnitely many–stationary distributions: For each 0 ≤ α ≤ 1. is a stationary distribution. by (18). recognising that the random variables on the left are nonnegative and bounded below 1. Consider positive recurrent irreducible Markov chain. n ≥ 0) and suppose that P (X0 = x) = π(x).

Consider a Markov chain. Consider a positive recurrent irreducible Markov chain and two states a. But π(i|C) > 0. Decompose it into communicating classes. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . π(a) = π [a] (a) = 1/Ea Ta . Both π [a] and π[b] are stationary distributions. Let C be the communicating class of i. 1 π(a) = . In particular. for all x ∈ S. In particular. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Let i be such that π(i) > 0. Therefore Ei Ti < ∞. so they are equal. π(i|C) = Ei Ti . Hence π = π [a] . Thus. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. an arbitrary stationary distribution π must be equal to the speciﬁc one π [a] . i.Therefore π(x) = π [a] (x). delete all transition probabilities pxy with x ∈ C. Then. Consider an arbitrary Markov chain. Proof. Assume that a stationary distribution π exists. Consider the restriction of the chain to C. i is positive recurrent. b ∈ S. the only stationary distribution for the restricted chain because C is irreducible.e. There is only one stationary distribution. There is a unique stationary distribution. This is in fact. Then any state i such that π(i) > 0 is positive recurrent. for all a ∈ S. The cycle formula merely re-expresses this equality. The following is a sort of converse to that. x ∈ S. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Corollary 11. Proof. π(C) x ∈ C. where π(C) = y∈C π(y). Hence there is only one stationary distribution. Let π(·|C) be the distribution deﬁned by restricting π to C: π(x|C) := π(x) . E a Ta Proof. Then π [a] = π [b] . By the 1 above corollary. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Corollary 12. Namely. from the deﬁnition of π [a] . Consider a positive recurrent irreducible Markov chain with stationary distribution π. y ∈ C.

eventually. We all know. from physical experience. This is deﬁned by picking a state (any state!) a ∈ C and letting π C := π [a] . x ∈ C) is a stationary distribution. x∈C Then (π(x|C). π(x|C) ≡ π C (x). as n → ∞.1 Coupling and stability Deﬁnition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. Stability of a deterministic dynamical system means that the system eventually converges to a ﬁxed point (or. Proposition 3. without friction. more generally. A Markov chain is a simple model of a stochastic dynamical system. For example. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. it will settle to the vertical position.which assigns positive probability to each state in C. under irreducibility and positive recurrence conditions. Let π be an arbitrary stationary distribution. where π(C) := π(C) π(x). for example. So we have that the above decomposition holds with αC = π(C). that. To each positive recurrent communicating class C there corresponds a stationary distribution π C . In an ideal pendulum. the pendulum will perform an 45 . Proof. that it remains in a small region). For each such C deﬁne the conditional distribution π(x|C) := π(x) . If π(x) > 0 then x belongs to some positive recurrent class C. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . So far we have seen. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. consider the motion of a pendulum under the action of gravity. 18 18. There is dissipation of energy due to friction and that causes the settling down. that the pendulum will swing for a while and. This is the stable position. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). An arbitrary stationary distribution π must be a linear combination of these π C . such that C αC = 1. where αC ≥ 0. take an = (−1)n . By our uniqueness result.

or whatever. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ .e. are given. as t → ∞. ˙ There are myriads of applications giving physical interpretation to this. X2 . the one found by setting the right-hand side equal to zero: x∗ = b/a. a > 0. . simple because the ξn ’s depend on the time parameter n. The thing to observe here is that there is an equilibrium point i. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. If. And we are in a happy situation here. random variables. by means of the recursion above. ······ 46 . The simplest diﬀerential equation As another example. . ξ0 . we may it is stable because it does not (it cannot!) swing too far away. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . x ∈ R. consider the following diﬀerential equation x = −ax + b. we have x(t) → x∗ . ξ2 . then x(t) = x∗ for all t > 0. A linear stochastic system We now pass on to stochastic systems. mutually independent. . It follows that |ρ| < 1 is the condition for stability. But what does stability mean? In general. then regardless of the starting position. . We say that x∗ is a stable equilibrium. or an example in physics. Speciﬁcally. moreover. We shall assume that the ξn have the same law. ξ1 . let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . .oscillatory motion ad inﬁnitum. it cannot mean convergence to a ﬁxed equilibrium point. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an ﬁnancial example. we here assume that X0 . Still. However notice what happens when we solve the recursion. . and deﬁne a stochastic sequence X1 .

i. In other words. which is a linear combination of ξ0 . while the limit of Xn does not exists. ξn−1 does not converge. A straightforward (often not so useful) construction is to assume that the random variables are independent and deﬁne P (X = x.g. for any initial distribution. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . Y = y) is? 47 . In particular. under nice conditions. (e.Since |ρ| < 1. 18. . the n-step transition matrix Pn . Coupling refers to the various methods for constructing this joint law. ∞ ˆ The nice thing is that. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j).2 The fundamental stability theorem for Markov chains Theorem 17. deﬁne stability in a way other than convergence of the distributions of the random variables. we need to understand the notion of coupling. Y but not their joint law. we have that the ﬁrst term goes to 0 as n → ∞. suppose you are given the laws of two random variables X. converges. for time-homogeneous Markov chains. . How should we then deﬁne what P (X = x. as n → ∞. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . The second term. uniformly in x. we know f (x) = P (X = x). we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . but we are not given what P (X = x. 18. . to try to make the coincidence probability P (X = Y ) as large as possible. j ∈ S. Starting at an elementary level. Y = y) is. If a Markov chain is irreducible. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot.3 Coupling To understand why the fundamental stability theorem works. for instance. Y = y) = P (X = x)P (Y = y). . positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). g(y) = P (Y = y). But suppose that we have some further requirement. What this means is that.

n ≥ T ) = P (Xn = x. uniformly in x ∈ S. x∈S 48 . Find the joint law h(x. respectively. Y be random variables with values in Z and laws f (x) = P (X = x). and all x ∈ S. n ≥ 0) and the law of a stochastic process Y = (Yn . g(y) = P (Y = y). We have P (Xn = x) = P (Xn = x. If.e. A coupling of them refers to a joint construction on the same probability space. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Combining the two inequalities we obtain the ﬁrst assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). and which maximises P (X = Y ). n ≥ 0). in particular. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). This T is a random variables that takes values in the set of times or it may take values +∞. for all n ≥ 0. n ≥ 0. i. P (T < ∞) = 1. i. f (x) = y h(x. T is ﬁnite with probability one.e. precisely because of the assumption that the two processes have been constructed together. Note that it makes sense to speak of a meeting time. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. Then. y) = P (X = x. Y . Repeating (19). we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). n ≥ T ) ≤ P (n < T ) + P (Yn = x).Exercise: Let X. (19) as n → ∞. g(y) = x h(x. The following result is the so-called coupling inequality: Proposition 4. but with the roles of X and Y interchanged. Y = y) which respects the marginals. on the event that the two processes never meet. Suppose that we have two stochastic processes as above and suppose that they have been coupled. Let T be a meeting of two coupled stochastic processes X. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). n ≥ 0. The coupling we are interested in is a coupling of processes. Proof. y). x ∈ S. then |P (Xn = x) − P (Yn = x)| → 0. but we are not told what the joint probabilities are. n < T ) + P (Xn = x. We are given the law of a stochastic process X = (Xn . y). n < T ) + P (Yn = x. Since this holds for all x ∈ S and since the right hand side does not depend on x.

The ﬁrst process. Y0 ) has distribution P (W0 = (x.y′ for all large n. (Indeed. It is very important to note that we have only deﬁned the law of X and the law of Y but we have not deﬁned a joint law. Notice that σ(x. We shall consider the laws of two processes. positive recurrent and aperiodic. and it also makes sense to consider their ﬁrst meeting time: T := inf{n ≥ 0 : Xn = Yn }. denoted by Y = (Yn .x′ ). Xm = y). Having coupled them.x′ ). for all n ≥ 1. and so (Wn .x′ ). Let pij denote. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. We now couple them by doing the straightforward thing: we assume that X = (Xn .y′ . n ≥ 0) is independent of Y = (Yn . then. Yn ).x′ py. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. the transition probabilities. had we started with both X0 and Y0 independent and distributed according to π. n ≥ 0) is a Markov chain with state space S × S. denoted by X = (Xn . Xn and Yn would be independent and distributed according to π. Therefore. From Theorem 3 and the aperiodicity assumption.(y. implying that q(x.Assume now that P (T < ∞) = 1. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ.y′ ) = px. is a stationary distribution for (Wn . Then (Wn . The second process. denoted by π. (x. as usual.y′ ) for all large n.y′ . x∈S as n → ∞. Its initial state W0 = (X0 . y ′ ) | Wn = (x. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. 18.(y. y) Its n-step transition probabilities are q(x.x′ py. n ≥ 0) is an irreducible chain. n ≥ 0). Then n→∞ lim P (T > n) = P (T = ∞) = 0.(y. π(x) > 0 for all x ∈ S. y) ∈ S × S. We know that there is a unique stationary distribution. we have not deﬁned joint probabilities such as P (Xn = x. y) := π(x)π(y).) By positive recurrence. y) = µ(x)π(y). review the assumptions of the theorem: the chain is irreducible. Its (1-step) transition probabilities are q(x.4 Proof of the fundamental stability theorem First. we have that px. in other words.x′ > 0 and py. n ≥ 0). . we can deﬁne joint probabilities.y′ ) := P Wn+1 = (x′ . they diﬀer only in how the initial states are chosen. sup |P (Xn = x) − P (Yn = x)| → 0. Consider the process Wn := (Xn .

0) = q(1. p01 = 0. say.1) = q(1. for all n and x. which converges to zero. ≥ 0) is a Markov chain with transition probabilities pij . P (T < ∞) = 1. we have |P (Xn = x) − π(x)| ≤ P (T > n). 0). P (Xn = x) = P (Zn = x). Zn = Yn . n ≥ 0) is identical in law to (Xn . (Zn .Therefore σ(x. and since P (Yn = x) = π(x). Deﬁne now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). 0).0). n ≥ 0) is positive recurrent. In particular. 1} and the pij are p01 = p10 = 1.0) = q(0. X arbitrary distribution Z stationary distribution Y T Thus.1). 50 . Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. p00 = 0.1) = 1. Hence (Zn . This is not irreducible. Then the product chain has state space S ×S = {(0. Hence (Wn . However.01. 1). (0.1).(0. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). n ≥ 0). T is a meeting time between Z and Y .(0. Z0 = X0 which has an arbitrary distribution µ. uniformly in x. y) ∈ S × S. as n → ∞. Zn equals Xn before the meeting time of X and Y .0). In particular. y) > 0 for all (x.(1. For example. (1. suppose that S = {0. Moreover. Y ) may fail to be irreducible. 1)} and the transition probabilities are q(0. Since P (Zn = x) = P (Xn = x). if we modify the original chain and make it aperiodic by. Clearly. p10 = 1. Obviously.99. (1. after the meeting time. by assumption. precisely because P (T < ∞) = 1.(1. then the process W = (X. By the coupling inequality.

from Cd back to C0 .497 . n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). However. Suppose X0 = 0. and this is expressed by our terminology.4 and. By the results of §5. from C1 to C2 . π(0) + π(1) = 1.f. 6 The reason is that the term “ergodic” means something speciﬁc in Mathematics (c. But this is “wrong”.503. It is.99. . p10 = 1. we can decompose the state space into d classes C0 . It is not diﬃcult to see that 6 We put “wrong” in quotes. π(1) ≈ 0. however. 0. Therefore p00 = 1 for even n (n) and = 0 for odd n. most people use the term “ergodic” for an irreducible. π(0) ≈ 0. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. customary to have a consistent terminology.99π(0) = π(1). .e. p01 = 0. per se wrong.then the product chain becomes irreducible. Theorem 4. Thus. then (n) lim p n→∞ 00 (n) = lim p10 = π(0). n→∞ (n) where π(0). if we modify the chain by p00 = 0. i.497. 1981). Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. π(1) satisfy 0. positive recurrent and aperiodic Markov chain.01. and so on. . Indeed. Parry.497 18. .5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. Then Xn = 0 for even n and = 1 for odd n. C1 . Terminology cannot be. By the results of Section 14. The reader is asked to pay attention to this terminology which is diﬀerent from the one used in the Markov chain literature.503 0. in particular. 18. 51 . we know we have a unique stationary distribution π and that π(x) > 0 for all states x. In matrix notation.503 0. Cd−1 . limn→∞ p00 does not exist. such that the chain moves cyclically between them: from C0 to C1 . Suppose that the period of the chain is d.

Let αr := i∈Cr π(i). → dπ(j). thereafter. we have pij for all i. Then. each one must be equal to 1/d. i ∈ Cr . after reduction by d) and. Since the αr add up to 1. . Let i ∈ Ck . and the previous results apply. Since (Xnd . j ∈ Cℓ . Cd−1 . . . 52 . and let r = ℓ − k (modulo d). if sampled every d steps. as n → ∞. . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. Suppose i ∈ Cr . Then. starting from i. Then pji = 0 unless j ∈ Cr−1 . the (unique) stationary distribution of (Xnd . . i∈Cr for all r = 0. The chain (Xnd . n ≥ 0) consists of d closed communicating classes. i ∈ Cr . All states have period 1 for this chain. d − 1. .Lemma 8. Can we also ﬁnd the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . it remains in Cℓ . j ∈ Cℓ . 1. Proof. n ≥ 0) is dπ(i). Suppose that X0 ∈ Cr . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . as n → ∞. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). where π is the stationary distribution. n ≥ 0) is aperiodic. So π(i) = j∈Cr−1 π(j)pji . namely C0 . π(i) = 1/d. We have that π satisﬁes the balance equations: π(i) = j∈S π(j)pji . This means that: Theorem 18. i ∈ S. j ∈ Cr . We shall show that Lemma 9. . Hence if we restrict the chain (Xnd . .

say. i. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. for each i. a probability distribution on N. the sequence n3 of the third row is a subsequence k k of n2 of the second row. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. for all i ∈ S. k = 1. 2. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. . 2.. . Now consider the contained between 0 and 1 there must be a exists and equals. We also saw in Corollary (n) 7 that. are contained between 0 and 1. . p(n) = (n) (n) (n) (n) (n) ∞ p1 . Consider. k = 1. schematically. We have. there must be a subsequence exists and equals. pi . . the sequence n2 of the second row k is a subsequence of n1 of the ﬁrst row. If j is a null recurrent state then pij → 0. 19. k = 1. . for all i ∈ S. then pij → 0 as n → ∞. such that limk→∞ pi k exists and equals. say. So the diagonal subsequence (nk . p2 . .1 Limiting probabilities for null recurrent states Theorem 19. e 53 . . and so on. It is clear how we (ni ) k continue. If the period is d then the limit exists only along a selected subsequence (Theorem 18). . p2 .) is a k k subsequence of each (ni . Then we can ﬁnd a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. (nk ) k → pi . . For each i we have found subsequence ni . . . Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . for each The second result is Fatou’s lemma.). . 2. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . exists for all i ∈ N. . The proof will be based on two results from Analysis. . . for each n = 1. whose proof we omit7 : 7 See Br´maud (1999). p1 . then pij → π(j) (Theorems 17). By construction. where = 1 and pi ≥ 0 for all i. . k = 1. p3 . as k → ∞. as n → ∞. the ﬁrst of which is known as HellyBray Lemma: Lemma 10. 2. Hence pi k i. say.e.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij .. . . . if j is transient. 2.

If (yi .e. . pix k . We wish to show that these inequalities are. by assumption. . pix k Therefore. (xi . equalities. state j is positive recurrent: Contradiction. x∈S (n′ ) Since aj > 0.e.Lemma 11. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. i ∈ N). (n′ ) Hence αx ≥ αy pyx . π is a stationary distribution. i. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). it is a strict inequality: αx0 > y∈S αy pyx0 . i=1 (n) Proof of Theorem 19. . for some x0 ∈ S. x ∈ S . and the last equality is obvious since we have a probability vector. x ∈ S. i ∈ N). (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . y∈S x ∈ S. Hence we can ﬁnd a subsequence nk such that k→∞ lim pij k = αj > 0. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . i. But Pn+1 = Pn P. 2. αy pyx . Now π(j) > 0 because αj > 0. y∈S Since x∈S αx ≤ 1. k→∞ (n′ +1) = y∈S piy k pyx . actually. we have 0< x∈S The inequality follows from Lemma 11. Thus. i. by Lemma 11. n = 1. pix k = 1. the sequence (n) pij does not converge to 0. 54 . Suppose that one of them is not. Let j be a null recurrent state.e. By Corollary 13. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . we have that π(x) := αx y∈S αy satisﬁes the balance equations and has x∈S π(x) = 1. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . Suppose that for some i. (n ) (n ) Consider the probability vector By Lemma 10.

the result can be obtained by the so-called Abel’s Lemma of Analysis. Then (n) lim p n→∞ ij = ϕC (i)π C (j). as n → ∞. which is what we need. so we will only look that the aperiodic case. This is a more old-fashioned way of getting the same result. as in §10. Let ϕC (i) := Pi (HC < ∞).2 The general case (n) It remains to see what the limit of pij is. that is. and fij = Pi (Tj < ∞). Example: Consider the following chain (0 < α. But Theorem 17 tells us that the limit exists regardless of the initial distribution. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). Theorem 20. starting with the ﬁrst hitting time decomposition relation of Lemma 7. Pµ (Xn−m = j) → π C (j). Indeed. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. Let HC := inf{n ≥ 0 : Xn ∈ C}. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . The result when the period j is larger than 1 is a bit awkward to formulate. π C (j) = 1/Ej Tj .19. In general. Let i be an arbitrary state. as n → ∞. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. From the Strong Markov Property.2. β. we have pij = Pi (Xn = j) = Pi (Xn = j. and fij = ϕC (j). Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). Let j be a positive recurrent state with period 1. Proof. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Let C be the communicating class of j and let π C be the stationary distribution associated with C. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. In this form. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij .

We now generalise this. ϕ(5) = 0. as n → ∞. 1 − αβ 1 − αβ Both C1 . p42 → ϕ(4)π C1 (2). we assumed the existence of a positive recurrent state. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. Hence. from ﬁrst-step analysis. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. Consider the quantity ν [a] (x) deﬁned as the average number of visits to state x between two successive visits to state a–see 16. (n) (n) p33 → 0. For example. C1 = {1. whence 1−α β(1 − α) . 56 (20) x∈S . The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . (n) (n) p32 → ϕ(3)π C1 (2). Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. Similarly. (n) (n) p43 → 0. π C1 (2) = . Ergodic Theory is nowadays an area that combines many ﬁelds of Mathematics and Science. 4} (inessential states). Recall that. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. (n) (n) p34 → 0. (n) (n) p35 → (1 − ϕ(3))π C2 (5). Pioneers in the are were. for example. The subject of Ergodic Theory has its origins in Physics. while. ϕ(4) = . in Theorem 14. π C2 (5) = 1. information about its velocity. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. p45 → (1 − ϕ(4))π C2 (5). And so is the Law of Large Numbers itself (Theorem 13). ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . p44 → 0. ϕ(3) = p31 → ϕ(3)π C1 (1). among others. We will avoid to give a precise deﬁnition of the term ‘ergodic’ but only mention what constitutes a result in the area. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). p41 → ϕ(4)π C1 (1). Theorem 14 is a type of ergodic theorem. 2} (closed class). C2 = {5} (absorbing state). We have ϕ(1) = ϕ(2) = 1. C2 are aperiodic classes. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. Theorem 21. The phase (or state) space of a particle is not just a physical space but it includes.The communicating classes are I = {3.

The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . 57 . But this is precisely the condition (20). Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. Remark 1: The diﬀerence between Theorem 14 and 21 is that. we obtain the result.5. we have t→∞ lim 1 last 1 G = lim Gﬁrst = 0. and this justiﬁes the terminology of §18. Proof. As in the proof of Theorem 14. we have Nt /t → 1/Ea Ta . which is the quantity denoted by f in Theorem 14. So there are always recurrent states. we have C := i∈S ν(i) < ∞. t t→∞ t t t with probability one. so the numerator converges to f /Ea Ta . we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gﬁrst + Glast . Then at least one of the states must be visited inﬁnitely often. Replacing n by the subsequence Nt . respectively.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . we assumed positive recurrence of the ground state a. in the former. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. If so. 21 Finite chains We take a brief look at some things that are speciﬁc to chains with ﬁnitely many states. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). Glast are the total rewards over the ﬁrst and t t a a last incomplete cycles. we have that the denominator equals Nt /t and converges to 1/Ea Ta . As in Theorem 14. as we saw in the proof of Theorem 14. (21) f (x)ν [a] (x). So if we divide by t the numerator and denominator of the fraction in (21). then. The reader is asked to revisit the proof of Theorem 14. since the number of states is ﬁnite. We only need to check that EG1 = f . while Gﬁrst . Now. From proposition 2 we know that the balance equations ν = νP have at least one solution. r=0 with probability 1 as long as E|G1 | < ∞. and this is left as an exercise.

that is.Hence 1 ν(i). in addition. . the chain is aperiodic. Lemma 12. X1 . Taking n = 1 in the deﬁnition of reversibility. and run the ﬁlm backwards. Proof. But if we ﬁlm a man who walks and run the ﬁlm backwards. 22 Time reversibility Consider a Markov chain (Xn . Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. the stationary distribution is unique. it will look the same. (X0 . Then the chain is positive recurrent with a unique stationary distribution π. Xn has the same distribution as X0 for all n. Therefore π is a stationary distribution. like playing a ﬁlm backwards. . X1 . Theorem 23. If. . . Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. because positive recurrence is a class property (Theorem 15). by Theorem 16. . j ∈ S. in addition. . we see that (X0 . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. . as n → ∞. . P (X0 = i. i.e. Xn ) has the same distribution as (Xn . for all n. 58 . Suppose ﬁrst that the chain is reversible. . the chain is irreducible. then we instantly know that the ﬁlm is played in reverse. Because the process is Markov. running backwards. without being able to tell that it is. At least some state i must have π(i) > 0. . X1 ) has the same distribution as (X1 . n ≥ 0). Xn−1 . . . actually. this state must be positive recurrent. a ﬁnite Markov chain always has positive recurrent states. X0 ) for all n. If. X1 = j) = P (X1 = i. simply. Xn−1 . reversible if it has the same distribution when the arrow of time is reversed. A reversible Markov chain is stationary. the ﬁrst components of the two vectors must have the same distribution. then all states are positive recurrent. this means that it is stationary. We say that it is time-reversible. We summarise the above discussion in: Theorem 22. . X0 ). X0 = j). X0 ). in a sense. Proof. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . we have Pn → Π. π(i) := Hence. More precisely. It is. Consider an irreducible Markov chain with ﬁnitely many states. C satisﬁes the balance equations and the normalisation condition i∈S π(i) = 1. In addition. In particular. if we ﬁlm a standing man who raises his arms up and down successively. For example. Xn ) has the same distribution as (Xn . By Corollary 13. Reversibility means that (X0 . . i. i ∈ S. . or. .

. k. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. letting m → ∞. . for all n. X2 = k) = P (X0 = k. suppose the DBE hold. Theorem 24. . . X0 ). j. X1 = j) = π(i)pij . . X0 ). Conversely. The DBE imply immediately that (X0 . Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. for all i. X1 = i) = π(j)pji . . we have separation of variables: pij = f (i)g(j). P (X0 = j. j. Summing up over all choices of states i3 . and the chain is reversible. i2 . pji 59 (m−3) → π(i1 ). using the DBE. X1 = j. we have π(i) > 0 for all i. So the detailed balance equations (DBE) hold. So the chain is reversible. In other words. (22) Proof. and pi1 i2 (m−3) → π(i2 ). X1 . X2 ) has the same distribution as (X2 . . im ∈ S and all m ≥ 3. X1 . we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . k. X1 ) has the same distribution as (X1 . . im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. Conversely. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. Xn ) has the same distribution as (Xn .for all i and j. Xn−1 . it is easy to show that. . . . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . But P (X0 = i. suppose that the KLC (22) holds. X0 ). The same logic works for any m and can be worked out by induction. . We will show the Kolmogorov’s loop criterion (KLC). if there is an arrow from i to j in the graph of the chain but no arrow from j to i. and this means that P (X0 = i. and hence (X0 . using DBE. then the chain cannot be reversible. X2 = i). X1 = j. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . X1 . . i4 . These are the DBE. . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. Now pick a state k and write. . (X0 . π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. Suppose ﬁrst that the chain is reversible. . Pick 3 states i. then. By induction. Hence the DBE hold. and write. .

that the chain is positive recurrent. pji g(i) Multiplying by g(i) and summing over j we ﬁnd g(i) = j∈S g(j)pji . namely the arrow from i to j. there isn’t any. then the chain is reversible. then we need. And hence π can be found very easily. meaning that it has no loops. and the arrow from j to j is the only one from Ac to A. A) (see §4.1) which gives π(i)pij = π(j)pji . The balance equations are written in the form F (A. for each pair of distinct states i. the two oriented edges from i to j and from j to i by a single edge without arrow. Ac ) = F (Ac . Form the graph by deleting all self-loops. the graph becomes disconnected. 60 . there would be a loop. Ac be the sets in the two connected components. i. There is obviously only one arrow from AtoAc . Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. The reason is as follows. For example. So g satisﬁes the balance equations. using another method. and.) Let A. Example 1: Chain with a tree-structure. j for which pij > 0. • If the chain has inﬁnitely many states. in addition. consider the chain: Replace it by: This is a tree.) Necessarily. (If it didn’t. to guarantee. If we remove the link between them. Hence the original chain is reversible.e. and by replacing. If the graph thus obtained is a tree. so we have pij g(j) = . by assumption. This gives the following method for showing that a ﬁnite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate.(Convention: we form this ratio only when pij > 0. f (i)g(i) must be 1 (why?). Hence the chain is reversible. Pick two distinct states i. j for which pij > 0. the detailed balance equations.

1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. If i. and a set E of undirected edges. i∈S Since δ(i) = 2|E|.Example 2: Random walk on a graph. 2. j are not neighbours then both sides are 0. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). In all cases. so both members are equal to C. π(i)pij = π(j)pji . To see this let us verify that the detailed balance equations hold. 61 . Each edge connects a pair of distinct vertices (states) without orientation. There are two cases: If i. π(i) = 1. Suppose the graph is connected. if j is a neighbour of i otherwise. the DBE hold. For example. The constant C is found by normalisation. where C is a constant. 0. we have that C = 1/2|E|. Now deﬁne the the following transition probabilities: pij = 1/δ(i). The stationary distribution assigns to a state probability which is proportional to its degree. States j are i are called neighbours if they are connected by an edge. i∈S where |E| is the total number of edges. Hence we conclude that: 1. We claim that π(i) = Cδ(i). a ﬁnite set. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. For each state i deﬁne its degree δ(i) = number of neighbours of i. The stationary distribution of a RWG is very easy to ﬁnd. i. p02 = 1/4. A RWG is a reversible chain. etc.e. Consider an undirected graph G with vertices S.

. random variables with values in N and distribution λ(n) = P (ξ0 = n).) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . . .23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). if nk = n0 + km. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). 1. The state space of (Yr ) is A. where pij are the m-step transition probabilities of the original chain. . Deﬁne Yr := XT (r) . Let (St . If the subsequence forms an arithmetic progression.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. Due to the Strong Markov Property. 2.e. 2. in addition. t ≥ 0) be independent from (Xn ) with S0 = 0. then (Yk . this process is also a Markov chain and it does have time-homogeneous transition probabilities. if. (m) (m) 23. k = 0. . In this case. 2. . k = 0. . . k = 0. where ξt are i. . t ≥ 0. A r = 1. n ≥ 0) be Markov with transition probabilities pij . 1. i. .i. π is a stationary distribution for the original chain. 2. The sequence (Yk . . . 1. .e.) has the Markov property but it is not time-homogeneous.3 Subordination Let (Xn . . irreducible and positive recurrent).1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. in general. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . Let A be (r) a set of states and let TA be the r-th visit to A.e.d. 2. how can we create new ones? 23. 23. . 62 n ∈ N. . then it is a stationary distribution for the new chain. . 1. St+1 = St + ξt . i. k = 0.

We need to show that P (Yn+1 = β | Yn = α. β). If. . .. Yn−1 ) = P (Yn+1 = β | Yn = α). (n) 23. f is a one-to-one function then (Yn ) is a Markov chain. . a Markov chain. is NOT. . α1 . . we have: Lemma 13.. j. ∞ λ(n)pij . αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . . in general. (Notation: We will denote the elements of S by i. . . Indeed. . . . is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). i.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . β. n≥0 63 . Yn−1 = αn−1 }.Then Yt := XSt . . Then (Yn ) has the Markov property. then the process Yn = f (Xn ). however. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). knowledge of f (Xn ) implies knowledge of Xn and so. conditional on Yn the past is independent of the future. and the elements of S ′ by α. Fix α0 . . .) More generally. Proof.e.

β) P (Yn = α|Yn−1 ) i∈S = Q(α. α > 0 Q(α. β). conditional on {Xn = i}. Deﬁne Yn := |Xn |. then |Xn+1 | = 1 for sure. β) P (Yn = α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. Yn−1 ) Q(f (i). for all i ∈ Z and all β ∈ Z+ . Example: Consider the symmetric drunkard’s random walk. i ∈ Z. α = 0. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. if we let 1/2. i. β) := 1. But P (Yn+1 = β | Yn = α. Yn = α. Thus (Yn ) is Markov with transition probabilities Q(α. if i = 0. Yn−1 )P (Xn = i | Yn = α. P (Xn+1 = i ± 1|Xn = i) = 1/2. We have that P (Yn+1 = β|Xn = i) = Q(|i|.for all choices of the variables involved. β). because the probability of going up is the same as the probability of going down. But we will show that P (Yn+1 = β|Xn = i). i ∈ Z. To see this. observe that the function | · | is not one-to-one. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Indeed. indeed. depends on i only through |i|. if β = α ± 1. β) i∈S P (Yn = α|Xn = i. On the other hand. regardless of whether i > 0 or i < 0. Then (Yn ) is a Markov chain. β ∈ Z+ . β). Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. a Markov chain with transition probabilities Q(α. β). β)P (Xn = i | Yn = α. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. and if i = 0. if β = 1. 64 . Yn−1 ) Q(f (i). In other words.e. and so (Yn ) is. β) P (Yn = α|Xn = i.

The parent dies immediately upon giving birth. . ξ2 . So assume that p1 = 1. (and independent of the initial population size X0 –a further assumption). The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of oﬀspring. Example: Suppose that p1 = p. . and 0 is always an absorbing state. since ξ (0) . Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. j In this case. . If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. We let pk be the probability that an individual will bear k oﬀspring. which has a binomial distribution: i j pij = p (1 − p)i−j . then we see that the above equation is of the form Xn+1 = f (Xn . . Hence all states i ≥ 1 are inessential. . Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed.24 24. 0 But 0 is an absorbing state. a hard problem. one question of interest is whether the process will become extinct or not. independently from one another. if the 65 (n) (n) . and. Hence all states i ≥ 1 are transient. in general. we have that (Xn . if the current population is i there is probability pi that everybody gives birth to zero oﬀspring. so we will ﬁnd some diﬀerent method to proceed. n ≥ 0) has the Markov property. 2. therefore transient. .d. and following the same distribution. Thus. Let ξk be the number of oﬀspring of the k-th individual in the n-th generation. The state 0 is irrelevant because it cannot be reached from anywhere. In other words. Back to the general case. are i. If we let ξ (n) = ξ1 . k = 0. . Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow.. . p0 = 1 − p. We will not attempt to compute the pij neither draw the transition diagram because they are both futile.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. It is a Markov chain with time-homogeneous transitions. . with probability 1. if 0 is never reached then Xn → ∞. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. But here is an interesting special case.i. Therefore. 0 ≤ j ≤ i. ξ (n) ). Then P (ξ = 0) + P (ξ ≥ 2) > 0. We ﬁnd the population (n) at time n + 1 by adding the number of oﬀspring of each individual. each individual gives birth to at one child with probability p or has children with probability 1 − p. the population cannot grow: it will eventually reach 0. ξ (1) . 1.

and. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. Indeed. This function is of interest because it gives information about the extinction probability ε. so P (D|X0 = i) = εi . Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. Obviously. Let ∞ µ = Eξ = k=1 kpk be the mean number of oﬀspring of an individual. when we start with X0 = i. Let ε be the extinction probability of the process. For each k = 1. Proof. ε(0) = 1. . For i ≥ 1. .population does not become extinct then it must grow to inﬁnity. we can argue that ε(i) = εi . The result is this: Proposition 5. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. We then have that D = D1 ∩ · · · ∩ Di . . we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . Indeed. each one behaves independently of one another. i. ϕn (0) = P (Xn = 0). 66 . if we start with X0 = i in ital ancestors. . where ε = P1 (D). We wish to compute the extinction probability ε(i) := Pi (D). If µ ≤ 1 then the ε = 1. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. Therefore the problem reduces to the study of the branching process starting with X0 = 1. and P (Dk ) = ε. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. since Xn = 0 implies Xm = 0 for all m ≥ n.

then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1.) 67 . ϕ(ϕn (0)) → ϕ(ε). So ε ≤ ε∗ < 1. all we need to show is that ε < 1. (See right ﬁgure below. if µ > 1. We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). ϕn (0) ≤ ε∗ . and. the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. But ϕn (0) → ε. In particular. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). Since both ε and ε∗ satisfy z = ϕ(z). and so on. ε = ε∗ . ϕn+1 (0) = ϕ(ϕn (0)). This means that ε = 1 (see left ﬁgure below). But 0 ≤ ε∗ . if the slope µ at z = 1 is ≤ 1. Let us show that. On the other hand. Therefore. Therefore ε = ϕ(ε). Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)).Note also that ϕ0 (z) = 1. Also ϕn (0) → ε. in fact. Therefore ε = ε∗ . But ϕn+1 (0) → ε as n → ∞. By induction. recall that ϕ is a convex function with ϕ(1) = 1. ϕn+1 (z) = ϕ(ϕn (z)). Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . and since the latter has only two solutions (z = 0 and z = ε∗ ). z=1 Also. since ϕ is a continuous function. Notice now that µ= d ϕ(z) dz .

periodically. Then ε = 1.i. . otherwise. are allocated to hosts 1. there are x + a packets. k ≥ 0. there is no time to make a prior query whether a host has something to transmit or not. If s ≥ 2. 24. 2. 0). 4. Abandoning this idea. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. During this time slot assume there are a new packets that need to be transmitted. we let hosts transmit whenever they like. 3. . then time slots 0. If collision happens. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. An the number of new packets on the same time slot.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). Let s be the number of selected packets. 5.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. The problem of communication over a common channel. . Nowadays. .d. random variables with some common distribution: P (An = k) = λk . So. in a simpliﬁed model. 68 . 3 hosts. 1. 3. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). We assume that the An are i. Case: µ > 1. Then ε < 1. then the transmission is not successful and the transmissions must be attempted anew. on the next times slot. there is collision and the transmission is unsuccessful. 1. using the same frequency. 2. . there is a collision and. and Sn the number of selected packets. Xn+1 = Xn + An . on the next times slot. One obvious solution is to allocate time slots to hosts in a speciﬁc order. there are x + a − 1 packets. 6. is that if more than one hosts attempt to transmit a packet. for a packet to go through. 3.. . 2. Since things happen fast in a computer network. Ethernet has replaced Aloha. if Sn = 1. 1. So if there are. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. If s = 1 there is a successful transmission and. then max(Xn + An − 1. say. Thus. only one of the hosts should be transmitting at a given time.

we have q < 1. selections are made amongst the Xn + An packets. An−1 . then the chain is transient. and so q x tends to 0 much faster than G(x) tends to ∞. we have ∆(x) = (−1)px. Indeed.x−1 + k≥1 kpx.x−1 when x > 0. we must have no new packets arriving. An = a. if Xn + An = 0. where G(x) is a function of the form α + βx . we should not subtract 1. k 0 ≤ k ≤ x + a. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). For x > 0. we have that the drift is positive for all large x and this proves transience.) = x + a k x−k p q .x are computed by subtracting the sum of the above from 1. and p. Since p > 0. We here only restrict ourselves to the computation of this expected change. Sn = 1|Xn = x) = λ0 xpq x−1 .e. The reason we take the maximum with 0 in the above equation is because. we think as follows: to have a reduction of x by 1. 69 .x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). k ≥ 1. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. The Aloha protocol is unstable. So P (Sn = k|Xn = x. Idea of proof.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k .x−1 = P (An = 0. where q = 1 − p. To compute px. So we can make q x G(x) as small as we like by choosing x suﬃciently large. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . Xn−1 . but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. . We also let µ := k≥1 kλk be the expectation of An . for example we can make it ≤ µ/2. Therefore ∆(x) ≥ µ/2 for all large x. The process (Xn . Unless µ = 0 (which is a trivial case: no arrivals!). The main result is: Proposition 6. So: px. Let us ﬁnd its transition probabilities.e. the Markov chain (Xn ) is transient. To compute px.x+k for k ≥ 1. i. The random variable Sn has. The probabilities px. . and exactly one selection: px.where λk ≥ 0 and k≥0 kλk = 1. . deﬁned by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. so q x G(x) tends to 0 as x → ∞. Proving this is beyond the scope of the lectures. n ≥ 0) is a Markov chain.

The way it does that is by computing the stationary distribution of a Markov chain that we now describe. Deﬁne transition probabilities as follows: 1/|L(i)|. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. In an Internet-dominated society. So this is how Google assigns ranks to pages: It uses an algorithm. because of the size of the problem. that allows someone to manipulate the PageRank of a web page and make it much higher. If L(i) = ∅. For each page i let L(i) be the set of pages it links to. if L(i) = ∅ 0. known as ‘302 Google jacking’. PageRank. otherwise. So if Xn = i is the current location of the chain.2) and let α pij := (1 − α)pij + . 2. It uses advanced techniques from linear algebra to speed up computations. A new use of PageRank is to rank academic doctoral programs based on their records of ﬁnding jobs for their graduates. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. then i is a ‘dangling page’. The idea is simple. It is not clear that the chain is irreducible. |V | These are new transition probabilities. unless i is a dangling page. neither that it is aperiodic. if j ∈ L(i) pij = 1/|V |. Its implementation though is one of the largest matrix computations in the world. There is a technique. Comments: 1. Then it says: page i has higher rank than j if π(i) > π(j). So we make the following modiﬁcation. unless he gets bored–and this happens with probability α–in which case he clicks on a random page.24. there are companies that sell high-rank pages. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. Academic departments link to each other by hiring their 70 . to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. Therefore.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. We pick α between 0 and 1 (typically. in the latter case. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. α ≈ 0. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. the chain moves to a random page. 3. All you need to do to get high rank is pay (a lot of money). the rank of a page may be very important for a company’s proﬁts.

it makes the evaluation procedure easy. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. We will assume that a child chooses an allele from each parent at random. a gene controlling a speciﬁc characteristic (e. We have a transition from x to y if y alleles of type A were produced. One of the circle alleles is selected twice. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). from the point of view of an administrator.faculty from each other (and from themselves). What is important is that the evaluation can be done without ever looking at the content of one’s research. colour of eyes) appears in two forms. we will assume that. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. Three of the square alleles are never selected. meaning how many characteristics we have. this allele is of type A with probability x/2N . generation after generation. then. Since we don’t care to keep track of the individuals’ identities. Thus. We are now ready to deﬁne the Markov chain Xn representing the number of alleles of type A in generation n. all we need is the information outlined in this ﬁgure. And so on. 4. We will also assume that the parents are replaced by the children. 24. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. 71 . their oﬀspring inherits an allele from each parent. in each generation. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their oﬀspring did not inherit this allele. Do the same thing 2N times. We would like to study the genetic diversity. In this ﬁgure there are 4 alleles of type A (circles) and 6 of type a (squares). a dominant (say A) and a recessive (say a). The process repeats. One of the square alleles is selected 4 times. Imagine that we have a population of N individuals who mate at random producing a number of oﬀspring. This number could be independent of the research quality but. This means that the external appearance of a characteristic is a function of the pair of alleles. But an AA mating with a Aa may produce a child of type AA or one of type Aa. two AA individuals will produce and AA child. When two individuals mate.4 The Wright-Fisher model: an example from biology In an organism.g. The alleles in an individual come in pairs and express the individual’s genotype for that gene. To simplify. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. If so. maintain this allele for the next generation. 5. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. the genetic diversity depends only on the total number of alleles of each type.

. and the population becomes one consisting of alleles of type a only. Let M = 2N . the chain gets absorbed by either 0 or 2N . 72 . The result is correct. but the argument needs further justiﬁcation because “the end of time” is not a deterministic time. These are: M ϕ(0) = 0. M Therefore. . ϕ(x) = x/M.d. Since selections are independent. ξM are i. we see that 2N x 2N −y x y pxy = . . y ≤ 2N − 1. X0 ) = E(Xn+1 |Xn ) = M This implies that. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. . . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. . . Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . conditionally on (X0 . . i. Clearly. X0 ) = 1 − Xn . Since pxy > 0 for all 1 ≤ x. EXn+1 = EXn Xn = Xn . so both 0 and 2N are absorbing states. the random variables ξ1 . X0 ) = Xn . ϕ(M ) = 1. the states {1. . . We show that our guess is correct by proceeding via the mundane way of verifying that it satisﬁes the ﬁrst-step analysis equations.e. since. . and the population becomes one consisting of alleles of type A only. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x).e. since. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. Similarly. Clearly. ϕ(x) = y=0 pxy ϕ(y). n n where. (Here. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . . p00 = p2N.i. as usual.Each allele of type A is selected with probability x/2N . . on the average.) One can argue that. .2N = 1. Xn ). M n P (ξk = 0|Xn . . . i. . 0 ≤ y ≤ 2N. Tx := inf{n ≥ 1 : Xn = x}. . the number of alleles of type A won’t change. E(Xn+1 |Xn . the expectation doesn’t change and. on the average. with n P (ξk = 1|Xn . 2N − 1} form a communicating class of inessential states. . . M and thus. . . for all n ≥ 0. . The fact that M is even is irrelevant for the mathematical model.

. 2. on the average. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . M −1 y−1 x M y−1 1− x . we ﬁnd X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. we will try to solve the recursion. Let ξn := An − Dn . larger than the supply per unit of time. Here are our assumptions: The pairs of random variables ((An .i. Xn electronic components at time n. Dn ). the demand is. so that Xn+1 = (Xn + ξn ) ∨ 0. We have that (Xn . provided that enough components are available.d. Working for n = 1. and the stock reached level zero at time n + 1. Since the Markov chain is represented by this very simple recursion. We will apply the so-called Loynes’ method. the demand is only partially satisﬁed. Otherwise. at time n + 1. there are Xn + An − Dn components in the stock. If the demand is for Dn components. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. n ≥ 0) is a Markov chain. for all n ≥ 0. and E(An − Dn ) = −µ < 0. A number An of new components are ordered and added to the stock. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. say.5 A storage or queueing application A storage facility stores. b). We will show that this Markov chain is positive recurrent. . and so. 73 .The ﬁrst two are obvious. the demand is met instantaneously. Thus. n ≥ 0) are i. while a number of them are sold to the market. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. . We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a.

74 . . . using π(y) = limn→∞ P (Mn = y). Indeed. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). the random variables (ξn ) are i. (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . that d (ξ0 . for y ≥ 0. . in computing the distribution of Mn which depends on ξ0 . and. instead of considering them in their original order. . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). we ﬁnd π(y) = x≥0 π(x)pxy . . In particular. . for all n. . . ξn−1 ) = (ξn−1 . ξ0 ). . . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy .Thus. . ξn . however. . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisﬁes the balance equations. so we can shuﬄe them in any way we like without altering their joint distribution. ξn−1 . .d. Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. .i. . ξn−1 ). where the latter equality follows from the fact that. = x≥0 Since the n-dependent terms are bounded. we reverse it. . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). Therefore. . we may take the limit of both sides as n → ∞. . Notice. . the event whose probability we want to compute is a function of (ξ0 . . we may replace the latter random variables by ξ1 . (n) . Hence.

} > y) is not identically equal to 1. . Z2 . Z2 . .} < ∞) = 1. . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. we will use the assumption that Eξn = −µ < 0. . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . This will be the case if g(y) is not identically equal to 1. The case Eξn = 0 is delicate. as required. Here. n→∞ This implies that P ( lim Zn = −∞) = 1. . We must make sure that it also satisﬁes y π(y) = 1.So π satisﬁes the balance equations. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. n→∞ Hence g(y) = P (0 ∨ max{Z1 . 75 . Depending on the actual distribution of ξn . . the Markov chain may be transient or null recurrent.

.y+z for any translation z. A random walk of this type is called simple random walk or nearest neighbour random walk.g. but. where C is such that x p(x) = 1. d p(x) = qj . . . given a function p(x). j = 1. if x = −ej . 0. x ∈ Zd . so far. here. More general spaces (e. Our Markov chains.and space-homogeneous transition probabilities given by px. A random walk possesses an additional property. otherwise. otherwise where p + q = 1. we can let S = Zd . the skip-free property. If d = 1. . . then p(1) = p. where p1 + · · · + pd + q1 + · · · + qd = 1. namely. ±ed } otherwise 76 . p(−1) = q. . d 0. j = 1. besides time. To be able to say this. This means that the transition probability pxy should depend on x and y only through their relative positions in space. This is the drunkard’s walk we’ve seen before. this walk enjoys. Example 3: Deﬁne p(x) = 1 2d .y = px+z. we won’t go beyond Zd . a random walk is a Markov chain with time. . In dimension d = 1. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. we need to give more structure to the state space S. So. For example.y = p(y − x). . Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. have been processes with timehomogeneous transition probabilities. that of spatial homogeneity. Deﬁne pj . if x ∈ {±e1 . . . groups) are allowed. the set of d-dimensional vectors with integer coordinates. p(x) = 0.and space-homogeneity. a structure that enables us to say that px. .PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. Example 1: Let p(x) = C2−|x1 |−···−|xd | . . if x = ej .

and this is the symmetric drunkard’s walk. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). p ∗ δz = p. x + ξ2 = y) = y p(x)p(y − x). It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . in addition.d. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). . Obviously.) In this notation. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). If d = 1. computing the n-th order convolution for all n is a notoriously hard problem. . in general. because all we have done is that we dressed the original problem in more fancy clothes. x ± ed . xy The random walk ξ1 + · · · + ξn is a random walk starting from 0.i. then p(1) = p(−1) = 1/2. . (When we write a sum without indicating explicitly the range of the summation variable. We shall resist the temptation to compute and look at other 77 . where ξ1 . But this would be nonsense. . p(x) = 0. We call it simple symmetric random walk. . otherwise. p′ (x).This is a simple random walk. The convolution of two probabilities p(x). The special structure of a random walk enables us to give more detailed formulae for its behaviour. y p(x)p′ (y − x). has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . (Recall that δz is a probability distribution that assigns probability 1 to the point z. are i. . ξ2 . We refer to ξn as the increments of the random walk. Furthermore. Indeed. we will mean a sum over all possible values. random variables in Zd with common distribution P (ξn = x) = p(x). x ∈ Zd . One could stop here and say that we have a formula for the n-step transition probability. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . We will let p∗n be the convolution of p with itself n times.) Now the operation in the last term is known as convolution. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. So. which. . we can reduce several questions to questions about this random walk. Furthermore. the initial state S0 is independent of the increments. Let us denote by Sn the state of the walk at time n. x ∈ Zd is deﬁned as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p.

for ﬁxed n. Events A that depend only on these the ﬁrst n random signs are subsets of this sample space. 1) → (2. Clearly.d. First. 1}n . we shall learn some counting methods. . but its practice may be hard. how long will that take? C. 1): (2. . all possible values of the vector are equally likely.i+1 = pi. and so #A P (A) = n . 0) → (1. 2 where #A is the number of elements of A. Therefore. ξn ). +1}n . After all. we are dealing with a uniform distribution on the sample space {−1. . we should assign. a random vector with values in {−1. 2) → (3. A ⊂ {−1.0) We can thus think of the elements of the sample space as paths.1) + (0. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . . . consider the following: 78 . ξn = εn ) = 2−n . +1. −1} then we have a path in the plane that joins the points (0. . Will the random walk ever return to where it started from? B. a path of length n. If not. If yes. Since there are 2n elements in {0. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that diﬀerent questions we need to ask or (b) that approximate methods are more informative than exact ones.i. and the sequence of signs is {+1. +1}n . The principle is trivial. we have P (ξ1 = ε1 .i−1 = 1/2 for all i ∈ Z.2) where the ξn are i. So if the initial position is S0 = 0.methods. As a warm-up. Look at the distribution of (ξ1 . +1.1) + − (5. So. +1}n . if we can count the number of elements of A we can compute its probability. to each initial position and to each sequence of n signs. + (1. −1. 2) → (5. . In the sequel.2) (4.1) − (3. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. 1) → (4. Here are a couple of meaningful questions: A. . random variables (random signs) with P (ξn = ±1) = 1/2.

n 79 . 0) and end up at (n. y)}.047. in this case. y)} = n−m .Example: Assume that S0 = 0. i). In if n + i is not an even number then there is no path that starts from (0.) The darker region shows all paths that start at (0. (There are 2n of them. x) (n. S6 < 0. S7 < 0}. x) and end up at (n. We use the notation {(m. i). 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. i).1 Paths that start and end at speciﬁc points Now consider a path that starts from a speciﬁc point and ends up at a speciﬁc point. (n − m + y − x)/2 (23) Remark. x) and end up at (n. The paths comprising this event are shown below. and compute the probability of the event A = {S2 ≥ 0. 0) and ends at (n. y) is given by: #{(0. the number of paths that start from (m. 0) and n ends at (n. 0) and end up at the speciﬁc point (n. y)} =: {all paths that start at (m. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. 0)). x) (n. S4 = −1. i)} = n n+i 2 More generally. Hence (n+i)/2 = 0. We can easily count that there are 6 paths in A.i) (0. Let us ﬁrst show that Lemma 14. Proof. 0) (n. y) is given by: #{(m. The number of paths that start from (0. Consider a path that starts from (0. (n.0) The lightly shaded region contains all paths of length n that start at (0.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the ﬁgure), there is no path that goes from (0, 0) to (3, 2). We shall thus deﬁne n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

**Paths that start and end at speciﬁc points and do not touch zero at all
**

(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =

1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the speciﬁc point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the ﬁrst time it hits zero and then reﬂecting it around the horizontal axis. In the ﬁgure below, follow the solid path starting at (m, x) up to the ﬁrst time it hits zero and then follow the dotted path: this is the reﬂected path.

(n, y)

(m, x)

(n, −y)

A path and its reﬂection around a horizontal line. Notice that the reﬂection starts only at the point where the path touches the horizontal line.

Note that if we know the reﬂected path we can recover the original path. Hence we may as well count the number of reﬂected paths. But each reﬂected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =

1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)

1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

**Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
**

1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n

n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n

n 2

= #{(0, 1) n

n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1

n−y 2

(n, y + 1); remain > 0}

− −

n

n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

−

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the ﬁrst one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

**Distribution after n steps
**

n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

Sn > 0 | Sn = y) = y . .2 The ballot theorem Theorem 25. .P (Sn = y|Sm = x) = In particular. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). given that S0 = 0 and Sn = y > 0. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . the random walk never becomes zero up to time n is given by P0 (S1 > 0. 0) (n. for now. 0) (n. . y)} . we shall just compute it. 1) (n. . y)} . . n This is called the ballot theorem and the reason will be given in Section 29. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . y) that do not touch zero} × 2−n #{paths (0.3 Some conditional probabilities Lemma 17. . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . n 2 #{paths (m. P0 (S1 ≥ 0 . S2 > 0. n 2−n . . (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. 27. But. n−m 2−n−m . . P0 (Sn = y) 2 (n + y) + 1 . x) 2n−m (n. 27. n/2 (29) #{(0. n (30) Proof. Sn ≥ 0 | Sn = y) = 1 − In particular. if n is even. The probability that. if n is even. The left side of (30) equals #{paths (1. Such a simple answer begs for a physical/intuitive explanation. . Sn ≥ 0 | Sn = 0) = 84 1 . If n. .

and ﬁnally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. •).Proof. . n/2 where we used (28) and (29). . . . . . . Sn ≥ 0). . . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . +1 27. . Sn < 0) (b) (a) = 2P0 (S1 > 0. ξ3 . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . Proof. . P0 (S1 ≥ 0. . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . . and Sn−1 > 0 implies that Sn ≥ 0. . Sn ≥ 0) = P0 (S1 = 0. Sn = 0) = P0 (S1 > 0. . . . Sn ≥ 0) = #{(0. . . . If n is even. . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. . . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . . ξ2 . We use (27): P0 (Sn ≥ 0 . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . We next have P0 (S1 = 0.). . . . . 1 + ξ2 + ξ3 > 0. . .4 Some remarkable identities Theorem 26. . .) by (ξ1 . Sn = 0) = P0 (Sn = 0). . . (c) (d) (e) (f ) (g) where (a) is obvious. . . . ξ2 + ξ3 ≥ 0. . (f ) follows from the replacement of (ξ2 . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. 0) (n. . . . . . . ξ3 . . . For even n we have P0 (S1 ≥ 0. . 85 . . . . . . . (e) follows from the fact that ξ1 is independent of ξ2 . Sn ≥ 0. . (b) follows from symmetry. . . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. Sn > 0) + P0 (S1 < 0.

and if the particle is to the left of a then its mirror image is on the right. Sn−1 < 0.27. . the probability u(n) that the walk (started at 0) attains value 0 in n steps. 86 . If a particle is to the right of a then you see its image on the left. Sn−1 = 0. . So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . n−1 Proof. So 1 f (n) = 2u(n) P0 (S1 > 0. So f (n) = 2P0 (S1 > 0. f (n) = P0 (S1 > 0. Sn−1 > 0. . given that Sn = 0. Sn = 0) = u(n) . . we either have Sn−1 = 1 or −1. Sn = 0 means Sn−1 = 1. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). Sn−2 > 0|Sn−1 = 1). . for even n. . P0 (Sn−1 > 0. Sn = 0) P0 (S1 > 0. . Then f (n) := P0 (S1 = 0. n−1 28 The reﬂection principle for a simple random walk in dimension 1 The reflection principle says the following. Sn−2 > 0|Sn−1 > 0. and u(n) = P0 (Sn = 0). . . . . Also. . Sn = 0) Now. We now want to ﬁnd the probability f (n) that the walk will return to 0 for the ﬁrst time in n steps. So u(n) f (n) = . Since the random walk cannot change sign before becoming zero. Fix a state a. we computed. . with equal probability. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. . Sn = 0. Sn = 0) + P0 (S1 < 0. we see that Sn−1 > 0. we can omit the last event. Sn = 0) = 2P0 (Sn−1 > 0. 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). To take care of the last term in the last expression for f (n). . Sn−1 > 0. Sn = 0). . the two terms must be equal. By the Markov property at time n − 1. Theorem 27. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . By symmetry. This is not so interesting. . . Let n be even. So the last term is 1/2. . . . . Put a two-sided mirror at a.5 First return to zero In (29). .

Theorem 28. Let (Sn . n < Ta . Then. Sn > a) = P0 (Ta ≤ n. deﬁne Sn = where is the ﬁrst hitting time of a. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). the process (STa +m − a. n ∈ Z+ ). Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). Sn = Sn . In other words. P0 (Sn ≥ a) = P0 (Ta ≤ n. we have n ≥ Ta . Proof. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). called (Sn . But P0 (Ta ≤ n) = P0 (Ta ≤ n. m ≥ 0) is a simple symmetric random walk. Then Ta = inf{n ≥ 0 : Sn = a} Sn . The last equality follows from the fact that Sn > a implies Ta ≤ n. Finally. 2a − Sn ≥ a) = P0 (Ta ≤ n. Trivially. independent of the past before Ta . Sn ≤ a) + P0 (Sn > a).1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . 2 Deﬁne now the running maximum Mn = max(S0 . 2 Proof. If Sn ≥ a then Ta ≤ n. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). a + (Sn − a). Sn ≤ a) + P0 (Ta ≤ n. n ≥ Ta By the strong Markov property. 28. n ∈ Z+ ) has the same law as (Sn . n ≥ Ta . But a symmetric random walk has the same law as its negative. Thus. . Sn ≤ a). So: P0 (Sn ≥ a) = P0 (Ta ≤ n. 2a − Sn .What is more interesting is this: Run the particle till it hits a (it will. Sn ). we obtain 87 . n ≥ 0) be a simple symmetric random walk and a > 0. So Sn − a can be replaced by a − Sn in the last display and the resulting process. as a corollary to the above result. . so we may replace Sn by 2a − Sn (Theorem 28). . n < Ta . In the last event. S1 . Sn ≥ a). .

Since Mn < x is equivalent to Tx > n. This. usually a vase furnished with a foot or pedestal. Sn = 2x − y). 9 Bertrand. e e e Paris. we have P0 (Mn < x. gives the result. Sci. If Tx ≤ n. 8 88 . and anciently for holding ballots to be drawn. The original ballot problem. D. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. of which a are coloured azure and b black. Bertrand. Sn = y) = P0 (Tx > n. A sample from the urn is a sequence η = (η1 . Sci. Compt. e 10 Andr´. P0 (Mn < x. employed for diﬀerent purposes. J. An urn is a vessel. Sn = y) = P0 (Tx ≤ n. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. n = a + b. Paris. Solution directe du probl`me r´solu par M. It is the latter use we are interested in in Probability and Statistics. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. by applying the reﬂection principle (Theorem 28). 369. as for holding liquids. 2 Proof. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). . Sn = y). Acad. Rend. . Rend. ornamental objects. the ashes of the dead after cremation. 105. 105. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. (31) x > y. . posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. (1887).Corollary 19. then the probability that. Proof. Acad. Sn = y) = P0 (Sn = 2x − y). Compt. The answer is very simple: Theorem 31. (1887). Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . ηn ). 29 Urns and the ballot theorem Suppose that an urn8 contains n items. If an urn contains a azure items and b = n−a black items. The set of values of η contains n elements and η is uniformly distributed a in this set. 436-437. where ηi indicates the colour of the i-th item picked. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. as long as a ≥ b. the number of selected azure items is always ahead of the black ones equals (a − b)/n. . We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. Solution d’un probl`me. which results into P0 (Tx ≤ n. during sampling without replacement. combined with (31). then.

a − b = y. p(n+k)/2 q (n−k)/2 . . . . . But if Sn = y then.We shall call (η1 . +1} such that ε1 + · · · + εn = y. we can translate the original question into a question about the random walk. . but which is not necessarily symmetric. Sn > 0} that the random walk remains positive. . . But in (30) we computed that y P0 (S1 > 0. so ‘azure’ items are +1 and ‘black’ items are −1. . Since the ‘azure’ items correspond to + signs (respectively. Proof. letting a. εn be elements of {−1. The values of each ξi are ±1. we have: P0 (ξ1 = ε1 . the result follows. . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. indeed. ηn ) an urn process with parameters n. . . conditional on {Sn = y}. using the deﬁnition of conditional probability and Lemma 16. n ≥ 0) with values in Z and let y be a positive integer. i ∈ Z. a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. ξn ) has a uniform distribution. . . . Sn > 0 | Sn = y) = . ξn ) is. b be deﬁned by a + b = n. . n n −n 2 (n + y)/2 a Thus. . Let ε1 . . . . . the ‘black’ items correspond to − signs). . Then. . . (ξ1 .i−1 = q = 1 − p. ξn ) is uniformly distributed in a set with n elements. . . An urn can be realised quite easily be a random walk: Theorem 32. . Since an urn with parameters n. . n . conditional on the event {Sn = y}. ξn = εn ) P0 (Sn = y) 2−n 1 = = . where y = a − (n − a) = 2a − n. a we have that a out of the n increments will be +1 and b will be −1. the sequence (ξ1 . (The word “asymmetric” will always mean “not necessarily symmetric”. We need to show that. Proof of Theorem 31. conditional on {Sn = y}. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. It remains to show that all values are equally a likely. . 89 . . . Then. We have P0 (Sn = k) = n n+k 2 pi.) So pi. . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . . So the number of possible values of (ξ1 . always assuming (without any loss of generality) that a ≥ b = n − a. . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. a. the sequence (ξ1 . . −n ≤ k ≤ n. n Since y = a − (n − a) = a − b.i+1 = p. . . . Consider a simple symmetric random walk (Sn . . . .

2. plus a ′′ further time T1 required to go from 0 to +1 for the ﬁrst time. The rest is a consequence of the strong Markov property. Consider a simple random walk starting from 0. . For all x. Then the probability generating function of the ﬁrst time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . Consider a simple random walk starting from 0. x ∈ Z. Px (Ty = t) = P0 (Ty−x = t). If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. 2. . Theorem 33. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. Start with S0 = 0 and watch the process until the ﬁrst n = T1 such that Sn = 1. If x > 0.} ∪ {∞}.30. x − 1 in order to reach x. Proof. 90 .d. Corollary 20. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. See ﬁgure below. x ≥ 0) is also a random walk. If S1 = 1 then T1 = 1. . We will approach this using probability generating functions. for x > 0. (We tacitly assume that S0 = 0. only the relative position of y with respect to x is needed. Lemma 18. 2qs Proof. y ∈ Z. Let ϕ(s) := E0 sT1 . . The process (Tx . In view of this. Consider a simple random walk starting from 0.) Then. and all t ≥ 0. This random variable takes values in {0. . namely the obvious fact that a simple random walk starting from 0 must pass through levels 1.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. Lemma 19.i. random variables. . E0 sTx = ϕ(s)x . Proof. 1. In view of Lemma 19 we need to know the distribution of T1 . . it is no loss of generality to consider the random walk starting from 0. plus a time T1 required to go from −1 to 0 for the ﬁrst time. we make essential use of the skip-free property. To prove this.

Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. The formula is obtained by solving the quadratic. 2ps Proof. Hence the previous formula applies with the roles of p and q interchanged. Then the probability generating function of the ﬁrst time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . Corollary 22. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. if x < 0 2ps Proof. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . 0 if x > 0 2qs Tx (32) E0 s = √ |x| E sT−1 |x| = 1− 1−4pqs2 0 . We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . Then the probability generating function of the ﬁrst time that level x ∈ Z is reached is given by √ x E sT1 x = 1− 1−4pqs2 . from the strong Markov property. Consider a simple random walk starting from 0. 91 . the RW must ﬁrst hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Corollary 21.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. in order to hit 1. The ﬁrst time n such that Sn = −1 is the ﬁrst time n such that −Sn = 1. Consider a simple random walk starting from 0. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p.

E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . and simply write (1 + x)n = k≥0 n k x . Then the probability generating function of the ﬁrst return time to 0 is given by E0 s T 0 = 1 − Proof. the same as the time required for the walk to hit 1 starting ′ from 0. The latter is. For the general case.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . Similarly. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. k by the binomial theorem. 1 − 4pqs2 . If α = n is a positive integer then n (1 + x)n = k=0 n k x . how long will it take till it ﬁrst returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. 2ps ′ =1− 30. this was found in §30. k 92 .2 First return to the origin Now let us ask: If the particle starts at 0. If we deﬁne n to be equal to 0 for k > n. Consider a simple random walk starting from 0. For a symmetric ′ random walk.30. if S1 = 1. in distribution. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . then we can omit the k upper limit in the sum. where α is a real number. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0.3 The distribution of the ﬁrst return to the origin We wish to ﬁnd the exact distribution of the ﬁrst return time to the origin. We again do ﬁrst-step analysis.

Recall that n k = It turns out that the formula is formally true for general α. (−1)k = 2k − 1 k k and this is shown by direct computation. by deﬁnition. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . k!(n − k)! k! α k x . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . Indeed. (1 − x)−1 = But. As another example. −1 k so (1 − x)−1 = as needed. √ 1 − x. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . 2k − 1 k 93 . We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . k! k! (−1)2k xk = k≥0 xk . For example. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. but that we won’t worry about which ones. k as long as we deﬁne α k = α(α − 1) · · · (α − k + 1) .

Hence it is transient. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. The quantity inside the square root becomes 1 − 4pq. We apply this to (32). That it is actually null-recurrent follows from Markov chain theory. |x| if x > 0 (33) . so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 = (1 − 2p)2 = |1 − 2p| = |p − q|. Hence it is again transient. x 1−|p−q| 2q 1−|p−q| 2p . We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. by the deﬁnition of E0 sT0 . without appeal to the Markov chain theory. But remember that q = 1 − p. Comparing the two ‘inﬁnite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). x if x > 0 . • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. s↑1 which is true for any probability generating function. P0 (Tx < ∞) = p q q p ∧1 ∧1 . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. 94 . if x < 0 |x| In particular. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. if x < 0 This formula can be written more neatly as follows: Theorem 35.But. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx .

x ≥ 0. with some positive probability. but here it is < ∞. if x < 0 which is what we want. The last part of Theorem 35 tells us that it is recurrent. if x > 0 . . the sequence Mn is nondecreasing.) If we let Mn = max(S0 . . . . 31.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. Then ′ P0 (T0 < ∞) = 2(p ∧ q). If p < q then |p − q| = q − p and. Consider a random walk with values in Z. Sn ). . we obtain what we are looking for. this probability is always strictly less than 1. then we have M∞ = limn→∞ Mn .Proof. 31. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. This value is random and is denoted by M∞ = sup(S0 . The simple symmetric random walk with values in Z is null-recurrent. S1 . Assume p < 1/2. Consider a simple random walk with values in Z. . S2 . This means that there is a largest value that will be ever attained by the random walk. Theorem 36. Then limn→∞ Sn = −∞ with probability 1. . What is this probability? Theorem 37. 95 . Proof. q p. We know that it cannot be positive recurrent. Unless p = 1/2. Corollary 23. every nondecreasing sequence has a limit.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. Indeed. except that the limit may be ∞. Otherwise. with probability 1 because p < 1/2. Proof. and so it is null-recurrent. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the ﬁrst return time to 0. again. n→∞ n→∞ x ≥ 0. it will never ever return to the point where it started from.

The probability generating function of T0 was obtained in Theorem 34. But if it is transient. 31. We now compute its exact distribution. 96 . even for p = 1/2. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . as already known. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . where the last probability was computed in Theorem 37. In the latter case. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. (34) a random variable ﬁrst considered in Section 10 where it was also shown to be geometric. if a Markov chain starts from a state i then Ji is geometric. k = 0. Remark 1: By spatial homogeneity. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . . We now look at the the number of visits to some state but if we start from a diﬀerent state. But J0 ≥ 1 if and only if the ﬁrst return time to zero is ﬁnite. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. 1. 1 − 2(p ∧ q) this being a geometric series. 2.3 Total number of visits to a state If the random walk is recurrent then it visits any state inﬁnitely many times. Then. 2. . And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. .′ Proof. Pi (Ji ≥ k) = 1 for all k. . 1 − 2(p ∧ q) Proof. 2(p ∧ q) E0 J0 = . Theorem 38. . k = 0. This proves the ﬁrst formula. . starting from zero. 2(p ∧ q) . 1. the total number of visits to a state will be ﬁnite. Consider a simple random walk with values in Z. . . 2. If p = q this gives 1. 1]. . the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . k = 0. In Lemma 4 we showed that. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). 1. . meaning that Pi (Ji = ∞) = 1.

starting from 0. The ﬁrst term was computed in Theorem 35. The reason that we replaced k by k − 1 is because. Apply now the strong Markov property at Tx . it will visit any point below 0 exactly the same number of times on the average. We repeat the argument for x < 0 and ﬁnd P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . Jx ≥ k).e. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). given that Tx < ∞. (q/p)(−x)∨0 1 − 2q k≥1 . If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . i. Remark 1: The expectation E0 Jx is ﬁnite if p = 1/2. we have E0 Jx = 1/(1−2p) regardless of x. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. Suppose x > 0. Consider the case p < 1/2. Then P0 (Tx < ∞. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. if the random walk “drifts down” (i. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . For x > 0. k ≥ 1. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . ∞ As for the expectation. and the second in Theorem 38. Then P0 (Jx ≥ k) = P0 (Tx < ∞. But for x < 0. 97 .e. In other words. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . (2(p ∧ q))k−1 . then E0 Jx drops down to zero geometrically fast as x tends to inﬁnity. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 .Theorem 39. p < 1/2) then. Consider a random walk with values in Z. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . E0 Jx = Proof. simply because if Jx ≥ k ≥ 1 then Tx < ∞.

Thus. . 0 ≤ k ≤ n − 1. (Sk . we derived using duality. ξn ) has the same distribution as (ξn . 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. a sum of i. 0 ≤ k ≤ n). 0 ≤ k ≤ n) is deﬁned by Sk := Sn − Sn−k . Theorem 40. Sn = x) P0 (Tx = n) = . .i. Proof. . 0 ≤ k ≤ n) of (Sk . every time we have a probability for S we can translate it into a probability for S. . . i=1 Then the dual (Sk . in Theorem 41. Fix an integer n. because the two have the same distribution. = 1− √ 1 − s2 s x . n (37) 98 . and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . for a simple symmetric random walk. random variables. (35) In Theorem 29 we used the reﬂection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). . i. Pause for reﬂection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. . n Proof.32 Duality Consider random walk Sk = k ξi . for a simple random walk and for x > 0. ξ1 ). ξ2 . . In Lemma 19 we observed that. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. . the probabilities P0 (Tx = n) = x P0 (Sn = x). (36) 2 Lastly. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. ξn−1 . Tx is the sum of x i.e. Use the obvious fact that (ξ1 . n x > 0. . 0 ≤ k ≤ n) has the same distribution as (Sk . Let (Sk ) be a simple symmetric random walk and x a positive integer.d. random variables. Here is an application of this: Theorem 41.i. which is basically another probability for S. P0 (Sn = x) P0 (Sn = x) x .d. Let Tx = inf{n ≥ 1 : Sn = x} be the ﬁrst time that x will be visited. Then P0 (Tx = n) = x P0 (Sn = x) . each distributed as T1 . . .

We remark here something methodological: It was completely diﬀerent techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coeﬃcient of sn in (35); summing up (37) should give (36), or taking diﬀerences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Deﬁne its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the ﬁrst time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the ﬁrst event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =

r=1

+

r=1

**In the ﬁrst case,
**

′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

**We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
**

k

**f2r p2k−2r,2n−2r +
**

r=1

1 2

n−k

**f2r p2k,2n−2r
**

r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2

k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .

r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

**The arcsine law*
**

Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

**Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
**

na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =

a≤k/n≤b

∼

a≤k/n≤b

1 . k k (n − n ) n n

1/π

**This is easily seen to have a limit as n → ∞, the limit being
**

b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of

x(1−x)

which is amazingly close to the plot of the ﬁgure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

−1)) = 1/2. ξ2 . n ∈ Z+ ) showing that it. i=1 ξi i=1 ξi 2 1 Lemma 20. Indeed. The two random walks are not independent. . the two stochastic processes are not independent. we let x1 be its horizontal coordinate and x2 its vertical.i. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . Sn = i=1 n ξi . n ∈ Z+ ) is a random walk. west. west. The corners are described by coordinates (x. However. . We then let S0 = (0. n ∈ Z+ ) is a sum of i. the latter being functions of the former. Many other quantities can be approximated in a similar manner. 1 P (ξn = 1) = P (ξn = (1. 1 Hence (Sn . n ≥ 1. Sn ) = n n 2 . When he reaches a corner he moves again in the same manner. Consider a change of coordinates: rotate the axes by 45o (and scale them). (Sn . . . ξ2 . random variables and hence a random walk. . 0)) = P (ξn = (−1. the random 1 2 variables ξn . 1)) + P (ξn = (0. 0). So Sn = (Sn . . So let 1 2 1 2 1 2 Rn = (Rn . . ξn are not independent simply because if one takes value 0 the other is nonzero. we immediately get that the expectation of T1 is inﬁnity: ET1 = ∞. x1 − x2 ). i.i. random vectors such that P (ξn = (0. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0.. is a random walk. 0)) = 1/4. Since ξ1 . Sn − Sn ). y ∈ Z. . being totally drunk. with equal probability (1/4). x2 ) into (x1 + x2 . too. .e.d. so are ξ1 . are independent. (Sn . We have 1. n ∈ Z+ ) is also a random walk. 0)) = P (ξn = (1. Rn ) := (Sn + Sn . he moves east. 0) and. He starts at (0. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. 2 1 If x ∈ Z2 . It is not a simple random walk because there is a possibility that the increment takes the value 0.d. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. y) where x. i. 1)) = P (ξn = (−1. north or south. 0)) = 1/4. going at random east. 1 1 Proof. north or south. We have: 102 . ξ2 . 2 Same argument applies to (Sn . map (x1 .From this. for each n.

We have of each of the ηn n 1 2 P (ηn = 1. we need to show that ∞ P (Sn = 0) = ∞.1 Lemma 21. 1 where the last equality follows from the independence of the two random walks. . The possible values 1 . being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . ηn = 1) = P (ξn = (0. 2 . To show that the last two are independent. So P (R2n = 0) = 2n 2−2n .. 0)) = 1/4 1 2 P (ηn = −1. (Rn + independent. in the sense that the ratio of the two sides goes to 1. And so 2 1 2 1 P (ηn = ε. Proof. is i. η . 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. ηn = −1) = P (ξn = (−1. where c is a positive constant. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. . 0)) = 1/4 1 2 P (ηn = 1. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. n ∈ Z+ ) 1 is a random walk in two dimensions. n ∈ Z ) is a random walk in one dimension. as n → ∞. ξ 1 − ξ 2 ). b−a P (STa ∧Tb = a) = b . b−a 103 . The two-dimensional simple symmetric random walk is recurrent. n ∈ Z ) is a random walk in one dimension. +1}. By Lemma 5.d. P (S2n = 0) ∼ n . Similar argument shows that each of (Rn . η 2 are ±1.i. ηn = −1) = P (ξn = (0. Lemma 22. Rn = n (ξi − ξi ). P (S2n = 0) = Proof. . n ∈ Z+ ) is a random walk in two dimensions. b−a P (Ta < Tb ) = b . ξ2 . 2 1 2 2 1 1 Proof. ηn = 1) = P (ξn = (1. that for a simple symmetric random walk (started from S0 = 0). ε′ ∈ {−1. Let a < 0 < b. (Rn . But by the previous n=0 c lemma. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . We have Rn = n (ξi + ξi ). n ∈ Z+ ). R2n = 0) = P (R2n = 0)(R2n = 0). n Corollary 24. and by Stirling’s approximation. The same is true for (Rn ). Hence P (ηn = 1) = P (ηn = −1) = 1/2. Hence (Rn . Recall. (Rn . n ∈ Z+ ) is a random 2 . = 1)) = 1/4. ηn = ξn − ξn are independent for each n. 2n n 2 2−4n . from Theorem 7. . . b−a −a . 1)) = 1/4 1 2 P (ηn = −1. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. . But (Rn ) 1 2 is a simple symmetric random walk. The last two are walk in one dimension. The sequence of vectors η . So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 .

i ∈ Z. using this. Indeed. 0)} ∪ (N × N) with the following distribution: P (A = i. Conversely.This last display tells us the distribution of STa ∧Tb .g. 1 − p). Then there exists a stopping time T such that P (ST = i) = pi . Proof. i ∈ Z). i ∈ Z. P (A = 0. and assuming that (A. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. Let (Sn ) be a simple symmetric random walk and (pi . The claim is that P (Z = k) = pk . that ESTa ∧Tb = 0. we have this common value: i<0 (−i)pi = and. M < N . B) taking values in {(0. given a 2-point probability distribution (p. B) are independent of (Sn ). we consider the random variable Z = STA ∧TB . b]. M. To see this. where c−1 is chosen by normalisation: P (A = 0. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the ﬁrst time it exits the interval [a. How about more complicated distributions? Let suppose we are given a probability distribution (pi . B = 0) = p0 . suppose ﬁrst k = 0. a = M − N . The probability it exits it from the left point is p. in particular. if p = M/N . i < 0 < j. and the probability it exits from the right point is 1 − p. Theorem 43. choose. we can always ﬁnd integers a. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. e. b. k ∈ Z. N ∈ N. Deﬁne a pair of random variables (A. a < 0 < b such that pa + (1 − p)b = 0. B = j) = 1. Then Z = 0 if and only if A = B = 0 and this has 104 . i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. B = j) = c−1 (j − i)pi pj . we get that c is precisely ipi . such that p is rational. Can we ﬁnd a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. b = N . Note. B = 0) + P (A = i.

probability p0 . Suppose next k > 0. 105 . P (Z = k) = pk for k < 0. i<0 = c−1 Similarly. B = j) P (STi ∧Tk = k)P (A = i. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . by deﬁnition.

Zero is denoted by 0.) Notice that: gcd(a. If c | b and b | a then c | a. their negatives and zero. 38) = gcd(6. Then interchange the roles of the numbers and continue. Given integers a. 2. Indeed. with at least one of them nonzero. 106 . b) = gcd(a. the process of long division that one learns in elementary school). . The set Z of integers consists of N. and d = gcd(a. 3. In other words. b). Now. We will show that d = gcd(a. (ii) if c | a and c | b. 0. if a. as taught in high schools or universities. So (ii) holds. 26) = gcd(6. b − a). So (i) holds. if we also invoke Euclid’s algorithm (i. 2. They are not supposed to replace textbooks or basic courses. −1. d | b. d2 are two such numbers then d1 would divide d2 and vice-versa. 5. 14) = gcd(6. c | b. 32) = gcd(6.e. Then c | a + (b − a). 2) = 2. Now suppose that c | a and c | b − a. with b = 0. But 3 is the maximum number of times that 38 “ﬁts” into 120. (That it is unique is obvious from the fact that if d1 . Similarly. . · · · }. For instance. 2) = gcd(4. 38) = gcd(44. 38) = gcd(82. This gives us an algorithm for ﬁnding the gcd between two numbers: Suppose 0 < a < b are integers. We write this by b | a. Arithmetic Arithmetic deals with integers. 38) = gcd(6. Keep doing this until you obtain b − qa < a. we say that b divides a if there is an integer k such that a = kb. . 8) = gcd(6. leading to d1 = d2 . 2) = gcd(2. 1. Because c | a and c | b. So we work faster. Replace b by b − a. let d = gcd(a. If a | b and b | a then a = b.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. b. Indeed 4 × 38 > 120. But d | a and d | b. i. we can deﬁne their greatest common divisor d = gcd(a. 20) = gcd(6. then d | c. 6. −2. replace b − a by b − 2a. we may summarise the ﬁrst line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. −3. b). Z = {· · · . b) as the unique positive integer such that (i) d | a. 4. The set N of natural numbers is the set of positive integers: N = {1.}. 3. they merely serve as reminders. with the second line. gcd(120. Therefore d | b − a. b are arbitrary integers. and we’re done.e. But in the ﬁrst line we subtracted 38 exactly 3 times. b − a). we have c | d. If b − a > a.

b) then there are integers x. we can show that. . This shows (ii) of the deﬁnition of a greatest common divisor. Second. 2) = 2. 38) = gcd(6. Working backwards. y = 6. a. . r = 18. Then c divides any linear combination of a. Suppose that c | a and c | b. We need to show (i). i. for integers a. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. The same logic works in general. let d be the right hand side. In the process. To see why this is true. for some x. Its proof is obvious. . let us follow the previous example again: gcd(120. To see this. 120) = 38x + 120y. 38) = gcd(6. gcd(38. 107 . we can ﬁnd an integer q and an integer r ∈ {0. we have the fact that if d = gcd(a. we have gcd(a. . y such that d = xa + yb.The theorem of division says the following: Given two positive integers b. it divides d. with x = 19. such that a = qb + r. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. not both zero. with a > b. b ∈ Z. 1. For example. Then d = xa + yb. in particular. The long division algorithm is a process for ﬁnding the quotient q and the remainder r. y ∈ Z. xa + yb > 0}. b) = min{xa + yb : x. b. Thus. Here are some other facts about the greatest common divisor: First.e. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. y ∈ Z. a − 1}.

m elements. f (x2 ). x2 ∈ X have diﬀerent values f (x1 ). the greatest common divisor of any subset of the integers also makes sense. For example. a relation can be thought of as a set of pairs of elements of the set A. 108 . z) it also contains (x. y) without contain its symmetric (y. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. that d does not divide b. yielding r = (−qx)a + (1 − qy)b.. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. and this is a contradiction. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. A relation is called symmetric if it cannot contain a pair (x. we can divide b by d to ﬁnd b = qd + r. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. it can be thought as reﬂexive (unless we do not want to accept the fact that one may not be afriend of oneself). A function is called one-to-one (or injective) if two diﬀerent x1 . so d divides both a and b. The relation < is not reﬂexive.e. A relation is transitive if when it contains (x. The range of f is the set of its values. The set of functions from X to Y is denoted by Y X .e. The relation < is not symmetric.g. y) such that x is smaller than y. e. then the relation “x is a friend of y” is symmetric. A relation which possesses all the three properties is called equivalence relation. The equality relation = is reﬂexive. The equality relation is symmetric. i. for some 1 ≤ r ≤ d − 1. A relation is called reflexive if it contains pairs (x. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). We encountered the equivalence relation i theory. So d is not minimum as claimed. so r = 0. i. B be ﬁnite sets with n. x). x). Suppose this is not true. Finally. If we take A to be the set of students in a classroom. z). j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . But then b = q(xa + yb) + r. Since d > 0. the set {f (x) : x ∈ X}. The greatest common divisor of 3 integers can now be deﬁned in a similar manner. respectively. that d does not divide both numbers. but is not necessarily transitive. Counting Let A. Both < and = are transitive. A function is called onto (or surjective) if its range is Y . In general. say.that d | a and d | b. y) and (y.

e. a one-to-one function from the set A into A is called a permutation of A. We thus observe that there are as many subsets in A as number of elements in the set {0. Clearly. Indeed. k! k where the symbol equation. the k-th in n − k + 1 ways. and the third. In the special case m = n. the second in n − 1. The ﬁrst element can be chosen in n ways. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . n! is the total number of permutations of a set of n elements. If A has n elements. so does the second. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. 109 . and so on. Each assignment of this type is a one-to-one function from A to B. we denote (n)n also by n!. Any set A contains several subsets. and this follows from the binomial theorem. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen.The number of functions from A to B (i. in other words. we need more names than animals: n ≤ m. A function is then an assignment of a name to each animal. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to diﬀerent animals. Since the name of the ﬁrst animal can be chosen in m ways. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . To be able to do this. Incidentally. that of the second in m − 1 ways. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). Thus. n! stands for the product of the ﬁrst n natural numbers: n! = 1 × 2 × 3 × · · · × n.) So the ﬁrst animal can be assigned any of the m available names. and so on. 1}n of sequences of length n with elements 0 or 1. and so on. we may think of the elements of A as animals and those of B as names. (The same name may be used for diﬀerent animals. the cardinality of the set B A ) is mn . we have = 1. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: .

If the pig is included. by convention. by convention. 0). if x is a real number. b). we choose k animals from the remaining n ones and this can be done in n ways. Note that it is always true that a = a+ − a− . For example. c. The notation extends to more than one numbers. a− = − min(a. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. we use the following notations: a+ = max(a. we deﬁne x m = x(x − 1) · · · (x − m + 1) . 110 . The maximum is denoted by a ∨ b = max(a.We shall let n = 0 if n < k. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. In particular. b is denoted by a ∧ b = min(a. and a− is called the negative part of a. and we don’t care to put parantheses because the order of ﬁnding a maximum is irrelevant. If the pig is not included in the sample. say. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). Maximum and minimum The minimum of two numbers a. we add the two numbers together in order to ﬁnd the total number of ways to pick k animals from n + 1 animals. Both quantities are nonnegative. b). b. a∨b∨c refers to the maximum of the three numbers a. The absolute value of a is deﬁned by |a| = a+ + a− . a+ is called the positive part of a. the pig. 0). Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. k Also. we let The Pascal triangle relation states that n+1 k = = 0. m! n y If y is not an integer then. consider a speciﬁc animal. + k−1 k Indeed. in choosing k animals from a set of n + 1 ones. n n .

This is very useful notation because conjunction of clauses is transformed into multiplication. Another example: Let X be a nonnegative random variable with values in N = {1. . An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . 111 . 1 − 1A = 1 A c . Thus. are sets or clauses then the number n≥1 1An is the number true clauses. P (X ≥ n). . meaning a function that takes value 1 on A and 0 on the complement Ac of A. A2 . For instance. Several quantities of interest are expressed in terms of indicators. . = 1A + 1B − 1A∩B . So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. It is easy to see that 1A 1A∩B = 1A 1B . if A1 . . Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A).Indicator function stands for the indicator function of a set A. If A is an event then 1A is a random variable. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y).}. It takes value 1 with probability P (A) and 0 with probability P (Ac ). 2. Then X X= n=1 1 (the sum of X ones is X). It is worth remembering that. . We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. So X= n≥1 1(X ≥ n). if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). .

Combining. β ∈ R. xn n 1 y := ϕ(x) = j=1 xj ϕ(uj ). . . But the ϕ(uj ) are vectors in Rm . . . . . . . y ∈ Rn . . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . . xn the given basis. = ··· ··· ··· . . The matrix A of ϕ with respect to the choice of the two basis is deﬁned by a11 · · · a1n A := · · · · · · · · · am1 · · · amn y1 Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . The column . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ . ym v1 . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . . Let u1 . and all α. . . . x . Suppose that ψ. ϕ are linear functions: It is a m × n matrix. .for all x. . vm . . because ϕ is linear. The relation between the column representation of y and the column representation of x is written as 1 1 x y a11 · · · a1n . So if we choose a basis v1 . x1 . . . j=1 i = 1. un be a basis for Rn . . m. we have ϕ(x) = m n vi i=1 j=1 aij xj . The column . Let y i := n aij xj . is a column representation of y := ϕ(x) with respect to the given basis . where ∈ R. Then. . . is a column representation of x with respect to .

If I is empty then inf ∅ = +∞. j So A is a ℓ × m matrix. xn+2 . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. . . . . . . The standard basis in Rd consists of the vectors 1 0 0 0 1 0 e 1 = . . a square matrix represents a linear transformation from Rn to Rn .} ≤ inf{x2 . inf I is characterised by: inf I ≤ x for all x ∈ I and. . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. where m (AB)ij = k=1 Aik Bkj .Fix three sets of basis on Rn . A matrix A is called square if it is a n × n matrix.} has no minimum. 1 ≤ j ≤ n. Likewise. a sequence a1 . a2 . B be the matrix representations of ϕ. ψ. . 1 ≤ i ≤ ℓ. If I has no lower bound then inf I = −∞. Rm and Rℓ and let A. the choice of the basis is the so-called standard basis. . e2 = . respectively. x4 . and it is this that we called inﬁmum. there is x ∈ I with x ≤ ε + inf I. standing for the “infimum” of I. . 113 . . . we deﬁne lim sup xn . Limsup and liminf If x1 . . The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. · · · }. . Similarly. . . x2 . Similarly. 0 0 1 In other words. For instance I = {1. . . x3 . x2 . i. sup ∅ = −∞. . with respect to these bases. for any ε > 0. is a sequence of numbers then inf{x1 .} ≤ · · · and. xn+1 . 1/2. and AB is a ℓ × n matrix. we deﬁne the supremum sup I to be the minimum of all upper bounds.e. for any ε > 0. Supremum and inﬁmum When we have a set of numbers I (for instance. . Usually. In these cases we use the notation inf I. This minimum (or maximum) may not exist. it is its matrix representation with respect to the ﬁxed basis. . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. ei = 1(i = j). we deﬁne max I. So lim sup xn = limn→∞ inf{xn . . . . deﬁned as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. B is a m × n matrix. . there is x ∈ I with x ≥ ε + inf I. . . We note that lim inf(−xn ) = − lim sup xn .} ≤ inf{x3 . . If we ﬁx a basis in Rn . ed = . . sup I is characterised by: sup I ≥ x for all x ∈ I and. . Then the matrix representation of ϕ ◦ ψ is AB. If I has no upper bound then sup I = +∞. 1/3.

. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ).e. we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. i. A2 . by not requiring this. In contrast. Note that 1A 1B = 1A∩B . . . If A1 . Linear Algebra that has many axioms is more ‘restrictive’. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). which holds since X is positive. if he events A1 . Quite often. This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. If A is an event then 1A or 1(A) is the indicator random variable. This is the inclusion-exclusion formula for two events. A2 . Probability Theory is exceptionally simple in terms of its axioms. For example. If An are events with P (An ) = 0 then P (∪n An ) = 0. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). B in the deﬁnition. follows from this axiom. and 1 Ac = 1 − 1 A . . B2 . it is useful to ﬁnd disjoint events B1 .Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). Often. then P n≥1 An = P (An ). A is a subset of B). If the events A. if An are events with P (An ) = 1 then P (∩n An ) = 1.. That is why Probability Theory is very rich. 11 114 . Indeed. In Markov chain theory. n P (An ). I do sometimes “cheat” from a strict mathematical point of view. In contrast. (That is why Probability Theory is very rich. to compute P (A ∩ B) we read the deﬁnition as P (A ∩ B) = P (A)P (B|A). A2 . E 1A = P (A).. we work a lot with conditional probabilities. This follows from the logical statement X 1(X > x) ≤ X. Of course. to compute P (A). B are disjoint events then P (A ∪ B) = P (A) + P (B). n≥1 Usually. n If the events A1 . derive formulae. . to show that EX ≤ EY . for (besides P (Ω) = 1) there is essentially one axiom .e. such that ∪n Bn = Ω. Linear Algebra that has many axioms is more ‘restrictive’. Recall that P (B|A) is deﬁned by P (A ∩ B)/P (A) and is a very motivated deﬁnition. Similarly. . namely: Everything else. but not too much. . and everywhere else. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. the function that takes value 1 on A and 0 on Ac . . But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. . B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. Therefore. . . and apply them. are mutually disjoint events. This can be generalised to any ﬁnite number of events. we often need to argue that X ≤ Y . . in these notes.) If A. Similarly. P (∪n An ) ≤ Similarly.

. . 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. is a probability distribution on Z+ = {0. . is an inﬁnite polynomial having the ai as coeﬃcients: G(s) = a0 + a1 s + a2 s2 + · · · . In case the random variable is ∞ absoultely continuous with density f . The formula holds for any positive random variable. For a random variable which takes positive and negative values. The generating function of the sequence pn = P (X = n). for s = 0 for which G(0) = 0. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. . n ≥ 0 is ∞ ρn s n = n=0 1 . . 0). ϕ(s) = EsX = n=0 sn P (X = n) 115 . For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). we deﬁne X + = max(X. is called probability generating function of X: ∞ as long as |ρs| < 1. x→∞ In Markov chain theory. so it is useful to keep this in mind.. We do not a priori know that this exists for any s other than. . The expectation of a nonnegative random variable X may be deﬁned by ∞ EX = 0 P (X > x)dx. . the generating function of the geometric sequence an = ρn .An application of this is: P (X < ∞) = lim P (X ≤ x). we encounter situations where P (X = ∞) may be positive. 0). a1 . obviously. 1]. . we have EX = −∞ xf (x)dx. . viz. As an example. which is motivated from the algebraic identity X = X + − X − . 2.} because then n an = 1. This deﬁnition makes sense only if EX + < ∞ or if EX − < ∞. Generating functions A generating function of a sequence a0 . . n = 0. In case the random variable is discrete. 1]. taught in calculus courses). This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. which are both nonnegative and then deﬁne EX = EX + − EX − . 1.e. . 1. So if n an < ∞ then G(s) exists for all s ∈ [−1. the formula reduces to the known one. making it a useful object when the a0 . EX = x xP (X = x). |s| < 1/|ρ|. i. X − = − min(X. a1 . .

116 . take the value +∞. Note that ∞ n=0 pn ≤ 1. with x0 = x1 = 1. Since the term s∞ p∞ could formally appear in the sum deﬁning ϕ(s). consider the Fibonacci sequence deﬁned by the recursion xn+2 = xn+1 + xn . n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). n=0 We do allow X to. possibly. xn sn be the generating of (xn . while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). we can recover the expectation of X from EX = lim ϕ′ (s). s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . We can recover the probabilities pn from the formula pn = ϕ(n) (0) . n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . Generating functions are useful for solving linear recursions. since |s| < 1. If we let X(s) = n≥0 n ≥ 0. and this is why ϕ is called probability generating function.This is deﬁned for all −1 < s < 1. As an example. n! as follows from Taylor’s theorem. n ≥ 0) then the generating function of (xn+1 . but. Also. we have s∞ = 0. ds ds and by letting s = 1. Note that ϕ(0) = P (X = 0). ∞ and p∞ := P (X = ∞) = 1 − pn .

Since x0 = x1 = 1. n ≥ 0) equals the sum of the generating functions of (xn+1 . the n-th term of the Fibonacci sequence.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 .e. for example. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. b−a is the solution to the recursion. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . 117 . is such an example. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. x ∈ R. But I could very well use the function P (X > x). if X is a random variable with values in R. n ≥ 0) and (xn .e. b = −( 5 + 1)/2. n ≥ 0). we can solve for X(s) and ﬁnd X(s) = s2 −1 . we can also write this as xn = (−a)n+1 − (−b)n+1 . Thus (1/ρ)−s is the 1 generating function of ρn+1 . and so X(s) = 1 1 1 1 1 1 = . Hence s2 + s − 1 = (s − a)(s − b). Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). i. then the so-called distribution function. s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . b−a Noting that ab = −1. Despite the square roots in the formula. the function P (X ≤ x). So. i. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. namely.

instead. n! ≈ exp(n log n − n) = nn e−n . Even the generating function EsX of a Z+ -valued random variable X is an example. We frequently encounter random objects which belong to larger spaces. to compute a large sum we can. (b) The approximation is not good when we consider ratios of products. dx All these objects can be referred to as “distribution” or “law” of X. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. it’s better to use logarithms: n n! = e log n = exp k=1 log k. I could use its density f (x) = d P (X ≤ x). compute an integral and obtain an approximation. in 2n fair coin tosses 118 . for example they could take values which are themselves functions. Specifying the distribution of such an object may be hard. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. For instance. for it does allow us to specify the probabilities P (X = n). Hence we have We call this the rough Stirling approximation. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. 1 Since d dx (x log x − x) = log x. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. For example (try it!) the rough Stirling approximation give that the probability that. Such an excursion is a random object which should be thought of as a random function. If X is absolutely continuous. It turns out to be a good one. n Hence n k=1 log k ≈ log x dx.or the 2-parameter function P (a < X < b). n ∈ Z+ . since the number is so big. but using or merely talking about such an object presents no diﬃculty other than overcoming a psychological barrier. we encountered the concept of an excursion of a Markov chain from a state i. Now.

And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0.we have exactly n heads equals 2n 2−2n ≈ 1. That was the random variable J0 = ∞ n=1 1(Sn = 0). and is based on the concept of monotonicity. by taking expectations of both sides (expectation is an increasing operator). 119 . then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any diﬀerent from X. because we know that the n probability goes to zero as n → ∞. Markov’s inequality Arguably this is the most useful inequality in Probability. and this is terrible. We baptise this geometric(λ) distribution. Surely. If we think of X as the time of occurrence of a random phenomenon. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). so this little section is a reminder. you have seen it before. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . for all x. and so. y ∈ R g(y) ≥ g(x)1(y ≥ x). g(X) ≥ g(x)1(X ≥ x). Eg(X) ≥ g(x)P (X ≥ x) . expressing monotonicity. Then. We have already seen a geometric random variable. To remedy this. n ∈ Z+ . we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. where λ = P (X ≥ 1). Indeed. We can case both properties of g neatly by writing For all x. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . expressing non-negativity. a geometric function of k. It was shown in Theorem 38 that J0 satisﬁes the memoryless property. as n → ∞. 0 ≤ n ≤ m. Consider an increasing and nonnegative function g : R → R+ . The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). This completely trivial thought gives the very useful Markov’s inequality. deﬁned in (34). Let X be a random variable.

J L (1997) Introduction to Probability. Bremaud P (1997) An Introduction to Probabilistic Modeling. Springer. Springer. 120 . Parry.mac. Grinstead. W (1981) Topics in Ergodic Theory. 7. Chung. Norris. Bremaud P (1999) Markov Chains. Math. J R (1998) Markov Chains. C M and Snell. ´ 2. Wiley. 5. Cambridge University Press. 3. Feller.dartmouth.Bibliography ´ 1. 4. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). W (1968) An Introduction to Probability Theory and its Applications. Springer. Soc.pdf 6. Cambridge University Press. Also available at: http://www.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Amer.

18 dual. 6 Markov property. 87 121 . 2 nearest neighbour random walk. 17 Aloha. 51 essential state. 108 relation. 16 convolution. 17 inﬁmum. 110 mixing. 20 increments. 114 balance equations. 98 dynamical system. 35 Helly-Bray Lemma. 48 cycle formula. 15. 7 invariant distribution. 117 division. 70 greatest common divisor. 53 hitting time. 110 meeting time. 77 indicator. 68 aperiodic. 86 reﬂexive. 119 matrix. 73 Markov chain. 115 geometric distribution. 109 positive recurrent. 111 maximum. 18 permutation. 111 inessential state. 116 ﬁrst-step analysis. 65 generating function. 16. 101 axiom of Probability Theory. 29 reﬂection. 59 law. 117 leads to. 17 communicates with. 34 probability ﬂow. 58 digraph (directed graph). 84 bank account. 20 function of a Markov chain. 3 branching process. 63 Galton-Watson process. 61 null recurrent. 48 memoryless property. 15 Loynes’ method. 106 ground state. 119 Google. 45 Chapman-Kolmogorov equation. 115 random walk on a graph. 1 equivalence classes. 68 Fatou’s lemma. 108 ruin probability. 17 Kolmogorov’s loop criterion. 15 distribution. 108 ergodic. 77 coupling. 10 ballot theorem. 70 period. 76 neighbours. 47 coupling inequality. 61 detailed balance equations. 13 probability generating function. 53 Fibonacci sequence. 65 Catalan numbers. 16 equivalence relation. 16 communicating classes. 81 reﬂection principle. 61 recurrent. 82 Cesaro averages. 44 degree. 10 irreducible.Index absorbing. 26 running maximum. 34 PageRank. 7 closed. 17 Ethernet. 119 minimum. 51 Monte Carlo. 6 Markov’s inequality. 113 initial distribution. 18 arcsine law.

113 symmetric. 10 steady-state. 7 time-reversible. 9 stationary distribution. 62 supremum. 62 Wright-Fisher model. 13 stopping time. 88 watching a chain in a set. 29 transition diagram:. 7 simple random walk. 62 subsequence of a Markov chain. 76 Skorokhod embedding. 77 skip-free property. 60 urn. 58 transient. 8 transition probability matrix. 108 tree. 7. 104 states. 108 time-homogeneous. 8 transitive. 6 stationary. 24 Strirling’s approximation. 16. 16. 27 storage chain. 119 strong Markov property. 2 transition probability. 76 simple symmetric random walk. 71 122 .semigroup property. 3. 73 strategy of the gambler. 6 statespace. 27 subordination.

- The Incredible Power of Post Selection (Scott Aaronson)
- markov-chains
- Lec27
- Markov Chains
- Comm Bremermann
- MarkovChains
- Exponential Growth
- 20752_0412055511MarkovChain
- 9781904534488
- markov chain chapter 6
- 2-2
- Bayesian Econometric Methods
- Mata23 Final 2012s
- (Ebook) Hidden Markov Models (Theory & Methods) Markov Chains Particle Filter Monte Carlo Hmm - Cappemoulinesryden
- UCLA Math 33A Midterm 2 Solutions
- Markov Coding
- Matrix
- ProblemsMarkovChains
- SVD_ans_you.pdf
- Markov Chain
- Matrizes, Eliminação Gaussiana, Equações Lineares e Espaços Vetoriais
- A First Course in Linear Algebra
- ROC
- Solved Problems
- lect14
- Statistics
- distance transform 21
- Introduction to Markov Chain Monte Carlo Simulation and Their Statistical Analysis
- 2-D Graphics Transformation -ecomputernotes.com
- Exercises for Stochastic Calculus

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd