This action might not be possible to undo. Are you sure you want to continue?

**MARKOV CHAINS AND RANDOM WALKS
**

Takis Konstantopoulos∗ Autumn 2009

∗

c Takis Konstantopoulos 2006-2009

Contents

1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Deﬁnition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and ﬁrst-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Deﬁnition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . . . . . . . . . .3 Some conditional probabilities . . . . . . . . . . . .4 The Wright-Fisher model: an example from biology . . . . . . . . 95 31. . . . . . . . .24. . . .2 Paths that start and end at speciﬁc points and do not touch zero at all . . . . . . .2 First return to the origin . . .1 Distribution of hitting time and maximum . . . . 95 31. . . . .3 Total number of visits to a state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . .5 A storage or queueing application . . . .2 The ballot theorem . .1 Branching processes: a population growth application . . . . . . . . . . . .2 The ALOHA protocol: a computer communications application . 90 30. . . . 84 27. . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . . . . . . . . . . . . 70 24. . 85 27. . . . . . . 65 24. . . . . . . . 86 28 The reﬂection principle for a simple random walk in dimension 1 86 28. . . . .5 First return to zero . . . . . . . . . . . . . . . . .1 Distribution after n steps . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Asymptotic distribution of the maximum . . . 71 24. . . . . . . . . . 83 27. . . . .1 First hitting time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 27. . . . . . . . . . . . . . . . .1 Paths that start and end at speciﬁc points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Some remarkable identities . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 24. .2 Return probabilities . . . . . . . . . . . . . . . . 79 26. . . . . . . . . . . . .3 The distribution of the ﬁrst return to the origin . . . . . . . 92 30. . . . . . . . . . . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuﬀ with matrices – On ﬁnance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. I introduce stationarity before even talking about state classiﬁcation.S. I have a little mathematical appendix. Greece and the U. The ﬁrst part is about Markov chains and some applications. dozens of good books on the topic. There are. Also.... The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. Also. v .Preface These notes were written based on a number of courses I taught over the years in the U. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. Of course. They form the core material for an undergraduate course on Markov chains in discrete time. At the end. I tried to put all terms in blue small capitals whenever they are ﬁrst encountered.K. “important” formulae are placed inside a coloured box. The second one is speciﬁcally for simple random walks. There notes are still incomplete. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. of course. I have tried both and prefer the current ordering. A few starred sections should be considered “advanced” and can be omitted at ﬁrst reading.

Recall (from elementary school maths). x). or. .e. initialised by x0 = max(a. b) between two positive integers a and b. Suppose that a particle of mass m is moving under the action of a force F in a straight line. we let our “state” be a pair of integers (xn . Its velocity is x = dx . yn ) into Xn+1 = (xn+1 . 2.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past aﬀects the future only through the present. b) = gcd(r. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). . b). x Here. The “time” can be discrete (i. An example from Physics is provided by Newton’s laws of motion. . that gcd(a. where xn ≥ yn . a totally ordered set. where r = rem(a. i. n = 0. yn ). It is not hard to see that. the future pairs Xn+1 . one of which is 0. In Mathematics. For instance. . yn ). depend only on the pair Xn . The state of the particle at time t is described by the pair ˙ X(t) = (x(t). Xn+2 . for any n. more generally. In other words. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. 1. F = f (x. In the module following this one you will study continuous-time chains. the real numbers). has the property that. x(t)). . yn+1 ). and the other the greatest common divisor. b). By repeating the procedure. Formally. . there is a function F that takes the pair Xn = (xn . a phenomenon which evolves with time in a way that only the present aﬀects the future is called a dynamical system. for any t.e. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. X2 . An example from Arithmetic of a dynamical system in discrete time is the one which ﬁnds the greatest common divisor gcd(a.e. The sequence of pairs thus produced. Assume that the force F depends only on the particle’s position ¨ dt and velocity. . and evolving as xn+1 = yn yn+1 = rem(xn . . the future trajectory (X(s). we end up with two numbers. b). X0 . continuous (i. x = x(t) denotes the position of the particle at time t. . b) is the remainder of the division of a by b. y0 = min(a. s ≥ t) ˙ 1 . and its ˙ dt 2x acceleration is x = d 2 . the integers). X1 .

000 rows and 10. but some are stochastic. Yn ). .can be completely speciﬁed by the current state X(t).05. In other words. Other examples of dynamical systems are the algorithms run.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as ﬁnding the determinant of a matrix with 10. we will see what kind of questions we can ask and what kind of answers we can hope for. 1 and 2. In addition. instead of deterministic objects. The resulting ratio is an approximation b of a f (x)dx. A scientist’s job is to record the position of the mouse every minute. regardless of where it was at earlier times. a ≤ X0 ≤ b. analyse. 0 ≤ Y0 ≤ B. Markov chains generalise this concept of dependence of the future only on the present. containing fresh and stinky cheese.e. at time n + 1 it is either still in 1 or has moved to 2. We will also see why they are useful and discuss how they are applied. Each time count 1 if Yn < f (Xn ) and 0 otherwise. Some of these algorithms are deterministic. A mouse lives in the cage. Perform this a large number N of times. The generalisation takes us into the realm of Randomness. Count the number of 1’s and divide by N . We will develop a theory that tells us how to describe. via the use of the tools of Probability. i. Try this on the computer! 2 . and use those mathematical models which are called Markov chains. . We can summarise this information by the transition diagram: 1 Stochastic means random.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. an eﬀective way for dealing with large problems is via the use of randomness. past states are not needed. for steps n = 1. . Repeat with (Xn . say. Indeed. Choose a pair (X0 . Similarly. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. We will be dealing with random variables. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. 2 2.000 columns or computing the integral of a complicated function of a large number of variables. by the software in your computer. it moves from 2 to 1 with probability β = 0.99. respectively. it does so. When the mouse is in cell 1 at time n (minutes) then. 2.. uniformly. Y0 ) of random numbers.

Assume also that X0 . D1 .01 Questions of interest: 1.95 0. Here.05 0. . are random variables with distributions FD . the ﬁrst time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. (This is the mean of the binomial distribution with parameter α. are independent random variables.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. on the average.99 0. We easily see that if y > 0 then ∞ x. as follows: Xn+1 = max{Xn + Dn − Sn . from month to month.) 2. Dn − Sn = y − x) P (Sn = z.y := P (Xn+1 = y|Xn = x). Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). 0}. intuitive. . y = 0. S2 . and Sn is the money he wants to spend. FS . Xn is the money (in pounds) at the beginning of the month. he can’t.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. Sn . How often is the mouse in room 1 ? Question 1 has an easy. 1. So if he wishes to spend more than what is available. . . D0 . Assume that Dn . Dn is the money he deposits. px. = z=0 ∞ = z=0 3 . . S0 .05 ≈ 20 minutes. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is deﬁned by: px. 2. How long does it take for the mouse. S1 . . answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. to move from cell 1 to cell 2? 2. D2 . respectively.2 Bank account The amount of money in Mr Baxter’s bank account evolves.

2 ··· ··· ··· ··· · · · · · · ··· and is an inﬁnite matrix. if y = 0. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. The transition probability matrix is y=1 p0.0 p0. Questions of interest: 1.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. how fast? 2. If he starts from position 0.something that. 2. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left).1 p0. in principle. and if so. he has no sense of direction. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. how often will he be visiting 0? 2.0 p1.y .1 p2.2 P= p2.2 p1. 2. So he may move forwards with equal probability that he moves backwards. can be computed from FS and FD .4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . we have px. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. Of course. If the pub is located in position 0 and his home in position 100. on the average.0 p2.0 = 1 − ∞ px. how long will it take him. Will Mr Baxter’s account grow. Are there any places which he will visit inﬁnitely many times? 4.1 p1. Are there any places which he will never visit? 3. being drunk.

but they will become more precise in the sequel.1 D Thus.01 S 0.H . for now.8 0. Note that the diagram omits pH.H − pS. and pS. Also. pD. the company must have an idea on how long the clients will live. given that he is currently healthy. and let’s say he’s completely drunk. that since there are more degrees of freedom now.D = 1. west.D . But.S because pH. the reason being that the company does not believe that its clients are subject to resurrection.S − pH.S = 1 − pS.5 Actuarial chains A life insurance company wants to ﬁnd out how much money to charge its clients.H = 1 − pH. pD.S = 0.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. Clearly.3 that the person will become sick. So he can move east. there is clearly a higher chance that our man will be lost.3 H 0. the answer to this is crucial for the policy determination. observe.D . etc. mathematically imprecise. These things are. We may again want to ask similar questions. and pS. 5 . north or south. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly.D is omitted. so we assign probability 1/4 for each move. therefore. there is probability pH. 2. It proposes the following model summarising the state of health of an individual on a monthly basis: 0.

. n + m] Xk = ik | Xn = in . m and any choice of states i0 . . . m ∈ T). . Xn+m ) and (X0 . . . . Proof. . n ∈ T}. . . Xn−1 ) given Xn . . X1 = i1 . Usually. 2. We focus almost exclusively to the ﬁrst choice. . n ∈ Z+ ) is Markov if and only if. which precisely means that (Xn+1 . We will also assume. for any n ∈ T. . m < n.} or the set Z of all integers (positive. X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). So we will omit referring to T explicitly. 2. (Xn . X0 = i0 ) = P (∀ k ∈ [n + 1. . n ∈ T} is Markov instead of saying that it has the Markov property. Lemma 1. viz. . . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. m ∈ T) is independent of the past process (Xm . where T is a subset of the integers. X2 = i2 . and so future is independent of past given the present.1 The Markov property Deﬁnition of the Markov property Consider a (ﬁnite or inﬁnite) sequence of random variables {Xn . .} or N = {1. the Markov property. (1) 6 . . m > n. 3. . the future process (Xm . for all n ∈ Z+ and all i0 . 3. called the state space. . We say that this sequence has the Markov property if. 3. for any n. . . if the claim holds. . The elements of S are frequently called states. . P (Xn+1 = in+1 | Xn = in . T = Z+ = {0. . . . it is customary to call (Xn ) a Markov chain. conditionally on Xn . . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). in+1 ∈ S. unless needed. in+m . Let us now write the Markov property in an equivalent way. . Since S is countable. This is true for any n and m. . n + m] Xk = ik | Xn = in ). almost exclusively (unless otherwise said explicitly). . . . . or negative). that the Xn take values in some countable set S. . .3 3. Xn−1 ) are independent conditionally on Xn . Sometimes we say that {Xn . zero. On the other hand. 1. .

by (1).j (n. we deﬁne.. viz.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. . Of course.in (n − 1.j (n. requires some ordering on S) and whose entries are pi. by deﬁnition. 2) · · · pin−1 .i2 (1.i1 (0. n). by a result in Probability. a square matrix whose rows and columns are indexed by S (and this. something that can be called semigroup property or Chapman-Kolmogorov equation.j (m. pi. 7 . n + 1) := P (Xn+1 = j | Xn = i) . n) = P(m. but if S is an inﬁnite set. n) = P(m. n).i (n − 1. 1) pi1 .3 In this new notation. n)]i. 3. m + 2) · · · P(n − 1. If |S| = d then we have d × d matrices.j∈S . or as P(m. pi. n) = 1(i = j) and. then the matrices are inﬁnitely large. j ∈ S. n ≥ 0.j (m. pi. all probabilities of events associated with the Markov chain This causes some analytical diﬃculties. Xn = in ) = µ(i0 ) pi0 . i∈S i.which means that the two functions µ(i) := P (X0 = i) . The function pi. ℓ) P(ℓ. m + 1)P(m + 1.i1 (m.j (n. means that the transition probabilities pi. will specify all joint distributions2 : P (X0 = i0 . X2 = i2 . n). . Therefore. n + 1) do not depend on n . remembering a bit of Algebra. n) := [pi. But we will circumvent them by using Probability only. we can write (2) as P(m. m + 1) · · · pin−1 . . n) = i1 ∈S ··· in−1 ∈S pi.j (m. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. for each m ≤ n the matrix P(m. More generally. 2 3 And. X1 = i1 . pi. of course.j (n.j (m. n) if m ≤ ℓ ≤ n. The function µ(i) is called initial distribution. n + 1) is called transition probability from state i at time n to state j at time n + 1. which. . n). (2) which is reminiscent of matrix multiplication.

unless otherwise speciﬁed. n−m times for all m ≤ n. n) = P × P × · · · · · · × P = Pn−m . Then P(m.j∈S is called the (one-step) transition probability matrix or. In other words. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i).j (n. and call pi. Similarly. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . For example. if Y is a real random variable. From now on.and therefore we can simply write pi.j ]i. of course. n + 1) =: pi. Then we have µ n = µ 0 Pn . For example. then µ = pδi1 + (1 − p)δi2 . Starting from a speciﬁc state i. 0. transition matrix. b ≥ 0. to picking state i with probability 1. Notice that any probability distribution on S can be written as a linear combination of unit masses. (3) We call δi the unit mass at i. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). simply. The matrix notation is useful.j the (one-step) transition probability from state i to state j. and the semigroup property is now even more obvious because Pa+b = Pa Pb . if j = i otherwise. as follows from (1) and the deﬁnition of P. if µ assigns probability p to state i1 and 1 − p to state i2 . Notational conventions Unit mass. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. arranged in a row vector. It corresponds. while P = [pi. Thus Pi is the law of the chain when the initial distribution is δi . y y 8 . δi (j) := 1. for all integers a.j . The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi .

because it is impossible for the number of molecules to remain constant at all times. To provide motivation and intuition.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. we indeed expect that we have N/2 molecules in each room. . Hence the uniform probability distribution is stationary. . at any point of time n. Example: the Ehrenfest chain. Imagine there are N molecules of gas (where N is a large number. selecting a vertex with probability 1/7. let us look at the following example: Example: Motion on a polygon. and vice versa. Then a hole is opened in the separator and we watch what happens as the molecules diﬀuse between the two rooms.d.). It is clear. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. at time n + 1 it moves to one of the adjacent vertices with equal probability. . 1. n = m. The gas is placed partly in room 1 and partly in room 2.i.4 Stationarity We say that the process (Xn . But this will not work.e. If at time n the bug is on a vertex. then. It should be clear that if initially place the bug at random on one of the vertices. dividing it into two identical rooms. n ≥ 0) will not be stationary: indeed. Clearly.) is stationary if it has the same law as (Xn . Another guess then is: Place 9 . as it takes some time before the molecules diﬀuse from room to room. Let Xn be the number of molecules in room 1 at time n. i. Then pi. the law of a stationary process does not depend on the origin of time. . random variables provides an example of a stationary process. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . then.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . m + 1. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. This is a model of gas distributed between two identical communicating chambers. the distribution of the bug will still be the same. In other words. . The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. We now investigate when a Markov chain is stationary. On the average. say N = 1025 ) in a metallic chamber with a separator in the middle. a sequence of i. Consider a canonical heptagon and let a bug move on its vertices. for any m ∈ Z+ . n = 0. for a while we will be able to tell how long ago the process started. .

(a binomial distribution).i + P (X0 = i + 1)pi+1. . . If we can do so. 2−N i−1 i+1 N N (The dots signify some missing algebra. 10 . Xm+2 = i2 . . . we are led to consider: The stationarity problem: Given [pij ]. In other words. ﬁnd those distributions π such that πP = π. P (Xn = i) = π(i) for all n. . stationary or invariant distribution. a Markov chain is stationary if and only if the distribution of Xn does not change with n. . Xn = in ) = P (Xm = i0 . If we do so.each molecule at random either in room 1 or in room 2. by (3). . for all m ≥ 0 and i0 . we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. (4) Such a π is called . Since. hence.i + π(i + 1)pi+1. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . then we have created a stationary Markov chain. Changing notation. . . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. For the if part. P (X1 = i) = j P (X0 = j)pj.) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. A Markov chain (Xn .i = N N −i+1 N i+1 2−N + = · · · = π(i). But we need to prove this. use (1) to see that P (X0 = i0 . Indeed. Proof. Xm+1 = i1 . The equations (4) are called balance equations. i 0 ≤ i ≤ N.i = P (X0 = i − 1)pi−1. Lemma 2. Xm+n = in ). then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . . . The only if part is obvious. X1 = i1 . . in ∈ S. which you should go through by yourselves.i = π(i − 1)pi−1. X2 = i2 .

in matrix notation. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. Example: waiting for a bus. 11 . Keep doing it continuously. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. The times between successive buses are i.i = π(0)p(i) + π(i + 1). .i. write the equations (4): π(i) = π(j)pj. The state space is S = N = {1. . d}.i = p(i). . 2 . To ﬁnd the stationary distribution. Consider a deck of d = 52 cards and shuﬄe them in the following (stupid) way: pick two cards at random and swap their positions. which. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). if i > 0 i ∈ N. then we run into trouble: indeed. the next one will arrive in i minutes with probability p(i). π(1) will be zero. i ≥ 0.j = 1. This means that the solution to the balance equations is identically zero. . each x ∈ Sd is of the form x = (x1 . we use the normalisation condition π(1) = i≥1 j≥i = 1. we must be careful. random variables. . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. i∈S The matrix P has the eigenvalue 1 always because j∈S pi.}. Then (Xn . After bus arrives at a bus stop.. p1. Who tells us that the sum inside the parentheses is ﬁnite? If the sum is inﬁnite. . . It is easy to see that the stationary distribution is uniform: π(x) = 1 . π must be a probability: π(i) = 1. 2. Thus. Example: card shuﬄing. i = 1.d.i = 1. . i ≥ 1. . . . d! x ∈ Sd . where 1 is a column whose entries are all 1. where all the xi are distinct. . which gives p(j) . can be written as P1 = 1. But here. n ≥ 0) is a Markov chain with transition probabilities pi+1. In addition to that. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. which in turn means that the normalisation condition cannot be satisﬁed. xd ). and so each π(i) will be zero.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. . i≥1 π(i) −1 To compute π(1).

applied to the interval [0.d. . ξi ∈ {0.. .). ξr+1 (n) = 1 − εr ). . εr ∈ {0.). Hence choosing the initial state at random in this manner results in a stationary Markov chain. i. . ξ(n + 1) = ϕ(ξ(n)). 2. for any n = 0.. with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. . So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is ﬁnite. ξ2 . . . ξ2 (n). are i. 1 − ξ3 . . . 1} for all i. . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. from (5). . and any ε1 . .Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. . The Markov chain is obtained by applying the mapping ϕ again and again. if ξ1 = 1 then shift to the left and ﬂip all bits to obtain (1 − ξ2 . . Remarks: This is not a Markov chain with countably many states.) Consider an inﬁnite binary vector ξ = (ξ1 . So. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). If ξ1 = 1 then x ≥ 1/2. 1. . . . any r = 1. . ξ3 . . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 .e.) is the state at time n. But to even think about it will give you some strength.d. Here ξ(n) = (ξ1 (n). the same is true at n + 1. what we have really shown is that this assumption is a necessary and suﬃcient condition for the existence and uniqueness of a stationary distribution. . 1}. r = 1. 1] (think of [0. . . . We have thus deﬁned a mapping ϕ from binary vectors into binary vectors. We can show that if we start with the ξr (0). ξ2 (n) = 1 − ε1 . 2. As for the name of the chain (bread kneading) it can be justiﬁed as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. But the function f (x) = min(2x. Example: bread kneading. . . . . . . .* (This is slightly beyond the scope of these lectures and can be omitted. . ξ2 (n) = ε1 . . 2(1 − x)). Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. . .i. So let us do so. . (5) 12 . You can check that .). i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. P (ξ1 (n + 1) = ε1 . ξ2 (n). 2. .i.. and the rule maps x into 2(1 − x). ξr (n + 1) = εr ) = P (ξ1 (n) = 0. i. So it goes beyond our theory. 1] as a ﬂexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. . ξr+1 (n) = εr ) + P (ξ1 (n) = 1. . .

When a Markov chain is stationary we refer to it as being in 4.j (which equals 1): π(i) j∈S pi.j as a “current” ﬂowing from i to j. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. as we will see in examples. we introduce more. π(j) = 1.j = j∈A π(j)pj. π(j)pj.1 Finding a stationary distribution As explained above. Writing the equations in this form is not always the best thing to do.i = j∈A π(j)pj.i + j∈Ac π(j)pj. Ac ) is the total current from A to Ac . If the latter condition holds. This is OK. So F (A. then the above are d + 1 equations. If the state space has d states.Note on terminology: steady-state. Proof.i . A). for all A ⊂ S. suppose that the balance equations hold. Conversely.j . expecting to be able to choose. We have: Proposition 1. then choose A = {i} and see that you obtain the balance equations. Ac ) = F (Ac . π(i) = j∈S (balance equations) (normalisation condition). The convenience is often suggested by the topological structure of the chain.i + j∈Ac π(j)pj.i 13 . j∈S i ∈ S. a stationary distribution π satisﬁes π = πP π1 = 1 In other words. therefore one of them is always redundant. Instead of eliminating equations. because the ﬁrst d equations are linearly dependent: the sum of both sides equals 1. Ac ) := π(i)pi.i Multiply π(i) by j∈S pi. i∈A j∈Ac Think of π(i)pi. But we only have d unknowns. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. a set of d linearly independent equations which can be more easily solved. amongst them. π is a stationary distribution if and only if π1 = 1 and F (A.

. p22 = 1 − β... .99 (as in the mouse example). here is an example: Example: Consider the 2-state Markov chain with p12 = α. A) Schematic representation of the ﬂow balance relations.j = j∈A π(j)pj.99 ≈ 0. The constant c is found from π(1) + π(2) = 1.05. β = 0. we see that the ﬁrst term on the left cancels with the ﬁrst on the right.8% of the time in room 2. A c F(A .04 π(2) = 0. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). .j + j∈A j∈Ac π(i)pi.) Assume 0 < α. α+β π(2) = α . gives us ﬂexibility. while the remaining two terms give the equality we seek. (Consequently. Example: a walk with a barrier. The extra degree of freedom provided by the last result. in steady state.2% of the time in 1. One can ask how to choose a minimal complete set of linearly independent equations. If we take α = 0. and 4. π(1) = β . . we ﬁnd π(1) = 0. β < 1. A ) A c c F( A . α+β π(2) = cα. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. 3. Instead.05 ≈ 0.Now split the left sum also: π(i)pi. but we shall not do this here. p11 = 1 − α. p q Consider the following chain.04 room 1.952. 1. 2.048. p21 = β.i This is true for all i ∈ S. the mouse spends only 95. 14 i = 1. Thus. That is. Summing over i ∈ A. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 .i + j∈Ac π(j)pj. We have π(1)α = π(2)β.

. . 1. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). . Does this provide a stationary distribution? Yes. This is i=0 possible if and only if p < q. i. In other words. S is a set of vertices. For i ≥ 1. the chain will visit j at some ﬁnite time. • Suppose that i j. and E a set of directed edges.2 The relation “leads to” We say that a state i leads to state j if.j > 0. we can solve them immediately. In graph-theoretic terminology. E) where S is the state space and E ⊂ S × S deﬁned by (i. in−1 . If i = j there is nothing to prove. 5. i − 1}.) 5 5. (iii) Pi (Xn = j) > 0 for some n ≥ 0. a pair (S. This is often cast in terms of a digraph (directed graph). starting from i. So assume i = j. . Proof.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. If i ≥ 1 we have F (A. provided that ∞ π(i) = 1. (ii) pi. These are much simpler than the previous ones. In fact. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. π(i) = (p/q)i π(0).e. A). . Ac ) = π(i − 1)p = π(i)q = F (Ac .But we can make them simpler by choosing A = {0.i2 · · · pin−1 .e. (If p ≥ 1/2 there is no stationary distribution. . . We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. n≥0 . .i1 pi1 . i.j > 0 for some n ∈ N and some states i1 . p < 1/2. j) ∈ E ⇐⇒ pi. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0.

Equivalence relation means that it is symmetric. The communicating class corresponding to the state i is. 16 . (6) j..i1 pi1 . It is reﬂexive and transitive because is so. Hence Pi (Xn = j) > 0. {2. by deﬁnition.in−1 pi. [i] = [j] if and only if i identical or completely disjoint. meaning that (ii) holds. Proof. Since Pi (Xn = j) > 0. the class {2. 5} is closed: if the chain goes in there then it never moves out of it. • Suppose now that (iii) holds.. The ﬁrst two classes diﬀer from the last two in character. reﬂexive and transitive. we have that the sum is also positive and so one of its terms is positive. this is symmetric (i Corollary 2. 5}. In addition. the sum contains a nonzero term. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j).the sum on the right must have a nonzero term. The class {4. (iii) holds. where the sum extends over all possible choices of states in S. it partitions S into equivalence classes known as communicating classes. The relation is transitive: If i j and j k then i k.i2 · · · pin−1 . the set of all states that communicate with i: [i] := {j ∈ S : j So.. {4. (i) holds.3 The relation “communicates with” Deﬁne next the relation i communicates with j (written as i Obviously. We just observed it is symmetric. j) ⇐⇒ i j and j i. We have Pi (Xn = j) = i1 . by deﬁnition. 3} is not closed. Just as any equivalence relation. we see that there are four communicating classes: {1}. is an equivalence relation. The relation j ⇐⇒ j i). Look at the last display again. • Suppose that (ii) holds. i}. However. 3}. {6}.. By assumption. and so (iii) holds.j . Two communicating classes are either 1 2 3 4 5 6 In the example of the ﬁgure. 5. Corollary 1. In other words.

parts. 17 . and so on. Suppose that i j. its transition matrix P) is called irreducible. There can be arrows between two nonclosed communicating classes but they are always in the same direction. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. starting from any state. more precisely. we have i1 ∈ C. pi. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this ﬁgure. in−1 such that pi. To put it otherwise. .i = 1 then we say that i is an absorbing state.i1 pi1 .i1 > 0.j > 0. and so i2 ∈ C. Similarly. we conclude that j ∈ C. The classes C4 . Suppose ﬁrst that C is a closed communicating class.i2 · · · pin−1 .j = 1. more manageable. In other words still. . If all states communicate with all others. The converse is left as an exercise. The communicating classes C1 . we say that a set of states C ⊂ S is closed if pi. we can reach any other state. Thus. Lemma 3. . and C is closed. C2 . the state space S is a closed communicating class.More generally. then the chain (or.i2 > 0. pi1 . A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. C5 in the ﬁgure above are closed communicating classes. Otherwise. C3 are not closed. Note that there can be no arrows between closed communicating classes. Note: If we consider a Markov chain with transition matrix P and ﬁx n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. the state is inessential. Closed communicating classes are particularly important because they decompose the chain into smaller. Proof. the single-element set {i} is a closed communicating class. . In graph terms. If a state i is such that pi. Since i ∈ C. We then can ﬁnd states i1 . Why? A state i is essential if it belongs to a closed communicating class. j∈C for all i ∈ C. Note: An inessential state i is such that there exists a state j such that i j but j i. A general Markov chain can have an inﬁnite number of communicating classes.

j ∈ S. pii (ℓn) ≥ pii (n) ℓ . 4} at the next step and vice versa. (m) (n) for all m. i. So. Theorem 2 (the period is a class property). So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . If i j then d(i) = d(j). that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. Dj = {n ∈ (n) (α) N : pjj > 0}. Consider two distinct essential states i. . (Can you ﬁnd an example of a chain with period 3?) We will now give the deﬁnition of the period of a state of an arbitrary chain. X2 ∈ C1 . more generally. d(j) = gcd Dj . A state i is called aperiodic if d(i) = 1. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. (n) Proof. d(i) = gcd Di . The chain alternates between the two sets. The period of an essential state is deﬁned as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . n ≥ 0. ℓ ∈ N. j. 3} at some point of time then. . Since Pm+n = Pm Pn . .e. . Let us denote by pij the entries of Pn . X4 ∈ C1 . i. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . If i j there is α ∈ N with pij > 0 and if 18 . Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. Notice however. The period of an inessential state is not deﬁned. Let Di = {n ∈ N : pii > 0}. X3 ∈ C2 .5. and. all states form a single closed communicating class.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. We say that the chain has period 2. it will move to the set C2 = {2.

if an irreducible chain has period d then we can decompose the state space into d sets C0 . If a number divides two other numbers then it also divides their diﬀerence. n1 ∈ Di such that n2 − n1 = 1. . showing that α + n + β ∈ Dj . Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. So d(i) = d(j). . chains where. i. n2 ∈ Di . More generally. Pick n2 . (Here Cd := C0 . we obtain d(i) ≤ d(j) as well. d − 1. . Because n is suﬃciently large. . we have n ∈ Di as well. 1. the chain is called aperiodic. Divide n by n1 to obtain n = qn1 + r. Cd−1 such that the chain moves cyclically between them. where the remainder r ≤ n1 − 1. C1 . Formally. .j i there is β ∈ N with pji > 0. Now let n ∈ Di . Theorem 4. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This ﬁgure shows the internal structure of a closed communicating class with period d = 4. . In particular. Corollary 3. . Therefore d(j) | α + β + n for all n ∈ Di . all states communicate with one another.) 19 . C1 . Consider an irreducible chain with period d. j∈Cr+1 i ∈ Cr . Arguing symmetrically. Cd−1 such that pij = 1. So pjj ≥ pij pji > 0. this is the content of the following theorem. are irreducible chains. j ∈ S. r = 0. . An irreducible chain has period d if one (and hence all) of the states have period d. An irreducible chain with ﬁnitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . . So d(j) is a divisor of all elements of Di . . showing that α + β ∈ Dj .e. From the last two displays we get d(j) | n for all n ∈ Di . Then pii > 0 and so pjj ≥ pij pii pji > 0. . Of particular interest. Then we can uniquely partition the state space S into d disjoint subsets C0 . d(j) | d(i)). as deﬁned earlier. q − r > 0. Let n be suﬃciently large. Since n1 . if d = 1. Therefore d(j) | α + β. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . Proof. . Theorem 3.

The reader is warned to be alert as to which of the two variables we are considering at each time.. then. Pick a state i0 and let C0 be its equivalence class (i. We avoid excessive notation by using the same letter for both. .. id−1 d 6 Hitting times and ﬁrst-step analysis Consider a Markov chain with transition probability matrix P. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which diﬀers from TA simply by considering n ≥ 1 instead of n ≥ 0. Continue in this manner and deﬁne states i0 . Suppose pij > 0 but j ∈ C1 but. j ∈ C2 .e. i.Proof. We are interested in deriving formulae for the probability that this time is ﬁnite as well as the expectation of this time.. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . and by the assumptions that d d j. after it takes one step from its current position. We deﬁne the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. Such a path is possible because of the choice of the i0 . C1 . Hence it partitions the state space S into d disjoint equivalence classes. . Take. id ∈ C0 . i1 . . 20 . ... which contradicts the deﬁnition of d.. We show that these classes satisfy what we need. Assume d > 1. i ∈ C0 . As an example.e the set of states j such that i j). Let C1 be the equivalence class of i1 . j belongs to the next class. We now show that if i belongs to one of these classes and if pij > 0 then. Notice that this is an equivalence relation. It is easy to see that if we continue and pick a further state id with pid−1 . The existence of such a path implies that it is possible to go from i0 to i. necessarily. say. We use a method that is based upon considering what the Markov chain does at time 1. Then pick i1 such that pi0 i1 > 0. Deﬁne the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. i1 . for example. that is why we call it “first-step analysis”. with corresponding classes C0 . necessarily. If X0 ∈ A the two times coincide. .id > 0. Cd−1 ..

X2 . So TA = 1 + TA . ′ Proof. But the Markov chain is homogeneous. . if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. and deﬁne ϕ(i) := Pi (TA < ∞). X0 = i) = P (TA < ∞|X1 = j). We then have: Theorem 5. ′ is the remaining time until A is hit). Furthermore. we have Pi (TA < ∞) = ϕ(j)pij . let ϕ′ (i) be another solution. the event TA is independent of X0 . By self-feeding this equation. We ﬁrst have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). ϕ(i) = j∈S i∈A pij ϕ(j). conditionally on X1 . If i ∈ A then TA = 0. . Combining the above. But what exactly is this probability equal to? To answer this in its generality. Therefore. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j).). a ′ function of X1 . For the second part of the proof. ﬁx a set of states A. . j∈S as needed. and so ϕ(i) = 1. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A pjk ϕ′ (k) k∈A 21 . then TA ≥ 1. Hence: ′ ′ P (TA < ∞|X1 = j. If i ∈ A. because the chain may end up in state 2. The function ϕ(i) satisﬁes ϕ(i) = 1. i ∈ A.It is clear that P1 (T0 < ∞) < 1.e.

the second equals Pi (TA = 2). If ′ i ∈ A. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. which is the second of the equations. On the other hand. 2}. If i ∈ A then. The function ψ(i) satisﬁes ψ(i) = 0. 3 2 so ϕ(1) = 3/4. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). Furthermore. We let ϕ(i) = Pi (T0 < ∞). 2. let A = {0.We recognise that the ﬁrst term equals Pi (TA = 1). We immediately have ϕ(0) = 1. ψ(i) = 1 + i∈A pij ψ(j). Example: Continuing the previous example. as above. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the ﬁrst time.) As for ϕ(1). Start the chain from i. Clearly. The second part of the proof is omitted. Proof. if the chain starts from 2 it will never hit 0. because it takes no time to hit A if the chain starts from A. j∈A i ∈ A. i = 0. so the ﬁrst two terms together equal Pi (TA ≤ 2). obviously. ψ(0) = ψ(2) = 0. 1 ψ(1) = 1 + ψ(1). We have: Theorem 6. we choose A = {0}. and ϕ(2) = 0. TA = 0 and so ψ(i) := Ei TA = 0. Example: In the example of the previous ﬁgure. Therefore. Letting n → ∞. By continuing self-feeding the above equation n times. then TA = 1 + TA . we obtain ϕ′ (i) ≥ Pi (TA ≤ n). we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). 3 22 . (If the chain starts from 0 it takes no time to hit 0. and let ψ(i) = Ei TA . the set that contains only state 0. ψ(i) := Ei TA . 1.

i. . This happens up to the ﬁrst hitting time TA ﬁrst hitting time of the set A. because the underlying experiment is the ﬂipping of a coin with “success” probability 2/3. Hence. otherwise. It should be clear that the equations satisﬁed are: ϕAB (i) = 1. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ).e. . (This should have been obvious from the start. after one step. . if x ∈ A. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). ϕAB (i) = 0. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the ﬁrst-step analysis method. until set A is reached. therefore the mean number of ﬂips until the ﬁrst success is 3/2. ϕAB (i) = j∈A i∈B pij ϕAB (j). i∈A Furthermore. where.. We may ask to ﬁnd the probabilities ϕAB (i) = Pi (TA < TB ). B. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). Clearly. ′ TA = 1 + TA . ϕAB is the minimal solution to these equations. X2 . 23 . Then ′ 1+TA x ∈ A. in the last sum. as argued earlier. Now the last sum is a function of the future after time 1. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). we just changed index from n to n + 1. of the random variables X1 . by the Markov property. TB for two disjoint sets of states A. h(x) = f (x).which gives ψ(1) = 3/2.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . ′ where TA is the remaining time. Next. consider the following situation: Every time the state is x a reward f (x) is earned.

(Sooner or later. k ∈ N) form the strategy of the gambler. so 24 . he collects i pounds. y∈S h(x) = f (x) + x ∈ A. So. Example: A thief enters sneaks in the following house and moves around the rooms. 2. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). The successive stakes (Yk . we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). the set of equations satisﬁed by h. unless he is in room 4. Let p the probability of heads and q = 1 − p that of tails. in which case he dies immediately. if he starts from room 2. until he dies in room 4. no gambler is assumed to have any information about the outcome of the upcoming toss.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. collecting “rewards”: when in room i. he will die. x∈A pxy h(y). (Also. 3. Clearly. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. h(3) = 7. If heads come up then you win £1 per pound of stake. for i = 1. 1 2 4 3 The question is to compute the average total reward. are h(x) = 0. 2 2 We solve and ﬁnd h(2) = 8. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared.Thus. why?) If we let h(i) be the average total reward if he starts from room i. h(1) = 5.

if the gambler wants to have a strategy at all. . We obviously have ϕ(0) = 0. a + 1. she must base it on whatever outcomes have appeared thus far. in simplest case Yk = 1 for all k. a ≤ x ≤ b. Let £x be the initial fortune of the gambler. pi. Ex ) for the probability (respectively. . if p = q Px (Tb < Ta ) = . P (tails) = 1 − p.i+1 = p. ξk−1 and–as is often the case in practice–on other considerations of e. . betting £1 at each integer time against a coin which has P (heads) = p. if p = q = 1/2 b−a Proof. a < i < b. Consider a gambler playing a simple coin tossing game. We wish to consider the problem of ﬁnding the probability that the gambler will never lose her money. Theorem 7. as described above. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. 25 ϕ(ℓ) = 1. astrological nature. To solve the problem. It can only depend on ξ1 . in the ﬁrst case. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. 0 ≤ x ≤ ℓ. . a + 2. x − a . the company has control over the probability p. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). b are absorbing. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals: x−a − 1 (q/p) (q/p)b−a − 1 .i−1 = q = 1 − p. . We ﬁrst solve the problem when a = 0. In other words. (Think why!) To simplify things. . expectation) under the initial condition X0 = x. ℓ).g. . The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. Let us consider a concrete problem. is our next goal. ℓ) := Px (Tℓ < T0 ). Yk cannot depend on ξk . b − 1. . in addition. We use the notation Px (respectively. write ϕ(x) instead of ϕ(x. . We are clearly dealing with a Markov chain in the state space {a. The diﬀerence between the gambler’s ruin problem for a company and that for a mere mortal is that. and where the states a. b} with transition probabilities pi. b − a). She will stop playing if her money fall below a (a < x). it will make enormous proﬁts). b = ℓ > 0 and let ϕ(x.

if p > 1/2 if p ≤ 1/2. we ﬁnd δ(0) = 1/ℓ and so x ϕ(x) = . If p > 1/2. 0 ≤ x ≤ ℓ. λ = 1. if λ = 1 if λ = 1. ℓ) = λx −1 . then λ = q/p > 1. ℓ→∞ ℓ→∞ (q/p)x . ϕ(x) = pϕ(x + 1) + qϕ(x − 1). if p = q. then λ = q/p < 1. we ﬁnd ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. Corollary 4. Since p + q = 1. λ = 1. δ(x − 2) = · · · = q p x δ(0). ℓ Summarising. λℓ −1 x ℓ.and. Let δ(x) := ϕ(x + 1) − ϕ(x). we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). 1. λx − 1 δ(0). The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. by ﬁrst-step analysis. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. Since ϕ(ℓ) = 1. and so Px (T0 < ∞) = 1 − lim If p < 1/2. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). λℓ − 1 0 ≤ x ≤ ℓ. Assuming that λ := q/p = 1. Px (T0 < ∞) = 1 − lim 26 . and so λx − 1 λx − 1 =1− = λx . This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). We ﬁnally ﬁnd λx − 1 . The general case is obtained by replacing ℓ by b − a and x by x − a. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. we have obtained ϕ(x. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)].

n ≥ 0) is a time-homogeneous Markov chain. Xn+2 = j2 . the ﬁrst hitting time of the set A. τ < ∞) = P (Xn+1 = j1 . Suppose (Xn . τ = n) (c) = P (Xn+1 = j1 . considered earlier. . . An example of a stopping time is the time TA . τ < ∞)P (A | Xτ .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . .j2 · · · We have: P (Xτ +1 = j1 . . . A. (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . . . . τ < ∞). . Our assertions are now easily obtained from this by ﬁrst summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . (d) where (a) is just logic. Xn ) for all n ∈ Z+ . any such A must have the property that. . Xτ +2 = j2 .j2 · · · P (Xn = i. . . A ∩ {τ = n} is determined by (X0 . P (Xτ +1 = j1 . . τ = n) (a) (b) = P (Xτ +1 = j1 . . A | Xτ . conditional on τ < ∞ and Xτ = i. Xn ) if we know the present Xn . Let τ be a stopping time. . . . for any n. the future process (Xn .j1 pj1 . τ = n). We are going to show that for any such A. for all n ∈ Z+ . | Xn = i)P (Xn = i. . Theorem 8 (strong Markov property). Xn = i. . τ = n)P (Xn = i. the future process (Xτ +n . A. the value +∞. Xτ ) if τ < ∞. Proof. . . possibly. . . τ < ∞) and that P (Xτ +1 = j1 . . . . .) is independent of the past process (X0 .j1 pj1 . An example of a random time which is not a stopping time is the last visit of a speciﬁc set. τ = n) = P (Xn+1 = j1 . . A. Xτ +2 = j2 . . 27 . Recall that the Markov property states that. . . τ < ∞) = pi. Xn+1 . . . . . . Let A be any event determined by the past (X0 . . (b) is by the deﬁnition of conditional probability. τ = n) = pi. . A. . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . Since (X0 . . . Let τ be a stopping time. Xτ = i. This stronger property is referred to as the strong Markov property. Xn )1(τ = n). . In particular. | Xτ . A. . A. Xn ). Xn ) and the ordinary splitting property at n. and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . Then. . Xτ ) and has the same law as the original process started from i.Xτ +2 = j2 . Xτ +2 = j2 . | Xτ = i. . . . . . Xn+2 = j2 . Xn+2 = j2 . Xτ )1(τ < ∞) = n∈Z+ (X0 . . | Xn = i. . . . . . . including. . n ≥ 0) is independent of the past process (X0 . .

Formally. So Xi is the r-th excursion: the chain visits 28 . Xi (0) . in a generalised sense.i. as “random variables”. If X0 = i then the two deﬁnitions coincide. Schematically. random trajectories. := Xn . They are. State i may or may not be visited again. . .i. The random variables Xn comprising a general Markov chain are not independent. Xi (2) . r = 2. we will show that we can split the sequence into i. . “chunks”. 0 ≤ n < Ti . we recursively deﬁne Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. The Ti of Section 6 was allowed to take the value 0. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the ﬁrst time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. All these random times are stopping times. Ti Xi (0) (r) ≤ n < Ti (r+1) .. .d. We (r) refer to them as excursions of the process. Our Ti here is always positive. .9 Regenerative structure of a Markov chain A sequence of i.) We let Ti be the time of the third visit. (1) r = 1. by using the strong Markov property. Xi (1) . 2. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . random variables is a (very) particular example of a Markov chain. in fact. . 3. However. Regardless. and so on. .. we let Ti be the time of the second (1) (3) visit. The idea is that if we can ﬁnd a state i that is visited again and again. (And we set Ti := Ti . then the behaviour of the chain between successive visits cannot be inﬂuenced by the past or the future (and that is due to the strong Markov property). Note the subtle diﬀerence between this Ti and the Ti of Section 6.d.

starting from i. g(Xi (2) ). Apply Strong Markov Property (SMP) at Ti : SMP says that.). ad inﬁnitum. . except. We abstract this trivial observation and give a couple of deﬁnitions: First..d. Second. equal to i. g(Xi (1) ). starting from i. but there are always states which will be visited again and again. Example: Deﬁne Ti (r+1) (0) ). are independent random variables. In particular. . Xi distribution. the future is independent of the (1) (2) past. But (r) the state at Ti is. A trivial. by deﬁnition. all. Λ2 . . The last corollary ensures us that Λ1 . and “goes for an excursion”. (Numerical) functions g(Xi independent random variables. they are also identical in law (i. An immediate consequence of the strong Markov property is: Theorem 9. . Moreover. possibly. for time-homogeneous Markov chains. if it starts from state i at some point of time. The excursions Xi (0) . 29 . The claim that Xi . the chain returns to i only ﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 0. we say that a state i is recurrent if. given the state at (r) (r) (r) the stopping time Ti . Therefore. Therefore. (0) are independent random variables. and this proves the ﬁrst claim. but important corollary of the observation of this section is that: Corollary 5. Xi (1) . at least one state must be visited inﬁnitely many times. . the chain returns to i inﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 1. 10 Recurrence and transience When a Markov chain has ﬁnitely many states. it will behave in the same way as if it starts from the same state at another point of time. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. Perhaps some states (inessential ones) will be visited ﬁnitely many times. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i.state i for the r-th time. .. we say that a state i is transient if. .i. the future after Ti is independent of the past before Ti . have the same Proof. Xi (2) . (r) . we “watch” it up until the next time it returns to state i.

Important note: Notice that the two statements above are not negations of one another. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. . Therefore. The following are equivalent: 1. k ≥ 0. every state is either recurrent or transient. This means that the state i is recurrent. State i is recurrent. This means that the state i is transient. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. Because either fii = 1 or fii < 1. i. 30 . We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. therefore concluding that every state is either recurrent or transient. We will show that such a possibility does not exist. But the past (k−1) before Ti is independent of the future. and that is why we get the product. Indeed. If fii < 1. Pi (Ji = ∞) = 1. We claim that: Lemma 4. Suppose that the Markov chain starts from state i. then Pi (Ji < ∞) = 1. 3. Lemma 5. namely. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . 4. 2. Starting from X0 = i. Pi (Ji = ∞) = 1. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} .e. we have Proof. there could be yet another possibility: 0 < Pi (Xn = i for inﬁnitely many n) < 1. in principle. Indeed. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). Notice that: state i is visited inﬁnitely many times ⇐⇒ Ji = ∞. we conclude what we asserted earlier. Ei Ji = ∞. fii = Pi (Ti < ∞) = 1. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii .

clearly. as n → ∞. Clearly. but the two sequences are related in an exact way: Lemma 7. as n → ∞. If i is a transient state then pii → 0. Notice the diﬀerence with pij . (n) But if a series is ﬁnite then the summands go to 0. the probability to hit state j for the ﬁrst time at time n ≥ 1. since Ji is a geometric random variable. 4. If Ei Ji = ∞. we see that its expectation cannot be inﬁnity without Ji itself being inﬁnity with probability one. (n) i. State i is transient. If i is transient then Ei Ji < ∞.e. is equivalent to Pi (Ji < ∞) = 1. then. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . Proof. The equivalence of the ﬁrst two statements was proved above. by deﬁnition of Ji is equivalent to Pi (Ji = ∞) = 1. by the deﬁnition of Ji . The equivalence of the ﬁrst two statements was proved before the lemma. From Elementary Probability. Pi (Ji < ∞) = 1.1 Deﬁne First hitting time decomposition fij := Pi (Tj = n).Proof. fii = Pi (Ti < ∞) < 1. assuming that the chain (n) starts from state i. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . State i is transient if. this implies that Ei Ji < ∞. then. This. Proof. 3. 10. Lemma 6. by deﬁnition. it is visited ﬁnitely many times with probability 1. Corollary 6. owing to the fact that Ji is a geometric random variable. therefore Pi (Ji < ∞) = 1. fij ≤ pij . The following are equivalent: 1. n ≥ 1. 2. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. by deﬁnition. if we assume Ei Ji < ∞.e. Hence Ei Ji = ∞ also. i. pii → 0. The latter is the probability to be in state j at time n (not necessarily for the ﬁrst time) starting from i. Proof. Ji cannot take value +∞ with positive probability. State i is recurrent if. which. and. it is visited inﬁnitely many times. On the other hand. Ei Ji < ∞. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii .

d. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). In other words. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. for all states i.r := 1 j appears in excursion Xi (r) . . We now show that recurrence is a class property. 10. . Indeed.) Theorem 10. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . In a communicating class. and hence pij as n → ∞. Corollary 7. not only fij > 0. we have i j ⇐⇒ fij > 0. Now. If j is transient. Xi . which is ﬁnite. . ∞ Proof. 1. . inﬁnitely many of them will be 1 (with probability 1). . and since they take the value 1 with positive probability. excursions Xi . If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. r = 0. . 32 . In particular.. if we let fij := Pi (Tj < ∞) . Proof. by deﬁnition. as n → ∞. Corollary 8. by summing over n the relation of Lemma 7.i. Start with X0 = i. are i.i. and so the probability that j will appear in a speciﬁc excursion is positive. but also fij = 1. we have n=1 pjj = Ej Jj < ∞. For the second term. starting from i. ( Remember: another property that was a class property was the property that the state have a certain period. the probability to go to j in some ﬁnite time is positive. j i. The last statement is simply an expression in symbols of what we said above.The ﬁrst term equals fij . Hence.2 Communicating classes and recurrence/transience Recall the deﬁnition of the relation “state i leads to state j”. either all states are recurrent or all states are transient. j i. Hence.d. denoted by i j: it means that. So the random variables δj. showing that j will appear in inﬁnitely many of the excursions for sure. If j is a transient state then pij → 0. and gij := Pi (Tj < Ti ) > 0. the probability that j will appear in a speciﬁc excursion equals gij = Pi (Tj < Ti ).

We now take a closer look at recurrence. Hence. uniformly distributed in the unit interval [0. If i is inessential then it is transient. . appear in a certain game. . U2 . .i. By ﬁniteness. Hence all states are recurrent. . Theorem 11.Proof. and hence. and so i is a transient state. . Example: Let U0 . . be i. Proof. Theorem 12. Recall that a ﬁnite random variable may not necessarily have ﬁnite expectation. It should be clear that this can. Suppose that i is inessential. . Let T := inf{n ≥ 1 : Un > U0 }. n+1 33 = 0 . 1]. But then. but Pi (A ∩ B) = Pi (Xm = j. This state is recurrent. every other j is also recurrent. U1 . . Hence 11 Positive recurrence Recall that if we ﬁx a state i and let Ti be the ﬁrst return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state).d. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. Indeed. . random variables. Let C be the class. If a communicating class contains a state which is transient then. Thus T represents the number of variables we need to draw to get a value larger than the initial one. P (B) = Pi (Xn = i for inﬁnitely many n) < 1. observe that T is ﬁnite. Pi (B) = 0. some state i must be visited inﬁnitely many times. j but j i. What is the expectation of T ? First. by the above theorem. for example. Un ≤ x | U0 = x)dx xn dx = 1 . necessarily. By closedness. i. if started in C. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). all other states are also transient. Proof. X remains in C. Every ﬁnite closed communicating class is recurrent. P (T > n) = P (U1 ≤ U0 . i. . there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. So if a communicating class is recurrent then it contains no inessential states and so it is closed.e. Xn = i for inﬁnitely many n) = 0. .e. If i is recurrent then every other j in the same communicating class satisﬁes i j.

34 . 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. be i. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. . n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown.Therefore. Before doing that. Z2 . . . Then P Z1 + · · · + Zn =µ n→∞ n lim = 1.i. To our rescue comes the regenerative structure of a Markov chain identiﬁed at an earlier section. random variables. But this is misleading. arguably. It is not a Law. It is traditional to refer to it as “Law”. 0) > −∞. . X2 . X1 . . P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . then it takes. (ii) Suppose E max(Z1 . . the sequence X0 . we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. Theorem 13. the Fundamental Theorem of Probability Theory. 0) = ∞. To apply the law of large numbers we must create some kind of independence. Let Z1 . but E min(Z1 . of successive states is not an independent sequence. When we have a Markov chain. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞.d. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. an inﬁnite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. We say that i is null recurrent if Ei Ti = ∞. it is a Theorem that can be derived from the axioms of Probability Theory. on the average. We state it here without proof.

. which. G(Xa(3) ). G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). In particular. Since functions of independent random variables are also independent.13. .2 Average reward Consider now a function f : S → R. . Then. with probability 1. we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. n as n → ∞. . . the (0) initial excursion Xa also has the same law as the rest of them. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). in this example. G(Xa(2) ). assign a reward f (x).d. Xa . . for an excursion X . . Xa(3) . because the excursions Xa . . for reasonable choices of the functions G taking values in R. forms an independent sequence of (generalised) random variables. we can think of as a reward received when the chain is in state x. Example 1: To each state x ∈ S. are i.i. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). deﬁne G(X ) to be the maximum reward received during this excursion. random variables. but deﬁne G(X ) to be the total reward received over the excursion X . the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). Thus.d. . (1) (2) (1) (n) (8) 13. are i. as in Example 2. k=0 which represents the ‘time-average reward’ received. and because if we start with X0 = a. Example 2: Let f (x) be as in example 1. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been ﬁxed once and for all) the sequence of excursions Xa(1) . 35 .i. The ﬁnal equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a.. Xa(2) . (7) Whichever the choice of G. we have that G(Xa(1) ). (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }.1 Functions of excursions Namely. Speciﬁcally.

say. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . (r) Nt := max{r ≥ 1 : Ta ≤ t}. plus the total reward received over the next excursion. and so 1 ﬁrst G → 0. with probability one. We break the sum into the total reward received over the initial excursion. Hence |Gﬁrst | ≤ BTa . with probability one. Since P (Ta < ∞) = 1. The idea is this: Suppose that t is large enough. t as t → ∞. Nt = 5. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. t t as t → ∞. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. Gﬁrst is the reward up to time Ta = Ta or t (whichever t is smaller). Proof. Nt . Gr is given by (7). Then. We assumed that f is t bounded. Let f : S → R+ be a bounded ‘reward’ function. A similar argument shows that 1 last G → 0. in the ﬁgure below. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. For example. say |f (x)| ≤ B. we have BTa /t → 0.e. and so on. Thus we need to keep track of the number. of complete excursions from the ground state a that occur between 0 and t. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a ﬁrst and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gﬁrst + Glast . t t 36 .Theorem 14. t t In other words. and Glast is the reward over the last incomplete cycle. We are looking for the total reward received between 0 and t.

Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . Hence. we have Ta r→∞ r lim (r) . r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . . Since Ta → ∞. = E a Ta . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again.i. E a Ta f (Xn ) = n=0 EG1 =: f . lim t∞ 1 Nt Nt −1 Gr = EG1 . E a Ta 37 .d. are i. (11) By deﬁnition.. . as t → ∞. . 2.also. we have that the visit with index r = Nt to the ground state a is the last before t (see also ﬁgure above): (N (N Ta t ) ≤ t < Ta t +1) . 1 Nt = . So we are left with the sum of the rewards over complete cycles. random variables with common expectation Ea Ta . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . r=1 with probability one. = E a Ta . it follows that the number Nt of complete cycles before time t will also be tending to ∞. and since Ta − Ta k = 1. Since Ta /r → 0. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . as r → ∞.

The constant f is a sort of average over an excursion. We can use b as a ground state instead of a. it takes Ea Ta units of time between successive visits to the state a. we let f to be 1 at the state b and 0 everywhere else. The equality of these two limits gives: average no. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). not both. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . The denominator is the duration of the excursion. then the long-run proportion of visits to the state a should equal 1/Ea Ta . We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). The physical interpretation of the latter should be clear: If. E a Ta (14) with probability 1.To see that f is as claimed. Applying formula (14) with b in place of a we ﬁnd that the limit in (13) also equals 1/Eb Tb . Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . Note that we include only one endpoint of the excursion. then we see that the numerator equals 1. The theorem above then says that. for all b ∈ S. Comment 1: It is important to understand what the formula says. b. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). the ground state. i. which communicate and are positive recurrent. from the Strong Markov Property. on the average. Comment 4: Suppose that two states. a. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. E b Tb 38 . t Ea 1(Xk = b) E a Ta . Ta −1 k=0 1 lim t→∞ t with probability 1.e. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included).

X3 . Moreover. . Proposition 2. so there is nothing lost in doing so. by a function of X1 . . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. notice that the choice of notation ν [a] is justiﬁable through the notation (6) for the communicating class of the state a. . x ∈ S. (15) should play an important role. 39 . Consider the function ν [a] deﬁned in (15). Since a = XT a . Indeed.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . Recurrence tells us that Ta < ∞ with probability one. X2 . In view of this. X1 . . i. ν [a] (x) = y∈S ν [a] (y)pyx . Clearly. we will soon realise that this dependence vanishes when the chain is irreducible. . Hence the superscript [a] will soon be dropped. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited.e. Fix a ground state a ∈ S and assume that it is recurrent5 . Proof. 5 We stress that we do not need the stronger assumption of positive recurrence here. Both of them equal 1. x ∈ S. Note: Although ν [a] depends on the choice of the ‘ground’ state a. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). we have 0 < ν [a] (x) < ∞. where a is the ﬁxed recurrent ground state whose existence was assumed. this should have something to do with the concept of stationary distribution discussed a few sections ago. X2 . We start the Markov chain with X0 = a. for any state b such that a → x (a leads to x). Then ν [a] satisﬁes the balance equations: ν [a] = ν [a] P. . The real reason for doing so is because we wanted to replace a function of X0 .

which. n ≤ Ta ) pyx Pa (Xn−1 = y. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). namely. so there exists n ≥ 1. x ∈ S. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. Xn−1 = y. we used (16). x a. ν [a] (x) < ∞. in this last step. Deﬁne Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). 40 . from (16). such that pxa > 0. Suppose that the Markov chain possesses a positive recurrent state a. such that pax > 0. xa so. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. where. indeed. for all n ∈ N. (n) Next. is ﬁnite when a is positive recurrent. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. which are precisely the balance equations. Note: The last result was shown under the assumption that a is a recurrent state. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). Therefore. n ≤ Ta ) Pa (Xn = x. by deﬁnition. First note that. (17) This π is a stationary distribution. ax ax From Theorem 10. that a is positive recurrent. We just showed that ν [a] = ν [a] P. We are now going to assume more. so there exists m ≥ 1. We now have: Corollary 9. ν [a] = ν [a] Pn . for all x ∈ S. let x be such that a x.and thus be in a position to apply ﬁrst-step analysis.

Also. Start with X0 = i. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. (r) j.i. the event that j occurs in a speciﬁc excursion is independent from the event that j occurs in the next excursion. 1. The coins are i. Consider the excursions Xi . Suppose that i recurrent. . Since a is positive recurrent. since the excursions are independent. then j occurs in the r-th excursion. and the probability gij = Pi (Tj < ∞) that j occurs in a speciﬁc excursion is positive. we toss a coin with probability of heads gij > 0. then the r-th excursion does not contain state j. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. Theorem 15. 2. second) excursion that contains j. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. We already know that j is recurrent. 41 . i. we have Ea Ta < ∞. In words. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. R2 ) be the index of the ﬁrst (respectively. Let i be a positive recurrent state.d. If heads show up at the r-th excursion.In other words: If there is a positive recurrent state. . Therefore. Then j is positive Proof. r = 0. Let R1 (respectively.e. that ν [a] P = ν [a] . Hence j will occur in inﬁnitely many excursions. unique. positive recurrent state. no matter which one. π [a] P = π [a] . This is what we will try to do in the sequel. in general. and so π [a] is not trivially equal to zero. Otherwise. . It only remains to show that π [a] is a probability. that it sums up to one. We have already shown. π(y) = 1 y∈S x∈S have at least one solution. But π [a] is simply a multiple of ν [a] . in Proposition 2. then the set of linear equations π(x) = y∈S π(y)pyx . if tails show up. to decide whether j will occur in a speciﬁc excursion. Proof.

. by Wald’s lemma. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . R2 has ﬁnite expectation. . or all states are null recurrent or all states are transient. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. An irreducible Markov chain is called transient if one (and hence all) states are transient. But.d. Then either all states in C are positive recurrent. . An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent.i. 2. Corollary 10. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. . . Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. we just need to show that R2 −R1 +1 ζ := r=1 Zr has ﬁnite expectation.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . where the Zr are i. . with the same law as Ti . IRREDUCIBLE CHAIN r = 1. Let C be a communicating class. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . . In other words.

indeed. Theorem 16. Proof. π(2) = 1 − α 2. (18) π(x)Px (Ta < ∞) = π(x) = 1. an absorbing state). we start the Markov chain from the stationary distribution π. x∈S By (13). after all. Then there is a unique stationary distribution. Fix a ground state a.16 Uniqueness of the stationary distribution At the risk of being too pedantic. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). for all n. n ≥ 0) and suppose that P (X0 = x) = π(x). following Theorem 14. But observe that there are many–inﬁnitely many–stationary distributions: For each 0 ≤ α ≤ 1. we have. for all states i and j. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). what prohibits uniqueness. x ∈ S. we have P (Xn = x) = π(x). x ∈ S. with probability one. by (18). the π deﬁned by π(0) = α. there is a stationary distribution. By a theorem of Analysis (Bounded Convergence Theorem). Consider positive recurrent irreducible Markov chain. Consider the Markov chain (Xn . as t → ∞. is a stationary distribution. n=1 43 . we stress that a stationary distribution may not be unique. But. state 1 (and state 2) is positive recurrent. Then P (Ta < ∞) = x∈S In other words. π(1) = 0. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). for all x ∈ S. for totally trivial reasons (it is. Let π be a stationary distribution. Hence. we can take expectations of both sides: E as t → ∞. Then. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. recognising that the random variables on the left are nonnegative and bounded below 1.

Let i be such that π(i) > 0. But π(i|C) > 0. The following is a sort of converse to that. π(a) = π [a] (a) = 1/Ea Ta . Let π(·|C) be the distribution deﬁned by restricting π to C: π(x|C) := π(x) . Then. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Consider a Markov chain.e.Therefore π(x) = π [a] (x). π(i|C) = Ei Ti . an arbitrary stationary distribution π must be equal to the speciﬁc one π [a] . π(C) x ∈ C. Therefore Ei Ti < ∞. Proof. Hence π = π [a] . Consider a positive recurrent irreducible Markov chain with stationary distribution π. Corollary 11. Consider the restriction of the chain to C. x ∈ S. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. 1 π(a) = . Let C be the communicating class of i. Consider a positive recurrent irreducible Markov chain and two states a. In particular. so they are equal. where π(C) = y∈C π(y). for all a ∈ S. for all x ∈ S. There is a unique stationary distribution. Both π [a] and π[b] are stationary distributions. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). the only stationary distribution for the restricted chain because C is irreducible. Assume that a stationary distribution π exists. This is in fact. In particular. Proof. Hence there is only one stationary distribution. i is positive recurrent. from the deﬁnition of π [a] . The cycle formula merely re-expresses this equality. Decompose it into communicating classes. Corollary 12. Thus. delete all transition probabilities pxy with x ∈ C. i. Then π [a] = π [b] . Namely. There is only one stationary distribution. y ∈ C. Then any state i such that π(i) > 0 is positive recurrent. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . By the 1 above corollary. E a Ta Proof. Consider an arbitrary Markov chain. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . b ∈ S.

A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. A Markov chain is a simple model of a stochastic dynamical system. For each such C deﬁne the conditional distribution π(x|C) := π(x) . eventually. x ∈ C) is a stationary distribution. the pendulum will perform an 45 . 18 18. Proposition 3. In an ideal pendulum. Proof. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . We all know. more generally. x∈C Then (π(x|C). By our uniqueness result. under irreducibility and positive recurrence conditions. that it remains in a small region). To each positive recurrent communicating class C there corresponds a stationary distribution π C . it will settle to the vertical position. that. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. for example. where αC ≥ 0. π(x|C) ≡ π C (x). Let π be an arbitrary stationary distribution. There is dissipation of energy due to friction and that causes the settling down. consider the motion of a pendulum under the action of gravity. This is the stable position. such that C αC = 1. So we have that the above decomposition holds with αC = π(C). An arbitrary stationary distribution π must be a linear combination of these π C . Stability of a deterministic dynamical system means that the system eventually converges to a ﬁxed point (or. that the pendulum will swing for a while and. where π(C) := π(C) π(x). For example. without friction. as n → ∞. from physical experience. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x).1 Coupling and stability Deﬁnition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. If π(x) > 0 then x belongs to some positive recurrent class C. take an = (−1)n .which assigns positive probability to each state in C. This is deﬁned by picking a state (any state!) a ∈ C and letting π C := π [a] . So far we have seen.

for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . it cannot mean convergence to a ﬁxed equilibrium point. However notice what happens when we solve the recursion. then x(t) = x∗ for all t > 0. Still. . are given. we here assume that X0 . A linear stochastic system We now pass on to stochastic systems. ˙ There are myriads of applications giving physical interpretation to this. It follows that |ρ| < 1 is the condition for stability. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . .oscillatory motion ad inﬁnitum. We shall assume that the ξn have the same law. . consider the following diﬀerential equation x = −ax + b. simple because the ξn ’s depend on the time parameter n. we may it is stable because it does not (it cannot!) swing too far away. a > 0. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an ﬁnancial example. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . . The thing to observe here is that there is an equilibrium point i. mutually independent. x ∈ R. or whatever. ······ 46 . If. and deﬁne a stochastic sequence X1 . And we are in a happy situation here. Speciﬁcally.e. we have x(t) → x∗ . ξ0 . But what does stability mean? In general. The simplest diﬀerential equation As another example. random variables. . as t → ∞. X2 . . by means of the recursion above. We say that x∗ is a stable equilibrium. ξ2 . the one found by setting the right-hand side equal to zero: x∗ = b/a. ξ1 . moreover. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. then regardless of the starting position. or an example in physics.

How should we then deﬁne what P (X = x. In particular. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. Coupling refers to the various methods for constructing this joint law.3 Coupling To understand why the fundamental stability theorem works. . j ∈ S. uniformly in x. If a Markov chain is irreducible. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . while the limit of Xn does not exists. suppose you are given the laws of two random variables X. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). ∞ ˆ The nice thing is that. we know f (x) = P (X = x). under nice conditions. the n-step transition matrix Pn .Since |ρ| < 1. we have that the ﬁrst term goes to 0 as n → ∞. . 18. What this means is that. which is a linear combination of ξ0 . But suppose that we have some further requirement. Starting at an elementary level. Y = y) = P (X = x)P (Y = y). for instance.2 The fundamental stability theorem for Markov chains Theorem 17. for any initial distribution. (e. converges. . Y = y) is? 47 . 18. to try to make the coincidence probability P (X = Y ) as large as possible. . assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . In other words. i. Y = y) is. for time-homogeneous Markov chains. A straightforward (often not so useful) construction is to assume that the random variables are independent and deﬁne P (X = x. ξn−1 does not converge.g. The second term. but we are not given what P (X = x. we need to understand the notion of coupling. Y but not their joint law. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). deﬁne stability in a way other than convergence of the distributions of the random variables. g(y) = P (Y = y). ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . as n → ∞.

|P (Xn = x) − P (Yn = x)| ≤ P (T > n). n < T ) + P (Xn = x. Repeating (19). x∈S 48 . Y . f (x) = y h(x. n ≥ T ) = P (Xn = x. Then. n ≥ 0. for all n ≥ 0. Y = y) which respects the marginals. but with the roles of X and Y interchanged. Proof. n ≥ 0. and all x ∈ S. Let T be a meeting of two coupled stochastic processes X. x ∈ S. Suppose that we have two stochastic processes as above and suppose that they have been coupled. i. This T is a random variables that takes values in the set of times or it may take values +∞. T is ﬁnite with probability one. A coupling of them refers to a joint construction on the same probability space. n < T ) + P (Yn = x. If. g(y) = x h(x. uniformly in x ∈ S. and which maximises P (X = Y ). we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Find the joint law h(x. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). The following result is the so-called coupling inequality: Proposition 4. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). g(y) = P (Y = y). We are given the law of a stochastic process X = (Xn . n ≥ T ) ≤ P (n < T ) + P (Yn = x).e. n ≥ 0). We have P (Xn = x) = P (Xn = x. precisely because of the assumption that the two processes have been constructed together. y). n ≥ 0) and the law of a stochastic process Y = (Yn . P (T < ∞) = 1. (19) as n → ∞. Y be random variables with values in Z and laws f (x) = P (X = x). on the event that the two processes never meet. The coupling we are interested in is a coupling of processes. y). We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. then |P (Xn = x) − P (Yn = x)| → 0. Since this holds for all x ∈ S and since the right hand side does not depend on x. y) = P (X = x. but we are not told what the joint probabilities are. in particular. respectively.e.Exercise: Let X. i. Note that it makes sense to speak of a meeting time. Combining the two inequalities we obtain the ﬁrst assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n).

x′ ). n ≥ 0) is a Markov chain with state space S × S. and so (Wn . Yn ). denoted by Y = (Yn .y′ for all large n. Xm = y).x′ py.y′ . 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px.y′ ) for all large n. π(x) > 0 for all x ∈ S.(y. positive recurrent and aperiodic. y) Its n-step transition probabilities are q(x. and it also makes sense to consider their ﬁrst meeting time: T := inf{n ≥ 0 : Xn = Yn }.y′ ) := P Wn+1 = (x′ .x′ py. It is very important to note that we have only deﬁned the law of X and the law of Y but we have not deﬁned a joint law.x′ ). sup |P (Xn = x) − P (Yn = x)| → 0.y′ ) = px. x∈S as n → ∞. Its initial state W0 = (X0 . then. as usual. denoted by π. we can deﬁne joint probabilities.(y. The second process. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. for all n ≥ 1. . The ﬁrst process. y) ∈ S × S.x′ > 0 and py. n ≥ 0). Then (Wn .(y. y) = µ(x)π(y).x′ ). had we started with both X0 and Y0 independent and distributed according to π. they diﬀer only in how the initial states are chosen.) By positive recurrence. We now couple them by doing the straightforward thing: we assume that X = (Xn .y′ . y ′ ) | Wn = (x. in other words. Then n→∞ lim P (T > n) = P (T = ∞) = 0. (Indeed. Notice that σ(x. Consider the process Wn := (Xn . Let pij denote. denoted by X = (Xn . we have not deﬁned joint probabilities such as P (Xn = x.4 Proof of the fundamental stability theorem First. y) := π(x)π(y). n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ.Assume now that P (T < ∞) = 1. We know that there is a unique stationary distribution. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. Having coupled them. the transition probabilities. n ≥ 0) is independent of Y = (Yn . we have that px. n ≥ 0) is an irreducible chain. We shall consider the laws of two processes. Therefore. 18. Xn and Yn would be independent and distributed according to π. (x. Its (1-step) transition probabilities are q(x. is a stationary distribution for (Wn . implying that q(x. Y0 ) has distribution P (W0 = (x. From Theorem 3 and the aperiodicity assumption. review the assumptions of the theorem: the chain is irreducible. n ≥ 0).

after the meeting time. n ≥ 0) is positive recurrent.(0. Then the product chain has state space S ×S = {(0.0) = q(1. Hence (Zn . which converges to zero. ≥ 0) is a Markov chain with transition probabilities pij . we have |P (Xn = x) − π(x)| ≤ P (T > n). Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. n ≥ 0). (1. Moreover. Obviously.1). Zn equals Xn before the meeting time of X and Y . 50 .0). y) > 0 for all (x. Since P (Zn = x) = P (Xn = x).Therefore σ(x. 0). Clearly. However. (1. Hence (Wn . p10 = 1.1) = 1.(1. 1)} and the transition probabilities are q(0. T is a meeting time between Z and Y . P (T < ∞) = 1. In particular. For example.(1. precisely because P (T < ∞) = 1. say. if we modify the original chain and make it aperiodic by.0). 1). uniformly in x. p00 = 0. Z0 = X0 which has an arbitrary distribution µ. Y ) may fail to be irreducible. This is not irreducible.1) = q(1. Zn = Yn . for all n and x. y) ∈ S × S. Deﬁne now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). then the process W = (X. n ≥ 0) is identical in law to (Xn . X arbitrary distribution Z stationary distribution Y T Thus. 1} and the pij are p01 = p10 = 1. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). (Zn . by assumption.01.(0. 0). In particular. (0. as n → ∞. By the coupling inequality. and since P (Yn = x) = π(x). suppose that S = {0.1).0) = q(0. p01 = 0. P (Xn = x) = P (Zn = x).99.

It is.503 0. customary to have a consistent terminology. 0. p01 = 0.497 .503 0. π(0) ≈ 0. π(1) ≈ 0. most people use the term “ergodic” for an irreducible. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. and so on. 18. 51 . By the results of Section 14. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. Terminology cannot be. per se wrong. . C1 . then (n) lim p n→∞ 00 (n) = lim p10 = π(0). But this is “wrong”.01.99π(0) = π(1). if we modify the chain by p00 = 0. 6 The reason is that the term “ergodic” means something speciﬁc in Mathematics (c. Indeed.99. π(1) satisfy 0. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. p10 = 1. Suppose X0 = 0. and this is expressed by our terminology.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. Suppose that the period of the chain is d. Then Xn = 0 for even n and = 1 for odd n.497. Parry. . i. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). 1981). The reader is asked to pay attention to this terminology which is diﬀerent from the one used in the Markov chain literature. such that the chain moves cyclically between them: from C0 to C1 . however.503. In matrix notation. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. . limn→∞ p00 does not exist. Cd−1 .then the product chain becomes irreducible. we can decompose the state space into d classes C0 . Thus. .f. positive recurrent and aperiodic Markov chain. n→∞ (n) where π(0). π(0) + π(1) = 1. from C1 to C2 . Therefore p00 = 1 for even n (n) and = 0 for odd n. Theorem 4. It is not diﬃcult to see that 6 We put “wrong” in quotes. However.4 and. in particular. from Cd back to C0 .e.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. By the results of §5.497 18.

n ≥ 0) is aperiodic. Cd−1 . . . . Suppose that X0 ∈ Cr . 52 .Lemma 8. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Since the αr add up to 1. j ∈ Cℓ . i ∈ Cr . and the previous results apply. where π is the stationary distribution. This means that: Theorem 18. j ∈ Cr . as n → ∞. Then. We shall show that Lemma 9. n ≥ 0) consists of d closed communicating classes. we have pij for all i. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. and let r = ℓ − k (modulo d). The chain (Xnd . . . i ∈ S. Can we also ﬁnd the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . thereafter. the (unique) stationary distribution of (Xnd . Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). it remains in Cℓ . Then. . Then pji = 0 unless j ∈ Cr−1 . Suppose i ∈ Cr . . So π(i) = j∈Cr−1 π(j)pji . Hence if we restrict the chain (Xnd . starting from i. . j ∈ Cℓ . if sampled every d steps. π(i) = 1/d. 1. We have that π satisﬁes the balance equations: π(i) = j∈S π(j)pji . → dπ(j). i∈Cr for all r = 0. Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. All states have period 1 for this chain. after reduction by d) and. d − 1. Since (Xnd . i ∈ Cr . Let αr := i∈Cr π(i). the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. Let i ∈ Ck . Proof. namely C0 . as n → ∞. each one must be equal to 1/d. n ≥ 0) is dπ(i).

It is clear how we (ni ) k continue. Now consider the contained between 0 and 1 there must be a exists and equals. . for all i ∈ S. . such that limk→∞ p1 k 3 (n1 ) numbers p2 k . say. . The proof will be based on two results from Analysis. Then we can ﬁnd a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. for each i. . p3 . the sequence n3 of the third row is a subsequence k k of n2 of the second row. p(n) = (n) (n) (n) (n) (n) ∞ p1 . . k = 1. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . . e 53 . as n → ∞. If the period is d then the limit exists only along a selected subsequence (Theorem 18). . Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . . 2. say. We have. pi . whose proof we omit7 : 7 See Br´maud (1999). a probability distribution on N. Consider. . . the sequence n2 of the second row k is a subsequence of n1 of the ﬁrst row. . the ﬁrst of which is known as HellyBray Lemma: Lemma 10. say. i. such that limk→∞ pi k exists and equals. 2. schematically. where = 1 and pi ≥ 0 for all i. k = 1. for all i ∈ S. . 19. For each i we have found subsequence ni . then pij → 0 as n → ∞. 2. So the diagonal subsequence (nk . k = 1. .1 Limiting probabilities for null recurrent states Theorem 19. By construction.. Hence pi k i. k = 1. 2. .) is a k k subsequence of each (ni . are contained between 0 and 1. for each n = 1. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. for each The second result is Fatou’s lemma. If j is a null recurrent state then pij → 0. exists for all i ∈ N. if j is transient. as k → ∞. there must be a subsequence exists and equals. . . p2 .. . 2. We also saw in Corollary (n) 7 that. (nk ) k → pi . . p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal.e. . p2 . then pij → π(j) (Theorems 17). p1 . and so on.).

i. and the last equality is obvious since we have a probability vector. . for some x0 ∈ S. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . i. i=1 (n) Proof of Theorem 19. by assumption.Lemma 11. 54 . x ∈ S . Hence we can ﬁnd a subsequence nk such that k→∞ lim pij k = αj > 0. αy pyx . (n ) (n ) Consider the probability vector By Lemma 10. y∈S x ∈ S. actually. x ∈ S. we have that π(x) := αx y∈S αy satisﬁes the balance equations and has x∈S π(x) = 1. pix k . Now π(j) > 0 because αj > 0.e. the sequence (n) pij does not converge to 0. Thus. by Lemma 11. n = 1. . By Corollary 13. (n′ ) Hence αx ≥ αy pyx . αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. pix k = 1.e. We wish to show that these inequalities are. we have 0< x∈S The inequality follows from Lemma 11. y∈S Since x∈S αx ≤ 1. (xi . .e. it is a strict inequality: αx0 > y∈S αy pyx0 . and this is a contradiction (the number αx = αx cannot be strictly larger than itself). If (yi . Suppose that for some i. Suppose that one of them is not. i ∈ N). state j is positive recurrent: Contradiction. i ∈ N). π is a stationary distribution. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . But Pn+1 = Pn P. i. Let j be a null recurrent state. pix k Therefore. equalities. x∈S (n′ ) Since aj > 0. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . k→∞ (n′ +1) = y∈S piy k pyx . 2.

and fij = ϕC (j). as n → ∞. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. Indeed. This is a more old-fashioned way of getting the same result. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. But Theorem 17 tells us that the limit exists regardless of the initial distribution. and fij = Pi (Tj < ∞). as in §10. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m).2. Let i be an arbitrary state. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. which is what we need. The result when the period j is larger than 1 is a bit awkward to formulate. In this form. Pµ (Xn−m = j) → π C (j). π C (j) = 1/Ej Tj . so we will only look that the aperiodic case. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. we have pij = Pi (Xn = j) = Pi (Xn = j. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . Let C be the communicating class of j and let π C be the stationary distribution associated with C. as n → ∞. Let HC := inf{n ≥ 0 : Xn ∈ C}. Then (n) lim p n→∞ ij = ϕC (i)π C (j). Let j be a positive recurrent state with period 1.19. Example: Consider the following chain (0 < α. β. Proof. starting with the ﬁrst hitting time decomposition relation of Lemma 7. Theorem 20. Let ϕC (i) := Pi (HC < ∞).2 The general case (n) It remains to see what the limit of pij is. From the Strong Markov Property. the result can be obtained by the so-called Abel’s Lemma of Analysis. that is. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). In general.

(n) (n) p43 → 0. in Theorem 14. C1 = {1. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . we assumed the existence of a positive recurrent state. as n → ∞. 1 − αβ 1 − αβ Both C1 . whence 1−α β(1 − α) . (n) (n) p32 → ϕ(3)π C1 (2). 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. information about its velocity. We now generalise this. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. from ﬁrst-step analysis. For example. Theorem 14 is a type of ergodic theorem. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. (n) (n) p33 → 0. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. C2 = {5} (absorbing state). π C1 (2) = . Similarly. (n) (n) p34 → 0. C2 are aperiodic classes. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = .The communicating classes are I = {3. p42 → ϕ(4)π C1 (2). Ergodic Theory is nowadays an area that combines many ﬁelds of Mathematics and Science. (n) (n) p35 → (1 − ϕ(3))π C2 (5). Theorem 21. Hence. p44 → 0. We have ϕ(1) = ϕ(2) = 1. p45 → (1 − ϕ(4))π C2 (5). among others. The phase (or state) space of a particle is not just a physical space but it includes. 2} (closed class). ϕ(4) = (1 − β)ϕ(5) + βϕ(3). 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. 56 (20) x∈S . Pioneers in the are were. We will avoid to give a precise deﬁnition of the term ‘ergodic’ but only mention what constitutes a result in the area. p41 → ϕ(4)π C1 (1). for example. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. ϕ(3) = p31 → ϕ(3)π C1 (1). ϕ(4) = . ϕ(5) = 0. Consider the quantity ν [a] (x) deﬁned as the average number of visits to state x between two successive visits to state a–see 16. Recall that. π C2 (5) = 1. while. 4} (inessential states). The subject of Ergodic Theory has its origins in Physics. And so is the Law of Large Numbers itself (Theorem 13).

we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gﬁrst + Glast . then. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. respectively. Replacing n by the subsequence Nt . we obtain the result. we have Nt /t → 1/Ea Ta . t t where Gr = T (r) ≤t<T (r+1) f (Xt ). Then at least one of the states must be visited inﬁnitely often. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. in the former. If so. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . as we saw in the proof of Theorem 14. which is the quantity denoted by f in Theorem 14. 57 . 21 Finite chains We take a brief look at some things that are speciﬁc to chains with ﬁnitely many states. and this is left as an exercise.5. Proof. From proposition 2 we know that the balance equations ν = νP have at least one solution. we have t→∞ lim 1 last 1 G = lim Gﬁrst = 0. we have C := i∈S ν(i) < ∞. So if we divide by t the numerator and denominator of the fraction in (21). Now. and this justiﬁes the terminology of §18. The reader is asked to revisit the proof of Theorem 14. t t→∞ t t t with probability one. so the numerator converges to f /Ea Ta . while Gﬁrst . we assumed positive recurrence of the ground state a. (21) f (x)ν [a] (x). But this is precisely the condition (20). since the number of states is ﬁnite. So there are always recurrent states. As in Theorem 14. r=0 with probability 1 as long as E|G1 | < ∞. As in the proof of Theorem 14. Glast are the total rewards over the ﬁrst and t t a a last incomplete cycles. we have that the denominator equals Nt /t and converges to 1/Ea Ta .Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . Remark 1: The diﬀerence between Theorem 14 and 21 is that. We only need to check that EG1 = f .

X1 . Then the chain is positive recurrent with a unique stationary distribution π. π(i) := Hence. X1 ) has the same distribution as (X1 . In addition. or. (X0 . More precisely. this means that it is stationary. i. 58 . . Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. reversible if it has the same distribution when the arrow of time is reversed. In particular. Xn has the same distribution as X0 for all n. running backwards. because positive recurrence is a class property (Theorem 15). in addition. Reversibility means that (X0 . Xn ) has the same distribution as (Xn . i. the ﬁrst components of the two vectors must have the same distribution. the stationary distribution is unique. Suppose ﬁrst that the chain is reversible. n ≥ 0). We say that it is time-reversible. j ∈ S. Because the process is Markov. . and run the ﬁlm backwards. .Hence 1 ν(i). it will look the same. this state must be positive recurrent. . a ﬁnite Markov chain always has positive recurrent states. X0 ). as n → ∞. If. Consider an irreducible Markov chain with ﬁnitely many states. then we instantly know that the ﬁlm is played in reverse. Lemma 12. Xn−1 . by Theorem 16. . C satisﬁes the balance equations and the normalisation condition i∈S π(i) = 1. It is. the chain is aperiodic. then all states are positive recurrent. . . Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . . But if we ﬁlm a man who walks and run the ﬁlm backwards. Xn−1 . Xn ) has the same distribution as (Xn . X1 . X0 ). X0 ) for all n. . without being able to tell that it is. . i ∈ S. A reversible Markov chain is stationary. . At least some state i must have π(i) > 0. . Taking n = 1 in the deﬁnition of reversibility. We summarise the above discussion in: Theorem 22. like playing a ﬁlm backwards. actually. Therefore π is a stationary distribution. simply. we see that (X0 . Proof. Proof. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. the chain is irreducible. If.e. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. we have Pn → Π. in addition. in a sense. X1 = j) = P (X1 = i. Theorem 23. . 22 Time reversibility Consider a Markov chain (Xn . By Corollary 13. for all n. . For example. that is. P (X0 = i. . X0 = j). if we ﬁlm a standing man who raises his arms up and down successively. .

Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. . Now pick a state k and write. 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. In other words. Xn−1 . The same logic works for any m and can be worked out by induction. X2 = k) = P (X0 = k. Suppose ﬁrst that the chain is reversible. Summing up over all choices of states i3 . X2 = i). X1 = j. for all i. and pi1 i2 (m−3) → π(i2 ). k. By induction. X2 ) has the same distribution as (X2 . Theorem 24. X1 = i) = π(j)pji . im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. i2 .for all i and j. So the chain is reversible. . X1 = j. Hence the DBE hold. X1 . X0 ). and this means that P (X0 = i. . Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. . . Pick 3 states i. we have separation of variables: pij = f (i)g(j). k. and hence (X0 . using the DBE. then the chain cannot be reversible. . (X0 . X1 . using DBE. i4 . . suppose the DBE hold. then. π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . X0 ). suppose that the KLC (22) holds. pji 59 (m−3) → π(i1 ). Conversely. it is easy to show that. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . j. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. So the detailed balance equations (DBE) hold. (22) Proof. Xn ) has the same distribution as (Xn . X1 = j) = π(i)pij . We will show the Kolmogorov’s loop criterion (KLC). we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . and the chain is reversible. letting m → ∞. im ∈ S and all m ≥ 3. Conversely. for all n. The DBE imply immediately that (X0 . j. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. . . . But P (X0 = i. X0 ). . X1 ) has the same distribution as (X1 . P (X0 = j. These are the DBE. . Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . . X1 . π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. and write. we have π(i) > 0 for all i. . . . .

This gives the following method for showing that a ﬁnite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. There is obviously only one arrow from AtoAc .(Convention: we form this ratio only when pij > 0. f (i)g(i) must be 1 (why?). meaning that it has no loops. for each pair of distinct states i. Ac be the sets in the two connected components. Hence the chain is reversible. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0.e. there would be a loop. Form the graph by deleting all self-loops. The balance equations are written in the form F (A. pji g(i) Multiplying by g(i) and summing over j we ﬁnd g(i) = j∈S g(j)pji . The reason is as follows. namely the arrow from i to j. If we remove the link between them. j for which pij > 0. Hence the original chain is reversible. the two oriented edges from i to j and from j to i by a single edge without arrow. the graph becomes disconnected. the detailed balance equations. i. by assumption. so we have pij g(j) = . And hence π can be found very easily. using another method.1) which gives π(i)pij = π(j)pji . then the chain is reversible. A) (see §4. Ac ) = F (Ac . So g satisﬁes the balance equations. and. and the arrow from j to j is the only one from Ac to A. to guarantee. then we need. (If it didn’t. Pick two distinct states i. in addition.) Necessarily. consider the chain: Replace it by: This is a tree. Example 1: Chain with a tree-structure. 60 . and by replacing. If the graph thus obtained is a tree. that the chain is positive recurrent. there isn’t any. • If the chain has inﬁnitely many states. For example.) Let A. j for which pij > 0.

i∈S where |E| is the total number of edges. and a set E of undirected edges. p02 = 1/4. States j are i are called neighbours if they are connected by an edge. A RWG is a reversible chain. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. 2. The stationary distribution of a RWG is very easy to ﬁnd. j are not neighbours then both sides are 0. etc. i. For each state i deﬁne its degree δ(i) = number of neighbours of i. Hence we conclude that: 1. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). To see this let us verify that the detailed balance equations hold. If i. Now deﬁne the the following transition probabilities: pij = 1/δ(i). Suppose the graph is connected. Each edge connects a pair of distinct vertices (states) without orientation. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C.Example 2: Random walk on a graph. The stationary distribution assigns to a state probability which is proportional to its degree. we have that C = 1/2|E|. if j is a neighbour of i otherwise. The constant C is found by normalisation.e. Consider an undirected graph G with vertices S. where C is a constant. so both members are equal to C. 61 . a ﬁnite set. the DBE hold. π(i)pij = π(j)pji . In all cases. i∈S Since δ(i) = 2|E|. 0. π(i) = 1. For example. We claim that π(i) = Cδ(i). There are two cases: If i.

in addition.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). . A r = 1. . k = 0. π is a stationary distribution for the original chain. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . 2. . 1. 1. 1. Let (St . i.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. 2. 2. The sequence (Yk . t ≥ 0) be independent from (Xn ) with S0 = 0. . St+1 = St + ξt . 2. where ξt are i. . if nk = n0 + km. random variables with values in N and distribution λ(n) = P (ξ0 = n). . (m) (m) 23. . 23. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). Deﬁne Yr := XT (r) .d.e. Due to the Strong Markov Property.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . . k = 0. i. . this process is also a Markov chain and it does have time-homogeneous transition probabilities. where pij are the m-step transition probabilities of the original chain.i. in general. If the subsequence forms an arithmetic progression. . 2. n ≥ 0) be Markov with transition probabilities pij . k = 0. if.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. The state space of (Yr ) is A. k = 0. . . In this case. 1. .3 Subordination Let (Xn .e. .) has the Markov property but it is not time-homogeneous. . . Let A be (r) a set of states and let TA be the r-th visit to A. how can we create new ones? 23. then (Yk . t ≥ 0. .e. irreducible and positive recurrent). then it is a stationary distribution for the new chain. 62 n ∈ N.

(Notation: We will denote the elements of S by i. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). in general. If. we have: Lemma 13. β). j. Proof. then the process Yn = f (Xn ). f is a one-to-one function then (Yn ) is a Markov chain. and the elements of S ′ by α.. . . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). . n≥0 63 . ∞ λ(n)pij .Then Yt := XSt . . is NOT. . α1 . however.e. a Markov chain.) More generally. . . β. . Yn−1 ) = P (Yn+1 = β | Yn = α). . αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . We need to show that P (Yn+1 = β | Yn = α. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. . Indeed. . Fix α0 . i. . .. knowledge of f (Xn ) implies knowledge of Xn and so. Then (Yn ) has the Markov property. . Yn−1 = αn−1 }. conditional on Yn the past is independent of the future.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . (n) 23.

Example: Consider the symmetric drunkard’s random walk. regardless of whether i > 0 or i < 0. β) P (Yn = α|Xn = i. conditional on {Xn = i}. then |Xn+1 | = 1 for sure. β ∈ Z+ . 64 . β). β). if i = 0. β) P (Yn = α. α > 0 Q(α. But P (Yn+1 = β | Yn = α. indeed. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. β) P (Yn = α|Yn−1 ) i∈S = Q(α.for all choices of the variables involved. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. i ∈ Z. if β = α ± 1. Then (Yn ) is a Markov chain. Yn = α. P (Xn+1 = i ± 1|Xn = i) = 1/2. if we let 1/2. In other words. i ∈ Z. Yn−1 ) Q(f (i). a Markov chain with transition probabilities Q(α. and so (Yn ) is. Yn−1 ) Q(f (i). β). Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. observe that the function | · | is not one-to-one. depends on i only through |i|. On the other hand. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). α = 0. To see this. β)P (Xn = i | Yn = α. β) := 1. because the probability of going up is the same as the probability of going down. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. if β = 1. Yn−1 )P (Xn = i | Yn = α. Thus (Yn ) is Markov with transition probabilities Q(α. β). i.e. Deﬁne Yn := |Xn |. Indeed. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. But we will show that P (Yn+1 = β|Xn = i). β) i∈S P (Yn = α|Xn = i. and if i = 0. for all i ∈ Z and all β ∈ Z+ . We have that P (Yn+1 = β|Xn = i) = Q(|i|.

In other words. which has a binomial distribution: i j pij = p (1 − p)i−j . The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of oﬀspring. we have that (Xn . . . . j In this case. 2. the population cannot grow: it will eventually reach 0. with probability 1. If we let ξ (n) = ξ1 . ξ2 . in general. if the current population is i there is probability pi that everybody gives birth to zero oﬀspring. So assume that p1 = 1. since ξ (0) . one question of interest is whether the process will become extinct or not. then we see that the above equation is of the form Xn+1 = f (Xn .i. therefore transient. Then P (ξ = 0) + P (ξ ≥ 2) > 0. a hard problem. The state 0 is irrelevant because it cannot be reached from anywhere.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. n ≥ 0) has the Markov property. . and following the same distribution. Hence all states i ≥ 1 are transient. so we will ﬁnd some diﬀerent method to proceed. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. It is a Markov chain with time-homogeneous transitions. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. 0 ≤ j ≤ i. We let pk be the probability that an individual will bear k oﬀspring. But here is an interesting special case. ξ (n) ). Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. Therefore. . p0 = 1 − p.d. Example: Suppose that p1 = p. are i. Thus. 1. if 0 is never reached then Xn → ∞. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. Back to the general case. each individual gives birth to at one child with probability p or has children with probability 1 − p. 0 But 0 is an absorbing state. and. We ﬁnd the population (n) at time n + 1 by adding the number of oﬀspring of each individual. independently from one another.24 24.. The parent dies immediately upon giving birth. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. . . Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . k = 0. ξ (1) . Hence all states i ≥ 1 are inessential. (and independent of the initial population size X0 –a further assumption). Let ξk be the number of oﬀspring of the k-th individual in the n-th generation. . if the 65 (n) (n) . and 0 is always an absorbing state. .

. when we start with X0 = i. The result is this: Proposition 5. and. ε(0) = 1. If µ ≤ 1 then the ε = 1. We then have that D = D1 ∩ · · · ∩ Di . For each k = 1. Proof. . If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. 66 . Let ε be the extinction probability of the process. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. . we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. if we start with X0 = i in ital ancestors.population does not become extinct then it must grow to inﬁnity. since Xn = 0 implies Xm = 0 for all m ≥ n. Obviously. we can argue that ε(i) = εi . Let ∞ µ = Eξ = k=1 kpk be the mean number of oﬀspring of an individual. . let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. i. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. so P (D|X0 = i) = εi . each one behaves independently of one another. This function is of interest because it gives information about the extinction probability ε. where ε = P1 (D). Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. We wish to compute the extinction probability ε(i) := Pi (D). Indeed. For i ≥ 1. Indeed. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . ϕn (0) = P (Xn = 0). Therefore the problem reduces to the study of the branching process starting with X0 = 1. and P (Dk ) = ε.

But ϕn+1 (0) → ε as n → ∞. Since both ε and ε∗ satisfy z = ϕ(z). and so on. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). Therefore ε = ϕ(ε). all we need to show is that ε < 1. But ϕn (0) → ε. then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. By induction. in fact. So ε ≤ ε∗ < 1. since ϕ is a continuous function. ε = ε∗ . Notice now that µ= d ϕ(z) dz . ϕn+1 (z) = ϕ(ϕn (z)). and. But 0 ≤ ε∗ . We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. ϕn (0) ≤ ε∗ . recall that ϕ is a convex function with ϕ(1) = 1. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). ϕn+1 (0) = ϕ(ϕn (0)).Note also that ϕ0 (z) = 1.) 67 . Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . Therefore ε = ε∗ . Therefore. (See right ﬁgure below. Let us show that. if µ > 1. z=1 Also. ϕ(ϕn (0)) → ϕ(ε). Also ϕn (0) → ε. if the slope µ at z = 1 is ≤ 1. This means that ε = 1 (see left ﬁgure below). Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). and since the latter has only two solutions (z = 0 and z = ε∗ ). On the other hand. In particular.

Ethernet has replaced Aloha.. . 2.d. is that if more than one hosts attempt to transmit a packet. Abandoning this idea.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. If s ≥ 2. in a simpliﬁed model. So. on the next times slot. k ≥ 0. One obvious solution is to allocate time slots to hosts in a speciﬁc order. 2. Thus. are allocated to hosts 1. there is collision and the transmission is unsuccessful. Since things happen fast in a computer network. . Case: µ > 1. 68 . there are x + a packets. 2. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. . periodically. 3 hosts. there is a collision and. then max(Xn + An − 1. 1. otherwise. Xn+1 = Xn + An . 0). on the next times slot. Then ε < 1. . We assume that the An are i. 24. there are x + a − 1 packets. So if there are. The problem of communication over a common channel.i. 3. and Sn the number of selected packets. 1. say. there is no time to make a prior query whether a host has something to transmit or not. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. An the number of new packets on the same time slot. During this time slot assume there are a new packets that need to be transmitted. if Sn = 1. If collision happens. only one of the hosts should be transmitting at a given time. Let s be the number of selected packets. using the same frequency.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. 3. 5. 6. . then the transmission is not successful and the transmissions must be attempted anew. If s = 1 there is a successful transmission and. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). . then time slots 0. random variables with some common distribution: P (An = k) = λk . we let hosts transmit whenever they like. for a packet to go through. 1. Then ε = 1. 3. 4. Nowadays.

Since p > 0.x−1 = P (An = 0. For x > 0. and exactly one selection: px. where G(x) is a function of the form α + βx . where q = 1 − p. if Xn + An = 0. . we must have no new packets arriving. The Aloha protocol is unstable.e.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . So P (Sn = k|Xn = x. we think as follows: to have a reduction of x by 1. The reason we take the maximum with 0 in the above equation is because. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). the Markov chain (Xn ) is transient. and p. The probabilities px. then the chain is transient. Idea of proof. We also let µ := k≥1 kλk be the expectation of An . we have q < 1.x+k for k ≥ 1. We here only restrict ourselves to the computation of this expected change. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). deﬁned by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. 69 . n ≥ 0) is a Markov chain. The random variable Sn has. and so q x tends to 0 much faster than G(x) tends to ∞. i. The main result is: Proposition 6. for example we can make it ≤ µ/2. Therefore ∆(x) ≥ µ/2 for all large x. So we can make q x G(x) as small as we like by choosing x suﬃciently large.) = x + a k x−k p q .x−1 + k≥1 kpx.x are computed by subtracting the sum of the above from 1. To compute px. we have that the drift is positive for all large x and this proves transience. we have ∆(x) = (−1)px. The process (Xn .e. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. To compute px. Xn−1 . selections are made amongst the Xn + An packets. . Indeed. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . Unless µ = 0 (which is a trivial case: no arrivals!). k 0 ≤ k ≤ x + a. so q x G(x) tends to 0 as x → ∞.where λk ≥ 0 and k≥0 kλk = 1.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . Sn = 1|Xn = x) = λ0 xpq x−1 . we should not subtract 1.x−1 when x > 0. k ≥ 1. So: px. Let us ﬁnd its transition probabilities. Proving this is beyond the scope of the lectures. An = a. . An−1 .

to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. A new use of PageRank is to rank academic doctoral programs based on their records of ﬁnding jobs for their graduates. If L(i) = ∅. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. the chain moves to a random page. Deﬁne transition probabilities as follows: 1/|L(i)|. 2. 3. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. PageRank. then i is a ‘dangling page’. Then it says: page i has higher rank than j if π(i) > π(j). Therefore.2) and let α pij := (1 − α)pij + . So this is how Google assigns ranks to pages: It uses an algorithm. unless i is a dangling page. The idea is simple. Comments: 1. known as ‘302 Google jacking’. α ≈ 0. otherwise. It uses advanced techniques from linear algebra to speed up computations. the rank of a page may be very important for a company’s proﬁts. if j ∈ L(i) pij = 1/|V |. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. Academic departments link to each other by hiring their 70 . the next location Xn+1 is picked uniformly at random amongst all pages that i links to. It is not clear that the chain is irreducible. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. neither that it is aperiodic.24. if L(i) = ∅ 0. There is a technique. So if Xn = i is the current location of the chain. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. |V | These are new transition probabilities. We pick α between 0 and 1 (typically. because of the size of the problem. All you need to do to get high rank is pay (a lot of money). In an Internet-dominated society. that allows someone to manipulate the PageRank of a web page and make it much higher. For each page i let L(i) be the set of pages it links to. So we make the following modiﬁcation. Its implementation though is one of the largest matrix computations in the world. in the latter case. there are companies that sell high-rank pages.

colour of eyes) appears in two forms. We are now ready to deﬁne the Markov chain Xn representing the number of alleles of type A in generation n. In this ﬁgure there are 4 alleles of type A (circles) and 6 of type a (squares). Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their oﬀspring did not inherit this allele. This means that the external appearance of a characteristic is a function of the pair of alleles. When two individuals mate. We will also assume that the parents are replaced by the children. the genetic diversity depends only on the total number of alleles of each type. Do the same thing 2N times. it makes the evaluation procedure easy. generation after generation. then. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). This number could be independent of the research quality but. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. We will assume that a child chooses an allele from each parent at random. meaning how many characteristics we have.g. a gene controlling a speciﬁc characteristic (e. We would like to study the genetic diversity.4 The Wright-Fisher model: an example from biology In an organism. a dominant (say A) and a recessive (say a). One of the circle alleles is selected twice. Thus. Since we don’t care to keep track of the individuals’ identities. maintain this allele for the next generation. And so on. One of the square alleles is selected 4 times. If so. this allele is of type A with probability x/2N . 24. their oﬀspring inherits an allele from each parent. 5. But an AA mating with a Aa may produce a child of type AA or one of type Aa. The alleles in an individual come in pairs and express the individual’s genotype for that gene. in each generation. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. Imagine that we have a population of N individuals who mate at random producing a number of oﬀspring. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. The process repeats. all we need is the information outlined in this ﬁgure. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. from the point of view of an administrator. Three of the square alleles are never selected. To simplify. 71 . We have a transition from x to y if y alleles of type A were produced. 4.faculty from each other (and from themselves). two AA individuals will produce and AA child. What is important is that the evaluation can be done without ever looking at the content of one’s research. we will assume that. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output.

d. Let M = 2N . the states {1.2N = 1. Similarly. . n n where. ξM are i. the expectation doesn’t change and. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected.e.Each allele of type A is selected with probability x/2N . Xn ). EXn+1 = EXn Xn = Xn . and the population becomes one consisting of alleles of type a only. . . The fact that M is even is irrelevant for the mathematical model. . Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . the number of alleles of type A won’t change. The result is correct. Clearly. . (Here. but the argument needs further justiﬁcation because “the end of time” is not a deterministic time. since. y ≤ 2N − 1. since. . 72 .) One can argue that. the random variables ξ1 . . . on the average. M n P (ξk = 0|Xn . . X0 ) = 1 − Xn . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. . ϕ(x) = x/M. with n P (ξk = 1|Xn . . so both 0 and 2N are absorbing states. . we see that 2N x 2N −y x y pxy = . . . . These are: M ϕ(0) = 0. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . ϕ(x) = y=0 pxy ϕ(y). M and thus. . X0 ) = Xn . p00 = p2N. i. the chain gets absorbed by either 0 or 2N . We show that our guess is correct by proceeding via the mundane way of verifying that it satisﬁes the ﬁrst-step analysis equations. conditionally on (X0 . X0 ) = E(Xn+1 |Xn ) = M This implies that. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. . Since selections are independent. . ϕ(M ) = 1. E(Xn+1 |Xn . 2N − 1} form a communicating class of inessential states. Since pxy > 0 for all 1 ≤ x. Clearly. for all n ≥ 0. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). 0 ≤ y ≤ 2N.i. . . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1.e. on the average. . . as usual. M Therefore. i. . Tx := inf{n ≥ 1 : Xn = x}. and the population becomes one consisting of alleles of type A only.

we will try to solve the recursion. Since the Markov chain is represented by this very simple recursion.d. .i. Xn electronic components at time n. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. Otherwise. and so. and E(An − Dn ) = −µ < 0. on the average. . 2. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. the demand is met instantaneously. We will show that this Markov chain is positive recurrent. we ﬁnd X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. If the demand is for Dn components.5 A storage or queueing application A storage facility stores. the demand is only partially satisﬁed. 73 . and the stock reached level zero at time n + 1. the demand is. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. M −1 y−1 x M y−1 1− x . Working for n = 1. provided that enough components are available. Dn ). We have that (Xn . A number An of new components are ordered and added to the stock. larger than the supply per unit of time. Thus. for all n ≥ 0. n ≥ 0) is a Markov chain. We will apply the so-called Loynes’ method. n ≥ 0) are i. at time n + 1. Here are our assumptions: The pairs of random variables ((An . while a number of them are sold to the market. there are Xn + An − Dn components in the stock. . M x M (M −1)−(y−1) x x +1− M M M −1 = 24. b). Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y .The ﬁrst two are obvious. Let ξn := An − Dn . so that Xn+1 = (Xn + ξn ) ∨ 0. say.

ξn−1 .d. (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . . Therefore. we ﬁnd π(y) = x≥0 π(x)pxy . ξ0 ). ξn−1 ) = (ξn−1 . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisﬁes the balance equations. = x≥0 Since the n-dependent terms are bounded. so we can shuﬄe them in any way we like without altering their joint distribution. where the latter equality follows from the fact that. . for y ≥ 0. instead of considering them in their original order. using π(y) = limn→∞ P (Mn = y). in computing the distribution of Mn which depends on ξ0 . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). . and. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). . . 74 . . . . the event whose probability we want to compute is a function of (ξ0 . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . In particular. Notice. . we may replace the latter random variables by ξ1 . . . ξn−1 ). we reverse it. . . . that d (ξ0 . however. (n) . Indeed. ξn . we may take the limit of both sides as n → ∞.Thus. for all n. . Hence. . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. . . . . the random variables (ξn ) are i.i.

So π satisﬁes the balance equations. n→∞ This implies that P ( lim Zn = −∞) = 1. Z2 . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . This will be the case if g(y) is not identically equal to 1. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. the Markov chain may be transient or null recurrent. . as required. .} > y) is not identically equal to 1. we will use the assumption that Eξn = −µ < 0. We must make sure that it also satisﬁes y π(y) = 1. Here. Depending on the actual distribution of ξn . . Z2 . 75 . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. . n→∞ Hence g(y) = P (0 ∨ max{Z1 . .} < ∞) = 1. The case Eξn = 0 is delicate. .

A random walk possesses an additional property. that of spatial homogeneity. p(x) = 0. ±ed } otherwise 76 . . If d = 1. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. we can let S = Zd . we won’t go beyond Zd . d 0. if x = ej . otherwise where p + q = 1.and space-homogeneity. but. j = 1.g. Deﬁne pj . this walk enjoys. More general spaces (e. groups) are allowed. given a function p(x). here. d p(x) = qj . p(−1) = q. the set of d-dimensional vectors with integer coordinates. So. a structure that enables us to say that px. . namely. j = 1. .y = p(y − x). so far. This is the drunkard’s walk we’ve seen before. .and space-homogeneous transition probabilities given by px. To be able to say this. This means that the transition probability pxy should depend on x and y only through their relative positions in space. Our Markov chains. A random walk of this type is called simple random walk or nearest neighbour random walk. a random walk is a Markov chain with time. we need to give more structure to the state space S. if x ∈ {±e1 . x ∈ Zd .y = px+z. if x = −ej . 0. the skip-free property. . Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. where C is such that x p(x) = 1. . where p1 + · · · + pd + q1 + · · · + qd = 1.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. have been processes with timehomogeneous transition probabilities. otherwise. . Example 3: Deﬁne p(x) = 1 2d . .y+z for any translation z. then p(1) = p. For example. . . In dimension d = 1. Example 1: Let p(x) = C2−|x1 |−···−|xd | . . besides time. .

x + ξ2 = y) = y p(x)p(y − x). which. the initial state S0 is independent of the increments. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . ξ2 . . Let us denote by Sn the state of the walk at time n.) Now the operation in the last term is known as convolution. because all we have done is that we dressed the original problem in more fancy clothes. . x ∈ Zd is deﬁned as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. But this would be nonsense. So.d. p ∗ δz = p.This is a simple random walk. If d = 1. Indeed. . Furthermore. We will let p∗n be the convolution of p with itself n times. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). then p(1) = p(−1) = 1/2. The special structure of a random walk enables us to give more detailed formulae for its behaviour. are i. random variables in Zd with common distribution P (ξn = x) = p(x). (When we write a sum without indicating explicitly the range of the summation variable.i. we will mean a sum over all possible values. we can reduce several questions to questions about this random walk. . Obviously. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . in general. otherwise. and this is the symmetric drunkard’s walk. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). We shall resist the temptation to compute and look at other 77 . p(x) = 0. The convolution of two probabilities p(x). x ∈ Zd . in addition. where ξ1 . . We call it simple symmetric random walk. .) In this notation. computing the n-th order convolution for all n is a notoriously hard problem. p′ (x). Furthermore. We refer to ξn as the increments of the random walk. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. (Recall that δz is a probability distribution that assigns probability 1 to the point z. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . . One could stop here and say that we have a formula for the n-step transition probability. x ± ed . y p(x)p′ (y − x).

we should assign. If not. . When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . In the sequel. to each initial position and to each sequence of n signs.i.1) − (3. but its practice may be hard.1) + − (5. and so #A P (A) = n . where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. −1. 2) → (5. Since there are 2n elements in {0. 2 where #A is the number of elements of A. Therefore.d. random variables (random signs) with P (ξn = ±1) = 1/2. . + (1. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that diﬀerent questions we need to ask or (b) that approximate methods are more informative than exact ones. A ⊂ {−1. if we can count the number of elements of A we can compute its probability. +1}n . 1) → (4. all possible values of the vector are equally likely.0) We can thus think of the elements of the sample space as paths.i−1 = 1/2 for all i ∈ Z. So. . 2) → (3.methods. 1) → (2. +1}n . If yes. . First. Events A that depend only on these the ﬁrst n random signs are subsets of this sample space. After all.i+1 = pi. 0) → (1. Here are a couple of meaningful questions: A. As a warm-up.1) + (0. . a random vector with values in {−1. ξn ). we are dealing with a uniform distribution on the sample space {−1. 1}n . for ﬁxed n. consider the following: 78 . we shall learn some counting methods. . we have P (ξ1 = ε1 . Clearly. and the sequence of signs is {+1. . −1} then we have a path in the plane that joins the points (0. . So if the initial position is S0 = 0. 1): (2. Will the random walk ever return to where it started from? B. The principle is trivial. Look at the distribution of (ξ1 . +1}n . how long will that take? C. ξn = εn ) = 2−n . +1. a path of length n. +1.2) where the ξn are i.2) (4.

y) is given by: #{(m. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. i)} = n n+i 2 More generally. Proof. i). in this case. i). 0) (n. We can easily count that there are 6 paths in A. 0) and end up at the speciﬁc point (n. y)}. i). We use the notation {(m. S4 = −1. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. x) and end up at (n. S7 < 0}. x) (n. (There are 2n of them.0) The lightly shaded region contains all paths of length n that start at (0. In if n + i is not an even number then there is no path that starts from (0. n 79 . Let us ﬁrst show that Lemma 14. Consider a path that starts from (0.1 Paths that start and end at speciﬁc points Now consider a path that starts from a speciﬁc point and ends up at a speciﬁc point. and compute the probability of the event A = {S2 ≥ 0.047. (n. x) and end up at (n. 0) and ends at (n. Hence (n+i)/2 = 0.Example: Assume that S0 = 0. The paths comprising this event are shown below. y)} = n−m . S6 < 0. x) (n. 0) and end up at (n.) The darker region shows all paths that start at (0.i) (0. y)} =: {all paths that start at (m. 0) and n ends at (n. y) is given by: #{(0. (n − m + y − x)/2 (23) Remark. the number of paths that start from (m. 0)). The number of paths that start from (0.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the ﬁgure), there is no path that goes from (0, 0) to (3, 2). We shall thus deﬁne n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

**Paths that start and end at speciﬁc points and do not touch zero at all
**

(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =

1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the speciﬁc point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the ﬁrst time it hits zero and then reﬂecting it around the horizontal axis. In the ﬁgure below, follow the solid path starting at (m, x) up to the ﬁrst time it hits zero and then follow the dotted path: this is the reﬂected path.

(n, y)

(m, x)

(n, −y)

A path and its reﬂection around a horizontal line. Notice that the reﬂection starts only at the point where the path touches the horizontal line.

Note that if we know the reﬂected path we can recover the original path. Hence we may as well count the number of reﬂected paths. But each reﬂected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =

1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)

1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

**Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
**

1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n

n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n

n 2

= #{(0, 1) n

n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1

n−y 2

(n, y + 1); remain > 0}

− −

n

n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

−

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the ﬁrst one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

**Distribution after n steps
**

n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

y)} . given that S0 = 0 and Sn = y > 0. .P (Sn = y|Sm = x) = In particular. The probability that. x) 2n−m (n. 27. P0 (S1 ≥ 0 . 1) (n. Sn ≥ 0 | Sn = 0) = 84 1 . Sn ≥ 0 | Sn = y) = 1 − In particular. . If n. n−m 2−n−m . . . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . Such a simple answer begs for a physical/intuitive explanation. 0) (n. n This is called the ballot theorem and the reason will be given in Section 29. if n is even. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). S2 > 0. n 2−n . y) that do not touch zero} × 2−n #{paths (0. P0 (Sn = y) 2 (n + y) + 1 . . (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. . the random walk never becomes zero up to time n is given by P0 (S1 > 0. for now. n 2 #{paths (m. 27. 0) (n. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . But. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . if n is even. n/2 (29) #{(0. y)} . . .3 Some conditional probabilities Lemma 17. n (30) Proof. we shall just compute it. .2 The ballot theorem Theorem 25. . The left side of (30) equals #{paths (1. Sn > 0 | Sn = y) = y .

. . Sn ≥ 0) = #{(0. . Sn ≥ 0. . . . and Sn−1 > 0 implies that Sn ≥ 0. Sn ≥ 0). . . Sn ≥ 0) = P0 (S1 = 0. . . . . 85 . . . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . . (c) (d) (e) (f ) (g) where (a) is obvious.). Sn < 0) (b) (a) = 2P0 (S1 > 0. . and ﬁnally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. Proof. . ξ3 . . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0.4 Some remarkable identities Theorem 26. 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . . . . . If n is even. ξ3 . . . . . . Sn = 0) = P0 (Sn = 0). . . . For even n we have P0 (S1 ≥ 0. . . . . . (b) follows from symmetry. We use (27): P0 (Sn ≥ 0 . . . . . . ξ2 + ξ3 ≥ 0. . . . .) by (ξ1 . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . . P0 (S1 ≥ 0. (e) follows from the fact that ξ1 is independent of ξ2 . . . 0) (n. 1 + ξ2 + ξ3 > 0. . Sn = 0) = P0 (S1 > 0. . . ξ2 .Proof. We next have P0 (S1 = 0. (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . . n/2 where we used (28) and (29). Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . (f ) follows from the replacement of (ξ2 . . . +1 27. •). ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. Sn > 0) + P0 (S1 < 0. . .

Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). Sn = 0) P0 (S1 > 0. and u(n) = P0 (Sn = 0). for even n. Sn = 0) + P0 (S1 < 0. Let n be even. P0 (Sn−1 > 0. Then f (n) := P0 (S1 = 0. We now want to ﬁnd the probability f (n) that the walk will return to 0 for the ﬁrst time in n steps. . . with equal probability. Theorem 27. 86 . If a particle is to the right of a then you see its image on the left. So 1 f (n) = 2u(n) P0 (S1 > 0. . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). . Fix a state a. Also. n−1 28 The reﬂection principle for a simple random walk in dimension 1 The reflection principle says the following. given that Sn = 0. we computed. Put a two-sided mirror at a. . Sn = 0) Now. we see that Sn−1 > 0. Sn−2 > 0|Sn−1 > 0. . . Sn−1 < 0. Sn−2 > 0|Sn−1 = 1). . . Sn = 0) = u(n) . . we either have Sn−1 = 1 or −1. Sn = 0. So f (n) = 2P0 (S1 > 0. . To take care of the last term in the last expression for f (n). . It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. . n We know that P0 (Sn = 0) = u(n) = n/2 2−n . By symmetry. the probability u(n) that the walk (started at 0) attains value 0 in n steps. f (n) = P0 (S1 > 0. Sn = 0 means Sn−1 = 1.5 First return to zero In (29). . So u(n) f (n) = . . So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. This is not so interesting. Sn−1 = 0. . By the Markov property at time n − 1. . Sn = 0). the two terms must be equal. n−1 Proof. we can omit the last event. Sn−1 > 0. . . .27. Sn−1 > 0. . So the last term is 1/2. Since the random walk cannot change sign before becoming zero. . Sn = 0) = 2P0 (Sn−1 > 0. and if the particle is to the left of a then its mirror image is on the right. . .

n ≥ Ta By the strong Markov property. So: P0 (Sn ≥ a) = P0 (Ta ≤ n. Then Ta = inf{n ≥ 0 : Sn = a} Sn . 2a − Sn ≥ a) = P0 (Ta ≤ n. the process (STa +m − a. Sn ≤ a). Proof. Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). we have n ≥ Ta . n ∈ Z+ ). In the last event. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). . deﬁne Sn = where is the ﬁrst hitting time of a. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). 2 Deﬁne now the running maximum Mn = max(S0 . But a symmetric random walk has the same law as its negative. m ≥ 0) is a simple symmetric random walk. Sn ). If Sn ≥ a then Ta ≤ n. 2 Proof. Theorem 28. we obtain 87 . .What is more interesting is this: Run the particle till it hits a (it will. P0 (Sn ≥ a) = P0 (Ta ≤ n. 2a − Sn . Trivially. n ≥ 0) be a simple symmetric random walk and a > 0. S1 . Then.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. . Sn ≤ a) + P0 (Sn > a). n ∈ Z+ ) has the same law as (Sn . So Sn − a can be replaced by a − Sn in the last display and the resulting process. . for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . independent of the past before Ta . The last equality follows from the fact that Sn > a implies Ta ≤ n. Let (Sn . Sn ≤ a) + P0 (Ta ≤ n. n < Ta . n < Ta . n ≥ Ta . Finally. as a corollary to the above result. Thus. In other words. called (Sn . so we may replace Sn by 2a − Sn (Theorem 28). But P0 (Ta ≤ n) = P0 (Ta ≤ n. Sn = Sn . 28. Sn > a) = P0 (Ta ≤ n. a + (Sn − a). Sn ≥ a).

we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. An urn is a vessel. Acad. e 10 Andr´. A sample from the urn is a sequence η = (η1 . posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. we have P0 (Mn < x. . during sampling without replacement. gives the result. 369. (1887). Sn = y). employed for diﬀerent purposes. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. Sci. then the probability that. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). 105. J. 9 Bertrand. . 2 Proof. and anciently for holding ballots to be drawn. . as long as a ≥ b. (31) x > y. Acad. n = a + b. Compt. Paris. The original ballot problem. Compt. The answer is very simple: Theorem 31. P0 (Mn < x. If an urn contains a azure items and b = n−a black items. combined with (31). Bertrand. 8 88 . Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . If Tx ≤ n.Corollary 19. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. ornamental objects. Since Mn < x is equivalent to Tx > n. usually a vase furnished with a foot or pedestal. Solution directe du probl`me r´solu par M. Sci. where ηi indicates the colour of the i-th item picked. It is the latter use we are interested in in Probability and Statistics. 105. 436-437. the ashes of the dead after cremation. This. Sn = 2x − y). Sn = y) = P0 (Sn = 2x − y). Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. Rend. the number of selected azure items is always ahead of the black ones equals (a − b)/n. . D. of which a are coloured azure and b black. (1887). Proof. e e e Paris. which results into P0 (Tx ≤ n. by applying the reﬂection principle (Theorem 28). But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. ηn ). Sn = y) = P0 (Tx ≤ n. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Sn = y) = P0 (Tx > n. then. as for holding liquids. Rend. Solution d’un probl`me. The set of values of η contains n elements and η is uniformly distributed a in this set.

n Since y = a − (n − a) = a − b. . We have P0 (Sn = k) = n n+k 2 pi. Since an urn with parameters n. a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. . . . . a. . . −n ≤ k ≤ n. conditional on {Sn = y}.i−1 = q = 1 − p. ξn = εn ) P0 (Sn = y) 2−n 1 = = . . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. Then. Consider a simple symmetric random walk (Sn . a − b = y. . p(n+k)/2 q (n−k)/2 . Proof of Theorem 31. . so ‘azure’ items are +1 and ‘black’ items are −1. b be deﬁned by a + b = n. . . . i ∈ Z. conditional on the event {Sn = y}. (ξ1 . indeed. Let ε1 . . . But in (30) we computed that y P0 (S1 > 0. Proof. . ξn ) is. Sn > 0 | Sn = y) = . .i+1 = p. the sequence (ξ1 . We need to show that. . . using the deﬁnition of conditional probability and Lemma 16. . . letting a. (The word “asymmetric” will always mean “not necessarily symmetric”. conditional on {Sn = y}. . +1} such that ε1 + · · · + εn = y. Then. . But if Sn = y then. . the result follows. n n −n 2 (n + y)/2 a Thus. where y = a − (n − a) = 2a − n. . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. . . we can translate the original question into a question about the random walk. ξn ) has a uniform distribution. always assuming (without any loss of generality) that a ≥ b = n − a. An urn can be realised quite easily be a random walk: Theorem 32. . . . . εn be elements of {−1. 89 . . .) So pi. . So the number of possible values of (ξ1 . we have: P0 (ξ1 = ε1 . the ‘black’ items correspond to − signs). Sn > 0} that the random walk remains positive. n . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. ξn = εn | Sn = y) = P0 (ξ1 = ε1 . n ≥ 0) with values in Z and let y be a positive integer. a we have that a out of the n increments will be +1 and b will be −1. the sequence (ξ1 .We shall call (η1 . The values of each ξi are ±1. Since the ‘azure’ items correspond to + signs (respectively. . ξn ) is uniformly distributed in a set with n elements. . . but which is not necessarily symmetric. . . . ηn ) an urn process with parameters n. It remains to show that all values are equally a likely.

Lemma 19. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. x ≥ 0) is also a random walk. plus a time T1 required to go from −1 to 0 for the ﬁrst time. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. x − 1 in order to reach x. . The rest is a consequence of the strong Markov property.d. If S1 = 1 then T1 = 1. . 90 . Let ϕ(s) := E0 sT1 . See ﬁgure below. We will approach this using probability generating functions. In view of Lemma 19 we need to know the distribution of T1 . Consider a simple random walk starting from 0. plus a ′′ further time T1 required to go from 0 to +1 for the ﬁrst time.} ∪ {∞}. In view of this. 2. for x > 0. For all x. Proof. . and all t ≥ 0. it is no loss of generality to consider the random walk starting from 0. Start with S0 = 0 and watch the process until the ﬁrst n = T1 such that Sn = 1. E0 sTx = ϕ(s)x . Then the probability generating function of the ﬁrst time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . Px (Ty = t) = P0 (Ty−x = t).i. Consider a simple random walk starting from 0. we make essential use of the skip-free property. only the relative position of y with respect to x is needed. . Proof. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. (We tacitly assume that S0 = 0.30. Corollary 20. random variables.) Then. x ∈ Z. This random variable takes values in {0.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. To prove this. If x > 0. 2. 1. y ∈ Z. Lemma 18. 2qs Proof. . The process (Tx . Theorem 33. . . Consider a simple random walk starting from 0.

if x < 0 2ps Proof. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. The formula is obtained by solving the quadratic. 2ps Proof. Then the probability generating function of the ﬁrst time that level x ∈ Z is reached is given by √ x E sT1 x = 1− 1−4pqs2 . Consider a simple random walk starting from 0. The ﬁrst time n such that Sn = −1 is the ﬁrst time n such that −Sn = 1.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. 0 if x > 0 2qs Tx (32) E0 s = √ |x| E sT−1 |x| = 1− 1−4pqs2 0 . 91 . the RW must ﬁrst hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . Then the probability generating function of the ﬁrst time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . in order to hit 1. Hence the previous formula applies with the roles of p and q interchanged. Consider a simple random walk starting from 0. from the strong Markov property. Corollary 21. Corollary 22.

this was found in §30. For a symmetric ′ random walk. If α = n is a positive integer then n (1 + x)n = k=0 n k x . E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . the same as the time required for the walk to hit 1 starting ′ from 0. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . and simply write (1 + x)n = k≥0 n k x .2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1.3 The distribution of the ﬁrst return to the origin We wish to ﬁnd the exact distribution of the ﬁrst return time to the origin. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . The latter is. in distribution. We again do ﬁrst-step analysis. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . For the general case. 1 − 4pqs2 . then we can omit the k upper limit in the sum. if S1 = 1. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. k 92 . then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. If we deﬁne n to be equal to 0 for k > n. k by the binomial theorem. Similarly.2 First return to the origin Now let us ask: If the particle starts at 0. where α is a real number.30. how long will it take till it ﬁrst returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. 2ps ′ =1− 30. Consider a simple random walk starting from 0. Then the probability generating function of the ﬁrst return time to 0 is given by E0 s T 0 = 1 − Proof.

−1 k so (1 − x)−1 = as needed. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . 2k − 1 k 93 . k! k! (−1)2k xk = k≥0 xk . but that we won’t worry about which ones. As another example. by deﬁnition. k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . (−1)k = 2k − 1 k k and this is shown by direct computation. √ 1 − x. k as long as we deﬁne α k = α(α − 1) · · · (α − k + 1) . k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . (1 − x)−1 = But. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . For example. Indeed. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . k!(n − k)! k! α k x . we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series.Recall that n k = It turns out that the formula is formally true for general α.

By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. if x < 0 |x| In particular. We apply this to (32).But. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . by the deﬁnition of E0 sT0 . • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. Hence it is again transient. That it is actually null-recurrent follows from Markov chain theory. The quantity inside the square root becomes 1 − 4pq. if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . Hence it is transient. without appeal to the Markov chain theory. s↑1 which is true for any probability generating function. x if x > 0 . if x < 0 This formula can be written more neatly as follows: Theorem 35. But remember that q = 1 − p. Comparing the two ‘inﬁnite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 = (1 − 2p)2 = |1 − 2p| = |p − q|. x 1−|p−q| 2q 1−|p−q| 2p . |x| if x > 0 (33) . 94 . We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. P0 (Tx < ∞) = p q q p ∧1 ∧1 .

Consider a random walk with values in Z. the sequence Mn is nondecreasing. again. we obtain what we are looking for. . . Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the ﬁrst return time to 0. Sn ). Assume p < 1/2. Proof. Otherwise. with some positive probability. it will never ever return to the point where it started from. and so it is null-recurrent. . Corollary 23. if x > 0 . but here it is < ∞. We know that it cannot be positive recurrent. if x < 0 which is what we want. every nondecreasing sequence has a limit. 31. Unless p = 1/2. 95 . Indeed. If p < q then |p − q| = q − p and.) If we let Mn = max(S0 . This value is random and is denoted by M∞ = sup(S0 . . . Then ′ P0 (T0 < ∞) = 2(p ∧ q). S2 . Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . . This means that there is a largest value that will be ever attained by the random walk. except that the limit may be ∞. this probability is always strictly less than 1. S1 . q p. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . What is this probability? Theorem 37. then we have M∞ = limn→∞ Mn . 31. x ≥ 0.Proof. Consider a simple random walk with values in Z. The simple symmetric random walk with values in Z is null-recurrent.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. Then limn→∞ Sn = −∞ with probability 1. Theorem 36. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. n→∞ n→∞ x ≥ 0.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. . with probability 1 because p < 1/2. Proof. The last part of Theorem 35 tells us that it is recurrent.

2. The probability generating function of T0 was obtained in Theorem 34. If p = q this gives 1. 31. where the last probability was computed in Theorem 37. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). Then. Remark 1: By spatial homogeneity. 2. . Theorem 38. This proves the ﬁrst formula. . We now compute its exact distribution. . 2(p ∧ q) E0 J0 = . And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. . k = 0. if a Markov chain starts from a state i then Ji is geometric. . 1. . So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). k = 0. (34) a random variable ﬁrst considered in Section 10 where it was also shown to be geometric. . Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . meaning that Pi (Ji = ∞) = 1. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. 1. even for p = 1/2. . Consider a simple random walk with values in Z. 1 − 2(p ∧ q) this being a geometric series. But J0 ≥ 1 if and only if the ﬁrst return time to zero is ﬁnite. . 1. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . 1].′ Proof. as already known. 96 . the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . 2.3 Total number of visits to a state If the random walk is recurrent then it visits any state inﬁnitely many times. starting from zero. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. In the latter case. 2(p ∧ q) . But if it is transient. Pi (Ji ≥ k) = 1 for all k. 1 − 2(p ∧ q) Proof. k = 0. We now look at the the number of visits to some state but if we start from a diﬀerent state. the total number of visits to a state will be ﬁnite. . In Lemma 4 we showed that.

If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . then E0 Jx drops down to zero geometrically fast as x tends to inﬁnity. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . we have E0 Jx = 1/(1−2p) regardless of x.e. But for x < 0. Suppose x > 0. For x > 0. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. We repeat the argument for x < 0 and ﬁnd P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . (2(p ∧ q))k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . k ≥ 1. Consider a random walk with values in Z. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . E0 Jx = Proof. simply because if Jx ≥ k ≥ 1 then Tx < ∞. i. if the random walk “drifts down” (i. it will visit any point below 0 exactly the same number of times on the average. starting from 0. Apply now the strong Markov property at Tx . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series.e. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . The ﬁrst term was computed in Theorem 35. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . Consider the case p < 1/2.Theorem 39. Then P0 (Tx < ∞. The reason that we replaced k by k − 1 is because. 97 . In other words. given that Tx < ∞. and the second in Theorem 38. Remark 1: The expectation E0 Jx is ﬁnite if p = 1/2. Then P0 (Jx ≥ k) = P0 (Tx < ∞. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. p < 1/2) then. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . ∞ As for the expectation. Jx ≥ k). (q/p)(−x)∨0 1 − 2q k≥1 . Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1).

ξn ) has the same distribution as (ξn . Then P0 (Tx = n) = x P0 (Sn = x) . random variables. for a simple random walk and for x > 0. Use the obvious fact that (ξ1 . (36) 2 Lastly. . 0 ≤ k ≤ n) has the same distribution as (Sk . 0 ≤ k ≤ n) of (Sk . 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. n Proof. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . each distributed as T1 . Sn = x) P0 (Tx = n) = . ξ2 . Theorem 40. n x > 0. 0 ≤ k ≤ n − 1.32 Duality Consider random walk Sk = k ξi . Fix an integer n. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. Here is an application of this: Theorem 41. the probabilities P0 (Tx = n) = x P0 (Sn = x). which is basically another probability for S. In Lemma 19 we observed that. . i=1 Then the dual (Sk .e. Thus. ξn−1 . Pause for reﬂection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. Let Tx = inf{n ≥ 1 : Sn = x} be the ﬁrst time that x will be visited. Let (Sk ) be a simple symmetric random walk and x a positive integer. in Theorem 41. = 1− √ 1 − s2 s x . . . . because the two have the same distribution. . n (37) 98 . 0 ≤ k ≤ n) is deﬁned by Sk := Sn − Sn−k . . for a simple symmetric random walk. Tx is the sum of x i. (35) In Theorem 29 we used the reﬂection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). i. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0.i. . a sum of i. . (Sk . 0 ≤ k ≤ n).d. Proof. . we derived using duality. . P0 (Sn = x) P0 (Sn = x) x .d. every time we have a probability for S we can translate it into a probability for S. ξ1 ). .i. random variables.

We remark here something methodological: It was completely diﬀerent techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coeﬃcient of sn in (35); summing up (37) should give (36), or taking diﬀerences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Deﬁne its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the ﬁrst time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the ﬁrst event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =

r=1

+

r=1

**In the ﬁrst case,
**

′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

**We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
**

k

**f2r p2k−2r,2n−2r +
**

r=1

1 2

n−k

**f2r p2k,2n−2r
**

r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2

k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .

r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

**The arcsine law*
**

Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

**Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
**

na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =

a≤k/n≤b

∼

a≤k/n≤b

1 . k k (n − n ) n n

1/π

**This is easily seen to have a limit as n → ∞, the limit being
**

b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of

x(1−x)

which is amazingly close to the plot of the ﬁgure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

. the two stochastic processes are not independent. y) where x. 0)) = 1/4. ξ2 . going at random east. However. too. 0). north or south. x1 − x2 ). 2 1 If x ∈ Z2 . i=1 ξi i=1 ξi 2 1 Lemma 20. n ∈ Z+ ) is a random walk.. The two random walks are not independent. random variables and hence a random walk. 1 1 Proof. It is not a simple random walk because there is a possibility that the increment takes the value 0. The corners are described by coordinates (x. So let 1 2 1 2 1 2 Rn = (Rn . i. ξ2 . . We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 .i. . Rn ) := (Sn + Sn . Indeed. 1 P (ξn = 1) = P (ξn = (1. 1)) = P (ξn = (−1. Consider a change of coordinates: rotate the axes by 45o (and scale them). 1)) + P (ξn = (0. Sn = i=1 n ξi . ξn are not independent simply because if one takes value 0 the other is nonzero. being totally drunk. 0) and. y ∈ Z. (Sn .e. west.d.From this. the random 1 2 variables ξn . (Sn . he moves east. is a random walk. . n ∈ Z+ ) is a sum of i. −1)) = 1/2. west. We have 1. for each n. 0)) = P (ξn = (1. So Sn = (Sn . n ≥ 1. we let x1 be its horizontal coordinate and x2 its vertical. 2 Same argument applies to (Sn . we immediately get that the expectation of T1 is inﬁnity: ET1 = ∞. Many other quantities can be approximated in a similar manner. the latter being functions of the former. Sn ) = n n 2 . n ∈ Z+ ) is also a random walk. ξ2 . 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. so are ξ1 . Sn − Sn ). random vectors such that P (ξn = (0. 0)) = P (ξn = (−1. .d.i. are independent. We have: 102 . He starts at (0. Since ξ1 . . 0)) = 1/4. . 1 Hence (Sn . north or south. i. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. . We then let S0 = (0. When he reaches a corner he moves again in the same manner. n ∈ Z+ ) showing that it. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. x2 ) into (x1 + x2 . map (x1 . with equal probability (1/4). .

We have Rn = n (ξi + ξi ). in the sense that the ratio of the two sides goes to 1. Proof. . ηn = 1) = P (ξn = (0. b−a 103 . The same is true for (Rn ). n ∈ Z+ ) is a random walk in two dimensions. = 1)) = 1/4. . The possible values 1 . 1 where the last equality follows from the independence of the two random walks. The two-dimensional simple symmetric random walk is recurrent. Hence P (ηn = 1) = P (ηn = −1) = 1/2. n ∈ Z+ ). R2n = 0) = P (R2n = 0)(R2n = 0). n ∈ Z+ ) 1 is a random walk in two dimensions. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . b−a P (Ta < Tb ) = b . n ∈ Z+ ) is a random 2 . 0)) = 1/4 1 2 P (ηn = 1. . . The last two are walk in one dimension. is i.1 Lemma 21. (Rn . Lemma 22. (Rn . 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. To show that the last two are independent. Let a < 0 < b. But by the previous n=0 c lemma. b−a P (STa ∧Tb = a) = b . 2n n 2 2−4n . 2 1 2 2 1 1 Proof. ξ 1 − ξ 2 ). ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. (Rn + independent. The sequence of vectors η . n ∈ Z ) is a random walk in one dimension. And so 2 1 2 1 P (ηn = ε.i. ηn = −1) = P (ξn = (0. where c is a positive constant. Rn = n (ξi − ξi ). n ∈ Z ) is a random walk in one dimension. ηn = 1) = P (ξn = (1. . η . 2 . 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. Recall. ε′ ∈ {−1. P (S2n = 0) ∼ n . ξ2 . 1)) = 1/4 1 2 P (ηn = −1. as n → ∞. Similar argument shows that each of (Rn . By Lemma 5. from Theorem 7. ηn = −1) = P (ξn = (−1. n Corollary 24. η 2 are ±1. +1}. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn .d. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . . So P (R2n = 0) = 2n 2−2n . and by Stirling’s approximation. We have of each of the ηn n 1 2 P (ηn = 1. But (Rn ) 1 2 is a simple symmetric random walk. we need to show that ∞ P (Sn = 0) = ∞.. 0)) = 1/4 1 2 P (ηn = −1. Hence (Rn . P (S2n = 0) = Proof. that for a simple symmetric random walk (started from S0 = 0). b−a −a . ηn = ξn − ξn are independent for each n. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a .

P (A = 0. B = 0) = p0 . Then Z = 0 if and only if A = B = 0 and this has 104 . Note. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. i ∈ Z. and the probability it exits from the right point is 1 − p. The probability it exits it from the left point is p. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. B = 0) + P (A = i. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the ﬁrst time it exits the interval [a. Indeed. How about more complicated distributions? Let suppose we are given a probability distribution (pi . where c−1 is chosen by normalisation: P (A = 0. and assuming that (A. N ∈ N. 1 − p). b. Can we ﬁnd a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. To see this. a < 0 < b such that pa + (1 − p)b = 0.This last display tells us the distribution of STa ∧Tb . in particular. Proof. using this. i ∈ Z. i ∈ Z). Deﬁne a pair of random variables (A. B) taking values in {(0.g. e. we consider the random variable Z = STA ∧TB . that ESTa ∧Tb = 0. Theorem 43. a = M − N . b = N . B) are independent of (Sn ). suppose ﬁrst k = 0. if p = M/N . we can always ﬁnd integers a. 0)} ∪ (N × N) with the following distribution: P (A = i. we get that c is precisely ipi . Let (Sn ) be a simple symmetric random walk and (pi . i < 0 < j. B = j) = 1. k ∈ Z. b]. M < N . B = j) = c−1 (j − i)pi pj . The claim is that P (Z = k) = pk . such that p is rational. M. choose. we have this common value: i<0 (−i)pi = and. given a 2-point probability distribution (p. Conversely. Then there exists a stopping time T such that P (ST = i) = pi . i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0.

B = j) P (STi ∧Tk = k)P (A = i. P (Z = k) = pk for k < 0. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk .probability p0 . 105 . i<0 = c−1 Similarly. Suppose next k > 0. by deﬁnition.

Now suppose that c | a and c | b − a. They are not supposed to replace textbooks or basic courses.}. So (ii) holds. b − a). Indeed 4 × 38 > 120.) Notice that: gcd(a. 3.e. c | b. Z = {· · · . In other words. 2) = gcd(2. and d = gcd(a. Then interchange the roles of the numbers and continue. with at least one of them nonzero. The set N of natural numbers is the set of positive integers: N = {1. If b − a > a.e. 8) = gcd(6. b). 38) = gcd(44. We will show that d = gcd(a. we say that b divides a if there is an integer k such that a = kb. But 3 is the maximum number of times that 38 “ﬁts” into 120. −3. (That it is unique is obvious from the fact that if d1 .PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. For instance. b are arbitrary integers. If a | b and b | a then a = b. Therefore d | b − a. 5. 6. Indeed. 38) = gcd(82. they merely serve as reminders. Given integers a. 1. Keep doing this until you obtain b − qa < a. 2) = gcd(4. b) = gcd(a. 4. if we also invoke Euclid’s algorithm (i. If c | b and b | a then c | a. 20) = gcd(6. (ii) if c | a and c | b. Because c | a and c | b. and we’re done. 106 . Replace b by b − a. i. . . 32) = gcd(6. 14) = gcd(6. we can deﬁne their greatest common divisor d = gcd(a. Zero is denoted by 0. if a. Now. replace b − a by b − 2a. But d | a and d | b. 2) = 2. with b = 0. let d = gcd(a. The set Z of integers consists of N. with the second line. But in the ﬁrst line we subtracted 38 exactly 3 times. the process of long division that one learns in elementary school). So we work faster. −1. 0. b). we may summarise the ﬁrst line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. we have c | d. −2. d | b. Then c | a + (b − a). 2. b. b − a). We write this by b | a. then d | c. . d2 are two such numbers then d1 would divide d2 and vice-versa. Arithmetic Arithmetic deals with integers. This gives us an algorithm for ﬁnding the gcd between two numbers: Suppose 0 < a < b are integers. 38) = gcd(6. 38) = gcd(6. their negatives and zero. 26) = gcd(6. So (i) holds. · · · }. 3. 2. gcd(120. as taught in high schools or universities. Similarly. leading to d1 = d2 . b) as the unique positive integer such that (i) d | a.

let d be the right hand side.e. Here are some other facts about the greatest common divisor: First. In the process. y ∈ Z. To see why this is true. such that a = qb + r. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. with a > b. it divides d. we have the fact that if d = gcd(a. b) = min{xa + yb : x. for some x. 38) = gcd(6. y such that d = xa + yb. in particular. we can ﬁnd an integer q and an integer r ∈ {0.The theorem of division says the following: Given two positive integers b. 1. Second. we have gcd(a. b ∈ Z. This shows (ii) of the deﬁnition of a greatest common divisor. b) then there are integers x. Thus. Working backwards. . Suppose that c | a and c | b. y ∈ Z. y = 6. let us follow the previous example again: gcd(120. not both zero. 2) = 2. xa + yb > 0}. a − 1}. . Then c divides any linear combination of a. 38) = gcd(6. To see this. i. For example. with x = 19. Then d = xa + yb. Its proof is obvious. r = 18. b. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. 120) = 38x + 120y. gcd(38. a. The long division algorithm is a process for ﬁnding the quotient q and the remainder r. . The same logic works in general. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. we can show that. . for integers a. 107 . We need to show (i).

z). x). For example. A function is called onto (or surjective) if its range is Y . Finally. Counting Let A. f (x2 ). m elements. B be ﬁnite sets with n. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y .that d | a and d | b. i. i. The set of functions from X to Y is denoted by Y X . yielding r = (−qx)a + (1 − qy)b. 108 . Suppose this is not true. and this is a contradiction. The range of f is the set of its values. A relation is transitive if when it contains (x. Since d > 0. but is not necessarily transitive. A relation which possesses all the three properties is called equivalence relation. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. that d does not divide both numbers. that d does not divide b.e. say. The equality relation = is reﬂexive. x2 ∈ X have diﬀerent values f (x1 ). a relation can be thought of as a set of pairs of elements of the set A. the set {f (x) : x ∈ X}. The greatest common divisor of 3 integers can now be deﬁned in a similar manner. But then b = q(xa + yb) + r. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. y) and (y. A relation is called reflexive if it contains pairs (x. then the relation “x is a friend of y” is symmetric. it can be thought as reﬂexive (unless we do not want to accept the fact that one may not be afriend of oneself). respectively. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. A relation is called symmetric if it cannot contain a pair (x. In general. e. z) it also contains (x. for some 1 ≤ r ≤ d − 1.e. The relation < is not reﬂexive. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). so d divides both a and b. Both < and = are transitive. so r = 0. We encountered the equivalence relation i theory.g. the greatest common divisor of any subset of the integers also makes sense. y) without contain its symmetric (y. A function is called one-to-one (or injective) if two diﬀerent x1 . So d is not minimum as claimed. The relation < is not symmetric.. we can divide b by d to ﬁnd b = qd + r. The equality relation is symmetric. y) such that x is smaller than y. x). If we take A to be the set of students in a classroom.

the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. In the special case m = n. Each assignment of this type is a one-to-one function from A to B. Any set A contains several subsets. the k-th in n − k + 1 ways. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to diﬀerent animals. 109 . n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . and this follows from the binomial theorem. we have = 1. n! is the total number of permutations of a set of n elements. We thus observe that there are as many subsets in A as number of elements in the set {0. we need more names than animals: n ≤ m. Clearly. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. and so on. Indeed. 1}n of sequences of length n with elements 0 or 1. we denote (n)n also by n!. in other words. and so on. If A has n elements. n! stands for the product of the ﬁrst n natural numbers: n! = 1 × 2 × 3 × · · · × n. A function is then an assignment of a name to each animal. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). The ﬁrst element can be chosen in n ways. k! k where the symbol equation.e. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). Since the name of the ﬁrst animal can be chosen in m ways. that of the second in m − 1 ways. To be able to do this. so does the second. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . and so on.) So the ﬁrst animal can be assigned any of the m available names. we may think of the elements of A as animals and those of B as names.The number of functions from A to B (i. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . Incidentally. (The same name may be used for diﬀerent animals. the second in n − 1. a one-to-one function from the set A into A is called a permutation of A. Thus. and the third. the cardinality of the set B A ) is mn .

b is denoted by a ∧ b = min(a. In particular. The notation extends to more than one numbers. 0). we let The Pascal triangle relation states that n+1 k = = 0. a− = − min(a. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. consider a speciﬁc animal. Maximum and minimum The minimum of two numbers a. by convention. b). say. 0). + k−1 k Indeed. we deﬁne x m = x(x − 1) · · · (x − m + 1) . we use the following notations: a+ = max(a. The absolute value of a is deﬁned by |a| = a+ + a− . if x is a real number. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. Both quantities are nonnegative. a∨b∨c refers to the maximum of the three numbers a. 110 . n n . b. b). by convention. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). m! n y If y is not an integer then. and a− is called the negative part of a. If the pig is included. c. the pig. Note that it is always true that a = a+ − a− . For example. k Also. a+ is called the positive part of a. The maximum is denoted by a ∨ b = max(a.We shall let n = 0 if n < k. in choosing k animals from a set of n + 1 ones. and we don’t care to put parantheses because the order of ﬁnding a maximum is irrelevant. we choose k animals from the remaining n ones and this can be done in n ways. we add the two numbers together in order to ﬁnd the total number of ways to pick k animals from n + 1 animals. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. If the pig is not included in the sample.

For instance. 111 . if A1 . .Indicator function stands for the indicator function of a set A. It takes value 1 with probability P (A) and 0 with probability P (Ac ). So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). . If A is an event then 1A is a random variable. . Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). Thus. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). Several quantities of interest are expressed in terms of indicators. It is easy to see that 1A 1A∩B = 1A 1B . A2 . It is worth remembering that. meaning a function that takes value 1 on A and 0 on the complement Ac of A. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . P (X ≥ n). This is very useful notation because conjunction of clauses is transformed into multiplication. . 2. 1 − 1A = 1 A c . Another example: Let X be a nonnegative random variable with values in N = {1.}. . We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. are sets or clauses then the number n≥1 1An is the number true clauses. Then X X= n=1 1 (the sum of X ones is X). = 1A + 1B − 1A∩B . . So X= n≥1 1(X ≥ n). if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed.

. because ϕ is linear. . Let y i := n aij xj . . . . vm . . . . y ∈ Rn . is a column representation of y := ϕ(x) with respect to the given basis . xn the given basis. The matrix A of ϕ with respect to the choice of the two basis is deﬁned by a11 · · · a1n A := · · · · · · · · · am1 · · · amn y1 Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . . . β ∈ R. The column . So if we choose a basis v1 . . un be a basis for Rn . we have ϕ(x) = m n vi i=1 j=1 aij xj . . The column . ϕ are linear functions: It is a m × n matrix. . . . . Combining. vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . . x1 . Then. and all α. xn n 1 y := ϕ(x) = j=1 xj ϕ(uj ). . is a column representation of x with respect to . . Let u1 . = ··· ··· ··· . . where ∈ R. j=1 i = 1. Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . m. Suppose that ψ. ym v1 . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ . . The relation between the column representation of y and the column representation of x is written as 1 1 x y a11 · · · a1n .for all x. . x . But the ϕ(uj ) are vectors in Rm . .

. it is its matrix representation with respect to the ﬁxed basis. The standard basis in Rd consists of the vectors 1 0 0 0 1 0 e 1 = . . 1/2. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. . . . . we deﬁne the supremum sup I to be the minimum of all upper bounds. If I is empty then inf ∅ = +∞.} ≤ inf{x3 . a square matrix represents a linear transformation from Rn to Rn . Limsup and liminf If x1 . . Likewise. xn+2 .Fix three sets of basis on Rn . for any ε > 0. . . a sequence a1 . 113 . . . there is x ∈ I with x ≥ ε + inf I. . . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. the choice of the basis is the so-called standard basis. If we ﬁx a basis in Rn . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. . x2 . B is a m × n matrix. e2 = . . we deﬁne lim sup xn . . . standing for the “infimum” of I. x2 . . for any ε > 0. . where m (AB)ij = k=1 Aik Bkj . . Similarly. ed = . . sup ∅ = −∞. .} has no minimum.} ≤ · · · and. So lim sup xn = limn→∞ inf{xn . xn+1 . . i. . 1 ≤ i ≤ ℓ. Similarly. and AB is a ℓ × n matrix. Usually. there is x ∈ I with x ≤ ε + inf I. Supremum and inﬁmum When we have a set of numbers I (for instance. . ei = 1(i = j). x4 . This minimum (or maximum) may not exist. A matrix A is called square if it is a n × n matrix. 1/3. respectively. . ψ. 1 ≤ j ≤ n. .e. Then the matrix representation of ϕ ◦ ψ is AB. . . sup I is characterised by: sup I ≥ x for all x ∈ I and. · · · }. Rm and Rℓ and let A. and it is this that we called inﬁmum. We note that lim inf(−xn ) = − lim sup xn .} ≤ inf{x2 . . For instance I = {1. If I has no upper bound then sup I = +∞. In these cases we use the notation inf I. inf I is characterised by: inf I ≤ x for all x ∈ I and. we deﬁne max I. j So A is a ℓ × m matrix. . a2 . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. 0 0 1 In other words. B be the matrix representations of ϕ. with respect to these bases. is a sequence of numbers then inf{x1 . If I has no lower bound then inf I = −∞. x3 . deﬁned as the maximum of all the numbers y with the property y ≤ x for all x ∈ I.

which holds since X is positive. . Linear Algebra that has many axioms is more ‘restrictive’. are mutually disjoint events. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). we work a lot with conditional probabilities. P (∪n An ) ≤ Similarly. Of course. In Markov chain theory. E 1A = P (A). it is useful to ﬁnd disjoint events B1 . and everywhere else. B in the deﬁnition. If An are events with P (An ) = 0 then P (∪n An ) = 0. i. Therefore. If A is an event then 1A or 1(A) is the indicator random variable. by not requiring this. In contrast. This follows from the logical statement X 1(X > x) ≤ X. to show that EX ≤ EY .Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). and apply them. A2 . B are disjoint events then P (A ∪ B) = P (A) + P (B). n≥1 Usually. derive formulae. For example. if An are events with P (An ) = 1 then P (∩n An ) = 1. If the events A. . That is why Probability Theory is very rich. namely: Everything else. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. in these notes. n P (An ). Recall that P (B|A) is deﬁned by P (A ∩ B)/P (A) and is a very motivated deﬁnition. . . This can be generalised to any ﬁnite number of events. . n If the events A1 .e. . B2 . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. follows from this axiom. the function that takes value 1 on A and 0 on Ac .) If A. . to compute P (A ∩ B) we read the deﬁnition as P (A ∩ B) = P (A)P (B|A). Quite often.. A is a subset of B). B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). such that ∪n Bn = Ω. If A1 . then P n≥1 An = P (An ). Probability Theory is exceptionally simple in terms of its axioms. In contrast. . but not too much. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. Often.. . in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). if he events A1 . 11 114 . A2 . Note that 1A 1B = 1A∩B . This is the inclusion-exclusion formula for two events. for (besides P (Ω) = 1) there is essentially one axiom . Linear Algebra that has many axioms is more ‘restrictive’. This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. we often need to argue that X ≤ Y . . . Indeed. Similarly. Similarly. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). A2 . Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. (That is why Probability Theory is very rich. and 1 Ac = 1 − 1 A . to compute P (A).e. . I do sometimes “cheat” from a strict mathematical point of view.

|s| < 1/|ρ|. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. taught in calculus courses). Generating functions A generating function of a sequence a0 . . As an example. . 1].An application of this is: P (X < ∞) = lim P (X ≤ x). 2. So if n an < ∞ then G(s) exists for all s ∈ [−1. 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. n ≥ 0 is ∞ ρn s n = n=0 1 . x→∞ In Markov chain theory. the formula reduces to the known one. EX = x xP (X = x). The generating function of the sequence pn = P (X = n). which is motivated from the algebraic identity X = X + − X − . 1. ϕ(s) = EsX = n=0 sn P (X = n) 115 . is an inﬁnite polynomial having the ai as coeﬃcients: G(s) = a0 + a1 s + a2 s2 + · · · . a1 . viz.e. This deﬁnition makes sense only if EX + < ∞ or if EX − < ∞. X − = − min(X. we deﬁne X + = max(X. For a random variable which takes positive and negative values. 0). In case the random variable is ∞ absoultely continuous with density f . . . . The expectation of a nonnegative random variable X may be deﬁned by ∞ EX = 0 P (X > x)dx. a1 .. . . . . 0). is called probability generating function of X: ∞ as long as |ρs| < 1. making it a useful object when the a0 . i. 1]. we have EX = −∞ xf (x)dx.} because then n an = 1. . . n = 0. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). which are both nonnegative and then deﬁne EX = EX + − EX − . the generating function of the geometric sequence an = ρn . In case the random variable is discrete. 1. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. . we encounter situations where P (X = ∞) may be positive. is a probability distribution on Z+ = {0. so it is useful to keep this in mind. We do not a priori know that this exists for any s other than. obviously. for s = 0 for which G(0) = 0. The formula holds for any positive random variable. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n).

If we let X(s) = n≥0 n ≥ 0. n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). We can recover the probabilities pn from the formula pn = ϕ(n) (0) . since |s| < 1. Note that ϕ(0) = P (X = 0). As an example. but. 116 . Generating functions are useful for solving linear recursions. ∞ and p∞ := P (X = ∞) = 1 − pn . n! as follows from Taylor’s theorem. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). ds ds and by letting s = 1. Also. xn sn be the generating of (xn . n ≥ 0) then the generating function of (xn+1 . and this is why ϕ is called probability generating function. consider the Fibonacci sequence deﬁned by the recursion xn+2 = xn+1 + xn . possibly. Note that ∞ n=0 pn ≤ 1. we have s∞ = 0.This is deﬁned for all −1 < s < 1. Since the term s∞ p∞ could formally appear in the sum deﬁning ϕ(s). with x0 = x1 = 1. we can recover the expectation of X from EX = lim ϕ′ (s). n=0 We do allow X to. take the value +∞.

then the so-called distribution function. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. But I could very well use the function P (X > x). n ≥ 0). Since x0 = x1 = 1. is such an example. b = −( 5 + 1)/2. for example.e. the n-th term of the Fibonacci sequence. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X.e. we can solve for X(s) and ﬁnd X(s) = s2 −1 . 117 . i. n ≥ 0) and (xn . Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). Despite the square roots in the formula. Hence s2 + s − 1 = (s − a)(s − b). i. b−a Noting that ab = −1. we can also write this as xn = (−a)n+1 − (−b)n+1 . n ≥ 0) equals the sum of the generating functions of (xn+1 . and so X(s) = 1 1 1 1 1 1 = . namely. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . So.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . the function P (X ≤ x). b−a is the solution to the recursion. Thus (1/ρ)−s is the 1 generating function of ρn+1 . if X is a random variable with values in R. s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). x ∈ R.

compute an integral and obtain an approximation. in 2n fair coin tosses 118 . Even the generating function EsX of a Z+ -valued random variable X is an example. dx All these objects can be referred to as “distribution” or “law” of X.or the 2-parameter function P (a < X < b). to compute a large sum we can. Specifying the distribution of such an object may be hard. but using or merely talking about such an object presents no diﬃculty other than overcoming a psychological barrier. Such an excursion is a random object which should be thought of as a random function. 1 Since d dx (x log x − x) = log x. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. instead. Hence we have We call this the rough Stirling approximation. (b) The approximation is not good when we consider ratios of products. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. I could use its density f (x) = d P (X ≤ x). It turns out to be a good one. n Hence n k=1 log k ≈ log x dx. Now. For instance. it’s better to use logarithms: n n! = e log n = exp k=1 log k. we encountered the concept of an excursion of a Markov chain from a state i. n! ≈ exp(n log n − n) = nn e−n . it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. for example they could take values which are themselves functions. since the number is so big. n ∈ Z+ . For example (try it!) the rough Stirling approximation give that the probability that. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. If X is absolutely continuous. for it does allow us to specify the probabilities P (X = n). Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. We frequently encounter random objects which belong to larger spaces.

0 ≤ n ≤ m. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. Eg(X) ≥ g(x)P (X ≥ x) . then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any diﬀerent from X. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0.we have exactly n heads equals 2n 2−2n ≈ 1. Consider an increasing and nonnegative function g : R → R+ . by taking expectations of both sides (expectation is an increasing operator). Surely. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . you have seen it before. That was the random variable J0 = ∞ n=1 1(Sn = 0). for all x. g(X) ≥ g(x)1(X ≥ x). Indeed. where λ = P (X ≥ 1). Let X be a random variable. Then. and this is terrible. It was shown in Theorem 38 that J0 satisﬁes the memoryless property. Markov’s inequality Arguably this is the most useful inequality in Probability. We can case both properties of g neatly by writing For all x. 119 . In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . deﬁned in (34). expressing non-negativity. y ∈ R g(y) ≥ g(x)1(y ≥ x). n ∈ Z+ . because we know that the n probability goes to zero as n → ∞. and so. To remedy this. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). a geometric function of k. and is based on the concept of monotonicity. expressing monotonicity. If we think of X as the time of occurrence of a random phenomenon. We have already seen a geometric random variable. We baptise this geometric(λ) distribution. This completely trivial thought gives the very useful Markov’s inequality. as n → ∞. so this little section is a reminder.

Grinstead. J L (1997) Introduction to Probability. ´ 2. Feller. C M and Snell. 3.Bibliography ´ 1. Cambridge University Press. 7. W (1981) Topics in Ergodic Theory. 4. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance).pdf 6.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Also available at: http://www. Springer. W (1968) An Introduction to Probability Theory and its Applications. Bremaud P (1999) Markov Chains. 5. Parry. Springer. Math. Wiley. Bremaud P (1997) An Introduction to Probabilistic Modeling. Amer.dartmouth. Norris. Cambridge University Press.mac. Soc. J R (1998) Markov Chains. 120 . Springer. Chung.

34 probability ﬂow. 108 ergodic. 61 recurrent. 15 Loynes’ method. 17 Kolmogorov’s loop criterion. 15 distribution. 108 relation. 13 probability generating function. 29 reﬂection. 81 reﬂection principle. 70 greatest common divisor. 51 essential state. 113 initial distribution. 77 coupling. 20 function of a Markov chain. 17 communicates with. 58 digraph (directed graph). 17 Ethernet. 111 maximum. 18 permutation. 116 ﬁrst-step analysis. 101 axiom of Probability Theory. 110 meeting time. 108 ruin probability.Index absorbing. 7 closed. 10 irreducible. 86 reﬂexive. 16 equivalence relation. 16 convolution. 70 period. 110 mixing. 106 ground state. 61 null recurrent. 73 Markov chain. 18 dual. 61 detailed balance equations. 6 Markov property. 109 positive recurrent. 15. 53 hitting time. 48 memoryless property. 115 geometric distribution. 17 Aloha. 114 balance equations. 3 branching process. 111 inessential state. 87 121 . 47 coupling inequality. 68 aperiodic. 6 Markov’s inequality. 26 running maximum. 35 Helly-Bray Lemma. 7 invariant distribution. 119 matrix. 117 leads to. 10 ballot theorem. 48 cycle formula. 65 generating function. 59 law. 77 indicator. 82 Cesaro averages. 45 Chapman-Kolmogorov equation. 119 minimum. 1 equivalence classes. 53 Fibonacci sequence. 20 increments. 34 PageRank. 119 Google. 2 nearest neighbour random walk. 17 inﬁmum. 63 Galton-Watson process. 16 communicating classes. 18 arcsine law. 16. 115 random walk on a graph. 44 degree. 68 Fatou’s lemma. 65 Catalan numbers. 76 neighbours. 117 division. 84 bank account. 51 Monte Carlo. 98 dynamical system.

27 storage chain. 62 supremum. 104 states. 8 transition probability matrix. 16. 76 simple symmetric random walk. 6 stationary. 60 urn. 62 Wright-Fisher model. 7 time-reversible. 77 skip-free property. 108 tree. 3. 24 Strirling’s approximation. 16. 88 watching a chain in a set. 58 transient. 71 122 . 7. 13 stopping time. 2 transition probability. 76 Skorokhod embedding. 8 transitive. 113 symmetric. 7 simple random walk. 29 transition diagram:.semigroup property. 10 steady-state. 27 subordination. 9 stationary distribution. 6 statespace. 73 strategy of the gambler. 62 subsequence of a Markov chain. 119 strong Markov property. 108 time-homogeneous.

Sign up to vote on this title

UsefulNot useful- markov-chains
- 20752_0412055511MarkovChain
- Markov Chain
- (Ebook) Hidden Markov Models (Theory & Methods) Markov Chains Particle Filter Monte Carlo Hmm - Cappemoulinesryden
- MarkovChains
- Introduction to Markov Chain Monte Carlo Simulation and Their Statistical Analysis
- Markov Chains
- Bayesian Econometric Methods
- Markov Coding
- ProblemsMarkovChains
- A First Course in Linear Algebra
- Statistics
- Exercises for Stochastic Calculus
- Markov Chain
- Markov
- markov chain chapter 6
- Stochastic Processes
- A Course in Linear Algebra .pdf
- Solved Problems
- An Introduction to Statistics
- Usig R for Introductory Statistics
- Basic Topology Armstrong
- Grimmett g.r., Stirzaker d.r. Probability and Random Processes (3ed., Oxford, 2001)(1)
- Weather Wax Ross Solutions
- Bayesian Econometrics
- ACTEX Exam C Spring 2009 Vol 1
- Stochastic Processes - Ross
- references.docx
- Programming_with_Solutions C
- Markov Chain Random Walk