This action might not be possible to undo. Are you sure you want to continue?

**MARKOV CHAINS AND RANDOM WALKS
**

Takis Konstantopoulos∗ Autumn 2009

∗

c Takis Konstantopoulos 2006-2009

Contents

1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Deﬁnition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and ﬁrst-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Deﬁnition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . . 86 28 The reﬂection principle for a simple random walk in dimension 1 86 28.24. . . . . . . . . . .4 Some remarkable identities . . . . . . . . . . . . 84 27. . . . .1 Asymptotic distribution of the maximum . . . . . . . . . . . . . . . . . . .2 First return to the origin . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . .3 Some conditional probabilities . . . . . . . 79 26. . . . . . . . . . . . . . . . . . . . . . . . . .2 The ALOHA protocol: a computer communications application . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . . . . . . . . . . . 65 24. . . 68 24.4 The Wright-Fisher model: an example from biology . . . . . . .2 The ballot theorem . . . . . . . . .3 Total number of visits to a state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . .1 Paths that start and end at speciﬁc points . . . . . . . . . .1 Branching processes: a population growth application . . . . . . . . . 83 27. . . . . . . . . . . . . . 84 27. . . . 71 24. . . . . . . . . . . . . . . . . . . . . .1 First hitting time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . . . . .1 Distribution after n steps . . .2 Paths that start and end at speciﬁc points and do not touch zero at all . . . . . . . . . . . . . . . . . 85 27. . . . . 92 30. . .2 Return probabilities . . . . . 95 31. . . . . . . . 90 30. . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . .1 Distribution of hitting time and maximum . . .5 First return to zero . . . . . . 95 31. . . . . . . . . . . . . . . 70 24. . . . . .3 The distribution of the ﬁrst return to the origin . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . . . . . . .5 A storage or queueing application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

The only new thing here is that I give emphasis to probabilistic methods as soon as possible.K. I have a little mathematical appendix.Preface These notes were written based on a number of courses I taught over the years in the U. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. At the end. They form the core material for an undergraduate course on Markov chains in discrete time. “important” formulae are placed inside a coloured box. There notes are still incomplete. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuﬀ with matrices – On ﬁnance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. dozens of good books on the topic. Also.. v . Also. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. of course.. The second one is speciﬁcally for simple random walks. I have tried both and prefer the current ordering.. There are. Greece and the U. The ﬁrst part is about Markov chains and some applications. I introduce stationarity before even talking about state classiﬁcation. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. A few starred sections should be considered “advanced” and can be omitted at ﬁrst reading. I tried to put all terms in blue small capitals whenever they are ﬁrst encountered. Of course.S.

that gcd(a. and its ˙ dt 2x acceleration is x = d 2 . yn ). The “time” can be discrete (i. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. yn+1 ). initialised by x0 = max(a. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). It is not hard to see that.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past aﬀects the future only through the present. or. An example from Physics is provided by Newton’s laws of motion. we let our “state” be a pair of integers (xn . F = f (x. b). yn ) into Xn+1 = (xn+1 . x Here. more generally. one of which is 0. where r = rem(a. X0 . a phenomenon which evolves with time in a way that only the present aﬀects the future is called a dynamical system. In Mathematics. Formally. where xn ≥ yn . b) between two positive integers a and b. . Newton’s second law says that the acceleration is proportional to the force: m¨ = F. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law).e. b) is the remainder of the division of a by b. n = 0. For instance. Suppose that a particle of mass m is moving under the action of a force F in a straight line. x(t)). and the other the greatest common divisor. s ≥ t) ˙ 1 . In the module following this one you will study continuous-time chains. . In other words. . X1 . b). . 2. . the real numbers). Xn+2 . X2 . Assume that the force F depends only on the particle’s position ¨ dt and velocity. we end up with two numbers. for any n. the future trajectory (X(s). . has the property that. By repeating the procedure. there is a function F that takes the pair Xn = (xn . x = x(t) denotes the position of the particle at time t. 1. i. An example from Arithmetic of a dynamical system in discrete time is the one which ﬁnds the greatest common divisor gcd(a. depend only on the pair Xn . for any t. continuous (i. a totally ordered set. the integers).e. Recall (from elementary school maths). The sequence of pairs thus produced. b) = gcd(r. y0 = min(a. yn ). b).e. Its velocity is x = dx . . . and evolving as xn+1 = yn yn+1 = rem(xn . x). the future pairs Xn+1 . .

.000 columns or computing the integral of a complicated function of a large number of variables. instead of deterministic objects. Perform this a large number N of times. The resulting ratio is an approximation b of a f (x)dx.05. We will also see why they are useful and discuss how they are applied.000 rows and 10. and use those mathematical models which are called Markov chains. Count the number of 1’s and divide by N . but some are stochastic. . In addition. regardless of where it was at earlier times.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. Choose a pair (X0 . an eﬀective way for dealing with large problems is via the use of randomness. Y0 ) of random numbers. for steps n = 1. A mouse lives in the cage. Yn ). Markov chains generalise this concept of dependence of the future only on the present. analyse. we will see what kind of questions we can ask and what kind of answers we can hope for. The generalisation takes us into the realm of Randomness. say. containing fresh and stinky cheese. Repeat with (Xn . Try this on the computer! 2 . 2. 0 ≤ Y0 ≤ B. In other words. Similarly.can be completely speciﬁed by the current state X(t). uniformly.99. past states are not needed. We will develop a theory that tells us how to describe. it does so. via the use of the tools of Probability.e. by the software in your computer. i. it moves from 2 to 1 with probability β = 0. a ≤ X0 ≤ b. respectively. We will be dealing with random variables. Each time count 1 if Yn < f (Xn ) and 0 otherwise. 2 2.. at time n + 1 it is either still in 1 or has moved to 2.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as ﬁnding the determinant of a matrix with 10. Some of these algorithms are deterministic. A scientist’s job is to record the position of the mouse every minute. When the mouse is in cell 1 at time n (minutes) then. . 1 and 2. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. Indeed. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. Other examples of dynamical systems are the algorithms run. We can summarise this information by the transition diagram: 1 Stochastic means random.

95 0. . px. S2 . (This is the mean of the binomial distribution with parameter α. Dn − Sn = y − x) P (Sn = z. respectively. 2. So if he wishes to spend more than what is available. . S1 . D2 . How often is the mouse in room 1 ? Question 1 has an easy. from month to month. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is deﬁned by: px. D0 .05 ≈ 20 minutes. We easily see that if y > 0 then ∞ x. y = 0. and Sn is the money he wants to spend.05 0.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. S0 . intuitive. Here. Xn is the money (in pounds) at the beginning of the month.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. he can’t. FS . Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). are independent random variables.01 Questions of interest: 1. Assume also that X0 .) 2. = z=0 ∞ = z=0 3 . answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. . as follows: Xn+1 = max{Xn + Dn − Sn . Assume that Dn .99 0.y := P (Xn+1 = y|Xn = x). . the ﬁrst time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. . to move from cell 1 to cell 2? 2. D1 . are random variables with distributions FD . . on the average. 1. Dn is the money he deposits. 0}. Sn .2 Bank account The amount of money in Mr Baxter’s bank account evolves. How long does it take for the mouse.

1 p1. Questions of interest: 1. we have px.something that.0 p0.0 p1. how fast? 2. and if so. Are there any places which he will visit inﬁnitely many times? 4.2 p1.y .3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. Of course. Will Mr Baxter’s account grow. So he may move forwards with equal probability that he moves backwards. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). 2.1 p0. he has no sense of direction. If he starts from position 0. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions.0 p2.2 P= p2.1 p2. if y = 0. on the average.2 ··· ··· ··· ··· · · · · · · ··· and is an inﬁnite matrix. If the pub is located in position 0 and his home in position 100. 2.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . being drunk. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. how often will he be visiting 0? 2. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. Are there any places which he will never visit? 3. The transition probability matrix is y=1 p0. in principle.0 = 1 − ∞ px. can be computed from FS and FD . how long will it take him.

So he can move east. for now. 5 . Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly.5 Actuarial chains A life insurance company wants to ﬁnd out how much money to charge its clients. Also. given that he is currently healthy. there is clearly a higher chance that our man will be lost.8 0. but they will become more precise in the sequel. Clearly. etc.S because pH. there is probability pH.1 D Thus.S = 1 − pS. that since there are more degrees of freedom now.D . and pS. pD. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. We may again want to ask similar questions.S = 0. observe.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner.H − pS. the answer to this is crucial for the policy determination. mathematically imprecise. 2.H . These things are. therefore. and let’s say he’s completely drunk. But. the reason being that the company does not believe that its clients are subject to resurrection. the company must have an idea on how long the clients will live. Note that the diagram omits pH.3 H 0.D . so we assign probability 1/4 for each move. north or south.S − pH.3 that the person will become sick. west.D = 1.D is omitted.H = 1 − pH.01 S 0. and pS. pD.

. . . . . X2 = i2 . We focus almost exclusively to the ﬁrst choice. Usually. . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). (Xn . (1) 6 . in+1 ∈ S. m and any choice of states i0 . X0 = i0 ) = P (∀ k ∈ [n + 1.} or the set Z of all integers (positive. the Markov property. Xn+m ) and (X0 . if the claim holds. 3. .3 3. . 3. m ∈ T). . . This is true for any n and m. . 2. n ∈ T} is Markov instead of saying that it has the Markov property. Sometimes we say that {Xn . m ∈ T) is independent of the past process (Xm . . or negative). 3. m > n.2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . that the Xn take values in some countable set S. n ∈ T}. unless needed. m < n. . . . . . zero.1 The Markov property Deﬁnition of the Markov property Consider a (ﬁnite or inﬁnite) sequence of random variables {Xn . called the state space. . Let us now write the Markov property in an equivalent way. Xn−1 ) given Xn . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . . .} or N = {1. Proof. we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. . So we will omit referring to T explicitly. . . Since S is countable. . n + m] Xk = ik | Xn = in ). The elements of S are frequently called states. conditionally on Xn . . it is customary to call (Xn ) a Markov chain. We will also assume. On the other hand. . . 2. n + m] Xk = ik | Xn = in . . We say that this sequence has the Markov property if. X1 = i1 . . Xn−1 ) are independent conditionally on Xn . almost exclusively (unless otherwise said explicitly). the future process (Xm . . . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). where T is a subset of the integers. in+m . . for any n ∈ T. n ∈ Z+ ) is Markov if and only if. for any n. . . P (Xn+1 = in+1 | Xn = in . . Lemma 1. . viz. T = Z+ = {0. 1. which precisely means that (Xn+1 . . and so future is independent of past given the present. . for all n ∈ Z+ and all i0 .

. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m.j (n. something that can be called semigroup property or Chapman-Kolmogorov equation. n). Of course. 2) · · · pin−1 .j (n. n) = i1 ∈S ··· in−1 ∈S pi. remembering a bit of Algebra.j (n. viz. but if S is an inﬁnite set.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains.i1 (m.j (m. If |S| = d then we have d × d matrices. 1) pi1 . n) = P(m. by (1). by deﬁnition. 7 . requires some ordering on S) and whose entries are pi. n) := [pi.j (m. n). j ∈ S. i∈S i. 2 3 And.3 In this new notation. we can write (2) as P(m. (2) which is reminiscent of matrix multiplication. The function µ(i) is called initial distribution. X2 = i2 . X1 = i1 . n). . n ≥ 0. m + 1)P(m + 1. or as P(m. we deﬁne.i (n − 1. pi.. m + 2) · · · P(n − 1.i1 (0. n) if m ≤ ℓ ≤ n. of course. . then the matrices are inﬁnitely large. ℓ) P(ℓ. a square matrix whose rows and columns are indexed by S (and this. which. n) = 1(i = j) and. pi. for each m ≤ n the matrix P(m. means that the transition probabilities pi.which means that the two functions µ(i) := P (X0 = i) . .j (n. n + 1) := P (Xn+1 = j | Xn = i) . all probabilities of events associated with the Markov chain This causes some analytical diﬃculties.in (n − 1. Therefore. n) = P(m. n + 1) is called transition probability from state i at time n to state j at time n + 1. 3. n)]i. pi.i2 (1.j (m. n). pi.j (m. n + 1) do not depend on n . The function pi. by a result in Probability. But we will circumvent them by using Probability only. Xn = in ) = µ(i0 ) pi0 . will specify all joint distributions2 : P (X0 = i0 . m + 1) · · · pin−1 .j∈S . More generally.

For example. and call pi. y y 8 . δi (j) := 1. transition matrix. for all integers a. Notice that any probability distribution on S can be written as a linear combination of unit masses. Then P(m. and the semigroup property is now even more obvious because Pa+b = Pa Pb . simply. if µ assigns probability p to state i1 and 1 − p to state i2 . Notational conventions Unit mass. then µ = pδi1 + (1 − p)δi2 . (3) We call δi the unit mass at i. Similarly. while P = [pi. n−m times for all m ≤ n. as follows from (1) and the deﬁnition of P. if j = i otherwise. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . 0. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . of course.and therefore we can simply write pi. if Y is a real random variable. to picking state i with probability 1. b ≥ 0. arranged in a row vector. n) = P × P × · · · · · · × P = Pn−m . A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i).j . n + 1) =: pi. unless otherwise speciﬁed. In other words. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. Then we have µ n = µ 0 Pn .j∈S is called the (one-step) transition probability matrix or.j (n.j ]i. Starting from a speciﬁc state i. It corresponds. Thus Pi is the law of the chain when the initial distribution is δi .j the (one-step) transition probability from state i to state j. From now on. For example. The matrix notation is useful.

say N = 1025 ) in a metallic chamber with a separator in the middle.) is stationary if it has the same law as (Xn . Clearly. the law of a stationary process does not depend on the origin of time. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. . n ≥ 0) will not be stationary: indeed. and vice versa.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . selecting a vertex with probability 1/7. On the average. then. If at time n the bug is on a vertex. because it is impossible for the number of molecules to remain constant at all times. i. Consider a canonical heptagon and let a bug move on its vertices. We now investigate when a Markov chain is stationary. This is a model of gas distributed between two identical communicating chambers. n = m. at time n + 1 it moves to one of the adjacent vertices with equal probability. dividing it into two identical rooms. . Then a hole is opened in the separator and we watch what happens as the molecules diﬀuse between the two rooms. Imagine there are N molecules of gas (where N is a large number. for a while we will be able to tell how long ago the process started. . But this will not work. Then pi. The gas is placed partly in room 1 and partly in room 2. the distribution of the bug will still be the same.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. .d. Hence the uniform probability distribution is stationary. at any point of time n. In other words.4 Stationarity We say that the process (Xn . as it takes some time before the molecules diﬀuse from room to room.i. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . then. n = 0.). It is clear.e. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. . Example: the Ehrenfest chain. Let Xn be the number of molecules in room 1 at time n. It should be clear that if initially place the bug at random on one of the vertices. Another guess then is: Place 9 . we indeed expect that we have N/2 molecules in each room. a sequence of i. let us look at the following example: Example: Motion on a polygon. random variables provides an example of a stationary process. . To provide motivation and intuition. m + 1. 1. for any m ∈ Z+ .

a Markov chain is stationary if and only if the distribution of Xn does not change with n. Xm+2 = i2 . P (X1 = i) = j P (X0 = j)pj. Proof. Xm+1 = i1 . . Indeed.i = N N −i+1 N i+1 2−N + = · · · = π(i). P (Xn = i) = π(i) for all n. 2−N i−1 i+1 N N (The dots signify some missing algebra. The only if part is obvious. If we do so. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . . then we have created a stationary Markov chain. stationary or invariant distribution. . we are led to consider: The stationarity problem: Given [pij ]. ﬁnd those distributions π such that πP = π.) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. X1 = i1 . then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . . hence.i + P (X0 = i + 1)pi+1. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. use (1) to see that P (X0 = i0 .each molecule at random either in room 1 or in room 2. in ∈ S. . Xn = in ) = P (Xm = i0 . for all m ≥ 0 and i0 .i = π(i − 1)pi−1. Xm+n = in ). . (a binomial distribution). (4) Such a π is called . . For the if part. Since. . . which you should go through by yourselves. If we can do so. A Markov chain (Xn . X2 = i2 .i + π(i + 1)pi+1. . i 0 ≤ i ≤ N. We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. The equations (4) are called balance equations. In other words. But we need to prove this. . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. Changing notation. 10 . Lemma 2. by (3). .i = P (X0 = i − 1)pi−1.

which. we use the normalisation condition π(1) = i≥1 j≥i = 1. In addition to that. i ≥ 1. i ≥ 0. . Then (Xn . Consider a deck of d = 52 cards and shuﬄe them in the following (stupid) way: pick two cards at random and swap their positions. each x ∈ Sd is of the form x = (x1 .i = 1. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1.i. 2. and so each π(i) will be zero. But here. This means that the solution to the balance equations is identically zero. d}. Who tells us that the sum inside the parentheses is ﬁnite? If the sum is inﬁnite. After bus arrives at a bus stop. which in turn means that the normalisation condition cannot be satisﬁed.i = π(0)p(i) + π(i + 1). π(1) will be zero. write the equations (4): π(i) = π(j)pj. i = 1. . Keep doing it continuously. . . where 1 is a column whose entries are all 1. Example: card shuﬄing. The times between successive buses are i. can be written as P1 = 1. 2 . p1. the next one will arrive in i minutes with probability p(i).. xd ). To ﬁnd the stationary distribution. random variables. where all the xi are distinct.}. Thus. . . d! x ∈ Sd . . . we must be careful. .d. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. 11 .Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1.j = 1. in matrix notation. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). . which gives p(j) . Example: waiting for a bus. if i > 0 i ∈ N. then we run into trouble: indeed. . n ≥ 0) is a Markov chain with transition probabilities pi+1.i = p(i). It is easy to see that the stationary distribution is uniform: π(x) = 1 . The state space is S = N = {1. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. i≥1 π(i) −1 To compute π(1). . π must be a probability: π(i) = 1. . i∈S The matrix P has the eigenvalue 1 always because j∈S pi. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1.

. 1} for all i. . .i.). for any n = 0. are i.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). . Here ξ(n) = (ξ1 (n). So.. . . 2. . . . . . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. . 1}. Example: bread kneading. ξr+1 (n) = εr ) + P (ξ1 (n) = 1. So it goes beyond our theory. 1] (think of [0. . . We can show that if we start with the ξr (0). . ξ2 (n). with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. ξ2 (n) = 1 − ε1 . . .) Consider an inﬁnite binary vector ξ = (ξ1 . We have thus deﬁned a mapping ϕ from binary vectors into binary vectors. r = 1. i. (5) 12 . 2. . . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. εr ∈ {0. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is ﬁnite. But to even think about it will give you some strength. . Remarks: This is not a Markov chain with countably many states. i. As for the name of the chain (bread kneading) it can be justiﬁed as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. ξi ∈ {0. . . any r = 1. 1. . .d. . . But the function f (x) = min(2x. .. what we have really shown is that this assumption is a necessary and suﬃcient condition for the existence and uniqueness of a stationary distribution. 1 − ξ3 . . The Markov chain is obtained by applying the mapping ϕ again and again. 2. You can check that . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 .) is the state at time n. . ξ3 . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. ξ2 (n) = ε1 . with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). ξ2 . from (5).. and the rule maps x into 2(1 − x). . . the same is true at n + 1.). .* (This is slightly beyond the scope of these lectures and can be omitted. Hence choosing the initial state at random in this manner results in a stationary Markov chain. applied to the interval [0. . 2(1 − x)). .d. .). if ξ1 = 1 then shift to the left and ﬂip all bits to obtain (1 − ξ2 . . ξr+1 (n) = 1 − εr ). ξ2 (n). .e. If ξ1 = 1 then x ≥ 1/2. . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. . . P (ξ1 (n + 1) = ε1 . So let us do so. and any ε1 .i. ξ(n + 1) = ϕ(ξ(n)). 1] as a ﬂexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle.

A). Instead of eliminating equations.1 Finding a stationary distribution As explained above. for all A ⊂ S. suppose that the balance equations hold. If the state space has d states.j as a “current” ﬂowing from i to j. then the above are d + 1 equations. a stationary distribution π satisﬁes π = πP π1 = 1 In other words. When a Markov chain is stationary we refer to it as being in 4.i Multiply π(i) by j∈S pi. i∈A j∈Ac Think of π(i)pi. This is OK.Note on terminology: steady-state.i + j∈Ac π(j)pj. Ac ) = F (Ac . Ac ) := π(i)pi. If the latter condition holds.i . The convenience is often suggested by the topological structure of the chain. j∈S i ∈ S. Writing the equations in this form is not always the best thing to do. π(j)pj. therefore one of them is always redundant. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A.j = j∈A π(j)pj.i = j∈A π(j)pj. we introduce more. Ac ) is the total current from A to Ac . π is a stationary distribution if and only if π1 = 1 and F (A. because the ﬁrst d equations are linearly dependent: the sum of both sides equals 1. a set of d linearly independent equations which can be more easily solved.j (which equals 1): π(i) j∈S pi. We have: Proposition 1. as we will see in examples. Proof. then choose A = {i} and see that you obtain the balance equations.j . amongst them. Conversely. π(i) = j∈S (balance equations) (normalisation condition). expecting to be able to choose.i 13 . π(j) = 1.i + j∈Ac π(j)pj. But we only have d unknowns. So F (A.

while the remaining two terms give the equality we seek. α+β π(2) = cα. gives us ﬂexibility. The extra degree of freedom provided by the last result. We have π(1)α = π(2)β. . Summing over i ∈ A.. p21 = β. π(1) = β . p q Consider the following chain.04 π(2) = 0. β = 0.8% of the time in room 2.2% of the time in 1. Instead. If we take α = 0. A ) A c c F( A . 14 i = 1.) Assume 0 < α. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. 1. we see that the ﬁrst term on the left cancels with the ﬁrst on the right. Example: a walk with a barrier. . β < 1. That is. A) Schematic representation of the ﬂow balance relations.048.04 room 1. One can ask how to choose a minimal complete set of linearly independent equations. the mouse spends only 95. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1).99 (as in the mouse example). (Consequently. . α+β π(2) = α .i + j∈Ac π(j)pj.j + j∈A j∈Ac π(i)pi. A c F(A . here is an example: Example: Consider the 2-state Markov chain with p12 = α. . Thus. and 4..952. but we shall not do this here.j = j∈A π(j)pj.i This is true for all i ∈ S. 2.05 ≈ 0. in steady state. p11 = 1 − α.Now split the left sum also: π(i)pi. we ﬁnd π(1) = 0.05. p22 = 1 − β. 3.99 ≈ 0. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . The constant c is found from π(1) + π(2) = 1.

This is i=0 possible if and only if p < q. This is often cast in terms of a digraph (directed graph). So assume i = j. and E a set of directed edges. provided that ∞ π(i) = 1. . in−1 . n≥0 . i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1.i1 pi1 . i.) 5 5. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). a pair (S. . starting from i. the chain will visit j at some ﬁnite time. . In other words. . j) ∈ E ⇐⇒ pi. (If p ≥ 1/2 there is no stationary distribution. A). .e. E) where S is the state space and E ⊂ S × S deﬁned by (i. Does this provide a stationary distribution? Yes. In graph-theoretic terminology.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. 5. we can solve them immediately. (ii) pi. . .2 The relation “leads to” We say that a state i leads to state j if. In fact. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. 1. i. Ac ) = π(i − 1)p = π(i)q = F (Ac . i − 1}. S is a set of vertices. π(i) = (p/q)i π(0). Proof. . If i = j there is nothing to prove.e. • Suppose that i j. p < 1/2. (iii) Pi (Xn = j) > 0 for some n ≥ 0.j > 0 for some n ∈ N and some states i1 . These are much simpler than the previous ones.But we can make them simpler by choosing A = {0. If i ≥ 1 we have F (A.i2 · · · pin−1 .j > 0. For i ≥ 1.

16 . Equivalence relation means that it is symmetric. By assumption. • Suppose now that (iii) holds. 5} is closed: if the chain goes in there then it never moves out of it. i}. The relation is transitive: If i j and j k then i k. where the sum extends over all possible choices of states in S. However. (iii) holds. is an equivalence relation. {4.j . In other words. we see that there are four communicating classes: {1}. by deﬁnition.in−1 pi.i1 pi1 . 3} is not closed. reﬂexive and transitive. the sum contains a nonzero term. the class {2. 5. Proof. The class {4. the set of all states that communicate with i: [i] := {j ∈ S : j So. by deﬁnition. It is reﬂexive and transitive because is so. • Suppose that (ii) holds. 5}. it partitions S into equivalence classes known as communicating classes. and so (iii) holds. Hence Pi (Xn = j) > 0.3 The relation “communicates with” Deﬁne next the relation i communicates with j (written as i Obviously. j) ⇐⇒ i j and j i. Two communicating classes are either 1 2 3 4 5 6 In the example of the ﬁgure. 3}. we have that the sum is also positive and so one of its terms is positive. {2.the sum on the right must have a nonzero term. meaning that (ii) holds. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). Look at the last display again. {6}. We just observed it is symmetric.. The relation j ⇐⇒ j i)... We have Pi (Xn = j) = i1 . (6) j. Corollary 1. In addition. Since Pi (Xn = j) > 0. this is symmetric (i Corollary 2. The communicating class corresponding to the state i is. The ﬁrst two classes diﬀer from the last two in character. (i) holds. [i] = [j] if and only if i identical or completely disjoint.i2 · · · pin−1 .. Just as any equivalence relation.

we say that a set of states C ⊂ S is closed if pi. The communicating classes C1 . We then can ﬁnd states i1 . Closed communicating classes are particularly important because they decompose the chain into smaller. C3 are not closed. and C is closed. Lemma 3. The classes C4 . 17 . Since i ∈ C. then the chain (or. C5 in the ﬁgure above are closed communicating classes. . Thus.i1 pi1 . its transition matrix P) is called irreducible. Proof. more manageable.i2 > 0. the single-element set {i} is a closed communicating class. we can reach any other state. pi. Why? A state i is essential if it belongs to a closed communicating class. we have i1 ∈ C.i2 · · · pin−1 . starting from any state. Suppose that i j. . the state space S is a closed communicating class. Suppose ﬁrst that C is a closed communicating class. Similarly. Note: If we consider a Markov chain with transition matrix P and ﬁx n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. A general Markov chain can have an inﬁnite number of communicating classes. Note: An inessential state i is such that there exists a state j such that i j but j i. There can be arrows between two nonclosed communicating classes but they are always in the same direction. C2 . parts. we conclude that j ∈ C. To put it otherwise.More generally. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. in−1 such that pi. In other words still. and so i2 ∈ C. . If all states communicate with all others. The converse is left as an exercise.j > 0. j∈C for all i ∈ C. Otherwise.j = 1. If a state i is such that pi.i1 > 0.i = 1 then we say that i is an absorbing state. In graph terms. the state is inessential. and so on. pi1 . more precisely. Note that there can be no arrows between closed communicating classes. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this ﬁgure. .

we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . X2 ∈ C1 . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 .4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. Theorem 2 (the period is a class property). Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. d(j) = gcd Dj . all states form a single closed communicating class. . Let Di = {n ∈ N : pii > 0}. it will move to the set C2 = {2. Since Pm+n = Pm Pn . i. d(i) = gcd Di . (Can you ﬁnd an example of a chain with period 3?) We will now give the deﬁnition of the period of a state of an arbitrary chain. n ≥ 0. . Notice however. We say that the chain has period 2. If i j there is α ∈ N with pij > 0 and if 18 . A state i is called aperiodic if d(i) = 1. more generally. The period of an essential state is deﬁned as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . i. X3 ∈ C2 . (m) (n) for all m. (n) Proof. The chain alternates between the two sets. pii (ℓn) ≥ pii (n) ℓ .5. The period of an inessential state is not deﬁned. 3} at some point of time then. Let us denote by pij the entries of Pn . If i j then d(i) = d(j). . j. Dj = {n ∈ (n) (α) N : pjj > 0}. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. Consider two distinct essential states i. 4} at the next step and vice versa. j ∈ S.e. X4 ∈ C1 . whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. and. So. . ℓ ∈ N.

. the chain is called aperiodic. Arguing symmetrically. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This ﬁgure shows the internal structure of a closed communicating class with period d = 4. are irreducible chains. Cd−1 such that pij = 1. . Cd−1 such that the chain moves cyclically between them. An irreducible chain with ﬁnitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . From the last two displays we get d(j) | n for all n ∈ Di . . Divide n by n1 to obtain n = qn1 + r. if d = 1. . . So d(j) is a divisor of all elements of Di . Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. all states communicate with one another.j i there is β ∈ N with pji > 0. i. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . If a number divides two other numbers then it also divides their diﬀerence. showing that α + n + β ∈ Dj . An irreducible chain has period d if one (and hence all) of the states have period d. we have n ∈ Di as well.e. n2 ∈ Di . showing that α + β ∈ Dj . Theorem 4. j ∈ S. n1 ∈ Di such that n2 − n1 = 1. . C1 . if an irreducible chain has period d then we can decompose the state space into d sets C0 . Now let n ∈ Di . Since n1 . chains where. . r = 0. Because n is suﬃciently large. Proof. we obtain d(i) ≤ d(j) as well. where the remainder r ≤ n1 − 1. . Corollary 3. Formally. So pjj ≥ pij pji > 0. Consider an irreducible chain with period d. Therefore d(j) | α + β + n for all n ∈ Di . C1 . Let n be suﬃciently large. 1. So d(i) = d(j). . Then we can uniquely partition the state space S into d disjoint subsets C0 . In particular. d(j) | d(i)). . this is the content of the following theorem. as deﬁned earlier. Pick n2 . j∈Cr+1 i ∈ Cr . Theorem 3. . Then pii > 0 and so pjj ≥ pij pii pji > 0. More generally. d − 1. (Here Cd := C0 . Of particular interest. . q − r > 0. Therefore d(j) | α + β.) 19 .

We show that these classes satisfy what we need. Pick a state i0 and let C0 be its equivalence class (i. i1 . id ∈ C0 . i. We now show that if i belongs to one of these classes and if pij > 0 then. If X0 ∈ A the two times coincide. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 .. . We avoid excessive notation by using the same letter for both. We deﬁne the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}.Proof.e the set of states j such that i j). The reader is warned to be alert as to which of the two variables we are considering at each time. Take. necessarily. We use a method that is based upon considering what the Markov chain does at time 1.. j belongs to the next class. Cd−1 . that is why we call it “first-step analysis”. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). say. id−1 d 6 Hitting times and ﬁrst-step analysis Consider a Markov chain with transition probability matrix P. . then. C1 . necessarily. Suppose pij > 0 but j ∈ C1 but. after it takes one step from its current position. and by the assumptions that d d j. j ∈ C2 . As an example. with corresponding classes C0 . 20 . Continue in this manner and deﬁne states i0 . i1 . Such a path is possible because of the choice of the i0 . Hence it partitions the state space S into d disjoint equivalence classes.. The existence of such a path implies that it is possible to go from i0 to i. which contradicts the deﬁnition of d.. . Let C1 be the equivalence class of i1 .. We are interested in deriving formulae for the probability that this time is ﬁnite as well as the expectation of this time. It is easy to see that if we continue and pick a further state id with pid−1 . Notice that this is an equivalence relation. for example.e.id > 0. .. Assume d > 1. Deﬁne the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which diﬀers from TA simply by considering n ≥ 1 instead of n ≥ 0. .. Then pick i1 such that pi0 i1 > 0. i ∈ C0 .

X0 = i) = P (TA < ∞|X1 = j). Hence: ′ ′ P (TA < ∞|X1 = j. conditionally on X1 . We ﬁrst have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. For the second part of the proof. The function ϕ(i) satisﬁes ϕ(i) = 1. ′ Proof. Furthermore. the event TA is independent of X0 . ﬁx a set of states A. So TA = 1 + TA . a ′ function of X1 . If i ∈ A then TA = 0. . But the Markov chain is homogeneous. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). If i ∈ A. ϕ(i) = j∈S i∈A pij ϕ(j).). and so ϕ(i) = 1.e. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). let ϕ′ (i) be another solution. j∈S as needed. . we have Pi (TA < ∞) = ϕ(j)pij . and deﬁne ϕ(i) := Pi (TA < ∞). Therefore. . Combining the above. ′ is the remaining time until A is hit). But what exactly is this probability equal to? To answer this in its generality. i ∈ A. X2 . we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A pjk ϕ′ (k) k∈A 21 . We then have: Theorem 5. By self-feeding this equation. if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S.It is clear that P1 (T0 < ∞) < 1. then TA ≥ 1. because the chain may end up in state 2.

∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the ﬁrst time. 1 ψ(1) = 1 + ψ(1). The function ψ(i) satisﬁes ψ(i) = 0. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). TA = 0 and so ψ(i) := Ei TA = 0. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). we obtain ϕ′ (i) ≥ Pi (TA ≤ n). obviously. Example: Continuing the previous example. Proof. Furthermore. 1. We immediately have ϕ(0) = 1. The second part of the proof is omitted. let A = {0. We let ϕ(i) = Pi (T0 < ∞). Therefore. Example: In the example of the previous ﬁgure. the second equals Pi (TA = 2). 2}. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. On the other hand.) As for ϕ(1). (If the chain starts from 0 it takes no time to hit 0. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). so the ﬁrst two terms together equal Pi (TA ≤ 2). 3 22 . which is the second of the equations. Clearly. Start the chain from i.We recognise that the ﬁrst term equals Pi (TA = 1). If i ∈ A then. we choose A = {0}. and let ψ(i) = Ei TA . i = 0. 3 2 so ϕ(1) = 3/4. ψ(0) = ψ(2) = 0. We have: Theorem 6. the set that contains only state 0. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). By continuing self-feeding the above equation n times. because it takes no time to hit A if the chain starts from A. ψ(i) := Ei TA . if the chain starts from 2 it will never hit 0. then TA = 1 + TA . as above. j∈A i ∈ A. 2. ψ(i) = 1 + i∈A pij ψ(j). and ϕ(2) = 0. If ′ i ∈ A. Letting n → ∞.

23 . .e. . until set A is reached. This happens up to the ﬁrst hitting time TA ﬁrst hitting time of the set A. where.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . B. ′ where TA is the remaining time.which gives ψ(1) = 3/2. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the ﬁrst-step analysis method. X2 .. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). ϕAB (i) = j∈A i∈B pij ϕAB (j). Hence. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). in the last sum. ′ TA = 1 + TA . TB for two disjoint sets of states A. therefore the mean number of ﬂips until the ﬁrst success is 3/2. ϕAB is the minimal solution to these equations. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). Now the last sum is a function of the future after time 1. Clearly. It should be clear that the equations satisﬁed are: ϕAB (i) = 1. i. h(x) = f (x). by the Markov property. Next. of the random variables X1 . we just changed index from n to n + 1. Then ′ 1+TA x ∈ A. ϕAB (i) = 0. if x ∈ A. as argued earlier. because the underlying experiment is the ﬂipping of a coin with “success” probability 2/3. (This should have been obvious from the start. consider the following situation: Every time the state is x a reward f (x) is earned. after one step. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). We may ask to ﬁnd the probabilities ϕAB (i) = Pi (TA < TB ). i∈A Furthermore. . otherwise.

h(1) = 5. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). Example: A thief enters sneaks in the following house and moves around the rooms. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . Let p the probability of heads and q = 1 − p that of tails. in which case he dies immediately. if he starts from room 2. why?) If we let h(i) be the average total reward if he starts from room i. x∈A pxy h(y). h(3) = 7. k ∈ N) form the strategy of the gambler. unless he is in room 4. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). If heads come up then you win £1 per pound of stake. 1 2 4 3 The question is to compute the average total reward. he collects i pounds. for i = 1. 2. (Also. the set of equations satisﬁed by h. he will die. so 24 . (Sooner or later. 3.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. The successive stakes (Yk . 2 2 We solve and ﬁnd h(2) = 8. Clearly. are h(x) = 0. until he dies in room 4. So. y∈S h(x) = f (x) + x ∈ A. collecting “rewards”: when in room i.Thus. no gambler is assumed to have any information about the outcome of the upcoming toss.

. . a + 2. in simplest case Yk = 1 for all k. . she must base it on whatever outcomes have appeared thus far. expectation) under the initial condition X0 = x. Ex ) for the probability (respectively. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. P (tails) = 1 − p. a < i < b. . b are absorbing. To solve the problem. and where the states a. Let £x be the initial fortune of the gambler. Let us consider a concrete problem. write ϕ(x) instead of ϕ(x. (Think why!) To simplify things. The diﬀerence between the gambler’s ruin problem for a company and that for a mere mortal is that. the company has control over the probability p. pi. Yk cannot depend on ξk . a ≤ x ≤ b. 0 ≤ x ≤ ℓ. a + 1. ℓ). in the ﬁrst case. We use the notation Px (respectively. astrological nature. . b} with transition probabilities pi. 25 ϕ(ℓ) = 1. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a.i−1 = q = 1 − p. We are clearly dealing with a Markov chain in the state space {a. is our next goal.i+1 = p. ℓ) := Px (Tℓ < T0 ). . It can only depend on ξ1 .if the gambler wants to have a strategy at all. We ﬁrst solve the problem when a = 0. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). . We wish to consider the problem of ﬁnding the probability that the gambler will never lose her money. . Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. b = ℓ > 0 and let ϕ(x. b − a). if p = q Px (Tb < Ta ) = . if p = q = 1/2 b−a Proof. x − a . ξk−1 and–as is often the case in practice–on other considerations of e. Consider a gambler playing a simple coin tossing game. In other words. We obviously have ϕ(0) = 0. it will make enormous proﬁts). in addition. betting £1 at each integer time against a coin which has P (heads) = p. Theorem 7. b − 1. She will stop playing if her money fall below a (a < x).g. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals: x−a − 1 (q/p) (q/p)b−a − 1 . as described above. .

we ﬁnd δ(0) = 1/ℓ and so x ϕ(x) = . λ = 1.and. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). δ(x − 2) = · · · = q p x δ(0). Assuming that λ := q/p = 1. by ﬁrst-step analysis. if p = q. We ﬁnally ﬁnd λx − 1 . then λ = q/p > 1. if p > 1/2 if p ≤ 1/2. Let δ(x) := ϕ(x + 1) − ϕ(x). If p > 1/2. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. Since p + q = 1. λℓ − 1 0 ≤ x ≤ ℓ. and so Px (T0 < ∞) = 1 − lim If p < 1/2. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). λℓ −1 x ℓ. ℓ→∞ ℓ→∞ (q/p)x . 1. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). we ﬁnd ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. and so λx − 1 λx − 1 =1− = λx . Corollary 4. we have obtained ϕ(x. ℓ) = λx −1 . Px (T0 < ∞) = 1 − lim 26 . ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. then λ = q/p < 1. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). λx − 1 δ(0). 0 ≤ x ≤ ℓ. The general case is obtained by replacing ℓ by b − a and x by x − a. if λ = 1 if λ = 1. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. Since ϕ(ℓ) = 1. λ = 1. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. ℓ Summarising.

. . Xn ). τ < ∞)P (A | Xτ . | Xn = i)P (Xn = i.Xτ +2 = j2 . P (Xτ +1 = j1 .j2 · · · We have: P (Xτ +1 = j1 . Proof. . . | Xn = i. . In particular. A. the ﬁrst hitting time of the set A. Xn+2 = j2 . . . . . including. . Since (X0 . . . . . . | Xτ . | Xτ = i. . τ = n) = P (Xn+1 = j1 . . Xτ +2 = j2 . We are going to show that for any such A. A. . . (b) is by the deﬁnition of conditional probability. .j1 pj1 . .) is independent of the past process (X0 .j2 · · · P (Xn = i. Let τ be a stopping time. τ < ∞) and that P (Xτ +1 = j1 . Theorem 8 (strong Markov property). This stronger property is referred to as the strong Markov property. and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. . n ≥ 0) is independent of the past process (X0 . . the value +∞. (d) where (a) is just logic. A | Xτ . . . . Xτ = i. possibly. τ = n) (c) = P (Xn+1 = j1 .j1 pj1 . τ = n) (a) (b) = P (Xτ +1 = j1 . τ = n) = pi. . . Xτ )1(τ < ∞) = n∈Z+ (X0 . . An example of a stopping time is the time TA . . . conditional on τ < ∞ and Xτ = i. . Xn ) if we know the present Xn . Xτ ) if τ < ∞. A. A. Let τ be a stopping time. A. . . the future process (Xn . Xn = i. . . . 27 . Xn ) and the ordinary splitting property at n. any such A must have the property that. Then. τ = n). . Xn )1(τ = n). . Xn ) for all n ∈ Z+ . . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . . . . An example of a random time which is not a stopping time is the last visit of a speciﬁc set. . Xτ +2 = j2 . . considered earlier. for all n ∈ Z+ . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . Xn+2 = j2 . Xn+2 = j2 . . Suppose (Xn . . A ∩ {τ = n} is determined by (X0 . . Let A be any event determined by the past (X0 . . . A. . τ = n)P (Xn = i. n ≥ 0) is a time-homogeneous Markov chain. τ < ∞). τ < ∞) = P (Xn+1 = j1 . the future process (Xτ +n . Xτ +2 = j2 . . Our assertions are now easily obtained from this by ﬁrst summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . for any n. Xn+1 . . . . . Xτ ) and has the same law as the original process started from i. Recall that the Markov property states that. . τ < ∞) = pi.

.d. .9 Regenerative structure of a Markov chain A sequence of i. . Formally. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn .. random variables is a (very) particular example of a Markov chain. we let Ti be the time of the second (1) (3) visit. So Xi is the r-th excursion: the chain visits 28 . in fact. 2. (And we set Ti := Ti . by using the strong Markov property.) We let Ti be the time of the third visit. Xi (0) . All these random times are stopping times. we recursively deﬁne Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. we will show that we can split the sequence into i. . These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the ﬁrst time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. := Xn . Schematically. State i may or may not be visited again.i. Xi (1) . Our Ti here is always positive. 0 ≤ n < Ti . (1) r = 1. They are.. .d. We (r) refer to them as excursions of the process. The Ti of Section 6 was allowed to take the value 0. However. The idea is that if we can ﬁnd a state i that is visited again and again. Ti Xi (0) (r) ≤ n < Ti (r+1) . Regardless. then the behaviour of the chain between successive visits cannot be inﬂuenced by the past or the future (and that is due to the strong Markov property). “chunks”. and so on. . Xi (2) . 3.i. random trajectories. in a generalised sense. Note the subtle diﬀerence between this Ti and the Ti of Section 6. If X0 = i then the two deﬁnitions coincide. as “random variables”. . r = 2. The random variables Xn comprising a general Markov chain are not independent.

A trivial. we “watch” it up until the next time it returns to state i. Xi (1) . Perhaps some states (inessential ones) will be visited ﬁnitely many times. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. But (r) the state at Ti is. . Xi distribution. In particular. but important corollary of the observation of this section is that: Corollary 5. We abstract this trivial observation and give a couple of deﬁnitions: First.. g(Xi (2) ). starting from i. . An immediate consequence of the strong Markov property is: Theorem 9. Second. they are also identical in law (i.state i for the r-th time. the future after Ti is independent of the past before Ti . we say that a state i is transient if. the future is independent of the (1) (2) past. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. Xi (2) . (0) are independent random variables. at least one state must be visited inﬁnitely many times.d.i. except. by deﬁnition. have the same Proof.. it will behave in the same way as if it starts from the same state at another point of time. Therefore. . all. g(Xi (1) ). .). Apply Strong Markov Property (SMP) at Ti : SMP says that. The claim that Xi . . possibly. . Therefore. Moreover. and this proves the ﬁrst claim. we say that a state i is recurrent if. The excursions Xi (0) . given the state at (r) (r) (r) the stopping time Ti . The last corollary ensures us that Λ1 . Λ2 . (r) . and “goes for an excursion”. the chain returns to i inﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 1. . but there are always states which will be visited again and again. starting from i. equal to i. if it starts from state i at some point of time. are independent random variables. 10 Recurrence and transience When a Markov chain has ﬁnitely many states. 29 . the chain returns to i only ﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 0. ad inﬁnitum. (Numerical) functions g(Xi independent random variables. for time-homogeneous Markov chains. Example: Deﬁne Ti (r+1) (0) ).

the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . Suppose that the Markov chain starts from state i. If fii < 1. every state is either recurrent or transient. 30 . State i is recurrent. Ei Ji = ∞. This means that the state i is transient. Indeed. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} .Important note: Notice that the two statements above are not negations of one another. Because either fii = 1 or fii < 1. We claim that: Lemma 4. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . therefore concluding that every state is either recurrent or transient. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. Pi (Ji = ∞) = 1. fii = Pi (Ti < ∞) = 1. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). Indeed. Therefore. Pi (Ji = ∞) = 1. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. in principle. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. then Pi (Ji < ∞) = 1. i. Notice that: state i is visited inﬁnitely many times ⇐⇒ Ji = ∞. k ≥ 0. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. namely. Starting from X0 = i. The following are equivalent: 1. We will show that such a possibility does not exist. 4. we conclude what we asserted earlier. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). there could be yet another possibility: 0 < Pi (Xn = i for inﬁnitely many n) < 1. 3. But the past (k−1) before Ti is independent of the future. we have Proof.e. and that is why we get the product. 2. This means that the state i is recurrent. Lemma 5. .

as n → ∞. Ji cannot take value +∞ with positive probability. (n) But if a series is ﬁnite then the summands go to 0. (n) i. by deﬁnition. it is visited inﬁnitely many times. 4. Proof. fii = Pi (Ti < ∞) < 1. assuming that the chain (n) starts from state i. The following are equivalent: 1. it is visited ﬁnitely many times with probability 1. then. Lemma 6. 3. which. but the two sequences are related in an exact way: Lemma 7. is equivalent to Pi (Ji < ∞) = 1. therefore Pi (Ji < ∞) = 1. The equivalence of the ﬁrst two statements was proved before the lemma. State i is transient. From Elementary Probability. 10. pii → 0. we see that its expectation cannot be inﬁnity without Ji itself being inﬁnity with probability one. Clearly. Proof. Corollary 6. This. by the deﬁnition of Ji . State i is recurrent if. Hence Ei Ji = ∞ also. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . and. the probability to hit state j for the ﬁrst time at time n ≥ 1. if we assume Ei Ji < ∞. Ei Ji < ∞. The latter is the probability to be in state j at time n (not necessarily for the ﬁrst time) starting from i. Proof.Proof. i. Notice the diﬀerence with pij . On the other hand. If i is a transient state then pii → 0. The equivalence of the ﬁrst two statements was proved above. State i is transient if. Pi (Ji < ∞) = 1. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . since Ji is a geometric random variable. this implies that Ei Ji < ∞. by deﬁnition of Ji is equivalent to Pi (Ji = ∞) = 1. as n → ∞. 2. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. by deﬁnition. owing to the fact that Ji is a geometric random variable.e.e. If Ei Ji = ∞.1 Deﬁne First hitting time decomposition fij := Pi (Tj = n). fij ≤ pij . But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . If i is transient then Ei Ji < ∞. clearly. then. n ≥ 1.

we have i j ⇐⇒ fij > 0. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. not only fij > 0. If j is transient.d. which is ﬁnite. but also fij = 1. . . In a communicating class. we have n=1 pjj = Ej Jj < ∞. by summing over n the relation of Lemma 7.i. Hence. inﬁnitely many of them will be 1 (with probability 1). j i. j i. The last statement is simply an expression in symbols of what we said above. . the probability to go to j in some ﬁnite time is positive. and so the probability that j will appear in a speciﬁc excursion is positive. We now show that recurrence is a class property.r := 1 j appears in excursion Xi (r) .2 Communicating classes and recurrence/transience Recall the deﬁnition of the relation “state i leads to state j”. starting from i. Corollary 7. Corollary 8. 32 . In particular. For the second term. either all states are recurrent or all states are transient. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). Now. and gij := Pi (Tj < Ti ) > 0. excursions Xi . are i. Xi . and since they take the value 1 with positive probability.i.) Theorem 10.. and hence pij as n → ∞. ∞ Proof. by deﬁnition. Indeed. . the probability that j will appear in a speciﬁc excursion equals gij = Pi (Tj < Ti ). we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . ( Remember: another property that was a class property was the property that the state have a certain period. r = 0. . for all states i. denoted by i j: it means that. 10.The ﬁrst term equals fij .d. 1. So the random variables δj. Hence. Proof. In other words. . If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. as n → ∞. showing that j will appear in inﬁnitely many of the excursions for sure. Start with X0 = i. if we let fij := Pi (Tj < ∞) . If j is a transient state then pij → 0.

appear in a certain game. be i. P (T > n) = P (U1 ≤ U0 . there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. Un ≤ x | U0 = x)dx xn dx = 1 . . Recall that a ﬁnite random variable may not necessarily have ﬁnite expectation. By ﬁniteness. P (B) = Pi (Xn = i for inﬁnitely many n) < 1. Hence. But then. all other states are also transient. So if a communicating class is recurrent then it contains no inessential states and so it is closed. Example: Let U0 .Proof. U2 . Xn = i for inﬁnitely many n) = 0. . . Indeed. by the above theorem. . . Let T := inf{n ≥ 1 : Un > U0 }.i. . If i is recurrent then every other j in the same communicating class satisﬁes i j. and so i is a transient state.e. j but j i. Hence 11 Positive recurrence Recall that if we ﬁx a state i and let Ti be the ﬁrst return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). Thus T represents the number of variables we need to draw to get a value larger than the initial one. . Suppose that i is inessential. This state is recurrent. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. uniformly distributed in the unit interval [0. Proof. n+1 33 = 0 . If i is inessential then it is transient. . random variables. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). i. Theorem 11. and hence. for example. every other j is also recurrent. Every ﬁnite closed communicating class is recurrent. Pi (B) = 0. U1 . We now take a closer look at recurrence. If a communicating class contains a state which is transient then. What is the expectation of T ? First. 1]. observe that T is ﬁnite.d. By closedness. . i. necessarily. Let C be the class. It should be clear that this can. . . but Pi (A ∩ B) = Pi (Xm = j. X remains in C. some state i must be visited inﬁnitely many times. if started in C. Theorem 12. Hence all states are recurrent.e. Proof.

Z2 . . Let Z1 . the sequence X0 .Therefore. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . 34 . To our rescue comes the regenerative structure of a Markov chain identiﬁed at an earlier section. 0) > −∞. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. an inﬁnite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. be i. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. . random variables. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. of successive states is not an independent sequence. . but E min(Z1 . on the average. 0) = ∞. We say that i is null recurrent if Ei Ti = ∞. To apply the law of large numbers we must create some kind of independence. It is traditional to refer to it as “Law”. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. It is not a Law. Theorem 13. then it takes. We state it here without proof. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. (ii) Suppose E max(Z1 . Before doing that. X1 . But this is misleading. .i. When we have a Markov chain. arguably. the Fundamental Theorem of Probability Theory.d. it is a Theorem that can be derived from the axioms of Probability Theory. . X2 . . we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞.

Since functions of independent random variables are also independent. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). which. G(Xa(2) ). we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. The ﬁnal equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. . (7) Whichever the choice of G. we can think of as a reward received when the chain is in state x. the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). we have that G(Xa(1) ). but deﬁne G(X ) to be the total reward received over the excursion X . . . .2 Average reward Consider now a function f : S → R. G(Xa(3) ). forms an independent sequence of (generalised) random variables. k=0 which represents the ‘time-average reward’ received.1 Functions of excursions Namely. Then. as in Example 2.d. Thus. random variables. n as n → ∞. in this example.. Xa(3) . .i. with probability 1. assign a reward f (x). . Speciﬁcally. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been ﬁxed once and for all) the sequence of excursions Xa(1) . . (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. Example 1: To each state x ∈ S. deﬁne G(X ) to be the maximum reward received during this excursion. . Example 2: Let f (x) be as in example 1. for an excursion X . and because if we start with X0 = a. are i. . for reasonable choices of the functions G taking values in R. (1) (2) (1) (n) (8) 13. G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). are i. 35 . because the excursions Xa . Xa . the (0) initial excursion Xa also has the same law as the rest of them.d. In particular.i. Xa(2) .13.

with probability one. We are looking for the total reward received between 0 and t. and so 1 ﬁrst G → 0. say. Nt = 5. Gﬁrst is the reward up to time Ta = Ta or t (whichever t is smaller). Since P (Ta < ∞) = 1. Gr is given by (7). Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Let f : S → R+ be a bounded ‘reward’ function. plus the total reward received over the next excursion.Theorem 14. Hence |Gﬁrst | ≤ BTa . t t 36 . We shall present the proof in a way that the constant f will reveal itself–it will be discovered. We break the sum into the total reward received over the initial excursion. and so on. Thus we need to keep track of the number. Proof. A similar argument shows that 1 last G → 0. say |f (x)| ≤ B. We assumed that f is t bounded. (r) Nt := max{r ≥ 1 : Ta ≤ t}. we have BTa /t → 0. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a ﬁrst and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gﬁrst + Glast . t t as t → ∞. of complete excursions from the ground state a that occur between 0 and t. Then. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . For example. with probability one. and Glast is the reward over the last incomplete cycle. in the ﬁgure below. t as t → ∞. The idea is this: Suppose that t is large enough. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. Nt .e. t t In other words.

Since Ta /r → 0. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . . Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . r=1 with probability one.i. E a Ta f (Xn ) = n=0 EG1 =: f . = E a Ta . random variables with common expectation Ea Ta .. Since Ta → ∞. we have Ta r→∞ r lim (r) . So we are left with the sum of the rewards over complete cycles. 1 Nt = . as r → ∞. we have that the visit with index r = Nt to the ground state a is the last before t (see also ﬁgure above): (N (N Ta t ) ≤ t < Ta t +1) . .also. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . lim t∞ 1 Nt Nt −1 Gr = EG1 . it follows that the number Nt of complete cycles before time t will also be tending to ∞. 2. as t → ∞. E a Ta 37 .d. . Hence. (11) By deﬁnition. r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. and since Ta − Ta k = 1. are i. = E a Ta .

for all b ∈ S. Comment 4: Suppose that two states. The denominator is the duration of the excursion. i. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). then we see that the numerator equals 1.e. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . E b Tb 38 . The physical interpretation of the latter should be clear: If. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). b. Ta −1 k=0 1 lim t→∞ t with probability 1. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. We can use b as a ground state instead of a. which communicate and are positive recurrent. Comment 1: It is important to understand what the formula says. it takes Ea Ta units of time between successive visits to the state a. on the average. we let f to be 1 at the state b and 0 everywhere else.To see that f is as claimed. The theorem above then says that. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). Applying formula (14) with b in place of a we ﬁnd that the limit in (13) also equals 1/Eb Tb . f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). the ground state. then the long-run proportion of visits to the state a should equal 1/Ea Ta . not both. from the Strong Markov Property. E a Ta (14) with probability 1. t Ea 1(Xk = b) E a Ta . a. The constant f is a sort of average over an excursion. Note that we include only one endpoint of the excursion. The equality of these two limits gives: average no.

(15) should play an important role.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . X2 . Note: Although ν [a] depends on the choice of the ‘ground’ state a. Clearly. Hence the superscript [a] will soon be dropped. x ∈ S. Proof.e. we have 0 < ν [a] (x) < ∞. by a function of X1 . 39 . 5 We stress that we do not need the stronger assumption of positive recurrence here. X1 . Moreover. Both of them equal 1. Indeed. Then ν [a] satisﬁes the balance equations: ν [a] = ν [a] P. for any state b such that a → x (a leads to x). . we will soon realise that this dependence vanishes when the chain is irreducible. x ∈ S. notice that the choice of notation ν [a] is justiﬁable through the notation (6) for the communicating class of the state a. Fix a ground state a ∈ S and assume that it is recurrent5 . Consider the function ν [a] deﬁned in (15). (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. In view of this. . . . ν [a] (x) = y∈S ν [a] (y)pyx . this should have something to do with the concept of stationary distribution discussed a few sections ago. X3 . X2 . Since a = XT a . i. Proposition 2. where a is the ﬁxed recurrent ground state whose existence was assumed. . we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). The real reason for doing so is because we wanted to replace a function of X0 . so there is nothing lost in doing so. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. We start the Markov chain with X0 = a. Recurrence tells us that Ta < ∞ with probability one. .

that a is positive recurrent. let x be such that a x. namely. Note: The last result was shown under the assumption that a is a recurrent state. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). we used (16). n ≤ Ta ) = n=1 ∞ Pa (Xn = x. by deﬁnition. 40 . for all n ∈ N. Xn−1 = y. such that pax > 0. where. so there exists m ≥ 1. such that pxa > 0. n ≤ Ta ) pyx Pa (Xn−1 = y. Suppose that the Markov chain possesses a positive recurrent state a. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). We are now going to assume more.and thus be in a position to apply ﬁrst-step analysis. which. Deﬁne Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . so there exists n ≥ 1. ν [a] = ν [a] Pn . We just showed that ν [a] = ν [a] P. which are precisely the balance equations. We now have: Corollary 9. is ﬁnite when a is positive recurrent. xa so. in this last step. x a. First note that. (17) This π is a stationary distribution. Therefore. (n) Next. n ≤ Ta ) Pa (Xn = x. ν [a] (x) < ∞. x ∈ S. indeed. for all x ∈ S. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). ax ax From Theorem 10. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. from (16).

second) excursion that contains j. π(y) = 1 y∈S x∈S have at least one solution. Let R1 (respectively. Then j is positive Proof. in general. R2 ) be the index of the ﬁrst (respectively. then j occurs in the r-th excursion. no matter which one. 1. that ν [a] P = ν [a] . and the probability gij = Pi (Tj < ∞) that j occurs in a speciﬁc excursion is positive. we toss a coin with probability of heads gij > 0. It only remains to show that π [a] is a probability.i. We have already shown. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. We already know that j is recurrent. Start with X0 = i.e. since the excursions are independent. the event that j occurs in a speciﬁc excursion is independent from the event that j occurs in the next excursion.d. 41 . Otherwise. that it sums up to one. . (r) j. we have Ea Ta < ∞. Theorem 15. if tails show up. π [a] P = π [a] . Also. i. 2. . in Proposition 2. In words. Suppose that i recurrent.In other words: If there is a positive recurrent state. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. Hence j will occur in inﬁnitely many excursions. and so π [a] is not trivially equal to zero. then the set of linear equations π(x) = y∈S π(y)pyx . then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. . x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. If heads show up at the r-th excursion. Since a is positive recurrent. to decide whether j will occur in a speciﬁc excursion. then the r-th excursion does not contain state j. The coins are i. r = 0. Let i be a positive recurrent state. This is what we will try to do in the sequel. Proof. unique. But π [a] is simply a multiple of ν [a] . Consider the excursions Xi . positive recurrent state. Therefore.

R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . . R2 has ﬁnite expectation. An irreducible Markov chain is called transient if one (and hence all) states are transient. Then either all states in C are positive recurrent. In other words. Let C be a communicating class. . . with the same law as Ti . But. .i. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent.d. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. . IRREDUCIBLE CHAIN r = 1. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. or all states are null recurrent or all states are transient. . Corollary 10. . 2. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . by Wald’s lemma. we just need to show that R2 −R1 +1 ζ := r=1 Zr has ﬁnite expectation. where the Zr are i. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij .

with probability one.16 Uniqueness of the stationary distribution At the risk of being too pedantic. by (18). there is a stationary distribution. after all. for all states i and j. x∈S By (13). Fix a ground state a. state 1 (and state 2) is positive recurrent. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). as t → ∞. we have. we stress that a stationary distribution may not be unique. for all x ∈ S. what prohibits uniqueness. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). is a stationary distribution. we have P (Xn = x) = π(x). Then there is a unique stationary distribution. Then P (Ta < ∞) = x∈S In other words. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. Let π be a stationary distribution. Consider positive recurrent irreducible Markov chain. we start the Markov chain from the stationary distribution π. following Theorem 14. x ∈ S. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. for totally trivial reasons (it is. recognising that the random variables on the left are nonnegative and bounded below 1. an absorbing state). π(1) = 0. n=1 43 . indeed. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. Then. Proof. Hence. the π deﬁned by π(0) = α. π(2) = 1 − α 2. n ≥ 0) and suppose that P (X0 = x) = π(x). we can take expectations of both sides: E as t → ∞. Consider the Markov chain (Xn . x ∈ S. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). But. Theorem 16. (18) π(x)Px (Ta < ∞) = π(x) = 1. But observe that there are many–inﬁnitely many–stationary distributions: For each 0 ≤ α ≤ 1. By a theorem of Analysis (Bounded Convergence Theorem). for all n.

b ∈ S. Proof. where π(C) = y∈C π(y). Let C be the communicating class of i. Namely. There is only one stationary distribution. y ∈ C. In particular. π(a) = π [a] (a) = 1/Ea Ta . π(C) x ∈ C. Let π(·|C) be the distribution deﬁned by restricting π to C: π(x|C) := π(x) . the only stationary distribution for the restricted chain because C is irreducible. This is in fact. Hence there is only one stationary distribution. i is positive recurrent. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Let i be such that π(i) > 0. π(i|C) = Ei Ti . x ∈ S. Decompose it into communicating classes. Consider an arbitrary Markov chain. E a Ta Proof. Thus. Consider the restriction of the chain to C. Then any state i such that π(i) > 0 is positive recurrent. But π(i|C) > 0. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Then π [a] = π [b] . from the deﬁnition of π [a] . By the 1 above corollary. for all a ∈ S. Consider a positive recurrent irreducible Markov chain and two states a. Then.Therefore π(x) = π [a] (x). Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . so they are equal. There is a unique stationary distribution. Assume that a stationary distribution π exists. Consider a Markov chain. Consider a positive recurrent irreducible Markov chain with stationary distribution π. i. delete all transition probabilities pxy with x ∈ C. Therefore Ei Ti < ∞. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . Both π [a] and π[b] are stationary distributions. The following is a sort of converse to that. Proof. 1 π(a) = . which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Hence π = π [a] . Corollary 11. In particular. The cycle formula merely re-expresses this equality. for all x ∈ S. Corollary 12.e. an arbitrary stationary distribution π must be equal to the speciﬁc one π [a] .

eventually. more generally. An arbitrary stationary distribution π must be a linear combination of these π C . that it remains in a small region). under irreducibility and positive recurrence conditions.1 Coupling and stability Deﬁnition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. By our uniqueness result. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. This is the stable position. Let π be an arbitrary stationary distribution. that the pendulum will swing for a while and. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. Proposition 3. the pendulum will perform an 45 . without friction. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . If π(x) > 0 then x belongs to some positive recurrent class C. for example. This is deﬁned by picking a state (any state!) a ∈ C and letting π C := π [a] . take an = (−1)n . x∈C Then (π(x|C). We all know. x ∈ C) is a stationary distribution. as n → ∞.which assigns positive probability to each state in C. Proof. where π(C) := π(C) π(x). it will settle to the vertical position. 18 18. So we have that the above decomposition holds with αC = π(C). such that C αC = 1. So far we have seen. For each such C deﬁne the conditional distribution π(x|C) := π(x) . that. where αC ≥ 0. In an ideal pendulum. For example. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). Stability of a deterministic dynamical system means that the system eventually converges to a ﬁxed point (or. π(x|C) ≡ π C (x). A Markov chain is a simple model of a stochastic dynamical system. from physical experience. There is dissipation of energy due to friction and that causes the settling down. consider the motion of a pendulum under the action of gravity. To each positive recurrent communicating class C there corresponds a stationary distribution π C .

or whatever. . . then regardless of the starting position. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . simple because the ξn ’s depend on the time parameter n. ······ 46 . let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . And we are in a happy situation here. mutually independent. random variables. If. But what does stability mean? In general. ˙ There are myriads of applications giving physical interpretation to this. However notice what happens when we solve the recursion. or an example in physics. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an ﬁnancial example. . We shall assume that the ξn have the same law. we may it is stable because it does not (it cannot!) swing too far away. The thing to observe here is that there is an equilibrium point i. it cannot mean convergence to a ﬁxed equilibrium point. x ∈ R. ξ0 . Still. . we here assume that X0 . for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . by means of the recursion above.e. The simplest diﬀerential equation As another example. It follows that |ρ| < 1 is the condition for stability.oscillatory motion ad inﬁnitum. ξ2 . the one found by setting the right-hand side equal to zero: x∗ = b/a. Speciﬁcally. then x(t) = x∗ for all t > 0. ξ1 . A linear stochastic system We now pass on to stochastic systems. moreover. . as t → ∞. . We say that x∗ is a stable equilibrium. X2 . are given. consider the following diﬀerential equation x = −ax + b. and deﬁne a stochastic sequence X1 . we have x(t) → x∗ . Since we are prohibited by the syllabus to speak about continuous time stochastic systems. a > 0.

while the limit of Xn does not exists. . we know f (x) = P (X = x). for instance. deﬁne stability in a way other than convergence of the distributions of the random variables. How should we then deﬁne what P (X = x. for any initial distribution.g. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). Y = y) is. which is a linear combination of ξ0 . If a Markov chain is irreducible. i. we need to understand the notion of coupling. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. we have that the ﬁrst term goes to 0 as n → ∞.3 Coupling To understand why the fundamental stability theorem works. Y but not their joint law. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . j ∈ S. But suppose that we have some further requirement.Since |ρ| < 1. Starting at an elementary level. What this means is that.2 The fundamental stability theorem for Markov chains Theorem 17. Coupling refers to the various methods for constructing this joint law. . . converges. A straightforward (often not so useful) construction is to assume that the random variables are independent and deﬁne P (X = x. g(y) = P (Y = y). . 18. ∞ ˆ The nice thing is that. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . In particular. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . Y = y) = P (X = x)P (Y = y). 18. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). In other words. uniformly in x. suppose you are given the laws of two random variables X. under nice conditions. but we are not given what P (X = x. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). for time-homogeneous Markov chains. to try to make the coincidence probability P (X = Y ) as large as possible. (e. the n-step transition matrix Pn . ξn−1 does not converge. as n → ∞. The second term. Y = y) is? 47 .

n ≥ 0).Exercise: Let X. A coupling of them refers to a joint construction on the same probability space. and all x ∈ S. g(y) = P (Y = y). n ≥ 0. y) = P (X = x. Proof.e. T is ﬁnite with probability one. respectively. i. Combining the two inequalities we obtain the ﬁrst assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). but with the roles of X and Y interchanged. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). i. f (x) = y h(x. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). The following result is the so-called coupling inequality: Proposition 4. Find the joint law h(x. n ≥ T ) = P (Xn = x. P (T < ∞) = 1. We are given the law of a stochastic process X = (Xn .e. (19) as n → ∞. n ≥ 0) and the law of a stochastic process Y = (Yn . x∈S 48 . Suppose that we have two stochastic processes as above and suppose that they have been coupled. If. n < T ) + P (Yn = x. Y . Y be random variables with values in Z and laws f (x) = P (X = x). Since this holds for all x ∈ S and since the right hand side does not depend on x. in particular. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. for all n ≥ 0. n ≥ 0. Note that it makes sense to speak of a meeting time. y). Y = y) which respects the marginals. then |P (Xn = x) − P (Yn = x)| → 0. uniformly in x ∈ S. Repeating (19). |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Let T be a meeting of two coupled stochastic processes X. n < T ) + P (Xn = x. n ≥ T ) ≤ P (n < T ) + P (Yn = x). g(y) = x h(x. y). but we are not told what the joint probabilities are. The coupling we are interested in is a coupling of processes. x ∈ S. on the event that the two processes never meet. Then. and which maximises P (X = Y ). precisely because of the assumption that the two processes have been constructed together. This T is a random variables that takes values in the set of times or it may take values +∞. We have P (Xn = x) = P (Xn = x.

y′ ) := P Wn+1 = (x′ . they diﬀer only in how the initial states are chosen.y′ for all large n. n ≥ 0) is an irreducible chain. y ′ ) | Wn = (x.x′ py. we have not deﬁned joint probabilities such as P (Xn = x. had we started with both X0 and Y0 independent and distributed according to π. the transition probabilities.y′ ) for all large n. for all n ≥ 1. y) := π(x)π(y).(y. review the assumptions of the theorem: the chain is irreducible. Therefore. Xm = y). and it also makes sense to consider their ﬁrst meeting time: T := inf{n ≥ 0 : Xn = Yn }. Its initial state W0 = (X0 . Consider the process Wn := (Xn .x′ > 0 and py. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. The ﬁrst process.y′ .4 Proof of the fundamental stability theorem First. Its (1-step) transition probabilities are q(x. we have that px. we can deﬁne joint probabilities. and so (Wn . n ≥ 0) is a Markov chain with state space S × S. 18. Having coupled them.) By positive recurrence.Assume now that P (T < ∞) = 1. n ≥ 0).x′ ). y) ∈ S × S. From Theorem 3 and the aperiodicity assumption. positive recurrent and aperiodic. denoted by X = (Xn . 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. The second process. We know that there is a unique stationary distribution. y) Its n-step transition probabilities are q(x.x′ ). Y0 ) has distribution P (W0 = (x. sup |P (Xn = x) − P (Yn = x)| → 0. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. x∈S as n → ∞. We now couple them by doing the straightforward thing: we assume that X = (Xn . n ≥ 0) is independent of Y = (Yn . Let pij denote.(y. denoted by π.x′ ). y) = µ(x)π(y). Then (Wn . as usual.(y. (x. We shall consider the laws of two processes. Then n→∞ lim P (T > n) = P (T = ∞) = 0. It is very important to note that we have only deﬁned the law of X and the law of Y but we have not deﬁned a joint law.x′ py. is a stationary distribution for (Wn .y′ . Notice that σ(x. implying that q(x. Xn and Yn would be independent and distributed according to π. then. Yn ). in other words.y′ ) = px. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. (Indeed. n ≥ 0). . denoted by Y = (Yn . π(x) > 0 for all x ∈ S.

(0.Therefore σ(x. By the coupling inequality. 0). 1).1). (1.0) = q(0. by assumption. 1} and the pij are p01 = p10 = 1. we have |P (Xn = x) − π(x)| ≤ P (T > n). (Zn . p01 = 0. suppose that S = {0. Clearly. Hence (Wn . In particular. Hence (Zn .01. P (Xn = x) = P (Zn = x).1) = 1. 0). Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. if we modify the original chain and make it aperiodic by. then the process W = (X. and since P (Yn = x) = π(x).1) = q(1. Obviously.(0. Then the product chain has state space S ×S = {(0. precisely because P (T < ∞) = 1. y) > 0 for all (x. Moreover.0) = q(1. for all n and x. n ≥ 0). However. This is not irreducible. For example.1).0). p10 = 1. which converges to zero. as n → ∞. P (T < ∞) = 1. 50 . y) ∈ S × S. 1)} and the transition probabilities are q(0. ≥ 0) is a Markov chain with transition probabilities pij . n ≥ 0) is positive recurrent. uniformly in x. Zn = Yn . Zn equals Xn before the meeting time of X and Y . Deﬁne now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). after the meeting time. (1. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). say.(1. Y ) may fail to be irreducible.0). Since P (Zn = x) = P (Xn = x).(1. Z0 = X0 which has an arbitrary distribution µ. p00 = 0. n ≥ 0) is identical in law to (Xn . T is a meeting time between Z and Y . In particular. X arbitrary distribution Z stationary distribution Y T Thus. (0.99.

p10 = 1. In matrix notation. and this is expressed by our terminology. Terminology cannot be. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. It is. Thus. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. . customary to have a consistent terminology. π(1) ≈ 0.99π(0) = π(1).01. Theorem 4. Cd−1 . Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. and so on. most people use the term “ergodic” for an irreducible. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). such that the chain moves cyclically between them: from C0 to C1 . 6 The reason is that the term “ergodic” means something speciﬁc in Mathematics (c. Indeed.497 18.497. C1 . . Then Xn = 0 for even n and = 1 for odd n. By the results of §5. 18.503. π(0) ≈ 0. Suppose X0 = 0. Suppose that the period of the chain is d.503 0.4 and.503 0. then (n) lim p n→∞ 00 (n) = lim p10 = π(0). Therefore p00 = 1 for even n (n) and = 0 for odd n. It is not diﬃcult to see that 6 We put “wrong” in quotes. in particular. π(1) satisfy 0. π(0) + π(1) = 1. positive recurrent and aperiodic Markov chain. 1981).f. But this is “wrong”. we can decompose the state space into d classes C0 . p01 = 0.e. per se wrong.then the product chain becomes irreducible. 51 . .6 Periodic chains Suppose now that we have a positive recurrent irreducible chain.99. however. if we modify the chain by p00 = 0. However.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. limn→∞ p00 does not exist. i. . 0. from Cd back to C0 . By the results of Section 14. n→∞ (n) where π(0). Parry.497 . The reader is asked to pay attention to this terminology which is diﬀerent from the one used in the Markov chain literature. from C1 to C2 .

i ∈ Cr . We shall show that Lemma 9. where π is the stationary distribution. .Lemma 8. Can we also ﬁnd the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . Let i ∈ Ck . j ∈ Cℓ . Cd−1 . n ≥ 0) is dπ(i). Suppose i ∈ Cr . after reduction by d) and. . n ≥ 0) consists of d closed communicating classes. if sampled every d steps. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . we have pij for all i. starting from i. Then. Then pji = 0 unless j ∈ Cr−1 . n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. Hence if we restrict the chain (Xnd . 1. So π(i) = j∈Cr−1 π(j)pji . Since the αr add up to 1. j ∈ Cℓ . the (unique) stationary distribution of (Xnd . Then. Let αr := i∈Cr π(i). Suppose that X0 ∈ Cr . and the previous results apply. d − 1. . i ∈ Cr . namely C0 . as n → ∞. i∈Cr for all r = 0. . j ∈ Cr . All states have period 1 for this chain. n ≥ 0) is aperiodic. it remains in Cℓ . and let r = ℓ − k (modulo d). . each one must be equal to 1/d. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). . π(i) = 1/d. 52 . . → dπ(j). The chain (Xnd . . Proof. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. We have that π satisﬁes the balance equations: π(i) = j∈S π(j)pji . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. Since (Xnd . thereafter. i ∈ S. as n → ∞. This means that: Theorem 18.

2. 2. 2. . p2 . . there must be a subsequence exists and equals. Hence pi k i.1 Limiting probabilities for null recurrent states Theorem 19. the sequence n2 of the second row k is a subsequence of n1 of the ﬁrst row. as k → ∞. if j is transient. . . k = 1. 19. By construction. say. for each n = 1. p2 . p(n) = (n) (n) (n) (n) (n) ∞ p1 . for each The second result is Fatou’s lemma. 2. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . For each i we have found subsequence ni . It is clear how we (ni ) k continue. If the period is d then the limit exists only along a selected subsequence (Theorem 18). a probability distribution on N. as n → ∞. the sequence n3 of the third row is a subsequence k k of n2 of the second row.e. . then pij → 0 as n → ∞. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . We have. for all i ∈ S. .). where = 1 and pi ≥ 0 for all i. k = 1. are contained between 0 and 1. .) is a k k subsequence of each (ni . Then we can ﬁnd a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. Now consider the contained between 0 and 1 there must be a exists and equals. If j is a null recurrent state then pij → 0. So the diagonal subsequence (nk . The proof will be based on two results from Analysis. e 53 . . i. say. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. . k = 1. whose proof we omit7 : 7 See Br´maud (1999). Consider. . for each i. schematically. such that limk→∞ pi k exists and equals. pi . p1 . Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. We also saw in Corollary (n) 7 that. p3 . . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. . (nk ) k → pi . . . .. k = 1. then pij → π(j) (Theorems 17). the ﬁrst of which is known as HellyBray Lemma: Lemma 10. .. .19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . for all i ∈ S. say. 2. and so on. . . exists for all i ∈ N.

pix k = 1. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . Hence we can ﬁnd a subsequence nk such that k→∞ lim pij k = αj > 0. But Pn+1 = Pn P. (n ) (n ) Consider the probability vector By Lemma 10. x ∈ S . i. k→∞ (n′ +1) = y∈S piy k pyx . i. and the last equality is obvious since we have a probability vector. by Lemma 11. actually. Suppose that one of them is not. (n′ ) Hence αx ≥ αy pyx . αx ≤ lim inf k→∞ (n′ ) for all x ∈ S.e. x ∈ S. the sequence (n) pij does not converge to 0. x∈S (n′ ) Since aj > 0. it is a strict inequality: αx0 > y∈S αy pyx0 . (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . pix k . we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . . . This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy .e. y∈S x ∈ S.e. Let j be a null recurrent state. (xi . pix k Therefore. n = 1. equalities. π is a stationary distribution. state j is positive recurrent: Contradiction. i ∈ N). If (yi . i ∈ N). we have 0< x∈S The inequality follows from Lemma 11. y∈S Since x∈S αx ≤ 1. i=1 (n) Proof of Theorem 19. by assumption. αy pyx . Thus. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). Suppose that for some i.Lemma 11. 54 . 2. . i. By Corollary 13. for some x0 ∈ S. We wish to show that these inequalities are. Now π(j) > 0 because αj > 0. we have that π(x) := αx y∈S αy satisﬁes the balance equations and has x∈S π(x) = 1.

and fij = Pi (Tj < ∞).2. that is. Let j be a positive recurrent state with period 1. But Theorem 17 tells us that the limit exists regardless of the initial distribution. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. This is a more old-fashioned way of getting the same result. π C (j) = 1/Ej Tj . β. Let i be an arbitrary state. In general. starting with the ﬁrst hitting time decomposition relation of Lemma 7. as n → ∞. as n → ∞. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). so we will only look that the aperiodic case. which is what we need. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). From the Strong Markov Property. we have pij = Pi (Xn = j) = Pi (Xn = j.19. and fij = ϕC (j).2 The general case (n) It remains to see what the limit of pij is. The result when the period j is larger than 1 is a bit awkward to formulate. Pµ (Xn−m = j) → π C (j). Proof. In this form. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . Let ϕC (i) := Pi (HC < ∞). the result can be obtained by the so-called Abel’s Lemma of Analysis. Let HC := inf{n ≥ 0 : Xn ∈ C}. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. Example: Consider the following chain (0 < α. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). Theorem 20. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. Let C be the communicating class of j and let π C be the stationary distribution associated with C. Then (n) lim p n→∞ ij = ϕC (i)π C (j). as in §10. Indeed.

p42 → ϕ(4)π C1 (2). We now generalise this. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. For example. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . Theorem 21. π C1 (2) = .The communicating classes are I = {3. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. The phase (or state) space of a particle is not just a physical space but it includes. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. among others. Theorem 14 is a type of ergodic theorem. (n) (n) p35 → (1 − ϕ(3))π C2 (5). the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. (n) (n) p32 → ϕ(3)π C1 (2). in Theorem 14. Recall that. ϕ(3) = p31 → ϕ(3)π C1 (1). 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. Ergodic Theory is nowadays an area that combines many ﬁelds of Mathematics and Science. ϕ(4) = . Pioneers in the are were. from ﬁrst-step analysis. (n) (n) p33 → 0. 1 − αβ 1 − αβ Both C1 . Hence. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). while. p44 → 0. we assumed the existence of a positive recurrent state. information about its velocity. (n) (n) p34 → 0. whence 1−α β(1 − α) . p41 → ϕ(4)π C1 (1). Consider the quantity ν [a] (x) deﬁned as the average number of visits to state x between two successive visits to state a–see 16. We have ϕ(1) = ϕ(2) = 1. p45 → (1 − ϕ(4))π C2 (5). C2 = {5} (absorbing state). π C2 (5) = 1. Similarly. 56 (20) x∈S . Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. ϕ(5) = 0. 4} (inessential states). ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . 2} (closed class). as n → ∞. (n) (n) p43 → 0. C1 = {1. for example. The subject of Ergodic Theory has its origins in Physics. C2 are aperiodic classes. We will avoid to give a precise deﬁnition of the term ‘ergodic’ but only mention what constitutes a result in the area. And so is the Law of Large Numbers itself (Theorem 13).

Remark 1: The diﬀerence between Theorem 14 and 21 is that. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. we have t→∞ lim 1 last 1 G = lim Gﬁrst = 0. If so. we have C := i∈S ν(i) < ∞. we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gﬁrst + Glast . From proposition 2 we know that the balance equations ν = νP have at least one solution. 21 Finite chains We take a brief look at some things that are speciﬁc to chains with ﬁnitely many states. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . As in Theorem 14. we assumed positive recurrence of the ground state a. So there are always recurrent states. Now. and this is left as an exercise. and this justiﬁes the terminology of §18. (21) f (x)ν [a] (x). we have that the denominator equals Nt /t and converges to 1/Ea Ta . As in the proof of Theorem 14. 57 . Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. The reader is asked to revisit the proof of Theorem 14. t t→∞ t t t with probability one. which is the quantity denoted by f in Theorem 14. as we saw in the proof of Theorem 14. Glast are the total rewards over the ﬁrst and t t a a last incomplete cycles. So if we divide by t the numerator and denominator of the fraction in (21). since the number of states is ﬁnite. Proof. in the former. Then at least one of the states must be visited inﬁnitely often.5. r=0 with probability 1 as long as E|G1 | < ∞. then. But this is precisely the condition (20). we obtain the result. respectively. we have Nt /t → 1/Ea Ta . so the numerator converges to f /Ea Ta . while Gﬁrst . t t where Gr = T (r) ≤t<T (r+1) f (Xt ). Replacing n by the subsequence Nt . We only need to check that EG1 = f .Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 .

If. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. Proof. In particular. . X1 . X0 ). More precisely. Therefore π is a stationary distribution. Lemma 12. that is. A reversible Markov chain is stationary. π(i) := Hence. Because the process is Markov. actually. Then the chain is positive recurrent with a unique stationary distribution π. for all n. We say that it is time-reversible. . By Corollary 13. in addition. Xn has the same distribution as X0 for all n. as n → ∞. running backwards. i. i ∈ S. Theorem 23. X1 = j) = P (X1 = i. . . Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. . . Xn−1 . Proof. Xn ) has the same distribution as (Xn . the stationary distribution is unique. . . Consider an irreducible Markov chain with ﬁnitely many states. without being able to tell that it is. Taking n = 1 in the deﬁnition of reversibility. We summarise the above discussion in: Theorem 22. Xn−1 . i. X0 ). if we ﬁlm a standing man who raises his arms up and down successively. we have Pn → Π. this state must be positive recurrent. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . in addition. because positive recurrence is a class property (Theorem 15). At least some state i must have π(i) > 0. in a sense. a ﬁnite Markov chain always has positive recurrent states. In addition. . Reversibility means that (X0 . X1 . . (X0 . C satisﬁes the balance equations and the normalisation condition i∈S π(i) = 1. . Xn ) has the same distribution as (Xn . 22 Time reversibility Consider a Markov chain (Xn . It is. then all states are positive recurrent. . . . X0 = j). like playing a ﬁlm backwards. Suppose ﬁrst that the chain is reversible. . it will look the same.Hence 1 ν(i). If. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. . the chain is aperiodic. simply. But if we ﬁlm a man who walks and run the ﬁlm backwards. X0 ) for all n. this means that it is stationary. reversible if it has the same distribution when the arrow of time is reversed. the ﬁrst components of the two vectors must have the same distribution. P (X0 = i. then we instantly know that the ﬁlm is played in reverse. and run the ﬁlm backwards. 58 . For example. by Theorem 16. n ≥ 0). X1 ) has the same distribution as (X1 . or. the chain is irreducible.e. we see that (X0 . j ∈ S.

Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. In other words. π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. Xn−1 . we have separation of variables: pij = f (i)g(j). im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. . These are the DBE. . . i4 . X1 = i) = π(j)pji . pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . Summing up over all choices of states i3 . j. k. for all i. We will show the Kolmogorov’s loop criterion (KLC). . So the chain is reversible. . then. The same logic works for any m and can be worked out by induction. X0 ). pji 59 (m−3) → π(i1 ). . Theorem 24. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . . The DBE imply immediately that (X0 . . . . Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. then the chain cannot be reversible. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. im ∈ S and all m ≥ 3. it is easy to show that. X0 ). By induction. using the DBE. X2 = i). for all n. k. suppose that the KLC (22) holds. . X1 = j. Now pick a state k and write. . and pi1 i2 (m−3) → π(i2 ). letting m → ∞. and write. suppose the DBE hold. . P (X0 = j. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. we have π(i) > 0 for all i. . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. and the chain is reversible. and this means that P (X0 = i.for all i and j. X1 . . Conversely. Xn ) has the same distribution as (Xn . But P (X0 = i. X1 = j) = π(i)pij . Pick 3 states i. So the detailed balance equations (DBE) hold. . X1 = j. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . X1 . Suppose ﬁrst that the chain is reversible. X2 ) has the same distribution as (X2 . j. X1 ) has the same distribution as (X1 . using DBE. Conversely. X2 = k) = P (X0 = k. (X0 . . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . X1 . Hence the DBE hold. (22) Proof. X0 ). and hence (X0 . i2 .

e. meaning that it has no loops. This gives the following method for showing that a ﬁnite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate.) Necessarily. Hence the chain is reversible. and by replacing. Form the graph by deleting all self-loops. f (i)g(i) must be 1 (why?). using another method. So g satisﬁes the balance equations. The reason is as follows. pji g(i) Multiplying by g(i) and summing over j we ﬁnd g(i) = j∈S g(j)pji . j for which pij > 0. (If it didn’t. Hence the original chain is reversible. by assumption. The balance equations are written in the form F (A.1) which gives π(i)pij = π(j)pji .) Let A.(Convention: we form this ratio only when pij > 0. the graph becomes disconnected. and. If we remove the link between them. that the chain is positive recurrent. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. consider the chain: Replace it by: This is a tree. then we need. There is obviously only one arrow from AtoAc . in addition. For example. there would be a loop. Ac ) = F (Ac . Example 1: Chain with a tree-structure. If the graph thus obtained is a tree. And hence π can be found very easily. 60 . the two oriented edges from i to j and from j to i by a single edge without arrow. namely the arrow from i to j. there isn’t any. for each pair of distinct states i. • If the chain has inﬁnitely many states. the detailed balance equations. so we have pij g(j) = . Pick two distinct states i. Ac be the sets in the two connected components. j for which pij > 0. i. A) (see §4. and the arrow from j to j is the only one from Ac to A. then the chain is reversible. to guarantee.

i. For each state i deﬁne its degree δ(i) = number of neighbours of i. we have that C = 1/2|E|. j are not neighbours then both sides are 0. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). p02 = 1/4. We claim that π(i) = Cδ(i). 61 . 2. To see this let us verify that the detailed balance equations hold. etc. if j is a neighbour of i otherwise.Example 2: Random walk on a graph. and a set E of undirected edges. The stationary distribution of a RWG is very easy to ﬁnd. A RWG is a reversible chain. π(i)pij = π(j)pji . Now deﬁne the the following transition probabilities: pij = 1/δ(i). where C is a constant.e. States j are i are called neighbours if they are connected by an edge. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. a ﬁnite set. For example. In all cases. i∈S Since δ(i) = 2|E|. Consider an undirected graph G with vertices S. The stationary distribution assigns to a state probability which is proportional to its degree. i∈S where |E| is the total number of edges. the DBE hold. so both members are equal to C. There are two cases: If i. The constant C is found by normalisation. Hence we conclude that: 1. If i. 0. π(i) = 1. Suppose the graph is connected. Each edge connects a pair of distinct vertices (states) without orientation. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3.

Let (St . In this case. 2. 1. t ≥ 0) be independent from (Xn ) with S0 = 0. i.e. where pij are the m-step transition probabilities of the original chain. k = 0. If the subsequence forms an arithmetic progression. . k = 0. The sequence (Yk . 23. how can we create new ones? 23. if. 1. 2. . . . Due to the Strong Markov Property. St+1 = St + ξt . .3 Subordination Let (Xn . 2. . If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). Let A be (r) a set of states and let TA be the r-th visit to A. k = 0. . . . π is a stationary distribution for the original chain. 2. Deﬁne Yr := XT (r) . then (Yk . a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . The state space of (Yr ) is A. in addition. 1. k = 0.i. .1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. irreducible and positive recurrent). n ≥ 0) be Markov with transition probabilities pij .) has the Markov property but it is not time-homogeneous. . (m) (m) 23. i.e. this process is also a Markov chain and it does have time-homogeneous transition probabilities.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . then it is a stationary distribution for the new chain. where ξt are i. .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. . . if nk = n0 + km.e.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). 1. . t ≥ 0. random variables with values in N and distribution λ(n) = P (ξ0 = n). . 2. 62 n ∈ N. in general. A r = 1. .d.

. Fix α0 .. . β). Indeed. .Then Yt := XSt . Yn−1 ) = P (Yn+1 = β | Yn = α). is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. however. If.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . . we have: Lemma 13. j. . (Notation: We will denote the elements of S by i. Then (Yn ) has the Markov property. Proof. then the process Yn = f (Xn ). Yn−1 = αn−1 }. We need to show that P (Yn+1 = β | Yn = α.. . . . and the elements of S ′ by α. . n≥0 63 . f is a one-to-one function then (Yn ) is a Markov chain. . knowledge of f (Xn ) implies knowledge of Xn and so. in general. . . conditional on Yn the past is independent of the future. ∞ λ(n)pij . i. αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . . is NOT. . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). α1 .e.) More generally. β. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). (n) 23. a Markov chain.

We have that P (Yn+1 = β|Xn = i) = Q(|i|. α = 0. β) := 1. Yn−1 ) Q(f (i). i ∈ Z. if we let 1/2. and so (Yn ) is. then |Xn+1 | = 1 for sure. depends on i only through |i|. Deﬁne Yn := |Xn |. To see this. β). β) P (Yn = α|Xn = i. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). if i = 0. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) P (Yn = α. because the probability of going up is the same as the probability of going down. Thus (Yn ) is Markov with transition probabilities Q(α. β). Example: Consider the symmetric drunkard’s random walk. Yn−1 )P (Xn = i | Yn = α. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. β ∈ Z+ . indeed. Indeed. β)P (Xn = i | Yn = α. for all i ∈ Z and all β ∈ Z+ . Yn−1 ) Q(f (i). β) P (Yn = α|Yn−1 ) i∈S = Q(α. Yn = α. observe that the function | · | is not one-to-one. Then (Yn ) is a Markov chain. α > 0 Q(α. i. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. On the other hand.e. But P (Yn+1 = β | Yn = α. β) i∈S P (Yn = α|Xn = i. β). P (Xn+1 = i ± 1|Xn = i) = 1/2. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. regardless of whether i > 0 or i < 0. In other words. But we will show that P (Yn+1 = β|Xn = i). i ∈ Z. a Markov chain with transition probabilities Q(α. if β = α ± 1. if β = 1. and if i = 0. conditional on {Xn = i}. 64 . β).for all choices of the variables involved.

n ≥ 0) has the Markov property. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. one question of interest is whether the process will become extinct or not. a hard problem. If we let ξ (n) = ξ1 . Hence all states i ≥ 1 are transient. if the current population is i there is probability pi that everybody gives birth to zero oﬀspring. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . and. 0 ≤ j ≤ i. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. (and independent of the initial population size X0 –a further assumption).24 24. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. But here is an interesting special case. are i. Thus. . Example: Suppose that p1 = p. which has a binomial distribution: i j pij = p (1 − p)i−j . Hence all states i ≥ 1 are inessential. ξ (n) ). The state 0 is irrelevant because it cannot be reached from anywhere. . in general. each individual gives birth to at one child with probability p or has children with probability 1 − p. the population cannot grow: it will eventually reach 0. j In this case. so we will ﬁnd some diﬀerent method to proceed. we have that (Xn . Back to the general case. So assume that p1 = 1. and 0 is always an absorbing state. 1. k = 0. . then we see that the above equation is of the form Xn+1 = f (Xn . . We let pk be the probability that an individual will bear k oﬀspring. It is a Markov chain with time-homogeneous transitions. 2. p0 = 1 − p. In other words.i.d. . Therefore. The parent dies immediately upon giving birth.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. We ﬁnd the population (n) at time n + 1 by adding the number of oﬀspring of each individual. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. if 0 is never reached then Xn → ∞. therefore transient. and following the same distribution. Let ξk be the number of oﬀspring of the k-th individual in the n-th generation. independently from one another. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. ξ (1) . ξ2 . . . since ξ (0) . Then P (ξ = 0) + P (ξ ≥ 2) > 0. .. 0 But 0 is an absorbing state. with probability 1. . if the 65 (n) (n) . The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of oﬀspring.

For each k = 1. Proof. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. Indeed. each one behaves independently of one another. i. since Xn = 0 implies Xm = 0 for all m ≥ n. . For i ≥ 1. ϕn (0) = P (Xn = 0). Therefore the problem reduces to the study of the branching process starting with X0 = 1. and P (Dk ) = ε. we can argue that ε(i) = εi . Let ∞ µ = Eξ = k=1 kpk be the mean number of oﬀspring of an individual. Let ε be the extinction probability of the process. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. . where ε = P1 (D). Obviously. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. We wish to compute the extinction probability ε(i) := Pi (D). The result is this: Proposition 5. This function is of interest because it gives information about the extinction probability ε. ε(0) = 1. if we start with X0 = i in ital ancestors. If µ ≤ 1 then the ε = 1. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . Indeed. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. so P (D|X0 = i) = εi . and.population does not become extinct then it must grow to inﬁnity. when we start with X0 = i. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. We then have that D = D1 ∩ · · · ∩ Di . 66 . . .

Notice now that µ= d ϕ(z) dz . In particular. This means that ε = 1 (see left ﬁgure below). then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. ϕn+1 (0) = ϕ(ϕn (0)). z=1 Also. Since both ε and ε∗ satisfy z = ϕ(z). = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). ϕ(ϕn (0)) → ϕ(ε). the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). and since the latter has only two solutions (z = 0 and z = ε∗ ). But 0 ≤ ε∗ . We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). But ϕn+1 (0) → ε as n → ∞. since ϕ is a continuous function. Let us show that. and. By induction. Therefore ε = ϕ(ε). ϕn (0) ≤ ε∗ . and so on. in fact. Also ϕn (0) → ε. ε = ε∗ .Note also that ϕ0 (z) = 1. ϕn+1 (z) = ϕ(ϕn (z)). On the other hand. Therefore. recall that ϕ is a convex function with ϕ(1) = 1. if the slope µ at z = 1 is ≤ 1. But ϕn (0) → ε. So ε ≤ ε∗ < 1. all we need to show is that ε < 1.) 67 . Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . Therefore ε = ε∗ . (See right ﬁgure below. if µ > 1.

assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’).1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. in a simpliﬁed model.d. We assume that the An are i. then time slots 0. Case: µ > 1. there is collision and the transmission is unsuccessful. . 2. Then ε < 1. there is a collision and. If collision happens. Nowadays. there are x + a packets. Thus. 3 hosts. 3.i. 24. then max(Xn + An − 1. An the number of new packets on the same time slot. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. k ≥ 0. Xn+1 = Xn + An . then the transmission is not successful and the transmissions must be attempted anew. The problem of communication over a common channel. 5. 1. 2. If s ≥ 2. otherwise. there is no time to make a prior query whether a host has something to transmit or not. Let s be the number of selected packets. . 3. If s = 1 there is a successful transmission and. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. Since things happen fast in a computer network.. 6. and Sn the number of selected packets. if Sn = 1. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. on the next times slot. is that if more than one hosts attempt to transmit a packet. One obvious solution is to allocate time slots to hosts in a speciﬁc order. 1. for a packet to go through. Then ε = 1. periodically. Ethernet has replaced Aloha. 3. .2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). Abandoning this idea. 68 . 1. using the same frequency. . random variables with some common distribution: P (An = k) = λk . So if there are. are allocated to hosts 1. we let hosts transmit whenever they like. say. 0). 2. During this time slot assume there are a new packets that need to be transmitted. 4. . there are x + a − 1 packets. on the next times slot. only one of the hosts should be transmitting at a given time. . So.

Sn = 1|Xn = x) = λ0 xpq x−1 .x−1 + k≥1 kpx. i. Idea of proof.e. 69 . To compute px. we have ∆(x) = (−1)px. then the chain is transient. we should not subtract 1. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). So we can make q x G(x) as small as we like by choosing x suﬃciently large. An−1 . for example we can make it ≤ µ/2. we have q < 1.x are computed by subtracting the sum of the above from 1. We also let µ := k≥1 kλk be the expectation of An . An = a. Since p > 0.e. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). The main result is: Proposition 6. To compute px. where G(x) is a function of the form α + βx .x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . . and p. where q = 1 − p. and so q x tends to 0 much faster than G(x) tends to ∞. Xn−1 . k 0 ≤ k ≤ x + a. Therefore ∆(x) ≥ µ/2 for all large x. Let us ﬁnd its transition probabilities.x+k for k ≥ 1. The process (Xn . if Xn + An = 0. we think as follows: to have a reduction of x by 1.where λk ≥ 0 and k≥0 kλk = 1. Indeed. Unless µ = 0 (which is a trivial case: no arrivals!). We here only restrict ourselves to the computation of this expected change. the Markov chain (Xn ) is transient. The random variable Sn has.x−1 = P (An = 0. For x > 0.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. so q x G(x) tends to 0 as x → ∞. and exactly one selection: px. selections are made amongst the Xn + An packets. we have that the drift is positive for all large x and this proves transience. k ≥ 1. we must have no new packets arriving.) = x + a k x−k p q . n ≥ 0) is a Markov chain. The Aloha protocol is unstable. The reason we take the maximum with 0 in the above equation is because. The probabilities px.x−1 when x > 0. . deﬁned by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. So P (Sn = k|Xn = x. So: px. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . . Proving this is beyond the scope of the lectures.

if j ∈ L(i) pij = 1/|V |. there are companies that sell high-rank pages. unless i is a dangling page. 3. in the latter case. if L(i) = ∅ 0. It is not clear that the chain is irreducible. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. So we make the following modiﬁcation. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. If L(i) = ∅. Then it says: page i has higher rank than j if π(i) > π(j). 2. known as ‘302 Google jacking’. |V | These are new transition probabilities. Deﬁne transition probabilities as follows: 1/|L(i)|. In an Internet-dominated society. Therefore. A new use of PageRank is to rank academic doctoral programs based on their records of ﬁnding jobs for their graduates. Its implementation though is one of the largest matrix computations in the world. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. It uses advanced techniques from linear algebra to speed up computations. Comments: 1.2) and let α pij := (1 − α)pij + . The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. The idea is simple. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. then i is a ‘dangling page’. the chain moves to a random page. We pick α between 0 and 1 (typically. neither that it is aperiodic. For each page i let L(i) be the set of pages it links to. the rank of a page may be very important for a company’s proﬁts. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. Academic departments link to each other by hiring their 70 . α ≈ 0. otherwise. All you need to do to get high rank is pay (a lot of money).3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. that allows someone to manipulate the PageRank of a web page and make it much higher. because of the size of the problem. There is a technique. So this is how Google assigns ranks to pages: It uses an algorithm.24. PageRank. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. So if Xn = i is the current location of the chain.

We will assume that a child chooses an allele from each parent at random. from the point of view of an administrator. colour of eyes) appears in two forms. meaning how many characteristics we have. One of the circle alleles is selected twice. we will assume that. One of the square alleles is selected 4 times. This means that the external appearance of a characteristic is a function of the pair of alleles. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. a gene controlling a speciﬁc characteristic (e. And so on. two AA individuals will produce and AA child. then. Since we don’t care to keep track of the individuals’ identities. To simplify. Three of the square alleles are never selected. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. Thus. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). But an AA mating with a Aa may produce a child of type AA or one of type Aa. We are now ready to deﬁne the Markov chain Xn representing the number of alleles of type A in generation n. 71 . Imagine that we have a population of N individuals who mate at random producing a number of oﬀspring. the genetic diversity depends only on the total number of alleles of each type.4 The Wright-Fisher model: an example from biology In an organism. The process repeats. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output.g. If so. in each generation. 5. What is important is that the evaluation can be done without ever looking at the content of one’s research. it makes the evaluation procedure easy. all we need is the information outlined in this ﬁgure.faculty from each other (and from themselves). generation after generation. this allele is of type A with probability x/2N . at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. We would like to study the genetic diversity. We will also assume that the parents are replaced by the children. The alleles in an individual come in pairs and express the individual’s genotype for that gene. In this ﬁgure there are 4 alleles of type A (circles) and 6 of type a (squares). 24. Do the same thing 2N times. maintain this allele for the next generation. a dominant (say A) and a recessive (say a). We have a transition from x to y if y alleles of type A were produced. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their oﬀspring did not inherit this allele. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. When two individuals mate. This number could be independent of the research quality but. 4. their oﬀspring inherits an allele from each parent.

. conditionally on (X0 . the chain gets absorbed by either 0 or 2N . ϕ(M ) = 1.2N = 1. M n P (ξk = 0|Xn . X0 ) = E(Xn+1 |Xn ) = M This implies that. but the argument needs further justiﬁcation because “the end of time” is not a deterministic time. . . 2N − 1} form a communicating class of inessential states. Since pxy > 0 for all 1 ≤ x. ξM are i. i. as usual. we see that 2N x 2N −y x y pxy = .) One can argue that. on the average. . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. . . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. 0 ≤ y ≤ 2N. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. i. . . Xn ). E(Xn+1 |Xn . 72 .d. p00 = p2N. so both 0 and 2N are absorbing states. EXn+1 = EXn Xn = Xn . for all n ≥ 0. . Tx := inf{n ≥ 1 : Xn = x}. since. y ≤ 2N − 1. . . and the population becomes one consisting of alleles of type a only. n n where. . . Let M = 2N . The result is correct. the states {1. . (Here. Similarly. M Therefore. . the expectation doesn’t change and. .e. . .Each allele of type A is selected with probability x/2N . there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. on the average. with n P (ξk = 1|Xn . We show that our guess is correct by proceeding via the mundane way of verifying that it satisﬁes the ﬁrst-step analysis equations. the random variables ξ1 . .i. and the population becomes one consisting of alleles of type A only. . Clearly.e. M and thus. . . ϕ(x) = y=0 pxy ϕ(y). Clearly. the number of alleles of type A won’t change. ϕ(x) = x/M. since. . The fact that M is even is irrelevant for the mathematical model. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). X0 ) = Xn . Since selections are independent. These are: M ϕ(0) = 0. X0 ) = 1 − Xn .

Otherwise. n ≥ 0) is a Markov chain. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. for all n ≥ 0. the demand is. A number An of new components are ordered and added to the stock. and E(An − Dn ) = −µ < 0. the demand is only partially satisﬁed. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. at time n + 1. Dn ). Working for n = 1. . larger than the supply per unit of time. while a number of them are sold to the market. We will apply the so-called Loynes’ method. n ≥ 0) are i. there are Xn + An − Dn components in the stock. Xn electronic components at time n. . Let ξn := An − Dn . we will try to solve the recursion. M −1 y−1 x M y−1 1− x . b). the demand is met instantaneously. say. .d. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. and the stock reached level zero at time n + 1.5 A storage or queueing application A storage facility stores. and so. provided that enough components are available. Here are our assumptions: The pairs of random variables ((An .The ﬁrst two are obvious. so that Xn+1 = (Xn + ξn ) ∨ 0. We will show that this Markov chain is positive recurrent. 2.i. Thus. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. we ﬁnd X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. on the average. 73 . Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . We have that (Xn . If the demand is for Dn components. Since the Markov chain is represented by this very simple recursion.

. . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisﬁes the balance equations. In particular. ξn−1 ) = (ξn−1 . Hence. ξn−1 . for all n.i. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). the random variables (ξn ) are i. . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). that d (ξ0 . Indeed. in computing the distribution of Mn which depends on ξ0 . . we may replace the latter random variables by ξ1 . we may take the limit of both sides as n → ∞. Notice. . . for y ≥ 0.d. (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . . using π(y) = limn→∞ P (Mn = y). . we reverse it. however. .Thus. 74 . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). Therefore. . ξ0 ). . . where the latter equality follows from the fact that. Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. the event whose probability we want to compute is a function of (ξ0 . . . ξn . . so we can shuﬄe them in any way we like without altering their joint distribution. instead of considering them in their original order. = x≥0 Since the n-dependent terms are bounded. . . . we ﬁnd π(y) = x≥0 π(x)pxy . (n) . . and. ξn−1 ).

Z2 . Z2 .} > y) is not identically equal to 1. we will use the assumption that Eξn = −µ < 0. This will be the case if g(y) is not identically equal to 1.} < ∞) = 1. We must make sure that it also satisﬁes y π(y) = 1. as required. . The case Eξn = 0 is delicate. . . Here. . Depending on the actual distribution of ξn . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . n→∞ This implies that P ( lim Zn = −∞) = 1. . n→∞ Hence g(y) = P (0 ∨ max{Z1 . the Markov chain may be transient or null recurrent. . 75 . It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient.So π satisﬁes the balance equations. From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1.

0. . p(−1) = q. where C is such that x p(x) = 1. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. . d 0. . . If d = 1. ±ed } otherwise 76 . Example 3: Deﬁne p(x) = 1 2d . that of spatial homogeneity.and space-homogeneity. So. a random walk is a Markov chain with time. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. . if x = −ej . we can let S = Zd . otherwise where p + q = 1. besides time. This means that the transition probability pxy should depend on x and y only through their relative positions in space. For example. given a function p(x). Our Markov chains. . p(x) = 0. . j = 1. we need to give more structure to the state space S. namely.y+z for any translation z. otherwise. if x = ej . . a structure that enables us to say that px. we won’t go beyond Zd .y = p(y − x).g. A random walk possesses an additional property. then p(1) = p. d p(x) = qj . To be able to say this. but. j = 1. More general spaces (e. .and space-homogeneous transition probabilities given by px. . have been processes with timehomogeneous transition probabilities. groups) are allowed. if x ∈ {±e1 . Deﬁne pj . . A random walk of this type is called simple random walk or nearest neighbour random walk. the skip-free property. this walk enjoys. x ∈ Zd . In dimension d = 1. This is the drunkard’s walk we’ve seen before. here.y = px+z. where p1 + · · · + pd + q1 + · · · + qd = 1. Example 1: Let p(x) = C2−|x1 |−···−|xd | .PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. . so far. the set of d-dimensional vectors with integer coordinates.

This is a simple random walk. are i. which. x + ξ2 = y) = y p(x)p(y − x). . Obviously. . It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. in general. ξ2 . We call it simple symmetric random walk. we can reduce several questions to questions about this random walk. because all we have done is that we dressed the original problem in more fancy clothes. y p(x)p′ (y − x). Indeed. p′ (x). The convolution of two probabilities p(x).d. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). p(x) = 0. the initial state S0 is independent of the increments. . and this is the symmetric drunkard’s walk. One could stop here and say that we have a formula for the n-step transition probability. then p(1) = p(−1) = 1/2. Furthermore.) In this notation. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . in addition. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. We will let p∗n be the convolution of p with itself n times. . x ∈ Zd is deﬁned as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. random variables in Zd with common distribution P (ξn = x) = p(x). Furthermore. The special structure of a random walk enables us to give more detailed formulae for its behaviour. . Let us denote by Sn the state of the walk at time n. where ξ1 . So. (Recall that δz is a probability distribution that assigns probability 1 to the point z. x ± ed . We refer to ξn as the increments of the random walk. . computing the n-th order convolution for all n is a notoriously hard problem. We shall resist the temptation to compute and look at other 77 . otherwise. we will mean a sum over all possible values. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x).) Now the operation in the last term is known as convolution. p ∗ δz = p. If d = 1.i. . x ∈ Zd . (When we write a sum without indicating explicitly the range of the summation variable. But this would be nonsense.

If yes. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn .0) We can thus think of the elements of the sample space as paths. So. consider the following: 78 . and so #A P (A) = n . 2 where #A is the number of elements of A. if we can count the number of elements of A we can compute its probability. 2) → (5. + (1. Look at the distribution of (ξ1 . . for ﬁxed n. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. . In the sequel. Clearly. Therefore. .methods. .2) (4. to each initial position and to each sequence of n signs. we should assign. If not. and the sequence of signs is {+1.1) + (0. 1) → (2. 2) → (3. After all.i. how long will that take? C. Since there are 2n elements in {0. all possible values of the vector are equally likely. −1. A ⊂ {−1. The principle is trivial. +1}n . who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that diﬀerent questions we need to ask or (b) that approximate methods are more informative than exact ones. Events A that depend only on these the ﬁrst n random signs are subsets of this sample space.1) − (3. +1. a random vector with values in {−1. . So if the initial position is S0 = 0.i−1 = 1/2 for all i ∈ Z. . +1. Will the random walk ever return to where it started from? B. 1}n .i+1 = pi.d. 1): (2. −1} then we have a path in the plane that joins the points (0. 0) → (1. random variables (random signs) with P (ξn = ±1) = 1/2. . ξn = εn ) = 2−n . a path of length n. As a warm-up. First. ξn ).2) where the ξn are i. we have P (ξ1 = ε1 . +1}n . 1) → (4. Here are a couple of meaningful questions: A.1) + − (5. +1}n . . we are dealing with a uniform distribution on the sample space {−1. but its practice may be hard. we shall learn some counting methods.

x) (n. 0) (n. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. The number of paths that start from (0. (n − m + y − x)/2 (23) Remark. S7 < 0}. 0) and ends at (n. x) and end up at (n. Let us ﬁrst show that Lemma 14. The paths comprising this event are shown below. S6 < 0. y)} = n−m . y) is given by: #{(0. We can easily count that there are 6 paths in A. 0)). x) (n. 0) and end up at the speciﬁc point (n. In if n + i is not an even number then there is no path that starts from (0. i).i) (0. i). 0) and n ends at (n. We use the notation {(m. y) is given by: #{(m. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. i). y)}. x) and end up at (n. the number of paths that start from (m. S4 = −1.1 Paths that start and end at speciﬁc points Now consider a path that starts from a speciﬁc point and ends up at a speciﬁc point.) The darker region shows all paths that start at (0. Hence (n+i)/2 = 0. and compute the probability of the event A = {S2 ≥ 0. (n. in this case. 0) and end up at (n.Example: Assume that S0 = 0. Proof. y)} =: {all paths that start at (m. (There are 2n of them.0) The lightly shaded region contains all paths of length n that start at (0.047. n 79 . i)} = n n+i 2 More generally. Consider a path that starts from (0.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the ﬁgure), there is no path that goes from (0, 0) to (3, 2). We shall thus deﬁne n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

**Paths that start and end at speciﬁc points and do not touch zero at all
**

(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =

1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the speciﬁc point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the ﬁrst time it hits zero and then reﬂecting it around the horizontal axis. In the ﬁgure below, follow the solid path starting at (m, x) up to the ﬁrst time it hits zero and then follow the dotted path: this is the reﬂected path.

(n, y)

(m, x)

(n, −y)

A path and its reﬂection around a horizontal line. Notice that the reﬂection starts only at the point where the path touches the horizontal line.

Note that if we know the reﬂected path we can recover the original path. Hence we may as well count the number of reﬂected paths. But each reﬂected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =

1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)

1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

**Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
**

1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n

n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n

n 2

= #{(0, 1) n

n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1

n−y 2

(n, y + 1); remain > 0}

− −

n

n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

−

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the ﬁrst one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

**Distribution after n steps
**

n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

n/2 (29) #{(0. . n 2−n .3 Some conditional probabilities Lemma 17. for now. . n (30) Proof. we shall just compute it.P (Sn = y|Sm = x) = In particular. 27. n 2 #{paths (m. 0) (n. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). the random walk never becomes zero up to time n is given by P0 (S1 > 0. 0) (n. But. . n This is called the ballot theorem and the reason will be given in Section 29. if n is even. n−m 2−n−m . The left side of (30) equals #{paths (1.2 The ballot theorem Theorem 25. y) that do not touch zero} × 2−n #{paths (0. . The probability that. 1) (n. . y)} . . given that S0 = 0 and Sn = y > 0. y)} . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . . Sn > 0 | Sn = y) = y . . (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . Sn ≥ 0 | Sn = 0) = 84 1 . Such a simple answer begs for a physical/intuitive explanation. P0 (Sn = y) 2 (n + y) + 1 . . x) 2n−m (n. 27. Sn ≥ 0 | Sn = y) = 1 − In particular. if n is even. If n. P0 (S1 ≥ 0 . S2 > 0. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y .

Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . . Sn ≥ 0. Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . . Sn ≥ 0) = #{(0. and ﬁnally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . Proof. We use (27): P0 (Sn ≥ 0 . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . (c) (d) (e) (f ) (g) where (a) is obvious.Proof. Sn ≥ 0) = P0 (S1 = 0.4 Some remarkable identities Theorem 26. . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. Sn > 0) + P0 (S1 < 0. . . . . . +1 27. . ξ3 . (e) follows from the fact that ξ1 is independent of ξ2 .) by (ξ1 . . . . . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . . (f ) follows from the replacement of (ξ2 . . . 1 + ξ2 + ξ3 > 0. . . Sn = 0) = P0 (Sn = 0). . If n is even.). Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. . . . Sn < 0) (b) (a) = 2P0 (S1 > 0. ξ2 . . . (b) follows from symmetry. Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . For even n we have P0 (S1 ≥ 0. . . . . ξ2 + ξ3 ≥ 0. . . n/2 where we used (28) and (29). . 85 . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. •). and Sn−1 > 0 implies that Sn ≥ 0. . . . . . 0) (n. . ξ3 . . P0 (S1 ≥ 0. remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . . . Sn ≥ 0). . . . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. We next have P0 (S1 = 0. . . . . Sn = 0) = P0 (S1 > 0. . . . . .

for even n. . So u(n) f (n) = . If a particle is to the right of a then you see its image on the left. . . . . n−1 28 The reﬂection principle for a simple random walk in dimension 1 The reflection principle says the following. f (n) = P0 (S1 > 0. Fix a state a. Put a two-sided mirror at a. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . . n We know that P0 (Sn = 0) = u(n) = n/2 2−n . Sn = 0). . with equal probability. Sn−1 = 0. 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). Sn−2 > 0|Sn−1 > 0. . We now want to ﬁnd the probability f (n) that the walk will return to 0 for the ﬁrst time in n steps. Sn = 0. . we either have Sn−1 = 1 or −1. Sn = 0) Now.5 First return to zero In (29). . given that Sn = 0. and if the particle is to the left of a then its mirror image is on the right. So the last term is 1/2. . Sn = 0) = 2P0 (Sn−1 > 0. we computed. we can omit the last event. Sn−1 < 0. . . So f (n) = 2P0 (S1 > 0. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). Sn = 0) P0 (S1 > 0. To take care of the last term in the last expression for f (n). Since the random walk cannot change sign before becoming zero. P0 (Sn−1 > 0. . This is not so interesting. Sn = 0) = u(n) . and u(n) = P0 (Sn = 0). By the Markov property at time n − 1. Sn−1 > 0. By symmetry. . It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. Sn = 0) + P0 (S1 < 0. n−1 Proof. we see that Sn−1 > 0. . . Theorem 27. the two terms must be equal. the probability u(n) that the walk (started at 0) attains value 0 in n steps. Then f (n) := P0 (S1 = 0. Sn−1 > 0. . . 86 . . . . Also. Sn−2 > 0|Sn−1 = 1). So 1 f (n) = 2u(n) P0 (S1 > 0.27. Sn = 0 means Sn−1 = 1. . Let n be even.

m ≥ 0) is a simple symmetric random walk. If Sn ≥ a then Ta ≤ n.What is more interesting is this: Run the particle till it hits a (it will. Thus. But a symmetric random walk has the same law as its negative. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). n ≥ Ta . Theorem 28. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). we obtain 87 . Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). But P0 (Ta ≤ n) = P0 (Ta ≤ n. 28. Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a).1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. Sn ≤ a) + P0 (Sn > a). the process (STa +m − a. Sn ≥ a). Sn ). In other words. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . S1 . independent of the past before Ta . 2a − Sn . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. a + (Sn − a). n ≥ Ta By the strong Markov property. n ≥ 0) be a simple symmetric random walk and a > 0. Sn ≤ a). . Proof. Finally. n < Ta . n ∈ Z+ ) has the same law as (Sn . . 2a − Sn ≥ a) = P0 (Ta ≤ n. Then Ta = inf{n ≥ 0 : Sn = a} Sn . Sn > a) = P0 (Ta ≤ n. P0 (Sn ≥ a) = P0 (Ta ≤ n. So Sn − a can be replaced by a − Sn in the last display and the resulting process. 2 Deﬁne now the running maximum Mn = max(S0 . . Sn = Sn . Then. called (Sn . so we may replace Sn by 2a − Sn (Theorem 28). we have n ≥ Ta . In the last event. 2 Proof. n < Ta . Let (Sn . n ∈ Z+ ). as a corollary to the above result. The last equality follows from the fact that Sn > a implies Ta ≤ n. . deﬁne Sn = where is the ﬁrst hitting time of a. Sn ≤ a) + P0 (Ta ≤ n. Trivially.

Compt. we have P0 (Mn < x. Acad. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Rend. by applying the reﬂection principle (Theorem 28). as long as a ≥ b. If Tx ≤ n. 9 Bertrand. (31) x > y. Acad. Rend. J. of which a are coloured azure and b black. Sci. 436-437. Sn = y) = P0 (Sn = 2x − y). An urn is a vessel. combined with (31). employed for diﬀerent purposes. It is the latter use we are interested in in Probability and Statistics. The set of values of η contains n elements and η is uniformly distributed a in this set. Sn = 2x − y). Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . . 2 Proof. Bertrand. Proof. Sn = y) = P0 (Tx ≤ n. D. If an urn contains a azure items and b = n−a black items. then. ornamental objects. The answer is very simple: Theorem 31. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. ηn ). which results into P0 (Tx ≤ n. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. (1887). This. gives the result. Since Mn < x is equivalent to Tx > n. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29.Corollary 19. The original ballot problem. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. Sn = y) = P0 (Tx > n. e 10 Andr´. . Paris. e e e Paris. 105. . n = a + b. during sampling without replacement. . the number of selected azure items is always ahead of the black ones equals (a − b)/n. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). (1887). posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. P0 (Mn < x. Sn = y). Solution directe du probl`me r´solu par M. where ηi indicates the colour of the i-th item picked. Sci. then the probability that. the ashes of the dead after cremation. usually a vase furnished with a foot or pedestal. as for holding liquids. and anciently for holding ballots to be drawn. 105. Compt. 8 88 . Solution d’un probl`me. 369. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. A sample from the urn is a sequence η = (η1 .

(The word “asymmetric” will always mean “not necessarily symmetric”. the ‘black’ items correspond to − signs). . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. ξn = εn | Sn = y) = P0 (ξ1 = ε1 . Since the ‘azure’ items correspond to + signs (respectively. . So the number of possible values of (ξ1 . . . ξn ) is uniformly distributed in a set with n elements. 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. ξn ) has a uniform distribution. .) So pi. ξn ) is. . .i−1 = q = 1 − p. p(n+k)/2 q (n−k)/2 . letting a. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. . . . Then. . . ξn = εn ) P0 (Sn = y) 2−n 1 = = . . (ξ1 . We have P0 (Sn = k) = n n+k 2 pi. i ∈ Z. −n ≤ k ≤ n. . a − b = y. b be deﬁned by a + b = n. Let ε1 . . But in (30) we computed that y P0 (S1 > 0. . so ‘azure’ items are +1 and ‘black’ items are −1. . always assuming (without any loss of generality) that a ≥ b = n − a. n Since y = a − (n − a) = a − b.We shall call (η1 . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. But if Sn = y then. .i+1 = p. Sn > 0 | Sn = y) = . . . Consider a simple symmetric random walk (Sn . Proof of Theorem 31. a we have that a out of the n increments will be +1 and b will be −1. . . It remains to show that all values are equally a likely. +1} such that ε1 + · · · + εn = y. . ηn ) an urn process with parameters n. . . n ≥ 0) with values in Z and let y be a positive integer. . . n n −n 2 (n + y)/2 a Thus. where y = a − (n − a) = 2a − n. . Sn > 0} that the random walk remains positive. indeed. the sequence (ξ1 . we can translate the original question into a question about the random walk. . conditional on {Sn = y}. The values of each ξi are ±1. . . . εn be elements of {−1. . using the deﬁnition of conditional probability and Lemma 16. Proof. but which is not necessarily symmetric. conditional on the event {Sn = y}. the sequence (ξ1 . . n . Then. . the result follows. Since an urn with parameters n. 89 . . . a. we have: P0 (ξ1 = ε1 . We need to show that. . . . conditional on {Sn = y}. An urn can be realised quite easily be a random walk: Theorem 32.

then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. . See ﬁgure below. . Then the probability generating function of the ﬁrst time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 .i. Proof. x − 1 in order to reach x. In view of this. We will approach this using probability generating functions. Corollary 20. In view of Lemma 19 we need to know the distribution of T1 . 2.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. To prove this.) Then. y ∈ Z.30. it is no loss of generality to consider the random walk starting from 0. This random variable takes values in {0. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. . 90 . Theorem 33. only the relative position of y with respect to x is needed. Let ϕ(s) := E0 sT1 . Proof. random variables. For all x. and all t ≥ 0. Consider a simple random walk starting from 0. . . 1. Lemma 18. x ≥ 0) is also a random walk. Consider a simple random walk starting from 0. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. we make essential use of the skip-free property.d. 2qs Proof. . (We tacitly assume that S0 = 0. Start with S0 = 0 and watch the process until the ﬁrst n = T1 such that Sn = 1. The rest is a consequence of the strong Markov property. Px (Ty = t) = P0 (Ty−x = t). plus a ′′ further time T1 required to go from 0 to +1 for the ﬁrst time. . x ∈ Z. Lemma 19. If x > 0. for x > 0. 2. Consider a simple random walk starting from 0. The process (Tx . plus a time T1 required to go from −1 to 0 for the ﬁrst time. If S1 = 1 then T1 = 1.} ∪ {∞}. E0 sTx = ϕ(s)x .

in order to hit 1. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . Hence the previous formula applies with the roles of p and q interchanged. Then the probability generating function of the ﬁrst time that level x ∈ Z is reached is given by √ x E sT1 x = 1− 1−4pqs2 . the RW must ﬁrst hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. 91 . Then the probability generating function of the ﬁrst time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . Consider a simple random walk starting from 0. The ﬁrst time n such that Sn = −1 is the ﬁrst time n such that −Sn = 1. Corollary 22. The formula is obtained by solving the quadratic. 0 if x > 0 2qs Tx (32) E0 s = √ |x| E sT−1 |x| = 1− 1−4pqs2 0 . if x < 0 2ps Proof. Corollary 21. 2ps Proof. from the strong Markov property.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. Consider a simple random walk starting from 0.

then we can omit the k upper limit in the sum. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . If we deﬁne n to be equal to 0 for k > n. this was found in §30. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. 2ps ′ =1− 30. We again do ﬁrst-step analysis. how long will it take till it ﬁrst returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . If α = n is a positive integer then n (1 + x)n = k=0 n k x . in distribution. k by the binomial theorem. and simply write (1 + x)n = k≥0 n k x . For the general case. The latter is. Similarly. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34.30.2 First return to the origin Now let us ask: If the particle starts at 0.3 The distribution of the ﬁrst return to the origin We wish to ﬁnd the exact distribution of the ﬁrst return time to the origin. Consider a simple random walk starting from 0. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . 1 − 4pqs2 . E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 .2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. k 92 . For a symmetric ′ random walk. where α is a real number. the same as the time required for the walk to hit 1 starting ′ from 0. if S1 = 1. Then the probability generating function of the ﬁrst return time to 0 is given by E0 s T 0 = 1 − Proof.

k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. (−1)k = 2k − 1 k k and this is shown by direct computation. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . k as long as we deﬁne α k = α(α − 1) · · · (α − k + 1) .Recall that n k = It turns out that the formula is formally true for general α. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . by deﬁnition. −1 k so (1 − x)−1 = as needed. but that we won’t worry about which ones. √ 1 − x. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . Indeed. 2k − 1 k 93 . and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . For example. k! k! (−1)2k xk = k≥0 xk . (1 − x)−1 = But. As another example. k!(n − k)! k! α k x .

94 . The quantity inside the square root becomes 1 − 4pq.But. That it is actually null-recurrent follows from Markov chain theory. Hence it is again transient. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. Comparing the two ‘inﬁnite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . But remember that q = 1 − p. x 1−|p−q| 2q 1−|p−q| 2p . by the deﬁnition of E0 sT0 . By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. Hence it is transient. if x < 0 |x| In particular. s↑1 which is true for any probability generating function. x if x > 0 . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. |x| if x > 0 (33) . • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. without appeal to the Markov chain theory. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . We apply this to (32). if x < 0 This formula can be written more neatly as follows: Theorem 35. P0 (Tx < ∞) = p q q p ∧1 ∧1 . • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 = (1 − 2p)2 = |1 − 2p| = |p − q|.

. Proof. n→∞ n→∞ x ≥ 0. . 95 .2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. Assume p < 1/2. Indeed. it will never ever return to the point where it started from. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. we obtain what we are looking for. . the sequence Mn is nondecreasing. Corollary 23. q p. Consider a simple random walk with values in Z. with probability 1 because p < 1/2. The simple symmetric random walk with values in Z is null-recurrent. Consider a random walk with values in Z. x ≥ 0. This means that there is a largest value that will be ever attained by the random walk.) If we let Mn = max(S0 . Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . 31. Proof.Proof. If p < q then |p − q| = q − p and. if x > 0 . Otherwise. but here it is < ∞. The last part of Theorem 35 tells us that it is recurrent. What is this probability? Theorem 37. Theorem 36. Sn ). and so it is null-recurrent. We know that it cannot be positive recurrent. every nondecreasing sequence has a limit. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the ﬁrst return time to 0. This value is random and is denoted by M∞ = sup(S0 . P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . . Then limn→∞ Sn = −∞ with probability 1. except that the limit may be ∞. then we have M∞ = limn→∞ Mn . Then ′ P0 (T0 < ∞) = 2(p ∧ q). this probability is always strictly less than 1. . S1 . if x < 0 which is what we want. Unless p = 1/2. with some positive probability. S2 . 31. . again. .

the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . The probability generating function of T0 was obtained in Theorem 34. 2. 1. 1]. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . 1 − 2(p ∧ q) this being a geometric series. 96 . where the last probability was computed in Theorem 37. . k = 0. the total number of visits to a state will be ﬁnite. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). 2(p ∧ q) . . Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. . (34) a random variable ﬁrst considered in Section 10 where it was also shown to be geometric. k = 0. 2. 2(p ∧ q) E0 J0 = . k = 0. . 1. . We now look at the the number of visits to some state but if we start from a diﬀerent state.3 Total number of visits to a state If the random walk is recurrent then it visits any state inﬁnitely many times. Remark 1: By spatial homogeneity. . But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. But if it is transient. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. even for p = 1/2. if a Markov chain starts from a state i then Ji is geometric. Consider a simple random walk with values in Z.′ Proof. Theorem 38. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . as already known. If p = q this gives 1. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). 2. meaning that Pi (Ji = ∞) = 1. . Pi (Ji ≥ k) = 1 for all k. Then. This proves the ﬁrst formula. 1 − 2(p ∧ q) Proof. But J0 ≥ 1 if and only if the ﬁrst return time to zero is ﬁnite. starting from zero. . We now compute its exact distribution. 1. . In the latter case. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. 31. In Lemma 4 we showed that. . we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k .

Jx ≥ k). then E0 Jx drops down to zero geometrically fast as x tends to inﬁnity. simply because if Jx ≥ k ≥ 1 then Tx < ∞. given that Tx < ∞. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). Consider a random walk with values in Z. starting from 0.e. For x > 0. Apply now the strong Markov property at Tx . Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . and the second in Theorem 38. i. Consider the case p < 1/2. E0 Jx = Proof. But for x < 0.Theorem 39. Remark 1: The expectation E0 Jx is ﬁnite if p = 1/2. 97 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . p < 1/2) then. we have E0 Jx = 1/(1−2p) regardless of x. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. if the random walk “drifts down” (i. (q/p)(−x)∨0 1 − 2q k≥1 . In other words. The ﬁrst term was computed in Theorem 35. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . k ≥ 1. Suppose x > 0. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . it will visit any point below 0 exactly the same number of times on the average. ∞ As for the expectation. (2(p ∧ q))k−1 . We repeat the argument for x < 0 and ﬁnd P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 .e. Then P0 (Jx ≥ k) = P0 (Tx < ∞. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. The reason that we replaced k by k − 1 is because. Then P0 (Tx < ∞.

Tx is the sum of x i. 0 ≤ k ≤ n) of (Sk . n x > 0. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x .i. a sum of i. random variables. for a simple random walk and for x > 0. n (37) 98 . ξn ) has the same distribution as (ξn . random variables.32 Duality Consider random walk Sk = k ξi . we derived using duality. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. 0 ≤ k ≤ n − 1.d. . In Lemma 19 we observed that. Pause for reﬂection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. . 0 ≤ k ≤ n) has the same distribution as (Sk . (35) In Theorem 29 we used the reﬂection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x).d. i. = 1− √ 1 − s2 s x . 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. . n Proof. . P0 (Sn = x) P0 (Sn = x) x . which is basically another probability for S. . . in Theorem 41. ξn−1 . Thus. . Sn = x) P0 (Tx = n) = . 0 ≤ k ≤ n). (Sk . ξ1 ). i=1 Then the dual (Sk . Let (Sk ) be a simple symmetric random walk and x a positive integer. . because the two have the same distribution. Use the obvious fact that (ξ1 .e. every time we have a probability for S we can translate it into a probability for S. each distributed as T1 . Theorem 40. . Here is an application of this: Theorem 41. Let Tx = inf{n ≥ 1 : Sn = x} be the ﬁrst time that x will be visited. (36) 2 Lastly. 0 ≤ k ≤ n) is deﬁned by Sk := Sn − Sn−k .i. Fix an integer n. ξ2 . . . the probabilities P0 (Tx = n) = x P0 (Sn = x). Then P0 (Tx = n) = x P0 (Sn = x) . Proof. . Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. for a simple symmetric random walk.

We remark here something methodological: It was completely diﬀerent techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coeﬃcient of sn in (35); summing up (37) should give (36), or taking diﬀerences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Deﬁne its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the ﬁrst time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the ﬁrst event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =

r=1

+

r=1

**In the ﬁrst case,
**

′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

**We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
**

k

**f2r p2k−2r,2n−2r +
**

r=1

1 2

n−k

**f2r p2k,2n−2r
**

r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2

k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .

r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

**The arcsine law*
**

Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

**Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
**

na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =

a≤k/n≤b

∼

a≤k/n≤b

1 . k k (n − n ) n n

1/π

**This is easily seen to have a limit as n → ∞, the limit being
**

b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of

x(1−x)

which is amazingly close to the plot of the ﬁgure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

Sn ) = n n 2 . ξ2 .From this. 2 Same argument applies to (Sn . n ∈ Z+ ) is also a random walk. ξ2 . We then let S0 = (0. . 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. map (x1 . 0) and. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1.i. 1)) + P (ξn = (0. (Sn . i=1 ξi i=1 ξi 2 1 Lemma 20. Consider a change of coordinates: rotate the axes by 45o (and scale them). Indeed. west. ξn are not independent simply because if one takes value 0 the other is nonzero. It is not a simple random walk because there is a possibility that the increment takes the value 0.i. The corners are described by coordinates (x. north or south. he moves east. y ∈ Z. (Sn . So Sn = (Sn . the two stochastic processes are not independent.d.e. 1 Hence (Sn . We have: 102 . y) where x. for each n.. −1)) = 1/2. 0)) = P (ξn = (1. . . with equal probability (1/4). Sn − Sn ). 1 P (ξn = 1) = P (ξn = (1. west. n ∈ Z+ ) showing that it. the latter being functions of the former. When he reaches a corner he moves again in the same manner. We have 1. being totally drunk. north or south. i. n ≥ 1. ξ2 . 2 1 If x ∈ Z2 . Since ξ1 . the random 1 2 variables ξn . random variables and hence a random walk. 0)) = P (ξn = (−1. . . x1 − x2 ). The two random walks are not independent.d. x2 ) into (x1 + x2 . 0)) = 1/4. Rn ) := (Sn + Sn . 0)) = 1/4. 0). are independent. However. 1)) = P (ξn = (−1. so are ξ1 . we let x1 be its horizontal coordinate and x2 its vertical. . random vectors such that P (ξn = (0. Many other quantities can be approximated in a similar manner. is a random walk. . 1 1 Proof. we immediately get that the expectation of T1 is inﬁnity: ET1 = ∞. He starts at (0. So let 1 2 1 2 1 2 Rn = (Rn . too. i. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. n ∈ Z+ ) is a random walk. Sn = i=1 n ξi . going at random east. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . . n ∈ Z+ ) is a sum of i. .

ε′ ∈ {−1. The last two are walk in one dimension. ηn = 1) = P (ξn = (1. n ∈ Z+ ). And so 2 1 2 1 P (ηn = ε. n Corollary 24. b−a P (STa ∧Tb = a) = b . being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . (Rn . from Theorem 7. 0)) = 1/4 1 2 P (ηn = −1. But (Rn ) 1 2 is a simple symmetric random walk. ηn = −1) = P (ξn = (−1. +1}. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . ηn = −1) = P (ξn = (0. . Hence (Rn . n ∈ Z+ ) is a random 2 . ηn = 1) = P (ξn = (0. n ∈ Z+ ) is a random walk in two dimensions. Hence P (ηn = 1) = P (ηn = −1) = 1/2. The two-dimensional simple symmetric random walk is recurrent. n ∈ Z ) is a random walk in one dimension. We have Rn = n (ξi + ξi ). P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a .i. in the sense that the ratio of the two sides goes to 1. Lemma 22. n ∈ Z ) is a random walk in one dimension. 2 1 2 2 1 1 Proof. Proof. The same is true for (Rn ). (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . (Rn + independent.1 Lemma 21. P (S2n = 0) ∼ n . To show that the last two are independent. 2 . 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. Similar argument shows that each of (Rn . By Lemma 5. The possible values 1 .. where c is a positive constant. . We have of each of the ηn n 1 2 P (ηn = 1. η 2 are ±1. Let a < 0 < b. . So P (R2n = 0) = 2n 2−2n . Recall. as n → ∞. . ξ2 . . ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. ξ 1 − ξ 2 ). But by the previous n=0 c lemma. Rn = n (ξi − ξi ). 0)) = 1/4 1 2 P (ηn = 1. is i. . ηn = ξn − ξn are independent for each n. b−a 103 . η .d. R2n = 0) = P (R2n = 0)(R2n = 0). b−a P (Ta < Tb ) = b . b−a −a . 1 where the last equality follows from the independence of the two random walks. (Rn . and by Stirling’s approximation. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. that for a simple symmetric random walk (started from S0 = 0). = 1)) = 1/4. 1)) = 1/4 1 2 P (ηn = −1. 2n n 2 2−4n . n ∈ Z+ ) 1 is a random walk in two dimensions. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. P (S2n = 0) = Proof. we need to show that ∞ P (Sn = 0) = ∞. The sequence of vectors η .

P (A = 0. Theorem 43. The probability it exits it from the left point is p. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. Indeed. e. b = N . This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the ﬁrst time it exits the interval [a. B = 0) + P (A = i. using this. that ESTa ∧Tb = 0. where c−1 is chosen by normalisation: P (A = 0. i ∈ Z.This last display tells us the distribution of STa ∧Tb . a = M − N . choose. M. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. Note. we can always ﬁnd integers a. we consider the random variable Z = STA ∧TB . How about more complicated distributions? Let suppose we are given a probability distribution (pi . B = j) = c−1 (j − i)pi pj . B = j) = 1. i ∈ Z). B = 0) = p0 . B) are independent of (Sn ). Deﬁne a pair of random variables (A. k ∈ Z. and the probability it exits from the right point is 1 − p. b. given a 2-point probability distribution (p. suppose ﬁrst k = 0.g. M < N . Then there exists a stopping time T such that P (ST = i) = pi . N ∈ N. i ∈ Z. B) taking values in {(0. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. if p = M/N . Can we ﬁnd a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. i < 0 < j. The claim is that P (Z = k) = pk . we have this common value: i<0 (−i)pi = and. Proof. we get that c is precisely ipi . and assuming that (A. such that p is rational. Then Z = 0 if and only if A = B = 0 and this has 104 . b]. a < 0 < b such that pa + (1 − p)b = 0. 0)} ∪ (N × N) with the following distribution: P (A = i. Conversely. To see this. in particular. Let (Sn ) be a simple symmetric random walk and (pi . 1 − p).

P (Z = k) = pk for k < 0. B = j) P (STi ∧Tk = k)P (A = i. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. by deﬁnition. 105 . Suppose next k > 0. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk .probability p0 . i<0 = c−1 Similarly.

−3. This gives us an algorithm for ﬁnding the gcd between two numbers: Suppose 0 < a < b are integers. 38) = gcd(6. Replace b by b − a. the process of long division that one learns in elementary school). 3. 38) = gcd(82. b are arbitrary integers. Indeed. if we also invoke Euclid’s algorithm (i. −2. −1. and d = gcd(a. Keep doing this until you obtain b − qa < a. Similarly. replace b − a by b − 2a. 0. We will show that d = gcd(a. 20) = gcd(6. b). we say that b divides a if there is an integer k such that a = kb. b) as the unique positive integer such that (i) d | a. They are not supposed to replace textbooks or basic courses. . If a | b and b | a then a = b. So (ii) holds. and we’re done. we can deﬁne their greatest common divisor d = gcd(a. Then c | a + (b − a). For instance.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. 38) = gcd(44. 5. if a. b). d2 are two such numbers then d1 would divide d2 and vice-versa. 2) = gcd(4. In other words. 2. 8) = gcd(6. b − a). Given integers a. 1.}. leading to d1 = d2 . c | b. But in the ﬁrst line we subtracted 38 exactly 3 times. Zero is denoted by 0. with b = 0. i. 38) = gcd(6. let d = gcd(a. Arithmetic Arithmetic deals with integers. The set N of natural numbers is the set of positive integers: N = {1. But 3 is the maximum number of times that 38 “ﬁts” into 120. d | b. 14) = gcd(6. they merely serve as reminders. with at least one of them nonzero. 6. with the second line. we have c | d. Because c | a and c | b. We write this by b | a. b − a). 4. 3. 2. 106 . 26) = gcd(6. Therefore d | b − a. we may summarise the ﬁrst line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. b. Now suppose that c | a and c | b − a. b) = gcd(a. Now.) Notice that: gcd(a. as taught in high schools or universities. 2) = 2.e. If c | b and b | a then c | a. (ii) if c | a and c | b. But d | a and d | b. · · · }. Indeed 4 × 38 > 120. 2) = gcd(2. If b − a > a. . The set Z of integers consists of N. then d | c. gcd(120. . 32) = gcd(6. Z = {· · · . So we work faster. Then interchange the roles of the numbers and continue. (That it is unique is obvious from the fact that if d1 .e. So (i) holds. their negatives and zero.

e. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. 38) = gcd(6. with a > b. 120) = 38x + 120y. let d be the right hand side. we have the fact that if d = gcd(a. y = 6. r = 18. a − 1}. To see why this is true. This shows (ii) of the deﬁnition of a greatest common divisor. y ∈ Z. in particular. we can show that. 38) = gcd(6. Thus. . let us follow the previous example again: gcd(120. . Then c divides any linear combination of a.The theorem of division says the following: Given two positive integers b. a. for integers a. xa + yb > 0}. i. 2) = 2. b. we have gcd(a. Its proof is obvious. b) then there are integers x. 107 . it divides d. gcd(38. . b) = min{xa + yb : x. . Then d = xa + yb. To see this. Working backwards. In the process. y ∈ Z. y such that d = xa + yb. b ∈ Z. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. Suppose that c | a and c | b. Here are some other facts about the greatest common divisor: First. for some x. Second. not both zero. The long division algorithm is a process for ﬁnding the quotient q and the remainder r. with x = 19. For example. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. 1. such that a = qb + r. we can ﬁnd an integer q and an integer r ∈ {0. We need to show (i). The same logic works in general.

then the relation “x is a friend of y” is symmetric. The equality relation is symmetric. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. so d divides both a and b. x). z) it also contains (x. the greatest common divisor of any subset of the integers also makes sense. Since d > 0. y) and (y. The relation < is not symmetric. that d does not divide b. Finally. x2 ∈ X have diﬀerent values f (x1 ). For example. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. say. y) such that x is smaller than y. The greatest common divisor of 3 integers can now be deﬁned in a similar manner.e.e. we can divide b by d to ﬁnd b = qd + r. e. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . The set of functions from X to Y is denoted by Y X . Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. and this is a contradiction. it can be thought as reﬂexive (unless we do not want to accept the fact that one may not be afriend of oneself). A function is called onto (or surjective) if its range is Y .that d | a and d | b. y) without contain its symmetric (y. z). a relation can be thought of as a set of pairs of elements of the set A.. A relation is called symmetric if it cannot contain a pair (x. A relation is called reflexive if it contains pairs (x. The equality relation = is reﬂexive. that d does not divide both numbers. We encountered the equivalence relation i theory. A function is called one-to-one (or injective) if two diﬀerent x1 . i. A relation which possesses all the three properties is called equivalence relation. Both < and = are transitive. A relation is transitive if when it contains (x. x). so r = 0. B be ﬁnite sets with n. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). but is not necessarily transitive. But then b = q(xa + yb) + r. for some 1 ≤ r ≤ d − 1.g. Counting Let A. The range of f is the set of its values. So d is not minimum as claimed. The relation < is not reﬂexive. the set {f (x) : x ∈ X}. f (x2 ). Suppose this is not true. In general. If we take A to be the set of students in a classroom. yielding r = (−qx)a + (1 − qy)b. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. 108 . i. m elements. respectively.

that of the second in m − 1 ways. Indeed. A function is then an assignment of a name to each animal. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). and so on. 1}n of sequences of length n with elements 0 or 1. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . we may think of the elements of A as animals and those of B as names. k! k where the symbol equation. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). and so on. Thus. Each assignment of this type is a one-to-one function from A to B. In the special case m = n. the cardinality of the set B A ) is mn . the second in n − 1. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . we have = 1. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. The ﬁrst element can be chosen in n ways. If A has n elements. Clearly.e. To be able to do this.) So the ﬁrst animal can be assigned any of the m available names. n! is the total number of permutations of a set of n elements. and the third. in other words. and this follows from the binomial theorem. (The same name may be used for diﬀerent animals. we need more names than animals: n ≤ m.The number of functions from A to B (i. and so on. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . so does the second. Incidentally. we denote (n)n also by n!. 109 . Any set A contains several subsets. We thus observe that there are as many subsets in A as number of elements in the set {0. n! stands for the product of the ﬁrst n natural numbers: n! = 1 × 2 × 3 × · · · × n. Since the name of the ﬁrst animal can be chosen in m ways. a one-to-one function from the set A into A is called a permutation of A. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to diﬀerent animals. the k-th in n − k + 1 ways.

Maximum and minimum The minimum of two numbers a. The maximum is denoted by a ∨ b = max(a. n n . we choose k animals from the remaining n ones and this can be done in n ways. we let The Pascal triangle relation states that n+1 k = = 0. c. we use the following notations: a+ = max(a. in choosing k animals from a set of n + 1 ones. The notation extends to more than one numbers.We shall let n = 0 if n < k. For example. the pig. 0). b. In particular. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. by convention. a+ is called the positive part of a. m! n y If y is not an integer then. and a− is called the negative part of a. and we don’t care to put parantheses because the order of ﬁnding a maximum is irrelevant. b). if x is a real number. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). Note that it is always true that a = a+ − a− . 0). consider a speciﬁc animal. say. If the pig is included. b is denoted by a ∧ b = min(a. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. + k−1 k Indeed. we deﬁne x m = x(x − 1) · · · (x − m + 1) . b). Both quantities are nonnegative. a− = − min(a. a∨b∨c refers to the maximum of the three numbers a. The absolute value of a is deﬁned by |a| = a+ + a− . 110 . we add the two numbers together in order to ﬁnd the total number of ways to pick k animals from n + 1 animals. by convention. If the pig is not included in the sample. k Also. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true.

So X= n≥1 1(X ≥ n). 2. .}. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. Then X X= n=1 1 (the sum of X ones is X). So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. = 1A + 1B − 1A∩B . . For instance. are sets or clauses then the number n≥1 1An is the number true clauses.Indicator function stands for the indicator function of a set A. It is worth remembering that. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. If A is an event then 1A is a random variable. if A1 . P (X ≥ n). 111 . We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. A2 . . 1 − 1A = 1 A c . . . . This is very useful notation because conjunction of clauses is transformed into multiplication. Another example: Let X be a nonnegative random variable with values in N = {1. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). Several quantities of interest are expressed in terms of indicators. Thus. It takes value 1 with probability P (A) and 0 with probability P (Ac ). An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . meaning a function that takes value 1 on A and 0 on the complement Ac of A. It is easy to see that 1A 1A∩B = 1A 1B . which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

x1 . . Let y i := n aij xj . . But the ϕ(uj ) are vectors in Rm .for all x. . . The relation between the column representation of y and the column representation of x is written as 1 1 x y a11 · · · a1n . xn the given basis. Then. The column . . . where ∈ R. . vm . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . j=1 i = 1. xn n 1 y := ϕ(x) = j=1 xj ϕ(uj ). un be a basis for Rn . . . = ··· ··· ··· . we have ϕ(x) = m n vi i=1 j=1 aij xj . x . y ∈ Rn . ϕ are linear functions: It is a m × n matrix. Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ . is a column representation of x with respect to . m. . Let u1 . . is a column representation of y := ϕ(x) with respect to the given basis . The matrix A of ϕ with respect to the choice of the two basis is deﬁned by a11 · · · a1n A := · · · · · · · · · am1 · · · amn y1 Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . Suppose that ψ. The column . . ym v1 . . . and all α. . . . . . because ϕ is linear. . . β ∈ R. . So if we choose a basis v1 . Combining.

ei = 1(i = j). A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. xn+2 . A matrix A is called square if it is a n × n matrix. deﬁned as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. . 1 ≤ j ≤ n. . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. . x2 . a2 . 113 . there is x ∈ I with x ≥ ε + inf I. sup ∅ = −∞. . x2 . . x4 . with respect to these bases. This minimum (or maximum) may not exist. it is its matrix representation with respect to the ﬁxed basis. we deﬁne max I.} ≤ inf{x3 . 1/2. . . ed = . . If I has no lower bound then inf I = −∞. Similarly. sup I is characterised by: sup I ≥ x for all x ∈ I and. . . B be the matrix representations of ϕ.} has no minimum.} ≤ inf{x2 . . there is x ∈ I with x ≤ ε + inf I. Rm and Rℓ and let A. xn+1 . . . e2 = . we deﬁne the supremum sup I to be the minimum of all upper bounds. .} ≤ · · · and. . 0 0 1 In other words. j So A is a ℓ × m matrix. where m (AB)ij = k=1 Aik Bkj . The standard basis in Rd consists of the vectors 1 0 0 0 1 0 e 1 = . inf I is characterised by: inf I ≤ x for all x ∈ I and. . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. ψ. . a square matrix represents a linear transformation from Rn to Rn . So lim sup xn = limn→∞ inf{xn . . Limsup and liminf If x1 . the choice of the basis is the so-called standard basis. for any ε > 0. . for any ε > 0. Similarly. For instance I = {1. Then the matrix representation of ϕ ◦ ψ is AB. . If I is empty then inf ∅ = +∞. If we ﬁx a basis in Rn . and it is this that we called inﬁmum. . . B is a m × n matrix. Supremum and inﬁmum When we have a set of numbers I (for instance. . x3 . Likewise. .e. . 1/3. i. If I has no upper bound then sup I = +∞. . . 1 ≤ i ≤ ℓ. In these cases we use the notation inf I. respectively. we deﬁne lim sup xn . standing for the “infimum” of I. .Fix three sets of basis on Rn . and AB is a ℓ × n matrix. a sequence a1 . is a sequence of numbers then inf{x1 . . . Usually. · · · }. We note that lim inf(−xn ) = − lim sup xn . .

such that ∪n Bn = Ω. to compute P (A ∩ B) we read the deﬁnition as P (A ∩ B) = P (A)P (B|A). . by not requiring this. . This can be generalised to any ﬁnite number of events. P (∪n An ) ≤ Similarly. derive formulae. . Similarly. That is why Probability Theory is very rich. If A1 . i. A2 . B2 . namely: Everything else. In contrast. This is the inclusion-exclusion formula for two events. For example. . (That is why Probability Theory is very rich. Often. Quite often. are mutually disjoint events. If A is an event then 1A or 1(A) is the indicator random variable. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). and 1 Ac = 1 − 1 A . Therefore. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems.. Of course. n≥1 Usually. Recall that P (B|A) is deﬁned by P (A ∩ B)/P (A) and is a very motivated deﬁnition. In Markov chain theory. it is useful to ﬁnd disjoint events B1 . are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). If An are events with P (An ) = 0 then P (∪n An ) = 0. .e. 11 114 . Linear Algebra that has many axioms is more ‘restrictive’. for (besides P (Ω) = 1) there is essentially one axiom . n If the events A1 . . . B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). in these notes. and apply them. . E 1A = P (A). Probability Theory is exceptionally simple in terms of its axioms. we often need to argue that X ≤ Y . This follows from the logical statement X 1(X > x) ≤ X. . . showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i.) If A. which holds since X is positive. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). if he events A1 . . If the events A. to compute P (A). This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. but not too much. . and everywhere else. B in the deﬁnition. A2 ..e. In contrast. we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. n P (An ). Similarly. we work a lot with conditional probabilities. follows from this axiom. Linear Algebra that has many axioms is more ‘restrictive’.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). A2 . I do sometimes “cheat” from a strict mathematical point of view. Note that 1A 1B = 1A∩B . then P n≥1 An = P (An ). A is a subset of B). B are disjoint events then P (A ∪ B) = P (A) + P (B). the function that takes value 1 on A and 0 on Ac . to show that EX ≤ EY . Indeed. if An are events with P (An ) = 1 then P (∩n An ) = 1.

An application of this is: P (X < ∞) = lim P (X ≤ x). For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. . we have EX = −∞ xf (x)dx. making it a useful object when the a0 . is called probability generating function of X: ∞ as long as |ρs| < 1. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. 2. obviously.} because then n an = 1. . the generating function of the geometric sequence an = ρn .. . This deﬁnition makes sense only if EX + < ∞ or if EX − < ∞. 1. 1]. n ≥ 0 is ∞ ρn s n = n=0 1 . is a probability distribution on Z+ = {0. The expectation of a nonnegative random variable X may be deﬁned by ∞ EX = 0 P (X > x)dx. n = 0. We do not a priori know that this exists for any s other than. . x→∞ In Markov chain theory.e. . for s = 0 for which G(0) = 0. viz. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). 0). . is an inﬁnite polynomial having the ai as coeﬃcients: G(s) = a0 + a1 s + a2 s2 + · · · . EX = x xP (X = x). a1 . In case the random variable is ∞ absoultely continuous with density f . 1. 0). we encounter situations where P (X = ∞) may be positive. . Generating functions A generating function of a sequence a0 . so it is useful to keep this in mind. For a random variable which takes positive and negative values. So if n an < ∞ then G(s) exists for all s ∈ [−1. In case the random variable is discrete. The generating function of the sequence pn = P (X = n). 1]. i. which are both nonnegative and then deﬁne EX = EX + − EX − . which is motivated from the algebraic identity X = X + − X − . The formula holds for any positive random variable. we deﬁne X + = max(X. As an example. 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. . |s| < 1/|ρ|. . the formula reduces to the known one. taught in calculus courses). . . a1 . . X − = − min(X. ϕ(s) = EsX = n=0 sn P (X = n) 115 .

we can recover the expectation of X from EX = lim ϕ′ (s). possibly. and this is why ϕ is called probability generating function. n ≥ 0) then the generating function of (xn+1 . Note that ϕ(0) = P (X = 0). ∞ and p∞ := P (X = ∞) = 1 − pn . s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . n=0 We do allow X to.This is deﬁned for all −1 < s < 1. since |s| < 1. As an example. Generating functions are useful for solving linear recursions. ds ds and by letting s = 1. Since the term s∞ p∞ could formally appear in the sum deﬁning ϕ(s). We can recover the probabilities pn from the formula pn = ϕ(n) (0) . xn sn be the generating of (xn . we have s∞ = 0. but. take the value +∞. with x0 = x1 = 1. n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). Also. 116 . n! as follows from Taylor’s theorem. n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). Note that ∞ n=0 pn ≤ 1. If we let X(s) = n≥0 n ≥ 0. consider the Fibonacci sequence deﬁned by the recursion xn+2 = xn+1 + xn .

we can solve for X(s) and ﬁnd X(s) = s2 −1 .From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . is such an example. n ≥ 0). b−a is the solution to the recursion. b−a Noting that ab = −1. Thus (1/ρ)−s is the 1 generating function of ρn+1 . s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). So. i. i. the function P (X ≤ x). x ∈ R. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . Despite the square roots in the formula. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). we can also write this as xn = (−a)n+1 − (−b)n+1 .e. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. n ≥ 0) and (xn . the n-th term of the Fibonacci sequence. namely. Hence s2 + s − 1 = (s − a)(s − b).e. if X is a random variable with values in R. and so X(s) = 1 1 1 1 1 1 = . But I could very well use the function P (X > x). then the so-called distribution function. b = −( 5 + 1)/2. for example. 117 . n ≥ 0) equals the sum of the generating functions of (xn+1 . Since x0 = x1 = 1.

n Hence n k=1 log k ≈ log x dx. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. since the number is so big. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. Even the generating function EsX of a Z+ -valued random variable X is an example. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. in 2n fair coin tosses 118 . n ∈ Z+ . Specifying the distribution of such an object may be hard. Such an excursion is a random object which should be thought of as a random function. 1 Since d dx (x log x − x) = log x. If X is absolutely continuous. We frequently encounter random objects which belong to larger spaces. (b) The approximation is not good when we consider ratios of products. we encountered the concept of an excursion of a Markov chain from a state i. for example they could take values which are themselves functions. but using or merely talking about such an object presents no diﬃculty other than overcoming a psychological barrier. n! ≈ exp(n log n − n) = nn e−n . For instance. For example (try it!) the rough Stirling approximation give that the probability that. Now. Hence we have We call this the rough Stirling approximation. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. compute an integral and obtain an approximation.or the 2-parameter function P (a < X < b). it’s better to use logarithms: n n! = e log n = exp k=1 log k. to compute a large sum we can. It turns out to be a good one. for it does allow us to specify the probabilities P (X = n). dx All these objects can be referred to as “distribution” or “law” of X. instead. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. I could use its density f (x) = d P (X ≤ x).

To remedy this. This completely trivial thought gives the very useful Markov’s inequality. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). a geometric function of k. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . y ∈ R g(y) ≥ g(x)1(y ≥ x). n ∈ Z+ . Let X be a random variable. We have already seen a geometric random variable. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). and so. 0 ≤ n ≤ m. expressing monotonicity. It was shown in Theorem 38 that J0 satisﬁes the memoryless property. so this little section is a reminder. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. Consider an increasing and nonnegative function g : R → R+ .we have exactly n heads equals 2n 2−2n ≈ 1. for all x. Then. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any diﬀerent from X. That was the random variable J0 = ∞ n=1 1(Sn = 0). by taking expectations of both sides (expectation is an increasing operator). Indeed. and is based on the concept of monotonicity. Surely. you have seen it before. If we think of X as the time of occurrence of a random phenomenon. where λ = P (X ≥ 1). Eg(X) ≥ g(x)P (X ≥ x) . 119 . We can case both properties of g neatly by writing For all x. expressing non-negativity. Markov’s inequality Arguably this is the most useful inequality in Probability. deﬁned in (34). g(X) ≥ g(x)1(X ≥ x). as n → ∞. because we know that the n probability goes to zero as n → ∞. and this is terrible. We baptise this geometric(λ) distribution.

Bremaud P (1997) An Introduction to Probabilistic Modeling. W (1981) Topics in Ergodic Theory. Grinstead. Wiley. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). 120 . Math. Amer. Springer. J R (1998) Markov Chains.dartmouth. Parry. J L (1997) Introduction to Probability.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. 4. 3. Bremaud P (1999) Markov Chains. Soc. ´ 2. C M and Snell. Norris. Chung. 5.mac. Springer.pdf 6. Cambridge University Press. Springer. W (1968) An Introduction to Probability Theory and its Applications.Bibliography ´ 1. Also available at: http://www. Cambridge University Press. 7. Feller.

16 convolution. 61 detailed balance equations. 7 closed. 65 generating function. 70 period. 18 permutation. 26 running maximum. 119 Google. 86 reﬂexive. 77 indicator. 6 Markov property. 68 Fatou’s lemma. 117 leads to. 44 degree. 84 bank account. 61 null recurrent. 111 maximum. 58 digraph (directed graph). 18 arcsine law. 15. 6 Markov’s inequality. 110 meeting time. 70 greatest common divisor. 16. 111 inessential state. 17 inﬁmum. 65 Catalan numbers. 51 Monte Carlo. 108 ruin probability. 113 initial distribution. 45 Chapman-Kolmogorov equation. 20 increments. 16 equivalence relation. 59 law. 53 hitting time. 119 minimum. 18 dual. 34 probability ﬂow. 109 positive recurrent. 117 division. 48 cycle formula.Index absorbing. 82 Cesaro averages. 53 Fibonacci sequence. 29 reﬂection. 48 memoryless property. 116 ﬁrst-step analysis. 73 Markov chain. 47 coupling inequality. 119 matrix. 106 ground state. 13 probability generating function. 115 random walk on a graph. 108 ergodic. 68 aperiodic. 101 axiom of Probability Theory. 76 neighbours. 17 Kolmogorov’s loop criterion. 61 recurrent. 51 essential state. 1 equivalence classes. 17 communicates with. 63 Galton-Watson process. 3 branching process. 16 communicating classes. 34 PageRank. 87 121 . 20 function of a Markov chain. 7 invariant distribution. 77 coupling. 15 distribution. 114 balance equations. 35 Helly-Bray Lemma. 17 Aloha. 108 relation. 115 geometric distribution. 15 Loynes’ method. 98 dynamical system. 17 Ethernet. 110 mixing. 81 reﬂection principle. 2 nearest neighbour random walk. 10 irreducible. 10 ballot theorem.

7 simple random walk. 113 symmetric. 29 transition diagram:. 7. 10 steady-state. 6 stationary. 62 Wright-Fisher model. 88 watching a chain in a set. 9 stationary distribution. 71 122 . 8 transition probability matrix. 2 transition probability. 108 time-homogeneous. 108 tree. 27 subordination. 8 transitive. 104 states. 6 statespace. 60 urn. 73 strategy of the gambler. 16. 24 Strirling’s approximation. 119 strong Markov property. 27 storage chain. 16. 62 supremum. 76 Skorokhod embedding. 62 subsequence of a Markov chain.semigroup property. 7 time-reversible. 58 transient. 13 stopping time. 77 skip-free property. 3. 76 simple symmetric random walk.

- markov-chains
- MarkovChains
- Markov Chains
- 20752_0412055511MarkovChain
- Markov Chain
- markov chain chapter 6
- (Ebook) Hidden Markov Models (Theory & Methods) Markov Chains Particle Filter Monte Carlo Hmm - Cappemoulinesryden
- Bayesian Econometric Methods
- Markov Coding
- ProblemsMarkovChains
- A First Course in Linear Algebra
- Statistics
- Introduction to Markov Chain Monte Carlo Simulation and Their Statistical Analysis
- Exercises for Stochastic Calculus
- Markov Chain
- Markov
- Stochastic Processes
- An Introduction to Statistics
- A Course in Linear Algebra .pdf
- Solved Problems
- Generalized Linear Models 2nd Ed
- Stochastic Processes - Ross
- Weather Wax Ross Solutions
- Introduction to Probability and Statistics Using R!
- Algo Trading Hidden Markov Models on Foreign Exchange Data
- Number Theory Problems - Second Edition
- 02_Hidden Markov Model
- hidden markov models

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd