This action might not be possible to undo. Are you sure you want to continue?

**MARKOV CHAINS AND RANDOM WALKS
**

Takis Konstantopoulos∗ Autumn 2009

∗

c Takis Konstantopoulos 2006-2009

Contents

1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Deﬁnition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and ﬁrst-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Deﬁnition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . .24. . . . . . . . . . . . .2 First return to the origin . . . . . . . 79 26. . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. .5 First return to zero . . . . . . . .3 Some conditional probabilities . . . . . . . . . . . . . . . . .1 First hitting time analysis . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. .4 The Wright-Fisher model: an example from biology . . . . . . . . . . . . . . . . . . . 68 24. . . . . . . . . 70 24. . . . . . . . . . . . . . . . . . . . .2 The ballot theorem . . . . . . . . .5 A storage or queueing application . 84 27. . . . . . . . . . . . .1 Paths that start and end at speciﬁc points .1 Distribution of hitting time and maximum . . . . . . . . . . . . . . 86 28 The reﬂection principle for a simple random walk in dimension 1 86 28. . . . . . . . . . . . . . . . . . . . . . . 65 24. . . . . . . . .3 The distribution of the ﬁrst return to the origin . . . . . . . .2 Paths that start and end at speciﬁc points and do not touch zero at all .1 Asymptotic distribution of the maximum . . . . . . . . . . . 95 31. . . . . . . . . . . . . . . . . . .4 Some remarkable identities . . . . . . . . . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . 71 24. . . . .1 Branching processes: a population growth application .3 Total number of visits to a state . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 27. . . . . . . . . . .2 The ALOHA protocol: a computer communications application . . . . . . . . 92 30. . . . . . 83 27. . . . . . . . .1 Distribution after n steps . . . . . . . . . . . . . . 85 27. . . . . . . . . . . . . . . . . . . . 90 30. . . . . . . . . . . . . 95 31. . . . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . . . . . . . . . . . .2 Return probabilities . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

K. Of course. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. I have a little mathematical appendix. The only new thing here is that I give emphasis to probabilistic methods as soon as possible.. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuﬀ with matrices – On ﬁnance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. There are. They form the core material for an undergraduate course on Markov chains in discrete time. I tried to put all terms in blue small capitals whenever they are ﬁrst encountered. dozens of good books on the topic. v . At the end. I introduce stationarity before even talking about state classiﬁcation.. “important” formulae are placed inside a coloured box. Also.Preface These notes were written based on a number of courses I taught over the years in the U.S. A few starred sections should be considered “advanced” and can be omitted at ﬁrst reading. I have tried both and prefer the current ordering. There notes are still incomplete. Greece and the U.. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. The second one is speciﬁcally for simple random walks. The ﬁrst part is about Markov chains and some applications. Also. of course.

depend only on the pair Xn .PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past aﬀects the future only through the present. F = f (x. and its ˙ dt 2x acceleration is x = d 2 . the future pairs Xn+1 . for any t. we end up with two numbers. Formally. . Xn+2 . . b). . where xn ≥ yn . x(t)). y0 = min(a. X0 . initialised by x0 = max(a. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). . yn ). that gcd(a. . more generally. In the module following this one you will study continuous-time chains. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). An example from Physics is provided by Newton’s laws of motion. a totally ordered set. In Mathematics.e. It is not hard to see that. b). has the property that. the real numbers). . The sequence of pairs thus produced. there is a function F that takes the pair Xn = (xn . s ≥ t) ˙ 1 . yn ) into Xn+1 = (xn+1 . x Here. For instance. yn+1 ). b) = gcd(r. and evolving as xn+1 = yn yn+1 = rem(xn . b). x = x(t) denotes the position of the particle at time t. continuous (i. and the other the greatest common divisor. b) is the remainder of the division of a by b.e. yn ). for any n. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. . X2 . An example from Arithmetic of a dynamical system in discrete time is the one which ﬁnds the greatest common divisor gcd(a. In other words. one of which is 0. we let our “state” be a pair of integers (xn . The “time” can be discrete (i. the integers). By repeating the procedure. Recall (from elementary school maths). a phenomenon which evolves with time in a way that only the present aﬀects the future is called a dynamical system. or. where r = rem(a. 2. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. x). . b) between two positive integers a and b. X1 . Suppose that a particle of mass m is moving under the action of a force F in a straight line. . i. n = 0.e. Its velocity is x = dx . Assume that the force F depends only on the particle’s position ¨ dt and velocity. the future trajectory (X(s). 1.

000 columns or computing the integral of a complicated function of a large number of variables. an eﬀective way for dealing with large problems is via the use of randomness. We can summarise this information by the transition diagram: 1 Stochastic means random. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. it moves from 2 to 1 with probability β = 0. A mouse lives in the cage. A scientist’s job is to record the position of the mouse every minute. past states are not needed.99. . Other examples of dynamical systems are the algorithms run. i. Each time count 1 if Yn < f (Xn ) and 0 otherwise. Some of these algorithms are deterministic. containing fresh and stinky cheese. The generalisation takes us into the realm of Randomness. but some are stochastic. 2.can be completely speciﬁed by the current state X(t).1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as ﬁnding the determinant of a matrix with 10. via the use of the tools of Probability. 1 and 2. say. respectively. . Repeat with (Xn . regardless of where it was at earlier times. Count the number of 1’s and divide by N . Similarly.05. The resulting ratio is an approximation b of a f (x)dx. We will develop a theory that tells us how to describe. Try this on the computer! 2 .1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. In addition. for steps n = 1.000 rows and 10. In other words. 2 2. Choose a pair (X0 . 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. by the software in your computer. we will see what kind of questions we can ask and what kind of answers we can hope for. Indeed.. 0 ≤ Y0 ≤ B. a ≤ X0 ≤ b. at time n + 1 it is either still in 1 or has moved to 2. We will also see why they are useful and discuss how they are applied. it does so. We will be dealing with random variables. Yn ). When the mouse is in cell 1 at time n (minutes) then. Markov chains generalise this concept of dependence of the future only on the present. uniformly.e. analyse. Perform this a large number N of times. and use those mathematical models which are called Markov chains. instead of deterministic objects. . Y0 ) of random numbers.

to move from cell 1 to cell 2? 2. are random variables with distributions FD .99 0.05 ≈ 20 minutes. 2. are independent random variables. . answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. Assume that Dn .05 0. We easily see that if y > 0 then ∞ x. px. respectively. Dn − Sn = y − x) P (Sn = z. the ﬁrst time that the mouse will move from 1 to 2 will have mean 1/α = 1/0.2 Bank account The amount of money in Mr Baxter’s bank account evolves. How often is the mouse in room 1 ? Question 1 has an easy. .) 2. = z=0 ∞ = z=0 3 . Xn is the money (in pounds) at the beginning of the month.y := P (Xn+1 = y|Xn = x). S0 . The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is deﬁned by: px. intuitive. So if he wishes to spend more than what is available. Dn is the money he deposits. S2 . FS . Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). and Sn is the money he wants to spend. 0}. . D0 . as follows: Xn+1 = max{Xn + Dn − Sn . S1 .α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. . Sn .y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z.01 Questions of interest: 1. How long does it take for the mouse. Assume also that X0 .95 0. y = 0. Here. D1 . . he can’t. D2 . . 1. on the average. (This is the mean of the binomial distribution with parameter α. from month to month.

Questions of interest: 1. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. we have px. in principle.2 ··· ··· ··· ··· · · · · · · ··· and is an inﬁnite matrix. 2. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. So he may move forwards with equal probability that he moves backwards. The transition probability matrix is y=1 p0.1 p2. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). how long will it take him. Will Mr Baxter’s account grow. if y = 0.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1.y .0 = 1 − ∞ px. If the pub is located in position 0 and his home in position 100.something that. If he starts from position 0. can be computed from FS and FD .0 p0. Are there any places which he will visit inﬁnitely many times? 4.0 p1.0 p2. how often will he be visiting 0? 2.2 P= p2.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. Of course.1 p1.1 p0. 2. on the average.2 p1. Are there any places which he will never visit? 3. how fast? 2. and if so. being drunk. he has no sense of direction.

3 that the person will become sick.D = 1.H .Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. the reason being that the company does not believe that its clients are subject to resurrection.D . Clearly. etc.S = 1 − pS. for now. observe. 5 .8 0.S because pH. and pS.D . But.01 S 0. the answer to this is crucial for the policy determination. and let’s say he’s completely drunk. north or south. but they will become more precise in the sequel.D is omitted. These things are.1 D Thus. and pS. given that he is currently healthy.5 Actuarial chains A life insurance company wants to ﬁnd out how much money to charge its clients. west. so we assign probability 1/4 for each move. So he can move east.S − pH. 2. Note that the diagram omits pH. pD. therefore. pD. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. there is clearly a higher chance that our man will be lost.3 H 0.S = 0.H − pS. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. Also.H = 1 − pH. that since there are more degrees of freedom now. there is probability pH. We may again want to ask similar questions. the company must have an idea on how long the clients will live. mathematically imprecise.

2. 3.} or N = {1. . . viz. . in+m .} or the set Z of all integers (positive. X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). . m ∈ T) is independent of the past process (Xm . 3. Usually. . m and any choice of states i0 . n ∈ Z+ ) is Markov if and only if. for all n ∈ Z+ and all i0 . (Xn . X2 = i2 .1 The Markov property Deﬁnition of the Markov property Consider a (ﬁnite or inﬁnite) sequence of random variables {Xn . . . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . We say that this sequence has the Markov property if.2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . 1. .3 3. it is customary to call (Xn ) a Markov chain. . X1 = i1 . n + m] Xk = ik | Xn = in . . m ∈ T). for any n ∈ T. where T is a subset of the integers. . . . So we will omit referring to T explicitly. or negative). We focus almost exclusively to the ﬁrst choice. m > n. . that the Xn take values in some countable set S. 2. Sometimes we say that {Xn . . Xn+m ) and (X0 . . Let us now write the Markov property in an equivalent way. This is true for any n and m. the Markov property. zero. . which precisely means that (Xn+1 . On the other hand. . . . n ∈ T} is Markov instead of saying that it has the Markov property. . . . Proof. . Xn−1 ) are independent conditionally on Xn . unless needed. P (Xn+1 = in+1 | Xn = in . . . . . almost exclusively (unless otherwise said explicitly). X0 = i0 ) = P (∀ k ∈ [n + 1. Since S is countable. . T = Z+ = {0. . . Xn−1 ) given Xn . (1) 6 . . and so future is independent of past given the present. m < n. . for any n. Lemma 1. n + m] Xk = ik | Xn = in ). if the claim holds. . 3. conditionally on Xn . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). . . The elements of S are frequently called states. we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. n ∈ T}. called the state space. in+1 ∈ S. the future process (Xm . . . We will also assume.

or as P(m. n + 1) := P (Xn+1 = j | Xn = i) .i2 (1.j (n. by a result in Probability. a square matrix whose rows and columns are indexed by S (and this. 7 . n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m.i1 (m.j (n. . But we will circumvent them by using Probability only. 2 3 And.in (n − 1.j (m. by deﬁnition.which means that the two functions µ(i) := P (X0 = i) . n) = P(m. The function pi. X1 = i1 . pi. . requires some ordering on S) and whose entries are pi. something that can be called semigroup property or Chapman-Kolmogorov equation. The function µ(i) is called initial distribution. will specify all joint distributions2 : P (X0 = i0 . n)]i. by (1). but if S is an inﬁnite set.i (n − 1.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. of course. pi. viz. ℓ) P(ℓ. n).j (n. all probabilities of events associated with the Markov chain This causes some analytical diﬃculties. which. n). pi. n + 1) do not depend on n . Of course.j (n. Therefore. j ∈ S. we deﬁne. m + 2) · · · P(n − 1. n). X2 = i2 . n) = 1(i = j) and. m + 1)P(m + 1. then the matrices are inﬁnitely large.j (m. pi.. for each m ≤ n the matrix P(m. n) if m ≤ ℓ ≤ n. 1) pi1 . n + 1) is called transition probability from state i at time n to state j at time n + 1. n). More generally. . m + 1) · · · pin−1 . (2) which is reminiscent of matrix multiplication. n) := [pi. Xn = in ) = µ(i0 ) pi0 . means that the transition probabilities pi.j (m. n) = i1 ∈S ··· in−1 ∈S pi. i∈S i.3 In this new notation. If |S| = d then we have d × d matrices. n ≥ 0.i1 (0.j∈S . remembering a bit of Algebra. .j (m. we can write (2) as P(m. n) = P(m. 3. 2) · · · pin−1 .

Thus Pi is the law of the chain when the initial distribution is δi .j the (one-step) transition probability from state i to state j.j ]i.j∈S is called the (one-step) transition probability matrix or. if µ assigns probability p to state i1 and 1 − p to state i2 . Similarly. The matrix notation is useful. Notational conventions Unit mass. It corresponds. Then P(m. transition matrix. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn .j . For example. if Y is a real random variable. Then we have µ n = µ 0 Pn . while P = [pi. From now on. δi (j) := 1. In other words. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. to picking state i with probability 1. 0. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). as follows from (1) and the deﬁnition of P. if j = i otherwise. n + 1) =: pi. y y 8 . then µ = pδi1 + (1 − p)δi2 .and therefore we can simply write pi. Notice that any probability distribution on S can be written as a linear combination of unit masses. Starting from a speciﬁc state i. unless otherwise speciﬁed. n) = P × P × · · · · · · × P = Pn−m . simply. of course. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . arranged in a row vector. and the semigroup property is now even more obvious because Pa+b = Pa Pb . b ≥ 0.j (n. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). (3) We call δi the unit mass at i. for all integers a. n−m times for all m ≤ n. For example. and call pi.

To provide motivation and intuition. This is a model of gas distributed between two identical communicating chambers. . It is clear. at any point of time n. for any m ∈ Z+ . Then pi.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . Imagine there are N molecules of gas (where N is a large number. It should be clear that if initially place the bug at random on one of the vertices.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. Let Xn be the number of molecules in room 1 at time n. the distribution of the bug will still be the same.d. Example: the Ehrenfest chain. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. selecting a vertex with probability 1/7. n ≥ 0) will not be stationary: indeed. Hence the uniform probability distribution is stationary. .) is stationary if it has the same law as (Xn . Another guess then is: Place 9 . n = 0. then. We now investigate when a Markov chain is stationary. . m + 1. i. as it takes some time before the molecules diﬀuse from room to room. Consider a canonical heptagon and let a bug move on its vertices. at time n + 1 it moves to one of the adjacent vertices with equal probability. But this will not work.e. and vice versa. 1. we indeed expect that we have N/2 molecules in each room. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. say N = 1025 ) in a metallic chamber with a separator in the middle.4 Stationarity We say that the process (Xn . If at time n the bug is on a vertex. because it is impossible for the number of molecules to remain constant at all times. the law of a stationary process does not depend on the origin of time. for a while we will be able to tell how long ago the process started. The gas is placed partly in room 1 and partly in room 2. a sequence of i.i. then. In other words. dividing it into two identical rooms.). let us look at the following example: Example: Motion on a polygon. . On the average. random variables provides an example of a stationary process. . . Then a hole is opened in the separator and we watch what happens as the molecules diﬀuse between the two rooms. Clearly. n = m.

we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. .each molecule at random either in room 1 or in room 2.i = N N −i+1 N i+1 2−N + = · · · = π(i). which you should go through by yourselves. . in ∈ S. i 0 ≤ i ≤ N. P (X1 = i) = j P (X0 = j)pj. In other words. Indeed. by (3). The equations (4) are called balance equations. Since. . . 2−N i−1 i+1 N N (The dots signify some missing algebra.i = π(i − 1)pi−1. use (1) to see that P (X0 = i0 . If we do so. then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . a Markov chain is stationary if and only if the distribution of Xn does not change with n. hence. for all m ≥ 0 and i0 .i = P (X0 = i − 1)pi−1. . Proof. we are led to consider: The stationarity problem: Given [pij ]. The only if part is obvious. ﬁnd those distributions π such that πP = π. . . P (Xn = i) = π(i) for all n. For the if part. stationary or invariant distribution. X1 = i1 . the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . then we have created a stationary Markov chain. . Xm+n = in ).) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. Xm+1 = i1 . But we need to prove this. We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and.i + π(i + 1)pi+1. If we can do so. X2 = i2 . Changing notation. . A Markov chain (Xn . (4) Such a π is called . . . 10 . Xm+2 = i2 .i + P (X0 = i + 1)pi+1. Xn = in ) = P (Xm = i0 . (a binomial distribution). n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. Lemma 2. .

p1. Then (Xn . d}. To ﬁnd the stationary distribution. n ≥ 0) is a Markov chain with transition probabilities pi+1. write the equations (4): π(i) = π(j)pj. 2 . . the next one will arrive in i minutes with probability p(i). Consider a deck of d = 52 cards and shuﬄe them in the following (stupid) way: pick two cards at random and swap their positions. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. . π(1) will be zero. This means that the solution to the balance equations is identically zero. . 2.}.i = π(0)p(i) + π(i + 1).. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. . i = 1. In addition to that. then we run into trouble: indeed. Keep doing it continuously. i ≥ 1. . The state space is S = N = {1. i ≥ 0. each x ∈ Sd is of the form x = (x1 .i = p(i). which gives p(j) . 11 . . j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). After bus arrives at a bus stop. π must be a probability: π(i) = 1. . We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. Example: waiting for a bus. It is easy to see that the stationary distribution is uniform: π(x) = 1 . can be written as P1 = 1. Who tells us that the sum inside the parentheses is ﬁnite? If the sum is inﬁnite. if i > 0 i ∈ N. which in turn means that the normalisation condition cannot be satisﬁed. But here. where all the xi are distinct. . xd ). which. d! x ∈ Sd . . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus.i = 1. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. we use the normalisation condition π(1) = i≥1 j≥i = 1. in matrix notation. . we must be careful. where 1 is a column whose entries are all 1. . . The times between successive buses are i. Thus. .Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1.j = 1.d.i. random variables. i≥1 π(i) −1 To compute π(1). Example: card shuﬄing. and so each π(i) will be zero.

ξi ∈ {0. .. . . Here ξ(n) = (ξ1 (n). But the function f (x) = min(2x.. 2. any r = 1. ξ(n + 1) = ϕ(ξ(n)). .d. and the rule maps x into 2(1 − x). and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . .) Consider an inﬁnite binary vector ξ = (ξ1 . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. . . from (5). We have thus deﬁned a mapping ϕ from binary vectors into binary vectors. . As for the name of the chain (bread kneading) it can be justiﬁed as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. ξr+1 (n) = εr ) + P (ξ1 (n) = 1.i. . . 1] (think of [0. But to even think about it will give you some strength. what we have really shown is that this assumption is a necessary and suﬃcient condition for the existence and uniqueness of a stationary distribution. (5) 12 . . So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is ﬁnite. If ξ1 = 1 then x ≥ 1/2. 1 − ξ3 .. .* (This is slightly beyond the scope of these lectures and can be omitted. . . . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. ξr (n + 1) = εr ) = P (ξ1 (n) = 0. .). . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. . Hence choosing the initial state at random in this manner results in a stationary Markov chain. . .i. for any n = 0. . . You can check that . . Example: bread kneading. . . . 2(1 − x)). εr ∈ {0. . if ξ1 = 1 then shift to the left and ﬂip all bits to obtain (1 − ξ2 . So. . 1] as a ﬂexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle.d. 2. r = 1. the same is true at n + 1. We can show that if we start with the ξr (0). . 2. ξ2 (n) = 1 − ε1 . P (ξ1 (n + 1) = ε1 . applied to the interval [0.e. .) is the state at time n. . . . . . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. ξr+1 (n) = 1 − εr ). .Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). ξ2 .). 1. ξ3 . The Markov chain is obtained by applying the mapping ϕ again and again. and any ε1 . 1}. ξ2 (n) = ε1 . So it goes beyond our theory. ξ2 (n). . with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. . are i. . So let us do so. i. i. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). ξ2 (n). 1} for all i. Remarks: This is not a Markov chain with countably many states. .).

j . a stationary distribution π satisﬁes π = πP π1 = 1 In other words. Conversely.i Multiply π(i) by j∈S pi.i = j∈A π(j)pj. as we will see in examples.i 13 . We have: Proposition 1. expecting to be able to choose.i + j∈Ac π(j)pj. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. we introduce more. The convenience is often suggested by the topological structure of the chain. therefore one of them is always redundant. Writing the equations in this form is not always the best thing to do. π(i) = j∈S (balance equations) (normalisation condition). Ac ) is the total current from A to Ac . π is a stationary distribution if and only if π1 = 1 and F (A.j = j∈A π(j)pj. because the ﬁrst d equations are linearly dependent: the sum of both sides equals 1. A). Instead of eliminating equations.Note on terminology: steady-state. then the above are d + 1 equations. When a Markov chain is stationary we refer to it as being in 4.j as a “current” ﬂowing from i to j. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. If the state space has d states. π(j)pj. Ac ) := π(i)pi. amongst them.i + j∈Ac π(j)pj.i . i∈A j∈Ac Think of π(i)pi.1 Finding a stationary distribution As explained above. a set of d linearly independent equations which can be more easily solved. then choose A = {i} and see that you obtain the balance equations. This is OK. suppose that the balance equations hold. Proof. π(j) = 1.j (which equals 1): π(i) j∈S pi. j∈S i ∈ S. So F (A. If the latter condition holds. But we only have d unknowns. Ac ) = F (Ac . for all A ⊂ S.

i This is true for all i ∈ S.. A c F(A . A ) A c c F( A . . The extra degree of freedom provided by the last result.99 ≈ 0.05. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). . p21 = β.8% of the time in room 2. Example: a walk with a barrier. Thus. Summing over i ∈ A. 1.05 ≈ 0.Now split the left sum also: π(i)pi. but we shall not do this here. 3. The constant c is found from π(1) + π(2) = 1.j = j∈A π(j)pj. p22 = 1 − β.2% of the time in 1.i + j∈Ac π(j)pj. we see that the ﬁrst term on the left cancels with the ﬁrst on the right. If we take α = 0. while the remaining two terms give the equality we seek. We have π(1)α = π(2)β. in steady state.04 π(2) = 0.j + j∈A j∈Ac π(i)pi.952.. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. β < 1. α+β π(2) = cα. 14 i = 1. . . π(1) = β . p q Consider the following chain. here is an example: Example: Consider the 2-state Markov chain with p12 = α. Instead. gives us ﬂexibility. we ﬁnd π(1) = 0. α+β π(2) = α . (Consequently.048. 2. β = 0. and 4. p11 = 1 − α. One can ask how to choose a minimal complete set of linearly independent equations. That is.) Assume 0 < α. A) Schematic representation of the ﬂow balance relations.99 (as in the mouse example). the mouse spends only 95.04 room 1.

e.j > 0 for some n ∈ N and some states i1 . we can solve them immediately.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. i − 1}. j) ∈ E ⇐⇒ pi.) 5 5.i2 · · · pin−1 . provided that ∞ π(i) = 1. • Suppose that i j. (If p ≥ 1/2 there is no stationary distribution. p < 1/2. . (ii) pi. In other words. π(i) = (p/q)i π(0). These are much simpler than the previous ones. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1.i1 pi1 . i. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. Proof. the chain will visit j at some ﬁnite time. . If i = j there is nothing to prove. and E a set of directed edges. S is a set of vertices.j > 0. This is often cast in terms of a digraph (directed graph). . In fact. E) where S is the state space and E ⊂ S × S deﬁned by (i.But we can make them simpler by choosing A = {0. (iii) Pi (Xn = j) > 0 for some n ≥ 0. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). Does this provide a stationary distribution? Yes. a pair (S. Ac ) = π(i − 1)p = π(i)q = F (Ac . . In graph-theoretic terminology. This is i=0 possible if and only if p < q. . If i ≥ 1 we have F (A. in−1 . So assume i = j.e. i. 5. .2 The relation “leads to” We say that a state i leads to state j if. . . For i ≥ 1. n≥0 . We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. 1. A). starting from i.

We just observed it is symmetric. (iii) holds. {4. 16 . 3} is not closed. However. (6) j.. {6}. we have that the sum is also positive and so one of its terms is positive. where the sum extends over all possible choices of states in S. Just as any equivalence relation. Look at the last display again. this is symmetric (i Corollary 2. Hence Pi (Xn = j) > 0. 3}.3 The relation “communicates with” Deﬁne next the relation i communicates with j (written as i Obviously. Equivalence relation means that it is symmetric. 5. The communicating class corresponding to the state i is. The class {4. 5}. • Suppose that (ii) holds. • Suppose now that (iii) holds. i}. meaning that (ii) holds. by deﬁnition.j . and so (iii) holds. by deﬁnition. Two communicating classes are either 1 2 3 4 5 6 In the example of the ﬁgure. It is reﬂexive and transitive because is so. the set of all states that communicate with i: [i] := {j ∈ S : j So. we see that there are four communicating classes: {1}.. reﬂexive and transitive. 5} is closed: if the chain goes in there then it never moves out of it. [i] = [j] if and only if i identical or completely disjoint. (i) holds. Proof. By assumption. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). The relation is transitive: If i j and j k then i k. In other words. In addition. Corollary 1. it partitions S into equivalence classes known as communicating classes. Since Pi (Xn = j) > 0. the class {2.the sum on the right must have a nonzero term. j) ⇐⇒ i j and j i. {2.. The relation j ⇐⇒ j i).i2 · · · pin−1 .in−1 pi.. We have Pi (Xn = j) = i1 . The ﬁrst two classes diﬀer from the last two in character. is an equivalence relation. the sum contains a nonzero term.i1 pi1 .

. parts. C2 . In graph terms. C3 are not closed. we can reach any other state. . Note: An inessential state i is such that there exists a state j such that i j but j i. To put it otherwise. we say that a set of states C ⊂ S is closed if pi. the single-element set {i} is a closed communicating class.j > 0. Proof. If all states communicate with all others. The communicating classes C1 . the state is inessential. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. and C is closed. A general Markov chain can have an inﬁnite number of communicating classes. Note that there can be no arrows between closed communicating classes.i2 · · · pin−1 . pi. Note: If we consider a Markov chain with transition matrix P and ﬁx n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. Lemma 3. C5 in the ﬁgure above are closed communicating classes. The converse is left as an exercise.i = 1 then we say that i is an absorbing state. and so on. the state space S is a closed communicating class. There can be arrows between two nonclosed communicating classes but they are always in the same direction. . pi1 . more manageable. Since i ∈ C. Why? A state i is essential if it belongs to a closed communicating class. then the chain (or.i1 > 0. Closed communicating classes are particularly important because they decompose the chain into smaller. and so i2 ∈ C.i1 pi1 . Similarly. starting from any state. we conclude that j ∈ C. in−1 such that pi.i2 > 0. j∈C for all i ∈ C. Thus. 17 . Suppose that i j.More generally. The classes C4 . its transition matrix P) is called irreducible. more precisely. we have i1 ∈ C. Suppose ﬁrst that C is a closed communicating class. . Otherwise.j = 1. If a state i is such that pi. We then can ﬁnd states i1 . C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this ﬁgure. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. In other words still.

Consider two distinct essential states i. d(i) = gcd Di . (m) (n) for all m. Let us denote by pij the entries of Pn . i. If i j there is α ∈ N with pij > 0 and if 18 . If i j then d(i) = d(j). it will move to the set C2 = {2. Dj = {n ∈ (n) (α) N : pjj > 0}. 3} at some point of time then. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. j ∈ S. more generally. 4} at the next step and vice versa. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. .4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj .e. We say that the chain has period 2. The period of an inessential state is not deﬁned. Notice however. all states form a single closed communicating class. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. . The chain alternates between the two sets. Since Pm+n = Pm Pn . So. and. (n) Proof. i. .5. n ≥ 0. pii (ℓn) ≥ pii (n) ℓ . (Can you ﬁnd an example of a chain with period 3?) We will now give the deﬁnition of the period of a state of an arbitrary chain. j. The period of an essential state is deﬁned as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . X2 ∈ C1 . d(j) = gcd Dj . X3 ∈ C2 . Theorem 2 (the period is a class property). A state i is called aperiodic if d(i) = 1. . X4 ∈ C1 . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. ℓ ∈ N. Let Di = {n ∈ N : pii > 0}. So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 .

1.) 19 . n1 ∈ Di such that n2 − n1 = 1. as deﬁned earlier. Because n is suﬃciently large. Since n1 . (Here Cd := C0 . Corollary 3. Then we can uniquely partition the state space S into d disjoint subsets C0 . . if d = 1. Then pii > 0 and so pjj ≥ pij pii pji > 0.e. . this is the content of the following theorem. Therefore d(j) | α + β. An irreducible chain with ﬁnitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. the chain is called aperiodic. chains where. From the last two displays we get d(j) | n for all n ∈ Di . . n2 ∈ Di . . all states communicate with one another. . j∈Cr+1 i ∈ Cr . showing that α + β ∈ Dj .j i there is β ∈ N with pji > 0. . . d − 1. i. d(j) | d(i)). If a number divides two other numbers then it also divides their diﬀerence. if an irreducible chain has period d then we can decompose the state space into d sets C0 . Divide n by n1 to obtain n = qn1 + r. Proof. So d(i) = d(j). More generally. showing that α + n + β ∈ Dj . Theorem 4. Therefore d(j) | α + β + n for all n ∈ Di . Consider an irreducible chain with period d. Of particular interest. . Theorem 3. In particular. we have n ∈ Di as well. Arguing symmetrically. Cd−1 such that pij = 1. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . where the remainder r ≤ n1 − 1. j ∈ S. . If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . So d(j) is a divisor of all elements of Di . C1 . Formally. we obtain d(i) ≤ d(j) as well. . . So pjj ≥ pij pji > 0. Pick n2 . Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. An irreducible chain has period d if one (and hence all) of the states have period d. . C1 . are irreducible chains. r = 0. Cd−1 such that the chain moves cyclically between them. Let n be suﬃciently large. q − r > 0. Now let n ∈ Di . (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This ﬁgure shows the internal structure of a closed communicating class with period d = 4.

. j ∈ C2 . say. If X0 ∈ A the two times coincide.. Hence it partitions the state space S into d disjoint equivalence classes. Suppose pij > 0 but j ∈ C1 but.. Pick a state i0 and let C0 be its equivalence class (i. Deﬁne the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. and by the assumptions that d d j. We show that these classes satisfy what we need. We deﬁne the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. Let C1 be the equivalence class of i1 . The existence of such a path implies that it is possible to go from i0 to i.e. i ∈ C0 .id > 0. C1 . Such a path is possible because of the choice of the i0 . Take. Notice that this is an equivalence relation. .e the set of states j such that i j). We are interested in deriving formulae for the probability that this time is ﬁnite as well as the expectation of this time. i1 .. i1 . id ∈ C0 . after it takes one step from its current position. i.. with corresponding classes C0 . Cd−1 . .Proof. . 20 . Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . necessarily. Continue in this manner and deﬁne states i0 . which contradicts the deﬁnition of d. It is easy to see that if we continue and pick a further state id with pid−1 . . Assume d > 1. We use a method that is based upon considering what the Markov chain does at time 1. id−1 d 6 Hitting times and ﬁrst-step analysis Consider a Markov chain with transition probability matrix P. . We avoid excessive notation by using the same letter for both. necessarily.. then. As an example. j belongs to the next class. that is why we call it “first-step analysis”. for example. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which diﬀers from TA simply by considering n ≥ 1 instead of n ≥ 0. The reader is warned to be alert as to which of the two variables we are considering at each time.. We now show that if i belongs to one of these classes and if pij > 0 then. Then pick i1 such that pi0 i1 > 0. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?).

e.). ϕ(i) = j∈S i∈A pij ϕ(j). Furthermore. If i ∈ A. i ∈ A. ′ Proof. and deﬁne ϕ(i) := Pi (TA < ∞). if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. because the chain may end up in state 2. we have Pi (TA < ∞) = ϕ(j)pij . Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). But the Markov chain is homogeneous.It is clear that P1 (T0 < ∞) < 1. We then have: Theorem 5. X2 . we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A pjk ϕ′ (k) k∈A 21 . ′ is the remaining time until A is hit). X0 = i) = P (TA < ∞|X1 = j). which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). Hence: ′ ′ P (TA < ∞|X1 = j. . a ′ function of X1 . If i ∈ A then TA = 0. Therefore. j∈S as needed. . The function ϕ(i) satisﬁes ϕ(i) = 1. By self-feeding this equation. But what exactly is this probability equal to? To answer this in its generality. . then TA ≥ 1. Combining the above. ﬁx a set of states A. let ϕ′ (i) be another solution. So TA = 1 + TA . conditionally on X1 . We ﬁrst have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. the event TA is independent of X0 . and so ϕ(i) = 1. For the second part of the proof.

and let ψ(i) = Ei TA .We recognise that the ﬁrst term equals Pi (TA = 1). ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). (If the chain starts from 0 it takes no time to hit 0. The second part of the proof is omitted. j∈A i ∈ A. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). the second equals Pi (TA = 2). On the other hand. Example: Continuing the previous example. let A = {0. obviously. then TA = 1 + TA . If i ∈ A then. and ϕ(2) = 0. Proof. Furthermore. the set that contains only state 0. 3 2 so ϕ(1) = 3/4. We have: Theorem 6. if the chain starts from 2 it will never hit 0. i = 0. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. as above. 1 ψ(1) = 1 + ψ(1).) As for ϕ(1). TA = 0 and so ψ(i) := Ei TA = 0. ψ(i) = 1 + i∈A pij ψ(j). which is the second of the equations. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). Therefore. By continuing self-feeding the above equation n times. 3 22 . Clearly. ψ(i) := Ei TA . 2. we choose A = {0}. The function ψ(i) satisﬁes ψ(i) = 0. Example: In the example of the previous ﬁgure. We immediately have ϕ(0) = 1. because it takes no time to hit A if the chain starts from A. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). 2}. If ′ i ∈ A. so the ﬁrst two terms together equal Pi (TA ≤ 2). ψ(0) = ψ(2) = 0. Letting n → ∞. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the ﬁrst time. We let ϕ(i) = Pi (T0 < ∞). 1. Start the chain from i.

23 . ′ TA = 1 + TA . We may ask to ﬁnd the probabilities ϕAB (i) = Pi (TA < TB ). ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the ﬁrst-step analysis method.. Clearly. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). by the Markov property. consider the following situation: Every time the state is x a reward f (x) is earned. i. until set A is reached. as argued earlier. . ′ where TA is the remaining time. ϕAB (i) = 0. TB for two disjoint sets of states A. (This should have been obvious from the start.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . Hence. otherwise.which gives ψ(1) = 3/2. B. if x ∈ A. . of the random variables X1 . It should be clear that the equations satisﬁed are: ϕAB (i) = 1. we just changed index from n to n + 1. Next. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). This happens up to the ﬁrst hitting time TA ﬁrst hitting time of the set A. i∈A Furthermore. ϕAB is the minimal solution to these equations.e. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). . therefore the mean number of ﬂips until the ﬁrst success is 3/2. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). ϕAB (i) = j∈A i∈B pij ϕAB (j). h(x) = f (x). X2 . where. because the underlying experiment is the ﬂipping of a coin with “success” probability 2/3. in the last sum. after one step. Then ′ 1+TA x ∈ A. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). Now the last sum is a function of the future after time 1.

h(3) = 7. he collects i pounds. h(1) = 5.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently.Thus. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. Let p the probability of heads and q = 1 − p that of tails. until he dies in room 4. 2. 1 2 4 3 The question is to compute the average total reward. so 24 . The successive stakes (Yk . no gambler is assumed to have any information about the outcome of the upcoming toss. y∈S h(x) = f (x) + x ∈ A. for i = 1. Clearly. If heads come up then you win £1 per pound of stake. unless he is in room 4. are h(x) = 0. in which case he dies immediately. why?) If we let h(i) be the average total reward if he starts from room i. x∈A pxy h(y). 3. (Sooner or later. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). k ∈ N) form the strategy of the gambler. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. So. (Also. if he starts from room 2. collecting “rewards”: when in room i. 2 2 We solve and ﬁnd h(2) = 8. he will die. Example: A thief enters sneaks in the following house and moves around the rooms. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). the set of equations satisﬁed by h.

The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. she must base it on whatever outcomes have appeared thus far. and where the states a. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and.i+1 = p.g. in simplest case Yk = 1 for all k. b − 1. To solve the problem. In other words. expectation) under the initial condition X0 = x. We obviously have ϕ(0) = 0. a + 1. . b} with transition probabilities pi. a < i < b. . write ϕ(x) instead of ϕ(x. We wish to consider the problem of ﬁnding the probability that the gambler will never lose her money. 0 ≤ x ≤ ℓ. Yk cannot depend on ξk . . It can only depend on ξ1 . Theorem 7. . b − a). in the ﬁrst case. if p = q Px (Tb < Ta ) = . .if the gambler wants to have a strategy at all. ξk−1 and–as is often the case in practice–on other considerations of e. ℓ). 25 ϕ(ℓ) = 1. Consider a gambler playing a simple coin tossing game. (Think why!) To simplify things. . a ≤ x ≤ b. b = ℓ > 0 and let ϕ(x. in addition. . Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. as described above. The diﬀerence between the gambler’s ruin problem for a company and that for a mere mortal is that. the company has control over the probability p. if p = q = 1/2 b−a Proof. it will make enormous proﬁts). Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals: x−a − 1 (q/p) (q/p)b−a − 1 . pi. a + 2. ℓ) := Px (Tℓ < T0 ). . betting £1 at each integer time against a coin which has P (heads) = p. She will stop playing if her money fall below a (a < x). . P (tails) = 1 − p.i−1 = q = 1 − p. Let £x be the initial fortune of the gambler. x − a . b are absorbing. We are clearly dealing with a Markov chain in the state space {a. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). astrological nature. We ﬁrst solve the problem when a = 0. is our next goal. Let us consider a concrete problem. Ex ) for the probability (respectively. We use the notation Px (respectively.

and so λx − 1 λx − 1 =1− = λx . then λ = q/p < 1. by ﬁrst-step analysis. We ﬁnally ﬁnd λx − 1 . λx − 1 δ(0). Px (T0 < ∞) = 1 − lim 26 . The general case is obtained by replacing ℓ by b − a and x by x − a. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). Corollary 4. δ(x − 2) = · · · = q p x δ(0). ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. if p = q. λℓ −1 x ℓ. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). If p > 1/2. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). Since ϕ(ℓ) = 1. Assuming that λ := q/p = 1. λ = 1. ℓ→∞ ℓ→∞ (q/p)x .and. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. then λ = q/p > 1. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. λℓ − 1 0 ≤ x ≤ ℓ. 0 ≤ x ≤ ℓ. Since p + q = 1. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. and so Px (T0 < ∞) = 1 − lim If p < 1/2. ℓ) = λx −1 . ℓ→∞ λℓ − 1 x = 1 − 0 = 1. we have obtained ϕ(x. we ﬁnd ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. if λ = 1 if λ = 1. This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). if p > 1/2 if p ≤ 1/2. λ = 1. ℓ Summarising. 1. we ﬁnd δ(0) = 1/ℓ and so x ϕ(x) = . Let δ(x) := ϕ(x + 1) − ϕ(x).

for all n ∈ Z+ . A. . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 .j2 · · · We have: P (Xτ +1 = j1 . . . . . τ < ∞) = P (Xn+1 = j1 . A. Proof. A. . τ < ∞). . n ≥ 0) is a time-homogeneous Markov chain. . A ∩ {τ = n} is determined by (X0 . . . . A.) is independent of the past process (X0 . the future process (Xτ +n . . . . 27 . . . Xn = i. Xτ = i. a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . Xn+2 = j2 . A. | Xn = i)P (Xn = i. Xn+1 . | Xn = i. . τ = n) = P (Xn+1 = j1 . τ = n) = pi. Xτ +2 = j2 . τ = n). . τ = n)P (Xn = i. Let A be any event determined by the past (X0 . . Theorem 8 (strong Markov property). (d) where (a) is just logic. τ = n) (a) (b) = P (Xτ +1 = j1 . An example of a random time which is not a stopping time is the last visit of a speciﬁc set. including. Let τ be a stopping time. An example of a stopping time is the time TA . Xn ). . . We are going to show that for any such A. .j1 pj1 . (b) is by the deﬁnition of conditional probability. . . . for any n. n ≥ 0) is independent of the past process (X0 . . . Xτ ) if τ < ∞. τ = n) (c) = P (Xn+1 = j1 . . . . Xn+2 = j2 . the future process (Xn . . . A | Xτ . Our assertions are now easily obtained from this by ﬁrst summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i.Xτ +2 = j2 . | Xτ = i.j1 pj1 . Since (X0 . Xn ) if we know the present Xn . Xn+2 = j2 . . considered earlier. A. . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . Xτ +2 = j2 . Xn ) and the ordinary splitting property at n. . . . | Xτ . Xn )1(τ = n). Xτ ) and has the same law as the original process started from i. . the value +∞.8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. τ < ∞) = pi. Suppose (Xn . . . Recall that the Markov property states that. . In particular. the ﬁrst hitting time of the set A. . Then. . possibly. P (Xτ +1 = j1 . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . . . . . . This stronger property is referred to as the strong Markov property.j2 · · · P (Xn = i. τ < ∞)P (A | Xτ . Xτ )1(τ < ∞) = n∈Z+ (X0 . . . conditional on τ < ∞ and Xτ = i. . . . . Let τ be a stopping time. . . any such A must have the property that. Xn ) for all n ∈ Z+ . . τ < ∞) and that P (Xτ +1 = j1 . Xτ +2 = j2 . .

Formally. in a generalised sense. If X0 = i then the two deﬁnitions coincide. The random variables Xn comprising a general Markov chain are not independent. Regardless. and so on. we let Ti be the time of the second (1) (3) visit. (1) r = 1. They are. So Xi is the r-th excursion: the chain visits 28 . random trajectories. . in fact. r = 2. we recursively deﬁne Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. “chunks”. All these random times are stopping times. then the behaviour of the chain between successive visits cannot be inﬂuenced by the past or the future (and that is due to the strong Markov property). The Ti of Section 6 was allowed to take the value 0. . 2. (And we set Ti := Ti . . Note the subtle diﬀerence between this Ti and the Ti of Section 6. Schematically. 3. as “random variables”. Ti Xi (0) (r) ≤ n < Ti (r+1) .i. We (r) refer to them as excursions of the process.9 Regenerative structure of a Markov chain A sequence of i. Xi (1) .d. random variables is a (very) particular example of a Markov chain. ..) We let Ti be the time of the third visit. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . 0 ≤ n < Ti . we will show that we can split the sequence into i. Xi (0) . Xi (2) . . by using the strong Markov property. State i may or may not be visited again. := Xn . . Our Ti here is always positive. . The idea is that if we can ﬁnd a state i that is visited again and again. However. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the ﬁrst time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}.i.d..

We abstract this trivial observation and give a couple of deﬁnitions: First.state i for the r-th time..i. 29 . the chain returns to i only ﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 0. all. . But (r) the state at Ti is. for time-homogeneous Markov chains. The excursions Xi (0) . and this proves the ﬁrst claim. An immediate consequence of the strong Markov property is: Theorem 9.). the chain returns to i inﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 1.d. Therefore. we say that a state i is transient if. . if it starts from state i at some point of time. possibly. The last corollary ensures us that Λ1 . of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. we “watch” it up until the next time it returns to state i. but there are always states which will be visited again and again.. . they are also identical in law (i. (0) are independent random variables. the future is independent of the (1) (2) past. starting from i. equal to i. it will behave in the same way as if it starts from the same state at another point of time. ad inﬁnitum. (Numerical) functions g(Xi independent random variables. Example: Deﬁne Ti (r+1) (0) ). have the same Proof. Therefore. but important corollary of the observation of this section is that: Corollary 5. Xi distribution. except. In particular. and “goes for an excursion”. The claim that Xi . Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. Second. A trivial. g(Xi (1) ). at least one state must be visited inﬁnitely many times. . are independent random variables. Perhaps some states (inessential ones) will be visited ﬁnitely many times. Moreover. Xi (1) . (r) . . 10 Recurrence and transience When a Markov chain has ﬁnitely many states. . the future after Ti is independent of the past before Ti . Apply Strong Markov Property (SMP) at Ti : SMP says that. we say that a state i is recurrent if. Λ2 . by deﬁnition. given the state at (r) (r) (r) the stopping time Ti . Xi (2) . g(Xi (2) ). . starting from i.

We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. This means that the state i is transient. we conclude what we asserted earlier. fii = Pi (Ti < ∞) = 1. Because either fii = 1 or fii < 1. Therefore. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). in principle. Suppose that the Markov chain starts from state i. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . 3. We claim that: Lemma 4. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . we have Proof. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . Pi (Ji = ∞) = 1. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. namely. and that is why we get the product. 2. State i is recurrent. every state is either recurrent or transient.e. This means that the state i is recurrent. Indeed.Important note: Notice that the two statements above are not negations of one another. We will show that such a possibility does not exist. But the past (k−1) before Ti is independent of the future. . therefore concluding that every state is either recurrent or transient. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). then Pi (Ji < ∞) = 1. Lemma 5. Pi (Ji = ∞) = 1. there could be yet another possibility: 0 < Pi (Xn = i for inﬁnitely many n) < 1. 4. k ≥ 0. Ei Ji = ∞. The following are equivalent: 1. Notice that: state i is visited inﬁnitely many times ⇐⇒ Ji = ∞. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. If fii < 1. 30 . Starting from X0 = i. i. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. Indeed.

assuming that the chain (n) starts from state i. Lemma 6. it is visited inﬁnitely many times. as n → ∞. Notice the diﬀerence with pij . State i is transient if. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. if we assume Ei Ji < ∞. fij ≤ pij . Ei Ji < ∞. therefore Pi (Ji < ∞) = 1. The equivalence of the ﬁrst two statements was proved above. Ji cannot take value +∞ with positive probability. (n) But if a series is ﬁnite then the summands go to 0. by the deﬁnition of Ji . by deﬁnition of Ji is equivalent to Pi (Ji = ∞) = 1. State i is recurrent if. (n) i. If Ei Ji = ∞. but the two sequences are related in an exact way: Lemma 7. 2. 4. by deﬁnition. Clearly. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . This. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . State i is transient. since Ji is a geometric random variable. we see that its expectation cannot be inﬁnity without Ji itself being inﬁnity with probability one. The latter is the probability to be in state j at time n (not necessarily for the ﬁrst time) starting from i.1 Deﬁne First hitting time decomposition fij := Pi (Tj = n). Corollary 6. i. Proof. pii → 0. the probability to hit state j for the ﬁrst time at time n ≥ 1. owing to the fact that Ji is a geometric random variable. If i is transient then Ei Ji < ∞. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . this implies that Ei Ji < ∞.Proof. it is visited ﬁnitely many times with probability 1. 10. by deﬁnition. Pi (Ji < ∞) = 1. then. which. From Elementary Probability. On the other hand. as n → ∞. is equivalent to Pi (Ji < ∞) = 1. fii = Pi (Ti < ∞) < 1. The following are equivalent: 1. then. The equivalence of the ﬁrst two statements was proved before the lemma. Proof. Proof. If i is a transient state then pii → 0. and. 3. n ≥ 1.e. Hence Ei Ji = ∞ also.e. clearly.

. The last statement is simply an expression in symbols of what we said above. we have n=1 pjj = Ej Jj < ∞. We now show that recurrence is a class property. j i. Hence. In a communicating class. if we let fij := Pi (Tj < ∞) . Start with X0 = i. ( Remember: another property that was a class property was the property that the state have a certain period. the probability to go to j in some ﬁnite time is positive. Corollary 8. In other words. by deﬁnition. . Now. which is ﬁnite.i. 1. . Xi . . If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. 32 .i. but also fij = 1.) Theorem 10.r := 1 j appears in excursion Xi (r) . Hence.The ﬁrst term equals fij . are i. and since they take the value 1 with positive probability. So the random variables δj. starting from i. excursions Xi . showing that j will appear in inﬁnitely many of the excursions for sure. and so the probability that j will appear in a speciﬁc excursion is positive. denoted by i j: it means that. inﬁnitely many of them will be 1 (with probability 1). ∞ Proof. If j is transient. r = 0. For the second term.d.d. If j is a transient state then pij → 0. Corollary 7. . Proof. for all states i. as n → ∞. not only fij > 0. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . Indeed.. 10. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. and hence pij as n → ∞. either all states are recurrent or all states are transient.2 Communicating classes and recurrence/transience Recall the deﬁnition of the relation “state i leads to state j”. and gij := Pi (Tj < Ti ) > 0. the probability that j will appear in a speciﬁc excursion equals gij = Pi (Tj < Ti ). we have i j ⇐⇒ fij > 0. . j i. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). by summing over n the relation of Lemma 7. In particular.

What is the expectation of T ? First. Recall that a ﬁnite random variable may not necessarily have ﬁnite expectation. and hence. . if started in C. Pi (B) = 0. all other states are also transient. . It should be clear that this can. n+1 33 = 0 . . be i. So if a communicating class is recurrent then it contains no inessential states and so it is closed. Hence 11 Positive recurrence Recall that if we ﬁx a state i and let Ti be the ﬁrst return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). random variables. Un ≤ x | U0 = x)dx xn dx = 1 . every other j is also recurrent.e. Let T := inf{n ≥ 1 : Un > U0 }. i. If a communicating class contains a state which is transient then. Hence all states are recurrent.Proof. .d. by the above theorem. We now take a closer look at recurrence. uniformly distributed in the unit interval [0. By closedness. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. Every ﬁnite closed communicating class is recurrent. i. Xn = i for inﬁnitely many n) = 0. If i is inessential then it is transient. U2 . But then. Indeed. . Let C be the class. Thus T represents the number of variables we need to draw to get a value larger than the initial one. appear in a certain game. for example. and so i is a transient state. . 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). Theorem 11. Theorem 12. If i is recurrent then every other j in the same communicating class satisﬁes i j. 1]. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. j but j i. Suppose that i is inessential. Proof. This state is recurrent. but Pi (A ∩ B) = Pi (Xm = j. . P (T > n) = P (U1 ≤ U0 . . . necessarily. P (B) = Pi (Xn = i for inﬁnitely many n) < 1.i. Example: Let U0 .e. . U1 . By ﬁniteness. Proof. observe that T is ﬁnite. Hence. . some state i must be visited inﬁnitely many times. X remains in C.

. . random variables. 34 . of successive states is not an independent sequence. 0) = ∞. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. It is traditional to refer to it as “Law”. . an inﬁnite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. then it takes. the Fundamental Theorem of Probability Theory. Let Z1 . . arguably. Z2 . We say that i is null recurrent if Ei Ti = ∞. X1 . Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. (ii) Suppose E max(Z1 . To apply the law of large numbers we must create some kind of independence.i. X2 . be i. To our rescue comes the regenerative structure of a Markov chain identiﬁed at an earlier section. We state it here without proof. . Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. but E min(Z1 . the sequence X0 .Therefore. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. It is not a Law. on the average. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. Theorem 13. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence.d. When we have a Markov chain. But this is misleading. it is a Theorem that can be derived from the axioms of Probability Theory. Before doing that. . 0) > −∞. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 .

i. assign a reward f (x). . the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). but deﬁne G(X ) to be the total reward received over the excursion X . the (0) initial excursion Xa also has the same law as the rest of them. Xa . The ﬁnal equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. G(Xa(3) ). . and because if we start with X0 = a. forms an independent sequence of (generalised) random variables. which. Speciﬁcally.i. k=0 which represents the ‘time-average reward’ received. .d. we can think of as a reward received when the chain is in state x. for reasonable choices of the functions G taking values in R. Then. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been ﬁxed once and for all) the sequence of excursions Xa(1) . . Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). In particular. we have that G(Xa(1) ). . are i. with probability 1. Xa(2) . because the excursions Xa . G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }..2 Average reward Consider now a function f : S → R. for an excursion X . .d. in this example. we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. Since functions of independent random variables are also independent. as in Example 2.1 Functions of excursions Namely. random variables. Xa(3) . . n as n → ∞.13. . 35 . Thus. are i. Example 2: Let f (x) be as in example 1. deﬁne G(X ) to be the maximum reward received during this excursion. (7) Whichever the choice of G. . (1) (2) (1) (n) (8) 13. Example 1: To each state x ∈ S. G(Xa(2) ).

We are looking for the total reward received between 0 and t. with probability one. We break the sum into the total reward received over the initial excursion. Let f : S → R+ be a bounded ‘reward’ function. We assumed that f is t bounded. t t as t → ∞. t as t → ∞. with probability one. Gﬁrst is the reward up to time Ta = Ta or t (whichever t is smaller). say. of complete excursions from the ground state a that occur between 0 and t. Then. Nt . Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. Since P (Ta < ∞) = 1. Nt = 5. in the ﬁgure below. t t 36 . t t In other words. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a ﬁrst and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gﬁrst + Glast . (r) Nt := max{r ≥ 1 : Ta ≤ t}.e. plus the total reward received over the next excursion. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . and Glast is the reward over the last incomplete cycle. and so on. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. Hence |Gﬁrst | ≤ BTa . Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). A similar argument shows that 1 last G → 0. and so 1 ﬁrst G → 0. say |f (x)| ≤ B. Proof. For example. The idea is this: Suppose that t is large enough. Thus we need to keep track of the number. Gr is given by (7).Theorem 14. we have BTa /t → 0.

d. as t → ∞. = E a Ta . r=1 with probability one. are i. and since Ta − Ta k = 1.i. random variables with common expectation Ea Ta . t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . So we are left with the sum of the rewards over complete cycles. lim t∞ 1 Nt Nt −1 Gr = EG1 . 2. Since Ta → ∞. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . we have that the visit with index r = Nt to the ground state a is the last before t (see also ﬁgure above): (N (N Ta t ) ≤ t < Ta t +1) . Hence.. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . we have Ta r→∞ r lim (r) . E a Ta f (Xn ) = n=0 EG1 =: f . Since Ta /r → 0. . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) .also. (11) By deﬁnition. . . as r → ∞. 1 Nt = . E a Ta 37 . it follows that the number Nt of complete cycles before time t will also be tending to ∞. = E a Ta .

Ta −1 k=0 1 lim t→∞ t with probability 1. E b Tb 38 . just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). Comment 4: Suppose that two states. it takes Ea Ta units of time between successive visits to the state a. The theorem above then says that. for all b ∈ S. a. Applying formula (14) with b in place of a we ﬁnd that the limit in (13) also equals 1/Eb Tb . f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). E a Ta (14) with probability 1. The physical interpretation of the latter should be clear: If. Note that we include only one endpoint of the excursion. i. Comment 1: It is important to understand what the formula says. on the average. b. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. the ground state. which communicate and are positive recurrent. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). t Ea 1(Xk = b) E a Ta . from the Strong Markov Property. we let f to be 1 at the state b and 0 everywhere else. We can use b as a ground state instead of a. not both. The denominator is the duration of the excursion. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). then the long-run proportion of visits to the state a should equal 1/Ea Ta .To see that f is as claimed. then we see that the numerator equals 1. The constant f is a sort of average over an excursion. The equality of these two limits gives: average no. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 .e. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta .

In view of this. Indeed. we have 0 < ν [a] (x) < ∞. (15) should play an important role. x ∈ S. Fix a ground state a ∈ S and assume that it is recurrent5 . . by a function of X1 . X2 . Clearly. x ∈ S. Recurrence tells us that Ta < ∞ with probability one. X2 . i. The real reason for doing so is because we wanted to replace a function of X0 . . notice that the choice of notation ν [a] is justiﬁable through the notation (6) for the communicating class of the state a. Hence the superscript [a] will soon be dropped. Proof. Proposition 2. 39 . we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). We start the Markov chain with X0 = a. (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. X1 . Then ν [a] satisﬁes the balance equations: ν [a] = ν [a] P. for any state b such that a → x (a leads to x).e. ν [a] (x) = y∈S ν [a] (y)pyx . Note: Although ν [a] depends on the choice of the ‘ground’ state a. this should have something to do with the concept of stationary distribution discussed a few sections ago. Since a = XT a . we will soon realise that this dependence vanishes when the chain is irreducible. X3 . Moreover. Both of them equal 1. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. . so there is nothing lost in doing so. 5 We stress that we do not need the stronger assumption of positive recurrence here. . . Consider the function ν [a] deﬁned in (15). .14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . where a is the ﬁxed recurrent ground state whose existence was assumed.

for all x ∈ S. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). xa so. We are now going to assume more. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). Note: The last result was shown under the assumption that a is a recurrent state. Suppose that the Markov chain possesses a positive recurrent state a. Xn−1 = y. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. indeed. Therefore. so there exists m ≥ 1. n ≤ Ta ) pyx Pa (Xn−1 = y. ax ax From Theorem 10. which are precisely the balance equations. We just showed that ν [a] = ν [a] P. let x be such that a x. ν [a] (x) < ∞. Deﬁne Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. We now have: Corollary 9. that a is positive recurrent. from (16). x ∈ S. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. (17) This π is a stationary distribution. in this last step. is ﬁnite when a is positive recurrent. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). so there exists n ≥ 1. such that pxa > 0. we used (16). for all n ∈ N. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta .and thus be in a position to apply ﬁrst-step analysis. x a. First note that. such that pax > 0. n ≤ Ta ) Pa (Xn = x. namely. ν [a] = ν [a] Pn . by deﬁnition. 40 . where. which. (n) Next.

. if tails show up. The coins are i. R2 ) be the index of the ﬁrst (respectively. to decide whether j will occur in a speciﬁc excursion. then j occurs in the r-th excursion. and the probability gij = Pi (Tj < ∞) that j occurs in a speciﬁc excursion is positive. We already know that j is recurrent. then the set of linear equations π(x) = y∈S π(y)pyx . Then j is positive Proof. Proof. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. in general. and so π [a] is not trivially equal to zero. . second) excursion that contains j. Theorem 15. unique. 1. (r) j. in Proposition 2. Also. Otherwise.i. since the excursions are independent. Let R1 (respectively. We have already shown. Since a is positive recurrent. i. Start with X0 = i. Suppose that i recurrent. Consider the excursions Xi . Hence j will occur in inﬁnitely many excursions. Therefore. In words. This is what we will try to do in the sequel. that ν [a] P = ν [a] . 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. that it sums up to one. 2. But π [a] is simply a multiple of ν [a] . then the r-th excursion does not contain state j. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. positive recurrent state. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. the event that j occurs in a speciﬁc excursion is independent from the event that j occurs in the next excursion. no matter which one.In other words: If there is a positive recurrent state. 41 . we have Ea Ta < ∞. It only remains to show that π [a] is a probability. we toss a coin with probability of heads gij > 0. If heads show up at the r-th excursion. π [a] P = π [a] .d. π(y) = 1 y∈S x∈S have at least one solution. r = 0. . Let i be a positive recurrent state.e.

Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. . 2. . An irreducible Markov chain is called transient if one (and hence all) states are transient. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . . An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent.i. . But. we just need to show that R2 −R1 +1 ζ := r=1 Zr has ﬁnite expectation. . In other words. with the same law as Ti . where the Zr are i. . IRREDUCIBLE CHAIN r = 1. by Wald’s lemma.d. Corollary 10. R2 has ﬁnite expectation.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . . Then either all states in C are positive recurrent. or all states are null recurrent or all states are transient. Let C be a communicating class.

we have. Theorem 16. Hence. indeed. n=1 43 . after all. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. π(2) = 1 − α 2. Then P (Ta < ∞) = x∈S In other words. (18) π(x)Px (Ta < ∞) = π(x) = 1. By a theorem of Analysis (Bounded Convergence Theorem). what prohibits uniqueness. But. we can take expectations of both sides: E as t → ∞. Consider positive recurrent irreducible Markov chain. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. recognising that the random variables on the left are nonnegative and bounded below 1. for all n. an absorbing state). Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). Then. the π deﬁned by π(0) = α. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. Let π be a stationary distribution. we start the Markov chain from the stationary distribution π. x ∈ S. for totally trivial reasons (it is. we have P (Xn = x) = π(x). x∈S By (13). state 1 (and state 2) is positive recurrent. x ∈ S. with probability one. as t → ∞. Proof. for all states i and j. But observe that there are many–inﬁnitely many–stationary distributions: For each 0 ≤ α ≤ 1. Consider the Markov chain (Xn . E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). for all x ∈ S.16 Uniqueness of the stationary distribution At the risk of being too pedantic. is a stationary distribution. following Theorem 14. by (18). we stress that a stationary distribution may not be unique. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). n ≥ 0) and suppose that P (X0 = x) = π(x). Then there is a unique stationary distribution. π(1) = 0. Fix a ground state a. there is a stationary distribution.

In particular. This is in fact. Corollary 11. Thus. Assume that a stationary distribution π exists. Let i be such that π(i) > 0. Then any state i such that π(i) > 0 is positive recurrent. delete all transition probabilities pxy with x ∈ C. an arbitrary stationary distribution π must be equal to the speciﬁc one π [a] . 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Let C be the communicating class of i. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . Therefore Ei Ti < ∞. Decompose it into communicating classes. Then. In particular. Proof. 1 π(a) = . from the deﬁnition of π [a] . Consider the restriction of the chain to C. y ∈ C. i. There is a unique stationary distribution. π(C) x ∈ C. Corollary 12. Both π [a] and π[b] are stationary distributions. i is positive recurrent. π(a) = π [a] (a) = 1/Ea Ta . Consider a positive recurrent irreducible Markov chain and two states a.e. There is only one stationary distribution. By the 1 above corollary. But π(i|C) > 0. π(i|C) = Ei Ti . x ∈ S. the only stationary distribution for the restricted chain because C is irreducible. Consider an arbitrary Markov chain. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Hence there is only one stationary distribution. where π(C) = y∈C π(y). Proof. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Then π [a] = π [b] . Consider a positive recurrent irreducible Markov chain with stationary distribution π. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . for all a ∈ S. Let π(·|C) be the distribution deﬁned by restricting π to C: π(x|C) := π(x) . The following is a sort of converse to that. The cycle formula merely re-expresses this equality. so they are equal. Hence π = π [a] . Namely. b ∈ S. Consider a Markov chain. E a Ta Proof.Therefore π(x) = π [a] (x). for all x ∈ S.

which assigns positive probability to each state in C. This is deﬁned by picking a state (any state!) a ∈ C and letting π C := π [a] . where π(C) := π(C) π(x). such that C αC = 1. for example. There is dissipation of energy due to friction and that causes the settling down.1 Coupling and stability Deﬁnition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. Proposition 3. An arbitrary stationary distribution π must be a linear combination of these π C . consider the motion of a pendulum under the action of gravity. For each such C deﬁne the conditional distribution π(x|C) := π(x) . from physical experience. Stability of a deterministic dynamical system means that the system eventually converges to a ﬁxed point (or. A Markov chain is a simple model of a stochastic dynamical system. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. take an = (−1)n . A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. 18 18. This is the stable position. In an ideal pendulum. We all know. that. So we have that the above decomposition holds with αC = π(C). the pendulum will perform an 45 . it will settle to the vertical position. Proof. By our uniqueness result. x ∈ C) is a stationary distribution. under irreducibility and positive recurrence conditions. x∈C Then (π(x|C). more generally. where αC ≥ 0. π(x|C) ≡ π C (x). If π(x) > 0 then x belongs to some positive recurrent class C. without friction. that the pendulum will swing for a while and. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . To each positive recurrent communicating class C there corresponds a stationary distribution π C . So far we have seen. that it remains in a small region). the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). For example. as n → ∞. Let π be an arbitrary stationary distribution. eventually.

so I’ll leave it upon the reader to decide whether he or she wants to consider this as an ﬁnancial example. x ∈ R. as t → ∞. . for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 .e. The simplest diﬀerential equation As another example. . consider the following diﬀerential equation x = −ax + b. . mutually independent. . ξ1 . And we are in a happy situation here. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. We shall assume that the ξn have the same law. ξ2 . We say that x∗ is a stable equilibrium. the one found by setting the right-hand side equal to zero: x∗ = b/a. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . However notice what happens when we solve the recursion. by means of the recursion above. we have x(t) → x∗ . ξ0 . X2 .oscillatory motion ad inﬁnitum. . Still. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . A linear stochastic system We now pass on to stochastic systems. it cannot mean convergence to a ﬁxed equilibrium point. The thing to observe here is that there is an equilibrium point i. random variables. moreover. we may it is stable because it does not (it cannot!) swing too far away. But what does stability mean? In general. ˙ There are myriads of applications giving physical interpretation to this. . If. Speciﬁcally. are given. or whatever. ······ 46 . simple because the ξn ’s depend on the time parameter n. a > 0. then regardless of the starting position. we here assume that X0 . It follows that |ρ| < 1 is the condition for stability. then x(t) = x∗ for all t > 0. and deﬁne a stochastic sequence X1 . or an example in physics.

18. for instance. but we are not given what P (X = x. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). we need to understand the notion of coupling. Y but not their joint law. In other words. 18. suppose you are given the laws of two random variables X. for any initial distribution. under nice conditions. . deﬁne stability in a way other than convergence of the distributions of the random variables. Y = y) is. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. Y = y) = P (X = x)P (Y = y). ∞ ˆ The nice thing is that. while the limit of Xn does not exists. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . which is a linear combination of ξ0 . Starting at an elementary level. i. A straightforward (often not so useful) construction is to assume that the random variables are independent and deﬁne P (X = x. . .3 Coupling To understand why the fundamental stability theorem works. (e. to try to make the coincidence probability P (X = Y ) as large as possible. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). ξn−1 does not converge. we have that the ﬁrst term goes to 0 as n → ∞. the n-step transition matrix Pn .g. The second term. In particular. as n → ∞. Y = y) is? 47 . If a Markov chain is irreducible. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . converges. What this means is that. . But suppose that we have some further requirement.2 The fundamental stability theorem for Markov chains Theorem 17. j ∈ S. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). for time-homogeneous Markov chains. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . we know f (x) = P (X = x). How should we then deﬁne what P (X = x.Since |ρ| < 1. g(y) = P (Y = y). Coupling refers to the various methods for constructing this joint law. uniformly in x.

e. respectively. y). in particular. g(y) = P (Y = y). y).Exercise: Let X. but with the roles of X and Y interchanged. A coupling of them refers to a joint construction on the same probability space. This T is a random variables that takes values in the set of times or it may take values +∞. (19) as n → ∞. x ∈ S. Suppose that we have two stochastic processes as above and suppose that they have been coupled. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). n < T ) + P (Xn = x. Let T be a meeting of two coupled stochastic processes X. Since this holds for all x ∈ S and since the right hand side does not depend on x. T is ﬁnite with probability one. for all n ≥ 0. n ≥ 0) and the law of a stochastic process Y = (Yn . but we are not told what the joint probabilities are. n ≥ 0. n < T ) + P (Yn = x. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). then |P (Xn = x) − P (Yn = x)| → 0. Proof. The following result is the so-called coupling inequality: Proposition 4. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). and which maximises P (X = Y ). i. g(y) = x h(x. and all x ∈ S. The coupling we are interested in is a coupling of processes. Y . on the event that the two processes never meet. Find the joint law h(x. n ≥ 0). Then. If.e. We are given the law of a stochastic process X = (Xn . Note that it makes sense to speak of a meeting time. y) = P (X = x. Combining the two inequalities we obtain the ﬁrst assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). n ≥ T ) = P (Xn = x. P (T < ∞) = 1. n ≥ 0. Y = y) which respects the marginals. n ≥ T ) ≤ P (n < T ) + P (Yn = x). precisely because of the assumption that the two processes have been constructed together. x∈S 48 . Repeating (19). i. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. We have P (Xn = x) = P (Xn = x. uniformly in x ∈ S. f (x) = y h(x. Y be random variables with values in Z and laws f (x) = P (X = x).

y′ ) for all large n. Then n→∞ lim P (T > n) = P (T = ∞) = 0. they diﬀer only in how the initial states are chosen. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. n ≥ 0). denoted by Y = (Yn . y) := π(x)π(y). y) = µ(x)π(y). then.x′ py. in other words. the transition probabilities. denoted by X = (Xn .(y.x′ ). positive recurrent and aperiodic.y′ ) := P Wn+1 = (x′ . It is very important to note that we have only deﬁned the law of X and the law of Y but we have not deﬁned a joint law. Yn ). Its (1-step) transition probabilities are q(x. Notice that σ(x. From Theorem 3 and the aperiodicity assumption. Consider the process Wn := (Xn .) By positive recurrence.y′ . Its initial state W0 = (X0 .(y. x∈S as n → ∞. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. and it also makes sense to consider their ﬁrst meeting time: T := inf{n ≥ 0 : Xn = Yn }. . n ≥ 0) is a Markov chain with state space S × S. as usual.y′ .y′ for all large n. y) ∈ S × S. n ≥ 0). denoted by π. The ﬁrst process.y′ ) = px. The second process. Then (Wn .(y. is a stationary distribution for (Wn . n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. Therefore.x′ ). Y0 ) has distribution P (W0 = (x. for all n ≥ 1. sup |P (Xn = x) − P (Yn = x)| → 0. we have not deﬁned joint probabilities such as P (Xn = x. we have that px.4 Proof of the fundamental stability theorem First. We know that there is a unique stationary distribution. n ≥ 0) is an irreducible chain.x′ > 0 and py.x′ ). review the assumptions of the theorem: the chain is irreducible. Xn and Yn would be independent and distributed according to π. y) Its n-step transition probabilities are q(x. 18. had we started with both X0 and Y0 independent and distributed according to π. implying that q(x. (x.Assume now that P (T < ∞) = 1. y ′ ) | Wn = (x. n ≥ 0) is independent of Y = (Yn . Having coupled them. Xm = y). π(x) > 0 for all x ∈ S. (Indeed. We shall consider the laws of two processes. Let pij denote. and so (Wn .x′ py. we can deﬁne joint probabilities. We now couple them by doing the straightforward thing: we assume that X = (Xn .

For example.1).0). Z0 = X0 which has an arbitrary distribution µ. then the process W = (X. suppose that S = {0. Zn equals Xn before the meeting time of X and Y . Y ) may fail to be irreducible. 0). by assumption. 50 . Hence (Zn .1). n ≥ 0) is identical in law to (Xn . Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. In particular. say. precisely because P (T < ∞) = 1. if we modify the original chain and make it aperiodic by. P (Xn = x) = P (Zn = x). Obviously.Therefore σ(x.0). However. uniformly in x. 1). |P (Zn = x) − P (Yn = x)| ≤ P (T > n).(1.99. p10 = 1. n ≥ 0) is positive recurrent. 1} and the pij are p01 = p10 = 1.(1. Then the product chain has state space S ×S = {(0. (1. and since P (Yn = x) = π(x).01. for all n and x. 0). after the meeting time. ≥ 0) is a Markov chain with transition probabilities pij .1) = 1. p00 = 0. (Zn .(0. Hence (Wn . n ≥ 0). y) > 0 for all (x. Deﬁne now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). (1. Moreover. Since P (Zn = x) = P (Xn = x). By the coupling inequality. y) ∈ S × S.1) = q(1. T is a meeting time between Z and Y . as n → ∞. which converges to zero.(0. (0. we have |P (Xn = x) − π(x)| ≤ P (T > n). 1)} and the transition probabilities are q(0. Clearly. P (T < ∞) = 1. Zn = Yn . X arbitrary distribution Z stationary distribution Y T Thus.0) = q(0. p01 = 0. This is not irreducible. In particular.0) = q(1.

The reader is asked to pay attention to this terminology which is diﬀerent from the one used in the Markov chain literature. and this is expressed by our terminology. It is not diﬃcult to see that 6 We put “wrong” in quotes. . Then Xn = 0 for even n and = 1 for odd n.497. Therefore p00 = 1 for even n (n) and = 0 for odd n. . However. Parry. But this is “wrong”.01. π(0) ≈ 0. It is. however. Cd−1 .f. such that the chain moves cyclically between them: from C0 to C1 . then (n) lim p n→∞ 00 (n) = lim p10 = π(0). By the results of Section 14. we can decompose the state space into d classes C0 .503.503 0.4 and. i. n→∞ (n) where π(0). 0. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. customary to have a consistent terminology. from C1 to C2 . most people use the term “ergodic” for an irreducible.e. 6 The reason is that the term “ergodic” means something speciﬁc in Mathematics (c. Theorem 4. Indeed. In matrix notation. from Cd back to C0 . Suppose X0 = 0. 18. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. π(1) satisfy 0. π(1) ≈ 0.99. .then the product chain becomes irreducible. Suppose that the period of the chain is d. p01 = 0. Terminology cannot be. p10 = 1. per se wrong. and so on.503 0. we know we have a unique stationary distribution π and that π(x) > 0 for all states x.497 18. C1 .497 . if we modify the chain by p00 = 0. . 1981). 51 .5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. in particular. positive recurrent and aperiodic Markov chain. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). π(0) + π(1) = 1. Thus. limn→∞ p00 does not exist. By the results of §5.99π(0) = π(1).6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0.

we have pij for all i. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). and the previous results apply. i ∈ S. Proof. j ∈ Cℓ . i ∈ Cr . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. . So π(i) = j∈Cr−1 π(j)pji . as n → ∞. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. .Lemma 8. each one must be equal to 1/d. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Hence if we restrict the chain (Xnd . Can we also ﬁnd the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . namely C0 . i ∈ Cr . . . Since the αr add up to 1. . . Since (Xnd . where π is the stationary distribution. Cd−1 . Then. Suppose that X0 ∈ Cr . 52 . d − 1. if sampled every d steps. 1. Then pji = 0 unless j ∈ Cr−1 . All states have period 1 for this chain. j ∈ Cr . starting from i. Let αr := i∈Cr π(i). thereafter. . → dπ(j). j ∈ Cℓ . n ≥ 0) consists of d closed communicating classes. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. i∈Cr for all r = 0. . the (unique) stationary distribution of (Xnd . The chain (Xnd . We shall show that Lemma 9. as n → ∞. after reduction by d) and. and let r = ℓ − k (modulo d). n ≥ 0) is dπ(i). This means that: Theorem 18. Let i ∈ Ck . π(i) = 1/d. Suppose i ∈ Cr . We have that π satisﬁes the balance equations: π(i) = j∈S π(j)pji . it remains in Cℓ . Then. n ≥ 0) is aperiodic.

We also saw in Corollary (n) 7 that. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. for all i ∈ S. the ﬁrst of which is known as HellyBray Lemma: Lemma 10.e. . . 2. p2 . We have. for all i ∈ S. 2. 2. for each n = 1. whose proof we omit7 : 7 See Br´maud (1999). k = 1. as k → ∞. k = 1. i. if j is transient. . . p3 . such that limk→∞ pi k exists and equals. . the sequence n2 of the second row k is a subsequence of n1 of the ﬁrst row.) is a k k subsequence of each (ni . So the diagonal subsequence (nk . . Now consider the contained between 0 and 1 there must be a exists and equals. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . The proof will be based on two results from Analysis. and so on. say. . By construction. . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. the sequence n3 of the third row is a subsequence k k of n2 of the second row. are contained between 0 and 1. . where = 1 and pi ≥ 0 for all i. pi . 19. schematically. Then we can ﬁnd a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. Hence pi k i. k = 1. . e 53 . k = 1. a probability distribution on N. If the period is d then the limit exists only along a selected subsequence (Theorem 18). exists for all i ∈ N. there must be a subsequence exists and equals. say. 2. .. For each i we have found subsequence ni . as n → ∞. for each i. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . If j is a null recurrent state then pij → 0.. . (nk ) k → pi . . It is clear how we (ni ) k continue. p1 . say. . then pij → π(j) (Theorems 17).19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . . .1 Limiting probabilities for null recurrent states Theorem 19. p(n) = (n) (n) (n) (n) (n) ∞ p1 . Consider. p2 . .). . . 2. then pij → 0 as n → ∞. for each The second result is Fatou’s lemma.

Suppose that for some i. π is a stationary distribution. i=1 (n) Proof of Theorem 19. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i .e. pix k . αy pyx . and this is a contradiction (the number αx = αx cannot be strictly larger than itself). i. If (yi . (n ) (n ) Consider the probability vector By Lemma 10. Let j be a null recurrent state. i. pix k = 1. Now π(j) > 0 because αj > 0. (xi .e. we have 0< x∈S The inequality follows from Lemma 11. n = 1. state j is positive recurrent: Contradiction.e. i ∈ N). i. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. and the last equality is obvious since we have a probability vector. 2. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . y∈S x ∈ S. by Lemma 11. Hence we can ﬁnd a subsequence nk such that k→∞ lim pij k = αj > 0. . But Pn+1 = Pn P. for some x0 ∈ S.Lemma 11. . the sequence (n) pij does not converge to 0. x ∈ S. equalities. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . By Corollary 13. it is a strict inequality: αx0 > y∈S αy pyx0 . we have that π(x) := αx y∈S αy satisﬁes the balance equations and has x∈S π(x) = 1. We wish to show that these inequalities are. Thus. 54 . This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . actually. (n′ ) Hence αx ≥ αy pyx . . x∈S (n′ ) Since aj > 0. y∈S Since x∈S αx ≤ 1. x ∈ S . i ∈ N). by assumption. Suppose that one of them is not. pix k Therefore. k→∞ (n′ +1) = y∈S piy k pyx .

and fij = ϕC (j). starting with the ﬁrst hitting time decomposition relation of Lemma 7. In general. But Theorem 17 tells us that the limit exists regardless of the initial distribution.2. as n → ∞.2 The general case (n) It remains to see what the limit of pij is. we have pij = Pi (Xn = j) = Pi (Xn = j. so we will only look that the aperiodic case. Example: Consider the following chain (0 < α. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). In this form. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}.19. Let i be an arbitrary state. and fij = Pi (Tj < ∞). which is what we need. Let HC := inf{n ≥ 0 : Xn ∈ C}. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . the result can be obtained by the so-called Abel’s Lemma of Analysis. Indeed. as in §10. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). Proof. From the Strong Markov Property. π C (j) = 1/Ej Tj . This is a more old-fashioned way of getting the same result. Pµ (Xn−m = j) → π C (j). we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). Then (n) lim p n→∞ ij = ϕC (i)π C (j). If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. Let ϕC (i) := Pi (HC < ∞). Let j be a positive recurrent state with period 1. Let C be the communicating class of j and let π C be the stationary distribution associated with C. β. Theorem 20. The result when the period j is larger than 1 is a bit awkward to formulate. that is. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. as n → ∞.

The phase (or state) space of a particle is not just a physical space but it includes. C1 = {1. p41 → ϕ(4)π C1 (1). ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . π C1 (2) = . we assumed the existence of a positive recurrent state. C2 = {5} (absorbing state). 56 (20) x∈S . in Theorem 14. Similarly. 4} (inessential states). 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. We have ϕ(1) = ϕ(2) = 1. π C2 (5) = 1. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. while. The subject of Ergodic Theory has its origins in Physics. We will avoid to give a precise deﬁnition of the term ‘ergodic’ but only mention what constitutes a result in the area. ϕ(4) = . The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . C2 are aperiodic classes. (n) (n) p43 → 0.The communicating classes are I = {3. 2} (closed class). among others. (n) (n) p34 → 0. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. p44 → 0. Consider the quantity ν [a] (x) deﬁned as the average number of visits to state x between two successive visits to state a–see 16. We now generalise this. 1 − αβ 1 − αβ Both C1 . (n) (n) p33 → 0. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. ϕ(3) = p31 → ϕ(3)π C1 (1). And so is the Law of Large Numbers itself (Theorem 13). for example. Recall that. (n) (n) p32 → ϕ(3)π C1 (2). whence 1−α β(1 − α) . ϕ(4) = (1 − β)ϕ(5) + βϕ(3). p42 → ϕ(4)π C1 (2). ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. as n → ∞. Theorem 21. For example. from ﬁrst-step analysis. p45 → (1 − ϕ(4))π C2 (5). Let f be a reward function such that |f (x)|ν [a] (x) < ∞. (n) (n) p35 → (1 − ϕ(3))π C2 (5). ϕ(5) = 0. Pioneers in the are were. Ergodic Theory is nowadays an area that combines many ﬁelds of Mathematics and Science. Theorem 14 is a type of ergodic theorem. information about its velocity. Hence.

and this is left as an exercise. Now. So if we divide by t the numerator and denominator of the fraction in (21). Replacing n by the subsequence Nt . Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. From proposition 2 we know that the balance equations ν = νP have at least one solution. r=0 with probability 1 as long as E|G1 | < ∞. So there are always recurrent states. while Gﬁrst . which is the quantity denoted by f in Theorem 14. 57 . We only need to check that EG1 = f . we obtain the result. as we saw in the proof of Theorem 14. and this justiﬁes the terminology of §18. If so. t t→∞ t t t with probability one. Remark 1: The diﬀerence between Theorem 14 and 21 is that. then.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . we have Nt /t → 1/Ea Ta . Then at least one of the states must be visited inﬁnitely often. we assumed positive recurrence of the ground state a. in the former. (21) f (x)ν [a] (x). The reader is asked to revisit the proof of Theorem 14. so the numerator converges to f /Ea Ta . As in Theorem 14. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . t t where Gr = T (r) ≤t<T (r+1) f (Xt ).5. we have that the denominator equals Nt /t and converges to 1/Ea Ta . we have t→∞ lim 1 last 1 G = lim Gﬁrst = 0. But this is precisely the condition (20). 21 Finite chains We take a brief look at some things that are speciﬁc to chains with ﬁnitely many states. we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gﬁrst + Glast . Glast are the total rewards over the ﬁrst and t t a a last incomplete cycles. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. Proof. since the number of states is ﬁnite. respectively. we have C := i∈S ν(i) < ∞. As in the proof of Theorem 14.

and run the ﬁlm backwards. in a sense. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . the stationary distribution is unique. as n → ∞. It is. By Corollary 13. in addition. . X0 = j). At least some state i must have π(i) > 0. X1 = j) = P (X1 = i. Consider an irreducible Markov chain with ﬁnitely many states. X1 ) has the same distribution as (X1 . . running backwards. we have Pn → Π. simply. the chain is aperiodic. like playing a ﬁlm backwards. it will look the same. X1 . Therefore π is a stationary distribution. . . i. Xn ) has the same distribution as (Xn . because positive recurrence is a class property (Theorem 15). j ∈ S. X0 ).Hence 1 ν(i). Xn ) has the same distribution as (Xn . actually. or. We say that it is time-reversible. . we see that (X0 . For example. Xn−1 . then we instantly know that the ﬁlm is played in reverse. . . . . Theorem 23. If. More precisely. that is. X0 ) for all n. for all n. . reversible if it has the same distribution when the arrow of time is reversed. Proof. C satisﬁes the balance equations and the normalisation condition i∈S π(i) = 1. 58 . the chain is irreducible. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. Taking n = 1 in the deﬁnition of reversibility. In particular. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. Proof. Reversibility means that (X0 . If. P (X0 = i. We summarise the above discussion in: Theorem 22. without being able to tell that it is. i. A reversible Markov chain is stationary. this state must be positive recurrent. (X0 . . Xn has the same distribution as X0 for all n. . Lemma 12. a ﬁnite Markov chain always has positive recurrent states. then all states are positive recurrent. . In addition. if we ﬁlm a standing man who raises his arms up and down successively. Xn−1 . π(i) := Hence.e. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. Then the chain is positive recurrent with a unique stationary distribution π. X1 . Because the process is Markov. X0 ). Suppose ﬁrst that the chain is reversible. But if we ﬁlm a man who walks and run the ﬁlm backwards. by Theorem 16. in addition. n ≥ 0). . . 22 Time reversibility Consider a Markov chain (Xn . i ∈ S. the ﬁrst components of the two vectors must have the same distribution. this means that it is stationary. .

Theorem 24. Xn ) has the same distribution as (Xn . X2 ) has the same distribution as (X2 . X1 = i) = π(j)pji . suppose that the KLC (22) holds. The DBE imply immediately that (X0 .for all i and j. X1 . So the detailed balance equations (DBE) hold. suppose the DBE hold. i2 . X0 ). We will show the Kolmogorov’s loop criterion (KLC). By induction. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. Conversely. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. then. X2 = i). . . . X2 = k) = P (X0 = k. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . Pick 3 states i. . . Suppose ﬁrst that the chain is reversible. X1 = j. . Conversely. These are the DBE. X1 = j. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. using the DBE. . So the chain is reversible. . X0 ). . X0 ). P (X0 = j. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. X1 . then the chain cannot be reversible. . The same logic works for any m and can be worked out by induction. But P (X0 = i. for all i. using DBE. j. and hence (X0 . we have π(i) > 0 for all i. X1 ) has the same distribution as (X1 . (X0 . Now pick a state k and write. pji 59 (m−3) → π(i1 ). and write. Hence the DBE hold. we have separation of variables: pij = f (i)g(j). i4 . and pi1 i2 (m−3) → π(i2 ). im ∈ S and all m ≥ 3. k. Summing up over all choices of states i3 . letting m → ∞. . Xn−1 . X1 = j) = π(i)pij . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . and the chain is reversible. π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . . if there is an arrow from i to j in the graph of the chain but no arrow from j to i. k. j. . and this means that P (X0 = i. X1 . . for all n. (22) Proof. . . . it is easy to show that. In other words.

and by replacing. the graph becomes disconnected.e. and. j for which pij > 0. Hence the original chain is reversible. then we need. The reason is as follows. the two oriented edges from i to j and from j to i by a single edge without arrow. A) (see §4. Pick two distinct states i. And hence π can be found very easily. namely the arrow from i to j. i. and the arrow from j to j is the only one from Ac to A.1) which gives π(i)pij = π(j)pji . There is obviously only one arrow from AtoAc . meaning that it has no loops. Example 1: Chain with a tree-structure. If we remove the link between them. consider the chain: Replace it by: This is a tree. there would be a loop. by assumption. that the chain is positive recurrent. in addition. Hence the chain is reversible.) Necessarily.(Convention: we form this ratio only when pij > 0. j for which pij > 0. 60 . If the graph thus obtained is a tree. pji g(i) Multiplying by g(i) and summing over j we ﬁnd g(i) = j∈S g(j)pji . (If it didn’t. to guarantee. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. then the chain is reversible. so we have pij g(j) = . The balance equations are written in the form F (A. Form the graph by deleting all self-loops.) Let A. Ac ) = F (Ac . using another method. For example. the detailed balance equations. • If the chain has inﬁnitely many states. Ac be the sets in the two connected components. there isn’t any. for each pair of distinct states i. So g satisﬁes the balance equations. This gives the following method for showing that a ﬁnite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. f (i)g(i) must be 1 (why?).

so both members are equal to C. Consider an undirected graph G with vertices S. π(i) = 1. p02 = 1/4. For each state i deﬁne its degree δ(i) = number of neighbours of i. The stationary distribution assigns to a state probability which is proportional to its degree.Example 2: Random walk on a graph. We claim that π(i) = Cδ(i). A RWG is a reversible chain. If i. The stationary distribution of a RWG is very easy to ﬁnd. States j are i are called neighbours if they are connected by an edge. 0. etc. 61 . 2. For example. Now deﬁne the the following transition probabilities: pij = 1/δ(i). The constant C is found by normalisation.e. if j is a neighbour of i otherwise. There are two cases: If i. To see this let us verify that the detailed balance equations hold. we have that C = 1/2|E|. where C is a constant. Each edge connects a pair of distinct vertices (states) without orientation. Suppose the graph is connected. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). i∈S Since δ(i) = 2|E|. and a set E of undirected edges. j are not neighbours then both sides are 0. i∈S where |E| is the total number of edges. In all cases. π(i)pij = π(j)pji . Hence we conclude that: 1. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. i. a ﬁnite set. the DBE hold.

2. The sequence (Yk . in addition.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. 2. . . k = 0. .e.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . how can we create new ones? 23.i.e. i.) has the Markov property but it is not time-homogeneous. this process is also a Markov chain and it does have time-homogeneous transition probabilities. In this case. . . Deﬁne Yr := XT (r) . The state space of (Yr ) is A. Let (St . 1. . π is a stationary distribution for the original chain. 1. . i. random variables with values in N and distribution λ(n) = P (ξ0 = n). (m) (m) 23. 1. . . a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk .d. Let A be (r) a set of states and let TA be the r-th visit to A.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. . k = 0. St+1 = St + ξt . 1. 2. .3 Subordination Let (Xn . . If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). if nk = n0 + km. . A r = 1. irreducible and positive recurrent). .23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). where pij are the m-step transition probabilities of the original chain. If the subsequence forms an arithmetic progression. Due to the Strong Markov Property. 62 n ∈ N. . then it is a stationary distribution for the new chain. n ≥ 0) be Markov with transition probabilities pij . then (Yk . where ξt are i. t ≥ 0) be independent from (Xn ) with S0 = 0. . 2. k = 0. if. .e. k = 0. 23. t ≥ 0. 2. in general.

a Markov chain. . in general.) More generally. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. then the process Yn = f (Xn ). ∞ λ(n)pij . . α1 . Yn−1 ) = P (Yn+1 = β | Yn = α). β. .Then Yt := XSt .4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . (Notation: We will denote the elements of S by i. . . . Yn−1 = αn−1 }. and the elements of S ′ by α. . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . . Indeed. we have: Lemma 13. Fix α0 .e. . i. n≥0 63 .. j. Proof. β). . knowledge of f (Xn ) implies knowledge of Xn and so. f is a one-to-one function then (Yn ) is a Markov chain. . Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). . (n) 23. If. We need to show that P (Yn+1 = β | Yn = α. Then (Yn ) has the Markov property. . conditional on Yn the past is independent of the future. is NOT. however. ..

Thus (Yn ) is Markov with transition probabilities Q(α. regardless of whether i > 0 or i < 0. if i = 0. β). β) i∈S P (Yn = α|Xn = i. α > 0 Q(α. β). a Markov chain with transition probabilities Q(α. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. β ∈ Z+ . β) P (Yn = α|Yn−1 ) i∈S = Q(α. Yn = α. On the other hand.e. conditional on {Xn = i}. β). observe that the function | · | is not one-to-one. i. Yn−1 ) Q(f (i). β) P (Yn = α. Then (Yn ) is a Markov chain. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. if we let 1/2. We have that P (Yn+1 = β|Xn = i) = Q(|i|. α = 0. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). To see this. But we will show that P (Yn+1 = β|Xn = i). because the probability of going up is the same as the probability of going down. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. indeed. Deﬁne Yn := |Xn |. if β = 1. But P (Yn+1 = β | Yn = α. Indeed. β)P (Xn = i | Yn = α. β). 64 . In other words. Example: Consider the symmetric drunkard’s random walk. Yn−1 )P (Xn = i | Yn = α. β) := 1. and so (Yn ) is. i ∈ Z.for all choices of the variables involved. and if i = 0. then |Xn+1 | = 1 for sure. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. if β = α ± 1. Yn−1 ) Q(f (i). β) P (Yn = α|Xn = i. i ∈ Z. depends on i only through |i|. for all i ∈ Z and all β ∈ Z+ . P (Xn+1 = i ± 1|Xn = i) = 1/2.

If we let ξ (n) = ξ1 . . Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. if 0 is never reached then Xn → ∞. are i. in general. In other words. a hard problem. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. Hence all states i ≥ 1 are transient. n ≥ 0) has the Markov property. (and independent of the initial population size X0 –a further assumption). j In this case. .. the population cannot grow: it will eventually reach 0. . The state 0 is irrelevant because it cannot be reached from anywhere. Then P (ξ = 0) + P (ξ ≥ 2) > 0. We ﬁnd the population (n) at time n + 1 by adding the number of oﬀspring of each individual.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. Thus. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. . each individual gives birth to at one child with probability p or has children with probability 1 − p. 0 But 0 is an absorbing state. ξ (n) ). 0 ≤ j ≤ i. Back to the general case. Example: Suppose that p1 = p. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of oﬀspring. if the 65 (n) (n) . we have that (Xn . . since ξ (0) . Hence all states i ≥ 1 are inessential.i. . So assume that p1 = 1. and following the same distribution. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. k = 0. It is a Markov chain with time-homogeneous transitions. which has a binomial distribution: i j pij = p (1 − p)i−j .24 24. ξ (1) . Let ξk be the number of oﬀspring of the k-th individual in the n-th generation. ξ2 . Therefore. The parent dies immediately upon giving birth. and. . independently from one another.d. . one question of interest is whether the process will become extinct or not. But here is an interesting special case. so we will ﬁnd some diﬀerent method to proceed. We let pk be the probability that an individual will bear k oﬀspring. therefore transient. and 0 is always an absorbing state. 1. 2. . if the current population is i there is probability pi that everybody gives birth to zero oﬀspring. with probability 1. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. then we see that the above equation is of the form Xn+1 = f (Xn . p0 = 1 − p.

Proof. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. For i ≥ 1. i. . when we start with X0 = i. ε(0) = 1. so P (D|X0 = i) = εi . The result is this: Proposition 5. This function is of interest because it gives information about the extinction probability ε. and. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. Obviously. since Xn = 0 implies Xm = 0 for all m ≥ n. Let ε be the extinction probability of the process. Indeed. ϕn (0) = P (Xn = 0). we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. if we start with X0 = i in ital ancestors. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. we can argue that ε(i) = εi . Consider a Galton-Watson process starting with X0 = 1 initial ancestor. 66 . each one behaves independently of one another.population does not become extinct then it must grow to inﬁnity. Therefore the problem reduces to the study of the branching process starting with X0 = 1. We wish to compute the extinction probability ε(i) := Pi (D). . Indeed. . and P (Dk ) = ε. . We then have that D = D1 ∩ · · · ∩ Di . let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. where ε = P1 (D). Let ∞ µ = Eξ = k=1 kpk be the mean number of oﬀspring of an individual. For each k = 1. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . If µ ≤ 1 then the ε = 1.

then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. Also ϕn (0) → ε. recall that ϕ is a convex function with ϕ(1) = 1. By induction. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . In particular. Let us show that. Since both ε and ε∗ satisfy z = ϕ(z). Therefore ε = ϕ(ε). ϕn+1 (0) = ϕ(ϕn (0)). if µ > 1. since ϕ is a continuous function. Therefore ε = ε∗ . the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. ϕ(ϕn (0)) → ϕ(ε). Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). and so on. So ε ≤ ε∗ < 1. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). But ϕn (0) → ε. But ϕn+1 (0) → ε as n → ∞.) 67 . ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). This means that ε = 1 (see left ﬁgure below). ϕn (0) ≤ ε∗ . We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). and. (See right ﬁgure below. ε = ε∗ . all we need to show is that ε < 1. in fact. ϕn+1 (z) = ϕ(ϕn (z)). and since the latter has only two solutions (z = 0 and z = ε∗ ).Note also that ϕ0 (z) = 1. On the other hand. Therefore. if the slope µ at z = 1 is ≤ 1. But 0 ≤ ε∗ . Notice now that µ= d ϕ(z) dz . z=1 Also.

and Sn the number of selected packets. k ≥ 0. One obvious solution is to allocate time slots to hosts in a speciﬁc order. If s ≥ 2. Then ε < 1. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. 3. 3. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. say.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. in a simpliﬁed model. 1. random variables with some common distribution: P (An = k) = λk . if Sn = 1. .d. If collision happens. Case: µ > 1.. are allocated to hosts 1. we let hosts transmit whenever they like. Xn+1 = Xn + An . assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). 3 hosts. We assume that the An are i. . 3. there is a collision and. then the transmission is not successful and the transmissions must be attempted anew. there is collision and the transmission is unsuccessful. 2. 4. 2. . Nowadays. there are x + a packets. Abandoning this idea. Ethernet has replaced Aloha.i. 1. If s = 1 there is a successful transmission and. using the same frequency. Thus. So if there are. 2. periodically. is that if more than one hosts attempt to transmit a packet. then time slots 0. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. An the number of new packets on the same time slot. otherwise. . Let s be the number of selected packets. only one of the hosts should be transmitting at a given time. on the next times slot. then max(Xn + An − 1. The problem of communication over a common channel. During this time slot assume there are a new packets that need to be transmitted. 0). 5. . 1. 6. . Since things happen fast in a computer network. 68 . Then ε = 1. there is no time to make a prior query whether a host has something to transmit or not. 24.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). So. on the next times slot. there are x + a − 1 packets. for a packet to go through.

if Xn + An = 0. we think as follows: to have a reduction of x by 1. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . for example we can make it ≤ µ/2. An = a. k 0 ≤ k ≤ x + a. 69 . So P (Sn = k|Xn = x. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x.) = x + a k x−k p q . For x > 0.x−1 + k≥1 kpx. The random variable Sn has. The process (Xn . To compute px. Therefore ∆(x) ≥ µ/2 for all large x. i. so q x G(x) tends to 0 as x → ∞. The reason we take the maximum with 0 in the above equation is because.e. The main result is: Proposition 6. . . Unless µ = 0 (which is a trivial case: no arrivals!). To compute px. So: px. we have q < 1. An−1 . So we can make q x G(x) as small as we like by choosing x suﬃciently large. Sn = 1|Xn = x) = λ0 xpq x−1 . Indeed. . we have ∆(x) = (−1)px. we have that the drift is positive for all large x and this proves transience. selections are made amongst the Xn + An packets. We also let µ := k≥1 kλk be the expectation of An . Since p > 0. The Aloha protocol is unstable. where q = 1 − p. where G(x) is a function of the form α + βx . deﬁned by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. we should not subtract 1. n ≥ 0) is a Markov chain. we must have no new packets arriving. Idea of proof. Let us ﬁnd its transition probabilities. and so q x tends to 0 much faster than G(x) tends to ∞. Proving this is beyond the scope of the lectures. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x).x are computed by subtracting the sum of the above from 1. Xn−1 . and p.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k .e.x−1 when x > 0. the Markov chain (Xn ) is transient.x+k for k ≥ 1. and exactly one selection: px.x−1 = P (An = 0. We here only restrict ourselves to the computation of this expected change.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . The probabilities px. k ≥ 1.where λk ≥ 0 and k≥0 kλk = 1. then the chain is transient. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1).

The idea is simple. the chain moves to a random page. the rank of a page may be very important for a company’s proﬁts. known as ‘302 Google jacking’. if L(i) = ∅ 0. Its implementation though is one of the largest matrix computations in the world. 2. So this is how Google assigns ranks to pages: It uses an algorithm. if j ∈ L(i) pij = 1/|V |. |V | These are new transition probabilities. There is a technique. In an Internet-dominated society. neither that it is aperiodic. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. PageRank. unless i is a dangling page. If L(i) = ∅.2) and let α pij := (1 − α)pij + .24. We pick α between 0 and 1 (typically. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. Academic departments link to each other by hiring their 70 . So we make the following modiﬁcation. that allows someone to manipulate the PageRank of a web page and make it much higher. in the latter case.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. It is not clear that the chain is irreducible. otherwise. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. All you need to do to get high rank is pay (a lot of money). Therefore. A new use of PageRank is to rank academic doctoral programs based on their records of ﬁnding jobs for their graduates. Deﬁne transition probabilities as follows: 1/|L(i)|. α ≈ 0. there are companies that sell high-rank pages. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. It uses advanced techniques from linear algebra to speed up computations. then i is a ‘dangling page’. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. 3. For each page i let L(i) be the set of pages it links to. So if Xn = i is the current location of the chain. because of the size of the problem. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. Then it says: page i has higher rank than j if π(i) > π(j). Comments: 1.

When two individuals mate. all we need is the information outlined in this ﬁgure. then. But an AA mating with a Aa may produce a child of type AA or one of type Aa. In this ﬁgure there are 4 alleles of type A (circles) and 6 of type a (squares). One of the circle alleles is selected twice. Since we don’t care to keep track of the individuals’ identities. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output.4 The Wright-Fisher model: an example from biology In an organism. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). their oﬀspring inherits an allele from each parent. meaning how many characteristics we have. 5. We will assume that a child chooses an allele from each parent at random. We have a transition from x to y if y alleles of type A were produced. We will also assume that the parents are replaced by the children. And so on. from the point of view of an administrator. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. The alleles in an individual come in pairs and express the individual’s genotype for that gene.faculty from each other (and from themselves). Do the same thing 2N times. This means that the external appearance of a characteristic is a function of the pair of alleles. Three of the square alleles are never selected. it makes the evaluation procedure easy. What is important is that the evaluation can be done without ever looking at the content of one’s research. in each generation. Thus. 24. Imagine that we have a population of N individuals who mate at random producing a number of oﬀspring. If so. colour of eyes) appears in two forms. this allele is of type A with probability x/2N . This number could be independent of the research quality but. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their oﬀspring did not inherit this allele.g. a gene controlling a speciﬁc characteristic (e. We would like to study the genetic diversity. The process repeats. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. maintain this allele for the next generation. 4. generation after generation. the genetic diversity depends only on the total number of alleles of each type. two AA individuals will produce and AA child. a dominant (say A) and a recessive (say a). An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. One of the square alleles is selected 4 times. 71 . we will assume that. To simplify. We are now ready to deﬁne the Markov chain Xn representing the number of alleles of type A in generation n.

Since selections are independent. we see that 2N x 2N −y x y pxy = . Tx := inf{n ≥ 1 : Xn = x}. with n P (ξk = 1|Xn . . since. on the average. the number of alleles of type A won’t change. the states {1. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). Let M = 2N . i. . . X0 ) = E(Xn+1 |Xn ) = M This implies that. Clearly. ϕ(M ) = 1. Clearly. p00 = p2N. . Xn ). Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . . the expectation doesn’t change and. M n P (ξk = 0|Xn . and the population becomes one consisting of alleles of type A only. The result is correct. These are: M ϕ(0) = 0. . as usual. M and thus. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. The fact that M is even is irrelevant for the mathematical model. M Therefore. .e. 2N − 1} form a communicating class of inessential states. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. so both 0 and 2N are absorbing states. 0 ≤ y ≤ 2N. We show that our guess is correct by proceeding via the mundane way of verifying that it satisﬁes the ﬁrst-step analysis equations. ϕ(x) = y=0 pxy ϕ(y).d. 72 . conditionally on (X0 . ξM are i. . the random variables ξ1 . X0 ) = 1 − Xn . and the population becomes one consisting of alleles of type a only.i. . (Here. but the argument needs further justiﬁcation because “the end of time” is not a deterministic time. . . . .Each allele of type A is selected with probability x/2N . for all n ≥ 0. y ≤ 2N − 1. the chain gets absorbed by either 0 or 2N . i. . Since pxy > 0 for all 1 ≤ x. . we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . .e. on the average.2N = 1. . . . Similarly. . . n n where.) One can argue that. . since. . 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. X0 ) = Xn . E(Xn+1 |Xn . . ϕ(x) = x/M. EXn+1 = EXn Xn = Xn .

. A number An of new components are ordered and added to the stock. . the demand is met instantaneously. we will try to solve the recursion. for all n ≥ 0. Since the Markov chain is represented by this very simple recursion. we ﬁnd X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. larger than the supply per unit of time. n ≥ 0) are i. provided that enough components are available. n ≥ 0) is a Markov chain. Thus. and so. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . while a number of them are sold to the market. Let ξn := An − Dn . . b). 2. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. so that Xn+1 = (Xn + ξn ) ∨ 0. We will show that this Markov chain is positive recurrent. and E(An − Dn ) = −µ < 0.The ﬁrst two are obvious. on the average. the demand is only partially satisﬁed. We have that (Xn . the demand is.d. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. Here are our assumptions: The pairs of random variables ((An . there are Xn + An − Dn components in the stock. at time n + 1. Xn electronic components at time n.5 A storage or queueing application A storage facility stores. 73 . and the stock reached level zero at time n + 1. Dn ). Otherwise. M −1 y−1 x M y−1 1− x . We will apply the so-called Loynes’ method. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. If the demand is for Dn components. say.i. Working for n = 1. M x M (M −1)−(y−1) x x +1− M M M −1 = 24.

Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. . (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . where the latter equality follows from the fact that. using π(y) = limn→∞ P (Mn = y).i. . for y ≥ 0. . . . ξn . . for all n. Hence. so we can shuﬄe them in any way we like without altering their joint distribution. ξn−1 ). . and. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). the event whose probability we want to compute is a function of (ξ0 . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). we ﬁnd π(y) = x≥0 π(x)pxy . we may replace the latter random variables by ξ1 . ξn−1 ) = (ξn−1 . we may take the limit of both sides as n → ∞. . . . Indeed. . ξ0 ). . 74 . we reverse it. . Therefore. . (n) . in computing the distribution of Mn which depends on ξ0 . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). instead of considering them in their original order. ξn−1 . the random variables (ξn ) are i. P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . . . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisﬁes the balance equations.d.Thus. . however. . Notice. that d (ξ0 . In particular. . . = x≥0 Since the n-dependent terms are bounded.

n→∞ This implies that P ( lim Zn = −∞) = 1. Depending on the actual distribution of ξn .} > y) is not identically equal to 1. . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 .So π satisﬁes the balance equations. . Z2 . We must make sure that it also satisﬁes y π(y) = 1. . we will use the assumption that Eξn = −µ < 0. . n→∞ Hence g(y) = P (0 ∨ max{Z1 . It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient.} < ∞) = 1. as required. the Markov chain may be transient or null recurrent. Here. . This will be the case if g(y) is not identically equal to 1. Z2 . The case Eξn = 0 is delicate. . 75 . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1.

j = 1. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. we won’t go beyond Zd . . so far. otherwise where p + q = 1. . To be able to say this. d p(x) = qj . A random walk possesses an additional property.y = px+z. .and space-homogeneous transition probabilities given by px. the set of d-dimensional vectors with integer coordinates. Example 1: Let p(x) = C2−|x1 |−···−|xd | . a structure that enables us to say that px. 0. This means that the transition probability pxy should depend on x and y only through their relative positions in space. d 0. the skip-free property. . . If d = 1. For example. we can let S = Zd . . x ∈ Zd . given a function p(x). if x ∈ {±e1 . that of spatial homogeneity. . Our Markov chains. then p(1) = p. Deﬁne pj . this walk enjoys. but. . here. groups) are allowed.and space-homogeneity. ±ed } otherwise 76 . More general spaces (e. In dimension d = 1. Example 3: Deﬁne p(x) = 1 2d . if x = ej . where p1 + · · · + pd + q1 + · · · + qd = 1. So. a random walk is a Markov chain with time. namely.g. j = 1. if x = −ej . This is the drunkard’s walk we’ve seen before. A random walk of this type is called simple random walk or nearest neighbour random walk. .y+z for any translation z. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else.y = p(y − x). otherwise. . where C is such that x p(x) = 1. besides time. p(x) = 0. . have been processes with timehomogeneous transition probabilities. .PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. p(−1) = q. we need to give more structure to the state space S.

has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . Furthermore. random variables in Zd with common distribution P (ξn = x) = p(x). (Recall that δz is a probability distribution that assigns probability 1 to the point z. ξ2 . otherwise. x ± ed . Furthermore. . p(x) = 0. But this would be nonsense.) Now the operation in the last term is known as convolution. p ∗ δz = p.i. then p(1) = p(−1) = 1/2. computing the n-th order convolution for all n is a notoriously hard problem. We call it simple symmetric random walk.This is a simple random walk. we will mean a sum over all possible values. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). in addition. because all we have done is that we dressed the original problem in more fancy clothes. . We will let p∗n be the convolution of p with itself n times. p′ (x). It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn .d. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). . . y p(x)p′ (y − x). . If d = 1. in general. we can reduce several questions to questions about this random walk. . we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). x ∈ Zd is deﬁned as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. (When we write a sum without indicating explicitly the range of the summation variable. are i.) In this notation. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. the initial state S0 is independent of the increments. We shall resist the temptation to compute and look at other 77 . Let us denote by Sn the state of the walk at time n. which. Indeed. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . One could stop here and say that we have a formula for the n-step transition probability. So. Obviously. . and this is the symmetric drunkard’s walk. x + ξ2 = y) = y p(x)p(y − x). The convolution of two probabilities p(x). x ∈ Zd . The special structure of a random walk enables us to give more detailed formulae for its behaviour. where ξ1 . We refer to ξn as the increments of the random walk. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x.

who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that diﬀerent questions we need to ask or (b) that approximate methods are more informative than exact ones. Clearly. ξn = εn ) = 2−n . and so #A P (A) = n . 2) → (5. . . Here are a couple of meaningful questions: A. . for ﬁxed n. 1): (2. a random vector with values in {−1. In the sequel.i.i+1 = pi. how long will that take? C. 1) → (4. consider the following: 78 .d. . ξn ). −1} then we have a path in the plane that joins the points (0. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi.2) (4. Since there are 2n elements in {0. As a warm-up.0) We can thus think of the elements of the sample space as paths.1) − (3. +1. 1}n . + (1. and the sequence of signs is {+1. After all.i−1 = 1/2 for all i ∈ Z. . If not. First. all possible values of the vector are equally likely. +1}n . Will the random walk ever return to where it started from? B.1) + (0. −1. 2) → (3. to each initial position and to each sequence of n signs. we shall learn some counting methods. Therefore. +1}n . The principle is trivial. Look at the distribution of (ξ1 . we have P (ξ1 = ε1 . If yes. +1}n . A ⊂ {−1. a path of length n. we should assign. 1) → (2. random variables (random signs) with P (ξn = ±1) = 1/2. if we can count the number of elements of A we can compute its probability. 2 where #A is the number of elements of A. . So if the initial position is S0 = 0.2) where the ξn are i. . we are dealing with a uniform distribution on the sample space {−1. So. 0) → (1.methods. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . +1.1) + − (5. . Events A that depend only on these the ﬁrst n random signs are subsets of this sample space. but its practice may be hard.

Let us ﬁrst show that Lemma 14. 0) and end up at the speciﬁc point (n. 0)).1 Paths that start and end at speciﬁc points Now consider a path that starts from a speciﬁc point and ends up at a speciﬁc point. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. n 79 . Consider a path that starts from (0. i). In if n + i is not an even number then there is no path that starts from (0. y) is given by: #{(m. S7 < 0}. y)}.) The darker region shows all paths that start at (0. 0) (n. 0) and end up at (n. The paths comprising this event are shown below. x) and end up at (n. x) (n.i) (0. y) is given by: #{(0. i). We can easily count that there are 6 paths in A. (n − m + y − x)/2 (23) Remark.Example: Assume that S0 = 0. S4 = −1. i)} = n n+i 2 More generally. Proof. (There are 2n of them. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. (n. in this case. We use the notation {(m. 0) and ends at (n. x) (n. the number of paths that start from (m. S6 < 0. Hence (n+i)/2 = 0.047. y)} =: {all paths that start at (m. y)} = n−m . 0) and n ends at (n.0) The lightly shaded region contains all paths of length n that start at (0. x) and end up at (n. and compute the probability of the event A = {S2 ≥ 0. The number of paths that start from (0. i).

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the ﬁgure), there is no path that goes from (0, 0) to (3, 2). We shall thus deﬁne n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

**Paths that start and end at speciﬁc points and do not touch zero at all
**

(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =

1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the speciﬁc point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the ﬁrst time it hits zero and then reﬂecting it around the horizontal axis. In the ﬁgure below, follow the solid path starting at (m, x) up to the ﬁrst time it hits zero and then follow the dotted path: this is the reﬂected path.

(n, y)

(m, x)

(n, −y)

A path and its reﬂection around a horizontal line. Notice that the reﬂection starts only at the point where the path touches the horizontal line.

Note that if we know the reﬂected path we can recover the original path. Hence we may as well count the number of reﬂected paths. But each reﬂected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =

1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)

1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

**Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
**

1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n

n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n

n 2

= #{(0, 1) n

n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1

n−y 2

(n, y + 1); remain > 0}

− −

n

n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

−

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the ﬁrst one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

**Distribution after n steps
**

n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

0) (n.P (Sn = y|Sm = x) = In particular. n This is called the ballot theorem and the reason will be given in Section 29. if n is even. n 2−n . y) that do not touch zero} × 2−n #{paths (0. The left side of (30) equals #{paths (1. . If n. 27. . 0) (n. Sn ≥ 0 | Sn = y) = 1 − In particular. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . . n (30) Proof. . But. Sn > 0 | Sn = y) = y . n−m 2−n−m . . . We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). . . (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. Such a simple answer begs for a physical/intuitive explanation. . x) 2n−m (n. P0 (Sn = y) 2 (n + y) + 1 . . Sn ≥ 0 | Sn = 0) = 84 1 . S2 > 0. for now. given that S0 = 0 and Sn = y > 0. y)} . if n is even. The probability that. y)} . n/2 (29) #{(0. we shall just compute it. 1) (n. n 2 #{paths (m.3 Some conditional probabilities Lemma 17. 27. the random walk never becomes zero up to time n is given by P0 (S1 > 0. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . P0 (S1 ≥ 0 . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 .2 The ballot theorem Theorem 25.

. . Sn ≥ 0) = P0 (S1 = 0. Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . . . . . (f ) follows from the replacement of (ξ2 . . . and ﬁnally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. (c) (d) (e) (f ) (g) where (a) is obvious. If n is even. . . Sn > 0) + P0 (S1 < 0. Sn ≥ 0). Sn ≥ 0. . . . ξ3 . . . . . .4 Some remarkable identities Theorem 26. . . . . +1 27. . ξ2 + ξ3 ≥ 0. . . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . Sn = 0) = P0 (S1 > 0. ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. 0) (n. . . . . .). . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. (e) follows from the fact that ξ1 is independent of ξ2 . .) by (ξ1 . We next have P0 (S1 = 0. P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning.Proof. . . . 85 . . . . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. Proof. . . . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . 1 + ξ2 + ξ3 > 0. Sn ≥ 0) = #{(0. . Sn < 0) (b) (a) = 2P0 (S1 > 0. . Sn = 0) = P0 (Sn = 0). . n/2 where we used (28) and (29). . ξ2 . . (b) follows from symmetry. . . . . P0 (S1 ≥ 0. . . •). and Sn−1 > 0 implies that Sn ≥ 0. . . . . . . . ξ3 . For even n we have P0 (S1 ≥ 0. Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . We use (27): P0 (Sn ≥ 0 .

and if the particle is to the left of a then its mirror image is on the right. If a particle is to the right of a then you see its image on the left. Sn = 0. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). Sn−1 = 0. the two terms must be equal. Theorem 27. . . . Let n be even. . . given that Sn = 0. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. Sn = 0) P0 (S1 > 0. We now want to ﬁnd the probability f (n) that the walk will return to 0 for the ﬁrst time in n steps. . f (n) = P0 (S1 > 0. n−1 28 The reﬂection principle for a simple random walk in dimension 1 The reflection principle says the following. Sn−2 > 0|Sn−1 > 0. . .5 First return to zero In (29). Put a two-sided mirror at a. . So the last term is 1/2. we see that Sn−1 > 0. for even n. we can omit the last event. So u(n) f (n) = . . Sn−1 < 0. Sn = 0) = 2P0 (Sn−1 > 0. Also. n−1 Proof. 86 . and u(n) = P0 (Sn = 0). we computed. . Fix a state a. Sn = 0 means Sn−1 = 1. . . This is not so interesting. . By the Markov property at time n − 1. we either have Sn−1 = 1 or −1. To take care of the last term in the last expression for f (n). . Sn = 0) = u(n) . Since the random walk cannot change sign before becoming zero. Sn−1 > 0. Sn = 0). P0 (Sn−1 > 0. Sn−1 > 0. . . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). with equal probability. Sn = 0) Now.27. Sn = 0) + P0 (S1 < 0. By symmetry. . n We know that P0 (Sn = 0) = u(n) = n/2 2−n . . . . Then f (n) := P0 (S1 = 0. . . So f (n) = 2P0 (S1 > 0. . the probability u(n) that the walk (started at 0) attains value 0 in n steps. So 1 f (n) = 2u(n) P0 (S1 > 0. Sn−2 > 0|Sn−1 = 1).

2 Proof. Sn ≤ a). Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). . 2 Deﬁne now the running maximum Mn = max(S0 . Trivially.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. Then. Thus. But P0 (Ta ≤ n) = P0 (Ta ≤ n. . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. 2a − Sn . Sn ≥ a). 28. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). P0 (Sn ≥ a) = P0 (Ta ≤ n. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). we have n ≥ Ta . . But a symmetric random walk has the same law as its negative. S1 . 2a − Sn ≥ a) = P0 (Ta ≤ n. Let (Sn . n ≥ Ta . n ≥ Ta By the strong Markov property. In other words. independent of the past before Ta . Sn ≤ a) + P0 (Ta ≤ n. n ∈ Z+ ). as a corollary to the above result. n < Ta . If Sn ≥ a then Ta ≤ n. m ≥ 0) is a simple symmetric random walk. . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). a + (Sn − a). Sn > a) = P0 (Ta ≤ n. n ≥ 0) be a simple symmetric random walk and a > 0. The last equality follows from the fact that Sn > a implies Ta ≤ n. n ∈ Z+ ) has the same law as (Sn . we obtain 87 . Theorem 28.What is more interesting is this: Run the particle till it hits a (it will. called (Sn . Then Ta = inf{n ≥ 0 : Sn = a} Sn . Finally. n < Ta . Proof. Sn ). So Sn − a can be replaced by a − Sn in the last display and the resulting process. Sn = Sn . so we may replace Sn by 2a − Sn (Theorem 28). for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . deﬁne Sn = where is the ﬁrst hitting time of a. Sn ≤ a) + P0 (Sn > a). the process (STa +m − a. In the last event.

But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. by applying the reﬂection principle (Theorem 28). ηn ). . 9 Bertrand. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. gives the result. then the probability that. It is the latter use we are interested in in Probability and Statistics. Sn = 2x − y). Acad. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. The set of values of η contains n elements and η is uniformly distributed a in this set. P0 (Mn < x. 436-437. 105. Compt. 2 Proof. A sample from the urn is a sequence η = (η1 . usually a vase furnished with a foot or pedestal. Rend. as long as a ≥ b. The answer is very simple: Theorem 31. Solution directe du probl`me r´solu par M. Solution d’un probl`me. Paris. An urn is a vessel. 105. e 10 Andr´. (1887). of which a are coloured azure and b black. and anciently for holding ballots to be drawn. If Tx ≤ n. Sci. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. during sampling without replacement. Bertrand. (31) x > y. Sn = y) = P0 (Tx ≤ n. Sn = y). This. the number of selected azure items is always ahead of the black ones equals (a − b)/n. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. as for holding liquids. D. (1887). which results into P0 (Tx ≤ n. . then.Corollary 19. combined with (31). the ashes of the dead after cremation. . where ηi indicates the colour of the i-th item picked. ornamental objects. . Compt. n = a + b. Proof. we have P0 (Mn < x. Rend. Sn = y) = P0 (Sn = 2x − y). Sn = y) = P0 (Tx > n. If an urn contains a azure items and b = n−a black items. Since Mn < x is equivalent to Tx > n. employed for diﬀerent purposes. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. e e e Paris. The original ballot problem. 369. Sci. J. Acad. 8 88 .

ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. . 89 . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . . An urn can be realised quite easily be a random walk: Theorem 32. ηn ) an urn process with parameters n. . . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. ξn ) has a uniform distribution. . but which is not necessarily symmetric. It remains to show that all values are equally a likely.) So pi. . always assuming (without any loss of generality) that a ≥ b = n − a. We have P0 (Sn = k) = n n+k 2 pi. . n . Since an urn with parameters n. n ≥ 0) with values in Z and let y be a positive integer. . . But in (30) we computed that y P0 (S1 > 0. . Let ε1 . . conditional on {Sn = y}. conditional on {Sn = y}. −n ≤ k ≤ n. . . (ξ1 . Consider a simple symmetric random walk (Sn . . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. a we have that a out of the n increments will be +1 and b will be −1. n n −n 2 (n + y)/2 a Thus. . Then. Proof of Theorem 31. But if Sn = y then. . We need to show that. the ‘black’ items correspond to − signs). p(n+k)/2 q (n−k)/2 . ξn ) is uniformly distributed in a set with n elements. . . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. (The word “asymmetric” will always mean “not necessarily symmetric”. . the sequence (ξ1 . . letting a. Then. a − b = y. . . . Sn > 0 | Sn = y) = . ξn = εn ) P0 (Sn = y) 2−n 1 = = . Proof. So the number of possible values of (ξ1 . . . conditional on the event {Sn = y}. . Since the ‘azure’ items correspond to + signs (respectively. εn be elements of {−1. . Sn > 0} that the random walk remains positive. so ‘azure’ items are +1 and ‘black’ items are −1. the result follows. . . The values of each ξi are ±1. . ξn ) is. n Since y = a − (n − a) = a − b. . . . the sequence (ξ1 . i ∈ Z. +1} such that ε1 + · · · + εn = y. a.i+1 = p. where y = a − (n − a) = 2a − n. indeed. . . .We shall call (η1 . . using the deﬁnition of conditional probability and Lemma 16.i−1 = q = 1 − p. we can translate the original question into a question about the random walk. . b be deﬁned by a + b = n. . . we have: P0 (ξ1 = ε1 .

plus a ′′ further time T1 required to go from 0 to +1 for the ﬁrst time. Proof. and all t ≥ 0. 2. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. The process (Tx . . 2qs Proof. This random variable takes values in {0. Then the probability generating function of the ﬁrst time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . Proof. Consider a simple random walk starting from 0. Lemma 18. . If S1 = 1 then T1 = 1. . 2. E0 sTx = ϕ(s)x . See ﬁgure below. For all x. In view of this. 90 . . This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. we make essential use of the skip-free property. for x > 0. Px (Ty = t) = P0 (Ty−x = t). namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. Corollary 20. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. The rest is a consequence of the strong Markov property. . If x > 0. 1. We will approach this using probability generating functions. Consider a simple random walk starting from 0. random variables. Let ϕ(s) := E0 sT1 . In view of Lemma 19 we need to know the distribution of T1 . Consider a simple random walk starting from 0. To prove this. . x ∈ Z. only the relative position of y with respect to x is needed. x − 1 in order to reach x.30.i.} ∪ {∞}. Lemma 19. it is no loss of generality to consider the random walk starting from 0. x ≥ 0) is also a random walk. . y ∈ Z. Start with S0 = 0 and watch the process until the ﬁrst n = T1 such that Sn = 1.) Then. (We tacitly assume that S0 = 0.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. plus a time T1 required to go from −1 to 0 for the ﬁrst time. Theorem 33.d.

T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . 2ps Proof. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. 91 . if x < 0 2ps Proof. the RW must ﬁrst hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). The formula is obtained by solving the quadratic. Then the probability generating function of the ﬁrst time that level x ∈ Z is reached is given by √ x E sT1 x = 1− 1−4pqs2 . Hence the previous formula applies with the roles of p and q interchanged. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. Consider a simple random walk starting from 0. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . Consider a simple random walk starting from 0.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. in order to hit 1. Corollary 21. The ﬁrst time n such that Sn = −1 is the ﬁrst time n such that −Sn = 1. Then the probability generating function of the ﬁrst time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . 0 if x > 0 2qs Tx (32) E0 s = √ |x| E sT−1 |x| = 1− 1−4pqs2 0 . Corollary 22. from the strong Markov property.

1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . k by the binomial theorem.3 The distribution of the ﬁrst return to the origin We wish to ﬁnd the exact distribution of the ﬁrst return time to the origin. and simply write (1 + x)n = k≥0 n k x . this was found in §30. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . the same as the time required for the walk to hit 1 starting ′ from 0. k 92 . 1 − 4pqs2 .2 First return to the origin Now let us ask: If the particle starts at 0. We again do ﬁrst-step analysis. if S1 = 1. Similarly. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. 2ps ′ =1− 30. how long will it take till it ﬁrst returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. Consider a simple random walk starting from 0. Then the probability generating function of the ﬁrst return time to 0 is given by E0 s T 0 = 1 − Proof. If we deﬁne n to be equal to 0 for k > n. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . then we can omit the k upper limit in the sum.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34.30. If α = n is a positive integer then n (1 + x)n = k=0 n k x . then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. The latter is. where α is a real number. in distribution. For the general case. For a symmetric ′ random walk.

k! k! (−1)2k xk = k≥0 xk .Recall that n k = It turns out that the formula is formally true for general α. (−1)k = 2k − 1 k k and this is shown by direct computation. (1 − x)−1 = But. k!(n − k)! k! α k x . √ 1 − x. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . by deﬁnition. but that we won’t worry about which ones. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. k as long as we deﬁne α k = α(α − 1) · · · (α − k + 1) . Indeed. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. −1 k so (1 − x)−1 = as needed. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . As another example. For example. 2k − 1 k 93 .

Hence it is again transient.But. if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. Hence it is transient. if x < 0 |x| In particular. P0 (Tx < ∞) = p q q p ∧1 ∧1 . We apply this to (32). s↑1 which is true for any probability generating function. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. x if x > 0 . so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 = (1 − 2p)2 = |1 − 2p| = |p − q|. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . |x| if x > 0 (33) . if x < 0 This formula can be written more neatly as follows: Theorem 35. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. But remember that q = 1 − p. without appeal to the Markov chain theory. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . by the deﬁnition of E0 sT0 . That it is actually null-recurrent follows from Markov chain theory. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. The quantity inside the square root becomes 1 − 4pq. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. x 1−|p−q| 2q 1−|p−q| 2p . 94 . 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. Comparing the two ‘inﬁnite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0).

Consider a simple random walk with values in Z. . Indeed. . The last part of Theorem 35 tells us that it is recurrent. 95 . Sn ). This value is random and is denoted by M∞ = sup(S0 . . and so it is null-recurrent. with probability 1 because p < 1/2. every nondecreasing sequence has a limit. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the ﬁrst return time to 0.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. S1 . .Proof. If p < q then |p − q| = q − p and. Otherwise. What is this probability? Theorem 37. q p. Then limn→∞ Sn = −∞ with probability 1. Proof. this probability is always strictly less than 1. Assume p < 1/2. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . except that the limit may be ∞. Consider a random walk with values in Z. Then ′ P0 (T0 < ∞) = 2(p ∧ q). . with some positive probability. x ≥ 0.) If we let Mn = max(S0 . This means that there is a largest value that will be ever attained by the random walk.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. the sequence Mn is nondecreasing. but here it is < ∞. The simple symmetric random walk with values in Z is null-recurrent. We know that it cannot be positive recurrent. S2 . Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . then we have M∞ = limn→∞ Mn . . . n→∞ n→∞ x ≥ 0. Corollary 23. Proof. it will never ever return to the point where it started from. Theorem 36. again. 31. Unless p = 1/2. 31. if x > 0 . If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. if x < 0 which is what we want. we obtain what we are looking for.

the total number of visits to a state will be ﬁnite. Remark 1: By spatial homogeneity.3 Total number of visits to a state If the random walk is recurrent then it visits any state inﬁnitely many times. In Lemma 4 we showed that. Consider a simple random walk with values in Z. . And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. 96 . . 1. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. 2. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. 31. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . If p = q this gives 1. k = 0. k = 0. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . We now look at the the number of visits to some state but if we start from a diﬀerent state. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). But J0 ≥ 1 if and only if the ﬁrst return time to zero is ﬁnite. This proves the ﬁrst formula. starting from zero. 1 − 2(p ∧ q) Proof. 1]. Pi (Ji ≥ k) = 1 for all k. . . 1. . as already known. The probability generating function of T0 was obtained in Theorem 34. meaning that Pi (Ji = ∞) = 1. even for p = 1/2. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . . . Theorem 38. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. 2(p ∧ q) E0 J0 = . 2(p ∧ q) . Then. .′ Proof. 1 − 2(p ∧ q) this being a geometric series. if a Markov chain starts from a state i then Ji is geometric. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). . 1. We now compute its exact distribution. 2. But if it is transient. . In the latter case. (34) a random variable ﬁrst considered in Section 10 where it was also shown to be geometric. k = 0. 2. where the last probability was computed in Theorem 37.

(q/p)(−x)∨0 1 − 2q k≥1 . Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. Suppose x > 0. i. Consider a random walk with values in Z. given that Tx < ∞. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . Then P0 (Jx ≥ k) = P0 (Tx < ∞. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. The reason that we replaced k by k − 1 is because. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total.e. and the second in Theorem 38. p < 1/2) then. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). Consider the case p < 1/2. if the random walk “drifts down” (i. then E0 Jx drops down to zero geometrically fast as x tends to inﬁnity. we have E0 Jx = 1/(1−2p) regardless of x. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . The ﬁrst term was computed in Theorem 35. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . ∞ As for the expectation.Theorem 39. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 .e. We repeat the argument for x < 0 and ﬁnd P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . k ≥ 1. starting from 0. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . Apply now the strong Markov property at Tx . For x > 0. 97 . (2(p ∧ q))k−1 . But for x < 0. simply because if Jx ≥ k ≥ 1 then Tx < ∞. In other words. E0 Jx = Proof. Then P0 (Tx < ∞. Remark 1: The expectation E0 Jx is ﬁnite if p = 1/2. it will visit any point below 0 exactly the same number of times on the average. Jx ≥ k).

Then P0 (Tx = n) = x P0 (Sn = x) . random variables. 0 ≤ k ≤ n).d. a sum of i. 0 ≤ k ≤ n) is deﬁned by Sk := Sn − Sn−k . . .e. random variables. Let Tx = inf{n ≥ 1 : Sn = x} be the ﬁrst time that x will be visited. . . . . we derived using duality. Here is an application of this: Theorem 41. Theorem 40. (Sk .i.i. because the two have the same distribution. for a simple random walk and for x > 0. (36) 2 Lastly. . n x > 0. . Let (Sk ) be a simple symmetric random walk and x a positive integer. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. . ξ2 . Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. Proof. which is basically another probability for S. Sn = x) P0 (Tx = n) = . (35) In Theorem 29 we used the reﬂection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). P0 (Sn = x) P0 (Sn = x) x . n Proof. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . . in Theorem 41. .d. Thus. 0 ≤ k ≤ n − 1. i. ξn ) has the same distribution as (ξn . for a simple symmetric random walk. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. 0 ≤ k ≤ n) of (Sk . Fix an integer n. each distributed as T1 . ξn−1 .32 Duality Consider random walk Sk = k ξi . In Lemma 19 we observed that. n (37) 98 . . Use the obvious fact that (ξ1 . ξ1 ). the probabilities P0 (Tx = n) = x P0 (Sn = x). Tx is the sum of x i. Pause for reﬂection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. 0 ≤ k ≤ n) has the same distribution as (Sk . = 1− √ 1 − s2 s x . i=1 Then the dual (Sk . every time we have a probability for S we can translate it into a probability for S.

We remark here something methodological: It was completely diﬀerent techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coeﬃcient of sn in (35); summing up (37) should give (36), or taking diﬀerences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Deﬁne its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the ﬁrst time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the ﬁrst event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =

r=1

+

r=1

**In the ﬁrst case,
**

′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

**We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
**

k

**f2r p2k−2r,2n−2r +
**

r=1

1 2

n−k

**f2r p2k,2n−2r
**

r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2

k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .

r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

**The arcsine law*
**

Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

**Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
**

na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =

a≤k/n≤b

∼

a≤k/n≤b

1 . k k (n − n ) n n

1/π

**This is easily seen to have a limit as n → ∞, the limit being
**

b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of

x(1−x)

which is amazingly close to the plot of the ﬁgure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

ξ2 .d. too. for each n. 0)) = P (ξn = (1. 0)) = 1/4. with equal probability (1/4). . 0). Since ξ1 . However. The two random walks are not independent. Sn = i=1 n ξi . x2 ) into (x1 + x2 . 2 1 If x ∈ Z2 . we immediately get that the expectation of T1 is inﬁnity: ET1 = ∞. map (x1 . (Sn . random vectors such that P (ξn = (0. 0) and. n ∈ Z+ ) is a random walk. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. Sn ) = n n 2 . . ξ2 . n ∈ Z+ ) showing that it. n ≥ 1. x1 − x2 ). So Sn = (Sn . he moves east. The corners are described by coordinates (x. the two stochastic processes are not independent. 1)) = P (ξn = (−1. we let x1 be its horizontal coordinate and x2 its vertical. n ∈ Z+ ) is a sum of i. i. west. . Rn ) := (Sn + Sn . We have 1. the random 1 2 variables ξn . 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. y) where x. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. (Sn .e. 1)) + P (ξn = (0. So let 1 2 1 2 1 2 Rn = (Rn .d. i. Indeed. north or south. We have: 102 . When he reaches a corner he moves again in the same manner. We then let S0 = (0. y ∈ Z. is a random walk. . ξn are not independent simply because if one takes value 0 the other is nonzero. . random variables and hence a random walk. i=1 ξi i=1 ξi 2 1 Lemma 20. Many other quantities can be approximated in a similar manner. . 0)) = 1/4. . west. so are ξ1 . are independent. −1)) = 1/2.i. Consider a change of coordinates: rotate the axes by 45o (and scale them). the latter being functions of the former. being totally drunk. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . 1 Hence (Sn .From this. 1 1 Proof. north or south. ξ2 .. . 0)) = P (ξn = (−1. 2 Same argument applies to (Sn . He starts at (0. It is not a simple random walk because there is a possibility that the increment takes the value 0.i. going at random east. Sn − Sn ). n ∈ Z+ ) is also a random walk. 1 P (ξn = 1) = P (ξn = (1. .

(Rn + independent. n Corollary 24. +1}. b−a P (Ta < Tb ) = b . 1)) = 1/4 1 2 P (ηn = −1. (Rn . 2n n 2 2−4n . . By Lemma 5. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. 2 1 2 2 1 1 Proof. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . R2n = 0) = P (R2n = 0)(R2n = 0). 0)) = 1/4 1 2 P (ηn = 1. We have Rn = n (ξi + ξi ). Hence (Rn . ηn = 1) = P (ξn = (1. n ∈ Z+ ) is a random 2 . b−a P (STa ∧Tb = a) = b . ξ2 . The possible values 1 . Let a < 0 < b. b−a −a . from Theorem 7. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . Hence P (ηn = 1) = P (ηn = −1) = 1/2. And so 2 1 2 1 P (ηn = ε. . ηn = ξn − ξn are independent for each n.d. and by Stirling’s approximation. Recall. . So P (R2n = 0) = 2n 2−2n . η 2 are ±1. The last two are walk in one dimension. where c is a positive constant. The same is true for (Rn ). n ∈ Z ) is a random walk in one dimension. η . 0)) = 1/4 1 2 P (ηn = −1. Rn = n (ξi − ξi ). The two-dimensional simple symmetric random walk is recurrent. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. (Rn . ηn = −1) = P (ξn = (0.1 Lemma 21. ηn = 1) = P (ξn = (0. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. . Lemma 22. = 1)) = 1/4. as n → ∞. ηn = −1) = P (ξn = (−1. n ∈ Z+ ) is a random walk in two dimensions. ξ 1 − ξ 2 ).i. ε′ ∈ {−1. b−a 103 . Similar argument shows that each of (Rn . 1 where the last equality follows from the independence of the two random walks. we need to show that ∞ P (Sn = 0) = ∞. that for a simple symmetric random walk (started from S0 = 0). n ∈ Z ) is a random walk in one dimension. 2 . . The sequence of vectors η . is i. To show that the last two are independent. But by the previous n=0 c lemma. P (S2n = 0) ∼ n . But (Rn ) 1 2 is a simple symmetric random walk. . 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. n ∈ Z+ ). in the sense that the ratio of the two sides goes to 1.. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . Proof. n ∈ Z+ ) 1 is a random walk in two dimensions. We have of each of the ηn n 1 2 P (ηn = 1. P (S2n = 0) = Proof.

that ESTa ∧Tb = 0. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. i ∈ Z). such that p is rational. choose. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the ﬁrst time it exits the interval [a. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. where c−1 is chosen by normalisation: P (A = 0. k ∈ Z. e. Theorem 43. i < 0 < j. Can we ﬁnd a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. Conversely. we can always ﬁnd integers a. we have this common value: i<0 (−i)pi = and. B = j) = 1. B = j) = c−1 (j − i)pi pj . How about more complicated distributions? Let suppose we are given a probability distribution (pi . b = N . in particular. 1 − p). Indeed. M < N . To see this. M. B = 0) = p0 . a = M − N . The claim is that P (Z = k) = pk . The probability it exits it from the left point is p. i ∈ Z. and the probability it exits from the right point is 1 − p. Proof. Then there exists a stopping time T such that P (ST = i) = pi . using this. given a 2-point probability distribution (p. b. P (A = 0. B = 0) + P (A = i. if p = M/N . Note. 0)} ∪ (N × N) with the following distribution: P (A = i. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. Let (Sn ) be a simple symmetric random walk and (pi . B) taking values in {(0. i ∈ Z. b]. B) are independent of (Sn ).g. Deﬁne a pair of random variables (A. we consider the random variable Z = STA ∧TB . and assuming that (A. Then Z = 0 if and only if A = B = 0 and this has 104 . we get that c is precisely ipi . N ∈ N. a < 0 < b such that pa + (1 − p)b = 0.This last display tells us the distribution of STa ∧Tb . suppose ﬁrst k = 0.

i<0 = c−1 Similarly. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . P (Z = k) = pk for k < 0. Suppose next k > 0.probability p0 . B = j) P (STi ∧Tk = k)P (A = i. by deﬁnition. 105 .

We will show that d = gcd(a. 2. leading to d1 = d2 . b. Then interchange the roles of the numbers and continue. (ii) if c | a and c | b. they merely serve as reminders. replace b − a by b − 2a. we can deﬁne their greatest common divisor d = gcd(a. with b = 0. d2 are two such numbers then d1 would divide d2 and vice-versa. Because c | a and c | b.e. If a | b and b | a then a = b. 38) = gcd(6. They are not supposed to replace textbooks or basic courses. we say that b divides a if there is an integer k such that a = kb. Keep doing this until you obtain b − qa < a. with at least one of them nonzero. So (i) holds. So we work faster. gcd(120. d | b. Zero is denoted by 0. 14) = gcd(6. 6. In other words. and d = gcd(a. 2) = gcd(4. 26) = gcd(6. 8) = gcd(6. 32) = gcd(6. b) = gcd(a. For instance. 20) = gcd(6. Indeed. we have c | d. −1. b − a). 38) = gcd(6. So (ii) holds. 1. let d = gcd(a. if a. This gives us an algorithm for ﬁnding the gcd between two numbers: Suppose 0 < a < b are integers. 2) = gcd(2. But in the ﬁrst line we subtracted 38 exactly 3 times. b) as the unique positive integer such that (i) d | a. 0. 3. Indeed 4 × 38 > 120. b − a). we may summarise the ﬁrst line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. Arithmetic Arithmetic deals with integers. 2. The set Z of integers consists of N. b are arbitrary integers. But 3 is the maximum number of times that 38 “ﬁts” into 120. But d | a and d | b. Similarly. Therefore d | b − a. c | b. −2. i. Given integers a. their negatives and zero. (That it is unique is obvious from the fact that if d1 . 38) = gcd(82. Then c | a + (b − a). Z = {· · · . . 5. the process of long division that one learns in elementary school). 2) = 2. Now suppose that c | a and c | b − a.e. 4. The set N of natural numbers is the set of positive integers: N = {1. −3. 106 . . as taught in high schools or universities. with the second line. Replace b by b − a. . 38) = gcd(44. If c | b and b | a then c | a.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. · · · }. We write this by b | a. then d | c. If b − a > a. if we also invoke Euclid’s algorithm (i. Now. and we’re done. b).}. 3.) Notice that: gcd(a. b).

we have gcd(a. Thus. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. y such that d = xa + yb. To see why this is true. Working backwards. Then c divides any linear combination of a. not both zero. 2) = 2. In the process. . y ∈ Z.e. b) then there are integers x. The same logic works in general. i. Suppose that c | a and c | b. 107 . Here are some other facts about the greatest common divisor: First. 38) = gcd(6. . let d be the right hand side. . we have the fact that if d = gcd(a. y ∈ Z.The theorem of division says the following: Given two positive integers b. Then d = xa + yb. The long division algorithm is a process for ﬁnding the quotient q and the remainder r. b) = min{xa + yb : x. a − 1}. 120) = 38x + 120y. r = 18. To see this. with a > b. b ∈ Z. such that a = qb + r. it divides d. a. with x = 19. 1. for some x. Second. For example. b. This shows (ii) of the deﬁnition of a greatest common divisor. for integers a. y = 6. We need to show (i). we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. . we can ﬁnd an integer q and an integer r ∈ {0. gcd(38. 38) = gcd(6. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. we can show that. Its proof is obvious. xa + yb > 0}. in particular. let us follow the previous example again: gcd(120.

Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. the set {f (x) : x ∈ X}. e. A relation is transitive if when it contains (x. A function is called onto (or surjective) if its range is Y . for some 1 ≤ r ≤ d − 1. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. B be ﬁnite sets with n. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. it can be thought as reﬂexive (unless we do not want to accept the fact that one may not be afriend of oneself). A function is called one-to-one (or injective) if two diﬀerent x1 . But then b = q(xa + yb) + r. z). A relation which possesses all the three properties is called equivalence relation. that d does not divide b. y) such that x is smaller than y. then the relation “x is a friend of y” is symmetric. 108 . Counting Let A. and this is a contradiction. Since d > 0. but is not necessarily transitive. respectively. y) without contain its symmetric (y. so d divides both a and b. m elements. x2 ∈ X have diﬀerent values f (x1 ).e. x). In general. say. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . the greatest common divisor of any subset of the integers also makes sense. f (x2 ). a relation can be thought of as a set of pairs of elements of the set A. A relation is called symmetric if it cannot contain a pair (x. yielding r = (−qx)a + (1 − qy)b. that d does not divide both numbers. Suppose this is not true. so r = 0. The range of f is the set of its values. If we take A to be the set of students in a classroom. i. Finally. The set of functions from X to Y is denoted by Y X . i. For example. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). x).that d | a and d | b. The relation < is not reﬂexive. The relation < is not symmetric. A relation is called reflexive if it contains pairs (x. Both < and = are transitive. We encountered the equivalence relation i theory. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. we can divide b by d to ﬁnd b = qd + r..g. The equality relation = is reﬂexive. z) it also contains (x. y) and (y.e. The greatest common divisor of 3 integers can now be deﬁned in a similar manner. The equality relation is symmetric. So d is not minimum as claimed.

To be able to do this. the cardinality of the set B A ) is mn . Incidentally. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . we have = 1. (The same name may be used for diﬀerent animals. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . in other words. we may think of the elements of A as animals and those of B as names. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). that of the second in m − 1 ways. the second in n − 1.) So the ﬁrst animal can be assigned any of the m available names. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to diﬀerent animals. and so on. so does the second. the k-th in n − k + 1 ways. k! k where the symbol equation. and this follows from the binomial theorem. Since the name of the ﬁrst animal can be chosen in m ways. and the third. The ﬁrst element can be chosen in n ways. We thus observe that there are as many subsets in A as number of elements in the set {0. and so on.e. 109 . So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. n! is the total number of permutations of a set of n elements. A function is then an assignment of a name to each animal. n! stands for the product of the ﬁrst n natural numbers: n! = 1 × 2 × 3 × · · · × n. Any set A contains several subsets. Thus. and so on.The number of functions from A to B (i. we denote (n)n also by n!. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). In the special case m = n. Each assignment of this type is a one-to-one function from A to B. 1}n of sequences of length n with elements 0 or 1. Clearly. we need more names than animals: n ≤ m. If A has n elements. a one-to-one function from the set A into A is called a permutation of A. Indeed. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen.

by convention. b. Note that it is always true that a = a+ − a− . c. a− = − min(a. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). m! n y If y is not an integer then. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. by convention. and we don’t care to put parantheses because the order of ﬁnding a maximum is irrelevant. say. b). a∨b∨c refers to the maximum of the three numbers a. 0). the pig. If the pig is not included in the sample. 0). Both quantities are nonnegative. The notation extends to more than one numbers. b is denoted by a ∧ b = min(a. if x is a real number. Maximum and minimum The minimum of two numbers a. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true.We shall let n = 0 if n < k. a+ is called the positive part of a. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. If the pig is included. we let The Pascal triangle relation states that n+1 k = = 0. k Also. The absolute value of a is deﬁned by |a| = a+ + a− . we add the two numbers together in order to ﬁnd the total number of ways to pick k animals from n + 1 animals. The maximum is denoted by a ∨ b = max(a. b). + k−1 k Indeed. n n . and a− is called the negative part of a. we choose k animals from the remaining n ones and this can be done in n ways. consider a speciﬁc animal. For example. in choosing k animals from a set of n + 1 ones. we deﬁne x m = x(x − 1) · · · (x − m + 1) . In particular. we use the following notations: a+ = max(a. 110 .

which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). If A is an event then 1A is a random variable. P (X ≥ n). A2 . are sets or clauses then the number n≥1 1An is the number true clauses. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed.}. Another example: Let X be a nonnegative random variable with values in N = {1. For instance. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. It takes value 1 with probability P (A) and 0 with probability P (Ac ). meaning a function that takes value 1 on A and 0 on the complement Ac of A. It is worth remembering that. So X= n≥1 1(X ≥ n). . This is very useful notation because conjunction of clauses is transformed into multiplication. 111 . A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. if A1 . An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. It is easy to see that 1A 1A∩B = 1A 1B . Then X X= n=1 1 (the sum of X ones is X). Several quantities of interest are expressed in terms of indicators. 2. . . .Indicator function stands for the indicator function of a set A. = 1A + 1B − 1A∩B . . Thus. 1 − 1A = 1 A c . .

.for all x. So if we choose a basis v1 . Combining. . x . The relation between the column representation of y and the column representation of x is written as 1 1 x y a11 · · · a1n . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . The matrix A of ϕ with respect to the choice of the two basis is deﬁned by a11 · · · a1n A := · · · · · · · · · am1 · · · amn y1 Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . . . Let u1 . is a column representation of x with respect to . β ∈ R. . y ∈ Rn . . . The column . . Suppose that ψ. . . ϕ are linear functions: It is a m × n matrix. am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ . xn n 1 y := ϕ(x) = j=1 xj ϕ(uj ). where ∈ R. . . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . is a column representation of y := ϕ(x) with respect to the given basis . . . vm . . Let y i := n aij xj . = ··· ··· ··· . . . un be a basis for Rn . . ym v1 . we have ϕ(x) = m n vi i=1 j=1 aij xj . because ϕ is linear. The column . . . . . and all α. x1 . But the ϕ(uj ) are vectors in Rm . j=1 i = 1. Then. . . m. xn the given basis.

Usually. j So A is a ℓ × m matrix. 1/2. there is x ∈ I with x ≤ ε + inf I. . . and AB is a ℓ × n matrix. . . a2 . inf I is characterised by: inf I ≤ x for all x ∈ I and. If I has no lower bound then inf I = −∞. .Fix three sets of basis on Rn . The standard basis in Rd consists of the vectors 1 0 0 0 1 0 e 1 = . . 1 ≤ j ≤ n. . If I is empty then inf ∅ = +∞. A matrix A is called square if it is a n × n matrix. there is x ∈ I with x ≥ ε + inf I. So lim sup xn = limn→∞ inf{xn . ed = . we deﬁne max I. 0 0 1 In other words. the choice of the basis is the so-called standard basis. In these cases we use the notation inf I. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. . If I has no upper bound then sup I = +∞. . We note that lim inf(−xn ) = − lim sup xn . 1 ≤ i ≤ ℓ. Supremum and inﬁmum When we have a set of numbers I (for instance. a square matrix represents a linear transformation from Rn to Rn .} ≤ · · · and. . sup ∅ = −∞. for any ε > 0. . Likewise. For instance I = {1. standing for the “infimum” of I. . . Rm and Rℓ and let A. .e. . where m (AB)ij = k=1 Aik Bkj . 113 . . If we ﬁx a basis in Rn . This minimum (or maximum) may not exist. B is a m × n matrix. A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. . xn+1 . B be the matrix representations of ϕ. Then the matrix representation of ϕ ◦ ψ is AB. for any ε > 0. ψ. of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. .} ≤ inf{x3 . . Similarly. with respect to these bases. · · · }. we deﬁne the supremum sup I to be the minimum of all upper bounds. x2 . . x2 . . xn+2 . . deﬁned as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. respectively. we deﬁne lim sup xn . . 1/3. i. . . Limsup and liminf If x1 . . ei = 1(i = j). . Similarly. and it is this that we called inﬁmum. . a sequence a1 . . sup I is characterised by: sup I ≥ x for all x ∈ I and. x4 . x3 . . is a sequence of numbers then inf{x1 .} has no minimum. e2 = .} ≤ inf{x2 . it is its matrix representation with respect to the ﬁxed basis. . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence.

if he events A1 . If the events A.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). we work a lot with conditional probabilities. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x.. n If the events A1 . . Quite often. but not too much. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). E 1A = P (A). namely: Everything else. it is useful to ﬁnd disjoint events B1 . If A1 . Linear Algebra that has many axioms is more ‘restrictive’. . That is why Probability Theory is very rich. B in the deﬁnition. . . we often need to argue that X ≤ Y . if An are events with P (An ) = 1 then P (∩n An ) = 1. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. . Therefore. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems.e. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). A is a subset of B). This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. derive formulae. Linear Algebra that has many axioms is more ‘restrictive’. Recall that P (B|A) is deﬁned by P (A ∩ B)/P (A) and is a very motivated deﬁnition. Note that 1A 1B = 1A∩B . such that ∪n Bn = Ω. then P n≥1 An = P (An ). Similarly. i. by not requiring this. and 1 Ac = 1 − 1 A . . the function that takes value 1 on A and 0 on Ac . A2 . Similarly. (That is why Probability Theory is very rich. A2 . Indeed. and apply them.e. In contrast. . to show that EX ≤ EY . B are disjoint events then P (A ∪ B) = P (A) + P (B). in these notes. for (besides P (Ω) = 1) there is essentially one axiom . Of course. A2 . This is the inclusion-exclusion formula for two events. For example. This can be generalised to any ﬁnite number of events. n P (An ). to compute P (A ∩ B) we read the deﬁnition as P (A ∩ B) = P (A)P (B|A). . . P (∪n An ) ≤ Similarly. Often. I do sometimes “cheat” from a strict mathematical point of view. are mutually disjoint events. If A is an event then 1A or 1(A) is the indicator random variable. Probability Theory is exceptionally simple in terms of its axioms. which holds since X is positive.. 11 114 . If An are events with P (An ) = 0 then P (∪n An ) = 0.) If A. In Markov chain theory. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). n≥1 Usually. follows from this axiom. . B2 . to compute P (A). . . In contrast. we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. and everywhere else. This follows from the logical statement X 1(X > x) ≤ X.

. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). As an example. 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. |s| < 1/|ρ|. . . . we deﬁne X + = max(X. which is motivated from the algebraic identity X = X + − X − . x→∞ In Markov chain theory. In case the random variable is ∞ absoultely continuous with density f . . is an inﬁnite polynomial having the ai as coeﬃcients: G(s) = a0 + a1 s + a2 s2 + · · · . making it a useful object when the a0 . So if n an < ∞ then G(s) exists for all s ∈ [−1. . so it is useful to keep this in mind. The formula holds for any positive random variable. 1]. obviously. . 1]. taught in calculus courses). The generating function of the sequence pn = P (X = n). n ≥ 0 is ∞ ρn s n = n=0 1 . 1. 1. a1 . 0).An application of this is: P (X < ∞) = lim P (X ≤ x). For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). We do not a priori know that this exists for any s other than. Generating functions A generating function of a sequence a0 . The expectation of a nonnegative random variable X may be deﬁned by ∞ EX = 0 P (X > x)dx. . In case the random variable is discrete. is called probability generating function of X: ∞ as long as |ρs| < 1. we encounter situations where P (X = ∞) may be positive. For a random variable which takes positive and negative values. viz. a1 . for s = 0 for which G(0) = 0. the generating function of the geometric sequence an = ρn . 0). ϕ(s) = EsX = n=0 sn P (X = n) 115 . 2. the formula reduces to the known one.} because then n an = 1. is a probability distribution on Z+ = {0. . . Suppose now that X is a random variable with values in Z+ ∪ {+∞}. This deﬁnition makes sense only if EX + < ∞ or if EX − < ∞. . .e. we have EX = −∞ xf (x)dx. i. . X − = − min(X. n = 0. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. EX = x xP (X = x). which are both nonnegative and then deﬁne EX = EX + − EX − .

As an example. n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . 116 . and this is why ϕ is called probability generating function. n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). possibly. we have s∞ = 0. ds ds and by letting s = 1. Generating functions are useful for solving linear recursions. Note that ∞ n=0 pn ≤ 1. Since the term s∞ p∞ could formally appear in the sum deﬁning ϕ(s). n ≥ 0) then the generating function of (xn+1 . We can recover the probabilities pn from the formula pn = ϕ(n) (0) . ∞ and p∞ := P (X = ∞) = 1 − pn . Note that ϕ(0) = P (X = 0). with x0 = x1 = 1. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . but. consider the Fibonacci sequence deﬁned by the recursion xn+2 = xn+1 + xn . If we let X(s) = n≥0 n ≥ 0. since |s| < 1. take the value +∞. xn sn be the generating of (xn . n! as follows from Taylor’s theorem.This is deﬁned for all −1 < s < 1. we can recover the expectation of X from EX = lim ϕ′ (s). Also. while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). n=0 We do allow X to.

e. is such an example. So. for example. we can also write this as xn = (−a)n+1 − (−b)n+1 . But I could very well use the function P (X > x).From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . we can solve for X(s) and ﬁnd X(s) = s2 −1 . Hence s2 + s − 1 = (s − a)(s − b). the function P (X ≤ x). 117 . This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . i. namely. Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. i.e. n ≥ 0). the n-th term of the Fibonacci sequence. Despite the square roots in the formula. Since x0 = x1 = 1. if X is a random variable with values in R. x ∈ R. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. n ≥ 0) equals the sum of the generating functions of (xn+1 . then the so-called distribution function. Thus (1/ρ)−s is the 1 generating function of ρn+1 . s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. n ≥ 0) and (xn . b−a is the solution to the recursion. and so X(s) = 1 1 1 1 1 1 = . b = −( 5 + 1)/2. b−a Noting that ab = −1.

But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. for example they could take values which are themselves functions. Such an excursion is a random object which should be thought of as a random function. n ∈ Z+ . dx All these objects can be referred to as “distribution” or “law” of X. but using or merely talking about such an object presents no diﬃculty other than overcoming a psychological barrier. Specifying the distribution of such an object may be hard. If X is absolutely continuous. 1 Since d dx (x log x − x) = log x. for it does allow us to specify the probabilities P (X = n). it’s better to use logarithms: n n! = e log n = exp k=1 log k. Now. to compute a large sum we can. We frequently encounter random objects which belong to larger spaces. n Hence n k=1 log k ≈ log x dx. instead. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. Even the generating function EsX of a Z+ -valued random variable X is an example. It turns out to be a good one. since the number is so big. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. (b) The approximation is not good when we consider ratios of products. we encountered the concept of an excursion of a Markov chain from a state i. n! ≈ exp(n log n − n) = nn e−n . For example (try it!) the rough Stirling approximation give that the probability that. compute an integral and obtain an approximation.or the 2-parameter function P (a < X < b). For instance. Hence we have We call this the rough Stirling approximation. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. I could use its density f (x) = d P (X ≤ x). in 2n fair coin tosses 118 .

y ∈ R g(y) ≥ g(x)1(y ≥ x). so this little section is a reminder. 0 ≤ n ≤ m. It was shown in Theorem 38 that J0 satisﬁes the memoryless property. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. Surely. and is based on the concept of monotonicity. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. We have already seen a geometric random variable. Then. Indeed. expressing non-negativity. you have seen it before. a geometric function of k. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . Let X be a random variable. To remedy this. n ∈ Z+ . Consider an increasing and nonnegative function g : R → R+ . expressing monotonicity. Markov’s inequality Arguably this is the most useful inequality in Probability. where λ = P (X ≥ 1). for all x. We baptise this geometric(λ) distribution. g(X) ≥ g(x)1(X ≥ x). This completely trivial thought gives the very useful Markov’s inequality. and so. because we know that the n probability goes to zero as n → ∞.we have exactly n heads equals 2n 2−2n ≈ 1. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). as n → ∞. That was the random variable J0 = ∞ n=1 1(Sn = 0). In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . Eg(X) ≥ g(x)P (X ≥ x) . then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any diﬀerent from X. 119 . If we think of X as the time of occurrence of a random phenomenon. We can case both properties of g neatly by writing For all x. and this is terrible. by taking expectations of both sides (expectation is an increasing operator). deﬁned in (34).

´ 2. Springer. Soc. Springer. C M and Snell. 3.Bibliography ´ 1. 120 . Grinstead. W (1968) An Introduction to Probability Theory and its Applications. Math. Norris. Cambridge University Press. 5.mac. 4. Amer. Feller. Cambridge University Press. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance).dartmouth. Springer. 7. Wiley. Also available at: http://www.pdf 6. W (1981) Topics in Ergodic Theory. Bremaud P (1999) Markov Chains. J L (1997) Introduction to Probability.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Parry. Bremaud P (1997) An Introduction to Probabilistic Modeling. Chung. J R (1998) Markov Chains.

115 geometric distribution. 63 Galton-Watson process. 68 Fatou’s lemma. 61 null recurrent. 76 neighbours. 53 hitting time. 59 law. 10 ballot theorem. 119 Google. 44 degree. 47 coupling inequality. 18 permutation. 6 Markov’s inequality. 20 increments. 84 bank account. 7 invariant distribution. 48 cycle formula. 58 digraph (directed graph). 77 indicator. 111 maximum. 98 dynamical system. 119 matrix. 51 essential state. 15. 119 minimum. 16. 117 leads to. 20 function of a Markov chain. 16 communicating classes. 10 irreducible. 114 balance equations. 77 coupling. 106 ground state. 61 detailed balance equations. 15 Loynes’ method. 53 Fibonacci sequence. 115 random walk on a graph. 109 positive recurrent. 51 Monte Carlo. 61 recurrent. 1 equivalence classes. 18 arcsine law. 110 meeting time. 87 121 . 70 greatest common divisor. 17 Aloha. 82 Cesaro averages. 111 inessential state. 113 initial distribution. 110 mixing. 108 relation. 73 Markov chain. 34 PageRank. 6 Markov property. 17 inﬁmum. 3 branching process. 108 ruin probability. 108 ergodic. 65 Catalan numbers. 17 Kolmogorov’s loop criterion. 17 Ethernet.Index absorbing. 34 probability ﬂow. 117 division. 26 running maximum. 2 nearest neighbour random walk. 13 probability generating function. 81 reﬂection principle. 16 convolution. 18 dual. 7 closed. 17 communicates with. 101 axiom of Probability Theory. 35 Helly-Bray Lemma. 16 equivalence relation. 65 generating function. 29 reﬂection. 116 ﬁrst-step analysis. 45 Chapman-Kolmogorov equation. 70 period. 68 aperiodic. 86 reﬂexive. 48 memoryless property. 15 distribution.

104 states. 62 Wright-Fisher model. 62 subsequence of a Markov chain. 7. 88 watching a chain in a set. 9 stationary distribution. 6 stationary. 7 time-reversible. 73 strategy of the gambler. 2 transition probability. 6 statespace. 108 tree. 77 skip-free property. 27 subordination. 7 simple random walk. 119 strong Markov property. 24 Strirling’s approximation. 8 transitive. 76 Skorokhod embedding. 16. 60 urn. 62 supremum. 3. 10 steady-state.semigroup property. 108 time-homogeneous. 13 stopping time. 16. 27 storage chain. 113 symmetric. 8 transition probability matrix. 76 simple symmetric random walk. 58 transient. 71 122 . 29 transition diagram:.

Sign up to vote on this title

UsefulNot useful- markov-chains
- 20752_0412055511MarkovChain
- Markov Chain
- (Ebook) Hidden Markov Models (Theory & Methods) Markov Chains Particle Filter Monte Carlo Hmm - Cappemoulinesryden
- MarkovChains
- Markov Chains
- Bayesian Econometric Methods
- Solved Problems
- Markov Coding
- A First Course in Linear Algebra
- ProblemsMarkovChains
- An Introduction to Statistics
- Ash Basic Probability Theory (Dover)
- Statistics
- Exercises for Stochastic Calculus
- Introduction to Markov Chain Monte Carlo Simulation and Their Statistical Analysis
- Applied Maths
- Markov Chain
- Markov
- markov chain chapter 6
- Stochastic Processes
- Grimmett g.r., Stirzaker d.r. Probability and Random Processes (3ed., Oxford, 2001)(1)
- A Course in Linear Algebra .pdf
- Usig R for Introductory Statistics
- Basic Topology Armstrong
- MIR publisher
- Weather Wax Ross Solutions
- Math Olympiad
- Markov Chain Random Walk