## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

**MARKOV CHAINS AND RANDOM WALKS
**

Takis Konstantopoulos∗ Autumn 2009

∗

c Takis Konstantopoulos 2006-2009

Contents

1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Deﬁnition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and ﬁrst-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Deﬁnition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . 95 31. . . . . . . . . . . . . . . . . . .3 Total number of visits to a state . . . . .3 The distribution of the ﬁrst return to the origin . . . . 84 27. . . . . . . . . . . . . . .1 Distribution after n steps . . . . . . .1 First hitting time analysis . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 31. 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . .2 The ballot theorem . . .5 First return to zero . . . . . . . . . . . . . . . . 65 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 The Wright-Fisher model: an example from biology . . . . . . . . . . . . . . . . . . . . 71 24. . 86 28 The reﬂection principle for a simple random walk in dimension 1 86 28. . . . . . . . . . . . . . . . 70 24. . . . . . . . . . . .2 Paths that start and end at speciﬁc points and do not touch zero at all . . . . .1 Asymptotic distribution of the maximum . . . . 83 27. . . . . . . . . . . . . . . 79 26. . . . . . . . . . . . . . . . . . . . . . . .1 Branching processes: a population growth application . . . . . . . . . . . . 90 30. . . .1 Distribution of hitting time and maximum . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . .4 Some remarkable identities . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . .2 First return to the origin . . . 84 27. . . . . . . . . . . . . . . . . . . . 68 24. . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . 85 27. . . . . . . . . . . . . . . . .3 Some conditional probabilities . . . . . . . . .2 The ALOHA protocol: a computer communications application . . . . . .1 Paths that start and end at speciﬁc points . . . . . . . . .2 Return probabilities . . . . . . . . . . . . . . . .5 A storage or queueing application . . . . . . . . . . . . . . . . .24. . . . . 92 30. . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

Preface These notes were written based on a number of courses I taught over the years in the U. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. The second one is speciﬁcally for simple random walks.. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. Greece and the U. I have tried both and prefer the current ordering. I tried to put all terms in blue small capitals whenever they are ﬁrst encountered. They form the core material for an undergraduate course on Markov chains in discrete time. A few starred sections should be considered “advanced” and can be omitted at ﬁrst reading. Of course. v . There notes are still incomplete. “important” formulae are placed inside a coloured box. dozens of good books on the topic. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory.. Also. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuﬀ with matrices – On ﬁnance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. of course. I introduce stationarity before even talking about state classiﬁcation. I have a little mathematical appendix. Also.S.K. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible.. The ﬁrst part is about Markov chains and some applications. At the end. There are.

a totally ordered set. It is not hard to see that. . the future trajectory (X(s). b) is the remainder of the division of a by b. and evolving as xn+1 = yn yn+1 = rem(xn . i. a phenomenon which evolves with time in a way that only the present aﬀects the future is called a dynamical system. . We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. . . Xn+2 . b). x). yn+1 ).e. depend only on the pair Xn . and the other the greatest common divisor. or. more generally. initialised by x0 = max(a. .PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past aﬀects the future only through the present. An example from Arithmetic of a dynamical system in discrete time is the one which ﬁnds the greatest common divisor gcd(a. An example from Physics is provided by Newton’s laws of motion. where r = rem(a. In other words. we let our “state” be a pair of integers (xn . X0 . X2 .e. yn ) into Xn+1 = (xn+1 . Assume that the force F depends only on the particle’s position ¨ dt and velocity. X1 . x(t)). b) = gcd(r. 2. b) between two positive integers a and b. and its ˙ dt 2x acceleration is x = d 2 . b). In the module following this one you will study continuous-time chains. s ≥ t) ˙ 1 . y0 = min(a. b). there is a function F that takes the pair Xn = (xn . has the property that. where xn ≥ yn .e. the real numbers). F = f (x. the integers). The sequence of pairs thus produced. for any n. . Suppose that a particle of mass m is moving under the action of a force F in a straight line. the future pairs Xn+1 . Recall (from elementary school maths). By repeating the procedure. x Here. continuous (i. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). one of which is 0. we end up with two numbers. 1. x = x(t) denotes the position of the particle at time t. In Mathematics. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). for any t. . that gcd(a. . yn ). For instance. Formally. yn ). n = 0. . The “time” can be discrete (i. Its velocity is x = dx .

at time n + 1 it is either still in 1 or has moved to 2. past states are not needed. but some are stochastic. a ≤ X0 ≤ b. In other words. We can summarise this information by the transition diagram: 1 Stochastic means random.can be completely speciﬁed by the current state X(t).. for steps n = 1. 1 and 2. Each time count 1 if Yn < f (Xn ) and 0 otherwise. Some of these algorithms are deterministic. 0 ≤ Y0 ≤ B. A scientist’s job is to record the position of the mouse every minute. The resulting ratio is an approximation b of a f (x)dx. 2 2.000 columns or computing the integral of a complicated function of a large number of variables. by the software in your computer. Other examples of dynamical systems are the algorithms run. Repeat with (Xn . Count the number of 1’s and divide by N .e. via the use of the tools of Probability. respectively. Perform this a large number N of times. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as ﬁnding the determinant of a matrix with 10.05. .99. We will be dealing with random variables. analyse. Try this on the computer! 2 .1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. regardless of where it was at earlier times. Indeed. uniformly. containing fresh and stinky cheese. it moves from 2 to 1 with probability β = 0. A mouse lives in the cage. 2. Choose a pair (X0 . and use those mathematical models which are called Markov chains. . an eﬀective way for dealing with large problems is via the use of randomness. Y0 ) of random numbers. We will develop a theory that tells us how to describe.000 rows and 10. Similarly. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. We will also see why they are useful and discuss how they are applied. say. instead of deterministic objects. When the mouse is in cell 1 at time n (minutes) then. we will see what kind of questions we can ask and what kind of answers we can hope for. it does so. The generalisation takes us into the realm of Randomness. i. . Yn ). In addition. Markov chains generalise this concept of dependence of the future only on the present.

Here. How long does it take for the mouse. intuitive.05 0. . Dn is the money he deposits. and Sn is the money he wants to spend. We easily see that if y > 0 then ∞ x. .95 0. = z=0 ∞ = z=0 3 . Dn − Sn = y − x) P (Sn = z. FS . How often is the mouse in room 1 ? Question 1 has an easy.2 Bank account The amount of money in Mr Baxter’s bank account evolves. 2. the ﬁrst time that the mouse will move from 1 to 2 will have mean 1/α = 1/0.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. y = 0. D1 . The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is deﬁned by: px. Sn .y := P (Xn+1 = y|Xn = x). S2 . from month to month. . . on the average. px. respectively. he can’t. So if he wishes to spend more than what is available.05 ≈ 20 minutes. are independent random variables. Assume that Dn . . 0}. as follows: Xn+1 = max{Xn + Dn − Sn . Xn is the money (in pounds) at the beginning of the month.) 2. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). D2 . are random variables with distributions FD . (This is the mean of the binomial distribution with parameter α.99 0. 1. to move from cell 1 to cell 2? 2.01 Questions of interest: 1. . S1 . S0 .y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. D0 . Assume also that X0 .

Will Mr Baxter’s account grow. how fast? 2.1 p1. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). how often will he be visiting 0? 2.0 p2. So he may move forwards with equal probability that he moves backwards.1 p2. 2. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. Questions of interest: 1.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 .0 p1. how long will it take him. in principle. If he starts from position 0. if y = 0.1 p0.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street.2 ··· ··· ··· ··· · · · · · · ··· and is an inﬁnite matrix. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions.2 P= p2.0 p0. he has no sense of direction. can be computed from FS and FD . If the pub is located in position 0 and his home in position 100. on the average.2 p1. we have px. Are there any places which he will visit inﬁnitely many times? 4. being drunk.y . The transition probability matrix is y=1 p0. 2. Are there any places which he will never visit? 3.0 = 1 − ∞ px. Of course.something that. and if so.

the answer to this is crucial for the policy determination. pD.8 0. 2.S because pH. and let’s say he’s completely drunk. Also. there is clearly a higher chance that our man will be lost.S = 1 − pS.1 D Thus. mathematically imprecise. But.H .S = 0. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. 5 . Clearly.D . there is probability pH. pD. the reason being that the company does not believe that its clients are subject to resurrection. and pS. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. the company must have an idea on how long the clients will live.3 H 0. These things are.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. therefore.5 Actuarial chains A life insurance company wants to ﬁnd out how much money to charge its clients. Note that the diagram omits pH. west. etc. but they will become more precise in the sequel. for now. given that he is currently healthy.H = 1 − pH. and pS. north or south.D . We may again want to ask similar questions. so we assign probability 1/4 for each move.D is omitted. So he can move east.S − pH.3 that the person will become sick. that since there are more degrees of freedom now.01 S 0.H − pS. observe.D = 1.

n ∈ T} is Markov instead of saying that it has the Markov property. 2. . unless needed. n ∈ T}. it is customary to call (Xn ) a Markov chain. . . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). Since S is countable. called the state space. . We will also assume. . Usually. On the other hand. for any n. . viz. or negative). . . n + m] Xk = ik | Xn = in ). . m ∈ T) is independent of the past process (Xm . (Xn . which precisely means that (Xn+1 . . So we will omit referring to T explicitly. . and so future is independent of past given the present. Proof. . . The elements of S are frequently called states. . . Lemma 1. Let us now write the Markov property in an equivalent way. Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). . m < n. . . . .1 The Markov property Deﬁnition of the Markov property Consider a (ﬁnite or inﬁnite) sequence of random variables {Xn . for any n ∈ T. . X2 = i2 . m > n. We say that this sequence has the Markov property if. 1. P (Xn+1 = in+1 | Xn = in . We focus almost exclusively to the ﬁrst choice. Xn+m ) and (X0 . . 3.3 3. n ∈ Z+ ) is Markov if and only if. zero. in+m . . . . . . (1) 6 . .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. X1 = i1 .} or the set Z of all integers (positive. 3. This is true for any n and m. If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . Xn−1 ) given Xn . the Markov property. almost exclusively (unless otherwise said explicitly). . T = Z+ = {0. . . . if the claim holds. conditionally on Xn . . . Xn−1 ) are independent conditionally on Xn . m and any choice of states i0 . that the Xn take values in some countable set S.} or N = {1. 2. n + m] Xk = ik | Xn = in . . the future process (Xm . for all n ∈ Z+ and all i0 . Sometimes we say that {Xn . m ∈ T). where T is a subset of the integers. X0 = i0 ) = P (∀ k ∈ [n + 1. in+1 ∈ S. . 3. . .

7 . 2 3 And. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. If |S| = d then we have d × d matrices. n + 1) is called transition probability from state i at time n to state j at time n + 1. pi. pi. we can write (2) as P(m. Therefore.3 In this new notation. n) := [pi. 3. pi.j (n. n) if m ≤ ℓ ≤ n. viz. pi. will specify all joint distributions2 : P (X0 = i0 . (2) which is reminiscent of matrix multiplication. n) = 1(i = j) and.which means that the two functions µ(i) := P (X0 = i) . m + 1)P(m + 1. .i1 (m. . X1 = i1 . 1) pi1 . which. The function µ(i) is called initial distribution.j (n. n + 1) := P (Xn+1 = j | Xn = i) . Of course. More generally.. remembering a bit of Algebra. by (1). for each m ≤ n the matrix P(m. a square matrix whose rows and columns are indexed by S (and this. means that the transition probabilities pi. X2 = i2 .j (n. m + 2) · · · P(n − 1. . we deﬁne.i2 (1. n) = P(m. something that can be called semigroup property or Chapman-Kolmogorov equation. by a result in Probability.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. Xn = in ) = µ(i0 ) pi0 . ℓ) P(ℓ.j (m. n) = P(m. n).j (m. n). by deﬁnition. n)]i. n). of course.j (m. but if S is an inﬁnite set. n + 1) do not depend on n . j ∈ S. 2) · · · pin−1 . n). n ≥ 0.j (n. or as P(m. But we will circumvent them by using Probability only.j∈S . all probabilities of events associated with the Markov chain This causes some analytical diﬃculties. i∈S i. then the matrices are inﬁnitely large. m + 1) · · · pin−1 .i (n − 1.in (n − 1.i1 (0. requires some ordering on S) and whose entries are pi. n) = i1 ∈S ··· in−1 ∈S pi.j (m. The function pi. .

n) = P × P × · · · · · · × P = Pn−m . In other words. if Y is a real random variable. and call pi. Then we have µ n = µ 0 Pn . Starting from a speciﬁc state i. From now on. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). arranged in a row vector. It corresponds.j (n. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . 0. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. y y 8 . Thus Pi is the law of the chain when the initial distribution is δi . while P = [pi. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . transition matrix. Notice that any probability distribution on S can be written as a linear combination of unit masses. then µ = pδi1 + (1 − p)δi2 .and therefore we can simply write pi. as follows from (1) and the deﬁnition of P.j . b ≥ 0.j the (one-step) transition probability from state i to state j. Then P(m. (3) We call δi the unit mass at i. simply. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). δi (j) := 1. if j = i otherwise. if µ assigns probability p to state i1 and 1 − p to state i2 . and the semigroup property is now even more obvious because Pa+b = Pa Pb . n−m times for all m ≤ n. for all integers a. of course. to picking state i with probability 1. Similarly. For example.j ]i.j∈S is called the (one-step) transition probability matrix or. n + 1) =: pi. The matrix notation is useful. For example. Notational conventions Unit mass. unless otherwise speciﬁed.

Clearly.e. . This is a model of gas distributed between two identical communicating chambers. we indeed expect that we have N/2 molecules in each room.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. let us look at the following example: Example: Motion on a polygon. then. To provide motivation and intuition.4 Stationarity We say that the process (Xn .d. selecting a vertex with probability 1/7. m + 1. a sequence of i. We now investigate when a Markov chain is stationary. The gas is placed partly in room 1 and partly in room 2. the law of a stationary process does not depend on the origin of time. Example: the Ehrenfest chain. Hence the uniform probability distribution is stationary. But this will not work. Then pi. at time n + 1 it moves to one of the adjacent vertices with equal probability. n ≥ 0) will not be stationary: indeed.) is stationary if it has the same law as (Xn . n = 0. 1. as it takes some time before the molecules diﬀuse from room to room. for a while we will be able to tell how long ago the process started. In other words. for any m ∈ Z+ .i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . If at time n the bug is on a vertex. because it is impossible for the number of molecules to remain constant at all times. the distribution of the bug will still be the same. It is clear. i. . say N = 1025 ) in a metallic chamber with a separator in the middle. Then a hole is opened in the separator and we watch what happens as the molecules diﬀuse between the two rooms. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . n = m. then. and vice versa. dividing it into two identical rooms. On the average.). The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. Consider a canonical heptagon and let a bug move on its vertices. Let Xn be the number of molecules in room 1 at time n. random variables provides an example of a stationary process. . We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. . It should be clear that if initially place the bug at random on one of the vertices. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. . Another guess then is: Place 9 . Imagine there are N molecules of gas (where N is a large number. .i. at any point of time n.

(a binomial distribution).each molecule at random either in room 1 or in room 2.i + π(i + 1)pi+1. use (1) to see that P (X0 = i0 . Proof. If we can do so. .i = P (X0 = i − 1)pi−1. a Markov chain is stationary if and only if the distribution of Xn does not change with n. But we need to prove this. then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . In other words. X2 = i2 . Since. . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. Xm+2 = i2 . X1 = i1 . If we do so. . . . 2−N i−1 i+1 N N (The dots signify some missing algebra. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. (4) Such a π is called . A Markov chain (Xn . P (X1 = i) = j P (X0 = j)pj. For the if part. Xn = in ) = P (Xm = i0 . which you should go through by yourselves. in ∈ S.) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. . ﬁnd those distributions π such that πP = π. The only if part is obvious. . the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . .i + P (X0 = i + 1)pi+1. stationary or invariant distribution. Xm+n = in ). 10 . . . for all m ≥ 0 and i0 . Indeed. by (3). i 0 ≤ i ≤ N. then we have created a stationary Markov chain. hence. P (Xn = i) = π(i) for all n. . . we are led to consider: The stationarity problem: Given [pij ].i = N N −i+1 N i+1 2−N + = · · · = π(i). we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P.i = π(i − 1)pi−1. Changing notation. Lemma 2. The equations (4) are called balance equations. Xm+1 = i1 .

. .d. which in turn means that the normalisation condition cannot be satisﬁed. in matrix notation. Who tells us that the sum inside the parentheses is ﬁnite? If the sum is inﬁnite. The state space is S = N = {1. 11 .i. i ≥ 0. and so each π(i) will be zero. . the next one will arrive in i minutes with probability p(i). Example: card shuﬄing. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). i∈S The matrix P has the eigenvalue 1 always because j∈S pi. i = 1. Keep doing it continuously.j = 1. Example: waiting for a bus. It is easy to see that the stationary distribution is uniform: π(x) = 1 . We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. write the equations (4): π(i) = π(j)pj. π must be a probability: π(i) = 1. we must be careful. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. Thus. i ≥ 1. which gives p(j) . which. . Consider a deck of d = 52 cards and shuﬄe them in the following (stupid) way: pick two cards at random and swap their positions. can be written as P1 = 1.i = π(0)p(i) + π(i + 1). To ﬁnd the stationary distribution. After bus arrives at a bus stop. random variables. . In addition to that. .i = p(i). if i > 0 i ∈ N. The times between successive buses are i. where 1 is a column whose entries are all 1. . π(1) will be zero. then we run into trouble: indeed. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. .. . But here. d! x ∈ Sd . This means that the solution to the balance equations is identically zero. .}. d}. . xd ). i≥1 π(i) −1 To compute π(1).i = 1. each x ∈ Sd is of the form x = (x1 . 2 . where all the xi are distinct. . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. Then (Xn . p1. we use the normalisation condition π(1) = i≥1 j≥i = 1.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. n ≥ 0) is a Markov chain with transition probabilities pi+1. . 2.

. Remarks: This is not a Markov chain with countably many states. (5) 12 . with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. 2.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). . 2.i.. . 2. So let us do so. . . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. and the rule maps x into 2(1 − x). any r = 1. what we have really shown is that this assumption is a necessary and suﬃcient condition for the existence and uniqueness of a stationary distribution.). i. . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. . . . We have thus deﬁned a mapping ϕ from binary vectors into binary vectors. Here ξ(n) = (ξ1 (n). . the same is true at n + 1. 1 − ξ3 . .). . Example: bread kneading. . . applied to the interval [0.d. . . But the function f (x) = min(2x. . . If ξ1 = 1 then x ≥ 1/2. . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals.i. and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . ξr+1 (n) = 1 − εr ).) is the state at time n. 1}. . ξi ∈ {0. .* (This is slightly beyond the scope of these lectures and can be omitted. . So it goes beyond our theory. P (ξ1 (n + 1) = ε1 . ξr+1 (n) = εr ) + P (ξ1 (n) = 1. 1] (think of [0. . 1] as a ﬂexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. ξ2 (n) = ε1 . ξ2 (n).) Consider an inﬁnite binary vector ξ = (ξ1 . ξ2 (n) = 1 − ε1 . r = 1. But to even think about it will give you some strength. . 2(1 − x)). Hence choosing the initial state at random in this manner results in a stationary Markov chain. . . ξ2 (n). As for the name of the chain (bread kneading) it can be justiﬁed as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. . P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). and any ε1 . . . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. . . . So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is ﬁnite. 1} for all i.. . ξ2 . . 1. You can check that .e. are i. .. ξ3 . if ξ1 = 1 then shift to the left and ﬂip all bits to obtain (1 − ξ2 . The Markov chain is obtained by applying the mapping ϕ again and again. for any n = 0. . . . i.d. We can show that if we start with the ξr (0). from (5). ξ(n + 1) = ϕ(ξ(n)).). with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. . So. . εr ∈ {0.

i Multiply π(i) by j∈S pi.i + j∈Ac π(j)pj. Ac ) := π(i)pi.i + j∈Ac π(j)pj. If the state space has d states. π(j)pj. expecting to be able to choose. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. i∈A j∈Ac Think of π(i)pi. So F (A. j∈S i ∈ S. When a Markov chain is stationary we refer to it as being in 4. for all A ⊂ S. a stationary distribution π satisﬁes π = πP π1 = 1 In other words. therefore one of them is always redundant.j (which equals 1): π(i) j∈S pi. Conversely. We have: Proposition 1.Note on terminology: steady-state. Ac ) is the total current from A to Ac . But we only have d unknowns. then choose A = {i} and see that you obtain the balance equations. amongst them.j as a “current” ﬂowing from i to j. The convenience is often suggested by the topological structure of the chain.j = j∈A π(j)pj.1 Finding a stationary distribution As explained above. Ac ) = F (Ac . π is a stationary distribution if and only if π1 = 1 and F (A. a set of d linearly independent equations which can be more easily solved. π(i) = j∈S (balance equations) (normalisation condition). as we will see in examples.i . π(j) = 1. Instead of eliminating equations. then the above are d + 1 equations. we introduce more. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. A). suppose that the balance equations hold.i = j∈A π(j)pj. Proof. Writing the equations in this form is not always the best thing to do.i 13 . because the ﬁrst d equations are linearly dependent: the sum of both sides equals 1. This is OK. If the latter condition holds.j .

p q Consider the following chain.2% of the time in 1. and 4.Now split the left sum also: π(i)pi. Summing over i ∈ A. . gives us ﬂexibility. The extra degree of freedom provided by the last result. the mouse spends only 95. Instead. 2.04 room 1. in steady state. here is an example: Example: Consider the 2-state Markov chain with p12 = α. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . A ) A c c F( A . Example: a walk with a barrier. we ﬁnd π(1) = 0. That is. .i + j∈Ac π(j)pj.j + j∈A j∈Ac π(i)pi. We have π(1)α = π(2)β. 1.) Assume 0 < α. α+β π(2) = cα. β = 0.. p11 = 1 − α. (Consequently.99 (as in the mouse example). π(1) = β . Thus.04 π(2) = 0. . A c F(A . we see that the ﬁrst term on the left cancels with the ﬁrst on the right. β < 1.j = j∈A π(j)pj.05 ≈ 0. If we take α = 0.048.05.8% of the time in room 2. One can ask how to choose a minimal complete set of linearly independent equations. 3. A) Schematic representation of the ﬂow balance relations. 14 i = 1. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). α+β π(2) = α . which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. .99 ≈ 0. p22 = 1 − β.i This is true for all i ∈ S.952.. but we shall not do this here. p21 = β. while the remaining two terms give the equality we seek. The constant c is found from π(1) + π(2) = 1.

e. Ac ) = π(i − 1)p = π(i)q = F (Ac . In graph-theoretic terminology. in−1 . In fact. provided that ∞ π(i) = 1. Does this provide a stationary distribution? Yes. A). . . starting from i. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. i. If i = j there is nothing to prove. 1. i. p < 1/2. i − 1}. π(i) = (p/q)i π(0). a pair (S. E) where S is the state space and E ⊂ S × S deﬁned by (i. .i2 · · · pin−1 . . Proof. .2 The relation “leads to” We say that a state i leads to state j if. • Suppose that i j.j > 0 for some n ∈ N and some states i1 . the chain will visit j at some ﬁnite time.) 5 5. This is often cast in terms of a digraph (directed graph). In other words. . We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows.But we can make them simpler by choosing A = {0. For i ≥ 1. (iii) Pi (Xn = j) > 0 for some n ≥ 0.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. 5. If i ≥ 1 we have F (A. S is a set of vertices. j) ∈ E ⇐⇒ pi. .e. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). (ii) pi. These are much simpler than the previous ones. . and E a set of directed edges.i1 pi1 . we can solve them immediately. (If p ≥ 1/2 there is no stationary distribution. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. So assume i = j. This is i=0 possible if and only if p < q. n≥0 .j > 0.

In other words. (6) j. The relation j ⇐⇒ j i). we see that there are four communicating classes: {1}.i2 · · · pin−1 . {2. 5. 5} is closed: if the chain goes in there then it never moves out of it. (iii) holds. is an equivalence relation. Hence Pi (Xn = j) > 0.in−1 pi. 16 .the sum on the right must have a nonzero term. Equivalence relation means that it is symmetric. {6}. the sum contains a nonzero term. {4. j) ⇐⇒ i j and j i. and so (iii) holds. Just as any equivalence relation. reﬂexive and transitive.j . (i) holds. The relation is transitive: If i j and j k then i k. it partitions S into equivalence classes known as communicating classes. meaning that (ii) holds. However. The class {4.. Proof. Look at the last display again. • Suppose that (ii) holds. 3}. We just observed it is symmetric. 5}..i1 pi1 . the class {2. We have Pi (Xn = j) = i1 ..3 The relation “communicates with” Deﬁne next the relation i communicates with j (written as i Obviously. this is symmetric (i Corollary 2. i}. The communicating class corresponding to the state i is. Two communicating classes are either 1 2 3 4 5 6 In the example of the ﬁgure. The ﬁrst two classes diﬀer from the last two in character. Since Pi (Xn = j) > 0. 3} is not closed. where the sum extends over all possible choices of states in S. It is reﬂexive and transitive because is so.. By assumption. Corollary 1. the set of all states that communicate with i: [i] := {j ∈ S : j So. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). by deﬁnition. In addition. • Suppose now that (iii) holds. [i] = [j] if and only if i identical or completely disjoint. we have that the sum is also positive and so one of its terms is positive. by deﬁnition.

Closed communicating classes are particularly important because they decompose the chain into smaller. C5 in the ﬁgure above are closed communicating classes. we can reach any other state.i = 1 then we say that i is an absorbing state. There can be arrows between two nonclosed communicating classes but they are always in the same direction. C2 . In graph terms. we conclude that j ∈ C.i2 · · · pin−1 . the state space S is a closed communicating class. then the chain (or. we have i1 ∈ C. . In other words still. parts. more manageable. . . more precisely.j = 1. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. Note that there can be no arrows between closed communicating classes. pi1 . Lemma 3. the single-element set {i} is a closed communicating class. starting from any state. and so on. We then can ﬁnd states i1 . the state is inessential. we say that a set of states C ⊂ S is closed if pi. Otherwise. Proof. The communicating classes C1 .More generally. j∈C for all i ∈ C. pi. Note: An inessential state i is such that there exists a state j such that i j but j i. Why? A state i is essential if it belongs to a closed communicating class.i1 > 0. C3 are not closed. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this ﬁgure. Thus. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. Suppose ﬁrst that C is a closed communicating class. Suppose that i j. To put it otherwise. and C is closed.j > 0. The converse is left as an exercise. in−1 such that pi. 17 . The classes C4 . If a state i is such that pi. Note: If we consider a Markov chain with transition matrix P and ﬁx n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. A general Markov chain can have an inﬁnite number of communicating classes. and so i2 ∈ C. Since i ∈ C. Similarly. . its transition matrix P) is called irreducible.i2 > 0.i1 pi1 . If all states communicate with all others.

we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . Dj = {n ∈ (n) (α) N : pjj > 0}. Consider two distinct essential states i. (m) (n) for all m. n ≥ 0. Theorem 2 (the period is a class property). Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. Since Pm+n = Pm Pn . Let us denote by pij the entries of Pn . If i j there is α ∈ N with pij > 0 and if 18 . The period of an inessential state is not deﬁned. 3} at some point of time then. and. (n) Proof.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. all states form a single closed communicating class. Let Di = {n ∈ N : pii > 0}. 4} at the next step and vice versa.e. If i j then d(i) = d(j). i. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. . A state i is called aperiodic if d(i) = 1. j. The chain alternates between the two sets. pii (ℓn) ≥ pii (n) ℓ . it will move to the set C2 = {2. X2 ∈ C1 .5. i. . Notice however. So. j ∈ S. The period of an essential state is deﬁned as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. We say that the chain has period 2. X4 ∈ C1 . d(j) = gcd Dj . more generally. (Can you ﬁnd an example of a chain with period 3?) We will now give the deﬁnition of the period of a state of an arbitrary chain. d(i) = gcd Di . . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . X3 ∈ C2 . ℓ ∈ N. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m.

j i there is β ∈ N with pji > 0. Pick n2 . Theorem 3. From the last two displays we get d(j) | n for all n ∈ Di . . Then pii > 0 and so pjj ≥ pij pii pji > 0. C1 . Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 .) 19 . if an irreducible chain has period d then we can decompose the state space into d sets C0 . 1.e. as deﬁned earlier. Of particular interest. are irreducible chains. In particular. . Proof. Therefore d(j) | α + β. . Consider an irreducible chain with period d. So d(j) is a divisor of all elements of Di . Now let n ∈ Di . . Formally. j ∈ S. Then we can uniquely partition the state space S into d disjoint subsets C0 . (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This ﬁgure shows the internal structure of a closed communicating class with period d = 4. q − r > 0. Cd−1 such that the chain moves cyclically between them. where the remainder r ≤ n1 − 1. . (Here Cd := C0 . Since n1 . all states communicate with one another. r = 0. Let n be suﬃciently large. j∈Cr+1 i ∈ Cr . the chain is called aperiodic. n1 ∈ Di such that n2 − n1 = 1. Cd−1 such that pij = 1. this is the content of the following theorem. . we have n ∈ Di as well. So pjj ≥ pij pji > 0. . showing that α + n + β ∈ Dj . . More generally. chains where. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. n2 ∈ Di . Because n is suﬃciently large. Therefore d(j) | α + β + n for all n ∈ Di . we obtain d(i) ≤ d(j) as well. . Arguing symmetrically. d(j) | d(i)). i. Theorem 4. C1 . . If a number divides two other numbers then it also divides their diﬀerence. . . showing that α + β ∈ Dj . So d(i) = d(j). If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . Divide n by n1 to obtain n = qn1 + r. d − 1. Corollary 3. An irreducible chain with ﬁnitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. if d = 1. An irreducible chain has period d if one (and hence all) of the states have period d.

Proof.. that is why we call it “first-step analysis”... which contradicts the deﬁnition of d.. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . .e. . then. Assume d > 1. Cd−1 . Take. id ∈ C0 . i ∈ C0 . i1 . . necessarily.. j belongs to the next class. As an example. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). for example. necessarily. We avoid excessive notation by using the same letter for both. If X0 ∈ A the two times coincide. Suppose pij > 0 but j ∈ C1 but. Such a path is possible because of the choice of the i0 . . after it takes one step from its current position. The existence of such a path implies that it is possible to go from i0 to i. and by the assumptions that d d j. Pick a state i0 and let C0 be its equivalence class (i. with corresponding classes C0 . Then pick i1 such that pi0 i1 > 0. The reader is warned to be alert as to which of the two variables we are considering at each time. Let C1 be the equivalence class of i1 . Notice that this is an equivalence relation. say. We now show that if i belongs to one of these classes and if pij > 0 then.. i1 .e the set of states j such that i j). j ∈ C2 . It is easy to see that if we continue and pick a further state id with pid−1 . Hence it partitions the state space S into d disjoint equivalence classes. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which diﬀers from TA simply by considering n ≥ 1 instead of n ≥ 0. Continue in this manner and deﬁne states i0 .. Deﬁne the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. We use a method that is based upon considering what the Markov chain does at time 1. 20 . C1 . . We show that these classes satisfy what we need.id > 0. We are interested in deriving formulae for the probability that this time is ﬁnite as well as the expectation of this time. id−1 d 6 Hitting times and ﬁrst-step analysis Consider a Markov chain with transition probability matrix P. i. We deﬁne the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}.

If i ∈ A then TA = 0. If i ∈ A. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j).It is clear that P1 (T0 < ∞) < 1. the event TA is independent of X0 . Hence: ′ ′ P (TA < ∞|X1 = j. So TA = 1 + TA . . Therefore. ϕ(i) = j∈S i∈A pij ϕ(j). and deﬁne ϕ(i) := Pi (TA < ∞). For the second part of the proof.). let ϕ′ (i) be another solution. Combining the above. But what exactly is this probability equal to? To answer this in its generality. if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. We ﬁrst have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. conditionally on X1 . By self-feeding this equation. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A pjk ϕ′ (k) k∈A 21 . X0 = i) = P (TA < ∞|X1 = j). we have Pi (TA < ∞) = ϕ(j)pij .e. because the chain may end up in state 2. We then have: Theorem 5. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). and so ϕ(i) = 1. then TA ≥ 1. . But the Markov chain is homogeneous. ′ Proof. Furthermore. The function ϕ(i) satisﬁes ϕ(i) = 1. ′ is the remaining time until A is hit). ﬁx a set of states A. X2 . i ∈ A. . j∈S as needed. a ′ function of X1 .

obviously. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. We let ϕ(i) = Pi (T0 < ∞). The second part of the proof is omitted. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). We have: Theorem 6. If ′ i ∈ A. as above. Example: In the example of the previous ﬁgure. the set that contains only state 0. 2}. By continuing self-feeding the above equation n times. ψ(i) = 1 + i∈A pij ψ(j). We immediately have ϕ(0) = 1. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). i = 0. then TA = 1 + TA . Furthermore. ψ(i) := Ei TA . because it takes no time to hit A if the chain starts from A. 3 2 so ϕ(1) = 3/4. j∈A i ∈ A. 1 ψ(1) = 1 + ψ(1).We recognise that the ﬁrst term equals Pi (TA = 1). and ϕ(2) = 0. TA = 0 and so ψ(i) := Ei TA = 0. (If the chain starts from 0 it takes no time to hit 0. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the ﬁrst time.) As for ϕ(1). and let ψ(i) = Ei TA . we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). let A = {0. 1. if the chain starts from 2 it will never hit 0. so the ﬁrst two terms together equal Pi (TA ≤ 2). Therefore. Clearly. which is the second of the equations. The function ψ(i) satisﬁes ψ(i) = 0. If i ∈ A then. 3 22 . the second equals Pi (TA = 2). we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). 2. ψ(0) = ψ(2) = 0. On the other hand. we choose A = {0}. Start the chain from i. Proof. Example: Continuing the previous example. Letting n → ∞.

B. ϕAB (i) = 0. in the last sum. by the Markov property. otherwise. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). ′ TA = 1 + TA . ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). as argued earlier.. because the underlying experiment is the ﬂipping of a coin with “success” probability 2/3. Then ′ 1+TA x ∈ A. Clearly. Now the last sum is a function of the future after time 1. We may ask to ﬁnd the probabilities ϕAB (i) = Pi (TA < TB ). 23 . ϕAB is the minimal solution to these equations. . i∈A Furthermore. until set A is reached. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). h(x) = f (x). (This should have been obvious from the start. we just changed index from n to n + 1. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the ﬁrst-step analysis method. therefore the mean number of ﬂips until the ﬁrst success is 3/2. of the random variables X1 . if x ∈ A. ϕAB (i) = j∈A i∈B pij ϕAB (j). consider the following situation: Every time the state is x a reward f (x) is earned. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ).e. . i. It should be clear that the equations satisﬁed are: ϕAB (i) = 1. ′ where TA is the remaining time. This happens up to the ﬁrst hitting time TA ﬁrst hitting time of the set A. Next. .which gives ψ(1) = 3/2. X2 . TB for two disjoint sets of states A.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). after one step. where. Hence.

(Sooner or later. So. so 24 . collecting “rewards”: when in room i. why?) If we let h(i) be the average total reward if he starts from room i. are h(x) = 0. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. he collects i pounds. Clearly. (Also. Example: A thief enters sneaks in the following house and moves around the rooms. unless he is in room 4. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). 2. k ∈ N) form the strategy of the gambler. 1 2 4 3 The question is to compute the average total reward. h(3) = 7. he will die. 3. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk .) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. in which case he dies immediately. y∈S h(x) = f (x) + x ∈ A. If heads come up then you win £1 per pound of stake. if he starts from room 2.Thus. x∈A pxy h(y). if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. 2 2 We solve and ﬁnd h(2) = 8. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). for i = 1. until he dies in room 4. h(1) = 5. no gambler is assumed to have any information about the outcome of the upcoming toss. the set of equations satisﬁed by h. Let p the probability of heads and q = 1 − p that of tails. The successive stakes (Yk .

Theorem 7. We obviously have ϕ(0) = 0. ℓ) := Px (Tℓ < T0 ).i+1 = p. 0 ≤ x ≤ ℓ. P (tails) = 1 − p. .if the gambler wants to have a strategy at all. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). a + 1. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. if p = q = 1/2 b−a Proof. In other words. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. Consider a gambler playing a simple coin tossing game. expectation) under the initial condition X0 = x. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals: x−a − 1 (q/p) (q/p)b−a − 1 . Let us consider a concrete problem. is our next goal. . She will stop playing if her money fall below a (a < x). if p = q Px (Tb < Ta ) = .i−1 = q = 1 − p. (Think why!) To simplify things. a + 2. 25 ϕ(ℓ) = 1. . b − 1. The diﬀerence between the gambler’s ruin problem for a company and that for a mere mortal is that. in simplest case Yk = 1 for all k. in addition. b} with transition probabilities pi. To solve the problem. We ﬁrst solve the problem when a = 0. the company has control over the probability p. a ≤ x ≤ b. it will make enormous proﬁts). . as described above. It can only depend on ξ1 . a < i < b. pi. ℓ). We wish to consider the problem of ﬁnding the probability that the gambler will never lose her money. . Let £x be the initial fortune of the gambler. write ϕ(x) instead of ϕ(x. x − a . We use the notation Px (respectively. We are clearly dealing with a Markov chain in the state space {a. b − a). b are absorbing.g. astrological nature. she must base it on whatever outcomes have appeared thus far. b = ℓ > 0 and let ϕ(x. ξk−1 and–as is often the case in practice–on other considerations of e. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. . . and where the states a. . in the ﬁrst case. . betting £1 at each integer time against a coin which has P (heads) = p. Yk cannot depend on ξk . Ex ) for the probability (respectively.

if p > 1/2 if p ≤ 1/2. The general case is obtained by replacing ℓ by b − a and x by x − a. λ = 1. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. λ = 1. Since p + q = 1. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. we ﬁnd ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. Assuming that λ := q/p = 1. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). ϕ(x) = pϕ(x + 1) + qϕ(x − 1). then λ = q/p < 1. we have obtained ϕ(x. λx − 1 δ(0). ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. Corollary 4. if λ = 1 if λ = 1. δ(x − 2) = · · · = q p x δ(0). ℓ Summarising. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. Let δ(x) := ϕ(x + 1) − ϕ(x). This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. λℓ − 1 0 ≤ x ≤ ℓ. Px (T0 < ∞) = 1 − lim 26 . ℓ) = λx −1 . and so Px (T0 < ∞) = 1 − lim If p < 1/2.and. and so λx − 1 λx − 1 =1− = λx . 1. by ﬁrst-step analysis. if p = q. ℓ→∞ ℓ→∞ (q/p)x . we ﬁnd δ(0) = 1/ℓ and so x ϕ(x) = . λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). If p > 1/2. λℓ −1 x ℓ. 0 ≤ x ≤ ℓ. Since ϕ(ℓ) = 1. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). We ﬁnally ﬁnd λx − 1 . then λ = q/p > 1.

Xn+2 = j2 . A. Xn+1 . . A ∩ {τ = n} is determined by (X0 . Xn = i. including. . conditional on τ < ∞ and Xτ = i. A. Xn )1(τ = n). . An example of a stopping time is the time TA . . This stronger property is referred to as the strong Markov property. . In particular. τ < ∞). and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . Recall that the Markov property states that. Theorem 8 (strong Markov property). . A. n ≥ 0) is independent of the past process (X0 . . . Xn ). . . . Xτ ) if τ < ∞. τ < ∞) and that P (Xτ +1 = j1 . for any n. . We are going to show that for any such A. | Xn = i. . considered earlier. . . (d) where (a) is just logic. for all n ∈ Z+ . τ = n) (a) (b) = P (Xτ +1 = j1 . τ < ∞)P (A | Xτ . Xτ +2 = j2 . the future process (Xτ +n . . . any such A must have the property that. the ﬁrst hitting time of the set A.j1 pj1 . . Since (X0 . τ < ∞) = pi. . . . . Xn+2 = j2 . Let A be any event determined by the past (X0 . . Xn ) for all n ∈ Z+ . . 27 . . Xn+2 = j2 . .Xτ +2 = j2 . . Xτ +2 = j2 . Xτ )1(τ < ∞) = n∈Z+ (X0 . . Xn ) if we know the present Xn . τ < ∞) = P (Xn+1 = j1 . . Let τ be a stopping time. . A. . the future process (Xn . . Xτ = i. | Xτ . Xτ ) and has the same law as the original process started from i. n ≥ 0) is a time-homogeneous Markov chain.8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. A. . Then. . . . Xn ) and the ordinary splitting property at n. . .j1 pj1 . . . τ = n) = pi. . . P (Xτ +1 = j1 . . An example of a random time which is not a stopping time is the last visit of a speciﬁc set. . A. (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . possibly. . (b) is by the deﬁnition of conditional probability. . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . . Our assertions are now easily obtained from this by ﬁrst summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. | Xn = i)P (Xn = i. A | Xτ . . . | Xτ = i. . Proof. . . . Let τ be a stopping time. τ = n)P (Xn = i. τ = n) (c) = P (Xn+1 = j1 . Suppose (Xn . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . τ = n) = P (Xn+1 = j1 . the value +∞. Xτ +2 = j2 .j2 · · · P (Xn = i. . . τ = n). . .j2 · · · We have: P (Xτ +1 = j1 .) is independent of the past process (X0 . .

random trajectories. (And we set Ti := Ti . 0 ≤ n < Ti . Ti Xi (0) (r) ≤ n < Ti (r+1) . . := Xn .9 Regenerative structure of a Markov chain A sequence of i. “chunks”. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the ﬁrst time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. We (r) refer to them as excursions of the process.. Xi (2) .. Xi (0) . . All these random times are stopping times. . and so on. r = 2. If X0 = i then the two deﬁnitions coincide. So Xi is the r-th excursion: the chain visits 28 . Formally.) We let Ti be the time of the third visit. . Regardless.d. in a generalised sense. The idea is that if we can ﬁnd a state i that is visited again and again.d.i. 3. . random variables is a (very) particular example of a Markov chain. The Ti of Section 6 was allowed to take the value 0. Schematically. then the behaviour of the chain between successive visits cannot be inﬂuenced by the past or the future (and that is due to the strong Markov property). State i may or may not be visited again. we recursively deﬁne Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. as “random variables”. in fact. Note the subtle diﬀerence between this Ti and the Ti of Section 6. They are. . we will show that we can split the sequence into i. we let Ti be the time of the second (1) (3) visit. Xi (1) . The random variables Xn comprising a general Markov chain are not independent.i. by using the strong Markov property. . Our Ti here is always positive. However. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . 2. (1) r = 1.

. . Xi (1) . The claim that Xi . it will behave in the same way as if it starts from the same state at another point of time. given the state at (r) (r) (r) the stopping time Ti . all. and this proves the ﬁrst claim. we “watch” it up until the next time it returns to state i. the future after Ti is independent of the past before Ti . . In particular. have the same Proof. A trivial. starting from i..). . at least one state must be visited inﬁnitely many times. we say that a state i is recurrent if. . An immediate consequence of the strong Markov property is: Theorem 9. Therefore. Xi distribution. We abstract this trivial observation and give a couple of deﬁnitions: First. except. we say that a state i is transient if. equal to i. . The last corollary ensures us that Λ1 . Moreover. . 10 Recurrence and transience When a Markov chain has ﬁnitely many states. Xi (2) . starting from i. 29 . ad inﬁnitum.. the chain returns to i only ﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 0. if it starts from state i at some point of time.state i for the r-th time. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. possibly. Therefore. g(Xi (1) ). (0) are independent random variables. Perhaps some states (inessential ones) will be visited ﬁnitely many times. are independent random variables. (r) . Example: Deﬁne Ti (r+1) (0) ). But (r) the state at Ti is. they are also identical in law (i. Apply Strong Markov Property (SMP) at Ti : SMP says that. (Numerical) functions g(Xi independent random variables. and “goes for an excursion”. but important corollary of the observation of this section is that: Corollary 5. for time-homogeneous Markov chains.i.d. the future is independent of the (1) (2) past. Second. the chain returns to i inﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 1. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. The excursions Xi (0) . Λ2 . g(Xi (2) ). by deﬁnition. but there are always states which will be visited again and again.

we conclude what we asserted earlier. Suppose that the Markov chain starts from state i. there could be yet another possibility: 0 < Pi (Xn = i for inﬁnitely many n) < 1. Therefore. We claim that: Lemma 4. This means that the state i is transient. i. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} .e. Notice that: state i is visited inﬁnitely many times ⇐⇒ Ji = ∞. Starting from X0 = i.Important note: Notice that the two statements above are not negations of one another. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. 4. 30 . The following are equivalent: 1. 2. State i is recurrent. k ≥ 0. Ei Ji = ∞. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . Pi (Ji = ∞) = 1. we have Proof. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). Indeed. namely. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). Because either fii = 1 or fii < 1. therefore concluding that every state is either recurrent or transient. Indeed. Lemma 5. and that is why we get the product. then Pi (Ji < ∞) = 1. Pi (Ji = ∞) = 1. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. every state is either recurrent or transient. . We will show that such a possibility does not exist. This means that the state i is recurrent. But the past (k−1) before Ti is independent of the future. 3. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. in principle. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . fii = Pi (Ti < ∞) = 1. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. If fii < 1.

Hence Ei Ji = ∞ also. 3. clearly. The equivalence of the ﬁrst two statements was proved before the lemma. (n) i. 10. then. fij ≤ pij . This. State i is transient. Ji cannot take value +∞ with positive probability. The following are equivalent: 1. fii = Pi (Ti < ∞) < 1. Corollary 6. this implies that Ei Ji < ∞.Proof. then. The equivalence of the ﬁrst two statements was proved above. as n → ∞. assuming that the chain (n) starts from state i. Lemma 6. by deﬁnition of Ji is equivalent to Pi (Ji = ∞) = 1. If i is transient then Ei Ji < ∞. Notice the diﬀerence with pij . it is visited ﬁnitely many times with probability 1. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . Pi (Ji < ∞) = 1. therefore Pi (Ji < ∞) = 1. If Ei Ji = ∞. Clearly.e. Proof. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. owing to the fact that Ji is a geometric random variable. by deﬁnition. i. Proof. pii → 0. From Elementary Probability. which. 2. if we assume Ei Ji < ∞. n ≥ 1. On the other hand. The latter is the probability to be in state j at time n (not necessarily for the ﬁrst time) starting from i. by the deﬁnition of Ji . If i is a transient state then pii → 0.e. we see that its expectation cannot be inﬁnity without Ji itself being inﬁnity with probability one.1 Deﬁne First hitting time decomposition fij := Pi (Tj = n). as n → ∞. it is visited inﬁnitely many times. is equivalent to Pi (Ji < ∞) = 1. but the two sequences are related in an exact way: Lemma 7. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . since Ji is a geometric random variable. and. State i is recurrent if. Proof. Ei Ji < ∞. the probability to hit state j for the ﬁrst time at time n ≥ 1. State i is transient if. (n) But if a series is ﬁnite then the summands go to 0. by deﬁnition. 4.

In particular. showing that j will appear in inﬁnitely many of the excursions for sure. Proof. excursions Xi . Indeed. 10. denoted by i j: it means that.2 Communicating classes and recurrence/transience Recall the deﬁnition of the relation “state i leads to state j”.The ﬁrst term equals fij . If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. We now show that recurrence is a class property. but also fij = 1. Now. if we let fij := Pi (Tj < ∞) .) Theorem 10. Hence. . we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). j i. Start with X0 = i. . For the second term. inﬁnitely many of them will be 1 (with probability 1). In a communicating class. ( Remember: another property that was a class property was the property that the state have a certain period.. the probability that j will appear in a speciﬁc excursion equals gij = Pi (Tj < Ti ). by summing over n the relation of Lemma 7.r := 1 j appears in excursion Xi (r) . 1. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. we have i j ⇐⇒ fij > 0. ∞ Proof. The last statement is simply an expression in symbols of what we said above. and gij := Pi (Tj < Ti ) > 0. for all states i. by deﬁnition. we have n=1 pjj = Ej Jj < ∞.i. Xi . and hence pij as n → ∞.i. Hence. r = 0. If j is transient. not only fij > 0. as n → ∞. either all states are recurrent or all states are transient. . If j is a transient state then pij → 0. which is ﬁnite. starting from i. Corollary 7. are i. In other words. and since they take the value 1 with positive probability. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . .d. . the probability to go to j in some ﬁnite time is positive. Corollary 8.d. j i. and so the probability that j will appear in a speciﬁc excursion is positive. . 32 . So the random variables δj.

j but j i. n+1 33 = 0 . Proof. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. but Pi (A ∩ B) = Pi (Xm = j.Proof. Theorem 11. Theorem 12.e. Hence. P (B) = Pi (Xn = i for inﬁnitely many n) < 1. Xn = i for inﬁnitely many n) = 0. It should be clear that this can.e. necessarily. Suppose that i is inessential. some state i must be visited inﬁnitely many times. Hence 11 Positive recurrence Recall that if we ﬁx a state i and let Ti be the ﬁrst return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). and hence. By ﬁniteness. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. If i is inessential then it is transient. be i. But then. Recall that a ﬁnite random variable may not necessarily have ﬁnite expectation. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). all other states are also transient. Every ﬁnite closed communicating class is recurrent. every other j is also recurrent. P (T > n) = P (U1 ≤ U0 . Let C be the class. 1]. . . i. . random variables. If i is recurrent then every other j in the same communicating class satisﬁes i j. . Indeed. So if a communicating class is recurrent then it contains no inessential states and so it is closed. Hence all states are recurrent. . . U1 . for example. If a communicating class contains a state which is transient then. Example: Let U0 . . U2 . i. Thus T represents the number of variables we need to draw to get a value larger than the initial one. Pi (B) = 0. . uniformly distributed in the unit interval [0. Proof. if started in C. Let T := inf{n ≥ 1 : Un > U0 }. and so i is a transient state. X remains in C. .i. Un ≤ x | U0 = x)dx xn dx = 1 . by the above theorem. . observe that T is ﬁnite. This state is recurrent. We now take a closer look at recurrence. What is the expectation of T ? First. By closedness.d. . appear in a certain game.

of successive states is not an independent sequence. Theorem 13. We state it here without proof. . be i. It is traditional to refer to it as “Law”.Therefore. X2 . 0) > −∞. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . When we have a Markov chain. (ii) Suppose E max(Z1 . 34 . . but E min(Z1 . Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. It is not a Law. . on the average. . Z2 . To our rescue comes the regenerative structure of a Markov chain identiﬁed at an earlier section. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. the sequence X0 . it is a Theorem that can be derived from the axioms of Probability Theory. X1 . . Before doing that. an inﬁnite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. arguably. the Fundamental Theorem of Probability Theory. random variables. Let Z1 . 0) = ∞. But this is misleading. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. To apply the law of large numbers we must create some kind of independence.i.d. . then it takes. We say that i is null recurrent if Ei Ti = ∞.

d. assign a reward f (x). Speciﬁcally. . Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). we have that G(Xa(1) ). Then. G(Xa(2) ). . (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. in this example. random variables.. Xa . the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). n as n → ∞. for an excursion X . Example 1: To each state x ∈ S. the (0) initial excursion Xa also has the same law as the rest of them. Since functions of independent random variables are also independent. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been ﬁxed once and for all) the sequence of excursions Xa(1) . because the excursions Xa . 35 . G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ).1 Functions of excursions Namely. (1) (2) (1) (n) (8) 13.13. G(Xa(3) ). In particular. k=0 which represents the ‘time-average reward’ received. for reasonable choices of the functions G taking values in R. as in Example 2.2 Average reward Consider now a function f : S → R. . . forms an independent sequence of (generalised) random variables. Example 2: Let f (x) be as in example 1. . are i. which.i. Xa(3) . Thus.d. . with probability 1. (7) Whichever the choice of G. and because if we start with X0 = a. . are i. we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. . The ﬁnal equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. we can think of as a reward received when the chain is in state x.i. but deﬁne G(X ) to be the total reward received over the excursion X . Xa(2) . . deﬁne G(X ) to be the maximum reward received during this excursion.

Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. t t 36 . and so on. A similar argument shows that 1 last G → 0. Proof. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a ﬁrst and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gﬁrst + Glast . Thus we need to keep track of the number. and Glast is the reward over the last incomplete cycle. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. with probability one. (r) Nt := max{r ≥ 1 : Ta ≤ t}. Since P (Ta < ∞) = 1. Let f : S → R+ be a bounded ‘reward’ function. in the ﬁgure below. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). For example. Nt = 5.Theorem 14. t t as t → ∞. We break the sum into the total reward received over the initial excursion. t t In other words. Gr is given by (7). We assumed that f is t bounded.e. plus the total reward received over the next excursion. We are looking for the total reward received between 0 and t. Hence |Gﬁrst | ≤ BTa . we have BTa /t → 0. of complete excursions from the ground state a that occur between 0 and t. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. The idea is this: Suppose that t is large enough. say. say |f (x)| ≤ B. with probability one. Nt . and so 1 ﬁrst G → 0. t as t → ∞. Then. Gﬁrst is the reward up to time Ta = Ta or t (whichever t is smaller). the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 .

i.. E a Ta 37 . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. it follows that the number Nt of complete cycles before time t will also be tending to ∞. r=1 with probability one. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . we have Ta r→∞ r lim (r) . = E a Ta . random variables with common expectation Ea Ta . Since Ta → ∞. Since Ta /r → 0. . we have that the visit with index r = Nt to the ground state a is the last before t (see also ﬁgure above): (N (N Ta t ) ≤ t < Ta t +1) . Hence. 2. = E a Ta . . and since Ta − Ta k = 1. (11) By deﬁnition. E a Ta f (Xn ) = n=0 EG1 =: f . 1 Nt = . are i. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 .also. as t → ∞. . as r → ∞. So we are left with the sum of the rewards over complete cycles. lim t∞ 1 Nt Nt −1 Gr = EG1 .d. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) .

Comment 4: Suppose that two states. t Ea 1(Xk = b) E a Ta . The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). Note that we include only one endpoint of the excursion. Comment 1: It is important to understand what the formula says. then we see that the numerator equals 1. E a Ta (14) with probability 1. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . from the Strong Markov Property. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . then the long-run proportion of visits to the state a should equal 1/Ea Ta . The constant f is a sort of average over an excursion. E b Tb 38 . (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. The denominator is the duration of the excursion. b. The theorem above then says that. Applying formula (14) with b in place of a we ﬁnd that the limit in (13) also equals 1/Eb Tb . we let f to be 1 at the state b and 0 everywhere else. which communicate and are positive recurrent. a. it takes Ea Ta units of time between successive visits to the state a. on the average. for all b ∈ S. The equality of these two limits gives: average no. i. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). the ground state.e. We can use b as a ground state instead of a. not both. The physical interpretation of the latter should be clear: If. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b).To see that f is as claimed. Ta −1 k=0 1 lim t→∞ t with probability 1.

Proof. X1 . Then ν [a] satisﬁes the balance equations: ν [a] = ν [a] P. Both of them equal 1. . X3 . we have 0 < ν [a] (x) < ∞. Proposition 2. (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. X2 . .14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . Note: Although ν [a] depends on the choice of the ‘ground’ state a. X2 . this should have something to do with the concept of stationary distribution discussed a few sections ago. We start the Markov chain with X0 = a. Recurrence tells us that Ta < ∞ with probability one. 39 . The real reason for doing so is because we wanted to replace a function of X0 . (15) should play an important role. Moreover. In view of this. where a is the ﬁxed recurrent ground state whose existence was assumed. . . for any state b such that a → x (a leads to x). Clearly. Hence the superscript [a] will soon be dropped. x ∈ S. Consider the function ν [a] deﬁned in (15). 5 We stress that we do not need the stronger assumption of positive recurrence here. by a function of X1 .e. . i. we will soon realise that this dependence vanishes when the chain is irreducible. ν [a] (x) = y∈S ν [a] (y)pyx . Since a = XT a . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. so there is nothing lost in doing so. . we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). Fix a ground state a ∈ S and assume that it is recurrent5 . Indeed. x ∈ S. notice that the choice of notation ν [a] is justiﬁable through the notation (6) for the communicating class of the state a.

indeed. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . So we have shown ν [a] (x) = y∈S pyx ν [a] (y). n ≤ Ta ) pyx Pa (Xn−1 = y. We just showed that ν [a] = ν [a] P. so there exists n ≥ 1. by deﬁnition. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. is ﬁnite when a is positive recurrent. xa so. 40 . n ≤ Ta ) Pa (Xn = x. for all n ∈ N. Xn−1 = y. such that pxa > 0. we used (16). Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). ax ax From Theorem 10. We now have: Corollary 9. where. let x be such that a x. in this last step. that a is positive recurrent. (17) This π is a stationary distribution. Note: The last result was shown under the assumption that a is a recurrent state. which are precisely the balance equations. Suppose that the Markov chain possesses a positive recurrent state a. x ∈ S. First note that. ν [a] (x) < ∞.and thus be in a position to apply ﬁrst-step analysis. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. We are now going to assume more. such that pax > 0. ν [a] = ν [a] Pn . x a. from (16). namely. for all x ∈ S. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. Therefore. Deﬁne Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . (n) Next. so there exists m ≥ 1. which.

In words. in general. and so π [a] is not trivially equal to zero. 1. that it sums up to one. we have Ea Ta < ∞. i. then j occurs in the r-th excursion. the event that j occurs in a speciﬁc excursion is independent from the event that j occurs in the next excursion. Hence j will occur in inﬁnitely many excursions. Consider the excursions Xi . We already know that j is recurrent. then the set of linear equations π(x) = y∈S π(y)pyx . .e.In other words: If there is a positive recurrent state. The coins are i. Therefore. . to decide whether j will occur in a speciﬁc excursion. we toss a coin with probability of heads gij > 0. and the probability gij = Pi (Tj < ∞) that j occurs in a speciﬁc excursion is positive. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one.d. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. then the r-th excursion does not contain state j. π [a] P = π [a] . R2 ) be the index of the ﬁrst (respectively. It only remains to show that π [a] is a probability. positive recurrent state.i. that ν [a] P = ν [a] . if tails show up. Let R1 (respectively. We have already shown. Proof. second) excursion that contains j. This is what we will try to do in the sequel. Suppose that i recurrent. since the excursions are independent. Also. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. (r) j. π(y) = 1 y∈S x∈S have at least one solution. But π [a] is simply a multiple of ν [a] . Let i be a positive recurrent state. 41 . in Proposition 2. Then j is positive Proof. r = 0. unique. Start with X0 = i. If heads show up at the r-th excursion. no matter which one. . Otherwise. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. 2. Theorem 15. Since a is positive recurrent.

2. . An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. Corollary 10.d. or all states are null recurrent or all states are transient. we just need to show that R2 −R1 +1 ζ := r=1 Zr has ﬁnite expectation. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. An irreducible Markov chain is called transient if one (and hence all) states are transient. Then either all states in C are positive recurrent. But. . . . RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . . IRREDUCIBLE CHAIN r = 1.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. by Wald’s lemma. In other words.i. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. . . Let C be a communicating class. with the same law as Ti . R2 has ﬁnite expectation. where the Zr are i.

is a stationary distribution. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. indeed. x ∈ S. for all states i and j. π(1) = 0. for all n. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). an absorbing state). Theorem 16. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). Consider positive recurrent irreducible Markov chain. for all x ∈ S. x∈S By (13). as t → ∞. following Theorem 14. we start the Markov chain from the stationary distribution π. Then there is a unique stationary distribution. n ≥ 0) and suppose that P (X0 = x) = π(x). the π deﬁned by π(0) = α. n=1 43 . Fix a ground state a. what prohibits uniqueness. state 1 (and state 2) is positive recurrent. Consider the Markov chain (Xn . we can take expectations of both sides: E as t → ∞. we have. Proof. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. we have P (Xn = x) = π(x). we stress that a stationary distribution may not be unique. Then P (Ta < ∞) = x∈S In other words. π(2) = 1 − α 2. But. with probability one. Hence. x ∈ S. for totally trivial reasons (it is. But observe that there are many–inﬁnitely many–stationary distributions: For each 0 ≤ α ≤ 1.16 Uniqueness of the stationary distribution At the risk of being too pedantic. (18) π(x)Px (Ta < ∞) = π(x) = 1. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. Then. there is a stationary distribution. Let π be a stationary distribution. By a theorem of Analysis (Bounded Convergence Theorem). by (18). Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). recognising that the random variables on the left are nonnegative and bounded below 1. after all.

y ∈ C. By the 1 above corollary. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). where π(C) = y∈C π(y). Consider an arbitrary Markov chain.e. Assume that a stationary distribution π exists. x ∈ S. Let π(·|C) be the distribution deﬁned by restricting π to C: π(x|C) := π(x) . Then π [a] = π [b] . But π(i|C) > 0. π(i|C) = Ei Ti . Consider the restriction of the chain to C. Corollary 12. Let C be the communicating class of i. Hence there is only one stationary distribution. Both π [a] and π[b] are stationary distributions. i is positive recurrent. an arbitrary stationary distribution π must be equal to the speciﬁc one π [a] . from the deﬁnition of π [a] . Then. Consider a positive recurrent irreducible Markov chain with stationary distribution π. Proof. i. There is a unique stationary distribution. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Hence π = π [a] . Consider a Markov chain. the only stationary distribution for the restricted chain because C is irreducible. Thus. There is only one stationary distribution. Consider a positive recurrent irreducible Markov chain and two states a. so they are equal. Let i be such that π(i) > 0. 1 π(a) = . which is useful because it is often used as a criterion for positive recurrence: Corollary 13. This is in fact. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. In particular. E a Ta Proof. The following is a sort of converse to that. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . Decompose it into communicating classes. Corollary 11. delete all transition probabilities pxy with x ∈ C. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . Therefore Ei Ti < ∞.Therefore π(x) = π [a] (x). π(a) = π [a] (a) = 1/Ea Ta . for all a ∈ S. π(C) x ∈ C. Proof. for all x ∈ S. In particular. The cycle formula merely re-expresses this equality. Namely. Then any state i such that π(i) > 0 is positive recurrent. b ∈ S.

Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . For example. where π(C) := π(C) π(x). Proposition 3. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. π(x|C) ≡ π C (x). Stability of a deterministic dynamical system means that the system eventually converges to a ﬁxed point (or. So far we have seen. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. Proof. This is deﬁned by picking a state (any state!) a ∈ C and letting π C := π [a] . This is the stable position. that it remains in a small region). for example. where αC ≥ 0. such that C αC = 1. it will settle to the vertical position. without friction. the pendulum will perform an 45 . take an = (−1)n . An arbitrary stationary distribution π must be a linear combination of these π C . Let π be an arbitrary stationary distribution. So we have that the above decomposition holds with αC = π(C). x∈C Then (π(x|C). that. In an ideal pendulum. 18 18. consider the motion of a pendulum under the action of gravity. under irreducibility and positive recurrence conditions. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). more generally. eventually. A Markov chain is a simple model of a stochastic dynamical system.which assigns positive probability to each state in C.1 Coupling and stability Deﬁnition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. By our uniqueness result. For each such C deﬁne the conditional distribution π(x|C) := π(x) . as n → ∞. x ∈ C) is a stationary distribution. We all know. that the pendulum will swing for a while and. If π(x) > 0 then x belongs to some positive recurrent class C. There is dissipation of energy due to friction and that causes the settling down. from physical experience. To each positive recurrent communicating class C there corresponds a stationary distribution π C .

. However notice what happens when we solve the recursion. ······ 46 . If. simple because the ξn ’s depend on the time parameter n. it cannot mean convergence to a ﬁxed equilibrium point. A linear stochastic system We now pass on to stochastic systems. x ∈ R. or whatever. . . ξ2 . for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . then x(t) = x∗ for all t > 0. X2 . . a > 0.oscillatory motion ad inﬁnitum.e. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an ﬁnancial example. Still. The thing to observe here is that there is an equilibrium point i. But what does stability mean? In general. . are given. . by means of the recursion above. as t → ∞. we here assume that X0 . ξ0 . It follows that |ρ| < 1 is the condition for stability. then regardless of the starting position. ˙ There are myriads of applications giving physical interpretation to this. We shall assume that the ξn have the same law. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . we have x(t) → x∗ . Speciﬁcally. or an example in physics. The simplest diﬀerential equation As another example. the one found by setting the right-hand side equal to zero: x∗ = b/a. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. consider the following diﬀerential equation x = −ax + b. We say that x∗ is a stable equilibrium. random variables. and deﬁne a stochastic sequence X1 . ξ1 . mutually independent. moreover. And we are in a happy situation here. we may it is stable because it does not (it cannot!) swing too far away.

Y but not their joint law. we know f (x) = P (X = x). for instance. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. i. the n-step transition matrix Pn . Y = y) is. . but we are not given what P (X = x. In other words. converges. while the limit of Xn does not exists. under nice conditions. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . for any initial distribution. What this means is that. for time-homogeneous Markov chains. The second term. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). Y = y) = P (X = x)P (Y = y). j ∈ S. as n → ∞. Y = y) is? 47 . 18. suppose you are given the laws of two random variables X. Starting at an elementary level. But suppose that we have some further requirement. we need to understand the notion of coupling. . assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . to try to make the coincidence probability P (X = Y ) as large as possible.Since |ρ| < 1.g. . How should we then deﬁne what P (X = x. 18. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). . If a Markov chain is irreducible. ξn−1 does not converge. which is a linear combination of ξ0 . g(y) = P (Y = y). we have that the ﬁrst term goes to 0 as n → ∞.2 The fundamental stability theorem for Markov chains Theorem 17. A straightforward (often not so useful) construction is to assume that the random variables are independent and deﬁne P (X = x. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). ∞ ˆ The nice thing is that. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . deﬁne stability in a way other than convergence of the distributions of the random variables. uniformly in x.3 Coupling To understand why the fundamental stability theorem works. (e. Coupling refers to the various methods for constructing this joint law. In particular.

and which maximises P (X = Y ). in particular. Proof. The coupling we are interested in is a coupling of processes. y). Let T be a meeting of two coupled stochastic processes X. T is ﬁnite with probability one. then |P (Xn = x) − P (Yn = x)| → 0. Repeating (19). Find the joint law h(x. A coupling of them refers to a joint construction on the same probability space. This T is a random variables that takes values in the set of times or it may take values +∞. f (x) = y h(x. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Since this holds for all x ∈ S and since the right hand side does not depend on x. x∈S 48 . uniformly in x ∈ S. and all x ∈ S. n ≥ 0. i. on the event that the two processes never meet. (19) as n → ∞. g(y) = P (Y = y). Y = y) which respects the marginals. y) = P (X = x. n ≥ 0) and the law of a stochastic process Y = (Yn . n < T ) + P (Xn = x. n ≥ 0). g(y) = x h(x. Note that it makes sense to speak of a meeting time. x ∈ S. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). |P (Xn = x) − P (Yn = x)| ≤ P (T > n). precisely because of the assumption that the two processes have been constructed together. n ≥ T ) ≤ P (n < T ) + P (Yn = x). Combining the two inequalities we obtain the ﬁrst assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. i. The following result is the so-called coupling inequality: Proposition 4. Y be random variables with values in Z and laws f (x) = P (X = x). n ≥ T ) = P (Xn = x. but we are not told what the joint probabilities are. y). We are given the law of a stochastic process X = (Xn .Exercise: Let X. n < T ) + P (Yn = x. Suppose that we have two stochastic processes as above and suppose that they have been coupled. respectively.e. for all n ≥ 0. If. Then. P (T < ∞) = 1. but with the roles of X and Y interchanged.e. Y . We have P (Xn = x) = P (Xn = x. n ≥ 0. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n).

y′ ) for all large n. Then (Wn . From Theorem 3 and the aperiodicity assumption.y′ . for all n ≥ 1. as usual. x∈S as n → ∞. (Indeed. y) ∈ S × S. Its initial state W0 = (X0 . Therefore.y′ ) := P Wn+1 = (x′ . We know that there is a unique stationary distribution. had we started with both X0 and Y0 independent and distributed according to π. denoted by X = (Xn . y) = µ(x)π(y). Let pij denote. n ≥ 0). and so (Wn . n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. 18.) By positive recurrence.x′ > 0 and py. the transition probabilities. It is very important to note that we have only deﬁned the law of X and the law of Y but we have not deﬁned a joint law. y ′ ) | Wn = (x. Yn ). and it also makes sense to consider their ﬁrst meeting time: T := inf{n ≥ 0 : Xn = Yn }. y) Its n-step transition probabilities are q(x. we have that px.Assume now that P (T < ∞) = 1.4 Proof of the fundamental stability theorem First. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. Then n→∞ lim P (T > n) = P (T = ∞) = 0.x′ ). is a stationary distribution for (Wn .x′ ).(y. denoted by Y = (Yn . n ≥ 0) is an irreducible chain. sup |P (Xn = x) − P (Yn = x)| → 0. Having coupled them. we have not deﬁned joint probabilities such as P (Xn = x.(y. The ﬁrst process. implying that q(x.x′ py. Y0 ) has distribution P (W0 = (x.x′ ). then.y′ . We now couple them by doing the straightforward thing: we assume that X = (Xn . (x. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. . Xm = y).y′ for all large n. The second process. n ≥ 0) is a Markov chain with state space S × S. Its (1-step) transition probabilities are q(x. n ≥ 0). Notice that σ(x. positive recurrent and aperiodic. We shall consider the laws of two processes. we can deﬁne joint probabilities.x′ py.y′ ) = px. denoted by π. n ≥ 0) is independent of Y = (Yn . Xn and Yn would be independent and distributed according to π. they diﬀer only in how the initial states are chosen. review the assumptions of the theorem: the chain is irreducible. π(x) > 0 for all x ∈ S.(y. in other words. y) := π(x)π(y). Consider the process Wn := (Xn .

For example. Hence (Zn .(0. Zn = Yn . p10 = 1.1) = q(1. suppose that S = {0. 0).1). after the meeting time. by assumption. (1. n ≥ 0) is positive recurrent.(1. Z0 = X0 which has an arbitrary distribution µ. for all n and x. 1)} and the transition probabilities are q(0.0). 1} and the pij are p01 = p10 = 1. In particular.(1. Then the product chain has state space S ×S = {(0. p01 = 0. Y ) may fail to be irreducible. ≥ 0) is a Markov chain with transition probabilities pij . y) > 0 for all (x. Clearly. Hence (Wn .Therefore σ(x.1). P (T < ∞) = 1. Deﬁne now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ).0) = q(0. By the coupling inequality. if we modify the original chain and make it aperiodic by. (1. In particular. (Zn . y) ∈ S × S.0). X arbitrary distribution Z stationary distribution Y T Thus. Since P (Zn = x) = P (Xn = x). (0. n ≥ 0) is identical in law to (Xn .99. Moreover. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). However. as n → ∞. This is not irreducible.(0. and since P (Yn = x) = π(x). 0). say. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X.1) = 1. 50 . precisely because P (T < ∞) = 1. then the process W = (X. Obviously. 1). n ≥ 0). p00 = 0. T is a meeting time between Z and Y . uniformly in x. which converges to zero. Zn equals Xn before the meeting time of X and Y .01.0) = q(1. P (Xn = x) = P (Zn = x). we have |P (Xn = x) − π(x)| ≤ P (T > n).

p10 = 1. π(0) + π(1) = 1. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). however.503.99π(0) = π(1).503 0. and so on. 6 The reason is that the term “ergodic” means something speciﬁc in Mathematics (c. π(0) ≈ 0. in particular. most people use the term “ergodic” for an irreducible. if we modify the chain by p00 = 0. . 0. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing.99. Theorem 4. such that the chain moves cyclically between them: from C0 to C1 . limn→∞ p00 does not exist. then (n) lim p n→∞ 00 (n) = lim p10 = π(0). positive recurrent and aperiodic Markov chain. Parry. 18. 1981). It is. from C1 to C2 . Cd−1 . Then Xn = 0 for even n and = 1 for odd n.e. Suppose X0 = 0. . per se wrong. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. Suppose that the period of the chain is d. But this is “wrong”. . . customary to have a consistent terminology. Thus. However. Terminology cannot be. π(1) ≈ 0.497.497 .01. By the results of §5. π(1) satisfy 0.497 18. 51 . It is not diﬃcult to see that 6 We put “wrong” in quotes. i. Therefore p00 = 1 for even n (n) and = 0 for odd n. from Cd back to C0 .5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. By the results of Section 14. and this is expressed by our terminology. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. we can decompose the state space into d classes C0 . p01 = 0. In matrix notation. Indeed. The reader is asked to pay attention to this terminology which is diﬀerent from the one used in the Markov chain literature.503 0.f. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. n→∞ (n) where π(0). C1 .then the product chain becomes irreducible.4 and.

. Suppose i ∈ Cr . namely C0 . Let i ∈ Ck . i ∈ Cr .Lemma 8. So π(i) = j∈Cr−1 π(j)pji . j ∈ Cℓ . . → dπ(j). Then pji = 0 unless j ∈ Cr−1 . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Proof. . n ≥ 0) consists of d closed communicating classes. Since (Xnd . the (unique) stationary distribution of (Xnd . All states have period 1 for this chain. Then. it remains in Cℓ . i∈Cr for all r = 0. 52 . Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). d − 1. Let αr := i∈Cr π(i). and let r = ℓ − k (modulo d). where π is the stationary distribution. we have pij for all i. thereafter. Suppose that X0 ∈ Cr . . Then. n ≥ 0) is dπ(i). i ∈ S. 1. and the previous results apply. i ∈ Cr . the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. Cd−1 . π(i) = 1/d. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. . Hence if we restrict the chain (Xnd . Since the αr add up to 1. We have that π satisﬁes the balance equations: π(i) = j∈S π(j)pji . The chain (Xnd . n ≥ 0) is aperiodic. Can we also ﬁnd the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . j ∈ Cr . We shall show that Lemma 9. as n → ∞. j ∈ Cℓ . . . each one must be equal to 1/d. This means that: Theorem 18. after reduction by d) and. Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. as n → ∞. starting from i. if sampled every d steps. .

. k = 1. the sequence n3 of the third row is a subsequence k k of n2 of the second row. Consider.). where = 1 and pi ≥ 0 for all i. the ﬁrst of which is known as HellyBray Lemma: Lemma 10. Hence pi k i. The proof will be based on two results from Analysis. For each i we have found subsequence ni . . . say. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. p(n) = (n) (n) (n) (n) (n) ∞ p1 . . Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . k = 1. such that limk→∞ pi k exists and equals. So the diagonal subsequence (nk ... and so on. . are contained between 0 and 1. a probability distribution on N. there must be a subsequence exists and equals.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . . exists for all i ∈ N. p2 . for each n = 1. . for each The second result is Fatou’s lemma. . the sequence n2 of the second row k is a subsequence of n1 of the ﬁrst row. i. 2. whose proof we omit7 : 7 See Br´maud (1999). 2. p3 .e. as n → ∞. 19. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. . It is clear how we (ni ) k continue. . then pij → 0 as n → ∞. By construction. for all i ∈ S. . 2. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. (nk ) k → pi . pi . say. k = 1. Then we can ﬁnd a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. for all i ∈ S. then pij → π(j) (Theorems 17). k = 1. . Now consider the contained between 0 and 1 there must be a exists and equals. 2. p2 . We have. . .) is a k k subsequence of each (ni . for each i. if j is transient. . as k → ∞. p1 . If j is a null recurrent state then pij → 0. schematically. . . . e 53 . 2. We also saw in Corollary (n) 7 that. .1 Limiting probabilities for null recurrent states Theorem 19. If the period is d then the limit exists only along a selected subsequence (Theorem 18). such that limk→∞ p1 k 3 (n1 ) numbers p2 k . say.

. Let j be a null recurrent state. But Pn+1 = Pn P. pix k = 1. pix k Therefore. We wish to show that these inequalities are. and the last equality is obvious since we have a probability vector. 54 . αy pyx . n = 1. Suppose that one of them is not. i ∈ N). we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . x ∈ S. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). i. π is a stationary distribution. x∈S (n′ ) Since aj > 0. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i .e. we have that π(x) := αx y∈S αy satisﬁes the balance equations and has x∈S π(x) = 1. i. i. . pix k . (n′ ) Hence αx ≥ αy pyx . equalities. 2. (xi . Thus. Now π(j) > 0 because αj > 0. Suppose that for some i.e. . y∈S Since x∈S αx ≤ 1. it is a strict inequality: αx0 > y∈S αy pyx0 . the sequence (n) pij does not converge to 0. for some x0 ∈ S.e. i ∈ N). αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. state j is positive recurrent: Contradiction. we have 0< x∈S The inequality follows from Lemma 11. by assumption. If (yi . (n ) (n ) Consider the probability vector By Lemma 10. k→∞ (n′ +1) = y∈S piy k pyx . By Corollary 13. by Lemma 11.Lemma 11. x ∈ S . actually. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . Hence we can ﬁnd a subsequence nk such that k→∞ lim pij k = αj > 0. i=1 (n) Proof of Theorem 19. y∈S x ∈ S.

and fij = Pi (Tj < ∞). Let C be the communicating class of j and let π C be the stationary distribution associated with C. as in §10. π C (j) = 1/Ej Tj . In this form. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. β.19. This is a more old-fashioned way of getting the same result. In general. Indeed. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . as n → ∞. Let j be a positive recurrent state with period 1. From the Strong Markov Property. we have pij = Pi (Xn = j) = Pi (Xn = j. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m).2. Example: Consider the following chain (0 < α. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. Theorem 20. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . starting with the ﬁrst hitting time decomposition relation of Lemma 7. The result when the period j is larger than 1 is a bit awkward to formulate. that is. Let ϕC (i) := Pi (HC < ∞).2 The general case (n) It remains to see what the limit of pij is. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). as n → ∞. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. Then (n) lim p n→∞ ij = ϕC (i)π C (j). so we will only look that the aperiodic case. Let HC := inf{n ≥ 0 : Xn ∈ C}. Pµ (Xn−m = j) → π C (j). Proof. which is what we need. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. the result can be obtained by the so-called Abel’s Lemma of Analysis. But Theorem 17 tells us that the limit exists regardless of the initial distribution. Let i be an arbitrary state. and fij = ϕC (j).

Ergodic Theory is nowadays an area that combines many ﬁelds of Mathematics and Science. The phase (or state) space of a particle is not just a physical space but it includes. Similarly. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. p42 → ϕ(4)π C1 (2). C1 = {1. 4} (inessential states). 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. ϕ(5) = 0. 2} (closed class). (n) (n) p43 → 0. We have ϕ(1) = ϕ(2) = 1. from ﬁrst-step analysis. Consider the quantity ν [a] (x) deﬁned as the average number of visits to state x between two successive visits to state a–see 16. π C1 (2) = . ϕ(4) = . for example. C2 are aperiodic classes. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. (n) (n) p35 → (1 − ϕ(3))π C2 (5).The communicating classes are I = {3. p44 → 0. We will avoid to give a precise deﬁnition of the term ‘ergodic’ but only mention what constitutes a result in the area. C2 = {5} (absorbing state). p41 → ϕ(4)π C1 (1). ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . For example. ϕ(3) = p31 → ϕ(3)π C1 (1). The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . (n) (n) p32 → ϕ(3)π C1 (2). information about its velocity. 1 − αβ 1 − αβ Both C1 . in Theorem 14. we assumed the existence of a positive recurrent state. The subject of Ergodic Theory has its origins in Physics. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). And so is the Law of Large Numbers itself (Theorem 13). π C2 (5) = 1. 56 (20) x∈S . (n) (n) p34 → 0. Theorem 21. whence 1−α β(1 − α) . as n → ∞. among others. Theorem 14 is a type of ergodic theorem. Recall that. Pioneers in the are were. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. Hence. while. (n) (n) p33 → 0. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. We now generalise this. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. p45 → (1 − ϕ(4))π C2 (5).

r=0 with probability 1 as long as E|G1 | < ∞. we obtain the result. we have C := i∈S ν(i) < ∞. As in the proof of Theorem 14. The reader is asked to revisit the proof of Theorem 14. But this is precisely the condition (20). (21) f (x)ν [a] (x). as we saw in the proof of Theorem 14. We only need to check that EG1 = f . 21 Finite chains We take a brief look at some things that are speciﬁc to chains with ﬁnitely many states. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). we have Nt /t → 1/Ea Ta . we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gﬁrst + Glast .Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . t t→∞ t t t with probability one. so the numerator converges to f /Ea Ta . which is the quantity denoted by f in Theorem 14. we have that the denominator equals Nt /t and converges to 1/Ea Ta .5. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. we assumed positive recurrence of the ground state a. Remark 1: The diﬀerence between Theorem 14 and 21 is that. From proposition 2 we know that the balance equations ν = νP have at least one solution. Glast are the total rewards over the ﬁrst and t t a a last incomplete cycles. respectively. and this is left as an exercise. then. and this justiﬁes the terminology of §18. since the number of states is ﬁnite. Then at least one of the states must be visited inﬁnitely often. while Gﬁrst . 57 . So there are always recurrent states. Proof. we have t→∞ lim 1 last 1 G = lim Gﬁrst = 0. Replacing n by the subsequence Nt . Now. As in Theorem 14. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. If so. So if we divide by t the numerator and denominator of the fraction in (21). in the former.

Because the process is Markov. i ∈ S. By Corollary 13. Then the chain is positive recurrent with a unique stationary distribution π. . . Suppose ﬁrst that the chain is reversible. reversible if it has the same distribution when the arrow of time is reversed. n ≥ 0). in addition. X1 ) has the same distribution as (X1 . Xn ) has the same distribution as (Xn . . 22 Time reversibility Consider a Markov chain (Xn . . It is. this state must be positive recurrent. . the chain is aperiodic. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . π(i) := Hence. But if we ﬁlm a man who walks and run the ﬁlm backwards. we have Pn → Π. P (X0 = i. in a sense. i. . we see that (X0 . . A reversible Markov chain is stationary. X1 . . if we ﬁlm a standing man who raises his arms up and down successively. . . j ∈ S. X1 = j) = P (X1 = i. i. 58 . Reversibility means that (X0 . actually. the stationary distribution is unique. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. For example. this means that it is stationary. Therefore π is a stationary distribution. . . by Theorem 16. Lemma 12. that is. without being able to tell that it is. as n → ∞. in addition. Proof. . In particular. In addition. . Proof. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. If. At least some state i must have π(i) > 0. the chain is irreducible. . a ﬁnite Markov chain always has positive recurrent states. Xn−1 . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. More precisely.Hence 1 ν(i). like playing a ﬁlm backwards. . Taking n = 1 in the deﬁnition of reversibility. Xn−1 . X1 . X0 ). simply. We say that it is time-reversible. because positive recurrence is a class property (Theorem 15). (X0 . We summarise the above discussion in: Theorem 22. Xn has the same distribution as X0 for all n. it will look the same. Xn ) has the same distribution as (Xn . Consider an irreducible Markov chain with ﬁnitely many states. X0 = j). Theorem 23. running backwards. X0 ) for all n. the ﬁrst components of the two vectors must have the same distribution. C satisﬁes the balance equations and the normalisation condition i∈S π(i) = 1. If. then all states are positive recurrent. or. for all n. then we instantly know that the ﬁlm is played in reverse. and run the ﬁlm backwards.e. X0 ).

So the chain is reversible. . X1 . suppose the DBE hold. Conversely. j. Summing up over all choices of states i3 . . it is easy to show that. But P (X0 = i. letting m → ∞. Pick 3 states i. These are the DBE. X1 = j. for all n. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic.for all i and j. . if there is an arrow from i to j in the graph of the chain but no arrow from j to i. . Conversely. . Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. X0 ). By induction. π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. Suppose ﬁrst that the chain is reversible. X2 = i). and write. So the detailed balance equations (DBE) hold. X1 = j. then the chain cannot be reversible. X1 . (22) Proof. Now pick a state k and write. . im ∈ S and all m ≥ 3. and pi1 i2 (m−3) → π(i2 ). we have π(i) > 0 for all i. . Hence the DBE hold. . X2 = k) = P (X0 = k. . k. The DBE imply immediately that (X0 . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . i4 . k. and hence (X0 . j. and the chain is reversible. P (X0 = j. X0 ). and this means that P (X0 = i. 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . Theorem 24. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . X0 ). pji 59 (m−3) → π(i1 ). . using the DBE. X2 ) has the same distribution as (X2 . suppose that the KLC (22) holds. . X1 ) has the same distribution as (X1 . In other words. X1 = j) = π(i)pij . X1 . i2 . then. The same logic works for any m and can be worked out by induction. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . for all i. . Xn ) has the same distribution as (Xn . Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. X1 = i) = π(j)pji . Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. . . . we have separation of variables: pij = f (i)g(j). . using DBE. (X0 . . We will show the Kolmogorov’s loop criterion (KLC). Xn−1 .

and by replacing. Form the graph by deleting all self-loops. f (i)g(i) must be 1 (why?). then the chain is reversible. This gives the following method for showing that a ﬁnite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. Pick two distinct states i. there isn’t any.1) which gives π(i)pij = π(j)pji . and. (If it didn’t. If we remove the link between them. meaning that it has no loops. the detailed balance equations. then we need. So g satisﬁes the balance equations. j for which pij > 0. namely the arrow from i to j. The balance equations are written in the form F (A. If the graph thus obtained is a tree. Hence the original chain is reversible. so we have pij g(j) = .(Convention: we form this ratio only when pij > 0.) Let A. For example. Hence the chain is reversible. The reason is as follows. consider the chain: Replace it by: This is a tree.) Necessarily. the two oriented edges from i to j and from j to i by a single edge without arrow. And hence π can be found very easily. using another method. there would be a loop. A) (see §4. in addition. There is obviously only one arrow from AtoAc . pji g(i) Multiplying by g(i) and summing over j we ﬁnd g(i) = j∈S g(j)pji . to guarantee.e. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. that the chain is positive recurrent. Example 1: Chain with a tree-structure. and the arrow from j to j is the only one from Ac to A. Ac be the sets in the two connected components. for each pair of distinct states i. • If the chain has inﬁnitely many states. the graph becomes disconnected. by assumption. i. 60 . j for which pij > 0. Ac ) = F (Ac .

e. Each edge connects a pair of distinct vertices (states) without orientation. 2. j are not neighbours then both sides are 0. The constant C is found by normalisation. etc. For each state i deﬁne its degree δ(i) = number of neighbours of i. We claim that π(i) = Cδ(i). i∈S where |E| is the total number of edges. where C is a constant. The stationary distribution of a RWG is very easy to ﬁnd. For example. There are two cases: If i. a ﬁnite set.Example 2: Random walk on a graph. In all cases. if j is a neighbour of i otherwise. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. If i. The stationary distribution assigns to a state probability which is proportional to its degree. the DBE hold. π(i)pij = π(j)pji . States j are i are called neighbours if they are connected by an edge. 61 . A RWG is a reversible chain. so both members are equal to C. and a set E of undirected edges. Hence we conclude that: 1. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). π(i) = 1. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. 0. i∈S Since δ(i) = 2|E|. Suppose the graph is connected. i. p02 = 1/4. we have that C = 1/2|E|. To see this let us verify that the detailed balance equations hold. Consider an undirected graph G with vertices S. Now deﬁne the the following transition probabilities: pij = 1/δ(i).

. if. (m) (m) 23.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij .e. random variables with values in N and distribution λ(n) = P (ξ0 = n). . If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). St+1 = St + ξt . k = 0. 1. where ξt are i. in addition.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). n ≥ 0) be Markov with transition probabilities pij . 23. . . The state space of (Yr ) is A. irreducible and positive recurrent). then (Yk . Due to the Strong Markov Property. k = 0. 2. π is a stationary distribution for the original chain. 2. 2. where pij are the m-step transition probabilities of the original chain. . i. 62 n ∈ N.) has the Markov property but it is not time-homogeneous. 1.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. . .d. this process is also a Markov chain and it does have time-homogeneous transition probabilities. . . k = 0. 2. . . The sequence (Yk . . . k = 0. 2. . 1. t ≥ 0. 1. .3 Subordination Let (Xn . Let A be (r) a set of states and let TA be the r-th visit to A.i. A r = 1. If the subsequence forms an arithmetic progression. Let (St .e. . in general. then it is a stationary distribution for the new chain. t ≥ 0) be independent from (Xn ) with S0 = 0. In this case. Deﬁne Yr := XT (r) .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . how can we create new ones? 23. . if nk = n0 + km. i.e.

. . Yn−1 = αn−1 }. β. β).4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . ∞ λ(n)pij . Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i).Then Yt := XSt . (Notation: We will denote the elements of S by i. Fix α0 . . a Markov chain. . (n) 23.) More generally. Then (Yn ) has the Markov property. i. Yn−1 ) = P (Yn+1 = β | Yn = α). . . Proof.. .. and the elements of S ′ by α. is NOT. . . . then the process Yn = f (Xn ). α1 . n≥0 63 . j. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. however.e. conditional on Yn the past is independent of the future. f is a one-to-one function then (Yn ) is a Markov chain. Indeed. we have: Lemma 13. . αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . . . If. in general. . knowledge of f (Xn ) implies knowledge of Xn and so. that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). We need to show that P (Yn+1 = β | Yn = α.

β) P (Yn = α|Xn = i. α = 0. i. β) i∈S P (Yn = α|Xn = i. Yn−1 )P (Xn = i | Yn = α. On the other hand. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β ∈ Z+ . depends on i only through |i|. because the probability of going up is the same as the probability of going down. if β = 1. β) P (Yn = α|Yn−1 ) i∈S = Q(α. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. if β = α ± 1. conditional on {Xn = i}. β) := 1. β).e. To see this. for all i ∈ Z and all β ∈ Z+ . Indeed. Yn = α. Yn−1 ) Q(f (i). if i = 0. β)P (Xn = i | Yn = α. P (Xn+1 = i ± 1|Xn = i) = 1/2. i ∈ Z. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) P (Yn = α. and so (Yn ) is. Then (Yn ) is a Markov chain. regardless of whether i > 0 or i < 0. 64 . observe that the function | · | is not one-to-one. β). Yn−1 ) Q(f (i).for all choices of the variables involved. if we let 1/2. indeed. But P (Yn+1 = β | Yn = α. Example: Consider the symmetric drunkard’s random walk. i ∈ Z. and if i = 0. then |Xn+1 | = 1 for sure. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. α > 0 Q(α. a Markov chain with transition probabilities Q(α. In other words. Deﬁne Yn := |Xn |. β). But we will show that P (Yn+1 = β|Xn = i). β). We have that P (Yn+1 = β|Xn = i) = Q(|i|. Thus (Yn ) is Markov with transition probabilities Q(α.

therefore transient. n ≥ 0) has the Markov property.d. so we will ﬁnd some diﬀerent method to proceed. ξ (n) ). . Example: Suppose that p1 = p. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. and. . independently from one another. . j In this case. one question of interest is whether the process will become extinct or not. . In other words.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. Therefore. the population cannot grow: it will eventually reach 0. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . k = 0. each individual gives birth to at one child with probability p or has children with probability 1 − p. p0 = 1 − p. and following the same distribution. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of oﬀspring. if the 65 (n) (n) . . It is a Markov chain with time-homogeneous transitions. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow.. a hard problem. Thus. then we see that the above equation is of the form Xn+1 = f (Xn . since ξ (0) . ξ (1) . 0 But 0 is an absorbing state. Hence all states i ≥ 1 are inessential. which has a binomial distribution: i j pij = p (1 − p)i−j . Back to the general case.i. . . ξ2 . if the current population is i there is probability pi that everybody gives birth to zero oﬀspring. . with probability 1. 0 ≤ j ≤ i. Let ξk be the number of oﬀspring of the k-th individual in the n-th generation. 1. So assume that p1 = 1. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. But here is an interesting special case. and 0 is always an absorbing state. 2. We let pk be the probability that an individual will bear k oﬀspring. we have that (Xn . If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. If we let ξ (n) = ξ1 . are i. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. Then P (ξ = 0) + P (ξ ≥ 2) > 0. in general. . We ﬁnd the population (n) at time n + 1 by adding the number of oﬀspring of each individual. Hence all states i ≥ 1 are transient. The state 0 is irrelevant because it cannot be reached from anywhere.24 24. if 0 is never reached then Xn → ∞. (and independent of the initial population size X0 –a further assumption). The parent dies immediately upon giving birth.

and P (Dk ) = ε. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. . We then have that D = D1 ∩ · · · ∩ Di . Consider a Galton-Watson process starting with X0 = 1 initial ancestor. For each k = 1. . ϕn (0) = P (Xn = 0). we can argue that ε(i) = εi . Therefore the problem reduces to the study of the branching process starting with X0 = 1. 66 . Obviously. where ε = P1 (D). if we start with X0 = i in ital ancestors.population does not become extinct then it must grow to inﬁnity. This function is of interest because it gives information about the extinction probability ε. Proof. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. each one behaves independently of one another. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. Let ∞ µ = Eξ = k=1 kpk be the mean number of oﬀspring of an individual. and. We wish to compute the extinction probability ε(i) := Pi (D). since Xn = 0 implies Xm = 0 for all m ≥ n. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. . If µ ≤ 1 then the ε = 1. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. ε(0) = 1. i. so P (D|X0 = i) = εi . The result is this: Proposition 5. For i ≥ 1. Let ε be the extinction probability of the process. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . Indeed. . Indeed. when we start with X0 = i.

ϕn+1 (0) = ϕ(ϕn (0)). if the slope µ at z = 1 is ≤ 1. In particular. (See right ﬁgure below.Note also that ϕ0 (z) = 1. Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). in fact. On the other hand.) 67 . Therefore ε = ε∗ . the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. if µ > 1. and so on. Notice now that µ= d ϕ(z) dz . ε = ε∗ . By induction. and since the latter has only two solutions (z = 0 and z = ε∗ ). ϕn (0) ≤ ε∗ . = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). ϕn+1 (z) = ϕ(ϕn (z)). But ϕn (0) → ε. But ϕn+1 (0) → ε as n → ∞. Since both ε and ε∗ satisfy z = ϕ(z). all we need to show is that ε < 1. and. ϕ(ϕn (0)) → ϕ(ε). Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). But 0 ≤ ε∗ . recall that ϕ is a convex function with ϕ(1) = 1. z=1 Also. So ε ≤ ε∗ < 1. since ϕ is a continuous function. This means that ε = 1 (see left ﬁgure below). Therefore. Let us show that. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). Also ϕn (0) → ε. then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. Therefore ε = ϕ(ε).

on the next times slot. are allocated to hosts 1. The problem of communication over a common channel. if Sn = 1. If s = 1 there is a successful transmission and. there is collision and the transmission is unsuccessful. in a simpliﬁed model.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). there are x + a − 1 packets. If s ≥ 2. we let hosts transmit whenever they like. there is no time to make a prior query whether a host has something to transmit or not. 3. Ethernet has replaced Aloha. there is a collision and. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. Then ε < 1. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. then time slots 0. An the number of new packets on the same time slot.i. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. One obvious solution is to allocate time slots to hosts in a speciﬁc order. Nowadays. for a packet to go through. on the next times slot. We assume that the An are i. 3 hosts. Xn+1 = Xn + An . 3. Then ε = 1. 24. Case: µ > 1. only one of the hosts should be transmitting at a given time. 2. If collision happens.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. random variables with some common distribution: P (An = k) = λk . 2. . .d. 68 . . . assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). Thus. Since things happen fast in a computer network. and Sn the number of selected packets. So. then max(Xn + An − 1. using the same frequency. 5. 3. is that if more than one hosts attempt to transmit a packet. 6. otherwise. k ≥ 0. Abandoning this idea. 1. Let s be the number of selected packets. . . there are x + a packets. So if there are. 0). 1.. 2. During this time slot assume there are a new packets that need to be transmitted. 4. say. periodically. then the transmission is not successful and the transmissions must be attempted anew. 1.

where λk ≥ 0 and k≥0 kλk = 1. The main result is: Proposition 6. selections are made amongst the Xn + An packets. .e. where G(x) is a function of the form α + βx .x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. So we can make q x G(x) as small as we like by choosing x suﬃciently large. i. . and p. Let us ﬁnd its transition probabilities. we have that the drift is positive for all large x and this proves transience. Proving this is beyond the scope of the lectures. and exactly one selection: px. We here only restrict ourselves to the computation of this expected change. The Aloha protocol is unstable. The random variable Sn has. if Xn + An = 0. Therefore ∆(x) ≥ µ/2 for all large x. where q = 1 − p. . but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. So P (Sn = k|Xn = x.x are computed by subtracting the sum of the above from 1. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). Idea of proof. then the chain is transient.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . Since p > 0.e. we should not subtract 1. We also let µ := k≥1 kλk be the expectation of An . 69 . An = a.x+k for k ≥ 1. An−1 . The probabilities px. The process (Xn . Unless µ = 0 (which is a trivial case: no arrivals!). and so q x tends to 0 much faster than G(x) tends to ∞. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). To compute px. we have q < 1. Indeed. k ≥ 1. To compute px. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An .x−1 when x > 0.x−1 = P (An = 0. Xn−1 . The reason we take the maximum with 0 in the above equation is because. we have ∆(x) = (−1)px. deﬁned by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. So: px.) = x + a k x−k p q . we must have no new packets arriving. for example we can make it ≤ µ/2. k 0 ≤ k ≤ x + a. For x > 0. we think as follows: to have a reduction of x by 1. Sn = 1|Xn = x) = λ0 xpq x−1 . so q x G(x) tends to 0 as x → ∞. n ≥ 0) is a Markov chain. the Markov chain (Xn ) is transient.x−1 + k≥1 kpx.

there are companies that sell high-rank pages. if L(i) = ∅ 0. All you need to do to get high rank is pay (a lot of money). otherwise. α ≈ 0. the chain moves to a random page. that allows someone to manipulate the PageRank of a web page and make it much higher. 3. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. the rank of a page may be very important for a company’s proﬁts. Comments: 1. neither that it is aperiodic. So this is how Google assigns ranks to pages: It uses an algorithm. Then it says: page i has higher rank than j if π(i) > π(j). The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. So if Xn = i is the current location of the chain. It is not clear that the chain is irreducible. because of the size of the problem.2) and let α pij := (1 − α)pij + . The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. Its implementation though is one of the largest matrix computations in the world.24. In an Internet-dominated society. So we make the following modiﬁcation. if j ∈ L(i) pij = 1/|V |. If L(i) = ∅. The idea is simple. in the latter case. Deﬁne transition probabilities as follows: 1/|L(i)|. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. 2. We pick α between 0 and 1 (typically. For each page i let L(i) be the set of pages it links to. unless i is a dangling page. PageRank. Academic departments link to each other by hiring their 70 . It uses advanced techniques from linear algebra to speed up computations. There is a technique. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. then i is a ‘dangling page’. |V | These are new transition probabilities. known as ‘302 Google jacking’. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. Therefore. A new use of PageRank is to rank academic doctoral programs based on their records of ﬁnding jobs for their graduates.

If so. colour of eyes) appears in two forms. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their oﬀspring did not inherit this allele. To simplify. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology.4 The Wright-Fisher model: an example from biology In an organism. And so on. When two individuals mate. a gene controlling a speciﬁc characteristic (e. We would like to study the genetic diversity. We will assume that a child chooses an allele from each parent at random.g. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. We are now ready to deﬁne the Markov chain Xn representing the number of alleles of type A in generation n. this allele is of type A with probability x/2N . This means that the external appearance of a characteristic is a function of the pair of alleles. we will assume that. two AA individuals will produce and AA child. The alleles in an individual come in pairs and express the individual’s genotype for that gene. One of the circle alleles is selected twice. Thus. their oﬀspring inherits an allele from each parent. maintain this allele for the next generation. generation after generation. 71 . from the point of view of an administrator. We have a transition from x to y if y alleles of type A were produced. This number could be independent of the research quality but. then. 24. 4. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). Three of the square alleles are never selected. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. 5. The process repeats. all we need is the information outlined in this ﬁgure. in each generation. Do the same thing 2N times. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. But an AA mating with a Aa may produce a child of type AA or one of type Aa. One of the square alleles is selected 4 times. it makes the evaluation procedure easy. a dominant (say A) and a recessive (say a). meaning how many characteristics we have. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. In this ﬁgure there are 4 alleles of type A (circles) and 6 of type a (squares). Since we don’t care to keep track of the individuals’ identities.faculty from each other (and from themselves). Imagine that we have a population of N individuals who mate at random producing a number of oﬀspring. the genetic diversity depends only on the total number of alleles of each type. We will also assume that the parents are replaced by the children. What is important is that the evaluation can be done without ever looking at the content of one’s research.

The fact that M is even is irrelevant for the mathematical model. . EXn+1 = EXn Xn = Xn . we see that 2N x 2N −y x y pxy = . 2N − 1} form a communicating class of inessential states. ξM are i. Clearly. Xn ). i. . (Here. Tx := inf{n ≥ 1 : Xn = x}. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. .2N = 1.) One can argue that. . the random variables ξ1 . The result is correct. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . Since pxy > 0 for all 1 ≤ x. . the number of alleles of type A won’t change. M Therefore. . 0 ≤ y ≤ 2N. and the population becomes one consisting of alleles of type a only. on the average. . . . . Clearly. . . Similarly. M and thus. . since. . for all n ≥ 0.i. the expectation doesn’t change and. These are: M ϕ(0) = 0.Each allele of type A is selected with probability x/2N . We show that our guess is correct by proceeding via the mundane way of verifying that it satisﬁes the ﬁrst-step analysis equations. E(Xn+1 |Xn . so both 0 and 2N are absorbing states.e. Since selections are independent. . Let M = 2N . . . but the argument needs further justiﬁcation because “the end of time” is not a deterministic time. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . with n P (ξk = 1|Xn . 72 . . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. . and the population becomes one consisting of alleles of type A only. M n P (ξk = 0|Xn . n n where. . conditionally on (X0 .e. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. the states {1. ϕ(M ) = 1. . the chain gets absorbed by either 0 or 2N . X0 ) = 1 − Xn . since. ϕ(x) = y=0 pxy ϕ(y). X0 ) = E(Xn+1 |Xn ) = M This implies that. y ≤ 2N − 1. i. . at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). on the average. . p00 = p2N. . as usual.d. ϕ(x) = x/M. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. X0 ) = Xn .

we will try to solve the recursion. say. Since the Markov chain is represented by this very simple recursion. Let ξn := An − Dn . for all n ≥ 0. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y .5 A storage or queueing application A storage facility stores. . we ﬁnd X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. Dn ). The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. n ≥ 0) are i. If the demand is for Dn components.i. on the average. Xn electronic components at time n. and E(An − Dn ) = −µ < 0. 2. . Working for n = 1. Otherwise. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. Thus.The ﬁrst two are obvious.d. b). 73 . the demand is only partially satisﬁed. provided that enough components are available. the demand is. Here are our assumptions: The pairs of random variables ((An . larger than the supply per unit of time. and so. We will show that this Markov chain is positive recurrent. We have that (Xn . there are Xn + An − Dn components in the stock. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. the demand is met instantaneously. A number An of new components are ordered and added to the stock. . n ≥ 0) is a Markov chain. while a number of them are sold to the market. at time n + 1. and the stock reached level zero at time n + 1. We will apply the so-called Loynes’ method. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. M −1 y−1 x M y−1 1− x . so that Xn+1 = (Xn + ξn ) ∨ 0.

. . . . .Thus. ξn−1 ). But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). . ξn−1 . we ﬁnd π(y) = x≥0 π(x)pxy . . where the latter equality follows from the fact that. . . ξ0 ). . Notice. we may replace the latter random variables by ξ1 . . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . the random variables (ξn ) are i.i. Indeed. in computing the distribution of Mn which depends on ξ0 . for y ≥ 0. instead of considering them in their original order. Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. 74 . however. (n) . . Hence. so we can shuﬄe them in any way we like without altering their joint distribution. we may take the limit of both sides as n → ∞. The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). . and. that d (ξ0 . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . . ξn−1 ) = (ξn−1 . Therefore. . for all n. ξn . .d. In particular. the event whose probability we want to compute is a function of (ξ0 . we reverse it. using π(y) = limn→∞ P (Mn = y). . = x≥0 Since the n-dependent terms are bounded. . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisﬁes the balance equations. . .

. .} > y) is not identically equal to 1. n→∞ Hence g(y) = P (0 ∨ max{Z1 . we will use the assumption that Eξn = −µ < 0. . Depending on the actual distribution of ξn . The case Eξn = 0 is delicate. We must make sure that it also satisﬁes y π(y) = 1.} < ∞) = 1. Here. as required. From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . This will be the case if g(y) is not identically equal to 1. the Markov chain may be transient or null recurrent. Z2 .So π satisﬁes the balance equations. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. Z2 . . . 75 . n→∞ This implies that P ( lim Zn = −∞) = 1.

and space-homogeneity. j = 1. if x = ej . but. .and space-homogeneous transition probabilities given by px. 0. This means that the transition probability pxy should depend on x and y only through their relative positions in space. . this walk enjoys. besides time. where p1 + · · · + pd + q1 + · · · + qd = 1. Example 1: Let p(x) = C2−|x1 |−···−|xd | . otherwise where p + q = 1.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk.y = px+z.y = p(y − x). a structure that enables us to say that px. . namely. . . we won’t go beyond Zd . where C is such that x p(x) = 1. then p(1) = p. that of spatial homogeneity. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. d 0. have been processes with timehomogeneous transition probabilities. Deﬁne pj . Our Markov chains. we need to give more structure to the state space S. x ∈ Zd . here. . d p(x) = qj . . p(−1) = q. j = 1. To be able to say this. A random walk possesses an additional property. p(x) = 0. A random walk of this type is called simple random walk or nearest neighbour random walk. so far. otherwise.y+z for any translation z. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. In dimension d = 1. groups) are allowed. a random walk is a Markov chain with time. . For example. This is the drunkard’s walk we’ve seen before. .g. if x = −ej . given a function p(x). the skip-free property. if x ∈ {±e1 . . Example 3: Deﬁne p(x) = 1 2d . . we can let S = Zd . So. ±ed } otherwise 76 . If d = 1. the set of d-dimensional vectors with integer coordinates. . More general spaces (e.

p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). . Indeed. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). and this is the symmetric drunkard’s walk. x ∈ Zd . where ξ1 .This is a simple random walk. x + ξ2 = y) = y p(x)p(y − x). we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). in addition. . If d = 1. then p(1) = p(−1) = 1/2.i. But this would be nonsense. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . . which. . So. y p(x)p′ (y − x). p ∗ δz = p. p′ (x).d. otherwise. Furthermore. . we can reduce several questions to questions about this random walk. because all we have done is that we dressed the original problem in more fancy clothes. We shall resist the temptation to compute and look at other 77 . random variables in Zd with common distribution P (ξn = x) = p(x). We refer to ξn as the increments of the random walk. the initial state S0 is independent of the increments. ξ2 . .) Now the operation in the last term is known as convolution. Furthermore.) In this notation. The special structure of a random walk enables us to give more detailed formulae for its behaviour. are i. x ∈ Zd is deﬁned as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . computing the n-th order convolution for all n is a notoriously hard problem. x ± ed . we will mean a sum over all possible values. We call it simple symmetric random walk. in general. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. (When we write a sum without indicating explicitly the range of the summation variable. . One could stop here and say that we have a formula for the n-step transition probability. Let us denote by Sn the state of the walk at time n. (Recall that δz is a probability distribution that assigns probability 1 to the point z. Obviously. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. The convolution of two probabilities p(x). p(x) = 0. We will let p∗n be the convolution of p with itself n times.

. . we have P (ξ1 = ε1 . −1. +1}n . ξn = εn ) = 2−n .d. −1} then we have a path in the plane that joins the points (0.i−1 = 1/2 for all i ∈ Z. First. consider the following: 78 . . Events A that depend only on these the ﬁrst n random signs are subsets of this sample space. and so #A P (A) = n . where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. 1) → (2. As a warm-up. all possible values of the vector are equally likely. and the sequence of signs is {+1. random variables (random signs) with P (ξn = ±1) = 1/2. to each initial position and to each sequence of n signs. how long will that take? C. +1. 1) → (4. Therefore.1) + − (5. but its practice may be hard. The principle is trivial. 1): (2.i. After all. we shall learn some counting methods.2) (4. 1}n . we are dealing with a uniform distribution on the sample space {−1. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that diﬀerent questions we need to ask or (b) that approximate methods are more informative than exact ones. . In the sequel.2) where the ξn are i. . a path of length n. So if the initial position is S0 = 0. +1}n . If yes. A ⊂ {−1. Clearly. Look at the distribution of (ξ1 . if we can count the number of elements of A we can compute its probability. So. +1}n .i+1 = pi. 2) → (5. Here are a couple of meaningful questions: A. Will the random walk ever return to where it started from? B. 2) → (3. . When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . +1. 2 where #A is the number of elements of A. ξn ). for ﬁxed n. Since there are 2n elements in {0. 0) → (1. If not.0) We can thus think of the elements of the sample space as paths.1) + (0.1) − (3. we should assign. .methods. + (1. a random vector with values in {−1. .

1 Paths that start and end at speciﬁc points Now consider a path that starts from a speciﬁc point and ends up at a speciﬁc point. i). S6 < 0. 0)).Example: Assume that S0 = 0. (n − m + y − x)/2 (23) Remark. Proof. 0) and ends at (n. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. y) is given by: #{(m. x) (n. 0) and end up at the speciﬁc point (n. in this case.047. i).i) (0. and compute the probability of the event A = {S2 ≥ 0. y)} =: {all paths that start at (m.) The darker region shows all paths that start at (0. We use the notation {(m. i). We can easily count that there are 6 paths in A. 0) (n. y)}. 0) and end up at (n. x) and end up at (n. y)} = n−m .0) The lightly shaded region contains all paths of length n that start at (0. The paths comprising this event are shown below. Consider a path that starts from (0. (n. Let us ﬁrst show that Lemma 14. S4 = −1. The number of paths that start from (0. x) and end up at (n. the number of paths that start from (m. (There are 2n of them. Hence (n+i)/2 = 0. y) is given by: #{(0. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. 0) and n ends at (n. x) (n. n 79 . In if n + i is not an even number then there is no path that starts from (0. S7 < 0}. i)} = n n+i 2 More generally.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the ﬁgure), there is no path that goes from (0, 0) to (3, 2). We shall thus deﬁne n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

**Paths that start and end at speciﬁc points and do not touch zero at all
**

(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =

1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the speciﬁc point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the ﬁrst time it hits zero and then reﬂecting it around the horizontal axis. In the ﬁgure below, follow the solid path starting at (m, x) up to the ﬁrst time it hits zero and then follow the dotted path: this is the reﬂected path.

(n, y)

(m, x)

(n, −y)

A path and its reﬂection around a horizontal line. Notice that the reﬂection starts only at the point where the path touches the horizontal line.

Note that if we know the reﬂected path we can recover the original path. Hence we may as well count the number of reﬂected paths. But each reﬂected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =

1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)

1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

**Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
**

1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n

n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n

n 2

= #{(0, 1) n

n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1

n−y 2

(n, y + 1); remain > 0}

− −

n

n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

−

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the ﬁrst one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

**Distribution after n steps
**

n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

27. the random walk never becomes zero up to time n is given by P0 (S1 > 0. 1) (n. . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . The left side of (30) equals #{paths (1. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . P0 (Sn = y) 2 (n + y) + 1 . y) that do not touch zero} × 2−n #{paths (0. P0 (S1 ≥ 0 . if n is even. Sn ≥ 0 | Sn = 0) = 84 1 . 0) (n. for now. given that S0 = 0 and Sn = y > 0. y)} . 27. S2 > 0. n 2−n . . . n 2 #{paths (m.P (Sn = y|Sm = x) = In particular. x) 2n−m (n. we shall just compute it. 0) (n. . . n (30) Proof. y)} . . .2 The ballot theorem Theorem 25. n This is called the ballot theorem and the reason will be given in Section 29. if n is even. . Such a simple answer begs for a physical/intuitive explanation. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. But. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y .3 Some conditional probabilities Lemma 17. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). n−m 2−n−m . The probability that. If n. Sn > 0 | Sn = y) = y . n/2 (29) #{(0. Sn ≥ 0 | Sn = y) = 1 − In particular. . .

We next have P0 (S1 = 0. . We use (27): P0 (Sn ≥ 0 . . ξ3 . +1 27. . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . 85 . . . (e) follows from the fact that ξ1 is independent of ξ2 . . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . (b) follows from symmetry. .). Sn > 0) + P0 (S1 < 0. . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. 1 + ξ2 + ξ3 > 0. . .Proof. . Sn = 0) = P0 (Sn = 0). Sn ≥ 0. . ξ2 . . . . . . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . Sn ≥ 0) = #{(0.4 Some remarkable identities Theorem 26. . . . . . Proof. . Sn < 0) (b) (a) = 2P0 (S1 > 0. . . (c) (d) (e) (f ) (g) where (a) is obvious. . •). . . and Sn−1 > 0 implies that Sn ≥ 0. Sn = 0) = P0 (S1 > 0. . . Sn ≥ 0) = P0 (S1 = 0. (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. .) by (ξ1 . ξ3 . Sn ≥ 0). P0 (S1 ≥ 0. . . . If n is even. . . . . . . 0) (n. . . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . . . . . ξ2 + ξ3 ≥ 0. and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. and ﬁnally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . . . (f ) follows from the replacement of (ξ2 . . For even n we have P0 (S1 ≥ 0. . n/2 where we used (28) and (29). . . . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . .

. we can omit the last event. Sn = 0 means Sn−1 = 1. n−1 Proof. P0 (Sn−1 > 0. So f (n) = 2P0 (S1 > 0. We now want to ﬁnd the probability f (n) that the walk will return to 0 for the ﬁrst time in n steps. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . Then f (n) := P0 (S1 = 0. . we either have Sn−1 = 1 or −1. So u(n) f (n) = . . Sn = 0) + P0 (S1 < 0. So the last term is 1/2. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . Sn = 0) = u(n) . and if the particle is to the left of a then its mirror image is on the right. Also. . Let n be even. . Sn = 0) Now. . Since the random walk cannot change sign before becoming zero. . Sn−1 < 0. f (n) = P0 (S1 > 0. Theorem 27. Sn−2 > 0|Sn−1 = 1). Sn−1 = 0. . . Sn = 0) P0 (S1 > 0. for even n. 86 . To take care of the last term in the last expression for f (n). .5 First return to zero In (29). . . Sn−2 > 0|Sn−1 > 0. Sn = 0) = 2P0 (Sn−1 > 0. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). . the probability u(n) that the walk (started at 0) attains value 0 in n steps. By symmetry. with equal probability. . It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. . n−1 28 The reﬂection principle for a simple random walk in dimension 1 The reflection principle says the following. . . By the Markov property at time n − 1. the two terms must be equal. and u(n) = P0 (Sn = 0). . . given that Sn = 0. Fix a state a. Sn = 0).27. Sn−1 > 0. we computed. . . If a particle is to the right of a then you see its image on the left. Sn = 0. . So 1 f (n) = 2u(n) P0 (S1 > 0. . Put a two-sided mirror at a. Sn−1 > 0. we see that Sn−1 > 0. 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). This is not so interesting.

n ≥ 0) be a simple symmetric random walk and a > 0. Sn = Sn . But a symmetric random walk has the same law as its negative. Sn ≤ a) + P0 (Sn > a). . . Finally. . the process (STa +m − a. we obtain 87 . In other words. Let (Sn . so we may replace Sn by 2a − Sn (Theorem 28). n < Ta . S1 . If Sn ≥ a then Ta ≤ n. n ≥ Ta . Sn ). Sn ≥ a). observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). a + (Sn − a). If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). n ≥ Ta By the strong Markov property. . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). 2 Deﬁne now the running maximum Mn = max(S0 . deﬁne Sn = where is the ﬁrst hitting time of a. 2 Proof. we have n ≥ Ta . Sn ≤ a). n ∈ Z+ ). Then Ta = inf{n ≥ 0 : Sn = a} Sn . 2a − Sn ≥ a) = P0 (Ta ≤ n. called (Sn . Sn > a) = P0 (Ta ≤ n. n ∈ Z+ ) has the same law as (Sn . But P0 (Ta ≤ n) = P0 (Ta ≤ n. Sn ≤ a) + P0 (Ta ≤ n.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. Then.What is more interesting is this: Run the particle till it hits a (it will. The last equality follows from the fact that Sn > a implies Ta ≤ n. n < Ta . 2a − Sn . Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). Thus. independent of the past before Ta . as a corollary to the above result. 28. Theorem 28. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . In the last event. So Sn − a can be replaced by a − Sn in the last display and the resulting process. P0 (Sn ≥ a) = P0 (Ta ≤ n. Proof. m ≥ 0) is a simple symmetric random walk. Trivially. So: P0 (Sn ≥ a) = P0 (Ta ≤ n.

105. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. by applying the reﬂection principle (Theorem 28). which results into P0 (Tx ≤ n. gives the result. where ηi indicates the colour of the i-th item picked. 2 Proof. . Acad. This. The answer is very simple: Theorem 31. If Tx ≤ n. e 10 Andr´. P0 (Mn < x. Rend. Sci. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. and anciently for holding ballots to be drawn. employed for diﬀerent purposes. Compt. Sci. The set of values of η contains n elements and η is uniformly distributed a in this set. Compt. The original ballot problem. . (31) x > y. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30.Corollary 19. Sn = 2x − y). of which a are coloured azure and b black. 8 88 . combined with (31). . 29 Urns and the ballot theorem Suppose that an urn8 contains n items. A sample from the urn is a sequence η = (η1 . as for holding liquids. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. then the probability that. (1887). But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. . (1887). Sn = y). J. If an urn contains a azure items and b = n−a black items. ηn ). the number of selected azure items is always ahead of the black ones equals (a − b)/n. Bertrand. as long as a ≥ b. Rend. we have P0 (Mn < x. usually a vase furnished with a foot or pedestal. e e e Paris. n = a + b. An urn is a vessel. the ashes of the dead after cremation. ornamental objects. Acad. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Proof. Sn = y) = P0 (Tx > n. then. Sn = y) = P0 (Tx ≤ n. Paris. during sampling without replacement. 9 Bertrand. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. D. Solution directe du probl`me r´solu par M. It is the latter use we are interested in in Probability and Statistics. Since Mn < x is equivalent to Tx > n. 369. 105. Sn = y) = P0 (Sn = 2x − y). 436-437. Solution d’un probl`me.

ξn ) has a uniform distribution. . letting a. we can translate the original question into a question about the random walk. Sn > 0 | Sn = y) = . n n −n 2 (n + y)/2 a Thus. using the deﬁnition of conditional probability and Lemma 16. . a. +1} such that ε1 + · · · + εn = y. . Let ε1 . so ‘azure’ items are +1 and ‘black’ items are −1. . . ξn = εn ) P0 (Sn = y) 2−n 1 = = . . εn be elements of {−1. ξn = εn | Sn = y) = P0 (ξ1 = ε1 . a we have that a out of the n increments will be +1 and b will be −1. the result follows. It remains to show that all values are equally a likely. . . . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. .i−1 = q = 1 − p. conditional on {Sn = y}. . ηn ) an urn process with parameters n. . Consider a simple symmetric random walk (Sn . But if Sn = y then. . n ≥ 0) with values in Z and let y be a positive integer. we have: P0 (ξ1 = ε1 .) So pi. Since an urn with parameters n. . conditional on {Sn = y}. Then. the sequence (ξ1 . . . . Since the ‘azure’ items correspond to + signs (respectively. . 89 . . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. ξn ) is uniformly distributed in a set with n elements. . So the number of possible values of (ξ1 . . conditional on the event {Sn = y}. . We have P0 (Sn = k) = n n+k 2 pi. indeed. The values of each ξi are ±1. . . An urn can be realised quite easily be a random walk: Theorem 32. . . . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. the sequence (ξ1 . . p(n+k)/2 q (n−k)/2 . But in (30) we computed that y P0 (S1 > 0. but which is not necessarily symmetric. n Since y = a − (n − a) = a − b. ξn ) is. Sn > 0} that the random walk remains positive. We need to show that. where y = a − (n − a) = 2a − n. . . n . (ξ1 .i+1 = p. . (The word “asymmetric” will always mean “not necessarily symmetric”. always assuming (without any loss of generality) that a ≥ b = n − a. a − b = y. −n ≤ k ≤ n. . the ‘black’ items correspond to − signs). Proof. . Proof of Theorem 31. .We shall call (η1 . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. . . . i ∈ Z. . Then. . b be deﬁned by a + b = n. .

it is no loss of generality to consider the random walk starting from 0. Proof. 2qs Proof. plus a time T1 required to go from −1 to 0 for the ﬁrst time. Px (Ty = t) = P0 (Ty−x = t). .) Then. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. Consider a simple random walk starting from 0. We will approach this using probability generating functions. 90 . Consider a simple random walk starting from 0. 2. Lemma 18.} ∪ {∞}.d. x ∈ Z. 1. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. Corollary 20. Then the probability generating function of the ﬁrst time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. Start with S0 = 0 and watch the process until the ﬁrst n = T1 such that Sn = 1. This random variable takes values in {0. we make essential use of the skip-free property. To prove this. Proof. The process (Tx . If S1 = 1 then T1 = 1. only the relative position of y with respect to x is needed. x − 1 in order to reach x. .30. x ≥ 0) is also a random walk. Consider a simple random walk starting from 0. for x > 0. Lemma 19. 2. . .i.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. In view of this. E0 sTx = ϕ(s)x . In view of Lemma 19 we need to know the distribution of T1 . y ∈ Z. If x > 0. . The rest is a consequence of the strong Markov property. random variables. . For all x. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. . See ﬁgure below. (We tacitly assume that S0 = 0. and all t ≥ 0. Let ϕ(s) := E0 sT1 . plus a ′′ further time T1 required to go from 0 to +1 for the ﬁrst time. Theorem 33.

Then the probability generating function of the ﬁrst time that level x ∈ Z is reached is given by √ x E sT1 x = 1− 1−4pqs2 . Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. if x < 0 2ps Proof. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. from the strong Markov property. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . Hence the previous formula applies with the roles of p and q interchanged. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . The ﬁrst time n such that Sn = −1 is the ﬁrst time n such that −Sn = 1. the RW must ﬁrst hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). 2ps Proof. Consider a simple random walk starting from 0. 0 if x > 0 2qs Tx (32) E0 s = √ |x| E sT−1 |x| = 1− 1−4pqs2 0 . in order to hit 1.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Corollary 21. Consider a simple random walk starting from 0. 91 . Then the probability generating function of the ﬁrst time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. The formula is obtained by solving the quadratic. Corollary 22.

The latter is. 2ps ′ =1− 30.30. and simply write (1 + x)n = k≥0 n k x . we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α .2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. this was found in §30. if S1 = 1. For the general case. For a symmetric ′ random walk. where α is a real number. k by the binomial theorem. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . in distribution. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . We again do ﬁrst-step analysis. If we deﬁne n to be equal to 0 for k > n. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. k 92 . then we can omit the k upper limit in the sum. Similarly. If α = n is a positive integer then n (1 + x)n = k=0 n k x .3 The distribution of the ﬁrst return to the origin We wish to ﬁnd the exact distribution of the ﬁrst return time to the origin. Consider a simple random walk starting from 0. the same as the time required for the walk to hit 1 starting ′ from 0. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. Then the probability generating function of the ﬁrst return time to 0 is given by E0 s T 0 = 1 − Proof. how long will it take till it ﬁrst returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}.2 First return to the origin Now let us ask: If the particle starts at 0. 1 − 4pqs2 . 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 .

We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . k! k! (−1)2k xk = k≥0 xk . k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . For example.Recall that n k = It turns out that the formula is formally true for general α. (1 − x)−1 = But. k as long as we deﬁne α k = α(α − 1) · · · (α − k + 1) . −1 k so (1 − x)−1 = as needed. by deﬁnition. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . (−1)k = 2k − 1 k k and this is shown by direct computation. 2k − 1 k 93 . we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. k!(n − k)! k! α k x . Indeed. √ 1 − x. but that we won’t worry about which ones. k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. As another example. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = .

• If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. The quantity inside the square root becomes 1 − 4pq. That it is actually null-recurrent follows from Markov chain theory. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . We apply this to (32). Hence it is transient. by the deﬁnition of E0 sT0 . without appeal to the Markov chain theory. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. Comparing the two ‘inﬁnite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). if x < 0 |x| In particular. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. x 1−|p−q| 2q 1−|p−q| 2p . |x| if x > 0 (33) . if x < 0 This formula can be written more neatly as follows: Theorem 35. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. P0 (Tx < ∞) = p q q p ∧1 ∧1 . But remember that q = 1 − p. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. s↑1 which is true for any probability generating function. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 = (1 − 2p)2 = |1 − 2p| = |p − q|. 94 . Hence it is again transient.But. if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. x if x > 0 . • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one.

q p. The last part of Theorem 35 tells us that it is recurrent. Corollary 23. but here it is < ∞. . 31. we obtain what we are looking for. This value is random and is denoted by M∞ = sup(S0 . 95 . with some positive probability. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . n→∞ n→∞ x ≥ 0.Proof. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the ﬁrst return time to 0. The simple symmetric random walk with values in Z is null-recurrent. . if x > 0 . every nondecreasing sequence has a limit.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. Unless p = 1/2. x ≥ 0.) If we let Mn = max(S0 . Consider a random walk with values in Z. . Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . then we have M∞ = limn→∞ Mn . it will never ever return to the point where it started from. Then limn→∞ Sn = −∞ with probability 1. Then ′ P0 (T0 < ∞) = 2(p ∧ q). 31. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. again. . This means that there is a largest value that will be ever attained by the random walk. . If p < q then |p − q| = q − p and. Proof. S1 . if x < 0 which is what we want. Otherwise. and so it is null-recurrent. Theorem 36. this probability is always strictly less than 1. the sequence Mn is nondecreasing. We know that it cannot be positive recurrent. Proof. Assume p < 1/2. What is this probability? Theorem 37. S2 .1 Asymptotic distribution of the maximum Suppose now that p < 1/2. Consider a simple random walk with values in Z. . Indeed. . with probability 1 because p < 1/2. Sn ). except that the limit may be ∞.

But J0 ≥ 1 if and only if the ﬁrst return time to zero is ﬁnite. 1. Pi (Ji ≥ k) = 1 for all k. In Lemma 4 we showed that. k = 0. k = 0. 2(p ∧ q) E0 J0 = . if a Markov chain starts from a state i then Ji is geometric. k = 0. 31. . where the last probability was computed in Theorem 37. 1 − 2(p ∧ q) this being a geometric series. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . Theorem 38. . 1. 2. . 1. This proves the ﬁrst formula. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). 1]. 2(p ∧ q) . 1 − 2(p ∧ q) Proof. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. Remark 1: By spatial homogeneity. We now look at the the number of visits to some state but if we start from a diﬀerent state. . even for p = 1/2. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. 96 . the total number of visits to a state will be ﬁnite. The probability generating function of T0 was obtained in Theorem 34. Then. . meaning that Pi (Ji = ∞) = 1. 2. starting from zero. (34) a random variable ﬁrst considered in Section 10 where it was also shown to be geometric. as already known. . In the latter case. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. But if it is transient. If p = q this gives 1. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k .′ Proof.3 Total number of visits to a state If the random walk is recurrent then it visits any state inﬁnitely many times. . 2. . . We now compute its exact distribution. . Consider a simple random walk with values in Z.

Remark 1: The expectation E0 Jx is ﬁnite if p = 1/2. we have E0 Jx = 1/(1−2p) regardless of x. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . The reason that we replaced k by k − 1 is because. (2(p ∧ q))k−1 . Consider a random walk with values in Z. But for x < 0. then E0 Jx drops down to zero geometrically fast as x tends to inﬁnity. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . ∞ As for the expectation. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y .e. In other words. i.e. Then P0 (Jx ≥ k) = P0 (Tx < ∞. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 .Theorem 39. E0 Jx = Proof. We repeat the argument for x < 0 and ﬁnd P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . (q/p)(−x)∨0 1 − 2q k≥1 . starting from 0. The ﬁrst term was computed in Theorem 35. given that Tx < ∞. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . Then P0 (Tx < ∞. Suppose x > 0. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . Consider the case p < 1/2. it will visit any point below 0 exactly the same number of times on the average. 97 . and the second in Theorem 38. Jx ≥ k). Apply now the strong Markov property at Tx . For x > 0. simply because if Jx ≥ k ≥ 1 then Tx < ∞. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). if the random walk “drifts down” (i. k ≥ 1. p < 1/2) then. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total.

(36) 2 Lastly. . Theorem 40. random variables. random variables. we derived using duality. = 1− √ 1 − s2 s x . in Theorem 41. ξn ) has the same distribution as (ξn . . Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. a sum of i. ξ2 . which is basically another probability for S. ξ1 ). Pause for reﬂection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. Here is an application of this: Theorem 41. Thus. 0 ≤ k ≤ n). 0 ≤ k ≤ n) has the same distribution as (Sk . Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. for a simple random walk and for x > 0. In Lemma 19 we observed that. n x > 0. . . . P0 (Sn = x) P0 (Sn = x) x . Let Tx = inf{n ≥ 1 : Sn = x} be the ﬁrst time that x will be visited.32 Duality Consider random walk Sk = k ξi . Then P0 (Tx = n) = x P0 (Sn = x) . (Sk . 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. Use the obvious fact that (ξ1 . 0 ≤ k ≤ n − 1.e. i. for a simple symmetric random walk. n Proof. ξn−1 . because the two have the same distribution. . Tx is the sum of x i. 0 ≤ k ≤ n) of (Sk . . . . i=1 Then the dual (Sk . (35) In Theorem 29 we used the reﬂection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). . 0 ≤ k ≤ n) is deﬁned by Sk := Sn − Sn−k . the probabilities P0 (Tx = n) = x P0 (Sn = x).d. . each distributed as T1 . Sn = x) P0 (Tx = n) = .i. Let (Sk ) be a simple symmetric random walk and x a positive integer. Proof.d. Fix an integer n. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . . n (37) 98 .i. every time we have a probability for S we can translate it into a probability for S.

We remark here something methodological: It was completely diﬀerent techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coeﬃcient of sn in (35); summing up (37) should give (36), or taking diﬀerences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Deﬁne its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the ﬁrst time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the ﬁrst event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =

r=1

+

r=1

**In the ﬁrst case,
**

′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

**We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
**

k

**f2r p2k−2r,2n−2r +
**

r=1

1 2

n−k

**f2r p2k,2n−2r
**

r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2

k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .

r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

**The arcsine law*
**

Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

**Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
**

na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =

a≤k/n≤b

∼

a≤k/n≤b

1 . k k (n − n ) n n

1/π

**This is easily seen to have a limit as n → ∞, the limit being
**

b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of

x(1−x)

which is amazingly close to the plot of the ﬁgure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

. (Sn . north or south. So let 1 2 1 2 1 2 Rn = (Rn . i. Consider a change of coordinates: rotate the axes by 45o (and scale them). with equal probability (1/4). for each n. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . he moves east. 0) and. y) where x. is a random walk. . 0)) = 1/4. 0). 1)) = P (ξn = (−1. 1)) + P (ξn = (0. we immediately get that the expectation of T1 is inﬁnity: ET1 = ∞. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. the latter being functions of the former. 2 1 If x ∈ Z2 . map (x1 . 1 1 Proof. ξ2 . So Sn = (Sn . are independent. Since ξ1 . the two stochastic processes are not independent. y ∈ Z.e. He starts at (0. n ≥ 1. . Sn ) = n n 2 .From this. too. Indeed. x2 ) into (x1 + x2 . Sn = i=1 n ξi . i=1 ξi i=1 ξi 2 1 Lemma 20. n ∈ Z+ ) is a sum of i. so are ξ1 . . north or south. west. n ∈ Z+ ) is also a random walk.d. going at random east. 0)) = P (ξn = (1. The corners are described by coordinates (x. . The two random walks are not independent. We then let S0 = (0. Rn ) := (Sn + Sn . (Sn . They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. When he reaches a corner he moves again in the same manner. It is not a simple random walk because there is a possibility that the increment takes the value 0. . n ∈ Z+ ) showing that it. 1 P (ξn = 1) = P (ξn = (1. 0)) = 1/4. being totally drunk. west. i. random variables and hence a random walk. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. . However. .i. . 2 Same argument applies to (Sn . We have 1. 1 Hence (Sn . We have: 102 .d. 0)) = P (ξn = (−1. we let x1 be its horizontal coordinate and x2 its vertical. Sn − Sn ). n ∈ Z+ ) is a random walk. Many other quantities can be approximated in a similar manner. the random 1 2 variables ξn . ξ2 . −1)) = 1/2..i. x1 − x2 ). ξn are not independent simply because if one takes value 0 the other is nonzero. random vectors such that P (ξn = (0. ξ2 .

that for a simple symmetric random walk (started from S0 = 0). b−a P (STa ∧Tb = a) = b .1 Lemma 21. ηn = 1) = P (ξn = (0. The sequence of vectors η . n ∈ Z+ ) 1 is a random walk in two dimensions. 0)) = 1/4 1 2 P (ηn = 1. Let a < 0 < b. (Rn . The possible values 1 . = 1)) = 1/4. ξ2 . The last two are walk in one dimension. ξ 1 − ξ 2 ). We have of each of the ηn n 1 2 P (ηn = 1. η . n ∈ Z+ ). Hence P (ηn = 1) = P (ηn = −1) = 1/2. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. 2n n 2 2−4n . Proof. We have Rn = n (ξi + ξi ). where c is a positive constant. So P (R2n = 0) = 2n 2−2n . P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . Rn = n (ξi − ξi ). By Lemma 5. and by Stirling’s approximation. as n → ∞. b−a 103 . 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. 2 1 2 2 1 1 Proof. n ∈ Z+ ) is a random walk in two dimensions. . is i. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . n ∈ Z ) is a random walk in one dimension. P (S2n = 0) ∼ n . ηn = ξn − ξn are independent for each n. η 2 are ±1. R2n = 0) = P (R2n = 0)(R2n = 0). But (Rn ) 1 2 is a simple symmetric random walk. . But by the previous n=0 c lemma.. . Recall. b−a P (Ta < Tb ) = b . Similar argument shows that each of (Rn . P (S2n = 0) = Proof. n ∈ Z+ ) is a random 2 . being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . The same is true for (Rn ).d. (Rn . . we need to show that ∞ P (Sn = 0) = ∞. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . 1 where the last equality follows from the independence of the two random walks. 0)) = 1/4 1 2 P (ηn = −1. from Theorem 7.i. n ∈ Z ) is a random walk in one dimension. And so 2 1 2 1 P (ηn = ε. . +1}. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. ηn = −1) = P (ξn = (−1. . 2 . Lemma 22. The two-dimensional simple symmetric random walk is recurrent. b−a −a . 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. ε′ ∈ {−1. in the sense that the ratio of the two sides goes to 1. (Rn + independent. To show that the last two are independent. Hence (Rn . n Corollary 24. ηn = −1) = P (ξn = (0. ηn = 1) = P (ξn = (1. 1)) = 1/4 1 2 P (ηn = −1.

i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. Indeed. i ∈ Z. i ∈ Z). B = j) = 1. B = 0) = p0 . How about more complicated distributions? Let suppose we are given a probability distribution (pi . b]. a < 0 < b such that pa + (1 − p)b = 0. Let (Sn ) be a simple symmetric random walk and (pi . i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. i ∈ Z. b.g. given a 2-point probability distribution (p. Note. B) are independent of (Sn ). To see this. Theorem 43. Deﬁne a pair of random variables (A. Can we ﬁnd a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. The claim is that P (Z = k) = pk . M. a = M − N . choose. 0)} ∪ (N × N) with the following distribution: P (A = i. k ∈ Z. that ESTa ∧Tb = 0. B = j) = c−1 (j − i)pi pj . i < 0 < j. we have this common value: i<0 (−i)pi = and. using this. we consider the random variable Z = STA ∧TB . This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the ﬁrst time it exits the interval [a. we can always ﬁnd integers a. Conversely. P (A = 0. in particular. Proof. 1 − p). and the probability it exits from the right point is 1 − p. b = N . N ∈ N. The probability it exits it from the left point is p. M < N . and assuming that (A. B) taking values in {(0. suppose ﬁrst k = 0. where c−1 is chosen by normalisation: P (A = 0. B = 0) + P (A = i. Then Z = 0 if and only if A = B = 0 and this has 104 . Then there exists a stopping time T such that P (ST = i) = pi . we get that c is precisely ipi . e. if p = M/N . c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}.This last display tells us the distribution of STa ∧Tb . such that p is rational.

Suppose next k > 0. B = j) P (STi ∧Tk = k)P (A = i. by deﬁnition. i<0 = c−1 Similarly.probability p0 . P (Z = k) = pk for k < 0. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . 105 .

20) = gcd(6. let d = gcd(a. 2) = gcd(2. Z = {· · · . But d | a and d | b.) Notice that: gcd(a. We write this by b | a. c | b. 2) = gcd(4. with the second line. But 3 is the maximum number of times that 38 “ﬁts” into 120. 26) = gcd(6. 14) = gcd(6. Now. 32) = gcd(6. 2) = 2. . Now suppose that c | a and c | b − a. Then interchange the roles of the numbers and continue. 38) = gcd(6. and d = gcd(a. 8) = gcd(6. Indeed 4 × 38 > 120. they merely serve as reminders. 6. −3. b). 38) = gcd(44. Replace b by b − a. if we also invoke Euclid’s algorithm (i. If c | b and b | a then c | a. with at least one of them nonzero. If a | b and b | a then a = b. They are not supposed to replace textbooks or basic courses. replace b − a by b − 2a.}. . leading to d1 = d2 . (ii) if c | a and c | b. 0. 38) = gcd(6.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. as taught in high schools or universities. 1. Indeed. i. So (i) holds. b. Arithmetic Arithmetic deals with integers. we say that b divides a if there is an integer k such that a = kb. their negatives and zero. This gives us an algorithm for ﬁnding the gcd between two numbers: Suppose 0 < a < b are integers. But in the ﬁrst line we subtracted 38 exactly 3 times. Similarly. For instance. b) = gcd(a. In other words. b). 3.e. 38) = gcd(82. b are arbitrary integers. then d | c. Given integers a. (That it is unique is obvious from the fact that if d1 . Because c | a and c | b. −2. · · · }. So we work faster. d | b. −1. . 5. we may summarise the ﬁrst line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. the process of long division that one learns in elementary school). Therefore d | b − a. we can deﬁne their greatest common divisor d = gcd(a. The set N of natural numbers is the set of positive integers: N = {1. with b = 0. Keep doing this until you obtain b − qa < a. We will show that d = gcd(a. So (ii) holds. b − a). 2. d2 are two such numbers then d1 would divide d2 and vice-versa. 106 . and we’re done. b) as the unique positive integer such that (i) d | a. Zero is denoted by 0.e. gcd(120. 3. The set Z of integers consists of N. Then c | a + (b − a). if a. we have c | d. 4. 2. If b − a > a. b − a).

The long division algorithm is a process for ﬁnding the quotient q and the remainder r. Then d = xa + yb. y such that d = xa + yb. a. b ∈ Z. 120) = 38x + 120y. Then c divides any linear combination of a. 107 . gcd(38. not both zero. The same logic works in general. 38) = gcd(6. b) then there are integers x. To see why this is true.e. Suppose that c | a and c | b. In the process. We need to show (i). we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. we can show that. we have gcd(a. in particular. 1. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. i. This shows (ii) of the deﬁnition of a greatest common divisor. Second. Its proof is obvious. . Here are some other facts about the greatest common divisor: First. such that a = qb + r. we can ﬁnd an integer q and an integer r ∈ {0. with a > b. xa + yb > 0}. it divides d. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. Working backwards. For example. y = 6. b) = min{xa + yb : x. for some x. with x = 19. b. Thus. r = 18. for integers a. . To see this. . let d be the right hand side. . a − 1}. we have the fact that if d = gcd(a. let us follow the previous example again: gcd(120. 38) = gcd(6. y ∈ Z. 2) = 2.The theorem of division says the following: Given two positive integers b. y ∈ Z.

i. the greatest common divisor of any subset of the integers also makes sense. So d is not minimum as claimed. and this is a contradiction. But then b = q(xa + yb) + r. 108 . If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. Finally. a relation can be thought of as a set of pairs of elements of the set A. y) and (y. A relation is transitive if when it contains (x. For example. so d divides both a and b. that d does not divide b. x). We encountered the equivalence relation i theory. it can be thought as reﬂexive (unless we do not want to accept the fact that one may not be afriend of oneself). Suppose this is not true. A relation which possesses all the three properties is called equivalence relation.. f (x2 ). e. but is not necessarily transitive.that d | a and d | b.g. z) it also contains (x. B be ﬁnite sets with n. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A.e. respectively. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). The set of functions from X to Y is denoted by Y X . say. yielding r = (−qx)a + (1 − qy)b. i. A relation is called reflexive if it contains pairs (x. The equality relation = is reﬂexive.e. Since d > 0. The greatest common divisor of 3 integers can now be deﬁned in a similar manner. Both < and = are transitive. then the relation “x is a friend of y” is symmetric. In general. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. the set {f (x) : x ∈ X}. for some 1 ≤ r ≤ d − 1. we can divide b by d to ﬁnd b = qd + r. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . If we take A to be the set of students in a classroom. m elements. The equality relation is symmetric. y) without contain its symmetric (y. A function is called onto (or surjective) if its range is Y . The relation < is not symmetric. Counting Let A. x). the relation < between real numbers can be thought of as consisting of pairs of numbers (x. A function is called one-to-one (or injective) if two diﬀerent x1 . y) such that x is smaller than y. x2 ∈ X have diﬀerent values f (x1 ). A relation is called symmetric if it cannot contain a pair (x. that d does not divide both numbers. z). so r = 0. The relation < is not reﬂexive. The range of f is the set of its values.

the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. and so on. and this follows from the binomial theorem. and so on. If A has n elements. Incidentally. the cardinality of the set B A ) is mn . Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to diﬀerent animals. a one-to-one function from the set A into A is called a permutation of A. Each assignment of this type is a one-to-one function from A to B. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). n! stands for the product of the ﬁrst n natural numbers: n! = 1 × 2 × 3 × · · · × n. the second in n − 1. We thus observe that there are as many subsets in A as number of elements in the set {0. and the third. A function is then an assignment of a name to each animal. we need more names than animals: n ≤ m.The number of functions from A to B (i. we may think of the elements of A as animals and those of B as names. 109 . The ﬁrst element can be chosen in n ways. Thus. we have = 1. Any set A contains several subsets. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . that of the second in m − 1 ways. Indeed. so does the second. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . (The same name may be used for diﬀerent animals. In the special case m = n. the k-th in n − k + 1 ways. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . 1}n of sequences of length n with elements 0 or 1. Clearly. in other words. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). and so on. To be able to do this. n! is the total number of permutations of a set of n elements.e.) So the ﬁrst animal can be assigned any of the m available names. k! k where the symbol equation. Since the name of the ﬁrst animal can be chosen in m ways. we denote (n)n also by n!.

c. we deﬁne x m = x(x − 1) · · · (x − m + 1) . + k−1 k Indeed. by convention. we add the two numbers together in order to ﬁnd the total number of ways to pick k animals from n + 1 animals. In particular. and a− is called the negative part of a. For example. if x is a real number. The absolute value of a is deﬁned by |a| = a+ + a− . Maximum and minimum The minimum of two numbers a. If the pig is included. b is denoted by a ∧ b = min(a. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). we use the following notations: a+ = max(a. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. If the pig is not included in the sample. say. in choosing k animals from a set of n + 1 ones. we let The Pascal triangle relation states that n+1 k = = 0. by convention. a− = − min(a. k Also. and we don’t care to put parantheses because the order of ﬁnding a maximum is irrelevant. m! n y If y is not an integer then.We shall let n = 0 if n < k. 0). The maximum is denoted by a ∨ b = max(a. we choose k animals from the remaining n ones and this can be done in n ways. Both quantities are nonnegative. a∨b∨c refers to the maximum of the three numbers a. b). 0). Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. Note that it is always true that a = a+ − a− . b). The notation extends to more than one numbers. n n . a+ is called the positive part of a. 110 . consider a speciﬁc animal. the pig. b.

2. This is very useful notation because conjunction of clauses is transformed into multiplication. are sets or clauses then the number n≥1 1An is the number true clauses. A2 . For instance. So X= n≥1 1(X ≥ n). Thus. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). Then X X= n=1 1 (the sum of X ones is X). if A1 . . Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). = 1A + 1B − 1A∩B . 1 − 1A = 1 A c . 111 . If A is an event then 1A is a random variable. . if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. . It takes value 1 with probability P (A) and 0 with probability P (Ac ). . It is easy to see that 1A 1A∩B = 1A 1B . . An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . P (X ≥ n). Another example: Let X be a nonnegative random variable with values in N = {1. . It is worth remembering that. Several quantities of interest are expressed in terms of indicators. meaning a function that takes value 1 on A and 0 on the complement Ac of A. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components.}. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B).Indicator function stands for the indicator function of a set A.

. . we have ϕ(x) = m n vi i=1 j=1 aij xj . Let u1 . vm . Combining. The column . m. because ϕ is linear. . . un be a basis for Rn . But the ϕ(uj ) are vectors in Rm . β ∈ R. Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . . y ∈ Rn . x1 . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ . xn the given basis. and all α. xn n 1 y := ϕ(x) = j=1 xj ϕ(uj ). . . . = ··· ··· ··· . . j=1 i = 1. . where ∈ R. . ϕ are linear functions: It is a m × n matrix. . . . . is a column representation of x with respect to . Then. The relation between the column representation of y and the column representation of x is written as 1 1 x y a11 · · · a1n . x .for all x. . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . The matrix A of ϕ with respect to the choice of the two basis is deﬁned by a11 · · · a1n A := · · · · · · · · · am1 · · · amn y1 Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . . . So if we choose a basis v1 . . ym v1 . . . . Let y i := n aij xj . Suppose that ψ. The column . is a column representation of y := ϕ(x) with respect to the given basis . .

The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. . the choice of the basis is the so-called standard basis. e2 = . 1/2. . So lim sup xn = limn→∞ inf{xn . .} ≤ · · · and. j So A is a ℓ × m matrix. For instance I = {1. a2 . there is x ∈ I with x ≥ ε + inf I. . x4 . is a sequence of numbers then inf{x1 . If I has no lower bound then inf I = −∞. 1/3. i. . standing for the “infimum” of I. · · · }. 0 0 1 In other words. Supremum and inﬁmum When we have a set of numbers I (for instance. we deﬁne lim sup xn . . If I has no upper bound then sup I = +∞. . xn+1 . . . . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. for any ε > 0. and it is this that we called inﬁmum. . .} ≤ inf{x2 . inf I is characterised by: inf I ≤ x for all x ∈ I and. . a sequence a1 . . . .Fix three sets of basis on Rn . . there is x ∈ I with x ≤ ε + inf I. we deﬁne max I. . . . we deﬁne the supremum sup I to be the minimum of all upper bounds. Then the matrix representation of ϕ ◦ ψ is AB. . sup ∅ = −∞. The standard basis in Rd consists of the vectors 1 0 0 0 1 0 e 1 = .} ≤ inf{x3 . deﬁned as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. . ψ. B be the matrix representations of ϕ. A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists.} has no minimum. In these cases we use the notation inf I. . . B is a m × n matrix. respectively. ei = 1(i = j). Rm and Rℓ and let A. We note that lim inf(−xn ) = − lim sup xn . If I is empty then inf ∅ = +∞. . a square matrix represents a linear transformation from Rn to Rn . . sup I is characterised by: sup I ≥ x for all x ∈ I and. If we ﬁx a basis in Rn . x2 . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. A matrix A is called square if it is a n × n matrix. x2 . ed = .e. . . Usually. Similarly. Similarly. . . for any ε > 0. where m (AB)ij = k=1 Aik Bkj . it is its matrix representation with respect to the ﬁxed basis. 113 . 1 ≤ i ≤ ℓ. xn+2 . and AB is a ℓ × n matrix. This minimum (or maximum) may not exist. . Likewise. Limsup and liminf If x1 . x3 . with respect to these bases. 1 ≤ j ≤ n.

Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). Recall that P (B|A) is deﬁned by P (A ∩ B)/P (A) and is a very motivated deﬁnition. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. This is the inclusion-exclusion formula for two events. . Indeed. by not requiring this. namely: Everything else. . A is a subset of B). i. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. A2 . In contrast.. B are disjoint events then P (A ∪ B) = P (A) + P (B). That is why Probability Theory is very rich. This follows from the logical statement X 1(X > x) ≤ X. we work a lot with conditional probabilities. and everywhere else. Probability Theory is exceptionally simple in terms of its axioms. E 1A = P (A). If A1 . . Similarly. but not too much. A2 . in these notes. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Linear Algebra that has many axioms is more ‘restrictive’. the function that takes value 1 on A and 0 on Ac . . n≥1 Usually. and 1 Ac = 1 − 1 A . . n If the events A1 . P (∪n An ) ≤ Similarly. . B in the deﬁnition.. to show that EX ≤ EY . A2 . . . This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. 11 114 . are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). Similarly. Note that 1A 1B = 1A∩B . Quite often. derive formulae. If An are events with P (An ) = 0 then P (∪n An ) = 0. (That is why Probability Theory is very rich. to compute P (A ∩ B) we read the deﬁnition as P (A ∩ B) = P (A)P (B|A). then P n≥1 An = P (An ). . follows from this axiom. which holds since X is positive. . B2 . and apply them. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). to compute P (A). If the events A. Linear Algebra that has many axioms is more ‘restrictive’. are mutually disjoint events. such that ∪n Bn = Ω. we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. In contrast. for (besides P (Ω) = 1) there is essentially one axiom . if An are events with P (An ) = 1 then P (∩n An ) = 1. it is useful to ﬁnd disjoint events B1 . Often.e. we often need to argue that X ≤ Y . In Markov chain theory. Of course.e. n P (An ).) If A. . I do sometimes “cheat” from a strict mathematical point of view. This can be generalised to any ﬁnite number of events. if he events A1 . . If A is an event then 1A or 1(A) is the indicator random variable. For example. Therefore.

e. the generating function of the geometric sequence an = ρn . i. is an inﬁnite polynomial having the ai as coeﬃcients: G(s) = a0 + a1 s + a2 s2 + · · · . We do not a priori know that this exists for any s other than. the formula reduces to the known one. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). we have EX = −∞ xf (x)dx. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. we deﬁne X + = max(X. . we encounter situations where P (X = ∞) may be positive. . . As an example. The generating function of the sequence pn = P (X = n). n = 0. ϕ(s) = EsX = n=0 sn P (X = n) 115 . 1]. viz. . The formula holds for any positive random variable. X − = − min(X. a1 . 0). In case the random variable is ∞ absoultely continuous with density f . For a random variable which takes positive and negative values. The expectation of a nonnegative random variable X may be deﬁned by ∞ EX = 0 P (X > x)dx. taught in calculus courses). EX = x xP (X = x). 1. 1]. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. 0). a1 . In case the random variable is discrete. |s| < 1/|ρ|. . making it a useful object when the a0 . so it is useful to keep this in mind. which are both nonnegative and then deﬁne EX = EX + − EX − . . which is motivated from the algebraic identity X = X + − X − . Generating functions A generating function of a sequence a0 . is a probability distribution on Z+ = {0. .. 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. This deﬁnition makes sense only if EX + < ∞ or if EX − < ∞.} because then n an = 1. . This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). . . x→∞ In Markov chain theory. . . 2. for s = 0 for which G(0) = 0. is called probability generating function of X: ∞ as long as |ρs| < 1.An application of this is: P (X < ∞) = lim P (X ≤ x). So if n an < ∞ then G(s) exists for all s ∈ [−1. 1. n ≥ 0 is ∞ ρn s n = n=0 1 . obviously.

Note that ϕ(0) = P (X = 0). take the value +∞.This is deﬁned for all −1 < s < 1. we have s∞ = 0. If we let X(s) = n≥0 n ≥ 0. and this is why ϕ is called probability generating function. consider the Fibonacci sequence deﬁned by the recursion xn+2 = xn+1 + xn . while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). Generating functions are useful for solving linear recursions. n=0 We do allow X to. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . Note that ∞ n=0 pn ≤ 1. but. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . As an example. since |s| < 1. Since the term s∞ p∞ could formally appear in the sum deﬁning ϕ(s). 116 . n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . with x0 = x1 = 1. possibly. n ≥ 0) then the generating function of (xn+1 . ds ds and by letting s = 1. xn sn be the generating of (xn . we can recover the expectation of X from EX = lim ϕ′ (s). ∞ and p∞ := P (X = ∞) = 1 − pn . n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). Also. n! as follows from Taylor’s theorem.

From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . b = −( 5 + 1)/2. namely. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . Thus (1/ρ)−s is the 1 generating function of ρn+1 . n ≥ 0). But I could very well use the function P (X > x). n ≥ 0) and (xn . b−a is the solution to the recursion. Despite the square roots in the formula. s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). we can solve for X(s) and ﬁnd X(s) = s2 −1 . b−a Noting that ab = −1.e. is such an example. Since x0 = x1 = 1. if X is a random variable with values in R. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. 117 . and so X(s) = 1 1 1 1 1 1 = . for example. n ≥ 0) equals the sum of the generating functions of (xn+1 . then the so-called distribution function. x ∈ R. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. Hence s2 + s − 1 = (s − a)(s − b). Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). the function P (X ≤ x). i. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . the n-th term of the Fibonacci sequence. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. we can also write this as xn = (−a)n+1 − (−b)n+1 .e. So. i.

compute an integral and obtain an approximation. to compute a large sum we can. For instance. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. Even the generating function EsX of a Z+ -valued random variable X is an example. n! ≈ exp(n log n − n) = nn e−n . for example they could take values which are themselves functions. (b) The approximation is not good when we consider ratios of products. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. we encountered the concept of an excursion of a Markov chain from a state i. n Hence n k=1 log k ≈ log x dx. Such an excursion is a random object which should be thought of as a random function. since the number is so big. instead. dx All these objects can be referred to as “distribution” or “law” of X. I could use its density f (x) = d P (X ≤ x). but using or merely talking about such an object presents no diﬃculty other than overcoming a psychological barrier. If X is absolutely continuous. It turns out to be a good one. For example (try it!) the rough Stirling approximation give that the probability that. We frequently encounter random objects which belong to larger spaces. for it does allow us to specify the probabilities P (X = n). n ∈ Z+ . Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. Hence we have We call this the rough Stirling approximation. Specifying the distribution of such an object may be hard.or the 2-parameter function P (a < X < b). Now. it’s better to use logarithms: n n! = e log n = exp k=1 log k. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. 1 Since d dx (x log x − x) = log x. in 2n fair coin tosses 118 .

Then. so this little section is a reminder. y ∈ R g(y) ≥ g(x)1(y ≥ x). That was the random variable J0 = ∞ n=1 1(Sn = 0). where λ = P (X ≥ 1). then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any diﬀerent from X. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). deﬁned in (34). n ∈ Z+ . Let X be a random variable. Eg(X) ≥ g(x)P (X ≥ x) . We baptise this geometric(λ) distribution. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. We have already seen a geometric random variable. g(X) ≥ g(x)1(X ≥ x). 119 . If we think of X as the time of occurrence of a random phenomenon. and is based on the concept of monotonicity. by taking expectations of both sides (expectation is an increasing operator). Indeed. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. you have seen it before. because we know that the n probability goes to zero as n → ∞. We can case both properties of g neatly by writing For all x. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). a geometric function of k. Consider an increasing and nonnegative function g : R → R+ . 0 ≤ n ≤ m. expressing monotonicity. Markov’s inequality Arguably this is the most useful inequality in Probability.we have exactly n heads equals 2n 2−2n ≈ 1. as n → ∞. and so. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . To remedy this. expressing non-negativity. Surely. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . It was shown in Theorem 38 that J0 satisﬁes the memoryless property. This completely trivial thought gives the very useful Markov’s inequality. and this is terrible. for all x.

Feller. Soc. Amer. Chung. Bremaud P (1999) Markov Chains.dartmouth. Bremaud P (1997) An Introduction to Probabilistic Modeling. Cambridge University Press. Springer. 3.mac. Wiley. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). ´ 2. Parry. Math. W (1981) Topics in Ergodic Theory. J L (1997) Introduction to Probability. 4. J R (1998) Markov Chains. 5. W (1968) An Introduction to Probability Theory and its Applications. Also available at: http://www.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Springer. C M and Snell. 120 .pdf 6. Springer. Grinstead. 7. Norris.Bibliography ´ 1. Cambridge University Press.

48 memoryless property. 17 communicates with. 81 reﬂection principle. 59 law. 16 communicating classes. 35 Helly-Bray Lemma. 29 reﬂection. 20 increments. 48 cycle formula. 87 121 . 70 greatest common divisor. 18 dual. 16. 68 aperiodic. 6 Markov’s inequality. 47 coupling inequality. 45 Chapman-Kolmogorov equation. 65 generating function. 15 distribution. 63 Galton-Watson process. 17 Kolmogorov’s loop criterion. 119 minimum. 15. 16 convolution. 113 initial distribution. 58 digraph (directed graph). 15 Loynes’ method. 61 recurrent. 10 ballot theorem.Index absorbing. 119 matrix. 68 Fatou’s lemma. 20 function of a Markov chain. 110 meeting time. 17 Aloha. 53 Fibonacci sequence. 117 leads to. 17 Ethernet. 108 ergodic. 18 arcsine law. 114 balance equations. 106 ground state. 108 relation. 76 neighbours. 82 Cesaro averages. 119 Google. 26 running maximum. 6 Markov property. 3 branching process. 109 positive recurrent. 34 PageRank. 16 equivalence relation. 108 ruin probability. 86 reﬂexive. 110 mixing. 51 Monte Carlo. 98 dynamical system. 117 division. 34 probability ﬂow. 17 inﬁmum. 77 indicator. 7 invariant distribution. 73 Markov chain. 111 inessential state. 7 closed. 115 random walk on a graph. 61 null recurrent. 70 period. 53 hitting time. 115 geometric distribution. 44 degree. 51 essential state. 10 irreducible. 2 nearest neighbour random walk. 61 detailed balance equations. 1 equivalence classes. 84 bank account. 116 ﬁrst-step analysis. 111 maximum. 77 coupling. 101 axiom of Probability Theory. 13 probability generating function. 18 permutation. 65 Catalan numbers.

8 transitive. 3. 7 simple random walk. 6 stationary. 58 transient. 71 122 . 2 transition probability. 10 steady-state. 8 transition probability matrix. 7. 88 watching a chain in a set. 76 simple symmetric random walk. 108 tree. 16. 6 statespace. 62 subsequence of a Markov chain. 77 skip-free property. 76 Skorokhod embedding. 104 states. 13 stopping time. 29 transition diagram:. 24 Strirling’s approximation. 60 urn. 73 strategy of the gambler.semigroup property. 113 symmetric. 62 Wright-Fisher model. 27 storage chain. 9 stationary distribution. 27 subordination. 16. 7 time-reversible. 62 supremum. 108 time-homogeneous. 119 strong Markov property.

- ProblemsMarkovChains
- Solutions to Selected Exercises for Stochastic Processes Theory for Applications ( R. G. Gallager )
- markov
- markov chain chapter 6
- markov-chains
- Solved Problems
- Markov
- Markov Coding
- Markov Chains
- Markov Chain
- 20752_0412055511MarkovChain
- (Ebook) Hidden Markov Models (Theory & Methods) Markov Chains Particle Filter Monte Carlo Hmm - Cappemoulinesryden
- Bayesian Econometric Methods
- MarkovChains
- Elementary probability theory with stochastic processes and an introduction to mathematical finance
- Exercises for Stochastic Calculus
- Stochastic Processes
- Stochastic Processes
- Markov Chain
- (Chapman & Hall_CRC Texts in Statistical Science) Richard McElreath-Statistical Rethinking_ a Bayesian Course With Examples in R and Stan-Chapman and Hall_CRC (2015)
- EE-0030 - Probability and Random Processes for Electrical and Computer Engineers
- Applied Time Series Eco No Metrics - Lutkepohl H [2004]
- Stochastic Processes - Ross
- Bayesian Tutorial
- Ddp
- Markov Chain Monte Carlo in Practice (W R Gilks, S Richardson, D J Spiegelhalter
- A Second Course in Probability
- Statistics
- Grimmett g.r., Stirzaker d.r. Probability and Random Processes (3ed., Oxford, 2001)(1)

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading