This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

**MARKOV CHAINS AND RANDOM WALKS
**

Takis Konstantopoulos∗ Autumn 2009

∗

c Takis Konstantopoulos 2006-2009

Contents

1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Deﬁnition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and ﬁrst-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Deﬁnition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . 86 28 The reﬂection principle for a simple random walk in dimension 1 86 28. . . . 83 27. . . . . . .3 The distribution of the ﬁrst return to the origin . . . . . . . . . . 95 31. . . . . . . . .5 A storage or queueing application . . . . . 90 30. . . . . . . . . . . . . . . . . . .1 Branching processes: a population growth application . . . . . . . . . . . . . . . . . . . . .1 Asymptotic distribution of the maximum . . . . . . . . . . . . . . . . . . . . . . . . .2 The ALOHA protocol: a computer communications application . . . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . . . . . . . . . . . . . . . . . . 95 31. . . . . . . . . . .1 First hitting time analysis . . . . .4 The Wright-Fisher model: an example from biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 27. . . . . . . 65 24. . . . 85 27.2 The ballot theorem . . 79 26.3 Some conditional probabilities . . . .4 Some remarkable identities . . . . . . . . . . . . 84 27. . . .1 Paths that start and end at speciﬁc points . . . . . . . . . . . . . 70 24. . . . 71 24. . . . . . . . . .5 First return to zero . . . . . . . . . . . . . . 68 24. . . . . . . . . . . .1 Distribution after n steps . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . .2 Paths that start and end at speciﬁc points and do not touch zero at all . . . . . . . . . . . .3 Total number of visits to a state . . . . . . . . .1 Distribution of hitting time and maximum . . . . . . . . . . . . . . 92 30. . . . . . . . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . . . .2 Return probabilities . . . . . . . . .24. . . . . . . .2 First return to the origin . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

Greece and the U. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. There notes are still incomplete. I have a little mathematical appendix.. The ﬁrst part is about Markov chains and some applications. Of course. “important” formulae are placed inside a coloured box. I tried to put all terms in blue small capitals whenever they are ﬁrst encountered. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. dozens of good books on the topic. Also. A few starred sections should be considered “advanced” and can be omitted at ﬁrst reading. The second one is speciﬁcally for simple random walks.K.Preface These notes were written based on a number of courses I taught over the years in the U. I have tried both and prefer the current ordering. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. At the end. Also. There are. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. I introduce stationarity before even talking about state classiﬁcation..S. of course. They form the core material for an undergraduate course on Markov chains in discrete time.. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuﬀ with matrices – On ﬁnance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. v .

the future pairs Xn+1 . we let our “state” be a pair of integers (xn . the integers). b). where r = rem(a. depend only on the pair Xn . The “time” can be discrete (i. yn ). 1. and the other the greatest common divisor.e. x(t)). for any n. b) between two positive integers a and b. the future trajectory (X(s). The state of the particle at time t is described by the pair ˙ X(t) = (x(t). . one of which is 0. x = x(t) denotes the position of the particle at time t. Suppose that a particle of mass m is moving under the action of a force F in a straight line. we end up with two numbers. X2 . and evolving as xn+1 = yn yn+1 = rem(xn . . more generally. . b) = gcd(r. By repeating the procedure. that gcd(a.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past aﬀects the future only through the present. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law).e. a totally ordered set. i. . Formally. for any t. has the property that. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. In the module following this one you will study continuous-time chains. the real numbers). . where xn ≥ yn . In other words. An example from Arithmetic of a dynamical system in discrete time is the one which ﬁnds the greatest common divisor gcd(a. 2. An example from Physics is provided by Newton’s laws of motion. initialised by x0 = max(a. x Here. b) is the remainder of the division of a by b. yn+1 ). . n = 0. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. or. In Mathematics. . . For instance. yn ). Assume that the force F depends only on the particle’s position ¨ dt and velocity. s ≥ t) ˙ 1 . F = f (x. a phenomenon which evolves with time in a way that only the present aﬀects the future is called a dynamical system. x). It is not hard to see that. continuous (i. there is a function F that takes the pair Xn = (xn . and its ˙ dt 2x acceleration is x = d 2 .e. yn ) into Xn+1 = (xn+1 . Recall (from elementary school maths). Xn+2 . The sequence of pairs thus produced. y0 = min(a. Its velocity is x = dx . . X1 . b). b). X0 .

1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. A mouse lives in the cage.can be completely speciﬁed by the current state X(t). containing fresh and stinky cheese. The resulting ratio is an approximation b of a f (x)dx. but some are stochastic. Choose a pair (X0 . an eﬀective way for dealing with large problems is via the use of randomness.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as ﬁnding the determinant of a matrix with 10. . past states are not needed. Repeat with (Xn . Try this on the computer! 2 . Yn ). In other words. 2. Count the number of 1’s and divide by N . and use those mathematical models which are called Markov chains. it moves from 2 to 1 with probability β = 0. Other examples of dynamical systems are the algorithms run. Some of these algorithms are deterministic. it does so. say. a ≤ X0 ≤ b. The generalisation takes us into the realm of Randomness. analyse. We will develop a theory that tells us how to describe. regardless of where it was at earlier times. When the mouse is in cell 1 at time n (minutes) then. by the software in your computer. . 1 and 2.. 2 2.000 rows and 10. We will also see why they are useful and discuss how they are applied. for steps n = 1.99. . i. Indeed. Perform this a large number N of times. instead of deterministic objects. uniformly. Each time count 1 if Yn < f (Xn ) and 0 otherwise. We can summarise this information by the transition diagram: 1 Stochastic means random. Y0 ) of random numbers. We will be dealing with random variables. Similarly. we will see what kind of questions we can ask and what kind of answers we can hope for. via the use of the tools of Probability. Markov chains generalise this concept of dependence of the future only on the present. A scientist’s job is to record the position of the mouse every minute. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B.000 columns or computing the integral of a complicated function of a large number of variables.05. at time n + 1 it is either still in 1 or has moved to 2.e. 0 ≤ Y0 ≤ B.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. In addition. respectively.

intuitive.) 2. as follows: Xn+1 = max{Xn + Dn − Sn . S0 .01 Questions of interest: 1. from month to month. 1. Dn − Sn = y − x) P (Sn = z. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is deﬁned by: px. = z=0 ∞ = z=0 3 . D0 . . the ﬁrst time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. are random variables with distributions FD . answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. y = 0. respectively.05 0. Here.95 0.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). he can’t. S1 . Xn is the money (in pounds) at the beginning of the month. . Dn is the money he deposits. .99 0. How often is the mouse in room 1 ? Question 1 has an easy. D1 . (This is the mean of the binomial distribution with parameter α. . Assume also that X0 . How long does it take for the mouse. and Sn is the money he wants to spend. S2 . .2 Bank account The amount of money in Mr Baxter’s bank account evolves.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z.05 ≈ 20 minutes. FS . D2 . px. are independent random variables. on the average. . We easily see that if y > 0 then ∞ x.y := P (Xn+1 = y|Xn = x). So if he wishes to spend more than what is available. Sn . 0}. Assume that Dn . to move from cell 1 to cell 2? 2. 2.

1 p0.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . we have px. on the average. and if so. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction.0 p1. If the pub is located in position 0 and his home in position 100.0 = 1 − ∞ px. Of course.1 p2. The transition probability matrix is y=1 p0.0 p0. 2. Will Mr Baxter’s account grow. Are there any places which he will visit inﬁnitely many times? 4.2 P= p2.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. being drunk. If he starts from position 0. how often will he be visiting 0? 2.0 p2. can be computed from FS and FD . How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. in principle. how fast? 2. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1.2 p1.1 p1.something that.2 ··· ··· ··· ··· · · · · · · ··· and is an inﬁnite matrix. he has no sense of direction. So he may move forwards with equal probability that he moves backwards. Are there any places which he will never visit? 3. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). how long will it take him.y . 2. if y = 0. Questions of interest: 1.

S = 1 − pS.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. So he can move east. the reason being that the company does not believe that its clients are subject to resurrection.8 0.S = 0. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. and let’s say he’s completely drunk. the answer to this is crucial for the policy determination. mathematically imprecise. But. there is clearly a higher chance that our man will be lost. and pS.D . pD.S − pH.S because pH.01 S 0. therefore. so we assign probability 1/4 for each move. Clearly.H . Note that the diagram omits pH. the company must have an idea on how long the clients will live. that since there are more degrees of freedom now. These things are. and pS. Also.3 H 0. for now.D . 2.D = 1.3 that the person will become sick. but they will become more precise in the sequel.1 D Thus. 5 . etc.5 Actuarial chains A life insurance company wants to ﬁnd out how much money to charge its clients. We may again want to ask similar questions.H = 1 − pH.H − pS. pD. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. west. observe. north or south.D is omitted. given that he is currently healthy. there is probability pH.

which precisely means that (Xn+1 . This is true for any n and m.1 The Markov property Deﬁnition of the Markov property Consider a (ﬁnite or inﬁnite) sequence of random variables {Xn . (1) 6 . . . m < n. . . almost exclusively (unless otherwise said explicitly). . . Xn−1 ) given Xn . Lemma 1. . . . Sometimes we say that {Xn . . 3. m ∈ T). 3. The elements of S are frequently called states. m ∈ T) is independent of the past process (Xm . . zero. where T is a subset of the integers. for all n ∈ Z+ and all i0 . 3. unless needed. .} or the set Z of all integers (positive. n ∈ Z+ ) is Markov if and only if. . Usually. . for any n. 2. viz. . . . On the other hand. .3 3. the future process (Xm . . . the Markov property.} or N = {1. Since S is countable. for any n ∈ T. . if the claim holds. in+m . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). We say that this sequence has the Markov property if. m > n. . . . . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . 1. (Xn . We focus almost exclusively to the ﬁrst choice. . .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . . m and any choice of states i0 . T = Z+ = {0. conditionally on Xn . X2 = i2 . n + m] Xk = ik | Xn = in . . . . that the Xn take values in some countable set S. it is customary to call (Xn ) a Markov chain. P (Xn+1 = in+1 | Xn = in . Proof. . Xn+m ) and (X0 . n + m] Xk = ik | Xn = in ). . or negative). we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. 2. X0 = i0 ) = P (∀ k ∈ [n + 1. We will also assume. Let us now write the Markov property in an equivalent way. in+1 ∈ S. . n ∈ T}. n ∈ T} is Markov instead of saying that it has the Markov property. So we will omit referring to T explicitly. called the state space. . Xn−1 ) are independent conditionally on Xn . . X1 = i1 . and so future is independent of past given the present. . .

n). The function pi. all probabilities of events associated with the Markov chain This causes some analytical diﬃculties. The function µ(i) is called initial distribution.j (n.3 In this new notation. n + 1) is called transition probability from state i at time n to state j at time n + 1. n).. Xn = in ) = µ(i0 ) pi0 . n) = P(m. n) = i1 ∈S ··· in−1 ∈S pi.j (m. m + 2) · · · P(n − 1.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m.j (n. n)]i.i2 (1. i∈S i. If |S| = d then we have d × d matrices. .i1 (m. pi. 2 3 And.j (m.which means that the two functions µ(i) := P (X0 = i) . 2) · · · pin−1 . n) := [pi. n) = 1(i = j) and. (2) which is reminiscent of matrix multiplication. . pi. requires some ordering on S) and whose entries are pi.j (n. n) = P(m. Therefore.j (m. will specify all joint distributions2 : P (X0 = i0 . we can write (2) as P(m.j (m. n) if m ≤ ℓ ≤ n. m + 1)P(m + 1. viz. n). but if S is an inﬁnite set. we deﬁne. of course. 3.i (n − 1. n + 1) := P (Xn+1 = j | Xn = i) .j (n. pi. by (1). n). n + 1) do not depend on n . j ∈ S. a square matrix whose rows and columns are indexed by S (and this. for each m ≤ n the matrix P(m. But we will circumvent them by using Probability only. m + 1) · · · pin−1 . n ≥ 0. by deﬁnition. Of course. then the matrices are inﬁnitely large. remembering a bit of Algebra. More generally. by a result in Probability. X2 = i2 .i1 (0. . X1 = i1 .in (n − 1. or as P(m. which. .j∈S . pi. ℓ) P(ℓ. 1) pi1 . 7 . something that can be called semigroup property or Chapman-Kolmogorov equation. means that the transition probabilities pi.

n) = P × P × · · · · · · × P = Pn−m . if Y is a real random variable. while P = [pi. 0. simply. It corresponds.j . b ≥ 0. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). Starting from a speciﬁc state i. for all integers a. and the semigroup property is now even more obvious because Pa+b = Pa Pb . For example. to picking state i with probability 1. Thus Pi is the law of the chain when the initial distribution is δi . if µ assigns probability p to state i1 and 1 − p to state i2 . The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . transition matrix. Notice that any probability distribution on S can be written as a linear combination of unit masses.j (n. From now on. δi (j) := 1. Similarly.j∈S is called the (one-step) transition probability matrix or. Then P(m. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . then µ = pδi1 + (1 − p)δi2 .and therefore we can simply write pi. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. as follows from (1) and the deﬁnition of P. y y 8 .j ]i. and call pi.j the (one-step) transition probability from state i to state j. if j = i otherwise. arranged in a row vector. The matrix notation is useful. For example. of course. n + 1) =: pi. n−m times for all m ≤ n. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). unless otherwise speciﬁed. Notational conventions Unit mass. In other words. (3) We call δi the unit mass at i. Then we have µ n = µ 0 Pn .

i. In other words. because it is impossible for the number of molecules to remain constant at all times. for any m ∈ Z+ .d. and vice versa. On the average.4 Stationarity We say that the process (Xn . .i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . Clearly. . But this will not work. To provide motivation and intuition. Hence the uniform probability distribution is stationary. Example: the Ehrenfest chain. at any point of time n. n = 0. then.). n = m. we indeed expect that we have N/2 molecules in each room. Consider a canonical heptagon and let a bug move on its vertices. m + 1. If at time n the bug is on a vertex. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. as it takes some time before the molecules diﬀuse from room to room. 1. It should be clear that if initially place the bug at random on one of the vertices. i. for a while we will be able to tell how long ago the process started. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . at time n + 1 it moves to one of the adjacent vertices with equal probability. a sequence of i. Then pi. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. It is clear. n ≥ 0) will not be stationary: indeed.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. the law of a stationary process does not depend on the origin of time. the distribution of the bug will still be the same. say N = 1025 ) in a metallic chamber with a separator in the middle. The gas is placed partly in room 1 and partly in room 2. Let Xn be the number of molecules in room 1 at time n. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. random variables provides an example of a stationary process. . . .e. This is a model of gas distributed between two identical communicating chambers.) is stationary if it has the same law as (Xn . Then a hole is opened in the separator and we watch what happens as the molecules diﬀuse between the two rooms. We now investigate when a Markov chain is stationary. . dividing it into two identical rooms. Imagine there are N molecules of gas (where N is a large number. selecting a vertex with probability 1/7. Another guess then is: Place 9 . let us look at the following example: Example: Motion on a polygon. then.

each molecule at random either in room 1 or in room 2. P (Xn = i) = π(i) for all n. 10 . The only if part is obvious. hence. . we are led to consider: The stationarity problem: Given [pij ]. for all m ≥ 0 and i0 . a Markov chain is stationary if and only if the distribution of Xn does not change with n. (4) Such a π is called . Proof.i = π(i − 1)pi−1. Xm+1 = i1 . . then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . Changing notation. . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. (a binomial distribution). i 0 ≤ i ≤ N. X2 = i2 . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. then we have created a stationary Markov chain. by (3).) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. If we can do so. . . . Since. For the if part.i + π(i + 1)pi+1. .i + P (X0 = i + 1)pi+1. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. use (1) to see that P (X0 = i0 . Xm+2 = i2 . X1 = i1 . But we need to prove this. The equations (4) are called balance equations. Indeed. which you should go through by yourselves. In other words. . Xm+n = in ). . . . . Lemma 2.i = N N −i+1 N i+1 2−N + = · · · = π(i).i = P (X0 = i − 1)pi−1. If we do so. P (X1 = i) = j P (X0 = j)pj. Xn = in ) = P (Xm = i0 . A Markov chain (Xn . stationary or invariant distribution. 2−N i−1 i+1 N N (The dots signify some missing algebra. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . ﬁnd those distributions π such that πP = π. in ∈ S.

π must be a probability: π(i) = 1. and so each π(i) will be zero. i = 1. To ﬁnd the stationary distribution. In addition to that. Keep doing it continuously. write the equations (4): π(i) = π(j)pj. 2. Consider a deck of d = 52 cards and shuﬄe them in the following (stupid) way: pick two cards at random and swap their positions. The times between successive buses are i. .i = p(i). d! x ∈ Sd . It is easy to see that the stationary distribution is uniform: π(x) = 1 . which gives p(j) . . which in turn means that the normalisation condition cannot be satisﬁed. if i > 0 i ∈ N. . This means that the solution to the balance equations is identically zero.d. xd ). 2 . . where 1 is a column whose entries are all 1.i = π(0)p(i) + π(i + 1). each x ∈ Sd is of the form x = (x1 .i. . where all the xi are distinct. n ≥ 0) is a Markov chain with transition probabilities pi+1. 11 . i ≥ 0. . i∈S The matrix P has the eigenvalue 1 always because j∈S pi. The state space is S = N = {1. But here. Then (Xn . . i ≥ 1. can be written as P1 = 1. . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. Thus.}. .Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. random variables.i = 1. which. d}. we use the normalisation condition π(1) = i≥1 j≥i = 1. . . p1. π(1) will be zero. After bus arrives at a bus stop.j = 1. we must be careful. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. then we run into trouble: indeed. in matrix notation. Who tells us that the sum inside the parentheses is ﬁnite? If the sum is inﬁnite.. . j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). Example: card shuﬄing. . and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. the next one will arrive in i minutes with probability p(i). i≥1 π(i) −1 To compute π(1). Example: waiting for a bus.

. ξr+1 (n) = εr ) + P (ξ1 (n) = 1. If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x.* (This is slightly beyond the scope of these lectures and can be omitted. But to even think about it will give you some strength. . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . . Here ξ(n) = (ξ1 (n). As for the name of the chain (bread kneading) it can be justiﬁed as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. 2.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). . any r = 1. . ξ2 (n). . . . . . for any n = 0. . You can check that . 1. . So it goes beyond our theory. .i. ξ2 . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. The Markov chain is obtained by applying the mapping ϕ again and again. i. are i. 1}. . 1 − ξ3 . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. ξ2 (n).). ξ(n + 1) = ϕ(ξ(n)). i. if ξ1 = 1 then shift to the left and ﬂip all bits to obtain (1 − ξ2 . . . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. ξ3 . . the same is true at n + 1. .d. 1] as a ﬂexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. Example: bread kneading. (5) 12 . If ξ1 = 1 then x ≥ 1/2. what we have really shown is that this assumption is a necessary and suﬃcient condition for the existence and uniqueness of a stationary distribution. Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. 2. .. and any ε1 . r = 1. .. . ξr+1 (n) = 1 − εr ). εr ∈ {0. So let us do so. . So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is ﬁnite. ξ2 (n) = 1 − ε1 . But the function f (x) = min(2x. . . . 2(1 − x)). . . P (ξ1 (n + 1) = ε1 . ξ2 (n) = ε1 . from (5).) Consider an inﬁnite binary vector ξ = (ξ1 .d. applied to the interval [0. . . . We can show that if we start with the ξr (0). . . 1} for all i.i. . ξi ∈ {0. We have thus deﬁned a mapping ϕ from binary vectors into binary vectors. . 1] (think of [0. . . . and the rule maps x into 2(1 − x).e.).). . Remarks: This is not a Markov chain with countably many states. Hence choosing the initial state at random in this manner results in a stationary Markov chain. 2. . P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n)..) is the state at time n. . So.

for all A ⊂ S. This is OK. π is a stationary distribution if and only if π1 = 1 and F (A. Conversely. Writing the equations in this form is not always the best thing to do. we introduce more. amongst them. The convenience is often suggested by the topological structure of the chain. then choose A = {i} and see that you obtain the balance equations.j (which equals 1): π(i) j∈S pi. i∈A j∈Ac Think of π(i)pi.i . Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. π(j) = 1. a stationary distribution π satisﬁes π = πP π1 = 1 In other words. When a Markov chain is stationary we refer to it as being in 4. If the latter condition holds. Ac ) is the total current from A to Ac . Ac ) := π(i)pi.Note on terminology: steady-state. as we will see in examples. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A.i = j∈A π(j)pj.1 Finding a stationary distribution As explained above.j = j∈A π(j)pj.i + j∈Ac π(j)pj. But we only have d unknowns.j as a “current” ﬂowing from i to j.i Multiply π(i) by j∈S pi. then the above are d + 1 equations. Ac ) = F (Ac . j∈S i ∈ S. A). Proof. expecting to be able to choose.i + j∈Ac π(j)pj. a set of d linearly independent equations which can be more easily solved. We have: Proposition 1.j . π(j)pj. So F (A. therefore one of them is always redundant. because the ﬁrst d equations are linearly dependent: the sum of both sides equals 1. suppose that the balance equations hold. π(i) = j∈S (balance equations) (normalisation condition). If the state space has d states.i 13 . Instead of eliminating equations.

99 ≈ 0. A ) A c c F( A .048. p q Consider the following chain.Now split the left sum also: π(i)pi. Thus. Instead. 1.99 (as in the mouse example)..2% of the time in 1.05 ≈ 0.i + j∈Ac π(j)pj. Summing over i ∈ A.04 room 1. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. while the remaining two terms give the equality we seek.952. the mouse spends only 95. but we shall not do this here. .j = j∈A π(j)pj. One can ask how to choose a minimal complete set of linearly independent equations. (Consequently.j + j∈A j∈Ac π(i)pi.i This is true for all i ∈ S. A) Schematic representation of the ﬂow balance relations. .. β = 0. p21 = β. we ﬁnd π(1) = 0. we see that the ﬁrst term on the left cancels with the ﬁrst on the right. α+β π(2) = cα. and 4. π(1) = β .) Assume 0 < α. p22 = 1 − β. gives us ﬂexibility. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1).04 π(2) = 0. 14 i = 1.8% of the time in room 2. 2. That is. . where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . The constant c is found from π(1) + π(2) = 1. If we take α = 0. . 3. The extra degree of freedom provided by the last result. A c F(A . α+β π(2) = α . here is an example: Example: Consider the 2-state Markov chain with p12 = α. p11 = 1 − α. in steady state. β < 1. Example: a walk with a barrier. We have π(1)α = π(2)β.05.

If i = j there is nothing to prove. j) ∈ E ⇐⇒ pi.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. 5. π(i) = (p/q)i π(0). The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. • Suppose that i j.) 5 5.e. p < 1/2. So assume i = j. A). starting from i. Ac ) = π(i − 1)p = π(i)q = F (Ac . For i ≥ 1. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. Proof. Does this provide a stationary distribution? Yes. In other words. the chain will visit j at some ﬁnite time.i1 pi1 . . . provided that ∞ π(i) = 1. . This is i=0 possible if and only if p < q.i2 · · · pin−1 . 1. . These are much simpler than the previous ones. i. i. In fact. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j).But we can make them simpler by choosing A = {0.j > 0. (If p ≥ 1/2 there is no stationary distribution. in−1 . we can solve them immediately. This is often cast in terms of a digraph (directed graph). n≥0 . (ii) pi. E) where S is the state space and E ⊂ S × S deﬁned by (i. i − 1}.2 The relation “leads to” We say that a state i leads to state j if. If i ≥ 1 we have F (A. and E a set of directed edges. a pair (S. In graph-theoretic terminology. . . (iii) Pi (Xn = j) > 0 for some n ≥ 0. .e.j > 0 for some n ∈ N and some states i1 . . S is a set of vertices.

we see that there are four communicating classes: {1}. Corollary 1. By assumption. The relation j ⇐⇒ j i). • Suppose now that (iii) holds. However. {4. 5} is closed: if the chain goes in there then it never moves out of it.in−1 pi. In other words. 3} is not closed. The communicating class corresponding to the state i is. 3}. Equivalence relation means that it is symmetric. 16 .3 The relation “communicates with” Deﬁne next the relation i communicates with j (written as i Obviously. 5}.i2 · · · pin−1 . the class {2. we have that the sum is also positive and so one of its terms is positive. It is reﬂexive and transitive because is so. Just as any equivalence relation. and so (iii) holds. Two communicating classes are either 1 2 3 4 5 6 In the example of the ﬁgure. Since Pi (Xn = j) > 0. We just observed it is symmetric. by deﬁnition. it partitions S into equivalence classes known as communicating classes. j) ⇐⇒ i j and j i. • Suppose that (ii) holds. i}. The ﬁrst two classes diﬀer from the last two in character. The relation is transitive: If i j and j k then i k. (i) holds. is an equivalence relation. (iii) holds. Look at the last display again. {2. the set of all states that communicate with i: [i] := {j ∈ S : j So. {6}.. We have Pi (Xn = j) = i1 ..j . where the sum extends over all possible choices of states in S... Proof.the sum on the right must have a nonzero term. The class {4. In addition. (6) j.i1 pi1 . because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). [i] = [j] if and only if i identical or completely disjoint. meaning that (ii) holds. this is symmetric (i Corollary 2. 5. reﬂexive and transitive. by deﬁnition. the sum contains a nonzero term. Hence Pi (Xn = j) > 0.

pi. . The communicating classes C1 . The converse is left as an exercise.i1 pi1 .i2 > 0. C2 .i = 1 then we say that i is an absorbing state. Since i ∈ C. If all states communicate with all others. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this ﬁgure. Suppose that i j.More generally. we can reach any other state. Thus. Lemma 3. To put it otherwise. There can be arrows between two nonclosed communicating classes but they are always in the same direction. .i2 · · · pin−1 . . Otherwise. the state is inessential. In other words still. Proof. Closed communicating classes are particularly important because they decompose the chain into smaller. The classes C4 . then the chain (or. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. we conclude that j ∈ C. j∈C for all i ∈ C. 17 . Note: If we consider a Markov chain with transition matrix P and ﬁx n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. the single-element set {i} is a closed communicating class. Note: An inessential state i is such that there exists a state j such that i j but j i.j > 0. its transition matrix P) is called irreducible. In graph terms. pi1 . in−1 such that pi.i1 > 0. .j = 1. C3 are not closed. more manageable. and so i2 ∈ C. C5 in the ﬁgure above are closed communicating classes. starting from any state. more precisely. We then can ﬁnd states i1 . parts. and C is closed. A general Markov chain can have an inﬁnite number of communicating classes. the state space S is a closed communicating class. Similarly. Note that there can be no arrows between closed communicating classes. Suppose ﬁrst that C is a closed communicating class. If a state i is such that pi. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. we say that a set of states C ⊂ S is closed if pi. and so on. Why? A state i is essential if it belongs to a closed communicating class. we have i1 ∈ C.

If i j then d(i) = d(j). ℓ ∈ N. (m) (n) for all m. The period of an inessential state is not deﬁned. We say that the chain has period 2. X3 ∈ C2 . more generally. . Notice however.5. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . The period of an essential state is deﬁned as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . . n ≥ 0. X4 ∈ C1 . Let us denote by pij the entries of Pn . So. Theorem 2 (the period is a class property). The chain alternates between the two sets. i. (Can you ﬁnd an example of a chain with period 3?) We will now give the deﬁnition of the period of a state of an arbitrary chain. Since Pm+n = Pm Pn . Consider two distinct essential states i. pii (ℓn) ≥ pii (n) ℓ . 3} at some point of time then. A state i is called aperiodic if d(i) = 1. d(j) = gcd Dj . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. i. Let Di = {n ∈ N : pii > 0}. . j. X2 ∈ C1 . Dj = {n ∈ (n) (α) N : pjj > 0}. 4} at the next step and vice versa. . j ∈ S.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. all states form a single closed communicating class. d(i) = gcd Di . whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. If i j there is α ∈ N with pij > 0 and if 18 . Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. (n) Proof.e. it will move to the set C2 = {2. and.

j∈Cr+1 i ∈ Cr . n1 ∈ Di such that n2 − n1 = 1. From the last two displays we get d(j) | n for all n ∈ Di . Cd−1 such that the chain moves cyclically between them. we obtain d(i) ≤ d(j) as well. Therefore d(j) | α + β. the chain is called aperiodic. C1 . i.e. are irreducible chains. Consider an irreducible chain with period d. . An irreducible chain with ﬁnitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. 1. . showing that α + β ∈ Dj . j ∈ S. So d(i) = d(j). . d − 1. C1 .j i there is β ∈ N with pji > 0. .) 19 . If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . . Divide n by n1 to obtain n = qn1 + r. . Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . In particular. where the remainder r ≤ n1 − 1. . More generally. we have n ∈ Di as well. d(j) | d(i)). Then we can uniquely partition the state space S into d disjoint subsets C0 . . Arguing symmetrically. q − r > 0. Because n is suﬃciently large. if d = 1. . . Theorem 3. Proof. showing that α + n + β ∈ Dj . Let n be suﬃciently large. Now let n ∈ Di . Therefore d(j) | α + β + n for all n ∈ Di . Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. Pick n2 . chains where. Theorem 4. n2 ∈ Di . . r = 0. all states communicate with one another. (Here Cd := C0 . (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This ﬁgure shows the internal structure of a closed communicating class with period d = 4. An irreducible chain has period d if one (and hence all) of the states have period d. as deﬁned earlier. So d(j) is a divisor of all elements of Di . . this is the content of the following theorem. If a number divides two other numbers then it also divides their diﬀerence. Of particular interest. So pjj ≥ pij pji > 0. Cd−1 such that pij = 1. if an irreducible chain has period d then we can decompose the state space into d sets C0 . Then pii > 0 and so pjj ≥ pij pii pji > 0. Formally. Corollary 3. Since n1 .

id ∈ C0 .. Take. j belongs to the next class. We show that these classes satisfy what we need. Then pick i1 such that pi0 i1 > 0. and by the assumptions that d d j. We use a method that is based upon considering what the Markov chain does at time 1. . id−1 d 6 Hitting times and ﬁrst-step analysis Consider a Markov chain with transition probability matrix P. 20 . Suppose pij > 0 but j ∈ C1 but. We deﬁne the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. It is easy to see that if we continue and pick a further state id with pid−1 . We are interested in deriving formulae for the probability that this time is ﬁnite as well as the expectation of this time. i. C1 . . We now show that if i belongs to one of these classes and if pij > 0 then. As an example.. necessarily.. Hence it partitions the state space S into d disjoint equivalence classes. If X0 ∈ A the two times coincide. after it takes one step from its current position. We avoid excessive notation by using the same letter for both. j ∈ C2 .Proof. with corresponding classes C0 . Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 .id > 0.e the set of states j such that i j). The existence of such a path implies that it is possible to go from i0 to i. Pick a state i0 and let C0 be its equivalence class (i.. then... Deﬁne the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. Assume d > 1. for example. . Notice that this is an equivalence relation. i1 . . say. i ∈ C0 . necessarily. The reader is warned to be alert as to which of the two variables we are considering at each time. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). Let C1 be the equivalence class of i1 . consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which diﬀers from TA simply by considering n ≥ 1 instead of n ≥ 0.e.. Such a path is possible because of the choice of the i0 . that is why we call it “first-step analysis”. which contradicts the deﬁnition of d. Continue in this manner and deﬁne states i0 . Cd−1 . i1 . .

which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). j∈S as needed. But what exactly is this probability equal to? To answer this in its generality. if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S.e. . . ﬁx a set of states A. We ﬁrst have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. X0 = i) = P (TA < ∞|X1 = j). We then have: Theorem 5. Therefore. and so ϕ(i) = 1. ′ Proof. X2 . Hence: ′ ′ P (TA < ∞|X1 = j. ′ is the remaining time until A is hit). i ∈ A. let ϕ′ (i) be another solution. So TA = 1 + TA . then TA ≥ 1. a ′ function of X1 . the event TA is independent of X0 . . we have Pi (TA < ∞) = ϕ(j)pij . ϕ(i) = j∈S i∈A pij ϕ(j). conditionally on X1 . because the chain may end up in state 2. The function ϕ(i) satisﬁes ϕ(i) = 1.It is clear that P1 (T0 < ∞) < 1. For the second part of the proof.). Combining the above. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). Furthermore. By self-feeding this equation. But the Markov chain is homogeneous. and deﬁne ϕ(i) := Pi (TA < ∞). If i ∈ A then TA = 0. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A pjk ϕ′ (k) k∈A 21 . If i ∈ A.

if the chain starts from 2 it will never hit 0. Clearly. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). (If the chain starts from 0 it takes no time to hit 0. TA = 0 and so ψ(i) := Ei TA = 0. 1 ψ(1) = 1 + ψ(1). let A = {0. On the other hand. 1. as above. By continuing self-feeding the above equation n times. we choose A = {0}. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). obviously. the second equals Pi (TA = 2). then TA = 1 + TA .We recognise that the ﬁrst term equals Pi (TA = 1). ψ(i) = 1 + i∈A pij ψ(j). If i ∈ A then. Letting n → ∞. The function ψ(i) satisﬁes ψ(i) = 0. which is the second of the equations. The second part of the proof is omitted. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). 3 2 so ϕ(1) = 3/4. 2. Therefore. Furthermore. We have: Theorem 6. We immediately have ϕ(0) = 1. so the ﬁrst two terms together equal Pi (TA ≤ 2). We let ϕ(i) = Pi (T0 < ∞). because it takes no time to hit A if the chain starts from A. the set that contains only state 0. Proof. Example: In the example of the previous ﬁgure. j∈A i ∈ A. Example: Continuing the previous example. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the ﬁrst time. 2}. ψ(0) = ψ(2) = 0. and ϕ(2) = 0.) As for ϕ(1). ψ(i) := Ei TA . 3 22 . we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). If ′ i ∈ A. and let ψ(i) = Ei TA . i = 0. Start the chain from i.

otherwise. 23 . ϕAB is the minimal solution to these equations. ϕAB (i) = j∈A i∈B pij ϕAB (j). because the underlying experiment is the ﬂipping of a coin with “success” probability 2/3.which gives ψ(1) = 3/2. TB for two disjoint sets of states A. (This should have been obvious from the start. in the last sum. we just changed index from n to n + 1. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). . ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). It should be clear that the equations satisﬁed are: ϕAB (i) = 1. if x ∈ A. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). ′ where TA is the remaining time. Now the last sum is a function of the future after time 1.e. after one step. . of the random variables X1 . as argued earlier. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). Clearly.. Hence. ′ TA = 1 + TA . Then ′ 1+TA x ∈ A. X2 . consider the following situation: Every time the state is x a reward f (x) is earned. by the Markov property. B. Next. i. We may ask to ﬁnd the probabilities ϕAB (i) = Pi (TA < TB ). ϕAB (i) = 0. until set A is reached. h(x) = f (x). ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the ﬁrst-step analysis method. This happens up to the ﬁrst hitting time TA ﬁrst hitting time of the set A. i∈A Furthermore. . therefore the mean number of ﬂips until the ﬁrst success is 3/2.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . where.

we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). in which case he dies immediately. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). for i = 1. he collects i pounds. (Also. Clearly. if he starts from room 2. 3. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. If heads come up then you win £1 per pound of stake.Thus. no gambler is assumed to have any information about the outcome of the upcoming toss. Let p the probability of heads and q = 1 − p that of tails. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . h(3) = 7. (Sooner or later. 2. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. So. 1 2 4 3 The question is to compute the average total reward. y∈S h(x) = f (x) + x ∈ A. until he dies in room 4. k ∈ N) form the strategy of the gambler. The successive stakes (Yk . collecting “rewards”: when in room i. 2 2 We solve and ﬁnd h(2) = 8. are h(x) = 0. why?) If we let h(i) be the average total reward if he starts from room i. so 24 . the set of equations satisﬁed by h.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. x∈A pxy h(y). Example: A thief enters sneaks in the following house and moves around the rooms. he will die. unless he is in room 4. h(1) = 5.

We use the notation Px (respectively. The diﬀerence between the gambler’s ruin problem for a company and that for a mere mortal is that. 0 ≤ x ≤ ℓ. the company has control over the probability p. as described above. Theorem 7.i+1 = p. b are absorbing. We are clearly dealing with a Markov chain in the state space {a. b} with transition probabilities pi. . it will make enormous proﬁts). Consider a gambler playing a simple coin tossing game. write ϕ(x) instead of ϕ(x. in addition. . . Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. in the ﬁrst case. It can only depend on ξ1 . . b − 1. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. if p = q = 1/2 b−a Proof. P (tails) = 1 − p. astrological nature. We wish to consider the problem of ﬁnding the probability that the gambler will never lose her money. . This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. . We ﬁrst solve the problem when a = 0. . she must base it on whatever outcomes have appeared thus far. a ≤ x ≤ b. and where the states a.g. a + 2. To solve the problem. Ex ) for the probability (respectively. 25 ϕ(ℓ) = 1. Yk cannot depend on ξk . betting £1 at each integer time against a coin which has P (heads) = p. b = ℓ > 0 and let ϕ(x. a + 1. pi. . if p = q Px (Tb < Ta ) = . a < i < b.i−1 = q = 1 − p. expectation) under the initial condition X0 = x. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals: x−a − 1 (q/p) (q/p)b−a − 1 . In other words. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). ℓ). . Let us consider a concrete problem. Let £x be the initial fortune of the gambler. in simplest case Yk = 1 for all k. ℓ) := Px (Tℓ < T0 ). (Think why!) To simplify things. ξk−1 and–as is often the case in practice–on other considerations of e. x − a .if the gambler wants to have a strategy at all. She will stop playing if her money fall below a (a < x). b − a). is our next goal. We obviously have ϕ(0) = 0.

and. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). Px (T0 < ∞) = 1 − lim 26 . Corollary 4. δ(x − 2) = · · · = q p x δ(0). and so λx − 1 λx − 1 =1− = λx . λ = 1. then λ = q/p > 1. we ﬁnd ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). ℓ→∞ λℓ − 1 x = 1 − 0 = 1. If p > 1/2. λℓ − 1 0 ≤ x ≤ ℓ. we ﬁnd δ(0) = 1/ℓ and so x ϕ(x) = . and so Px (T0 < ∞) = 1 − lim If p < 1/2. λ = 1. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. We ﬁnally ﬁnd λx − 1 . if p = q. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). if p > 1/2 if p ≤ 1/2. The general case is obtained by replacing ℓ by b − a and x by x − a. ℓ Summarising. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. then λ = q/p < 1. we have obtained ϕ(x. by ﬁrst-step analysis. 0 ≤ x ≤ ℓ. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. λℓ −1 x ℓ. Let δ(x) := ϕ(x + 1) − ϕ(x). ℓ) = λx −1 . Since ϕ(ℓ) = 1. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). if λ = 1 if λ = 1. Since p + q = 1. λx − 1 δ(0). ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. 1. Assuming that λ := q/p = 1. ℓ→∞ ℓ→∞ (q/p)x .

. Let τ be a stopping time. (b) is by the deﬁnition of conditional probability. conditional on τ < ∞ and Xτ = i. . n ≥ 0) is independent of the past process (X0 . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . τ = n)P (Xn = i. | Xτ = i. . n ≥ 0) is a time-homogeneous Markov chain. . . . .Xτ +2 = j2 . .j1 pj1 . . . . . . . the future process (Xτ +n . τ < ∞). . A. . Suppose (Xn . . Xn+1 . 27 . . τ = n) = pi. A. Xn ) if we know the present Xn . . for any n. including. . P (Xτ +1 = j1 . τ = n) (a) (b) = P (Xτ +1 = j1 . . Xτ )1(τ < ∞) = n∈Z+ (X0 . . Xτ ) if τ < ∞. . Our assertions are now easily obtained from this by ﬁrst summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . . τ = n). considered earlier. An example of a random time which is not a stopping time is the last visit of a speciﬁc set. possibly. Let τ be a stopping time.) is independent of the past process (X0 . . . Proof. . τ = n) (c) = P (Xn+1 = j1 . A. . . (d) where (a) is just logic. This stronger property is referred to as the strong Markov property. .j1 pj1 . . for all n ∈ Z+ . the future process (Xn . A | Xτ . Xτ = i. .j2 · · · We have: P (Xτ +1 = j1 . τ = n) = P (Xn+1 = j1 . . . the value +∞. Xn )1(τ = n). τ < ∞) = P (Xn+1 = j1 . τ < ∞)P (A | Xτ . Xτ +2 = j2 . Xn+2 = j2 . Xτ +2 = j2 . . . Xn = i. . Let A be any event determined by the past (X0 . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . . A. . A. . Recall that the Markov property states that. Xτ ) and has the same law as the original process started from i. .j2 · · · P (Xn = i. . Then. . . . A. and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . . Xn ) for all n ∈ Z+ .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. . | Xτ . . Xn+2 = j2 . . any such A must have the property that. the ﬁrst hitting time of the set A. Xn+2 = j2 . Xn ) and the ordinary splitting property at n. An example of a stopping time is the time TA . | Xn = i. . τ < ∞) and that P (Xτ +1 = j1 . . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . A ∩ {τ = n} is determined by (X0 . . Xτ +2 = j2 . Xn ). . . In particular. Since (X0 . . Theorem 8 (strong Markov property). We are going to show that for any such A. τ < ∞) = pi. . . | Xn = i)P (Xn = i.

Our Ti here is always positive. . However. “chunks”. If X0 = i then the two deﬁnitions coincide. We (r) refer to them as excursions of the process. Schematically. and so on.. .i. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . in a generalised sense. The random variables Xn comprising a general Markov chain are not independent.. .d. then the behaviour of the chain between successive visits cannot be inﬂuenced by the past or the future (and that is due to the strong Markov property). State i may or may not be visited again. The idea is that if we can ﬁnd a state i that is visited again and again. . (1) r = 1. by using the strong Markov property. 3. Ti Xi (0) (r) ≤ n < Ti (r+1) . r = 2.9 Regenerative structure of a Markov chain A sequence of i. Xi (0) . 0 ≤ n < Ti . So Xi is the r-th excursion: the chain visits 28 . random variables is a (very) particular example of a Markov chain. random trajectories. . They are. Formally.) We let Ti be the time of the third visit. as “random variables”. (And we set Ti := Ti . in fact. Xi (1) . we recursively deﬁne Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. we let Ti be the time of the second (1) (3) visit. Xi (2) . .d. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the ﬁrst time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. := Xn . we will show that we can split the sequence into i. All these random times are stopping times. Note the subtle diﬀerence between this Ti and the Ti of Section 6.i. . The Ti of Section 6 was allowed to take the value 0. 2. Regardless.

29 . Xi distribution.i. . given the state at (r) (r) (r) the stopping time Ti . Λ2 . are independent random variables. We abstract this trivial observation and give a couple of deﬁnitions: First. Therefore. Xi (2) . and this proves the ﬁrst claim. In particular. The claim that Xi . it will behave in the same way as if it starts from the same state at another point of time. by deﬁnition. Moreover. if it starts from state i at some point of time. have the same Proof.). but there are always states which will be visited again and again. Xi (1) . .. and “goes for an excursion”. . Perhaps some states (inessential ones) will be visited ﬁnitely many times. equal to i. ad inﬁnitum. the chain returns to i inﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 1. at least one state must be visited inﬁnitely many times. A trivial. g(Xi (2) ). we “watch” it up until the next time it returns to state i. we say that a state i is recurrent if. (0) are independent random variables. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i.d. Therefore. they are also identical in law (i. for time-homogeneous Markov chains. . Apply Strong Markov Property (SMP) at Ti : SMP says that. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. possibly.. 10 Recurrence and transience When a Markov chain has ﬁnitely many states. (r) .state i for the r-th time. the chain returns to i only ﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 0. all. the future after Ti is independent of the past before Ti . except. but important corollary of the observation of this section is that: Corollary 5. But (r) the state at Ti is. starting from i. (Numerical) functions g(Xi independent random variables. The excursions Xi (0) . An immediate consequence of the strong Markov property is: Theorem 9. starting from i. . we say that a state i is transient if. Example: Deﬁne Ti (r+1) (0) ). Second. the future is independent of the (1) (2) past. . The last corollary ensures us that Λ1 . . g(Xi (1) ).

Suppose that the Markov chain starts from state i. k ≥ 0. and that is why we get the product. Ei Ji = ∞. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. namely. Notice that: state i is visited inﬁnitely many times ⇐⇒ Ji = ∞. 3. Because either fii = 1 or fii < 1. in principle. 2. Pi (Ji = ∞) = 1. fii = Pi (Ti < ∞) = 1. But the past (k−1) before Ti is independent of the future. therefore concluding that every state is either recurrent or transient. This means that the state i is recurrent. Lemma 5. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. Starting from X0 = i. We claim that: Lemma 4. We will show that such a possibility does not exist. 30 . 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii .e. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . This means that the state i is transient. Indeed. If fii < 1. Indeed. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas.Important note: Notice that the two statements above are not negations of one another. i. 4. Therefore. we conclude what we asserted earlier. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . every state is either recurrent or transient. State i is recurrent. The following are equivalent: 1. Pi (Ji = ∞) = 1. . we have Proof. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). there could be yet another possibility: 0 < Pi (Xn = i for inﬁnitely many n) < 1. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). then Pi (Ji < ∞) = 1.

(n) i. 2. 3. If i is transient then Ei Ji < ∞. If Ei Ji = ∞. On the other hand. From Elementary Probability. assuming that the chain (n) starts from state i. n ≥ 1. is equivalent to Pi (Ji < ∞) = 1.Proof. clearly. The equivalence of the ﬁrst two statements was proved above. Proof. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . as n → ∞. therefore Pi (Ji < ∞) = 1. Notice the diﬀerence with pij . (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. 4. it is visited inﬁnitely many times. State i is transient if. Corollary 6. State i is transient. Lemma 6. by deﬁnition. Pi (Ji < ∞) = 1. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . If i is a transient state then pii → 0. fij ≤ pij . The following are equivalent: 1. then. Ji cannot take value +∞ with positive probability. and. then. 10. by deﬁnition of Ji is equivalent to Pi (Ji = ∞) = 1. but the two sequences are related in an exact way: Lemma 7. by deﬁnition. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . Ei Ji < ∞. since Ji is a geometric random variable. by the deﬁnition of Ji . the probability to hit state j for the ﬁrst time at time n ≥ 1. Hence Ei Ji = ∞ also. This. Proof. as n → ∞. which. State i is recurrent if. this implies that Ei Ji < ∞. (n) But if a series is ﬁnite then the summands go to 0. i. The equivalence of the ﬁrst two statements was proved before the lemma. owing to the fact that Ji is a geometric random variable.e. fii = Pi (Ti < ∞) < 1. if we assume Ei Ji < ∞. we see that its expectation cannot be inﬁnity without Ji itself being inﬁnity with probability one. it is visited ﬁnitely many times with probability 1. Proof.e. pii → 0. The latter is the probability to be in state j at time n (not necessarily for the ﬁrst time) starting from i.1 Deﬁne First hitting time decomposition fij := Pi (Tj = n). Clearly.

starting from i. as n → ∞. the probability to go to j in some ﬁnite time is positive.i. Corollary 8.r := 1 j appears in excursion Xi (r) . Indeed. In particular.d. So the random variables δj. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . . Now. inﬁnitely many of them will be 1 (with probability 1). either all states are recurrent or all states are transient. We now show that recurrence is a class property. Hence. and hence pij as n → ∞.The ﬁrst term equals fij . and gij := Pi (Tj < Ti ) > 0. If j is transient. Proof. for all states i.2 Communicating classes and recurrence/transience Recall the deﬁnition of the relation “state i leads to state j”. j i. we have n=1 pjj = Ej Jj < ∞. and so the probability that j will appear in a speciﬁc excursion is positive. . not only fij > 0. In a communicating class. Start with X0 = i.) Theorem 10. but also fij = 1. denoted by i j: it means that. and since they take the value 1 with positive probability. j i. 32 . are i. . 10. The last statement is simply an expression in symbols of what we said above. For the second term. ( Remember: another property that was a class property was the property that the state have a certain period. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). 1. Hence. In other words.d. r = 0. excursions Xi . which is ﬁnite. Xi .. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. we have i j ⇐⇒ fij > 0.i. . ∞ Proof. the probability that j will appear in a speciﬁc excursion equals gij = Pi (Tj < Ti ). . by deﬁnition. showing that j will appear in inﬁnitely many of the excursions for sure. if we let fij := Pi (Tj < ∞) . . Corollary 7. If j is a transient state then pij → 0. by summing over n the relation of Lemma 7.

Example: Let U0 . i. be i. Hence. Theorem 12. U2 . Indeed.Proof. and hence. n+1 33 = 0 . . Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. . We now take a closer look at recurrence. P (B) = Pi (Xn = i for inﬁnitely many n) < 1. By closedness. .d. Let T := inf{n ≥ 1 : Un > U0 }. random variables. Thus T represents the number of variables we need to draw to get a value larger than the initial one. but Pi (A ∩ B) = Pi (Xm = j. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. and so i is a transient state. all other states are also transient.e. If i is inessential then it is transient. Recall that a ﬁnite random variable may not necessarily have ﬁnite expectation. for example. By ﬁniteness. If i is recurrent then every other j in the same communicating class satisﬁes i j. Let C be the class. Every ﬁnite closed communicating class is recurrent. . Hence 11 Positive recurrence Recall that if we ﬁx a state i and let Ti be the ﬁrst return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). some state i must be visited inﬁnitely many times. If a communicating class contains a state which is transient then. . by the above theorem. Theorem 11. U1 . Pi (B) = 0. . observe that T is ﬁnite. Proof. . Un ≤ x | U0 = x)dx xn dx = 1 . . So if a communicating class is recurrent then it contains no inessential states and so it is closed. Xn = i for inﬁnitely many n) = 0. What is the expectation of T ? First. But then. X remains in C.i. every other j is also recurrent. . Proof. 1]. necessarily. . Suppose that i is inessential.e. Hence all states are recurrent. . j but j i. This state is recurrent. if started in C. i. appear in a certain game. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). P (T > n) = P (U1 ≤ U0 . It should be clear that this can. uniformly distributed in the unit interval [0.

d. To our rescue comes the regenerative structure of a Markov chain identiﬁed at an earlier section. but E min(Z1 . Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. When we have a Markov chain. then it takes. X1 . X2 . .i. Z2 . Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. Let Z1 . the sequence X0 . it is a Theorem that can be derived from the axioms of Probability Theory. We say that i is null recurrent if Ei Ti = ∞. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. . 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. To apply the law of large numbers we must create some kind of independence. of successive states is not an independent sequence. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 .Therefore. It is traditional to refer to it as “Law”. . 34 . random variables. Theorem 13. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. . 0) = ∞. the Fundamental Theorem of Probability Theory. . arguably. But this is misleading. . We state it here without proof. Before doing that. an inﬁnite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. It is not a Law. be i. (ii) Suppose E max(Z1 . on the average. 0) > −∞. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution.

d. . the (0) initial excursion Xa also has the same law as the rest of them.1 Functions of excursions Namely. G(Xa(2) ).. . . Speciﬁcally. (7) Whichever the choice of G. Since functions of independent random variables are also independent. Xa(3) . Example 2: Let f (x) be as in example 1. and because if we start with X0 = a. k=0 which represents the ‘time-average reward’ received. because the excursions Xa . Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). assign a reward f (x). . are i. the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). forms an independent sequence of (generalised) random variables.2 Average reward Consider now a function f : S → R. we can think of as a reward received when the chain is in state x. as in Example 2. Xa(2) . Xa . are i. The ﬁnal equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. we have that G(Xa(1) ). We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. for an excursion X . n as n → ∞. deﬁne G(X ) to be the maximum reward received during this excursion. .i. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been ﬁxed once and for all) the sequence of excursions Xa(1) . (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. Example 1: To each state x ∈ S. in this example.13. for reasonable choices of the functions G taking values in R. 35 . . Thus. In particular. (1) (2) (1) (n) (8) 13. but deﬁne G(X ) to be the total reward received over the excursion X . G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). random variables. which.i. with probability 1. .d. . G(Xa(3) ). Then. .

Gﬁrst is the reward up to time Ta = Ta or t (whichever t is smaller). Nt . For example. of complete excursions from the ground state a that occur between 0 and t. Then. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. We are looking for the total reward received between 0 and t. and so on. t t as t → ∞. t t In other words. Since P (Ta < ∞) = 1. t as t → ∞. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1.e. A similar argument shows that 1 last G → 0. The idea is this: Suppose that t is large enough. Thus we need to keep track of the number. Hence |Gﬁrst | ≤ BTa . and so 1 ﬁrst G → 0. say. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . t t 36 . plus the total reward received over the next excursion. Proof. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). with probability one. We break the sum into the total reward received over the initial excursion. Gr is given by (7). with probability one. (r) Nt := max{r ≥ 1 : Ta ≤ t}. and Glast is the reward over the last incomplete cycle. Nt = 5. we have BTa /t → 0.Theorem 14. in the ﬁgure below. say |f (x)| ≤ B. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a ﬁrst and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gﬁrst + Glast . We assumed that f is t bounded. Let f : S → R+ be a bounded ‘reward’ function.

. 2. random variables with common expectation Ea Ta . we have Ta r→∞ r lim (r) .also. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . it follows that the number Nt of complete cycles before time t will also be tending to ∞. E a Ta f (Xn ) = n=0 EG1 =: f . r=1 with probability one. we have that the visit with index r = Nt to the ground state a is the last before t (see also ﬁgure above): (N (N Ta t ) ≤ t < Ta t +1) . are i.. = E a Ta .i. Since Ta → ∞. as r → ∞.d. So we are left with the sum of the rewards over complete cycles. = E a Ta . E a Ta 37 . Hence. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . . . 1 Nt = . r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . as t → ∞. r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. (11) By deﬁnition. lim t∞ 1 Nt Nt −1 Gr = EG1 . Since Ta /r → 0. and since Ta − Ta k = 1.

which communicate and are positive recurrent. Applying formula (14) with b in place of a we ﬁnd that the limit in (13) also equals 1/Eb Tb . not both. The constant f is a sort of average over an excursion. we let f to be 1 at the state b and 0 everywhere else. the ground state.e. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . Comment 4: Suppose that two states. for all b ∈ S. Comment 1: It is important to understand what the formula says. We can use b as a ground state instead of a. t Ea 1(Xk = b) E a Ta . a. E a Ta (14) with probability 1. The physical interpretation of the latter should be clear: If. it takes Ea Ta units of time between successive visits to the state a. b. on the average. The equality of these two limits gives: average no.To see that f is as claimed. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). E b Tb 38 . from the Strong Markov Property. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). i. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). then the long-run proportion of visits to the state a should equal 1/Ea Ta . Note that we include only one endpoint of the excursion. Ta −1 k=0 1 lim t→∞ t with probability 1. The denominator is the duration of the excursion. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. The theorem above then says that. then we see that the numerator equals 1.

where a is the ﬁxed recurrent ground state whose existence was assumed. . Consider the function ν [a] deﬁned in (15). 5 We stress that we do not need the stronger assumption of positive recurrence here. X1 . for any state b such that a → x (a leads to x). Hence the superscript [a] will soon be dropped. (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. Since a = XT a . Then ν [a] satisﬁes the balance equations: ν [a] = ν [a] P. Both of them equal 1. Proof. x ∈ S. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. so there is nothing lost in doing so. We start the Markov chain with X0 = a. by a function of X1 . Moreover. Proposition 2. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). The real reason for doing so is because we wanted to replace a function of X0 .14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . x ∈ S. we have 0 < ν [a] (x) < ∞. 39 . X2 . X3 . we will soon realise that this dependence vanishes when the chain is irreducible. Clearly. . . Note: Although ν [a] depends on the choice of the ‘ground’ state a. i. notice that the choice of notation ν [a] is justiﬁable through the notation (6) for the communicating class of the state a.e. In view of this. . ν [a] (x) = y∈S ν [a] (y)pyx . Indeed. Fix a ground state a ∈ S and assume that it is recurrent5 . this should have something to do with the concept of stationary distribution discussed a few sections ago. (15) should play an important role. Recurrence tells us that Ta < ∞ with probability one. X2 . . .

so there exists m ≥ 1. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . So we have shown ν [a] (x) = y∈S pyx ν [a] (y). where. First note that. x ∈ S. Xn−1 = y. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. ax ax From Theorem 10. such that pxa > 0. in this last step. indeed. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. such that pax > 0. ν [a] = ν [a] Pn . 40 .and thus be in a position to apply ﬁrst-step analysis. Note: The last result was shown under the assumption that a is a recurrent state. n ≤ Ta ) Pa (Xn = x. by deﬁnition. which. x a. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). for all n ∈ N. let x be such that a x. (17) This π is a stationary distribution. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. Therefore. Deﬁne Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . ν [a] (x) < ∞. xa so. which are precisely the balance equations. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). we used (16). for all x ∈ S. namely. from (16). (n) Next. that a is positive recurrent. is ﬁnite when a is positive recurrent. We now have: Corollary 9. We just showed that ν [a] = ν [a] P. Suppose that the Markov chain possesses a positive recurrent state a. n ≤ Ta ) pyx Pa (Xn−1 = y. We are now going to assume more. so there exists n ≥ 1.

We have already shown. Since a is positive recurrent. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. no matter which one. 2. The coins are i. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. we toss a coin with probability of heads gij > 0. Hence j will occur in inﬁnitely many excursions. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. that it sums up to one. Also. We already know that j is recurrent. since the excursions are independent. Theorem 15. unique. Start with X0 = i. π(y) = 1 y∈S x∈S have at least one solution. Proof. But π [a] is simply a multiple of ν [a] . Let i be a positive recurrent state. It only remains to show that π [a] is a probability.d. that ν [a] P = ν [a] .In other words: If there is a positive recurrent state. 41 . i. in general. Therefore. then the set of linear equations π(x) = y∈S π(y)pyx . and so π [a] is not trivially equal to zero. we have Ea Ta < ∞. Suppose that i recurrent. In words. second) excursion that contains j. if tails show up.i. Consider the excursions Xi . . to decide whether j will occur in a speciﬁc excursion. π [a] P = π [a] . .e. This is what we will try to do in the sequel. Otherwise. in Proposition 2. 1. the event that j occurs in a speciﬁc excursion is independent from the event that j occurs in the next excursion. R2 ) be the index of the ﬁrst (respectively. r = 0. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. and the probability gij = Pi (Tj < ∞) that j occurs in a speciﬁc excursion is positive. Then j is positive Proof. then the r-th excursion does not contain state j. (r) j. If heads show up at the r-th excursion. . then j occurs in the r-th excursion. Let R1 (respectively. positive recurrent state.

Corollary 10. by Wald’s lemma.d. with the same law as Ti . we just need to show that R2 −R1 +1 ζ := r=1 Zr has ﬁnite expectation.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . or all states are null recurrent or all states are transient. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. . . 2. IRREDUCIBLE CHAIN r = 1. . But. where the Zr are i. An irreducible Markov chain is called transient if one (and hence all) states are transient. Let C be a communicating class. . R2 has ﬁnite expectation.i. . . . An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. In other words. Then either all states in C are positive recurrent.

Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). recognising that the random variables on the left are nonnegative and bounded below 1. Hence. x ∈ S. Fix a ground state a. we have P (Xn = x) = π(x). But. following Theorem 14. is a stationary distribution. Proof. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. we can take expectations of both sides: E as t → ∞.16 Uniqueness of the stationary distribution At the risk of being too pedantic. π(2) = 1 − α 2. Then P (Ta < ∞) = x∈S In other words. x∈S By (13). Then there is a unique stationary distribution. we start the Markov chain from the stationary distribution π. By a theorem of Analysis (Bounded Convergence Theorem). for all x ∈ S. as t → ∞. an absorbing state). π(1) = 0. we stress that a stationary distribution may not be unique. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. what prohibits uniqueness. state 1 (and state 2) is positive recurrent. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). after all. Theorem 16. n ≥ 0) and suppose that P (X0 = x) = π(x). Consider positive recurrent irreducible Markov chain. indeed. there is a stationary distribution. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). Let π be a stationary distribution. But observe that there are many–inﬁnitely many–stationary distributions: For each 0 ≤ α ≤ 1. (18) π(x)Px (Ta < ∞) = π(x) = 1. n=1 43 . Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. the π deﬁned by π(0) = α. for all states i and j. x ∈ S. we have. Then. with probability one. for all n. for totally trivial reasons (it is. Consider the Markov chain (Xn . by (18).

Let C be the communicating class of i. Proof.e. π(a) = π [a] (a) = 1/Ea Ta . The cycle formula merely re-expresses this equality. Thus. from the deﬁnition of π [a] . where π(C) = y∈C π(y). Consider a positive recurrent irreducible Markov chain and two states a. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Decompose it into communicating classes. In particular. b ∈ S. Proof. Hence π = π [a] . Corollary 12. x ∈ S. Let i be such that π(i) > 0. i is positive recurrent. Both π [a] and π[b] are stationary distributions. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Consider a Markov chain. There is only one stationary distribution. π(C) x ∈ C. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . so they are equal. y ∈ C. There is a unique stationary distribution. i. delete all transition probabilities pxy with x ∈ C. the only stationary distribution for the restricted chain because C is irreducible. Therefore Ei Ti < ∞. π(i|C) = Ei Ti . Consider an arbitrary Markov chain. Hence there is only one stationary distribution. for all a ∈ S. But π(i|C) > 0. This is in fact. Then any state i such that π(i) > 0 is positive recurrent. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Then. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . Consider the restriction of the chain to C. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Assume that a stationary distribution π exists. Let π(·|C) be the distribution deﬁned by restricting π to C: π(x|C) := π(x) . By the 1 above corollary. Namely. The following is a sort of converse to that. Then π [a] = π [b] . E a Ta Proof. Consider a positive recurrent irreducible Markov chain with stationary distribution π. Corollary 11. In particular.Therefore π(x) = π [a] (x). for all x ∈ S. 1 π(a) = . an arbitrary stationary distribution π must be equal to the speciﬁc one π [a] .

x∈C Then (π(x|C). For each such C deﬁne the conditional distribution π(x|C) := π(x) . that the pendulum will swing for a while and. for example. Proof. as n → ∞. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. without friction. that. the pendulum will perform an 45 . the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). that it remains in a small region). x ∈ C) is a stationary distribution.1 Coupling and stability Deﬁnition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. where π(C) := π(C) π(x). eventually. In an ideal pendulum. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . π(x|C) ≡ π C (x). Stability of a deterministic dynamical system means that the system eventually converges to a ﬁxed point (or. This is the stable position. By our uniqueness result. So we have that the above decomposition holds with αC = π(C). There is dissipation of energy due to friction and that causes the settling down.which assigns positive probability to each state in C. An arbitrary stationary distribution π must be a linear combination of these π C . consider the motion of a pendulum under the action of gravity. take an = (−1)n . Proposition 3. To each positive recurrent communicating class C there corresponds a stationary distribution π C . This is deﬁned by picking a state (any state!) a ∈ C and letting π C := π [a] . We all know. A Markov chain is a simple model of a stochastic dynamical system. where αC ≥ 0. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. 18 18. So far we have seen. more generally. For example. such that C αC = 1. Let π be an arbitrary stationary distribution. If π(x) > 0 then x belongs to some positive recurrent class C. under irreducibility and positive recurrence conditions. from physical experience. it will settle to the vertical position.

or an example in physics. random variables. a > 0. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . Speciﬁcally. we here assume that X0 . ξ0 . A linear stochastic system We now pass on to stochastic systems. ······ 46 . it cannot mean convergence to a ﬁxed equilibrium point. . X2 . It follows that |ρ| < 1 is the condition for stability. ξ2 . ˙ There are myriads of applications giving physical interpretation to this. moreover. we have x(t) → x∗ .oscillatory motion ad inﬁnitum. we may it is stable because it does not (it cannot!) swing too far away. The thing to observe here is that there is an equilibrium point i. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . by means of the recursion above. consider the following diﬀerential equation x = −ax + b. . mutually independent. However notice what happens when we solve the recursion. then x(t) = x∗ for all t > 0. as t → ∞. Still. The simplest diﬀerential equation As another example. are given. and deﬁne a stochastic sequence X1 . x ∈ R. or whatever. We say that x∗ is a stable equilibrium. We shall assume that the ξn have the same law. . the one found by setting the right-hand side equal to zero: x∗ = b/a. . ξ1 . then regardless of the starting position. . so I’ll leave it upon the reader to decide whether he or she wants to consider this as an ﬁnancial example. And we are in a happy situation here. But what does stability mean? In general. . let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn .e. simple because the ξn ’s depend on the time parameter n. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. If.

Y = y) is? 47 . In other words. as n → ∞. ξn−1 does not converge. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . Y but not their joint law. In particular. 18. we know f (x) = P (X = x). . Coupling refers to the various methods for constructing this joint law. . converges. Y = y) = P (X = x)P (Y = y). to try to make the coincidence probability P (X = Y ) as large as possible.g. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . we have that the ﬁrst term goes to 0 as n → ∞. 18. How should we then deﬁne what P (X = x. Starting at an elementary level. while the limit of Xn does not exists. g(y) = P (Y = y). under nice conditions. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). for any initial distribution.3 Coupling To understand why the fundamental stability theorem works. (e. What this means is that. A straightforward (often not so useful) construction is to assume that the random variables are independent and deﬁne P (X = x. If a Markov chain is irreducible. . . deﬁne stability in a way other than convergence of the distributions of the random variables. But suppose that we have some further requirement. Y = y) is. uniformly in x. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot.Since |ρ| < 1. but we are not given what P (X = x. suppose you are given the laws of two random variables X. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). ∞ ˆ The nice thing is that. the n-step transition matrix Pn . ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . The second term. i.2 The fundamental stability theorem for Markov chains Theorem 17. we need to understand the notion of coupling. for instance. j ∈ S. which is a linear combination of ξ0 . for time-homogeneous Markov chains.

f (x) = y h(x. Repeating (19). precisely because of the assumption that the two processes have been constructed together. A coupling of them refers to a joint construction on the same probability space. i. Find the joint law h(x. Y be random variables with values in Z and laws f (x) = P (X = x). T is ﬁnite with probability one. respectively. n ≥ T ) = P (Xn = x.Exercise: Let X. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. Note that it makes sense to speak of a meeting time. uniformly in x ∈ S. P (T < ∞) = 1. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). n ≥ 0) and the law of a stochastic process Y = (Yn . Proof. The coupling we are interested in is a coupling of processes. x ∈ S. n ≥ T ) ≤ P (n < T ) + P (Yn = x). Let T be a meeting of two coupled stochastic processes X. n ≥ 0). Combining the two inequalities we obtain the ﬁrst assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). |P (Xn = x) − P (Yn = x)| ≤ P (T > n). n < T ) + P (Yn = x.e. y). We are given the law of a stochastic process X = (Xn . We have P (Xn = x) = P (Xn = x. in particular. i. y). n ≥ 0. then |P (Xn = x) − P (Yn = x)| → 0. Y = y) which respects the marginals. g(y) = P (Y = y). g(y) = x h(x. and which maximises P (X = Y ). x∈S 48 .e. This T is a random variables that takes values in the set of times or it may take values +∞. but we are not told what the joint probabilities are. on the event that the two processes never meet. If. n < T ) + P (Xn = x. Since this holds for all x ∈ S and since the right hand side does not depend on x. and all x ∈ S. Y . Suppose that we have two stochastic processes as above and suppose that they have been coupled. y) = P (X = x. Then. but with the roles of X and Y interchanged. The following result is the so-called coupling inequality: Proposition 4. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). (19) as n → ∞. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). for all n ≥ 0. n ≥ 0.

positive recurrent and aperiodic.x′ > 0 and py. Notice that σ(x.y′ . is a stationary distribution for (Wn . denoted by π. n ≥ 0) is a Markov chain with state space S × S. and so (Wn . x∈S as n → ∞. (x. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. y) := π(x)π(y).y′ ) := P Wn+1 = (x′ . Xn and Yn would be independent and distributed according to π. Xm = y). denoted by X = (Xn . we have that px. n ≥ 0). It is very important to note that we have only deﬁned the law of X and the law of Y but we have not deﬁned a joint law. review the assumptions of the theorem: the chain is irreducible. and it also makes sense to consider their ﬁrst meeting time: T := inf{n ≥ 0 : Xn = Yn }. Consider the process Wn := (Xn .4 Proof of the fundamental stability theorem First. π(x) > 0 for all x ∈ S.(y.x′ ). Then (Wn . as usual. Having coupled them. y ′ ) | Wn = (x. Its initial state W0 = (X0 . 18.y′ . implying that q(x. n ≥ 0) is an irreducible chain. The ﬁrst process. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. sup |P (Xn = x) − P (Yn = x)| → 0. Therefore. in other words. y) ∈ S × S.) By positive recurrence.Assume now that P (T < ∞) = 1. Yn ). . Its (1-step) transition probabilities are q(x. had we started with both X0 and Y0 independent and distributed according to π.y′ for all large n.y′ ) for all large n. denoted by Y = (Yn .x′ py. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. y) Its n-step transition probabilities are q(x.(y. y) = µ(x)π(y). we can deﬁne joint probabilities. the transition probabilities. Y0 ) has distribution P (W0 = (x. From Theorem 3 and the aperiodicity assumption. n ≥ 0) is independent of Y = (Yn .x′ ). The second process.(y. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. n ≥ 0).x′ py. for all n ≥ 1. We shall consider the laws of two processes. We know that there is a unique stationary distribution. then. Let pij denote. we have not deﬁned joint probabilities such as P (Xn = x. We now couple them by doing the straightforward thing: we assume that X = (Xn . Then n→∞ lim P (T > n) = P (T = ∞) = 0. they diﬀer only in how the initial states are chosen.y′ ) = px. (Indeed.x′ ).

0). (0. 0). T is a meeting time between Z and Y . P (Xn = x) = P (Zn = x). (1.0). (1. Hence (Zn .Therefore σ(x. In particular.1).(0. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X.01. 1} and the pij are p01 = p10 = 1.1) = 1.(1. 1). y) ∈ S × S. Moreover. 1)} and the transition probabilities are q(0. then the process W = (X. In particular. p10 = 1. n ≥ 0).0) = q(1. Clearly.(1. For example. This is not irreducible. for all n and x. say. P (T < ∞) = 1. Zn = Yn . |P (Zn = x) − P (Yn = x)| ≤ P (T > n). as n → ∞.1). Then the product chain has state space S ×S = {(0.(0. n ≥ 0) is identical in law to (Xn . suppose that S = {0. after the meeting time. However. Zn equals Xn before the meeting time of X and Y . uniformly in x. if we modify the original chain and make it aperiodic by. we have |P (Xn = x) − π(x)| ≤ P (T > n). Y ) may fail to be irreducible. 0). Z0 = X0 which has an arbitrary distribution µ. which converges to zero. Obviously.1) = q(1. and since P (Yn = x) = π(x). y) > 0 for all (x. p01 = 0. Deﬁne now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). p00 = 0. n ≥ 0) is positive recurrent. ≥ 0) is a Markov chain with transition probabilities pij .0) = q(0. (Zn . Since P (Zn = x) = P (Xn = x). X arbitrary distribution Z stationary distribution Y T Thus. By the coupling inequality. precisely because P (T < ∞) = 1. 50 . by assumption. Hence (Wn .99.

Indeed. The reader is asked to pay attention to this terminology which is diﬀerent from the one used in the Markov chain literature.497. π(0) + π(1) = 1. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. By the results of §5. customary to have a consistent terminology. from Cd back to C0 . C1 . . and so on.99π(0) = π(1). most people use the term “ergodic” for an irreducible. It is.497 . n→∞ (n) where π(0).4 and. however. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). But this is “wrong”. Then Xn = 0 for even n and = 1 for odd n. p01 = 0. 6 The reason is that the term “ergodic” means something speciﬁc in Mathematics (c.503 0. Theorem 4. However. Suppose that the period of the chain is d. π(0) ≈ 0. 18. positive recurrent and aperiodic Markov chain.503 0. and this is expressed by our terminology. such that the chain moves cyclically between them: from C0 to C1 . .01. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. Suppose X0 = 0.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. then (n) lim p n→∞ 00 (n) = lim p10 = π(0). By the results of Section 14. . if we modify the chain by p00 = 0. per se wrong.f. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. in particular.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. 1981). we can decompose the state space into d classes C0 . 51 . limn→∞ p00 does not exist. .503. from C1 to C2 . It is not diﬃcult to see that 6 We put “wrong” in quotes. Thus. Parry. Cd−1 .e.497 18. 0. In matrix notation. i. π(1) ≈ 0. π(1) satisfy 0. Terminology cannot be. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. p10 = 1. Therefore p00 = 1 for even n (n) and = 0 for odd n.99.then the product chain becomes irreducible.

Cd−1 . . Let αr := i∈Cr π(i). So π(i) = j∈Cr−1 π(j)pji . d − 1. j ∈ Cℓ . . All states have period 1 for this chain. Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. and let r = ℓ − k (modulo d). Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). . This means that: Theorem 18. . i ∈ Cr . Can we also ﬁnd the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . Proof. after reduction by d) and. as n → ∞. we have pij for all i. n ≥ 0) is dπ(i). j ∈ Cr . 1. The chain (Xnd . Suppose i ∈ Cr . Then. each one must be equal to 1/d. Hence if we restrict the chain (Xnd . → dπ(j). the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. if sampled every d steps. . the (unique) stationary distribution of (Xnd . .Lemma 8. 52 . n ≥ 0) is aperiodic. it remains in Cℓ . starting from i. Suppose that X0 ∈ Cr . π(i) = 1/d. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . i ∈ S. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. where π is the stationary distribution. Since the αr add up to 1. . namely C0 . j ∈ Cℓ . We shall show that Lemma 9. Then pji = 0 unless j ∈ Cr−1 . Then. n ≥ 0) consists of d closed communicating classes. Since (Xnd . i∈Cr for all r = 0. We have that π satisﬁes the balance equations: π(i) = j∈S π(j)pji . Let i ∈ Ck . i ∈ Cr . thereafter. and the previous results apply. . as n → ∞.

. where = 1 and pi ≥ 0 for all i. as k → ∞. for all i ∈ S. Hence pi k i. if j is transient. If j is a null recurrent state then pij → 0. 2. . By construction. a probability distribution on N. So the diagonal subsequence (nk . 19. Consider. 2. k = 1.e. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . for each The second result is Fatou’s lemma.1 Limiting probabilities for null recurrent states Theorem 19.) is a k k subsequence of each (ni . p2 . 2. k = 1.. We also saw in Corollary (n) 7 that. Now consider the contained between 0 and 1 there must be a exists and equals. e 53 . . . p(n) = (n) (n) (n) (n) (n) ∞ p1 .). 2. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . The proof will be based on two results from Analysis. p1 . . . p2 . If the period is d then the limit exists only along a selected subsequence (Theorem 18). schematically. It is clear how we (ni ) k continue. the sequence n2 of the second row k is a subsequence of n1 of the ﬁrst row. . . then pij → π(j) (Theorems 17). pi . . . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. say. say. . and so on. p3 . are contained between 0 and 1. . k = 1. . . . (nk ) k → pi . p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. for each n = 1. as n → ∞. for all i ∈ S. k = 1. . for each i. the sequence n3 of the third row is a subsequence k k of n2 of the second row. 2. there must be a subsequence exists and equals. say. Then we can ﬁnd a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. exists for all i ∈ N. whose proof we omit7 : 7 See Br´maud (1999).19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . then pij → 0 as n → ∞.. . We have. i. such that limk→∞ pi k exists and equals. For each i we have found subsequence ni . Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. . . the ﬁrst of which is known as HellyBray Lemma: Lemma 10.

i. Thus. i ∈ N). the sequence (n) pij does not converge to 0. by Lemma 11. k→∞ (n′ +1) = y∈S piy k pyx . 2. (n ) (n ) Consider the probability vector By Lemma 10. . and this is a contradiction (the number αx = αx cannot be strictly larger than itself). x∈S (n′ ) Since aj > 0. we have 0< x∈S The inequality follows from Lemma 11. Suppose that for some i. π is a stationary distribution. i=1 (n) Proof of Theorem 19. Now π(j) > 0 because αj > 0. and the last equality is obvious since we have a probability vector. actually. it is a strict inequality: αx0 > y∈S αy pyx0 . (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . (n′ ) Hence αx ≥ αy pyx . If (yi . Let j be a null recurrent state.e. Hence we can ﬁnd a subsequence nk such that k→∞ lim pij k = αj > 0. 54 . y∈S Since x∈S αx ≤ 1. pix k . (xi . we have that π(x) := αx y∈S αy satisﬁes the balance equations and has x∈S π(x) = 1. n = 1. αy pyx .e. i ∈ N). We wish to show that these inequalities are. by assumption. . pix k = 1. Suppose that one of them is not. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . i. for some x0 ∈ S.e. By Corollary 13. x ∈ S . αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. . state j is positive recurrent: Contradiction. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . pix k Therefore. x ∈ S. i. But Pn+1 = Pn P. equalities. y∈S x ∈ S.Lemma 11.

Example: Consider the following chain (0 < α. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. Theorem 20. Then (n) lim p n→∞ ij = ϕC (i)π C (j). and fij = ϕC (j). This is a more old-fashioned way of getting the same result. Indeed. the result can be obtained by the so-called Abel’s Lemma of Analysis. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. starting with the ﬁrst hitting time decomposition relation of Lemma 7. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. Let i be an arbitrary state. Pµ (Xn−m = j) → π C (j). we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). π C (j) = 1/Ej Tj . Let j be a positive recurrent state with period 1.2. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . In general. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. and fij = Pi (Tj < ∞). But Theorem 17 tells us that the limit exists regardless of the initial distribution.19. that is. which is what we need. In this form. as n → ∞. Let ϕC (i) := Pi (HC < ∞).2 The general case (n) It remains to see what the limit of pij is. Let HC := inf{n ≥ 0 : Xn ∈ C}. The result when the period j is larger than 1 is a bit awkward to formulate. Let C be the communicating class of j and let π C be the stationary distribution associated with C. as n → ∞. we have pij = Pi (Xn = j) = Pi (Xn = j. Proof. as in §10. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . From the Strong Markov Property. so we will only look that the aperiodic case. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). β.

C2 are aperiodic classes. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). p41 → ϕ(4)π C1 (1).The communicating classes are I = {3. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. while. in Theorem 14. (n) (n) p34 → 0. For example. (n) (n) p32 → ϕ(3)π C1 (2). ϕ(3) = p31 → ϕ(3)π C1 (1). (n) (n) p43 → 0. π C2 (5) = 1. Hence. We now generalise this. The subject of Ergodic Theory has its origins in Physics. Recall that. among others. And so is the Law of Large Numbers itself (Theorem 13). ϕ(5) = 0. C2 = {5} (absorbing state). Theorem 21. 2} (closed class). We have ϕ(1) = ϕ(2) = 1. from ﬁrst-step analysis. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. p45 → (1 − ϕ(4))π C2 (5). whence 1−α β(1 − α) . Similarly. 4} (inessential states). ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . The phase (or state) space of a particle is not just a physical space but it includes. 56 (20) x∈S . Pioneers in the are were. p44 → 0. Ergodic Theory is nowadays an area that combines many ﬁelds of Mathematics and Science. p42 → ϕ(4)π C1 (2). 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. We will avoid to give a precise deﬁnition of the term ‘ergodic’ but only mention what constitutes a result in the area. Consider the quantity ν [a] (x) deﬁned as the average number of visits to state x between two successive visits to state a–see 16. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. Theorem 14 is a type of ergodic theorem. C1 = {1. (n) (n) p33 → 0. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. (n) (n) p35 → (1 − ϕ(3))π C2 (5). we assumed the existence of a positive recurrent state. for example. information about its velocity. 1 − αβ 1 − αβ Both C1 . ϕ(4) = . as n → ∞. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. π C1 (2) = .

we have C := i∈S ν(i) < ∞. we have Nt /t → 1/Ea Ta . Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. which is the quantity denoted by f in Theorem 14. as we saw in the proof of Theorem 14. If so. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t.5. From proposition 2 we know that the balance equations ν = νP have at least one solution. Glast are the total rewards over the ﬁrst and t t a a last incomplete cycles. 21 Finite chains We take a brief look at some things that are speciﬁc to chains with ﬁnitely many states. while Gﬁrst . Then at least one of the states must be visited inﬁnitely often. (21) f (x)ν [a] (x). As in Theorem 14. But this is precisely the condition (20).Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . we have t→∞ lim 1 last 1 G = lim Gﬁrst = 0. t t→∞ t t t with probability one. in the former. we obtain the result. Remark 1: The diﬀerence between Theorem 14 and 21 is that. As in the proof of Theorem 14. Now. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). We only need to check that EG1 = f . Proof. we assumed positive recurrence of the ground state a. 57 . and this is left as an exercise. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gﬁrst + Glast . we have that the denominator equals Nt /t and converges to 1/Ea Ta . r=0 with probability 1 as long as E|G1 | < ∞. respectively. and this justiﬁes the terminology of §18. since the number of states is ﬁnite. So if we divide by t the numerator and denominator of the fraction in (21). So there are always recurrent states. The reader is asked to revisit the proof of Theorem 14. so the numerator converges to f /Ea Ta . Replacing n by the subsequence Nt . then.

Xn−1 . i. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . X0 ) for all n. Suppose ﬁrst that the chain is reversible. Then the chain is positive recurrent with a unique stationary distribution π. this means that it is stationary. We summarise the above discussion in: Theorem 22. without being able to tell that it is. If. running backwards. X1 = j) = P (X1 = i. Reversibility means that (X0 . we have Pn → Π. Proof. Theorem 23. But if we ﬁlm a man who walks and run the ﬁlm backwards. Therefore π is a stationary distribution. By Corollary 13. If. for all n. the chain is irreducible. (X0 . the chain is aperiodic. Consider an irreducible Markov chain with ﬁnitely many states. Taking n = 1 in the deﬁnition of reversibility. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. . because positive recurrence is a class property (Theorem 15). . X0 ). that is. then we instantly know that the ﬁlm is played in reverse. In addition. . it will look the same. i ∈ S. then all states are positive recurrent. . j ∈ S. . X1 . by Theorem 16.Hence 1 ν(i). in a sense. More precisely. . n ≥ 0). Xn−1 . . 22 Time reversibility Consider a Markov chain (Xn . A reversible Markov chain is stationary. simply. Lemma 12. . like playing a ﬁlm backwards. this state must be positive recurrent. i. the ﬁrst components of the two vectors must have the same distribution. Because the process is Markov. It is. 58 . X0 = j). . . actually. the stationary distribution is unique. and run the ﬁlm backwards. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. . . in addition. Xn ) has the same distribution as (Xn . a ﬁnite Markov chain always has positive recurrent states. or. X1 . In particular. X1 ) has the same distribution as (X1 . . . P (X0 = i. we see that (X0 . X0 ). Xn ) has the same distribution as (Xn . π(i) := Hence. Proof. We say that it is time-reversible. as n → ∞. At least some state i must have π(i) > 0. C satisﬁes the balance equations and the normalisation condition i∈S π(i) = 1. in addition. if we ﬁlm a standing man who raises his arms up and down successively.e. . reversible if it has the same distribution when the arrow of time is reversed. For example. Xn has the same distribution as X0 for all n. .

we have separation of variables: pij = f (i)g(j). X1 = i) = π(j)pji . Pick 3 states i. k. Hence the DBE hold. X1 . . it is easy to show that. for all n. and write. then the chain cannot be reversible. suppose the DBE hold. letting m → ∞. Conversely.for all i and j. . But P (X0 = i. Theorem 24. We will show the Kolmogorov’s loop criterion (KLC). we have π(i) > 0 for all i. The same logic works for any m and can be worked out by induction. π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . for all i. (X0 . . In other words. j. . and hence (X0 . . . . (22) Proof. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. X0 ). Xn−1 . . . using DBE. then. X2 ) has the same distribution as (X2 . Summing up over all choices of states i3 . X1 . So the chain is reversible. X0 ). π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. . . X1 = j. X1 = j) = π(i)pij . X1 ) has the same distribution as (X1 . X2 = i). j. . X2 = k) = P (X0 = k. These are the DBE. suppose that the KLC (22) holds. and this means that P (X0 = i. By induction. . Conversely. using the DBE. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. i4 . im ∈ S and all m ≥ 3. i2 . pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . . The DBE imply immediately that (X0 . P (X0 = j. Now pick a state k and write. . X0 ). . k. X1 = j. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. So the detailed balance equations (DBE) hold. X1 . pji 59 (m−3) → π(i1 ). and the chain is reversible. Suppose ﬁrst that the chain is reversible. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. . and pi1 i2 (m−3) → π(i2 ). we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . Xn ) has the same distribution as (Xn .

so we have pij g(j) = . So g satisﬁes the balance equations. namely the arrow from i to j. (If it didn’t.e. There is obviously only one arrow from AtoAc .(Convention: we form this ratio only when pij > 0. Pick two distinct states i. pji g(i) Multiplying by g(i) and summing over j we ﬁnd g(i) = j∈S g(j)pji . And hence π can be found very easily. If we remove the link between them. Ac ) = F (Ac . and by replacing. A) (see §4. Form the graph by deleting all self-loops. Hence the original chain is reversible. j for which pij > 0.1) which gives π(i)pij = π(j)pji . consider the chain: Replace it by: This is a tree. for each pair of distinct states i. 60 . there isn’t any. there would be a loop. then we need. For example. The balance equations are written in the form F (A. Example 1: Chain with a tree-structure. the graph becomes disconnected. the detailed balance equations. in addition. f (i)g(i) must be 1 (why?). Hence the chain is reversible. by assumption.) Necessarily. meaning that it has no loops. and. i. to guarantee. The reason is as follows. and the arrow from j to j is the only one from Ac to A. • If the chain has inﬁnitely many states. j for which pij > 0.) Let A. that the chain is positive recurrent. the two oriented edges from i to j and from j to i by a single edge without arrow. Ac be the sets in the two connected components. using another method. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. If the graph thus obtained is a tree. This gives the following method for showing that a ﬁnite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. then the chain is reversible.

There are two cases: If i. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C.e. Each edge connects a pair of distinct vertices (states) without orientation. The constant C is found by normalisation. π(i) = 1. A RWG is a reversible chain. For example. i. Consider an undirected graph G with vertices S. In all cases. i∈S Since δ(i) = 2|E|. i∈S where |E| is the total number of edges. Hence we conclude that: 1. Now deﬁne the the following transition probabilities: pij = 1/δ(i). States j are i are called neighbours if they are connected by an edge. 2. For each state i deﬁne its degree δ(i) = number of neighbours of i. and a set E of undirected edges. 0. The stationary distribution assigns to a state probability which is proportional to its degree. Suppose the graph is connected. where C is a constant. so both members are equal to C. j are not neighbours then both sides are 0. The stationary distribution of a RWG is very easy to ﬁnd. if j is a neighbour of i otherwise. etc. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. If i. we have that C = 1/2|E|. a ﬁnite set.Example 2: Random walk on a graph. To see this let us verify that the detailed balance equations hold. We claim that π(i) = Cδ(i). π(i)pij = π(j)pji . p02 = 1/4. the DBE hold. 61 . The Markov chain with these transition probabilities is called a random walk on a graph (RWG).

Let A be (r) a set of states and let TA be the r-th visit to A. . then it is a stationary distribution for the new chain. . The sequence (Yk . 1. . (m) (m) 23. . 1. k = 0. where ξt are i. 2. . where pij are the m-step transition probabilities of the original chain. k = 0. if nk = n0 + km.e. The state space of (Yr ) is A. . . i. A r = 1. k = 0. 2. irreducible and positive recurrent). 23. . if. 2. then (Yk . Deﬁne Yr := XT (r) . t ≥ 0. 1. Due to the Strong Markov Property.d.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). n ≥ 0) be Markov with transition probabilities pij .3 Subordination Let (Xn . in addition. .e. this process is also a Markov chain and it does have time-homogeneous transition probabilities. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . . . how can we create new ones? 23.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence.e.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. in general. Let (St .) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij .) has the Markov property but it is not time-homogeneous. St+1 = St + ξt . π is a stationary distribution for the original chain. If the subsequence forms an arithmetic progression. . t ≥ 0) be independent from (Xn ) with S0 = 0. k = 0. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ).i. . i. . random variables with values in N and distribution λ(n) = P (ξ0 = n). . 1. 2. In this case. 2. . 62 n ∈ N. .

. β). . in general. . is NOT. Fix α0 . Indeed. . n≥0 63 . . Then (Yn ) has the Markov property. . a Markov chain. Yn−1 ) = P (Yn+1 = β | Yn = α). conditional on Yn the past is independent of the future. . Yn−1 = αn−1 }. however. . .) More generally. ∞ λ(n)pij . knowledge of f (Xn ) implies knowledge of Xn and so. (n) 23. β. then the process Yn = f (Xn ). . f is a one-to-one function then (Yn ) is a Markov chain. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0.e. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). We need to show that P (Yn+1 = β | Yn = α.Then Yt := XSt . α1 . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). . If. Proof.. . (Notation: We will denote the elements of S by i. i. . j. we have: Lemma 13. .. αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . and the elements of S ′ by α.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ .

In other words. Thus (Yn ) is Markov with transition probabilities Q(α. regardless of whether i > 0 or i < 0. α > 0 Q(α. conditional on {Xn = i}. if we let 1/2. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Example: Consider the symmetric drunkard’s random walk. observe that the function | · | is not one-to-one. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i.e. Deﬁne Yn := |Xn |. i ∈ Z. But we will show that P (Yn+1 = β|Xn = i). P (Xn+1 = i ± 1|Xn = i) = 1/2. Indeed. Yn−1 )P (Xn = i | Yn = α. α = 0. 64 . Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). β). We have that P (Yn+1 = β|Xn = i) = Q(|i|. β) P (Yn = α|Yn−1 ) i∈S = Q(α. if β = α ± 1. β) P (Yn = α|Xn = i. Yn−1 ) Q(f (i). On the other hand. if i = 0. i ∈ Z. β). β) i∈S P (Yn = α|Xn = i. then |Xn+1 | = 1 for sure. and so (Yn ) is. β) := 1. β)P (Xn = i | Yn = α. and if i = 0. Yn−1 ) Q(f (i). But P (Yn+1 = β | Yn = α. To see this. Then (Yn ) is a Markov chain. i. for all i ∈ Z and all β ∈ Z+ . if β = 1. a Markov chain with transition probabilities Q(α. β) P (Yn = α. because the probability of going up is the same as the probability of going down. indeed. Yn = α. β).for all choices of the variables involved. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β ∈ Z+ . β). β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. depends on i only through |i|. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2.

then we see that the above equation is of the form Xn+1 = f (Xn . We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. the population cannot grow: it will eventually reach 0. . Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. The parent dies immediately upon giving birth. if 0 is never reached then Xn → ∞. one question of interest is whether the process will become extinct or not.. with probability 1. . We will not attempt to compute the pij neither draw the transition diagram because they are both futile. So assume that p1 = 1. 0 ≤ j ≤ i. But here is an interesting special case. so we will ﬁnd some diﬀerent method to proceed. and following the same distribution. We let pk be the probability that an individual will bear k oﬀspring. a hard problem. If we let ξ (n) = ξ1 . Back to the general case. ξ (n) ). Thus.24 24. . 0 But 0 is an absorbing state. since ξ (0) . We ﬁnd the population (n) at time n + 1 by adding the number of oﬀspring of each individual. Hence all states i ≥ 1 are transient. In other words. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . . independently from one another. . therefore transient. Example: Suppose that p1 = p. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. Then P (ξ = 0) + P (ξ ≥ 2) > 0. and 0 is always an absorbing state.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. if the current population is i there is probability pi that everybody gives birth to zero oﬀspring. It is a Markov chain with time-homogeneous transitions.d. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of oﬀspring. each individual gives birth to at one child with probability p or has children with probability 1 − p. (and independent of the initial population size X0 –a further assumption). . if the 65 (n) (n) . If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. p0 = 1 − p. n ≥ 0) has the Markov property. in general. Let ξk be the number of oﬀspring of the k-th individual in the n-th generation. . ξ2 . 2.i. k = 0. we have that (Xn . Hence all states i ≥ 1 are inessential. and. . ξ (1) . which has a binomial distribution: i j pij = p (1 − p)i−j . . Therefore. j In this case. are i. The state 0 is irrelevant because it cannot be reached from anywhere. 1.

i. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. 66 . Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. We then have that D = D1 ∩ · · · ∩ Di . Indeed. Let ε be the extinction probability of the process. The result is this: Proposition 5. . if we start with X0 = i in ital ancestors. For each k = 1. we can argue that ε(i) = εi . since Xn = 0 implies Xm = 0 for all m ≥ n. when we start with X0 = i. ϕn (0) = P (Xn = 0). so P (D|X0 = i) = εi . We wish to compute the extinction probability ε(i) := Pi (D). Proof. Obviously. If µ ≤ 1 then the ε = 1. Therefore the problem reduces to the study of the branching process starting with X0 = 1. This function is of interest because it gives information about the extinction probability ε. For i ≥ 1. where ε = P1 (D). ε(0) = 1. each one behaves independently of one another. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. . . and P (Dk ) = ε. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. Indeed. and. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. Let ∞ µ = Eξ = k=1 kpk be the mean number of oﬀspring of an individual. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn .population does not become extinct then it must grow to inﬁnity. .

Since both ε and ε∗ satisfy z = ϕ(z). all we need to show is that ε < 1. ϕ(ϕn (0)) → ϕ(ε).Note also that ϕ0 (z) = 1. Also ϕn (0) → ε. In particular. Let us show that. ϕn (0) ≤ ε∗ . and so on. (See right ﬁgure below. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). in fact. But ϕn+1 (0) → ε as n → ∞. On the other hand. ϕn+1 (z) = ϕ(ϕn (z)). since ϕ is a continuous function. and. ε = ε∗ . then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. Notice now that µ= d ϕ(z) dz . the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . Therefore. We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). Therefore ε = ε∗ . Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). But ϕn (0) → ε. This means that ε = 1 (see left ﬁgure below). if the slope µ at z = 1 is ≤ 1. recall that ϕ is a convex function with ϕ(1) = 1. Therefore ε = ϕ(ε). By induction. But 0 ≤ ε∗ .) 67 . So ε ≤ ε∗ < 1. z=1 Also. if µ > 1. ϕn+1 (0) = ϕ(ϕn (0)). and since the latter has only two solutions (z = 0 and z = ε∗ ).

So if there are.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. 5. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. using the same frequency. . We assume that the An are i. 0). assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). 3. k ≥ 0. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. Ethernet has replaced Aloha. Then ε < 1. 24. there is no time to make a prior query whether a host has something to transmit or not. there are x + a − 1 packets. is that if more than one hosts attempt to transmit a packet. Xn+1 = Xn + An . But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. 6.d. If s ≥ 2. 4. 1. Since things happen fast in a computer network. then max(Xn + An − 1. During this time slot assume there are a new packets that need to be transmitted. we let hosts transmit whenever they like. say. on the next times slot. there is collision and the transmission is unsuccessful. Abandoning this idea. 1. Nowadays. 1. 3 hosts. there are x + a packets. only one of the hosts should be transmitting at a given time. Thus. 2. . Let s be the number of selected packets. 3. there is a collision and. random variables with some common distribution: P (An = k) = λk . 2. 68 . then the transmission is not successful and the transmissions must be attempted anew. . in a simpliﬁed model. otherwise. for a packet to go through. periodically.i. An the number of new packets on the same time slot. on the next times slot. 2. If collision happens. Then ε = 1.. If s = 1 there is a successful transmission and. are allocated to hosts 1. 3. So. . The problem of communication over a common channel.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). . and Sn the number of selected packets. if Sn = 1. One obvious solution is to allocate time slots to hosts in a speciﬁc order. . Case: µ > 1. then time slots 0.

. The probabilities px. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. Let us ﬁnd its transition probabilities. Unless µ = 0 (which is a trivial case: no arrivals!). we have that the drift is positive for all large x and this proves transience. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . . Sn = 1|Xn = x) = λ0 xpq x−1 . we have ∆(x) = (−1)px. deﬁned by‘ ∆(x) := E[Xn+1 − Xn |Xn = x].) = x + a k x−k p q . To compute px. i. An−1 . We also let µ := k≥1 kλk be the expectation of An . Proving this is beyond the scope of the lectures. selections are made amongst the Xn + An packets. k 0 ≤ k ≤ x + a. To compute px. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. So P (Sn = k|Xn = x. We here only restrict ourselves to the computation of this expected change. where q = 1 − p. and p. .x+k for k ≥ 1. k ≥ 1.e.x−1 + k≥1 kpx. n ≥ 0) is a Markov chain. The reason we take the maximum with 0 in the above equation is because. we think as follows: to have a reduction of x by 1. where G(x) is a function of the form α + βx .x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . The random variable Sn has. 69 . Therefore ∆(x) ≥ µ/2 for all large x. For x > 0. The main result is: Proposition 6. and so q x tends to 0 much faster than G(x) tends to ∞.e. we should not subtract 1. The process (Xn . x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x).x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . The Aloha protocol is unstable.x−1 = P (An = 0. for example we can make it ≤ µ/2. so q x G(x) tends to 0 as x → ∞. the Markov chain (Xn ) is transient.where λk ≥ 0 and k≥0 kλk = 1. then the chain is transient.x−1 when x > 0. Xn−1 . Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). So: px. we have q < 1. Indeed. Since p > 0.x are computed by subtracting the sum of the above from 1. Idea of proof. An = a. we must have no new packets arriving. So we can make q x G(x) as small as we like by choosing x suﬃciently large. if Xn + An = 0. and exactly one selection: px.

there are companies that sell high-rank pages. the chain moves to a random page. that allows someone to manipulate the PageRank of a web page and make it much higher. Comments: 1. PageRank. In an Internet-dominated society. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. because of the size of the problem. 3. We pick α between 0 and 1 (typically. For each page i let L(i) be the set of pages it links to. Deﬁne transition probabilities as follows: 1/|L(i)|.2) and let α pij := (1 − α)pij + . The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. the rank of a page may be very important for a company’s proﬁts. known as ‘302 Google jacking’. If L(i) = ∅. in the latter case. Then it says: page i has higher rank than j if π(i) > π(j). The way it does that is by computing the stationary distribution of a Markov chain that we now describe. unless i is a dangling page. then i is a ‘dangling page’. α ≈ 0. All you need to do to get high rank is pay (a lot of money). Academic departments link to each other by hiring their 70 . the next location Xn+1 is picked uniformly at random amongst all pages that i links to. Therefore. So if Xn = i is the current location of the chain. So this is how Google assigns ranks to pages: It uses an algorithm. So we make the following modiﬁcation. The idea is simple.24. It uses advanced techniques from linear algebra to speed up computations. A new use of PageRank is to rank academic doctoral programs based on their records of ﬁnding jobs for their graduates.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. 2. There is a technique. if j ∈ L(i) pij = 1/|V |. otherwise. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. Its implementation though is one of the largest matrix computations in the world. It is not clear that the chain is irreducible. |V | These are new transition probabilities. if L(i) = ∅ 0. neither that it is aperiodic. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus.

two AA individuals will produce and AA child. This means that the external appearance of a characteristic is a function of the pair of alleles. we will assume that. The alleles in an individual come in pairs and express the individual’s genotype for that gene. We would like to study the genetic diversity. We will also assume that the parents are replaced by the children. The process repeats. What is important is that the evaluation can be done without ever looking at the content of one’s research. from the point of view of an administrator. We will assume that a child chooses an allele from each parent at random. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. all we need is the information outlined in this ﬁgure. their oﬀspring inherits an allele from each parent. meaning how many characteristics we have. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. 24. the genetic diversity depends only on the total number of alleles of each type. When two individuals mate. in each generation. One of the circle alleles is selected twice. Thus. To simplify. We have a transition from x to y if y alleles of type A were produced. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. then. generation after generation. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. We are now ready to deﬁne the Markov chain Xn representing the number of alleles of type A in generation n. a dominant (say A) and a recessive (say a). In this ﬁgure there are 4 alleles of type A (circles) and 6 of type a (squares). One of the square alleles is selected 4 times. it makes the evaluation procedure easy. 71 . And so on. 5. colour of eyes) appears in two forms. But an AA mating with a Aa may produce a child of type AA or one of type Aa. maintain this allele for the next generation. Since we don’t care to keep track of the individuals’ identities. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. Imagine that we have a population of N individuals who mate at random producing a number of oﬀspring. This number could be independent of the research quality but.faculty from each other (and from themselves).4 The Wright-Fisher model: an example from biology In an organism. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their oﬀspring did not inherit this allele. 4. a gene controlling a speciﬁc characteristic (e. this allele is of type A with probability x/2N .g. Three of the square alleles are never selected. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). If so. Do the same thing 2N times.

and the population becomes one consisting of alleles of type A only.) One can argue that. Since selections are independent. . there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. . Since pxy > 0 for all 1 ≤ x.e. as usual. .e. M n P (ξk = 0|Xn . n n where. Clearly.d. on the average. . but the argument needs further justiﬁcation because “the end of time” is not a deterministic time. . p00 = p2N. Xn ). y ≤ 2N − 1. 0 ≤ y ≤ 2N. Tx := inf{n ≥ 1 : Xn = x}. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. on the average. Clearly. 72 . . the states {1. X0 ) = 1 − Xn . we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . X0 ) = Xn . . . for all n ≥ 0. . conditionally on (X0 .2N = 1. the expectation doesn’t change and. i. ϕ(x) = y=0 pxy ϕ(y). . and the population becomes one consisting of alleles of type a only. These are: M ϕ(0) = 0. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). . .i.Each allele of type A is selected with probability x/2N . The result is correct. EXn+1 = EXn Xn = Xn . ϕ(x) = x/M. . the chain gets absorbed by either 0 or 2N . X0 ) = E(Xn+1 |Xn ) = M This implies that. . The fact that M is even is irrelevant for the mathematical model. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. Similarly. ξM are i. so both 0 and 2N are absorbing states. M and thus. M Therefore. with n P (ξk = 1|Xn . ϕ(M ) = 1. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. We show that our guess is correct by proceeding via the mundane way of verifying that it satisﬁes the ﬁrst-step analysis equations. . . since. . Let M = 2N . 2N − 1} form a communicating class of inessential states. we see that 2N x 2N −y x y pxy = . . i. . . the number of alleles of type A won’t change. . E(Xn+1 |Xn . . since. . the random variables ξ1 . . (Here.

. n ≥ 0) are i. we will try to solve the recursion. the demand is only partially satisﬁed. and so. . If the demand is for Dn components. b). M −1 y−1 x M y−1 1− x . we ﬁnd X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0.5 A storage or queueing application A storage facility stores. A number An of new components are ordered and added to the stock. say. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. there are Xn + An − Dn components in the stock. Here are our assumptions: The pairs of random variables ((An . M x M (M −1)−(y−1) x x +1− M M M −1 = 24. the demand is. provided that enough components are available. and E(An − Dn ) = −µ < 0. while a number of them are sold to the market. Working for n = 1. and the stock reached level zero at time n + 1. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. so that Xn+1 = (Xn + ξn ) ∨ 0. Dn ). Let ξn := An − Dn . Xn electronic components at time n. 73 .The ﬁrst two are obvious.i. at time n + 1. We have that (Xn . We will apply the so-called Loynes’ method.d. Thus. larger than the supply per unit of time. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . We will show that this Markov chain is positive recurrent. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. n ≥ 0) is a Markov chain. the demand is met instantaneously. Since the Markov chain is represented by this very simple recursion. on the average. 2. Otherwise. for all n ≥ 0. .

. Notice. ξn−1 ). . (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . instead of considering them in their original order. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). . . we may replace the latter random variables by ξ1 . . where the latter equality follows from the fact that. Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). = x≥0 Since the n-dependent terms are bounded. . . we may take the limit of both sides as n → ∞. for all n. we ﬁnd π(y) = x≥0 π(x)pxy . using π(y) = limn→∞ P (Mn = y). the event whose probability we want to compute is a function of (ξ0 . . Indeed. . . so we can shuﬄe them in any way we like without altering their joint distribution. ξ0 ). ξn−1 ) = (ξn−1 . Therefore. 74 . .d. . . ξn . ξn−1 . Hence. for y ≥ 0. the random variables (ξn ) are i. we reverse it. (n) . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . and. . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). . in computing the distribution of Mn which depends on ξ0 .Thus. In particular. . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisﬁes the balance equations. . that d (ξ0 .i. . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. however. . .

.} > y) is not identically equal to 1. From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. . n→∞ This implies that P ( lim Zn = −∞) = 1. the Markov chain may be transient or null recurrent. as required. n→∞ Hence g(y) = P (0 ∨ max{Z1 .} < ∞) = 1.So π satisﬁes the balance equations. . This will be the case if g(y) is not identically equal to 1. The case Eξn = 0 is delicate. Depending on the actual distribution of ξn . . 75 . we will use the assumption that Eξn = −µ < 0. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. We must make sure that it also satisﬁes y π(y) = 1. Z2 . Here. . Z2 . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . .

otherwise where p + q = 1. given a function p(x). Example 1: Let p(x) = C2−|x1 |−···−|xd | . If d = 1. Deﬁne pj . To be able to say this. So. then p(1) = p. . For example. j = 1. but. we can let S = Zd . In dimension d = 1. namely. This means that the transition probability pxy should depend on x and y only through their relative positions in space. the skip-free property. . if x = −ej . A random walk possesses an additional property. Example 3: Deﬁne p(x) = 1 2d . namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. 0.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. we won’t go beyond Zd . .y = p(y − x). .and space-homogeneous transition probabilities given by px. d 0. here. . a structure that enables us to say that px. j = 1. that of spatial homogeneity. x ∈ Zd . otherwise. the set of d-dimensional vectors with integer coordinates. have been processes with timehomogeneous transition probabilities. . where p1 + · · · + pd + q1 + · · · + qd = 1. . Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else.y = px+z. if x = ej . a random walk is a Markov chain with time. . p(−1) = q. . More general spaces (e. p(x) = 0. ±ed } otherwise 76 . . where C is such that x p(x) = 1. .and space-homogeneity. this walk enjoys. . Our Markov chains. This is the drunkard’s walk we’ve seen before. A random walk of this type is called simple random walk or nearest neighbour random walk.y+z for any translation z. if x ∈ {±e1 . groups) are allowed. besides time. d p(x) = qj .g. we need to give more structure to the state space S. so far.

We will let p∗n be the convolution of p with itself n times. and this is the symmetric drunkard’s walk. which.i. (Recall that δz is a probability distribution that assigns probability 1 to the point z. Furthermore. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . where ξ1 . . x ∈ Zd is deﬁned as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p.) Now the operation in the last term is known as convolution. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). Indeed. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). We call it simple symmetric random walk. If d = 1. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. are i. Obviously. x ∈ Zd . random variables in Zd with common distribution P (ξn = x) = p(x). We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). Let us denote by Sn the state of the walk at time n. p(x) = 0. Furthermore.This is a simple random walk. in addition. But this would be nonsense. We refer to ξn as the increments of the random walk. ξ2 . notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. We shall resist the temptation to compute and look at other 77 . otherwise. computing the n-th order convolution for all n is a notoriously hard problem. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . then p(1) = p(−1) = 1/2. The special structure of a random walk enables us to give more detailed formulae for its behaviour. . The convolution of two probabilities p(x). in general. p ∗ δz = p. .d. So. we can reduce several questions to questions about this random walk. . we will mean a sum over all possible values. . y p(x)p′ (y − x). has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . p′ (x). the initial state S0 is independent of the increments. One could stop here and say that we have a formula for the n-step transition probability. x + ξ2 = y) = y p(x)p(y − x). because all we have done is that we dressed the original problem in more fancy clothes. . . (When we write a sum without indicating explicitly the range of the summation variable.) In this notation. x ± ed .

−1} then we have a path in the plane that joins the points (0. and so #A P (A) = n .0) We can thus think of the elements of the sample space as paths. . Events A that depend only on these the ﬁrst n random signs are subsets of this sample space. we should assign.methods.1) + (0. 1) → (4. . +1}n . After all. The principle is trivial. ξn ). if we can count the number of elements of A we can compute its probability. . but its practice may be hard. + (1. Clearly. ξn = εn ) = 2−n . how long will that take? C. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that diﬀerent questions we need to ask or (b) that approximate methods are more informative than exact ones. we shall learn some counting methods. Since there are 2n elements in {0. Here are a couple of meaningful questions: A. 2) → (3.i+1 = pi. . a path of length n. +1. consider the following: 78 . to each initial position and to each sequence of n signs. 1): (2. we are dealing with a uniform distribution on the sample space {−1. +1. 1) → (2.i−1 = 1/2 for all i ∈ Z. .2) (4. . where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi.1) + − (5. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn .i. If yes. Look at the distribution of (ξ1 . 1}n . −1. Therefore. .1) − (3. for ﬁxed n. random variables (random signs) with P (ξn = ±1) = 1/2. +1}n . So. In the sequel. and the sequence of signs is {+1. +1}n . So if the initial position is S0 = 0. 2) → (5. . First. we have P (ξ1 = ε1 . all possible values of the vector are equally likely. Will the random walk ever return to where it started from? B. 2 where #A is the number of elements of A.2) where the ξn are i. a random vector with values in {−1. As a warm-up. A ⊂ {−1. If not. 0) → (1.d.

In if n + i is not an even number then there is no path that starts from (0. We use the notation {(m. i). x) (n. in this case. (n. y) is given by: #{(0. 0) (n. y)} = n−m . Let us ﬁrst show that Lemma 14.047. x) and end up at (n. Proof. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. 0) and end up at (n. 0)). So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. We can easily count that there are 6 paths in A. y) is given by: #{(m. x) (n. (n − m + y − x)/2 (23) Remark. i). 0) and ends at (n. S4 = −1. 0) and end up at the speciﬁc point (n.1 Paths that start and end at speciﬁc points Now consider a path that starts from a speciﬁc point and ends up at a speciﬁc point.0) The lightly shaded region contains all paths of length n that start at (0. The number of paths that start from (0.Example: Assume that S0 = 0. Consider a path that starts from (0. 0) and n ends at (n. and compute the probability of the event A = {S2 ≥ 0. i). i)} = n n+i 2 More generally. the number of paths that start from (m. (There are 2n of them. y)}. x) and end up at (n.) The darker region shows all paths that start at (0. n 79 . The paths comprising this event are shown below. y)} =: {all paths that start at (m.i) (0. S7 < 0}. Hence (n+i)/2 = 0. S6 < 0.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the ﬁgure), there is no path that goes from (0, 0) to (3, 2). We shall thus deﬁne n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

**Paths that start and end at speciﬁc points and do not touch zero at all
**

(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =

1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the speciﬁc point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the ﬁrst time it hits zero and then reﬂecting it around the horizontal axis. In the ﬁgure below, follow the solid path starting at (m, x) up to the ﬁrst time it hits zero and then follow the dotted path: this is the reﬂected path.

(n, y)

(m, x)

(n, −y)

A path and its reﬂection around a horizontal line. Notice that the reﬂection starts only at the point where the path touches the horizontal line.

Note that if we know the reﬂected path we can recover the original path. Hence we may as well count the number of reﬂected paths. But each reﬂected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =

1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)

1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

**Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
**

1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n

n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n

n 2

= #{(0, 1) n

n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1

n−y 2

(n, y + 1); remain > 0}

− −

n

n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

−

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the ﬁrst one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

**Distribution after n steps
**

n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

y)} . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . . n/2 (29) #{(0. Such a simple answer begs for a physical/intuitive explanation. 0) (n.P (Sn = y|Sm = x) = In particular. y)} . we shall just compute it. 27. n−m 2−n−m . . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . n (30) Proof. if n is even. n 2 #{paths (m. . We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). . The left side of (30) equals #{paths (1. y) that do not touch zero} × 2−n #{paths (0. P0 (S1 ≥ 0 . . Sn ≥ 0 | Sn = 0) = 84 1 . x) 2n−m (n.2 The ballot theorem Theorem 25. 27.3 Some conditional probabilities Lemma 17. The probability that. Sn ≥ 0 | Sn = y) = 1 − In particular. n 2−n . If n. 1) (n. . . if n is even. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. P0 (Sn = y) 2 (n + y) + 1 . . given that S0 = 0 and Sn = y > 0. . 0) (n. But. the random walk never becomes zero up to time n is given by P0 (S1 > 0. Sn > 0 | Sn = y) = y . for now. S2 > 0. . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . n This is called the ballot theorem and the reason will be given in Section 29.

We use (27): P0 (Sn ≥ 0 . +1 27. Sn = 0) = P0 (S1 > 0. . . . . . . Sn > 0) + P0 (S1 < 0. Sn ≥ 0).4 Some remarkable identities Theorem 26.Proof. . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . . 1 + ξ2 + ξ3 > 0. . n/2 where we used (28) and (29). . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . . Sn < 0) (b) (a) = 2P0 (S1 > 0. . . Proof. . . . •). Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . . . (b) follows from symmetry. . . . ξ2 + ξ3 ≥ 0. ξ3 .) by (ξ1 . . . If n is even. 85 . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . ξ3 . . . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . Sn ≥ 0) = P0 (S1 = 0. . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. (f ) follows from the replacement of (ξ2 . . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . . 0) (n. . (c) (d) (e) (f ) (g) where (a) is obvious. . . Sn ≥ 0) = #{(0. . . For even n we have P0 (S1 ≥ 0. . .). ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. . . . . . . P0 (S1 ≥ 0. Sn = 0) = P0 (Sn = 0). . and ﬁnally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . We next have P0 (S1 = 0. . . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . and Sn−1 > 0 implies that Sn ≥ 0. . . . (e) follows from the fact that ξ1 is independent of ξ2 . . . . ξ2 . Sn ≥ 0.

. Sn−2 > 0|Sn−1 > 0. the two terms must be equal. and u(n) = P0 (Sn = 0). and if the particle is to the left of a then its mirror image is on the right. . Sn = 0) = 2P0 (Sn−1 > 0. we computed. . Also. . So u(n) f (n) = . It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. Sn = 0 means Sn−1 = 1. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . with equal probability. we either have Sn−1 = 1 or −1. . for even n. Sn = 0) P0 (S1 > 0. f (n) = P0 (S1 > 0. Let n be even. Sn = 0) = u(n) . . Sn−1 > 0. Sn = 0). Sn−1 = 0. . . By symmetry. . Since the random walk cannot change sign before becoming zero. Sn−2 > 0|Sn−1 = 1). Then f (n) := P0 (S1 = 0. 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). . the probability u(n) that the walk (started at 0) attains value 0 in n steps. we can omit the last event. . Sn−1 < 0. This is not so interesting. n−1 Proof. Sn−1 > 0. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). .5 First return to zero In (29). Sn = 0) Now. . . P0 (Sn−1 > 0. To take care of the last term in the last expression for f (n). So 1 f (n) = 2u(n) P0 (S1 > 0. . If a particle is to the right of a then you see its image on the left. Sn = 0) + P0 (S1 < 0. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . Put a two-sided mirror at a. . So f (n) = 2P0 (S1 > 0. . given that Sn = 0. . We now want to ﬁnd the probability f (n) that the walk will return to 0 for the ﬁrst time in n steps.27. . So the last term is 1/2. Theorem 27. Fix a state a. we see that Sn−1 > 0. . Sn = 0. n−1 28 The reﬂection principle for a simple random walk in dimension 1 The reflection principle says the following. 86 . By the Markov property at time n − 1. . . .

independent of the past before Ta . the process (STa +m − a. Sn = Sn . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). n ≥ 0) be a simple symmetric random walk and a > 0. as a corollary to the above result. n < Ta . S1 . P0 (Sn ≥ a) = P0 (Ta ≤ n. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . we have n ≥ Ta . n ∈ Z+ ). Finally. Sn ≤ a). Then. n ≥ Ta . Sn ≤ a) + P0 (Ta ≤ n. . observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). The last equality follows from the fact that Sn > a implies Ta ≤ n. 28. Thus. In the last event. 2 Proof. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). Let (Sn . deﬁne Sn = where is the ﬁrst hitting time of a. so we may replace Sn by 2a − Sn (Theorem 28). .What is more interesting is this: Run the particle till it hits a (it will. 2a − Sn ≥ a) = P0 (Ta ≤ n. Trivially. n < Ta . But a symmetric random walk has the same law as its negative. n ∈ Z+ ) has the same law as (Sn . If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). So Sn − a can be replaced by a − Sn in the last display and the resulting process. 2 Deﬁne now the running maximum Mn = max(S0 . we obtain 87 . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. called (Sn . Sn > a) = P0 (Ta ≤ n. 2a − Sn . Proof. n ≥ Ta By the strong Markov property. Then Ta = inf{n ≥ 0 : Sn = a} Sn . But P0 (Ta ≤ n) = P0 (Ta ≤ n. Sn ). . Theorem 28. If Sn ≥ a then Ta ≤ n. Sn ≤ a) + P0 (Sn > a). a + (Sn − a).1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. m ≥ 0) is a simple symmetric random walk. In other words. Sn ≥ a). .

105. by applying the reﬂection principle (Theorem 28). The answer is very simple: Theorem 31. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. gives the result. If an urn contains a azure items and b = n−a black items. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. Sn = y) = P0 (Tx ≤ n. usually a vase furnished with a foot or pedestal. Compt. Rend. Bertrand. This. which results into P0 (Tx ≤ n. Proof. 8 88 . J. It is the latter use we are interested in in Probability and Statistics. as for holding liquids. D. ornamental objects. ηn ). 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Sn = 2x − y). 369. 105. 436-437. (1887). Acad. If Tx ≤ n. . of which a are coloured azure and b black. Acad. e 10 Andr´. . The original ballot problem. 2 Proof. employed for diﬀerent purposes. (1887). e e e Paris. Since Mn < x is equivalent to Tx > n. (31) x > y. An urn is a vessel. and anciently for holding ballots to be drawn. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. Rend. 9 Bertrand. then. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . Solution directe du probl`me r´solu par M. the number of selected azure items is always ahead of the black ones equals (a − b)/n. n = a + b. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. Sn = y) = P0 (Sn = 2x − y). the ashes of the dead after cremation. A sample from the urn is a sequence η = (η1 . The set of values of η contains n elements and η is uniformly distributed a in this set. Solution d’un probl`me. Compt. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. Paris. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. combined with (31).Corollary 19. Sn = y). . We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. we have P0 (Mn < x. as long as a ≥ b. Sn = y) = P0 (Tx > n. then the probability that. during sampling without replacement. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. P0 (Mn < x. where ηi indicates the colour of the i-th item picked. Sci. Sci. .

But in (30) we computed that y P0 (S1 > 0. . . . where y = a − (n − a) = 2a − n. but which is not necessarily symmetric. letting a. Proof. (ξ1 . . The values of each ξi are ±1. the ‘black’ items correspond to − signs). . the sequence (ξ1 . Then. (The word “asymmetric” will always mean “not necessarily symmetric”. ξn = εn ) P0 (Sn = y) 2−n 1 = = . . n .We shall call (η1 . the result follows. indeed. εn be elements of {−1. Since the ‘azure’ items correspond to + signs (respectively. . . But if Sn = y then. . It remains to show that all values are equally a likely. . So the number of possible values of (ξ1 . . An urn can be realised quite easily be a random walk: Theorem 32. conditional on {Sn = y}. . ξn ) is uniformly distributed in a set with n elements. . . Proof of Theorem 31. a we have that a out of the n increments will be +1 and b will be −1. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. a − b = y. we have: P0 (ξ1 = ε1 . . ξn ) is. . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . Consider a simple symmetric random walk (Sn . 89 . i ∈ Z. . so ‘azure’ items are +1 and ‘black’ items are −1. the sequence (ξ1 . ξn ) has a uniform distribution. . conditional on the event {Sn = y}. ηn ) an urn process with parameters n. −n ≤ k ≤ n. Since an urn with parameters n. . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. . . n ≥ 0) with values in Z and let y be a positive integer. b be deﬁned by a + b = n. . . . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. Then. always assuming (without any loss of generality) that a ≥ b = n − a. . . . We have P0 (Sn = k) = n n+k 2 pi. we can translate the original question into a question about the random walk. Sn > 0 | Sn = y) = . . p(n+k)/2 q (n−k)/2 . . . Sn > 0} that the random walk remains positive. We need to show that. . . . . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. n n −n 2 (n + y)/2 a Thus. . . using the deﬁnition of conditional probability and Lemma 16. a. +1} such that ε1 + · · · + εn = y. Let ε1 .i+1 = p. . . conditional on {Sn = y}. . .) So pi.i−1 = q = 1 − p. n Since y = a − (n − a) = a − b.

then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. In view of this. . x − 1 in order to reach x. random variables. Theorem 33. . (We tacitly assume that S0 = 0. and all t ≥ 0.i. We will approach this using probability generating functions. Lemma 19. E0 sTx = ϕ(s)x . Then the probability generating function of the ﬁrst time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . .) Then. 2. . y ∈ Z. Consider a simple random walk starting from 0. Consider a simple random walk starting from 0. In view of Lemma 19 we need to know the distribution of T1 . If S1 = 1 then T1 = 1. only the relative position of y with respect to x is needed. x ∈ Z. . 90 . Lemma 18. Start with S0 = 0 and watch the process until the ﬁrst n = T1 such that Sn = 1. plus a ′′ further time T1 required to go from 0 to +1 for the ﬁrst time. 2. If x > 0. . Proof.} ∪ {∞}. Proof. Let ϕ(s) := E0 sT1 .d. Corollary 20. For all x. plus a time T1 required to go from −1 to 0 for the ﬁrst time. .30. Px (Ty = t) = P0 (Ty−x = t). 2qs Proof. Consider a simple random walk starting from 0. To prove this. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. x ≥ 0) is also a random walk. See ﬁgure below. we make essential use of the skip-free property. The rest is a consequence of the strong Markov property. 1. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. it is no loss of generality to consider the random walk starting from 0. The process (Tx . This random variable takes values in {0.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. for x > 0.

Hence the previous formula applies with the roles of p and q interchanged. The formula is obtained by solving the quadratic. the RW must ﬁrst hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Corollary 22. 2ps Proof. from the strong Markov property. Then the probability generating function of the ﬁrst time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . Consider a simple random walk starting from 0. Then the probability generating function of the ﬁrst time that level x ∈ Z is reached is given by √ x E sT1 x = 1− 1−4pqs2 .Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Consider a simple random walk starting from 0. The ﬁrst time n such that Sn = −1 is the ﬁrst time n such that −Sn = 1. if x < 0 2ps Proof. Corollary 21. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. in order to hit 1. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . 91 . T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . 0 if x > 0 2qs Tx (32) E0 s = √ |x| E sT−1 |x| = 1− 1−4pqs2 0 . But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p.

′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. then we can omit the k upper limit in the sum. and simply write (1 + x)n = k≥0 n k x . We again do ﬁrst-step analysis. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34.30. The latter is.2 First return to the origin Now let us ask: If the particle starts at 0. how long will it take till it ﬁrst returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. k 92 . 2ps ′ =1− 30. If we deﬁne n to be equal to 0 for k > n. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . If α = n is a positive integer then n (1 + x)n = k=0 n k x . 1 − 4pqs2 . Consider a simple random walk starting from 0. in distribution. the same as the time required for the walk to hit 1 starting ′ from 0. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . k by the binomial theorem. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . if S1 = 1. where α is a real number. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . For a symmetric ′ random walk.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. Then the probability generating function of the ﬁrst return time to 0 is given by E0 s T 0 = 1 − Proof.3 The distribution of the ﬁrst return to the origin We wish to ﬁnd the exact distribution of the ﬁrst return time to the origin. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. For the general case. Similarly. this was found in §30.

k!(n − k)! k! α k x . (1 − x)−1 = But. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . Indeed. As another example. k as long as we deﬁne α k = α(α − 1) · · · (α − k + 1) . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . by deﬁnition. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . 2k − 1 k 93 . √ 1 − x.Recall that n k = It turns out that the formula is formally true for general α. For example. −1 k so (1 − x)−1 = as needed. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. but that we won’t worry about which ones. (−1)k = 2k − 1 k k and this is shown by direct computation. k! k! (−1)2k xk = k≥0 xk .

|x| if x > 0 (33) . By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. Hence it is transient. without appeal to the Markov chain theory. 94 . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 = (1 − 2p)2 = |1 − 2p| = |p − q|. x 1−|p−q| 2q 1−|p−q| 2p . Comparing the two ‘inﬁnite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). The quantity inside the square root becomes 1 − 4pq.But. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. by the deﬁnition of E0 sT0 . • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. if x < 0 This formula can be written more neatly as follows: Theorem 35. We apply this to (32). • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. if x < 0 |x| In particular. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. Hence it is again transient. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. x if x > 0 . P0 (Tx < ∞) = p q q p ∧1 ∧1 . But remember that q = 1 − p. That it is actually null-recurrent follows from Markov chain theory. s↑1 which is true for any probability generating function. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx .

This value is random and is denoted by M∞ = sup(S0 . again. it will never ever return to the point where it started from. S1 . . . Consider a simple random walk with values in Z. then we have M∞ = limn→∞ Mn . Assume p < 1/2.Proof. with probability 1 because p < 1/2. Corollary 23. Proof. with some positive probability. . Consider a random walk with values in Z. Theorem 36. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the ﬁrst return time to 0.) If we let Mn = max(S0 . Proof. Unless p = 1/2. Otherwise. If p < q then |p − q| = q − p and.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. . this probability is always strictly less than 1.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. The simple symmetric random walk with values in Z is null-recurrent. Then limn→∞ Sn = −∞ with probability 1. the sequence Mn is nondecreasing. This means that there is a largest value that will be ever attained by the random walk. . . What is this probability? Theorem 37. . P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . every nondecreasing sequence has a limit. but here it is < ∞. x ≥ 0. Indeed. q p. The last part of Theorem 35 tells us that it is recurrent. if x > 0 . 31. if x < 0 which is what we want. n→∞ n→∞ x ≥ 0. S2 . and so it is null-recurrent. 31. Sn ). we obtain what we are looking for. except that the limit may be ∞. We know that it cannot be positive recurrent. 95 . Then ′ P0 (T0 < ∞) = 2(p ∧ q).

1]. 1 − 2(p ∧ q) this being a geometric series.′ Proof. . Pi (Ji ≥ k) = 1 for all k. 2. where the last probability was computed in Theorem 37. . . if a Markov chain starts from a state i then Ji is geometric. starting from zero. Then. the total number of visits to a state will be ﬁnite. 2. (34) a random variable ﬁrst considered in Section 10 where it was also shown to be geometric. . But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. as already known. . 2(p ∧ q) . And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . In the latter case. meaning that Pi (Ji = ∞) = 1. 96 . Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. . We now compute its exact distribution. The probability generating function of T0 was obtained in Theorem 34. In Lemma 4 we showed that. 2. This proves the ﬁrst formula. 1. 1 − 2(p ∧ q) Proof. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . k = 0. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). But J0 ≥ 1 if and only if the ﬁrst return time to zero is ﬁnite. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . 1. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. . even for p = 1/2. k = 0. 1. .3 Total number of visits to a state If the random walk is recurrent then it visits any state inﬁnitely many times. . If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. If p = q this gives 1. 31. Consider a simple random walk with values in Z. But if it is transient. k = 0. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . . Theorem 38. We now look at the the number of visits to some state but if we start from a diﬀerent state. 2(p ∧ q) E0 J0 = . Remark 1: By spatial homogeneity. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q).

If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . (2(p ∧ q))k−1 . The ﬁrst term was computed in Theorem 35. and the second in Theorem 38. In other words. Apply now the strong Markov property at Tx . But for x < 0. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . We repeat the argument for x < 0 and ﬁnd P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 .e. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . Then P0 (Tx < ∞. For x > 0.Theorem 39. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y .e. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . (q/p)(−x)∨0 1 − 2q k≥1 . The reason that we replaced k by k − 1 is because. E0 Jx = Proof. k ≥ 1. we have E0 Jx = 1/(1−2p) regardless of x. Remark 1: The expectation E0 Jx is ﬁnite if p = 1/2. Consider the case p < 1/2. i. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . then E0 Jx drops down to zero geometrically fast as x tends to inﬁnity. p < 1/2) then. if the random walk “drifts down” (i. simply because if Jx ≥ k ≥ 1 then Tx < ∞. Consider a random walk with values in Z. Suppose x > 0. Then P0 (Jx ≥ k) = P0 (Tx < ∞. given that Tx < ∞. 97 . ∞ As for the expectation. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. starting from 0. it will visit any point below 0 exactly the same number of times on the average. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). Jx ≥ k).

i=1 Then the dual (Sk . n Proof. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. ξn ) has the same distribution as (ξn . the probabilities P0 (Tx = n) = x P0 (Sn = x). and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . 0 ≤ k ≤ n). (Sk . In Lemma 19 we observed that. Fix an integer n. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. . (35) In Theorem 29 we used the reﬂection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. . Thus. random variables. . Let Tx = inf{n ≥ 1 : Sn = x} be the ﬁrst time that x will be visited. Theorem 40. . 0 ≤ k ≤ n) is deﬁned by Sk := Sn − Sn−k . Let (Sk ) be a simple symmetric random walk and x a positive integer. n x > 0. = 1− √ 1 − s2 s x . Use the obvious fact that (ξ1 . we derived using duality. . ξ1 ). Sn = x) P0 (Tx = n) = . n (37) 98 . ξn−1 . Then P0 (Tx = n) = x P0 (Sn = x) . . a sum of i. because the two have the same distribution. (36) 2 Lastly. Pause for reﬂection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. 0 ≤ k ≤ n − 1. 0 ≤ k ≤ n) has the same distribution as (Sk .i.i. . which is basically another probability for S. for a simple symmetric random walk. in Theorem 41. P0 (Sn = x) P0 (Sn = x) x . for a simple random walk and for x > 0. each distributed as T1 . . 0 ≤ k ≤ n) of (Sk . Tx is the sum of x i. .32 Duality Consider random walk Sk = k ξi .d. i.e. every time we have a probability for S we can translate it into a probability for S. . Here is an application of this: Theorem 41.d. . Proof. . random variables. ξ2 .

We remark here something methodological: It was completely diﬀerent techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coeﬃcient of sn in (35); summing up (37) should give (36), or taking diﬀerences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Deﬁne its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the ﬁrst time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the ﬁrst event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =

r=1

+

r=1

**In the ﬁrst case,
**

′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

**We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
**

k

**f2r p2k−2r,2n−2r +
**

r=1

1 2

n−k

**f2r p2k,2n−2r
**

r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2

k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .

r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

**The arcsine law*
**

Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

**Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
**

na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =

a≤k/n≤b

∼

a≤k/n≤b

1 . k k (n − n ) n n

1/π

**This is easily seen to have a limit as n → ∞, the limit being
**

b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of

x(1−x)

which is amazingly close to the plot of the ﬁgure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

.i. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. . 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. too. 1 Hence (Sn . . random vectors such that P (ξn = (0. Indeed. We then let S0 = (0. He starts at (0. i=1 ξi i=1 ξi 2 1 Lemma 20. random variables and hence a random walk.From this. (Sn . Consider a change of coordinates: rotate the axes by 45o (and scale them). with equal probability (1/4). 1)) = P (ξn = (−1. n ∈ Z+ ) showing that it. west. We have 1.. i. for each n. We have: 102 .i. being totally drunk. 0)) = 1/4. we immediately get that the expectation of T1 is inﬁnity: ET1 = ∞. . Sn − Sn ). going at random east. It is not a simple random walk because there is a possibility that the increment takes the value 0. we let x1 be its horizontal coordinate and x2 its vertical. n ≥ 1. ξ2 . .d. west. n ∈ Z+ ) is a sum of i.d. However. n ∈ Z+ ) is a random walk. 0)) = P (ξn = (−1. 0) and. 0)) = 1/4. the random 1 2 variables ξn . y ∈ Z. so are ξ1 . are independent. (Sn . map (x1 . When he reaches a corner he moves again in the same manner. ξ2 . 1 1 Proof. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. 2 1 If x ∈ Z2 . he moves east. i. y) where x. The two random walks are not independent. 0)) = P (ξn = (1. . . Since ξ1 . x2 ) into (x1 + x2 . is a random walk. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . . 2 Same argument applies to (Sn . the latter being functions of the former. 1 P (ξn = 1) = P (ξn = (1. Rn ) := (Sn + Sn . 1)) + P (ξn = (0. . 0). So Sn = (Sn . Sn = i=1 n ξi . The corners are described by coordinates (x. n ∈ Z+ ) is also a random walk. ξn are not independent simply because if one takes value 0 the other is nonzero. north or south. Many other quantities can be approximated in a similar manner. −1)) = 1/2. So let 1 2 1 2 1 2 Rn = (Rn . Sn ) = n n 2 . x1 − x2 ). the two stochastic processes are not independent. ξ2 .e. north or south.

d. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. as n → ∞. Similar argument shows that each of (Rn . . Hence (Rn . So P (R2n = 0) = 2n 2−2n . (Rn + independent. The last two are walk in one dimension. n ∈ Z ) is a random walk in one dimension. +1}. 2 1 2 2 1 1 Proof. Rn = n (ξi − ξi ). 0)) = 1/4 1 2 P (ηn = −1. Recall. n ∈ Z+ ). we need to show that ∞ P (Sn = 0) = ∞. To show that the last two are independent. 2n n 2 2−4n . being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . ηn = ξn − ξn are independent for each n. b−a 103 . n ∈ Z ) is a random walk in one dimension. R2n = 0) = P (R2n = 0)(R2n = 0). = 1)) = 1/4. (Rn . b−a −a . ηn = 1) = P (ξn = (0. b−a P (STa ∧Tb = a) = b . (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . 1)) = 1/4 1 2 P (ηn = −1. Let a < 0 < b. By Lemma 5. P (S2n = 0) ∼ n . 2 .i. n ∈ Z+ ) is a random walk in two dimensions. from Theorem 7. ξ2 . n Corollary 24. ηn = −1) = P (ξn = (−1. n ∈ Z+ ) 1 is a random walk in two dimensions. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. The possible values 1 .. ε′ ∈ {−1. where c is a positive constant. 1 where the last equality follows from the independence of the two random walks. Hence P (ηn = 1) = P (ηn = −1) = 1/2. We have Rn = n (ξi + ξi ). in the sense that the ratio of the two sides goes to 1. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . b−a P (Ta < Tb ) = b . η . . 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. and by Stirling’s approximation. ξ 1 − ξ 2 ). 0)) = 1/4 1 2 P (ηn = 1.1 Lemma 21. Lemma 22. that for a simple symmetric random walk (started from S0 = 0). n ∈ Z+ ) is a random 2 . . 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. is i. . But by the previous n=0 c lemma. The two-dimensional simple symmetric random walk is recurrent. P (S2n = 0) = Proof. ηn = −1) = P (ξn = (0. The same is true for (Rn ). So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . . And so 2 1 2 1 P (ηn = ε. Proof. We have of each of the ηn n 1 2 P (ηn = 1. (Rn . ηn = 1) = P (ξn = (1. The sequence of vectors η . But (Rn ) 1 2 is a simple symmetric random walk. η 2 are ±1. .

choose. How about more complicated distributions? Let suppose we are given a probability distribution (pi . suppose ﬁrst k = 0. Then there exists a stopping time T such that P (ST = i) = pi . Conversely. P (A = 0.g. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. we can always ﬁnd integers a. e. and the probability it exits from the right point is 1 − p. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. a = M − N . B = 0) = p0 .This last display tells us the distribution of STa ∧Tb . B = 0) + P (A = i. and assuming that (A. we get that c is precisely ipi . i ∈ Z). B) are independent of (Sn ). k ∈ Z. M < N . Can we ﬁnd a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. b]. i < 0 < j. The probability it exits it from the left point is p. Deﬁne a pair of random variables (A. Note. i ∈ Z. given a 2-point probability distribution (p. such that p is rational. if p = M/N . that ESTa ∧Tb = 0. Proof. we have this common value: i<0 (−i)pi = and. 1 − p). i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. Indeed. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the ﬁrst time it exits the interval [a. 0)} ∪ (N × N) with the following distribution: P (A = i. in particular. B = j) = 1. we consider the random variable Z = STA ∧TB . where c−1 is chosen by normalisation: P (A = 0. Let (Sn ) be a simple symmetric random walk and (pi . B = j) = c−1 (j − i)pi pj . i ∈ Z. B) taking values in {(0. using this. N ∈ N. b = N . Theorem 43. Then Z = 0 if and only if A = B = 0 and this has 104 . The claim is that P (Z = k) = pk . M. To see this. a < 0 < b such that pa + (1 − p)b = 0. b.

probability p0 . B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . i<0 = c−1 Similarly. B = j) P (STi ∧Tk = k)P (A = i. P (Z = k) = pk for k < 0. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. 105 . Suppose next k > 0. by deﬁnition.

i. Zero is denoted by 0. Then c | a + (b − a). b. 0. and we’re done. c | b. The set N of natural numbers is the set of positive integers: N = {1. with at least one of them nonzero. 38) = gcd(6. 6. (That it is unique is obvious from the fact that if d1 . if a. In other words. But in the ﬁrst line we subtracted 38 exactly 3 times. But 3 is the maximum number of times that 38 “ﬁts” into 120. −2. Now suppose that c | a and c | b − a. b) as the unique positive integer such that (i) d | a. 14) = gcd(6. then d | c. Z = {· · · .) Notice that: gcd(a. . This gives us an algorithm for ﬁnding the gcd between two numbers: Suppose 0 < a < b are integers. we have c | d. we can deﬁne their greatest common divisor d = gcd(a. their negatives and zero. 2) = gcd(4. 4. 106 . −3. Therefore d | b − a. d2 are two such numbers then d1 would divide d2 and vice-versa. −1. 3. 5. But d | a and d | b. if we also invoke Euclid’s algorithm (i.}. b − a). If c | b and b | a then c | a. as taught in high schools or universities. 2) = 2. 38) = gcd(82. Indeed 4 × 38 > 120. (ii) if c | a and c | b. So (ii) holds. If b − a > a. We will show that d = gcd(a. So (i) holds. leading to d1 = d2 . 26) = gcd(6. b − a).e.e. 2) = gcd(2. b). 32) = gcd(6. and d = gcd(a. · · · }. 38) = gcd(6. the process of long division that one learns in elementary school). 1. If a | b and b | a then a = b. 38) = gcd(44. Given integers a. Then interchange the roles of the numbers and continue. with b = 0. 2. Indeed. 20) = gcd(6. . Keep doing this until you obtain b − qa < a. d | b. Replace b by b − a. they merely serve as reminders. we say that b divides a if there is an integer k such that a = kb. Similarly. 2. For instance. Now. 8) = gcd(6. Arithmetic Arithmetic deals with integers. b). with the second line. So we work faster. They are not supposed to replace textbooks or basic courses. 3. The set Z of integers consists of N. . we may summarise the ﬁrst line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. gcd(120.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. replace b − a by b − 2a. We write this by b | a. let d = gcd(a. b are arbitrary integers. b) = gcd(a. Because c | a and c | b.

with a > b. y such that d = xa + yb. let us follow the previous example again: gcd(120. Its proof is obvious. xa + yb > 0}.The theorem of division says the following: Given two positive integers b. b ∈ Z. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. we can show that. 38) = gcd(6. To see this. i. Thus. The same logic works in general. we have gcd(a. we can ﬁnd an integer q and an integer r ∈ {0. We need to show (i). . a. let d be the right hand side. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. y ∈ Z. Here are some other facts about the greatest common divisor: First. Second. Then d = xa + yb. r = 18. 2) = 2. it divides d. . b) = min{xa + yb : x. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. 38) = gcd(6. such that a = qb + r. 120) = 38x + 120y. Suppose that c | a and c | b. a − 1}. y = 6. . not both zero. Then c divides any linear combination of a. Working backwards. 1. with x = 19.e. for integers a. To see why this is true. The long division algorithm is a process for ﬁnding the quotient q and the remainder r. gcd(38. in particular. 107 . In the process. This shows (ii) of the deﬁnition of a greatest common divisor. b) then there are integers x. we have the fact that if d = gcd(a. b. for some x. . y ∈ Z. For example.

Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. For example.. In general. f (x2 ). The equality relation = is reﬂexive. Both < and = are transitive. The equality relation is symmetric. the set {f (x) : x ∈ X}. so r = 0. x). yielding r = (−qx)a + (1 − qy)b. A relation is called symmetric if it cannot contain a pair (x.e. a relation can be thought of as a set of pairs of elements of the set A. Suppose this is not true. The set of functions from X to Y is denoted by Y X . Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. The relation < is not reﬂexive. If we take A to be the set of students in a classroom. Since d > 0. The range of f is the set of its values. But then b = q(xa + yb) + r. it can be thought as reﬂexive (unless we do not want to accept the fact that one may not be afriend of oneself). z) it also contains (x. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . and this is a contradiction. the greatest common divisor of any subset of the integers also makes sense. m elements. A relation is called reflexive if it contains pairs (x. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. i. x). we can divide b by d to ﬁnd b = qd + r. The relation < is not symmetric. The greatest common divisor of 3 integers can now be deﬁned in a similar manner. 108 .that d | a and d | b. e. A function is called onto (or surjective) if its range is Y . B be ﬁnite sets with n. but is not necessarily transitive. Finally. so d divides both a and b. y) and (y. A relation is transitive if when it contains (x. that d does not divide both numbers. A function is called one-to-one (or injective) if two diﬀerent x1 . y) without contain its symmetric (y. say. x2 ∈ X have diﬀerent values f (x1 ). y) such that x is smaller than y. We encountered the equivalence relation i theory. respectively. then the relation “x is a friend of y” is symmetric. So d is not minimum as claimed. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). Counting Let A. that d does not divide b.e.g. A relation which possesses all the three properties is called equivalence relation. z). for some 1 ≤ r ≤ d − 1. i.

Thus. we have = 1. the cardinality of the set B A ) is mn . a one-to-one function from the set A into A is called a permutation of A. we need more names than animals: n ≤ m. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. n! stands for the product of the ﬁrst n natural numbers: n! = 1 × 2 × 3 × · · · × n. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . We thus observe that there are as many subsets in A as number of elements in the set {0. and so on. the second in n − 1. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. in other words. In the special case m = n. and this follows from the binomial theorem. the k-th in n − k + 1 ways. Clearly.e. 1}n of sequences of length n with elements 0 or 1. (The same name may be used for diﬀerent animals. we denote (n)n also by n!. Each assignment of this type is a one-to-one function from A to B. Incidentally. Since the name of the ﬁrst animal can be chosen in m ways.The number of functions from A to B (i. and so on. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . The ﬁrst element can be chosen in n ways. that of the second in m − 1 ways. Any set A contains several subsets. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . and so on. To be able to do this. so does the second. If A has n elements. n! is the total number of permutations of a set of n elements. and the third. Indeed. 109 . k! k where the symbol equation. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). A function is then an assignment of a name to each animal. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to diﬀerent animals.) So the ﬁrst animal can be assigned any of the m available names. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). we may think of the elements of A as animals and those of B as names.

say. we let The Pascal triangle relation states that n+1 k = = 0. a− = − min(a. m! n y If y is not an integer then. by convention. a∨b∨c refers to the maximum of the three numbers a. Maximum and minimum The minimum of two numbers a. we choose k animals from the remaining n ones and this can be done in n ways. b is denoted by a ∧ b = min(a. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. k Also. we use the following notations: a+ = max(a. if x is a real number. The notation extends to more than one numbers. the pig. The maximum is denoted by a ∨ b = max(a. consider a speciﬁc animal. 0). Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). 110 . + k−1 k Indeed. The absolute value of a is deﬁned by |a| = a+ + a− . Note that it is always true that a = a+ − a− . If the pig is included. we deﬁne x m = x(x − 1) · · · (x − m + 1) . c. b). then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. 0).We shall let n = 0 if n < k. Both quantities are nonnegative. n n . b. by convention. and a− is called the negative part of a. we add the two numbers together in order to ﬁnd the total number of ways to pick k animals from n + 1 animals. b). in choosing k animals from a set of n + 1 ones. If the pig is not included in the sample. In particular. and we don’t care to put parantheses because the order of ﬁnding a maximum is irrelevant. a+ is called the positive part of a. For example.

So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. = 1A + 1B − 1A∩B . This is very useful notation because conjunction of clauses is transformed into multiplication. . which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Several quantities of interest are expressed in terms of indicators. . For instance. if A1 . An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . So X= n≥1 1(X ≥ n). It is worth remembering that. 111 . 1 − 1A = 1 A c .Indicator function stands for the indicator function of a set A. If A is an event then 1A is a random variable. P (X ≥ n). are sets or clauses then the number n≥1 1An is the number true clauses. . Thus. A2 . It takes value 1 with probability P (A) and 0 with probability P (Ac ).}. 2. meaning a function that takes value 1 on A and 0 on the complement Ac of A. . We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). Then X X= n=1 1 (the sum of X ones is X). . Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). It is easy to see that 1A 1A∩B = 1A 1B . Another example: Let X be a nonnegative random variable with values in N = {1. . we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ].

where ∈ R. So if we choose a basis v1 . . xn the given basis. . y ∈ Rn . . . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ . . m. The matrix A of ϕ with respect to the choice of the two basis is deﬁned by a11 · · · a1n A := · · · · · · · · · am1 · · · amn y1 Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . The relation between the column representation of y and the column representation of x is written as 1 1 x y a11 · · · a1n . . Combining. Suppose that ψ. But the ϕ(uj ) are vectors in Rm . x . Let u1 . . . . is a column representation of x with respect to . The column . Let y i := n aij xj . . . The column . . . x1 . un be a basis for Rn . ym v1 . because ϕ is linear. is a column representation of y := ϕ(x) with respect to the given basis . . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . and all α. vm . .for all x. vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . j=1 i = 1. Then. . . xn n 1 y := ϕ(x) = j=1 xj ϕ(uj ). . . = ··· ··· ··· . we have ϕ(x) = m n vi i=1 j=1 aij xj . . β ∈ R. ϕ are linear functions: It is a m × n matrix. . . . .

i. a2 .Fix three sets of basis on Rn . This minimum (or maximum) may not exist. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. So lim sup xn = limn→∞ inf{xn . 1 ≤ j ≤ n. we deﬁne max I. Supremum and inﬁmum When we have a set of numbers I (for instance. . . . If I is empty then inf ∅ = +∞. sup I is characterised by: sup I ≥ x for all x ∈ I and. j So A is a ℓ × m matrix. since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. 1 ≤ i ≤ ℓ. B be the matrix representations of ϕ. x4 . ei = 1(i = j). · · · }. . Rm and Rℓ and let A. we deﬁne the supremum sup I to be the minimum of all upper bounds. We note that lim inf(−xn ) = − lim sup xn . with respect to these bases.} has no minimum. deﬁned as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. . Then the matrix representation of ϕ ◦ ψ is AB. for any ε > 0. . In these cases we use the notation inf I. inf I is characterised by: inf I ≤ x for all x ∈ I and. 1/2. . . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. . a sequence a1 . . xn+2 . . . . ed = . . . . is a sequence of numbers then inf{x1 . . For instance I = {1. If I has no upper bound then sup I = +∞. e2 = . If I has no lower bound then inf I = −∞. standing for the “infimum” of I. . the choice of the basis is the so-called standard basis. . xn+1 . a square matrix represents a linear transformation from Rn to Rn . . we deﬁne lim sup xn . . x2 . . x2 . x3 . there is x ∈ I with x ≥ ε + inf I. . Limsup and liminf If x1 . 113 .} ≤ inf{x3 . . there is x ∈ I with x ≤ ε + inf I. 0 0 1 In other words. . . . Similarly.} ≤ inf{x2 . for any ε > 0. 1/3. respectively. ψ. Similarly. If we ﬁx a basis in Rn . The standard basis in Rd consists of the vectors 1 0 0 0 1 0 e 1 = . . Likewise. . and it is this that we called inﬁmum. and AB is a ℓ × n matrix. sup ∅ = −∞. where m (AB)ij = k=1 Aik Bkj .} ≤ · · · and. Usually.e. B is a m × n matrix. A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. . A matrix A is called square if it is a n × n matrix. . it is its matrix representation with respect to the ﬁxed basis.

n P (An ).e. . if An are events with P (An ) = 1 then P (∩n An ) = 1. If An are events with P (An ) = 0 then P (∪n An ) = 0. we work a lot with conditional probabilities. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). Quite often. . B in the deﬁnition. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). n If the events A1 . Indeed. are mutually disjoint events. Note that 1A 1B = 1A∩B . Probability Theory is exceptionally simple in terms of its axioms. If A is an event then 1A or 1(A) is the indicator random variable. In Markov chain theory.e. Recall that P (B|A) is deﬁned by P (A ∩ B)/P (A) and is a very motivated deﬁnition. . . are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). to compute P (A ∩ B) we read the deﬁnition as P (A ∩ B) = P (A)P (B|A). In contrast. follows from this axiom. namely: Everything else. . Of course.. P (∪n An ) ≤ Similarly. . If the events A. B2 . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. and 1 Ac = 1 − 1 A . 11 114 . to show that EX ≤ EY . For example. E 1A = P (A). such that ∪n Bn = Ω. A2 . Similarly. if he events A1 .) If A. the function that takes value 1 on A and 0 on Ac . Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. but not too much. for (besides P (Ω) = 1) there is essentially one axiom . and everywhere else. we often need to argue that X ≤ Y . This can be generalised to any ﬁnite number of events. (That is why Probability Theory is very rich. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. Therefore.. If A1 . . A is a subset of B). . Similarly. . Linear Algebra that has many axioms is more ‘restrictive’. and apply them. . A2 . it is useful to ﬁnd disjoint events B1 . Linear Algebra that has many axioms is more ‘restrictive’. then P n≥1 An = P (An ). Often. derive formulae. . in these notes. In contrast. which holds since X is positive. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. to compute P (A). by not requiring this. .Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. i. That is why Probability Theory is very rich. This follows from the logical statement X 1(X > x) ≤ X. n≥1 Usually. B are disjoint events then P (A ∪ B) = P (A) + P (B). A2 . This is the inclusion-exclusion formula for two events. I do sometimes “cheat” from a strict mathematical point of view.

1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. For a random variable which takes positive and negative values. .} because then n an = 1. As an example. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. . EX = x xP (X = x). . which are both nonnegative and then deﬁne EX = EX + − EX − .. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). which is motivated from the algebraic identity X = X + − X − . ϕ(s) = EsX = n=0 sn P (X = n) 115 . taught in calculus courses). i. we encounter situations where P (X = ∞) may be positive. 0). n = 0. . is a probability distribution on Z+ = {0. 0). 2. a1 . So if n an < ∞ then G(s) exists for all s ∈ [−1. . This deﬁnition makes sense only if EX + < ∞ or if EX − < ∞. . 1]. . . x→∞ In Markov chain theory. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. . The formula holds for any positive random variable. 1]. 1. we have EX = −∞ xf (x)dx. for s = 0 for which G(0) = 0. X − = − min(X. Generating functions A generating function of a sequence a0 . . viz.e. In case the random variable is discrete. making it a useful object when the a0 . |s| < 1/|ρ|. The expectation of a nonnegative random variable X may be deﬁned by ∞ EX = 0 P (X > x)dx. The generating function of the sequence pn = P (X = n). we deﬁne X + = max(X. In case the random variable is ∞ absoultely continuous with density f . obviously. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). so it is useful to keep this in mind. n ≥ 0 is ∞ ρn s n = n=0 1 .An application of this is: P (X < ∞) = lim P (X ≤ x). the formula reduces to the known one. is an inﬁnite polynomial having the ai as coeﬃcients: G(s) = a0 + a1 s + a2 s2 + · · · . is called probability generating function of X: ∞ as long as |ρs| < 1. . the generating function of the geometric sequence an = ρn . We do not a priori know that this exists for any s other than. 1. a1 . .

n=0 We do allow X to. Since the term s∞ p∞ could formally appear in the sum deﬁning ϕ(s). take the value +∞. consider the Fibonacci sequence deﬁned by the recursion xn+2 = xn+1 + xn . with x0 = x1 = 1. ∞ and p∞ := P (X = ∞) = 1 − pn . xn sn be the generating of (xn . As an example. Generating functions are useful for solving linear recursions. Note that ∞ n=0 pn ≤ 1. possibly. while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). ds ds and by letting s = 1. we have s∞ = 0. but. Note that ϕ(0) = P (X = 0). we can recover the expectation of X from EX = lim ϕ′ (s). n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . n! as follows from Taylor’s theorem. since |s| < 1. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . Also. If we let X(s) = n≥0 n ≥ 0. 116 . n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). n ≥ 0) then the generating function of (xn+1 . s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 .This is deﬁned for all −1 < s < 1. and this is why ϕ is called probability generating function.

Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). n ≥ 0) equals the sum of the generating functions of (xn+1 . i. Despite the square roots in the formula. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. b−a Noting that ab = −1. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . for example. namely. So.e. is such an example. n ≥ 0). x ∈ R. s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). then the so-called distribution function. But I could very well use the function P (X > x). we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. the function P (X ≤ x). we can also write this as xn = (−a)n+1 − (−b)n+1 . Thus (1/ρ)−s is the 1 generating function of ρn+1 . b−a is the solution to the recursion. Hence s2 + s − 1 = (s − a)(s − b). i.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. b = −( 5 + 1)/2. n ≥ 0) and (xn . we can solve for X(s) and ﬁnd X(s) = s2 −1 .e. the n-th term of the Fibonacci sequence. and so X(s) = 1 1 1 1 1 1 = . Since x0 = x1 = 1. 117 . if X is a random variable with values in R.

but using or merely talking about such an object presents no diﬃculty other than overcoming a psychological barrier. it’s better to use logarithms: n n! = e log n = exp k=1 log k. n! ≈ exp(n log n − n) = nn e−n . since the number is so big. Such an excursion is a random object which should be thought of as a random function. to compute a large sum we can. It turns out to be a good one. we encountered the concept of an excursion of a Markov chain from a state i. For example (try it!) the rough Stirling approximation give that the probability that. n Hence n k=1 log k ≈ log x dx. 1 Since d dx (x log x − x) = log x. for example they could take values which are themselves functions. Specifying the distribution of such an object may be hard. Now. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. n ∈ Z+ . If X is absolutely continuous. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. We frequently encounter random objects which belong to larger spaces. Even the generating function EsX of a Z+ -valued random variable X is an example. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. instead. I could use its density f (x) = d P (X ≤ x). Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. compute an integral and obtain an approximation. Hence we have We call this the rough Stirling approximation. dx All these objects can be referred to as “distribution” or “law” of X. For instance. in 2n fair coin tosses 118 .or the 2-parameter function P (a < X < b). But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. (b) The approximation is not good when we consider ratios of products. for it does allow us to specify the probabilities P (X = n).

Then.we have exactly n heads equals 2n 2−2n ≈ 1. g(X) ≥ g(x)1(X ≥ x). we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . Consider an increasing and nonnegative function g : R → R+ . expressing non-negativity. That was the random variable J0 = ∞ n=1 1(Sn = 0). 119 . Surely. Indeed. n ∈ Z+ . if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. as n → ∞. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any diﬀerent from X. deﬁned in (34). where λ = P (X ≥ 1). for all x. 0 ≤ n ≤ m. and is based on the concept of monotonicity. This completely trivial thought gives the very useful Markov’s inequality. y ∈ R g(y) ≥ g(x)1(y ≥ x). so this little section is a reminder. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). by taking expectations of both sides (expectation is an increasing operator). expressing monotonicity. Let X be a random variable. a geometric function of k. and this is terrible. and so. If we think of X as the time of occurrence of a random phenomenon. We have already seen a geometric random variable. We baptise this geometric(λ) distribution. Markov’s inequality Arguably this is the most useful inequality in Probability. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . It was shown in Theorem 38 that J0 satisﬁes the memoryless property. you have seen it before. We can case both properties of g neatly by writing For all x. because we know that the n probability goes to zero as n → ∞. To remedy this. Eg(X) ≥ g(x)P (X ≥ x) .

edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. J R (1998) Markov Chains. W (1968) An Introduction to Probability Theory and its Applications. 4. Feller.Bibliography ´ 1. Amer. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). Chung. Cambridge University Press. Springer. Also available at: http://www. Math. Cambridge University Press. 5. Soc. Grinstead. J L (1997) Introduction to Probability. C M and Snell. 7. 3. 120 . ´ 2. Bremaud P (1997) An Introduction to Probabilistic Modeling. W (1981) Topics in Ergodic Theory. Bremaud P (1999) Markov Chains. Springer. Wiley.dartmouth. Springer.pdf 6. Norris. Parry.mac.

51 Monte Carlo. 13 probability generating function. 61 detailed balance equations. 98 dynamical system. 20 increments. 82 Cesaro averages. 17 Aloha. 47 coupling inequality. 17 Ethernet. 73 Markov chain. 10 irreducible. 81 reﬂection principle. 17 Kolmogorov’s loop criterion. 35 Helly-Bray Lemma. 48 cycle formula. 7 invariant distribution. 34 probability ﬂow. 77 indicator. 111 inessential state. 84 bank account. 111 maximum. 110 meeting time. 119 Google. 86 reﬂexive. 117 division. 61 recurrent. 109 positive recurrent. 77 coupling.Index absorbing. 68 aperiodic. 68 Fatou’s lemma. 6 Markov’s inequality. 61 null recurrent. 117 leads to. 115 geometric distribution. 63 Galton-Watson process. 116 ﬁrst-step analysis. 3 branching process. 16 equivalence relation. 18 dual. 70 greatest common divisor. 16 communicating classes. 20 function of a Markov chain. 18 permutation. 70 period. 58 digraph (directed graph). 65 Catalan numbers. 15. 115 random walk on a graph. 119 matrix. 51 essential state. 45 Chapman-Kolmogorov equation. 44 degree. 18 arcsine law. 15 distribution. 34 PageRank. 16. 53 Fibonacci sequence. 48 memoryless property. 17 communicates with. 113 initial distribution. 87 121 . 59 law. 65 generating function. 53 hitting time. 108 ergodic. 76 neighbours. 108 ruin probability. 1 equivalence classes. 7 closed. 16 convolution. 101 axiom of Probability Theory. 114 balance equations. 10 ballot theorem. 110 mixing. 108 relation. 2 nearest neighbour random walk. 15 Loynes’ method. 17 inﬁmum. 26 running maximum. 29 reﬂection. 6 Markov property. 119 minimum. 106 ground state.

16. 16. 62 supremum. 104 states. 6 stationary. 113 symmetric. 73 strategy of the gambler. 7 time-reversible.semigroup property. 119 strong Markov property. 7 simple random walk. 60 urn. 62 subsequence of a Markov chain. 6 statespace. 8 transitive. 8 transition probability matrix. 108 tree. 27 subordination. 58 transient. 7. 76 Skorokhod embedding. 76 simple symmetric random walk. 2 transition probability. 108 time-homogeneous. 29 transition diagram:. 10 steady-state. 27 storage chain. 9 stationary distribution. 3. 88 watching a chain in a set. 77 skip-free property. 24 Strirling’s approximation. 62 Wright-Fisher model. 13 stopping time. 71 122 .

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd