This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

**MARKOV CHAINS AND RANDOM WALKS
**

Takis Konstantopoulos∗ Autumn 2009

∗

c Takis Konstantopoulos 2006-2009

Contents

1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Deﬁnition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and ﬁrst-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Deﬁnition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . 71 24. . . . . .3 The distribution of the ﬁrst return to the origin . . . . 83 27. 92 30. . . . .2 Paths that start and end at speciﬁc points and do not touch zero at all . . . . . . . . . . . . . . . . . 84 27. . . . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . .2 The ALOHA protocol: a computer communications application . . . . . . . . . 90 30. .1 First hitting time analysis . . . . . . . . . . . . . .5 A storage or queueing application . . . .2 Return probabilities . . . . . . . . . . . . . . . . .1 Asymptotic distribution of the maximum . . . . . . 86 28 The reﬂection principle for a simple random walk in dimension 1 86 28. . . . .1 Branching processes: a population growth application . . . . 95 31. . . . . . . . . . . 85 27. . .2 The ballot theorem . . . . . . . . . . . . . . .1 Distribution after n steps . . 84 27. . . . . . . . . . . 65 24. . . . . . . 68 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 First return to zero . . . . .1 Paths that start and end at speciﬁc points . . . . . . . . . . . .4 The Wright-Fisher model: an example from biology . . . .3 Total number of visits to a state . . . . . . 79 26. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Distribution of hitting time and maximum . . . . . . . . . . . .3 Some conditional probabilities . . . . . . . . . . . . . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 24. . . . . . . . . . . . . . . . . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . 95 31. .4 Some remarkable identities . . . .24.2 First return to the origin . . . . . . . . . . . . . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

I have a little mathematical appendix. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. Also. I have tried both and prefer the current ordering. Also. The ﬁrst part is about Markov chains and some applications. There notes are still incomplete. The second one is speciﬁcally for simple random walks. A few starred sections should be considered “advanced” and can be omitted at ﬁrst reading.K.Preface These notes were written based on a number of courses I taught over the years in the U. There are. Of course.. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. I tried to put all terms in blue small capitals whenever they are ﬁrst encountered. dozens of good books on the topic. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible.S. I introduce stationarity before even talking about state classiﬁcation. At the end.. They form the core material for an undergraduate course on Markov chains in discrete time.. of course. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuﬀ with matrices – On ﬁnance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. v . Greece and the U. “important” formulae are placed inside a coloured box.

An example from Arithmetic of a dynamical system in discrete time is the one which ﬁnds the greatest common divisor gcd(a. In Mathematics. yn ) into Xn+1 = (xn+1 . and its ˙ dt 2x acceleration is x = d 2 . For instance. we let our “state” be a pair of integers (xn . . for any t. x(t)). By repeating the procedure. X1 . . Suppose that a particle of mass m is moving under the action of a force F in a straight line. 2. where xn ≥ yn . x). The sequence of pairs thus produced. that gcd(a. . where r = rem(a. Recall (from elementary school maths). i. the future trajectory (X(s). has the property that. . the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). An example from Physics is provided by Newton’s laws of motion. b) between two positive integers a and b. continuous (i. one of which is 0. Newton’s second law says that the acceleration is proportional to the force: m¨ = F.e.e. 1. X0 . x Here. X2 . F = f (x. b). b). yn ). initialised by x0 = max(a. and evolving as xn+1 = yn yn+1 = rem(xn . n = 0. there is a function F that takes the pair Xn = (xn . . Its velocity is x = dx . Assume that the force F depends only on the particle’s position ¨ dt and velocity. The “time” can be discrete (i. It is not hard to see that. depend only on the pair Xn . more generally. b) = gcd(r. yn ). a totally ordered set. . b) is the remainder of the division of a by b. Formally. x = x(t) denotes the position of the particle at time t. b). yn+1 ). for any n. s ≥ t) ˙ 1 . We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. . . The state of the particle at time t is described by the pair ˙ X(t) = (x(t). In the module following this one you will study continuous-time chains. and the other the greatest common divisor. a phenomenon which evolves with time in a way that only the present aﬀects the future is called a dynamical system. In other words. y0 = min(a. . the integers). or.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past aﬀects the future only through the present. the future pairs Xn+1 . Xn+2 .e. the real numbers). we end up with two numbers.

The resulting ratio is an approximation b of a f (x)dx. 2 2.can be completely speciﬁed by the current state X(t). respectively.000 rows and 10. Y0 ) of random numbers. We will also see why they are useful and discuss how they are applied.. A scientist’s job is to record the position of the mouse every minute. containing fresh and stinky cheese. it moves from 2 to 1 with probability β = 0.05. Count the number of 1’s and divide by N . Indeed. for steps n = 1. analyse. Similarly. via the use of the tools of Probability. . In addition. When the mouse is in cell 1 at time n (minutes) then. Repeat with (Xn . . 0 ≤ Y0 ≤ B. say. but some are stochastic. The generalisation takes us into the realm of Randomness. We can summarise this information by the transition diagram: 1 Stochastic means random.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as ﬁnding the determinant of a matrix with 10. Yn ).99. Each time count 1 if Yn < f (Xn ) and 0 otherwise. a ≤ X0 ≤ b. Try this on the computer! 2 . past states are not needed. it does so. 1 and 2. Other examples of dynamical systems are the algorithms run. regardless of where it was at earlier times. instead of deterministic objects. Markov chains generalise this concept of dependence of the future only on the present. and use those mathematical models which are called Markov chains. In other words. 2. A mouse lives in the cage. Some of these algorithms are deterministic. we will see what kind of questions we can ask and what kind of answers we can hope for. . at time n + 1 it is either still in 1 or has moved to 2. by the software in your computer. We will develop a theory that tells us how to describe.e. Choose a pair (X0 . An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B.000 columns or computing the integral of a complicated function of a large number of variables. We will be dealing with random variables. an eﬀective way for dealing with large problems is via the use of randomness. i. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. Perform this a large number N of times. uniformly.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells.

The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is deﬁned by: px. and Sn is the money he wants to spend. Assume also that X0 . as follows: Xn+1 = max{Xn + Dn − Sn . 2. the ﬁrst time that the mouse will move from 1 to 2 will have mean 1/α = 1/0.) 2. We easily see that if y > 0 then ∞ x. from month to month. he can’t. 0}.2 Bank account The amount of money in Mr Baxter’s bank account evolves. on the average. Sn . FS . How long does it take for the mouse. .05 ≈ 20 minutes. S1 . How often is the mouse in room 1 ? Question 1 has an easy. S0 . Dn − Sn = y − x) P (Sn = z.01 Questions of interest: 1. Xn is the money (in pounds) at the beginning of the month. (This is the mean of the binomial distribution with parameter α. Assume that Dn . D0 . . Here. . are random variables with distributions FD . Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). Dn is the money he deposits. to move from cell 1 to cell 2? 2. y = 0. .99 0. px.y := P (Xn+1 = y|Xn = x). answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. S2 . respectively. intuitive. D2 .05 0. So if he wishes to spend more than what is available. = z=0 ∞ = z=0 3 . D1 . are independent random variables. 1.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0.95 0. .y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. .

3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street.2 P= p2. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction.1 p1.2 ··· ··· ··· ··· · · · · · · ··· and is an inﬁnite matrix.0 = 1 − ∞ px. being drunk.1 p0. So he may move forwards with equal probability that he moves backwards. Are there any places which he will visit inﬁnitely many times? 4. he has no sense of direction. If he starts from position 0.y . and if so. in principle. Are there any places which he will never visit? 3. how long will it take him. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions.something that. can be computed from FS and FD . if y = 0. on the average. Of course. If the pub is located in position 0 and his home in position 100.0 p1. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. Will Mr Baxter’s account grow.1 p2.2 p1. 2. The transition probability matrix is y=1 p0. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). how often will he be visiting 0? 2.0 p2. 2.0 p0.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . we have px. how fast? 2. Questions of interest: 1.

the answer to this is crucial for the policy determination. north or south. that since there are more degrees of freedom now.S because pH.3 H 0. mathematically imprecise.H − pS. Note that the diagram omits pH. pD.D .8 0. etc. We may again want to ask similar questions.01 S 0.H = 1 − pH. 5 . These things are.1 D Thus. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. so we assign probability 1/4 for each move. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. So he can move east.S = 0. therefore. the company must have an idea on how long the clients will live. there is clearly a higher chance that our man will be lost.5 Actuarial chains A life insurance company wants to ﬁnd out how much money to charge its clients. but they will become more precise in the sequel. for now.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner.H . Also. observe.3 that the person will become sick. the reason being that the company does not believe that its clients are subject to resurrection. 2. given that he is currently healthy. west. But. and pS.S − pH. there is probability pH. and pS.D = 1.D .D is omitted. pD. Clearly. and let’s say he’s completely drunk.S = 1 − pS.

. almost exclusively (unless otherwise said explicitly).3 3. m and any choice of states i0 . This is true for any n and m. n + m] Xk = ik | Xn = in . So we will omit referring to T explicitly. for any n. and so future is independent of past given the present. n ∈ T} is Markov instead of saying that it has the Markov property. . m < n. it is customary to call (Xn ) a Markov chain. m ∈ T). . . 1. Let us now write the Markov property in an equivalent way. . P (Xn+1 = in+1 | Xn = in . X1 = i1 . for all n ∈ Z+ and all i0 . . . We say that this sequence has the Markov property if. n ∈ Z+ ) is Markov if and only if. viz. . Since S is countable. if the claim holds. We will also assume. m ∈ T) is independent of the past process (Xm . (1) 6 . . . 3. Usually. called the state space. Xn−1 ) are independent conditionally on Xn . . . in+1 ∈ S. . . . . Xn+m ) and (X0 . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. conditionally on Xn . . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). We focus almost exclusively to the ﬁrst choice. . the Markov property. The elements of S are frequently called states. . . n + m] Xk = ik | Xn = in ).} or the set Z of all integers (positive. or negative). (Xn . . . Sometimes we say that {Xn . the future process (Xm . . . . . . Lemma 1. where T is a subset of the integers. 2.} or N = {1. in+m . . X2 = i2 . . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . Xn−1 ) given Xn . 2. n ∈ T}. . . 3. T = Z+ = {0. which precisely means that (Xn+1 . unless needed. 3.2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . m > n. zero. . for any n ∈ T. . . . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). On the other hand. Proof.1 The Markov property Deﬁnition of the Markov property Consider a (ﬁnite or inﬁnite) sequence of random variables {Xn . . that the Xn take values in some countable set S. X0 = i0 ) = P (∀ k ∈ [n + 1. . .

n). all probabilities of events associated with the Markov chain This causes some analytical diﬃculties. n) if m ≤ ℓ ≤ n. If |S| = d then we have d × d matrices. Of course. (2) which is reminiscent of matrix multiplication. n + 1) do not depend on n . 2 3 And. n). m + 1) · · · pin−1 .j∈S . 2) · · · pin−1 . we deﬁne. remembering a bit of Algebra. by a result in Probability. n) = i1 ∈S ··· in−1 ∈S pi.in (n − 1. m + 2) · · · P(n − 1. viz.j (m. X2 = i2 . Xn = in ) = µ(i0 ) pi0 . X1 = i1 . for each m ≤ n the matrix P(m. then the matrices are inﬁnitely large. by deﬁnition. n) = P(m. n + 1) is called transition probability from state i at time n to state j at time n + 1. will specify all joint distributions2 : P (X0 = i0 .which means that the two functions µ(i) := P (X0 = i) .i1 (m. n). j ∈ S.j (n. m + 1)P(m + 1. requires some ordering on S) and whose entries are pi. but if S is an inﬁnite set. The function µ(i) is called initial distribution. by (1). or as P(m. n).j (m.. n) = P(m. pi. n ≥ 0.j (m. .j (n. More generally. .i (n − 1.j (m.i2 (1. n + 1) := P (Xn+1 = j | Xn = i) . which.i1 (0.j (n.j (n. i∈S i. n) := [pi. pi. . n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. But we will circumvent them by using Probability only. a square matrix whose rows and columns are indexed by S (and this. ℓ) P(ℓ. n)]i.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. The function pi. we can write (2) as P(m. 1) pi1 . of course. means that the transition probabilities pi. 7 . n) = 1(i = j) and. something that can be called semigroup property or Chapman-Kolmogorov equation. . pi.3 In this new notation. pi. 3. Therefore.

(3) We call δi the unit mass at i. The matrix notation is useful. 0. to picking state i with probability 1.j ]i. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). b ≥ 0. then µ = pδi1 + (1 − p)δi2 . Starting from a speciﬁc state i. δi (j) := 1. For example. Notice that any probability distribution on S can be written as a linear combination of unit masses. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . Then we have µ n = µ 0 Pn . A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). transition matrix. unless otherwise speciﬁed. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. Thus Pi is the law of the chain when the initial distribution is δi . n + 1) =: pi. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi .j∈S is called the (one-step) transition probability matrix or. From now on. if j = i otherwise. Then P(m. simply. n−m times for all m ≤ n. It corresponds.j (n.j the (one-step) transition probability from state i to state j. and the semigroup property is now even more obvious because Pa+b = Pa Pb . and call pi. as follows from (1) and the deﬁnition of P.j . Similarly.and therefore we can simply write pi. For example. arranged in a row vector. y y 8 . of course. while P = [pi. for all integers a. Notational conventions Unit mass. if Y is a real random variable. n) = P × P × · · · · · · × P = Pn−m . if µ assigns probability p to state i1 and 1 − p to state i2 . In other words.

e. This is a model of gas distributed between two identical communicating chambers. 1. dividing it into two identical rooms. i. Imagine there are N molecules of gas (where N is a large number.). at any point of time n. we indeed expect that we have N/2 molecules in each room.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . because it is impossible for the number of molecules to remain constant at all times.d. selecting a vertex with probability 1/7. random variables provides an example of a stationary process. the distribution of the bug will still be the same. But this will not work. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. If at time n the bug is on a vertex. say N = 1025 ) in a metallic chamber with a separator in the middle. . . . m + 1. Another guess then is: Place 9 . The gas is placed partly in room 1 and partly in room 2. then. On the average.4 Stationarity We say that the process (Xn . for any m ∈ Z+ . . Clearly. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. n ≥ 0) will not be stationary: indeed. a sequence of i. Example: the Ehrenfest chain. then. It is clear.) is stationary if it has the same law as (Xn . the law of a stationary process does not depend on the origin of time. . It should be clear that if initially place the bug at random on one of the vertices. . n = 0. Consider a canonical heptagon and let a bug move on its vertices.i. Then a hole is opened in the separator and we watch what happens as the molecules diﬀuse between the two rooms. Let Xn be the number of molecules in room 1 at time n. To provide motivation and intuition. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. n = m.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. In other words. for a while we will be able to tell how long ago the process started. and vice versa. Hence the uniform probability distribution is stationary. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . at time n + 1 it moves to one of the adjacent vertices with equal probability. let us look at the following example: Example: Motion on a polygon. as it takes some time before the molecules diﬀuse from room to room. We now investigate when a Markov chain is stationary. Then pi.

Proof. The only if part is obvious.i + π(i + 1)pi+1. .) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. In other words. A Markov chain (Xn . i 0 ≤ i ≤ N. . we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. 2−N i−1 i+1 N N (The dots signify some missing algebra. ﬁnd those distributions π such that πP = π. But we need to prove this. then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . P (X1 = i) = j P (X0 = j)pj. . X2 = i2 . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. If we do so. hence. (4) Such a π is called .i = P (X0 = i − 1)pi−1. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. 10 . for all m ≥ 0 and i0 . then we have created a stationary Markov chain. Xm+n = in ). Lemma 2. stationary or invariant distribution. . .each molecule at random either in room 1 or in room 2. If we can do so.i + P (X0 = i + 1)pi+1. Since. which you should go through by yourselves. by (3). . . . For the if part. Xm+1 = i1 . X1 = i1 . Xm+2 = i2 . . we are led to consider: The stationarity problem: Given [pij ]. Changing notation. Xn = in ) = P (Xm = i0 . the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . . use (1) to see that P (X0 = i0 . P (Xn = i) = π(i) for all n. The equations (4) are called balance equations.i = N N −i+1 N i+1 2−N + = · · · = π(i). Indeed. . in ∈ S. a Markov chain is stationary if and only if the distribution of Xn does not change with n. . (a binomial distribution).i = π(i − 1)pi−1.

}. . . In addition to that. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus.i. . can be written as P1 = 1. p1. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). xd ). . . . if i > 0 i ∈ N. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. each x ∈ Sd is of the form x = (x1 . 2 . The state space is S = N = {1.. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. This means that the solution to the balance equations is identically zero. then we run into trouble: indeed. . 11 . . Consider a deck of d = 52 cards and shuﬄe them in the following (stupid) way: pick two cards at random and swap their positions. which gives p(j) . i ≥ 1. in matrix notation. .i = 1. i ≥ 0. which. The times between successive buses are i.d. we use the normalisation condition π(1) = i≥1 j≥i = 1. Example: waiting for a bus. π must be a probability: π(i) = 1. and so each π(i) will be zero. Who tells us that the sum inside the parentheses is ﬁnite? If the sum is inﬁnite. i = 1. i≥1 π(i) −1 To compute π(1). where 1 is a column whose entries are all 1. Thus. . the next one will arrive in i minutes with probability p(i). Example: card shuﬄing.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. Keep doing it continuously. write the equations (4): π(i) = π(j)pj. we must be careful.i = π(0)p(i) + π(i + 1). random variables. But here. To ﬁnd the stationary distribution.i = p(i). After bus arrives at a bus stop. π(1) will be zero. It is easy to see that the stationary distribution is uniform: π(x) = 1 . . d! x ∈ Sd . 2. . which in turn means that the normalisation condition cannot be satisﬁed. Then (Xn . d}. where all the xi are distinct. n ≥ 0) is a Markov chain with transition probabilities pi+1.j = 1. .

So. We can show that if we start with the ξr (0). . εr ∈ {0. As for the name of the chain (bread kneading) it can be justiﬁed as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. for any n = 0. .. 2. . . from (5). if ξ1 = 1 then shift to the left and ﬂip all bits to obtain (1 − ξ2 . . . ξr+1 (n) = 1 − εr ). ξ2 (n).d. 1} for all i. ξ2 (n). applied to the interval [0. . 1] (think of [0. any r = 1. If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. ξr+1 (n) = εr ) + P (ξ1 (n) = 1. But the function f (x) = min(2x. Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. . 2. . 2(1 − x)). 2. . . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. . . . . ξ2 (n) = ε1 . Hence choosing the initial state at random in this manner results in a stationary Markov chain. (5) 12 . . Remarks: This is not a Markov chain with countably many states. . i. .). 1] as a ﬂexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n.). ξr (n + 1) = εr ) = P (ξ1 (n) = 0. ξ2 . . i. r = 1. . If ξ1 = 1 then x ≥ 1/2.d. Here ξ(n) = (ξ1 (n).i.) Consider an inﬁnite binary vector ξ = (ξ1 . We have thus deﬁned a mapping ϕ from binary vectors into binary vectors. . But to even think about it will give you some strength.i. P (ξ1 (n + 1) = ε1 . . ξ3 . .Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). . .).* (This is slightly beyond the scope of these lectures and can be omitted. i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals.. . So let us do so. 1. . . the same is true at n + 1. You can check that . .. .) is the state at time n. ξ2 (n) = 1 − ε1 . . . . .e. . So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is ﬁnite. . and the rule maps x into 2(1 − x). P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). The Markov chain is obtained by applying the mapping ϕ again and again. ξ(n + 1) = ϕ(ξ(n)). . 1 − ξ3 . . 1}. So it goes beyond our theory. what we have really shown is that this assumption is a necessary and suﬃcient condition for the existence and uniqueness of a stationary distribution. ξi ∈ {0. are i. Example: bread kneading. and any ε1 . . .

amongst them. then choose A = {i} and see that you obtain the balance equations.j as a “current” ﬂowing from i to j. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. So F (A.Note on terminology: steady-state.j (which equals 1): π(i) j∈S pi. i∈A j∈Ac Think of π(i)pi.i . because the ﬁrst d equations are linearly dependent: the sum of both sides equals 1. π(i) = j∈S (balance equations) (normalisation condition).i = j∈A π(j)pj. But we only have d unknowns. The convenience is often suggested by the topological structure of the chain. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. as we will see in examples. We have: Proposition 1. π(j) = 1.i + j∈Ac π(j)pj.i Multiply π(i) by j∈S pi. expecting to be able to choose. If the latter condition holds. Instead of eliminating equations. Proof.j .j = j∈A π(j)pj. j∈S i ∈ S. Conversely. π(j)pj. This is OK. Ac ) := π(i)pi. Writing the equations in this form is not always the best thing to do. Ac ) = F (Ac . for all A ⊂ S. then the above are d + 1 equations. a set of d linearly independent equations which can be more easily solved. Ac ) is the total current from A to Ac . therefore one of them is always redundant.1 Finding a stationary distribution As explained above. If the state space has d states. we introduce more. A).i 13 . π is a stationary distribution if and only if π1 = 1 and F (A. suppose that the balance equations hold. When a Markov chain is stationary we refer to it as being in 4. a stationary distribution π satisﬁes π = πP π1 = 1 In other words.i + j∈Ac π(j)pj.

here is an example: Example: Consider the 2-state Markov chain with p12 = α. β < 1.8% of the time in room 2. gives us ﬂexibility. The constant c is found from π(1) + π(2) = 1. If we take α = 0. we ﬁnd π(1) = 0.05 ≈ 0.i This is true for all i ∈ S. while the remaining two terms give the equality we seek. p11 = 1 − α. A) Schematic representation of the ﬂow balance relations. The extra degree of freedom provided by the last result.j = j∈A π(j)pj. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. α+β π(2) = cα.j + j∈A j∈Ac π(i)pi. One can ask how to choose a minimal complete set of linearly independent equations.. p22 = 1 − β. .i + j∈Ac π(j)pj.. Instead.05.04 π(2) = 0.2% of the time in 1. Summing over i ∈ A.) Assume 0 < α. That is. 2. the mouse spends only 95. Example: a walk with a barrier. 3. but we shall not do this here. 1. β = 0.048. Thus.952. α+β π(2) = α . A c F(A . p21 = β. .Now split the left sum also: π(i)pi. (Consequently. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). 14 i = 1. in steady state. π(1) = β . We have π(1)α = π(2)β. . A ) A c c F( A . where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 .04 room 1. and 4.99 ≈ 0. p q Consider the following chain.99 (as in the mouse example). . we see that the ﬁrst term on the left cancels with the ﬁrst on the right.

This is i=0 possible if and only if p < q. provided that ∞ π(i) = 1.e. E) where S is the state space and E ⊂ S × S deﬁned by (i.j > 0 for some n ∈ N and some states i1 . If i = j there is nothing to prove. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. the chain will visit j at some ﬁnite time.But we can make them simpler by choosing A = {0. . These are much simpler than the previous ones. . i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. Does this provide a stationary distribution? Yes. π(i) = (p/q)i π(0). i.e. p < 1/2. Ac ) = π(i − 1)p = π(i)q = F (Ac . n≥0 .i2 · · · pin−1 . j) ∈ E ⇐⇒ pi. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j).i1 pi1 . . S is a set of vertices. 1. Proof. In graph-theoretic terminology. in−1 . A). (ii) pi. and E a set of directed edges. In other words. (iii) Pi (Xn = j) > 0 for some n ≥ 0. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. 5. .2 The relation “leads to” We say that a state i leads to state j if.) 5 5. we can solve them immediately. . In fact.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. So assume i = j. • Suppose that i j. If i ≥ 1 we have F (A. . . a pair (S. i. For i ≥ 1. (If p ≥ 1/2 there is no stationary distribution. starting from i.j > 0. This is often cast in terms of a digraph (directed graph). . i − 1}.

Proof. It is reﬂexive and transitive because is so. i}. and so (iii) holds.. 5. the sum contains a nonzero term.3 The relation “communicates with” Deﬁne next the relation i communicates with j (written as i Obviously. (6) j.j . [i] = [j] if and only if i identical or completely disjoint. {4.. Look at the last display again. the class {2.in−1 pi. The communicating class corresponding to the state i is. by deﬁnition. we have that the sum is also positive and so one of its terms is positive.i2 · · · pin−1 . meaning that (ii) holds. 16 . {6}. The relation j ⇐⇒ j i). Just as any equivalence relation. The ﬁrst two classes diﬀer from the last two in character. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). it partitions S into equivalence classes known as communicating classes. j) ⇐⇒ i j and j i. Hence Pi (Xn = j) > 0. However. Since Pi (Xn = j) > 0. (i) holds. The class {4. Corollary 1. By assumption. • Suppose now that (iii) holds. the set of all states that communicate with i: [i] := {j ∈ S : j So. • Suppose that (ii) holds. In addition. The relation is transitive: If i j and j k then i k. where the sum extends over all possible choices of states in S. 5} is closed: if the chain goes in there then it never moves out of it. 3}. reﬂexive and transitive.. (iii) holds. {2.the sum on the right must have a nonzero term. We have Pi (Xn = j) = i1 . Equivalence relation means that it is symmetric. 3} is not closed. In other words. we see that there are four communicating classes: {1}. We just observed it is symmetric.i1 pi1 . 5}. is an equivalence relation.. by deﬁnition. this is symmetric (i Corollary 2. Two communicating classes are either 1 2 3 4 5 6 In the example of the ﬁgure.

The classes C4 . A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. C5 in the ﬁgure above are closed communicating classes.i1 pi1 . the state is inessential. j∈C for all i ∈ C. Similarly. Lemma 3. C3 are not closed. Proof. more precisely. If a state i is such that pi. The converse is left as an exercise. Since i ∈ C. The communicating classes C1 . If all states communicate with all others. A general Markov chain can have an inﬁnite number of communicating classes.j > 0. more manageable. pi. then the chain (or. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this ﬁgure. Note: If we consider a Markov chain with transition matrix P and ﬁx n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. and C is closed. we can reach any other state. . In graph terms.i2 > 0. C2 . parts.j = 1. we say that a set of states C ⊂ S is closed if pi. and so on.i2 · · · pin−1 . . Why? A state i is essential if it belongs to a closed communicating class. and so i2 ∈ C. Suppose that i j. Note: An inessential state i is such that there exists a state j such that i j but j i. we have i1 ∈ C. Suppose ﬁrst that C is a closed communicating class. . Otherwise. There can be arrows between two nonclosed communicating classes but they are always in the same direction. To put it otherwise. In other words still. pi1 . Note that there can be no arrows between closed communicating classes. Thus.i = 1 then we say that i is an absorbing state. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. starting from any state. . the single-element set {i} is a closed communicating class. 17 .More generally. its transition matrix P) is called irreducible. in−1 such that pi.i1 > 0. We then can ﬁnd states i1 . the state space S is a closed communicating class. we conclude that j ∈ C. Closed communicating classes are particularly important because they decompose the chain into smaller.

Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. The period of an inessential state is not deﬁned. So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . j. i. So. pii (ℓn) ≥ pii (n) ℓ . The chain alternates between the two sets. Consider two distinct essential states i. (m) (n) for all m. X4 ∈ C1 . i. (Can you ﬁnd an example of a chain with period 3?) We will now give the deﬁnition of the period of a state of an arbitrary chain. n ≥ 0. . ℓ ∈ N.e. d(j) = gcd Dj . all states form a single closed communicating class. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. . Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. Let Di = {n ∈ N : pii > 0}.5. If i j then d(i) = d(j). d(i) = gcd Di . we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . A state i is called aperiodic if d(i) = 1. (n) Proof. . 3} at some point of time then. that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. If i j there is α ∈ N with pij > 0 and if 18 .4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. We say that the chain has period 2. The period of an essential state is deﬁned as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . . Since Pm+n = Pm Pn . Dj = {n ∈ (n) (α) N : pjj > 0}. j ∈ S. X2 ∈ C1 . and. more generally. Let us denote by pij the entries of Pn . Notice however. Theorem 2 (the period is a class property). X3 ∈ C2 . 4} at the next step and vice versa. it will move to the set C2 = {2.

Pick n2 . i. . . Therefore d(j) | α + β. So d(i) = d(j). . Now let n ∈ Di . Of particular interest. if d = 1. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This ﬁgure shows the internal structure of a closed communicating class with period d = 4. . If a number divides two other numbers then it also divides their diﬀerence. Because n is suﬃciently large. Cd−1 such that pij = 1. where the remainder r ≤ n1 − 1. all states communicate with one another. Arguing symmetrically. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . Then we can uniquely partition the state space S into d disjoint subsets C0 . . Cd−1 such that the chain moves cyclically between them. n1 ∈ Di such that n2 − n1 = 1. Theorem 4. So d(j) is a divisor of all elements of Di . 1. Since n1 . So pjj ≥ pij pji > 0. Let n be suﬃciently large. q − r > 0. . Then pii > 0 and so pjj ≥ pij pii pji > 0. we obtain d(i) ≤ d(j) as well. showing that α + β ∈ Dj . From the last two displays we get d(j) | n for all n ∈ Di . Proof. Therefore d(j) | α + β + n for all n ∈ Di . d(j) | d(i)). More generally. j ∈ S. we have n ∈ Di as well. . Consider an irreducible chain with period d. Divide n by n1 to obtain n = qn1 + r. r = 0. if an irreducible chain has period d then we can decompose the state space into d sets C0 . showing that α + n + β ∈ Dj . In particular. this is the content of the following theorem. j∈Cr+1 i ∈ Cr . . . . (Here Cd := C0 . If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . as deﬁned earlier.) 19 . . C1 . Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. Formally. An irreducible chain has period d if one (and hence all) of the states have period d.e. n2 ∈ Di . An irreducible chain with ﬁnitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. chains where. . Theorem 3. Corollary 3. d − 1.j i there is β ∈ N with pji > 0. C1 . are irreducible chains. the chain is called aperiodic.

We now show that if i belongs to one of these classes and if pij > 0 then.. j belongs to the next class.id > 0. for example. id−1 d 6 Hitting times and ﬁrst-step analysis Consider a Markov chain with transition probability matrix P. Assume d > 1. We avoid excessive notation by using the same letter for both. i1 . Then pick i1 such that pi0 i1 > 0. Hence it partitions the state space S into d disjoint equivalence classes. If X0 ∈ A the two times coincide. Notice that this is an equivalence relation. As an example. i ∈ C0 . We use a method that is based upon considering what the Markov chain does at time 1. necessarily. i1 . Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . The reader is warned to be alert as to which of the two variables we are considering at each time. say.. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). after it takes one step from its current position. We are interested in deriving formulae for the probability that this time is ﬁnite as well as the expectation of this time. i. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which diﬀers from TA simply by considering n ≥ 1 instead of n ≥ 0. with corresponding classes C0 . We deﬁne the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. Cd−1 . . necessarily. Suppose pij > 0 but j ∈ C1 but. Deﬁne the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0.. which contradicts the deﬁnition of d. We show that these classes satisfy what we need. 20 ..e the set of states j such that i j). It is easy to see that if we continue and pick a further state id with pid−1 .. Such a path is possible because of the choice of the i0 .e. Pick a state i0 and let C0 be its equivalence class (i. C1 . then. Continue in this manner and deﬁne states i0 . . . Let C1 be the equivalence class of i1 .. that is why we call it “first-step analysis”. and by the assumptions that d d j.Proof.. Take. . j ∈ C2 . . The existence of such a path implies that it is possible to go from i0 to i. id ∈ C0 .

which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). X0 = i) = P (TA < ∞|X1 = j). ′ is the remaining time until A is hit). If i ∈ A then TA = 0. . For the second part of the proof.It is clear that P1 (T0 < ∞) < 1. ′ Proof. i ∈ A. Furthermore. then TA ≥ 1. let ϕ′ (i) be another solution.). We then have: Theorem 5. Hence: ′ ′ P (TA < ∞|X1 = j. If i ∈ A. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). . because the chain may end up in state 2. We ﬁrst have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. . So TA = 1 + TA . ϕ(i) = j∈S i∈A pij ϕ(j). we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A pjk ϕ′ (k) k∈A 21 . The function ϕ(i) satisﬁes ϕ(i) = 1. But the Markov chain is homogeneous. and deﬁne ϕ(i) := Pi (TA < ∞). Therefore. j∈S as needed. we have Pi (TA < ∞) = ϕ(j)pij . if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. Combining the above. a ′ function of X1 . But what exactly is this probability equal to? To answer this in its generality. ﬁx a set of states A. and so ϕ(i) = 1. conditionally on X1 . the event TA is independent of X0 . By self-feeding this equation.e. X2 .

j∈A i ∈ A. i = 0. We immediately have ϕ(0) = 1. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). The function ψ(i) satisﬁes ψ(i) = 0. Proof. as above. if the chain starts from 2 it will never hit 0. and let ψ(i) = Ei TA . 2. ψ(i) = 1 + i∈A pij ψ(j). and ϕ(2) = 0. 3 2 so ϕ(1) = 3/4. Letting n → ∞. we choose A = {0}. Example: Continuing the previous example. because it takes no time to hit A if the chain starts from A. We let ϕ(i) = Pi (T0 < ∞). TA = 0 and so ψ(i) := Ei TA = 0. Clearly. Start the chain from i. let A = {0. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the ﬁrst time.We recognise that the ﬁrst term equals Pi (TA = 1).) As for ϕ(1). the second equals Pi (TA = 2). If i ∈ A then. then TA = 1 + TA . By continuing self-feeding the above equation n times. 2}. We have: Theorem 6. 1. ψ(0) = ψ(2) = 0. the set that contains only state 0. 1 ψ(1) = 1 + ψ(1). If ′ i ∈ A. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). The second part of the proof is omitted. Example: In the example of the previous ﬁgure. which is the second of the equations. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). so the ﬁrst two terms together equal Pi (TA ≤ 2). (If the chain starts from 0 it takes no time to hit 0. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). ψ(i) := Ei TA . obviously. Furthermore. On the other hand. 3 22 . Therefore.

B. i. because the underlying experiment is the ﬂipping of a coin with “success” probability 2/3. (This should have been obvious from the start.e. ϕAB is the minimal solution to these equations. This happens up to the ﬁrst hitting time TA ﬁrst hitting time of the set A. ′ TA = 1 + TA . after one step. ϕAB (i) = j∈A i∈B pij ϕAB (j).) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . It should be clear that the equations satisﬁed are: ϕAB (i) = 1. otherwise. ϕAB (i) = 0. i∈A Furthermore. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the ﬁrst-step analysis method. TB for two disjoint sets of states A. in the last sum. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). Clearly. therefore the mean number of ﬂips until the ﬁrst success is 3/2. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). until set A is reached. h(x) = f (x). as argued earlier. by the Markov property. of the random variables X1 . The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). where. consider the following situation: Every time the state is x a reward f (x) is earned. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). we just changed index from n to n + 1. Next. if x ∈ A. Now the last sum is a function of the future after time 1. ′ where TA is the remaining time. Then ′ 1+TA x ∈ A. We may ask to ﬁnd the probabilities ϕAB (i) = Pi (TA < TB ).. . . . Hence.which gives ψ(1) = 3/2. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). 23 . X2 .

h(1) = 5. 1 2 4 3 The question is to compute the average total reward.Thus.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. until he dies in room 4. in which case he dies immediately. unless he is in room 4. he collects i pounds. If heads come up then you win £1 per pound of stake. x∈A pxy h(y). collecting “rewards”: when in room i. 2. The successive stakes (Yk . You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. are h(x) = 0. so 24 . Example: A thief enters sneaks in the following house and moves around the rooms. 3. y∈S h(x) = f (x) + x ∈ A. k ∈ N) form the strategy of the gambler. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). he will die. Let p the probability of heads and q = 1 − p that of tails. 2 2 We solve and ﬁnd h(2) = 8. (Also. Clearly. if he starts from room 2. for i = 1. (Sooner or later. h(3) = 7. why?) If we let h(i) be the average total reward if he starts from room i. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . the set of equations satisﬁed by h. So. no gambler is assumed to have any information about the outcome of the upcoming toss.

pi. if p = q Px (Tb < Ta ) = . betting £1 at each integer time against a coin which has P (heads) = p. Consider a gambler playing a simple coin tossing game. We wish to consider the problem of ﬁnding the probability that the gambler will never lose her money. and where the states a. in simplest case Yk = 1 for all k. write ϕ(x) instead of ϕ(x. In other words. We ﬁrst solve the problem when a = 0. We are clearly dealing with a Markov chain in the state space {a. . It can only depend on ξ1 . b = ℓ > 0 and let ϕ(x. P (tails) = 1 − p. she must base it on whatever outcomes have appeared thus far. . ξk−1 and–as is often the case in practice–on other considerations of e. astrological nature. Let us consider a concrete problem. in addition. a + 2. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a.if the gambler wants to have a strategy at all. Theorem 7. .i+1 = p. a ≤ x ≤ b. the company has control over the probability p. We use the notation Px (respectively. Ex ) for the probability (respectively. ℓ) := Px (Tℓ < T0 ). The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). To solve the problem. ℓ). expectation) under the initial condition X0 = x. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals: x−a − 1 (q/p) (q/p)b−a − 1 . a < i < b. She will stop playing if her money fall below a (a < x). . x − a . . Let £x be the initial fortune of the gambler.g. We obviously have ϕ(0) = 0. The diﬀerence between the gambler’s ruin problem for a company and that for a mere mortal is that. 0 ≤ x ≤ ℓ. in the ﬁrst case. . b − a). . Yk cannot depend on ξk . b − 1.i−1 = q = 1 − p. (Think why!) To simplify things. . This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. a + 1. . b} with transition probabilities pi. it will make enormous proﬁts). 25 ϕ(ℓ) = 1. as described above. is our next goal. b are absorbing. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. if p = q = 1/2 b−a Proof.

we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. then λ = q/p < 1. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). we ﬁnd δ(0) = 1/ℓ and so x ϕ(x) = . ℓ→∞ ℓ→∞ (q/p)x . Since p + q = 1. This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). λ = 1. ℓ Summarising. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. Assuming that λ := q/p = 1. We ﬁnally ﬁnd λx − 1 . ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally.and. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. Since ϕ(ℓ) = 1. if p > 1/2 if p ≤ 1/2. we have obtained ϕ(x. Corollary 4. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). 1. If p > 1/2. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. we ﬁnd ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. Let δ(x) := ϕ(x + 1) − ϕ(x). if λ = 1 if λ = 1. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). δ(x − 2) = · · · = q p x δ(0). by ﬁrst-step analysis. ℓ) = λx −1 . We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). λℓ −1 x ℓ. 0 ≤ x ≤ ℓ. and so λx − 1 λx − 1 =1− = λx . Px (T0 < ∞) = 1 − lim 26 . and so Px (T0 < ∞) = 1 − lim If p < 1/2. The general case is obtained by replacing ℓ by b − a and x by x − a. if p = q. λℓ − 1 0 ≤ x ≤ ℓ. λx − 1 δ(0). ℓ→∞ λℓ − 1 x = 1 − 0 = 1. then λ = q/p > 1. λ = 1.

. . .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. the future process (Xn . Xτ +2 = j2 . . . . . Xτ +2 = j2 . . . . conditional on τ < ∞ and Xτ = i. . . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . τ = n) (c) = P (Xn+1 = j1 . . We are going to show that for any such A. . . | Xτ . . the future process (Xτ +n . Suppose (Xn . . A. . . . . τ < ∞) = P (Xn+1 = j1 . . . An example of a stopping time is the time TA . τ < ∞). τ = n). .j2 · · · We have: P (Xτ +1 = j1 . Xn+2 = j2 . In particular. . Xτ )1(τ < ∞) = n∈Z+ (X0 .j1 pj1 . Xn+2 = j2 . any such A must have the property that. . (d) where (a) is just logic. Theorem 8 (strong Markov property). . A. . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . (b) is by the deﬁnition of conditional probability. . Xn = i. | Xn = i. . An example of a random time which is not a stopping time is the last visit of a speciﬁc set. for any n. . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . the value +∞. Then. Xn )1(τ = n). τ = n) = P (Xn+1 = j1 . . . Xn ) if we know the present Xn . considered earlier. | Xτ = i. τ < ∞)P (A | Xτ . . Xn+1 . Xn+2 = j2 . . Recall that the Markov property states that. . A. the ﬁrst hitting time of the set A. . Xτ ) and has the same law as the original process started from i. A ∩ {τ = n} is determined by (X0 . τ = n) = pi. This stronger property is referred to as the strong Markov property. . A. A | Xτ . | Xn = i)P (Xn = i. Let τ be a stopping time. . . . Let A be any event determined by the past (X0 . Xτ ) if τ < ∞.Xτ +2 = j2 . A.) is independent of the past process (X0 . . . for all n ∈ Z+ . . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . possibly. Proof. Xn ).j2 · · · P (Xn = i. . . . . . . τ = n) (a) (b) = P (Xτ +1 = j1 . . n ≥ 0) is a time-homogeneous Markov chain. Xn ) and the ordinary splitting property at n. 27 . A. Let τ be a stopping time. Our assertions are now easily obtained from this by ﬁrst summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. Xτ +2 = j2 . τ < ∞) = pi. Since (X0 . n ≥ 0) is independent of the past process (X0 . τ < ∞) and that P (Xτ +1 = j1 . . P (Xτ +1 = j1 . . . . . . . Xn ) for all n ∈ Z+ . Xτ = i. including.j1 pj1 . τ = n)P (Xn = i.

3. The idea is that if we can ﬁnd a state i that is visited again and again. The Ti of Section 6 was allowed to take the value 0. in fact. So Xi is the r-th excursion: the chain visits 28 . random trajectories. All these random times are stopping times. r = 2.) We let Ti be the time of the third visit. 0 ≤ n < Ti . If X0 = i then the two deﬁnitions coincide. .i. (1) r = 1. State i may or may not be visited again. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . . They are. Xi (2) . Ti Xi (0) (r) ≤ n < Ti (r+1) . Schematically. Note the subtle diﬀerence between this Ti and the Ti of Section 6. Formally. then the behaviour of the chain between successive visits cannot be inﬂuenced by the past or the future (and that is due to the strong Markov property). we will show that we can split the sequence into i. Xi (1) . random variables is a (very) particular example of a Markov chain. by using the strong Markov property. Regardless. “chunks”. := Xn . we recursively deﬁne Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. (And we set Ti := Ti .i.d. 2.d. . We (r) refer to them as excursions of the process. However. . we let Ti be the time of the second (1) (3) visit. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the ﬁrst time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. as “random variables”. . and so on. The random variables Xn comprising a general Markov chain are not independent. in a generalised sense... Our Ti here is always positive. . Xi (0) .9 Regenerative structure of a Markov chain A sequence of i. .

if it starts from state i at some point of time. 29 . Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous.. we say that a state i is recurrent if. 10 Recurrence and transience When a Markov chain has ﬁnitely many states. Second. (Numerical) functions g(Xi independent random variables. Example: Deﬁne Ti (r+1) (0) ). Therefore. we say that a state i is transient if.i. they are also identical in law (i. all. The last corollary ensures us that Λ1 . by deﬁnition.. but there are always states which will be visited again and again. have the same Proof. given the state at (r) (r) (r) the stopping time Ti . An immediate consequence of the strong Markov property is: Theorem 9. The excursions Xi (0) . The claim that Xi . equal to i. are independent random variables. . starting from i. except. A trivial. the future after Ti is independent of the past before Ti . the chain returns to i only ﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 0. . but important corollary of the observation of this section is that: Corollary 5. Xi distribution. starting from i. and “goes for an excursion”. at least one state must be visited inﬁnitely many times. g(Xi (1) ). (r) . Xi (2) . Perhaps some states (inessential ones) will be visited ﬁnitely many times. In particular. possibly. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. . (0) are independent random variables. Apply Strong Markov Property (SMP) at Ti : SMP says that. . for time-homogeneous Markov chains. We abstract this trivial observation and give a couple of deﬁnitions: First.d. . But (r) the state at Ti is. .state i for the r-th time. we “watch” it up until the next time it returns to state i. g(Xi (2) ).). Λ2 . Moreover. and this proves the ﬁrst claim. ad inﬁnitum. . Therefore. it will behave in the same way as if it starts from the same state at another point of time. Xi (1) . the chain returns to i inﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 1. the future is independent of the (1) (2) past.

Notice that: state i is visited inﬁnitely many times ⇐⇒ Ji = ∞. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . Suppose that the Markov chain starts from state i. The following are equivalent: 1. Starting from X0 = i. we conclude what we asserted earlier. 2. State i is recurrent. If fii < 1. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). in principle. Therefore. we have Proof. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . We claim that: Lemma 4. k ≥ 0. Because either fii = 1 or fii < 1. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞).Important note: Notice that the two statements above are not negations of one another. Lemma 5. 3. This means that the state i is transient.e. then Pi (Ji < ∞) = 1. therefore concluding that every state is either recurrent or transient. there could be yet another possibility: 0 < Pi (Xn = i for inﬁnitely many n) < 1. i. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . This means that the state i is recurrent. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. . (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. Ei Ji = ∞. and that is why we get the product. every state is either recurrent or transient. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. fii = Pi (Ti < ∞) = 1. Pi (Ji = ∞) = 1. But the past (k−1) before Ti is independent of the future. We will show that such a possibility does not exist. 30 . 4. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. Pi (Ji = ∞) = 1. namely. Indeed. Indeed.

Proof. Lemma 6. if we assume Ei Ji < ∞. n ≥ 1. Proof. it is visited inﬁnitely many times. as n → ∞. Notice the diﬀerence with pij . The following are equivalent: 1. (n) i. it is visited ﬁnitely many times with probability 1. which. since Ji is a geometric random variable. fii = Pi (Ti < ∞) < 1. we see that its expectation cannot be inﬁnity without Ji itself being inﬁnity with probability one. The equivalence of the ﬁrst two statements was proved before the lemma. Proof. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . 3. then. The equivalence of the ﬁrst two statements was proved above. Pi (Ji < ∞) = 1.e. On the other hand. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . as n → ∞. If i is transient then Ei Ji < ∞. pii → 0.1 Deﬁne First hitting time decomposition fij := Pi (Tj = n). by deﬁnition. is equivalent to Pi (Ji < ∞) = 1. 4. but the two sequences are related in an exact way: Lemma 7. State i is recurrent if. Clearly. Ji cannot take value +∞ with positive probability. therefore Pi (Ji < ∞) = 1. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . Proof. Corollary 6. by the deﬁnition of Ji . 2. assuming that the chain (n) starts from state i.e. fij ≤ pij . From Elementary Probability. This. then. State i is transient. State i is transient if. Ei Ji < ∞. clearly. If Ei Ji = ∞. and. owing to the fact that Ji is a geometric random variable. i. The latter is the probability to be in state j at time n (not necessarily for the ﬁrst time) starting from i. 10. by deﬁnition of Ji is equivalent to Pi (Ji = ∞) = 1. Hence Ei Ji = ∞ also. (n) But if a series is ﬁnite then the summands go to 0. the probability to hit state j for the ﬁrst time at time n ≥ 1. If i is a transient state then pii → 0. this implies that Ei Ji < ∞. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. by deﬁnition.

but also fij = 1. and since they take the value 1 with positive probability. ∞ Proof. and so the probability that j will appear in a speciﬁc excursion is positive. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). For the second term.) Theorem 10. . for all states i.d.2 Communicating classes and recurrence/transience Recall the deﬁnition of the relation “state i leads to state j”.r := 1 j appears in excursion Xi (r) . the probability to go to j in some ﬁnite time is positive. In other words. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. excursions Xi . Xi . and hence pij as n → ∞.i. Start with X0 = i. as n → ∞. inﬁnitely many of them will be 1 (with probability 1).. . . and gij := Pi (Tj < Ti ) > 0. r = 0. the probability that j will appear in a speciﬁc excursion equals gij = Pi (Tj < Ti ). If j is transient. Hence. are i. In particular. Corollary 8. not only fij > 0. So the random variables δj. 1. denoted by i j: it means that. The last statement is simply an expression in symbols of what we said above. If j is a transient state then pij → 0.The ﬁrst term equals fij . We now show that recurrence is a class property. ( Remember: another property that was a class property was the property that the state have a certain period. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. Corollary 7. showing that j will appear in inﬁnitely many of the excursions for sure. starting from i. which is ﬁnite. Now. if we let fij := Pi (Tj < ∞) . Proof. . by deﬁnition. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . . 32 . 10.i. Indeed. . j i. we have i j ⇐⇒ fij > 0. Hence. j i. In a communicating class. we have n=1 pjj = Ej Jj < ∞. either all states are recurrent or all states are transient. by summing over n the relation of Lemma 7.d.

necessarily. If a communicating class contains a state which is transient then. . Theorem 12. U2 . So if a communicating class is recurrent then it contains no inessential states and so it is closed. Theorem 11. But then. some state i must be visited inﬁnitely many times. . P (T > n) = P (U1 ≤ U0 . Un ≤ x | U0 = x)dx xn dx = 1 . Recall that a ﬁnite random variable may not necessarily have ﬁnite expectation. but Pi (A ∩ B) = Pi (Xm = j. . 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). . Pi (B) = 0. Hence. appear in a certain game. .e. If i is recurrent then every other j in the same communicating class satisﬁes i j. U1 . Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x.i. . Let C be the class. Xn = i for inﬁnitely many n) = 0. every other j is also recurrent. random variables. If i is inessential then it is transient. uniformly distributed in the unit interval [0. Thus T represents the number of variables we need to draw to get a value larger than the initial one.e. By ﬁniteness. What is the expectation of T ? First. Proof. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. X remains in C. Indeed. Every ﬁnite closed communicating class is recurrent. all other states are also transient. by the above theorem. . Example: Let U0 . . Hence all states are recurrent. This state is recurrent. 1]. i. Hence 11 Positive recurrence Recall that if we ﬁx a state i and let Ti be the ﬁrst return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). . for example. i.d. By closedness. if started in C. . observe that T is ﬁnite. Proof. j but j i. Let T := inf{n ≥ 1 : Un > U0 }. It should be clear that this can.Proof. and so i is a transient state. We now take a closer look at recurrence. . n+1 33 = 0 . and hence. Suppose that i is inessential. P (B) = Pi (Xn = i for inﬁnitely many n) < 1. be i.

Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. 0) = ∞. Before doing that. the sequence X0 . arguably. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. . Theorem 13. We say that i is null recurrent if Ei Ti = ∞. To apply the law of large numbers we must create some kind of independence. . (ii) Suppose E max(Z1 . When we have a Markov chain.i. To our rescue comes the regenerative structure of a Markov chain identiﬁed at an earlier section. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. We state it here without proof. but E min(Z1 . on the average.d. be i. random variables. X2 . It is traditional to refer to it as “Law”. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. It is not a Law. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. But this is misleading. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. Z2 . Let Z1 . 0) > −∞. it is a Theorem that can be derived from the axioms of Probability Theory. then it takes.Therefore. . 34 . the Fundamental Theorem of Probability Theory. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. of successive states is not an independent sequence. X1 . an inﬁnite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. . . . (i) Suppose E|Z1 | < ∞ and let µ = EZ1 .

. the (0) initial excursion Xa also has the same law as the rest of them. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). The ﬁnal equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a.2 Average reward Consider now a function f : S → R. . . Speciﬁcally. as in Example 2. are i. . because the excursions Xa . the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). Example 1: To each state x ∈ S. Example 2: Let f (x) be as in example 1. (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. k=0 which represents the ‘time-average reward’ received. Thus. .d. but deﬁne G(X ) to be the total reward received over the excursion X . we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. . assign a reward f (x). In particular. in this example.i. we can think of as a reward received when the chain is in state x. . Then. G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ).i. 35 . with probability 1. Since functions of independent random variables are also independent. random variables. Xa(3) .13. G(Xa(2) ). forms an independent sequence of (generalised) random variables.d. deﬁne G(X ) to be the maximum reward received during this excursion. . for reasonable choices of the functions G taking values in R. . Xa . Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). which. for an excursion X . G(Xa(3) ).1 Functions of excursions Namely. n as n → ∞. (7) Whichever the choice of G. and because if we start with X0 = a. Xa(2) . we have that G(Xa(1) ). (1) (2) (1) (n) (8) 13. are i. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been ﬁxed once and for all) the sequence of excursions Xa(1) . .

We are looking for the total reward received between 0 and t.e. We assumed that f is t bounded. t t In other words. in the ﬁgure below. we have BTa /t → 0. Proof. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). t as t → ∞. plus the total reward received over the next excursion. Thus we need to keep track of the number. of complete excursions from the ground state a that occur between 0 and t. t t as t → ∞. Nt . t t 36 . Let f : S → R+ be a bounded ‘reward’ function. Then. and so 1 ﬁrst G → 0. with probability one. Nt = 5.Theorem 14. A similar argument shows that 1 last G → 0. and so on. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a ﬁrst and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gﬁrst + Glast . say. say |f (x)| ≤ B. with probability one. For example. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. We break the sum into the total reward received over the initial excursion. Hence |Gﬁrst | ≤ BTa . Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. The idea is this: Suppose that t is large enough. Since P (Ta < ∞) = 1. Gﬁrst is the reward up to time Ta = Ta or t (whichever t is smaller). (r) Nt := max{r ≥ 1 : Ta ≤ t}. and Glast is the reward over the last incomplete cycle. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . Gr is given by (7).

= E a Ta . Since Ta → ∞.i. as t → ∞. 1 Nt = .d. So we are left with the sum of the rewards over complete cycles.also. = E a Ta . (11) By deﬁnition. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . 2.. . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. and since Ta − Ta k = 1. . E a Ta 37 . we have that the visit with index r = Nt to the ground state a is the last before t (see also ﬁgure above): (N (N Ta t ) ≤ t < Ta t +1) . lim t∞ 1 Nt Nt −1 Gr = EG1 . are i. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . E a Ta f (Xn ) = n=0 EG1 =: f . it follows that the number Nt of complete cycles before time t will also be tending to ∞. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . as r → ∞. random variables with common expectation Ea Ta . r=1 with probability one. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . Since Ta /r → 0. we have Ta r→∞ r lim (r) . . Hence.

Comment 4: Suppose that two states. Note that we include only one endpoint of the excursion. which communicate and are positive recurrent. not both. The equality of these two limits gives: average no. E b Tb 38 . Ta −1 k=0 1 lim t→∞ t with probability 1. on the average. b.To see that f is as claimed. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . The physical interpretation of the latter should be clear: If. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). The constant f is a sort of average over an excursion. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. t Ea 1(Xk = b) E a Ta . We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . We can use b as a ground state instead of a. then we see that the numerator equals 1. Applying formula (14) with b in place of a we ﬁnd that the limit in (13) also equals 1/Eb Tb . f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b).e. The theorem above then says that. for all b ∈ S. we let f to be 1 at the state b and 0 everywhere else. The denominator is the duration of the excursion. a. the ground state. Comment 1: It is important to understand what the formula says. from the Strong Markov Property. it takes Ea Ta units of time between successive visits to the state a. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). i. then the long-run proportion of visits to the state a should equal 1/Ea Ta . E a Ta (14) with probability 1.

x ∈ S. X3 .e. we will soon realise that this dependence vanishes when the chain is irreducible. where a is the ﬁxed recurrent ground state whose existence was assumed. so there is nothing lost in doing so. i. (15) should play an important role. we have 0 < ν [a] (x) < ∞.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . 39 . by a function of X1 . Moreover. Proof. Both of them equal 1. In view of this. X1 . . Proposition 2. for any state b such that a → x (a leads to x). Fix a ground state a ∈ S and assume that it is recurrent5 . Note: Although ν [a] depends on the choice of the ‘ground’ state a. Consider the function ν [a] deﬁned in (15). x ∈ S. . Clearly. . We start the Markov chain with X0 = a. . X2 . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. Hence the superscript [a] will soon be dropped. . Then ν [a] satisﬁes the balance equations: ν [a] = ν [a] P. Recurrence tells us that Ta < ∞ with probability one. The real reason for doing so is because we wanted to replace a function of X0 . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. this should have something to do with the concept of stationary distribution discussed a few sections ago. ν [a] (x) = y∈S ν [a] (y)pyx . 5 We stress that we do not need the stronger assumption of positive recurrence here. X2 . notice that the choice of notation ν [a] is justiﬁable through the notation (6) for the communicating class of the state a. Indeed. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). . Since a = XT a .

and thus be in a position to apply ﬁrst-step analysis. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. 40 . Xn−1 = y. let x be such that a x. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). First note that. (17) This π is a stationary distribution. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. ν [a] = ν [a] Pn . Note: The last result was shown under the assumption that a is a recurrent state. so there exists m ≥ 1. We just showed that ν [a] = ν [a] P. that a is positive recurrent. indeed. namely. Deﬁne Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . n ≤ Ta ) pyx Pa (Xn−1 = y. where. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . for all x ∈ S. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). in this last step. for all n ∈ N. ax ax From Theorem 10. such that pax > 0. We now have: Corollary 9. x ∈ S. xa so. so there exists n ≥ 1. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). which are precisely the balance equations. we used (16). ν [a] (x) < ∞. by deﬁnition. (n) Next. We are now going to assume more. is ﬁnite when a is positive recurrent. Suppose that the Markov chain possesses a positive recurrent state a. Therefore. such that pxa > 0. which. n ≤ Ta ) Pa (Xn = x. from (16). x a.

2. Start with X0 = i. If heads show up at the r-th excursion. we have Ea Ta < ∞.i. Then j is positive Proof. π(y) = 1 y∈S x∈S have at least one solution. It only remains to show that π [a] is a probability. 41 . then the r-th excursion does not contain state j. The coins are i. second) excursion that contains j.d. since the excursions are independent. then j occurs in the r-th excursion. . then the set of linear equations π(x) = y∈S π(y)pyx . π [a] P = π [a] . in general. unique. Since a is positive recurrent. Let R1 (respectively. But π [a] is simply a multiple of ν [a] . . we toss a coin with probability of heads gij > 0. i. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. Let i be a positive recurrent state. r = 0. and the probability gij = Pi (Tj < ∞) that j occurs in a speciﬁc excursion is positive. (r) j. and so π [a] is not trivially equal to zero. Consider the excursions Xi . no matter which one. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. Hence j will occur in inﬁnitely many excursions. In words. to decide whether j will occur in a speciﬁc excursion.In other words: If there is a positive recurrent state. This is what we will try to do in the sequel. R2 ) be the index of the ﬁrst (respectively. Therefore. the event that j occurs in a speciﬁc excursion is independent from the event that j occurs in the next excursion. Proof. We already know that j is recurrent. We have already shown. Theorem 15. 1. Also.e. that it sums up to one. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. that ν [a] P = ν [a] . positive recurrent state. if tails show up. in Proposition 2. Otherwise. . 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. Suppose that i recurrent.

But. . In other words. where the Zr are i. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. . we just need to show that R2 −R1 +1 ζ := r=1 Zr has ﬁnite expectation. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . . An irreducible Markov chain is called transient if one (and hence all) states are transient. IRREDUCIBLE CHAIN r = 1. with the same law as Ti . 2. Let C be a communicating class. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 .i. R2 has ﬁnite expectation. by Wald’s lemma. Corollary 10. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. . or all states are null recurrent or all states are transient. Then either all states in C are positive recurrent. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞.d. .R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . . An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. .

indeed. following Theorem 14. x ∈ S. Then there is a unique stationary distribution. (18) π(x)Px (Ta < ∞) = π(x) = 1. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. for all n. an absorbing state).16 Uniqueness of the stationary distribution At the risk of being too pedantic. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. for all states i and j. we have. π(2) = 1 − α 2. for all x ∈ S. we stress that a stationary distribution may not be unique. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. x ∈ S. n=1 43 . 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). Consider positive recurrent irreducible Markov chain. Hence. as t → ∞. Then. the π deﬁned by π(0) = α. Then P (Ta < ∞) = x∈S In other words. there is a stationary distribution. state 1 (and state 2) is positive recurrent. Proof. by (18). Fix a ground state a. π(1) = 0. By a theorem of Analysis (Bounded Convergence Theorem). for totally trivial reasons (it is. Consider the Markov chain (Xn . But observe that there are many–inﬁnitely many–stationary distributions: For each 0 ≤ α ≤ 1. we start the Markov chain from the stationary distribution π. x∈S By (13). is a stationary distribution. Theorem 16. recognising that the random variables on the left are nonnegative and bounded below 1. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). But. what prohibits uniqueness. we can take expectations of both sides: E as t → ∞. n ≥ 0) and suppose that P (X0 = x) = π(x). with probability one. Let π be a stationary distribution. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). we have P (Xn = x) = π(x). after all.

Thus. Therefore Ei Ti < ∞. for all a ∈ S. Proof. By the 1 above corollary.Therefore π(x) = π [a] (x). Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . b ∈ S. Decompose it into communicating classes. Let π(·|C) be the distribution deﬁned by restricting π to C: π(x|C) := π(x) . an arbitrary stationary distribution π must be equal to the speciﬁc one π [a] . Hence π = π [a] . 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Consider an arbitrary Markov chain. There is only one stationary distribution. Namely. In particular. Consider a positive recurrent irreducible Markov chain and two states a. where π(C) = y∈C π(y). In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Assume that a stationary distribution π exists. 1 π(a) = . π(C) x ∈ C. But π(i|C) > 0. for all x ∈ S. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . Consider a positive recurrent irreducible Markov chain with stationary distribution π. Both π [a] and π[b] are stationary distributions. π(a) = π [a] (a) = 1/Ea Ta . We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Consider a Markov chain. The following is a sort of converse to that. Proof. π(i|C) = Ei Ti . i is positive recurrent. so they are equal. Corollary 11. There is a unique stationary distribution. Corollary 12. Hence there is only one stationary distribution. from the deﬁnition of π [a] . The cycle formula merely re-expresses this equality. the only stationary distribution for the restricted chain because C is irreducible. E a Ta Proof. This is in fact. delete all transition probabilities pxy with x ∈ C. Let i be such that π(i) > 0. y ∈ C. In particular. Then. Then π [a] = π [b] . Consider the restriction of the chain to C. Let C be the communicating class of i. Then any state i such that π(i) > 0 is positive recurrent. x ∈ S.e. i.

eventually. So far we have seen. that. We all know. it will settle to the vertical position. x ∈ C) is a stationary distribution. where π(C) := π(C) π(x). For each such C deﬁne the conditional distribution π(x|C) := π(x) . take an = (−1)n . By our uniqueness result. consider the motion of a pendulum under the action of gravity. from physical experience. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . π(x|C) ≡ π C (x). Proposition 3.1 Coupling and stability Deﬁnition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. Proof. Stability of a deterministic dynamical system means that the system eventually converges to a ﬁxed point (or. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). Let π be an arbitrary stationary distribution. that it remains in a small region). x∈C Then (π(x|C). 18 18. To each positive recurrent communicating class C there corresponds a stationary distribution π C . under irreducibility and positive recurrence conditions. A Markov chain is a simple model of a stochastic dynamical system. where αC ≥ 0.which assigns positive probability to each state in C. For example. such that C αC = 1. for example. as n → ∞. So we have that the above decomposition holds with αC = π(C). more generally. that the pendulum will swing for a while and. There is dissipation of energy due to friction and that causes the settling down. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. This is deﬁned by picking a state (any state!) a ∈ C and letting π C := π [a] . This is the stable position. An arbitrary stationary distribution π must be a linear combination of these π C . the pendulum will perform an 45 . In an ideal pendulum. without friction. If π(x) > 0 then x belongs to some positive recurrent class C.

random variables. we may it is stable because it does not (it cannot!) swing too far away. If. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. ······ 46 . It follows that |ρ| < 1 is the condition for stability. The thing to observe here is that there is an equilibrium point i. We say that x∗ is a stable equilibrium. However notice what happens when we solve the recursion. moreover. Still. And we are in a happy situation here. ˙ There are myriads of applications giving physical interpretation to this.e. Speciﬁcally. simple because the ξn ’s depend on the time parameter n. and deﬁne a stochastic sequence X1 . ξ0 . we have x(t) → x∗ . then x(t) = x∗ for all t > 0. the one found by setting the right-hand side equal to zero: x∗ = b/a. . . ξ1 . a > 0. mutually independent. . But what does stability mean? In general. We shall assume that the ξn have the same law. . x ∈ R. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . The simplest diﬀerential equation As another example. as t → ∞. or an example in physics.oscillatory motion ad inﬁnitum. it cannot mean convergence to a ﬁxed equilibrium point. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . so I’ll leave it upon the reader to decide whether he or she wants to consider this as an ﬁnancial example. are given. or whatever. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . . ξ2 . then regardless of the starting position. we here assume that X0 . by means of the recursion above. X2 . A linear stochastic system We now pass on to stochastic systems. . consider the following diﬀerential equation x = −ax + b.

If a Markov chain is irreducible. for any initial distribution.Since |ρ| < 1. 18. but we are not given what P (X = x. Y = y) is? 47 . g(y) = P (Y = y). In other words. . suppose you are given the laws of two random variables X. we know f (x) = P (X = x). the n-step transition matrix Pn . we have that the ﬁrst term goes to 0 as n → ∞. deﬁne stability in a way other than convergence of the distributions of the random variables. for instance. Y = y) = P (X = x)P (Y = y). the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. In particular. converges. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . What this means is that. . Y = y) is. for time-homogeneous Markov chains. (e. to try to make the coincidence probability P (X = Y ) as large as possible.3 Coupling To understand why the fundamental stability theorem works. Coupling refers to the various methods for constructing this joint law. under nice conditions. . as n → ∞. which is a linear combination of ξ0 . The second term. A straightforward (often not so useful) construction is to assume that the random variables are independent and deﬁne P (X = x. uniformly in x. . ∞ ˆ The nice thing is that. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). j ∈ S. But suppose that we have some further requirement. ξn−1 does not converge. Starting at an elementary level. we need to understand the notion of coupling. Y but not their joint law. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . 18. i.2 The fundamental stability theorem for Markov chains Theorem 17. while the limit of Xn does not exists. How should we then deﬁne what P (X = x.g. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j).

then |P (Xn = x) − P (Yn = x)| → 0. uniformly in x ∈ S. Find the joint law h(x. This T is a random variables that takes values in the set of times or it may take values +∞. n ≥ T ) ≤ P (n < T ) + P (Yn = x). We are given the law of a stochastic process X = (Xn .e. x ∈ S. Since this holds for all x ∈ S and since the right hand side does not depend on x. for all n ≥ 0. Y be random variables with values in Z and laws f (x) = P (X = x). precisely because of the assumption that the two processes have been constructed together. Note that it makes sense to speak of a meeting time. Let T be a meeting of two coupled stochastic processes X. Y = y) which respects the marginals. n ≥ 0. n < T ) + P (Yn = x. in particular. Combining the two inequalities we obtain the ﬁrst assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). n ≥ 0).Exercise: Let X. P (T < ∞) = 1. The following result is the so-called coupling inequality: Proposition 4. and all x ∈ S. g(y) = P (Y = y). and which maximises P (X = Y ). we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). If. The coupling we are interested in is a coupling of processes. y). T is ﬁnite with probability one. f (x) = y h(x. i. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). y) = P (X = x. Then. Y . Proof. n ≥ T ) = P (Xn = x. respectively. i. n ≥ 0. x∈S 48 . g(y) = x h(x. n ≥ 0) and the law of a stochastic process Y = (Yn . We have P (Xn = x) = P (Xn = x. A coupling of them refers to a joint construction on the same probability space. Suppose that we have two stochastic processes as above and suppose that they have been coupled. n < T ) + P (Xn = x. Repeating (19).e. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). but with the roles of X and Y interchanged. on the event that the two processes never meet. but we are not told what the joint probabilities are. y). and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. (19) as n → ∞.

y) Its n-step transition probabilities are q(x.y′ ) := P Wn+1 = (x′ . The second process. Consider the process Wn := (Xn .(y.) By positive recurrence.x′ ). From Theorem 3 and the aperiodicity assumption.y′ for all large n.y′ ) = px. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. Its (1-step) transition probabilities are q(x. then. y) = µ(x)π(y). It is very important to note that we have only deﬁned the law of X and the law of Y but we have not deﬁned a joint law. We now couple them by doing the straightforward thing: we assume that X = (Xn . The ﬁrst process. is a stationary distribution for (Wn . Its initial state W0 = (X0 . in other words. y ′ ) | Wn = (x.Assume now that P (T < ∞) = 1. (x. n ≥ 0). sup |P (Xn = x) − P (Yn = x)| → 0.4 Proof of the fundamental stability theorem First. denoted by Y = (Yn . implying that q(x. y) ∈ S × S.y′ ) for all large n. Notice that σ(x. and so (Wn . we have not deﬁned joint probabilities such as P (Xn = x. π(x) > 0 for all x ∈ S. Xm = y). We know that there is a unique stationary distribution. n ≥ 0) is a Markov chain with state space S × S.(y. we have that px. We shall consider the laws of two processes. positive recurrent and aperiodic. had we started with both X0 and Y0 independent and distributed according to π.x′ py. y) := π(x)π(y). we can deﬁne joint probabilities. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ.x′ py. n ≥ 0) is independent of Y = (Yn . . Xn and Yn would be independent and distributed according to π. review the assumptions of the theorem: the chain is irreducible. for all n ≥ 1. the transition probabilities. n ≥ 0) is an irreducible chain. (Indeed.y′ . n ≥ 0).(y. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. Then n→∞ lim P (T > n) = P (T = ∞) = 0. Therefore. Y0 ) has distribution P (W0 = (x. Then (Wn . So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. as usual. Yn ). Having coupled them. they diﬀer only in how the initial states are chosen.x′ > 0 and py. and it also makes sense to consider their ﬁrst meeting time: T := inf{n ≥ 0 : Xn = Yn }.x′ ). x∈S as n → ∞. denoted by X = (Xn .y′ .x′ ). Let pij denote. 18. denoted by π.

Hence (Wn . 1).0). if we modify the original chain and make it aperiodic by.0) = q(0. 0). However. 0). P (T < ∞) = 1. (1. p10 = 1. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). By the coupling inequality. This is not irreducible. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. n ≥ 0) is identical in law to (Xn . n ≥ 0). and since P (Yn = x) = π(x). X arbitrary distribution Z stationary distribution Y T Thus. Zn = Yn . P (Xn = x) = P (Zn = x). 1} and the pij are p01 = p10 = 1. uniformly in x. Since P (Zn = x) = P (Xn = x). Hence (Zn . Then the product chain has state space S ×S = {(0. Deﬁne now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ).1).1) = q(1. In particular. Y ) may fail to be irreducible.(1.(0. (0. Z0 = X0 which has an arbitrary distribution µ.0). y) ∈ S × S. Clearly. In particular. Obviously. then the process W = (X. ≥ 0) is a Markov chain with transition probabilities pij .1). (Zn . For example. n ≥ 0) is positive recurrent. as n → ∞. 1)} and the transition probabilities are q(0.01. after the meeting time.1) = 1. p00 = 0. Moreover.0) = q(1. T is a meeting time between Z and Y . by assumption.(0.99. suppose that S = {0. say. which converges to zero. for all n and x. (1.(1. p01 = 0. 50 . we have |P (Xn = x) − π(x)| ≤ P (T > n). precisely because P (T < ∞) = 1.Therefore σ(x. y) > 0 for all (x. Zn equals Xn before the meeting time of X and Y .

we can decompose the state space into d classes C0 . Parry. π(1) satisfy 0. such that the chain moves cyclically between them: from C0 to C1 .503 0. But this is “wrong”. Cd−1 .503. By the results of §5. π(0) + π(1) = 1. most people use the term “ergodic” for an irreducible. π(0) ≈ 0.497 18. customary to have a consistent terminology. 0. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. in particular. .e. Theorem 4.497 . 6 The reason is that the term “ergodic” means something speciﬁc in Mathematics (c.99π(0) = π(1). n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. Therefore p00 = 1 for even n (n) and = 0 for odd n. positive recurrent and aperiodic Markov chain. Suppose X0 = 0. In matrix notation. per se wrong. . p01 = 0. n→∞ (n) where π(0).01. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing.f. limn→∞ p00 does not exist.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain.497. Indeed. π(1) ≈ 0. however.99. However. By the results of Section 14. and this is expressed by our terminology. 18. 51 . and so on. . then (n) lim p n→∞ 00 (n) = lim p10 = π(0). Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. It is. 1981). if we modify the chain by p00 = 0. The reader is asked to pay attention to this terminology which is diﬀerent from the one used in the Markov chain literature.4 and.503 0. from Cd back to C0 . Terminology cannot be. Then Xn = 0 for even n and = 1 for odd n. Thus. from C1 to C2 .then the product chain becomes irreducible. It is not diﬃcult to see that 6 We put “wrong” in quotes. . Suppose that the period of the chain is d. C1 . p10 = 1. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). i.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic.

as n → ∞. π(i) = 1/d. Let αr := i∈Cr π(i). → dπ(j). Can we also ﬁnd the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . . Since the αr add up to 1. d − 1. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). if sampled every d steps.Lemma 8. . Since (Xnd . namely C0 . . We shall show that Lemma 9. the (unique) stationary distribution of (Xnd . Then. i∈Cr for all r = 0. n ≥ 0) consists of d closed communicating classes. All states have period 1 for this chain. 52 . and let r = ℓ − k (modulo d). . i ∈ S. Hence if we restrict the chain (Xnd . j ∈ Cℓ . . . Then pji = 0 unless j ∈ Cr−1 . j ∈ Cr . i ∈ Cr . Proof. Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. it remains in Cℓ . i ∈ Cr . The chain (Xnd . . where π is the stationary distribution. n ≥ 0) is dπ(i). We have that π satisﬁes the balance equations: π(i) = j∈S π(j)pji . n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. as n → ∞. Then. This means that: Theorem 18. Let i ∈ Ck . So π(i) = j∈Cr−1 π(j)pji . starting from i. j ∈ Cℓ . we have pij for all i. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. n ≥ 0) is aperiodic. thereafter. and the previous results apply. after reduction by d) and. Cd−1 . Suppose that X0 ∈ Cr . 1. each one must be equal to 1/d. . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Suppose i ∈ Cr .

Now consider the contained between 0 and 1 there must be a exists and equals. for each i. schematically. Hence pi k i. k = 1. as n → ∞. the sequence n3 of the third row is a subsequence k k of n2 of the second row. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. . . e 53 . If j is a null recurrent state then pij → 0. We have. . (nk ) k → pi . 19. . p3 . . We also saw in Corollary (n) 7 that.) is a k k subsequence of each (ni .. say. . By construction. . 2. . exists for all i ∈ N. So the diagonal subsequence (nk . and so on. for each n = 1. 2. . . p(n) = (n) (n) (n) (n) (n) ∞ p1 . . p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . p2 . . if j is transient.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . If the period is d then the limit exists only along a selected subsequence (Theorem 18).e. for all i ∈ S. pi . for each The second result is Fatou’s lemma. p1 . then pij → π(j) (Theorems 17). It is clear how we (ni ) k continue.). Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . . . as k → ∞. The proof will be based on two results from Analysis. say. are contained between 0 and 1. there must be a subsequence exists and equals. for all i ∈ S. Then we can ﬁnd a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. . i. k = 1. . where = 1 and pi ≥ 0 for all i. a probability distribution on N. the sequence n2 of the second row k is a subsequence of n1 of the ﬁrst row. say. 2. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. k = 1. k = 1. 2. the ﬁrst of which is known as HellyBray Lemma: Lemma 10. . . p2 .. such that limk→∞ pi k exists and equals. whose proof we omit7 : 7 See Br´maud (1999). then pij → 0 as n → ∞. 2. For each i we have found subsequence ni . . Consider.1 Limiting probabilities for null recurrent states Theorem 19.

αx ≤ lim inf k→∞ (n′ ) for all x ∈ S.e. i ∈ N).e. x ∈ S . we have 0< x∈S The inequality follows from Lemma 11. By Corollary 13. Hence we can ﬁnd a subsequence nk such that k→∞ lim pij k = αj > 0. . i. αy pyx . actually. We wish to show that these inequalities are. But Pn+1 = Pn P. y∈S x ∈ S. we have that π(x) := αx y∈S αy satisﬁes the balance equations and has x∈S π(x) = 1.e. (n′ ) Hence αx ≥ αy pyx . i. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx .Lemma 11. Let j be a null recurrent state. (n ) (n ) Consider the probability vector By Lemma 10. state j is positive recurrent: Contradiction. n = 1. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). by Lemma 11. . Suppose that one of them is not. i. π is a stationary distribution. Now π(j) > 0 because αj > 0. k→∞ (n′ +1) = y∈S piy k pyx . equalities. y∈S Since x∈S αx ≤ 1. i ∈ N). pix k Therefore. and the last equality is obvious since we have a probability vector. x ∈ S. pix k . . by assumption. Thus. Suppose that for some i. If (yi . are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . pix k = 1. the sequence (n) pij does not converge to 0. 54 . 2. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . (xi . it is a strict inequality: αx0 > y∈S αy pyx0 . x∈S (n′ ) Since aj > 0. for some x0 ∈ S. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . i=1 (n) Proof of Theorem 19.

Then (n) lim p n→∞ ij = ϕC (i)π C (j). The result when the period j is larger than 1 is a bit awkward to formulate. as in §10. β. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Indeed. Let C be the communicating class of j and let π C be the stationary distribution associated with C.2 The general case (n) It remains to see what the limit of pij is. starting with the ﬁrst hitting time decomposition relation of Lemma 7. Let j be a positive recurrent state with period 1. as n → ∞. π C (j) = 1/Ej Tj . we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). we have pij = Pi (Xn = j) = Pi (Xn = j. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. the result can be obtained by the so-called Abel’s Lemma of Analysis. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. But Theorem 17 tells us that the limit exists regardless of the initial distribution. so we will only look that the aperiodic case. Let ϕC (i) := Pi (HC < ∞). In general. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . This is a more old-fashioned way of getting the same result. Theorem 20. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . Pµ (Xn−m = j) → π C (j). Let i be an arbitrary state. Proof. which is what we need. that is. as n → ∞.2. From the Strong Markov Property. Let HC := inf{n ≥ 0 : Xn ∈ C}. and fij = Pi (Tj < ∞). and fij = ϕC (j). If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). In this form. Example: Consider the following chain (0 < α.19.

Theorem 14 is a type of ergodic theorem. Recall that. ϕ(5) = 0. (n) (n) p33 → 0. 4} (inessential states). We now generalise this. for example. 1 − αβ 1 − αβ Both C1 . Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. And so is the Law of Large Numbers itself (Theorem 13). (n) (n) p43 → 0. p44 → 0. The phase (or state) space of a particle is not just a physical space but it includes. whence 1−α β(1 − α) . p42 → ϕ(4)π C1 (2). among others. Hence. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). p45 → (1 − ϕ(4))π C2 (5). Consider the quantity ν [a] (x) deﬁned as the average number of visits to state x between two successive visits to state a–see 16. ϕ(4) = . the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. from ﬁrst-step analysis. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. The subject of Ergodic Theory has its origins in Physics. we assumed the existence of a positive recurrent state. Theorem 21. C1 = {1. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. We have ϕ(1) = ϕ(2) = 1. C2 are aperiodic classes. π C2 (5) = 1. (n) (n) p34 → 0. We will avoid to give a precise deﬁnition of the term ‘ergodic’ but only mention what constitutes a result in the area. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. π C1 (2) = . Pioneers in the are were. Ergodic Theory is nowadays an area that combines many ﬁelds of Mathematics and Science. information about its velocity. (n) (n) p35 → (1 − ϕ(3))π C2 (5). 56 (20) x∈S . 2} (closed class).The communicating classes are I = {3. as n → ∞. For example. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . while. C2 = {5} (absorbing state). p41 → ϕ(4)π C1 (1). Similarly. (n) (n) p32 → ϕ(3)π C1 (2). Let f be a reward function such that |f (x)|ν [a] (x) < ∞. in Theorem 14. ϕ(3) = p31 → ϕ(3)π C1 (1).

Proof. Now. We only need to check that EG1 = f . 21 Finite chains We take a brief look at some things that are speciﬁc to chains with ﬁnitely many states. If so. we obtain the result. Then at least one of the states must be visited inﬁnitely often. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. As in the proof of Theorem 14. So if we divide by t the numerator and denominator of the fraction in (21). we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gﬁrst + Glast .5. while Gﬁrst . we have Nt /t → 1/Ea Ta . 57 . we have that the denominator equals Nt /t and converges to 1/Ea Ta . we have t→∞ lim 1 last 1 G = lim Gﬁrst = 0. From proposition 2 we know that the balance equations ν = νP have at least one solution. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . So there are always recurrent states. respectively. in the former. then. (21) f (x)ν [a] (x). r=0 with probability 1 as long as E|G1 | < ∞. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). As in Theorem 14. and this is left as an exercise. since the number of states is ﬁnite. But this is precisely the condition (20). t t→∞ t t t with probability one. as we saw in the proof of Theorem 14. we have C := i∈S ν(i) < ∞. Remark 1: The diﬀerence between Theorem 14 and 21 is that. Replacing n by the subsequence Nt . and this justiﬁes the terminology of §18. so the numerator converges to f /Ea Ta . The reader is asked to revisit the proof of Theorem 14. Glast are the total rewards over the ﬁrst and t t a a last incomplete cycles. we assumed positive recurrence of the ground state a. which is the quantity denoted by f in Theorem 14.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 .

it will look the same. a ﬁnite Markov chain always has positive recurrent states. X0 ). where P is the transition matrix and Π is a matrix all the rows of which are equal to π. . Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. X0 ) for all n. as n → ∞. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. running backwards. . C satisﬁes the balance equations and the normalisation condition i∈S π(i) = 1. in a sense. X1 ) has the same distribution as (X1 . If. Lemma 12. the ﬁrst components of the two vectors must have the same distribution. Xn−1 . .e. Xn−1 . in addition. By Corollary 13. j ∈ S. X1 . P (X0 = i. Xn has the same distribution as X0 for all n. i ∈ S. . Consider an irreducible Markov chain with ﬁnitely many states. . But if we ﬁlm a man who walks and run the ﬁlm backwards. For example. . this means that it is stationary. if we ﬁlm a standing man who raises his arms up and down successively. i. or. . in addition. Therefore π is a stationary distribution. the chain is aperiodic. then all states are positive recurrent. reversible if it has the same distribution when the arrow of time is reversed. n ≥ 0). We say that it is time-reversible. for all n. Proof. Proof. . 22 Time reversibility Consider a Markov chain (Xn . simply. the stationary distribution is unique. the chain is irreducible. It is. . At least some state i must have π(i) > 0. i. We summarise the above discussion in: Theorem 22. (X0 . Suppose ﬁrst that the chain is reversible. In particular. and run the ﬁlm backwards. . In addition. X1 . . this state must be positive recurrent. Taking n = 1 in the deﬁnition of reversibility. actually. that is. π(i) := Hence. X0 = j). 58 . Because the process is Markov. because positive recurrence is a class property (Theorem 15). Xn ) has the same distribution as (Xn . . A reversible Markov chain is stationary. . . Theorem 23. X0 ). If. without being able to tell that it is. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . . X1 = j) = P (X1 = i. Then the chain is positive recurrent with a unique stationary distribution π.Hence 1 ν(i). Xn ) has the same distribution as (Xn . we see that (X0 . we have Pn → Π. . like playing a ﬁlm backwards. by Theorem 16. Reversibility means that (X0 . More precisely. then we instantly know that the ﬁlm is played in reverse.

. X2 = i). X1 = j. we have separation of variables: pij = f (i)g(j). . we have π(i) > 0 for all i. But P (X0 = i. for all n. X1 .for all i and j. The DBE imply immediately that (X0 . In other words. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. . X1 = j. P (X0 = j. . . (X0 . π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. X0 ). Xn ) has the same distribution as (Xn . j. The same logic works for any m and can be worked out by induction. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. pji 59 (m−3) → π(i1 ). Summing up over all choices of states i3 . . for all i. So the detailed balance equations (DBE) hold. 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. using DBE. Now pick a state k and write. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. i4 . im ∈ S and all m ≥ 3. These are the DBE. X1 ) has the same distribution as (X1 . We will show the Kolmogorov’s loop criterion (KLC). and hence (X0 . By induction. Conversely. suppose that the KLC (22) holds. Hence the DBE hold. j. π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . X1 . it is easy to show that. X1 = i) = π(j)pji . we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . . i2 . . So the chain is reversible. Suppose ﬁrst that the chain is reversible. Pick 3 states i. X0 ). letting m → ∞. X1 . . X2 ) has the same distribution as (X2 . . and pi1 i2 (m−3) → π(i2 ). . then. then the chain cannot be reversible. . X0 ). . and the chain is reversible. Xn−1 . Theorem 24. using the DBE. and write. (22) Proof. . X2 = k) = P (X0 = k. . Conversely. k. X1 = j) = π(i)pij . suppose the DBE hold. . and this means that P (X0 = i. . k. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 .

the detailed balance equations. Ac ) = F (Ac . then the chain is reversible. • If the chain has inﬁnitely many states. A) (see §4. The reason is as follows. meaning that it has no loops. there would be a loop. 60 . then we need. i. and the arrow from j to j is the only one from Ac to A. that the chain is positive recurrent. Example 1: Chain with a tree-structure.e. Hence the original chain is reversible. in addition. Form the graph by deleting all self-loops. and.1) which gives π(i)pij = π(j)pji . So g satisﬁes the balance equations. so we have pij g(j) = . If the graph thus obtained is a tree. Pick two distinct states i. f (i)g(i) must be 1 (why?). and by replacing. The balance equations are written in the form F (A. namely the arrow from i to j. j for which pij > 0.(Convention: we form this ratio only when pij > 0.) Necessarily. For example. the two oriented edges from i to j and from j to i by a single edge without arrow. Ac be the sets in the two connected components. for each pair of distinct states i. (If it didn’t. This gives the following method for showing that a ﬁnite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. there isn’t any. j for which pij > 0. by assumption. If we remove the link between them. There is obviously only one arrow from AtoAc . the graph becomes disconnected. And hence π can be found very easily. Hence the chain is reversible. using another method. pji g(i) Multiplying by g(i) and summing over j we ﬁnd g(i) = j∈S g(j)pji . to guarantee. consider the chain: Replace it by: This is a tree. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0.) Let A.

i. If i. A RWG is a reversible chain. so both members are equal to C. i∈S where |E| is the total number of edges. The stationary distribution of a RWG is very easy to ﬁnd. the DBE hold. etc. p02 = 1/4.e. For each state i deﬁne its degree δ(i) = number of neighbours of i. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). For example. There are two cases: If i. 61 .Example 2: Random walk on a graph. 2. a ﬁnite set. States j are i are called neighbours if they are connected by an edge. We claim that π(i) = Cδ(i). Each edge connects a pair of distinct vertices (states) without orientation. To see this let us verify that the detailed balance equations hold. In all cases. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. we have that C = 1/2|E|. if j is a neighbour of i otherwise. The stationary distribution assigns to a state probability which is proportional to its degree. i∈S Since δ(i) = 2|E|. π(i)pij = π(j)pji . j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. Consider an undirected graph G with vertices S. j are not neighbours then both sides are 0. Hence we conclude that: 1. π(i) = 1. 0. and a set E of undirected edges. Now deﬁne the the following transition probabilities: pij = 1/δ(i). where C is a constant. Suppose the graph is connected. The constant C is found by normalisation.

If the subsequence forms an arithmetic progression.e. .3 Subordination Let (Xn . a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . The sequence (Yk . . 2. . In this case. k = 0. .e.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). t ≥ 0. random variables with values in N and distribution λ(n) = P (ξ0 = n). . St+1 = St + ξt . 62 n ∈ N. . i. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). k = 0. 2. if nk = n0 + km. k = 0. this process is also a Markov chain and it does have time-homogeneous transition probabilities. The state space of (Yr ) is A.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . . where ξt are i. Let A be (r) a set of states and let TA be the r-th visit to A. . . . then (Yk . 1. 2. A r = 1. where pij are the m-step transition probabilities of the original chain. . π is a stationary distribution for the original chain. . k = 0. . if. irreducible and positive recurrent).1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. 1. Deﬁne Yr := XT (r) . (m) (m) 23. 2.d. in addition. 1. i. t ≥ 0) be independent from (Xn ) with S0 = 0. .i. 23. in general. . then it is a stationary distribution for the new chain. how can we create new ones? 23. Let (St . n ≥ 0) be Markov with transition probabilities pij .) has the Markov property but it is not time-homogeneous. . 2. 1.e. .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. Due to the Strong Markov Property.

β. ∞ λ(n)pij . a Markov chain. . then the process Yn = f (Xn ). β). . in general. . Yn−1 ) = P (Yn+1 = β | Yn = α). j. . We need to show that P (Yn+1 = β | Yn = α. we have: Lemma 13. is NOT. Indeed. that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i).. however. . . .. . Fix α0 . . Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). . . knowledge of f (Xn ) implies knowledge of Xn and so. α1 . If. .) More generally. αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 .Then Yt := XSt .e. (Notation: We will denote the elements of S by i. Yn−1 = αn−1 }. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. Proof. . conditional on Yn the past is independent of the future. i. n≥0 63 .4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . f is a one-to-one function then (Yn ) is a Markov chain. Then (Yn ) has the Markov property. (n) 23. and the elements of S ′ by α. .

β). indeed. Then (Yn ) is a Markov chain. Yn−1 ) Q(f (i). β) P (Yn = α. β ∈ Z+ . if i = 0. i. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α.for all choices of the variables involved. and if i = 0. regardless of whether i > 0 or i < 0. for all i ∈ Z and all β ∈ Z+ . β). Yn = α. We have that P (Yn+1 = β|Xn = i) = Q(|i|. β) := 1. if we let 1/2. β) i∈S P (Yn = α|Xn = i. β) P (Yn = α|Xn = i. β) P (Yn = α|Yn−1 ) i∈S = Q(α. Yn−1 )P (Xn = i | Yn = α. On the other hand. P (Xn+1 = i ± 1|Xn = i) = 1/2. β). Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. In other words. β). α = 0. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). Yn−1 ) Q(f (i). then |Xn+1 | = 1 for sure. Thus (Yn ) is Markov with transition probabilities Q(α. Example: Consider the symmetric drunkard’s random walk. and so (Yn ) is. Deﬁne Yn := |Xn |.e. β)P (Xn = i | Yn = α. i ∈ Z. observe that the function | · | is not one-to-one. a Markov chain with transition probabilities Q(α. if β = 1. because the probability of going up is the same as the probability of going down. depends on i only through |i|. if β = α ± 1. Indeed. conditional on {Xn = i}. 64 . i ∈ Z. To see this. α > 0 Q(α. But P (Yn+1 = β | Yn = α. But we will show that P (Yn+1 = β|Xn = i).

a hard problem. and. 0 But 0 is an absorbing state. independently from one another. If we let ξ (n) = ξ1 . since ξ (0) . It is a Markov chain with time-homogeneous transitions. . Example: Suppose that p1 = p.24 24. The parent dies immediately upon giving birth. with probability 1. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. . are i. Hence all states i ≥ 1 are inessential. which has a binomial distribution: i j pij = p (1 − p)i−j . j In this case. ξ (1) . 1. the population cannot grow: it will eventually reach 0. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is.d. Back to the general case. in general. 2. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of oﬀspring. . ξ (n) ).i. Thus. if the current population is i there is probability pi that everybody gives birth to zero oﬀspring. . The state 0 is irrelevant because it cannot be reached from anywhere. ξ2 . . . Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. But here is an interesting special case. so we will ﬁnd some diﬀerent method to proceed. and following the same distribution. p0 = 1 − p. So assume that p1 = 1.. . we have that (Xn . therefore transient.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. one question of interest is whether the process will become extinct or not. Therefore. Let ξk be the number of oﬀspring of the k-th individual in the n-th generation. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . . . We let pk be the probability that an individual will bear k oﬀspring. n ≥ 0) has the Markov property. (and independent of the initial population size X0 –a further assumption). then we see that the above equation is of the form Xn+1 = f (Xn . Hence all states i ≥ 1 are transient. 0 ≤ j ≤ i. and 0 is always an absorbing state. Then P (ξ = 0) + P (ξ ≥ 2) > 0. k = 0. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. We ﬁnd the population (n) at time n + 1 by adding the number of oﬀspring of each individual. In other words. if the 65 (n) (n) . if 0 is never reached then Xn → ∞. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. each individual gives birth to at one child with probability p or has children with probability 1 − p.

if we start with X0 = i in ital ancestors. . 66 . when we start with X0 = i. ϕn (0) = P (Xn = 0). Obviously. For each k = 1. Proof. ε(0) = 1. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. For i ≥ 1. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn .population does not become extinct then it must grow to inﬁnity. We then have that D = D1 ∩ · · · ∩ Di . Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. If µ ≤ 1 then the ε = 1. We wish to compute the extinction probability ε(i) := Pi (D). If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. and. . and P (Dk ) = ε. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. where ε = P1 (D). we can argue that ε(i) = εi . Let ε be the extinction probability of the process. each one behaves independently of one another. Indeed. This function is of interest because it gives information about the extinction probability ε. i. Indeed. so P (D|X0 = i) = εi . . Let ∞ µ = Eξ = k=1 kpk be the mean number of oﬀspring of an individual. . since Xn = 0 implies Xm = 0 for all m ≥ n. Therefore the problem reduces to the study of the branching process starting with X0 = 1. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. The result is this: Proposition 5.

(See right ﬁgure below. the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. and. On the other hand. since ϕ is a continuous function. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). recall that ϕ is a convex function with ϕ(1) = 1. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)).Note also that ϕ0 (z) = 1. and since the latter has only two solutions (z = 0 and z = ε∗ ). Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . z=1 Also. and so on. if the slope µ at z = 1 is ≤ 1. By induction. all we need to show is that ε < 1. Therefore ε = ϕ(ε). then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. Therefore. So ε ≤ ε∗ < 1. ϕn+1 (z) = ϕ(ϕn (z)). if µ > 1. Therefore ε = ε∗ . Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). ϕn (0) ≤ ε∗ . ϕ(ϕn (0)) → ϕ(ε). ϕn+1 (0) = ϕ(ϕn (0)). ε = ε∗ . Also ϕn (0) → ε. But ϕn+1 (0) → ε as n → ∞. Notice now that µ= d ϕ(z) dz .) 67 . In particular. in fact. We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). Since both ε and ε∗ satisfy z = ϕ(z). But ϕn (0) → ε. Let us show that. This means that ε = 1 (see left ﬁgure below). But 0 ≤ ε∗ .

2. say. If s = 1 there is a successful transmission and. . Let s be the number of selected packets. . for a packet to go through. are allocated to hosts 1. The problem of communication over a common channel. . 1.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel).d. only one of the hosts should be transmitting at a given time. One obvious solution is to allocate time slots to hosts in a speciﬁc order. If collision happens. there are x + a packets. there is a collision and. random variables with some common distribution: P (An = k) = λk . assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). in a simpliﬁed model. . 6. So. using the same frequency. if Sn = 1. on the next times slot. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. 1. 3 hosts. we let hosts transmit whenever they like. . 5. During this time slot assume there are a new packets that need to be transmitted. 1.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. 2. then max(Xn + An − 1. Case: µ > 1.i. 3. then the transmission is not successful and the transmissions must be attempted anew. 68 . and Sn the number of selected packets. Ethernet has replaced Aloha. there are x + a − 1 packets. k ≥ 0. . Thus. periodically. otherwise. 2. Abandoning this idea. there is no time to make a prior query whether a host has something to transmit or not. Then ε = 1. 24. 4. If s ≥ 2. 3. Xn+1 = Xn + An . An the number of new packets on the same time slot. is that if more than one hosts attempt to transmit a packet. then time slots 0. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. We assume that the An are i. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. Nowadays. 0). So if there are. on the next times slot. Since things happen fast in a computer network. there is collision and the transmission is unsuccessful.. Then ε < 1. 3.

We here only restrict ourselves to the computation of this expected change. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). The reason we take the maximum with 0 in the above equation is because.x+k for k ≥ 1.x are computed by subtracting the sum of the above from 1. and p. so q x G(x) tends to 0 as x → ∞. we have that the drift is positive for all large x and this proves transience. We also let µ := k≥1 kλk be the expectation of An . if Xn + An = 0. To compute px. where G(x) is a function of the form α + βx .e. Proving this is beyond the scope of the lectures.x−1 + k≥1 kpx.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . An−1 . Let us ﬁnd its transition probabilities. Idea of proof. we have q < 1. k 0 ≤ k ≤ x + a. So P (Sn = k|Xn = x. The probabilities px. selections are made amongst the Xn + An packets. k ≥ 1. For x > 0. Sn = 1|Xn = x) = λ0 xpq x−1 . The process (Xn . Xn−1 . Therefore ∆(x) ≥ µ/2 for all large x. then the chain is transient. we think as follows: to have a reduction of x by 1. . So we can make q x G(x) as small as we like by choosing x suﬃciently large. and so q x tends to 0 much faster than G(x) tends to ∞. The Aloha protocol is unstable. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. To compute px.e. we have ∆(x) = (−1)px. for example we can make it ≤ µ/2.) = x + a k x−k p q . where q = 1 − p. and exactly one selection: px. An = a.x−1 when x > 0. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). we must have no new packets arriving. The random variable Sn has. the Markov chain (Xn ) is transient. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . i. Since p > 0. . Indeed. .x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . n ≥ 0) is a Markov chain. 69 . we should not subtract 1. The main result is: Proposition 6.x−1 = P (An = 0. So: px. deﬁned by‘ ∆(x) := E[Xn+1 − Xn |Xn = x].where λk ≥ 0 and k≥0 kλk = 1. Unless µ = 0 (which is a trivial case: no arrivals!).

the next location Xn+1 is picked uniformly at random amongst all pages that i links to. Therefore. Comments: 1. α ≈ 0. the rank of a page may be very important for a company’s proﬁts. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. there are companies that sell high-rank pages. PageRank. Its implementation though is one of the largest matrix computations in the world. unless i is a dangling page. otherwise. For each page i let L(i) be the set of pages it links to. There is a technique. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. if j ∈ L(i) pij = 1/|V |. The idea is simple. neither that it is aperiodic. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. In an Internet-dominated society. known as ‘302 Google jacking’. that allows someone to manipulate the PageRank of a web page and make it much higher. All you need to do to get high rank is pay (a lot of money).2) and let α pij := (1 − α)pij + . 2. So we make the following modiﬁcation. |V | These are new transition probabilities. because of the size of the problem. It uses advanced techniques from linear algebra to speed up computations. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. Then it says: page i has higher rank than j if π(i) > π(j). then i is a ‘dangling page’. Deﬁne transition probabilities as follows: 1/|L(i)|. in the latter case. if L(i) = ∅ 0. We pick α between 0 and 1 (typically. the chain moves to a random page.24. So if Xn = i is the current location of the chain. 3.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. It is not clear that the chain is irreducible. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. A new use of PageRank is to rank academic doctoral programs based on their records of ﬁnding jobs for their graduates. Academic departments link to each other by hiring their 70 . So this is how Google assigns ranks to pages: It uses an algorithm. If L(i) = ∅.

We will assume that a child chooses an allele from each parent at random. And so on. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. their oﬀspring inherits an allele from each parent. colour of eyes) appears in two forms. it makes the evaluation procedure easy. Thus. If so. in each generation. In this ﬁgure there are 4 alleles of type A (circles) and 6 of type a (squares). So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). This number could be independent of the research quality but. 24. We would like to study the genetic diversity. We have a transition from x to y if y alleles of type A were produced. Since we don’t care to keep track of the individuals’ identities. Three of the square alleles are never selected. from the point of view of an administrator. To simplify. a gene controlling a speciﬁc characteristic (e. 71 . But an AA mating with a Aa may produce a child of type AA or one of type Aa. meaning how many characteristics we have. The process repeats. When two individuals mate. a dominant (say A) and a recessive (say a). Do the same thing 2N times. One of the circle alleles is selected twice. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. Imagine that we have a population of N individuals who mate at random producing a number of oﬀspring.g. generation after generation. What is important is that the evaluation can be done without ever looking at the content of one’s research. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their oﬀspring did not inherit this allele.4 The Wright-Fisher model: an example from biology In an organism. 5. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. The alleles in an individual come in pairs and express the individual’s genotype for that gene. We are now ready to deﬁne the Markov chain Xn representing the number of alleles of type A in generation n. all we need is the information outlined in this ﬁgure. One of the square alleles is selected 4 times. we will assume that. two AA individuals will produce and AA child. then. This means that the external appearance of a characteristic is a function of the pair of alleles. We will also assume that the parents are replaced by the children. this allele is of type A with probability x/2N . 4. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology.faculty from each other (and from themselves). maintain this allele for the next generation. the genetic diversity depends only on the total number of alleles of each type. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random.

since. so both 0 and 2N are absorbing states. . Clearly. ϕ(x) = y=0 pxy ϕ(y). we see that 2N x 2N −y x y pxy = . ξM are i. . the expectation doesn’t change and. These are: M ϕ(0) = 0. Since pxy > 0 for all 1 ≤ x. . but the argument needs further justiﬁcation because “the end of time” is not a deterministic time. . .e. the states {1. Since selections are independent. conditionally on (X0 . We show that our guess is correct by proceeding via the mundane way of verifying that it satisﬁes the ﬁrst-step analysis equations. for all n ≥ 0. X0 ) = E(Xn+1 |Xn ) = M This implies that. ϕ(x) = x/M. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. E(Xn+1 |Xn . the chain gets absorbed by either 0 or 2N . . . and the population becomes one consisting of alleles of type A only. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . M Therefore. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M .2N = 1.) One can argue that. . . . Similarly. .i. The result is correct. p00 = p2N. . (Here. the random variables ξ1 . the number of alleles of type A won’t change. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected.d. and the population becomes one consisting of alleles of type a only. n n where. . . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. y ≤ 2N − 1. M and thus. 0 ≤ y ≤ 2N. . . i. Tx := inf{n ≥ 1 : Xn = x}. on the average. M n P (ξk = 0|Xn .Each allele of type A is selected with probability x/2N .e. i. since. 2N − 1} form a communicating class of inessential states. 72 . with n P (ξk = 1|Xn . on the average. . The fact that M is even is irrelevant for the mathematical model. . . Xn ). ϕ(M ) = 1. . Clearly. . at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). X0 ) = Xn . X0 ) = 1 − Xn . as usual. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. EXn+1 = EXn Xn = Xn . . . . Let M = 2N .

. A number An of new components are ordered and added to the stock. Xn electronic components at time n. Dn ). We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. . Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . Here are our assumptions: The pairs of random variables ((An . The correctness of the latter can formally be proved by showing that it does satisfy the recursion. while a number of them are sold to the market.The ﬁrst two are obvious.5 A storage or queueing application A storage facility stores. Otherwise. the demand is met instantaneously. . Let ξn := An − Dn . for all n ≥ 0. and so. at time n + 1. If the demand is for Dn components. there are Xn + An − Dn components in the stock. Thus. the demand is. so that Xn+1 = (Xn + ξn ) ∨ 0. provided that enough components are available. We will show that this Markov chain is positive recurrent. 2.d. say. and E(An − Dn ) = −µ < 0. we ﬁnd X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. on the average. and the stock reached level zero at time n + 1. Working for n = 1. n ≥ 0) are i. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. Since the Markov chain is represented by this very simple recursion. M −1 y−1 x M y−1 1− x . We have that (Xn . 73 . larger than the supply per unit of time. We will apply the so-called Loynes’ method. b).i. the demand is only partially satisﬁed. n ≥ 0) is a Markov chain. we will try to solve the recursion.

. . . . ξn−1 . . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). ξn−1 ) = (ξn−1 . . the random variables (ξn ) are i.d. . . we may replace the latter random variables by ξ1 . . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. 74 . the event whose probability we want to compute is a function of (ξ0 . Notice. Hence. in computing the distribution of Mn which depends on ξ0 . however. . ξn . . . . where the latter equality follows from the fact that. = x≥0 Since the n-dependent terms are bounded. Indeed. Therefore. . we reverse it. . ξ0 ). P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . . for all n. so we can shuﬄe them in any way we like without altering their joint distribution. . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y).Thus.i. using π(y) = limn→∞ P (Mn = y). . instead of considering them in their original order. . (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). for y ≥ 0. ξn−1 ). we ﬁnd π(y) = x≥0 π(x)pxy . that d (ξ0 . we may take the limit of both sides as n → ∞. and. . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisﬁes the balance equations. In particular. (n) .

We must make sure that it also satisﬁes y π(y) = 1. . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. 75 . . Z2 . Depending on the actual distribution of ξn . as required.} < ∞) = 1. the Markov chain may be transient or null recurrent. .So π satisﬁes the balance equations. . n→∞ This implies that P ( lim Zn = −∞) = 1. The case Eξn = 0 is delicate. we will use the assumption that Eξn = −µ < 0. . It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. Here. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . n→∞ Hence g(y) = P (0 ∨ max{Z1 . Z2 . This will be the case if g(y) is not identically equal to 1.} > y) is not identically equal to 1. .

If d = 1. groups) are allowed. . if x = ej . . we won’t go beyond Zd . the skip-free property. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. 0. we can let S = Zd . .g. ±ed } otherwise 76 .y = p(y − x). d p(x) = qj . For example. otherwise where p + q = 1. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. x ∈ Zd . This means that the transition probability pxy should depend on x and y only through their relative positions in space. This is the drunkard’s walk we’ve seen before. if x ∈ {±e1 . so far. this walk enjoys.y = px+z. if x = −ej . namely. then p(1) = p. In dimension d = 1. where C is such that x p(x) = 1. To be able to say this. we need to give more structure to the state space S. d 0.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. given a function p(x). Example 3: Deﬁne p(x) = 1 2d . Our Markov chains. A random walk of this type is called simple random walk or nearest neighbour random walk. where p1 + · · · + pd + q1 + · · · + qd = 1. Example 1: Let p(x) = C2−|x1 |−···−|xd | . .and space-homogeneous transition probabilities given by px. Deﬁne pj .y+z for any translation z. j = 1. besides time. . p(−1) = q. have been processes with timehomogeneous transition probabilities. . p(x) = 0. . the set of d-dimensional vectors with integer coordinates. .and space-homogeneity. that of spatial homogeneity. j = 1. . a structure that enables us to say that px. More general spaces (e. So. but. here. otherwise. . A random walk possesses an additional property. a random walk is a Markov chain with time. . .

computing the n-th order convolution for all n is a notoriously hard problem. random variables in Zd with common distribution P (ξn = x) = p(x). It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . Obviously. Indeed. x ± ed . . p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x).) In this notation. One could stop here and say that we have a formula for the n-step transition probability. are i. the initial state S0 is independent of the increments. where ξ1 . ξ2 . in general. . The convolution of two probabilities p(x). notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. . But this would be nonsense. we will mean a sum over all possible values. x + ξ2 = y) = y p(x)p(y − x). We shall resist the temptation to compute and look at other 77 . Furthermore. If d = 1. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x).) Now the operation in the last term is known as convolution. in addition. because all we have done is that we dressed the original problem in more fancy clothes. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . We refer to ξn as the increments of the random walk.i. then p(1) = p(−1) = 1/2. (When we write a sum without indicating explicitly the range of the summation variable. . We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). x ∈ Zd . p(x) = 0. y p(x)p′ (y − x). Furthermore. and this is the symmetric drunkard’s walk. p ∗ δz = p. We will let p∗n be the convolution of p with itself n times. x ∈ Zd is deﬁned as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . . p′ (x). which.d. (Recall that δz is a probability distribution that assigns probability 1 to the point z. So. . xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. . we can reduce several questions to questions about this random walk. We call it simple symmetric random walk. The special structure of a random walk enables us to give more detailed formulae for its behaviour. Let us denote by Sn the state of the walk at time n.This is a simple random walk. otherwise.

i+1 = pi.0) We can thus think of the elements of the sample space as paths. all possible values of the vector are equally likely. Therefore. + (1.methods. 2) → (5.1) + (0. Look at the distribution of (ξ1 . ξn ). Clearly. . . First. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn .i. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. we have P (ξ1 = ε1 . If yes. In the sequel. we should assign. but its practice may be hard. A ⊂ {−1.1) − (3. we shall learn some counting methods. .2) where the ξn are i. 1) → (2. +1. and so #A P (A) = n . Here are a couple of meaningful questions: A.i−1 = 1/2 for all i ∈ Z. a path of length n. 2 where #A is the number of elements of A. random variables (random signs) with P (ξn = ±1) = 1/2. 1) → (4. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that diﬀerent questions we need to ask or (b) that approximate methods are more informative than exact ones. As a warm-up. . If not. for ﬁxed n. So. −1} then we have a path in the plane that joins the points (0. So if the initial position is S0 = 0. +1}n . Events A that depend only on these the ﬁrst n random signs are subsets of this sample space. ξn = εn ) = 2−n . and the sequence of signs is {+1. 1}n . . we are dealing with a uniform distribution on the sample space {−1. 0) → (1. +1}n . . −1. . a random vector with values in {−1. +1. 1): (2. to each initial position and to each sequence of n signs. Since there are 2n elements in {0.1) + − (5. .d. Will the random walk ever return to where it started from? B. The principle is trivial. +1}n . how long will that take? C. if we can count the number of elements of A we can compute its probability.2) (4. After all. consider the following: 78 . 2) → (3.

The paths comprising this event are shown below.i) (0. 0) (n. We use the notation {(m.0) The lightly shaded region contains all paths of length n that start at (0. y) is given by: #{(0.1 Paths that start and end at speciﬁc points Now consider a path that starts from a speciﬁc point and ends up at a speciﬁc point. Hence (n+i)/2 = 0. x) and end up at (n. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. the number of paths that start from (m. S7 < 0}. (n − m + y − x)/2 (23) Remark. Let us ﬁrst show that Lemma 14. y)}. We can easily count that there are 6 paths in A.Example: Assume that S0 = 0. x) and end up at (n. y)} =: {all paths that start at (m. i). (There are 2n of them. i)} = n n+i 2 More generally. 0)). in this case. 0) and end up at the speciﬁc point (n. 0) and ends at (n. S6 < 0. In if n + i is not an even number then there is no path that starts from (0. Proof. The number of paths that start from (0. and compute the probability of the event A = {S2 ≥ 0.) The darker region shows all paths that start at (0. i). i). Consider a path that starts from (0. x) (n. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. x) (n. 0) and end up at (n. y) is given by: #{(m. 0) and n ends at (n. S4 = −1. n 79 . y)} = n−m .047. (n.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the ﬁgure), there is no path that goes from (0, 0) to (3, 2). We shall thus deﬁne n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

**Paths that start and end at speciﬁc points and do not touch zero at all
**

(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =

1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the speciﬁc point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the ﬁrst time it hits zero and then reﬂecting it around the horizontal axis. In the ﬁgure below, follow the solid path starting at (m, x) up to the ﬁrst time it hits zero and then follow the dotted path: this is the reﬂected path.

(n, y)

(m, x)

(n, −y)

A path and its reﬂection around a horizontal line. Notice that the reﬂection starts only at the point where the path touches the horizontal line.

Note that if we know the reﬂected path we can recover the original path. Hence we may as well count the number of reﬂected paths. But each reﬂected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =

1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)

1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

**Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
**

1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n

n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n

n 2

= #{(0, 1) n

n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1

n−y 2

(n, y + 1); remain > 0}

− −

n

n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

−

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the ﬁrst one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

**Distribution after n steps
**

n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

1) (n. The left side of (30) equals #{paths (1.P (Sn = y|Sm = x) = In particular. . . . n 2 #{paths (m. Sn > 0 | Sn = y) = y . . n 2−n . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . y)} . y)} . P0 (Sn = y) 2 (n + y) + 1 . . 0) (n. n−m 2−n−m . for now. . We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). If n. Such a simple answer begs for a physical/intuitive explanation. n This is called the ballot theorem and the reason will be given in Section 29. . we shall just compute it. The probability that. the random walk never becomes zero up to time n is given by P0 (S1 > 0. . . n/2 (29) #{(0. 27. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . 27. given that S0 = 0 and Sn = y > 0. . if n is even. P0 (S1 ≥ 0 . (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof.3 Some conditional probabilities Lemma 17. 0) (n. y) that do not touch zero} × 2−n #{paths (0. Sn ≥ 0 | Sn = 0) = 84 1 . S2 > 0. x) 2n−m (n. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . Sn ≥ 0 | Sn = y) = 1 − In particular.2 The ballot theorem Theorem 25. n (30) Proof. But. if n is even.

. 0) (n. ξ2 + ξ3 ≥ 0. . . . . . If n is even. . 85 . and Sn−1 > 0 implies that Sn ≥ 0. . . (b) follows from symmetry. For even n we have P0 (S1 ≥ 0. . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . Sn ≥ 0. . . (f ) follows from the replacement of (ξ2 . (e) follows from the fact that ξ1 is independent of ξ2 . . . . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . . . +1 27. . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. . . . Sn ≥ 0). Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. Sn ≥ 0) = #{(0. . ξ2 . . Sn < 0) (b) (a) = 2P0 (S1 > 0. . . .). Sn ≥ 0) = P0 (S1 = 0. . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . and ﬁnally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . . We next have P0 (S1 = 0. . . . . . . . 1 + ξ2 + ξ3 > 0. . . Proof. . . ξ3 . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . Sn = 0) = P0 (S1 > 0. . We use (27): P0 (Sn ≥ 0 . . .4 Some remarkable identities Theorem 26. and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . P0 (S1 ≥ 0. . . . . •). ξ3 . . . . .) by (ξ1 . . (c) (d) (e) (f ) (g) where (a) is obvious. . Sn = 0) = P0 (Sn = 0). . . . Sn > 0) + P0 (S1 < 0. (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. . n/2 where we used (28) and (29). .Proof.

Sn = 0. we either have Sn−1 = 1 or −1. So u(n) f (n) = . we can omit the last event. . So 1 f (n) = 2u(n) P0 (S1 > 0. Sn = 0) Now. . . Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). . So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. the two terms must be equal. for even n. . . We now want to ﬁnd the probability f (n) that the walk will return to 0 for the ﬁrst time in n steps. This is not so interesting. Then f (n) := P0 (S1 = 0. with equal probability. Sn = 0) P0 (S1 > 0. Sn−1 > 0. Fix a state a. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. So the last term is 1/2. . Sn = 0) = u(n) . Theorem 27. Sn = 0) + P0 (S1 < 0. Sn = 0 means Sn−1 = 1. If a particle is to the right of a then you see its image on the left. Sn = 0). and if the particle is to the left of a then its mirror image is on the right. Sn−2 > 0|Sn−1 = 1). . and u(n) = P0 (Sn = 0). . Sn = 0) = 2P0 (Sn−1 > 0. Sn−1 < 0. . 86 . . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). n−1 28 The reﬂection principle for a simple random walk in dimension 1 The reflection principle says the following. f (n) = P0 (S1 > 0. . . . So f (n) = 2P0 (S1 > 0. . Sn−2 > 0|Sn−1 > 0. . Sn−1 > 0. . . the probability u(n) that the walk (started at 0) attains value 0 in n steps. P0 (Sn−1 > 0. . . given that Sn = 0. Put a two-sided mirror at a. we see that Sn−1 > 0. we computed. . Sn−1 = 0. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . By symmetry. Since the random walk cannot change sign before becoming zero. . Also. .27. n−1 Proof. By the Markov property at time n − 1. . To take care of the last term in the last expression for f (n). Let n be even.5 First return to zero In (29).

Let (Sn . . n ∈ Z+ ). 2 Deﬁne now the running maximum Mn = max(S0 .What is more interesting is this: Run the particle till it hits a (it will. Sn ≤ a) + P0 (Sn > a). Sn = Sn . P0 (Sn ≥ a) = P0 (Ta ≤ n. called (Sn . 2 Proof. m ≥ 0) is a simple symmetric random walk. we have n ≥ Ta . n < Ta . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). Sn ≥ a). So Sn − a can be replaced by a − Sn in the last display and the resulting process. a + (Sn − a). If Sn ≥ a then Ta ≤ n. Then Ta = inf{n ≥ 0 : Sn = a} Sn . Finally. independent of the past before Ta . n ≥ 0) be a simple symmetric random walk and a > 0. S1 . . 2a − Sn . The last equality follows from the fact that Sn > a implies Ta ≤ n. the process (STa +m − a. we obtain 87 . Proof. as a corollary to the above result. So: P0 (Sn ≥ a) = P0 (Ta ≤ n. n ≥ Ta By the strong Markov property. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . Then. 2a − Sn ≥ a) = P0 (Ta ≤ n. 28. In the last event. But P0 (Ta ≤ n) = P0 (Ta ≤ n. n ≥ Ta . . n ∈ Z+ ) has the same law as (Sn . Sn ≤ a) + P0 (Ta ≤ n. Thus. so we may replace Sn by 2a − Sn (Theorem 28). In other words. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). Theorem 28. Sn ≤ a).1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. But a symmetric random walk has the same law as its negative. Trivially. Sn > a) = P0 (Ta ≤ n. deﬁne Sn = where is the ﬁrst hitting time of a. . n < Ta . Sn ).

If an urn contains a azure items and b = n−a black items. usually a vase furnished with a foot or pedestal. If Tx ≤ n. then. J. 105. of which a are coloured azure and b black. gives the result. which results into P0 (Tx ≤ n. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. . 9 Bertrand. the number of selected azure items is always ahead of the black ones equals (a − b)/n. An urn is a vessel. D. then the probability that. Acad.Corollary 19. It is the latter use we are interested in in Probability and Statistics. Solution d’un probl`me. Rend. P0 (Mn < x. Sn = 2x − y). Acad. This. Compt. by applying the reﬂection principle (Theorem 28). Sn = y) = P0 (Tx ≤ n. The set of values of η contains n elements and η is uniformly distributed a in this set. . Paris. (1887). 2 Proof. The original ballot problem. Compt. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. Sn = y). Since Mn < x is equivalent to Tx > n. Sn = y) = P0 (Sn = 2x − y). posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. combined with (31). we have P0 (Mn < x. Sci. where ηi indicates the colour of the i-th item picked. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. Sn = y) = P0 (Tx > n. Proof. as long as a ≥ b. A sample from the urn is a sequence η = (η1 . 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). (31) x > y. The answer is very simple: Theorem 31. Bertrand. 8 88 . Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. 369. 436-437. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. e 10 Andr´. Rend. (1887). . ηn ). e e e Paris. Solution directe du probl`me r´solu par M. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . ornamental objects. as for holding liquids. employed for diﬀerent purposes. 105. the ashes of the dead after cremation. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. Sci. during sampling without replacement. and anciently for holding ballots to be drawn. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. . n = a + b.

. conditional on {Sn = y}. ξn ) is uniformly distributed in a set with n elements. 89 . . . . .We shall call (η1 . where y = a − (n − a) = 2a − n.i−1 = q = 1 − p. . . Sn > 0 | Sn = y) = . But if Sn = y then. . n Since y = a − (n − a) = a − b. the result follows. . . . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . conditional on {Sn = y}. . the ‘black’ items correspond to − signs). we can translate the original question into a question about the random walk. −n ≤ k ≤ n. . always assuming (without any loss of generality) that a ≥ b = n − a. An urn can be realised quite easily be a random walk: Theorem 32. . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. using the deﬁnition of conditional probability and Lemma 16. . . . We have P0 (Sn = k) = n n+k 2 pi. εn be elements of {−1. . .i+1 = p. . i ∈ Z. the sequence (ξ1 . Consider a simple symmetric random walk (Sn . Proof. Since the ‘azure’ items correspond to + signs (respectively. So the number of possible values of (ξ1 . . we have: P0 (ξ1 = ε1 . . . but which is not necessarily symmetric. Then. It remains to show that all values are equally a likely. ξn ) is. a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. n . b be deﬁned by a + b = n. . ξn = εn ) P0 (Sn = y) 2−n 1 = = . . The values of each ξi are ±1. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. Sn > 0} that the random walk remains positive. so ‘azure’ items are +1 and ‘black’ items are −1. . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. n ≥ 0) with values in Z and let y be a positive integer. n n −n 2 (n + y)/2 a Thus. a. the sequence (ξ1 . conditional on the event {Sn = y}. Let ε1 . . . +1} such that ε1 + · · · + εn = y. . . Proof of Theorem 31. . Then. a we have that a out of the n increments will be +1 and b will be −1. Since an urn with parameters n. . indeed. p(n+k)/2 q (n−k)/2 . letting a. a − b = y. . (The word “asymmetric” will always mean “not necessarily symmetric”. But in (30) we computed that y P0 (S1 > 0. ξn ) has a uniform distribution. . We need to show that. . . . . (ξ1 .) So pi. . . ηn ) an urn process with parameters n.

If S1 = 1 then T1 = 1. Lemma 19. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. we make essential use of the skip-free property.} ∪ {∞}. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. random variables. 1. 2. 90 . it is no loss of generality to consider the random walk starting from 0.d. Corollary 20. plus a time T1 required to go from −1 to 0 for the ﬁrst time. . for x > 0. The rest is a consequence of the strong Markov property. 2qs Proof. Lemma 18. . (We tacitly assume that S0 = 0. Px (Ty = t) = P0 (Ty−x = t). x − 1 in order to reach x. .) Then. We will approach this using probability generating functions. Start with S0 = 0 and watch the process until the ﬁrst n = T1 such that Sn = 1.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. Theorem 33. E0 sTx = ϕ(s)x . The process (Tx . Then the probability generating function of the ﬁrst time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . only the relative position of y with respect to x is needed. In view of Lemma 19 we need to know the distribution of T1 . Let ϕ(s) := E0 sT1 . then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. Consider a simple random walk starting from 0. plus a ′′ further time T1 required to go from 0 to +1 for the ﬁrst time. For all x. To prove this.30. 2. x ∈ Z. Consider a simple random walk starting from 0. . This random variable takes values in {0. See ﬁgure below. . y ∈ Z. . Proof. . Proof. x ≥ 0) is also a random walk. Consider a simple random walk starting from 0.i. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. If x > 0. In view of this. and all t ≥ 0.

We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . Then the probability generating function of the ﬁrst time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . Corollary 21. The formula is obtained by solving the quadratic.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. 0 if x > 0 2qs Tx (32) E0 s = √ |x| E sT−1 |x| = 1− 1−4pqs2 0 . from the strong Markov property. Then the probability generating function of the ﬁrst time that level x ∈ Z is reached is given by √ x E sT1 x = 1− 1−4pqs2 . in order to hit 1. Corollary 22. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . 91 . Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. 2ps Proof. the RW must ﬁrst hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Consider a simple random walk starting from 0. Hence the previous formula applies with the roles of p and q interchanged. if x < 0 2ps Proof. Consider a simple random walk starting from 0. The ﬁrst time n such that Sn = −1 is the ﬁrst time n such that −Sn = 1.

We again do ﬁrst-step analysis. If α = n is a positive integer then n (1 + x)n = k=0 n k x . then we can omit the k upper limit in the sum. and simply write (1 + x)n = k≥0 n k x . 2ps ′ =1− 30.3 The distribution of the ﬁrst return to the origin We wish to ﬁnd the exact distribution of the ﬁrst return time to the origin. Similarly. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34.2 First return to the origin Now let us ask: If the particle starts at 0.30. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . in distribution. 1 − 4pqs2 . k 92 . For the general case. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. if S1 = 1. how long will it take till it ﬁrst returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. For a symmetric ′ random walk. The latter is. this was found in §30. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . where α is a real number. the same as the time required for the walk to hit 1 starting ′ from 0. k by the binomial theorem. If we deﬁne n to be equal to 0 for k > n. Then the probability generating function of the ﬁrst return time to 0 is given by E0 s T 0 = 1 − Proof. Consider a simple random walk starting from 0.

let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . (1 − x)−1 = But. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = .Recall that n k = It turns out that the formula is formally true for general α. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. by deﬁnition. 2k − 1 k 93 . Indeed. k!(n − k)! k! α k x . (−1)k = 2k − 1 k k and this is shown by direct computation. For example. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . −1 k so (1 − x)−1 = as needed. k as long as we deﬁne α k = α(α − 1) · · · (α − k + 1) . k! k! (−1)2k xk = k≥0 xk . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . but that we won’t worry about which ones. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . √ 1 − x. As another example.

x 1−|p−q| 2q 1−|p−q| 2p . by the deﬁnition of E0 sT0 . Hence it is again transient. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. if x < 0 |x| In particular. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. |x| if x > 0 (33) . E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . if x < 0 This formula can be written more neatly as follows: Theorem 35. x if x > 0 . so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 = (1 − 2p)2 = |1 − 2p| = |p − q|. P0 (Tx < ∞) = p q q p ∧1 ∧1 . Hence it is transient. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. 94 . That it is actually null-recurrent follows from Markov chain theory.But. s↑1 which is true for any probability generating function. But remember that q = 1 − p. Comparing the two ‘inﬁnite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . We apply this to (32). if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. The quantity inside the square root becomes 1 − 4pq. without appeal to the Markov chain theory.

Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the ﬁrst return time to 0. . Indeed. Then limn→∞ Sn = −∞ with probability 1. Sn ). every nondecreasing sequence has a limit. Proof. except that the limit may be ∞. . n→∞ n→∞ x ≥ 0. with some positive probability. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . S2 .2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. Consider a simple random walk with values in Z. we obtain what we are looking for. Unless p = 1/2. again. but here it is < ∞. if x < 0 which is what we want. 31. Then ′ P0 (T0 < ∞) = 2(p ∧ q). Theorem 36. We know that it cannot be positive recurrent. . then we have M∞ = limn→∞ Mn . If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. Assume p < 1/2.Proof. if x > 0 . . . with probability 1 because p < 1/2. Proof.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. the sequence Mn is nondecreasing. q p. The simple symmetric random walk with values in Z is null-recurrent. and so it is null-recurrent.) If we let Mn = max(S0 . Consider a random walk with values in Z. This means that there is a largest value that will be ever attained by the random walk. S1 . . Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . this probability is always strictly less than 1. The last part of Theorem 35 tells us that it is recurrent. 31. If p < q then |p − q| = q − p and. Otherwise. What is this probability? Theorem 37. it will never ever return to the point where it started from. Corollary 23. x ≥ 0. 95 . . This value is random and is denoted by M∞ = sup(S0 .

2. the total number of visits to a state will be ﬁnite. Consider a simple random walk with values in Z. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . But J0 ≥ 1 if and only if the ﬁrst return time to zero is ﬁnite.3 Total number of visits to a state If the random walk is recurrent then it visits any state inﬁnitely many times. (34) a random variable ﬁrst considered in Section 10 where it was also shown to be geometric. . . 1 − 2(p ∧ q) this being a geometric series. 1. In the latter case. . If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. The probability generating function of T0 was obtained in Theorem 34. meaning that Pi (Ji = ∞) = 1. . Then. . 2. 1 − 2(p ∧ q) Proof. 31. as already known. Theorem 38. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . . Pi (Ji ≥ k) = 1 for all k. 96 . 2(p ∧ q) E0 J0 = . 1. Remark 1: By spatial homogeneity. k = 0. If p = q this gives 1. 2. . This proves the ﬁrst formula. But if it is transient. . 2(p ∧ q) . But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. We now look at the the number of visits to some state but if we start from a diﬀerent state. . The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). 1. k = 0. . we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k .′ Proof. We now compute its exact distribution. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . where the last probability was computed in Theorem 37. even for p = 1/2. starting from zero. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. if a Markov chain starts from a state i then Ji is geometric. In Lemma 4 we showed that. 1]. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). k = 0.

e. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . Then P0 (Tx < ∞. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . (q/p)(−x)∨0 1 − 2q k≥1 . we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. p < 1/2) then. Jx ≥ k). then E0 Jx drops down to zero geometrically fast as x tends to inﬁnity. i. Apply now the strong Markov property at Tx . 97 . (2(p ∧ q))k−1 . starting from 0. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . Remark 1: The expectation E0 Jx is ﬁnite if p = 1/2. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . The reason that we replaced k by k − 1 is because. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . For x > 0. Consider the case p < 1/2. it will visit any point below 0 exactly the same number of times on the average. In other words. ∞ As for the expectation. But for x < 0. E0 Jx = Proof. The ﬁrst term was computed in Theorem 35. simply because if Jx ≥ k ≥ 1 then Tx < ∞. if the random walk “drifts down” (i. we have E0 Jx = 1/(1−2p) regardless of x. and the second in Theorem 38. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 .e. Consider a random walk with values in Z. Then P0 (Jx ≥ k) = P0 (Tx < ∞. given that Tx < ∞. k ≥ 1. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. Suppose x > 0.Theorem 39. We repeat the argument for x < 0 and ﬁnd P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 .

. i=1 Then the dual (Sk . because the two have the same distribution. ξ2 . Let Tx = inf{n ≥ 1 : Sn = x} be the ﬁrst time that x will be visited. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. 0 ≤ k ≤ n) of (Sk . . random variables. .32 Duality Consider random walk Sk = k ξi . and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . (Sk . for a simple random walk and for x > 0. which is basically another probability for S. In Lemma 19 we observed that. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. 0 ≤ k ≤ n) has the same distribution as (Sk . random variables. (35) In Theorem 29 we used the reﬂection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). ξn−1 . . Fix an integer n. Theorem 40. every time we have a probability for S we can translate it into a probability for S. = 1− √ 1 − s2 s x . Sn = x) P0 (Tx = n) = . n x > 0. Thus.i. 0 ≤ k ≤ n − 1. . . . (36) 2 Lastly. ξn ) has the same distribution as (ξn . Here is an application of this: Theorem 41. . . a sum of i. n (37) 98 . each distributed as T1 . 0 ≤ k ≤ n). Then P0 (Tx = n) = x P0 (Sn = x) . Proof.d. Pause for reﬂection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. 0 ≤ k ≤ n) is deﬁned by Sk := Sn − Sn−k . Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. the probabilities P0 (Tx = n) = x P0 (Sn = x). i. we derived using duality. P0 (Sn = x) P0 (Sn = x) x .i. n Proof.e.d. Use the obvious fact that (ξ1 . ξ1 ). for a simple symmetric random walk. in Theorem 41. Let (Sk ) be a simple symmetric random walk and x a positive integer. . . . Tx is the sum of x i.

We remark here something methodological: It was completely diﬀerent techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coeﬃcient of sn in (35); summing up (37) should give (36), or taking diﬀerences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Deﬁne its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the ﬁrst time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the ﬁrst event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =

r=1

+

r=1

**In the ﬁrst case,
**

′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

**We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
**

k

**f2r p2k−2r,2n−2r +
**

r=1

1 2

n−k

**f2r p2k,2n−2r
**

r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2

k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .

r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

**The arcsine law*
**

Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

**Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
**

na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =

a≤k/n≤b

∼

a≤k/n≤b

1 . k k (n − n ) n n

1/π

**This is easily seen to have a limit as n → ∞, the limit being
**

b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of

x(1−x)

which is amazingly close to the plot of the ﬁgure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

When he reaches a corner he moves again in the same manner. The corners are described by coordinates (x. Rn ) := (Sn + Sn . west. Since ξ1 . 0)) = 1/4. is a random walk. 1 P (ξn = 1) = P (ξn = (1. Sn ) = n n 2 . Indeed. ξ2 . Many other quantities can be approximated in a similar manner. n ∈ Z+ ) is a sum of i. map (x1 . 0)) = 1/4. are independent. 0). 0)) = P (ξn = (−1. 2 Same argument applies to (Sn . 1)) = P (ξn = (−1. Sn − Sn ). We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . 1 1 Proof. We have 1. ξ2 . The two random walks are not independent. i=1 ξi i=1 ξi 2 1 Lemma 20. 0)) = P (ξn = (1. y ∈ Z. north or south. west. Consider a change of coordinates: rotate the axes by 45o (and scale them). It is not a simple random walk because there is a possibility that the increment takes the value 0. However. x2 ) into (x1 + x2 . . y) where x.d. i. .d. random variables and hence a random walk. (Sn . the two stochastic processes are not independent. −1)) = 1/2. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. He starts at (0. n ∈ Z+ ) showing that it. 1)) + P (ξn = (0. ξ2 . They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. 1 Hence (Sn . . Sn = i=1 n ξi . x1 − x2 ). 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. n ∈ Z+ ) is also a random walk. north or south. . So let 1 2 1 2 1 2 Rn = (Rn . 0) and. We have: 102 . . 2 1 If x ∈ Z2 .. he moves east. being totally drunk. . the random 1 2 variables ξn . . n ≥ 1.e.From this.i. We then let S0 = (0. so are ξ1 . So Sn = (Sn . (Sn . we let x1 be its horizontal coordinate and x2 its vertical. with equal probability (1/4). random vectors such that P (ξn = (0. going at random east. too. for each n. i. ξn are not independent simply because if one takes value 0 the other is nonzero.i. n ∈ Z+ ) is a random walk. . . we immediately get that the expectation of T1 is inﬁnity: ET1 = ∞. the latter being functions of the former.

P (S2n = 0) ∼ n . Hence P (ηn = 1) = P (ηn = −1) = 1/2. 1 where the last equality follows from the independence of the two random walks. But (Rn ) 1 2 is a simple symmetric random walk. Hence (Rn . Proof. P (S2n = 0) = Proof. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. b−a −a . 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. The last two are walk in one dimension. where c is a positive constant. ηn = 1) = P (ξn = (1. The possible values 1 . in the sense that the ratio of the two sides goes to 1. n ∈ Z+ ) 1 is a random walk in two dimensions. n Corollary 24. ηn = −1) = P (ξn = (−1. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . ηn = ξn − ξn are independent for each n. . ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. ξ2 . The same is true for (Rn ). and by Stirling’s approximation. . ηn = −1) = P (ξn = (0. By Lemma 5. . The sequence of vectors η .. η . We have Rn = n (ξi + ξi ). Rn = n (ξi − ξi ). n ∈ Z+ ) is a random 2 . 2 . 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. The two-dimensional simple symmetric random walk is recurrent. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . n ∈ Z+ ). 0)) = 1/4 1 2 P (ηn = −1.1 Lemma 21. from Theorem 7. Lemma 22. (Rn . η 2 are ±1. . +1}. (Rn . b−a 103 . n ∈ Z ) is a random walk in one dimension.i. ηn = 1) = P (ξn = (0.d. . ε′ ∈ {−1. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . that for a simple symmetric random walk (started from S0 = 0). n ∈ Z ) is a random walk in one dimension. 2 1 2 2 1 1 Proof. . (Rn + independent. We have of each of the ηn n 1 2 P (ηn = 1. To show that the last two are independent. Let a < 0 < b. But by the previous n=0 c lemma. we need to show that ∞ P (Sn = 0) = ∞. Similar argument shows that each of (Rn . as n → ∞. And so 2 1 2 1 P (ηn = ε. b−a P (STa ∧Tb = a) = b . ξ 1 − ξ 2 ). So P (R2n = 0) = 2n 2−2n . n ∈ Z+ ) is a random walk in two dimensions. Recall. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . 1)) = 1/4 1 2 P (ηn = −1. is i. 0)) = 1/4 1 2 P (ηn = 1. b−a P (Ta < Tb ) = b . 2n n 2 2−4n . = 1)) = 1/4. R2n = 0) = P (R2n = 0)(R2n = 0).

k ∈ Z. Then Z = 0 if and only if A = B = 0 and this has 104 . if p = M/N . i ∈ Z). B = 0) + P (A = i. and assuming that (A. using this. Theorem 43. B = 0) = p0 . i < 0 < j. The probability it exits it from the left point is p. i ∈ Z. To see this.This last display tells us the distribution of STa ∧Tb . a = M − N . b]. we consider the random variable Z = STA ∧TB . Conversely. b. Can we ﬁnd a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. 1 − p). This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the ﬁrst time it exits the interval [a. e. B = j) = c−1 (j − i)pi pj . The claim is that P (Z = k) = pk . given a 2-point probability distribution (p. where c−1 is chosen by normalisation: P (A = 0. B) taking values in {(0. Proof. in particular. a < 0 < b such that pa + (1 − p)b = 0. B) are independent of (Sn ). b = N . N ∈ N. such that p is rational. we can always ﬁnd integers a. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. Deﬁne a pair of random variables (A. we get that c is precisely ipi . c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. Then there exists a stopping time T such that P (ST = i) = pi . How about more complicated distributions? Let suppose we are given a probability distribution (pi . Note. 0)} ∪ (N × N) with the following distribution: P (A = i. and the probability it exits from the right point is 1 − p. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0.g. we have this common value: i<0 (−i)pi = and. suppose ﬁrst k = 0. choose. Indeed. M. P (A = 0. that ESTa ∧Tb = 0. B = j) = 1. i ∈ Z. M < N . Let (Sn ) be a simple symmetric random walk and (pi .

105 . B = j) P (STi ∧Tk = k)P (A = i.probability p0 . We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. i<0 = c−1 Similarly. Suppose next k > 0. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . P (Z = k) = pk for k < 0. by deﬁnition.

But 3 is the maximum number of times that 38 “ﬁts” into 120. We write this by b | a. we may summarise the ﬁrst line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. So we work faster. with at least one of them nonzero. 38) = gcd(6. 6. i. So (i) holds. we can deﬁne their greatest common divisor d = gcd(a. −3. This gives us an algorithm for ﬁnding the gcd between two numbers: Suppose 0 < a < b are integers. if a. (That it is unique is obvious from the fact that if d1 .) Notice that: gcd(a. We will show that d = gcd(a. Zero is denoted by 0. with the second line. Indeed 4 × 38 > 120. 106 . If b − a > a. If c | b and b | a then c | a. 5. b). . Indeed. In other words. let d = gcd(a. Z = {· · · . 1. The set Z of integers consists of N.e. as taught in high schools or universities. The set N of natural numbers is the set of positive integers: N = {1. their negatives and zero. 3. 2) = gcd(4. b − a). with b = 0. Given integers a. replace b − a by b − 2a. the process of long division that one learns in elementary school).e. if we also invoke Euclid’s algorithm (i. They are not supposed to replace textbooks or basic courses. Then c | a + (b − a). c | b. 38) = gcd(44. . b). Now suppose that c | a and c | b − a. So (ii) holds. 0. −2. they merely serve as reminders. · · · }. Then interchange the roles of the numbers and continue. 14) = gcd(6. b) as the unique positive integer such that (i) d | a.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. then d | c. 8) = gcd(6. (ii) if c | a and c | b. b. But in the ﬁrst line we subtracted 38 exactly 3 times. Now. we have c | d. b − a). 38) = gcd(6. d | b. and d = gcd(a. 2) = 2. and we’re done. 38) = gcd(82. 32) = gcd(6. d2 are two such numbers then d1 would divide d2 and vice-versa. 4. Keep doing this until you obtain b − qa < a. leading to d1 = d2 . Because c | a and c | b. Therefore d | b − a. b are arbitrary integers. . −1. Replace b by b − a. 20) = gcd(6. b) = gcd(a. 3. If a | b and b | a then a = b. we say that b divides a if there is an integer k such that a = kb. 26) = gcd(6. 2. gcd(120. 2. 2) = gcd(2. Arithmetic Arithmetic deals with integers. For instance. Similarly. But d | a and d | b.}.

Thus. we can ﬁnd an integer q and an integer r ∈ {0. 38) = gcd(6. b ∈ Z. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. Then c divides any linear combination of a. a − 1}. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. y ∈ Z.e. 38) = gcd(6. such that a = qb + r. b) then there are integers x. . 120) = 38x + 120y. with a > b. y = 6. y ∈ Z. For example. b) = min{xa + yb : x. not both zero. Its proof is obvious. we have gcd(a. it divides d. we can show that. . This shows (ii) of the deﬁnition of a greatest common divisor. Suppose that c | a and c | b. We need to show (i). Then d = xa + yb. b. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. in particular. To see this. The long division algorithm is a process for ﬁnding the quotient q and the remainder r. To see why this is true. 1. for integers a. let d be the right hand side. 107 . xa + yb > 0}. we have the fact that if d = gcd(a. gcd(38. Here are some other facts about the greatest common divisor: First. for some x. . let us follow the previous example again: gcd(120. 2) = 2. a. . with x = 19. r = 18. i. y such that d = xa + yb. Second.The theorem of division says the following: Given two positive integers b. In the process. The same logic works in general. Working backwards.

yielding r = (−qx)a + (1 − qy)b. that d does not divide b. x2 ∈ X have diﬀerent values f (x1 ). so d divides both a and b. A relation which possesses all the three properties is called equivalence relation. the set {f (x) : x ∈ X}. The range of f is the set of its values. for some 1 ≤ r ≤ d − 1. Since d > 0. We encountered the equivalence relation i theory. For example. y) such that x is smaller than y.. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). B be ﬁnite sets with n. Finally. The relation < is not symmetric. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A.g. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . z) it also contains (x. But then b = q(xa + yb) + r. i.e. The equality relation = is reﬂexive.that d | a and d | b. and this is a contradiction. say. A relation is called reflexive if it contains pairs (x. In general. y) without contain its symmetric (y. f (x2 ). 108 . x). A function is called one-to-one (or injective) if two diﬀerent x1 . y) and (y. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x.e. z). e. m elements. If we take A to be the set of students in a classroom. it can be thought as reﬂexive (unless we do not want to accept the fact that one may not be afriend of oneself). The relation < is not reﬂexive. but is not necessarily transitive. Suppose this is not true. A function is called onto (or surjective) if its range is Y . that d does not divide both numbers. the greatest common divisor of any subset of the integers also makes sense. Both < and = are transitive. A relation is transitive if when it contains (x. then the relation “x is a friend of y” is symmetric. a relation can be thought of as a set of pairs of elements of the set A. i. we can divide b by d to ﬁnd b = qd + r. respectively. The equality relation is symmetric. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. so r = 0. So d is not minimum as claimed. x). A relation is called symmetric if it cannot contain a pair (x. Counting Let A. The set of functions from X to Y is denoted by Y X . The greatest common divisor of 3 integers can now be deﬁned in a similar manner.

) So the ﬁrst animal can be assigned any of the m available names. and so on. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . k! k where the symbol equation. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . Thus. The ﬁrst element can be chosen in n ways. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen.The number of functions from A to B (i. If A has n elements. we denote (n)n also by n!. in other words. To be able to do this. In the special case m = n. we have = 1. and this follows from the binomial theorem. that of the second in m − 1 ways. We thus observe that there are as many subsets in A as number of elements in the set {0. 1}n of sequences of length n with elements 0 or 1. n! is the total number of permutations of a set of n elements. we need more names than animals: n ≤ m. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). A function is then an assignment of a name to each animal. and so on. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. Since the name of the ﬁrst animal can be chosen in m ways. the second in n − 1. the cardinality of the set B A ) is mn .e. Any set A contains several subsets. (The same name may be used for diﬀerent animals. Each assignment of this type is a one-to-one function from A to B. a one-to-one function from the set A into A is called a permutation of A. Incidentally. 109 . we may think of the elements of A as animals and those of B as names. n! stands for the product of the ﬁrst n natural numbers: n! = 1 × 2 × 3 × · · · × n. Clearly. so does the second. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to diﬀerent animals. the k-th in n − k + 1 ways. and the third. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . and so on. Indeed.

Thus a ∧ b = a if a ≤ b or = b if b ≤ a. we add the two numbers together in order to ﬁnd the total number of ways to pick k animals from n + 1 animals. 0). b is denoted by a ∧ b = min(a. 110 . by convention. b). we use the following notations: a+ = max(a. a∨b∨c refers to the maximum of the three numbers a. consider a speciﬁc animal. In particular. by convention. say. and we don’t care to put parantheses because the order of ﬁnding a maximum is irrelevant. c. and a− is called the negative part of a. m! n y If y is not an integer then. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. a+ is called the positive part of a. Note that it is always true that a = a+ − a− . then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. if x is a real number. 0). Both quantities are nonnegative. the pig. For example. we deﬁne x m = x(x − 1) · · · (x − m + 1) . we let The Pascal triangle relation states that n+1 k = = 0. b. k Also. If the pig is included. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). The notation extends to more than one numbers. we choose k animals from the remaining n ones and this can be done in n ways. + k−1 k Indeed. n n . b).We shall let n = 0 if n < k. If the pig is not included in the sample. Maximum and minimum The minimum of two numbers a. in choosing k animals from a set of n + 1 ones. The absolute value of a is deﬁned by |a| = a+ + a− . The maximum is denoted by a ∨ b = max(a. a− = − min(a.

}. It is worth remembering that. Another example: Let X be a nonnegative random variable with values in N = {1. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . . are sets or clauses then the number n≥1 1An is the number true clauses. .Indicator function stands for the indicator function of a set A. . meaning a function that takes value 1 on A and 0 on the complement Ac of A. For instance. Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. . . We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. 1 − 1A = 1 A c . . A2 . A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). Thus. If A is an event then 1A is a random variable. 2. P (X ≥ n). It takes value 1 with probability P (A) and 0 with probability P (Ac ). So X= n≥1 1(X ≥ n). It is easy to see that 1A 1A∩B = 1A 1B . which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. if A1 . = 1A + 1B − 1A∩B . This is very useful notation because conjunction of clauses is transformed into multiplication. 111 . Several quantities of interest are expressed in terms of indicators. Then X X= n=1 1 (the sum of X ones is X). So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components.

. we have ϕ(x) = m n vi i=1 j=1 aij xj . x1 . . β ∈ R. Then. where ∈ R. Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . So if we choose a basis v1 . But the ϕ(uj ) are vectors in Rm . . . is a column representation of x with respect to . . . . . The column . . ym v1 . j=1 i = 1. Let u1 . . . . because ϕ is linear. . = ··· ··· ··· . . Let y i := n aij xj . xn the given basis. . ϕ are linear functions: It is a m × n matrix.for all x. . . and all α. The matrix A of ϕ with respect to the choice of the two basis is deﬁned by a11 · · · a1n A := · · · · · · · · · am1 · · · amn y1 Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . y ∈ Rn . xn n 1 y := ϕ(x) = j=1 xj ϕ(uj ). am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ . . x . Combining. m. . . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . . The column . Suppose that ψ. . un be a basis for Rn . . . The relation between the column representation of y and the column representation of x is written as 1 1 x y a11 · · · a1n . vm . is a column representation of y := ϕ(x) with respect to the given basis .

Likewise. Rm and Rℓ and let A.} has no minimum. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. ed = . is a sequence of numbers then inf{x1 . . Usually. We note that lim inf(−xn ) = − lim sup xn . 0 0 1 In other words. Then the matrix representation of ϕ ◦ ψ is AB. i. For instance I = {1. respectively. . 1/2. . ei = 1(i = j).e. The standard basis in Rd consists of the vectors 1 0 0 0 1 0 e 1 = . a2 . A matrix A is called square if it is a n × n matrix. with respect to these bases. . inf I is characterised by: inf I ≤ x for all x ∈ I and. the choice of the basis is the so-called standard basis. xn+1 . So lim sup xn = limn→∞ inf{xn . . . we deﬁne lim sup xn . xn+2 . . we deﬁne max I. . x2 . . If I has no upper bound then sup I = +∞. deﬁned as the maximum of all the numbers y with the property y ≤ x for all x ∈ I.} ≤ · · · and. 113 . it is its matrix representation with respect to the ﬁxed basis. . . . for any ε > 0. j So A is a ℓ × m matrix. there is x ∈ I with x ≤ ε + inf I. a square matrix represents a linear transformation from Rn to Rn . . where m (AB)ij = k=1 Aik Bkj . e2 = . x3 . and it is this that we called inﬁmum. If we ﬁx a basis in Rn . . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. . . 1/3. . we deﬁne the supremum sup I to be the minimum of all upper bounds. a sequence a1 . B is a m × n matrix. This minimum (or maximum) may not exist. sup I is characterised by: sup I ≥ x for all x ∈ I and. Similarly. . . standing for the “infimum” of I. 1 ≤ j ≤ n. ψ. x4 . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. Supremum and inﬁmum When we have a set of numbers I (for instance. x2 . If I is empty then inf ∅ = +∞. .Fix three sets of basis on Rn . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. .} ≤ inf{x2 . · · · }. sup ∅ = −∞. In these cases we use the notation inf I. . there is x ∈ I with x ≥ ε + inf I. Limsup and liminf If x1 . . . and AB is a ℓ × n matrix. . If I has no lower bound then inf I = −∞. Similarly. . . B be the matrix representations of ϕ. . .} ≤ inf{x3 . . for any ε > 0. 1 ≤ i ≤ ℓ. .

n≥1 Usually. it is useful to ﬁnd disjoint events B1 . Linear Algebra that has many axioms is more ‘restrictive’.e. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. B in the deﬁnition. and everywhere else. . derive formulae. This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. I do sometimes “cheat” from a strict mathematical point of view. to compute P (A). we work a lot with conditional probabilities. A2 . and 1 Ac = 1 − 1 A . A2 . namely: Everything else. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i.. In Markov chain theory. . we often need to argue that X ≤ Y .) If A. In contrast. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). A is a subset of B). That is why Probability Theory is very rich. Linear Algebra that has many axioms is more ‘restrictive’. P (∪n An ) ≤ Similarly. i.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). n If the events A1 . But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. then P n≥1 An = P (An ). This can be generalised to any ﬁnite number of events. . For example. . . which holds since X is positive. If the events A. A2 . if An are events with P (An ) = 1 then P (∩n An ) = 1. 11 114 . if he events A1 . . Recall that P (B|A) is deﬁned by P (A ∩ B)/P (A) and is a very motivated deﬁnition. Similarly. . If A is an event then 1A or 1(A) is the indicator random variable. . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. to compute P (A ∩ B) we read the deﬁnition as P (A ∩ B) = P (A)P (B|A). Similarly. Note that 1A 1B = 1A∩B . If A1 . Often. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). n P (An ). B2 . are mutually disjoint events. B are disjoint events then P (A ∪ B) = P (A) + P (B). are increasing with A = ∪n An then P (A) = limn→∞ P (An ). and apply them. Therefore. E 1A = P (A). Probability Theory is exceptionally simple in terms of its axioms. but not too much. . the function that takes value 1 on A and 0 on Ac . such that ∪n Bn = Ω. in these notes. to show that EX ≤ EY . Of course. follows from this axiom.e. by not requiring this. If An are events with P (An ) = 0 then P (∪n An ) = 0. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). This is the inclusion-exclusion formula for two events. .. . for (besides P (Ω) = 1) there is essentially one axiom . (That is why Probability Theory is very rich. Quite often. This follows from the logical statement X 1(X > x) ≤ X. In contrast. . Indeed.

So if n an < ∞ then G(s) exists for all s ∈ [−1. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). EX = x xP (X = x). . If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. . n ≥ 0 is ∞ ρn s n = n=0 1 . . we encounter situations where P (X = ∞) may be positive. . As an example. . taught in calculus courses). The expectation of a nonnegative random variable X may be deﬁned by ∞ EX = 0 P (X > x)dx. . viz. 1. for s = 0 for which G(0) = 0. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. The generating function of the sequence pn = P (X = n). is a probability distribution on Z+ = {0. 2. which are both nonnegative and then deﬁne EX = EX + − EX − . . i. the generating function of the geometric sequence an = ρn . a1 . 1].} because then n an = 1. For a random variable which takes positive and negative values. . x→∞ In Markov chain theory. X − = − min(X. is called probability generating function of X: ∞ as long as |ρs| < 1. We do not a priori know that this exists for any s other than. we have EX = −∞ xf (x)dx. In case the random variable is discrete. 1]. which is motivated from the algebraic identity X = X + − X − . so it is useful to keep this in mind. In case the random variable is ∞ absoultely continuous with density f . 0). Generating functions A generating function of a sequence a0 . a1 . 0). making it a useful object when the a0 . The formula holds for any positive random variable. ϕ(s) = EsX = n=0 sn P (X = n) 115 .e. obviously. is an inﬁnite polynomial having the ai as coeﬃcients: G(s) = a0 + a1 s + a2 s2 + · · · . .An application of this is: P (X < ∞) = lim P (X ≤ x). . we deﬁne X + = max(X. This deﬁnition makes sense only if EX + < ∞ or if EX − < ∞. Suppose now that X is a random variable with values in Z+ ∪ {+∞}.. 1. . . n = 0. the formula reduces to the known one. |s| < 1/|ρ|.

s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . possibly. while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . we can recover the expectation of X from EX = lim ϕ′ (s).This is deﬁned for all −1 < s < 1. ∞ and p∞ := P (X = ∞) = 1 − pn . with x0 = x1 = 1. but. take the value +∞. If we let X(s) = n≥0 n ≥ 0. n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). As an example. xn sn be the generating of (xn . n ≥ 0) then the generating function of (xn+1 . ds ds and by letting s = 1. we have s∞ = 0. n! as follows from Taylor’s theorem. 116 . Note that ∞ n=0 pn ≤ 1. and this is why ϕ is called probability generating function. Generating functions are useful for solving linear recursions. Note that ϕ(0) = P (X = 0). Also. n=0 We do allow X to. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . since |s| < 1. consider the Fibonacci sequence deﬁned by the recursion xn+2 = xn+1 + xn . Since the term s∞ p∞ could formally appear in the sum deﬁning ϕ(s).

− − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . 117 . Despite the square roots in the formula. n ≥ 0) equals the sum of the generating functions of (xn+1 . This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 .From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 .e. then the so-called distribution function. i. the n-th term of the Fibonacci sequence.e. So. namely. if X is a random variable with values in R. Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). the function P (X ≤ x). Hence s2 + s − 1 = (s − a)(s − b). Since x0 = x1 = 1. Thus (1/ρ)−s is the 1 generating function of ρn+1 . +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. we can solve for X(s) and ﬁnd X(s) = s2 −1 . n ≥ 0) and (xn . for example. b−a Noting that ab = −1. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. But I could very well use the function P (X > x). is such an example. and so X(s) = 1 1 1 1 1 1 = . n ≥ 0). i. b = −( 5 + 1)/2. we can also write this as xn = (−a)n+1 − (−b)n+1 . we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. b−a is the solution to the recursion. s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). x ∈ R.

in 2n fair coin tosses 118 . for it does allow us to specify the probabilities P (X = n). If X is absolutely continuous. We frequently encounter random objects which belong to larger spaces. Such an excursion is a random object which should be thought of as a random function. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. instead. for example they could take values which are themselves functions. dx All these objects can be referred to as “distribution” or “law” of X. For instance. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. n Hence n k=1 log k ≈ log x dx. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. but using or merely talking about such an object presents no diﬃculty other than overcoming a psychological barrier. to compute a large sum we can. we encountered the concept of an excursion of a Markov chain from a state i. Hence we have We call this the rough Stirling approximation. it’s better to use logarithms: n n! = e log n = exp k=1 log k. n! ≈ exp(n log n − n) = nn e−n . (b) The approximation is not good when we consider ratios of products. n ∈ Z+ . Now. For example (try it!) the rough Stirling approximation give that the probability that. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. 1 Since d dx (x log x − x) = log x. I could use its density f (x) = d P (X ≤ x). Specifying the distribution of such an object may be hard. since the number is so big. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. compute an integral and obtain an approximation.or the 2-parameter function P (a < X < b). Even the generating function EsX of a Z+ -valued random variable X is an example. It turns out to be a good one.

Then. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any diﬀerent from X. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . expressing non-negativity. That was the random variable J0 = ∞ n=1 1(Sn = 0). deﬁned in (34). We can case both properties of g neatly by writing For all x. 0 ≤ n ≤ m. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. To remedy this. where λ = P (X ≥ 1). by taking expectations of both sides (expectation is an increasing operator). and this is terrible. g(X) ≥ g(x)1(X ≥ x). because we know that the n probability goes to zero as n → ∞. and is based on the concept of monotonicity. 119 . If we think of X as the time of occurrence of a random phenomenon. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . y ∈ R g(y) ≥ g(x)1(y ≥ x). if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). for all x. so this little section is a reminder. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). Let X be a random variable. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. Markov’s inequality Arguably this is the most useful inequality in Probability. It was shown in Theorem 38 that J0 satisﬁes the memoryless property. This completely trivial thought gives the very useful Markov’s inequality. as n → ∞. Indeed. Eg(X) ≥ g(x)P (X ≥ x) . We have already seen a geometric random variable. Consider an increasing and nonnegative function g : R → R+ . n ∈ Z+ . We baptise this geometric(λ) distribution. a geometric function of k. you have seen it before. and so. expressing monotonicity. Surely.we have exactly n heads equals 2n 2−2n ≈ 1.

C M and Snell. Amer. Chung. Springer. 3. Springer. Parry. W (1981) Topics in Ergodic Theory. Grinstead. Cambridge University Press. Cambridge University Press. J R (1998) Markov Chains. J L (1997) Introduction to Probability. 4. Wiley.Bibliography ´ 1. ´ 2. Soc. Feller. Bremaud P (1999) Markov Chains. Math. Bremaud P (1997) An Introduction to Probabilistic Modeling. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance).edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. W (1968) An Introduction to Probability Theory and its Applications.pdf 6. 7.mac. Norris. 5. Also available at: http://www. Springer.dartmouth. 120 .

58 digraph (directed graph). 17 communicates with. 65 Catalan numbers. 77 indicator. 16 convolution. 44 degree. 111 inessential state. 77 coupling. 76 neighbours. 45 Chapman-Kolmogorov equation. 16. 35 Helly-Bray Lemma. 17 Aloha. 61 null recurrent. 70 greatest common divisor. 53 hitting time. 34 PageRank. 82 Cesaro averages. 7 invariant distribution. 34 probability ﬂow. 109 positive recurrent. 17 inﬁmum. 29 reﬂection. 106 ground state. 18 permutation. 15. 48 cycle formula. 110 mixing.Index absorbing. 6 Markov property. 16 communicating classes. 84 bank account. 48 memoryless property. 70 period. 2 nearest neighbour random walk. 81 reﬂection principle. 20 increments. 108 relation. 15 Loynes’ method. 110 meeting time. 51 Monte Carlo. 18 arcsine law. 68 Fatou’s lemma. 17 Kolmogorov’s loop criterion. 98 dynamical system. 15 distribution. 6 Markov’s inequality. 87 121 . 119 Google. 119 minimum. 10 ballot theorem. 86 reﬂexive. 20 function of a Markov chain. 65 generating function. 117 division. 114 balance equations. 73 Markov chain. 101 axiom of Probability Theory. 26 running maximum. 63 Galton-Watson process. 111 maximum. 10 irreducible. 68 aperiodic. 108 ruin probability. 61 detailed balance equations. 16 equivalence relation. 108 ergodic. 116 ﬁrst-step analysis. 59 law. 47 coupling inequality. 51 essential state. 7 closed. 1 equivalence classes. 113 initial distribution. 13 probability generating function. 17 Ethernet. 115 random walk on a graph. 61 recurrent. 53 Fibonacci sequence. 18 dual. 115 geometric distribution. 119 matrix. 117 leads to. 3 branching process.

6 statespace. 2 transition probability. 108 time-homogeneous. 88 watching a chain in a set. 7. 24 Strirling’s approximation. 77 skip-free property. 13 stopping time. 16. 16. 76 Skorokhod embedding. 9 stationary distribution. 104 states. 76 simple symmetric random walk. 71 122 . 60 urn.semigroup property. 27 storage chain. 62 Wright-Fisher model. 8 transition probability matrix. 73 strategy of the gambler. 62 subsequence of a Markov chain. 10 steady-state. 62 supremum. 113 symmetric. 8 transitive. 3. 6 stationary. 119 strong Markov property. 29 transition diagram:. 27 subordination. 7 simple random walk. 108 tree. 58 transient. 7 time-reversible.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd