## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

**MARKOV CHAINS AND RANDOM WALKS
**

Takis Konstantopoulos∗ Autumn 2009

∗

c Takis Konstantopoulos 2006-2009

Contents

1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Deﬁnition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and ﬁrst-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Deﬁnition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 28 The reﬂection principle for a simple random walk in dimension 1 86 28. . . . . 79 26. . . . . . . .1 Paths that start and end at speciﬁc points . . . . . . . . . . . . . . . . . . . . .24. 95 31. .2 The ALOHA protocol: a computer communications application .3 Total number of visits to a state . . . . . . . . . . 95 31. . . . . . . . 70 24. . . . .4 Some remarkable identities .2 First return to the origin . . . 90 30. . . . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii .4 The Wright-Fisher model: an example from biology . .1 First hitting time analysis . . . . . . . . .2 Return probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Branching processes: a population growth application . . .1 Distribution of hitting time and maximum . . . . . . 71 24. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 24. . .1 Asymptotic distribution of the maximum . . . .3 PageRank (trademark of Google): A World Wide Web application . . . .5 First return to zero . 85 27. . . 65 24. . . . . . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . . . . . . . 84 27. . . . . . . .1 Distribution after n steps . . . . . . . . . . . .5 A storage or queueing application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Paths that start and end at speciﬁc points and do not touch zero at all . . . . . . . . . . . . . . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . . .3 The distribution of the ﬁrst return to the origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 27. . 83 27. . 92 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . .3 Some conditional probabilities . . . . . . . . . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . .2 The ballot theorem . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

Also. The ﬁrst part is about Markov chains and some applications.. They form the core material for an undergraduate course on Markov chains in discrete time. Greece and the U. of course. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible.S. A few starred sections should be considered “advanced” and can be omitted at ﬁrst reading. Of course. At the end. There are.Preface These notes were written based on a number of courses I taught over the years in the U. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuﬀ with matrices – On ﬁnance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. I have tried both and prefer the current ordering. “important” formulae are placed inside a coloured box. The second one is speciﬁcally for simple random walks... v . dozens of good books on the topic. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. I introduce stationarity before even talking about state classiﬁcation. I have a little mathematical appendix.K. I tried to put all terms in blue small capitals whenever they are ﬁrst encountered. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. There notes are still incomplete. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. Also.

Recall (from elementary school maths). In the module following this one you will study continuous-time chains. . F = f (x. . y0 = min(a. b) between two positive integers a and b. b). there is a function F that takes the pair Xn = (xn . yn ) into Xn+1 = (xn+1 . and the other the greatest common divisor. An example from Physics is provided by Newton’s laws of motion. x = x(t) denotes the position of the particle at time t. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. the future trajectory (X(s). s ≥ t) ˙ 1 . yn+1 ). initialised by x0 = max(a. we end up with two numbers. for any n. In Mathematics. yn ). depend only on the pair Xn . The “time” can be discrete (i.e. Formally. where r = rem(a. yn ). . Suppose that a particle of mass m is moving under the action of a force F in a straight line.e. x Here. Its velocity is x = dx . Xn+2 . we let our “state” be a pair of integers (xn . more generally. x(t)). Assume that the force F depends only on the particle’s position ¨ dt and velocity. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. X2 . b). The state of the particle at time t is described by the pair ˙ X(t) = (x(t). a phenomenon which evolves with time in a way that only the present aﬀects the future is called a dynamical system. x). b) is the remainder of the division of a by b. and evolving as xn+1 = yn yn+1 = rem(xn . one of which is 0. 1. has the property that. X1 . and its ˙ dt 2x acceleration is x = d 2 . the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). . For instance. The sequence of pairs thus produced. i. . . 2. By repeating the procedure. or.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past aﬀects the future only through the present. for any t. continuous (i.e. the real numbers). . the future pairs Xn+1 . the integers). that gcd(a. n = 0. b). b) = gcd(r. where xn ≥ yn . . X0 . a totally ordered set. It is not hard to see that. In other words. . An example from Arithmetic of a dynamical system in discrete time is the one which ﬁnds the greatest common divisor gcd(a.

We will also see why they are useful and discuss how they are applied. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. say. 2 2. uniformly. We can summarise this information by the transition diagram: 1 Stochastic means random. Y0 ) of random numbers. . via the use of the tools of Probability. Yn ). A mouse lives in the cage. respectively. 1 and 2. and use those mathematical models which are called Markov chains. When the mouse is in cell 1 at time n (minutes) then. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. Each time count 1 if Yn < f (Xn ) and 0 otherwise.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. Some of these algorithms are deterministic.000 columns or computing the integral of a complicated function of a large number of variables. . The resulting ratio is an approximation b of a f (x)dx. Markov chains generalise this concept of dependence of the future only on the present. Perform this a large number N of times. Try this on the computer! 2 .. The generalisation takes us into the realm of Randomness.99. instead of deterministic objects.can be completely speciﬁed by the current state X(t). In addition.05. analyse. for steps n = 1. i. but some are stochastic.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as ﬁnding the determinant of a matrix with 10. Indeed. Other examples of dynamical systems are the algorithms run. . regardless of where it was at earlier times. Similarly. past states are not needed. We will be dealing with random variables. it does so. an eﬀective way for dealing with large problems is via the use of randomness. by the software in your computer. at time n + 1 it is either still in 1 or has moved to 2. Repeat with (Xn . it moves from 2 to 1 with probability β = 0.e. Choose a pair (X0 . 0 ≤ Y0 ≤ B. we will see what kind of questions we can ask and what kind of answers we can hope for. containing fresh and stinky cheese. Count the number of 1’s and divide by N . We will develop a theory that tells us how to describe. 2. A scientist’s job is to record the position of the mouse every minute. In other words.000 rows and 10. a ≤ X0 ≤ b.

as follows: Xn+1 = max{Xn + Dn − Sn . are random variables with distributions FD . Xn is the money (in pounds) at the beginning of the month. S0 . answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. . 1. Here. . D0 . y = 0.) 2. S2 . 2.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. Assume that Dn . (This is the mean of the binomial distribution with parameter α. Dn − Sn = y − x) P (Sn = z. So if he wishes to spend more than what is available. on the average.05 ≈ 20 minutes.y := P (Xn+1 = y|Xn = x). are independent random variables. How often is the mouse in room 1 ? Question 1 has an easy. . .01 Questions of interest: 1. Sn .95 0. respectively.99 0. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). . the ﬁrst time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. We easily see that if y > 0 then ∞ x. FS . D1 . Dn is the money he deposits. How long does it take for the mouse.2 Bank account The amount of money in Mr Baxter’s bank account evolves. to move from cell 1 to cell 2? 2. he can’t. S1 . D2 . Assume also that X0 . The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is deﬁned by: px. intuitive. px. from month to month. 0}. .05 0. and Sn is the money he wants to spend. = z=0 ∞ = z=0 3 .

0 = 1 − ∞ px. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). 2.1 p1. can be computed from FS and FD . how long will it take him. he has no sense of direction. Questions of interest: 1.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . we have px.1 p0. If the pub is located in position 0 and his home in position 100.0 p2.1 p2. on the average. Of course. 2.2 p1. So he may move forwards with equal probability that he moves backwards. The transition probability matrix is y=1 p0.0 p0. Will Mr Baxter’s account grow.something that. being drunk. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. Are there any places which he will never visit? 3. and if so. how often will he be visiting 0? 2. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. if y = 0.0 p1. If he starts from position 0.y . how fast? 2.2 ··· ··· ··· ··· · · · · · · ··· and is an inﬁnite matrix. in principle. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. Are there any places which he will visit inﬁnitely many times? 4.2 P= p2.

there is clearly a higher chance that our man will be lost. for now.S = 1 − pS. etc.S = 0. observe. 2. the answer to this is crucial for the policy determination. But. north or south. therefore.D is omitted. and pS. mathematically imprecise.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner.S − pH.D . given that he is currently healthy.3 that the person will become sick.5 Actuarial chains A life insurance company wants to ﬁnd out how much money to charge its clients.3 H 0. Note that the diagram omits pH. Clearly. pD. but they will become more precise in the sequel.H . It proposes the following model summarising the state of health of an individual on a monthly basis: 0.1 D Thus.S because pH. the reason being that the company does not believe that its clients are subject to resurrection. 5 . Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly.01 S 0.H − pS. pD. so we assign probability 1/4 for each move. west. there is probability pH. the company must have an idea on how long the clients will live. We may again want to ask similar questions. and let’s say he’s completely drunk.D . and pS. These things are.8 0. that since there are more degrees of freedom now. So he can move east. Also.D = 1.H = 1 − pH.

the future process (Xm . 2. . Since S is countable. it is customary to call (Xn ) a Markov chain. 3. .3 3. So we will omit referring to T explicitly. . n + m] Xk = ik | Xn = in . 3. . in+1 ∈ S. Proof. We will also assume. Xn−1 ) given Xn . (1) 6 . . 3. We focus almost exclusively to the ﬁrst choice. . . . Usually.1 The Markov property Deﬁnition of the Markov property Consider a (ﬁnite or inﬁnite) sequence of random variables {Xn . . Lemma 1. . We say that this sequence has the Markov property if. .} or N = {1. m < n. if the claim holds. n ∈ T}. where T is a subset of the integers. T = Z+ = {0. Xn+m ) and (X0 . X0 = i0 ) = P (∀ k ∈ [n + 1. Sometimes we say that {Xn . 1. . the Markov property. . . . almost exclusively (unless otherwise said explicitly). (Xn . which precisely means that (Xn+1 . . Let us now write the Markov property in an equivalent way. Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). . and so future is independent of past given the present. . . . viz. . n + m] Xk = ik | Xn = in ). . m ∈ T) is independent of the past process (Xm . . m ∈ T). in+m . for all n ∈ Z+ and all i0 . This is true for any n and m. If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . that the Xn take values in some countable set S. . The elements of S are frequently called states. . On the other hand. called the state space. .} or the set Z of all integers (positive. . . unless needed. m > n. X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). for any n. or negative). for any n ∈ T. zero. n ∈ Z+ ) is Markov if and only if. . . X1 = i1 . 2. . X2 = i2 . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. Xn−1 ) are independent conditionally on Xn . . . . . n ∈ T} is Markov instead of saying that it has the Markov property. . conditionally on Xn . . P (Xn+1 = in+1 | Xn = in . m and any choice of states i0 . .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 .

. Therefore. or as P(m. (2) which is reminiscent of matrix multiplication. we deﬁne. n) if m ≤ ℓ ≤ n. n) = P(m. n). viz.i1 (m. n + 1) is called transition probability from state i at time n to state j at time n + 1. then the matrices are inﬁnitely large. j ∈ S.. . . n) = 1(i = j) and.which means that the two functions µ(i) := P (X0 = i) . n) = P(m. n + 1) do not depend on n . we can write (2) as P(m. Of course. n). .3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains.j (m.i (n − 1. 7 . X2 = i2 . pi. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. ℓ) P(ℓ.in (n − 1.3 In this new notation. of course. pi. i∈S i. 2 3 And. remembering a bit of Algebra. But we will circumvent them by using Probability only. If |S| = d then we have d × d matrices. n) = i1 ∈S ··· in−1 ∈S pi.i1 (0.j (n. all probabilities of events associated with the Markov chain This causes some analytical diﬃculties. X1 = i1 .j (n. m + 1) · · · pin−1 . n). pi. n) := [pi. 2) · · · pin−1 . 1) pi1 .j∈S . m + 2) · · · P(n − 1. by a result in Probability. pi. n + 1) := P (Xn+1 = j | Xn = i) . will specify all joint distributions2 : P (X0 = i0 . which. by (1). m + 1)P(m + 1.j (n. The function µ(i) is called initial distribution. for each m ≤ n the matrix P(m. More generally. but if S is an inﬁnite set.i2 (1. 3. requires some ordering on S) and whose entries are pi.j (m. n)]i.j (n.j (m. n ≥ 0. Xn = in ) = µ(i0 ) pi0 . something that can be called semigroup property or Chapman-Kolmogorov equation. a square matrix whose rows and columns are indexed by S (and this. The function pi. n). means that the transition probabilities pi. by deﬁnition.j (m.

transition matrix. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). δi (j) := 1. if j = i otherwise.j∈S is called the (one-step) transition probability matrix or. y y 8 . if Y is a real random variable. n) = P × P × · · · · · · × P = Pn−m . of course. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . if µ assigns probability p to state i1 and 1 − p to state i2 . For example. From now on. while P = [pi. and the semigroup property is now even more obvious because Pa+b = Pa Pb . For example. arranged in a row vector.j the (one-step) transition probability from state i to state j. n−m times for all m ≤ n. Notational conventions Unit mass. to picking state i with probability 1. as follows from (1) and the deﬁnition of P. Then P(m. (3) We call δi the unit mass at i. The matrix notation is useful. Then we have µ n = µ 0 Pn .j (n. b ≥ 0. It corresponds.j ]i. In other words.j . let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . Thus Pi is the law of the chain when the initial distribution is δi .and therefore we can simply write pi. and call pi. Notice that any probability distribution on S can be written as a linear combination of unit masses. Starting from a speciﬁc state i. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). n + 1) =: pi. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. Similarly. simply. then µ = pδi1 + (1 − p)δi2 . for all integers a. 0. unless otherwise speciﬁed.

Let Xn be the number of molecules in room 1 at time n.i. Imagine there are N molecules of gas (where N is a large number. It should be clear that if initially place the bug at random on one of the vertices. We now investigate when a Markov chain is stationary.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. Clearly. 1. for a while we will be able to tell how long ago the process started. let us look at the following example: Example: Motion on a polygon. at time n + 1 it moves to one of the adjacent vertices with equal probability. It is clear. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. we indeed expect that we have N/2 molecules in each room. i. In other words. at any point of time n. . n ≥ 0) will not be stationary: indeed. say N = 1025 ) in a metallic chamber with a separator in the middle. On the average. dividing it into two identical rooms. To provide motivation and intuition. The gas is placed partly in room 1 and partly in room 2. as it takes some time before the molecules diﬀuse from room to room.e.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . Example: the Ehrenfest chain. Hence the uniform probability distribution is stationary. Then pi. n = m. the law of a stationary process does not depend on the origin of time. then. . because it is impossible for the number of molecules to remain constant at all times. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. . We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. selecting a vertex with probability 1/7. and vice versa.) is stationary if it has the same law as (Xn . n = 0. then. m + 1. If at time n the bug is on a vertex. . the distribution of the bug will still be the same. a sequence of i. . But this will not work. Then a hole is opened in the separator and we watch what happens as the molecules diﬀuse between the two rooms. for any m ∈ Z+ . This is a model of gas distributed between two identical communicating chambers. random variables provides an example of a stationary process.d. Consider a canonical heptagon and let a bug move on its vertices. Another guess then is: Place 9 .4 Stationarity We say that the process (Xn . N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . .).

X1 = i1 .i + P (X0 = i + 1)pi+1. we are led to consider: The stationarity problem: Given [pij ]. .) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. . P (Xn = i) = π(i) for all n. 2−N i−1 i+1 N N (The dots signify some missing algebra. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. .i + π(i + 1)pi+1. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . i 0 ≤ i ≤ N. X2 = i2 . which you should go through by yourselves. . stationary or invariant distribution. We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. use (1) to see that P (X0 = i0 . . Proof. a Markov chain is stationary if and only if the distribution of Xn does not change with n. in ∈ S. Changing notation. P (X1 = i) = j P (X0 = j)pj. (a binomial distribution). . Xn = in ) = P (Xm = i0 . Since.i = P (X0 = i − 1)pi−1. Xm+1 = i1 . . If we can do so. by (3). hence. for all m ≥ 0 and i0 . Lemma 2. . 10 . . The equations (4) are called balance equations. ﬁnd those distributions π such that πP = π. .each molecule at random either in room 1 or in room 2. A Markov chain (Xn .i = N N −i+1 N i+1 2−N + = · · · = π(i). Indeed. For the if part. . If we do so. then we have created a stationary Markov chain. Xm+2 = i2 .i = π(i − 1)pi−1. In other words. Xm+n = in ). . then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. (4) Such a π is called . But we need to prove this. The only if part is obvious.

j = 1. d! x ∈ Sd . we use the normalisation condition π(1) = i≥1 j≥i = 1. i≥1 π(i) −1 To compute π(1). which in turn means that the normalisation condition cannot be satisﬁed. Then (Xn . This means that the solution to the balance equations is identically zero. Example: card shuﬄing. π must be a probability: π(i) = 1. i ≥ 0. i = 1. . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. i ≥ 1. In addition to that. . . p1. . To ﬁnd the stationary distribution. in matrix notation. Example: waiting for a bus. . After bus arrives at a bus stop. we must be careful. . . xd ). Consider a deck of d = 52 cards and shuﬄe them in the following (stupid) way: pick two cards at random and swap their positions. then we run into trouble: indeed.d. The times between successive buses are i. 2 .. .i.i = π(0)p(i) + π(i + 1). . 11 . and so each π(i) will be zero. . 2. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. Thus.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. which. . and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. It is easy to see that the stationary distribution is uniform: π(x) = 1 . where 1 is a column whose entries are all 1. .i = 1. n ≥ 0) is a Markov chain with transition probabilities pi+1. where all the xi are distinct. But here. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. . random variables.i = p(i). write the equations (4): π(i) = π(j)pj. Keep doing it continuously. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. Who tells us that the sum inside the parentheses is ﬁnite? If the sum is inﬁnite. each x ∈ Sd is of the form x = (x1 . The state space is S = N = {1. can be written as P1 = 1. which gives p(j) .}. d}. if i > 0 i ∈ N. the next one will arrive in i minutes with probability p(i). j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). π(1) will be zero.

Example: bread kneading. 1}. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is ﬁnite. .. . Hence choosing the initial state at random in this manner results in a stationary Markov chain. ξ2 (n) = 1 − ε1 .* (This is slightly beyond the scope of these lectures and can be omitted. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n).i. . 1] as a ﬂexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. 1} for all i. We can show that if we start with the ξr (0). Here ξ(n) = (ξ1 (n). ξi ∈ {0. i. . . But the function f (x) = min(2x. . . what we have really shown is that this assumption is a necessary and suﬃcient condition for the existence and uniqueness of a stationary distribution. . . So it goes beyond our theory. . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. ξr+1 (n) = 1 − εr ). .) Consider an inﬁnite binary vector ξ = (ξ1 . . . .i. are i. If ξ1 = 1 then x ≥ 1/2. ξ(n + 1) = ϕ(ξ(n)). . and the rule maps x into 2(1 − x). . . . P (ξ1 (n + 1) = ε1 . We have thus deﬁned a mapping ϕ from binary vectors into binary vectors. Remarks: This is not a Markov chain with countably many states. . .. and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 .) is the state at time n. . 1] (think of [0. . . from (5). 2(1 − x)). .d. .).).. The Markov chain is obtained by applying the mapping ϕ again and again. So. . As for the name of the chain (bread kneading) it can be justiﬁed as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. r = 1. any r = 1. i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. . . . . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. (5) 12 . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. ξ2 (n).e. 1. . . 2. 1 − ξ3 . ξ2 (n). if ξ1 = 1 then shift to the left and ﬂip all bits to obtain (1 − ξ2 .). ξr (n + 1) = εr ) = P (ξ1 (n) = 0.d. with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). ξ3 . i. . for any n = 0. . the same is true at n + 1. . applied to the interval [0. . 2. . You can check that . . . 2. εr ∈ {0. But to even think about it will give you some strength. ξ2 . ξr+1 (n) = εr ) + P (ξ1 (n) = 1. So let us do so. . and any ε1 . ξ2 (n) = ε1 .

therefore one of them is always redundant.i + j∈Ac π(j)pj.i Multiply π(i) by j∈S pi. then choose A = {i} and see that you obtain the balance equations. then the above are d + 1 equations. Ac ) is the total current from A to Ac .Note on terminology: steady-state. If the latter condition holds. A). When a Markov chain is stationary we refer to it as being in 4. So F (A. Instead of eliminating equations.j = j∈A π(j)pj.i 13 . for all A ⊂ S. Ac ) = F (Ac . Ac ) := π(i)pi.j . π is a stationary distribution if and only if π1 = 1 and F (A. Conversely. because the ﬁrst d equations are linearly dependent: the sum of both sides equals 1. as we will see in examples.i = j∈A π(j)pj.i . This is OK.j as a “current” ﬂowing from i to j. we introduce more. If the state space has d states.j (which equals 1): π(i) j∈S pi. π(j)pj. The convenience is often suggested by the topological structure of the chain. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj.i + j∈Ac π(j)pj.1 Finding a stationary distribution As explained above. Proof. But we only have d unknowns. amongst them. Writing the equations in this form is not always the best thing to do. We have: Proposition 1. a stationary distribution π satisﬁes π = πP π1 = 1 In other words. i∈A j∈Ac Think of π(i)pi. suppose that the balance equations hold. π(i) = j∈S (balance equations) (normalisation condition). expecting to be able to choose. j∈S i ∈ S. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. π(j) = 1. a set of d linearly independent equations which can be more easily solved.

π(1) = β . 1.2% of the time in 1. we ﬁnd π(1) = 0. 2. . Example: a walk with a barrier. Thus.j + j∈A j∈Ac π(i)pi. α+β π(2) = cα.04 room 1.04 π(2) = 0. the mouse spends only 95. 3. The constant c is found from π(1) + π(2) = 1.048. α+β π(2) = α . gives us ﬂexibility. 14 i = 1.952. p22 = 1 − β. Instead. .05. . One can ask how to choose a minimal complete set of linearly independent equations.) Assume 0 < α. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). in steady state. A ) A c c F( A . That is. We have π(1)α = π(2)β.8% of the time in room 2. while the remaining two terms give the equality we seek. Summing over i ∈ A.i This is true for all i ∈ S. here is an example: Example: Consider the 2-state Markov chain with p12 = α. β < 1. we see that the ﬁrst term on the left cancels with the ﬁrst on the right. The extra degree of freedom provided by the last result.. and 4.i + j∈Ac π(j)pj. p11 = 1 − α.99 (as in the mouse example). (Consequently.99 ≈ 0. A) Schematic representation of the ﬂow balance relations.Now split the left sum also: π(i)pi. but we shall not do this here. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ.05 ≈ 0. p21 = β. . where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 .j = j∈A π(j)pj. A c F(A . p q Consider the following chain.. If we take α = 0. β = 0.

These are much simpler than the previous ones. .e. This is i=0 possible if and only if p < q. If i ≥ 1 we have F (A. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. • Suppose that i j. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. .j > 0.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. we can solve them immediately. E) where S is the state space and E ⊂ S × S deﬁned by (i. For i ≥ 1. (iii) Pi (Xn = j) > 0 for some n ≥ 0. In other words.j > 0 for some n ∈ N and some states i1 . a pair (S. . provided that ∞ π(i) = 1. π(i) = (p/q)i π(0). (ii) pi. Does this provide a stationary distribution? Yes. (If p ≥ 1/2 there is no stationary distribution. in−1 . .i1 pi1 . 1. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. Ac ) = π(i − 1)p = π(i)q = F (Ac . and E a set of directed edges.e. If i = j there is nothing to prove. This is often cast in terms of a digraph (directed graph). In fact. . . S is a set of vertices.) 5 5. i − 1}. starting from i.But we can make them simpler by choosing A = {0. the chain will visit j at some ﬁnite time. p < 1/2. j) ∈ E ⇐⇒ pi. . In graph-theoretic terminology. Proof. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). So assume i = j. i. n≥0 . A).i2 · · · pin−1 . i.2 The relation “leads to” We say that a state i leads to state j if. . 5.

[i] = [j] if and only if i identical or completely disjoint. i}. The class {4. Corollary 1.. By assumption. The communicating class corresponding to the state i is.3 The relation “communicates with” Deﬁne next the relation i communicates with j (written as i Obviously. we see that there are four communicating classes: {1}. In addition. The relation is transitive: If i j and j k then i k. it partitions S into equivalence classes known as communicating classes. Two communicating classes are either 1 2 3 4 5 6 In the example of the ﬁgure. j) ⇐⇒ i j and j i. Since Pi (Xn = j) > 0. • Suppose that (ii) holds. meaning that (ii) holds. we have that the sum is also positive and so one of its terms is positive.j . the set of all states that communicate with i: [i] := {j ∈ S : j So. 5} is closed: if the chain goes in there then it never moves out of it. However. In other words. reﬂexive and transitive. where the sum extends over all possible choices of states in S. by deﬁnition. {4. 3}. Look at the last display again. The ﬁrst two classes diﬀer from the last two in character.i1 pi1 . Equivalence relation means that it is symmetric. We just observed it is symmetric.. and so (iii) holds. 5. 16 . 3} is not closed. {2.. the sum contains a nonzero term. We have Pi (Xn = j) = i1 . {6}.. the class {2. It is reﬂexive and transitive because is so. by deﬁnition. The relation j ⇐⇒ j i). (6) j. 5}. (iii) holds. • Suppose now that (iii) holds. is an equivalence relation.the sum on the right must have a nonzero term. this is symmetric (i Corollary 2. Just as any equivalence relation. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). Hence Pi (Xn = j) > 0. Proof. (i) holds.i2 · · · pin−1 .in−1 pi.

Closed communicating classes are particularly important because they decompose the chain into smaller. Proof. The communicating classes C1 .i2 · · · pin−1 . C3 are not closed.i = 1 then we say that i is an absorbing state. . Thus. in−1 such that pi. . we conclude that j ∈ C.More generally. . parts. pi1 . If all states communicate with all others. we say that a set of states C ⊂ S is closed if pi. In other words still. its transition matrix P) is called irreducible. C2 . the state space S is a closed communicating class. the single-element set {i} is a closed communicating class. In graph terms. The classes C4 .j > 0.i1 > 0. starting from any state. To put it otherwise. Otherwise. and C is closed. Note that there can be no arrows between closed communicating classes. we can reach any other state. 17 . Suppose ﬁrst that C is a closed communicating class. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this ﬁgure. Lemma 3. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. The converse is left as an exercise. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. Note: An inessential state i is such that there exists a state j such that i j but j i. Note: If we consider a Markov chain with transition matrix P and ﬁx n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. the state is inessential. more manageable. Similarly. If a state i is such that pi. pi. C5 in the ﬁgure above are closed communicating classes. Suppose that i j. . more precisely. Why? A state i is essential if it belongs to a closed communicating class. and so i2 ∈ C.i1 pi1 . then the chain (or. j∈C for all i ∈ C. and so on. We then can ﬁnd states i1 . Since i ∈ C.i2 > 0. A general Markov chain can have an inﬁnite number of communicating classes.j = 1. There can be arrows between two nonclosed communicating classes but they are always in the same direction. we have i1 ∈ C.

. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. X3 ∈ C2 . So. Theorem 2 (the period is a class property). If i j there is α ∈ N with pij > 0 and if 18 . (Can you ﬁnd an example of a chain with period 3?) We will now give the deﬁnition of the period of a state of an arbitrary chain. 4} at the next step and vice versa. d(j) = gcd Dj . n ≥ 0. ℓ ∈ N. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . Consider two distinct essential states i. Let us denote by pij the entries of Pn . The chain alternates between the two sets.e. The period of an inessential state is not deﬁned. i. (n) Proof. 3} at some point of time then. Let Di = {n ∈ N : pii > 0}. it will move to the set C2 = {2. X2 ∈ C1 . If i j then d(i) = d(j). whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. j ∈ S. A state i is called aperiodic if d(i) = 1. (m) (n) for all m. all states form a single closed communicating class. Dj = {n ∈ (n) (α) N : pjj > 0}. Since Pm+n = Pm Pn . . d(i) = gcd Di . more generally. We say that the chain has period 2. pii (ℓn) ≥ pii (n) ℓ . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. i. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. The period of an essential state is deﬁned as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . Notice however. X4 ∈ C1 .4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . j.5. . . and.

Theorem 3.) 19 . Because n is suﬃciently large. In particular. (Here Cd := C0 . Corollary 3. Cd−1 such that pij = 1. An irreducible chain has period d if one (and hence all) of the states have period d. Of particular interest. we have n ∈ Di as well. Pick n2 . . i. Then we can uniquely partition the state space S into d disjoint subsets C0 . where the remainder r ≤ n1 − 1. Now let n ∈ Di . the chain is called aperiodic. So pjj ≥ pij pji > 0. . . we obtain d(i) ≤ d(j) as well. Proof. 1. So d(i) = d(j). r = 0. Since n1 . . Divide n by n1 to obtain n = qn1 + r. C1 . Therefore d(j) | α + β.e. as deﬁned earlier. . if d = 1. From the last two displays we get d(j) | n for all n ∈ Di . . If a number divides two other numbers then it also divides their diﬀerence. all states communicate with one another. n2 ∈ Di . Formally. n1 ∈ Di such that n2 − n1 = 1. Theorem 4. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This ﬁgure shows the internal structure of a closed communicating class with period d = 4. d − 1. More generally. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . showing that α + β ∈ Dj . Cd−1 such that the chain moves cyclically between them. . chains where. q − r > 0. showing that α + n + β ∈ Dj . C1 . . Consider an irreducible chain with period d. So d(j) is a divisor of all elements of Di . . if an irreducible chain has period d then we can decompose the state space into d sets C0 . Arguing symmetrically.j i there is β ∈ N with pji > 0. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. j ∈ S. Then pii > 0 and so pjj ≥ pij pii pji > 0. d(j) | d(i)). Therefore d(j) | α + β + n for all n ∈ Di . Let n be suﬃciently large. An irreducible chain with ﬁnitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. j∈Cr+1 i ∈ Cr . . . If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . . are irreducible chains. this is the content of the following theorem.

20 . id−1 d 6 Hitting times and ﬁrst-step analysis Consider a Markov chain with transition probability matrix P. for example. i. If X0 ∈ A the two times coincide. As an example. Then pick i1 such that pi0 i1 > 0.. . necessarily. We now show that if i belongs to one of these classes and if pij > 0 then..e the set of states j such that i j). Suppose pij > 0 but j ∈ C1 but. Hence it partitions the state space S into d disjoint equivalence classes. id ∈ C0 . The reader is warned to be alert as to which of the two variables we are considering at each time. necessarily. Pick a state i0 and let C0 be its equivalence class (i. . after it takes one step from its current position.id > 0. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). then. j ∈ C2 . The existence of such a path implies that it is possible to go from i0 to i.. . i1 . We deﬁne the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. i ∈ C0 . and by the assumptions that d d j. Assume d > 1.. We use a method that is based upon considering what the Markov chain does at time 1. that is why we call it “first-step analysis”. i1 . We are interested in deriving formulae for the probability that this time is ﬁnite as well as the expectation of this time. Continue in this manner and deﬁne states i0 . which contradicts the deﬁnition of d.Proof. Let C1 be the equivalence class of i1 . We show that these classes satisfy what we need. It is easy to see that if we continue and pick a further state id with pid−1 . Deﬁne the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. Take. C1 . say. . with corresponding classes C0 .. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which diﬀers from TA simply by considering n ≥ 1 instead of n ≥ 0. Such a path is possible because of the choice of the i0 . j belongs to the next class. Notice that this is an equivalence relation. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . We avoid excessive notation by using the same letter for both. .e... Cd−1 .

′ is the remaining time until A is hit).e. Therefore. Combining the above. We ﬁrst have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. let ϕ′ (i) be another solution. We then have: Theorem 5. if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. conditionally on X1 . . But the Markov chain is homogeneous. then TA ≥ 1. ϕ(i) = j∈S i∈A pij ϕ(j). Furthermore. By self-feeding this equation.). we have Pi (TA < ∞) = ϕ(j)pij . we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A pjk ϕ′ (k) k∈A 21 . X2 . a ′ function of X1 . j∈S as needed. Hence: ′ ′ P (TA < ∞|X1 = j. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). ′ Proof.It is clear that P1 (T0 < ∞) < 1. because the chain may end up in state 2. If i ∈ A. . But what exactly is this probability equal to? To answer this in its generality. and so ϕ(i) = 1. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). For the second part of the proof. . The function ϕ(i) satisﬁes ϕ(i) = 1. the event TA is independent of X0 . So TA = 1 + TA . If i ∈ A then TA = 0. X0 = i) = P (TA < ∞|X1 = j). i ∈ A. and deﬁne ϕ(i) := Pi (TA < ∞). ﬁx a set of states A.

the second equals Pi (TA = 2). j∈A i ∈ A. i = 0. Furthermore. 3 22 . so the ﬁrst two terms together equal Pi (TA ≤ 2). if the chain starts from 2 it will never hit 0. If ′ i ∈ A. Proof. we choose A = {0}. If i ∈ A then. Example: Continuing the previous example. as above. and let ψ(i) = Ei TA . ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). Letting n → ∞. let A = {0. Example: In the example of the previous ﬁgure. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). ψ(0) = ψ(2) = 0. obviously. 2}. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). ψ(i) = 1 + i∈A pij ψ(j). We have: Theorem 6. TA = 0 and so ψ(i) := Ei TA = 0. The second part of the proof is omitted. Start the chain from i. Therefore. By continuing self-feeding the above equation n times. We let ϕ(i) = Pi (T0 < ∞). (If the chain starts from 0 it takes no time to hit 0. 2.) As for ϕ(1). and ϕ(2) = 0. The function ψ(i) satisﬁes ψ(i) = 0. Clearly. because it takes no time to hit A if the chain starts from A. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. the set that contains only state 0. 1. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the ﬁrst time. On the other hand. then TA = 1 + TA . 3 2 so ϕ(1) = 3/4.We recognise that the ﬁrst term equals Pi (TA = 1). which is the second of the equations. 1 ψ(1) = 1 + ψ(1). ψ(i) := Ei TA . We immediately have ϕ(0) = 1.

Now the last sum is a function of the future after time 1. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). . if x ∈ A. (This should have been obvious from the start. B. consider the following situation: Every time the state is x a reward f (x) is earned. i∈A Furthermore.e. we just changed index from n to n + 1. Clearly. TB for two disjoint sets of states A. X2 . as argued earlier. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). ′ TA = 1 + TA . h(x) = f (x). in the last sum. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). i. after one step. Hence. therefore the mean number of ﬂips until the ﬁrst success is 3/2.which gives ψ(1) = 3/2. until set A is reached. of the random variables X1 .. by the Markov property. ϕAB (i) = 0. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ).) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the ﬁrst-step analysis method. because the underlying experiment is the ﬂipping of a coin with “success” probability 2/3. 23 . This happens up to the ﬁrst hitting time TA ﬁrst hitting time of the set A. It should be clear that the equations satisﬁed are: ϕAB (i) = 1. ′ where TA is the remaining time. We may ask to ﬁnd the probabilities ϕAB (i) = Pi (TA < TB ). . . Then ′ 1+TA x ∈ A. ϕAB is the minimal solution to these equations. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). otherwise. where. Next. ϕAB (i) = j∈A i∈B pij ϕAB (j).

The successive stakes (Yk . y∈S h(x) = f (x) + x ∈ A. (Also. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). until he dies in room 4. Example: A thief enters sneaks in the following house and moves around the rooms.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. Let p the probability of heads and q = 1 − p that of tails. k ∈ N) form the strategy of the gambler. so 24 . collecting “rewards”: when in room i. unless he is in room 4. 2. If heads come up then you win £1 per pound of stake. 1 2 4 3 The question is to compute the average total reward. x∈A pxy h(y). h(1) = 5. 2 2 We solve and ﬁnd h(2) = 8. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared.Thus. for i = 1. if he starts from room 2. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). So. he will die. in which case he dies immediately. are h(x) = 0. (Sooner or later. why?) If we let h(i) be the average total reward if he starts from room i. no gambler is assumed to have any information about the outcome of the upcoming toss. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . Clearly. 3. h(3) = 7. he collects i pounds. the set of equations satisﬁed by h. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward.

a ≤ x ≤ b. a + 1. astrological nature.i+1 = p. We use the notation Px (respectively. expectation) under the initial condition X0 = x. ℓ) := Px (Tℓ < T0 ).i−1 = q = 1 − p. Let £x be the initial fortune of the gambler. . if p = q Px (Tb < Ta ) = . and where the states a. Let us consider a concrete problem. as described above. b − a). This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals: x−a − 1 (q/p) (q/p)b−a − 1 . Consider a gambler playing a simple coin tossing game. betting £1 at each integer time against a coin which has P (heads) = p. 25 ϕ(ℓ) = 1. . ℓ).if the gambler wants to have a strategy at all. is our next goal. To solve the problem. b} with transition probabilities pi. The diﬀerence between the gambler’s ruin problem for a company and that for a mere mortal is that. x − a . The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). she must base it on whatever outcomes have appeared thus far. ξk−1 and–as is often the case in practice–on other considerations of e. if p = q = 1/2 b−a Proof. Ex ) for the probability (respectively. Theorem 7. Yk cannot depend on ξk . b are absorbing. We obviously have ϕ(0) = 0. (Think why!) To simplify things. b = ℓ > 0 and let ϕ(x. in the ﬁrst case.g. b − 1. the company has control over the probability p. P (tails) = 1 − p. We wish to consider the problem of ﬁnding the probability that the gambler will never lose her money. She will stop playing if her money fall below a (a < x). pi. a + 2. We ﬁrst solve the problem when a = 0. . It can only depend on ξ1 . Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. . it will make enormous proﬁts). The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. In other words. write ϕ(x) instead of ϕ(x. We are clearly dealing with a Markov chain in the state space {a. 0 ≤ x ≤ ℓ. . . . in addition. in simplest case Yk = 1 for all k. . a < i < b. .

ℓ→∞ ℓ→∞ (q/p)x . ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. then λ = q/p > 1. λℓ − 1 0 ≤ x ≤ ℓ. then λ = q/p < 1. If p > 1/2. Since p + q = 1. Since ϕ(ℓ) = 1. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). Corollary 4. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). Let δ(x) := ϕ(x + 1) − ϕ(x). we have obtained ϕ(x. if λ = 1 if λ = 1. λ = 1. λℓ −1 x ℓ. λ = 1. we ﬁnd ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). ℓ Summarising. by ﬁrst-step analysis. if p = q. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. 0 ≤ x ≤ ℓ. we ﬁnd δ(0) = 1/ℓ and so x ϕ(x) = . we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. if p > 1/2 if p ≤ 1/2. λx − 1 δ(0). δ(x − 2) = · · · = q p x δ(0). We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). Assuming that λ := q/p = 1. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof.and. 1. Px (T0 < ∞) = 1 − lim 26 . and so Px (T0 < ∞) = 1 − lim If p < 1/2. The general case is obtained by replacing ℓ by b − a and x by x − a. ℓ) = λx −1 . and so λx − 1 λx − 1 =1− = λx . We ﬁnally ﬁnd λx − 1 . ℓ→∞ λℓ − 1 x = 1 − 0 = 1.

A. . Recall that the Markov property states that. . . . τ = n) (c) = P (Xn+1 = j1 . Xn ). . This stronger property is referred to as the strong Markov property. . | Xτ = i. the ﬁrst hitting time of the set A. . . . the value +∞. We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . Xτ +2 = j2 .j2 · · · P (Xn = i. A | Xτ . . τ < ∞) and that P (Xτ +1 = j1 .j1 pj1 . P (Xτ +1 = j1 . . . . An example of a random time which is not a stopping time is the last visit of a speciﬁc set. Xn ) and the ordinary splitting property at n. A. τ = n) = P (Xn+1 = j1 . .j1 pj1 . A. . the future process (Xn . Xτ ) if τ < ∞. τ < ∞)P (A | Xτ . . . Our assertions are now easily obtained from this by ﬁrst summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . the future process (Xτ +n . Xτ +2 = j2 . Xτ = i. A. . conditional on τ < ∞ and Xτ = i. . . including. . In particular. Suppose (Xn . τ < ∞) = P (Xn+1 = j1 . . . Xn )1(τ = n). τ < ∞). Xn ) for all n ∈ Z+ . Xτ +2 = j2 . Let τ be a stopping time. for any n. Then. . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . | Xτ . . | Xn = i.8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. . 27 . Xn+1 . Xn+2 = j2 . . . . (d) where (a) is just logic.) is independent of the past process (X0 . . Theorem 8 (strong Markov property). Xn ) if we know the present Xn . | Xn = i)P (Xn = i. . . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 .j2 · · · We have: P (Xτ +1 = j1 . Let τ be a stopping time. and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . . A. An example of a stopping time is the time TA . . τ < ∞) = pi. . Xn+2 = j2 . . (b) is by the deﬁnition of conditional probability. Xτ ) and has the same law as the original process started from i. . . any such A must have the property that. τ = n)P (Xn = i. for all n ∈ Z+ . . τ = n) = pi. . . . Xn+2 = j2 . τ = n) (a) (b) = P (Xτ +1 = j1 . . n ≥ 0) is a time-homogeneous Markov chain. .Xτ +2 = j2 . τ = n). Xn = i. n ≥ 0) is independent of the past process (X0 . Xτ )1(τ < ∞) = n∈Z+ (X0 . . . . Proof. . . . considered earlier. We are going to show that for any such A. . Since (X0 . . possibly. A. . Let A be any event determined by the past (X0 . . . . . A ∩ {τ = n} is determined by (X0 .

then the behaviour of the chain between successive visits cannot be inﬂuenced by the past or the future (and that is due to the strong Markov property). we let Ti be the time of the second (1) (3) visit. The Ti of Section 6 was allowed to take the value 0. “chunks”. (And we set Ti := Ti . If X0 = i then the two deﬁnitions coincide. in a generalised sense. All these random times are stopping times.d.9 Regenerative structure of a Markov chain A sequence of i. Xi (2) .. in fact. Xi (0) . Our Ti here is always positive.i.i. . State i may or may not be visited again. They are. 3. Ti Xi (0) (r) ≤ n < Ti (r+1) . Schematically. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the ﬁrst time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. . The random variables Xn comprising a general Markov chain are not independent. random trajectories. Formally.d. . (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . . we will show that we can split the sequence into i. So Xi is the r-th excursion: the chain visits 28 . (1) r = 1. We (r) refer to them as excursions of the process.. and so on. Note the subtle diﬀerence between this Ti and the Ti of Section 6. However. r = 2.) We let Ti be the time of the third visit. Xi (1) . as “random variables”. := Xn . 0 ≤ n < Ti . 2. by using the strong Markov property. . . random variables is a (very) particular example of a Markov chain. Regardless. we recursively deﬁne Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. . The idea is that if we can ﬁnd a state i that is visited again and again.

(r) . We abstract this trivial observation and give a couple of deﬁnitions: First. The claim that Xi . Xi (1) . of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i.state i for the r-th time.i. given the state at (r) (r) (r) the stopping time Ti .d. Example: Deﬁne Ti (r+1) (0) ). .). In particular. Perhaps some states (inessential ones) will be visited ﬁnitely many times. the chain returns to i inﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 1. Moreover. An immediate consequence of the strong Markov property is: Theorem 9. starting from i. but there are always states which will be visited again and again. and “goes for an excursion”. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. starting from i. they are also identical in law (i. Xi distribution. . Apply Strong Markov Property (SMP) at Ti : SMP says that. Xi (2) .. Second. the future after Ti is independent of the past before Ti . But (r) the state at Ti is. . and this proves the ﬁrst claim. . we “watch” it up until the next time it returns to state i. . ad inﬁnitum. the future is independent of the (1) (2) past. we say that a state i is recurrent if. A trivial. (0) are independent random variables. . have the same Proof. The excursions Xi (0) .. but important corollary of the observation of this section is that: Corollary 5. g(Xi (1) ). 10 Recurrence and transience When a Markov chain has ﬁnitely many states. possibly. it will behave in the same way as if it starts from the same state at another point of time. we say that a state i is transient if. Λ2 . the chain returns to i only ﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 0. 29 . by deﬁnition. for time-homogeneous Markov chains. . The last corollary ensures us that Λ1 . if it starts from state i at some point of time. Therefore. except. Therefore. g(Xi (2) ). are independent random variables. all. (Numerical) functions g(Xi independent random variables. equal to i. at least one state must be visited inﬁnitely many times.

fii = Pi (Ti < ∞) = 1. Pi (Ji = ∞) = 1. we conclude what we asserted earlier. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). 30 . Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . there could be yet another possibility: 0 < Pi (Xn = i for inﬁnitely many n) < 1. This means that the state i is transient. Starting from X0 = i. Indeed. Indeed.e. State i is recurrent. then Pi (Ji < ∞) = 1. . the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . i. Ei Ji = ∞. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. in principle. We will show that such a possibility does not exist. The following are equivalent: 1. therefore concluding that every state is either recurrent or transient. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. 2. Lemma 5. we have Proof. Notice that: state i is visited inﬁnitely many times ⇐⇒ Ji = ∞. But the past (k−1) before Ti is independent of the future. Therefore. Suppose that the Markov chain starts from state i. This means that the state i is recurrent. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. every state is either recurrent or transient. Pi (Ji = ∞) = 1. We claim that: Lemma 4.Important note: Notice that the two statements above are not negations of one another. Because either fii = 1 or fii < 1. 4. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. k ≥ 0. and that is why we get the product. 3. If fii < 1. namely.

it is visited ﬁnitely many times with probability 1. and. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . n ≥ 1. i. Ji cannot take value +∞ with positive probability. 4. owing to the fact that Ji is a geometric random variable. by the deﬁnition of Ji . If i is a transient state then pii → 0. if we assume Ei Ji < ∞. by deﬁnition of Ji is equivalent to Pi (Ji = ∞) = 1.e. therefore Pi (Ji < ∞) = 1. If i is transient then Ei Ji < ∞. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . assuming that the chain (n) starts from state i. then. If Ei Ji = ∞. Hence Ei Ji = ∞ also. we see that its expectation cannot be inﬁnity without Ji itself being inﬁnity with probability one. Ei Ji < ∞. (n) But if a series is ﬁnite then the summands go to 0. State i is transient. the probability to hit state j for the ﬁrst time at time n ≥ 1.Proof. fii = Pi (Ti < ∞) < 1. it is visited inﬁnitely many times.1 Deﬁne First hitting time decomposition fij := Pi (Tj = n). Proof. but the two sequences are related in an exact way: Lemma 7. Pi (Ji < ∞) = 1. Clearly. as n → ∞. On the other hand. pii → 0. since Ji is a geometric random variable. Notice the diﬀerence with pij . Proof. The latter is the probability to be in state j at time n (not necessarily for the ﬁrst time) starting from i. which. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. 2. (n) i. clearly. State i is transient if. The equivalence of the ﬁrst two statements was proved above. The following are equivalent: 1. 10.e. This. as n → ∞. From Elementary Probability. by deﬁnition. Corollary 6. fij ≤ pij . The equivalence of the ﬁrst two statements was proved before the lemma. Lemma 6. by deﬁnition. is equivalent to Pi (Ji < ∞) = 1. Proof. this implies that Ei Ji < ∞. then. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . 3. State i is recurrent if.

we have i j ⇐⇒ fij > 0. .. as n → ∞. j i. Now. Hence. Corollary 7. 1. and so the probability that j will appear in a speciﬁc excursion is positive. the probability that j will appear in a speciﬁc excursion equals gij = Pi (Tj < Ti ). Start with X0 = i. for all states i.) Theorem 10. not only fij > 0. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). In other words. and hence pij as n → ∞. The last statement is simply an expression in symbols of what we said above. . If j is transient. In particular. are i.i. ( Remember: another property that was a class property was the property that the state have a certain period. and since they take the value 1 with positive probability. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj .i. 10. Xi . we have n=1 pjj = Ej Jj < ∞. and gij := Pi (Tj < Ti ) > 0. . inﬁnitely many of them will be 1 (with probability 1). starting from i. Proof. We now show that recurrence is a class property. but also fij = 1. showing that j will appear in inﬁnitely many of the excursions for sure.The ﬁrst term equals fij . by summing over n the relation of Lemma 7. either all states are recurrent or all states are transient. if we let fij := Pi (Tj < ∞) . ∞ Proof. 32 . Indeed. the probability to go to j in some ﬁnite time is positive. Corollary 8.r := 1 j appears in excursion Xi (r) .d. Hence. denoted by i j: it means that.d.2 Communicating classes and recurrence/transience Recall the deﬁnition of the relation “state i leads to state j”. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. . If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. which is ﬁnite. by deﬁnition. In a communicating class. . excursions Xi . r = 0. If j is a transient state then pij → 0. j i. For the second term. So the random variables δj. .

What is the expectation of T ? First. U1 . Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. necessarily. be i. By ﬁniteness. . . P (T > n) = P (U1 ≤ U0 .e.e. Suppose that i is inessential. It should be clear that this can. . Hence all states are recurrent. If i is recurrent then every other j in the same communicating class satisﬁes i j. P (B) = Pi (Xn = i for inﬁnitely many n) < 1. Proof. . . Proof. if started in C. . Theorem 12.i. Every ﬁnite closed communicating class is recurrent. Xn = i for inﬁnitely many n) = 0. all other states are also transient. If a communicating class contains a state which is transient then. Let C be the class. and so i is a transient state. some state i must be visited inﬁnitely many times. Thus T represents the number of variables we need to draw to get a value larger than the initial one. . . 1]. But then. So if a communicating class is recurrent then it contains no inessential states and so it is closed. by the above theorem. Hence. Example: Let U0 . but Pi (A ∩ B) = Pi (Xm = j. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. and hence. random variables. This state is recurrent. j but j i. U2 .d. We now take a closer look at recurrence. Pi (B) = 0. . . If i is inessential then it is transient. Hence 11 Positive recurrence Recall that if we ﬁx a state i and let Ti be the ﬁrst return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). Theorem 11. Indeed. i. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ).Proof. uniformly distributed in the unit interval [0. Let T := inf{n ≥ 1 : Un > U0 }. for example. . every other j is also recurrent. X remains in C. i. appear in a certain game. observe that T is ﬁnite. By closedness. Un ≤ x | U0 = x)dx xn dx = 1 . Recall that a ﬁnite random variable may not necessarily have ﬁnite expectation. n+1 33 = 0 .

n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. Theorem 13. but E min(Z1 . We state it here without proof. (ii) Suppose E max(Z1 . . But this is misleading. the sequence X0 . be i. of successive states is not an independent sequence.d. When we have a Markov chain. X2 . (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. . 0) > −∞. 34 . It is not a Law. Let Z1 . It is traditional to refer to it as “Law”. then it takes. the Fundamental Theorem of Probability Theory.i. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. on the average. . 0) = ∞. random variables. . it is a Theorem that can be derived from the axioms of Probability Theory. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. . Z2 . arguably. To our rescue comes the regenerative structure of a Markov chain identiﬁed at an earlier section. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1.Therefore. an inﬁnite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. X1 . we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. To apply the law of large numbers we must create some kind of independence. We say that i is null recurrent if Ei Ti = ∞. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. Before doing that. .

as in Example 2. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ).i. Xa . Thus. are i. the (0) initial excursion Xa also has the same law as the rest of them. for reasonable choices of the functions G taking values in R. k=0 which represents the ‘time-average reward’ received. in this example.d. . We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). for an excursion X .i. with probability 1. Then. G(Xa(2) ). Speciﬁcally. we can think of as a reward received when the chain is in state x. . forms an independent sequence of (generalised) random variables. because the excursions Xa . we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. (7) Whichever the choice of G. the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). Xa(2) . G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. Since functions of independent random variables are also independent. and because if we start with X0 = a. . G(Xa(3) ). Example 1: To each state x ∈ S. . given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been ﬁxed once and for all) the sequence of excursions Xa(1) . assign a reward f (x).1 Functions of excursions Namely.13. 35 . but deﬁne G(X ) to be the total reward received over the excursion X . . . deﬁne G(X ) to be the maximum reward received during this excursion. The ﬁnal equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. Xa(3) . which. In particular. . (1) (2) (1) (n) (8) 13. .d. are i..2 Average reward Consider now a function f : S → R. we have that G(Xa(1) ). n as n → ∞. Example 2: Let f (x) be as in example 1. . random variables.

Hence |Gﬁrst | ≤ BTa . We break the sum into the total reward received over the initial excursion. we have BTa /t → 0. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a ﬁrst and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gﬁrst + Glast . We shall present the proof in a way that the constant f will reveal itself–it will be discovered. t t In other words. plus the total reward received over the next excursion. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). Gﬁrst is the reward up to time Ta = Ta or t (whichever t is smaller). Gr is given by (7).e. For example. Since P (Ta < ∞) = 1. Proof. with probability one. t as t → ∞. The idea is this: Suppose that t is large enough. Let f : S → R+ be a bounded ‘reward’ function. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . We are looking for the total reward received between 0 and t. t t as t → ∞. (r) Nt := max{r ≥ 1 : Ta ≤ t}. We assumed that f is t bounded.Theorem 14. of complete excursions from the ground state a that occur between 0 and t. A similar argument shows that 1 last G → 0. Nt . and Glast is the reward over the last incomplete cycle. and so 1 ﬁrst G → 0. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. say. Thus we need to keep track of the number. t t 36 . Then. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Nt = 5. in the ﬁgure below. with probability one. say |f (x)| ≤ B. and so on.

(11) By deﬁnition. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . we have Ta r→∞ r lim (r) . Since Ta → ∞.also. 1 Nt = . Since Ta /r → 0. it follows that the number Nt of complete cycles before time t will also be tending to ∞. r=1 with probability one. So we are left with the sum of the rewards over complete cycles.i. . E a Ta 37 . and since Ta − Ta k = 1. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . are i. E a Ta f (Xn ) = n=0 EG1 =: f .d. as r → ∞. random variables with common expectation Ea Ta . Hence. lim t∞ 1 Nt Nt −1 Gr = EG1 . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. we have that the visit with index r = Nt to the ground state a is the last before t (see also ﬁgure above): (N (N Ta t ) ≤ t < Ta t +1) . as t → ∞.. . 2. = E a Ta . . = E a Ta . Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) .

We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). t Ea 1(Xk = b) E a Ta . The denominator is the duration of the excursion. the ground state. Comment 4: Suppose that two states. b. The equality of these two limits gives: average no.To see that f is as claimed. Note that we include only one endpoint of the excursion. Applying formula (14) with b in place of a we ﬁnd that the limit in (13) also equals 1/Eb Tb . then we see that the numerator equals 1. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . on the average. a. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . We can use b as a ground state instead of a. i. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. E b Tb 38 . not both. E a Ta (14) with probability 1. which communicate and are positive recurrent. we let f to be 1 at the state b and 0 everywhere else. then the long-run proportion of visits to the state a should equal 1/Ea Ta . Comment 1: It is important to understand what the formula says. The constant f is a sort of average over an excursion.e. Ta −1 k=0 1 lim t→∞ t with probability 1. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). The theorem above then says that. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). for all b ∈ S. from the Strong Markov Property. The physical interpretation of the latter should be clear: If. it takes Ea Ta units of time between successive visits to the state a.

Consider the function ν [a] deﬁned in (15). notice that the choice of notation ν [a] is justiﬁable through the notation (6) for the communicating class of the state a. X3 .14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited.e. Recurrence tells us that Ta < ∞ with probability one. . this should have something to do with the concept of stationary distribution discussed a few sections ago. Note: Although ν [a] depends on the choice of the ‘ground’ state a. we will soon realise that this dependence vanishes when the chain is irreducible. Proposition 2. x ∈ S. X2 . . . 5 We stress that we do not need the stronger assumption of positive recurrence here. Proof. Since a = XT a . we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). (15) should play an important role. Clearly. ν [a] (x) = y∈S ν [a] (y)pyx . Both of them equal 1. Moreover. for any state b such that a → x (a leads to x). 39 . so there is nothing lost in doing so. The real reason for doing so is because we wanted to replace a function of X0 . X1 . we have 0 < ν [a] (x) < ∞. We start the Markov chain with X0 = a. Hence the superscript [a] will soon be dropped. where a is the ﬁxed recurrent ground state whose existence was assumed. X2 . In view of this. i. by a function of X1 . . . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. Fix a ground state a ∈ S and assume that it is recurrent5 . Indeed. Then ν [a] satisﬁes the balance equations: ν [a] = ν [a] P. . x ∈ S.

Suppose that the Markov chain possesses a positive recurrent state a. Xn−1 = y. n ≤ Ta ) pyx Pa (Xn−1 = y. such that pxa > 0. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). namely. (17) This π is a stationary distribution. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. n ≤ Ta ) Pa (Xn = x. Therefore. let x be such that a x. for all x ∈ S. for all n ∈ N. Deﬁne Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . We just showed that ν [a] = ν [a] P. by deﬁnition. xa so. in this last step. indeed. First note that. we used (16). ν [a] (x) < ∞. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. x ∈ S. is ﬁnite when a is positive recurrent. x a. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). where. We now have: Corollary 9. from (16). ν [a] = ν [a] Pn . We are now going to assume more. which are precisely the balance equations. which. so there exists m ≥ 1. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . ax ax From Theorem 10. (n) Next. that a is positive recurrent. so there exists n ≥ 1. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). Note: The last result was shown under the assumption that a is a recurrent state. such that pax > 0. 40 .and thus be in a position to apply ﬁrst-step analysis.

i. Therefore. Proof.d. Since a is positive recurrent. It only remains to show that π [a] is a probability. then j occurs in the r-th excursion. π [a] P = π [a] . we toss a coin with probability of heads gij > 0. Hence j will occur in inﬁnitely many excursions. then the set of linear equations π(x) = y∈S π(y)pyx . r = 0. if tails show up. But π [a] is simply a multiple of ν [a] . the event that j occurs in a speciﬁc excursion is independent from the event that j occurs in the next excursion. Otherwise. We have already shown. . second) excursion that contains j. . This is what we will try to do in the sequel. i. Start with X0 = i. and so π [a] is not trivially equal to zero. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. unique. The coins are i. We already know that j is recurrent. 1. no matter which one. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. In words. π(y) = 1 y∈S x∈S have at least one solution. and the probability gij = Pi (Tj < ∞) that j occurs in a speciﬁc excursion is positive. (r) j. Suppose that i recurrent. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. that ν [a] P = ν [a] . Consider the excursions Xi . positive recurrent state. If heads show up at the r-th excursion. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. since the excursions are independent. that it sums up to one. 2. .In other words: If there is a positive recurrent state. Theorem 15. Let R1 (respectively. we have Ea Ta < ∞. then the r-th excursion does not contain state j. 41 . in general.e. to decide whether j will occur in a speciﬁc excursion. Let i be a positive recurrent state. Also. R2 ) be the index of the ﬁrst (respectively. Then j is positive Proof. in Proposition 2.

An irreducible Markov chain is called transient if one (and hence all) states are transient. where the Zr are i. But. . Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞.i. or all states are null recurrent or all states are transient.d.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . In other words. . An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. 2. with the same law as Ti . An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. R2 has ﬁnite expectation. we just need to show that R2 −R1 +1 ζ := r=1 Zr has ﬁnite expectation. Then either all states in C are positive recurrent. by Wald’s lemma. . . Let C be a communicating class. IRREDUCIBLE CHAIN r = 1. . . . RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . Corollary 10.

Fix a ground state a. recognising that the random variables on the left are nonnegative and bounded below 1. there is a stationary distribution. x ∈ S. Consider positive recurrent irreducible Markov chain. (18) π(x)Px (Ta < ∞) = π(x) = 1. π(2) = 1 − α 2. n ≥ 0) and suppose that P (X0 = x) = π(x). Hence. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). we stress that a stationary distribution may not be unique. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. Then P (Ta < ∞) = x∈S In other words. x∈S By (13). Consider the Markov chain (Xn . what prohibits uniqueness. for totally trivial reasons (it is. after all. following Theorem 14. an absorbing state). we can take expectations of both sides: E as t → ∞. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. Let π be a stationary distribution. we start the Markov chain from the stationary distribution π. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). Then there is a unique stationary distribution. for all states i and j. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. Proof. we have. is a stationary distribution. for all n. π(1) = 0. By a theorem of Analysis (Bounded Convergence Theorem). But observe that there are many–inﬁnitely many–stationary distributions: For each 0 ≤ α ≤ 1. indeed. x ∈ S. Then. n=1 43 . for all x ∈ S. the π deﬁned by π(0) = α. by (18). we have P (Xn = x) = π(x). Theorem 16. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). state 1 (and state 2) is positive recurrent. with probability one. But. as t → ∞.16 Uniqueness of the stationary distribution At the risk of being too pedantic.

Then. π(a) = π [a] (a) = 1/Ea Ta . Corollary 11. Let π(·|C) be the distribution deﬁned by restricting π to C: π(x|C) := π(x) . Therefore Ei Ti < ∞. E a Ta Proof. delete all transition probabilities pxy with x ∈ C. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb .Therefore π(x) = π [a] (x). Both π [a] and π[b] are stationary distributions. the only stationary distribution for the restricted chain because C is irreducible. for all a ∈ S. Consider a Markov chain. In particular. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. In particular. Consider an arbitrary Markov chain. Let i be such that π(i) > 0. The following is a sort of converse to that. i. Proof. x ∈ S. Hence there is only one stationary distribution. Then any state i such that π(i) > 0 is positive recurrent. Assume that a stationary distribution π exists. b ∈ S. Namely. Hence π = π [a] . Consider a positive recurrent irreducible Markov chain and two states a. π(i|C) = Ei Ti . But π(i|C) > 0. There is only one stationary distribution. from the deﬁnition of π [a] . Proof. where π(C) = y∈C π(y). 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Corollary 12. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . so they are equal. y ∈ C. Then π [a] = π [b] . We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). There is a unique stationary distribution. for all x ∈ S. By the 1 above corollary. i is positive recurrent. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Consider the restriction of the chain to C. 1 π(a) = . This is in fact. Consider a positive recurrent irreducible Markov chain with stationary distribution π. an arbitrary stationary distribution π must be equal to the speciﬁc one π [a] .e. The cycle formula merely re-expresses this equality. π(C) x ∈ C. Let C be the communicating class of i. Thus. Decompose it into communicating classes.

Stability of a deterministic dynamical system means that the system eventually converges to a ﬁxed point (or. To each positive recurrent communicating class C there corresponds a stationary distribution π C .which assigns positive probability to each state in C. from physical experience. This is deﬁned by picking a state (any state!) a ∈ C and letting π C := π [a] . Proposition 3. π(x|C) ≡ π C (x). For example. We all know. 18 18. such that C αC = 1. There is dissipation of energy due to friction and that causes the settling down. where αC ≥ 0. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). eventually. A Markov chain is a simple model of a stochastic dynamical system. Let π be an arbitrary stationary distribution. that the pendulum will swing for a while and. By our uniqueness result.1 Coupling and stability Deﬁnition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. An arbitrary stationary distribution π must be a linear combination of these π C . for example. take an = (−1)n . that. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. In an ideal pendulum. where π(C) := π(C) π(x). So far we have seen. as n → ∞. Proof. x∈C Then (π(x|C). under irreducibility and positive recurrence conditions. This is the stable position. without friction. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. x ∈ C) is a stationary distribution. For each such C deﬁne the conditional distribution π(x|C) := π(x) . Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . more generally. it will settle to the vertical position. consider the motion of a pendulum under the action of gravity. So we have that the above decomposition holds with αC = π(C). the pendulum will perform an 45 . If π(x) > 0 then x belongs to some positive recurrent class C. that it remains in a small region).

Speciﬁcally. We shall assume that the ξn have the same law. Still. x ∈ R. mutually independent. X2 . ······ 46 . . A linear stochastic system We now pass on to stochastic systems. ξ0 . as t → ∞. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an ﬁnancial example. consider the following diﬀerential equation x = −ax + b. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . or an example in physics. by means of the recursion above. . . . the one found by setting the right-hand side equal to zero: x∗ = b/a. a > 0. ξ2 . are given. moreover. The thing to observe here is that there is an equilibrium point i. we here assume that X0 . But what does stability mean? In general. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . and deﬁne a stochastic sequence X1 . random variables. And we are in a happy situation here. However notice what happens when we solve the recursion. we may it is stable because it does not (it cannot!) swing too far away. It follows that |ρ| < 1 is the condition for stability.e. ξ1 . we have x(t) → x∗ . . or whatever.oscillatory motion ad inﬁnitum. We say that x∗ is a stable equilibrium. The simplest diﬀerential equation As another example. ˙ There are myriads of applications giving physical interpretation to this. then regardless of the starting position. If. then x(t) = x∗ for all t > 0. it cannot mean convergence to a ﬁxed equilibrium point. simple because the ξn ’s depend on the time parameter n. .

we have that the ﬁrst term goes to 0 as n → ∞. which is a linear combination of ξ0 . Y = y) is? 47 . but we are not given what P (X = x. A straightforward (often not so useful) construction is to assume that the random variables are independent and deﬁne P (X = x. to try to make the coincidence probability P (X = Y ) as large as possible. deﬁne stability in a way other than convergence of the distributions of the random variables. converges. 18. while the limit of Xn does not exists. Y but not their joint law. under nice conditions. .g. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . . 18. ξn−1 does not converge. . for time-homogeneous Markov chains. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . j ∈ S. If a Markov chain is irreducible. Coupling refers to the various methods for constructing this joint law. we know f (x) = P (X = x). for instance. for any initial distribution. Y = y) is. What this means is that. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). In other words. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . i. ∞ ˆ The nice thing is that. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. The second term. But suppose that we have some further requirement. suppose you are given the laws of two random variables X. g(y) = P (Y = y).3 Coupling To understand why the fundamental stability theorem works. Starting at an elementary level. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). . Y = y) = P (X = x)P (Y = y). positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). the n-step transition matrix Pn . uniformly in x.2 The fundamental stability theorem for Markov chains Theorem 17. In particular.Since |ρ| < 1. (e. How should we then deﬁne what P (X = x. we need to understand the notion of coupling. as n → ∞.

we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). P (T < ∞) = 1. Y be random variables with values in Z and laws f (x) = P (X = x).e. Let T be a meeting of two coupled stochastic processes X. n ≥ T ) ≤ P (n < T ) + P (Yn = x). n ≥ 0. Find the joint law h(x. n ≥ 0) and the law of a stochastic process Y = (Yn . x ∈ S. This T is a random variables that takes values in the set of times or it may take values +∞. Repeating (19). (19) as n → ∞. n ≥ 0. respectively. Combining the two inequalities we obtain the ﬁrst assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). y). The following result is the so-called coupling inequality: Proposition 4. Suppose that we have two stochastic processes as above and suppose that they have been coupled. f (x) = y h(x. n < T ) + P (Yn = x. precisely because of the assumption that the two processes have been constructed together. for all n ≥ 0. but we are not told what the joint probabilities are. but with the roles of X and Y interchanged.e. and which maximises P (X = Y ). T is ﬁnite with probability one.Exercise: Let X. The coupling we are interested in is a coupling of processes. on the event that the two processes never meet. i. y) = P (X = x. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). then |P (Xn = x) − P (Yn = x)| → 0. in particular. Y . y). We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. Since this holds for all x ∈ S and since the right hand side does not depend on x. n ≥ 0). uniformly in x ∈ S. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Note that it makes sense to speak of a meeting time. Proof. and all x ∈ S. We are given the law of a stochastic process X = (Xn . n < T ) + P (Xn = x. x∈S 48 . Then. Y = y) which respects the marginals. n ≥ T ) = P (Xn = x. i. g(y) = x h(x. We have P (Xn = x) = P (Xn = x. A coupling of them refers to a joint construction on the same probability space. g(y) = P (Y = y). If.

n ≥ 0) is a Markov chain with state space S × S. then.4 Proof of the fundamental stability theorem First. we can deﬁne joint probabilities.x′ ).y′ ) = px. (x. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px.x′ py. Let pij denote. the transition probabilities. n ≥ 0) is an irreducible chain.(y.y′ for all large n. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. Xn and Yn would be independent and distributed according to π. The ﬁrst process. and it also makes sense to consider their ﬁrst meeting time: T := inf{n ≥ 0 : Xn = Yn }. Notice that σ(x. y ′ ) | Wn = (x.x′ ). had we started with both X0 and Y0 independent and distributed according to π. x∈S as n → ∞.Assume now that P (T < ∞) = 1. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities.x′ ). The second process. . as usual. for all n ≥ 1. they diﬀer only in how the initial states are chosen. and so (Wn . Having coupled them. n ≥ 0). y) Its n-step transition probabilities are q(x.y′ ) := P Wn+1 = (x′ . positive recurrent and aperiodic. y) ∈ S × S. We shall consider the laws of two processes.y′ ) for all large n. review the assumptions of the theorem: the chain is irreducible. Therefore. Consider the process Wn := (Xn . 18.) By positive recurrence. Y0 ) has distribution P (W0 = (x. It is very important to note that we have only deﬁned the law of X and the law of Y but we have not deﬁned a joint law. y) := π(x)π(y).y′ . n ≥ 0) is independent of Y = (Yn . is a stationary distribution for (Wn . in other words. denoted by π. Xm = y).x′ py. we have that px.x′ > 0 and py. We now couple them by doing the straightforward thing: we assume that X = (Xn . implying that q(x. (Indeed. We know that there is a unique stationary distribution. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. we have not deﬁned joint probabilities such as P (Xn = x.(y. denoted by Y = (Yn . Then (Wn . y) = µ(x)π(y). Its initial state W0 = (X0 . Its (1-step) transition probabilities are q(x. n ≥ 0). Yn ). π(x) > 0 for all x ∈ S. denoted by X = (Xn . From Theorem 3 and the aperiodicity assumption.(y.y′ . Then n→∞ lim P (T > n) = P (T = ∞) = 0. sup |P (Xn = x) − P (Yn = x)| → 0.

0).(0. T is a meeting time between Z and Y . Hence (Wn .99.1). Y ) may fail to be irreducible.(1. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. P (T < ∞) = 1. uniformly in x. p00 = 0. However. ≥ 0) is a Markov chain with transition probabilities pij . 50 .1) = 1. Moreover. In particular. as n → ∞. say. Zn equals Xn before the meeting time of X and Y . which converges to zero. 0). after the meeting time. 1} and the pij are p01 = p10 = 1.0) = q(0. By the coupling inequality.0).0) = q(1. (0. and since P (Yn = x) = π(x). This is not irreducible. 0). by assumption. Hence (Zn . Then the product chain has state space S ×S = {(0. then the process W = (X.(1. 1)} and the transition probabilities are q(0.1) = q(1. p10 = 1. For example.1). (1. P (Xn = x) = P (Zn = x). (1. n ≥ 0) is identical in law to (Xn . suppose that S = {0. precisely because P (T < ∞) = 1. p01 = 0.(0. we have |P (Xn = x) − π(x)| ≤ P (T > n).01.Therefore σ(x. Obviously. Deﬁne now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). n ≥ 0). n ≥ 0) is positive recurrent. y) ∈ S × S. Clearly. y) > 0 for all (x. In particular. (Zn . for all n and x. 1). if we modify the original chain and make it aperiodic by. Z0 = X0 which has an arbitrary distribution µ. Since P (Zn = x) = P (Xn = x). |P (Zn = x) − P (Yn = x)| ≤ P (T > n). Zn = Yn . X arbitrary distribution Z stationary distribution Y T Thus.

Theorem 4. It is not diﬃcult to see that 6 We put “wrong” in quotes. . we know we have a unique stationary distribution π and that π(x) > 0 for all states x. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). customary to have a consistent terminology.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain.4 and.f. if we modify the chain by p00 = 0. then (n) lim p n→∞ 00 (n) = lim p10 = π(0). Terminology cannot be.01.then the product chain becomes irreducible. π(0) + π(1) = 1.99π(0) = π(1). p10 = 1. p01 = 0. . 51 . we can decompose the state space into d classes C0 . Thus.497 18. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. By the results of §5. Parry. But this is “wrong”. and so on. In matrix notation. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. However. n→∞ (n) where π(0).99. Therefore p00 = 1 for even n (n) and = 0 for odd n. 1981). By the results of Section 14. . 18.503 0. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1.503 0. from C1 to C2 . positive recurrent and aperiodic Markov chain. It is.e. in particular. 0. Then Xn = 0 for even n and = 1 for odd n. and this is expressed by our terminology. Cd−1 . π(0) ≈ 0.497. Suppose that the period of the chain is d.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. π(1) ≈ 0. per se wrong. Suppose X0 = 0. limn→∞ p00 does not exist. . from Cd back to C0 . C1 . The reader is asked to pay attention to this terminology which is diﬀerent from the one used in the Markov chain literature. Indeed. however. i. such that the chain moves cyclically between them: from C0 to C1 .497 . π(1) satisfy 0. 6 The reason is that the term “ergodic” means something speciﬁc in Mathematics (c.503. most people use the term “ergodic” for an irreducible.

. Let αr := i∈Cr π(i). n ≥ 0) consists of d closed communicating classes. Cd−1 . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. Let i ∈ Ck . Then. Since (Xnd . So π(i) = j∈Cr−1 π(j)pji . . after reduction by d) and. i ∈ S. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Then pji = 0 unless j ∈ Cr−1 . 1. The chain (Xnd . where π is the stationary distribution. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. . . j ∈ Cr . This means that: Theorem 18. each one must be equal to 1/d. n ≥ 0) is aperiodic. the (unique) stationary distribution of (Xnd . Suppose that X0 ∈ Cr . i ∈ Cr . i ∈ Cr . All states have period 1 for this chain. as n → ∞. Can we also ﬁnd the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . we have pij for all i.Lemma 8. π(i) = 1/d. and the previous results apply. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. → dπ(j). j ∈ Cℓ . d − 1. if sampled every d steps. namely C0 . . We have that π satisﬁes the balance equations: π(i) = j∈S π(j)pji . as n → ∞. n ≥ 0) is dπ(i). Suppose i ∈ Cr . . . We shall show that Lemma 9. and let r = ℓ − k (modulo d). . Then. i∈Cr for all r = 0. it remains in Cℓ . Hence if we restrict the chain (Xnd . starting from i. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). thereafter. 52 . Proof. Since the αr add up to 1. j ∈ Cℓ .

. Now consider the contained between 0 and 1 there must be a exists and equals. as n → ∞. (nk ) k → pi . Hence pi k i. p(n) = (n) (n) (n) (n) (n) ∞ p1 . schematically. . whose proof we omit7 : 7 See Br´maud (1999)..e. p2 . there must be a subsequence exists and equals. for all i ∈ S. . . 2. 2. say. .19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . then pij → 0 as n → ∞. If the period is d then the limit exists only along a selected subsequence (Theorem 18). .) is a k k subsequence of each (ni . and so on.1 Limiting probabilities for null recurrent states Theorem 19. for each n = 1. then pij → π(j) (Theorems 17). . p3 . So the diagonal subsequence (nk . . . Consider. for each i. exists for all i ∈ N. e 53 . Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. a probability distribution on N. k = 1. such that limk→∞ pi k exists and equals. . 19. We have. . say. the sequence n3 of the third row is a subsequence k k of n2 of the second row. the sequence n2 of the second row k is a subsequence of n1 of the ﬁrst row. For each i we have found subsequence ni . say. the ﬁrst of which is known as HellyBray Lemma: Lemma 10. . . .). . We also saw in Corollary (n) 7 that. k = 1. as k → ∞. pi . 2. It is clear how we (ni ) k continue. p1 . If j is a null recurrent state then pij → 0. for all i ∈ S. are contained between 0 and 1. where = 1 and pi ≥ 0 for all i. 2. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. i. 2. . for each The second result is Fatou’s lemma. if j is transient.. k = 1. p2 . . k = 1. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. By construction. . Then we can ﬁnd a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . The proof will be based on two results from Analysis.

i. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . (n ) (n ) Consider the probability vector By Lemma 10. If (yi . x ∈ S . .e. actually. Suppose that for some i. y∈S x ∈ S. we have 0< x∈S The inequality follows from Lemma 11. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. 54 . pix k . equalities. x∈S (n′ ) Since aj > 0. for some x0 ∈ S. i ∈ N). and this is a contradiction (the number αx = αx cannot be strictly larger than itself). x ∈ S. αy pyx . y∈S Since x∈S αx ≤ 1. pix k = 1. state j is positive recurrent: Contradiction. (n′ ) Hence αx ≥ αy pyx . By Corollary 13. i. by Lemma 11. i ∈ N). But Pn+1 = Pn P. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . We wish to show that these inequalities are. i. 2. i=1 (n) Proof of Theorem 19. Suppose that one of them is not. n = 1. pix k Therefore. Let j be a null recurrent state. it is a strict inequality: αx0 > y∈S αy pyx0 .e. Hence we can ﬁnd a subsequence nk such that k→∞ lim pij k = αj > 0. π is a stationary distribution. Now π(j) > 0 because αj > 0. by assumption. Thus. . . (xi .e. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy .Lemma 11. the sequence (n) pij does not converge to 0. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . we have that π(x) := αx y∈S αy satisﬁes the balance equations and has x∈S π(x) = 1. k→∞ (n′ +1) = y∈S piy k pyx . and the last equality is obvious since we have a probability vector.

This is a more old-fashioned way of getting the same result. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . Then (n) lim p n→∞ ij = ϕC (i)π C (j). From the Strong Markov Property. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class.2 The general case (n) It remains to see what the limit of pij is. that is. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. The result when the period j is larger than 1 is a bit awkward to formulate. Let HC := inf{n ≥ 0 : Xn ∈ C}. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m).2. Example: Consider the following chain (0 < α. Let ϕC (i) := Pi (HC < ∞). as in §10. the result can be obtained by the so-called Abel’s Lemma of Analysis. In this form. starting with the ﬁrst hitting time decomposition relation of Lemma 7. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . as n → ∞. Indeed. Let i be an arbitrary state. Let C be the communicating class of j and let π C be the stationary distribution associated with C. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. In general. and fij = Pi (Tj < ∞). π C (j) = 1/Ej Tj .19. β. and fij = ϕC (j). Pµ (Xn−m = j) → π C (j). Let j be a positive recurrent state with period 1. Theorem 20. as n → ∞. which is what we need. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). Proof. so we will only look that the aperiodic case. we have pij = Pi (Xn = j) = Pi (Xn = j. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). But Theorem 17 tells us that the limit exists regardless of the initial distribution.

The phase (or state) space of a particle is not just a physical space but it includes. Theorem 14 is a type of ergodic theorem. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . Theorem 21. And so is the Law of Large Numbers itself (Theorem 13). The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . ϕ(3) = p31 → ϕ(3)π C1 (1). p41 → ϕ(4)π C1 (1). Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). we assumed the existence of a positive recurrent state. Consider the quantity ν [a] (x) deﬁned as the average number of visits to state x between two successive visits to state a–see 16. while. 1 − αβ 1 − αβ Both C1 . 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. (n) (n) p32 → ϕ(3)π C1 (2). Hence. (n) (n) p34 → 0. (n) (n) p33 → 0. p45 → (1 − ϕ(4))π C2 (5). We now generalise this. Pioneers in the are were. C2 = {5} (absorbing state). Similarly. p42 → ϕ(4)π C1 (2). Let f be a reward function such that |f (x)|ν [a] (x) < ∞. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. C1 = {1. For example. (n) (n) p43 → 0. as n → ∞. for example. 56 (20) x∈S . p44 → 0. We have ϕ(1) = ϕ(2) = 1. 4} (inessential states). the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. ϕ(4) = .The communicating classes are I = {3. C2 are aperiodic classes. ϕ(5) = 0. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. in Theorem 14. information about its velocity. among others. π C1 (2) = . Ergodic Theory is nowadays an area that combines many ﬁelds of Mathematics and Science. 2} (closed class). from ﬁrst-step analysis. whence 1−α β(1 − α) . ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. We will avoid to give a precise deﬁnition of the term ‘ergodic’ but only mention what constitutes a result in the area. Recall that. π C2 (5) = 1. (n) (n) p35 → (1 − ϕ(3))π C2 (5). The subject of Ergodic Theory has its origins in Physics.

As in Theorem 14. we have C := i∈S ν(i) < ∞. we have that the denominator equals Nt /t and converges to 1/Ea Ta . and this justiﬁes the terminology of §18. then. we obtain the result. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . since the number of states is ﬁnite. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. 21 Finite chains We take a brief look at some things that are speciﬁc to chains with ﬁnitely many states. which is the quantity denoted by f in Theorem 14. as we saw in the proof of Theorem 14. Glast are the total rewards over the ﬁrst and t t a a last incomplete cycles. So if we divide by t the numerator and denominator of the fraction in (21). while Gﬁrst . and this is left as an exercise. Proof. respectively. we have t→∞ lim 1 last 1 G = lim Gﬁrst = 0. So there are always recurrent states. From proposition 2 we know that the balance equations ν = νP have at least one solution. We only need to check that EG1 = f . 57 . Then at least one of the states must be visited inﬁnitely often. If so.5. in the former. we assumed positive recurrence of the ground state a. The reader is asked to revisit the proof of Theorem 14. we have Nt /t → 1/Ea Ta . Remark 1: The diﬀerence between Theorem 14 and 21 is that. Replacing n by the subsequence Nt .Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . (21) f (x)ν [a] (x). so the numerator converges to f /Ea Ta . But this is precisely the condition (20). t t→∞ t t t with probability one. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gﬁrst + Glast . As in the proof of Theorem 14. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. r=0 with probability 1 as long as E|G1 | < ∞. Now.

22 Time reversibility Consider a Markov chain (Xn . We summarise the above discussion in: Theorem 22. n ≥ 0). Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . By Corollary 13. Xn−1 . π(i) := Hence. X0 ). Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. In addition. the chain is irreducible.e. . because positive recurrence is a class property (Theorem 15). . Theorem 23. if we ﬁlm a standing man who raises his arms up and down successively. simply. in a sense. like playing a ﬁlm backwards. without being able to tell that it is. the chain is aperiodic. Lemma 12. i ∈ S. i. X0 ) for all n. . If. . reversible if it has the same distribution when the arrow of time is reversed. 58 . X1 = j) = P (X1 = i. X0 = j). . j ∈ S. actually. we have Pn → Π. . then all states are positive recurrent. Xn has the same distribution as X0 for all n. It is. . the stationary distribution is unique. But if we ﬁlm a man who walks and run the ﬁlm backwards. . Because the process is Markov. More precisely. X1 ) has the same distribution as (X1 . or. this means that it is stationary. Proof. and run the ﬁlm backwards. for all n. Reversibility means that (X0 . . as n → ∞. a ﬁnite Markov chain always has positive recurrent states. the ﬁrst components of the two vectors must have the same distribution. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. Then the chain is positive recurrent with a unique stationary distribution π. X1 . . running backwards. in addition. that is. Xn−1 . by Theorem 16. We say that it is time-reversible. Proof. Consider an irreducible Markov chain with ﬁnitely many states. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. i. X0 ). we see that (X0 . P (X0 = i. . In particular. (X0 . Taking n = 1 in the deﬁnition of reversibility. Xn ) has the same distribution as (Xn . A reversible Markov chain is stationary. Xn ) has the same distribution as (Xn . At least some state i must have π(i) > 0. Therefore π is a stationary distribution. then we instantly know that the ﬁlm is played in reverse. . it will look the same. in addition. . . . C satisﬁes the balance equations and the normalisation condition i∈S π(i) = 1.Hence 1 ν(i). . For example. this state must be positive recurrent. X1 . Suppose ﬁrst that the chain is reversible. If.

using DBE. Suppose ﬁrst that the chain is reversible. . for all i. X1 = i) = π(j)pji . Now pick a state k and write. The same logic works for any m and can be worked out by induction. j. X2 = i). So the chain is reversible. But P (X0 = i. pji 59 (m−3) → π(i1 ). . Hence the DBE hold. . suppose that the KLC (22) holds. X1 = j) = π(i)pij . pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . then.for all i and j. and pi1 i2 (m−3) → π(i2 ). . . suppose the DBE hold. We will show the Kolmogorov’s loop criterion (KLC). im ∈ S and all m ≥ 3. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. for all n. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. . . Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. we have π(i) > 0 for all i. P (X0 = j. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . and the chain is reversible. and this means that P (X0 = i. X0 ). . X1 = j. and write. letting m → ∞. X2 = k) = P (X0 = k. k. Theorem 24. (X0 . . i4 . . it is easy to show that. The DBE imply immediately that (X0 . X2 ) has the same distribution as (X2 . These are the DBE. X1 = j. . In other words. π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. then the chain cannot be reversible. (22) Proof. Xn ) has the same distribution as (Xn . k. . X1 . By induction. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . Conversely. Summing up over all choices of states i3 . . X1 ) has the same distribution as (X1 . i2 . Xn−1 . X1 . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . X0 ). X1 . and hence (X0 . . Pick 3 states i. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. using the DBE. we have separation of variables: pij = f (i)g(j). . So the detailed balance equations (DBE) hold. . . X0 ). j. 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. Conversely.

so we have pij g(j) = . If the graph thus obtained is a tree. i. • If the chain has inﬁnitely many states. meaning that it has no loops. j for which pij > 0. then we need. that the chain is positive recurrent. pji g(i) Multiplying by g(i) and summing over j we ﬁnd g(i) = j∈S g(j)pji . then the chain is reversible. If we remove the link between them. j for which pij > 0.(Convention: we form this ratio only when pij > 0. So g satisﬁes the balance equations. Hence the chain is reversible.) Necessarily. and the arrow from j to j is the only one from Ac to A. There is obviously only one arrow from AtoAc .e. in addition. the graph becomes disconnected. The balance equations are written in the form F (A. and. the two oriented edges from i to j and from j to i by a single edge without arrow. namely the arrow from i to j. f (i)g(i) must be 1 (why?). A) (see §4. the detailed balance equations. consider the chain: Replace it by: This is a tree. to guarantee. for each pair of distinct states i. and by replacing. Ac be the sets in the two connected components. by assumption. Hence the original chain is reversible. Pick two distinct states i. For example. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. there isn’t any. using another method. Form the graph by deleting all self-loops.) Let A. This gives the following method for showing that a ﬁnite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. Ac ) = F (Ac . 60 . there would be a loop. And hence π can be found very easily. (If it didn’t.1) which gives π(i)pij = π(j)pji . The reason is as follows. Example 1: Chain with a tree-structure.

Now deﬁne the the following transition probabilities: pij = 1/δ(i). For example. where C is a constant. and a set E of undirected edges. Suppose the graph is connected. a ﬁnite set. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). If i. 61 . Hence we conclude that: 1. The constant C is found by normalisation. There are two cases: If i. The stationary distribution of a RWG is very easy to ﬁnd. i. π(i)pij = π(j)pji . we have that C = 1/2|E|. States j are i are called neighbours if they are connected by an edge. so both members are equal to C.e. A RWG is a reversible chain. p02 = 1/4. We claim that π(i) = Cδ(i). i∈S Since δ(i) = 2|E|. j are not neighbours then both sides are 0. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. i∈S where |E| is the total number of edges. Each edge connects a pair of distinct vertices (states) without orientation. if j is a neighbour of i otherwise. Consider an undirected graph G with vertices S. etc. To see this let us verify that the detailed balance equations hold. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C.Example 2: Random walk on a graph. For each state i deﬁne its degree δ(i) = number of neighbours of i. π(i) = 1. In all cases. the DBE hold. 0. 2. The stationary distribution assigns to a state probability which is proportional to its degree.

this process is also a Markov chain and it does have time-homogeneous transition probabilities. . Let A be (r) a set of states and let TA be the r-th visit to A. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). t ≥ 0. . in general. k = 0.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ).2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. random variables with values in N and distribution λ(n) = P (ξ0 = n). 2.e. .1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. . . then (Yk .) has the Markov property but it is not time-homogeneous. 2. if nk = n0 + km. . The sequence (Yk . 1. 62 n ∈ N. 1.d. k = 0. . Let (St . where pij are the m-step transition probabilities of the original chain. If the subsequence forms an arithmetic progression. . . 1. . In this case.e. then it is a stationary distribution for the new chain. 2. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . . The state space of (Yr ) is A. irreducible and positive recurrent). 2.3 Subordination Let (Xn . . St+1 = St + ξt . n ≥ 0) be Markov with transition probabilities pij . t ≥ 0) be independent from (Xn ) with S0 = 0. 2. how can we create new ones? 23. k = 0. if. . k = 0. Due to the Strong Markov Property.i. in addition.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . i. i. . . . (m) (m) 23. 23. where ξt are i. Deﬁne Yr := XT (r) . A r = 1.e. . 1. π is a stationary distribution for the original chain.

. in general. however. n≥0 63 . Proof. . i. αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . (Notation: We will denote the elements of S by i.e. knowledge of f (Xn ) implies knowledge of Xn and so. . Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). Fix α0 . is NOT. . Indeed. Yn−1 ) = P (Yn+1 = β | Yn = α). If. . then the process Yn = f (Xn ). .. f is a one-to-one function then (Yn ) is a Markov chain. .4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). We need to show that P (Yn+1 = β | Yn = α. . a Markov chain. we have: Lemma 13. Yn−1 = αn−1 }. β. . α1 . . .) More generally. Then (Yn ) has the Markov property. .Then Yt := XSt . . β). conditional on Yn the past is independent of the future. (n) 23.. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. and the elements of S ′ by α. j. ∞ λ(n)pij .

In other words. β)P (Xn = i | Yn = α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Yn−1 ) Q(f (i). i. β). α > 0 Q(α. Yn = α. Example: Consider the symmetric drunkard’s random walk. conditional on {Xn = i}. β ∈ Z+ . if β = α ± 1. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. We have that P (Yn+1 = β|Xn = i) = Q(|i|. β) := 1. for all i ∈ Z and all β ∈ Z+ . To see this. a Markov chain with transition probabilities Q(α. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Yn−1 )P (Xn = i | Yn = α. On the other hand. β). the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. and so (Yn ) is. β). if β = 1. indeed. α = 0.for all choices of the variables involved. and if i = 0. β) P (Yn = α. depends on i only through |i|. regardless of whether i > 0 or i < 0. because the probability of going up is the same as the probability of going down. Then (Yn ) is a Markov chain. Thus (Yn ) is Markov with transition probabilities Q(α. β) P (Yn = α|Yn−1 ) i∈S = Q(α. β).e. 64 . P (Xn+1 = i ± 1|Xn = i) = 1/2. if i = 0. i ∈ Z. Indeed. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. i ∈ Z. observe that the function | · | is not one-to-one. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). if we let 1/2. But we will show that P (Yn+1 = β|Xn = i). But P (Yn+1 = β | Yn = α. then |Xn+1 | = 1 for sure. Deﬁne Yn := |Xn |. β) P (Yn = α|Xn = i. β) i∈S P (Yn = α|Xn = i. Yn−1 ) Q(f (i).

p0 = 1 − p. n ≥ 0) has the Markov property. It is a Markov chain with time-homogeneous transitions. and. 1. Thus. then we see that the above equation is of the form Xn+1 = f (Xn . we have that (Xn . Back to the general case. . j In this case. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. Then P (ξ = 0) + P (ξ ≥ 2) > 0. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. So assume that p1 = 1. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . We will not attempt to compute the pij neither draw the transition diagram because they are both futile. We let pk be the probability that an individual will bear k oﬀspring. . the population cannot grow: it will eventually reach 0. In other words. ξ (1) . if the current population is i there is probability pi that everybody gives birth to zero oﬀspring. k = 0.d. 0 ≤ j ≤ i. if 0 is never reached then Xn → ∞. so we will ﬁnd some diﬀerent method to proceed.. Hence all states i ≥ 1 are transient. We ﬁnd the population (n) at time n + 1 by adding the number of oﬀspring of each individual. ξ (n) ). .i. Let ξk be the number of oﬀspring of the k-th individual in the n-th generation. and 0 is always an absorbing state. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. 0 But 0 is an absorbing state. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. a hard problem. which has a binomial distribution: i j pij = p (1 − p)i−j .24 24. if the 65 (n) (n) . each individual gives birth to at one child with probability p or has children with probability 1 − p. The state 0 is irrelevant because it cannot be reached from anywhere. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. therefore transient. ξ2 . . independently from one another. one question of interest is whether the process will become extinct or not. . . since ξ (0) . . Example: Suppose that p1 = p.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. Therefore. . The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of oﬀspring. and following the same distribution. with probability 1. in general. . 2. But here is an interesting special case. are i. (and independent of the initial population size X0 –a further assumption). If we let ξ (n) = ξ1 . Hence all states i ≥ 1 are inessential. The parent dies immediately upon giving birth.

ϕn (0) = P (Xn = 0). Consider a Galton-Watson process starting with X0 = 1 initial ancestor. For each k = 1. Let ∞ µ = Eξ = k=1 kpk be the mean number of oﬀspring of an individual. . if we start with X0 = i in ital ancestors. Therefore the problem reduces to the study of the branching process starting with X0 = 1. . when we start with X0 = i. we can argue that ε(i) = εi . i. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. For i ≥ 1. . We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. The result is this: Proposition 5. since Xn = 0 implies Xm = 0 for all m ≥ n. ε(0) = 1. 66 . we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. This function is of interest because it gives information about the extinction probability ε. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. We wish to compute the extinction probability ε(i) := Pi (D). where ε = P1 (D). Indeed. Indeed. . so P (D|X0 = i) = εi . Obviously. Let ε be the extinction probability of the process.population does not become extinct then it must grow to inﬁnity. We then have that D = D1 ∩ · · · ∩ Di . each one behaves independently of one another. and. and P (Dk ) = ε. Proof. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. If µ ≤ 1 then the ε = 1.

Therefore. and so on. So ε ≤ ε∗ < 1. Let us show that. By induction. On the other hand. ϕn+1 (z) = ϕ(ϕn (z)). ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). and. Since both ε and ε∗ satisfy z = ϕ(z). recall that ϕ is a convex function with ϕ(1) = 1. then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. Therefore ε = ε∗ . ε = ε∗ . since ϕ is a continuous function. in fact. ϕn (0) ≤ ε∗ . the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. But ϕn+1 (0) → ε as n → ∞. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . if the slope µ at z = 1 is ≤ 1. But 0 ≤ ε∗ . all we need to show is that ε < 1. and since the latter has only two solutions (z = 0 and z = ε∗ ). Therefore ε = ϕ(ε). if µ > 1. This means that ε = 1 (see left ﬁgure below). (See right ﬁgure below. In particular. ϕn+1 (0) = ϕ(ϕn (0)). Also ϕn (0) → ε. Notice now that µ= d ϕ(z) dz . ϕ(ϕn (0)) → ϕ(ε). Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)).Note also that ϕ0 (z) = 1. We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). But ϕn (0) → ε.) 67 . z=1 Also.

24. If s ≥ 2. then time slots 0. Then ε < 1. So if there are. there are x + a packets. are allocated to hosts 1. . If collision happens. 2. . otherwise.. there is a collision and.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. is that if more than one hosts attempt to transmit a packet. During this time slot assume there are a new packets that need to be transmitted. if Sn = 1. . . One obvious solution is to allocate time slots to hosts in a speciﬁc order. periodically. there is no time to make a prior query whether a host has something to transmit or not. 3. only one of the hosts should be transmitting at a given time. Xn+1 = Xn + An . 1. random variables with some common distribution: P (An = k) = λk . 1. . 4. So. 3. Case: µ > 1. on the next times slot. 0). The problem of communication over a common channel. Ethernet has replaced Aloha. using the same frequency. 2. for a packet to go through. then the transmission is not successful and the transmissions must be attempted anew. Abandoning this idea. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. Let s be the number of selected packets. We assume that the An are i. on the next times slot. Then ε = 1. 5. 6. we let hosts transmit whenever they like. If s = 1 there is a successful transmission and. there is collision and the transmission is unsuccessful. 3. 1. Nowadays.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). 3 hosts. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. 68 . then max(Xn + An − 1. and Sn the number of selected packets.i.d. . assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). there are x + a − 1 packets. say. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. k ≥ 0. Thus. An the number of new packets on the same time slot. Since things happen fast in a computer network. 2. in a simpliﬁed model.

To compute px. 69 . So P (Sn = k|Xn = x. Since p > 0. i. Idea of proof. So we can make q x G(x) as small as we like by choosing x suﬃciently large. we have q < 1. . we should not subtract 1. To compute px. An = a. if Xn + An = 0. deﬁned by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. and p.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . then the chain is transient. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. and so q x tends to 0 much faster than G(x) tends to ∞. we think as follows: to have a reduction of x by 1. k ≥ 1.where λk ≥ 0 and k≥0 kλk = 1. the Markov chain (Xn ) is transient. n ≥ 0) is a Markov chain. Unless µ = 0 (which is a trivial case: no arrivals!). for example we can make it ≤ µ/2. selections are made amongst the Xn + An packets.) = x + a k x−k p q . Therefore ∆(x) ≥ µ/2 for all large x.x are computed by subtracting the sum of the above from 1.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. The probabilities px. The process (Xn . Indeed.x−1 when x > 0. Xn−1 .e. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). . The random variable Sn has. where G(x) is a function of the form α + βx . conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . Let us ﬁnd its transition probabilities.x−1 + k≥1 kpx. and exactly one selection: px. . For x > 0. An−1 . The reason we take the maximum with 0 in the above equation is because.x+k for k ≥ 1. where q = 1 − p. So: px. k 0 ≤ k ≤ x + a. we must have no new packets arriving. we have ∆(x) = (−1)px. We also let µ := k≥1 kλk be the expectation of An . Sn = 1|Xn = x) = λ0 xpq x−1 . We here only restrict ourselves to the computation of this expected change. The main result is: Proposition 6.e. Proving this is beyond the scope of the lectures. we have that the drift is positive for all large x and this proves transience. so q x G(x) tends to 0 as x → ∞.x−1 = P (An = 0. The Aloha protocol is unstable.

there are companies that sell high-rank pages. Comments: 1. in the latter case. Therefore. PageRank. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. 2. All you need to do to get high rank is pay (a lot of money). There is a technique. 3. A new use of PageRank is to rank academic doctoral programs based on their records of ﬁnding jobs for their graduates. If L(i) = ∅. if L(i) = ∅ 0. Deﬁne transition probabilities as follows: 1/|L(i)|. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. So if Xn = i is the current location of the chain. neither that it is aperiodic. Then it says: page i has higher rank than j if π(i) > π(j). the rank of a page may be very important for a company’s proﬁts. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. because of the size of the problem. otherwise. then i is a ‘dangling page’. Academic departments link to each other by hiring their 70 . The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus.2) and let α pij := (1 − α)pij + . unless he gets bored–and this happens with probability α–in which case he clicks on a random page. It is not clear that the chain is irreducible. Its implementation though is one of the largest matrix computations in the world. if j ∈ L(i) pij = 1/|V |. In an Internet-dominated society. It uses advanced techniques from linear algebra to speed up computations. the chain moves to a random page. We pick α between 0 and 1 (typically. unless i is a dangling page. α ≈ 0.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. So we make the following modiﬁcation. For each page i let L(i) be the set of pages it links to. |V | These are new transition probabilities.24. that allows someone to manipulate the PageRank of a web page and make it much higher. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. known as ‘302 Google jacking’. The idea is simple. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. So this is how Google assigns ranks to pages: It uses an algorithm.

a dominant (say A) and a recessive (say a). Do the same thing 2N times. in each generation. their oﬀspring inherits an allele from each parent. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. maintain this allele for the next generation.g. all we need is the information outlined in this ﬁgure. If so. 5. The process repeats. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. We are now ready to deﬁne the Markov chain Xn representing the number of alleles of type A in generation n. And so on. This means that the external appearance of a characteristic is a function of the pair of alleles. One of the circle alleles is selected twice. Three of the square alleles are never selected. we will assume that. from the point of view of an administrator. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). two AA individuals will produce and AA child. One of the square alleles is selected 4 times. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their oﬀspring did not inherit this allele. We will assume that a child chooses an allele from each parent at random. 24. Thus. colour of eyes) appears in two forms. a gene controlling a speciﬁc characteristic (e.faculty from each other (and from themselves). 71 . Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. the genetic diversity depends only on the total number of alleles of each type. 4. When two individuals mate. We will also assume that the parents are replaced by the children. Since we don’t care to keep track of the individuals’ identities. What is important is that the evaluation can be done without ever looking at the content of one’s research. meaning how many characteristics we have. Imagine that we have a population of N individuals who mate at random producing a number of oﬀspring.4 The Wright-Fisher model: an example from biology In an organism. it makes the evaluation procedure easy. We have a transition from x to y if y alleles of type A were produced. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. The alleles in an individual come in pairs and express the individual’s genotype for that gene. generation after generation. This number could be independent of the research quality but. But an AA mating with a Aa may produce a child of type AA or one of type Aa. We would like to study the genetic diversity. then. this allele is of type A with probability x/2N . To simplify. In this ﬁgure there are 4 alleles of type A (circles) and 6 of type a (squares).

The fact that M is even is irrelevant for the mathematical model. . as usual. since. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . ξM are i. The result is correct. but the argument needs further justiﬁcation because “the end of time” is not a deterministic time. X0 ) = E(Xn+1 |Xn ) = M This implies that.i. Since pxy > 0 for all 1 ≤ x. . conditionally on (X0 . Clearly. i. since. p00 = p2N. . .e. X0 ) = 1 − Xn . . . Similarly. ϕ(x) = y=0 pxy ϕ(y). y ≤ 2N − 1. M n P (ξk = 0|Xn . ϕ(M ) = 1. the expectation doesn’t change and. Let M = 2N . on the average. with n P (ξk = 1|Xn . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. . the number of alleles of type A won’t change. These are: M ϕ(0) = 0.) One can argue that. . the random variables ξ1 . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. n n where. X0 ) = Xn . 2N − 1} form a communicating class of inessential states. . . 0 ≤ y ≤ 2N. . E(Xn+1 |Xn . . at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). M and thus. .2N = 1. M Therefore. . . ϕ(x) = x/M. .d. We show that our guess is correct by proceeding via the mundane way of verifying that it satisﬁes the ﬁrst-step analysis equations. . Since selections are independent.e. we see that 2N x 2N −y x y pxy = . Tx := inf{n ≥ 1 : Xn = x}. 72 . Clearly. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . so both 0 and 2N are absorbing states. . .Each allele of type A is selected with probability x/2N . and the population becomes one consisting of alleles of type a only. and the population becomes one consisting of alleles of type A only. Xn ). . for all n ≥ 0. on the average. . the states {1. i. (Here. the chain gets absorbed by either 0 or 2N . EXn+1 = EXn Xn = Xn . . .

Here are our assumptions: The pairs of random variables ((An . and the stock reached level zero at time n + 1. so that Xn+1 = (Xn + ξn ) ∨ 0. the demand is met instantaneously. larger than the supply per unit of time.d. b). . We will apply the so-called Loynes’ method. on the average. the demand is only partially satisﬁed. the demand is. Xn electronic components at time n.The ﬁrst two are obvious. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. say. M −1 y−1 x M y−1 1− x . A number An of new components are ordered and added to the stock. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. at time n + 1. We have that (Xn . and E(An − Dn ) = −µ < 0. If the demand is for Dn components. there are Xn + An − Dn components in the stock. n ≥ 0) are i. we will try to solve the recursion. Let ξn := An − Dn . n ≥ 0) is a Markov chain. we ﬁnd X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. Working for n = 1. 73 . The correctness of the latter can formally be proved by showing that it does satisfy the recursion. provided that enough components are available. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. We will show that this Markov chain is positive recurrent.5 A storage or queueing application A storage facility stores. Dn ). Since the Markov chain is represented by this very simple recursion. Otherwise. for all n ≥ 0. . Thus. 2. . and so. while a number of them are sold to the market.i.

. . where the latter equality follows from the fact that.i. Hence. 74 . . for y ≥ 0.d. the random variables (ξn ) are i. . and. In particular. Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. . . ξn . that d (ξ0 . ξ0 ). . . . . for all n. ξn−1 ) = (ξn−1 . we reverse it. . we may replace the latter random variables by ξ1 . . = x≥0 Since the n-dependent terms are bounded. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). . . Notice. . ξn−1 . . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). in computing the distribution of Mn which depends on ξ0 . Therefore. so we can shuﬄe them in any way we like without altering their joint distribution. ξn−1 ). (n) . instead of considering them in their original order. . however. . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . using π(y) = limn→∞ P (Mn = y). (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . . . we ﬁnd π(y) = x≥0 π(x)pxy . we may take the limit of both sides as n → ∞. Indeed. the event whose probability we want to compute is a function of (ξ0 . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisﬁes the balance equations.Thus. Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y).

Here. n→∞ Hence g(y) = P (0 ∨ max{Z1 . . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. . we will use the assumption that Eξn = −µ < 0. n→∞ This implies that P ( lim Zn = −∞) = 1.So π satisﬁes the balance equations. as required. the Markov chain may be transient or null recurrent. . .} < ∞) = 1. Z2 . Depending on the actual distribution of ξn . This will be the case if g(y) is not identically equal to 1. The case Eξn = 0 is delicate. We must make sure that it also satisﬁes y π(y) = 1. 75 . .} > y) is not identically equal to 1. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . Z2 . .

Our Markov chains. A random walk possesses an additional property.and space-homogeneity. . .y = p(y − x). have been processes with timehomogeneous transition probabilities. j = 1. otherwise where p + q = 1. p(x) = 0. we need to give more structure to the state space S. If d = 1. Example 1: Let p(x) = C2−|x1 |−···−|xd | . this walk enjoys. . Deﬁne pj . . if x = ej . x ∈ Zd . but. the skip-free property. that of spatial homogeneity. .PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk.g. besides time.y = px+z. if x = −ej . . we won’t go beyond Zd . A random walk of this type is called simple random walk or nearest neighbour random walk. For example. namely. . To be able to say this. . here. d 0. . ±ed } otherwise 76 . 0. if x ∈ {±e1 . . a random walk is a Markov chain with time. groups) are allowed. the set of d-dimensional vectors with integer coordinates. given a function p(x). otherwise.and space-homogeneous transition probabilities given by px. Example 3: Deﬁne p(x) = 1 2d . a structure that enables us to say that px. In dimension d = 1. p(−1) = q. . d p(x) = qj . More general spaces (e. so far. where p1 + · · · + pd + q1 + · · · + qd = 1.y+z for any translation z. . So. j = 1. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. we can let S = Zd . This is the drunkard’s walk we’ve seen before. This means that the transition probability pxy should depend on x and y only through their relative positions in space. where C is such that x p(x) = 1. then p(1) = p.

p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . in addition. which. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . We will let p∗n be the convolution of p with itself n times. y p(x)p′ (y − x). x ∈ Zd is deﬁned as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. . But this would be nonsense. . p(x) = 0. Furthermore. One could stop here and say that we have a formula for the n-step transition probability. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x).This is a simple random walk. ξ2 . are i.) Now the operation in the last term is known as convolution. then p(1) = p(−1) = 1/2. We call it simple symmetric random walk. and this is the symmetric drunkard’s walk. Let us denote by Sn the state of the walk at time n. . p′ (x). . (When we write a sum without indicating explicitly the range of the summation variable. We shall resist the temptation to compute and look at other 77 . p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). we can reduce several questions to questions about this random walk. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. . x ∈ Zd . We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x).) In this notation. p ∗ δz = p. the initial state S0 is independent of the increments. where ξ1 . If d = 1. Obviously. We refer to ξn as the increments of the random walk. The convolution of two probabilities p(x). So. because all we have done is that we dressed the original problem in more fancy clothes. . .i. Indeed. The special structure of a random walk enables us to give more detailed formulae for its behaviour. computing the n-th order convolution for all n is a notoriously hard problem. random variables in Zd with common distribution P (ξn = x) = p(x). x + ξ2 = y) = y p(x)p(y − x). x ± ed . Furthermore. otherwise. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . (Recall that δz is a probability distribution that assigns probability 1 to the point z. in general.d. we will mean a sum over all possible values.

After all. . we are dealing with a uniform distribution on the sample space {−1.0) We can thus think of the elements of the sample space as paths. we shall learn some counting methods. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. 2) → (3. a random vector with values in {−1. In the sequel. 1) → (4. +1}n . we have P (ξ1 = ε1 .2) (4. random variables (random signs) with P (ξn = ±1) = 1/2. +1}n . 2 where #A is the number of elements of A. . So. If yes. Clearly. .1) − (3. but its practice may be hard. +1}n . . + (1. ξn ).i−1 = 1/2 for all i ∈ Z. A ⊂ {−1. 1): (2. 0) → (1. −1. for ﬁxed n. a path of length n. to each initial position and to each sequence of n signs. So if the initial position is S0 = 0.2) where the ξn are i. Here are a couple of meaningful questions: A. . all possible values of the vector are equally likely. . 1) → (2. As a warm-up. The principle is trivial. Since there are 2n elements in {0. First.1) + (0.i+1 = pi. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . consider the following: 78 . 2) → (5.d. 1}n . Therefore.i. . Look at the distribution of (ξ1 . +1. −1} then we have a path in the plane that joins the points (0. if we can count the number of elements of A we can compute its probability. how long will that take? C.methods. we should assign. and so #A P (A) = n . If not. ξn = εn ) = 2−n . +1.1) + − (5. Will the random walk ever return to where it started from? B. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that diﬀerent questions we need to ask or (b) that approximate methods are more informative than exact ones. . Events A that depend only on these the ﬁrst n random signs are subsets of this sample space. and the sequence of signs is {+1.

Proof. y)}. Let us ﬁrst show that Lemma 14. x) (n. y)} = n−m . x) (n.047. y) is given by: #{(m. y)} =: {all paths that start at (m. n 79 .) The darker region shows all paths that start at (0. Hence (n+i)/2 = 0. x) and end up at (n. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. 0) and end up at (n. i). S7 < 0}. 0) (n. In if n + i is not an even number then there is no path that starts from (0. (n − m + y − x)/2 (23) Remark.i) (0. and compute the probability of the event A = {S2 ≥ 0. Consider a path that starts from (0. i). x) and end up at (n.1 Paths that start and end at speciﬁc points Now consider a path that starts from a speciﬁc point and ends up at a speciﬁc point. We can easily count that there are 6 paths in A. 0) and ends at (n. the number of paths that start from (m. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26.0) The lightly shaded region contains all paths of length n that start at (0. The number of paths that start from (0.Example: Assume that S0 = 0. The paths comprising this event are shown below. S6 < 0. 0) and n ends at (n. 0) and end up at the speciﬁc point (n. i). We use the notation {(m. S4 = −1. 0)). (n. y) is given by: #{(0. i)} = n n+i 2 More generally. (There are 2n of them. in this case.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the ﬁgure), there is no path that goes from (0, 0) to (3, 2). We shall thus deﬁne n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

**Paths that start and end at speciﬁc points and do not touch zero at all
**

(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =

1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the speciﬁc point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the ﬁrst time it hits zero and then reﬂecting it around the horizontal axis. In the ﬁgure below, follow the solid path starting at (m, x) up to the ﬁrst time it hits zero and then follow the dotted path: this is the reﬂected path.

(n, y)

(m, x)

(n, −y)

A path and its reﬂection around a horizontal line. Notice that the reﬂection starts only at the point where the path touches the horizontal line.

Note that if we know the reﬂected path we can recover the original path. Hence we may as well count the number of reﬂected paths. But each reﬂected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =

1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)

1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

**Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
**

1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n

n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n

n 2

= #{(0, 1) n

n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1

n−y 2

(n, y + 1); remain > 0}

− −

n

n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

−

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the ﬁrst one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

**Distribution after n steps
**

n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

we shall just compute it. The probability that. P0 (S1 ≥ 0 . n 2−n . n 2 #{paths (m. . 27. if n is even. If n. 27. .2 The ballot theorem Theorem 25. Sn ≥ 0 | Sn = y) = 1 − In particular. The left side of (30) equals #{paths (1. n−m 2−n−m . 0) (n. . 0) (n. y)} . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . . Such a simple answer begs for a physical/intuitive explanation. x) 2n−m (n. . the random walk never becomes zero up to time n is given by P0 (S1 > 0. . . Sn ≥ 0 | Sn = 0) = 84 1 . 1) (n. But. y) that do not touch zero} × 2−n #{paths (0.3 Some conditional probabilities Lemma 17. Sn > 0 | Sn = y) = y . y)} . n/2 (29) #{(0. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). n (30) Proof. if n is even. . n This is called the ballot theorem and the reason will be given in Section 29. P0 (Sn = y) 2 (n + y) + 1 . for now.P (Sn = y|Sm = x) = In particular. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . given that S0 = 0 and Sn = y > 0. S2 > 0. . . (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof.

. ξ3 . •).4 Some remarkable identities Theorem 26. . (f ) follows from the replacement of (ξ2 . (c) (d) (e) (f ) (g) where (a) is obvious. Sn ≥ 0). . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 .Proof. +1 27. . .) by (ξ1 . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. ξ2 . . . Sn > 0) + P0 (S1 < 0. Proof. . . . . . . . Sn < 0) (b) (a) = 2P0 (S1 > 0. 1 + ξ2 + ξ3 > 0. Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . ξ3 . . . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . . 85 . . . 0) (n. . . . . . . and Sn−1 > 0 implies that Sn ≥ 0. . . . ξ2 + ξ3 ≥ 0. . . . . We next have P0 (S1 = 0. and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . Sn ≥ 0) = #{(0. . . n/2 where we used (28) and (29). . . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . Sn ≥ 0. If n is even. . P0 (S1 ≥ 0. . We use (27): P0 (Sn ≥ 0 . . . . . . and ﬁnally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. . . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . . . Sn = 0) = P0 (S1 > 0. (e) follows from the fact that ξ1 is independent of ξ2 . . . Sn ≥ 0) = P0 (S1 = 0. Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . . (b) follows from symmetry. Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . For even n we have P0 (S1 ≥ 0. . .). . Sn = 0) = P0 (Sn = 0). .

the two terms must be equal. Sn−2 > 0|Sn−1 > 0. with equal probability. Sn = 0. we computed. n−1 Proof. given that Sn = 0. . the probability u(n) that the walk (started at 0) attains value 0 in n steps. . If a particle is to the right of a then you see its image on the left. and if the particle is to the left of a then its mirror image is on the right. So the last term is 1/2. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. Theorem 27. Let n be even. . So u(n) f (n) = . Sn−2 > 0|Sn−1 = 1). f (n) = P0 (S1 > 0. . Fix a state a. for even n. P0 (Sn−1 > 0. . Sn−1 > 0. we either have Sn−1 = 1 or −1. We now want to ﬁnd the probability f (n) that the walk will return to 0 for the ﬁrst time in n steps. By symmetry. Sn = 0) + P0 (S1 < 0. . By the Markov property at time n − 1. Also. Sn−1 = 0. . . n We know that P0 (Sn = 0) = u(n) = n/2 2−n . 86 . It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. Sn = 0). Sn = 0 means Sn−1 = 1. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). . . So f (n) = 2P0 (S1 > 0. we see that Sn−1 > 0. 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). we can omit the last event. . . . . Sn = 0) = 2P0 (Sn−1 > 0. . Sn = 0) = u(n) .27. Put a two-sided mirror at a. . . . . Then f (n) := P0 (S1 = 0. .5 First return to zero In (29). . So 1 f (n) = 2u(n) P0 (S1 > 0. and u(n) = P0 (Sn = 0). This is not so interesting. Since the random walk cannot change sign before becoming zero. Sn = 0) P0 (S1 > 0. . Sn = 0) Now. Sn−1 > 0. n−1 28 The reﬂection principle for a simple random walk in dimension 1 The reflection principle says the following. To take care of the last term in the last expression for f (n). . . Sn−1 < 0.

Theorem 28. n < Ta . The last equality follows from the fact that Sn > a implies Ta ≤ n. Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). . 2 Proof. n ≥ Ta . So Sn − a can be replaced by a − Sn in the last display and the resulting process. Then Ta = inf{n ≥ 0 : Sn = a} Sn . n ∈ Z+ ). In the last event. So: P0 (Sn ≥ a) = P0 (Ta ≤ n. Then. so we may replace Sn by 2a − Sn (Theorem 28). called (Sn . . Sn > a) = P0 (Ta ≤ n. n ∈ Z+ ) has the same law as (Sn . deﬁne Sn = where is the ﬁrst hitting time of a. Thus. Sn ≤ a) + P0 (Ta ≤ n. we obtain 87 . 2a − Sn ≥ a) = P0 (Ta ≤ n. Sn ≤ a) + P0 (Sn > a). as a corollary to the above result. . If Sn ≥ a then Ta ≤ n. 2a − Sn . n < Ta . n ≥ 0) be a simple symmetric random walk and a > 0. the process (STa +m − a. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a).1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. 2 Deﬁne now the running maximum Mn = max(S0 . Trivially. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). Let (Sn . In other words. Sn ≥ a). . Proof. But P0 (Ta ≤ n) = P0 (Ta ≤ n.What is more interesting is this: Run the particle till it hits a (it will. a + (Sn − a). independent of the past before Ta . Sn ). Finally. n ≥ Ta By the strong Markov property. m ≥ 0) is a simple symmetric random walk. P0 (Sn ≥ a) = P0 (Ta ≤ n. S1 . But a symmetric random walk has the same law as its negative. Sn ≤ a). Sn = Sn . we have n ≥ Ta . Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). 28. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn .

436-437. by applying the reﬂection principle (Theorem 28). what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. . Sn = y) = P0 (Sn = 2x − y). This. the number of selected azure items is always ahead of the black ones equals (a − b)/n. . ornamental objects. Compt. It is the latter use we are interested in in Probability and Statistics. of which a are coloured azure and b black. (1887). Sci. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) .Corollary 19. . Rend. P0 (Mn < x. Acad. Proof. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. then the probability that. (31) x > y. The answer is very simple: Theorem 31. as long as a ≥ b. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. Bertrand. The set of values of η contains n elements and η is uniformly distributed a in this set. Since Mn < x is equivalent to Tx > n. where ηi indicates the colour of the i-th item picked. 105. Solution d’un probl`me. 369. D. Sn = y). Acad. n = a + b. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. e 10 Andr´. J. Sn = y) = P0 (Tx > n. combined with (31). during sampling without replacement. ηn ). Solution directe du probl`me r´solu par M. 105. e e e Paris. 2 Proof. usually a vase furnished with a foot or pedestal. we have P0 (Mn < x. the ashes of the dead after cremation. 8 88 . 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Sn = y) = P0 (Tx ≤ n. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. and anciently for holding ballots to be drawn. which results into P0 (Tx ≤ n. A sample from the urn is a sequence η = (η1 . We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. Sci. gives the result. An urn is a vessel. (1887). Sn = 2x − y). 9 Bertrand. The original ballot problem. . employed for diﬀerent purposes. then. If Tx ≤ n. Paris. Compt. Rend. If an urn contains a azure items and b = n−a black items. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. as for holding liquids. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement.

Then. ξn ) is. using the deﬁnition of conditional probability and Lemma 16. But in (30) we computed that y P0 (S1 > 0. . a we have that a out of the n increments will be +1 and b will be −1. . . +1} such that ε1 + · · · + εn = y. . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . Sn > 0} that the random walk remains positive. . .i−1 = q = 1 − p. εn be elements of {−1. An urn can be realised quite easily be a random walk: Theorem 32. . i ∈ Z. . a.) So pi. n ≥ 0) with values in Z and let y be a positive integer. . 89 . Proof of Theorem 31. indeed. where y = a − (n − a) = 2a − n. . . letting a. but which is not necessarily symmetric.i+1 = p. −n ≤ k ≤ n. Since the ‘azure’ items correspond to + signs (respectively. . But if Sn = y then. . Proof. . . Then. . . the result follows. the ‘black’ items correspond to − signs). . always assuming (without any loss of generality) that a ≥ b = n − a. . conditional on the event {Sn = y}. . . n n −n 2 (n + y)/2 a Thus. . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. so ‘azure’ items are +1 and ‘black’ items are −1. So the number of possible values of (ξ1 . . . . . conditional on {Sn = y}. . ηn ) an urn process with parameters n. b be deﬁned by a + b = n. . . . n . Since an urn with parameters n. we can translate the original question into a question about the random walk. We have P0 (Sn = k) = n n+k 2 pi. ξn = εn ) P0 (Sn = y) 2−n 1 = = . a − b = y. . the sequence (ξ1 . ξn ) has a uniform distribution. the sequence (ξ1 . We need to show that. n Since y = a − (n − a) = a − b. 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. conditional on {Sn = y}. It remains to show that all values are equally a likely. Consider a simple symmetric random walk (Sn . . . Sn > 0 | Sn = y) = .We shall call (η1 . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. The values of each ξi are ±1. . (The word “asymmetric” will always mean “not necessarily symmetric”. (ξ1 . Let ε1 . . . ξn ) is uniformly distributed in a set with n elements. a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. . . p(n+k)/2 q (n−k)/2 . we have: P0 (ξ1 = ε1 . . .

Consider a simple random walk starting from 0. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. 1. Lemma 18. To prove this. x ∈ Z. it is no loss of generality to consider the random walk starting from 0. . Then the probability generating function of the ﬁrst time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . Theorem 33. E0 sTx = ϕ(s)x . 2. only the relative position of y with respect to x is needed. Start with S0 = 0 and watch the process until the ﬁrst n = T1 such that Sn = 1. Consider a simple random walk starting from 0.} ∪ {∞}. Lemma 19. x − 1 in order to reach x. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. For all x. . and all t ≥ 0. we make essential use of the skip-free property. In view of this. . Consider a simple random walk starting from 0. Px (Ty = t) = P0 (Ty−x = t).) Then. (We tacitly assume that S0 = 0.i. plus a time T1 required to go from −1 to 0 for the ﬁrst time. 2qs Proof. The rest is a consequence of the strong Markov property. for x > 0. In view of Lemma 19 we need to know the distribution of T1 . Proof. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. We will approach this using probability generating functions. Proof. This random variable takes values in {0. . If S1 = 1 then T1 = 1. 90 .30. 2. See ﬁgure below.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. plus a ′′ further time T1 required to go from 0 to +1 for the ﬁrst time. . then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. .d. . y ∈ Z. random variables. x ≥ 0) is also a random walk. The process (Tx . Let ϕ(s) := E0 sT1 . If x > 0. Corollary 20.

0 if x > 0 2qs Tx (32) E0 s = √ |x| E sT−1 |x| = 1− 1−4pqs2 0 . Corollary 21. The formula is obtained by solving the quadratic. 91 . T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . The ﬁrst time n such that Sn = −1 is the ﬁrst time n such that −Sn = 1. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. Corollary 22. Then the probability generating function of the ﬁrst time that level x ∈ Z is reached is given by √ x E sT1 x = 1− 1−4pqs2 . from the strong Markov property. the RW must ﬁrst hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). in order to hit 1. Then the probability generating function of the ﬁrst time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . if x < 0 2ps Proof. Consider a simple random walk starting from 0. Consider a simple random walk starting from 0. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. Hence the previous formula applies with the roles of p and q interchanged. 2ps Proof.

and simply write (1 + x)n = k≥0 n k x . how long will it take till it ﬁrst returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. We again do ﬁrst-step analysis. the same as the time required for the walk to hit 1 starting ′ from 0. The latter is. k 92 . k by the binomial theorem. Then the probability generating function of the ﬁrst return time to 0 is given by E0 s T 0 = 1 − Proof. For a symmetric ′ random walk. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . where α is a real number.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. in distribution. this was found in §30. Consider a simple random walk starting from 0. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . Similarly.3 The distribution of the ﬁrst return to the origin We wish to ﬁnd the exact distribution of the ﬁrst return time to the origin. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. If we deﬁne n to be equal to 0 for k > n.30. For the general case. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1.2 First return to the origin Now let us ask: If the particle starts at 0. then we can omit the k upper limit in the sum. 1 − 4pqs2 . E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . 2ps ′ =1− 30. If α = n is a positive integer then n (1 + x)n = k=0 n k x . if S1 = 1.

k as long as we deﬁne α k = α(α − 1) · · · (α − k + 1) . k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . −1 k so (1 − x)−1 = as needed. 2k − 1 k 93 . Indeed. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . For example. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series.Recall that n k = It turns out that the formula is formally true for general α. but that we won’t worry about which ones. As another example. k! k! (−1)2k xk = k≥0 xk . (1 − x)−1 = But. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. by deﬁnition. √ 1 − x. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . k!(n − k)! k! α k x . (−1)k = 2k − 1 k k and this is shown by direct computation. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k .

x if x > 0 . We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. The quantity inside the square root becomes 1 − 4pq. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 = (1 − 2p)2 = |1 − 2p| = |p − q|. P0 (Tx < ∞) = p q q p ∧1 ∧1 . |x| if x > 0 (33) . The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z.But. That it is actually null-recurrent follows from Markov chain theory. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. 94 . if x < 0 |x| In particular. We apply this to (32). without appeal to the Markov chain theory. Hence it is again transient. x 1−|p−q| 2q 1−|p−q| 2p . But remember that q = 1 − p. Hence it is transient. by the deﬁnition of E0 sT0 . if x < 0 This formula can be written more neatly as follows: Theorem 35. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. s↑1 which is true for any probability generating function. Comparing the two ‘inﬁnite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn .

Unless p = 1/2. . again. . it will never ever return to the point where it started from. What is this probability? Theorem 37. the sequence Mn is nondecreasing. The simple symmetric random walk with values in Z is null-recurrent. . except that the limit may be ∞. Assume p < 1/2.Proof. Then ′ P0 (T0 < ∞) = 2(p ∧ q). Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . 31. if x > 0 . . Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the ﬁrst return time to 0. Consider a random walk with values in Z. this probability is always strictly less than 1. with some positive probability.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. S1 . Proof. Then limn→∞ Sn = −∞ with probability 1.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. with probability 1 because p < 1/2. x ≥ 0. but here it is < ∞. Theorem 36. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x .) If we let Mn = max(S0 . then we have M∞ = limn→∞ Mn . Consider a simple random walk with values in Z. . if x < 0 which is what we want. n→∞ n→∞ x ≥ 0. Sn ). This value is random and is denoted by M∞ = sup(S0 . Corollary 23. The last part of Theorem 35 tells us that it is recurrent. 31. Otherwise. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. . Proof. every nondecreasing sequence has a limit. We know that it cannot be positive recurrent. If p < q then |p − q| = q − p and. q p. Indeed. S2 . 95 . and so it is null-recurrent. . This means that there is a largest value that will be ever attained by the random walk. we obtain what we are looking for.

starting from zero. . We now look at the the number of visits to some state but if we start from a diﬀerent state. . 1 − 2(p ∧ q) Proof. the total number of visits to a state will be ﬁnite. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . But if it is transient. This proves the ﬁrst formula. Pi (Ji ≥ k) = 1 for all k. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). meaning that Pi (Ji = ∞) = 1. Theorem 38. The probability generating function of T0 was obtained in Theorem 34. . If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. . 96 . . 2. Then. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. 1. . Remark 1: By spatial homogeneity. . If p = q this gives 1. 1]. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. We now compute its exact distribution. 31. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. k = 0. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . In the latter case. even for p = 1/2. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . . . k = 0. 1. 2. as already known. In Lemma 4 we showed that.′ Proof. 2. But J0 ≥ 1 if and only if the ﬁrst return time to zero is ﬁnite. where the last probability was computed in Theorem 37. 2(p ∧ q) E0 J0 = . Consider a simple random walk with values in Z. k = 0. if a Markov chain starts from a state i then Ji is geometric. (34) a random variable ﬁrst considered in Section 10 where it was also shown to be geometric. 1 − 2(p ∧ q) this being a geometric series. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). . 2(p ∧ q) .3 Total number of visits to a state If the random walk is recurrent then it visits any state inﬁnitely many times. 1.

E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. (q/p)(−x)∨0 1 − 2q k≥1 . In other words. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . if the random walk “drifts down” (i.e. Apply now the strong Markov property at Tx . it will visit any point below 0 exactly the same number of times on the average. we have E0 Jx = 1/(1−2p) regardless of x. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . The reason that we replaced k by k − 1 is because. p < 1/2) then. and the second in Theorem 38. Consider a random walk with values in Z. given that Tx < ∞. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . (2(p ∧ q))k−1 . But for x < 0. Then P0 (Jx ≥ k) = P0 (Tx < ∞. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 .Theorem 39. k ≥ 1. 97 . Then P0 (Tx < ∞. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). Suppose x > 0. Consider the case p < 1/2. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . E0 Jx = Proof. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total.e. then E0 Jx drops down to zero geometrically fast as x tends to inﬁnity. starting from 0. Remark 1: The expectation E0 Jx is ﬁnite if p = 1/2. The ﬁrst term was computed in Theorem 35. For x > 0. i. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. Jx ≥ k). We repeat the argument for x < 0 and ﬁnd P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . ∞ As for the expectation. simply because if Jx ≥ k ≥ 1 then Tx < ∞.

Tx is the sum of x i. Proof. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . Let Tx = inf{n ≥ 1 : Sn = x} be the ﬁrst time that x will be visited. i=1 Then the dual (Sk . n Proof. P0 (Sn = x) P0 (Sn = x) x . n (37) 98 . . ξn ) has the same distribution as (ξn . which is basically another probability for S. . Then P0 (Tx = n) = x P0 (Sn = x) . Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. Use the obvious fact that (ξ1 . ξ1 ). (35) In Theorem 29 we used the reﬂection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Sn = x) P0 (Tx = n) = . . Thus. 0 ≤ k ≤ n) is deﬁned by Sk := Sn − Sn−k .32 Duality Consider random walk Sk = k ξi . 0 ≤ k ≤ n) has the same distribution as (Sk . 0 ≤ k ≤ n) of (Sk . 0 ≤ k ≤ n − 1. each distributed as T1 .e. . 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. . for a simple symmetric random walk. every time we have a probability for S we can translate it into a probability for S. (Sk . . . we derived using duality. ξ2 . . (36) 2 Lastly. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0.i. random variables. .d. the probabilities P0 (Tx = n) = x P0 (Sn = x). 0 ≤ k ≤ n). because the two have the same distribution. ξn−1 . i. . n x > 0.d. random variables. Let (Sk ) be a simple symmetric random walk and x a positive integer. = 1− √ 1 − s2 s x . Theorem 40. for a simple random walk and for x > 0. Here is an application of this: Theorem 41. In Lemma 19 we observed that. a sum of i. .i. Pause for reﬂection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. Fix an integer n. in Theorem 41. .

We remark here something methodological: It was completely diﬀerent techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coeﬃcient of sn in (35); summing up (37) should give (36), or taking diﬀerences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Deﬁne its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the ﬁrst time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the ﬁrst event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =

r=1

+

r=1

**In the ﬁrst case,
**

′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

**We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
**

k

**f2r p2k−2r,2n−2r +
**

r=1

1 2

n−k

**f2r p2k,2n−2r
**

r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2

k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .

r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

**The arcsine law*
**

Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

**Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
**

na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =

a≤k/n≤b

∼

a≤k/n≤b

1 . k k (n − n ) n n

1/π

**This is easily seen to have a limit as n → ∞, the limit being
**

b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of

x(1−x)

which is amazingly close to the plot of the ﬁgure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0.i. n ∈ Z+ ) is a sum of i. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . the random 1 2 variables ξn .d.d. i. x1 − x2 ). x2 ) into (x1 + x2 . is a random walk. The two random walks are not independent. . However. i. The corners are described by coordinates (x.From this. 1)) = P (ξn = (−1. north or south. 0)) = P (ξn = (1. 0). Indeed. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. n ∈ Z+ ) showing that it.e. n ∈ Z+ ) is a random walk. (Sn . he moves east. being totally drunk. Sn − Sn ). 1 1 Proof. west. . going at random east. random vectors such that P (ξn = (0. 0)) = P (ξn = (−1. y ∈ Z. we immediately get that the expectation of T1 is inﬁnity: ET1 = ∞. random variables and hence a random walk. too. . we let x1 be its horizontal coordinate and x2 its vertical. 1 P (ξn = 1) = P (ξn = (1. 2 1 If x ∈ Z2 . . ξ2 . 1 Hence (Sn . 2 Same argument applies to (Sn . ξ2 . Sn = i=1 n ξi . so are ξ1 .i. . n ≥ 1. 1)) + P (ξn = (0. . −1)) = 1/2. Sn ) = n n 2 .. We then let S0 = (0. north or south. Many other quantities can be approximated in a similar manner. We have: 102 . 0) and. It is not a simple random walk because there is a possibility that the increment takes the value 0. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. So let 1 2 1 2 1 2 Rn = (Rn . When he reaches a corner he moves again in the same manner. n ∈ Z+ ) is also a random walk. west. He starts at (0. y) where x. ξn are not independent simply because if one takes value 0 the other is nonzero. Consider a change of coordinates: rotate the axes by 45o (and scale them). . . map (x1 . for each n. So Sn = (Sn . 0)) = 1/4. i=1 ξi i=1 ξi 2 1 Lemma 20. Since ξ1 . 0)) = 1/4. (Sn . ξ2 . Rn ) := (Sn + Sn . are independent. the two stochastic processes are not independent. . with equal probability (1/4). the latter being functions of the former. We have 1.

ξ2 . By Lemma 5. 1)) = 1/4 1 2 P (ηn = −1. . b−a P (STa ∧Tb = a) = b . P (S2n = 0) = Proof. b−a −a . as n → ∞.i. +1}. The possible values 1 . The two-dimensional simple symmetric random walk is recurrent. ε′ ∈ {−1. ηn = −1) = P (ξn = (0. To show that the last two are independent. . .. ηn = 1) = P (ξn = (0. Hence P (ηn = 1) = P (ηn = −1) = 1/2. . b−a 103 . ηn = 1) = P (ξn = (1. . where c is a positive constant. (Rn . P (S2n = 0) ∼ n . But by the previous n=0 c lemma. n ∈ Z ) is a random walk in one dimension. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. Recall. η . and by Stirling’s approximation. from Theorem 7. We have of each of the ηn n 1 2 P (ηn = 1. ηn = ξn − ξn are independent for each n. ηn = −1) = P (ξn = (−1. Let a < 0 < b. in the sense that the ratio of the two sides goes to 1. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . that for a simple symmetric random walk (started from S0 = 0). 1 where the last equality follows from the independence of the two random walks. Similar argument shows that each of (Rn . Hence (Rn . n ∈ Z+ ) is a random walk in two dimensions. R2n = 0) = P (R2n = 0)(R2n = 0). 2 . n ∈ Z+ ) 1 is a random walk in two dimensions. We have Rn = n (ξi + ξi ).1 Lemma 21. Lemma 22. is i. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension.d. n ∈ Z ) is a random walk in one dimension. The sequence of vectors η . 2n n 2 2−4n . η 2 are ±1. 0)) = 1/4 1 2 P (ηn = 1. And so 2 1 2 1 P (ηn = ε. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. = 1)) = 1/4. So P (R2n = 0) = 2n 2−2n . (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . (Rn . b−a P (Ta < Tb ) = b . (Rn + independent. Rn = n (ξi − ξi ). The same is true for (Rn ). ξ 1 − ξ 2 ). n Corollary 24. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. The last two are walk in one dimension. we need to show that ∞ P (Sn = 0) = ∞. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . Proof. 0)) = 1/4 1 2 P (ηn = −1. 2 1 2 2 1 1 Proof. . n ∈ Z+ ) is a random 2 . n ∈ Z+ ). But (Rn ) 1 2 is a simple symmetric random walk.

B) are independent of (Sn ). Deﬁne a pair of random variables (A. B = j) = c−1 (j − i)pi pj . and assuming that (A. B = 0) + P (A = i. M < N . we get that c is precisely ipi . in particular. i ∈ Z. given a 2-point probability distribution (p. b = N . Let (Sn ) be a simple symmetric random walk and (pi . M. a = M − N . Then there exists a stopping time T such that P (ST = i) = pi . The probability it exits it from the left point is p. To see this. Can we ﬁnd a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. The claim is that P (Z = k) = pk .g. 1 − p). that ESTa ∧Tb = 0. we can always ﬁnd integers a. we have this common value: i<0 (−i)pi = and. a < 0 < b such that pa + (1 − p)b = 0. such that p is rational. B = j) = 1. choose.This last display tells us the distribution of STa ∧Tb . B = 0) = p0 . i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. Proof. we consider the random variable Z = STA ∧TB . and the probability it exits from the right point is 1 − p. i ∈ Z. Theorem 43. Note. b. e. if p = M/N . c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. i < 0 < j. N ∈ N. using this. where c−1 is chosen by normalisation: P (A = 0. B) taking values in {(0. suppose ﬁrst k = 0. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. i ∈ Z). How about more complicated distributions? Let suppose we are given a probability distribution (pi . b]. Conversely. 0)} ∪ (N × N) with the following distribution: P (A = i. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the ﬁrst time it exits the interval [a. Then Z = 0 if and only if A = B = 0 and this has 104 . Indeed. k ∈ Z. P (A = 0.

Suppose next k > 0. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. i<0 = c−1 Similarly. P (Z = k) = pk for k < 0. by deﬁnition. 105 . B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . B = j) P (STi ∧Tk = k)P (A = i.probability p0 .

PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. with the second line. with b = 0. Arithmetic Arithmetic deals with integers. their negatives and zero.e. Then c | a + (b − a). . they merely serve as reminders. 38) = gcd(82. if we also invoke Euclid’s algorithm (i. Now. So we work faster. b). replace b − a by b − 2a.) Notice that: gcd(a. b are arbitrary integers.e. d | b. 8) = gcd(6. 3. 38) = gcd(44. the process of long division that one learns in elementary school). In other words. 38) = gcd(6. 2. c | b. let d = gcd(a. d2 are two such numbers then d1 would divide d2 and vice-versa. −2. Z = {· · · . We will show that d = gcd(a. They are not supposed to replace textbooks or basic courses. Given integers a. gcd(120. Because c | a and c | b. b − a). But 3 is the maximum number of times that 38 “ﬁts” into 120. Zero is denoted by 0. we can deﬁne their greatest common divisor d = gcd(a. −3. 4. 20) = gcd(6. This gives us an algorithm for ﬁnding the gcd between two numbers: Suppose 0 < a < b are integers. The set Z of integers consists of N. 1. Indeed 4 × 38 > 120. For instance. and d = gcd(a. 2) = 2. leading to d1 = d2 . 38) = gcd(6. (That it is unique is obvious from the fact that if d1 . Therefore d | b − a. b) = gcd(a. 26) = gcd(6. We write this by b | a. (ii) if c | a and c | b. we may summarise the ﬁrst line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. . we have c | d. b. b) as the unique positive integer such that (i) d | a. · · · }. Now suppose that c | a and c | b − a. i. Indeed. 2) = gcd(2. Keep doing this until you obtain b − qa < a. So (i) holds. 14) = gcd(6. b). Then interchange the roles of the numbers and continue. and we’re done. If a | b and b | a then a = b. −1. The set N of natural numbers is the set of positive integers: N = {1. 106 . then d | c. 0. 2) = gcd(4. as taught in high schools or universities. 5. 32) = gcd(6. with at least one of them nonzero. Replace b by b − a. But d | a and d | b. 3. if a. . So (ii) holds.}. b − a). 6. Similarly. 2. But in the ﬁrst line we subtracted 38 exactly 3 times. If b − a > a. we say that b divides a if there is an integer k such that a = kb. If c | b and b | a then c | a.

2) = 2. in particular. b ∈ Z. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. . To see this.e. Suppose that c | a and c | b. a − 1}. y such that d = xa + yb. r = 18. 38) = gcd(6. we have the fact that if d = gcd(a. In the process. with x = 19. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. Working backwards. b. Its proof is obvious. b) = min{xa + yb : x. such that a = qb + r. . 38) = gcd(6. 107 . for some x.The theorem of division says the following: Given two positive integers b. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. The same logic works in general. To see why this is true. xa + yb > 0}. 1. y ∈ Z. Here are some other facts about the greatest common divisor: First. y ∈ Z. with a > b. let us follow the previous example again: gcd(120. for integers a. Thus. Second. The long division algorithm is a process for ﬁnding the quotient q and the remainder r. we have gcd(a. Then c divides any linear combination of a. We need to show (i). b) then there are integers x. not both zero. gcd(38. let d be the right hand side. This shows (ii) of the deﬁnition of a greatest common divisor. . i. y = 6. it divides d. we can show that. a. . we can ﬁnd an integer q and an integer r ∈ {0. For example. 120) = 38x + 120y. Then d = xa + yb.

In general. The set of functions from X to Y is denoted by Y X . We encountered the equivalence relation i theory.. The equality relation = is reﬂexive. y) without contain its symmetric (y. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. it can be thought as reﬂexive (unless we do not want to accept the fact that one may not be afriend of oneself).that d | a and d | b. the greatest common divisor of any subset of the integers also makes sense. x). m elements. z). Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. Both < and = are transitive. A relation which possesses all the three properties is called equivalence relation. The equality relation is symmetric. If we take A to be the set of students in a classroom. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . The relation < is not reﬂexive.e. then the relation “x is a friend of y” is symmetric.g. A relation is called symmetric if it cannot contain a pair (x. So d is not minimum as claimed. A relation is called reflexive if it contains pairs (x. but is not necessarily transitive. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). for some 1 ≤ r ≤ d − 1. Suppose this is not true. For example. The relation < is not symmetric.e. and this is a contradiction. we can divide b by d to ﬁnd b = qd + r. x). Counting Let A. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. A function is called one-to-one (or injective) if two diﬀerent x1 . Since d > 0. f (x2 ). i. y) such that x is smaller than y. A function is called onto (or surjective) if its range is Y . so r = 0. The greatest common divisor of 3 integers can now be deﬁned in a similar manner. B be ﬁnite sets with n. But then b = q(xa + yb) + r. 108 . x2 ∈ X have diﬀerent values f (x1 ). that d does not divide b. z) it also contains (x. y) and (y. A relation is transitive if when it contains (x. say. a relation can be thought of as a set of pairs of elements of the set A. The range of f is the set of its values. so d divides both a and b. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. the set {f (x) : x ∈ X}. respectively. e. yielding r = (−qx)a + (1 − qy)b. Finally. i. that d does not divide both numbers.

k! k where the symbol equation. Incidentally. We thus observe that there are as many subsets in A as number of elements in the set {0. so does the second. the second in n − 1. Clearly. 109 . n! is the total number of permutations of a set of n elements. 1}n of sequences of length n with elements 0 or 1. we have = 1. A function is then an assignment of a name to each animal. Each assignment of this type is a one-to-one function from A to B. the cardinality of the set B A ) is mn . Since the name of the ﬁrst animal can be chosen in m ways. in other words. To be able to do this. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . the k-th in n − k + 1 ways. In the special case m = n. Any set A contains several subsets. Thus. a one-to-one function from the set A into A is called a permutation of A. (The same name may be used for diﬀerent animals. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. that of the second in m − 1 ways. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to diﬀerent animals. and so on. and so on. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . If A has n elements. and so on. Indeed. we may think of the elements of A as animals and those of B as names. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). and this follows from the binomial theorem. n! stands for the product of the ﬁrst n natural numbers: n! = 1 × 2 × 3 × · · · × n. The ﬁrst element can be chosen in n ways.e.) So the ﬁrst animal can be assigned any of the m available names. we need more names than animals: n ≤ m. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . we denote (n)n also by n!.The number of functions from A to B (i. and the third.

0). a∨b∨c refers to the maximum of the three numbers a. the pig. by convention. we add the two numbers together in order to ﬁnd the total number of ways to pick k animals from n + 1 animals. b is denoted by a ∧ b = min(a. we let The Pascal triangle relation states that n+1 k = = 0. b). if x is a real number. c. If the pig is included. and we don’t care to put parantheses because the order of ﬁnding a maximum is irrelevant. say. In particular. b. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. we use the following notations: a+ = max(a. we choose k animals from the remaining n ones and this can be done in n ways. by convention. Both quantities are nonnegative. Maximum and minimum The minimum of two numbers a. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. m! n y If y is not an integer then. + k−1 k Indeed. we deﬁne x m = x(x − 1) · · · (x − m + 1) . The absolute value of a is deﬁned by |a| = a+ + a− . The notation extends to more than one numbers. 0). For example. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. a+ is called the positive part of a. consider a speciﬁc animal. a− = − min(a. If the pig is not included in the sample.We shall let n = 0 if n < k. and a− is called the negative part of a. The maximum is denoted by a ∨ b = max(a. Note that it is always true that a = a+ − a− . b). n n . in choosing k animals from a set of n + 1 ones. k Also. 110 .

= 1A + 1B − 1A∩B . . meaning a function that takes value 1 on A and 0 on the complement Ac of A. . Another example: Let X be a nonnegative random variable with values in N = {1. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. It is worth remembering that. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. It is easy to see that 1A 1A∩B = 1A 1B . . An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . . Thus. . which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). P (X ≥ n).}. are sets or clauses then the number n≥1 1An is the number true clauses. 1 − 1A = 1 A c . 2.Indicator function stands for the indicator function of a set A. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. A2 . Then X X= n=1 1 (the sum of X ones is X). So X= n≥1 1(X ≥ n). if A1 . So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. For instance. If A is an event then 1A is a random variable. Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). . It takes value 1 with probability P (A) and 0 with probability P (Ac ). This is very useful notation because conjunction of clauses is transformed into multiplication. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). 111 . Several quantities of interest are expressed in terms of indicators.

. . is a column representation of x with respect to . xn the given basis. The column . . xn n 1 y := ϕ(x) = j=1 xj ϕ(uj ). . . . β ∈ R. j=1 i = 1. . Suppose that ψ. and all α. we have ϕ(x) = m n vi i=1 j=1 aij xj . . . Then. The matrix A of ϕ with respect to the choice of the two basis is deﬁned by a11 · · · a1n A := · · · · · · · · · am1 · · · amn y1 Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . vm . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . . . . Let u1 . . ym v1 . x1 . un be a basis for Rn . is a column representation of y := ϕ(x) with respect to the given basis . The column . . = ··· ··· ··· . . . . where ∈ R. . m. Combining. So if we choose a basis v1 . . because ϕ is linear.for all x. Let y i := n aij xj . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . x . . The relation between the column representation of y and the column representation of x is written as 1 1 x y a11 · · · a1n . . . ϕ are linear functions: It is a m × n matrix. y ∈ Rn . . But the ϕ(uj ) are vectors in Rm .

there is x ∈ I with x ≤ ε + inf I. it is its matrix representation with respect to the ﬁxed basis. A matrix A is called square if it is a n × n matrix. we deﬁne the supremum sup I to be the minimum of all upper bounds. for any ε > 0. . a square matrix represents a linear transformation from Rn to Rn . So lim sup xn = limn→∞ inf{xn . . ed = . . j So A is a ℓ × m matrix. . there is x ∈ I with x ≥ ε + inf I. since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. If I has no upper bound then sup I = +∞. of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. and AB is a ℓ × n matrix. . Similarly. The standard basis in Rd consists of the vectors 1 0 0 0 1 0 e 1 = . . If I is empty then inf ∅ = +∞. · · · }. We note that lim inf(−xn ) = − lim sup xn . 113 . . . Usually. . . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. x2 . . For instance I = {1. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. sup ∅ = −∞. 0 0 1 In other words. a sequence a1 . sup I is characterised by: sup I ≥ x for all x ∈ I and. . If we ﬁx a basis in Rn . xn+1 . . B be the matrix representations of ϕ. standing for the “infimum” of I. . .} ≤ inf{x3 .} ≤ · · · and. a2 . . e2 = . 1 ≤ i ≤ ℓ. Limsup and liminf If x1 .Fix three sets of basis on Rn . we deﬁne max I.} has no minimum.} ≤ inf{x2 . . we deﬁne lim sup xn . . . . deﬁned as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. Likewise. 1 ≤ j ≤ n. . 1/3. If I has no lower bound then inf I = −∞. is a sequence of numbers then inf{x1 . . where m (AB)ij = k=1 Aik Bkj . . In these cases we use the notation inf I. 1/2. B is a m × n matrix. Supremum and inﬁmum When we have a set of numbers I (for instance. Then the matrix representation of ϕ ◦ ψ is AB. and it is this that we called inﬁmum. . respectively. with respect to these bases. ψ. for any ε > 0. Similarly. . x3 . x4 . . . This minimum (or maximum) may not exist. ei = 1(i = j).e. . . . Rm and Rℓ and let A. inf I is characterised by: inf I ≤ x for all x ∈ I and. xn+2 . the choice of the basis is the so-called standard basis. . x2 . i.

If the events A. follows from this axiom. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. . we often need to argue that X ≤ Y . to show that EX ≤ EY . we work a lot with conditional probabilities. we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. A2 . i. Therefore. If An are events with P (An ) = 0 then P (∪n An ) = 0. That is why Probability Theory is very rich. I do sometimes “cheat” from a strict mathematical point of view. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ).e.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). Note that 1A 1B = 1A∩B . P (∪n An ) ≤ Similarly. If A is an event then 1A or 1(A) is the indicator random variable. and 1 Ac = 1 − 1 A . n If the events A1 . In contrast. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). derive formulae. For example. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. then P n≥1 An = P (An ). . B are disjoint events then P (A ∪ B) = P (A) + P (B). to compute P (A ∩ B) we read the deﬁnition as P (A ∩ B) = P (A)P (B|A). A2 . A2 .. In Markov chain theory. B in the deﬁnition. B2 . (That is why Probability Theory is very rich. such that ∪n Bn = Ω. n P (An ). and everywhere else. n≥1 Usually. if he events A1 . it is useful to ﬁnd disjoint events B1 . for (besides P (Ω) = 1) there is essentially one axiom . If A1 . . .) If A.. if An are events with P (An ) = 1 then P (∩n An ) = 1. to compute P (A). the function that takes value 1 on A and 0 on Ac . Probability Theory is exceptionally simple in terms of its axioms. but not too much. . are increasing with A = ∪n An then P (A) = limn→∞ P (An ). are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). . In contrast. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. This follows from the logical statement X 1(X > x) ≤ X. by not requiring this. . Recall that P (B|A) is deﬁned by P (A ∩ B)/P (A) and is a very motivated deﬁnition. Often. Similarly. Quite often. Linear Algebra that has many axioms is more ‘restrictive’. This can be generalised to any ﬁnite number of events. Similarly. . E 1A = P (A). are mutually disjoint events. This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. A is a subset of B). . namely: Everything else.e. which holds since X is positive. Of course. Indeed. in these notes. . Linear Algebra that has many axioms is more ‘restrictive’. and apply them. . This is the inclusion-exclusion formula for two events. . 11 114 .

This deﬁnition makes sense only if EX + < ∞ or if EX − < ∞. we have EX = −∞ xf (x)dx. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. EX = x xP (X = x). . 0). a1 .e. . for s = 0 for which G(0) = 0. . In case the random variable is discrete. . x→∞ In Markov chain theory. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). is called probability generating function of X: ∞ as long as |ρs| < 1. we encounter situations where P (X = ∞) may be positive. . making it a useful object when the a0 . In case the random variable is ∞ absoultely continuous with density f . For a random variable which takes positive and negative values. is an inﬁnite polynomial having the ai as coeﬃcients: G(s) = a0 + a1 s + a2 s2 + · · · . . 1]. . .An application of this is: P (X < ∞) = lim P (X ≤ x). is a probability distribution on Z+ = {0.. . the formula reduces to the known one. ϕ(s) = EsX = n=0 sn P (X = n) 115 . we deﬁne X + = max(X. which are both nonnegative and then deﬁne EX = EX + − EX − . i. n ≥ 0 is ∞ ρn s n = n=0 1 . so it is useful to keep this in mind. the generating function of the geometric sequence an = ρn . 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). . X − = − min(X. 2. The expectation of a nonnegative random variable X may be deﬁned by ∞ EX = 0 P (X > x)dx. We do not a priori know that this exists for any s other than. taught in calculus courses). 1. obviously. The formula holds for any positive random variable. viz. Generating functions A generating function of a sequence a0 . 1. a1 . 0).} because then n an = 1. |s| < 1/|ρ|. . The generating function of the sequence pn = P (X = n). 1]. As an example. . So if n an < ∞ then G(s) exists for all s ∈ [−1. which is motivated from the algebraic identity X = X + − X − . n = 0.

n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). Since the term s∞ p∞ could formally appear in the sum deﬁning ϕ(s). s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . n=0 We do allow X to. Generating functions are useful for solving linear recursions. Note that ∞ n=0 pn ≤ 1.This is deﬁned for all −1 < s < 1. consider the Fibonacci sequence deﬁned by the recursion xn+2 = xn+1 + xn . As an example. ds ds and by letting s = 1. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . but. n! as follows from Taylor’s theorem. Note that ϕ(0) = P (X = 0). 116 . we have s∞ = 0. and this is why ϕ is called probability generating function. xn sn be the generating of (xn . n ≥ 0) then the generating function of (xn+1 . since |s| < 1. Also. n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . ∞ and p∞ := P (X = ∞) = 1 − pn . while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). we can recover the expectation of X from EX = lim ϕ′ (s). If we let X(s) = n≥0 n ≥ 0. with x0 = x1 = 1. possibly. take the value +∞.

the function P (X ≤ x). we can also write this as xn = (−a)n+1 − (−b)n+1 . b = −( 5 + 1)/2. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). Since x0 = x1 = 1. b−a Noting that ab = −1. the n-th term of the Fibonacci sequence. So. 117 .From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . Despite the square roots in the formula. namely. But I could very well use the function P (X > x). i. for example. is such an example. x ∈ R. if X is a random variable with values in R. n ≥ 0). Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). then the so-called distribution function. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X.e. n ≥ 0) equals the sum of the generating functions of (xn+1 . and so X(s) = 1 1 1 1 1 1 = . − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . Hence s2 + s − 1 = (s − a)(s − b). we can solve for X(s) and ﬁnd X(s) = s2 −1 .e. Thus (1/ρ)−s is the 1 generating function of ρn+1 . i. n ≥ 0) and (xn . +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. b−a is the solution to the recursion.

If X is absolutely continuous. we encountered the concept of an excursion of a Markov chain from a state i. n Hence n k=1 log k ≈ log x dx. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. Even the generating function EsX of a Z+ -valued random variable X is an example. Now. but using or merely talking about such an object presents no diﬃculty other than overcoming a psychological barrier. 1 Since d dx (x log x − x) = log x. For instance. for it does allow us to specify the probabilities P (X = n). since the number is so big. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. for example they could take values which are themselves functions. dx All these objects can be referred to as “distribution” or “law” of X. Hence we have We call this the rough Stirling approximation.or the 2-parameter function P (a < X < b). (b) The approximation is not good when we consider ratios of products. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. it’s better to use logarithms: n n! = e log n = exp k=1 log k. Such an excursion is a random object which should be thought of as a random function. I could use its density f (x) = d P (X ≤ x). We frequently encounter random objects which belong to larger spaces. For example (try it!) the rough Stirling approximation give that the probability that. in 2n fair coin tosses 118 . to compute a large sum we can. instead. compute an integral and obtain an approximation. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. n! ≈ exp(n log n − n) = nn e−n . Specifying the distribution of such an object may be hard. It turns out to be a good one. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. n ∈ Z+ .

That was the random variable J0 = ∞ n=1 1(Sn = 0). 0 ≤ n ≤ m. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). as n → ∞. n ∈ Z+ . Indeed. Surely. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. you have seen it before. a geometric function of k. Let X be a random variable. Markov’s inequality Arguably this is the most useful inequality in Probability. We baptise this geometric(λ) distribution. because we know that the n probability goes to zero as n → ∞. 119 . In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . To remedy this. by taking expectations of both sides (expectation is an increasing operator). and this is terrible. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . If we think of X as the time of occurrence of a random phenomenon. where λ = P (X ≥ 1). and is based on the concept of monotonicity. and so. deﬁned in (34). This completely trivial thought gives the very useful Markov’s inequality. Then. expressing monotonicity. We have already seen a geometric random variable. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. expressing non-negativity. so this little section is a reminder. Consider an increasing and nonnegative function g : R → R+ . Eg(X) ≥ g(x)P (X ≥ x) . y ∈ R g(y) ≥ g(x)1(y ≥ x). for all x.we have exactly n heads equals 2n 2−2n ≈ 1. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any diﬀerent from X. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). It was shown in Theorem 38 that J0 satisﬁes the memoryless property. We can case both properties of g neatly by writing For all x. g(X) ≥ g(x)1(X ≥ x).

Feller. 7.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Chung.pdf 6. 120 . Norris.mac. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). Bremaud P (1997) An Introduction to Probabilistic Modeling. C M and Snell. 4. Cambridge University Press. Cambridge University Press. J R (1998) Markov Chains. 3. Springer. ´ 2. Springer. Math. W (1968) An Introduction to Probability Theory and its Applications. Soc. Parry. Bremaud P (1999) Markov Chains.dartmouth. Springer. J L (1997) Introduction to Probability. Grinstead. W (1981) Topics in Ergodic Theory. Wiley.Bibliography ´ 1. 5. Amer. Also available at: http://www.

44 degree. 113 initial distribution. 101 axiom of Probability Theory. 70 period. 68 aperiodic. 18 arcsine law. 20 increments. 98 dynamical system. 61 recurrent. 17 Kolmogorov’s loop criterion. 17 inﬁmum. 17 communicates with. 51 Monte Carlo. 35 Helly-Bray Lemma. 63 Galton-Watson process. 15 Loynes’ method. 110 mixing. 114 balance equations. 119 Google. 10 irreducible. 106 ground state. 86 reﬂexive. 26 running maximum. 45 Chapman-Kolmogorov equation. 76 neighbours. 17 Ethernet. 84 bank account. 18 permutation. 2 nearest neighbour random walk. 51 essential state. 116 ﬁrst-step analysis. 13 probability generating function. 108 ergodic. 119 matrix. 108 relation. 16. 109 positive recurrent. 81 reﬂection principle. 53 hitting time. 115 random walk on a graph. 15 distribution. 18 dual. 7 closed. 65 Catalan numbers. 70 greatest common divisor. 34 PageRank. 29 reﬂection. 87 121 . 59 law. 82 Cesaro averages. 48 cycle formula. 16 communicating classes. 1 equivalence classes.Index absorbing. 108 ruin probability. 34 probability ﬂow. 119 minimum. 53 Fibonacci sequence. 47 coupling inequality. 65 generating function. 117 leads to. 16 convolution. 48 memoryless property. 6 Markov’s inequality. 110 meeting time. 3 branching process. 15. 111 inessential state. 17 Aloha. 111 maximum. 68 Fatou’s lemma. 61 detailed balance equations. 20 function of a Markov chain. 6 Markov property. 77 indicator. 10 ballot theorem. 115 geometric distribution. 117 division. 16 equivalence relation. 77 coupling. 61 null recurrent. 7 invariant distribution. 58 digraph (directed graph). 73 Markov chain.

7 simple random walk. 62 subsequence of a Markov chain. 108 time-homogeneous. 71 122 . 27 storage chain. 13 stopping time. 16. 113 symmetric. 27 subordination. 16. 7. 60 urn. 7 time-reversible. 29 transition diagram:. 6 statespace.semigroup property. 104 states. 9 stationary distribution. 108 tree. 62 supremum. 8 transition probability matrix. 77 skip-free property. 76 simple symmetric random walk. 73 strategy of the gambler. 2 transition probability. 24 Strirling’s approximation. 10 steady-state. 3. 62 Wright-Fisher model. 76 Skorokhod embedding. 6 stationary. 88 watching a chain in a set. 119 strong Markov property. 58 transient. 8 transitive.

- Markov Chain
- Markov Chains
- Markov Coding
- (Ebook) Hidden Markov Models (Theory & Methods) Markov Chains Particle Filter Monte Carlo Hmm - Cappemoulinesryden
- 20752_0412055511MarkovChain
- MarkovChains
- Bayesian Econometric Methods
- Markov
- markov-chains
- Solved Problems
- Solutions to Selected Exercises for Stochastic Processes Theory for Applications ( R. G. Gallager )
- markov chain chapter 6
- markov
- ProblemsMarkovChains
- Statistics
- Grimmett g.r., Stirzaker d.r. Probability and Random Processes (3ed., Oxford, 2001)(1)
- Applied Time Series Eco No Metrics - Lutkepohl H [2004]
- Bayesian Tutorial
- Nonlinear Bayesian Filtering
- Exponential Distribution Theory and Methods
- Markov Chain
- (Chapman & Hall_CRC Texts in Statistical Science) Richard McElreath-Statistical Rethinking_ a Bayesian Course With Examples in R and Stan-Chapman and Hall_CRC (2015)
- EE-0030 - Probability and Random Processes for Electrical and Computer Engineers
- Exercises for Stochastic Calculus
- Applied Maths
- Elementary probability theory with stochastic processes and an introduction to mathematical finance
- Stochastic Processes
- Stochastic Processes
- [Berger]_Statistical_Decision_Theory_and_Bayesian_.pdf

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading