This action might not be possible to undo. Are you sure you want to continue?

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

**MARKOV CHAINS AND RANDOM WALKS
**

Takis Konstantopoulos∗ Autumn 2009

∗

c Takis Konstantopoulos 2006-2009

Contents

1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Deﬁnition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and ﬁrst-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Deﬁnition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

.2 First return to the origin .1 Paths that start and end at speciﬁc points . . . . . . . . . 84 27. . . . . . . . .2 Return probabilities . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . . . . . 79 26. . . . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . . . . . . . . . . . . . .5 A storage or queueing application . .24. . . . . . 86 28 The reﬂection principle for a simple random walk in dimension 1 86 28. .5 First return to zero . . .3 Some conditional probabilities . . . . . . . .2 Paths that start and end at speciﬁc points and do not touch zero at all . . . . . . . . . 68 24. . . . .4 The Wright-Fisher model: an example from biology . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 24. . . . . . . . . . . . . . . . . . . . . . . . . . . 71 24. . . . . . . . . . . . . . . 85 27. . . . . . . . 84 27. . . . . . . . . . 95 31. . . . . . . . . . . . . . 90 30. . . . . . . . . . . . . . . . . . . . . . .1 Asymptotic distribution of the maximum . . . . . . . . . . . . . . . . . .1 Distribution of hitting time and maximum . . . . . . . . . . . . . . . . . . . .1 First hitting time analysis . . . . . .1 Branching processes: a population growth application . . . 83 27. . . 92 30. . . . . . . . . . .2 The ballot theorem . . . .3 PageRank (trademark of Google): A World Wide Web application .3 The distribution of the ﬁrst return to the origin . .1 Distribution after n steps . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . .4 Some remarkable identities . . . . . . . . . . . . . . . . . . . . . . . . 95 31. . . . . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 24. . . .2 The ALOHA protocol: a computer communications application . . . . .3 Total number of visits to a state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

. There are. I have tried both and prefer the current ordering. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible.S.Preface These notes were written based on a number of courses I taught over the years in the U. Also. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. The only new thing here is that I give emphasis to probabilistic methods as soon as possible.K.. The second one is speciﬁcally for simple random walks. I have a little mathematical appendix. At the end. I introduce stationarity before even talking about state classiﬁcation. The ﬁrst part is about Markov chains and some applications. Greece and the U. dozens of good books on the topic. A few starred sections should be considered “advanced” and can be omitted at ﬁrst reading. of course. There notes are still incomplete. Also. v . They form the core material for an undergraduate course on Markov chains in discrete time. I tried to put all terms in blue small capitals whenever they are ﬁrst encountered.. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. Of course. “important” formulae are placed inside a coloured box. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuﬀ with matrices – On ﬁnance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come.

In Mathematics. The “time” can be discrete (i. yn ). there is a function F that takes the pair Xn = (xn . yn ). .e. a totally ordered set. for any n. x = x(t) denotes the position of the particle at time t. An example from Physics is provided by Newton’s laws of motion. and its ˙ dt 2x acceleration is x = d 2 .PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past aﬀects the future only through the present. more generally. depend only on the pair Xn . Newton’s second law says that the acceleration is proportional to the force: m¨ = F. In the module following this one you will study continuous-time chains. x(t)). . The state of the particle at time t is described by the pair ˙ X(t) = (x(t). The sequence of pairs thus produced. s ≥ t) ˙ 1 . that gcd(a. a phenomenon which evolves with time in a way that only the present aﬀects the future is called a dynamical system.e. we end up with two numbers.e. or. the future pairs Xn+1 . Assume that the force F depends only on the particle’s position ¨ dt and velocity. X1 . b) = gcd(r. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. x). By repeating the procedure. b) is the remainder of the division of a by b. X2 . yn ) into Xn+1 = (xn+1 . Formally. the future trajectory (X(s). Xn+2 . b). and evolving as xn+1 = yn yn+1 = rem(xn . has the property that. Recall (from elementary school maths). Its velocity is x = dx . It is not hard to see that. In other words. 2. where r = rem(a. n = 0. . b). x Here. . . . the real numbers). yn+1 ). the integers). for any t. one of which is 0. . y0 = min(a. b) between two positive integers a and b. An example from Arithmetic of a dynamical system in discrete time is the one which ﬁnds the greatest common divisor gcd(a. and the other the greatest common divisor. initialised by x0 = max(a. where xn ≥ yn . . F = f (x. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). we let our “state” be a pair of integers (xn . Suppose that a particle of mass m is moving under the action of a force F in a straight line. . b). X0 . continuous (i. 1. For instance. i.

When the mouse is in cell 1 at time n (minutes) then. by the software in your computer. uniformly. We will be dealing with random variables. an eﬀective way for dealing with large problems is via the use of randomness.99. regardless of where it was at earlier times. instead of deterministic objects. Perform this a large number N of times. Yn ). In other words. Each time count 1 if Yn < f (Xn ) and 0 otherwise. Try this on the computer! 2 . 2. 0 ≤ Y0 ≤ B. We can summarise this information by the transition diagram: 1 Stochastic means random. i. A mouse lives in the cage. . analyse. Count the number of 1’s and divide by N . but some are stochastic. Y0 ) of random numbers. Other examples of dynamical systems are the algorithms run. We will develop a theory that tells us how to describe. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. In addition. Similarly. Repeat with (Xn . it moves from 2 to 1 with probability β = 0. 2 2. We will also see why they are useful and discuss how they are applied. The resulting ratio is an approximation b of a f (x)dx. A scientist’s job is to record the position of the mouse every minute. The generalisation takes us into the realm of Randomness. it does so. via the use of the tools of Probability. Indeed. Markov chains generalise this concept of dependence of the future only on the present. Choose a pair (X0 .e. 1 and 2. containing fresh and stinky cheese. at time n + 1 it is either still in 1 or has moved to 2.000 rows and 10. respectively. a ≤ X0 ≤ b.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as ﬁnding the determinant of a matrix with 10. say. past states are not needed. . we will see what kind of questions we can ask and what kind of answers we can hope for.000 columns or computing the integral of a complicated function of a large number of variables. . for steps n = 1.can be completely speciﬁed by the current state X(t). and use those mathematical models which are called Markov chains. Some of these algorithms are deterministic.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells..05. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B.

y := P (Xn+1 = y|Xn = x). from month to month. Sn . D0 . Xn is the money (in pounds) at the beginning of the month.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. 1. D2 .05 ≈ 20 minutes. .01 Questions of interest: 1.99 0. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). he can’t. Assume that Dn . on the average. Dn is the money he deposits. S2 . S1 .05 0. . How long does it take for the mouse. How often is the mouse in room 1 ? Question 1 has an easy. . Assume also that X0 . as follows: Xn+1 = max{Xn + Dn − Sn . S0 .) 2. respectively. px.2 Bank account The amount of money in Mr Baxter’s bank account evolves. D1 . to move from cell 1 to cell 2? 2. are independent random variables. Dn − Sn = y − x) P (Sn = z. = z=0 ∞ = z=0 3 . FS . The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is deﬁned by: px. are random variables with distributions FD . We easily see that if y > 0 then ∞ x. and Sn is the money he wants to spend. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell.95 0. Here. 2. (This is the mean of the binomial distribution with parameter α. the ﬁrst time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. . . So if he wishes to spend more than what is available. intuitive. 0}. y = 0.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. .

If the pub is located in position 0 and his home in position 100. being drunk.something that. 2.1 p2.0 p1. in principle. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions.1 p0. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left).2 p1. If he starts from position 0. how fast? 2. he has no sense of direction. we have px. and if so. So he may move forwards with equal probability that he moves backwards.1 p1. Questions of interest: 1. on the average. Are there any places which he will visit inﬁnitely many times? 4. how often will he be visiting 0? 2.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. 2. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. Are there any places which he will never visit? 3. how long will it take him. if y = 0. Of course.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . The transition probability matrix is y=1 p0. Will Mr Baxter’s account grow.2 ··· ··· ··· ··· · · · · · · ··· and is an inﬁnite matrix.0 p2. can be computed from FS and FD .y .0 p0.2 P= p2. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction.0 = 1 − ∞ px.

Clearly. 5 . etc.3 that the person will become sick.H − pS.S = 1 − pS. west.D .S − pH. Note that the diagram omits pH.D .01 S 0. 2. the answer to this is crucial for the policy determination. We may again want to ask similar questions.3 H 0. mathematically imprecise.8 0. there is clearly a higher chance that our man will be lost.D = 1. and pS. the company must have an idea on how long the clients will live. pD. pD. north or south. given that he is currently healthy.H . These things are.5 Actuarial chains A life insurance company wants to ﬁnd out how much money to charge its clients. and let’s say he’s completely drunk. but they will become more precise in the sequel. So he can move east.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. that since there are more degrees of freedom now. therefore. But. so we assign probability 1/4 for each move. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. for now.S because pH. Also. there is probability pH.S = 0. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly.H = 1 − pH.D is omitted.1 D Thus. observe. and pS. the reason being that the company does not believe that its clients are subject to resurrection.

. Lemma 1. we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. . . for all n ∈ Z+ and all i0 . (1) 6 . . We focus almost exclusively to the ﬁrst choice. . that the Xn take values in some countable set S. T = Z+ = {0. . X1 = i1 . . almost exclusively (unless otherwise said explicitly). viz. conditionally on Xn . .} or N = {1. . . which precisely means that (Xn+1 . 3. . it is customary to call (Xn ) a Markov chain. Since S is countable. Usually. and so future is independent of past given the present. . This is true for any n and m. . n ∈ T}. n ∈ T} is Markov instead of saying that it has the Markov property. m ∈ T) is independent of the past process (Xm .1 The Markov property Deﬁnition of the Markov property Consider a (ﬁnite or inﬁnite) sequence of random variables {Xn . So we will omit referring to T explicitly. 2. X2 = i2 . called the state space. where T is a subset of the integers. m ∈ T). . m > n. . . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . We will also assume. .} or the set Z of all integers (positive. . n ∈ Z+ ) is Markov if and only if. m < n.3 3. X0 = i0 ) = P (∀ k ∈ [n + 1. . . . . n + m] Xk = ik | Xn = in . . for any n ∈ T. zero. in+1 ∈ S. n + m] Xk = ik | Xn = in ). . . Sometimes we say that {Xn . or negative).2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . . Xn+m ) and (X0 . Let us now write the Markov property in an equivalent way. P (Xn+1 = in+1 | Xn = in . the future process (Xm . We say that this sequence has the Markov property if. Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). Xn−1 ) are independent conditionally on Xn . . 3. 2. . Proof. . in+m . . for any n. m and any choice of states i0 . . . if the claim holds. 1. . the Markov property. Xn−1 ) given Xn . 3. . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). unless needed. The elements of S are frequently called states. . . (Xn . . . On the other hand.

but if S is an inﬁnite set.i2 (1.i1 (m.3 In this new notation. m + 2) · · · P(n − 1. by deﬁnition. n) = 1(i = j) and. all probabilities of events associated with the Markov chain This causes some analytical diﬃculties.j∈S .j (n. n) = P(m. by (1). n + 1) := P (Xn+1 = j | Xn = i) . n).which means that the two functions µ(i) := P (X0 = i) .j (n. n) := [pi. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. More generally.j (m. pi. of course.j (m. will specify all joint distributions2 : P (X0 = i0 . 2 3 And. 3. viz. pi. then the matrices are inﬁnitely large.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. Of course. Therefore. The function µ(i) is called initial distribution. m + 1) · · · pin−1 .j (n. n + 1) do not depend on n .i1 (0.j (n. means that the transition probabilities pi. (2) which is reminiscent of matrix multiplication. . . X1 = i1 . . X2 = i2 .i (n − 1. we deﬁne. j ∈ S. n) = P(m. n + 1) is called transition probability from state i at time n to state j at time n + 1. 2) · · · pin−1 . n). i∈S i. n).j (m. n)]i. by a result in Probability..j (m. ℓ) P(ℓ. n). . remembering a bit of Algebra. something that can be called semigroup property or Chapman-Kolmogorov equation. If |S| = d then we have d × d matrices. for each m ≤ n the matrix P(m. which. 1) pi1 . n) if m ≤ ℓ ≤ n.in (n − 1. The function pi. or as P(m. But we will circumvent them by using Probability only. requires some ordering on S) and whose entries are pi. Xn = in ) = µ(i0 ) pi0 . pi. a square matrix whose rows and columns are indexed by S (and this. n ≥ 0. m + 1)P(m + 1. n) = i1 ∈S ··· in−1 ∈S pi. pi. we can write (2) as P(m. 7 .

Similarly.j ]i. Then P(m. n + 1) =: pi. of course. Notice that any probability distribution on S can be written as a linear combination of unit masses. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . (3) We call δi the unit mass at i. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). if Y is a real random variable.j . In other words.and therefore we can simply write pi. For example. if µ assigns probability p to state i1 and 1 − p to state i2 . Starting from a speciﬁc state i. if j = i otherwise. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . and call pi. From now on. arranged in a row vector. y y 8 . The matrix notation is useful. while P = [pi. Then we have µ n = µ 0 Pn . n−m times for all m ≤ n. to picking state i with probability 1. Thus Pi is the law of the chain when the initial distribution is δi . transition matrix. It corresponds. 0. then µ = pδi1 + (1 − p)δi2 . A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). and the semigroup property is now even more obvious because Pa+b = Pa Pb . unless otherwise speciﬁed. b ≥ 0.j the (one-step) transition probability from state i to state j.j∈S is called the (one-step) transition probability matrix or. as follows from (1) and the deﬁnition of P. Notational conventions Unit mass. simply.j (n. For example. δi (j) := 1. n) = P × P × · · · · · · × P = Pn−m . for all integers a.

the distribution of the bug will still be the same. This is a model of gas distributed between two identical communicating chambers. 1. But this will not work. say N = 1025 ) in a metallic chamber with a separator in the middle. Then a hole is opened in the separator and we watch what happens as the molecules diﬀuse between the two rooms. m + 1. Consider a canonical heptagon and let a bug move on its vertices. selecting a vertex with probability 1/7.). We now investigate when a Markov chain is stationary. for a while we will be able to tell how long ago the process started. . let us look at the following example: Example: Motion on a polygon. In other words. n ≥ 0) will not be stationary: indeed. It should be clear that if initially place the bug at random on one of the vertices. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. Another guess then is: Place 9 . If at time n the bug is on a vertex. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. Let Xn be the number of molecules in room 1 at time n. at time n + 1 it moves to one of the adjacent vertices with equal probability. .i. . On the average. dividing it into two identical rooms. i. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. because it is impossible for the number of molecules to remain constant at all times. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . Then pi.) is stationary if it has the same law as (Xn . Example: the Ehrenfest chain.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. and vice versa. the law of a stationary process does not depend on the origin of time. Imagine there are N molecules of gas (where N is a large number. for any m ∈ Z+ . n = m. then. To provide motivation and intuition. at any point of time n. as it takes some time before the molecules diﬀuse from room to room.e. Clearly. random variables provides an example of a stationary process. .d.4 Stationarity We say that the process (Xn . It is clear. .i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . we indeed expect that we have N/2 molecules in each room. n = 0. then. . Hence the uniform probability distribution is stationary. a sequence of i. The gas is placed partly in room 1 and partly in room 2.

X2 = i2 .i = π(i − 1)pi−1. which you should go through by yourselves. Xm+n = in ). i 0 ≤ i ≤ N. .i = P (X0 = i − 1)pi−1. hence. For the if part. for all m ≥ 0 and i0 . Indeed. A Markov chain (Xn . . use (1) to see that P (X0 = i0 . Xm+2 = i2 . by (3). . (4) Such a π is called . In other words. X1 = i1 . P (X1 = i) = j P (X0 = j)pj. . .each molecule at random either in room 1 or in room 2. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. If we do so. The equations (4) are called balance equations. We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. stationary or invariant distribution. P (Xn = i) = π(i) for all n. in ∈ S. . ﬁnd those distributions π such that πP = π. 10 . then we have created a stationary Markov chain. 2−N i−1 i+1 N N (The dots signify some missing algebra. Lemma 2. .) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. Xn = in ) = P (Xm = i0 . Since. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . If we can do so. a Markov chain is stationary if and only if the distribution of Xn does not change with n.i + π(i + 1)pi+1. . . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. The only if part is obvious. then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . . Changing notation. (a binomial distribution). . But we need to prove this. we are led to consider: The stationarity problem: Given [pij ]. Xm+1 = i1 . .i = N N −i+1 N i+1 2−N + = · · · = π(i).i + P (X0 = i + 1)pi+1. Proof.

d! x ∈ Sd . 2. which.j = 1. Consider a deck of d = 52 cards and shuﬄe them in the following (stupid) way: pick two cards at random and swap their positions. To ﬁnd the stationary distribution. 11 . . which gives p(j) . This means that the solution to the balance equations is identically zero.d. The times between successive buses are i. i = 1. After bus arrives at a bus stop.i = p(i). i≥1 π(i) −1 To compute π(1). Who tells us that the sum inside the parentheses is ﬁnite? If the sum is inﬁnite. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. .i = π(0)p(i) + π(i + 1). So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j).}. . Then (Xn . π(1) will be zero. Thus. . write the equations (4): π(i) = π(j)pj. .. It is easy to see that the stationary distribution is uniform: π(x) = 1 . i ≥ 0. random variables. . then we run into trouble: indeed. where all the xi are distinct. n ≥ 0) is a Markov chain with transition probabilities pi+1. . xd ). if i > 0 i ∈ N. i∈S The matrix P has the eigenvalue 1 always because j∈S pi.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. we must be careful. Example: waiting for a bus. . p1. in matrix notation. But here. each x ∈ Sd is of the form x = (x1 . In addition to that.i. can be written as P1 = 1. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. . the next one will arrive in i minutes with probability p(i). which in turn means that the normalisation condition cannot be satisﬁed. Example: card shuﬄing. and so each π(i) will be zero. The state space is S = N = {1. d}. where 1 is a column whose entries are all 1. we use the normalisation condition π(1) = i≥1 j≥i = 1. . . 2 . Keep doing it continuously. . π must be a probability: π(i) = 1.i = 1. i ≥ 1. .

But the function f (x) = min(2x. 2(1 − x)). .* (This is slightly beyond the scope of these lectures and can be omitted. applied to the interval [0. with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. and the rule maps x into 2(1 − x). 1}. ξ2 . . If ξ1 = 1 then x ≥ 1/2. . ξr+1 (n) = εr ) + P (ξ1 (n) = 1. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is ﬁnite.. . We can show that if we start with the ξr (0). with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n.) Consider an inﬁnite binary vector ξ = (ξ1 . 1} for all i. So it goes beyond our theory. . ξ2 (n) = 1 − ε1 . We have thus deﬁned a mapping ϕ from binary vectors into binary vectors. But to even think about it will give you some strength. If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x.). the same is true at n + 1.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). ξ2 (n). from (5). So. . P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). .) is the state at time n.. . are i. 2. . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. what we have really shown is that this assumption is a necessary and suﬃcient condition for the existence and uniqueness of a stationary distribution. 1] as a ﬂexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. (5) 12 . . i. .d. .. εr ∈ {0.). i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. any r = 1. . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . P (ξ1 (n + 1) = ε1 . 1] (think of [0. As for the name of the chain (bread kneading) it can be justiﬁed as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. ξr+1 (n) = 1 − εr ). . for any n = 0. 2. . ξi ∈ {0. . Remarks: This is not a Markov chain with countably many states. Hence choosing the initial state at random in this manner results in a stationary Markov chain. i. 1 − ξ3 . Here ξ(n) = (ξ1 (n). ξ(n + 1) = ϕ(ξ(n)). The Markov chain is obtained by applying the mapping ϕ again and again. if ξ1 = 1 then shift to the left and ﬂip all bits to obtain (1 − ξ2 . . . ξ2 (n) = ε1 .e. .d. ξ3 . .i. . . Example: bread kneading. ξ2 (n). . . .). . r = 1. . . 2. . . . . . . . . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. So let us do so. 1. and any ε1 . . . You can check that . . .i.

j as a “current” ﬂowing from i to j.i = j∈A π(j)pj. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. π is a stationary distribution if and only if π1 = 1 and F (A. But we only have d unknowns. Ac ) := π(i)pi.j . then the above are d + 1 equations.Note on terminology: steady-state. for all A ⊂ S.i Multiply π(i) by j∈S pi. π(i) = j∈S (balance equations) (normalisation condition). j∈S i ∈ S. This is OK. Conversely. So F (A. a stationary distribution π satisﬁes π = πP π1 = 1 In other words.1 Finding a stationary distribution As explained above. therefore one of them is always redundant. a set of d linearly independent equations which can be more easily solved. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. π(j)pj. Writing the equations in this form is not always the best thing to do. If the latter condition holds. Proof. If the state space has d states.j (which equals 1): π(i) j∈S pi. suppose that the balance equations hold. amongst them.j = j∈A π(j)pj. The convenience is often suggested by the topological structure of the chain. When a Markov chain is stationary we refer to it as being in 4. i∈A j∈Ac Think of π(i)pi. because the ﬁrst d equations are linearly dependent: the sum of both sides equals 1. Instead of eliminating equations.i . as we will see in examples. Ac ) is the total current from A to Ac . Ac ) = F (Ac . We have: Proposition 1. π(j) = 1.i + j∈Ac π(j)pj.i 13 . then choose A = {i} and see that you obtain the balance equations. we introduce more.i + j∈Ac π(j)pj. A). expecting to be able to choose.

We have π(1)α = π(2)β.i This is true for all i ∈ S.. we ﬁnd π(1) = 0.04 π(2) = 0.048.j + j∈A j∈Ac π(i)pi. 3.2% of the time in 1. Example: a walk with a barrier. One can ask how to choose a minimal complete set of linearly independent equations. we see that the ﬁrst term on the left cancels with the ﬁrst on the right. p22 = 1 − β. β < 1. A c F(A . If we take α = 0. p q Consider the following chain. α+β π(2) = α .Now split the left sum also: π(i)pi. π(1) = β . p11 = 1 − α. . . α+β π(2) = cα. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1).05.04 room 1. gives us ﬂexibility. (Consequently. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . p21 = β. but we shall not do this here. 14 i = 1. 1.99 (as in the mouse example). while the remaining two terms give the equality we seek. 2. β = 0. A) Schematic representation of the ﬂow balance relations. Thus. and 4.) Assume 0 < α.j = j∈A π(j)pj.952. Instead. The constant c is found from π(1) + π(2) = 1. Summing over i ∈ A.i + j∈Ac π(j)pj. That is. The extra degree of freedom provided by the last result. . the mouse spends only 95. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. here is an example: Example: Consider the 2-state Markov chain with p12 = α. A ) A c c F( A .05 ≈ 0. in steady state.8% of the time in room 2. ..99 ≈ 0.

(iii) Pi (Xn = j) > 0 for some n ≥ 0. . Ac ) = π(i − 1)p = π(i)q = F (Ac . .2 The relation “leads to” We say that a state i leads to state j if. . S is a set of vertices. So assume i = j. This is often cast in terms of a digraph (directed graph). j) ∈ E ⇐⇒ pi.i2 · · · pin−1 .) 5 5. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1.But we can make them simpler by choosing A = {0. (If p ≥ 1/2 there is no stationary distribution. In other words. in−1 .1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. For i ≥ 1. In graph-theoretic terminology.e. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. If i = j there is nothing to prove. a pair (S. provided that ∞ π(i) = 1. This is i=0 possible if and only if p < q. starting from i. i. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. A).e. . we can solve them immediately. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). and E a set of directed edges. i. i − 1}. Does this provide a stationary distribution? Yes. In fact. . .i1 pi1 . . These are much simpler than the previous ones.j > 0. Proof.j > 0 for some n ∈ N and some states i1 . 5. the chain will visit j at some ﬁnite time. 1. p < 1/2. • Suppose that i j. n≥0 . (ii) pi. If i ≥ 1 we have F (A. π(i) = (p/q)i π(0). E) where S is the state space and E ⊂ S × S deﬁned by (i. .

It is reﬂexive and transitive because is so. 16 . Just as any equivalence relation. Corollary 1. The communicating class corresponding to the state i is. However. The relation j ⇐⇒ j i).. reﬂexive and transitive. (iii) holds. Proof. The relation is transitive: If i j and j k then i k. it partitions S into equivalence classes known as communicating classes. We just observed it is symmetric. Hence Pi (Xn = j) > 0. We have Pi (Xn = j) = i1 . • Suppose now that (iii) holds. • Suppose that (ii) holds. meaning that (ii) holds..in−1 pi. Two communicating classes are either 1 2 3 4 5 6 In the example of the ﬁgure. and so (iii) holds.. The class {4. In addition. (i) holds. we see that there are four communicating classes: {1}. In other words. Equivalence relation means that it is symmetric. The ﬁrst two classes diﬀer from the last two in character. 5} is closed: if the chain goes in there then it never moves out of it. (6) j.. where the sum extends over all possible choices of states in S. 3}. the sum contains a nonzero term. this is symmetric (i Corollary 2. is an equivalence relation. by deﬁnition. By assumption.j .the sum on the right must have a nonzero term. 5}. the class {2. [i] = [j] if and only if i identical or completely disjoint. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). {6}. by deﬁnition.i1 pi1 . the set of all states that communicate with i: [i] := {j ∈ S : j So. {2. 3} is not closed. 5.3 The relation “communicates with” Deﬁne next the relation i communicates with j (written as i Obviously. j) ⇐⇒ i j and j i. Since Pi (Xn = j) > 0. we have that the sum is also positive and so one of its terms is positive. i}.i2 · · · pin−1 . {4. Look at the last display again.

C3 are not closed.i1 pi1 . we can reach any other state. The classes C4 . parts. we have i1 ∈ C. Note: An inessential state i is such that there exists a state j such that i j but j i. its transition matrix P) is called irreducible. Closed communicating classes are particularly important because they decompose the chain into smaller. and C is closed.j > 0. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. and so i2 ∈ C. C2 . starting from any state. then the chain (or. in−1 such that pi. more precisely. pi.More generally. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this ﬁgure. In other words still. Otherwise. Similarly. . Since i ∈ C. Suppose ﬁrst that C is a closed communicating class. Lemma 3.i2 · · · pin−1 .j = 1. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. 17 . There can be arrows between two nonclosed communicating classes but they are always in the same direction.i2 > 0. Proof. the single-element set {i} is a closed communicating class. If a state i is such that pi. Why? A state i is essential if it belongs to a closed communicating class. To put it otherwise.i = 1 then we say that i is an absorbing state. The communicating classes C1 . the state is inessential. j∈C for all i ∈ C. In graph terms. Note that there can be no arrows between closed communicating classes. we say that a set of states C ⊂ S is closed if pi. we conclude that j ∈ C. . the state space S is a closed communicating class. We then can ﬁnd states i1 . more manageable. Thus. and so on. . C5 in the ﬁgure above are closed communicating classes. Note: If we consider a Markov chain with transition matrix P and ﬁx n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. A general Markov chain can have an inﬁnite number of communicating classes.i1 > 0. . The converse is left as an exercise. If all states communicate with all others. pi1 . Suppose that i j.

So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . . (n) Proof. Let Di = {n ∈ N : pii > 0}.e. j ∈ S. i. n ≥ 0. and. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. If i j then d(i) = d(j). X4 ∈ C1 . ℓ ∈ N. all states form a single closed communicating class. Notice however. 3} at some point of time then. it will move to the set C2 = {2. A state i is called aperiodic if d(i) = 1. Dj = {n ∈ (n) (α) N : pjj > 0}. X2 ∈ C1 . The chain alternates between the two sets. that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. (Can you ﬁnd an example of a chain with period 3?) We will now give the deﬁnition of the period of a state of an arbitrary chain. . If i j there is α ∈ N with pij > 0 and if 18 . we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . Consider two distinct essential states i. Let us denote by pij the entries of Pn . . So. The period of an inessential state is not deﬁned. (m) (n) for all m. 4} at the next step and vice versa.5. . We say that the chain has period 2. Since Pm+n = Pm Pn . d(i) = gcd Di . pii (ℓn) ≥ pii (n) ℓ . whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. j. X3 ∈ C2 . The period of an essential state is deﬁned as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . more generally. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. Theorem 2 (the period is a class property). i. d(j) = gcd Dj .

Divide n by n1 to obtain n = qn1 + r. . as deﬁned earlier. chains where.) 19 . Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . If a number divides two other numbers then it also divides their diﬀerence. q − r > 0. More generally. all states communicate with one another.j i there is β ∈ N with pji > 0. showing that α + β ∈ Dj . . Because n is suﬃciently large. Corollary 3. we have n ∈ Di as well. 1. So pjj ≥ pij pji > 0. Pick n2 . . if d = 1. n2 ∈ Di . Therefore d(j) | α + β + n for all n ∈ Di . (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This ﬁgure shows the internal structure of a closed communicating class with period d = 4. this is the content of the following theorem. C1 . Theorem 4. Formally. . . j∈Cr+1 i ∈ Cr . Then pii > 0 and so pjj ≥ pij pii pji > 0. So d(i) = d(j). Now let n ∈ Di . d(j) | d(i)). Arguing symmetrically. if an irreducible chain has period d then we can decompose the state space into d sets C0 . we obtain d(i) ≤ d(j) as well. Since n1 . Consider an irreducible chain with period d. In particular. are irreducible chains. An irreducible chain with ﬁnitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. where the remainder r ≤ n1 − 1. d − 1. . (Here Cd := C0 . the chain is called aperiodic. C1 . . Theorem 3. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. Therefore d(j) | α + β. . Of particular interest. i. . showing that α + n + β ∈ Dj . n1 ∈ Di such that n2 − n1 = 1. r = 0.e. Proof. Cd−1 such that pij = 1. From the last two displays we get d(j) | n for all n ∈ Di . So d(j) is a divisor of all elements of Di . Let n be suﬃciently large. Cd−1 such that the chain moves cyclically between them. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . j ∈ S. Then we can uniquely partition the state space S into d disjoint subsets C0 . . . An irreducible chain has period d if one (and hence all) of the states have period d. .

i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). The reader is warned to be alert as to which of the two variables we are considering at each time..e the set of states j such that i j). id ∈ C0 . Continue in this manner and deﬁne states i0 . id−1 d 6 Hitting times and ﬁrst-step analysis Consider a Markov chain with transition probability matrix P. Take.id > 0.. We use a method that is based upon considering what the Markov chain does at time 1. As an example. We now show that if i belongs to one of these classes and if pij > 0 then. . We deﬁne the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. i ∈ C0 . If X0 ∈ A the two times coincide. j ∈ C2 . consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which diﬀers from TA simply by considering n ≥ 1 instead of n ≥ 0. with corresponding classes C0 . i1 . Such a path is possible because of the choice of the i0 . j belongs to the next class.e. necessarily. Then pick i1 such that pi0 i1 > 0. C1 . . .. Hence it partitions the state space S into d disjoint equivalence classes. for example. Notice that this is an equivalence relation. Deﬁne the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0.Proof... that is why we call it “first-step analysis”. i.. say. necessarily. Suppose pij > 0 but j ∈ C1 but. Assume d > 1. Let C1 be the equivalence class of i1 . which contradicts the deﬁnition of d. i1 . We avoid excessive notation by using the same letter for both. Pick a state i0 and let C0 be its equivalence class (i.. We are interested in deriving formulae for the probability that this time is ﬁnite as well as the expectation of this time. 20 . after it takes one step from its current position. It is easy to see that if we continue and pick a further state id with pid−1 . Cd−1 . . . and by the assumptions that d d j. then. We show that these classes satisfy what we need. The existence of such a path implies that it is possible to go from i0 to i. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 .

and deﬁne ϕ(i) := Pi (TA < ∞). if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. But the Markov chain is homogeneous. i ∈ A. We then have: Theorem 5. We ﬁrst have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. the event TA is independent of X0 . conditionally on X1 . then TA ≥ 1. By self-feeding this equation. . Furthermore. Hence: ′ ′ P (TA < ∞|X1 = j.It is clear that P1 (T0 < ∞) < 1. Therefore. . because the chain may end up in state 2. a ′ function of X1 . X0 = i) = P (TA < ∞|X1 = j).). ′ is the remaining time until A is hit). But what exactly is this probability equal to? To answer this in its generality. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). So TA = 1 + TA . If i ∈ A then TA = 0. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A pjk ϕ′ (k) k∈A 21 . we have Pi (TA < ∞) = ϕ(j)pij . X2 . let ϕ′ (i) be another solution. For the second part of the proof. ′ Proof. ϕ(i) = j∈S i∈A pij ϕ(j).e. Combining the above. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). . If i ∈ A. j∈S as needed. ﬁx a set of states A. and so ϕ(i) = 1. The function ϕ(i) satisﬁes ϕ(i) = 1.

ψ(0) = ψ(2) = 0. By continuing self-feeding the above equation n times. 1 ψ(1) = 1 + ψ(1). (If the chain starts from 0 it takes no time to hit 0. which is the second of the equations. We have: Theorem 6. so the ﬁrst two terms together equal Pi (TA ≤ 2). ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). 2}. 3 22 . Therefore. if the chain starts from 2 it will never hit 0. 3 2 so ϕ(1) = 3/4. We immediately have ϕ(0) = 1. obviously. as above. let A = {0. TA = 0 and so ψ(i) := Ei TA = 0. Proof. Start the chain from i. ψ(i) = 1 + i∈A pij ψ(j). The function ψ(i) satisﬁes ψ(i) = 0. 2. and ϕ(2) = 0. ψ(i) := Ei TA . 1. If i ∈ A then. Example: Continuing the previous example. we choose A = {0}. Letting n → ∞. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). j∈A i ∈ A. Furthermore. the set that contains only state 0. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). On the other hand. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). We let ϕ(i) = Pi (T0 < ∞). the second equals Pi (TA = 2). then TA = 1 + TA . we obtain ϕ′ (i) ≥ Pi (TA ≤ n). if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. Example: In the example of the previous ﬁgure. because it takes no time to hit A if the chain starts from A. i = 0. The second part of the proof is omitted.) As for ϕ(1). If ′ i ∈ A. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the ﬁrst time.We recognise that the ﬁrst term equals Pi (TA = 1). Clearly. and let ψ(i) = Ei TA .

in the last sum. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). . ϕAB (i) = 0. after one step. TB for two disjoint sets of states A. We may ask to ﬁnd the probabilities ϕAB (i) = Pi (TA < TB ). of the random variables X1 . h(x) = f (x). until set A is reached. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). ϕAB is the minimal solution to these equations. therefore the mean number of ﬂips until the ﬁrst success is 3/2. Clearly. consider the following situation: Every time the state is x a reward f (x) is earned. 23 . It should be clear that the equations satisﬁed are: ϕAB (i) = 1. otherwise. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). (This should have been obvious from the start. ′ TA = 1 + TA . Then ′ 1+TA x ∈ A. ϕAB (i) = j∈A i∈B pij ϕAB (j). h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). by the Markov property. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). we just changed index from n to n + 1. i. where. i∈A Furthermore.which gives ψ(1) = 3/2. . This happens up to the ﬁrst hitting time TA ﬁrst hitting time of the set A. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the ﬁrst-step analysis method. if x ∈ A.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . . as argued earlier. Next. because the underlying experiment is the ﬂipping of a coin with “success” probability 2/3. B. Hence. ′ where TA is the remaining time.e. Now the last sum is a function of the future after time 1. X2 ..

y∈S h(x) = f (x) + x ∈ A. 1 2 4 3 The question is to compute the average total reward. for i = 1. (Sooner or later. Let p the probability of heads and q = 1 − p that of tails. Clearly. so 24 . in which case he dies immediately.Thus. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. 2. are h(x) = 0. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). he collects i pounds. So. h(3) = 7. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. Example: A thief enters sneaks in the following house and moves around the rooms. until he dies in room 4. the set of equations satisﬁed by h. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . If heads come up then you win £1 per pound of stake.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. why?) If we let h(i) be the average total reward if he starts from room i. 2 2 We solve and ﬁnd h(2) = 8. h(1) = 5. x∈A pxy h(y). we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). (Also. no gambler is assumed to have any information about the outcome of the upcoming toss. he will die. k ∈ N) form the strategy of the gambler. The successive stakes (Yk . unless he is in room 4. if he starts from room 2. 3. collecting “rewards”: when in room i.

Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. ξk−1 and–as is often the case in practice–on other considerations of e. Consider a gambler playing a simple coin tossing game. in the ﬁrst case. She will stop playing if her money fall below a (a < x). Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals: x−a − 1 (q/p) (q/p)b−a − 1 . a + 1. We ﬁrst solve the problem when a = 0. We use the notation Px (respectively.i−1 = q = 1 − p. We wish to consider the problem of ﬁnding the probability that the gambler will never lose her money.if the gambler wants to have a strategy at all. The diﬀerence between the gambler’s ruin problem for a company and that for a mere mortal is that. write ϕ(x) instead of ϕ(x. astrological nature. Yk cannot depend on ξk . 0 ≤ x ≤ ℓ. It can only depend on ξ1 . in addition. b − a). The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). In other words. 25 ϕ(ℓ) = 1. Theorem 7. ℓ). is our next goal. expectation) under the initial condition X0 = x. betting £1 at each integer time against a coin which has P (heads) = p.i+1 = p. Let £x be the initial fortune of the gambler. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. . Ex ) for the probability (respectively. b} with transition probabilities pi. it will make enormous proﬁts). x − a .g. We are clearly dealing with a Markov chain in the state space {a. We obviously have ϕ(0) = 0. a + 2. a < i < b. pi. and where the states a. b − 1. Let us consider a concrete problem. b = ℓ > 0 and let ϕ(x. . This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. in simplest case Yk = 1 for all k. . To solve the problem. P (tails) = 1 − p. . if p = q Px (Tb < Ta ) = . . as described above. . a ≤ x ≤ b. she must base it on whatever outcomes have appeared thus far. . ℓ) := Px (Tℓ < T0 ). the company has control over the probability p. if p = q = 1/2 b−a Proof. . b are absorbing. . (Think why!) To simplify things.

We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). Px (T0 < ∞) = 1 − lim 26 . λx − 1 δ(0). if λ = 1 if λ = 1. and so λx − 1 λx − 1 =1− = λx . Assuming that λ := q/p = 1. ℓ→∞ ℓ→∞ (q/p)x . ϕ(x) = pϕ(x + 1) + qϕ(x − 1). ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. δ(x − 2) = · · · = q p x δ(0). if p = q. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. Since p + q = 1. then λ = q/p > 1. ℓ Summarising. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. Corollary 4. The general case is obtained by replacing ℓ by b − a and x by x − a. if p > 1/2 if p ≤ 1/2. λ = 1. and so Px (T0 < ∞) = 1 − lim If p < 1/2. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). We ﬁnally ﬁnd λx − 1 .and. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). If p > 1/2. λℓ −1 x ℓ. Since ϕ(ℓ) = 1. λℓ − 1 0 ≤ x ≤ ℓ. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. 0 ≤ x ≤ ℓ. This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). then λ = q/p < 1. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. we ﬁnd ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. λ = 1. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. 1. we have obtained ϕ(x. Let δ(x) := ϕ(x + 1) − ϕ(x). ℓ) = λx −1 . by ﬁrst-step analysis. we ﬁnd δ(0) = 1/ℓ and so x ϕ(x) = .

(d) where (a) is just logic. (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. τ = n)P (Xn = i. 27 . Xτ ) if τ < ∞. In particular. . . τ < ∞) and that P (Xτ +1 = j1 . τ < ∞)P (A | Xτ . | Xn = i. Suppose (Xn . An example of a random time which is not a stopping time is the last visit of a speciﬁc set. .j1 pj1 . A | Xτ . . A. . . the future process (Xτ +n . . . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . . A. . . Xn ) for all n ∈ Z+ . Xn = i. . τ = n). . This stronger property is referred to as the strong Markov property.j2 · · · P (Xn = i. Xτ +2 = j2 . . any such A must have the property that. . Xn )1(τ = n). . the value +∞. . for all n ∈ Z+ . τ < ∞). . τ = n) (a) (b) = P (Xτ +1 = j1 . . Xτ = i. Xn ) if we know the present Xn . An example of a stopping time is the time TA . Recall that the Markov property states that. Xτ +2 = j2 . n ≥ 0) is a time-homogeneous Markov chain. . . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . P (Xτ +1 = j1 . . including. Xτ +2 = j2 . .) is independent of the past process (X0 . τ = n) = P (Xn+1 = j1 . Xn ) and the ordinary splitting property at n. A. Xn+1 . the ﬁrst hitting time of the set A. . Theorem 8 (strong Markov property). . . Our assertions are now easily obtained from this by ﬁrst summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. Xn+2 = j2 . .j2 · · · We have: P (Xτ +1 = j1 . A. . . . Xn ). . τ < ∞) = P (Xn+1 = j1 . possibly. . . We are going to show that for any such A. . Proof. . . τ = n) = pi. | Xτ = i. . Xτ ) and has the same law as the original process started from i. | Xn = i)P (Xn = i. Let τ be a stopping time. . A. . . Xn+2 = j2 . . considered earlier. . Xτ )1(τ < ∞) = n∈Z+ (X0 . for any n. | Xτ . Let A be any event determined by the past (X0 . A ∩ {τ = n} is determined by (X0 . . Xn+2 = j2 . A. . . .j1 pj1 . . n ≥ 0) is independent of the past process (X0 . . . (b) is by the deﬁnition of conditional probability. τ = n) (c) = P (Xn+1 = j1 . Since (X0 .Xτ +2 = j2 . the future process (Xn . . . Then. . τ < ∞) = pi. . . Let τ be a stopping time. conditional on τ < ∞ and Xτ = i. . .

. by using the strong Markov property.) We let Ti be the time of the third visit. So Xi is the r-th excursion: the chain visits 28 .. All these random times are stopping times.d. State i may or may not be visited again.d.i. (And we set Ti := Ti . The idea is that if we can ﬁnd a state i that is visited again and again. Our Ti here is always positive. Note the subtle diﬀerence between this Ti and the Ti of Section 6. we recursively deﬁne Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. random variables is a (very) particular example of a Markov chain. as “random variables”. (1) r = 1. . “chunks”. However. Ti Xi (0) (r) ≤ n < Ti (r+1) . 2. Formally. We (r) refer to them as excursions of the process. random trajectories. in a generalised sense. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . If X0 = i then the two deﬁnitions coincide. Schematically. r = 2. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the ﬁrst time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. . . The Ti of Section 6 was allowed to take the value 0. . . we let Ti be the time of the second (1) (3) visit.9 Regenerative structure of a Markov chain A sequence of i. and so on. then the behaviour of the chain between successive visits cannot be inﬂuenced by the past or the future (and that is due to the strong Markov property). Xi (0) . They are.i. := Xn . Regardless. The random variables Xn comprising a general Markov chain are not independent. 3. Xi (1) . in fact. . we will show that we can split the sequence into i. Xi (2) .. 0 ≤ n < Ti .

g(Xi (1) ).d. Xi distribution. Second. 29 . equal to i. are independent random variables. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. the future is independent of the (1) (2) past. possibly. Therefore. g(Xi (2) ). the future after Ti is independent of the past before Ti .i. The claim that Xi . except. they are also identical in law (i. But (r) the state at Ti is. ad inﬁnitum. An immediate consequence of the strong Markov property is: Theorem 9. the chain returns to i only ﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 0. The excursions Xi (0) .state i for the r-th time. . . In particular. Xi (2) . A trivial. given the state at (r) (r) (r) the stopping time Ti . have the same Proof. . we “watch” it up until the next time it returns to state i. starting from i. (Numerical) functions g(Xi independent random variables. Example: Deﬁne Ti (r+1) (0) ). Moreover. (r) . Xi (1) . We abstract this trivial observation and give a couple of deﬁnitions: First. Apply Strong Markov Property (SMP) at Ti : SMP says that. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. Therefore. ..). it will behave in the same way as if it starts from the same state at another point of time. (0) are independent random variables. if it starts from state i at some point of time. but there are always states which will be visited again and again. starting from i. and this proves the ﬁrst claim. . . we say that a state i is transient if. all.. . for time-homogeneous Markov chains. the chain returns to i inﬁnitely many times: Pi (Xn = i for inﬁnitely many n) = 1. Λ2 . and “goes for an excursion”. at least one state must be visited inﬁnitely many times. but important corollary of the observation of this section is that: Corollary 5. by deﬁnition. we say that a state i is recurrent if. 10 Recurrence and transience When a Markov chain has ﬁnitely many states. Perhaps some states (inessential ones) will be visited ﬁnitely many times. The last corollary ensures us that Λ1 .

Starting from X0 = i. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). This means that the state i is recurrent. and that is why we get the product. 4. Indeed. fii = Pi (Ti < ∞) = 1. Lemma 5. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k.Important note: Notice that the two statements above are not negations of one another. we conclude what we asserted earlier. 2. Pi (Ji = ∞) = 1. in principle. Because either fii = 1 or fii < 1. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . every state is either recurrent or transient. Ei Ji = ∞. i. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. k ≥ 0. State i is recurrent. But the past (k−1) before Ti is independent of the future. . there could be yet another possibility: 0 < Pi (Xn = i for inﬁnitely many n) < 1. We claim that: Lemma 4. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. therefore concluding that every state is either recurrent or transient. 30 . we have Proof. We will show that such a possibility does not exist.e. then Pi (Ji < ∞) = 1. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. Therefore. namely. 3. Notice that: state i is visited inﬁnitely many times ⇐⇒ Ji = ∞. The following are equivalent: 1. Indeed. This means that the state i is transient. Pi (Ji = ∞) = 1. Suppose that the Markov chain starts from state i. If fii < 1. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1).

then. if we assume Ei Ji < ∞. Ji cannot take value +∞ with positive probability.Proof. then. Proof. is equivalent to Pi (Ji < ∞) = 1. and. Corollary 6. Proof. by deﬁnition. Pi (Ji < ∞) = 1. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . we see that its expectation cannot be inﬁnity without Ji itself being inﬁnity with probability one. State i is transient. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. n ≥ 1. as n → ∞. assuming that the chain (n) starts from state i. If i is a transient state then pii → 0. Proof.e. 4. On the other hand. this implies that Ei Ji < ∞. the probability to hit state j for the ﬁrst time at time n ≥ 1. it is visited inﬁnitely many times. i. as n → ∞. owing to the fact that Ji is a geometric random variable. Hence Ei Ji = ∞ also. 3. (n) i. Notice the diﬀerence with pij . The equivalence of the ﬁrst two statements was proved before the lemma. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . Clearly. by deﬁnition of Ji is equivalent to Pi (Ji = ∞) = 1. The following are equivalent: 1. If i is transient then Ei Ji < ∞. which. by deﬁnition. The latter is the probability to be in state j at time n (not necessarily for the ﬁrst time) starting from i. but the two sequences are related in an exact way: Lemma 7.1 Deﬁne First hitting time decomposition fij := Pi (Tj = n). State i is recurrent if. 10. If Ei Ji = ∞. fii = Pi (Ti < ∞) < 1. The equivalence of the ﬁrst two statements was proved above. State i is transient if. Ei Ji < ∞. clearly. by the deﬁnition of Ji . (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . it is visited ﬁnitely many times with probability 1. 2. fij ≤ pij . This. (n) But if a series is ﬁnite then the summands go to 0. therefore Pi (Ji < ∞) = 1. since Ji is a geometric random variable. Lemma 6.e. From Elementary Probability. pii → 0.

d. Hence. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). showing that j will appear in inﬁnitely many of the excursions for sure. 32 . Now.d. the probability that j will appear in a speciﬁc excursion equals gij = Pi (Tj < Ti ). but also fij = 1. j i. j i. Corollary 8. Start with X0 = i. . .i. excursions Xi . we have i j ⇐⇒ fij > 0. as n → ∞.r := 1 j appears in excursion Xi (r) . Corollary 7. In other words. Hence. . inﬁnitely many of them will be 1 (with probability 1). ∞ Proof. If j is a transient state then pij → 0. are i. In particular. which is ﬁnite. Proof. either all states are recurrent or all states are transient. and since they take the value 1 with positive probability. by deﬁnition.. If j is transient. and gij := Pi (Tj < Ti ) > 0.i. 10. Xi . . if we let fij := Pi (Tj < ∞) . we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj .) Theorem 10. For the second term. denoted by i j: it means that. The last statement is simply an expression in symbols of what we said above.2 Communicating classes and recurrence/transience Recall the deﬁnition of the relation “state i leads to state j”. So the random variables δj. 1. . by summing over n the relation of Lemma 7. starting from i. not only fij > 0. We now show that recurrence is a class property. Indeed. ( Remember: another property that was a class property was the property that the state have a certain period. for all states i. . r = 0. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. and hence pij as n → ∞. In a communicating class. and so the probability that j will appear in a speciﬁc excursion is positive. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. we have n=1 pjj = Ej Jj < ∞. the probability to go to j in some ﬁnite time is positive.The ﬁrst term equals fij .

X remains in C. and so i is a transient state. Suppose that i is inessential. i. and hence. . all other states are also transient. every other j is also recurrent. If i is inessential then it is transient. U1 . Thus T represents the number of variables we need to draw to get a value larger than the initial one. if started in C. Hence. . . for example. . By ﬁniteness.e. Example: Let U0 . If i is recurrent then every other j in the same communicating class satisﬁes i j. j but j i. . 1]. It should be clear that this can. By closedness. Proof. but Pi (A ∩ B) = Pi (Xm = j. Let T := inf{n ≥ 1 : Un > U0 }. We now take a closer look at recurrence. necessarily. Indeed. So if a communicating class is recurrent then it contains no inessential states and so it is closed. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). Hence all states are recurrent. Xn = i for inﬁnitely many n) = 0. Theorem 12. some state i must be visited inﬁnitely many times. Theorem 11. . But then. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. Every ﬁnite closed communicating class is recurrent. . observe that T is ﬁnite.i. . . Let C be the class. . This state is recurrent. appear in a certain game. Un ≤ x | U0 = x)dx xn dx = 1 . random variables.d. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x.Proof. What is the expectation of T ? First. U2 . by the above theorem. . Hence 11 Positive recurrence Recall that if we ﬁx a state i and let Ti be the ﬁrst return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). be i. If a communicating class contains a state which is transient then. Recall that a ﬁnite random variable may not necessarily have ﬁnite expectation. uniformly distributed in the unit interval [0.e. P (B) = Pi (Xn = i for inﬁnitely many n) < 1. i. P (T > n) = P (U1 ≤ U0 . n+1 33 = 0 . Proof. Pi (B) = 0.

the Fundamental Theorem of Probability Theory. random variables. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is.Therefore. 0) > −∞. It is traditional to refer to it as “Law”. (ii) Suppose E max(Z1 . X1 . Theorem 13. of successive states is not an independent sequence. an inﬁnite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. be i. . To apply the law of large numbers we must create some kind of independence. 0) = ∞.d. But this is misleading. it is a Theorem that can be derived from the axioms of Probability Theory. Z2 . We state it here without proof. It is not a Law. arguably. When we have a Markov chain. Before doing that. We say that i is null recurrent if Ei Ti = ∞. . Let Z1 . Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. on the average. X2 . but E min(Z1 . Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. . (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . then it takes. . the sequence X0 . . To our rescue comes the regenerative structure of a Markov chain identiﬁed at an earlier section. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence.i. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. 34 . .

we have that G(Xa(1) ). the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). . . . we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit.d. Example 2: Let f (x) be as in example 1.d.i. Then. . are i. but deﬁne G(X ) to be the total reward received over the excursion X . (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. Xa(3) . Thus.13. forms an independent sequence of (generalised) random variables. In particular. assign a reward f (x). we can think of as a reward received when the chain is in state x. are i. Since functions of independent random variables are also independent. n as n → ∞. which. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ).2 Average reward Consider now a function f : S → R. . Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). for an excursion X . Xa .1 Functions of excursions Namely. G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). Xa(2) . given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been ﬁxed once and for all) the sequence of excursions Xa(1) . The ﬁnal equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. with probability 1. the (0) initial excursion Xa also has the same law as the rest of them. deﬁne G(X ) to be the maximum reward received during this excursion. for reasonable choices of the functions G taking values in R. Example 1: To each state x ∈ S. k=0 which represents the ‘time-average reward’ received. . G(Xa(2) ). 35 . as in Example 2. random variables. G(Xa(3) ).. (7) Whichever the choice of G. . and because if we start with X0 = a. (1) (2) (1) (n) (8) 13. in this example. Speciﬁcally. . because the excursions Xa .i. .

of complete excursions from the ground state a that occur between 0 and t. t t 36 . Gﬁrst is the reward up to time Ta = Ta or t (whichever t is smaller).e. Let f : S → R+ be a bounded ‘reward’ function. and Glast is the reward over the last incomplete cycle. We are looking for the total reward received between 0 and t. Hence |Gﬁrst | ≤ BTa . The idea is this: Suppose that t is large enough. with probability one.Theorem 14. (r) Nt := max{r ≥ 1 : Ta ≤ t}. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a ﬁrst and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gﬁrst + Glast . We break the sum into the total reward received over the initial excursion. For example. t t In other words. A similar argument shows that 1 last G → 0. Gr is given by (7). t t as t → ∞. Nt = 5. with probability one. say. in the ﬁgure below. we have BTa /t → 0. We assumed that f is t bounded. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Since P (Ta < ∞) = 1. t as t → ∞. Proof. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . plus the total reward received over the next excursion. Then. Thus we need to keep track of the number. and so 1 ﬁrst G → 0. and so on. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. say |f (x)| ≤ B. Nt .

r=1 with probability one. . lim t∞ 1 Nt Nt −1 Gr = EG1 . . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again.d. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . we have Ta r→∞ r lim (r) . . as r → ∞. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . E a Ta f (Xn ) = n=0 EG1 =: f . (11) By deﬁnition. Since Ta /r → 0. and since Ta − Ta k = 1. 2. Since Ta → ∞.. So we are left with the sum of the rewards over complete cycles. we have that the visit with index r = Nt to the ground state a is the last before t (see also ﬁgure above): (N (N Ta t ) ≤ t < Ta t +1) . random variables with common expectation Ea Ta . = E a Ta . as t → ∞. = E a Ta . E a Ta 37 . it follows that the number Nt of complete cycles before time t will also be tending to ∞. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . 1 Nt = . Hence.i. are i. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr .also.

on the average. Comment 4: Suppose that two states.e. then we see that the numerator equals 1. for all b ∈ S. The equality of these two limits gives: average no. Ta −1 k=0 1 lim t→∞ t with probability 1. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). i. The physical interpretation of the latter should be clear: If. E a Ta (14) with probability 1. which communicate and are positive recurrent. the ground state. from the Strong Markov Property. The theorem above then says that. b. it takes Ea Ta units of time between successive visits to the state a. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). t Ea 1(Xk = b) E a Ta . E b Tb 38 . a. not both. The constant f is a sort of average over an excursion. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . Note that we include only one endpoint of the excursion. We can use b as a ground state instead of a. Applying formula (14) with b in place of a we ﬁnd that the limit in (13) also equals 1/Eb Tb . Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . we let f to be 1 at the state b and 0 everywhere else. Comment 1: It is important to understand what the formula says.To see that f is as claimed. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. then the long-run proportion of visits to the state a should equal 1/Ea Ta . The denominator is the duration of the excursion. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b).

(15) should play an important role. 39 . x ∈ S. Moreover.e. . x ∈ S. X3 . Since a = XT a . Both of them equal 1. notice that the choice of notation ν [a] is justiﬁable through the notation (6) for the communicating class of the state a. Indeed. Recurrence tells us that Ta < ∞ with probability one. we have 0 < ν [a] (x) < ∞. In view of this. for any state b such that a → x (a leads to x). 5 We stress that we do not need the stronger assumption of positive recurrence here. Then ν [a] satisﬁes the balance equations: ν [a] = ν [a] P. . Proposition 2. The real reason for doing so is because we wanted to replace a function of X0 . Fix a ground state a ∈ S and assume that it is recurrent5 . Proof. ν [a] (x) = y∈S ν [a] (y)pyx . X2 . this should have something to do with the concept of stationary distribution discussed a few sections ago. . i. where a is the ﬁxed recurrent ground state whose existence was assumed. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). by a function of X1 . Clearly. X2 . . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . so there is nothing lost in doing so. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. . We start the Markov chain with X0 = a. X1 . . we will soon realise that this dependence vanishes when the chain is irreducible. Hence the superscript [a] will soon be dropped. Consider the function ν [a] deﬁned in (15). Note: Although ν [a] depends on the choice of the ‘ground’ state a.

Suppose that the Markov chain possesses a positive recurrent state a. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). Xn−1 = y. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. We just showed that ν [a] = ν [a] P. is ﬁnite when a is positive recurrent. 40 .and thus be in a position to apply ﬁrst-step analysis. n ≤ Ta ) pyx Pa (Xn−1 = y. (n) Next. x a. such that pxa > 0. in this last step. First note that. namely. n ≤ Ta ) Pa (Xn = x. for all x ∈ S. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . ν [a] (x) < ∞. by deﬁnition. (17) This π is a stationary distribution. Deﬁne Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . that a is positive recurrent. indeed. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). let x be such that a x. for all n ∈ N. Therefore. We are now going to assume more. so there exists n ≥ 1. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. which. where. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). xa so. which are precisely the balance equations. from (16). We now have: Corollary 9. such that pax > 0. we used (16). n ≤ Ta ) = n=1 ∞ Pa (Xn = x. Note: The last result was shown under the assumption that a is a recurrent state. so there exists m ≥ 1. ax ax From Theorem 10. ν [a] = ν [a] Pn . x ∈ S.

The coins are i. that it sums up to one. Otherwise. that ν [a] P = ν [a] . Suppose that i recurrent. This is what we will try to do in the sequel. (r) j. Hence j will occur in inﬁnitely many excursions. i. Let i be a positive recurrent state. we have Ea Ta < ∞. Also. Let R1 (respectively. unique. in general. Since a is positive recurrent. Therefore. . But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. then the r-th excursion does not contain state j. no matter which one. In words. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. and the probability gij = Pi (Tj < ∞) that j occurs in a speciﬁc excursion is positive. π [a] P = π [a] . Theorem 15. Consider the excursions Xi . in Proposition 2. It only remains to show that π [a] is a probability. to decide whether j will occur in a speciﬁc excursion. π(y) = 1 y∈S x∈S have at least one solution. Then j is positive Proof. positive recurrent state. since the excursions are independent.e. r = 0. we toss a coin with probability of heads gij > 0. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. second) excursion that contains j.d. then j occurs in the r-th excursion. But π [a] is simply a multiple of ν [a] . . Start with X0 = i. . 1. if tails show up. We have already shown. We already know that j is recurrent.In other words: If there is a positive recurrent state.i. 41 . x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. and so π [a] is not trivially equal to zero. If heads show up at the r-th excursion. then the set of linear equations π(x) = y∈S π(y)pyx . Proof. the event that j occurs in a speciﬁc excursion is independent from the event that j occurs in the next excursion. 2. R2 ) be the index of the ﬁrst (respectively.

we just need to show that R2 −R1 +1 ζ := r=1 Zr has ﬁnite expectation. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . In other words.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . with the same law as Ti . An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. . IRREDUCIBLE CHAIN r = 1. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. where the Zr are i. R2 has ﬁnite expectation.i. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. . .d. Corollary 10. An irreducible Markov chain is called transient if one (and hence all) states are transient. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . But. 2. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. . by Wald’s lemma. . or all states are null recurrent or all states are transient. Then either all states in C are positive recurrent. Let C be a communicating class. . .

is a stationary distribution. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. n=1 43 . for totally trivial reasons (it is. x ∈ S. indeed. we have P (Xn = x) = π(x). after all. Then there is a unique stationary distribution.16 Uniqueness of the stationary distribution At the risk of being too pedantic. Proof. an absorbing state). what prohibits uniqueness. π(2) = 1 − α 2. with probability one. we have. x∈S By (13). we stress that a stationary distribution may not be unique. By a theorem of Analysis (Bounded Convergence Theorem). the π deﬁned by π(0) = α. we can take expectations of both sides: E as t → ∞. Consider the Markov chain (Xn . Theorem 16. n ≥ 0) and suppose that P (X0 = x) = π(x). π(1) = 0. (18) π(x)Px (Ta < ∞) = π(x) = 1. following Theorem 14. as t → ∞. But observe that there are many–inﬁnitely many–stationary distributions: For each 0 ≤ α ≤ 1. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). x ∈ S. recognising that the random variables on the left are nonnegative and bounded below 1. we start the Markov chain from the stationary distribution π. Then P (Ta < ∞) = x∈S In other words. But. Hence. there is a stationary distribution. for all x ∈ S. for all n. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). Fix a ground state a. for all states i and j. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. by (18). Then. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. Consider positive recurrent irreducible Markov chain. state 1 (and state 2) is positive recurrent. Let π be a stationary distribution.

Namely. Hence π = π [a] . Therefore Ei Ti < ∞. The following is a sort of converse to that. Let π(·|C) be the distribution deﬁned by restricting π to C: π(x|C) := π(x) . Then π [a] = π [b] . Then any state i such that π(i) > 0 is positive recurrent. This is in fact.Therefore π(x) = π [a] (x). Both π [a] and π[b] are stationary distributions.e. There is only one stationary distribution. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . Decompose it into communicating classes. The cycle formula merely re-expresses this equality. y ∈ C. Let i be such that π(i) > 0. π(i|C) = Ei Ti . 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Consider the restriction of the chain to C. where π(C) = y∈C π(y). π(a) = π [a] (a) = 1/Ea Ta . x ∈ S. Consider a Markov chain. so they are equal. from the deﬁnition of π [a] . for all a ∈ S. In particular. b ∈ S. By the 1 above corollary. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Proof. Hence there is only one stationary distribution. delete all transition probabilities pxy with x ∈ C. Consider an arbitrary Markov chain. an arbitrary stationary distribution π must be equal to the speciﬁc one π [a] . Assume that a stationary distribution π exists. In particular. Thus. Corollary 12. i. Let C be the communicating class of i. Proof. E a Ta Proof. Consider a positive recurrent irreducible Markov chain with stationary distribution π. for all x ∈ S. i is positive recurrent. Then. There is a unique stationary distribution. Corollary 11. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). 1 π(a) = . But π(i|C) > 0. Consider a positive recurrent irreducible Markov chain and two states a. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. π(C) x ∈ C. the only stationary distribution for the restricted chain because C is irreducible. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb .

x ∈ C) is a stationary distribution. such that C αC = 1. To each positive recurrent communicating class C there corresponds a stationary distribution π C . for example. Let π be an arbitrary stationary distribution. In an ideal pendulum. By our uniqueness result. as n → ∞. take an = (−1)n . consider the motion of a pendulum under the action of gravity. without friction. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). A Markov chain is a simple model of a stochastic dynamical system. We all know. that the pendulum will swing for a while and. An arbitrary stationary distribution π must be a linear combination of these π C . x∈C Then (π(x|C). from physical experience. eventually. that. This is deﬁned by picking a state (any state!) a ∈ C and letting π C := π [a] .1 Coupling and stability Deﬁnition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. where π(C) := π(C) π(x). 18 18. the pendulum will perform an 45 . more generally.which assigns positive probability to each state in C. Proof. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. This is the stable position. There is dissipation of energy due to friction and that causes the settling down. So we have that the above decomposition holds with αC = π(C). For example. it will settle to the vertical position. π(x|C) ≡ π C (x). For each such C deﬁne the conditional distribution π(x|C) := π(x) . under irreducibility and positive recurrence conditions. If π(x) > 0 then x belongs to some positive recurrent class C. Proposition 3. So far we have seen. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . where αC ≥ 0. that it remains in a small region). Stability of a deterministic dynamical system means that the system eventually converges to a ﬁxed point (or. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge.

then x(t) = x∗ for all t > 0. we may it is stable because it does not (it cannot!) swing too far away. or whatever. a > 0. ······ 46 . . Still. If. mutually independent. moreover. it cannot mean convergence to a ﬁxed equilibrium point. X2 . and deﬁne a stochastic sequence X1 . ξ0 . A linear stochastic system We now pass on to stochastic systems. However notice what happens when we solve the recursion. ˙ There are myriads of applications giving physical interpretation to this. by means of the recursion above. The thing to observe here is that there is an equilibrium point i. as t → ∞. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . Speciﬁcally. we here assume that X0 . . x ∈ R. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an ﬁnancial example. We say that x∗ is a stable equilibrium. random variables. And we are in a happy situation here. we have x(t) → x∗ . ξ2 . It follows that |ρ| < 1 is the condition for stability. .e. then regardless of the starting position. the one found by setting the right-hand side equal to zero: x∗ = b/a. consider the following diﬀerential equation x = −ax + b. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . ξ1 . Since we are prohibited by the syllabus to speak about continuous time stochastic systems. The simplest diﬀerential equation As another example. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . . or an example in physics. simple because the ξn ’s depend on the time parameter n. . . We shall assume that the ξn have the same law.oscillatory motion ad inﬁnitum. But what does stability mean? In general. are given.

to try to make the coincidence probability P (X = Y ) as large as possible. ∞ ˆ The nice thing is that. 18. But suppose that we have some further requirement. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . j ∈ S. In other words. If a Markov chain is irreducible. for time-homogeneous Markov chains. Y = y) is. In particular.3 Coupling To understand why the fundamental stability theorem works. g(y) = P (Y = y).g. uniformly in x. The second term. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). suppose you are given the laws of two random variables X.Since |ρ| < 1. Y = y) = P (X = x)P (Y = y). converges. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . we need to understand the notion of coupling. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). . positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. the n-step transition matrix Pn . we have that the ﬁrst term goes to 0 as n → ∞. we know f (x) = P (X = x). What this means is that. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . How should we then deﬁne what P (X = x. Y but not their joint law. . under nice conditions. (e. 18. but we are not given what P (X = x. which is a linear combination of ξ0 . Coupling refers to the various methods for constructing this joint law. ξn−1 does not converge. while the limit of Xn does not exists. for any initial distribution. . i.2 The fundamental stability theorem for Markov chains Theorem 17. for instance. Y = y) is? 47 . as n → ∞. . Starting at an elementary level. deﬁne stability in a way other than convergence of the distributions of the random variables. A straightforward (often not so useful) construction is to assume that the random variables are independent and deﬁne P (X = x.

Exercise: Let X. The coupling we are interested in is a coupling of processes. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). precisely because of the assumption that the two processes have been constructed together. i. f (x) = y h(x. but we are not told what the joint probabilities are. and all x ∈ S. uniformly in x ∈ S. x ∈ S. Let T be a meeting of two coupled stochastic processes X. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. but with the roles of X and Y interchanged. A coupling of them refers to a joint construction on the same probability space. in particular. then |P (Xn = x) − P (Yn = x)| → 0. n ≥ 0. n ≥ T ) = P (Xn = x. T is ﬁnite with probability one. Y . and which maximises P (X = Y ). y). Find the joint law h(x. n ≥ 0. Y = y) which respects the marginals. g(y) = P (Y = y). on the event that the two processes never meet. We are given the law of a stochastic process X = (Xn . i. Note that it makes sense to speak of a meeting time. Proof. Repeating (19). n < T ) + P (Xn = x. Suppose that we have two stochastic processes as above and suppose that they have been coupled. If. n ≥ T ) ≤ P (n < T ) + P (Yn = x). Combining the two inequalities we obtain the ﬁrst assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Y be random variables with values in Z and laws f (x) = P (X = x). respectively.e. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). This T is a random variables that takes values in the set of times or it may take values +∞. x∈S 48 . n < T ) + P (Yn = x. (19) as n → ∞. Then. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n).e. for all n ≥ 0. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). n ≥ 0) and the law of a stochastic process Y = (Yn . P (T < ∞) = 1. n ≥ 0). Since this holds for all x ∈ S and since the right hand side does not depend on x. The following result is the so-called coupling inequality: Proposition 4. g(y) = x h(x. y). y) = P (X = x. We have P (Xn = x) = P (Xn = x.

denoted by π.x′ > 0 and py. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. review the assumptions of the theorem: the chain is irreducible. is a stationary distribution for (Wn .y′ ) = px.x′ py.x′ ). positive recurrent and aperiodic.y′ for all large n. denoted by Y = (Yn . n ≥ 0) is an irreducible chain.y′ .x′ ). Consider the process Wn := (Xn .4 Proof of the fundamental stability theorem First. We now couple them by doing the straightforward thing: we assume that X = (Xn . for all n ≥ 1. n ≥ 0) is independent of Y = (Yn . we have that px. Its (1-step) transition probabilities are q(x. then. they diﬀer only in how the initial states are chosen. Xm = y). Let pij denote. n ≥ 0).(y. . sup |P (Xn = x) − P (Yn = x)| → 0. we have not deﬁned joint probabilities such as P (Xn = x.y′ .Assume now that P (T < ∞) = 1. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px.(y. and so (Wn .(y. 18. (x. (Indeed. The second process.y′ ) := P Wn+1 = (x′ . Its initial state W0 = (X0 . y) = µ(x)π(y). Therefore. It is very important to note that we have only deﬁned the law of X and the law of Y but we have not deﬁned a joint law.) By positive recurrence. n ≥ 0). Then n→∞ lim P (T > n) = P (T = ∞) = 0. From Theorem 3 and the aperiodicity assumption. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. and it also makes sense to consider their ﬁrst meeting time: T := inf{n ≥ 0 : Xn = Yn }. as usual. the transition probabilities. Having coupled them. y) ∈ S × S. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. Yn ).x′ py. Notice that σ(x. y ′ ) | Wn = (x. in other words. Xn and Yn would be independent and distributed according to π. y) := π(x)π(y). implying that q(x. denoted by X = (Xn . We shall consider the laws of two processes. Then (Wn . had we started with both X0 and Y0 independent and distributed according to π. n ≥ 0) is a Markov chain with state space S × S. y) Its n-step transition probabilities are q(x. We know that there is a unique stationary distribution. The ﬁrst process. x∈S as n → ∞. we can deﬁne joint probabilities.x′ ). π(x) > 0 for all x ∈ S. Y0 ) has distribution P (W0 = (x.y′ ) for all large n.

(0.1). n ≥ 0) is positive recurrent. y) ∈ S × S. 1} and the pij are p01 = p10 = 1. In particular. y) > 0 for all (x. ≥ 0) is a Markov chain with transition probabilities pij . and since P (Yn = x) = π(x). by assumption. P (T < ∞) = 1.(1. P (Xn = x) = P (Zn = x). for all n and x. Obviously.01. 0). T is a meeting time between Z and Y . By the coupling inequality.(0. we have |P (Xn = x) − π(x)| ≤ P (T > n). 0). p00 = 0. (0.1) = q(1. precisely because P (T < ∞) = 1.(1. suppose that S = {0.Therefore σ(x. if we modify the original chain and make it aperiodic by.0). (1.1). Since P (Zn = x) = P (Xn = x). then the process W = (X. n ≥ 0). Y ) may fail to be irreducible. However. uniformly in x. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X.0) = q(1.0).0) = q(0. Then the product chain has state space S ×S = {(0. after the meeting time. p01 = 0. p10 = 1. 1). Clearly. In particular. 1)} and the transition probabilities are q(0. For example. Zn equals Xn before the meeting time of X and Y . (Zn .99. Moreover. Z0 = X0 which has an arbitrary distribution µ. Hence (Wn . say. This is not irreducible. 50 . Hence (Zn . (1.1) = 1. X arbitrary distribution Z stationary distribution Y T Thus. as n → ∞. which converges to zero. Zn = Yn . Deﬁne now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). |P (Zn = x) − P (Yn = x)| ≤ P (T > n). n ≥ 0) is identical in law to (Xn .

18. 51 .503 0.01. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. . . In matrix notation. π(1) ≈ 0.497. such that the chain moves cyclically between them: from C0 to C1 . limn→∞ p00 does not exist. in particular. 6 The reason is that the term “ergodic” means something speciﬁc in Mathematics (c. Then Xn = 0 for even n and = 1 for odd n. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. 1981). if we modify the chain by p00 = 0. Thus. we can decompose the state space into d classes C0 . Terminology cannot be. But this is “wrong”. n→∞ (n) where π(0).497 18. however. most people use the term “ergodic” for an irreducible. and so on. i. It is not diﬃcult to see that 6 We put “wrong” in quotes. C1 .503 0.e.4 and. p01 = 0. Theorem 4. Parry. Cd−1 . n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). The reader is asked to pay attention to this terminology which is diﬀerent from the one used in the Markov chain literature. from C1 to C2 . . π(0) ≈ 0.497 . . Suppose that the period of the chain is d. π(1) satisfy 0. per se wrong. 0. Suppose X0 = 0. then (n) lim p n→∞ 00 (n) = lim p10 = π(0). n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. By the results of §5. positive recurrent and aperiodic Markov chain. from Cd back to C0 . However.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. p10 = 1. Therefore p00 = 1 for even n (n) and = 0 for odd n. and this is expressed by our terminology. Indeed.99π(0) = π(1).then the product chain becomes irreducible.503. By the results of Section 14. customary to have a consistent terminology.99. It is.f. π(0) + π(1) = 1. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing.

The chain (Xnd . . Since the αr add up to 1. Let αr := i∈Cr π(i). Can we also ﬁnd the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . if sampled every d steps. Since (Xnd . each one must be equal to 1/d. So π(i) = j∈Cr−1 π(j)pji . . 52 . We shall show that Lemma 9. . Cd−1 . d − 1. Then. starting from i.Lemma 8. Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. . Suppose that X0 ∈ Cr . 1. π(i) = 1/d. All states have period 1 for this chain. Hence if we restrict the chain (Xnd . n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. j ∈ Cℓ . Suppose i ∈ Cr . . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . i ∈ Cr . Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). . i ∈ Cr . thereafter. n ≥ 0) is aperiodic. Then. Proof. Let i ∈ Ck . the (unique) stationary distribution of (Xnd . as n → ∞. . and the previous results apply. → dπ(j). n ≥ 0) is dπ(i). as n → ∞. n ≥ 0) consists of d closed communicating classes. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. after reduction by d) and. . and let r = ℓ − k (modulo d). j ∈ Cr . j ∈ Cℓ . We have that π satisﬁes the balance equations: π(i) = j∈S π(j)pji . i ∈ S. This means that: Theorem 18. we have pij for all i. namely C0 . i∈Cr for all r = 0. where π is the stationary distribution. Then pji = 0 unless j ∈ Cr−1 . it remains in Cℓ .

. say. there must be a subsequence exists and equals. p(n) = (n) (n) (n) (n) (n) ∞ p1 . The proof will be based on two results from Analysis. k = 1. . . 2. For each i we have found subsequence ni . . . If the period is d then the limit exists only along a selected subsequence (Theorem 18). 2. .19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . 19.. p2 . It is clear how we (ni ) k continue. Then we can ﬁnd a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof.. . p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. . schematically. . for each The second result is Fatou’s lemma. say. as k → ∞. for each i. e 53 . for all i ∈ S. pi . 2. for all i ∈ S. such that limk→∞ pi k exists and equals. . Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. k = 1. 2. p2 .1 Limiting probabilities for null recurrent states Theorem 19.) is a k k subsequence of each (ni . If j is a null recurrent state then pij → 0. if j is transient. where = 1 and pi ≥ 0 for all i. such that limk→∞ p1 k 3 (n1 ) numbers p2 k .e. the ﬁrst of which is known as HellyBray Lemma: Lemma 10. We also saw in Corollary (n) 7 that. as n → ∞. for each n = 1. . . . the sequence n3 of the third row is a subsequence k k of n2 of the second row. p3 . So the diagonal subsequence (nk . . exists for all i ∈ N. k = 1. k = 1. Hence pi k i. Consider. whose proof we omit7 : 7 See Br´maud (1999). and so on. (nk ) k → pi . are contained between 0 and 1. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. By construction. Now consider the contained between 0 and 1 there must be a exists and equals. . .). then pij → π(j) (Theorems 17). i. say. We have. . Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . . 2. a probability distribution on N. p1 . the sequence n2 of the second row k is a subsequence of n1 of the ﬁrst row. . then pij → 0 as n → ∞.

for some x0 ∈ S.e. . x ∈ S . Suppose that for some i.e. k→∞ (n′ +1) = y∈S piy k pyx . i ∈ N). it is a strict inequality: αx0 > y∈S αy pyx0 .e. x ∈ S. 54 . Now π(j) > 0 because αj > 0. Thus. If (yi . (n′ ) Hence αx ≥ αy pyx . i=1 (n) Proof of Theorem 19. Let j be a null recurrent state. we have 0< x∈S The inequality follows from Lemma 11. y∈S Since x∈S αx ≤ 1. .Lemma 11. by Lemma 11. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . But Pn+1 = Pn P. αy pyx . state j is positive recurrent: Contradiction. we have that π(x) := αx y∈S αy satisﬁes the balance equations and has x∈S π(x) = 1. and the last equality is obvious since we have a probability vector. y∈S x ∈ S. (xi . This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . i ∈ N). pix k . equalities. actually. Suppose that one of them is not. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . n = 1. By Corollary 13. i. pix k = 1. π is a stationary distribution. i. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . 2. the sequence (n) pij does not converge to 0. by assumption. pix k Therefore. . We wish to show that these inequalities are. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). i. Hence we can ﬁnd a subsequence nk such that k→∞ lim pij k = αj > 0. (n ) (n ) Consider the probability vector By Lemma 10. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. x∈S (n′ ) Since aj > 0.

Let C be the communicating class of j and let π C be the stationary distribution associated with C. the result can be obtained by the so-called Abel’s Lemma of Analysis. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). The result when the period j is larger than 1 is a bit awkward to formulate.2. Let j be a positive recurrent state with period 1. Theorem 20. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. This is a more old-fashioned way of getting the same result. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. Then (n) lim p n→∞ ij = ϕC (i)π C (j). From the Strong Markov Property. Example: Consider the following chain (0 < α. so we will only look that the aperiodic case. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞).19. Let HC := inf{n ≥ 0 : Xn ∈ C}. In general. starting with the ﬁrst hitting time decomposition relation of Lemma 7. and fij = Pi (Tj < ∞). In this form. as n → ∞.2 The general case (n) It remains to see what the limit of pij is. Proof. that is. and fij = ϕC (j). Pµ (Xn−m = j) → π C (j). as in §10. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. Let ϕC (i) := Pi (HC < ∞). we have pij = Pi (Xn = j) = Pi (Xn = j. Let i be an arbitrary state. But Theorem 17 tells us that the limit exists regardless of the initial distribution. as n → ∞. β. π C (j) = 1/Ej Tj . Indeed. which is what we need. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m).

C2 are aperiodic classes. (n) (n) p32 → ϕ(3)π C1 (2). The subject of Ergodic Theory has its origins in Physics. ϕ(4) = . 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. while. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . Hence. Pioneers in the are were. C1 = {1. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). 2} (closed class). Let f be a reward function such that |f (x)|ν [a] (x) < ∞. we assumed the existence of a positive recurrent state. among others. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. We will avoid to give a precise deﬁnition of the term ‘ergodic’ but only mention what constitutes a result in the area. p42 → ϕ(4)π C1 (2).The communicating classes are I = {3. And so is the Law of Large Numbers itself (Theorem 13). 56 (20) x∈S . 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. information about its velocity. Recall that. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. for example. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. 4} (inessential states). π C1 (2) = . Similarly. C2 = {5} (absorbing state). Theorem 14 is a type of ergodic theorem. from ﬁrst-step analysis. We have ϕ(1) = ϕ(2) = 1. Consider the quantity ν [a] (x) deﬁned as the average number of visits to state x between two successive visits to state a–see 16. ϕ(3) = p31 → ϕ(3)π C1 (1). Theorem 21. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . ϕ(5) = 0. 1 − αβ 1 − αβ Both C1 . Ergodic Theory is nowadays an area that combines many ﬁelds of Mathematics and Science. as n → ∞. (n) (n) p35 → (1 − ϕ(3))π C2 (5). We now generalise this. p45 → (1 − ϕ(4))π C2 (5). The phase (or state) space of a particle is not just a physical space but it includes. in Theorem 14. (n) (n) p34 → 0. (n) (n) p33 → 0. (n) (n) p43 → 0. For example. p44 → 0. p41 → ϕ(4)π C1 (1). π C2 (5) = 1. whence 1−α β(1 − α) .

since the number of states is ﬁnite. As in Theorem 14. We only need to check that EG1 = f . But this is precisely the condition (20). we have Nt /t → 1/Ea Ta . Remark 1: The diﬀerence between Theorem 14 and 21 is that. So there are always recurrent states. r=0 with probability 1 as long as E|G1 | < ∞. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 .5. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. in the former. (21) f (x)ν [a] (x). Now. As in the proof of Theorem 14. while Gﬁrst . 21 Finite chains We take a brief look at some things that are speciﬁc to chains with ﬁnitely many states. t t→∞ t t t with probability one. we obtain the result. we assumed positive recurrence of the ground state a.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . as we saw in the proof of Theorem 14. and this is left as an exercise. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. If so. we have that the denominator equals Nt /t and converges to 1/Ea Ta . we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gﬁrst + Glast . so the numerator converges to f /Ea Ta . 57 . Glast are the total rewards over the ﬁrst and t t a a last incomplete cycles. The reader is asked to revisit the proof of Theorem 14. and this justiﬁes the terminology of §18. we have t→∞ lim 1 last 1 G = lim Gﬁrst = 0. Replacing n by the subsequence Nt . Proof. we have C := i∈S ν(i) < ∞. From proposition 2 we know that the balance equations ν = νP have at least one solution. Then at least one of the states must be visited inﬁnitely often. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). which is the quantity denoted by f in Theorem 14. then. respectively. So if we divide by t the numerator and denominator of the fraction in (21).

actually. . Lemma 12. At least some state i must have π(i) > 0. and run the ﬁlm backwards. . i ∈ S. A reversible Markov chain is stationary. then we instantly know that the ﬁlm is played in reverse. It is. . By Corollary 13. X1 . j ∈ S. this state must be positive recurrent. Therefore π is a stationary distribution. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . the chain is irreducible. Xn−1 . a ﬁnite Markov chain always has positive recurrent states. the stationary distribution is unique. by Theorem 16. . Consider an irreducible Markov chain with ﬁnitely many states. 22 Time reversibility Consider a Markov chain (Xn . π(i) := Hence.Hence 1 ν(i). Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. . like playing a ﬁlm backwards. simply. . Theorem 23. for all n. X1 ) has the same distribution as (X1 . or. X1 = j) = P (X1 = i.e. For example. it will look the same. Then the chain is positive recurrent with a unique stationary distribution π. that is. X0 ). without being able to tell that it is. Xn has the same distribution as X0 for all n. Proof. the chain is aperiodic. P (X0 = i. running backwards. in addition. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. If. Xn−1 . In particular. Proof. reversible if it has the same distribution when the arrow of time is reversed. X0 = j). . . the ﬁrst components of the two vectors must have the same distribution. in addition. . if we ﬁlm a standing man who raises his arms up and down successively. . we see that (X0 . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. X0 ) for all n. i. 58 . Because the process is Markov. Taking n = 1 in the deﬁnition of reversibility. We say that it is time-reversible. We summarise the above discussion in: Theorem 22. C satisﬁes the balance equations and the normalisation condition i∈S π(i) = 1. Xn ) has the same distribution as (Xn . X1 . In addition. then all states are positive recurrent. this means that it is stationary. But if we ﬁlm a man who walks and run the ﬁlm backwards. . More precisely. (X0 . n ≥ 0). as n → ∞. . Reversibility means that (X0 . . Suppose ﬁrst that the chain is reversible. we have Pn → Π. X0 ). . because positive recurrence is a class property (Theorem 15). in a sense. If. . . Xn ) has the same distribution as (Xn . i.

. . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. Theorem 24. . Conversely. . X1 . So the detailed balance equations (DBE) hold. X0 ). if there is an arrow from i to j in the graph of the chain but no arrow from j to i. In other words. . using DBE. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. . we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . Suppose ﬁrst that the chain is reversible. and the chain is reversible. We will show the Kolmogorov’s loop criterion (KLC). So the chain is reversible. j. i4 . The DBE imply immediately that (X0 . Hence the DBE hold. Xn−1 . letting m → ∞. X1 . and write. . it is easy to show that. then. These are the DBE. Now pick a state k and write. k. But P (X0 = i. X1 ) has the same distribution as (X1 . By induction. . X0 ). and pi1 i2 (m−3) → π(i2 ). Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. j. X0 ). and hence (X0 . X1 = j) = π(i)pij . (X0 . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . pji 59 (m−3) → π(i1 ). Summing up over all choices of states i3 . Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. X2 = k) = P (X0 = k. . we have separation of variables: pij = f (i)g(j). k. X2 ) has the same distribution as (X2 . The same logic works for any m and can be worked out by induction. i2 . we have π(i) > 0 for all i. X2 = i). . im ∈ S and all m ≥ 3. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. . (22) Proof. . Pick 3 states i. . Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . X1 = i) = π(j)pji . for all n. Xn ) has the same distribution as (Xn . . . X1 = j. for all i. using the DBE. suppose the DBE hold. . X1 . Conversely. suppose that the KLC (22) holds. and this means that P (X0 = i.for all i and j. π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. then the chain cannot be reversible. P (X0 = j. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . . X1 = j.

Ac be the sets in the two connected components. by assumption. j for which pij > 0. Hence the chain is reversible. then we need. If we remove the link between them. for each pair of distinct states i. Form the graph by deleting all self-loops. i. Pick two distinct states i. and. If the graph thus obtained is a tree. And hence π can be found very easily. that the chain is positive recurrent.(Convention: we form this ratio only when pij > 0. there would be a loop. The balance equations are written in the form F (A.e. pji g(i) Multiplying by g(i) and summing over j we ﬁnd g(i) = j∈S g(j)pji . consider the chain: Replace it by: This is a tree. (If it didn’t.1) which gives π(i)pij = π(j)pji . • If the chain has inﬁnitely many states. So g satisﬁes the balance equations. A) (see §4. meaning that it has no loops. This gives the following method for showing that a ﬁnite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. so we have pij g(j) = . the graph becomes disconnected. j for which pij > 0. namely the arrow from i to j. Hence the original chain is reversible. there isn’t any. and by replacing.) Necessarily. then the chain is reversible. using another method. 60 . Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. The reason is as follows.) Let A. Ac ) = F (Ac . in addition. to guarantee. For example. the detailed balance equations. f (i)g(i) must be 1 (why?). There is obviously only one arrow from AtoAc . the two oriented edges from i to j and from j to i by a single edge without arrow. Example 1: Chain with a tree-structure. and the arrow from j to j is the only one from Ac to A.

j are not neighbours then both sides are 0. π(i)pij = π(j)pji . 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. The stationary distribution of a RWG is very easy to ﬁnd. i∈S where |E| is the total number of edges. In all cases. π(i) = 1. To see this let us verify that the detailed balance equations hold. and a set E of undirected edges. Suppose the graph is connected. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). if j is a neighbour of i otherwise.e. States j are i are called neighbours if they are connected by an edge.Example 2: Random walk on a graph. i∈S Since δ(i) = 2|E|. The stationary distribution assigns to a state probability which is proportional to its degree. We claim that π(i) = Cδ(i). a ﬁnite set. i. the DBE hold. 2. 0. 61 . A RWG is a reversible chain. Hence we conclude that: 1. There are two cases: If i. p02 = 1/4. The constant C is found by normalisation. Now deﬁne the the following transition probabilities: pij = 1/δ(i). so both members are equal to C. For each state i deﬁne its degree δ(i) = number of neighbours of i. If i. Consider an undirected graph G with vertices S. Each edge connects a pair of distinct vertices (states) without orientation. For example. etc. we have that C = 1/2|E|. where C is a constant.

k = 0. this process is also a Markov chain and it does have time-homogeneous transition probabilities. 2. Let A be (r) a set of states and let TA be the r-th visit to A. . i.) has the Markov property but it is not time-homogeneous.d.e. . t ≥ 0) be independent from (Xn ) with S0 = 0. 62 n ∈ N. . . 2. 23. 1. 2. .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. (m) (m) 23. k = 0. 1. A r = 1. In this case.3 Subordination Let (Xn . 1. in addition. then it is a stationary distribution for the new chain. . k = 0. 2. 1. . n ≥ 0) be Markov with transition probabilities pij . random variables with values in N and distribution λ(n) = P (ξ0 = n). The sequence (Yk . if. where ξt are i.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). if nk = n0 + km. in general. .1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence.e. . where pij are the m-step transition probabilities of the original chain. Let (St . . irreducible and positive recurrent). . . If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ).i. k = 0. Due to the Strong Markov Property. Deﬁne Yr := XT (r) . .) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . how can we create new ones? 23. 2.e. . The state space of (Yr ) is A. . . i. If the subsequence forms an arithmetic progression. then (Yk . t ≥ 0. St+1 = St + ξt . . π is a stationary distribution for the original chain. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk .

) More generally. Yn−1 ) = P (Yn+1 = β | Yn = α). Yn−1 = αn−1 }. j. α1 . is NOT. . ∞ λ(n)pij . n≥0 63 . . . (Notation: We will denote the elements of S by i. however. .e. i. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0.Then Yt := XSt . in general. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). knowledge of f (Xn ) implies knowledge of Xn and so. we have: Lemma 13. . . Indeed. We need to show that P (Yn+1 = β | Yn = α. and the elements of S ′ by α. . . . Fix α0 .. conditional on Yn the past is independent of the future. . αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . .4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . β). a Markov chain. β. (n) 23. Then (Yn ) has the Markov property.. . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). If. f is a one-to-one function then (Yn ) is a Markov chain. . . Proof. then the process Yn = f (Xn ).

But we will show that P (Yn+1 = β|Xn = i). β ∈ Z+ . i ∈ Z. 64 . observe that the function | · | is not one-to-one. Yn−1 )P (Xn = i | Yn = α. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. then |Xn+1 | = 1 for sure. Yn = α. because the probability of going up is the same as the probability of going down. Thus (Yn ) is Markov with transition probabilities Q(α. for all i ∈ Z and all β ∈ Z+ .for all choices of the variables involved. β). depends on i only through |i|. a Markov chain with transition probabilities Q(α. To see this. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. if we let 1/2.e. Indeed. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Yn−1 ) Q(f (i). β) P (Yn = α|Yn−1 ) i∈S = Q(α. Then (Yn ) is a Markov chain. and so (Yn ) is. β). indeed. α > 0 Q(α. if β = 1. conditional on {Xn = i}. regardless of whether i > 0 or i < 0. Example: Consider the symmetric drunkard’s random walk. On the other hand. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) P (Yn = α. In other words. Yn−1 ) Q(f (i). if i = 0. β). Deﬁne Yn := |Xn |. i ∈ Z. and if i = 0. We have that P (Yn+1 = β|Xn = i) = Q(|i|. β) := 1. if β = α ± 1. P (Xn+1 = i ± 1|Xn = i) = 1/2. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. β) P (Yn = α|Xn = i. β)P (Xn = i | Yn = α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). β) i∈S P (Yn = α|Xn = i. α = 0. i. But P (Yn+1 = β | Yn = α. β).

p0 = 1 − p. . Let ξk be the number of oﬀspring of the k-th individual in the n-th generation. we have that (Xn .i.24 24. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. . If we let ξ (n) = ξ1 . which has a binomial distribution: i j pij = p (1 − p)i−j . a hard problem. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of oﬀspring. Hence all states i ≥ 1 are transient. ξ2 . It is a Markov chain with time-homogeneous transitions. 2. 0 But 0 is an absorbing state. since ξ (0) . But here is an interesting special case.d. So assume that p1 = 1. The state 0 is irrelevant because it cannot be reached from anywhere. . In other words. independently from one another. j In this case. .. with probability 1. and following the same distribution. each individual gives birth to at one child with probability p or has children with probability 1 − p. if 0 is never reached then Xn → ∞. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. We ﬁnd the population (n) at time n + 1 by adding the number of oﬀspring of each individual. n ≥ 0) has the Markov property. one question of interest is whether the process will become extinct or not.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. The parent dies immediately upon giving birth. if the current population is i there is probability pi that everybody gives birth to zero oﬀspring. Then P (ξ = 0) + P (ξ ≥ 2) > 0. the population cannot grow: it will eventually reach 0. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. and 0 is always an absorbing state. therefore transient. ξ (1) . . . (and independent of the initial population size X0 –a further assumption). Thus. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. We let pk be the probability that an individual will bear k oﬀspring. Hence all states i ≥ 1 are inessential. in general. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . . 0 ≤ j ≤ i. and. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. so we will ﬁnd some diﬀerent method to proceed. if the 65 (n) (n) . are i. ξ (n) ). . Back to the general case. Example: Suppose that p1 = p. . We will not attempt to compute the pij neither draw the transition diagram because they are both futile. then we see that the above equation is of the form Xn+1 = f (Xn . Therefore. k = 0. 1.

so P (D|X0 = i) = εi . We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . 66 . and. each one behaves independently of one another. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. Proof. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. . and P (Dk ) = ε. We then have that D = D1 ∩ · · · ∩ Di . if we start with X0 = i in ital ancestors. ε(0) = 1. We wish to compute the extinction probability ε(i) := Pi (D). Indeed. If µ ≤ 1 then the ε = 1. . Let ε be the extinction probability of the process. Obviously. For i ≥ 1. Let ∞ µ = Eξ = k=1 kpk be the mean number of oﬀspring of an individual. . i. . we can argue that ε(i) = εi . let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. Indeed. For each k = 1. This function is of interest because it gives information about the extinction probability ε. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. ϕn (0) = P (Xn = 0).population does not become extinct then it must grow to inﬁnity. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. since Xn = 0 implies Xm = 0 for all m ≥ n. The result is this: Proposition 5. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. where ε = P1 (D). Therefore the problem reduces to the study of the branching process starting with X0 = 1. when we start with X0 = i.

Therefore ε = ϕ(ε). So ε ≤ ε∗ < 1. Let us show that. and since the latter has only two solutions (z = 0 and z = ε∗ ). and so on. Also ϕn (0) → ε. z=1 Also. On the other hand. if µ > 1. We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z).) 67 . the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. But ϕn+1 (0) → ε as n → ∞. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). This means that ε = 1 (see left ﬁgure below). But ϕn (0) → ε. Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1.Note also that ϕ0 (z) = 1. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). since ϕ is a continuous function. ε = ε∗ . ϕn+1 (z) = ϕ(ϕn (z)). and. By induction. But 0 ≤ ε∗ . if the slope µ at z = 1 is ≤ 1. In particular. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . recall that ϕ is a convex function with ϕ(1) = 1. Notice now that µ= d ϕ(z) dz . ϕn (0) ≤ ε∗ . (See right ﬁgure below. in fact. Therefore ε = ε∗ . all we need to show is that ε < 1. ϕn+1 (0) = ϕ(ϕn (0)). ϕ(ϕn (0)) → ϕ(ε). Therefore. Since both ε and ε∗ satisfy z = ϕ(z).

Let s be the number of selected packets. there are x + a packets. An the number of new packets on the same time slot.i. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). If collision happens. and Sn the number of selected packets. 4. 2. 3 hosts. During this time slot assume there are a new packets that need to be transmitted. Nowadays. periodically. . We assume that the An are i. The problem of communication over a common channel. 6.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). for a packet to go through. 0). If s = 1 there is a successful transmission and. . Then ε = 1.. Abandoning this idea. is that if more than one hosts attempt to transmit a packet. on the next times slot. 5. are allocated to hosts 1. there is no time to make a prior query whether a host has something to transmit or not. we let hosts transmit whenever they like. 3. only one of the hosts should be transmitting at a given time. say. 24. then time slots 0. So. So if there are. there are x + a − 1 packets.d. otherwise. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. then the transmission is not successful and the transmissions must be attempted anew. . If s ≥ 2. 3. then max(Xn + An − 1. One obvious solution is to allocate time slots to hosts in a speciﬁc order. 3. random variables with some common distribution: P (An = k) = λk . Since things happen fast in a computer network. Case: µ > 1. on the next times slot. 2. . Then ε < 1. if Sn = 1. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. Ethernet has replaced Aloha. Xn+1 = Xn + An . using the same frequency. . 2. there is a collision and. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. 1.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. . k ≥ 0. 1. in a simpliﬁed model. there is collision and the transmission is unsuccessful. Thus. 1. 68 .

if Xn + An = 0. selections are made amongst the Xn + An packets.x−1 when x > 0. Since p > 0. Unless µ = 0 (which is a trivial case: no arrivals!).where λk ≥ 0 and k≥0 kλk = 1. Therefore ∆(x) ≥ µ/2 for all large x.x−1 = P (An = 0. . We here only restrict ourselves to the computation of this expected change. Proving this is beyond the scope of the lectures. So P (Sn = k|Xn = x. The Aloha protocol is unstable. For x > 0. we must have no new packets arriving. and so q x tends to 0 much faster than G(x) tends to ∞. . x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x).) = x + a k x−k p q . We also let µ := k≥1 kλk be the expectation of An . The probabilities px. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . So: px. and exactly one selection: px.x are computed by subtracting the sum of the above from 1. An = a. .x−1 + k≥1 kpx.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k .e. we have q < 1.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k .x+k for k ≥ 1. deﬁned by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. and p. i. Indeed. Xn−1 . To compute px. The random variable Sn has. we should not subtract 1. where G(x) is a function of the form α + βx . So we can make q x G(x) as small as we like by choosing x suﬃciently large. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. Sn = 1|Xn = x) = λ0 xpq x−1 . k ≥ 1. we think as follows: to have a reduction of x by 1. The process (Xn . An−1 . for example we can make it ≤ µ/2. Idea of proof. k 0 ≤ k ≤ x + a. we have ∆(x) = (−1)px. we have that the drift is positive for all large x and this proves transience. then the chain is transient. 69 . The main result is: Proposition 6. where q = 1 − p. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. n ≥ 0) is a Markov chain. To compute px.e. so q x G(x) tends to 0 as x → ∞. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). the Markov chain (Xn ) is transient. The reason we take the maximum with 0 in the above equation is because. Let us ﬁnd its transition probabilities.

The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. So if Xn = i is the current location of the chain. 2. there are companies that sell high-rank pages. In an Internet-dominated society. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. Deﬁne transition probabilities as follows: 1/|L(i)|. if L(i) = ∅ 0. otherwise. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. We pick α between 0 and 1 (typically. The idea is simple. There is a technique. So we make the following modiﬁcation. All you need to do to get high rank is pay (a lot of money).3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. neither that it is aperiodic. if j ∈ L(i) pij = 1/|V |. the rank of a page may be very important for a company’s proﬁts. If L(i) = ∅. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. known as ‘302 Google jacking’. the chain moves to a random page. that allows someone to manipulate the PageRank of a web page and make it much higher. unless i is a dangling page. PageRank. then i is a ‘dangling page’. Then it says: page i has higher rank than j if π(i) > π(j). 3. The way it does that is by computing the stationary distribution of a Markov chain that we now describe.24. |V | These are new transition probabilities. Comments: 1. Therefore. because of the size of the problem. For each page i let L(i) be the set of pages it links to. A new use of PageRank is to rank academic doctoral programs based on their records of ﬁnding jobs for their graduates. Its implementation though is one of the largest matrix computations in the world. It is not clear that the chain is irreducible. So this is how Google assigns ranks to pages: It uses an algorithm. It uses advanced techniques from linear algebra to speed up computations.2) and let α pij := (1 − α)pij + . in the latter case. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. α ≈ 0. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. Academic departments link to each other by hiring their 70 .

So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. One of the square alleles is selected 4 times. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their oﬀspring did not inherit this allele. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. What is important is that the evaluation can be done without ever looking at the content of one’s research. Do the same thing 2N times. meaning how many characteristics we have. We will assume that a child chooses an allele from each parent at random. And so on. To simplify. When two individuals mate. We will also assume that the parents are replaced by the children. 24. But an AA mating with a Aa may produce a child of type AA or one of type Aa. the genetic diversity depends only on the total number of alleles of each type. two AA individuals will produce and AA child. we will assume that. Thus. We would like to study the genetic diversity. In this ﬁgure there are 4 alleles of type A (circles) and 6 of type a (squares).faculty from each other (and from themselves). a dominant (say A) and a recessive (say a). a gene controlling a speciﬁc characteristic (e. this allele is of type A with probability x/2N .4 The Wright-Fisher model: an example from biology In an organism. 4. The alleles in an individual come in pairs and express the individual’s genotype for that gene. their oﬀspring inherits an allele from each parent.g. We are now ready to deﬁne the Markov chain Xn representing the number of alleles of type A in generation n. maintain this allele for the next generation. colour of eyes) appears in two forms. One of the circle alleles is selected twice. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. Imagine that we have a population of N individuals who mate at random producing a number of oﬀspring. If so. 71 . Three of the square alleles are never selected. This means that the external appearance of a characteristic is a function of the pair of alleles. This number could be independent of the research quality but. all we need is the information outlined in this ﬁgure. Since we don’t care to keep track of the individuals’ identities. from the point of view of an administrator. generation after generation. in each generation. then. The process repeats. it makes the evaluation procedure easy. 5. We have a transition from x to y if y alleles of type A were produced.

ξM are i. i. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. . 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. the states {1. 72 . the number of alleles of type A won’t change. .2N = 1. and the population becomes one consisting of alleles of type A only.i. 0 ≤ y ≤ 2N. as usual. on the average. . .e. . M n P (ξk = 0|Xn . on the average. X0 ) = E(Xn+1 |Xn ) = M This implies that. . These are: M ϕ(0) = 0. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. (Here. ϕ(x) = y=0 pxy ϕ(y). . with n P (ξk = 1|Xn . . M Therefore. p00 = p2N.Each allele of type A is selected with probability x/2N . X0 ) = 1 − Xn . Since pxy > 0 for all 1 ≤ x. and the population becomes one consisting of alleles of type a only. so both 0 and 2N are absorbing states. conditionally on (X0 . . M and thus. The fact that M is even is irrelevant for the mathematical model. . . Clearly. E(Xn+1 |Xn . We show that our guess is correct by proceeding via the mundane way of verifying that it satisﬁes the ﬁrst-step analysis equations. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . .d. but the argument needs further justiﬁcation because “the end of time” is not a deterministic time. since. . ϕ(M ) = 1. the expectation doesn’t change and. . for all n ≥ 0.e. EXn+1 = EXn Xn = Xn . 2N − 1} form a communicating class of inessential states. . we see that 2N x 2N −y x y pxy = . ϕ(x) = x/M. Xn ). . . n n where. . i. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . X0 ) = Xn . Tx := inf{n ≥ 1 : Xn = x}. y ≤ 2N − 1. . Clearly. the chain gets absorbed by either 0 or 2N . since. . . The result is correct.) One can argue that. Since selections are independent. the random variables ξ1 . at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). . Similarly. . there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. Let M = 2N .

The ﬁrst two are obvious. . Thus. n ≥ 0) are i. the demand is met instantaneously. Here are our assumptions: The pairs of random variables ((An . and so. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. there are Xn + An − Dn components in the stock. If the demand is for Dn components. say. 2. and E(An − Dn ) = −µ < 0. larger than the supply per unit of time. We have that (Xn . Dn ). we ﬁnd X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0.i. . and the stock reached level zero at time n + 1. while a number of them are sold to the market. Working for n = 1. Since the Markov chain is represented by this very simple recursion. n ≥ 0) is a Markov chain. We will show that this Markov chain is positive recurrent. provided that enough components are available. so that Xn+1 = (Xn + ξn ) ∨ 0. Let ξn := An − Dn . b). M −1 y−1 x M y−1 1− x . M x M (M −1)−(y−1) x x +1− M M M −1 = 24. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. Xn electronic components at time n. the demand is. We will apply the so-called Loynes’ method.d. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . the demand is only partially satisﬁed. for all n ≥ 0. A number An of new components are ordered and added to the stock. Otherwise. 73 . at time n + 1. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. . we will try to solve the recursion. on the average.5 A storage or queueing application A storage facility stores.

The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). . 74 . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). for all n. ξn−1 . using π(y) = limn→∞ P (Mn = y). for y ≥ 0. we may replace the latter random variables by ξ1 . we may take the limit of both sides as n → ∞. ξn−1 ). Therefore. . ξn−1 ) = (ξn−1 . instead of considering them in their original order. Notice. where the latter equality follows from the fact that. . (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . the random variables (ξn ) are i. . . the event whose probability we want to compute is a function of (ξ0 . . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. . and. . in computing the distribution of Mn which depends on ξ0 . . we reverse it. Hence. . . . = x≥0 Since the n-dependent terms are bounded.Thus. Indeed. ξ0 ). . . that d (ξ0 . . so we can shuﬄe them in any way we like without altering their joint distribution. P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . (n) . .d. ξn . . But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). In particular. . . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisﬁes the balance equations.i. however. we ﬁnd π(y) = x≥0 π(x)pxy . .

So π satisﬁes the balance equations. .} > y) is not identically equal to 1.} < ∞) = 1. . This will be the case if g(y) is not identically equal to 1. Depending on the actual distribution of ξn . we will use the assumption that Eξn = −µ < 0. Z2 . . . as required. From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. n→∞ Hence g(y) = P (0 ∨ max{Z1 . The case Eξn = 0 is delicate. n→∞ This implies that P ( lim Zn = −∞) = 1. We must make sure that it also satisﬁes y π(y) = 1. Z2 . . 75 . Here. the Markov chain may be transient or null recurrent. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . .

here. j = 1. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. namely. then p(1) = p. we won’t go beyond Zd . this walk enjoys. otherwise. j = 1.and space-homogeneous transition probabilities given by px. . if x = ej . groups) are allowed. This is the drunkard’s walk we’ve seen before.g. we need to give more structure to the state space S. Example 1: Let p(x) = C2−|x1 |−···−|xd | . . otherwise where p + q = 1. d p(x) = qj . More general spaces (e. but. . so far. For example. the skip-free property. This means that the transition probability pxy should depend on x and y only through their relative positions in space.y = p(y − x). given a function p(x). . have been processes with timehomogeneous transition probabilities. . . A random walk of this type is called simple random walk or nearest neighbour random walk. a structure that enables us to say that px. besides time. 0. To be able to say this. ±ed } otherwise 76 . we can let S = Zd . Example 3: Deﬁne p(x) = 1 2d . the set of d-dimensional vectors with integer coordinates. . a random walk is a Markov chain with time. if x = −ej . . . Deﬁne pj .y+z for any translation z. . that of spatial homogeneity. . x ∈ Zd . . In dimension d = 1. p(x) = 0. where p1 + · · · + pd + q1 + · · · + qd = 1.and space-homogeneity. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. So. Our Markov chains. where C is such that x p(x) = 1. A random walk possesses an additional property. d 0. p(−1) = q.y = px+z.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. If d = 1. if x ∈ {±e1 .

x ± ed . where ξ1 . then p(1) = p(−1) = 1/2. x ∈ Zd . We shall resist the temptation to compute and look at other 77 . . We will let p∗n be the convolution of p with itself n times.i. in addition. (When we write a sum without indicating explicitly the range of the summation variable. The convolution of two probabilities p(x). . p′ (x).) Now the operation in the last term is known as convolution. computing the n-th order convolution for all n is a notoriously hard problem. p(x) = 0. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. p ∗ δz = p. x + ξ2 = y) = y p(x)p(y − x). and this is the symmetric drunkard’s walk. we will mean a sum over all possible values. x ∈ Zd is deﬁned as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . otherwise.) In this notation. are i. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). If d = 1. . in general. y p(x)p′ (y − x). p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . Furthermore. Furthermore.d. which.This is a simple random walk. ξ2 . random variables in Zd with common distribution P (ξn = x) = p(x). We call it simple symmetric random walk. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. . Obviously. (Recall that δz is a probability distribution that assigns probability 1 to the point z. We refer to ξn as the increments of the random walk. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). Let us denote by Sn the state of the walk at time n. Indeed. the initial state S0 is independent of the increments. But this would be nonsense. . So. One could stop here and say that we have a formula for the n-step transition probability. . . We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . The special structure of a random walk enables us to give more detailed formulae for its behaviour. we can reduce several questions to questions about this random walk. because all we have done is that we dressed the original problem in more fancy clothes.

+1}n . 1}n . we have P (ξ1 = ε1 . In the sequel.1) − (3. Will the random walk ever return to where it started from? B. Events A that depend only on these the ﬁrst n random signs are subsets of this sample space. 2 where #A is the number of elements of A. + (1. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn .d. +1}n . we shall learn some counting methods.0) We can thus think of the elements of the sample space as paths. The principle is trivial. −1. 0) → (1. if we can count the number of elements of A we can compute its probability. a random vector with values in {−1. . As a warm-up. to each initial position and to each sequence of n signs.methods. . Therefore. we are dealing with a uniform distribution on the sample space {−1. If not. random variables (random signs) with P (ξn = ±1) = 1/2. +1. . . who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that diﬀerent questions we need to ask or (b) that approximate methods are more informative than exact ones. A ⊂ {−1.1) + (0. After all.2) (4. +1. we should assign. . Here are a couple of meaningful questions: A. First. So. consider the following: 78 .2) where the ξn are i. . a path of length n. 2) → (5. ξn ). but its practice may be hard. 1) → (2. 2) → (3. If yes. Since there are 2n elements in {0. Look at the distribution of (ξ1 . how long will that take? C. all possible values of the vector are equally likely.i+1 = pi. ξn = εn ) = 2−n . . So if the initial position is S0 = 0.i. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. . and so #A P (A) = n . Clearly. for ﬁxed n. +1}n .i−1 = 1/2 for all i ∈ Z. 1) → (4. −1} then we have a path in the plane that joins the points (0.1) + − (5. and the sequence of signs is {+1. 1): (2.

i). 0) and end up at the speciﬁc point (n. y) is given by: #{(0. (n − m + y − x)/2 (23) Remark. x) (n. n 79 . Hence (n+i)/2 = 0. (There are 2n of them. in this case.Example: Assume that S0 = 0. Proof. 0) (n. S4 = −1. i).) The darker region shows all paths that start at (0. The paths comprising this event are shown below. y)} = n−m .1 Paths that start and end at speciﬁc points Now consider a path that starts from a speciﬁc point and ends up at a speciﬁc point. S7 < 0}.i) (0. and compute the probability of the event A = {S2 ≥ 0. i)} = n n+i 2 More generally. 0) and end up at (n. y)} =: {all paths that start at (m. 0) and n ends at (n. The number of paths that start from (0. In if n + i is not an even number then there is no path that starts from (0. We use the notation {(m. 0) and ends at (n.0) The lightly shaded region contains all paths of length n that start at (0. 0)). (n. x) (n. Let us ﬁrst show that Lemma 14. x) and end up at (n. y)}. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0.047. the number of paths that start from (m. Consider a path that starts from (0. x) and end up at (n. We can easily count that there are 6 paths in A. S6 < 0. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. y) is given by: #{(m. i).

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the ﬁgure), there is no path that goes from (0, 0) to (3, 2). We shall thus deﬁne n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

**Paths that start and end at speciﬁc points and do not touch zero at all
**

(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =

1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the speciﬁc point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the ﬁrst time it hits zero and then reﬂecting it around the horizontal axis. In the ﬁgure below, follow the solid path starting at (m, x) up to the ﬁrst time it hits zero and then follow the dotted path: this is the reﬂected path.

(n, y)

(m, x)

(n, −y)

A path and its reﬂection around a horizontal line. Notice that the reﬂection starts only at the point where the path touches the horizontal line.

Note that if we know the reﬂected path we can recover the original path. Hence we may as well count the number of reﬂected paths. But each reﬂected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =

1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)

1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

**Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
**

1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n

n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n

n 2

= #{(0, 1) n

n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1

n−y 2

(n, y + 1); remain > 0}

− −

n

n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

−

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the ﬁrst one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

**Distribution after n steps
**

n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

0) (n. P0 (S1 ≥ 0 . if n is even. . x) 2n−m (n. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . The probability that. for now.P (Sn = y|Sm = x) = In particular. y)} . . n This is called the ballot theorem and the reason will be given in Section 29. Such a simple answer begs for a physical/intuitive explanation. . 0) (n. y)} . (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. . . 27. given that S0 = 0 and Sn = y > 0. The left side of (30) equals #{paths (1. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . If n. 1) (n. n/2 (29) #{(0. n (30) Proof. 27. . n 2 #{paths (m. .3 Some conditional probabilities Lemma 17. . S2 > 0. n−m 2−n−m . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 .2 The ballot theorem Theorem 25. we shall just compute it. the random walk never becomes zero up to time n is given by P0 (S1 > 0. Sn > 0 | Sn = y) = y . n 2−n . if n is even. y) that do not touch zero} × 2−n #{paths (0. . We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). Sn ≥ 0 | Sn = 0) = 84 1 . But. P0 (Sn = y) 2 (n + y) + 1 . Sn ≥ 0 | Sn = y) = 1 − In particular. .

85 . . . . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . (f ) follows from the replacement of (ξ2 . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . Sn ≥ 0. (c) (d) (e) (f ) (g) where (a) is obvious. . .4 Some remarkable identities Theorem 26. ξ3 . ξ3 . . Sn ≥ 0) = #{(0. . . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . . . . . .). . For even n we have P0 (S1 ≥ 0.Proof. and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . . n/2 where we used (28) and (29). . . . . . . P0 (S1 ≥ 0. 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . . 0) (n. . . . . Sn < 0) (b) (a) = 2P0 (S1 > 0. . . . Sn = 0) = P0 (S1 > 0. . We use (27): P0 (Sn ≥ 0 . . . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. . . ξ2 + ξ3 ≥ 0. and ﬁnally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. •). . . and Sn−1 > 0 implies that Sn ≥ 0. . . . (e) follows from the fact that ξ1 is independent of ξ2 . If n is even. . . Sn ≥ 0) = P0 (S1 = 0. . . . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . . . 1 + ξ2 + ξ3 > 0. Proof. (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. Sn > 0) + P0 (S1 < 0. We next have P0 (S1 = 0. . . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . +1 27. P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0.) by (ξ1 . . Sn = 0) = P0 (Sn = 0). . . ξ2 . Sn ≥ 0). . (b) follows from symmetry.

Sn−2 > 0|Sn−1 = 1). Sn = 0) + P0 (S1 < 0. So f (n) = 2P0 (S1 > 0. If a particle is to the right of a then you see its image on the left. We now want to ﬁnd the probability f (n) that the walk will return to 0 for the ﬁrst time in n steps. . Theorem 27. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . . Then f (n) := P0 (S1 = 0. By the Markov property at time n − 1. f (n) = P0 (S1 > 0. .27. . . Sn−1 = 0. So the last term is 1/2. Put a two-sided mirror at a. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. Since the random walk cannot change sign before becoming zero. To take care of the last term in the last expression for f (n). By symmetry. . Sn = 0 means Sn−1 = 1. . Sn−1 < 0. So u(n) f (n) = . Sn = 0). the two terms must be equal. . and if the particle is to the left of a then its mirror image is on the right. . . Sn−2 > 0|Sn−1 > 0. P0 (Sn−1 > 0. we computed. Sn = 0. . . Fix a state a. we can omit the last event. Sn = 0) P0 (S1 > 0. n−1 Proof. 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). . . Sn = 0) Now. Sn−1 > 0. So 1 f (n) = 2u(n) P0 (S1 > 0. . . . . . we see that Sn−1 > 0. . with equal probability. . Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). 86 .5 First return to zero In (29). So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. for even n. given that Sn = 0. Sn = 0) = u(n) . Sn = 0) = 2P0 (Sn−1 > 0. Let n be even. . and u(n) = P0 (Sn = 0). n−1 28 The reﬂection principle for a simple random walk in dimension 1 The reflection principle says the following. the probability u(n) that the walk (started at 0) attains value 0 in n steps. Also. we either have Sn−1 = 1 or −1. This is not so interesting. . . Sn−1 > 0.

we obtain 87 . 2a − Sn ≥ a) = P0 (Ta ≤ n. Sn ≤ a) + P0 (Ta ≤ n. Proof. . But P0 (Ta ≤ n) = P0 (Ta ≤ n. Sn = Sn . Sn ≤ a) + P0 (Sn > a). independent of the past before Ta . n ∈ Z+ ) has the same law as (Sn . Sn ≥ a). Sn ). So Sn − a can be replaced by a − Sn in the last display and the resulting process. In other words. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). . deﬁne Sn = where is the ﬁrst hitting time of a. n ≥ 0) be a simple symmetric random walk and a > 0. 2 Proof. P0 (Sn ≥ a) = P0 (Ta ≤ n. Finally. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). 2 Deﬁne now the running maximum Mn = max(S0 . n ∈ Z+ ). called (Sn . Let (Sn . In the last event. Trivially. we have n ≥ Ta . Sn > a) = P0 (Ta ≤ n. a + (Sn − a). 28. 2a − Sn . If Sn ≥ a then Ta ≤ n. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . The last equality follows from the fact that Sn > a implies Ta ≤ n. so we may replace Sn by 2a − Sn (Theorem 28). If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). n < Ta . m ≥ 0) is a simple symmetric random walk. Sn ≤ a). But a symmetric random walk has the same law as its negative. n ≥ Ta . Then Ta = inf{n ≥ 0 : Sn = a} Sn . . n ≥ Ta By the strong Markov property.What is more interesting is this: Run the particle till it hits a (it will. Thus. . S1 . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a).1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. So: P0 (Sn ≥ a) = P0 (Ta ≤ n. the process (STa +m − a. as a corollary to the above result. n < Ta . Theorem 28. Then.

gives the result. then the probability that. e e e Paris. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. ηn ). Proof. This. Rend. Sn = y) = P0 (Sn = 2x − y). . Since Mn < x is equivalent to Tx > n. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. during sampling without replacement. It is the latter use we are interested in in Probability and Statistics. which results into P0 (Tx ≤ n. e 10 Andr´. (1887). 9 Bertrand. P0 (Mn < x. . (1887). n = a + b. by applying the reﬂection principle (Theorem 28). 105. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. Compt. Sn = y). Paris. Sci. ornamental objects. Acad. 369. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. Sn = y) = P0 (Tx ≤ n. 105. usually a vase furnished with a foot or pedestal. employed for diﬀerent purposes. the ashes of the dead after cremation. and anciently for holding ballots to be drawn. the number of selected azure items is always ahead of the black ones equals (a − b)/n. we have P0 (Mn < x. Compt. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . 8 88 . Rend. If Tx ≤ n. 436-437. The answer is very simple: Theorem 31. combined with (31).Corollary 19. Sci. then. as long as a ≥ b. J. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. as for holding liquids. Acad. where ηi indicates the colour of the i-th item picked. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. D. Solution d’un probl`me. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). 2 Proof. (31) x > y. of which a are coloured azure and b black. An urn is a vessel. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. The set of values of η contains n elements and η is uniformly distributed a in this set. Sn = 2x − y). Solution directe du probl`me r´solu par M. Bertrand. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. . If an urn contains a azure items and b = n−a black items. A sample from the urn is a sequence η = (η1 . The original ballot problem. . Sn = y) = P0 (Tx > n.

. using the deﬁnition of conditional probability and Lemma 16. the result follows. n .We shall call (η1 . . Since the ‘azure’ items correspond to + signs (respectively. a. a − b = y. i ∈ Z. It remains to show that all values are equally a likely. .) So pi. . An urn can be realised quite easily be a random walk: Theorem 32. n ≥ 0) with values in Z and let y be a positive integer. Let ε1 . We need to show that. The values of each ξi are ±1. . we can translate the original question into a question about the random walk. . . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . . Then. conditional on {Sn = y}.i+1 = p. εn be elements of {−1. letting a. . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. . . always assuming (without any loss of generality) that a ≥ b = n − a. . p(n+k)/2 q (n−k)/2 . Sn > 0} that the random walk remains positive. . . . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. conditional on {Sn = y}. . . the sequence (ξ1 . But if Sn = y then. We have P0 (Sn = k) = n n+k 2 pi. ξn ) is. . . Since an urn with parameters n. but which is not necessarily symmetric. . Then. so ‘azure’ items are +1 and ‘black’ items are −1. the sequence (ξ1 . indeed. −n ≤ k ≤ n. conditional on the event {Sn = y}. . Proof of Theorem 31. . we have: P0 (ξ1 = ε1 . . . . ηn ) an urn process with parameters n. . ξn = εn ) P0 (Sn = y) 2−n 1 = = . . ξn ) has a uniform distribution. . . (The word “asymmetric” will always mean “not necessarily symmetric”. . where y = a − (n − a) = 2a − n. +1} such that ε1 + · · · + εn = y.i−1 = q = 1 − p. . 89 . . (ξ1 . b be deﬁned by a + b = n. a we have that a out of the n increments will be +1 and b will be −1. Sn > 0 | Sn = y) = . . . n n −n 2 (n + y)/2 a Thus. a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. . . But in (30) we computed that y P0 (S1 > 0. . ξn ) is uniformly distributed in a set with n elements. . Consider a simple symmetric random walk (Sn . So the number of possible values of (ξ1 . . . the ‘black’ items correspond to − signs). Proof. n Since y = a − (n − a) = a − b. ξn ) of the increments is an urn process with parameters n and a = (n + y)/2.

If x > 0. E0 sTx = ϕ(s)x . For all x.30. x ∈ Z. We will approach this using probability generating functions. 2qs Proof. Theorem 33. Corollary 20.i. x ≥ 0) is also a random walk.} ∪ {∞}. This random variable takes values in {0. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. In view of this. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. The process (Tx . 1. To prove this. . random variables. . Let ϕ(s) := E0 sT1 . See ﬁgure below. y ∈ Z. . The rest is a consequence of the strong Markov property. Proof.d. for x > 0. it is no loss of generality to consider the random walk starting from 0. 2. 90 . . only the relative position of y with respect to x is needed. Consider a simple random walk starting from 0. . plus a ′′ further time T1 required to go from 0 to +1 for the ﬁrst time. Lemma 18. Proof. 2. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. and all t ≥ 0. Then the probability generating function of the ﬁrst time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . x − 1 in order to reach x. In view of Lemma 19 we need to know the distribution of T1 . Start with S0 = 0 and watch the process until the ﬁrst n = T1 such that Sn = 1. If S1 = 1 then T1 = 1.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. . . Consider a simple random walk starting from 0. Px (Ty = t) = P0 (Ty−x = t). Lemma 19. Consider a simple random walk starting from 0.) Then. (We tacitly assume that S0 = 0. plus a time T1 required to go from −1 to 0 for the ﬁrst time. we make essential use of the skip-free property.

Then the probability generating function of the ﬁrst time that level x ∈ Z is reached is given by √ x E sT1 x = 1− 1−4pqs2 . 91 . if x < 0 2ps Proof. from the strong Markov property. Corollary 21. The ﬁrst time n such that Sn = −1 is the ﬁrst time n such that −Sn = 1. Then the probability generating function of the ﬁrst time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . Consider a simple random walk starting from 0. in order to hit 1. the RW must ﬁrst hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Consider a simple random walk starting from 0. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . Corollary 22. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. 0 if x > 0 2qs Tx (32) E0 s = √ |x| E sT−1 |x| = 1− 1−4pqs2 0 . But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. 2ps Proof. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. The formula is obtained by solving the quadratic. Hence the previous formula applies with the roles of p and q interchanged.

For a symmetric ′ random walk. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34.2 First return to the origin Now let us ask: If the particle starts at 0. If we deﬁne n to be equal to 0 for k > n. Similarly. The latter is. k 92 . E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 .2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. For the general case. 2ps ′ =1− 30. if S1 = 1. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 .30. and simply write (1 + x)n = k≥0 n k x . we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . where α is a real number. then we can omit the k upper limit in the sum. how long will it take till it ﬁrst returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. this was found in §30. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . We again do ﬁrst-step analysis. in distribution.3 The distribution of the ﬁrst return to the origin We wish to ﬁnd the exact distribution of the ﬁrst return time to the origin. k by the binomial theorem. Consider a simple random walk starting from 0. the same as the time required for the walk to hit 1 starting ′ from 0. Then the probability generating function of the ﬁrst return time to 0 is given by E0 s T 0 = 1 − Proof. If α = n is a positive integer then n (1 + x)n = k=0 n k x . 1 − 4pqs2 .

let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . 2k − 1 k 93 . but that we won’t worry about which ones. Indeed. (−1)k = 2k − 1 k k and this is shown by direct computation. k! k! (−1)2k xk = k≥0 xk .Recall that n k = It turns out that the formula is formally true for general α. k as long as we deﬁne α k = α(α − 1) · · · (α − k + 1) . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . by deﬁnition. k!(n − k)! k! α k x . For example. −1 k so (1 − x)−1 = as needed. As another example. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . (1 − x)−1 = But. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. √ 1 − x. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = .

Hence it is transient. x if x > 0 . • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. without appeal to the Markov chain theory. But remember that q = 1 − p.But. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. That it is actually null-recurrent follows from Markov chain theory. The quantity inside the square root becomes 1 − 4pq. |x| if x > 0 (33) . so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 = (1 − 2p)2 = |1 − 2p| = |p − q|. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. by the deﬁnition of E0 sT0 . • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. Comparing the two ‘inﬁnite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. s↑1 which is true for any probability generating function. 94 . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . We apply this to (32). if x < 0 This formula can be written more neatly as follows: Theorem 35. P0 (Tx < ∞) = p q q p ∧1 ∧1 . x 1−|p−q| 2q 1−|p−q| 2p . if x < 0 |x| In particular. Hence it is again transient.

. . This means that there is a largest value that will be ever attained by the random walk. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . 31. Unless p = 1/2. . 95 . The simple symmetric random walk with values in Z is null-recurrent. Proof. Indeed. 31. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. . Proof.1 Asymptotic distribution of the maximum Suppose now that p < 1/2.) If we let Mn = max(S0 . and so it is null-recurrent. it will never ever return to the point where it started from. Otherwise. The last part of Theorem 35 tells us that it is recurrent. again. S1 . every nondecreasing sequence has a limit. Consider a simple random walk with values in Z. This value is random and is denoted by M∞ = sup(S0 . Sn ). Then ′ P0 (T0 < ∞) = 2(p ∧ q).Proof. with probability 1 because p < 1/2. . Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . q p. we obtain what we are looking for. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the ﬁrst return time to 0. but here it is < ∞. We know that it cannot be positive recurrent. this probability is always strictly less than 1. Then limn→∞ Sn = −∞ with probability 1. Corollary 23. if x < 0 which is what we want. the sequence Mn is nondecreasing. Theorem 36. . If p < q then |p − q| = q − p and. except that the limit may be ∞. with some positive probability. . What is this probability? Theorem 37. then we have M∞ = limn→∞ Mn . if x > 0 . S2 . n→∞ n→∞ x ≥ 0. Consider a random walk with values in Z. x ≥ 0. Assume p < 1/2.

. as already known. 2. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. This proves the ﬁrst formula. In the latter case. meaning that Pi (Ji = ∞) = 1. . 2. Theorem 38. k = 0. 1]. 1 − 2(p ∧ q) Proof. even for p = 1/2. 1. We now compute its exact distribution. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . k = 0. 2. 2(p ∧ q) . where the last probability was computed in Theorem 37. . But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. The probability generating function of T0 was obtained in Theorem 34. In Lemma 4 we showed that. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. Then. 2(p ∧ q) E0 J0 = . The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). 31. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. . (34) a random variable ﬁrst considered in Section 10 where it was also shown to be geometric. the total number of visits to a state will be ﬁnite. k = 0. But J0 ≥ 1 if and only if the ﬁrst return time to zero is ﬁnite.3 Total number of visits to a state If the random walk is recurrent then it visits any state inﬁnitely many times. starting from zero. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . if a Markov chain starts from a state i then Ji is geometric. Pi (Ji ≥ k) = 1 for all k.′ Proof. We now look at the the number of visits to some state but if we start from a diﬀerent state. . If p = q this gives 1. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). . . 1 − 2(p ∧ q) this being a geometric series. 1. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . Remark 1: By spatial homogeneity. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . Consider a simple random walk with values in Z. . . 96 . 1. But if it is transient. .

If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . given that Tx < ∞. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . Remark 1: The expectation E0 Jx is ﬁnite if p = 1/2. it will visit any point below 0 exactly the same number of times on the average. Apply now the strong Markov property at Tx . we have E0 Jx = 1/(1−2p) regardless of x. But for x < 0. k ≥ 1. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. then E0 Jx drops down to zero geometrically fast as x tends to inﬁnity. 97 . and the second in Theorem 38.Theorem 39. Then P0 (Jx ≥ k) = P0 (Tx < ∞. We repeat the argument for x < 0 and ﬁnd P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . Suppose x > 0. The ﬁrst term was computed in Theorem 35. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 .e. Jx ≥ k). Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. ∞ As for the expectation. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . simply because if Jx ≥ k ≥ 1 then Tx < ∞. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . Consider a random walk with values in Z. starting from 0. E0 Jx = Proof.e. Then P0 (Tx < ∞. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. The reason that we replaced k by k − 1 is because. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . (q/p)(−x)∨0 1 − 2q k≥1 . For x > 0. Consider the case p < 1/2. i. if the random walk “drifts down” (i. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . In other words. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). (2(p ∧ q))k−1 . p < 1/2) then.

. Fix an integer n. Pause for reﬂection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. random variables. ξn−1 . (Sk .32 Duality Consider random walk Sk = k ξi . we derived using duality. Tx is the sum of x i. for a simple random walk and for x > 0. n (37) 98 . Proof. . a sum of i. i=1 Then the dual (Sk .d. i.i. Let Tx = inf{n ≥ 1 : Sn = x} be the ﬁrst time that x will be visited. . 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. n Proof. . for a simple symmetric random walk. 0 ≤ k ≤ n) has the same distribution as (Sk . Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. Thus. Here is an application of this: Theorem 41. . . 0 ≤ k ≤ n) of (Sk . . ξn ) has the same distribution as (ξn . . Let (Sk ) be a simple symmetric random walk and x a positive integer. Theorem 40. the probabilities P0 (Tx = n) = x P0 (Sn = x). ξ1 ). which is basically another probability for S. n x > 0. each distributed as T1 .d. ξ2 . every time we have a probability for S we can translate it into a probability for S. 0 ≤ k ≤ n). and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . . P0 (Sn = x) P0 (Sn = x) x . random variables. in Theorem 41.i. Use the obvious fact that (ξ1 . . (36) 2 Lastly. . = 1− √ 1 − s2 s x . Then P0 (Tx = n) = x P0 (Sn = x) . In Lemma 19 we observed that. (35) In Theorem 29 we used the reﬂection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Sn = x) P0 (Tx = n) = . 0 ≤ k ≤ n − 1.e. because the two have the same distribution. 0 ≤ k ≤ n) is deﬁned by Sk := Sn − Sn−k . . Rewrite the ballot theorem terms of the dual: P0 (S1 > 0.

We remark here something methodological: It was completely diﬀerent techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coeﬃcient of sn in (35); summing up (37) should give (36), or taking diﬀerences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Deﬁne its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the ﬁrst time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the ﬁrst event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =

r=1

+

r=1

**In the ﬁrst case,
**

′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

**We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
**

k

**f2r p2k−2r,2n−2r +
**

r=1

1 2

n−k

**f2r p2k,2n−2r
**

r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2

k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .

r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

**The arcsine law*
**

Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

**Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
**

na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =

a≤k/n≤b

∼

a≤k/n≤b

1 . k k (n − n ) n n

1/π

**This is easily seen to have a limit as n → ∞, the limit being
**

b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of

x(1−x)

which is amazingly close to the plot of the ﬁgure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

y ∈ Z. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . Sn − Sn ). 1 Hence (Sn . 0) and. 1)) = P (ξn = (−1. 0)) = 1/4. he moves east. . random variables and hence a random walk. we immediately get that the expectation of T1 is inﬁnity: ET1 = ∞. i=1 ξi i=1 ξi 2 1 Lemma 20. Consider a change of coordinates: rotate the axes by 45o (and scale them). −1)) = 1/2. north or south. When he reaches a corner he moves again in the same manner. 1 1 Proof. the random 1 2 variables ξn . n ∈ Z+ ) is a random walk. . n ∈ Z+ ) is a sum of i. ξ2 . . . we let x1 be its horizontal coordinate and x2 its vertical. going at random east. (Sn . for each n. It is not a simple random walk because there is a possibility that the increment takes the value 0. x1 − x2 ). 2 Same argument applies to (Sn . 0). x2 ) into (x1 + x2 . So let 1 2 1 2 1 2 Rn = (Rn .d. 0)) = P (ξn = (1. ξ2 .i. He starts at (0. too. Sn ) = n n 2 . The two random walks are not independent. the two stochastic processes are not independent. Many other quantities can be approximated in a similar manner.i. so are ξ1 . We then let S0 = (0. . Sn = i=1 n ξi . is a random walk.From this. are independent. . So Sn = (Sn . We have: 102 . However. west. . Since ξ1 .. The corners are described by coordinates (x. (Sn . map (x1 . We have 1. n ∈ Z+ ) is also a random walk. . 0)) = 1/4.d. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. n ≥ 1. ξn are not independent simply because if one takes value 0 the other is nonzero. random vectors such that P (ξn = (0. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. with equal probability (1/4). being totally drunk. ξ2 . . 0)) = P (ξn = (−1. Rn ) := (Sn + Sn . 2 1 If x ∈ Z2 . i. the latter being functions of the former. y) where x. i.e. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. 1)) + P (ξn = (0. Indeed. n ∈ Z+ ) showing that it. west. 1 P (ξn = 1) = P (ξn = (1. north or south.

is i. b−a −a . 0)) = 1/4 1 2 P (ηn = 1. By Lemma 5. b−a P (STa ∧Tb = a) = b .1 Lemma 21.. Proof. . Hence (Rn . 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. Hence P (ηn = 1) = P (ηn = −1) = 1/2. ηn = −1) = P (ξn = (−1. . in the sense that the ratio of the two sides goes to 1. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . where c is a positive constant. ηn = ξn − ξn are independent for each n. n ∈ Z ) is a random walk in one dimension. Similar argument shows that each of (Rn . ε′ ∈ {−1. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . 2 1 2 2 1 1 Proof. (Rn .i. The possible values 1 . 2n n 2 2−4n . We have of each of the ηn n 1 2 P (ηn = 1. 1 where the last equality follows from the independence of the two random walks. = 1)) = 1/4. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. Rn = n (ξi − ξi ). R2n = 0) = P (R2n = 0)(R2n = 0). n ∈ Z+ ) is a random walk in two dimensions. that for a simple symmetric random walk (started from S0 = 0). The sequence of vectors η . 0)) = 1/4 1 2 P (ηn = −1. (Rn + independent. The two-dimensional simple symmetric random walk is recurrent. b−a P (Ta < Tb ) = b . The last two are walk in one dimension. . (Rn . . The same is true for (Rn ). n ∈ Z+ ). P (S2n = 0) ∼ n . We have Rn = n (ξi + ξi ). And so 2 1 2 1 P (ηn = ε. n ∈ Z+ ) is a random 2 . But by the previous n=0 c lemma. ηn = 1) = P (ξn = (1. Let a < 0 < b. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . To show that the last two are independent. and by Stirling’s approximation. n ∈ Z ) is a random walk in one dimension. . Lemma 22. +1}. as n → ∞. 2 . ηn = −1) = P (ξn = (0. b−a 103 .d. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. n Corollary 24. But (Rn ) 1 2 is a simple symmetric random walk. ξ 1 − ξ 2 ). (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . P (S2n = 0) = Proof. η 2 are ±1. 1)) = 1/4 1 2 P (ηn = −1. So P (R2n = 0) = 2n 2−2n . η . . from Theorem 7. we need to show that ∞ P (Sn = 0) = ∞. ξ2 . ηn = 1) = P (ξn = (0. Recall. n ∈ Z+ ) 1 is a random walk in two dimensions.

Let (Sn ) be a simple symmetric random walk and (pi . How about more complicated distributions? Let suppose we are given a probability distribution (pi . Note. i ∈ Z. Can we ﬁnd a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. and the probability it exits from the right point is 1 − p. The claim is that P (Z = k) = pk . using this. k ∈ Z. b]. we consider the random variable Z = STA ∧TB . B = j) = c−1 (j − i)pi pj . B) taking values in {(0. B = 0) = p0 . P (A = 0. where c−1 is chosen by normalisation: P (A = 0. i ∈ Z. choose. M. Then Z = 0 if and only if A = B = 0 and this has 104 .g. M < N . Conversely. such that p is rational. Proof. B) are independent of (Sn ). we can always ﬁnd integers a. that ESTa ∧Tb = 0. we get that c is precisely ipi . a < 0 < b such that pa + (1 − p)b = 0. if p = M/N . i ∈ Z). The probability it exits it from the left point is p. given a 2-point probability distribution (p. B = j) = 1. B = 0) + P (A = i. b = N . This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the ﬁrst time it exits the interval [a.This last display tells us the distribution of STa ∧Tb . 1 − p). i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. suppose ﬁrst k = 0. To see this. N ∈ N. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. Theorem 43. we have this common value: i<0 (−i)pi = and. a = M − N . b. Then there exists a stopping time T such that P (ST = i) = pi . i < 0 < j. Deﬁne a pair of random variables (A. in particular. 0)} ∪ (N × N) with the following distribution: P (A = i. and assuming that (A. Indeed. e.

105 . i<0 = c−1 Similarly. by deﬁnition. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . Suppose next k > 0.probability p0 . We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. P (Z = k) = pk for k < 0. B = j) P (STi ∧Tk = k)P (A = i.

38) = gcd(44. 2) = gcd(2. 26) = gcd(6. then d | c.}. 5. . with b = 0. b are arbitrary integers. −2. b − a). Indeed. For instance. −3.e. Now. 14) = gcd(6. Replace b by b − a. Arithmetic Arithmetic deals with integers. they merely serve as reminders. 1. 32) = gcd(6. Indeed 4 × 38 > 120. So (ii) holds. But 3 is the maximum number of times that 38 “ﬁts” into 120. . with the second line. if a. b). as taught in high schools or universities. Then c | a + (b − a). So (i) holds. The set Z of integers consists of N. But d | a and d | b.e. Zero is denoted by 0. (ii) if c | a and c | b. Given integers a. This gives us an algorithm for ﬁnding the gcd between two numbers: Suppose 0 < a < b are integers. So we work faster. . if we also invoke Euclid’s algorithm (i. The set N of natural numbers is the set of positive integers: N = {1. 2) = 2. the process of long division that one learns in elementary school). c | b. 106 . But in the ﬁrst line we subtracted 38 exactly 3 times. let d = gcd(a. 2) = gcd(4. we may summarise the ﬁrst line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. with at least one of them nonzero. They are not supposed to replace textbooks or basic courses. we have c | d. b) = gcd(a.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. We write this by b | a. d | b.) Notice that: gcd(a. 3. 38) = gcd(82. −1. Then interchange the roles of the numbers and continue. 2. 3. b) as the unique positive integer such that (i) d | a. b − a). d2 are two such numbers then d1 would divide d2 and vice-versa. Z = {· · · . we say that b divides a if there is an integer k such that a = kb. If c | b and b | a then c | a. we can deﬁne their greatest common divisor d = gcd(a. 20) = gcd(6. 0. 38) = gcd(6. b. Because c | a and c | b. b). replace b − a by b − 2a. leading to d1 = d2 . their negatives and zero. In other words. · · · }. We will show that d = gcd(a. and d = gcd(a. Now suppose that c | a and c | b − a. 4. If b − a > a. If a | b and b | a then a = b. Similarly. i. Keep doing this until you obtain b − qa < a. 38) = gcd(6. (That it is unique is obvious from the fact that if d1 . gcd(120. 8) = gcd(6. 2. 6. Therefore d | b − a. and we’re done.

for some x. i. This shows (ii) of the deﬁnition of a greatest common divisor. . 38) = gcd(6. . to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. To see this. Working backwards. Its proof is obvious. 120) = 38x + 120y. not both zero. 1. we have the fact that if d = gcd(a. y = 6. we can ﬁnd an integer q and an integer r ∈ {0. 107 . r = 18. a. The long division algorithm is a process for ﬁnding the quotient q and the remainder r. for integers a. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. b.The theorem of division says the following: Given two positive integers b. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. let d be the right hand side. Suppose that c | a and c | b. xa + yb > 0}. b) then there are integers x. we can show that. 2) = 2. To see why this is true. b) = min{xa + yb : x. in particular. . a − 1}. . For example. such that a = qb + r. In the process.e. Then c divides any linear combination of a. let us follow the previous example again: gcd(120. with x = 19. we have gcd(a. Thus. y ∈ Z. b ∈ Z. gcd(38. y such that d = xa + yb. Then d = xa + yb. with a > b. Here are some other facts about the greatest common divisor: First. Second. y ∈ Z. The same logic works in general. We need to show (i). 38) = gcd(6. it divides d.

we can divide b by d to ﬁnd b = qd + r. but is not necessarily transitive. A function is called one-to-one (or injective) if two diﬀerent x1 . The range of f is the set of its values. If we take A to be the set of students in a classroom. B be ﬁnite sets with n. The set of functions from X to Y is denoted by Y X . and this is a contradiction. say. the greatest common divisor of any subset of the integers also makes sense. i. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. so r = 0. The relation < is not reﬂexive. x). A relation is called reflexive if it contains pairs (x. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. the set {f (x) : x ∈ X}. For example. A relation is called symmetric if it cannot contain a pair (x. Suppose this is not true. so d divides both a and b. y) such that x is smaller than y. The relation < is not symmetric. The equality relation is symmetric. A relation which possesses all the three properties is called equivalence relation. A relation is transitive if when it contains (x. The equality relation = is reﬂexive. Counting Let A. m elements. Finally. i. z) it also contains (x. Since d > 0. So d is not minimum as claimed.e. A function is called onto (or surjective) if its range is Y . it can be thought as reﬂexive (unless we do not want to accept the fact that one may not be afriend of oneself). Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. y) and (y. for some 1 ≤ r ≤ d − 1. yielding r = (−qx)a + (1 − qy)b. then the relation “x is a friend of y” is symmetric. e. But then b = q(xa + yb) + r. x).e. that d does not divide b. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. In general. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . f (x2 ).g. x2 ∈ X have diﬀerent values f (x1 ). respectively.that d | a and d | b. The greatest common divisor of 3 integers can now be deﬁned in a similar manner. z). a relation can be thought of as a set of pairs of elements of the set A. y) without contain its symmetric (y. 108 .. We encountered the equivalence relation i theory. Both < and = are transitive. that d does not divide both numbers.

Thus. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to diﬀerent animals. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). To be able to do this. k! k where the symbol equation. and so on. a one-to-one function from the set A into A is called a permutation of A. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . Indeed. Any set A contains several subsets. and the third. the k-th in n − k + 1 ways. A function is then an assignment of a name to each animal. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. In the special case m = n. in other words. We thus observe that there are as many subsets in A as number of elements in the set {0. 1}n of sequences of length n with elements 0 or 1.e. the cardinality of the set B A ) is mn . So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . the second in n − 1.The number of functions from A to B (i. we need more names than animals: n ≤ m. The ﬁrst element can be chosen in n ways. we may think of the elements of A as animals and those of B as names. Incidentally. so does the second. Since the name of the ﬁrst animal can be chosen in m ways. and this follows from the binomial theorem. we denote (n)n also by n!. n! is the total number of permutations of a set of n elements.) So the ﬁrst animal can be assigned any of the m available names. we have = 1. If A has n elements. Each assignment of this type is a one-to-one function from A to B. that of the second in m − 1 ways. and so on. and so on. (The same name may be used for diﬀerent animals. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). Clearly. n! stands for the product of the ﬁrst n natural numbers: n! = 1 × 2 × 3 × · · · × n. 109 .

Thus a ∧ b = a if a ≤ b or = b if b ≤ a. b. The maximum is denoted by a ∨ b = max(a. in choosing k animals from a set of n + 1 ones. a− = − min(a. and a− is called the negative part of a. we use the following notations: a+ = max(a. we choose k animals from the remaining n ones and this can be done in n ways. The notation extends to more than one numbers. If the pig is not included in the sample. b). In particular. by convention. a∨b∨c refers to the maximum of the three numbers a. n n . b). 0). say. b is denoted by a ∧ b = min(a. The absolute value of a is deﬁned by |a| = a+ + a− . we deﬁne x m = x(x − 1) · · · (x − m + 1) . + k−1 k Indeed. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. consider a speciﬁc animal. and we don’t care to put parantheses because the order of ﬁnding a maximum is irrelevant. we add the two numbers together in order to ﬁnd the total number of ways to pick k animals from n + 1 animals. we let The Pascal triangle relation states that n+1 k = = 0. 110 . Note that it is always true that a = a+ − a− . For example. 0). Both quantities are nonnegative. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). a+ is called the positive part of a. Maximum and minimum The minimum of two numbers a.We shall let n = 0 if n < k. c. m! n y If y is not an integer then. If the pig is included. if x is a real number. k Also. by convention. the pig.

which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). So X= n≥1 1(X ≥ n). . meaning a function that takes value 1 on A and 0 on the complement Ac of A. If A is an event then 1A is a random variable. Then X X= n=1 1 (the sum of X ones is X). A2 . 1 − 1A = 1 A c . . if A1 . if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. For instance. Another example: Let X be a nonnegative random variable with values in N = {1. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). It takes value 1 with probability P (A) and 0 with probability P (Ac ). Several quantities of interest are expressed in terms of indicators. Thus. This is very useful notation because conjunction of clauses is transformed into multiplication. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. P (X ≥ n). We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. It is worth remembering that. are sets or clauses then the number n≥1 1An is the number true clauses. 111 .}. = 1A + 1B − 1A∩B . . . 2. . . It is easy to see that 1A 1A∩B = 1A 1B .Indicator function stands for the indicator function of a set A. Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B .

The column . y ∈ Rn . Then. Let y i := n aij xj . Combining. j=1 i = 1. x1 . . m.for all x. . But the ϕ(uj ) are vectors in Rm . ym v1 . . xn n 1 y := ϕ(x) = j=1 xj ϕ(uj ). . . . . = ··· ··· ··· . . . . . . The column . un be a basis for Rn . because ϕ is linear. So if we choose a basis v1 . we have ϕ(x) = m n vi i=1 j=1 aij xj . The relation between the column representation of y and the column representation of x is written as 1 1 x y a11 · · · a1n . x . xn the given basis. β ∈ R. . . is a column representation of y := ϕ(x) with respect to the given basis . Suppose that ψ. and all α. . ϕ are linear functions: It is a m × n matrix. Let u1 . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . . . The matrix A of ϕ with respect to the choice of the two basis is deﬁned by a11 · · · a1n A := · · · · · · · · · am1 · · · amn y1 Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . where ∈ R. . . vm . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ . is a column representation of x with respect to . . . . . .

Similarly. 1/2. . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. ψ. a2 . 0 0 1 In other words. sup I is characterised by: sup I ≥ x for all x ∈ I and. x2 . . . . If I has no lower bound then inf I = −∞. a sequence a1 . So lim sup xn = limn→∞ inf{xn . j So A is a ℓ × m matrix. Limsup and liminf If x1 . with respect to these bases.Fix three sets of basis on Rn . i. . Usually. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. If we ﬁx a basis in Rn . . sup ∅ = −∞. . . . . inf I is characterised by: inf I ≤ x for all x ∈ I and.} ≤ inf{x3 . 1 ≤ j ≤ n. Similarly. . . 1 ≤ i ≤ ℓ. the choice of the basis is the so-called standard basis. . there is x ∈ I with x ≤ ε + inf I. If I is empty then inf ∅ = +∞. respectively. . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. a square matrix represents a linear transformation from Rn to Rn . we deﬁne max I. . . . · · · }. for any ε > 0. .} ≤ · · · and. there is x ∈ I with x ≥ ε + inf I. deﬁned as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. . ei = 1(i = j). . Then the matrix representation of ϕ ◦ ψ is AB. we deﬁne lim sup xn . Supremum and inﬁmum When we have a set of numbers I (for instance. standing for the “infimum” of I. We note that lim inf(−xn ) = − lim sup xn . Rm and Rℓ and let A. and it is this that we called inﬁmum. and AB is a ℓ × n matrix. A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. xn+1 .} has no minimum. . is a sequence of numbers then inf{x1 . it is its matrix representation with respect to the ﬁxed basis.} ≤ inf{x2 . . The standard basis in Rd consists of the vectors 1 0 0 0 1 0 e 1 = . . where m (AB)ij = k=1 Aik Bkj . we deﬁne the supremum sup I to be the minimum of all upper bounds. . This minimum (or maximum) may not exist. . 1/3. 113 . for any ε > 0. B is a m × n matrix. . . B be the matrix representations of ϕ. Likewise. In these cases we use the notation inf I. xn+2 . . . If I has no upper bound then sup I = +∞. x2 . x4 . ed = . x3 . A matrix A is called square if it is a n × n matrix. . e2 = . For instance I = {1.e. .

if An are events with P (An ) = 1 then P (∩n An ) = 1. A2 . which holds since X is positive. If A1 . For example. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). in these notes. such that ∪n Bn = Ω. namely: Everything else. Similarly. P (∪n An ) ≤ Similarly. . we work a lot with conditional probabilities. A2 . B in the deﬁnition. Therefore. In Markov chain theory. A is a subset of B).. . This can be generalised to any ﬁnite number of events. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. Linear Algebra that has many axioms is more ‘restrictive’.) If A. and everywhere else. B2 . If the events A. If An are events with P (An ) = 0 then P (∪n An ) = 0. to show that EX ≤ EY . . Recall that P (B|A) is deﬁned by P (A ∩ B)/P (A) and is a very motivated deﬁnition. This is the inclusion-exclusion formula for two events. A2 . . . 11 114 . Often. and apply them. .e. . n≥1 Usually. If A is an event then 1A or 1(A) is the indicator random variable.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. are mutually disjoint events. I do sometimes “cheat” from a strict mathematical point of view. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ).. . we often need to argue that X ≤ Y . follows from this axiom. it is useful to ﬁnd disjoint events B1 . but not too much. Quite often. B are disjoint events then P (A ∪ B) = P (A) + P (B). This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. Linear Algebra that has many axioms is more ‘restrictive’. Similarly. to compute P (A ∩ B) we read the deﬁnition as P (A ∩ B) = P (A)P (B|A).e. we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. . and 1 Ac = 1 − 1 A . Indeed. by not requiring this. the function that takes value 1 on A and 0 on Ac . then P n≥1 An = P (An ). Note that 1A 1B = 1A∩B . E 1A = P (A). to compute P (A). are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). (That is why Probability Theory is very rich. . for (besides P (Ω) = 1) there is essentially one axiom . if he events A1 . derive formulae. n P (An ). n If the events A1 . This follows from the logical statement X 1(X > x) ≤ X. In contrast. . . Probability Theory is exceptionally simple in terms of its axioms. That is why Probability Theory is very rich. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). Of course. In contrast. i.

|s| < 1/|ρ|. the generating function of the geometric sequence an = ρn . .} because then n an = 1. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. . viz. 1. we deﬁne X + = max(X. ϕ(s) = EsX = n=0 sn P (X = n) 115 . a1 . If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. n ≥ 0 is ∞ ρn s n = n=0 1 . i. which are both nonnegative and then deﬁne EX = EX + − EX − . . The generating function of the sequence pn = P (X = n). This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). In case the random variable is discrete. . we encounter situations where P (X = ∞) may be positive. x→∞ In Markov chain theory. making it a useful object when the a0 . 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. a1 . . X − = − min(X. The expectation of a nonnegative random variable X may be deﬁned by ∞ EX = 0 P (X > x)dx. which is motivated from the algebraic identity X = X + − X − . n = 0. 1]. 1]. . the formula reduces to the known one. . 1. We do not a priori know that this exists for any s other than. we have EX = −∞ xf (x)dx..e. . For a random variable which takes positive and negative values. 0). . obviously. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). . 2. is an inﬁnite polynomial having the ai as coeﬃcients: G(s) = a0 + a1 s + a2 s2 + · · · . EX = x xP (X = x). In case the random variable is ∞ absoultely continuous with density f . is a probability distribution on Z+ = {0. So if n an < ∞ then G(s) exists for all s ∈ [−1.An application of this is: P (X < ∞) = lim P (X ≤ x). . taught in calculus courses). This deﬁnition makes sense only if EX + < ∞ or if EX − < ∞. . for s = 0 for which G(0) = 0. The formula holds for any positive random variable. is called probability generating function of X: ∞ as long as |ρs| < 1. so it is useful to keep this in mind. As an example. 0). Generating functions A generating function of a sequence a0 .

∞ and p∞ := P (X = ∞) = 1 − pn . 116 . s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . ds ds and by letting s = 1. xn sn be the generating of (xn . n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . and this is why ϕ is called probability generating function. n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). we have s∞ = 0. Generating functions are useful for solving linear recursions. n=0 We do allow X to. n ≥ 0) then the generating function of (xn+1 . As an example. with x0 = x1 = 1. possibly. Note that ϕ(0) = P (X = 0). while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). consider the Fibonacci sequence deﬁned by the recursion xn+2 = xn+1 + xn . since |s| < 1.This is deﬁned for all −1 < s < 1. but. Also. n! as follows from Taylor’s theorem. take the value +∞. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . we can recover the expectation of X from EX = lim ϕ′ (s). Since the term s∞ p∞ could formally appear in the sum deﬁning ϕ(s). If we let X(s) = n≥0 n ≥ 0. Note that ∞ n=0 pn ≤ 1.

for example. n ≥ 0) and (xn . namely. Despite the square roots in the formula.e. is such an example. the n-th term of the Fibonacci sequence. b−a Noting that ab = −1. b = −( 5 + 1)/2. Since x0 = x1 = 1. Hence s2 + s − 1 = (s − a)(s − b). i.e. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . Thus (1/ρ)−s is the 1 generating function of ρn+1 . Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). b−a is the solution to the recursion. we can also write this as xn = (−a)n+1 − (−b)n+1 . So. if X is a random variable with values in R. then the so-called distribution function. and so X(s) = 1 1 1 1 1 1 = . This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). we can solve for X(s) and ﬁnd X(s) = s2 −1 . x ∈ R. n ≥ 0) equals the sum of the generating functions of (xn+1 . − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . the function P (X ≤ x). 117 . But I could very well use the function P (X > x). i. n ≥ 0). this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space.

compute an integral and obtain an approximation. in 2n fair coin tosses 118 . Specifying the distribution of such an object may be hard. We frequently encounter random objects which belong to larger spaces. It turns out to be a good one. For example (try it!) the rough Stirling approximation give that the probability that. n! ≈ exp(n log n − n) = nn e−n . Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. for example they could take values which are themselves functions. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. it’s better to use logarithms: n n! = e log n = exp k=1 log k. Now. Even the generating function EsX of a Z+ -valued random variable X is an example. n ∈ Z+ . (b) The approximation is not good when we consider ratios of products. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. instead. If X is absolutely continuous. For instance. but using or merely talking about such an object presents no diﬃculty other than overcoming a psychological barrier. we encountered the concept of an excursion of a Markov chain from a state i. I could use its density f (x) = d P (X ≤ x). Such an excursion is a random object which should be thought of as a random function.or the 2-parameter function P (a < X < b). 1 Since d dx (x log x − x) = log x. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. dx All these objects can be referred to as “distribution” or “law” of X. to compute a large sum we can. n Hence n k=1 log k ≈ log x dx. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. for it does allow us to specify the probabilities P (X = n). Hence we have We call this the rough Stirling approximation. since the number is so big.

Markov’s inequality Arguably this is the most useful inequality in Probability. We baptise this geometric(λ) distribution. Let X be a random variable. Surely. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. where λ = P (X ≥ 1). Indeed. and is based on the concept of monotonicity. If we think of X as the time of occurrence of a random phenomenon. g(X) ≥ g(x)1(X ≥ x). 119 . 0 ≤ n ≤ m. expressing non-negativity. and this is terrible. Eg(X) ≥ g(x)P (X ≥ x) . Consider an increasing and nonnegative function g : R → R+ . a geometric function of k. deﬁned in (34). We can case both properties of g neatly by writing For all x. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). by taking expectations of both sides (expectation is an increasing operator). you have seen it before. y ∈ R g(y) ≥ g(x)1(y ≥ x). In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . so this little section is a reminder. This completely trivial thought gives the very useful Markov’s inequality. That was the random variable J0 = ∞ n=1 1(Sn = 0). It was shown in Theorem 38 that J0 satisﬁes the memoryless property. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . Then.we have exactly n heads equals 2n 2−2n ≈ 1. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any diﬀerent from X. for all x. n ∈ Z+ . because we know that the n probability goes to zero as n → ∞. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). and so. as n → ∞. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. To remedy this. expressing monotonicity. We have already seen a geometric random variable.

Cambridge University Press. 7. Also available at: http://www. Bremaud P (1999) Markov Chains. J L (1997) Introduction to Probability. ´ 2. Soc.dartmouth. C M and Snell. 5. Parry. Wiley. Grinstead. 4. W (1968) An Introduction to Probability Theory and its Applications. Feller. Cambridge University Press. Bremaud P (1997) An Introduction to Probabilistic Modeling. Springer. Math. J R (1998) Markov Chains. 3. Amer. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). Springer.mac. 120 . Springer.pdf 6. Norris. Chung.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. W (1981) Topics in Ergodic Theory.Bibliography ´ 1.

70 period. 15 distribution. 73 Markov chain. 111 maximum. 87 121 . 53 Fibonacci sequence. 1 equivalence classes. 18 permutation. 117 division. 115 geometric distribution. 77 coupling. 106 ground state. 77 indicator. 17 inﬁmum. 16 communicating classes. 29 reﬂection. 20 function of a Markov chain. 53 hitting time. 18 dual. 10 irreducible. 17 Aloha. 7 closed. 115 random walk on a graph. 47 coupling inequality. 17 communicates with. 61 recurrent. 13 probability generating function. 98 dynamical system. 10 ballot theorem. 45 Chapman-Kolmogorov equation. 6 Markov property. 119 minimum. 82 Cesaro averages. 48 memoryless property. 114 balance equations. 48 cycle formula. 65 Catalan numbers. 116 ﬁrst-step analysis. 15 Loynes’ method. 111 inessential state. 113 initial distribution. 108 relation. 17 Ethernet. 34 PageRank. 108 ergodic. 86 reﬂexive.Index absorbing. 110 meeting time. 34 probability ﬂow. 117 leads to. 17 Kolmogorov’s loop criterion. 58 digraph (directed graph). 18 arcsine law. 68 aperiodic. 61 detailed balance equations. 26 running maximum. 7 invariant distribution. 70 greatest common divisor. 20 increments. 110 mixing. 6 Markov’s inequality. 51 essential state. 16 convolution. 16 equivalence relation. 35 Helly-Bray Lemma. 109 positive recurrent. 3 branching process. 68 Fatou’s lemma. 119 matrix. 76 neighbours. 44 degree. 65 generating function. 119 Google. 59 law. 15. 108 ruin probability. 2 nearest neighbour random walk. 84 bank account. 63 Galton-Watson process. 51 Monte Carlo. 16. 101 axiom of Probability Theory. 61 null recurrent. 81 reﬂection principle.

2 transition probability. 113 symmetric. 76 Skorokhod embedding. 29 transition diagram:. 108 time-homogeneous. 62 supremum. 27 storage chain. 7. 10 steady-state. 6 stationary. 71 122 . 6 statespace. 16. 62 Wright-Fisher model. 16. 62 subsequence of a Markov chain. 9 stationary distribution. 27 subordination. 108 tree. 8 transition probability matrix. 3. 88 watching a chain in a set. 13 stopping time. 104 states. 76 simple symmetric random walk. 58 transient.semigroup property. 7 simple random walk. 77 skip-free property. 24 Strirling’s approximation. 7 time-reversible. 119 strong Markov property. 8 transitive. 73 strategy of the gambler. 60 urn.

- The Incredible Power of Post Selection (Scott Aaronson)
- markov-chains
- Lec27
- Markov Chains
- Comm Bremermann
- MarkovChains
- Exponential Growth
- 20752_0412055511MarkovChain
- 9781904534488
- markov chain chapter 6
- 2-2
- Bayesian Econometric Methods
- Mata23 Final 2012s
- (Ebook) Hidden Markov Models (Theory & Methods) Markov Chains Particle Filter Monte Carlo Hmm - Cappemoulinesryden
- UCLA Math 33A Midterm 2 Solutions
- Markov Coding
- Matrix
- Markov Chain
- SVD_ans_you.pdf
- ProblemsMarkovChains
- Matrizes, Eliminação Gaussiana, Equações Lineares e Espaços Vetoriais
- A First Course in Linear Algebra
- ROC
- Solved Problems
- lect14
- Statistics
- distance transform 21
- Introduction to Markov Chain Monte Carlo Simulation and Their Statistical Analysis
- 2-D Graphics Transformation -ecomputernotes.com
- Exercises for Stochastic Calculus

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd