Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . .1 Paths that start and end at specific points .3 The distribution of the first return to the origin . . . . . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application .2 First return to the origin . . . . . . . . . . . . . . . . . . . . . 84 27. . . . . . . . . . . . . . . . . . .2 The ballot theorem . . . 71 24. . . 92 30. . . . 68 24. . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . . . . . . . . . . . . . . . . .3 Some conditional probabilities . . . .24. . . . . . . . . . . . . . 83 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . . . . . . . .1 Distribution after n steps . . . . . . . . . . . . . . . . . . .4 Some remarkable identities . . . . . . . . . . . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 30. .5 First return to zero . . . . . . . . . . . . . . 65 24. . . . 95 31. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . .2 Paths that start and end at specific points and do not touch zero at all . . . . . . . . .1 First hitting time analysis . .2 Return probabilities . . . . . . . . . . . .1 Asymptotic distribution of the maximum . .5 A storage or queueing application . . . . . 70 24. . .3 Total number of visits to a state . . . . . . . . . . . . . . . . . . . . . . . . 84 27. . . . . . . . . . . . . 95 31. . . . . . . . . . . 79 26. . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . . . . . .1 Branching processes: a population growth application . . . . . .2 The ALOHA protocol: a computer communications application . .1 Distribution of hitting time and maximum . . 85 27. . . . . . . . . . . . . . . .4 The Wright-Fisher model: an example from biology . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

of course. Greece and the U. I introduce stationarity before even talking about state classification. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come.. They form the core material for an undergraduate course on Markov chains in discrete time.K. v .. “important” formulae are placed inside a coloured box. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. A few starred sections should be considered “advanced” and can be omitted at first reading. dozens of good books on the topic. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. Also.. The second one is specifically for simple random walks. At the end. I have tried both and prefer the current ordering. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. There are. The first part is about Markov chains and some applications. There notes are still incomplete. Also. I have a little mathematical appendix.S.Preface These notes were written based on a number of courses I taught over the years in the U. I tried to put all terms in blue small capitals whenever they are first encountered. Of course.

For instance. there is a function F that takes the pair Xn = (xn . Recall (from elementary school maths). 1. . Xn+2 . In Mathematics. that gcd(a. In the module following this one you will study continuous-time chains.e. a totally ordered set. the real numbers). yn ). n = 0. b). Its velocity is x = dx . b). for any n. . the future pairs Xn+1 . The “time” can be discrete (i. continuous (i. yn ) into Xn+1 = (xn+1 . i. and its ˙ dt 2x acceleration is x = d 2 . . x(t)). initialised by x0 = max(a. x = x(t) denotes the position of the particle at time t. y0 = min(a. Newton’s second law says that the acceleration is proportional to the force: m¨ = F.e. the integers). a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. one of which is 0. b) is the remainder of the division of a by b. yn+1 ). s ≥ t) ˙ 1 . 2. we let our “state” be a pair of integers (xn .e. for any t. . . x). An example from Physics is provided by Newton’s laws of motion.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. b) between two positive integers a and b. Suppose that a particle of mass m is moving under the action of a force F in a straight line. the future trajectory (X(s). Formally. b) = gcd(r. and the other the greatest common divisor. depend only on the pair Xn . has the property that. and evolving as xn+1 = yn yn+1 = rem(xn . yn ). It is not hard to see that. . where r = rem(a. An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. or. . more generally. b). F = f (x. By repeating the procedure. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). Assume that the force F depends only on the particle’s position ¨ dt and velocity. where xn ≥ yn . . In other words. X0 . The sequence of pairs thus produced. . X2 . the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). X1 . we end up with two numbers. x Here.

uniformly. A mouse lives in the cage.99. In other words. When the mouse is in cell 1 at time n (minutes) then. We can summarise this information by the transition diagram: 1 Stochastic means random. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. We will also see why they are useful and discuss how they are applied. Y0 ) of random numbers. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. The generalisation takes us into the realm of Randomness. respectively. Yn ). Similarly.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. regardless of where it was at earlier times.000 columns or computing the integral of a complicated function of a large number of variables. Count the number of 1’s and divide by N . 2 2. we will see what kind of questions we can ask and what kind of answers we can hope for. past states are not needed.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. an effective way for dealing with large problems is via the use of randomness. Try this on the computer! 2 . Indeed. A scientist’s job is to record the position of the mouse every minute. a ≤ X0 ≤ b. for steps n = 1. We will develop a theory that tells us how to describe. say. Some of these algorithms are deterministic. 1 and 2. . but some are stochastic.can be completely specified by the current state X(t). Other examples of dynamical systems are the algorithms run. instead of deterministic objects. . i.e. Markov chains generalise this concept of dependence of the future only on the present. The resulting ratio is an approximation b of a f (x)dx. by the software in your computer. Perform this a large number N of times. 2. . analyse. Choose a pair (X0 .000 rows and 10. Repeat with (Xn . 0 ≤ Y0 ≤ B. In addition. it does so. Each time count 1 if Yn < f (Xn ) and 0 otherwise. via the use of the tools of Probability. We will be dealing with random variables.. at time n + 1 it is either still in 1 or has moved to 2. containing fresh and stinky cheese.05. and use those mathematical models which are called Markov chains. it moves from 2 to 1 with probability β = 0.

on the average.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. . Dn − Sn = y − x) P (Sn = z. as follows: Xn+1 = max{Xn + Dn − Sn . to move from cell 1 to cell 2? 2. the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. We easily see that if y > 0 then ∞ x. Sn . FS . 1.95 0. 2. Dn is the money he deposits. How long does it take for the mouse. from month to month. . respectively. D1 . = z=0 ∞ = z=0 3 . (This is the mean of the binomial distribution with parameter α. . How often is the mouse in room 1 ? Question 1 has an easy.2 Bank account The amount of money in Mr Baxter’s bank account evolves. . he can’t. Assume also that X0 . The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. S1 . 0}. Here. px.) 2. are random variables with distributions FD . are independent random variables.05 ≈ 20 minutes. D0 .y := P (Xn+1 = y|Xn = x). . D2 . and Sn is the money he wants to spend. Xn is the money (in pounds) at the beginning of the month.05 0. Assume that Dn . S2 . S0 .99 0. So if he wishes to spend more than what is available.01 Questions of interest: 1. intuitive. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). y = 0. .

We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). Will Mr Baxter’s account grow.1 p0.0 p0.2 p1. how fast? 2. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. Are there any places which he will visit infinitely many times? 4. we have px. If he starts from position 0. Of course. Are there any places which he will never visit? 3.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. can be computed from FS and FD .0 p2. if y = 0.1 p1. being drunk.0 = 1 − ∞ px. 2. how long will it take him.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 .2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix.0 p1. So he may move forwards with equal probability that he moves backwards. The transition probability matrix is y=1 p0. how often will he be visiting 0? 2. Questions of interest: 1. 2. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. in principle. he has no sense of direction. and if so.something that. on the average.2 P= p2. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. If the pub is located in position 0 and his home in position 100.y .1 p2.

2. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. the answer to this is crucial for the policy determination. west. north or south.H − pS. the company must have an idea on how long the clients will live. and pS. Also.S − pH. but they will become more precise in the sequel. pD. Note that the diagram omits pH. These things are. therefore.3 H 0.S = 1 − pS. So he can move east.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner.3 that the person will become sick. the reason being that the company does not believe that its clients are subject to resurrection. mathematically imprecise. Clearly.01 S 0. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. so we assign probability 1/4 for each move.D = 1. But. observe.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients. that since there are more degrees of freedom now.H = 1 − pH. there is clearly a higher chance that our man will be lost.D is omitted.S = 0. pD.D . etc.S because pH.H . given that he is currently healthy.8 0. 5 .D . We may again want to ask similar questions. and let’s say he’s completely drunk. there is probability pH.1 D Thus. and pS. for now.

Xn+m ) and (X0 . . . for all n ∈ Z+ and all i0 . . in+m . m > n. where T is a subset of the integers. . . unless needed. We will also assume. Xn−1 ) given Xn . zero. it is customary to call (Xn ) a Markov chain. . . . or negative). T = Z+ = {0. .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . 3. . The elements of S are frequently called states. n ∈ Z+ ) is Markov if and only if. almost exclusively (unless otherwise said explicitly). the Markov property. . Let us now write the Markov property in an equivalent way. called the state space. . Usually. . . in+1 ∈ S. m ∈ T) is independent of the past process (Xm . . . m and any choice of states i0 . for any n. . . that the Xn take values in some countable set S.} or the set Z of all integers (positive. 3. On the other hand. So we will omit referring to T explicitly. . . 1.3 3. X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). 2. m ∈ T). We focus almost exclusively to the first choice. . . for any n ∈ T. If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . . This is true for any n and m. . which precisely means that (Xn+1 . . . n ∈ T}. conditionally on Xn . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). Since S is countable. . We say that this sequence has the Markov property if. the future process (Xm . X0 = i0 ) = P (∀ k ∈ [n + 1.} or N = {1. Xn−1 ) are independent conditionally on Xn . Sometimes we say that {Xn . . . n + m] Xk = ik | Xn = in ). if the claim holds. . P (Xn+1 = in+1 | Xn = in . . (Xn . . m < n. . . 3. and so future is independent of past given the present. . . . n ∈ T} is Markov instead of saying that it has the Markov property.1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. Lemma 1. . n + m] Xk = ik | Xn = in . viz. (1) 6 . 2. X1 = i1 . Proof. X2 = i2 .

m + 1)P(m + 1. will specify all joint distributions2 : P (X0 = i0 . n + 1) := P (Xn+1 = j | Xn = i) . 2 3 And. X2 = i2 . then the matrices are infinitely large. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. viz. .i (n − 1.3 In this new notation.j (n. or as P(m. Of course. n) if m ≤ ℓ ≤ n. n)]i. n + 1) do not depend on n .. 3. pi. More generally.in (n − 1. by (1). n). X1 = i1 .j (m. The function pi. means that the transition probabilities pi. . pi. n) = 1(i = j) and. n). we can write (2) as P(m.j (n. by a result in Probability. ℓ) P(ℓ. we define. If |S| = d then we have d × d matrices. 1) pi1 .which means that the two functions µ(i) := P (X0 = i) . n).j (m. The function µ(i) is called initial distribution. m + 1) · · · pin−1 . for each m ≤ n the matrix P(m. a square matrix whose rows and columns are indexed by S (and this. remembering a bit of Algebra. Therefore. by definition. . of course.j (m. (2) which is reminiscent of matrix multiplication. but if S is an infinite set. m + 2) · · · P(n − 1.j∈S . n) = P(m. n ≥ 0. requires some ordering on S) and whose entries are pi.i1 (m. n + 1) is called transition probability from state i at time n to state j at time n + 1. 7 .j (n. j ∈ S. n) = P(m. something that can be called semigroup property or Chapman-Kolmogorov equation. which.i1 (0. 2) · · · pin−1 .3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. n) := [pi.j (n. i∈S i.i2 (1. pi. n) = i1 ∈S ··· in−1 ∈S pi. Xn = in ) = µ(i0 ) pi0 . . But we will circumvent them by using Probability only. n). pi. all probabilities of events associated with the Markov chain This causes some analytical difficulties.j (m.

and the semigroup property is now even more obvious because Pa+b = Pa Pb . for all integers a. Then we have µ n = µ 0 Pn . arranged in a row vector. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn .j ]i. Starting from a specific state i. then µ = pδi1 + (1 − p)δi2 . n + 1) =: pi. n) = P × P × · · · · · · × P = Pn−m . if µ assigns probability p to state i1 and 1 − p to state i2 . A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). It corresponds. Then P(m. Notice that any probability distribution on S can be written as a linear combination of unit masses. Notational conventions Unit mass. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. Similarly. For example. b ≥ 0. to picking state i with probability 1. δi (j) := 1.j . as follows from (1) and the definition of P. if j = i otherwise.j∈S is called the (one-step) transition probability matrix or.and therefore we can simply write pi. if Y is a real random variable.j the (one-step) transition probability from state i to state j. Thus Pi is the law of the chain when the initial distribution is δi . For example. The matrix notation is useful. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y).j (n. of course. while P = [pi. unless otherwise specified. From now on. 0. transition matrix. simply. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . and call pi. In other words. y y 8 . n−m times for all m ≤ n. (3) We call δi the unit mass at i.

1. n = 0. random variables provides an example of a stationary process. The gas is placed partly in room 1 and partly in room 2. dividing it into two identical rooms. .4 Stationarity We say that the process (Xn . i. Example: the Ehrenfest chain. Consider a canonical heptagon and let a bug move on its vertices. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. . Clearly. say N = 1025 ) in a metallic chamber with a separator in the middle. Imagine there are N molecules of gas (where N is a large number. . But this will not work. If at time n the bug is on a vertex. n = m. It should be clear that if initially place the bug at random on one of the vertices. n ≥ 0) will not be stationary: indeed.e. we indeed expect that we have N/2 molecules in each room. This is a model of gas distributed between two identical communicating chambers. then. selecting a vertex with probability 1/7. Hence the uniform probability distribution is stationary. m + 1. at time n + 1 it moves to one of the adjacent vertices with equal probability.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . because it is impossible for the number of molecules to remain constant at all times. a sequence of i. for any m ∈ Z+ . Then pi.i. . . the distribution of the bug will still be the same.d. It is clear. at any point of time n. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2.) is stationary if it has the same law as (Xn . We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. Let Xn be the number of molecules in room 1 at time n.). as it takes some time before the molecules diffuse from room to room. let us look at the following example: Example: Motion on a polygon. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . the law of a stationary process does not depend on the origin of time. then. On the average. To provide motivation and intuition.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. We now investigate when a Markov chain is stationary. Another guess then is: Place 9 . and vice versa. for a while we will be able to tell how long ago the process started. . In other words.

i = π(i − 1)pi−1. . . in ∈ S. . (4) Such a π is called .each molecule at random either in room 1 or in room 2. . a Markov chain is stationary if and only if the distribution of Xn does not change with n. which you should go through by yourselves. Xm+2 = i2 . we are led to consider: The stationarity problem: Given [pij ]. for all m ≥ 0 and i0 . Lemma 2. i 0 ≤ i ≤ N. . Xm+n = in ). If we do so. But we need to prove this. hence. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. by (3). (a binomial distribution). If we can do so. . X1 = i1 . Changing notation. Xm+1 = i1 . we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P.i = P (X0 = i − 1)pi−1.) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π.i + P (X0 = i + 1)pi+1. . .i + π(i + 1)pi+1. . then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . 10 . find those distributions π such that πP = π. The only if part is obvious. use (1) to see that P (X0 = i0 . 2−N i−1 i+1 N N (The dots signify some missing algebra. stationary or invariant distribution.i = N N −i+1 N i+1 2−N + = · · · = π(i). In other words. . then we have created a stationary Markov chain. Xn = in ) = P (Xm = i0 . The equations (4) are called balance equations. . For the if part. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . P (X1 = i) = j P (X0 = j)pj. A Markov chain (Xn . Proof. Since. P (Xn = i) = π(i) for all n. . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. X2 = i2 . Indeed.

i = p(i). In addition to that. . . Who tells us that the sum inside the parentheses is finite? If the sum is infinite.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. we use the normalisation condition π(1) = i≥1 j≥i = 1. each x ∈ Sd is of the form x = (x1 . . in matrix notation. i ≥ 0. . i∈S The matrix P has the eigenvalue 1 always because j∈S pi. can be written as P1 = 1. 2 . The state space is S = N = {1. . To find the stationary distribution. . i ≥ 1. After bus arrives at a bus stop. Thus. 2. random variables. It is easy to see that the stationary distribution is uniform: π(x) = 1 . write the equations (4): π(i) = π(j)pj. The times between successive buses are i.i = 1. d! x ∈ Sd . This means that the solution to the balance equations is identically zero. the next one will arrive in i minutes with probability p(i). . which. which gives p(j) . Then (Xn . . π(1) will be zero. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. if i > 0 i ∈ N. where all the xi are distinct.d. we must be careful. i = 1. . Example: waiting for a bus.. and so each π(i) will be zero. n ≥ 0) is a Markov chain with transition probabilities pi+1. But here. then we run into trouble: indeed. where 1 is a column whose entries are all 1.i = π(0)p(i) + π(i + 1). 11 . Example: card shuffling. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. π must be a probability: π(i) = 1.j = 1. . xd ). So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). d}. Keep doing it continuously. i≥1 π(i) −1 To compute π(1). .}. Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. which in turn means that the normalisation condition cannot be satisfied. . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus.i. p1. .

. So it goes beyond our theory. . . 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. But to even think about it will give you some strength. ξ3 .* (This is slightly beyond the scope of these lectures and can be omitted. . 2(1 − x)). Hence choosing the initial state at random in this manner results in a stationary Markov chain.) is the state at time n. applied to the interval [0. Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. ..i. . . So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. .). 1}. 2. ξ2 (n) = 1 − ε1 . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . So. So let us do so. ξi ∈ {0. . . . are i. . . We can show that if we start with the ξr (0). ξ2 .). and the rule maps x into 2(1 − x).d.i. If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. . . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. . P (ξ1 (n + 1) = ε1 . 1. . ξ2 (n). ξ2 (n) = ε1 . . P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). from (5).) Consider an infinite binary vector ξ = (ξ1 . . . The Markov chain is obtained by applying the mapping ϕ again and again. . i. . ξ(n + 1) = ϕ(ξ(n)). . . But the function f (x) = min(2x. 1} for all i. Here ξ(n) = (ξ1 (n). .e. for any n = 0. the same is true at n + 1. . . εr ∈ {0. (5) 12 . . Example: bread kneading.. 1] (think of [0. If ξ1 = 1 then x ≥ 1/2. ξr+1 (n) = εr ) + P (ξ1 (n) = 1. with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 .. 2. 1 − ξ3 . . . ξ2 (n). i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. ξr+1 (n) = 1 − εr ). ξr (n + 1) = εr ) = P (ξ1 (n) = 0. We have thus defined a mapping ϕ from binary vectors into binary vectors. any r = 1. Remarks: This is not a Markov chain with countably many states.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). . . . . . As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x.). r = 1. . . and any ε1 .d. 2. . i. You can check that . . .

we introduce more. then choose A = {i} and see that you obtain the balance equations. i∈A j∈Ac Think of π(i)pi.i = j∈A π(j)pj. amongst them. We have: Proposition 1. π(i) = j∈S (balance equations) (normalisation condition).Note on terminology: steady-state. j∈S i ∈ S. π is a stationary distribution if and only if π1 = 1 and F (A. for all A ⊂ S.i + j∈Ac π(j)pj. π(j)pj.i . Ac ) is the total current from A to Ac . Proof. Ac ) = F (Ac . Conversely. When a Markov chain is stationary we refer to it as being in 4.j (which equals 1): π(i) j∈S pi. expecting to be able to choose. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. a set of d linearly independent equations which can be more easily solved.i + j∈Ac π(j)pj. suppose that the balance equations hold. therefore one of them is always redundant.i 13 . Writing the equations in this form is not always the best thing to do. But we only have d unknowns.j . A). So F (A. as we will see in examples. Ac ) := π(i)pi. This is OK. Instead of eliminating equations. If the latter condition holds.j = j∈A π(j)pj. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. because the first d equations are linearly dependent: the sum of both sides equals 1. then the above are d + 1 equations. a stationary distribution π satisfies π = πP π1 = 1 In other words.i Multiply π(i) by j∈S pi.1 Finding a stationary distribution As explained above. If the state space has d states. The convenience is often suggested by the topological structure of the chain. π(j) = 1.j as a “current” flowing from i to j.

05. The extra degree of freedom provided by the last result.99 ≈ 0. π(1) = β . Example: a walk with a barrier.Now split the left sum also: π(i)pi. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. A) Schematic representation of the flow balance relations. Instead.j + j∈A j∈Ac π(i)pi.04 π(2) = 0. p22 = 1 − β.i This is true for all i ∈ S. . . p11 = 1 − α.. gives us flexibility. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). here is an example: Example: Consider the 2-state Markov chain with p12 = α. .952. 14 i = 1. α+β π(2) = cα. If we take α = 0. but we shall not do this here.8% of the time in room 2. A c F(A . p21 = β.2% of the time in 1. β < 1.. One can ask how to choose a minimal complete set of linearly independent equations.) Assume 0 < α. The constant c is found from π(1) + π(2) = 1.i + j∈Ac π(j)pj. and 4. β = 0. A ) A c c F( A .048. α+β π(2) = α . . We have π(1)α = π(2)β.99 (as in the mouse example). 2. we find π(1) = 0.05 ≈ 0. 1. we see that the first term on the left cancels with the first on the right. Summing over i ∈ A. Thus.j = j∈A π(j)pj.04 room 1. That is. 3. (Consequently. while the remaining two terms give the equality we seek. the mouse spends only 95. p q Consider the following chain. in steady state.

So assume i = j. 5. we can solve them immediately. . E) where S is the state space and E ⊂ S × S defined by (i. Does this provide a stationary distribution? Yes. and E a set of directed edges. n≥0 . In graph-theoretic terminology. starting from i. In other words. i.) 5 5. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows.j > 0 for some n ∈ N and some states i1 . . provided that ∞ π(i) = 1. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j).2 The relation “leads to” We say that a state i leads to state j if.e. S is a set of vertices. Proof. . j) ∈ E ⇐⇒ pi.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive.i1 pi1 . 1. A). (iii) Pi (Xn = j) > 0 for some n ≥ 0.i2 · · · pin−1 . . The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0.e.But we can make them simpler by choosing A = {0. . If i ≥ 1 we have F (A. in−1 . This is often cast in terms of a digraph (directed graph). i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. a pair (S. π(i) = (p/q)i π(0). Ac ) = π(i − 1)p = π(i)q = F (Ac . . i − 1}. . i. This is i=0 possible if and only if p < q. These are much simpler than the previous ones. If i = j there is nothing to prove. . (If p ≥ 1/2 there is no stationary distribution.j > 0. (ii) pi. • Suppose that i j. p < 1/2. the chain will visit j at some finite time. For i ≥ 1. In fact.

is an equivalence relation. In other words.. We just observed it is symmetric. 5} is closed: if the chain goes in there then it never moves out of it. 3} is not closed. (i) holds. [i] = [j] if and only if i identical or completely disjoint. {4. By assumption.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. • Suppose now that (iii) holds. 3}.. (iii) holds. The relation is transitive: If i j and j k then i k. Equivalence relation means that it is symmetric. this is symmetric (i Corollary 2.i2 · · · pin−1 . by definition. meaning that (ii) holds. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. Just as any equivalence relation. Look at the last display again. The first two classes differ from the last two in character. Since Pi (Xn = j) > 0. The class {4. we see that there are four communicating classes: {1}. In addition. the set of all states that communicate with i: [i] := {j ∈ S : j So. We have Pi (Xn = j) = i1 . it partitions S into equivalence classes known as communicating classes. the class {2. j) ⇐⇒ i j and j i.. 5. Proof.. we have that the sum is also positive and so one of its terms is positive. {6}. and so (iii) holds. by definition. However. the sum contains a nonzero term. • Suppose that (ii) holds. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j).j . (6) j.i1 pi1 . Corollary 1. reflexive and transitive. i}. 16 . where the sum extends over all possible choices of states in S. 5}.in−1 pi. It is reflexive and transitive because is so. Hence Pi (Xn = j) > 0. The communicating class corresponding to the state i is. The relation j ⇐⇒ j i). {2.the sum on the right must have a nonzero term.

Note that there can be no arrows between closed communicating classes. . Closed communicating classes are particularly important because they decompose the chain into smaller.More generally.i = 1 then we say that i is an absorbing state. pi. In other words still. Note: An inessential state i is such that there exists a state j such that i j but j i. The classes C4 . C5 in the figure above are closed communicating classes. In graph terms. we conclude that j ∈ C. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes.j = 1. If all states communicate with all others. . Why? A state i is essential if it belongs to a closed communicating class. in−1 such that pi. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. To put it otherwise. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. C2 . . Similarly. Since i ∈ C. we have i1 ∈ C. 17 . Proof. There can be arrows between two nonclosed communicating classes but they are always in the same direction. more precisely.i1 > 0. C3 are not closed. Thus.i2 > 0. Lemma 3. starting from any state. we say that a set of states C ⊂ S is closed if pi.i1 pi1 . Otherwise. more manageable. If a state i is such that pi. . we can reach any other state. Suppose that i j. then the chain (or.j > 0. pi1 . j∈C for all i ∈ C. the state is inessential. its transition matrix P) is called irreducible. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them.i2 · · · pin−1 . parts. The converse is left as an exercise. and so on. the state space S is a closed communicating class. A general Markov chain can have an infinite number of communicating classes. the single-element set {i} is a closed communicating class. Suppose first that C is a closed communicating class. We then can find states i1 . The communicating classes C1 . and so i2 ∈ C. and C is closed.

. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . 4} at the next step and vice versa. more generally. (n) Proof. i. Since Pm+n = Pm Pn . ℓ ∈ N. So. (m) (n) for all m. n ≥ 0. Consider two distinct essential states i. The period of an inessential state is not defined. X4 ∈ C1 .5.e. . it will move to the set C2 = {2. A state i is called aperiodic if d(i) = 1. all states form a single closed communicating class. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. . X3 ∈ C2 . Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. X2 ∈ C1 . and. that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. We say that the chain has period 2. Theorem 2 (the period is a class property). pii (ℓn) ≥ pii (n) ℓ . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . j. If i j there is α ∈ N with pij > 0 and if 18 .4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. d(j) = gcd Dj . Let Di = {n ∈ N : pii > 0}. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. . 3} at some point of time then. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . Let us denote by pij the entries of Pn . j ∈ S. d(i) = gcd Di . If i j then d(i) = d(j). (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. The chain alternates between the two sets. Notice however. Dj = {n ∈ (n) (α) N : pjj > 0}. i.

d − 1. Cd−1 such that the chain moves cyclically between them. 1. . j ∈ S. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . Proof. showing that α + n + β ∈ Dj . So d(i) = d(j). Since n1 . . From the last two displays we get d(j) | n for all n ∈ Di . An irreducible chain has period d if one (and hence all) of the states have period d. . Of particular interest.e. n1 ∈ Di such that n2 − n1 = 1. are irreducible chains. where the remainder r ≤ n1 − 1. we have n ∈ Di as well. Therefore d(j) | α + β. d(j) | d(i)). . So d(j) is a divisor of all elements of Di . showing that α + β ∈ Dj . Theorem 4. the chain is called aperiodic. Then we can uniquely partition the state space S into d disjoint subsets C0 . chains where. C1 . If a number divides two other numbers then it also divides their difference. n2 ∈ Di . Because n is sufficiently large. So pjj ≥ pij pji > 0. . Formally. Theorem 3. Then pii > 0 and so pjj ≥ pij pii pji > 0. all states communicate with one another. q − r > 0. if d = 1. this is the content of the following theorem. . if an irreducible chain has period d then we can decompose the state space into d sets C0 . Divide n by n1 to obtain n = qn1 + r. Consider an irreducible chain with period d. In particular. . . we obtain d(i) ≤ d(j) as well. . Now let n ∈ Di . i. Corollary 3. More generally. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4.j i there is β ∈ N with pji > 0. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . . An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. Arguing symmetrically. .) 19 . Let n be sufficiently large. r = 0. (Here Cd := C0 . j∈Cr+1 i ∈ Cr . Cd−1 such that pij = 1. Therefore d(j) | α + β + n for all n ∈ Di . . C1 . as defined earlier. Pick n2 .

We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. . We show that these classes satisfy what we need. 20 . .. after it takes one step from its current position. which contradicts the definition of d. Suppose pij > 0 but j ∈ C1 but. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?).. id ∈ C0 . j belongs to the next class. The existence of such a path implies that it is possible to go from i0 to i. If X0 ∈ A the two times coincide. Cd−1 . consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. i ∈ C0 . that is why we call it “first-step analysis”. Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. Continue in this manner and define states i0 . Notice that this is an equivalence relation. It is easy to see that if we continue and pick a further state id with pid−1 . C1 . Let C1 be the equivalence class of i1 . i1 ..id > 0. The reader is warned to be alert as to which of the two variables we are considering at each time. j ∈ C2 . necessarily. i. say. .. Take. Such a path is possible because of the choice of the i0 . We now show that if i belongs to one of these classes and if pij > 0 then. We use a method that is based upon considering what the Markov chain does at time 1. with corresponding classes C0 .e the set of states j such that i j). .. id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. .e. and by the assumptions that d d j. for example. As an example. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 .. We avoid excessive notation by using the same letter for both.. Hence it partitions the state space S into d disjoint equivalence classes. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. Assume d > 1. Pick a state i0 and let C0 be its equivalence class (i.Proof. i1 . then. necessarily. Then pick i1 such that pi0 i1 > 0.

fix a set of states A. If i ∈ A. . we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . Therefore. .). Furthermore. X2 . i ∈ A. Hence: ′ ′ P (TA < ∞|X1 = j. if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. But the Markov chain is homogeneous. conditionally on X1 .e. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. The function ϕ(i) satisfies ϕ(i) = 1. because the chain may end up in state 2. ′ is the remaining time until A is hit). Combining the above.It is clear that P1 (T0 < ∞) < 1. a ′ function of X1 . We then have: Theorem 5. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). So TA = 1 + TA . . ϕ(i) = j∈S i∈A pij ϕ(j). and so ϕ(i) = 1. ′ Proof. If i ∈ A then TA = 0. j∈S as needed. X0 = i) = P (TA < ∞|X1 = j). then TA ≥ 1. and define ϕ(i) := Pi (TA < ∞). For the second part of the proof. By self-feeding this equation. let ϕ′ (i) be another solution. But what exactly is this probability equal to? To answer this in its generality. we have Pi (TA < ∞) = ϕ(j)pij . the event TA is independent of X0 .

1 ψ(1) = 1 + ψ(1). TA = 0 and so ψ(i) := Ei TA = 0. Therefore. Proof. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). which is the second of the equations. Example: In the example of the previous figure.) As for ϕ(1). The second part of the proof is omitted. j∈A i ∈ A. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. the set that contains only state 0. We let ϕ(i) = Pi (T0 < ∞). We have: Theorem 6. we choose A = {0}. obviously. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). Clearly. so the first two terms together equal Pi (TA ≤ 2). 3 22 . If i ∈ A then. ψ(0) = ψ(2) = 0. The function ψ(i) satisfies ψ(i) = 0. 2. then TA = 1 + TA . ψ(i) := Ei TA . Start the chain from i. If ′ i ∈ A. if the chain starts from 2 it will never hit 0. let A = {0. (If the chain starts from 0 it takes no time to hit 0. On the other hand. Example: Continuing the previous example.We recognise that the first term equals Pi (TA = 1). 3 2 so ϕ(1) = 3/4. and let ψ(i) = Ei TA . because it takes no time to hit A if the chain starts from A. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). Furthermore. the second equals Pi (TA = 2). By continuing self-feeding the above equation n times. i = 0. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. as above. Letting n → ∞. and ϕ(2) = 0. 1. 2}. We immediately have ϕ(0) = 1. ψ(i) = 1 + i∈A pij ψ(j). By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2).

′ where TA is the remaining time. . This happens up to the first hitting time TA first hitting time of the set A. B. It should be clear that the equations satisfied are: ϕAB (i) = 1. X2 . ′ TA = 1 + TA . ϕAB is the minimal solution to these equations. because the underlying experiment is the flipping of a coin with “success” probability 2/3. until set A is reached. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). Clearly. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). TB for two disjoint sets of states A. . because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ).which gives ψ(1) = 3/2. ϕAB (i) = j∈A i∈B pij ϕAB (j).e. i. Hence. after one step. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). 23 .) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . where. in the last sum. therefore the mean number of flips until the first success is 3/2. (This should have been obvious from the start. consider the following situation: Every time the state is x a reward f (x) is earned. We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ). we just changed index from n to n + 1. ϕAB (i) = 0. h(x) = f (x). i∈A Furthermore. of the random variables X1 . as argued earlier. Then ′ 1+TA x ∈ A. Next. by the Markov property. Now the last sum is a function of the future after time 1. otherwise. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. if x ∈ A. ..

2.Thus. k ∈ N) form the strategy of the gambler. in which case he dies immediately. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. So. unless he is in room 4. he will die. The successive stakes (Yk . h(1) = 5. why?) If we let h(i) be the average total reward if he starts from room i. until he dies in room 4. collecting “rewards”: when in room i. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. (Also. the set of equations satisfied by h. for i = 1. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). If heads come up then you win £1 per pound of stake.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). x∈A pxy h(y). no gambler is assumed to have any information about the outcome of the upcoming toss. h(3) = 7. 3. if he starts from room 2. Clearly. (Sooner or later. 2 2 We solve and find h(2) = 8. he collects i pounds. so 24 . your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . 1 2 4 3 The question is to compute the average total reward. are h(x) = 0. Example: A thief enters sneaks in the following house and moves around the rooms. y∈S h(x) = f (x) + x ∈ A. Let p the probability of heads and q = 1 − p that of tails.

b are absorbing. . she must base it on whatever outcomes have appeared thus far. Theorem 7. if p = q Px (Tb < Ta ) = . betting £1 at each integer time against a coin which has P (heads) = p. . a < i < b. We wish to consider the problem of finding the probability that the gambler will never lose her money. It can only depend on ξ1 . b − 1. write ϕ(x) instead of ϕ(x. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. . pi. Ex ) for the probability (respectively. To solve the problem. In other words. if p = q = 1/2 b−a Proof. as described above. We use the notation Px (respectively. a + 2. Yk cannot depend on ξk . b} with transition probabilities pi. . . b = ℓ > 0 and let ϕ(x. Consider a gambler playing a simple coin tossing game. in simplest case Yk = 1 for all k. is our next goal. . . ℓ). We first solve the problem when a = 0.g. b − a). it will make enormous profits). We obviously have ϕ(0) = 0. . She will stop playing if her money fall below a (a < x). The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. in addition. a + 1. in the first case. 25 ϕ(ℓ) = 1. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). P (tails) = 1 − p. Let us consider a concrete problem. Let £x be the initial fortune of the gambler. (Think why!) To simplify things.if the gambler wants to have a strategy at all. expectation) under the initial condition X0 = x. x − a    .i+1 = p.i−1 = q = 1 − p. ℓ) := Px (Tℓ < T0 ). and where the states a. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. the company has control over the probability p. We are clearly dealing with a Markov chain in the state space {a. astrological nature. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . ξk−1 and–as is often the case in practice–on other considerations of e. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. a ≤ x ≤ b. 0 ≤ x ≤ ℓ. .

This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). if p > 1/2 if p ≤ 1/2. Let δ(x) := ϕ(x + 1) − ϕ(x). λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. then λ = q/p < 1. and so Px (T0 < ∞) = 1 − lim If p < 1/2. The general case is obtained by replacing ℓ by b − a and x by x − a. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. λ = 1. Assuming that λ := q/p = 1. λ = 1. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. we have obtained ϕ(x. If p > 1/2. 1. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). λℓ −1 x ℓ. if λ = 1 if λ = 1. δ(x − 2) = · · · = q p x δ(0). We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. we find δ(0) = 1/ℓ and so x ϕ(x) = . 0 ≤ x ≤ ℓ. Corollary 4. if p = q.and. λx − 1 δ(0). Since ϕ(ℓ) = 1. ℓ Summarising. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). ℓ→∞ λℓ − 1 x = 1 − 0 = 1. then λ = q/p > 1. Since p + q = 1. λℓ − 1 0 ≤ x ≤ ℓ. Px (T0 < ∞) = 1 − lim 26 . ℓ) = λx −1 . by first-step analysis. and so λx − 1 λx − 1 =1− = λx . ℓ→∞ ℓ→∞ (q/p)x . We finally find λx − 1 . we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1).

and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . . n ≥ 0) is independent of the past process (X0 . τ = n)P (Xn = i. . . .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . Xτ )1(τ < ∞) = n∈Z+ (X0 . We are going to show that for any such A. . the future process (Xτ +n . . | Xτ = i. | Xn = i. τ = n) (a) (b) = P (Xτ +1 = j1 . the future process (Xn . . A. A. τ < ∞) = P (Xn+1 = j1 . . . Xn+1 . . τ < ∞) = pi. . . In particular. . Let τ be a stopping time. . Xn ). . τ = n) = pi. . Xτ +2 = j2 . . Xn+2 = j2 . any such A must have the property that. Xn+2 = j2 . P (Xτ +1 = j1 . τ = n) = P (Xn+1 = j1 . . for any n. . A ∩ {τ = n} is determined by (X0 . . (d) where (a) is just logic. possibly. 27 . Xn = i. . Xτ +2 = j2 . τ < ∞). considered earlier. Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . Let A be any event determined by the past (X0 . τ < ∞) and that P (Xτ +1 = j1 . . . . .j1 pj1 . A.j1 pj1 . . | Xτ . Recall that the Markov property states that. . | Xn = i)P (Xn = i.j2 · · · P (Xn = i. Xτ ) if τ < ∞. Xn ) for all n ∈ Z+ . . .) is independent of the past process (X0 . . including. . . . Xτ = i. . τ < ∞)P (A | Xτ . . Let τ be a stopping time. Xτ +2 = j2 . .j2 · · · We have: P (Xτ +1 = j1 . An example of a random time which is not a stopping time is the last visit of a specific set. . conditional on τ < ∞ and Xτ = i. Since (X0 . Xn+2 = j2 . . the value +∞. τ = n). n ≥ 0) is a time-homogeneous Markov chain. . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . . . Then. for all n ∈ Z+ . . . . . A. A | Xτ . Proof. (b) is by the definition of conditional probability. A. .Xτ +2 = j2 . Xn )1(τ = n). . . τ = n) (c) = P (Xn+1 = j1 . A. Theorem 8 (strong Markov property). Suppose (Xn . . . Xn ) if we know the present Xn . . Xτ ) and has the same law as the original process started from i. An example of a stopping time is the time TA . . This stronger property is referred to as the strong Markov property. . the first hitting time of the set A. Xn ) and the ordinary splitting property at n. . . . . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 .

in a generalised sense. The random variables Xn comprising a general Markov chain are not independent. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . State i may or may not be visited again.9 Regenerative structure of a Markov chain A sequence of i. Xi (2) . These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. 3. The Ti of Section 6 was allowed to take the value 0.. 0 ≤ n < Ti . .d. (And we set Ti := Ti . . Regardless. “chunks”. in fact. So Xi is the r-th excursion: the chain visits 28 . Note the subtle difference between this Ti and the Ti of Section 6.d. The idea is that if we can find a state i that is visited again and again. If X0 = i then the two definitions coincide. Xi (0) . we will show that we can split the sequence into i. Formally. We (r) refer to them as excursions of the process. random variables is a (very) particular example of a Markov chain. . . then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). by using the strong Markov property. 2. Schematically. r = 2.i. Our Ti here is always positive. However. we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. random trajectories. Ti Xi (0) (r) ≤ n < Ti (r+1) . All these random times are stopping times. . . we let Ti be the time of the second (1) (3) visit.. := Xn . (1) r = 1. Xi (1) . They are. as “random variables”.i. and so on. .) We let Ti be the time of the third visit.

). have the same Proof. Therefore. it will behave in the same way as if it starts from the same state at another point of time.state i for the r-th time. the future after Ti is independent of the past before Ti . Moreover. by definition. starting from i. ad infinitum. but there are always states which will be visited again and again. (Numerical) functions g(Xi independent random variables. . given the state at (r) (r) (r) the stopping time Ti . . Xi (2) . The claim that Xi . Example: Define Ti (r+1) (0) ). the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. but important corollary of the observation of this section is that: Corollary 5. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. and “goes for an excursion”. we say that a state i is recurrent if. . at least one state must be visited infinitely many times. . 10 Recurrence and transience When a Markov chain has finitely many states. and this proves the first claim.. The last corollary ensures us that Λ1 . Second. Apply Strong Markov Property (SMP) at Ti : SMP says that. Λ2 . the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. Xi distribution. we “watch” it up until the next time it returns to state i. possibly. (0) are independent random variables. Therefore. In particular. are independent random variables. 29 . except. starting from i. the future is independent of the (1) (2) past. . We abstract this trivial observation and give a couple of definitions: First. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. g(Xi (2) ).i. Perhaps some states (inessential ones) will be visited finitely many times. g(Xi (1) ). they are also identical in law (i. But (r) the state at Ti is. . equal to i. A trivial.. we say that a state i is transient if. The excursions Xi (0) . An immediate consequence of the strong Markov property is: Theorem 9. (r) . .d. if it starts from state i at some point of time. Xi (1) . for time-homogeneous Markov chains. all.

in principle. we have Proof. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). We claim that: Lemma 4. therefore concluding that every state is either recurrent or transient. This means that the state i is transient. Lemma 5. Because either fii = 1 or fii < 1. Pi (Ji = ∞) = 1. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. then Pi (Ji < ∞) = 1. Pi (Ji = ∞) = 1. Suppose that the Markov chain starts from state i. . 4. Therefore. namely. If fii < 1. 2. fii = Pi (Ti < ∞) = 1.e. This means that the state i is recurrent. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . we conclude what we asserted earlier. and that is why we get the product. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. k ≥ 0. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. The following are equivalent: 1. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. Indeed. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . State i is recurrent. Indeed. 3. Starting from X0 = i. i. But the past (k−1) before Ti is independent of the future. every state is either recurrent or transient. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. 30 . We will show that such a possibility does not exist. Ei Ji = ∞.Important note: Notice that the two statements above are not negations of one another. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1.

e. as n → ∞. Corollary 6. Proof. by definition.1 Define First hitting time decomposition fij := Pi (Tj = n). State i is transient. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . as n → ∞. fii = Pi (Ti < ∞) < 1. (n) i. Ei Ji < ∞. fij ≤ pij . Notice the difference with pij . (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. is equivalent to Pi (Ji < ∞) = 1. it is visited infinitely many times. The equivalence of the first two statements was proved above. Proof. by definition. by the definition of Ji . If i is a transient state then pii → 0. The following are equivalent: 1. State i is recurrent if. but the two sequences are related in an exact way: Lemma 7.e. therefore Pi (Ji < ∞) = 1. Proof. On the other hand. which. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . since Ji is a geometric random variable. If Ei Ji = ∞. 3. this implies that Ei Ji < ∞. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. Lemma 6. owing to the fact that Ji is a geometric random variable. 4.Proof. From Elementary Probability. Pi (Ji < ∞) = 1. State i is transient if. Ji cannot take value +∞ with positive probability. clearly. n ≥ 1. i. (n) But if a series is finite then the summands go to 0. assuming that the chain (n) starts from state i. Clearly. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. pii → 0. it is visited finitely many times with probability 1. then. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . 2. and. 10. The equivalence of the first two statements was proved before the lemma. then. If i is transient then Ei Ji < ∞. Hence Ei Ji = ∞ also. the probability to hit state j for the first time at time n ≥ 1. This. if we assume Ei Ji < ∞.

Corollary 8. In other words. we have i j ⇐⇒ fij > 0. if we let fij := Pi (Tj < ∞) . j i. Now. If j is transient. We now show that recurrence is a class property. by summing over n the relation of Lemma 7. So the random variables δj.i.. which is finite. not only fij > 0. Start with X0 = i. The last statement is simply an expression in symbols of what we said above. infinitely many of them will be 1 (with probability 1).) Theorem 10.r := 1 j appears in excursion Xi (r) . starting from i. either all states are recurrent or all states are transient. by definition. Indeed. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). For the second term. showing that j will appear in infinitely many of the excursions for sure. In particular. as n → ∞. excursions Xi . If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. denoted by i j: it means that. but also fij = 1. the probability to go to j in some finite time is positive.i. If j is a transient state then pij → 0. and so the probability that j will appear in a specific excursion is positive.d. Hence. . we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . . 32 . and hence pij as n → ∞. Xi . we have n=1 pjj = Ej Jj < ∞. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. Corollary 7. 10. and gij := Pi (Tj < Ti ) > 0. ( Remember: another property that was a class property was the property that the state have a certain period. and since they take the value 1 with positive probability. r = 0. ∞ Proof. Proof.The first term equals fij .2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. . In a communicating class. . we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). . . are i.d. 1. for all states i. Hence. j i.

Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. We now take a closer look at recurrence. n+1 33 = 0 . 1]. and hence. Let C be the class. . Proof. Example: Let U0 . j but j i. By closedness. Theorem 12. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). . Pi (B) = 0. It should be clear that this can. . Indeed.d. random variables. . every other j is also recurrent. Hence. . but Pi (A ∩ B) = Pi (Xm = j. if started in C. i. Recall that a finite random variable may not necessarily have finite expectation. uniformly distributed in the unit interval [0. . Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). i. If i is inessential then it is transient. So if a communicating class is recurrent then it contains no inessential states and so it is closed. . for example. observe that T is finite. some state i must be visited infinitely many times. P (B) = Pi (Xn = i for infinitely many n) < 1. If a communicating class contains a state which is transient then.e.Proof. Thus T represents the number of variables we need to draw to get a value larger than the initial one. U2 . . . . be i. appear in a certain game. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. Suppose that i is inessential. and so i is a transient state. . Let T := inf{n ≥ 1 : Un > U0 }. necessarily. by the above theorem. By finiteness. Hence all states are recurrent. Xn = i for infinitely many n) = 0. If i is recurrent then every other j in the same communicating class satisfies i j. This state is recurrent. X remains in C. U1 . all other states are also transient. But then.i. Theorem 11. P (T > n) = P (U1 ≤ U0 . Un ≤ x | U0 = x)dx xn dx = 1 . What is the expectation of T ? First. Every finite closed communicating class is recurrent. Proof.e.

the sequence X0 . (ii) Suppose E max(Z1 . of successive states is not an independent sequence. 0) = ∞. It is traditional to refer to it as “Law”. We state it here without proof. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. Before doing that. but E min(Z1 . But this is misleading. . . Theorem 13. . 34 . We say that i is null recurrent if Ei Ti = ∞. It is not a Law. 0) > −∞. the Fundamental Theorem of Probability Theory. To apply the law of large numbers we must create some kind of independence. it is a Theorem that can be derived from the axioms of Probability Theory. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞.Therefore. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. random variables. Let Z1 . When we have a Markov chain. . . arguably. Z2 . then it takes. X1 . To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. be i. on the average. . (i) Suppose E|Z1 | < ∞ and let µ = EZ1 .d. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. X2 . Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution.i.

Then. . in this example. 35 . G(Xa(2) ). for an excursion X . Since functions of independent random variables are also independent. .1 Functions of excursions Namely. In particular. (1) (2) (1) (n) (8) 13. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . . Thus. we have that G(Xa(1) )..i. Specifically. . The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. Xa(2) . Example 1: To each state x ∈ S.d. Xa . assign a reward f (x). we can think of as a reward received when the chain is in state x. . Xa(3) . for reasonable choices of the functions G taking values in R.i. with probability 1. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). and because if we start with X0 = a.2 Average reward Consider now a function f : S → R. Example 2: Let f (x) be as in example 1. define G(X ) to be the maximum reward received during this excursion. . . as in Example 2. n as n → ∞. which. random variables. are i. the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). but define G(X ) to be the total reward received over the excursion X .d. the (0) initial excursion Xa also has the same law as the rest of them. G(Xa(3) ). G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). .13. forms an independent sequence of (generalised) random variables. are i. (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. k=0 which represents the ‘time-average reward’ received. (7) Whichever the choice of G. because the excursions Xa . . we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit.

Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). Hence |Gfirst | ≤ BTa . Thus we need to keep track of the number. We are looking for the total reward received between 0 and t. say. and so on. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. t t as t → ∞. and so 1 first G → 0. Nt = 5. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). Nt . the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . The idea is this: Suppose that t is large enough. of complete excursions from the ground state a that occur between 0 and t. For example. t t In other words. with probability one. We assumed that f is t bounded. in the figure below. Since P (Ta < ∞) = 1. we have BTa /t → 0. plus the total reward received over the next excursion. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Let f : S → R+ be a bounded ‘reward’ function. and Glast is the reward over the last incomplete cycle. t t 36 . Gr is given by (7). Proof. say |f (x)| ≤ B. We break the sum into the total reward received over the initial excursion. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . (r) Nt := max{r ≥ 1 : Ta ≤ t}. Then. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. t as t → ∞. A similar argument shows that 1 last G → 0. with probability one.e.Theorem 14.

E a Ta f (Xn ) = n=0 EG1 =: f . as t → ∞. So we are left with the sum of the rewards over complete cycles. 2. as r → ∞. we have Ta r→∞ r lim (r) . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . . E a Ta 37 . Hence. = E a Ta .d. Since Ta /r → 0. r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. (11) By definition. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) .. lim t∞ 1 Nt Nt −1 Gr = EG1 . Since Ta → ∞. it follows that the number Nt of complete cycles before time t will also be tending to ∞.i. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . = E a Ta . are i. . we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . and since Ta − Ta k = 1. . r=1 with probability one.also. random variables with common expectation Ea Ta . 1 Nt = .

The constant f is a sort of average over an excursion. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb .e. not both. a. t Ea 1(Xk = b) E a Ta . just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). The equality of these two limits gives: average no. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . We can use b as a ground state instead of a.To see that f is as claimed. Note that we include only one endpoint of the excursion. we let f to be 1 at the state b and 0 everywhere else. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. then the long-run proportion of visits to the state a should equal 1/Ea Ta . The physical interpretation of the latter should be clear: If. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). E b Tb 38 . Ta −1 k=0 1 lim t→∞ t with probability 1. from the Strong Markov Property. on the average. it takes Ea Ta units of time between successive visits to the state a. which communicate and are positive recurrent. Comment 1: It is important to understand what the formula says. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). b. The theorem above then says that. then we see that the numerator equals 1. i. Comment 4: Suppose that two states. the ground state. E a Ta (14) with probability 1. The denominator is the duration of the excursion. for all b ∈ S.

i. this should have something to do with the concept of stationary distribution discussed a few sections ago. X1 . . 39 . we have 0 < ν [a] (x) < ∞. Note: Although ν [a] depends on the choice of the ‘ground’ state a. We start the Markov chain with X0 = a. . . where a is the fixed recurrent ground state whose existence was assumed. Hence the superscript [a] will soon be dropped. 5 We stress that we do not need the stronger assumption of positive recurrence here.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . ν [a] (x) = y∈S ν [a] (y)pyx . . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. x ∈ S. Consider the function ν [a] defined in (15). Proposition 2. notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. Indeed. so there is nothing lost in doing so. (15) should play an important role. X2 . X2 . for any state b such that a → x (a leads to x). we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). In view of this. The real reason for doing so is because we wanted to replace a function of X0 . . we will soon realise that this dependence vanishes when the chain is irreducible. Since a = XT a . Recurrence tells us that Ta < ∞ with probability one.e. Clearly. . Fix a ground state a ∈ S and assume that it is recurrent5 . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. by a function of X1 . Moreover. Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. Proof. Both of them equal 1. X3 . x ∈ S.

40 . for all n ∈ N. n ≤ Ta ) pyx Pa (Xn−1 = y. for all x ∈ S. so there exists m ≥ 1. indeed. so there exists n ≥ 1. that a is positive recurrent. (n) Next. which. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). We are now going to assume more. such that pax > 0. from (16). which are precisely the balance equations. such that pxa > 0. x a. in this last step. Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . ν [a] = ν [a] Pn . Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . ax ax From Theorem 10. (17) This π is a stationary distribution.and thus be in a position to apply first-step analysis. by definition. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). We now have: Corollary 9. is finite when a is positive recurrent. where. xa so. n ≤ Ta ) Pa (Xn = x. x ∈ S. let x be such that a x. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. We just showed that ν [a] = ν [a] P. First note that. Therefore. Note: The last result was shown under the assumption that a is a recurrent state. namely. Xn−1 = y. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). Suppose that the Markov chain possesses a positive recurrent state a. ν [a] (x) < ∞. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. we used (16).

This is what we will try to do in the sequel.d.e. Theorem 15. If heads show up at the r-th excursion. 1. then the set of linear equations π(x) = y∈S π(y)pyx . π(y) = 1 y∈S x∈S have at least one solution. 41 . then j occurs in the r-th excursion. Then j is positive Proof. Start with X0 = i. Hence j will occur in infinitely many excursions. (r) j. that ν [a] P = ν [a] . we toss a coin with probability of heads gij > 0. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. r = 0. Consider the excursions Xi . if tails show up. It only remains to show that π [a] is a probability. Suppose that i recurrent. since the excursions are independent. Proof. . x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. in general.In other words: If there is a positive recurrent state. . But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. that it sums up to one. In words. i. . R2 ) be the index of the first (respectively. Since a is positive recurrent. But π [a] is simply a multiple of ν [a] . in Proposition 2. Otherwise. and so π [a] is not trivially equal to zero. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. to decide whether j will occur in a specific excursion. second) excursion that contains j. π [a] P = π [a] . Let i be a positive recurrent state. then the r-th excursion does not contain state j. We have already shown. and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. We already know that j is recurrent. unique. we have Ea Ta < ∞. Also. no matter which one. Therefore. Let R1 (respectively. positive recurrent state. 2.i. The coins are i.

But. . Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent.i.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 .d. or all states are null recurrent or all states are transient. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . . where the Zr are i. . . . Then either all states in C are positive recurrent. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. Corollary 10. 2. with the same law as Ti . An irreducible Markov chain is called transient if one (and hence all) states are transient. R2 has finite expectation. . by Wald’s lemma. Let C be a communicating class. IRREDUCIBLE CHAIN r = 1. . In other words.

But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. following Theorem 14. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is.16 Uniqueness of the stationary distribution At the risk of being too pedantic. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). π(2) = 1 − α 2. we start the Markov chain from the stationary distribution π. Then. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). what prohibits uniqueness. we have P (Xn = x) = π(x). By a theorem of Analysis (Bounded Convergence Theorem). we have. for all n. x∈S By (13). an absorbing state). n ≥ 0) and suppose that P (X0 = x) = π(x). Fix a ground state a. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. Proof. (18) π(x)Px (Ta < ∞) = π(x) = 1. x ∈ S. Consider positive recurrent irreducible Markov chain. is a stationary distribution. for all states i and j. as t → ∞. n=1 43 . we stress that a stationary distribution may not be unique. the π defined by π(0) = α. for all x ∈ S. x ∈ S. we can take expectations of both sides: E as t → ∞. Then P (Ta < ∞) = x∈S In other words. after all. with probability one. Then there is a unique stationary distribution. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. Theorem 16. there is a stationary distribution. Let π be a stationary distribution. indeed. π(1) = 0. But. by (18). Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). state 1 (and state 2) is positive recurrent. Consider the Markov chain (Xn . recognising that the random variables on the left are nonnegative and bounded below 1. Hence. for totally trivial reasons (it is.

Both π [a] and π[b] are stationary distributions. Hence π = π [a] . Let i be such that π(i) > 0. the only stationary distribution for the restricted chain because C is irreducible. Consider a Markov chain. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Consider an arbitrary Markov chain.e. Therefore Ei Ti < ∞. i is positive recurrent.Therefore π(x) = π [a] (x). where π(C) = y∈C π(y). an arbitrary stationary distribution π must be equal to the specific one π [a] . π(a) = π [a] (a) = 1/Ea Ta . Namely. Corollary 11. Proof. There is a unique stationary distribution. from the definition of π [a] . Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Then. for all a ∈ S. E a Ta Proof. x ∈ S. y ∈ C. Proof. Thus. The following is a sort of converse to that. Then π [a] = π [b] . so they are equal. Consider a positive recurrent irreducible Markov chain and two states a. for all x ∈ S. Corollary 12. π(i|C) = Ei Ti . i. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . Decompose it into communicating classes. Assume that a stationary distribution π exists. In particular. By the 1 above corollary. This is in fact. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). But π(i|C) > 0. b ∈ S. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. The cycle formula merely re-expresses this equality. In particular. Consider a positive recurrent irreducible Markov chain with stationary distribution π. Then any state i such that π(i) > 0 is positive recurrent. delete all transition probabilities pxy with x ∈ C. Hence there is only one stationary distribution. There is only one stationary distribution. π(C) x ∈ C. Let C be the communicating class of i. 1 π(a) = . Consider the restriction of the chain to C.

under irreducibility and positive recurrence conditions. Proof. that the pendulum will swing for a while and. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). A Markov chain is a simple model of a stochastic dynamical system. the pendulum will perform an 45 . such that C αC = 1. more generally. without friction. for example. For each such C define the conditional distribution π(x|C) := π(x) . In an ideal pendulum. If π(x) > 0 then x belongs to some positive recurrent class C. Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. x∈C Then (π(x|C). as n → ∞. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. where αC ≥ 0. x ∈ C) is a stationary distribution. There is dissipation of energy due to friction and that causes the settling down. This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . Why stability? There are physical reasons why we are interested in the stability of a Markov chain. that it remains in a small region). where π(C) := π(C) π(x). For example. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . By our uniqueness result. eventually. that. it will settle to the vertical position. To each positive recurrent communicating class C there corresponds a stationary distribution π C . This is the stable position. take an = (−1)n . So we have that the above decomposition holds with αC = π(C). Let π be an arbitrary stationary distribution. We all know. 18 18. consider the motion of a pendulum under the action of gravity. from physical experience. π(x|C) ≡ π C (x).which assigns positive probability to each state in C. Proposition 3.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. So far we have seen. An arbitrary stationary distribution π must be a linear combination of these π C .

The thing to observe here is that there is an equilibrium point i. ξ2 . consider the following differential equation x = −ax + b. or whatever.oscillatory motion ad infinitum. ξ1 . Specifically. It follows that |ρ| < 1 is the condition for stability. And we are in a happy situation here. . it cannot mean convergence to a fixed equilibrium point. then x(t) = x∗ for all t > 0. ξ0 . ˙ There are myriads of applications giving physical interpretation to this. x ∈ R. If. then regardless of the starting position. we have x(t) → x∗ . . ······ 46 . We shall assume that the ξn have the same law. However notice what happens when we solve the recursion. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. are given. mutually independent. X2 . we here assume that X0 . . by means of the recursion above. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . We say that x∗ is a stable equilibrium. simple because the ξn ’s depend on the time parameter n. . moreover. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . But what does stability mean? In general. . random variables. The simplest differential equation As another example. the one found by setting the right-hand side equal to zero: x∗ = b/a. A linear stochastic system We now pass on to stochastic systems. we may it is stable because it does not (it cannot!) swing too far away. and define a stochastic sequence X1 .e. . or an example in physics. a > 0. as t → ∞. Still.

18. The second term. Coupling refers to the various methods for constructing this joint law. i.Since |ρ| < 1. . A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. Y but not their joint law. for any initial distribution. we know f (x) = P (X = x). to try to make the coincidence probability P (X = Y ) as large as possible.2 The fundamental stability theorem for Markov chains Theorem 17. uniformly in x. . ∞ ˆ The nice thing is that. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). which is a linear combination of ξ0 . . as n → ∞. If a Markov chain is irreducible. Y = y) is? 47 . g(y) = P (Y = y). In particular.3 Coupling To understand why the fundamental stability theorem works. (e. Y = y) = P (X = x)P (Y = y). j ∈ S. What this means is that. . positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). define stability in a way other than convergence of the distributions of the random variables. 18. we have that the first term goes to 0 as n → ∞. converges. ξn−1 does not converge. How should we then define what P (X = x. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). for time-homogeneous Markov chains.g. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . under nice conditions. but we are not given what P (X = x. for instance. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. suppose you are given the laws of two random variables X. But suppose that we have some further requirement. Starting at an elementary level. the n-step transition matrix Pn . while the limit of Xn does not exists. Y = y) is. In other words. we need to understand the notion of coupling.

|P (Xn = x) − P (Yn = x)| ≤ P (T > n). f (x) = y h(x. i. Proof. respectively. n ≥ T ) ≤ P (n < T ) + P (Yn = x). and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). n < T ) + P (Xn = x. n ≥ 0).e.e. Then. Repeating (19). y) = P (X = x. i. g(y) = P (Y = y). Since this holds for all x ∈ S and since the right hand side does not depend on x. in particular. and all x ∈ S. n < T ) + P (Yn = x. precisely because of the assumption that the two processes have been constructed together. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. This T is a random variables that takes values in the set of times or it may take values +∞. n ≥ 0. n ≥ T ) = P (Xn = x. Y be random variables with values in Z and laws f (x) = P (X = x). y). Let T be a meeting of two coupled stochastic processes X. but we are not told what the joint probabilities are. Find the joint law h(x. on the event that the two processes never meet. for all n ≥ 0. uniformly in x ∈ S. P (T < ∞) = 1. The coupling we are interested in is a coupling of processes. If. Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). but with the roles of X and Y interchanged. We are given the law of a stochastic process X = (Xn . Note that it makes sense to speak of a meeting time. and which maximises P (X = Y ). The following result is the so-called coupling inequality: Proposition 4. T is finite with probability one. then |P (Xn = x) − P (Yn = x)| → 0. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Y .Exercise: Let X. x ∈ S. A coupling of them refers to a joint construction on the same probability space. Y = y) which respects the marginals. (19) as n → ∞. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). We have P (Xn = x) = P (Xn = x. g(y) = x h(x. Suppose that we have two stochastic processes as above and suppose that they have been coupled. y). n ≥ 0. n ≥ 0) and the law of a stochastic process Y = (Yn . x∈S 48 .

y′ ) := P Wn+1 = (x′ . then.y′ ) = px. Yn ). for all n ≥ 1. x∈S as n → ∞. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. Consider the process Wn := (Xn .y′ . they differ only in how the initial states are chosen.y′ . Therefore.(y. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. implying that q(x. Xn and Yn would be independent and distributed according to π. in other words. 18. had we started with both X0 and Y0 independent and distributed according to π. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. y) ∈ S × S.4 Proof of the fundamental stability theorem First. Then (Wn . . denoted by X = (Xn .x′ ).(y. denoted by π. We know that there is a unique stationary distribution.) By positive recurrence. Its initial state W0 = (X0 . y) = µ(x)π(y). The first process. Xm = y). n ≥ 0). Then n→∞ lim P (T > n) = P (T = ∞) = 0. The second process. positive recurrent and aperiodic. From Theorem 3 and the aperiodicity assumption. sup |P (Xn = x) − P (Yn = x)| → 0.Assume now that P (T < ∞) = 1.x′ > 0 and py. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. Y0 ) has distribution P (W0 = (x. Its (1-step) transition probabilities are q(x. (x. review the assumptions of the theorem: the chain is irreducible. Let pij denote. y ′ ) | Wn = (x. Having coupled them. as usual. n ≥ 0) is an irreducible chain. We now couple them by doing the straightforward thing: we assume that X = (Xn .x′ ).y′ for all large n. n ≥ 0). and so (Wn . π(x) > 0 for all x ∈ S. We shall consider the laws of two processes.y′ ) for all large n. the transition probabilities. denoted by Y = (Yn .x′ py. we have that px. y) Its n-step transition probabilities are q(x.x′ ). we have not defined joint probabilities such as P (Xn = x.x′ py. n ≥ 0) is a Markov chain with state space S × S. and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }. (Indeed. y) := π(x)π(y). we can define joint probabilities. Notice that σ(x. n ≥ 0) is independent of Y = (Yn . is a stationary distribution for (Wn .(y.

0) = q(1.1). uniformly in x. Obviously. T is a meeting time between Z and Y . Z0 = X0 which has an arbitrary distribution µ.99. if we modify the original chain and make it aperiodic by. precisely because P (T < ∞) = 1. X arbitrary distribution Z stationary distribution Y T Thus. (1. p00 = 0. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). n ≥ 0) is identical in law to (Xn . Zn = Yn . and since P (Yn = x) = π(x).1). y) > 0 for all (x. Hence (Wn . P (T < ∞) = 1. Since P (Zn = x) = P (Xn = x). Clearly.(1. This is not irreducible. p10 = 1.(0. Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). we have |P (Xn = x) − π(x)| ≤ P (T > n). 1). In particular.(0. p01 = 0. Zn equals Xn before the meeting time of X and Y . 50 . P (Xn = x) = P (Zn = x). For example.Therefore σ(x. Y ) may fail to be irreducible.0). then the process W = (X.01. by assumption. n ≥ 0) is positive recurrent. Then the product chain has state space S ×S = {(0. 1} and the pij are p01 = p10 = 1. after the meeting time. (Zn . y) ∈ S × S. as n → ∞.(1.0) = q(0. By the coupling inequality.0). (0.1) = q(1.1) = 1. In particular. (1. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. ≥ 0) is a Markov chain with transition probabilities pij . 0). Moreover. for all n and x. suppose that S = {0. However. n ≥ 0). 1)} and the transition probabilities are q(0. 0). say. which converges to zero. Hence (Zn .

π(0) + π(1) = 1. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0.01. π(1) ≈ 0. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. p01 = 0. however. if we modify the chain by p00 = 0.503 0. 1981). Thus.f.503 0. limn→∞ p00 does not exist. p10 = 1. 18. Parry. in particular. Indeed. such that the chain moves cyclically between them: from C0 to C1 . Then Xn = 0 for even n and = 1 for odd n. However.497 18. The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. . Suppose that the period of the chain is d. Suppose X0 = 0. π(0) ≈ 0.4 and. . customary to have a consistent terminology. per se wrong. By the results of Section 14.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. By the results of §5. Therefore p00 = 1 for even n (n) and = 0 for odd n.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic.497. Theorem 4. Cd−1 . n→∞ (n) where π(0). A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. π(1) satisfy 0. 51 . It is. Terminology cannot be. .503.497 .99π(0) = π(1). we can decompose the state space into d classes C0 . then (n) lim p n→∞ 00 (n) = lim p10 = π(0). and this is expressed by our terminology. But this is “wrong”.e. most people use the term “ergodic” for an irreducible. positive recurrent and aperiodic Markov chain. from Cd back to C0 .99. 6 The reason is that the term “ergodic” means something specific in Mathematics (c. . i.then the product chain becomes irreducible. C1 . we know we have a unique stationary distribution π and that π(x) > 0 for all states x. and so on. It is not difficult to see that 6 We put “wrong” in quotes. In matrix notation. from C1 to C2 . 0. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1).

This means that: Theorem 18. . Suppose i ∈ Cr . if sampled every d steps. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. 52 . n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. Suppose that X0 ∈ Cr . Then. The chain (Xnd . it remains in Cℓ . . Then. each one must be equal to 1/d. i∈Cr for all r = 0. π(i) = 1/d. We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . the (unique) stationary distribution of (Xnd . All states have period 1 for this chain. n ≥ 0) is aperiodic. as n → ∞. n ≥ 0) consists of d closed communicating classes. Hence if we restrict the chain (Xnd . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . d − 1. i ∈ Cr . . i ∈ Cr .Lemma 8. Proof. Since the αr add up to 1. Let αr := i∈Cr π(i). Let i ∈ Ck . 1. and the previous results apply. starting from i. thereafter. namely C0 . So π(i) = j∈Cr−1 π(j)pji . we have pij for all i. . . Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). Cd−1 . j ∈ Cℓ . as n → ∞. after reduction by d) and. j ∈ Cr . Then pji = 0 unless j ∈ Cr−1 . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. j ∈ Cℓ . . where π is the stationary distribution. Since (Xnd . → dπ(j). . n ≥ 0) is dπ(i). . and let r = ℓ − k (modulo d). We shall show that Lemma 9. i ∈ S.

for each i. . . . . k = 1. . . . It is clear how we (ni ) k continue. We have. say.. We also saw in Corollary (n) 7 that. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . then pij → π(j) (Theorems 17). p2 . For each i we have found subsequence ni . and so on. as n → ∞. 2. whose proof we omit7 : 7 See Br´maud (1999). the first of which is known as HellyBray Lemma: Lemma 10. ..19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . if j is transient. pi .) is a k k subsequence of each (ni . k = 1. 19. . p(n) = (n) (n) (n) (n) (n) ∞ p1 . Now consider the contained between 0 and 1 there must be a exists and equals. So the diagonal subsequence (nk . k = 1. there must be a subsequence exists and equals. Hence pi k i. . Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. schematically. are contained between 0 and 1.e. Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. . for all i ∈ S. . the sequence n3 of the third row is a subsequence k k of n2 of the second row. a probability distribution on N. for each n = 1. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞.1 Limiting probabilities for null recurrent states Theorem 19. . p2 . for all i ∈ S. where = 1 and pi ≥ 0 for all i. The proof will be based on two results from Analysis. such that limk→∞ pi k exists and equals. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. . k = 1. (nk ) k → pi . e 53 . By construction. . 2. say. . 2. i. p1 . Consider. for each The second result is Fatou’s lemma. say. as k → ∞. If j is a null recurrent state then pij → 0. . p3 . 2. then pij → 0 as n → ∞. . the sequence n2 of the second row k is a subsequence of n1 of the first row. 2. .). exists for all i ∈ N. If the period is d then the limit exists only along a selected subsequence (Theorem 18).

This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . If (yi . equalities. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . . y∈S x ∈ S. y∈S Since x∈S αx ≤ 1. i=1 (n) Proof of Theorem 19. But Pn+1 = Pn P. pix k . pix k Therefore. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. and the last equality is obvious since we have a probability vector. By Corollary 13. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . . actually. k→∞ (n′ +1) = y∈S piy k pyx . Suppose that one of them is not. x ∈ S . 54 . (n′ ) Hence αx ≥ αy pyx . . (xi . Now π(j) > 0 because αj > 0. we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. it is a strict inequality: αx0 > y∈S αy pyx0 . i ∈ N).e. the sequence (n) pij does not converge to 0. by assumption. We wish to show that these inequalities are. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . π is a stationary distribution. (n ) (n ) Consider the probability vector By Lemma 10. by Lemma 11. 2.e. state j is positive recurrent: Contradiction. we have 0< x∈S The inequality follows from Lemma 11. Let j be a null recurrent state. i ∈ N). Suppose that for some i.e. n = 1. x ∈ S. pix k = 1. Thus. αy pyx . and this is a contradiction (the number αx = αx cannot be strictly larger than itself). i.Lemma 11. x∈S (n′ ) Since aj > 0. Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. for some x0 ∈ S. i. i.

as n → ∞. In general. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). Pµ (Xn−m = j) → π C (j). as n → ∞. In this form.2 The general case (n) It remains to see what the limit of pij is. and fij = Pi (Tj < ∞). we have pij = Pi (Xn = j) = Pi (Xn = j.19. Let C be the communicating class of j and let π C be the stationary distribution associated with C. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. the result can be obtained by the so-called Abel’s Lemma of Analysis. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). From the Strong Markov Property. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . The result when the period j is larger than 1 is a bit awkward to formulate. Indeed. But Theorem 17 tells us that the limit exists regardless of the initial distribution. and fij = ϕC (j). that is. which is what we need. Then (n) lim p n→∞ ij = ϕC (i)π C (j). Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . This is a more old-fashioned way of getting the same result. Example: Consider the following chain (0 < α. as in §10. Let i be an arbitrary state. Let ϕC (i) := Pi (HC < ∞). E j Tj where Tj = inf{n ≥ 1 : Xn = j}.2. π C (j) = 1/Ej Tj . Proof. so we will only look that the aperiodic case. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Let j be a positive recurrent state with period 1. Theorem 20. starting with the first hitting time decomposition relation of Lemma 7. β. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. Let HC := inf{n ≥ 0 : Xn ∈ C}.

(n) (n) p34 → 0. information about its velocity. π C1 (2) = . from first-step analysis. π C2 (5) = 1. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. (n) (n) p33 → 0. as n → ∞. 1 − αβ 1 − αβ Both C1 . Pioneers in the are were. 56 (20) x∈S . C1 = {1.The communicating classes are I = {3. The subject of Ergodic Theory has its origins in Physics. for example. Theorem 14 is a type of ergodic theorem. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. C2 are aperiodic classes. For example. ϕ(4) = . among others. p44 → 0. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. The phase (or state) space of a particle is not just a physical space but it includes. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. We now generalise this. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. p41 → ϕ(4)π C1 (1). ϕ(4) = (1 − β)ϕ(5) + βϕ(3). Theorem 21. (n) (n) p32 → ϕ(3)π C1 (2). (n) (n) p35 → (1 − ϕ(3))π C2 (5). We have ϕ(1) = ϕ(2) = 1. ϕ(5) = 0. (n) (n) p43 → 0. p42 → ϕ(4)π C1 (2). ϕ(3) = p31 → ϕ(3)π C1 (1). 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. Hence. 4} (inessential states). The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. in Theorem 14. whence 1−α β(1 − α) . p45 → (1 − ϕ(4))π C2 (5). while. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. C2 = {5} (absorbing state). Similarly. 2} (closed class). we assumed the existence of a positive recurrent state. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . Recall that. And so is the Law of Large Numbers itself (Theorem 13). The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = .

57 . Proof. then. (21) f (x)ν [a] (x). we have that the denominator equals Nt /t and converges to 1/Ea Ta . we assumed positive recurrence of the ground state a. Remark 1: The difference between Theorem 14 and 21 is that. we have Nt /t → 1/Ea Ta . Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. The reader is asked to revisit the proof of Theorem 14. As in Theorem 14. Replacing n by the subsequence Nt . t t where Gr = T (r) ≤t<T (r+1) f (Xt ).5. respectively. which is the quantity denoted by f in Theorem 14. We only need to check that EG1 = f . we have C := i∈S ν(i) < ∞. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . Glast are the total rewards over the first and t t a a last incomplete cycles. t t→∞ t t t with probability one. Then at least one of the states must be visited infinitely often. so the numerator converges to f /Ea Ta . while Gfirst . we obtain the result. r=0 with probability 1 as long as E|G1 | < ∞. As in the proof of Theorem 14. and this justifies the terminology of §18. If so. But this is precisely the condition (20). we have t→∞ lim 1 last 1 G = lim Gfirst = 0. and this is left as an exercise. So there are always recurrent states. as we saw in the proof of Theorem 14. Now. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . So if we divide by t the numerator and denominator of the fraction in (21).Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . in the former. From proposition 2 we know that the balance equations ν = νP have at least one solution. since the number of states is finite.

like playing a film backwards. . Suppose first that the chain is reversible. A reversible Markov chain is stationary. then we instantly know that the film is played in reverse. Theorem 23. in addition. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. reversible if it has the same distribution when the arrow of time is reversed. X0 ). . simply. for all n. In particular. it will look the same. Because the process is Markov. Xn−1 . Lemma 12. the chain is aperiodic. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. X0 ). because positive recurrence is a class property (Theorem 15). that is. We say that it is time-reversible. In addition. . running backwards. It is. and run the film backwards. If. by Theorem 16. . in addition. . By Corollary 13. i. Xn has the same distribution as X0 for all n. the chain is irreducible. X0 = j). If. Consider an irreducible Markov chain with finitely many states. Therefore π is a stationary distribution. as n → ∞. this means that it is stationary. Taking n = 1 in the definition of reversibility. More precisely. (X0 . . . X1 .Hence 1 ν(i). But if we film a man who walks and run the film backwards. or. if we film a standing man who raises his arms up and down successively. 22 Time reversibility Consider a Markov chain (Xn . X1 . j ∈ S. C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. Proof. Xn ) has the same distribution as (Xn . Xn−1 . π(i) := Hence. n ≥ 0). we see that (X0 . then all states are positive recurrent. At least some state i must have π(i) > 0. X1 = j) = P (X1 = i. X1 ) has the same distribution as (X1 .e. a finite Markov chain always has positive recurrent states. without being able to tell that it is. . Then the chain is positive recurrent with a unique stationary distribution π. . We summarise the above discussion in: Theorem 22. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . For example. in a sense. the first components of the two vectors must have the same distribution. Proof. . P (X0 = i. we have Pn → Π. Reversibility means that (X0 . X0 ) for all n. i. actually. . . i ∈ S. . . 58 . Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. Xn ) has the same distribution as (Xn . the stationary distribution is unique. . . this state must be positive recurrent.

. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . . X1 = i) = π(j)pji . then the chain cannot be reversible. and this means that P (X0 = i. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. k. So the detailed balance equations (DBE) hold. . P (X0 = j. . . X1 . Xn−1 . Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. Summing up over all choices of states i3 .for all i and j. (22) Proof. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . we have separation of variables: pij = f (i)g(j). . suppose that the KLC (22) holds. then. Theorem 24. X2 = i). X2 ) has the same distribution as (X2 . . it is easy to show that. X1 = j. for all i. Conversely. X1 . π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. . . using the DBE. X1 = j) = π(i)pij . using DBE. we have π(i) > 0 for all i. X1 = j. i2 . Conversely. X0 ). X2 = k) = P (X0 = k. We will show the Kolmogorov’s loop criterion (KLC). X1 . X0 ). Xn ) has the same distribution as (Xn . In other words. The same logic works for any m and can be worked out by induction. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. and hence (X0 . . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. Hence the DBE hold. i4 . Pick 3 states i. k. im ∈ S and all m ≥ 3. . But P (X0 = i. Now pick a state k and write. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . suppose the DBE hold. The DBE imply immediately that (X0 . . X1 ) has the same distribution as (X1 . By induction. and the chain is reversible. Suppose first that the chain is reversible. . we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . . X0 ). These are the DBE. and write. and pi1 i2 (m−3) → π(i2 ). Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. (X0 . pji 59 (m−3) → π(i1 ). . j. So the chain is reversible. for all n. j. letting m → ∞. .

) Let A.e. i. Ac be the sets in the two connected components. so we have pij g(j) = . pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . there would be a loop. Ac ) = F (Ac . to guarantee. j for which pij > 0. and the arrow from j to j is the only one from Ac to A. for each pair of distinct states i. The balance equations are written in the form F (A. There is obviously only one arrow from AtoAc . the detailed balance equations. and. f (i)g(i) must be 1 (why?). So g satisfies the balance equations. Hence the original chain is reversible. the two oriented edges from i to j and from j to i by a single edge without arrow.1) which gives π(i)pij = π(j)pji . and by replacing. If we remove the link between them. (If it didn’t. using another method. that the chain is positive recurrent. j for which pij > 0. Example 1: Chain with a tree-structure. namely the arrow from i to j.(Convention: we form this ratio only when pij > 0. 60 . consider the chain: Replace it by: This is a tree. If the graph thus obtained is a tree. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. Pick two distinct states i. meaning that it has no loops. then the chain is reversible. The reason is as follows. A) (see §4. And hence π can be found very easily. • If the chain has infinitely many states. there isn’t any. Form the graph by deleting all self-loops. For example.) Necessarily. then we need. the graph becomes disconnected. Hence the chain is reversible. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. in addition. by assumption.

where C is a constant. For each state i define its degree δ(i) = number of neighbours of i. States j are i are called neighbours if they are connected by an edge. 2. If i. π(i) = 1. We claim that π(i) = Cδ(i). The Markov chain with these transition probabilities is called a random walk on a graph (RWG). For example. The constant C is found by normalisation. 0. Consider an undirected graph G with vertices S. The stationary distribution of a RWG is very easy to find. 61 . j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. we have that C = 1/2|E|. Hence we conclude that: 1. Suppose the graph is connected. if j is a neighbour of i otherwise. a finite set.e. There are two cases: If i. Each edge connects a pair of distinct vertices (states) without orientation. etc. so both members are equal to C. i∈S Since δ(i) = 2|E|. p02 = 1/4. the DBE hold. i. A RWG is a reversible chain. i∈S where |E| is the total number of edges. In all cases. and a set E of undirected edges. Now define the the following transition probabilities: pij = 1/δ(i). π(i)pij = π(j)pji . 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. To see this let us verify that the detailed balance equations hold. The stationary distribution assigns to a state probability which is proportional to its degree. j are not neighbours then both sides are 0.Example 2: Random walk on a graph.

2. Due to the Strong Markov Property.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. π is a stationary distribution for the original chain. . irreducible and positive recurrent). Let A be (r) a set of states and let TA be the r-th visit to A. . .) has the Markov property but it is not time-homogeneous. 1. in general. k = 0. 2. i.3 Subordination Let (Xn . The sequence (Yk . if nk = n0 + km. then (Yk . k = 0. k = 0. . (m) (m) 23.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. n ≥ 0) be Markov with transition probabilities pij . 1. .23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). where pij are the m-step transition probabilities of the original chain. where ξt are i. 1. random variables with values in N and distribution λ(n) = P (ξ0 = n). St+1 = St + ξt . then it is a stationary distribution for the new chain.e. . 2. t ≥ 0) be independent from (Xn ) with S0 = 0. this process is also a Markov chain and it does have time-homogeneous transition probabilities. Let (St . . . Define Yr := XT (r) . . t ≥ 0. 2. In this case. . k = 0. A r = 1.e. . If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). .) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . If the subsequence forms an arithmetic progression.e. .d. . if. 23. . i. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . . how can we create new ones? 23. 62 n ∈ N. 1. The state space of (Yr ) is A. 2. in addition. .i.

We need to show that P (Yn+1 = β | Yn = α. . α1 . If. β).4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ .. . Proof. that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). we have: Lemma 13. . Then (Yn ) has the Markov property. a Markov chain. and the elements of S ′ by α. j. .Then Yt := XSt . Yn−1 = αn−1 }.. ∞ λ(n)pij . Fix α0 . f is a one-to-one function then (Yn ) is a Markov chain. i. n≥0 63 . Yn−1 ) = P (Yn+1 = β | Yn = α). in general. conditional on Yn the past is independent of the future. is NOT. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). . . . is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. . then the process Yn = f (Xn ). β. (n) 23. . αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . however. (Notation: We will denote the elements of S by i. . Indeed. . . . knowledge of f (Xn ) implies knowledge of Xn and so.e.) More generally. .

Indeed. Yn−1 )P (Xn = i | Yn = α. β ∈ Z+ . the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. if we let 1/2. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Example: Consider the symmetric drunkard’s random walk. and so (Yn ) is. β) i∈S P (Yn = α|Xn = i. because the probability of going up is the same as the probability of going down. β). We have that P (Yn+1 = β|Xn = i) = Q(|i|. i ∈ Z. then |Xn+1 | = 1 for sure. regardless of whether i > 0 or i < 0. and if i = 0. On the other hand. for all i ∈ Z and all β ∈ Z+ . β) P (Yn = α. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. 64 . Yn = α.e. β). conditional on {Xn = i}. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. β) P (Yn = α|Yn−1 ) i∈S = Q(α. Yn−1 ) Q(f (i). if β = α ± 1. depends on i only through |i|. i ∈ Z. Yn−1 ) Q(f (i). if β = 1. β)P (Xn = i | Yn = α. indeed. α = 0. observe that the function | · | is not one-to-one.for all choices of the variables involved. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). β). β) P (Yn = α|Xn = i. Define Yn := |Xn |. In other words. α > 0 Q(α. Then (Yn ) is a Markov chain. a Markov chain with transition probabilities Q(α. if i = 0. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. P (Xn+1 = i ± 1|Xn = i) = 1/2. Thus (Yn ) is Markov with transition probabilities Q(α. i. But P (Yn+1 = β | Yn = α. β) := 1. But we will show that P (Yn+1 = β|Xn = i). To see this. β).

a hard problem. 1. j In this case. 0 ≤ j ≤ i. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. But here is an interesting special case. 2. if 0 is never reached then Xn → ∞. we have that (Xn . . n ≥ 0) has the Markov property. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. ξ (n) ). so we will find some different method to proceed.24 24. In other words. which has a binomial distribution: i j pij = p (1 − p)i−j . It is a Markov chain with time-homogeneous transitions. We let pk be the probability that an individual will bear k offspring. with probability 1. . If we let ξ (n) = ξ1 . k = 0. . . (and independent of the initial population size X0 –a further assumption). independently from one another. Hence all states i ≥ 1 are inessential. We find the population (n) at time n + 1 by adding the number of offspring of each individual.. 0 But 0 is an absorbing state. then we see that the above equation is of the form Xn+1 = f (Xn . Back to the general case. and. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. Therefore. Thus. . We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. The state 0 is irrelevant because it cannot be reached from anywhere. Hence all states i ≥ 1 are transient. one question of interest is whether the process will become extinct or not. The parent dies immediately upon giving birth. So assume that p1 = 1. if the current population is i there is probability pi that everybody gives birth to zero offspring. if the 65 (n) (n) . p0 = 1 − p. . . Let ξk be the number of offspring of the k-th individual in the n-th generation. Example: Suppose that p1 = p. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. since ξ (0) . in general. ξ (1) . therefore transient. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . the population cannot grow: it will eventually reach 0. and 0 is always an absorbing state. and following the same distribution.i. are i. . ξ2 . . We will not attempt to compute the pij neither draw the transition diagram because they are both futile. Then P (ξ = 0) + P (ξ ≥ 2) > 0. each individual gives birth to at one child with probability p or has children with probability 1 − p.d.

If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. Obviously.population does not become extinct then it must grow to infinity. Let ε be the extinction probability of the process. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. . if we start with X0 = i in ital ancestors. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. We then have that D = D1 ∩ · · · ∩ Di . so P (D|X0 = i) = εi . Proof. Therefore the problem reduces to the study of the branching process starting with X0 = 1. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . . This function is of interest because it gives information about the extinction probability ε. . . ε(0) = 1. Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. where ε = P1 (D). We wish to compute the extinction probability ε(i) := Pi (D). If µ ≤ 1 then the ε = 1. each one behaves independently of one another. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. Indeed. when we start with X0 = i. and. 66 . Indeed. since Xn = 0 implies Xm = 0 for all m ≥ n. we can argue that ε(i) = εi . The result is this: Proposition 5. For each k = 1. i. For i ≥ 1. ϕn (0) = P (Xn = 0). and P (Dk ) = ε. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ.

ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). (See right figure below. Since both ε and ε∗ satisfy z = ϕ(z). all we need to show is that ε < 1. By induction. the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. This means that ε = 1 (see left figure below). But 0 ≤ ε∗ . Therefore. and since the latter has only two solutions (z = 0 and z = ε∗ ). Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . ε = ε∗ . Also ϕn (0) → ε. ϕn (0) ≤ ε∗ . since ϕ is a continuous function. if the slope µ at z = 1 is ≤ 1. and so on. In particular. z=1 Also. Therefore ε = ε∗ . in fact. But ϕn+1 (0) → ε as n → ∞. We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). ϕn+1 (z) = ϕ(ϕn (z)). Let us show that. and. So ε ≤ ε∗ < 1. Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). Therefore ε = ϕ(ε). Notice now that µ= d ϕ(z) dz .Note also that ϕ0 (z) = 1. then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. ϕn+1 (0) = ϕ(ϕn (0)). But ϕn (0) → ε. if µ > 1. recall that ϕ is a convex function with ϕ(1) = 1.) 67 . ϕ(ϕn (0)) → ϕ(ε). On the other hand. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)).

then time slots 0. . using the same frequency. 3. An the number of new packets on the same time slot.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. Case: µ > 1. . So. If collision happens. 68 . 2. Ethernet has replaced Aloha. there is collision and the transmission is unsuccessful. Since things happen fast in a computer network. . If s ≥ 2.. Xn+1 = Xn + An . Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. 3. . only one of the hosts should be transmitting at a given time. periodically. If s = 1 there is a successful transmission and. then the transmission is not successful and the transmissions must be attempted anew. are allocated to hosts 1. We assume that the An are i. 24. 4. One obvious solution is to allocate time slots to hosts in a specific order.i. 3.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). So if there are. . Let s be the number of selected packets. in a simplified model. The problem of communication over a common channel. and Sn the number of selected packets. 2. there is a collision and. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. Thus. Abandoning this idea. . otherwise. if Sn = 1. on the next times slot. then max(Xn + An − 1. k ≥ 0. 1. is that if more than one hosts attempt to transmit a packet. 0). 5. say. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. for a packet to go through. During this time slot assume there are a new packets that need to be transmitted. 1. Then ε < 1. Nowadays. there are x + a packets. we let hosts transmit whenever they like. random variables with some common distribution: P (An = k) = λk . on the next times slot. 6. there are x + a − 1 packets.d. Then ε = 1. 3 hosts. 2. 1. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). there is no time to make a prior query whether a host has something to transmit or not.

Indeed. The random variable Sn has. So we can make q x G(x) as small as we like by choosing x sufficiently large. n ≥ 0) is a Markov chain. . 69 .where λk ≥ 0 and k≥0 kλk = 1. Let us find its transition probabilities. then the chain is transient.) = x + a k x−k p q . the Markov chain (Xn ) is transient. for example we can make it ≤ µ/2. and p. Since p > 0. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). The Aloha protocol is unstable. An−1 . Xn−1 . So: px. The main result is: Proposition 6.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . i. we have that the drift is positive for all large x and this proves transience. To compute px. . k ≥ 1.x+k for k ≥ 1. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x.x−1 when x > 0. and exactly one selection: px. Sn = 1|Xn = x) = λ0 xpq x−1 . . Therefore ∆(x) ≥ µ/2 for all large x. where q = 1 − p. we have ∆(x) = (−1)px. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. we must have no new packets arriving. The probabilities px. Unless µ = 0 (which is a trivial case: no arrivals!). so q x G(x) tends to 0 as x → ∞.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). if Xn + An = 0. we have q < 1. An = a. Proving this is beyond the scope of the lectures.e.x−1 + k≥1 kpx. We also let µ := k≥1 kλk be the expectation of An . defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. k 0 ≤ k ≤ x + a. we think as follows: to have a reduction of x by 1. We here only restrict ourselves to the computation of this expected change. To compute px. The reason we take the maximum with 0 in the above equation is because. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An .x are computed by subtracting the sum of the above from 1.x−1 = P (An = 0. we should not subtract 1. and so q x tends to 0 much faster than G(x) tends to ∞. where G(x) is a function of the form α + βx . For x > 0. Idea of proof.e. The process (Xn . So P (Sn = k|Xn = x. selections are made amongst the Xn + An packets.

We pick α between 0 and 1 (typically. Comments: 1.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. because of the size of the problem. It uses advanced techniques from linear algebra to speed up computations. In an Internet-dominated society. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. if j ∈ L(i)  pij = 1/|V |. All you need to do to get high rank is pay (a lot of money). 2. in the latter case. PageRank. then i is a ‘dangling page’. there are companies that sell high-rank pages. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. that allows someone to manipulate the PageRank of a web page and make it much higher. α ≈ 0.24. So we make the following modification. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. If L(i) = ∅. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. otherwise. the chain moves to a random page. Therefore. Academic departments link to each other by hiring their 70 . So this is how Google assigns ranks to pages: It uses an algorithm. |V | These are new transition probabilities. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. For each page i let L(i) be the set of pages it links to. Then it says: page i has higher rank than j if π(i) > π(j). the rank of a page may be very important for a company’s profits. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. So if Xn = i is the current location of the chain. The idea is simple. Its implementation though is one of the largest matrix computations in the world. unless i is a dangling page. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. Define transition probabilities as follows:  1/|L(i)|. There is a technique. known as ‘302 Google jacking’. neither that it is aperiodic. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P.2) and let α pij := (1 − α)pij + . 3. if L(i) = ∅   0. It is not clear that the chain is irreducible.

24. 4. all we need is the information outlined in this figure. colour of eyes) appears in two forms. a dominant (say A) and a recessive (say a). 5.g. If so. But an AA mating with a Aa may produce a child of type AA or one of type Aa. Three of the square alleles are never selected. this allele is of type A with probability x/2N . at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. a gene controlling a specific characteristic (e. two AA individuals will produce and AA child. generation after generation. Thus. it makes the evaluation procedure easy.4 The Wright-Fisher model: an example from biology In an organism. the genetic diversity depends only on the total number of alleles of each type. The alleles in an individual come in pairs and express the individual’s genotype for that gene. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. maintain this allele for the next generation. In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). We will also assume that the parents are replaced by the children. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. Do the same thing 2N times. This means that the external appearance of a characteristic is a function of the pair of alleles.faculty from each other (and from themselves). We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. their offspring inherits an allele from each parent. Imagine that we have a population of N individuals who mate at random producing a number of offspring. Since we don’t care to keep track of the individuals’ identities. We have a transition from x to y if y alleles of type A were produced. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). in each generation. To simplify. we will assume that. One of the square alleles is selected 4 times. meaning how many characteristics we have. The process repeats. And so on. When two individuals mate. We would like to study the genetic diversity. What is important is that the evaluation can be done without ever looking at the content of one’s research. We will assume that a child chooses an allele from each parent at random. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. 71 . from the point of view of an administrator. This number could be independent of the research quality but. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. then. One of the circle alleles is selected twice.

ϕ(x) = x/M. i. as usual. . . p00 = p2N. so both 0 and 2N are absorbing states. The fact that M is even is irrelevant for the mathematical model. Xn ). Tx := inf{n ≥ 1 : Xn = x}. the states {1. Clearly. . . Clearly. y ≤ 2N − 1. E(Xn+1 |Xn . i. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM .) One can argue that. with n P (ξk = 1|Xn . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x.2N = 1. Similarly. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. on the average. . . Since selections are independent.Each allele of type A is selected with probability x/2N . . and the population becomes one consisting of alleles of type a only. EXn+1 = EXn Xn = Xn . These are: M ϕ(0) = 0.e. X0 ) = E(Xn+1 |Xn ) = M This implies that. . the number of alleles of type A won’t change. Since pxy > 0 for all 1 ≤ x. the expectation doesn’t change and. we see that 2N x 2N −y x y pxy = . 2N − 1} form a communicating class of inessential states. . the chain gets absorbed by either 0 or 2N . (Here.d. . . The result is correct. for all n ≥ 0. n n where. . and the population becomes one consisting of alleles of type A only. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). Let M = 2N . ξM are i. M n P (ξk = 0|Xn .e. on the average. . conditionally on (X0 . 72 . ϕ(x) = y=0 pxy ϕ(y). since. . 0 ≤ y ≤ 2N. . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. but the argument needs further justification because “the end of time” is not a deterministic time. . the random variables ξ1 . M and thus. .i. X0 ) = Xn . We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. ϕ(M ) = 1. . . since. . . 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . M Therefore. . . X0 ) = 1 − Xn .

The correctness of the latter can formally be proved by showing that it does satisfy the recursion. on the average. Otherwise.The first two are obvious. there are Xn + An − Dn components in the stock. Here are our assumptions: The pairs of random variables ((An . b). Xn electronic components at time n. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. Working for n = 1. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. Thus. 73 . M −1 y−1 x M y−1 1− x . . A number An of new components are ordered and added to the stock. say. 2. and so. the demand is met instantaneously. Dn ). and the stock reached level zero at time n + 1. . M x M (M −1)−(y−1) x x +1− M M M −1 = 24. If the demand is for Dn components. We will apply the so-called Loynes’ method. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. and E(An − Dn ) = −µ < 0. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . . for all n ≥ 0.i.d. Let ξn := An − Dn .5 A storage or queueing application A storage facility stores. the demand is only partially satisfied. n ≥ 0) are i. while a number of them are sold to the market. larger than the supply per unit of time. at time n + 1. n ≥ 0) is a Markov chain. provided that enough components are available. We have that (Xn . the demand is. We will show that this Markov chain is positive recurrent. we will try to solve the recursion. Since the Markov chain is represented by this very simple recursion. so that Xn+1 = (Xn + ξn ) ∨ 0.

. . that d (ξ0 . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. In particular. the random variables (ξn ) are i. . . ξn−1 ). . the event whose probability we want to compute is a function of (ξ0 . . so we can shuffle them in any way we like without altering their joint distribution. . (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . . = x≥0 Since the n-dependent terms are bounded. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). . . for y ≥ 0. ξ0 ).i. Therefore. we reverse it. Notice. instead of considering them in their original order. P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . . however. where the latter equality follows from the fact that. .d. . ξn−1 . . 74 . . in computing the distribution of Mn which depends on ξ0 . and. n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. we may replace the latter random variables by ξ1 . . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). . using π(y) = limn→∞ P (Mn = y). for all n. . Hence.Thus. . we may take the limit of both sides as n → ∞. ξn−1 ) = (ξn−1 . we find π(y) = x≥0 π(x)pxy . ξn . (n) . Indeed. .

This will be the case if g(y) is not identically equal to 1. Here. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . Z2 .} < ∞) = 1. .So π satisfies the balance equations. . Depending on the actual distribution of ξn . Z2 . n→∞ Hence g(y) = P (0 ∨ max{Z1 . the Markov chain may be transient or null recurrent. . n→∞ This implies that P ( lim Zn = −∞) = 1. . We must make sure that it also satisfies y π(y) = 1. The case Eξn = 0 is delicate. as required. From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. . 75 . .} > y) is not identically equal to 1. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. we will use the assumption that Eξn = −µ < 0.

This is the drunkard’s walk we’ve seen before. x ∈ Zd . Define  pj . a structure that enables us to say that px. but. the skip-free property. This means that the transition probability pxy should depend on x and y only through their relative positions in space. . 0.and space-homogeneity. More general spaces (e. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. ±ed } otherwise 76 . that of spatial homogeneity. . . p(−1) = q. If d = 1.g. For example. here. have been processes with timehomogeneous transition probabilities. . so far. we won’t go beyond Zd . . d  p(x) = qj . this walk enjoys.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. we need to give more structure to the state space S. .and space-homogeneous transition probabilities given by px. a random walk is a Markov chain with time. Example 3: Define p(x) = 1 2d . besides time. In dimension d = 1.y = px+z. Example 1: Let p(x) = C2−|x1 |−···−|xd | . we can let S = Zd . . Our Markov chains. where p1 + · · · + pd + q1 + · · · + qd = 1. . if x = −ej . . if x ∈ {±e1 . j = 1. otherwise where p + q = 1.y = p(y − x). . groups) are allowed. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. To be able to say this.y+z for any translation z. if x = ej . . p(x) = 0. d   0. So. given a function p(x). j = 1. where C is such that x p(x) = 1. then p(1) = p. otherwise. namely. the set of d-dimensional vectors with integer coordinates. A random walk of this type is called simple random walk or nearest neighbour random walk. A random walk possesses an additional property. .

d. y p(x)p′ (y − x). We refer to ξn as the increments of the random walk. Indeed. then p(1) = p(−1) = 1/2. One could stop here and say that we have a formula for the n-step transition probability. x + ξ2 = y) = y p(x)p(y − x). Let us denote by Sn the state of the walk at time n. . The special structure of a random walk enables us to give more detailed formulae for its behaviour. If d = 1. Obviously. The convolution of two probabilities p(x). p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). . x ± ed . But this would be nonsense. otherwise. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ .) In this notation. . Furthermore. We call it simple symmetric random walk. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. . (When we write a sum without indicating explicitly the range of the summation variable. the initial state S0 is independent of the increments. . x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. in addition. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. random variables in Zd with common distribution P (ξn = x) = p(x).i. and this is the symmetric drunkard’s walk. are i. we will mean a sum over all possible values. . p(x) = 0. ξ2 . (Recall that δz is a probability distribution that assigns probability 1 to the point z. computing the n-th order convolution for all n is a notoriously hard problem. We will let p∗n be the convolution of p with itself n times. where ξ1 . we can reduce several questions to questions about this random walk. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). p′ (x). Furthermore. p ∗ δz = p. . in general. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 .) Now the operation in the last term is known as convolution. x ∈ Zd .This is a simple random walk. which. We shall resist the temptation to compute and look at other 77 . We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). So. because all we have done is that we dressed the original problem in more fancy clothes.

. Will the random walk ever return to where it started from? B. Events A that depend only on these the first n random signs are subsets of this sample space. After all. . a random vector with values in {−1. and so #A P (A) = n . . First. we have P (ξ1 = ε1 . So if the initial position is S0 = 0. 1) → (4. .1) + (0. . 1): (2. 0) → (1. +1}n . all possible values of the vector are equally likely. we are dealing with a uniform distribution on the sample space {−1. . The principle is trivial. +1.2) where the ξn are i.2) (4. but its practice may be hard. Here are a couple of meaningful questions: A. Clearly. If yes. + (1. −1} then we have a path in the plane that joins the points (0. Therefore.methods. +1. for fixed n.i+1 = pi. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. ξn = εn ) = 2−n . 2) → (5. . Look at the distribution of (ξ1 . we should assign. As a warm-up. a path of length n. and the sequence of signs is {+1. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. if we can count the number of elements of A we can compute its probability. .1) + − (5. So.i.i−1 = 1/2 for all i ∈ Z.1) − (3. A ⊂ {−1. In the sequel.0) We can thus think of the elements of the sample space as paths. consider the following: 78 . −1. to each initial position and to each sequence of n signs. we shall learn some counting methods.d. If not. random variables (random signs) with P (ξn = ±1) = 1/2. 1) → (2. +1}n . 2) → (3. +1}n . 2 where #A is the number of elements of A. how long will that take? C. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . ξn ). Since there are 2n elements in {0. 1}n .

) The darker region shows all paths that start at (0. 0) and end up at the specific point (n. (n − m + y − x)/2 (23) Remark. Proof. y)} =: {all paths that start at (m. i). The paths comprising this event are shown below. Consider a path that starts from (0. 0) and n ends at (n. and compute the probability of the event A = {S2 ≥ 0. S7 < 0}. the number of paths that start from (m. y)} = n−m . y) is given by: #{(m. 0) and end up at (n. 0) (n.i) (0. S4 = −1. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. x) and end up at (n. 0)). n 79 . x) and end up at (n. i). Let us first show that Lemma 14. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. i)} = n n+i 2 More generally. (There are 2n of them. in this case.Example: Assume that S0 = 0. The number of paths that start from (0.1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point.0) The lightly shaded region contains all paths of length n that start at (0. x) (n. i). S6 < 0. y)}.047. x) (n. Hence (n+i)/2 = 0. We use the notation {(m. (n. 0) and ends at (n. In if n + i is not an even number then there is no path that starts from (0. We can easily count that there are 6 paths in A. y) is given by: #{(0.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

2 The ballot theorem Theorem 25. . The probability that. . . . y)} . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . n This is called the ballot theorem and the reason will be given in Section 29. n/2 (29) #{(0. 0) (n. if n is even. n 2 #{paths (m. if n is even. for now. The left side of (30) equals #{paths (1. we shall just compute it. . . y) that do not touch zero} × 2−n #{paths (0. the random walk never becomes zero up to time n is given by P0 (S1 > 0. P0 (Sn = y) 2 (n + y) + 1 . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . If n. . 27. . 1) (n. But. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. given that S0 = 0 and Sn = y > 0. . We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). Such a simple answer begs for a physical/intuitive explanation. x) 2n−m (n.P (Sn = y|Sm = x) = In particular. 27. Sn ≥ 0 | Sn = 0) = 84 1 . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . 0) (n. n (30) Proof. n 2−n . S2 > 0. . n−m 2−n−m . P0 (S1 ≥ 0 . Sn ≥ 0 | Sn = y) = 1 − In particular.3 Some conditional probabilities Lemma 17. y)} . Sn > 0 | Sn = y) = y .

. . Sn = 0) = P0 (S1 > 0. Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . ξ3 .) by (ξ1 . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. If n is even. . . . . . Sn = 0) = P0 (Sn = 0). . (b) follows from symmetry. . . . . . . . . . . . . . Proof.). We use (27): P0 (Sn ≥ 0 . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . . 0) (n. and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . P0 (S1 ≥ 0. . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. . . n/2 where we used (28) and (29). P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . 1 + ξ2 + ξ3 > 0. . . Sn < 0) (b) (a) = 2P0 (S1 > 0. and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. ξ2 . •). . and Sn−1 > 0 implies that Sn ≥ 0. (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . 85 . (e) follows from the fact that ξ1 is independent of ξ2 . . . Sn ≥ 0. . . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . We next have P0 (S1 = 0. . . . .4 Some remarkable identities Theorem 26. . . . Sn ≥ 0) = P0 (S1 = 0. . . (c) (d) (e) (f ) (g) where (a) is obvious. . . . . . . . . . +1 27. ξ2 + ξ3 ≥ 0. Sn ≥ 0) = #{(0. Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. Sn ≥ 0). . Sn > 0) + P0 (S1 < 0. . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0.Proof. . . . (f ) follows from the replacement of (ξ2 . . ξ3 . . For even n we have P0 (S1 ≥ 0.

It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. . . we computed. We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. . P0 (Sn−1 > 0. Sn−1 < 0. n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. .5 First return to zero In (29). By the Markov property at time n − 1. To take care of the last term in the last expression for f (n). Sn−2 > 0|Sn−1 = 1).27. the two terms must be equal. Sn = 0) = 2P0 (Sn−1 > 0. . 86 . . f (n) = P0 (S1 > 0. . for even n. Sn = 0) + P0 (S1 < 0. Sn−1 = 0. If a particle is to the right of a then you see its image on the left. . Also. . Sn = 0). Sn = 0) P0 (S1 > 0. . This is not so interesting. . n We know that P0 (Sn = 0) = u(n) = n/2 2−n . Theorem 27. . and u(n) = P0 (Sn = 0). . . Sn−1 > 0. Put a two-sided mirror at a. . with equal probability. Sn−1 > 0. Let n be even. So the last term is 1/2. . . we either have Sn−1 = 1 or −1. Sn = 0 means Sn−1 = 1. Sn = 0) = u(n) . Sn−2 > 0|Sn−1 > 0. By symmetry. Then f (n) := P0 (S1 = 0. the probability u(n) that the walk (started at 0) attains value 0 in n steps. given that Sn = 0. Sn = 0) Now. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). . So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. Sn = 0. and if the particle is to the left of a then its mirror image is on the right. Fix a state a. Since the random walk cannot change sign before becoming zero. we can omit the last event. . . So 1 f (n) = 2u(n) P0 (S1 > 0. we see that Sn−1 > 0. . So u(n) f (n) = . So f (n) = 2P0 (S1 > 0. . n−1 Proof. .

m ≥ 0) is a simple symmetric random walk. define Sn = where is the first hitting time of a. But a symmetric random walk has the same law as its negative. Then. we have n ≥ Ta . as a corollary to the above result. . Proof. n ∈ Z+ ) has the same law as (Sn . we obtain 87 . Then Ta = inf{n ≥ 0 : Sn = a} Sn . Theorem 28. n ∈ Z+ ). So Sn − a can be replaced by a − Sn in the last display and the resulting process. . Thus. 2a − Sn ≥ a) = P0 (Ta ≤ n. Sn ≤ a). P0 (Sn ≥ a) = P0 (Ta ≤ n. the process (STa +m − a. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). n ≥ 0) be a simple symmetric random walk and a > 0. In other words. independent of the past before Ta . . S1 . Sn ≤ a) + P0 (Sn > a). for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). a + (Sn − a). Sn ≤ a) + P0 (Ta ≤ n. n < Ta . 2 Proof. Sn = Sn . n ≥ Ta By the strong Markov property.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. 2 Define now the running maximum Mn = max(S0 . so we may replace Sn by 2a − Sn (Theorem 28). 2a − Sn . Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). called (Sn . Sn > a) = P0 (Ta ≤ n. Sn ≥ a). But P0 (Ta ≤ n) = P0 (Ta ≤ n. n < Ta . Sn ).What is more interesting is this: Run the particle till it hits a (it will. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). n ≥ Ta . The last equality follows from the fact that Sn > a implies Ta ≤ n. In the last event. . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. Trivially. Let (Sn . Finally. 28. If Sn ≥ a then Ta ≤ n.

. P0 (Mn < x. Rend. 436-437. (31) x > y. This. If an urn contains a azure items and b = n−a black items. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Since Mn < x is equivalent to Tx > n. which results into P0 (Tx ≤ n. then. Sci. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. where ηi indicates the colour of the i-th item picked. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. as long as a ≥ b. The original ballot problem. of which a are coloured azure and b black. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Rend. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. employed for different purposes. combined with (31).Corollary 19. D. Paris. and anciently for holding ballots to be drawn. during sampling without replacement. Sn = y). gives the result. The set of values of η contains n elements and η is uniformly distributed a in this set. J. Sn = y) = P0 (Sn = 2x − y). Sci. 2 Proof. ηn ). e 10 Andr´. the ashes of the dead after cremation. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. (1887). Sn = 2x − y). . usually a vase furnished with a foot or pedestal. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. 8 88 . Acad. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. . (1887). 105. Sn = y) = P0 (Tx ≤ n. 9 Bertrand. by applying the reflection principle (Theorem 28). Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . 369. An urn is a vessel. Acad. It is the latter use we are interested in in Probability and Statistics. ornamental objects. Compt. Solution d’un probl`me. If Tx ≤ n. as for holding liquids. The answer is very simple: Theorem 31. 105. Solution directe du probl`me r´solu par M. e e e Paris. Bertrand. we have P0 (Mn < x. Sn = y) = P0 (Tx > n. then the probability that. . Proof. the number of selected azure items is always ahead of the black ones equals (a − b)/n. n = a + b. Compt. A sample from the urn is a sequence η = (η1 .

We need to show that. But in (30) we computed that y P0 (S1 > 0. Sn > 0 | Sn = y) = . . i ∈ Z. Since the ‘azure’ items correspond to + signs (respectively. so ‘azure’ items are +1 and ‘black’ items are −1. conditional on {Sn = y}. b be defined by a + b = n. a − b = y. . the sequence (ξ1 . conditional on {Sn = y}. the result follows. . Proof of Theorem 31.) So pi. Then. . . .i+1 = p. . . . .We shall call (η1 . we can translate the original question into a question about the random walk. where y = a − (n − a) = 2a − n. indeed. but which is not necessarily symmetric. . εn be elements of {−1. . conditional on the event {Sn = y}. ηn ) an urn process with parameters n. +1} such that ε1 + · · · + εn = y. . . But if Sn = y then. the sequence (ξ1 . letting a. We have P0 (Sn = k) = n n+k 2 pi. (ξ1 . Proof. 89 . Since an urn with parameters n. ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. the ‘black’ items correspond to − signs). . . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. . . The values of each ξi are ±1. Sn > 0} that the random walk remains positive. Consider a simple symmetric random walk (Sn . −n ≤ k ≤ n. . . It remains to show that all values are equally a likely. a. . . . always assuming (without any loss of generality) that a ≥ b = n − a. ξn ) is uniformly distributed in a set with n elements. . n Since y = a − (n − a) = a − b. . n n −n 2 (n + y)/2 a Thus. . . ξn = εn ) P0 (Sn = y) 2−n 1 = = . . ξn ) is. An urn can be realised quite easily be a random walk: Theorem 32.i−1 = q = 1 − p. a we have that a out of the n increments will be +1 and b will be −1. 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. . n . . . p(n+k)/2 q (n−k)/2 . ξn ) has a uniform distribution. we have: P0 (ξ1 = ε1 . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. . So the number of possible values of (ξ1 . n ≥ 0) with values in Z and let y be a positive integer. ξn = εn | Sn = y) = P0 (ξ1 = ε1 . . . . (The word “asymmetric” will always mean “not necessarily symmetric”. using the definition of conditional probability and Lemma 16. . . . Then. . Let ε1 . .

If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. . x ∈ Z. If S1 = 1 then T1 = 1. we make essential use of the skip-free property. To prove this. The process (Tx . Px (Ty = t) = P0 (Ty−x = t).1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}.} ∪ {∞}. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . and all t ≥ 0. x − 1 in order to reach x. x ≥ 0) is also a random walk. For all x. In view of Lemma 19 we need to know the distribution of T1 . Let ϕ(s) := E0 sT1 . plus a time T1 required to go from −1 to 0 for the first time. (We tacitly assume that S0 = 0. 2. See figure below. . . . 2qs Proof. The rest is a consequence of the strong Markov property. . Corollary 20. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. E0 sTx = ϕ(s)x . In view of this. Consider a simple random walk starting from 0. . . Consider a simple random walk starting from 0. 2. Consider a simple random walk starting from 0. Lemma 18. Lemma 19. This random variable takes values in {0. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. Theorem 33. 1. plus a ′′ further time T1 required to go from 0 to +1 for the first time. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x.d. We will approach this using probability generating functions. y ∈ Z.30. random variables. for x > 0. only the relative position of y with respect to x is needed.) Then. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. Proof. Proof.i. it is no loss of generality to consider the random walk starting from 0. If x > 0. 90 .

 0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . The first time n such that Sn = −1 is the first time n such that −Sn = 1. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Consider a simple random walk starting from 0. Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 .Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. 91 . if x < 0 2ps Proof. Consider a simple random walk starting from 0. The formula is obtained by solving the quadratic. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . 2ps Proof. Corollary 22. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . from the strong Markov property. in order to hit 1. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. Hence the previous formula applies with the roles of p and q interchanged. Corollary 21.

3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. then we can omit the k upper limit in the sum. If α = n is a positive integer then n (1 + x)n = k=0 n k x . and simply write (1 + x)n = k≥0 n k x . E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . the same as the time required for the walk to hit 1 starting ′ from 0.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. this was found in §30. Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. Similarly. The latter is. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. 2ps ′ =1− 30. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . If we define n to be equal to 0 for k > n. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. where α is a real number. 1 − 4pqs2 .2 First return to the origin Now let us ask: If the particle starts at 0. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α .30. k 92 . in distribution. We again do first-step analysis. For the general case. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. k by the binomial theorem. For a symmetric ′ random walk. Consider a simple random walk starting from 0. if S1 = 1.

√ 1 − x. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. (−1)k = 2k − 1 k k and this is shown by direct computation. Indeed. 2k − 1 k 93 . k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . k!(n − k)! k! α k x . k! k! (−1)2k xk = k≥0 xk . but that we won’t worry about which ones. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . by definition. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . (1 − x)−1 = But.Recall that n k = It turns out that the formula is formally true for general α. k as long as we define α k = α(α − 1) · · · (α − k + 1) . As another example. For example. −1 k so (1 − x)−1 = as needed.

Hence it is transient. Hence it is again transient. s↑1 which is true for any probability generating function. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. P0 (Tx < ∞) = p q q p ∧1 ∧1 .But. if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. without appeal to the Markov chain theory. We apply this to (32). The quantity inside the square root becomes 1 − 4pq. 94 . Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. if x < 0 This formula can be written more neatly as follows: Theorem 35. That it is actually null-recurrent follows from Markov chain theory. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. by the definition of E0 sT0 . x if x > 0 . so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. x 1−|p−q| 2q 1−|p−q| 2p . • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . |x| if x > 0 (33) . But remember that q = 1 − p. if x < 0 |x| In particular.

Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . it will never ever return to the point where it started from. 31. Assume p < 1/2. this probability is always strictly less than 1. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . with some positive probability. Consider a simple random walk with values in Z. 95 . except that the limit may be ∞. Sn ). . Proof.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. . Indeed. the sequence Mn is nondecreasing. Unless p = 1/2. . with probability 1 because p < 1/2. q p. . and so it is null-recurrent. This value is random and is denoted by M∞ = sup(S0 . What is this probability? Theorem 37. . again. Then ′ P0 (T0 < ∞) = 2(p ∧ q). S1 . then we have M∞ = limn→∞ Mn . The simple symmetric random walk with values in Z is null-recurrent. every nondecreasing sequence has a limit. . If p < q then |p − q| = q − p and. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. Proof. Consider a random walk with values in Z. if x < 0 which is what we want. 31. Then limn→∞ Sn = −∞ with probability 1.) If we let Mn = max(S0 . Theorem 36. . if x > 0 . If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. n→∞ n→∞ x ≥ 0. Corollary 23. Otherwise. This means that there is a largest value that will be ever attained by the random walk. x ≥ 0.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. we obtain what we are looking for.Proof. The last part of Theorem 35 tells us that it is recurrent. but here it is < ∞. We know that it cannot be positive recurrent. S2 .

even for p = 1/2. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. (34) a random variable first considered in Section 10 where it was also shown to be geometric. 1 − 2(p ∧ q) Proof. 2(p ∧ q) . k = 0. . Theorem 38. . Consider a simple random walk with values in Z. starting from zero. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. If p = q this gives 1. 1. 31. if a Markov chain starts from a state i then Ji is geometric. 2. k = 0. 2. The probability generating function of T0 was obtained in Theorem 34. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. 1. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . . We now look at the the number of visits to some state but if we start from a different state. 1]. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). k = 0.3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. . But if it is transient. 1. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). . the total number of visits to a state will be finite. . If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. Pi (Ji ≥ k) = 1 for all k. . where the last probability was computed in Theorem 37. . . 1 − 2(p ∧ q) this being a geometric series. 2. 96 . And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . Remark 1: By spatial homogeneity. In Lemma 4 we showed that. meaning that Pi (Ji = ∞) = 1. This proves the first formula. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . Then.′ Proof. In the latter case. as already known. 2(p ∧ q) E0 J0 = . We now compute its exact distribution. . But J0 ≥ 1 if and only if the first return time to zero is finite.

Apply now the strong Markov property at Tx . there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. then E0 Jx drops down to zero geometrically fast as x tends to infinity. k ≥ 1. Suppose x > 0. it will visit any point below 0 exactly the same number of times on the average. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . The first term was computed in Theorem 35. p < 1/2) then. The reason that we replaced k by k − 1 is because. E0 Jx = Proof. Consider a random walk with values in Z.e. For x > 0. Consider the case p < 1/2. if the random walk “drifts down” (i. In other words. Then P0 (Jx ≥ k) = P0 (Tx < ∞. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . and the second in Theorem 38. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1).Theorem 39. simply because if Jx ≥ k ≥ 1 then Tx < ∞. we have E0 Jx = 1/(1−2p) regardless of x. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . ∞ As for the expectation. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. (q/p)(−x)∨0 1 − 2q k≥1 . (2(p ∧ q))k−1 . But for x < 0. Remark 1: The expectation E0 Jx is finite if p = 1/2.e. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . Then P0 (Tx < ∞. 97 . starting from 0. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . i. Jx ≥ k). given that Tx < ∞.

which is basically another probability for S. . . = 1− √ 1 − s2 s x . Here is an application of this: Theorem 41. (Sk . Thus. . i=1 Then the dual (Sk . 0 ≤ k ≤ n). 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . n Proof.i.d. . (36) 2 Lastly.32 Duality Consider random walk Sk = k ξi . in Theorem 41. 0 ≤ k ≤ n) of (Sk . i. ξn−1 . . Proof. Use the obvious fact that (ξ1 . Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. because the two have the same distribution. ξ2 . random variables. . Fix an integer n. In Lemma 19 we observed that. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. the probabilities P0 (Tx = n) = x P0 (Sn = x). 0 ≤ k ≤ n − 1.e. ξn ) has the same distribution as (ξn . n (37) 98 . ξ1 ). a sum of i. for a simple random walk and for x > 0. . each distributed as T1 . Theorem 40. Tx is the sum of x i. . Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). . Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. for a simple symmetric random walk.i. . n x > 0. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . Let (Sk ) be a simple symmetric random walk and x a positive integer. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. every time we have a probability for S we can translate it into a probability for S. 0 ≤ k ≤ n) has the same distribution as (Sk . random variables. Then P0 (Tx = n) = x P0 (Sn = x) . Sn = x) P0 (Tx = n) = . . we derived using duality. .d. P0 (Sn = x) P0 (Sn = x) x .

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

It is not a simple random walk because there is a possibility that the increment takes the value 0. . . He starts at (0. So Sn = (Sn . 1)) + P (ξn = (0. (Sn .d. . . Sn = i=1 n ξi . is a random walk.e. west. 1 1 Proof. 1)) = P (ξn = (−1. x1 − x2 ). random vectors such that P (ξn = (0. Rn ) := (Sn + Sn . for each n. 0). . are independent. −1)) = 1/2. n ∈ Z+ ) is a sum of i. the latter being functions of the former. 0)) = P (ξn = (1. x2 ) into (x1 + x2 . We have 1. ξn are not independent simply because if one takes value 0 the other is nonzero. . 0) and. Sn − Sn ). 2 Same argument applies to (Sn . Sn ) = n n 2 . Indeed. n ∈ Z+ ) is also a random walk. Since ξ1 .. n ∈ Z+ ) showing that it. being totally drunk. However. ξ2 . We have: 102 . 0)) = 1/4. we immediately get that the expectation of T1 is infinity: ET1 = ∞. i. . the random 1 2 variables ξn . Many other quantities can be approximated in a similar manner. Consider a change of coordinates: rotate the axes by 45o (and scale them). y ∈ Z. ξ2 . So let 1 2 1 2 1 2 Rn = (Rn . We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . The two random walks are not independent. with equal probability (1/4). (Sn . n ∈ Z+ ) is a random walk. n ≥ 1. random variables and hence a random walk.i. the two stochastic processes are not independent. north or south. west. ξ2 . he moves east. going at random east. .d.i. 2 1 If x ∈ Z2 . . map (x1 . so are ξ1 . 0)) = P (ξn = (−1. i=1 ξi i=1 ξi 2 1 Lemma 20. y) where x. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. 0)) = 1/4. The corners are described by coordinates (x. When he reaches a corner he moves again in the same manner. 1 P (ξn = 1) = P (ξn = (1. we let x1 be its horizontal coordinate and x2 its vertical. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0.From this. north or south. 1 Hence (Sn . i. too. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. We then let S0 = (0.

Lemma 22. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. ξ2 . as n → ∞.1 Lemma 21. n ∈ Z ) is a random walk in one dimension. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . Recall. n ∈ Z+ ) 1 is a random walk in two dimensions. 1 where the last equality follows from the independence of the two random walks. ξ 1 − ξ 2 ). . But (Rn ) 1 2 is a simple symmetric random walk. n ∈ Z+ ) is a random 2 . The possible values 1 . n ∈ Z+ ). 1)) = 1/4 1 2 P (ηn = −1. is i. (Rn . We have Rn = n (ξi + ξi ). b−a −a . Let a < 0 < b. 2 . 0)) = 1/4 1 2 P (ηn = 1. in the sense that the ratio of the two sides goes to 1. Hence P (ηn = 1) = P (ηn = −1) = 1/2. P (S2n = 0) ∼ n . . b−a 103 .. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. Rn = n (ξi − ξi ). So P (R2n = 0) = 2n 2−2n . To show that the last two are independent. we need to show that ∞ P (Sn = 0) = ∞. ηn = −1) = P (ξn = (0. = 1)) = 1/4. The sequence of vectors η . ηn = −1) = P (ξn = (−1. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . where c is a positive constant. η . The two-dimensional simple symmetric random walk is recurrent. (Rn + independent. (Rn . We have of each of the ηn n 1 2 P (ηn = 1. 2n n 2 2−4n . ηn = 1) = P (ξn = (1. that for a simple symmetric random walk (started from S0 = 0). 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. . The last two are walk in one dimension. R2n = 0) = P (R2n = 0)(R2n = 0). And so 2 1 2 1 P (ηn = ε. 0)) = 1/4 1 2 P (ηn = −1. The same is true for (Rn ). b−a P (Ta < Tb ) = b .i. Hence (Rn . b−a P (STa ∧Tb = a) = b . But by the previous n=0 c lemma. . 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. n ∈ Z+ ) is a random walk in two dimensions.d. . Similar argument shows that each of (Rn . By Lemma 5. ηn = 1) = P (ξn = (0. from Theorem 7. P (S2n = 0) = Proof. η 2 are ±1. ηn = ξn − ξn are independent for each n. . and by Stirling’s approximation. Proof. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . n ∈ Z ) is a random walk in one dimension. ε′ ∈ {−1. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . 2 1 2 2 1 1 Proof. +1}. n Corollary 24.

B) taking values in {(0. M. How about more complicated distributions? Let suppose we are given a probability distribution (pi . b]. we have this common value: i<0 (−i)pi = and. and assuming that (A. Indeed. in particular. a = M − N . i < 0 < j. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. choose. To see this. using this. Proof. we can always find integers a. e. B = 0) + P (A = i. we consider the random variable Z = STA ∧TB . where c−1 is chosen by normalisation: P (A = 0. Conversely. Note. 0)} ∪ (N × N) with the following distribution: P (A = i. B = 0) = p0 . P (A = 0. N ∈ N. b = N . B = j) = 1. given a 2-point probability distribution (p. B) are independent of (Sn ). Define a pair of random variables (A. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. Let (Sn ) be a simple symmetric random walk and (pi . i ∈ Z). Theorem 43. M < N . k ∈ Z. The probability it exits it from the left point is p. a < 0 < b such that pa + (1 − p)b = 0. i ∈ Z. we get that c is precisely ipi . such that p is rational. Then there exists a stopping time T such that P (ST = i) = pi . b. and the probability it exits from the right point is 1 − p. 1 − p).This last display tells us the distribution of STa ∧Tb . The claim is that P (Z = k) = pk .g. i ∈ Z. Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. if p = M/N . that ESTa ∧Tb = 0. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. B = j) = c−1 (j − i)pi pj . i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. suppose first k = 0. Then Z = 0 if and only if A = B = 0 and this has 104 .

probability p0 . P (Z = k) = pk for k < 0. Suppose next k > 0. 105 . B = j) P (STi ∧Tk = k)P (A = i. by definition. i<0 = c−1 Similarly. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk .

Z = {· · · . we can define their greatest common divisor d = gcd(a. 2. b. . Then c | a + (b − a).e. −3.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. replace b − a by b − 2a. (That it is unique is obvious from the fact that if d1 . This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. . and d = gcd(a. 26) = gcd(6. In other words. i. b). as taught in high schools or universities. gcd(120. Given integers a. we have c | d. The set Z of integers consists of N. d | b. we say that b divides a if there is an integer k such that a = kb. So we work faster. So (i) holds. 14) = gcd(6. Indeed 4 × 38 > 120. 4. 5. But d | a and d | b. 3. b). b) = gcd(a. 38) = gcd(6. Similarly. So (ii) holds. 38) = gcd(6. Now suppose that c | a and c | b − a. d2 are two such numbers then d1 would divide d2 and vice-versa. 2) = gcd(2. If b − a > a. if we also invoke Euclid’s algorithm (i. They are not supposed to replace textbooks or basic courses. The set N of natural numbers is the set of positive integers: N = {1. their negatives and zero. the process of long division that one learns in elementary school). with the second line. with at least one of them nonzero.) Notice that: gcd(a. Indeed. they merely serve as reminders. 2) = gcd(4. (ii) if c | a and c | b. 0. 1. with b = 0. b are arbitrary integers.e. We write this by b | a. 20) = gcd(6. But 3 is the maximum number of times that 38 “fits” into 120. 38) = gcd(44. . 2. If c | b and b | a then c | a. −2. But in the first line we subtracted 38 exactly 3 times. 6. 32) = gcd(6. 106 . Now. We will show that d = gcd(a. 38) = gcd(82. For instance. Therefore d | b − a. Arithmetic Arithmetic deals with integers. Then interchange the roles of the numbers and continue. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. leading to d1 = d2 . · · · }. 3. let d = gcd(a. c | b. 8) = gcd(6. Zero is denoted by 0. −1. b − a). and we’re done. 2) = 2. Replace b by b − a. Keep doing this until you obtain b − qa < a. Because c | a and c | b. then d | c. if a. If a | b and b | a then a = b. b) as the unique positive integer such that (i) d | a. b − a).}.

The theorem of division says the following: Given two positive integers b.e. 38) = gcd(6. i. with a > b. y = 6. Second. let d be the right hand side. y ∈ Z. gcd(38. . To see this. we can show that. it divides d. . 38) = gcd(6. b. b) then there are integers x. Working backwards. . 107 . The same logic works in general. For example. we have the fact that if d = gcd(a. for some x. 2) = 2. we have gcd(a. The long division algorithm is a process for finding the quotient q and the remainder r. Thus. a − 1}. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. . we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. b) = min{xa + yb : x. in particular. y such that d = xa + yb. 120) = 38x + 120y. b ∈ Z. let us follow the previous example again: gcd(120. Suppose that c | a and c | b. Then c divides any linear combination of a. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. not both zero. for integers a. Its proof is obvious. 1. To see why this is true. we can find an integer q and an integer r ∈ {0. y ∈ Z. In the process. a. xa + yb > 0}. We need to show (i). with x = 19. Then d = xa + yb. This shows (ii) of the definition of a greatest common divisor. Here are some other facts about the greatest common divisor: First. r = 18. such that a = qb + r.

that d does not divide both numbers. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. y) such that x is smaller than y. a relation can be thought of as a set of pairs of elements of the set A. x). The relation < is not symmetric. yielding r = (−qx)a + (1 − qy)b. Since d > 0. A relation is transitive if when it contains (x. then the relation “x is a friend of y” is symmetric.e. so d divides both a and b. But then b = q(xa + yb) + r. The equality relation is symmetric. So d is not minimum as claimed. A function is called one-to-one (or injective) if two different x1 . Both < and = are transitive. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. In general. x). z). If we take A to be the set of students in a classroom. Counting Let A. The relation < is not reflexive. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder).that d | a and d | b.g. The set of functions from X to Y is denoted by Y X . z) it also contains (x. A function is called onto (or surjective) if its range is Y . it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). and this is a contradiction. 108 . y) and (y. Suppose this is not true. for some 1 ≤ r ≤ d − 1. A relation is called reflexive if it contains pairs (x. The greatest common divisor of 3 integers can now be defined in a similar manner. the set {f (x) : x ∈ X}. i. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . The equality relation = is reflexive. i. the greatest common divisor of any subset of the integers also makes sense.e. A relation which possesses all the three properties is called equivalence relation. so r = 0. f (x2 ). The range of f is the set of its values. B be finite sets with n. but is not necessarily transitive. say. e. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. that d does not divide b. Finally. We encountered the equivalence relation i theory. For example. we can divide b by d to find b = qd + r. y) without contain its symmetric (y. x2 ∈ X have different values f (x1 ). m elements.. respectively. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. A relation is called symmetric if it cannot contain a pair (x.

Indeed. and the third. and so on. To be able to do this. the k-th in n − k + 1 ways. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n .The number of functions from A to B (i.e. We thus observe that there are as many subsets in A as number of elements in the set {0. we denote (n)n also by n!. If A has n elements. A function is then an assignment of a name to each animal. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). Incidentally. the cardinality of the set B A ) is mn . a one-to-one function from the set A into A is called a permutation of A. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅).) So the first animal can be assigned any of the m available names. The first element can be chosen in n ways. Clearly. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. and this follows from the binomial theorem. n! is the total number of permutations of a set of n elements. and so on. k! k where the symbol equation. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . Thus. we have = 1. Since the name of the first animal can be chosen in m ways. so does the second. in other words. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. 1}n of sequences of length n with elements 0 or 1. (The same name may be used for different animals. that of the second in m − 1 ways. we need more names than animals: n ≤ m. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. the second in n − 1. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . Any set A contains several subsets. and so on. we may think of the elements of A as animals and those of B as names. In the special case m = n. Each assignment of this type is a one-to-one function from A to B. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. 109 .

0). a∨b∨c refers to the maximum of the three numbers a. we let The Pascal triangle relation states that n+1 k = = 0. and a− is called the negative part of a. say. The absolute value of a is defined by |a| = a+ + a− . by convention. we choose k animals from the remaining n ones and this can be done in n ways. 0). In particular. b is denoted by a ∧ b = min(a. by convention. a+ is called the positive part of a. m! n y If y is not an integer then. c. and we don’t care to put parantheses because the order of finding a maximum is irrelevant. the pig. consider a specific animal. The maximum is denoted by a ∨ b = max(a. b). The notation extends to more than one numbers. b). 110 .We shall let n = 0 if n < k. If the pig is not included in the sample. Both quantities are nonnegative. we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). + k−1 k Indeed. a− = − min(a. if x is a real number. in choosing k animals from a set of n + 1 ones. n n . For example. Maximum and minimum The minimum of two numbers a. b. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. Note that it is always true that a = a+ − a− . If the pig is included. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. we define x m = x(x − 1) · · · (x − m + 1) . we use the following notations: a+ = max(a. k Also.

. meaning a function that takes value 1 on A and 0 on the complement Ac of A. . 111 . . We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise.}. if A1 . Another example: Let X be a nonnegative random variable with values in N = {1. This is very useful notation because conjunction of clauses is transformed into multiplication. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. = 1A + 1B − 1A∩B . are sets or clauses then the number n≥1 1An is the number true clauses. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). Thus. Then X X= n=1 1 (the sum of X ones is X). If A is an event then 1A is a random variable. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. 1 − 1A = 1 A c . 2. Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). . So X= n≥1 1(X ≥ n). It takes value 1 with probability P (A) and 0 with probability P (Ac ). . It is easy to see that 1A 1A∩B = 1A 1B . . It is worth remembering that. Several quantities of interest are expressed in terms of indicators. For instance. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . P (X ≥ n). which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B).Indicator function stands for the indicator function of a set A. A2 . So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components.

 is a column representation of y := ϕ(x) with respect to the given basis . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  . . j=1 i = 1. . xn the given basis. x1 . Then. Let y i := n aij xj .  is a column representation of x with respect to . and all α. m.  where ∈ R. The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  . . So if we choose a basis v1 .  = ··· ··· ···  . . . . vm .  . Combining. . . . . Suppose that ψ. Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . The column  . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ).   . . x  . .for all x. . y ∈ Rn . . un be a basis for Rn . The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ .  The column  . Let u1 . we have ϕ(x) = m n vi i=1 j=1 aij xj . . ϕ are linear functions: It is a m × n matrix. because ϕ is linear. . But the ϕ(uj ) are vectors in Rm . . . . β ∈ R. ym v1 . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . .    . .

} has no minimum.  . Similarly. . If we fix a basis in Rn . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. . . there is x ∈ I with x ≥ ε + inf I. x2 . . respectively. If I has no lower bound then inf I = −∞. ei = 1(i = j). we define lim sup xn . .} ≤ inf{x3 . . Limsup and liminf If x1 .  . we define the supremum sup I to be the minimum of all upper bounds.e.} ≤ · · · and. the choice of the basis is the so-called standard basis. In these cases we use the notation inf I. . Usually. e2 =  . . . . . . xn+2 . . . 0 0 1 In other words. Supremum and infimum When we have a set of numbers I (for instance. we define max I. We note that lim inf(−xn ) = − lim sup xn . . . for any ε > 0. a square matrix represents a linear transformation from Rn to Rn . .  . The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. it is its matrix representation with respect to the fixed basis. sup ∅ = −∞. where m (AB)ij = k=1 Aik Bkj . So lim sup xn = limn→∞ inf{xn . ed =  . . there is x ∈ I with x ≤ ε + inf I. defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. 1/2. x3 . A matrix A is called square if it is a n × n matrix. and AB is a ℓ × n matrix. Rm and Rℓ and let A.} ≤ inf{x2 . This minimum (or maximum) may not exist. sup I is characterised by: sup I ≥ x for all x ∈ I and. . with respect to these bases. x2 .Fix three sets of basis on Rn . . 113 . for any ε > 0. . If I has no upper bound then sup I = +∞. 1 ≤ j ≤ n. ψ. x4 . If I is empty then inf ∅ = +∞. is a sequence of numbers then inf{x1 . a2 . 1/3. . B be the matrix representations of ϕ. . B is a m × n matrix. · · · }. xn+1 . standing for the “infimum” of I. . The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . . Then the matrix representation of ϕ ◦ ψ is AB. . a sequence a1 . For instance I = {1. . 1 ≤ i ≤ ℓ. . i. Likewise. and it is this that we called infimum. of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. inf I is characterised by: inf I ≤ x for all x ∈ I and. Similarly. since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. j So A is a ℓ × m matrix.

we work a lot with conditional probabilities. B in the definition. which holds since X is positive. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). and apply them. Quite often. Similarly. . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. in these notes. .Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory).e. If A1 . derive formulae. to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). and 1 Ac = 1 − 1 A . . A is a subset of B). n If the events A1 . (That is why Probability Theory is very rich. . . to compute P (A). the function that takes value 1 on A and 0 on Ac . are increasing with A = ∪n An then P (A) = limn→∞ P (An ). Indeed. Probability Theory is exceptionally simple in terms of its axioms. For example. n≥1 Usually. In contrast. . Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. Linear Algebra that has many axioms is more ‘restrictive’. such that ∪n Bn = Ω. I do sometimes “cheat” from a strict mathematical point of view. we often need to argue that X ≤ Y . Therefore. Often. E 1A = P (A). then P  n≥1 An  = P (An ). and everywhere else. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. Similarly. Of course. . If the events A. by not requiring this.) If A. This is the inclusion-exclusion formula for two events. B2 . but not too much. A2 . . it is useful to find disjoint events B1 . for (besides P (Ω) = 1) there is essentially one axiom .e. Note that 1A 1B = 1A∩B . ... in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). B are disjoint events then P (A ∪ B) = P (A) + P (B). This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. 11 114 . . P (∪n An ) ≤ Similarly. to show that EX ≤ EY . are mutually disjoint events. In Markov chain theory. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). . if he events A1 . This follows from the logical statement X 1(X > x) ≤ X. Linear Algebra that has many axioms is more ‘restrictive’. If A is an event then 1A or 1(A) is the indicator random variable. namely:   Everything else. This can be generalised to any finite number of events. follows from this axiom. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. i. That is why Probability Theory is very rich. In contrast. A2 . n P (An ). . Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. if An are events with P (An ) = 1 then P (∩n An ) = 1. If An are events with P (An ) = 0 then P (∪n An ) = 0. A2 .

. So if n an < ∞ then G(s) exists for all s ∈ [−1. is a probability distribution on Z+ = {0. . |s| < 1/|ρ|. . For a random variable which takes positive and negative values. 0). 1. is called probability generating function of X: ∞ as long as |ρs| < 1. EX = x xP (X = x). X − = − min(X. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). We do not a priori know that this exists for any s other than. for s = 0 for which G(0) = 0. The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx.e. so it is useful to keep this in mind. This definition makes sense only if EX + < ∞ or if EX − < ∞. 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. viz. .. . . . . the generating function of the geometric sequence an = ρn . 1]. As an example. 1. the formula reduces to the known one. In case the random variable is discrete. we have EX = −∞ xf (x)dx. The generating function of the sequence pn = P (X = n). For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). n ≥ 0 is ∞ ρn s n = n=0 1 . a1 . . making it a useful object when the a0 . obviously. . 1]. we encounter situations where P (X = ∞) may be positive. 2. Generating functions A generating function of a sequence a0 . which are both nonnegative and then define EX = EX + − EX − . Suppose now that X is a random variable with values in Z+ ∪ {+∞}. 0). .An application of this is: P (X < ∞) = lim P (X ≤ x). i. we define X + = max(X. ϕ(s) = EsX = n=0 sn P (X = n) 115 .} because then n an = 1. . taught in calculus courses). In case the random variable is ∞ absoultely continuous with density f . a1 . x→∞ In Markov chain theory. which is motivated from the algebraic identity X = X + − X − . The formula holds for any positive random variable. n = 0. is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · .

∞ and p∞ := P (X = ∞) = 1 − pn . we can recover the expectation of X from EX = lim ϕ′ (s). ds ds and by letting s = 1. possibly. n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). take the value +∞. we have s∞ = 0. but. n! as follows from Taylor’s theorem. n ≥ 0) then the generating function of (xn+1 . since |s| < 1. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . Note that ∞ n=0 pn ≤ 1. consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . Generating functions are useful for solving linear recursions. with x0 = x1 = 1. and this is why ϕ is called probability generating function.This is defined for all −1 < s < 1. while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). 116 . As an example. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). n=0 We do allow X to. If we let X(s) = n≥0 n ≥ 0. Also. Note that ϕ(0) = P (X = 0). xn sn be the generating of (xn . n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . We can recover the probabilities pn from the formula pn = ϕ(n) (0) .

s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s).e. we can also write this as xn = (−a)n+1 − (−b)n+1 . i. we can solve for X(s) and find X(s) = s2 −1 . x ∈ R. b = −( 5 + 1)/2. b−a is the solution to the recursion. Since x0 = x1 = 1. the function P (X ≤ x). This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . and so X(s) = 1 1 1 1 1 1 = . is such an example. then the so-called distribution function. b−a Noting that ab = −1. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . if X is a random variable with values in R. the n-th term of the Fibonacci sequence. Despite the square roots in the formula. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. i. 117 . But I could very well use the function P (X > x). n ≥ 0) equals the sum of the generating functions of (xn+1 . So. Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ).e.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . namely. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. n ≥ 0). +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. n ≥ 0) and (xn . Hence s2 + s − 1 = (s − a)(s − b). Thus (1/ρ)−s is the 1 generating function of ρn+1 . for example.

for example they could take values which are themselves functions. (b) The approximation is not good when we consider ratios of products. it’s better to use logarithms: n n! = e log n = exp k=1 log k. Hence we have We call this the rough Stirling approximation. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. We frequently encounter random objects which belong to larger spaces. n ∈ Z+ . I could use its density f (x) = d P (X ≤ x). For instance. to compute a large sum we can. we encountered the concept of an excursion of a Markov chain from a state i. It turns out to be a good one. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. in 2n fair coin tosses 118 . Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. If X is absolutely continuous. since the number is so big. compute an integral and obtain an approximation. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. For example (try it!) the rough Stirling approximation give that the probability that. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. instead.or the 2-parameter function P (a < X < b). for it does allow us to specify the probabilities P (X = n). n Hence n k=1 log k ≈ log x dx. 1 Since d dx (x log x − x) = log x. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. Specifying the distribution of such an object may be hard. dx All these objects can be referred to as “distribution” or “law” of X. Even the generating function EsX of a Z+ -valued random variable X is an example. Such an excursion is a random object which should be thought of as a random function. n! ≈ exp(n log n − n) = nn e−n . Now.

Eg(X) ≥ g(x)P (X ≥ x) . expressing monotonicity. by taking expectations of both sides (expectation is an increasing operator). The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . as n → ∞. defined in (34). Markov’s inequality Arguably this is the most useful inequality in Probability. Surely. y ∈ R g(y) ≥ g(x)1(y ≥ x). We have already seen a geometric random variable. g(X) ≥ g(x)1(X ≥ x). a geometric function of k. We baptise this geometric(λ) distribution. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. and this is terrible. and so. That was the random variable J0 = ∞ n=1 1(Sn = 0). 0 ≤ n ≤ m. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. Indeed. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). Let X be a random variable.we have exactly n heads equals 2n 2−2n ≈ 1. expressing non-negativity. It was shown in Theorem 38 that J0 satisfies the memoryless property. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. 119 . you have seen it before. n ∈ Z+ . We can case both properties of g neatly by writing For all x. because we know that the n probability goes to zero as n → ∞. where λ = P (X ≥ 1). so this little section is a reminder. and is based on the concept of monotonicity. for all x. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . This completely trivial thought gives the very useful Markov’s inequality. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). Consider an increasing and nonnegative function g : R → R+ . If we think of X as the time of occurrence of a random phenomenon. Then. To remedy this.

C M and Snell. Amer. Cambridge University Press. Cambridge University Press. Chung. Wiley. Grinstead.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Math. Feller.dartmouth. Soc. ´ 2.mac. Parry. J L (1997) Introduction to Probability.pdf 6. 120 . Bremaud P (1997) An Introduction to Probabilistic Modeling.Bibliography ´ 1. 7. J R (1998) Markov Chains. Norris. 4. 5. W (1968) An Introduction to Probability Theory and its Applications. Springer. 3. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). Springer. W (1981) Topics in Ergodic Theory. Springer. Bremaud P (1999) Markov Chains. Also available at: http://www.

2 nearest neighbour random walk. 65 generating function. 16 equivalence relation. 15 Loynes’ method. 115 geometric distribution. 84 bank account. 117 leads to. 101 axiom of Probability Theory. 76 neighbours. 18 arcsine law. 114 balance equations. 35 Helly-Bray Lemma. 53 Fibonacci sequence. 113 initial distribution. 110 mixing. 98 dynamical system. 108 relation. 7 closed. 15 distribution. 65 Catalan numbers. 16 convolution. 86 reflexive. 17 infimum. 29 reflection. 117 division. 82 Cesaro averages. 6 Markov property. 45 Chapman-Kolmogorov equation. 108 ruin probability. 116 first-step analysis. 63 Galton-Watson process. 61 recurrent. 13 probability generating function. 70 period. 108 ergodic. 1 equivalence classes. 44 degree. 17 Ethernet. 3 branching process. 17 Kolmogorov’s loop criterion. 61 detailed balance equations. 81 reflection principle. 77 coupling. 73 Markov chain. 34 PageRank. 111 maximum. 17 communicates with. 20 increments. 47 coupling inequality. 77 indicator. 51 essential state. 87 121 . 10 irreducible. 51 Monte Carlo. 18 dual. 106 ground state. 111 inessential state. 68 aperiodic. 109 positive recurrent. 10 ballot theorem. 6 Markov’s inequality. 17 Aloha. 53 hitting time. 110 meeting time. 59 law. 61 null recurrent. 119 matrix. 48 cycle formula. 34 probability flow. 119 Google. 70 greatest common divisor. 16. 16 communicating classes. 115 random walk on a graph. 48 memoryless property. 7 invariant distribution. 20 function of a Markov chain. 18 permutation.Index absorbing. 119 minimum. 15. 26 running maximum. 68 Fatou’s lemma. 58 digraph (directed graph).

62 supremum. 27 storage chain. 24 Strirling’s approximation. 6 statespace. 16. 16. 60 urn. 76 simple symmetric random walk. 113 symmetric. 62 subsequence of a Markov chain.semigroup property. 13 stopping time. 71 122 . 29 transition diagram:. 7. 8 transition probability matrix. 3. 10 steady-state. 108 tree. 77 skip-free property. 76 Skorokhod embedding. 62 Wright-Fisher model. 119 strong Markov property. 27 subordination. 8 transitive. 6 stationary. 7 time-reversible. 2 transition probability. 9 stationary distribution. 73 strategy of the gambler. 58 transient. 104 states. 7 simple random walk. 108 time-homogeneous. 88 watching a chain in a set.

Sign up to vote on this title
UsefulNot useful