Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . . . . . . . . . . . . . . . . . . . . . 95 31. . . . . . . . . . . . . . 95 31. . . . .3 PageRank (trademark of Google): A World Wide Web application . . 83 27. . . . . . . . . . . . . . . . . .1 Distribution after n steps . . . . . .5 First return to zero . . . . . . . . . . . . . . . . . . . . . . 79 26. . . . .5 A storage or queueing application . . . . . . . . . . . . . . . . . . . . 90 30.2 First return to the origin . . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . . . . . . . . . . . . 71 24. . . . . . . . . . . . . . 84 27. . . . . . .24. . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. 84 27. . . . . . .1 Paths that start and end at specific points . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . .3 The distribution of the first return to the origin . . . . . . 92 30. . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . . . . . . . . . . . . . . . . 65 24.3 Total number of visits to a state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Branching processes: a population growth application . . . . .2 The ballot theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 The ALOHA protocol: a computer communications application . . .2 Paths that start and end at specific points and do not touch zero at all . . . . . . .1 Distribution of hitting time and maximum . . . . . . . . . . . . . . 68 24. . . . . . . . .4 The Wright-Fisher model: an example from biology . . . . . . . . . . . . . . . . . 70 24. . . . . . .4 Some remarkable identities . . .2 Return probabilities . . . .1 First hitting time analysis . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . .1 Asymptotic distribution of the maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Some conditional probabilities . 85 27. . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

I have tried both and prefer the current ordering. At the end. Also. v . one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory.K. of course.Preface These notes were written based on a number of courses I taught over the years in the U. I tried to put all terms in blue small capitals whenever they are first encountered. I introduce stationarity before even talking about state classification. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. There notes are still incomplete. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. A few starred sections should be considered “advanced” and can be omitted at first reading. Greece and the U. I have a little mathematical appendix. “important” formulae are placed inside a coloured box. The first part is about Markov chains and some applications.. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. dozens of good books on the topic. Of course.. Also. The only new thing here is that I give emphasis to probabilistic methods as soon as possible.. They form the core material for an undergraduate course on Markov chains in discrete time. There are. The second one is specifically for simple random walks.S.

. By repeating the procedure. yn ). b) between two positive integers a and b. where r = rem(a. x). X0 . n = 0. x = x(t) denotes the position of the particle at time t. . or. X2 . where xn ≥ yn . We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. the future pairs Xn+1 . the integers). Recall (from elementary school maths). Formally. In Mathematics. . that gcd(a. Assume that the force F depends only on the particle’s position ¨ dt and velocity. has the property that. . For instance. yn+1 ). In other words. b). Newton’s second law says that the acceleration is proportional to the force: m¨ = F. b) is the remainder of the division of a by b. s ≥ t) ˙ 1 . F = f (x. we let our “state” be a pair of integers (xn . 1. we end up with two numbers. and the other the greatest common divisor. more generally.e. there is a function F that takes the pair Xn = (xn . i. x(t)). the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). X1 .PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. yn ) into Xn+1 = (xn+1 . b). . y0 = min(a. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). and evolving as xn+1 = yn yn+1 = rem(xn . the real numbers). b). It is not hard to see that. for any t. In the module following this one you will study continuous-time chains. x Here. a totally ordered set. for any n. . An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. . and its ˙ dt 2x acceleration is x = d 2 .e. The sequence of pairs thus produced. a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. one of which is 0. Xn+2 . the future trajectory (X(s). The “time” can be discrete (i. b) = gcd(r. Its velocity is x = dx . Suppose that a particle of mass m is moving under the action of a force F in a straight line. initialised by x0 = max(a. . An example from Physics is provided by Newton’s laws of motion. . 2. continuous (i. yn ).e. depend only on the pair Xn .

regardless of where it was at earlier times. instead of deterministic objects. by the software in your computer. past states are not needed.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. a ≤ X0 ≤ b. via the use of the tools of Probability. i.000 rows and 10.000 columns or computing the integral of a complicated function of a large number of variables. Choose a pair (X0 .1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. . say. Try this on the computer! 2 . Yn ). containing fresh and stinky cheese. analyse. We will be dealing with random variables. Each time count 1 if Yn < f (Xn ) and 0 otherwise. Y0 ) of random numbers. .05. it moves from 2 to 1 with probability β = 0. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. 0 ≤ Y0 ≤ B. Other examples of dynamical systems are the algorithms run. . We will also see why they are useful and discuss how they are applied. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0.e. In addition. The generalisation takes us into the realm of Randomness. we will see what kind of questions we can ask and what kind of answers we can hope for. When the mouse is in cell 1 at time n (minutes) then. Some of these algorithms are deterministic. 2.. A scientist’s job is to record the position of the mouse every minute. it does so. Indeed. The resulting ratio is an approximation b of a f (x)dx. and use those mathematical models which are called Markov chains. We will develop a theory that tells us how to describe. respectively. for steps n = 1. Markov chains generalise this concept of dependence of the future only on the present. 1 and 2. A mouse lives in the cage. Repeat with (Xn . at time n + 1 it is either still in 1 or has moved to 2. Count the number of 1’s and divide by N .can be completely specified by the current state X(t). Perform this a large number N of times. 2 2. We can summarise this information by the transition diagram: 1 Stochastic means random. uniformly. In other words.99. an effective way for dealing with large problems is via the use of randomness. but some are stochastic. Similarly.

S0 .α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. So if he wishes to spend more than what is available. . . Here. 0}. S1 . Xn is the money (in pounds) at the beginning of the month.05 0.2 Bank account The amount of money in Mr Baxter’s bank account evolves. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). Dn − Sn = y − x) P (Sn = z. D0 . . 2. . S2 . D1 . Dn is the money he deposits. are random variables with distributions FD . .05 ≈ 20 minutes. to move from cell 1 to cell 2? 2. are independent random variables. he can’t.95 0. Sn . FS . y = 0. = z=0 ∞ = z=0 3 . on the average. We easily see that if y > 0 then ∞ x. D2 . The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. . answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell.01 Questions of interest: 1. the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. as follows: Xn+1 = max{Xn + Dn − Sn .) 2. intuitive. and Sn is the money he wants to spend. (This is the mean of the binomial distribution with parameter α. How often is the mouse in room 1 ? Question 1 has an easy.y := P (Xn+1 = y|Xn = x).y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. px. Assume that Dn . How long does it take for the mouse.99 0. respectively. from month to month. Assume also that X0 . 1.

3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street.something that. how often will he be visiting 0? 2. Of course. on the average. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). being drunk. If he starts from position 0. can be computed from FS and FD . Are there any places which he will visit infinitely many times? 4. he has no sense of direction.2 p1.2 P= p2. So he may move forwards with equal probability that he moves backwards.y .1 p0.0 p1.1 p2. 2. and if so. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. we have px. Are there any places which he will never visit? 3. The transition probability matrix is y=1 p0.0 p2. If the pub is located in position 0 and his home in position 100. how fast? 2.0 = 1 − ∞ px.1 p1. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. 2.2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. if y = 0. in principle.0 p0.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . how long will it take him. Will Mr Baxter’s account grow. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. Questions of interest: 1.

Also. and pS. that since there are more degrees of freedom now.H . But. and let’s say he’s completely drunk.S because pH. there is clearly a higher chance that our man will be lost.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients.S − pH.D .3 that the person will become sick. the company must have an idea on how long the clients will live.3 H 0. and pS. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. 5 . north or south. but they will become more precise in the sequel. Note that the diagram omits pH. etc.D = 1. These things are. so we assign probability 1/4 for each move. pD.D .8 0. pD.S = 1 − pS. the answer to this is crucial for the policy determination.S = 0. observe. for now.H − pS. west.1 D Thus.D is omitted. therefore. Clearly. mathematically imprecise. there is probability pH. We may again want to ask similar questions.H = 1 − pH.01 S 0. the reason being that the company does not believe that its clients are subject to resurrection. given that he is currently healthy. 2. So he can move east.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner.

. . . n ∈ Z+ ) is Markov if and only if. We will also assume. . X0 = i0 ) = P (∀ k ∈ [n + 1. T = Z+ = {0. . the future process (Xm . . X1 = i1 . Xn−1 ) are independent conditionally on Xn . So we will omit referring to T explicitly. n + m] Xk = ik | Xn = in . . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . m ∈ T). . for any n ∈ T. . On the other hand. . We focus almost exclusively to the first choice. . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). in+1 ∈ S. Usually. unless needed. (Xn . in+m . X2 = i2 . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. and so future is independent of past given the present. . . . . n ∈ T}. 3. . m and any choice of states i0 . .1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . m < n. . 1. zero. Xn+m ) and (X0 . . viz. called the state space. Sometimes we say that {Xn . 2. for all n ∈ Z+ and all i0 . . almost exclusively (unless otherwise said explicitly). . . 3. . . . 2. . . . or negative). . Since S is countable. conditionally on Xn . where T is a subset of the integers. that the Xn take values in some countable set S. P (Xn+1 = in+1 | Xn = in . for any n. . . We say that this sequence has the Markov property if. n ∈ T} is Markov instead of saying that it has the Markov property. (1) 6 . . . . Xn−1 ) given Xn .3 3. Lemma 1. which precisely means that (Xn+1 . m ∈ T) is independent of the past process (Xm . if the claim holds. it is customary to call (Xn ) a Markov chain. 3. Proof.2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . Let us now write the Markov property in an equivalent way.} or N = {1. X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). n + m] Xk = ik | Xn = in ). . This is true for any n and m. . the Markov property.} or the set Z of all integers (positive. The elements of S are frequently called states. m > n. . .

j (m.which means that the two functions µ(i) := P (X0 = i) . But we will circumvent them by using Probability only. . (2) which is reminiscent of matrix multiplication.j∈S . X1 = i1 . viz. by a result in Probability. n).j (m. ℓ) P(ℓ. Therefore. or as P(m. . 7 . n)]i.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. Of course. we can write (2) as P(m. n) = P(m. pi. then the matrices are infinitely large. a square matrix whose rows and columns are indexed by S (and this. i∈S i. of course. m + 1)P(m + 1. n). n). will specify all joint distributions2 : P (X0 = i0 . n). If |S| = d then we have d × d matrices. The function µ(i) is called initial distribution. requires some ordering on S) and whose entries are pi. 2) · · · pin−1 . j ∈ S. n) := [pi. pi. n + 1) do not depend on n .i1 (m. n) = i1 ∈S ··· in−1 ∈S pi. for each m ≤ n the matrix P(m.j (m. . by definition. Xn = in ) = µ(i0 ) pi0 . means that the transition probabilities pi. m + 2) · · · P(n − 1. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m.j (n.j (m. but if S is an infinite set. all probabilities of events associated with the Markov chain This causes some analytical difficulties. which.in (n − 1. we define. by (1).j (n. More generally.i (n − 1. n + 1) is called transition probability from state i at time n to state j at time n + 1. 1) pi1 . n) if m ≤ ℓ ≤ n.. pi. 2 3 And. The function pi. n + 1) := P (Xn+1 = j | Xn = i) .3 In this new notation.j (n. n) = 1(i = j) and.i1 (0. X2 = i2 .i2 (1.j (n. . n ≥ 0. pi. m + 1) · · · pin−1 . 3. something that can be called semigroup property or Chapman-Kolmogorov equation. n) = P(m. remembering a bit of Algebra.

For example. n) = P × P × · · · · · · × P = Pn−m . if Y is a real random variable. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. for all integers a. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . unless otherwise specified. as follows from (1) and the definition of P. The matrix notation is useful.j∈S is called the (one-step) transition probability matrix or. Similarly.and therefore we can simply write pi. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). From now on. if µ assigns probability p to state i1 and 1 − p to state i2 . In other words. Starting from a specific state i. For example. Thus Pi is the law of the chain when the initial distribution is δi . δi (j) := 1. It corresponds. then µ = pδi1 + (1 − p)δi2 . Then P(m. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). transition matrix. (3) We call δi the unit mass at i. while P = [pi. n + 1) =: pi. and call pi.j the (one-step) transition probability from state i to state j. and the semigroup property is now even more obvious because Pa+b = Pa Pb . n−m times for all m ≤ n.j . arranged in a row vector. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . simply. Notational conventions Unit mass.j (n. y y 8 . b ≥ 0. of course. to picking state i with probability 1. if j = i otherwise.j ]i. Notice that any probability distribution on S can be written as a linear combination of unit masses. 0. Then we have µ n = µ 0 Pn .

This is a model of gas distributed between two identical communicating chambers. random variables provides an example of a stationary process. The gas is placed partly in room 1 and partly in room 2. But this will not work. let us look at the following example: Example: Motion on a polygon. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. m + 1. a sequence of i. . Example: the Ehrenfest chain. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . for a while we will be able to tell how long ago the process started. If at time n the bug is on a vertex.4 Stationarity We say that the process (Xn . Another guess then is: Place 9 .i. . . selecting a vertex with probability 1/7. We now investigate when a Markov chain is stationary. On the average. then. the law of a stationary process does not depend on the origin of time. To provide motivation and intuition. . dividing it into two identical rooms. the distribution of the bug will still be the same. Hence the uniform probability distribution is stationary. n = m. at time n + 1 it moves to one of the adjacent vertices with equal probability. for any m ∈ Z+ . Consider a canonical heptagon and let a bug move on its vertices. 1. and vice versa. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. Imagine there are N molecules of gas (where N is a large number. . i.). In other words. say N = 1025 ) in a metallic chamber with a separator in the middle. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. because it is impossible for the number of molecules to remain constant at all times.e. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2.) is stationary if it has the same law as (Xn . then. n ≥ 0) will not be stationary: indeed. we indeed expect that we have N/2 molecules in each room. Then pi. It is clear.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . . at any point of time n. as it takes some time before the molecules diffuse from room to room.d. Let Xn be the number of molecules in room 1 at time n. Clearly. It should be clear that if initially place the bug at random on one of the vertices. n = 0.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi.

the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn .) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. 2−N i−1 i+1 N N (The dots signify some missing algebra. Xm+n = in ). then the number of molecules in room 1 will be i with probability π(i) = N −N 2 .i = N N −i+1 N i+1 2−N + = · · · = π(i). . Xn = in ) = P (Xm = i0 . (a binomial distribution). We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. . . The only if part is obvious. we are led to consider: The stationarity problem: Given [pij ]. hence. in ∈ S. . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. But we need to prove this.each molecule at random either in room 1 or in room 2. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. use (1) to see that P (X0 = i0 . X2 = i2 . stationary or invariant distribution. i 0 ≤ i ≤ N. A Markov chain (Xn . Xm+2 = i2 . Changing notation. (4) Such a π is called . Xm+1 = i1 . find those distributions π such that πP = π.i = P (X0 = i − 1)pi−1. For the if part. . . by (3). a Markov chain is stationary if and only if the distribution of Xn does not change with n.i + P (X0 = i + 1)pi+1. X1 = i1 . . Lemma 2. If we can do so.i = π(i − 1)pi−1. P (Xn = i) = π(i) for all n. P (X1 = i) = j P (X0 = j)pj. . . Proof. If we do so. . . 10 . Since. then we have created a stationary Markov chain.i + π(i + 1)pi+1. The equations (4) are called balance equations. . In other words. Indeed. for all m ≥ 0 and i0 . which you should go through by yourselves.

. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. The times between successive buses are i. . 2 . π must be a probability: π(i) = 1. d! x ∈ Sd . random variables. write the equations (4): π(i) = π(j)pj. which. It is easy to see that the stationary distribution is uniform: π(x) = 1 . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus.d. The state space is S = N = {1. . i ≥ 0. To find the stationary distribution. we use the normalisation condition π(1) = i≥1 j≥i = 1. π(1) will be zero.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. After bus arrives at a bus stop. d}. 11 . then we run into trouble: indeed. can be written as P1 = 1. the next one will arrive in i minutes with probability p(i). .i = π(0)p(i) + π(i + 1). and so each π(i) will be zero. Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions.i. where 1 is a column whose entries are all 1.. . Who tells us that the sum inside the parentheses is finite? If the sum is infinite. Then (Xn . . Thus. which in turn means that the normalisation condition cannot be satisfied. Example: waiting for a bus. i = 1. if i > 0 i ∈ N. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. .}. . 2.i = p(i).j = 1. which gives p(j) . p1. In addition to that. . i ≥ 1. in matrix notation. xd ). we must be careful. . each x ∈ Sd is of the form x = (x1 . Keep doing it continuously. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). Example: card shuffling. This means that the solution to the balance equations is identically zero. . and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. where all the xi are distinct. n ≥ 0) is a Markov chain with transition probabilities pi+1. But here. i≥1 π(i) −1 To compute π(1).i = 1. . .

So let us do so. are i. ξ2 (n). . . . Remarks: This is not a Markov chain with countably many states. ξ2 (n) = ε1 . . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . So it goes beyond our theory. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). . and the rule maps x into 2(1 − x). ξr (n + 1) = εr ) = P (ξ1 (n) = 0. . 1] (think of [0. applied to the interval [0. ξi ∈ {0.) is the state at time n. .i.) Consider an infinite binary vector ξ = (ξ1 . Here ξ(n) = (ξ1 (n). The Markov chain is obtained by applying the mapping ϕ again and again. 1. from (5). 2. But the function f (x) = min(2x.. .d. .. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. ξ(n + 1) = ϕ(ξ(n)). (5) 12 . If ξ1 = 1 then x ≥ 1/2. . . We can show that if we start with the ξr (0). . . .). . 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. εr ∈ {0. i. . We have thus defined a mapping ϕ from binary vectors into binary vectors. ξr+1 (n) = εr ) + P (ξ1 (n) = 1. if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 .i. P (ξ1 (n + 1) = ε1 . . . ξ3 . ξr+1 (n) = 1 − εr ). . . 2. . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. . 1 − ξ3 . . . the same is true at n + 1. 1}. 2.). r = 1. . So. Hence choosing the initial state at random in this manner results in a stationary Markov chain.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). .* (This is slightly beyond the scope of these lectures and can be omitted. what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. 1} for all i.d.e. . . . .. for any n = 0. . . 2(1 − x)). ξ2 . . . You can check that . . Example: bread kneading. any r = 1. and any ε1 .). Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. . ξ2 (n) = 1 − ε1 . . ξ2 (n). i. . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. . But to even think about it will give you some strength. .

amongst them. If the latter condition holds.i 13 . We have: Proposition 1. Proof. Ac ) := π(i)pi. a set of d linearly independent equations which can be more easily solved.j .i + j∈Ac π(j)pj. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj.Note on terminology: steady-state. a stationary distribution π satisfies π = πP π1 = 1 In other words. This is OK. i∈A j∈Ac Think of π(i)pi. j∈S i ∈ S. we introduce more. But we only have d unknowns.j = j∈A π(j)pj. then choose A = {i} and see that you obtain the balance equations.i Multiply π(i) by j∈S pi.1 Finding a stationary distribution As explained above. expecting to be able to choose. for all A ⊂ S. A). Conversely.j (which equals 1): π(i) j∈S pi. Instead of eliminating equations. The convenience is often suggested by the topological structure of the chain.j as a “current” flowing from i to j.i . because the first d equations are linearly dependent: the sum of both sides equals 1. as we will see in examples. When a Markov chain is stationary we refer to it as being in 4. If the state space has d states. suppose that the balance equations hold. therefore one of them is always redundant. Ac ) = F (Ac . Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A.i + j∈Ac π(j)pj. then the above are d + 1 equations.i = j∈A π(j)pj. π(j) = 1. π(j)pj. π(i) = j∈S (balance equations) (normalisation condition). So F (A. π is a stationary distribution if and only if π1 = 1 and F (A. Writing the equations in this form is not always the best thing to do. Ac ) is the total current from A to Ac .

A c F(A .05 ≈ 0. α+β π(2) = cα. α+β π(2) = α . That is. Example: a walk with a barrier. π(1) = β .j + j∈A j∈Ac π(i)pi. in steady state. 3.04 π(2) = 0. β = 0. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . we see that the first term on the left cancels with the first on the right.99 (as in the mouse example).05.04 room 1. 2.Now split the left sum also: π(i)pi. 14 i = 1.048. while the remaining two terms give the equality we seek. We have π(1)α = π(2)β. If we take α = 0.952. . Thus. the mouse spends only 95. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). . p21 = β. p11 = 1 − α. Instead.. gives us flexibility. but we shall not do this here. One can ask how to choose a minimal complete set of linearly independent equations. here is an example: Example: Consider the 2-state Markov chain with p12 = α. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ.2% of the time in 1. The constant c is found from π(1) + π(2) = 1. β < 1. 1.j = j∈A π(j)pj.) Assume 0 < α. (Consequently. p q Consider the following chain. A ) A c c F( A . and 4.99 ≈ 0. .. A) Schematic representation of the flow balance relations.8% of the time in room 2. p22 = 1 − β.i + j∈Ac π(j)pj. Summing over i ∈ A. we find π(1) = 0. . The extra degree of freedom provided by the last result.i This is true for all i ∈ S.

This is often cast in terms of a digraph (directed graph). These are much simpler than the previous ones.e. in−1 . i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. . j) ∈ E ⇐⇒ pi. If i ≥ 1 we have F (A. . i. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0.) 5 5. we can solve them immediately.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. 5. provided that ∞ π(i) = 1. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. . S is a set of vertices. π(i) = (p/q)i π(0). i. starting from i. 1.i2 · · · pin−1 .j > 0 for some n ∈ N and some states i1 .j > 0. a pair (S.e. For i ≥ 1. In fact. If i = j there is nothing to prove. Does this provide a stationary distribution? Yes.i1 pi1 .2 The relation “leads to” We say that a state i leads to state j if.But we can make them simpler by choosing A = {0. and E a set of directed edges. (ii) pi. (iii) Pi (Xn = j) > 0 for some n ≥ 0. In graph-theoretic terminology. . A). Ac ) = π(i − 1)p = π(i)q = F (Ac . . i − 1}. • Suppose that i j. So assume i = j. . n≥0 . . the chain will visit j at some finite time. In other words. . E) where S is the state space and E ⊂ S × S defined by (i. This is i=0 possible if and only if p < q. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). p < 1/2. (If p ≥ 1/2 there is no stationary distribution. Proof.

The relation j ⇐⇒ j i). {2.. The communicating class corresponding to the state i is. Look at the last display again. Since Pi (Xn = j) > 0. we have that the sum is also positive and so one of its terms is positive. the sum contains a nonzero term. The relation is transitive: If i j and j k then i k. We have Pi (Xn = j) = i1 . the set of all states that communicate with i: [i] := {j ∈ S : j So. 5. 5} is closed: if the chain goes in there then it never moves out of it. {4. {6}.i2 · · · pin−1 . this is symmetric (i Corollary 2. [i] = [j] if and only if i identical or completely disjoint. 3} is not closed. we see that there are four communicating classes: {1}.i1 pi1 . Hence Pi (Xn = j) > 0. Proof. However. • Suppose now that (iii) holds. By assumption. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. the class {2. (6) j. It is reflexive and transitive because is so. it partitions S into equivalence classes known as communicating classes.. 3}. In other words.in−1 pi. by definition. We just observed it is symmetric. Just as any equivalence relation. reflexive and transitive. i}. 16 . 5}. The class {4. j) ⇐⇒ i j and j i. (iii) holds.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. In addition.the sum on the right must have a nonzero term.j . because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). • Suppose that (ii) holds. and so (iii) holds. Equivalence relation means that it is symmetric. is an equivalence relation... by definition. meaning that (ii) holds. Corollary 1. (i) holds. The first two classes differ from the last two in character. where the sum extends over all possible choices of states in S.

we have i1 ∈ C. . In other words still.i1 pi1 . Why? A state i is essential if it belongs to a closed communicating class. parts. C5 in the figure above are closed communicating classes. . Proof. j∈C for all i ∈ C. Similarly. the state is inessential. the single-element set {i} is a closed communicating class. in−1 such that pi. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. starting from any state. and C is closed. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. . more manageable. We then can find states i1 . Otherwise.j = 1. To put it otherwise. A general Markov chain can have an infinite number of communicating classes. then the chain (or. Since i ∈ C. Note: An inessential state i is such that there exists a state j such that i j but j i. There can be arrows between two nonclosed communicating classes but they are always in the same direction. Note that there can be no arrows between closed communicating classes. In graph terms. The communicating classes C1 . Lemma 3. Suppose first that C is a closed communicating class. The converse is left as an exercise. the state space S is a closed communicating class. we can reach any other state. C3 are not closed. Closed communicating classes are particularly important because they decompose the chain into smaller. pi. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. C2 . and so on. If a state i is such that pi. .i2 > 0. pi1 .j > 0. Thus.i1 > 0. If all states communicate with all others. and so i2 ∈ C. The classes C4 . we conclude that j ∈ C. 17 . we say that a set of states C ⊂ S is closed if pi.i2 · · · pin−1 . Suppose that i j.More generally.i = 1 then we say that i is an absorbing state. its transition matrix P) is called irreducible. more precisely.

(n) Proof. all states form a single closed communicating class. 4} at the next step and vice versa. Since Pm+n = Pm Pn . Let us denote by pij the entries of Pn . Dj = {n ∈ (n) (α) N : pjj > 0}. and. more generally. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . j ∈ S. The chain alternates between the two sets. d(i) = gcd Di . Theorem 2 (the period is a class property). The period of an inessential state is not defined. (m) (n) for all m. So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 .5. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj .4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. n ≥ 0. If i j there is α ∈ N with pij > 0 and if 18 . Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. Notice however. ℓ ∈ N. pii (ℓn) ≥ pii (n) ℓ . A state i is called aperiodic if d(i) = 1. it will move to the set C2 = {2. Let Di = {n ∈ N : pii > 0}. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. Consider two distinct essential states i. d(j) = gcd Dj . We say that the chain has period 2. i. So. X3 ∈ C2 .e. (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. 3} at some point of time then. X4 ∈ C1 . j. . whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. i. X2 ∈ C1 . If i j then d(i) = d(j). . .

C1 . If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 .j i there is β ∈ N with pji > 0. Of particular interest. Cd−1 such that the chain moves cyclically between them. . (Here Cd := C0 . we obtain d(i) ≤ d(j) as well. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . Since n1 .) 19 . this is the content of the following theorem. n2 ∈ Di . More generally. d(j) | d(i)). r = 0. the chain is called aperiodic. So d(i) = d(j). Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. Therefore d(j) | α + β + n for all n ∈ Di . From the last two displays we get d(j) | n for all n ∈ Di . Theorem 4. Then we can uniquely partition the state space S into d disjoint subsets C0 . C1 . if an irreducible chain has period d then we can decompose the state space into d sets C0 . as defined earlier. q − r > 0.e. Formally. If a number divides two other numbers then it also divides their difference. Proof. 1. n1 ∈ Di such that n2 − n1 = 1. . we have n ∈ Di as well. Arguing symmetrically. . d − 1. Now let n ∈ Di . showing that α + β ∈ Dj . . . i. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. Cd−1 such that pij = 1. . . Corollary 3. . Because n is sufficiently large. Consider an irreducible chain with period d. are irreducible chains. Let n be sufficiently large. . where the remainder r ≤ n1 − 1. Pick n2 . Theorem 3. all states communicate with one another. showing that α + n + β ∈ Dj . So d(j) is a divisor of all elements of Di . Divide n by n1 to obtain n = qn1 + r. . An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. An irreducible chain has period d if one (and hence all) of the states have period d. Therefore d(j) | α + β. j ∈ S. So pjj ≥ pij pji > 0. Then pii > 0 and so pjj ≥ pij pii pji > 0. . if d = 1. chains where. In particular. . j∈Cr+1 i ∈ Cr .

i1 . Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. Pick a state i0 and let C0 be its equivalence class (i. then. Let C1 be the equivalence class of i1 . i ∈ C0 . 20 . Cd−1 .. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. We use a method that is based upon considering what the Markov chain does at time 1. We now show that if i belongs to one of these classes and if pij > 0 then.Proof. Then pick i1 such that pi0 i1 > 0. Notice that this is an equivalence relation. i1 . Suppose pij > 0 but j ∈ C1 but.e. which contradicts the definition of d. .e the set of states j such that i j).. The reader is warned to be alert as to which of the two variables we are considering at each time.. If X0 ∈ A the two times coincide.. Take. Assume d > 1. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). . Continue in this manner and define states i0 . j belongs to the next class. We show that these classes satisfy what we need. We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}.. and by the assumptions that d d j. that is why we call it “first-step analysis”. We avoid excessive notation by using the same letter for both. . Such a path is possible because of the choice of the i0 . consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. . Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . Hence it partitions the state space S into d disjoint equivalence classes. for example. C1 . i. The existence of such a path implies that it is possible to go from i0 to i. As an example. id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. with corresponding classes C0 . It is easy to see that if we continue and pick a further state id with pid−1 . id ∈ C0 ... .id > 0. after it takes one step from its current position. say. necessarily. necessarily. j ∈ C2 .

. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). X2 . ϕ(i) = j∈S i∈A pij ϕ(j). Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). .e. But the Markov chain is homogeneous. we have Pi (TA < ∞) = ϕ(j)pij . the event TA is independent of X0 . Furthermore. By self-feeding this equation. because the chain may end up in state 2. Therefore.). we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . We then have: Theorem 5. then TA ≥ 1. let ϕ′ (i) be another solution. conditionally on X1 . i ∈ A. Hence: ′ ′ P (TA < ∞|X1 = j. If i ∈ A then TA = 0. and define ϕ(i) := Pi (TA < ∞). So TA = 1 + TA . For the second part of the proof. X0 = i) = P (TA < ∞|X1 = j). ′ is the remaining time until A is hit). fix a set of states A. ′ Proof. j∈S as needed.It is clear that P1 (T0 < ∞) < 1. . We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. and so ϕ(i) = 1. Combining the above. But what exactly is this probability equal to? To answer this in its generality. If i ∈ A. a ′ function of X1 . if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. The function ϕ(i) satisfies ϕ(i) = 1.

if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. because it takes no time to hit A if the chain starts from A. We immediately have ϕ(0) = 1. 2}. then TA = 1 + TA . the second equals Pi (TA = 2). i = 0. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j).We recognise that the first term equals Pi (TA = 1). Proof. 2. TA = 0 and so ψ(i) := Ei TA = 0. Therefore. 3 2 so ϕ(1) = 3/4. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. Furthermore. 3 22 .) As for ϕ(1). The function ψ(i) satisfies ψ(i) = 0. and let ψ(i) = Ei TA . ψ(0) = ψ(2) = 0. so the first two terms together equal Pi (TA ≤ 2). let A = {0. If ′ i ∈ A. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). If i ∈ A then. Clearly. Start the chain from i. (If the chain starts from 0 it takes no time to hit 0. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). if the chain starts from 2 it will never hit 0. ψ(i) = 1 + i∈A pij ψ(j). and ϕ(2) = 0. We have: Theorem 6. which is the second of the equations. We let ϕ(i) = Pi (T0 < ∞). By continuing self-feeding the above equation n times. ψ(i) := Ei TA . On the other hand. Letting n → ∞. Example: In the example of the previous figure. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). 1. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). j∈A i ∈ A. obviously. as above. we choose A = {0}. The second part of the proof is omitted. the set that contains only state 0. 1 ψ(1) = 1 + ψ(1). Example: Continuing the previous example.

We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ). if x ∈ A. otherwise. ϕAB (i) = j∈A i∈B pij ϕAB (j). The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). h(x) = f (x). until set A is reached. It should be clear that the equations satisfied are: ϕAB (i) = 1. Now the last sum is a function of the future after time 1.which gives ψ(1) = 3/2.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . Next. as argued earlier. i∈A Furthermore. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). in the last sum.e. X2 . Clearly. This happens up to the first hitting time TA first hitting time of the set A. . (This should have been obvious from the start. TB for two disjoint sets of states A. . . Then ′ 1+TA x ∈ A.. B. ϕAB is the minimal solution to these equations. by the Markov property. i. where. we just changed index from n to n + 1. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). ′ TA = 1 + TA . therefore the mean number of flips until the first success is 3/2. after one step. ϕAB (i) = 0. consider the following situation: Every time the state is x a reward f (x) is earned. 23 . of the random variables X1 . Hence. ′ where TA is the remaining time. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). because the underlying experiment is the flipping of a coin with “success” probability 2/3.

2 2 We solve and find h(2) = 8. no gambler is assumed to have any information about the outcome of the upcoming toss. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). (Sooner or later. h(1) = 5. in which case he dies immediately. 1 2 4 3 The question is to compute the average total reward. h(3) = 7. unless he is in room 4. k ∈ N) form the strategy of the gambler. y∈S h(x) = f (x) + x ∈ A. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). why?) If we let h(i) be the average total reward if he starts from room i.Thus. (Also. so 24 . collecting “rewards”: when in room i. If heads come up then you win £1 per pound of stake. x∈A pxy h(y). The successive stakes (Yk . 2.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. if he starts from room 2. he will die. 3. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. Example: A thief enters sneaks in the following house and moves around the rooms. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. the set of equations satisfied by h. until he dies in room 4. he collects i pounds. Let p the probability of heads and q = 1 − p that of tails. for i = 1. Clearly. are h(x) = 0. So.

Let us consider a concrete problem. expectation) under the initial condition X0 = x. ℓ) := Px (Tℓ < T0 ). 25 ϕ(ℓ) = 1. We first solve the problem when a = 0. 0 ≤ x ≤ ℓ. . We use the notation Px (respectively. it will make enormous profits). Theorem 7. To solve the problem. is our next goal. Yk cannot depend on ξk . b} with transition probabilities pi. the company has control over the probability p. . We are clearly dealing with a Markov chain in the state space {a. b = ℓ > 0 and let ϕ(x. . The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. Ex ) for the probability (respectively. betting £1 at each integer time against a coin which has P (heads) = p. a < i < b. (Think why!) To simplify things.if the gambler wants to have a strategy at all. a + 2. astrological nature. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. in addition. as described above. she must base it on whatever outcomes have appeared thus far. b − 1. .i−1 = q = 1 − p. We obviously have ϕ(0) = 0. She will stop playing if her money fall below a (a < x). ξk−1 and–as is often the case in practice–on other considerations of e. a ≤ x ≤ b. pi. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . P (tails) = 1 − p. write ϕ(x) instead of ϕ(x. . Consider a gambler playing a simple coin tossing game.g. . The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). if p = q Px (Tb < Ta ) = . . x − a    . ℓ). a + 1. . in the first case. Let £x be the initial fortune of the gambler. It can only depend on ξ1 . in simplest case Yk = 1 for all k. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}.i+1 = p. b − a). if p = q = 1/2 b−a Proof. b are absorbing. and where the states a. In other words. We wish to consider the problem of finding the probability that the gambler will never lose her money. . This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and.

ℓ→∞ λℓ − 1 x = 1 − 0 = 1. then λ = q/p > 1. Since p + q = 1. we have obtained ϕ(x. then λ = q/p < 1. ℓ Summarising. λ = 1. 1. If p > 1/2. λ = 1. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. ℓ→∞ ℓ→∞ (q/p)x . ϕ(x) = pϕ(x + 1) + qϕ(x − 1). We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). δ(x − 2) = · · · = q p x δ(0). we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. if p = q. Assuming that λ := q/p = 1. if λ = 1 if λ = 1. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1).and. we find δ(0) = 1/ℓ and so x ϕ(x) = . by first-step analysis. and so λx − 1 λx − 1 =1− = λx . λℓ − 1 0 ≤ x ≤ ℓ. ℓ) = λx −1 . ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). and so Px (T0 < ∞) = 1 − lim If p < 1/2. Corollary 4. Px (T0 < ∞) = 1 − lim 26 . ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. λℓ −1 x ℓ. We finally find λx − 1 . Let δ(x) := ϕ(x + 1) − ϕ(x). if p > 1/2 if p ≤ 1/2. 0 ≤ x ≤ ℓ. we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. Since ϕ(ℓ) = 1. λx − 1 δ(0). We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. The general case is obtained by replacing ℓ by b − a and x by x − a.

Xτ = i. . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . Xτ )1(τ < ∞) = n∈Z+ (X0 . Xτ +2 = j2 . τ < ∞)P (A | Xτ .j2 · · · We have: P (Xτ +1 = j1 . . A. In particular. . | Xn = i)P (Xn = i. Xn+2 = j2 .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . Xn = i. A | Xτ . . . . .j1 pj1 . A. . Since (X0 . τ < ∞) = P (Xn+1 = j1 . . for any n. . P (Xτ +1 = j1 . Proof. Let A be any event determined by the past (X0 . . . . Let τ be a stopping time. . . τ = n). (d) where (a) is just logic. Xτ +2 = j2 . (b) is by the definition of conditional probability. . . . . Xτ +2 = j2 . Xn+2 = j2 . . . . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . . . Theorem 8 (strong Markov property). .j1 pj1 . . τ = n)P (Xn = i. . A. . . Xn ). τ = n) = pi. n ≥ 0) is independent of the past process (X0 . . conditional on τ < ∞ and Xτ = i. .j2 · · · P (Xn = i. . This stronger property is referred to as the strong Markov property. . | Xτ . the future process (Xτ +n . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . Xτ ) and has the same law as the original process started from i. We are going to show that for any such A. 27 . . Xn ) and the ordinary splitting property at n. Xτ ) if τ < ∞. n ≥ 0) is a time-homogeneous Markov chain. possibly. . . Xn )1(τ = n). . . . τ < ∞) and that P (Xτ +1 = j1 . Xn+1 . . . Then. . . . Suppose (Xn . Xn+2 = j2 . τ = n) = P (Xn+1 = j1 . . . | Xn = i. the future process (Xn . including. Recall that the Markov property states that. .) is independent of the past process (X0 . τ = n) (c) = P (Xn+1 = j1 . considered earlier. and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . .Xτ +2 = j2 . A ∩ {τ = n} is determined by (X0 . for all n ∈ Z+ . the value +∞. . τ < ∞) = pi. A. . A. . τ < ∞). A. Let τ be a stopping time. . . . the first hitting time of the set A. . | Xτ = i. τ = n) (a) (b) = P (Xτ +1 = j1 . An example of a random time which is not a stopping time is the last visit of a specific set. Xn ) for all n ∈ Z+ . Xn ) if we know the present Xn . . . any such A must have the property that. An example of a stopping time is the time TA .

. by using the strong Markov property. Note the subtle difference between this Ti and the Ti of Section 6. := Xn . The random variables Xn comprising a general Markov chain are not independent.i. and so on. However. (And we set Ti := Ti . These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. Xi (1) . Schematically. random trajectories.. . as “random variables”. random variables is a (very) particular example of a Markov chain. Our Ti here is always positive. 2. They are. we let Ti be the time of the second (1) (3) visit. The Ti of Section 6 was allowed to take the value 0. State i may or may not be visited again.d. If X0 = i then the two definitions coincide. Xi (2) . Ti Xi (0) (r) ≤ n < Ti (r+1) . So Xi is the r-th excursion: the chain visits 28 . . . We (r) refer to them as excursions of the process.d. r = 2. . Xi (0) . “chunks”. 0 ≤ n < Ti . The idea is that if we can find a state i that is visited again and again. . we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. 3. All these random times are stopping times.) We let Ti be the time of the third visit.i. (1) r = 1.9 Regenerative structure of a Markov chain A sequence of i. in fact. Formally. in a generalised sense. we will show that we can split the sequence into i. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property).. . (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . Regardless.

. we say that a state i is recurrent if. Perhaps some states (inessential ones) will be visited finitely many times. Xi (1) . but important corollary of the observation of this section is that: Corollary 5. the future after Ti is independent of the past before Ti . In particular. Therefore.. We abstract this trivial observation and give a couple of definitions: First. g(Xi (1) ). Xi distribution. we say that a state i is transient if. But (r) the state at Ti is. . for time-homogeneous Markov chains. .. we “watch” it up until the next time it returns to state i. if it starts from state i at some point of time. 10 Recurrence and transience When a Markov chain has finitely many states. . The last corollary ensures us that Λ1 .i. given the state at (r) (r) (r) the stopping time Ti . (0) are independent random variables. (r) . they are also identical in law (i. The claim that Xi . Apply Strong Markov Property (SMP) at Ti : SMP says that. . all. . Λ2 . starting from i. A trivial. and “goes for an excursion”. The excursions Xi (0) . the future is independent of the (1) (2) past. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0.d. equal to i. except. by definition. are independent random variables. Xi (2) . but there are always states which will be visited again and again. Therefore. Example: Define Ti (r+1) (0) ). g(Xi (2) ). Moreover.). at least one state must be visited infinitely many times. ad infinitum. . Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. and this proves the first claim.state i for the r-th time. An immediate consequence of the strong Markov property is: Theorem 9. it will behave in the same way as if it starts from the same state at another point of time. the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. starting from i. (Numerical) functions g(Xi independent random variables. Second. possibly. have the same Proof. 29 .

Lemma 5. State i is recurrent. in principle. The following are equivalent: 1. 4. 2. This means that the state i is recurrent. We will show that such a possibility does not exist. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. This means that the state i is transient. Therefore. k ≥ 0. Suppose that the Markov chain starts from state i. therefore concluding that every state is either recurrent or transient. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. and that is why we get the product. we conclude what we asserted earlier. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . then Pi (Ji < ∞) = 1. namely. Pi (Ji = ∞) = 1.Important note: Notice that the two statements above are not negations of one another. 3. we have Proof. fii = Pi (Ti < ∞) = 1. Because either fii = 1 or fii < 1. But the past (k−1) before Ti is independent of the future. If fii < 1. there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). i. Starting from X0 = i.e. Indeed. Ei Ji = ∞. Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. every state is either recurrent or transient. 30 . Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). . We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. Indeed. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . Pi (Ji = ∞) = 1. We claim that: Lemma 4.

as n → ∞. (n) i. therefore Pi (Ji < ∞) = 1. clearly. as n → ∞. Ji cannot take value +∞ with positive probability. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii .1 Define First hitting time decomposition fij := Pi (Tj = n). Notice the difference with pij .e.e. it is visited infinitely many times. Proof. 3. pii → 0. by the definition of Ji . it is visited finitely many times with probability 1.Proof. State i is recurrent if. then. but the two sequences are related in an exact way: Lemma 7. Pi (Ji < ∞) = 1. Corollary 6. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. which. 10. is equivalent to Pi (Ji < ∞) = 1. If i is transient then Ei Ji < ∞. fij ≤ pij . From Elementary Probability. If Ei Ji = ∞. owing to the fact that Ji is a geometric random variable. The equivalence of the first two statements was proved above. Hence Ei Ji = ∞ also. fii = Pi (Ti < ∞) < 1. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . 2. 4. if we assume Ei Ji < ∞. State i is transient. On the other hand. by definition. since Ji is a geometric random variable. and. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. The equivalence of the first two statements was proved before the lemma. Proof. this implies that Ei Ji < ∞. If i is a transient state then pii → 0. Proof. the probability to hit state j for the first time at time n ≥ 1. Clearly. The following are equivalent: 1. (n) But if a series is finite then the summands go to 0. Ei Ji < ∞. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. n ≥ 1. Lemma 6. assuming that the chain (n) starts from state i. i. then. by definition. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. State i is transient if. This.

Hence. by summing over n the relation of Lemma 7. by definition. . j i. If j is a transient state then pij → 0. and since they take the value 1 with positive probability. which is finite. ( Remember: another property that was a class property was the property that the state have a certain period. not only fij > 0. excursions Xi .2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. ∞ Proof. as n → ∞. We now show that recurrence is a class property. 1. starting from i. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). 10. . r = 0. The last statement is simply an expression in symbols of what we said above. In a communicating class. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). Hence. So the random variables δj. and gij := Pi (Tj < Ti ) > 0. and hence pij as n → ∞. we have n=1 pjj = Ej Jj < ∞.i. if we let fij := Pi (Tj < ∞) . .d.d. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. In particular. denoted by i j: it means that. but also fij = 1. Indeed. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj .. infinitely many of them will be 1 (with probability 1). showing that j will appear in infinitely many of the excursions for sure. for all states i. the probability to go to j in some finite time is positive.The first term equals fij . Proof. . Now. are i.i. we have i j ⇐⇒ fij > 0. either all states are recurrent or all states are transient.) Theorem 10. In other words.r := 1 j appears in excursion Xi (r) . Corollary 7. . If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. Xi . and so the probability that j will appear in a specific excursion is positive. . Start with X0 = i. 32 . If j is transient. Corollary 8. For the second term. j i.

Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). What is the expectation of T ? First. . Recall that a finite random variable may not necessarily have finite expectation. i. Thus T represents the number of variables we need to draw to get a value larger than the initial one.e.e. Every finite closed communicating class is recurrent. and hence. U2 . . . there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0.d. If a communicating class contains a state which is transient then. Hence all states are recurrent.i. If i is inessential then it is transient. all other states are also transient. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). if started in C. Theorem 11. appear in a certain game. Hence. But then. Let C be the class. Example: Let U0 . . . be i. necessarily. Proof. 1]. Proof. X remains in C. Un ≤ x | U0 = x)dx xn dx = 1 . Let T := inf{n ≥ 1 : Un > U0 }. Suppose that i is inessential. i. . . Pi (B) = 0. for example. uniformly distributed in the unit interval [0. P (T > n) = P (U1 ≤ U0 . So if a communicating class is recurrent then it contains no inessential states and so it is closed. some state i must be visited infinitely many times. By finiteness. . Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. P (B) = Pi (Xn = i for infinitely many n) < 1. j but j i. and so i is a transient state. . . We now take a closer look at recurrence. Indeed. If i is recurrent then every other j in the same communicating class satisfies i j. Xn = i for infinitely many n) = 0. By closedness. every other j is also recurrent. U1 . n+1 33 = 0 . It should be clear that this can. Theorem 12. but Pi (A ∩ B) = Pi (Xm = j.Proof. observe that T is finite. This state is recurrent. . random variables. by the above theorem.

P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. then it takes. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. . It is traditional to refer to it as “Law”. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . Theorem 13. . To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. . 34 . it is a Theorem that can be derived from the axioms of Probability Theory. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. . X2 . We state it here without proof.i. When we have a Markov chain. . Z2 . It is not a Law. To apply the law of large numbers we must create some kind of independence. X1 . of successive states is not an independent sequence. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. the sequence X0 . . (ii) Suppose E max(Z1 . Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. 0) = ∞.Therefore. Before doing that. but E min(Z1 . random variables. We say that i is null recurrent if Ei Ti = ∞. arguably. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. 0) > −∞. the Fundamental Theorem of Probability Theory. But this is misleading. be i. Let Z1 .d. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. on the average.

. the (0) initial excursion Xa also has the same law as the rest of them. which.i. Then. are i. are i. and because if we start with X0 = a. Xa .. Xa(3) . n as n → ∞. In particular.13. Example 2: Let f (x) be as in example 1. . the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). Specifically. we have that G(Xa(1) ). because the excursions Xa .d. . random variables.2 Average reward Consider now a function f : S → R. 35 . as in Example 2. we can think of as a reward received when the chain is in state x.d. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). for an excursion X . with probability 1. Since functions of independent random variables are also independent. assign a reward f (x). . Thus.1 Functions of excursions Namely. (7) Whichever the choice of G. G(Xa(2) ). in this example. (1) (2) (1) (n) (8) 13. we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. . for reasonable choices of the functions G taking values in R. The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. define G(X ) to be the maximum reward received during this excursion. G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). forms an independent sequence of (generalised) random variables. . Example 1: To each state x ∈ S. but define G(X ) to be the total reward received over the excursion X . k=0 which represents the ‘time-average reward’ received.i. Xa(2) . We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). G(Xa(3) ). . . . (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) .

in the figure below. Nt = 5.Theorem 14. Then. and Glast is the reward over the last incomplete cycle. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . t t 36 . Let f : S → R+ be a bounded ‘reward’ function. For example. We assumed that f is t bounded. t t as t → ∞. and so on. Gr is given by (7). plus the total reward received over the next excursion. (r) Nt := max{r ≥ 1 : Ta ≤ t}. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. A similar argument shows that 1 last G → 0. t as t → ∞. We break the sum into the total reward received over the initial excursion. Hence |Gfirst | ≤ BTa . with probability one. say. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). of complete excursions from the ground state a that occur between 0 and t. Since P (Ta < ∞) = 1. with probability one.e. Thus we need to keep track of the number. We are looking for the total reward received between 0 and t. t t In other words. The idea is this: Suppose that t is large enough. we have BTa /t → 0. and so 1 first G → 0. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. Proof. say |f (x)| ≤ B. Nt .

random variables with common expectation Ea Ta . are i.also. Hence. r=1 with probability one. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . it follows that the number Nt of complete cycles before time t will also be tending to ∞. we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . E a Ta f (Xn ) = n=0 EG1 =: f .d. we have Ta r→∞ r lim (r) . = E a Ta . and since Ta − Ta k = 1. . Since Ta /r → 0. lim t∞ 1 Nt Nt −1 Gr = EG1 .i. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) .. (11) By definition. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . Since Ta → ∞. E a Ta 37 . as t → ∞. = E a Ta . . 2. as r → ∞. 1 Nt = . So we are left with the sum of the rewards over complete cycles.

i. the ground state. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . The equality of these two limits gives: average no. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ).To see that f is as claimed. b. for all b ∈ S. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). then we see that the numerator equals 1. on the average. from the Strong Markov Property. t Ea 1(Xk = b) E a Ta . The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). Comment 1: It is important to understand what the formula says. The physical interpretation of the latter should be clear: If. a. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . E a Ta (14) with probability 1. The constant f is a sort of average over an excursion. Note that we include only one endpoint of the excursion. it takes Ea Ta units of time between successive visits to the state a. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a.e. We can use b as a ground state instead of a. we let f to be 1 at the state b and 0 everywhere else. not both. Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . E b Tb 38 . then the long-run proportion of visits to the state a should equal 1/Ea Ta . Comment 4: Suppose that two states. The denominator is the duration of the excursion. Ta −1 k=0 1 lim t→∞ t with probability 1. The theorem above then says that. which communicate and are positive recurrent.

we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). we will soon realise that this dependence vanishes when the chain is irreducible. X2 . In view of this. 5 We stress that we do not need the stronger assumption of positive recurrence here. ν [a] (x) = y∈S ν [a] (y)pyx . x ∈ S.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. (15) should play an important role. . we have 0 < ν [a] (x) < ∞. Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. Clearly. i. X2 .e. We start the Markov chain with X0 = a. X1 . . 39 . Hence the superscript [a] will soon be dropped. Both of them equal 1. X3 . Since a = XT a . so there is nothing lost in doing so. this should have something to do with the concept of stationary distribution discussed a few sections ago. for any state b such that a → x (a leads to x). . Proposition 2. notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. Consider the function ν [a] defined in (15). by a function of X1 . . Proof. where a is the fixed recurrent ground state whose existence was assumed. Fix a ground state a ∈ S and assume that it is recurrent5 . The real reason for doing so is because we wanted to replace a function of X0 . Moreover. Note: Although ν [a] depends on the choice of the ‘ground’ state a. Indeed. Recurrence tells us that Ta < ∞ with probability one. x ∈ S. .

we used (16). where. indeed. x ∈ S. for all x ∈ S.and thus be in a position to apply first-step analysis. which. so there exists m ≥ 1. Xn−1 = y. We just showed that ν [a] = ν [a] P. (17) This π is a stationary distribution. We are now going to assume more. We now have: Corollary 9. such that pxa > 0. such that pax > 0. in this last step. let x be such that a x. Note: The last result was shown under the assumption that a is a recurrent state. Therefore. n ≤ Ta ) pyx Pa (Xn−1 = y. 40 . Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. so there exists n ≥ 1. n ≤ Ta ) Pa (Xn = x. ν [a] (x) < ∞. Suppose that the Markov chain possesses a positive recurrent state a. that a is positive recurrent. which are precisely the balance equations. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. for all n ∈ N. xa so. by definition. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). is finite when a is positive recurrent. First note that. ax ax From Theorem 10. namely. (n) Next. Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). x a. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). from (16). ν [a] = ν [a] Pn .

Since a is positive recurrent. in Proposition 2. no matter which one. to decide whether j will occur in a specific excursion. Therefore. Otherwise. (r) j. we have Ea Ta < ∞. then the set of linear equations π(x) = y∈S π(y)pyx .In other words: If there is a positive recurrent state. that it sums up to one.i. This is what we will try to do in the sequel. The coins are i. i. Let i be a positive recurrent state. if tails show up. Suppose that i recurrent. It only remains to show that π [a] is a probability. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. 2. Let R1 (respectively. 1. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. . 41 . Theorem 15. π [a] P = π [a] . We have already shown. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. positive recurrent state. since the excursions are independent. and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. in general. that ν [a] P = ν [a] . In words. then j occurs in the r-th excursion. then the r-th excursion does not contain state j. we toss a coin with probability of heads gij > 0. If heads show up at the r-th excursion.e. But π [a] is simply a multiple of ν [a] . . Start with X0 = i. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. Then j is positive Proof. second) excursion that contains j. We already know that j is recurrent. r = 0. and so π [a] is not trivially equal to zero. Proof.d. . unique. R2 ) be the index of the first (respectively. Hence j will occur in infinitely many excursions. Consider the excursions Xi . π(y) = 1 y∈S x∈S have at least one solution. Also.

. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. . or all states are null recurrent or all states are transient.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. . Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. Corollary 10. In other words. Let C be a communicating class. we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . with the same law as Ti . where the Zr are i. Then either all states in C are positive recurrent.d. . An irreducible Markov chain is called transient if one (and hence all) states are transient. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . 2. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent.i. But. R2 has finite expectation. by Wald’s lemma. . . IRREDUCIBLE CHAIN r = 1.

E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). an absorbing state). Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. state 1 (and state 2) is positive recurrent. by (18). we have. n=1 43 . Then there is a unique stationary distribution. what prohibits uniqueness. Then P (Ta < ∞) = x∈S In other words. for all states i and j. we start the Markov chain from the stationary distribution π. Consider positive recurrent irreducible Markov chain. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. Hence. π(2) = 1 − α 2. there is a stationary distribution. following Theorem 14. after all. for all x ∈ S. for all n.16 Uniqueness of the stationary distribution At the risk of being too pedantic. Proof. Theorem 16. recognising that the random variables on the left are nonnegative and bounded below 1. as t → ∞. Let π be a stationary distribution. (18) π(x)Px (Ta < ∞) = π(x) = 1. But. for totally trivial reasons (it is. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. we have P (Xn = x) = π(x). n ≥ 0) and suppose that P (X0 = x) = π(x). we stress that a stationary distribution may not be unique. By a theorem of Analysis (Bounded Convergence Theorem). x∈S By (13). π(1) = 0. we can take expectations of both sides: E as t → ∞. Then. with probability one. x ∈ S. But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. x ∈ S. is a stationary distribution. indeed. Fix a ground state a. Consider the Markov chain (Xn . the π defined by π(0) = α.

This is in fact. Corollary 11. delete all transition probabilities pxy with x ∈ C. 1 π(a) = . Thus. π(C) x ∈ C. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . There is a unique stationary distribution. Let C be the communicating class of i. The following is a sort of converse to that. Therefore Ei Ti < ∞. There is only one stationary distribution. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . Consider a positive recurrent irreducible Markov chain with stationary distribution π. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Corollary 12. π(i|C) = Ei Ti . so they are equal. Proof. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C).e. Hence π = π [a] . from the definition of π [a] . E a Ta Proof. Then. an arbitrary stationary distribution π must be equal to the specific one π [a] . i is positive recurrent. the only stationary distribution for the restricted chain because C is irreducible. The cycle formula merely re-expresses this equality. Both π [a] and π[b] are stationary distributions. Decompose it into communicating classes. Let i be such that π(i) > 0. π(a) = π [a] (a) = 1/Ea Ta . By the 1 above corollary. i. Consider a positive recurrent irreducible Markov chain and two states a.Therefore π(x) = π [a] (x). In particular. Namely. Hence there is only one stationary distribution. x ∈ S. Then any state i such that π(i) > 0 is positive recurrent. y ∈ C. Assume that a stationary distribution π exists. Consider the restriction of the chain to C. Consider an arbitrary Markov chain. But π(i|C) > 0. for all a ∈ S. b ∈ S. for all x ∈ S. Then π [a] = π [b] . In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. In particular. where π(C) = y∈C π(y). which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Proof. Consider a Markov chain.

more generally. So we have that the above decomposition holds with αC = π(C). Proof. that. such that C αC = 1. There is dissipation of energy due to friction and that causes the settling down. consider the motion of a pendulum under the action of gravity. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). the pendulum will perform an 45 .which assigns positive probability to each state in C. without friction. For example. For each such C define the conditional distribution π(x|C) := π(x) . as n → ∞. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . Let π be an arbitrary stationary distribution. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . x ∈ C) is a stationary distribution. take an = (−1)n . where αC ≥ 0. from physical experience. This is the stable position. for example.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. A Markov chain is a simple model of a stochastic dynamical system. 18 18. eventually. We all know. In an ideal pendulum. where π(C) := π(C) π(x). x∈C Then (π(x|C). π(x|C) ≡ π C (x). So far we have seen. that the pendulum will swing for a while and. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. An arbitrary stationary distribution π must be a linear combination of these π C . Proposition 3. To each positive recurrent communicating class C there corresponds a stationary distribution π C . it will settle to the vertical position. If π(x) > 0 then x belongs to some positive recurrent class C. By our uniqueness result. that it remains in a small region). under irreducibility and positive recurrence conditions. Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or.

˙ There are myriads of applications giving physical interpretation to this. ξ0 . . we have x(t) → x∗ . for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . mutually independent. it cannot mean convergence to a fixed equilibrium point. If. We shall assume that the ξn have the same law. ······ 46 . or an example in physics. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . ξ1 . Since we are prohibited by the syllabus to speak about continuous time stochastic systems. We say that x∗ is a stable equilibrium. . .e. Specifically. However notice what happens when we solve the recursion. But what does stability mean? In general. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . A linear stochastic system We now pass on to stochastic systems. by means of the recursion above. And we are in a happy situation here. The simplest differential equation As another example. and define a stochastic sequence X1 . ξ2 . are given. as t → ∞. . we may it is stable because it does not (it cannot!) swing too far away. consider the following differential equation x = −ax + b. then regardless of the starting position. x ∈ R. moreover. or whatever. simple because the ξn ’s depend on the time parameter n. It follows that |ρ| < 1 is the condition for stability. the one found by setting the right-hand side equal to zero: x∗ = b/a. then x(t) = x∗ for all t > 0. . a > 0. The thing to observe here is that there is an equilibrium point i. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. random variables.oscillatory motion ad infinitum. we here assume that X0 . Still. . X2 .

If a Markov chain is irreducible. we need to understand the notion of coupling. uniformly in x. which is a linear combination of ξ0 . we know f (x) = P (X = x). under nice conditions. How should we then define what P (X = x. for time-homogeneous Markov chains. (e. g(y) = P (Y = y).g. What this means is that. for instance. i.Since |ρ| < 1. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. converges.2 The fundamental stability theorem for Markov chains Theorem 17. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). to try to make the coincidence probability P (X = Y ) as large as possible. But suppose that we have some further requirement. . for any initial distribution. but we are not given what P (X = x. 18. Y = y) = P (X = x)P (Y = y). In other words. ξn−1 does not converge. Y = y) is. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . ∞ ˆ The nice thing is that. The second term. In particular. Coupling refers to the various methods for constructing this joint law. 18. while the limit of Xn does not exists. define stability in a way other than convergence of the distributions of the random variables. Y but not their joint law. as n → ∞. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). . we have that the first term goes to 0 as n → ∞. Starting at an elementary level. j ∈ S. suppose you are given the laws of two random variables X.3 Coupling To understand why the fundamental stability theorem works. . Y = y) is? 47 . . A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. the n-step transition matrix Pn . but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 .

Y be random variables with values in Z and laws f (x) = P (X = x). i. This T is a random variables that takes values in the set of times or it may take values +∞. on the event that the two processes never meet. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Suppose that we have two stochastic processes as above and suppose that they have been coupled. y) = P (X = x. Repeating (19). n ≥ T ) = P (Xn = x. y).e. but we are not told what the joint probabilities are. Find the joint law h(x. n < T ) + P (Xn = x. n ≥ 0. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). n ≥ T ) ≤ P (n < T ) + P (Yn = x). n ≥ 0). and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). (19) as n → ∞. P (T < ∞) = 1. uniformly in x ∈ S. n < T ) + P (Yn = x. x∈S 48 . then |P (Xn = x) − P (Yn = x)| → 0. f (x) = y h(x. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n).e. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. T is finite with probability one. Proof. y). Note that it makes sense to speak of a meeting time. We have P (Xn = x) = P (Xn = x. but with the roles of X and Y interchanged. and which maximises P (X = Y ).Exercise: Let X. The coupling we are interested in is a coupling of processes. If. Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). in particular. precisely because of the assumption that the two processes have been constructed together. Y . Since this holds for all x ∈ S and since the right hand side does not depend on x. Y = y) which respects the marginals. Then. The following result is the so-called coupling inequality: Proposition 4. A coupling of them refers to a joint construction on the same probability space. and all x ∈ S. respectively. i. g(y) = x h(x. for all n ≥ 0. n ≥ 0) and the law of a stochastic process Y = (Yn . x ∈ S. We are given the law of a stochastic process X = (Xn . n ≥ 0. Let T be a meeting of two coupled stochastic processes X. g(y) = P (Y = y).

y) := π(x)π(y). n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. sup |P (Xn = x) − P (Yn = x)| → 0. The first process. We know that there is a unique stationary distribution. Then n→∞ lim P (T > n) = P (T = ∞) = 0. Notice that σ(x.y′ ) := P Wn+1 = (x′ . Yn ).x′ py. n ≥ 0) is an irreducible chain. y) ∈ S × S. x∈S as n → ∞. Xm = y).y′ for all large n. n ≥ 0) is a Markov chain with state space S × S.y′ . n ≥ 0). denoted by Y = (Yn . denoted by X = (Xn . for all n ≥ 1. and so (Wn . and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }. The second process. We now couple them by doing the straightforward thing: we assume that X = (Xn .x′ py.4 Proof of the fundamental stability theorem First. in other words. is a stationary distribution for (Wn . as usual. we have not defined joint probabilities such as P (Xn = x. . the transition probabilities. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ.Assume now that P (T < ∞) = 1. Y0 ) has distribution P (W0 = (x. Then (Wn . Therefore. n ≥ 0). (x.x′ > 0 and py.x′ ). (Indeed. y) Its n-step transition probabilities are q(x. they differ only in how the initial states are chosen. positive recurrent and aperiodic.y′ ) = px. we can define joint probabilities. π(x) > 0 for all x ∈ S. Let pij denote. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. We shall consider the laws of two processes.x′ ).(y. we have that px.(y. Consider the process Wn := (Xn . From Theorem 3 and the aperiodicity assumption. Its initial state W0 = (X0 . n ≥ 0) is independent of Y = (Yn . Xn and Yn would be independent and distributed according to π. y) = µ(x)π(y).(y. then.y′ . Having coupled them. Its (1-step) transition probabilities are q(x.) By positive recurrence. implying that q(x. denoted by π. review the assumptions of the theorem: the chain is irreducible. It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law.x′ ).y′ ) for all large n. y ′ ) | Wn = (x. had we started with both X0 and Y0 independent and distributed according to π. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. 18.

Zn = Yn .1). Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). p10 = 1. p00 = 0.0). X arbitrary distribution Z stationary distribution Y T Thus. as n → ∞. P (T < ∞) = 1. 1). 1)} and the transition probabilities are q(0. y) ∈ S × S. Moreover. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). say.(1. n ≥ 0) is identical in law to (Xn . Z0 = X0 which has an arbitrary distribution µ. (0. precisely because P (T < ∞) = 1. (1. Then the product chain has state space S ×S = {(0. 50 . 1} and the pij are p01 = p10 = 1. if we modify the original chain and make it aperiodic by. By the coupling inequality. for all n and x. which converges to zero.1). Zn equals Xn before the meeting time of X and Y . This is not irreducible. by assumption. However. Hence (Wn . n ≥ 0) is positive recurrent.Therefore σ(x. after the meeting time.(0. In particular. we have |P (Xn = x) − π(x)| ≤ P (T > n).1) = q(1. 0). T is a meeting time between Z and Y . ≥ 0) is a Markov chain with transition probabilities pij . (Zn . p01 = 0. n ≥ 0). P (Xn = x) = P (Zn = x).1) = 1. Clearly.0). Obviously. Since P (Zn = x) = P (Xn = x). Hence (Zn . (1. y) > 0 for all (x.0) = q(1.(1.0) = q(0. suppose that S = {0. 0). Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X.01.99. For example.(0. Y ) may fail to be irreducible. then the process W = (X. and since P (Yn = x) = π(x). uniformly in x. In particular.

The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. . It is. Parry. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. Suppose X0 = 0. C1 . By the results of §5. then (n) lim p n→∞ 00 (n) = lim p10 = π(0). we know we have a unique stationary distribution π and that π(x) > 0 for all states x. most people use the term “ergodic” for an irreducible. we can decompose the state space into d classes C0 .497.4 and. It is not difficult to see that 6 We put “wrong” in quotes.99π(0) = π(1). . if we modify the chain by p00 = 0. 18. such that the chain moves cyclically between them: from C0 to C1 . Then Xn = 0 for even n and = 1 for odd n. Suppose that the period of the chain is d. 51 . π(1) satisfy 0. But this is “wrong”.503 0. Theorem 4. However. n→∞ (n) where π(0). from Cd back to C0 .6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. p01 = 0. and this is expressed by our terminology. π(0) + π(1) = 1. Therefore p00 = 1 for even n (n) and = 0 for odd n. 6 The reason is that the term “ergodic” means something specific in Mathematics (c. positive recurrent and aperiodic Markov chain. 1981). 0. Cd−1 . limn→∞ p00 does not exist. π(0) ≈ 0.503 0. π(1) ≈ 0. Terminology cannot be.497 .then the product chain becomes irreducible. By the results of Section 14. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0.01. and so on. p10 = 1. customary to have a consistent terminology.f. Indeed. In matrix notation. in particular.99. .5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. i.497 18. per se wrong. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. Thus.e. from C1 to C2 .503. however. . n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1).

n ≥ 0) consists of d closed communicating classes. Cd−1 . and let r = ℓ − k (modulo d). namely C0 . Since (Xnd . as n → ∞. n ≥ 0) is aperiodic. Suppose i ∈ Cr . Since the αr add up to 1. We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . starting from i. Proof. Then. Then. i∈Cr for all r = 0. after reduction by d) and. and the previous results apply. π(i) = 1/d. thereafter. Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. where π is the stationary distribution. it remains in Cℓ . Let αr := i∈Cr π(i). the (unique) stationary distribution of (Xnd . All states have period 1 for this chain. as n → ∞. . 52 . j ∈ Cℓ . The chain (Xnd . n ≥ 0) is dπ(i). Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. each one must be equal to 1/d. if sampled every d steps. Then pji = 0 unless j ∈ Cr−1 . . → dπ(j). j ∈ Cr . i ∈ Cr . . . We shall show that Lemma 9. . Let i ∈ Ck . i ∈ Cr . d − 1. . This means that: Theorem 18. . i ∈ S. So π(i) = j∈Cr−1 π(j)pji . . Suppose that X0 ∈ Cr . j ∈ Cℓ . Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j).Lemma 8. we have pij for all i. 1. Hence if we restrict the chain (Xnd .

as k → ∞. . Consider..1 Limiting probabilities for null recurrent states Theorem 19. if j is transient. then pij → 0 as n → ∞. for each The second result is Fatou’s lemma. i. If j is a null recurrent state then pij → 0. It is clear how we (ni ) k continue. such that limk→∞ pi k exists and equals. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . p1 . . e 53 . For each i we have found subsequence ni .). say. k = 1. . The proof will be based on two results from Analysis. 2. where = 1 and pi ≥ 0 for all i. p2 . the sequence n2 of the second row k is a subsequence of n1 of the first row.. say. for each n = 1. are contained between 0 and 1. . . k = 1. . such that limk→∞ p1 k 3 (n1 ) numbers p2 k . . and so on. 2. schematically. Hence pi k i. . 2. for all i ∈ S. . Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. . p3 . p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. By construction.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . whose proof we omit7 : 7 See Br´maud (1999). We have. then pij → π(j) (Theorems 17). p2 . . Now consider the contained between 0 and 1 there must be a exists and equals. the first of which is known as HellyBray Lemma: Lemma 10. 19. for each i.e. . k = 1. there must be a subsequence exists and equals. the sequence n3 of the third row is a subsequence k k of n2 of the second row. . for all i ∈ S. p(n) = (n) (n) (n) (n) (n) ∞ p1 . Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. . . 2. pi . as n → ∞. k = 1.) is a k k subsequence of each (ni . say. . 2. So the diagonal subsequence (nk . . . We also saw in Corollary (n) 7 that. (nk ) k → pi . . If the period is d then the limit exists only along a selected subsequence (Theorem 18). a probability distribution on N. exists for all i ∈ N. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1.

Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. Suppose that one of them is not. pix k = 1. equalities. i. it is a strict inequality: αx0 > y∈S αy pyx0 . αy pyx .e. i. actually. i ∈ N). 54 . are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . and the last equality is obvious since we have a probability vector. y∈S x ∈ S. We wish to show that these inequalities are. π is a stationary distribution. by assumption. By Corollary 13. pix k . for some x0 ∈ S. we have 0< x∈S The inequality follows from Lemma 11. Thus. Now π(j) > 0 because αj > 0. x∈S (n′ ) Since aj > 0. . This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . Let j be a null recurrent state. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). 2. the sequence (n) pij does not converge to 0. But Pn+1 = Pn P. i. by Lemma 11. . Suppose that for some i.Lemma 11. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . (xi . (n ) (n ) Consider the probability vector By Lemma 10. y∈S Since x∈S αx ≤ 1.e. we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. pix k Therefore. state j is positive recurrent: Contradiction. x ∈ S . i=1 (n) Proof of Theorem 19. . (n′ ) Hence αx ≥ αy pyx . i ∈ N). x ∈ S. If (yi .e. k→∞ (n′ +1) = y∈S piy k pyx . αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. n = 1.

as n → ∞. Let j be a positive recurrent state with period 1. β. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. But Theorem 17 tells us that the limit exists regardless of the initial distribution. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). In this form. the result can be obtained by the so-called Abel’s Lemma of Analysis. Let C be the communicating class of j and let π C be the stationary distribution associated with C. and fij = ϕC (j). so we will only look that the aperiodic case. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . starting with the first hitting time decomposition relation of Lemma 7. This is a more old-fashioned way of getting the same result. as in §10. π C (j) = 1/Ej Tj . From the Strong Markov Property. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). In general. Example: Consider the following chain (0 < α. which is what we need. we have pij = Pi (Xn = j) = Pi (Xn = j. Pµ (Xn−m = j) → π C (j). and fij = Pi (Tj < ∞). Proof. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. Let ϕC (i) := Pi (HC < ∞). that is. Theorem 20. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m).2. Let i be an arbitrary state. Then (n) lim p n→∞ ij = ϕC (i)π C (j).19. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}.2 The general case (n) It remains to see what the limit of pij is. as n → ∞. Let HC := inf{n ≥ 0 : Xn ∈ C}. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . The result when the period j is larger than 1 is a bit awkward to formulate. Indeed.

p42 → ϕ(4)π C1 (2). C2 = {5} (absorbing state). for example. p44 → 0. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. (n) (n) p35 → (1 − ϕ(3))π C2 (5). The subject of Ergodic Theory has its origins in Physics. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . we assumed the existence of a positive recurrent state. p45 → (1 − ϕ(4))π C2 (5). (n) (n) p43 → 0. π C2 (5) = 1. whence 1−α β(1 − α) . among others. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. (n) (n) p32 → ϕ(3)π C1 (2). Let f be a reward function such that |f (x)|ν [a] (x) < ∞. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. And so is the Law of Large Numbers itself (Theorem 13). from first-step analysis. while. C1 = {1. Hence. information about its velocity. For example. as n → ∞. (n) (n) p34 → 0. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . ϕ(5) = 0. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. π C1 (2) = . in Theorem 14.The communicating classes are I = {3. Theorem 14 is a type of ergodic theorem. We have ϕ(1) = ϕ(2) = 1. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. ϕ(3) = p31 → ϕ(3)π C1 (1). Pioneers in the are were. 56 (20) x∈S . Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. Recall that. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. 1 − αβ 1 − αβ Both C1 . 4} (inessential states). The phase (or state) space of a particle is not just a physical space but it includes. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. ϕ(4) = . Theorem 21. We now generalise this. Similarly. p41 → ϕ(4)π C1 (1). ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. C2 are aperiodic classes. (n) (n) p33 → 0. 2} (closed class).

But this is precisely the condition (20). Replacing n by the subsequence Nt . As in Theorem 14. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . Now. If so. while Gfirst . Remark 1: The difference between Theorem 14 and 21 is that. As in the proof of Theorem 14. t t→∞ t t t with probability one. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). From proposition 2 we know that the balance equations ν = νP have at least one solution. we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . we have C := i∈S ν(i) < ∞. We only need to check that EG1 = f . since the number of states is finite. and this justifies the terminology of §18. Then at least one of the states must be visited infinitely often. (21) f (x)ν [a] (x). we have t→∞ lim 1 last 1 G = lim Gfirst = 0. we assumed positive recurrence of the ground state a. r=0 with probability 1 as long as E|G1 | < ∞. Glast are the total rewards over the first and t t a a last incomplete cycles. we obtain the result. in the former.5. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. we have that the denominator equals Nt /t and converges to 1/Ea Ta . and this is left as an exercise. which is the quantity denoted by f in Theorem 14. so the numerator converges to f /Ea Ta . The reader is asked to revisit the proof of Theorem 14. So there are always recurrent states. we have Nt /t → 1/Ea Ta . Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. respectively. Proof. then. So if we divide by t the numerator and denominator of the fraction in (21). 57 .Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . as we saw in the proof of Theorem 14.

Taking n = 1 in the definition of reversibility. X0 ) for all n. Lemma 12. we see that (X0 . X0 = j). Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. Suppose first that the chain is reversible. Because the process is Markov. We say that it is time-reversible. Xn ) has the same distribution as (Xn . Then the chain is positive recurrent with a unique stationary distribution π. . . In addition. for all n. running backwards. in a sense. 58 . i. C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. . in addition. Xn has the same distribution as X0 for all n. Consider an irreducible Markov chain with finitely many states. by Theorem 16. the chain is irreducible. it will look the same. j ∈ S. More precisely. X1 = j) = P (X1 = i. simply. . . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. . π(i) := Hence. . By Corollary 13. as n → ∞. . . . like playing a film backwards. without being able to tell that it is. that is. If. In particular. then all states are positive recurrent. i ∈ S. Reversibility means that (X0 . this state must be positive recurrent. or. X1 ) has the same distribution as (X1 . we have Pn → Π. At least some state i must have π(i) > 0. Proof. P (X0 = i. the first components of the two vectors must have the same distribution. Xn−1 . . the stationary distribution is unique. Xn ) has the same distribution as (Xn .e. . .Hence 1 ν(i). X0 ). If. this means that it is stationary. But if we film a man who walks and run the film backwards. (X0 . A reversible Markov chain is stationary. X0 ). in addition. Therefore π is a stationary distribution. the chain is aperiodic. We summarise the above discussion in: Theorem 22. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. because positive recurrence is a class property (Theorem 15). if we film a standing man who raises his arms up and down successively. i. 22 Time reversibility Consider a Markov chain (Xn . . For example. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . It is. . n ≥ 0). . and run the film backwards. reversible if it has the same distribution when the arrow of time is reversed. then we instantly know that the film is played in reverse. a finite Markov chain always has positive recurrent states. Xn−1 . X1 . Theorem 23. actually. X1 . Proof.

. Theorem 24. Pick 3 states i. These are the DBE. then the chain cannot be reversible. using the DBE. im ∈ S and all m ≥ 3. . By induction. we have π(i) > 0 for all i. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . and hence (X0 . X1 ) has the same distribution as (X1 . k. .for all i and j. . X1 = j) = π(i)pij . X1 = j. . . P (X0 = j. and pi1 i2 (m−3) → π(i2 ). X2 = k) = P (X0 = k. then. for all n. Now pick a state k and write. (X0 . . . i4 . X1 = i) = π(j)pji . So the detailed balance equations (DBE) hold. and this means that P (X0 = i. suppose that the KLC (22) holds. We will show the Kolmogorov’s loop criterion (KLC). π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. k. . X1 . So the chain is reversible. Summing up over all choices of states i3 . Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . j. . Xn ) has the same distribution as (Xn . X0 ). letting m → ∞. j. X0 ). Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. The DBE imply immediately that (X0 . Suppose first that the chain is reversible. we have separation of variables: pij = f (i)g(j). . In other words. Xn−1 . if there is an arrow from i to j in the graph of the chain but no arrow from j to i. X1 . X1 = j. 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. X2 ) has the same distribution as (X2 . . . (22) Proof. suppose the DBE hold. . im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. Conversely. it is easy to show that. . and the chain is reversible. X0 ). using DBE. Conversely. . and write. Hence the DBE hold. π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . i2 . X2 = i). pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . . pji 59 (m−3) → π(i1 ). X1 . But P (X0 = i. for all i. The same logic works for any m and can be worked out by induction.

) Let A. by assumption. that the chain is positive recurrent. Example 1: Chain with a tree-structure. If the graph thus obtained is a tree. Ac ) = F (Ac . for each pair of distinct states i. there isn’t any. Ac be the sets in the two connected components. using another method. meaning that it has no loops. and by replacing. j for which pij > 0.(Convention: we form this ratio only when pij > 0. (If it didn’t. Hence the original chain is reversible. so we have pij g(j) = . Pick two distinct states i. The reason is as follows. For example. There is obviously only one arrow from AtoAc . the detailed balance equations. Hence the chain is reversible. namely the arrow from i to j. • If the chain has infinitely many states. And hence π can be found very easily. consider the chain: Replace it by: This is a tree. A) (see §4. then we need. If we remove the link between them. there would be a loop.) Necessarily. then the chain is reversible.1) which gives π(i)pij = π(j)pji . 60 . f (i)g(i) must be 1 (why?). Form the graph by deleting all self-loops. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . and the arrow from j to j is the only one from Ac to A. to guarantee. So g satisfies the balance equations. j for which pij > 0.e. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. the graph becomes disconnected. and. The balance equations are written in the form F (A. in addition. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. i. the two oriented edges from i to j and from j to i by a single edge without arrow.

We claim that π(i) = Cδ(i). There are two cases: If i. π(i)pij = π(j)pji . p02 = 1/4. if j is a neighbour of i otherwise. 2.e.Example 2: Random walk on a graph. The stationary distribution of a RWG is very easy to find. Each edge connects a pair of distinct vertices (states) without orientation. Suppose the graph is connected. In all cases. If i. For each state i define its degree δ(i) = number of neighbours of i. To see this let us verify that the detailed balance equations hold. etc. 61 . i∈S where |E| is the total number of edges. States j are i are called neighbours if they are connected by an edge. i. Consider an undirected graph G with vertices S. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). i∈S Since δ(i) = 2|E|. A RWG is a reversible chain. where C is a constant. 0. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. j are not neighbours then both sides are 0. Now define the the following transition probabilities: pij = 1/δ(i). a finite set. The constant C is found by normalisation. The stationary distribution assigns to a state probability which is proportional to its degree. we have that C = 1/2|E|. and a set E of undirected edges. Hence we conclude that: 1. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. the DBE hold. π(i) = 1. so both members are equal to C. For example.

The state space of (Yr ) is A. .e. .e. random variables with values in N and distribution λ(n) = P (ξ0 = n).e. In this case.) has the Markov property but it is not time-homogeneous. The sequence (Yk . St+1 = St + ξt . . k = 0. in addition. . If the subsequence forms an arithmetic progression. this process is also a Markov chain and it does have time-homogeneous transition probabilities. 2. 2. 2. A r = 1. . t ≥ 0) be independent from (Xn ) with S0 = 0. if nk = n0 + km. (m) (m) 23. i. 1. . .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. where pij are the m-step transition probabilities of the original chain. 62 n ∈ N.i. . 2. i. . . . n ≥ 0) be Markov with transition probabilities pij . in general. if. Let A be (r) a set of states and let TA be the r-th visit to A. how can we create new ones? 23. k = 0. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). k = 0. 1. .d. 1. where ξt are i. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk .3 Subordination Let (Xn .1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. irreducible and positive recurrent). k = 0.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). π is a stationary distribution for the original chain. Let (St . 23. Define Yr := XT (r) . t ≥ 0. 2. . . 1. then it is a stationary distribution for the new chain. . Due to the Strong Markov Property. . . then (Yk .) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij .

. f is a one-to-one function then (Yn ) is a Markov chain. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . however. Fix α0 . . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). . αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . knowledge of f (Xn ) implies knowledge of Xn and so. . Indeed.) More generally. (Notation: We will denote the elements of S by i.. Yn−1 = αn−1 }. we have: Lemma 13. in general. . . Proof. . (n) 23. . . n≥0 63 . i. . . is NOT. . β. conditional on Yn the past is independent of the future. If. ∞ λ(n)pij . We need to show that P (Yn+1 = β | Yn = α. Yn−1 ) = P (Yn+1 = β | Yn = α). Then (Yn ) has the Markov property. β).. . α1 .Then Yt := XSt . . j.e. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). then the process Yn = f (Xn ). a Markov chain. and the elements of S ′ by α.

β)P (Xn = i | Yn = α. α > 0 Q(α. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. Thus (Yn ) is Markov with transition probabilities Q(α. To see this. if β = 1. conditional on {Xn = i}. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. 64 . We have that P (Yn+1 = β|Xn = i) = Q(|i|. i ∈ Z. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. β). a Markov chain with transition probabilities Q(α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). and if i = 0. if β = α ± 1. Then (Yn ) is a Markov chain. Example: Consider the symmetric drunkard’s random walk. if i = 0. because the probability of going up is the same as the probability of going down. β) P (Yn = α|Yn−1 ) i∈S = Q(α. i. β ∈ Z+ . regardless of whether i > 0 or i < 0. depends on i only through |i|. But P (Yn+1 = β | Yn = α. P (Xn+1 = i ± 1|Xn = i) = 1/2. for all i ∈ Z and all β ∈ Z+ . i ∈ Z. indeed. On the other hand. β) P (Yn = α|Xn = i. β) P (Yn = α. In other words. then |Xn+1 | = 1 for sure. β) := 1. α = 0. and so (Yn ) is. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. But we will show that P (Yn+1 = β|Xn = i). Yn−1 ) Q(f (i). observe that the function | · | is not one-to-one. β). Yn−1 ) Q(f (i). β). Define Yn := |Xn |.for all choices of the variables involved. β).e. Indeed. if we let 1/2. Yn = α. Yn−1 )P (Xn = i | Yn = α. β) i∈S P (Yn = α|Xn = i.

If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. . and following the same distribution. Let ξk be the number of offspring of the k-th individual in the n-th generation. Example: Suppose that p1 = p. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. each individual gives birth to at one child with probability p or has children with probability 1 − p. 0 But 0 is an absorbing state. if the current population is i there is probability pi that everybody gives birth to zero offspring. . independently from one another. so we will find some different method to proceed. . therefore transient. ξ2 . are i. . . Then P (ξ = 0) + P (ξ ≥ 2) > 0. Hence all states i ≥ 1 are transient. 2. j In this case. 0 ≤ j ≤ i. which has a binomial distribution: i j pij = p (1 − p)i−j . we have that (Xn . We let pk be the probability that an individual will bear k offspring. Hence all states i ≥ 1 are inessential. since ξ (0) . It is a Markov chain with time-homogeneous transitions. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. So assume that p1 = 1. with probability 1.24 24. The parent dies immediately upon giving birth. The state 0 is irrelevant because it cannot be reached from anywhere. the population cannot grow: it will eventually reach 0. n ≥ 0) has the Markov property.. But here is an interesting special case. . (and independent of the initial population size X0 –a further assumption). one question of interest is whether the process will become extinct or not. ξ (n) ). k = 0. and 0 is always an absorbing state. 1.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process.i. We find the population (n) at time n + 1 by adding the number of offspring of each individual. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. Therefore. if the 65 (n) (n) . . Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. and. a hard problem. ξ (1) . Thus. In other words. p0 = 1 − p. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. . .d. If we let ξ (n) = ξ1 . We will not attempt to compute the pij neither draw the transition diagram because they are both futile. then we see that the above equation is of the form Xn+1 = f (Xn . in general. if 0 is never reached then Xn → ∞. Back to the general case.

Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. ϕn (0) = P (Xn = 0). . For i ≥ 1. Let ε be the extinction probability of the process. and P (Dk ) = ε. 66 . Consider a Galton-Watson process starting with X0 = 1 initial ancestor. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. so P (D|X0 = i) = εi . when we start with X0 = i. Indeed.population does not become extinct then it must grow to infinity. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . where ε = P1 (D). If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. This function is of interest because it gives information about the extinction probability ε. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. . If µ ≤ 1 then the ε = 1. The result is this: Proposition 5. if we start with X0 = i in ital ancestors. we can argue that ε(i) = εi . Proof. Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. Obviously. We then have that D = D1 ∩ · · · ∩ Di . since Xn = 0 implies Xm = 0 for all m ≥ n. each one behaves independently of one another. Indeed. ε(0) = 1. . . Therefore the problem reduces to the study of the branching process starting with X0 = 1. We wish to compute the extinction probability ε(i) := Pi (D). Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. i. For each k = 1. and.

ϕ(ϕn (0)) → ϕ(ε).Note also that ϕ0 (z) = 1. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). since ϕ is a continuous function. ϕn+1 (0) = ϕ(ϕn (0)). and since the latter has only two solutions (z = 0 and z = ε∗ ). all we need to show is that ε < 1. Since both ε and ε∗ satisfy z = ϕ(z). and so on. In particular. But ϕn+1 (0) → ε as n → ∞. then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). ϕn+1 (z) = ϕ(ϕn (z)). Therefore. if µ > 1. and. Also ϕn (0) → ε. ε = ε∗ . This means that ε = 1 (see left figure below). Let us show that. By induction. recall that ϕ is a convex function with ϕ(1) = 1. But 0 ≤ ε∗ . But ϕn (0) → ε. the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1.) 67 . Therefore ε = ε∗ . in fact. if the slope µ at z = 1 is ≤ 1. Therefore ε = ϕ(ε). Notice now that µ= d ϕ(z) dz . z=1 Also. On the other hand. ϕn (0) ≤ ε∗ . Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). (See right figure below. So ε ≤ ε∗ < 1.

on the next times slot. then max(Xn + An − 1. If s = 1 there is a successful transmission and. 1. Let s be the number of selected packets. 3. . in a simplified model. random variables with some common distribution: P (An = k) = λk . are allocated to hosts 1. then time slots 0. there is no time to make a prior query whether a host has something to transmit or not. say. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot.d. on the next times slot. there is collision and the transmission is unsuccessful. . 3. otherwise. Xn+1 = Xn + An .1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. Then ε < 1. k ≥ 0. . 1. there are x + a packets. 0). there are x + a − 1 packets. If collision happens.. 24. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. So if there are.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). 68 . One obvious solution is to allocate time slots to hosts in a specific order. Case: µ > 1. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. then the transmission is not successful and the transmissions must be attempted anew. . . periodically. and Sn the number of selected packets. 2. The problem of communication over a common channel. An the number of new packets on the same time slot. 5. 1. 2. Nowadays. 4. If s ≥ 2. there is a collision and. Then ε = 1. 3 hosts. if Sn = 1. Ethernet has replaced Aloha. 2.i. Since things happen fast in a computer network. We assume that the An are i. Abandoning this idea. Thus. using the same frequency. During this time slot assume there are a new packets that need to be transmitted. for a packet to go through. 3. 6. So. only one of the hosts should be transmitting at a given time. we let hosts transmit whenever they like. is that if more than one hosts attempt to transmit a packet. . assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’).

Sn = 1|Xn = x) = λ0 xpq x−1 . Xn−1 . n ≥ 0) is a Markov chain. . we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. for example we can make it ≤ µ/2. Idea of proof. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x].x−1 when x > 0. The random variable Sn has. Proving this is beyond the scope of the lectures. To compute px. i.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k .) = x + a k x−k p q . Therefore ∆(x) ≥ µ/2 for all large x. k ≥ 1. we have that the drift is positive for all large x and this proves transience. For x > 0. Indeed.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k .x−1 + k≥1 kpx. To compute px. Let us find its transition probabilities. The reason we take the maximum with 0 in the above equation is because.x are computed by subtracting the sum of the above from 1. . An = a. k 0 ≤ k ≤ x + a. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). The probabilities px. if Xn + An = 0. Since p > 0. and p.where λk ≥ 0 and k≥0 kλk = 1. So we can make q x G(x) as small as we like by choosing x sufficiently large.e. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. so q x G(x) tends to 0 as x → ∞.e. The main result is: Proposition 6. where q = 1 − p. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). The Aloha protocol is unstable. 69 . Unless µ = 0 (which is a trivial case: no arrivals!). We also let µ := k≥1 kλk be the expectation of An . We here only restrict ourselves to the computation of this expected change. then the chain is transient. selections are made amongst the Xn + An packets. we think as follows: to have a reduction of x by 1. The process (Xn . . So P (Sn = k|Xn = x. An−1 .x−1 = P (An = 0. we have ∆(x) = (−1)px. we have q < 1. and so q x tends to 0 much faster than G(x) tends to ∞. where G(x) is a function of the form α + βx . we must have no new packets arriving. we should not subtract 1. the Markov chain (Xn ) is transient. So: px.x+k for k ≥ 1. and exactly one selection: px.

that allows someone to manipulate the PageRank of a web page and make it much higher. So this is how Google assigns ranks to pages: It uses an algorithm. then i is a ‘dangling page’. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. in the latter case. Academic departments link to each other by hiring their 70 . to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. There is a technique. Comments: 1. if L(i) = ∅   0. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. 3. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. So if Xn = i is the current location of the chain. 2. unless i is a dangling page. Then it says: page i has higher rank than j if π(i) > π(j). For each page i let L(i) be the set of pages it links to. It is not clear that the chain is irreducible. |V | These are new transition probabilities. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. α ≈ 0.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. All you need to do to get high rank is pay (a lot of money). Its implementation though is one of the largest matrix computations in the world. In an Internet-dominated society. the chain moves to a random page.24. Therefore. if j ∈ L(i)  pij = 1/|V |. neither that it is aperiodic. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. because of the size of the problem. The idea is simple. PageRank. It uses advanced techniques from linear algebra to speed up computations. otherwise.2) and let α pij := (1 − α)pij + . if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. So we make the following modification. the rank of a page may be very important for a company’s profits. there are companies that sell high-rank pages. If L(i) = ∅. Define transition probabilities as follows:  1/|L(i)|. known as ‘302 Google jacking’. We pick α between 0 and 1 (typically.

In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). Three of the square alleles are never selected. we will assume that.4 The Wright-Fisher model: an example from biology In an organism. Since we don’t care to keep track of the individuals’ identities. This number could be independent of the research quality but. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. maintain this allele for the next generation. their offspring inherits an allele from each parent. 4. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). If so. a gene controlling a specific characteristic (e. colour of eyes) appears in two forms. then. But an AA mating with a Aa may produce a child of type AA or one of type Aa. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. The process repeats. 71 . Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. generation after generation.g. Do the same thing 2N times. all we need is the information outlined in this figure. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. One of the square alleles is selected 4 times. We will assume that a child chooses an allele from each parent at random. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. And so on. Imagine that we have a population of N individuals who mate at random producing a number of offspring. We would like to study the genetic diversity. it makes the evaluation procedure easy. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. the genetic diversity depends only on the total number of alleles of each type. One of the circle alleles is selected twice. meaning how many characteristics we have. To simplify. 5. What is important is that the evaluation can be done without ever looking at the content of one’s research. this allele is of type A with probability x/2N . This means that the external appearance of a characteristic is a function of the pair of alleles. a dominant (say A) and a recessive (say a). Thus.faculty from each other (and from themselves). When two individuals mate. in each generation. We will also assume that the parents are replaced by the children. from the point of view of an administrator. The alleles in an individual come in pairs and express the individual’s genotype for that gene. two AA individuals will produce and AA child. 24. We have a transition from x to y if y alleles of type A were produced.

the number of alleles of type A won’t change. . .) One can argue that. conditionally on (X0 . we see that 2N x 2N −y x y pxy = . M Therefore. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . 2N − 1} form a communicating class of inessential states. since. . 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. . at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). as usual. y ≤ 2N − 1. Clearly. X0 ) = 1 − Xn . on the average. E(Xn+1 |Xn . but the argument needs further justification because “the end of time” is not a deterministic time. Since pxy > 0 for all 1 ≤ x.e. ξM are i. . . we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . since. 0 ≤ y ≤ 2N. Since selections are independent. ϕ(x) = y=0 pxy ϕ(y). . M n P (ξk = 0|Xn . The fact that M is even is irrelevant for the mathematical model.2N = 1. . . Similarly. Let M = 2N . Clearly.d. . . . . Tx := inf{n ≥ 1 : Xn = x}. so both 0 and 2N are absorbing states. The result is correct. . . . and the population becomes one consisting of alleles of type A only. X0 ) = Xn . These are: M ϕ(0) = 0. (Here. with n P (ξk = 1|Xn . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. .e. . . Xn ).i.Each allele of type A is selected with probability x/2N . i. X0 ) = E(Xn+1 |Xn ) = M This implies that. . the expectation doesn’t change and. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. the states {1. for all n ≥ 0. n n where. We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. the chain gets absorbed by either 0 or 2N . EXn+1 = EXn Xn = Xn . ϕ(x) = x/M. and the population becomes one consisting of alleles of type a only. the random variables ξ1 . ϕ(M ) = 1. p00 = p2N. 72 . i. . . on the average. M and thus.

Working for n = 1. so that Xn+1 = (Xn + ξn ) ∨ 0. the demand is only partially satisfied. larger than the supply per unit of time. 73 . The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. say. We will apply the so-called Loynes’ method. 2. on the average. If the demand is for Dn components. We have that (Xn . Otherwise. M −1 y−1 x M y−1 1− x . Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . . Since the Markov chain is represented by this very simple recursion. provided that enough components are available. the demand is.The first two are obvious. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. at time n + 1. . A number An of new components are ordered and added to the stock. and E(An − Dn ) = −µ < 0. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. We will show that this Markov chain is positive recurrent.i. Xn electronic components at time n. and the stock reached level zero at time n + 1.d. n ≥ 0) are i. while a number of them are sold to the market. Thus. . Dn ). The correctness of the latter can formally be proved by showing that it does satisfy the recursion. the demand is met instantaneously. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. there are Xn + An − Dn components in the stock. b). Here are our assumptions: The pairs of random variables ((An . and so. for all n ≥ 0. we will try to solve the recursion.5 A storage or queueing application A storage facility stores. n ≥ 0) is a Markov chain. Let ξn := An − Dn .

where the latter equality follows from the fact that. (n) . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). using π(y) = limn→∞ P (Mn = y). But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). . however. . In particular. Notice. n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . . . . the event whose probability we want to compute is a function of (ξ0 .i. Indeed. ξn−1 ) = (ξn−1 . . . that d (ξ0 . . ξ0 ). we reverse it. Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. .Thus. and. ξn−1 ). . Hence. for y ≥ 0. 74 .d. instead of considering them in their original order. ξn−1 . . . . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . we may take the limit of both sides as n → ∞. . . the random variables (ξn ) are i. . Therefore. . ξn . for all n. = x≥0 Since the n-dependent terms are bounded. . . in computing the distribution of Mn which depends on ξ0 . so we can shuffle them in any way we like without altering their joint distribution. The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). we may replace the latter random variables by ξ1 . we find π(y) = x≥0 π(x)pxy . .

Depending on the actual distribution of ξn . n→∞ Hence g(y) = P (0 ∨ max{Z1 . Z2 .} < ∞) = 1. . . Z2 . we will use the assumption that Eξn = −µ < 0. as required. We must make sure that it also satisfies y π(y) = 1. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 .So π satisfies the balance equations. . n→∞ This implies that P ( lim Zn = −∞) = 1. the Markov chain may be transient or null recurrent. 75 . This will be the case if g(y) is not identically equal to 1. . The case Eξn = 0 is delicate. Here. . It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1.} > y) is not identically equal to 1.

if x = ej . . This means that the transition probability pxy should depend on x and y only through their relative positions in space. p(x) = 0. . . . Example 1: Let p(x) = C2−|x1 |−···−|xd | . if x ∈ {±e1 . given a function p(x). this walk enjoys. but. . .g. besides time. x ∈ Zd . the skip-free property. ±ed } otherwise 76 . we need to give more structure to the state space S. . where p1 + · · · + pd + q1 + · · · + qd = 1. To be able to say this. In dimension d = 1. 0. we won’t go beyond Zd . If d = 1. Define  pj . Our Markov chains. d   0. so far. a random walk is a Markov chain with time. .y = px+z. Example 3: Define p(x) = 1 2d . where C is such that x p(x) = 1. A random walk possesses an additional property. we can let S = Zd . p(−1) = q. that of spatial homogeneity. For example.y+z for any translation z. have been processes with timehomogeneous transition probabilities. This is the drunkard’s walk we’ve seen before.y = p(y − x). if x = −ej . . d  p(x) = qj . So. . otherwise where p + q = 1.and space-homogeneous transition probabilities given by px. j = 1. groups) are allowed.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. . the set of d-dimensional vectors with integer coordinates. .and space-homogeneity. j = 1. namely. a structure that enables us to say that px. More general spaces (e. A random walk of this type is called simple random walk or nearest neighbour random walk. otherwise. then p(1) = p. here. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else.

i. So. p(x) = 0. Let us denote by Sn the state of the walk at time n. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). p ∗ δz = p. x ∈ Zd . p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). . . Furthermore. which. random variables in Zd with common distribution P (ξn = x) = p(x). Furthermore. computing the n-th order convolution for all n is a notoriously hard problem.d.) Now the operation in the last term is known as convolution. The convolution of two probabilities p(x). x + ξ2 = y) = y p(x)p(y − x). in general. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 .This is a simple random walk. Indeed. then p(1) = p(−1) = 1/2. We call it simple symmetric random walk. . otherwise. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . and this is the symmetric drunkard’s walk. Obviously. . We refer to ξn as the increments of the random walk. in addition. One could stop here and say that we have a formula for the n-step transition probability. we will mean a sum over all possible values. (Recall that δz is a probability distribution that assigns probability 1 to the point z. The special structure of a random walk enables us to give more detailed formulae for its behaviour. . xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. because all we have done is that we dressed the original problem in more fancy clothes. x ± ed . are i. y p(x)p′ (y − x). notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. We will let p∗n be the convolution of p with itself n times. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). . x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p.) In this notation. we can reduce several questions to questions about this random walk. . (When we write a sum without indicating explicitly the range of the summation variable. ξ2 . where ξ1 . p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . If d = 1. We shall resist the temptation to compute and look at other 77 . But this would be nonsense. p′ (x). the initial state S0 is independent of the increments.

−1. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. how long will that take? C.i. 0) → (1. . 1) → (4. we should assign. we shall learn some counting methods. all possible values of the vector are equally likely. Look at the distribution of (ξ1 . Clearly. +1. So. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . In the sequel. ξn ). . . 2 where #A is the number of elements of A. a path of length n.methods.1) − (3. 2) → (5. −1} then we have a path in the plane that joins the points (0. ξn = εn ) = 2−n . +1}n . . to each initial position and to each sequence of n signs. . So if the initial position is S0 = 0. A ⊂ {−1. After all. . 1) → (2. +1}n . +1. 1}n . As a warm-up. Events A that depend only on these the first n random signs are subsets of this sample space. we are dealing with a uniform distribution on the sample space {−1. we have P (ξ1 = ε1 .0) We can thus think of the elements of the sample space as paths. If yes. The principle is trivial. 2) → (3.2) where the ξn are i. If not.i−1 = 1/2 for all i ∈ Z. consider the following: 78 . +1}n .1) + − (5. Therefore. Here are a couple of meaningful questions: A.d. but its practice may be hard. and the sequence of signs is {+1. + (1.2) (4. Will the random walk ever return to where it started from? B. random variables (random signs) with P (ξn = ±1) = 1/2. . a random vector with values in {−1. 1): (2. First. and so #A P (A) = n . Since there are 2n elements in {0. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. if we can count the number of elements of A we can compute its probability. for fixed n.1) + (0. .i+1 = pi.

n 79 . the number of paths that start from (m. 0)). (n.0) The lightly shaded region contains all paths of length n that start at (0. y) is given by: #{(m. S7 < 0}. 0) and end up at the specific point (n. i)} = n n+i 2 More generally. x) (n. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. S6 < 0. 0) and n ends at (n. 0) (n. We can easily count that there are 6 paths in A. Proof. i).1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point.047. y) is given by: #{(0.i) (0. The paths comprising this event are shown below.) The darker region shows all paths that start at (0. i). in this case. x) (n. Let us first show that Lemma 14. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. S4 = −1. x) and end up at (n. y)}. Consider a path that starts from (0. Hence (n+i)/2 = 0. The number of paths that start from (0. 0) and end up at (n.Example: Assume that S0 = 0. 0) and ends at (n. (n − m + y − x)/2 (23) Remark. y)} = n−m . and compute the probability of the event A = {S2 ≥ 0. i). We use the notation {(m. (There are 2n of them. y)} =: {all paths that start at (m. In if n + i is not an even number then there is no path that starts from (0. x) and end up at (n.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

if n is even. Sn ≥ 0 | Sn = y) = 1 − In particular. x) 2n−m (n. The left side of (30) equals #{paths (1. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . n 2 #{paths (m. if n is even. the random walk never becomes zero up to time n is given by P0 (S1 > 0. Such a simple answer begs for a physical/intuitive explanation. . P0 (Sn = y) 2 (n + y) + 1 . 27. S2 > 0. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23).P (Sn = y|Sm = x) = In particular. y) that do not touch zero} × 2−n #{paths (0. y)} . .3 Some conditional probabilities Lemma 17. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. n (30) Proof. n This is called the ballot theorem and the reason will be given in Section 29. n 2−n .2 The ballot theorem Theorem 25. given that S0 = 0 and Sn = y > 0. we shall just compute it. . Sn ≥ 0 | Sn = 0) = 84 1 . y)} . 1) (n. 0) (n. . P0 (S1 ≥ 0 . Sn > 0 | Sn = y) = y . . The probability that. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . 0) (n. n/2 (29) #{(0. But. . If n. for now. . . . 27. . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . n−m 2−n−m .

. and Sn−1 > 0 implies that Sn ≥ 0. We use (27): P0 (Sn ≥ 0 . . ξ2 + ξ3 ≥ 0. . . (f ) follows from the replacement of (ξ2 . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . . . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . . . (c) (d) (e) (f ) (g) where (a) is obvious. and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). If n is even. . . Sn = 0) = P0 (Sn = 0). Sn ≥ 0). . (e) follows from the fact that ξ1 is independent of ξ2 . . . •). . . We next have P0 (S1 = 0. . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . . . . . Sn < 0) (b) (a) = 2P0 (S1 > 0. 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . Sn = 0) = P0 (S1 > 0. Sn ≥ 0) = #{(0. . . . .) by (ξ1 . . . . 0) (n. 1 + ξ2 + ξ3 > 0. . . . . . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . . Sn ≥ 0.Proof. . ξ2 . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. Sn ≥ 0) = P0 (S1 = 0. (b) follows from symmetry. . . . . . . . . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0.). (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . n/2 where we used (28) and (29). . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . ξ3 . For even n we have P0 (S1 ≥ 0. Proof. .4 Some remarkable identities Theorem 26. . . +1 27. ξ3 . . . . Sn > 0) + P0 (S1 < 0. . . 85 . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . . . P0 (S1 ≥ 0.

the two terms must be equal. Sn = 0) P0 (S1 > 0. Sn−2 > 0|Sn−1 = 1). . . and if the particle is to the left of a then its mirror image is on the right. f (n) = P0 (S1 > 0. . Put a two-sided mirror at a. n−1 Proof. So 1 f (n) = 2u(n) P0 (S1 > 0. we see that Sn−1 > 0. . . .5 First return to zero In (29). Sn = 0) = u(n) . Then f (n) := P0 (S1 = 0. Sn = 0) = 2P0 (Sn−1 > 0. 86 . we computed. So u(n) f (n) = . Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). . . P0 (Sn−1 > 0. the probability u(n) that the walk (started at 0) attains value 0 in n steps. 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). . . . Sn = 0). By symmetry. . Sn = 0) + P0 (S1 < 0. we either have Sn−1 = 1 or −1. with equal probability. . . n We know that P0 (Sn = 0) = u(n) = n/2 2−n .27. To take care of the last term in the last expression for f (n). If a particle is to the right of a then you see its image on the left. Sn−1 = 0. Since the random walk cannot change sign before becoming zero. . . Sn = 0) Now. Sn−1 > 0. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. and u(n) = P0 (Sn = 0). . we can omit the last event. . . . given that Sn = 0. So the last term is 1/2. By the Markov property at time n − 1. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. Sn = 0. Sn−1 < 0. n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. for even n. . Sn = 0 means Sn−1 = 1. Theorem 27. Fix a state a. . Also. Let n be even. Sn−2 > 0|Sn−1 > 0. Sn−1 > 0. . . So f (n) = 2P0 (S1 > 0. This is not so interesting.

called (Sn . But P0 (Ta ≤ n) = P0 (Ta ≤ n. n ≥ Ta . Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). we obtain 87 . m ≥ 0) is a simple symmetric random walk. In other words. 28. . . n ≥ Ta By the strong Markov property. n < Ta . n ≥ 0) be a simple symmetric random walk and a > 0. we have n ≥ Ta . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. The last equality follows from the fact that Sn > a implies Ta ≤ n. the process (STa +m − a. Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). n ∈ Z+ ) has the same law as (Sn . Sn ≥ a). Thus. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). Trivially. Sn = Sn . Sn ≤ a). n < Ta . define Sn = where is the first hitting time of a. n ∈ Z+ ). Sn ). 2a − Sn . . Sn ≤ a) + P0 (Sn > a). observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). P0 (Sn ≥ a) = P0 (Ta ≤ n. so we may replace Sn by 2a − Sn (Theorem 28). Theorem 28.What is more interesting is this: Run the particle till it hits a (it will. 2a − Sn ≥ a) = P0 (Ta ≤ n. 2 Proof. Sn > a) = P0 (Ta ≤ n. Let (Sn . Sn ≤ a) + P0 (Ta ≤ n. But a symmetric random walk has the same law as its negative.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. Then Ta = inf{n ≥ 0 : Sn = a} Sn . independent of the past before Ta . Proof. S1 . a + (Sn − a). In the last event. So Sn − a can be replaced by a − Sn in the last display and the resulting process. 2 Define now the running maximum Mn = max(S0 . as a corollary to the above result. . for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . Finally. If Sn ≥ a then Ta ≤ n. Then.

A sample from the urn is a sequence η = (η1 . Proof. The set of values of η contains n elements and η is uniformly distributed a in this set. 436-437. Rend. Sn = y) = P0 (Sn = 2x − y). Paris. Sci. n = a + b. by applying the reflection principle (Theorem 28). we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. 8 88 . as long as a ≥ b. where ηi indicates the colour of the i-th item picked. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . then. as for holding liquids. Sn = 2x − y). 105. and anciently for holding ballots to be drawn. P0 (Mn < x. Sci. 2 Proof. (1887). Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. usually a vase furnished with a foot or pedestal. If an urn contains a azure items and b = n−a black items. If Tx ≤ n. The original ballot problem. we have P0 (Mn < x. the number of selected azure items is always ahead of the black ones equals (a − b)/n. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. Solution d’un probl`me. combined with (31). ηn ). (1887). 105. . . Compt. It is the latter use we are interested in in Probability and Statistics. ornamental objects. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. employed for different purposes. which results into P0 (Tx ≤ n. Bertrand. gives the result.Corollary 19. This. e 10 Andr´. . 29 Urns and the ballot theorem Suppose that an urn8 contains n items. Sn = y) = P0 (Tx > n. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. Sn = y) = P0 (Tx ≤ n. (31) x > y. Acad. of which a are coloured azure and b black. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). e e e Paris. Since Mn < x is equivalent to Tx > n. An urn is a vessel. . Sn = y). Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. Solution directe du probl`me r´solu par M. then the probability that. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. 369. Rend. J. 9 Bertrand. Compt. Acad. D. the ashes of the dead after cremation. during sampling without replacement. The answer is very simple: Theorem 31.

ξn = εn | Sn = y) = P0 (ξ1 = ε1 . a.We shall call (η1 . the sequence (ξ1 . . always assuming (without any loss of generality) that a ≥ b = n − a. the ‘black’ items correspond to − signs). . . . Since the ‘azure’ items correspond to + signs (respectively. conditional on the event {Sn = y}. i ∈ Z. . . letting a. where y = a − (n − a) = 2a − n. But in (30) we computed that y P0 (S1 > 0. a we have that a out of the n increments will be +1 and b will be −1. so ‘azure’ items are +1 and ‘black’ items are −1. Then. . . (The word “asymmetric” will always mean “not necessarily symmetric”. ηn ) an urn process with parameters n. . It remains to show that all values are equally a likely. . n Since y = a − (n − a) = a − b. . we can translate the original question into a question about the random walk. We have P0 (Sn = k) = n n+k 2 pi. conditional on {Sn = y}. . Consider a simple symmetric random walk (Sn . n ≥ 0) with values in Z and let y be a positive integer. . n . b be defined by a + b = n. . −n ≤ k ≤ n. but which is not necessarily symmetric. . the result follows. . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. 89 . (ξ1 . . . . indeed.) So pi. ξn ) is. . . We need to show that. An urn can be realised quite easily be a random walk: Theorem 32. conditional on {Sn = y}. n n −n 2 (n + y)/2 a Thus. a − b = y. using the definition of conditional probability and Lemma 16. . . . . .i−1 = q = 1 − p. . . Then. . we have: P0 (ξ1 = ε1 . Let ε1 . But if Sn = y then. . +1} such that ε1 + · · · + εn = y. . the sequence (ξ1 . The values of each ξi are ±1. ξn ) is uniformly distributed in a set with n elements. . . Proof. ξn = εn ) P0 (Sn = y) 2−n 1 = = . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. Sn > 0} that the random walk remains positive. .i+1 = p. Proof of Theorem 31. ξn ) has a uniform distribution. So the number of possible values of (ξ1 . . p(n+k)/2 q (n−k)/2 . εn be elements of {−1. 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. Sn > 0 | Sn = y) = . . Since an urn with parameters n. . . .

namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. random variables. E0 sTx = ϕ(s)x . and all t ≥ 0. . Px (Ty = t) = P0 (Ty−x = t). Theorem 33. . 2. . Let ϕ(s) := E0 sT1 . 2. .d. If x > 0. See figure below. 90 . x ≥ 0) is also a random walk.30. In view of this. 2qs Proof. . If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. If S1 = 1 then T1 = 1. x − 1 in order to reach x. The process (Tx . Corollary 20. To prove this. Consider a simple random walk starting from 0.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. we make essential use of the skip-free property. plus a ′′ further time T1 required to go from 0 to +1 for the first time. This random variable takes values in {0.i. We will approach this using probability generating functions. Lemma 19. Consider a simple random walk starting from 0.) Then. Lemma 18. In view of Lemma 19 we need to know the distribution of T1 . only the relative position of y with respect to x is needed. For all x. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . y ∈ Z. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1.} ∪ {∞}. x ∈ Z. it is no loss of generality to consider the random walk starting from 0. for x > 0. Proof. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. 1. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. (We tacitly assume that S0 = 0. plus a time T1 required to go from −1 to 0 for the first time. The rest is a consequence of the strong Markov property. Consider a simple random walk starting from 0. . Proof. .

Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Consider a simple random walk starting from 0. 91 . The first time n such that Sn = −1 is the first time n such that −Sn = 1. The formula is obtained by solving the quadratic. Corollary 22. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But.  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . if x < 0 2ps Proof. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. Corollary 21. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . in order to hit 1. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. Consider a simple random walk starting from 0. 2ps Proof. from the strong Markov property. Hence the previous formula applies with the roles of p and q interchanged.

Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. If α = n is a positive integer then n (1 + x)n = k=0 n k x . 2ps ′ =1− 30. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . k by the binomial theorem. We again do first-step analysis. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . k 92 .3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. 1 − 4pqs2 . in distribution. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0.2 First return to the origin Now let us ask: If the particle starts at 0.30. For a symmetric ′ random walk. then we can omit the k upper limit in the sum. if S1 = 1.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. this was found in §30. The latter is. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. For the general case. and simply write (1 + x)n = k≥0 n k x . the same as the time required for the walk to hit 1 starting ′ from 0. If we define n to be equal to 0 for k > n. Consider a simple random walk starting from 0. Similarly. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . where α is a real number.

k!(n − k)! k! α k x . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. (1 − x)−1 = But. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . As another example. 2k − 1 k 93 . k! k! (−1)2k xk = k≥0 xk . −1 k so (1 − x)−1 = as needed. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . k as long as we define α k = α(α − 1) · · · (α − k + 1) . k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . For example.Recall that n k = It turns out that the formula is formally true for general α. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . √ 1 − x. Indeed. but that we won’t worry about which ones. (−1)k = 2k − 1 k k and this is shown by direct computation. by definition. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series.

if x < 0 |x| In particular. The quantity inside the square root becomes 1 − 4pq. s↑1 which is true for any probability generating function. Hence it is again transient. |x| if x > 0 (33) . That it is actually null-recurrent follows from Markov chain theory. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. without appeal to the Markov chain theory. We apply this to (32). 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn .But. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. P0 (Tx < ∞) = p q q p ∧1 ∧1 . By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. Hence it is transient. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . 94 . if x < 0 This formula can be written more neatly as follows: Theorem 35. But remember that q = 1 − p. by the definition of E0 sT0 . x 1−|p−q| 2q 1−|p−q| 2p . Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. x if x > 0 .

. the sequence Mn is nondecreasing. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . . This value is random and is denoted by M∞ = sup(S0 . again. Sn ). Consider a simple random walk with values in Z. 95 . with probability 1 because p < 1/2. Proof. Theorem 36. q p. Unless p = 1/2. but here it is < ∞.Proof. . Then limn→∞ Sn = −∞ with probability 1.) If we let Mn = max(S0 . . Otherwise. with some positive probability. 31. S1 . S2 . The simple symmetric random walk with values in Z is null-recurrent. The last part of Theorem 35 tells us that it is recurrent. If p < q then |p − q| = q − p and. if x > 0 . . then we have M∞ = limn→∞ Mn . Then ′ P0 (T0 < ∞) = 2(p ∧ q). We know that it cannot be positive recurrent. n→∞ n→∞ x ≥ 0. This means that there is a largest value that will be ever attained by the random walk. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . Consider a random walk with values in Z. Indeed. Proof. it will never ever return to the point where it started from.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. x ≥ 0. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. this probability is always strictly less than 1. every nondecreasing sequence has a limit. we obtain what we are looking for. . Corollary 23. if x < 0 which is what we want. Assume p < 1/2. 31. and so it is null-recurrent.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. except that the limit may be ∞. What is this probability? Theorem 37. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. .

96 . 1. In the latter case.′ Proof. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. Theorem 38. 31. Pi (Ji ≥ k) = 1 for all k. (34) a random variable first considered in Section 10 where it was also shown to be geometric. . as already known. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . 1. the total number of visits to a state will be finite. even for p = 1/2. . 2. We now look at the the number of visits to some state but if we start from a different state. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . But if it is transient. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . . where the last probability was computed in Theorem 37. The probability generating function of T0 was obtained in Theorem 34.3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. . But J0 ≥ 1 if and only if the first return time to zero is finite. 1 − 2(p ∧ q) Proof. . 1. 2. k = 0. . meaning that Pi (Ji = ∞) = 1. k = 0. 1 − 2(p ∧ q) this being a geometric series. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). In Lemma 4 we showed that. . Remark 1: By spatial homogeneity. Then. 1]. 2(p ∧ q) . starting from zero. . . Consider a simple random walk with values in Z. . if a Markov chain starts from a state i then Ji is geometric. k = 0. 2(p ∧ q) E0 J0 = . If p = q this gives 1. We now compute its exact distribution. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). This proves the first formula. 2. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k .

E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . given that Tx < ∞. Jx ≥ k).Theorem 39. Remark 1: The expectation E0 Jx is finite if p = 1/2. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . p < 1/2) then. The first term was computed in Theorem 35. ∞ As for the expectation. 97 . we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . starting from 0. k ≥ 1. In other words. Consider a random walk with values in Z. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . if the random walk “drifts down” (i. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . Apply now the strong Markov property at Tx . Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . and the second in Theorem 38. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. it will visit any point below 0 exactly the same number of times on the average. Suppose x > 0. then E0 Jx drops down to zero geometrically fast as x tends to infinity. simply because if Jx ≥ k ≥ 1 then Tx < ∞. Consider the case p < 1/2. The reason that we replaced k by k − 1 is because.e. i. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . E0 Jx = Proof. (q/p)(−x)∨0 1 − 2q k≥1 . For x > 0. But for x < 0. Then P0 (Jx ≥ k) = P0 (Tx < ∞. we have E0 Jx = 1/(1−2p) regardless of x.e. Then P0 (Tx < ∞. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. (2(p ∧ q))k−1 .

Theorem 40. (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). . Use the obvious fact that (ξ1 . In Lemma 19 we observed that. Then P0 (Tx = n) = x P0 (Sn = x) . random variables. . ξ1 ). in Theorem 41.i. . n (37) 98 .d. Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. 0 ≤ k ≤ n) has the same distribution as (Sk . . the probabilities P0 (Tx = n) = x P0 (Sn = x). 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. ξn−1 . each distributed as T1 . Sn = x) P0 (Tx = n) = . 0 ≤ k ≤ n). 0 ≤ k ≤ n − 1. n x > 0. . . Proof. = 1− √ 1 − s2 s x . (Sk . Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. for a simple random walk and for x > 0. (36) 2 Lastly. . ξ2 . Let (Sk ) be a simple symmetric random walk and x a positive integer. we derived using duality. because the two have the same distribution. Fix an integer n. P0 (Sn = x) P0 (Sn = x) x . Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. every time we have a probability for S we can translate it into a probability for S. . for a simple symmetric random walk.e.d. ξn ) has the same distribution as (ξn . which is basically another probability for S. i.i. . 0 ≤ k ≤ n) of (Sk . . Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. . n Proof. 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . random variables. Thus. i=1 Then the dual (Sk . a sum of i. Tx is the sum of x i. Here is an application of this: Theorem 41. . and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x .32 Duality Consider random walk Sk = k ξi .

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

i. The two random walks are not independent. 2 Same argument applies to (Sn . the random 1 2 variables ξn . 0). . north or south. 1)) = P (ξn = (−1. ξ2 . 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. We have 1. Since ξ1 . Rn ) := (Sn + Sn . n ∈ Z+ ) is also a random walk.i. n ∈ Z+ ) is a sum of i. Many other quantities can be approximated in a similar manner. random variables and hence a random walk. . north or south. for each n. . west. being totally drunk. 1 Hence (Sn . Consider a change of coordinates: rotate the axes by 45o (and scale them). n ∈ Z+ ) is a random walk. 0)) = 1/4.From this. −1)) = 1/2. 0) and. Sn ) = n n 2 . n ≥ 1. 0)) = 1/4. (Sn . he moves east.d. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. west. so are ξ1 . Sn − Sn ). the latter being functions of the former. So Sn = (Sn . He starts at (0. too. The corners are described by coordinates (x. Indeed. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . n ∈ Z+ ) showing that it. y) where x. random vectors such that P (ξn = (0. we immediately get that the expectation of T1 is infinity: ET1 = ∞. However. . y ∈ Z. 1 1 Proof. . going at random east. i=1 ξi i=1 ξi 2 1 Lemma 20. i. we let x1 be its horizontal coordinate and x2 its vertical. It is not a simple random walk because there is a possibility that the increment takes the value 0. ξ2 . We then let S0 = (0. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. the two stochastic processes are not independent. x1 − x2 ).e. (Sn . .d. i. 1 P (ξn = 1) = P (ξn = (1. 0)) = P (ξn = (−1. . . 2 1 If x ∈ Z2 . are independent. Sn = i=1 n ξi . We have: 102 . with equal probability (1/4).. 1)) + P (ξn = (0. is a random walk. ξn are not independent simply because if one takes value 0 the other is nonzero. map (x1 . So let 1 2 1 2 1 2 Rn = (Rn . 0)) = P (ξn = (1. x2 ) into (x1 + x2 . When he reaches a corner he moves again in the same manner. ξ2 . .

And so 2 1 2 1 P (ηn = ε. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. . The same is true for (Rn ). So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . ηn = 1) = P (ξn = (0. n ∈ Z ) is a random walk in one dimension. Let a < 0 < b. R2n = 0) = P (R2n = 0)(R2n = 0). η . To show that the last two are independent. +1}. Recall. So P (R2n = 0) = 2n 2−2n . Lemma 22. But by the previous n=0 c lemma. 2 1 2 2 1 1 Proof. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. Similar argument shows that each of (Rn .1 Lemma 21. b−a −a . . The two-dimensional simple symmetric random walk is recurrent. ηn = ξn − ξn are independent for each n. 2 . . is i. By Lemma 5. as n → ∞. (Rn . But (Rn ) 1 2 is a simple symmetric random walk. where c is a positive constant. . we need to show that ∞ P (Sn = 0) = ∞. The last two are walk in one dimension. The possible values 1 . b−a P (STa ∧Tb = a) = b .. that for a simple symmetric random walk (started from S0 = 0). Hence (Rn . ηn = −1) = P (ξn = (−1. 0)) = 1/4 1 2 P (ηn = −1. b−a P (Ta < Tb ) = b . Hence P (ηn = 1) = P (ηn = −1) = 1/2. We have of each of the ηn n 1 2 P (ηn = 1. ξ 1 − ξ 2 ). b−a 103 . = 1)) = 1/4. (Rn . 0)) = 1/4 1 2 P (ηn = 1. η 2 are ±1. ξ2 . Rn = n (ξi − ξi ). Proof. ηn = 1) = P (ξn = (1. n ∈ Z+ ) is a random 2 . . n ∈ Z+ ).i. P (S2n = 0) ∼ n . 1 where the last equality follows from the independence of the two random walks. .d. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . The sequence of vectors η . from Theorem 7. 1)) = 1/4 1 2 P (ηn = −1. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . ηn = −1) = P (ξn = (0. n Corollary 24. n ∈ Z+ ) is a random walk in two dimensions. We have Rn = n (ξi + ξi ). ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. in the sense that the ratio of the two sides goes to 1. (Rn + independent. ε′ ∈ {−1. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . n ∈ Z+ ) 1 is a random walk in two dimensions. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. P (S2n = 0) = Proof. 2n n 2 2−4n . n ∈ Z ) is a random walk in one dimension. and by Stirling’s approximation.

i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. we have this common value: i<0 (−i)pi = and. B = j) = 1. k ∈ Z. Note. 1 − p). This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. M. Theorem 43. B) taking values in {(0. in particular. The claim is that P (Z = k) = pk . i ∈ Z). using this. Let (Sn ) be a simple symmetric random walk and (pi . Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. P (A = 0. given a 2-point probability distribution (p. a = M − N . c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. B) are independent of (Sn ). Proof. i < 0 < j. and assuming that (A. The probability it exits it from the left point is p.This last display tells us the distribution of STa ∧Tb . b]. we can always find integers a. suppose first k = 0. Conversely. choose. a < 0 < b such that pa + (1 − p)b = 0. How about more complicated distributions? Let suppose we are given a probability distribution (pi . Indeed. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. 0)} ∪ (N × N) with the following distribution: P (A = i. B = j) = c−1 (j − i)pi pj . if p = M/N . we get that c is precisely ipi . b. such that p is rational. To see this. that ESTa ∧Tb = 0. Then there exists a stopping time T such that P (ST = i) = pi . we consider the random variable Z = STA ∧TB .g. i ∈ Z. b = N . i ∈ Z. B = 0) = p0 . e. N ∈ N. where c−1 is chosen by normalisation: P (A = 0. B = 0) + P (A = i. Define a pair of random variables (A. M < N . Then Z = 0 if and only if A = B = 0 and this has 104 . and the probability it exits from the right point is 1 − p.

Suppose next k > 0. P (Z = k) = pk for k < 0. i<0 = c−1 Similarly. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. 105 . B = j) P (STi ∧Tk = k)P (A = i. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk .probability p0 . by definition.

6. we have c | d. and we’re done. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. But 3 is the maximum number of times that 38 “fits” into 120. Therefore d | b − a. b) as the unique positive integer such that (i) d | a. −3. gcd(120. they merely serve as reminders. Now suppose that c | a and c | b − a. Arithmetic Arithmetic deals with integers. if we also invoke Euclid’s algorithm (i. Then interchange the roles of the numbers and continue. 38) = gcd(6. Indeed 4 × 38 > 120. 38) = gcd(82. 2) = gcd(4. Indeed. The set Z of integers consists of N. 2) = 2. c | b. . b). If c | b and b | a then c | a. the process of long division that one learns in elementary school). . 3. For instance. 2) = gcd(2. replace b − a by b − 2a. 1. we can define their greatest common divisor d = gcd(a. then d | c. The set N of natural numbers is the set of positive integers: N = {1. their negatives and zero. with the second line. 2. . Keep doing this until you obtain b − qa < a. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers.) Notice that: gcd(a. (That it is unique is obvious from the fact that if d1 . Then c | a + (b − a). But d | a and d | b. Similarly. 3. −2. · · · }. b). If b − a > a. 14) = gcd(6. b are arbitrary integers. b − a). 26) = gcd(6. Because c | a and c | b. let d = gcd(a. 38) = gcd(6. i. −1. we say that b divides a if there is an integer k such that a = kb. 32) = gcd(6. In other words. They are not supposed to replace textbooks or basic courses. leading to d1 = d2 . 2. 8) = gcd(6.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. if a. So we work faster. b − a). 20) = gcd(6. 5. b) = gcd(a. Now. So (ii) holds.}. We will show that d = gcd(a. Given integers a. b. 106 .e. Z = {· · · . 38) = gcd(44. and d = gcd(a. We write this by b | a. If a | b and b | a then a = b. d | b. d2 are two such numbers then d1 would divide d2 and vice-versa. with b = 0. So (i) holds. Zero is denoted by 0.e. Replace b by b − a. as taught in high schools or universities. 4. (ii) if c | a and c | b. 0. But in the first line we subtracted 38 exactly 3 times. with at least one of them nonzero.

i. y ∈ Z. a.e. we have the fact that if d = gcd(a. For example. The long division algorithm is a process for finding the quotient q and the remainder r. y such that d = xa + yb. 38) = gcd(6. .The theorem of division says the following: Given two positive integers b. 107 . Then d = xa + yb. This shows (ii) of the definition of a greatest common divisor. in particular. y ∈ Z. Here are some other facts about the greatest common divisor: First. for some x. 1. b) = min{xa + yb : x. Its proof is obvious. let d be the right hand side. with x = 19. r = 18. let us follow the previous example again: gcd(120. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. In the process. The same logic works in general. Then c divides any linear combination of a. we can find an integer q and an integer r ∈ {0. . it divides d. for integers a. 38) = gcd(6. y = 6. gcd(38. Second. xa + yb > 0}. . such that a = qb + r. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. 2) = 2. b ∈ Z. a − 1}. Working backwards. Suppose that c | a and c | b. We need to show (i). 120) = 38x + 120y. To see why this is true. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. with a > b. Thus. not both zero. we have gcd(a. we can show that. To see this. . b) then there are integers x. b.

. respectively. i. e. Finally. y) such that x is smaller than y. so r = 0. for some 1 ≤ r ≤ d − 1. y) without contain its symmetric (y. A relation is transitive if when it contains (x. z) it also contains (x. i. x2 ∈ X have different values f (x1 ). Both < and = are transitive. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). The equality relation is symmetric.e. x). say. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. The relation < is not reflexive. In general. and this is a contradiction. that d does not divide both numbers. The greatest common divisor of 3 integers can now be defined in a similar manner. A function is called onto (or surjective) if its range is Y . The relation < is not symmetric. x).that d | a and d | b. so d divides both a and b. Counting Let A. We encountered the equivalence relation i theory. the greatest common divisor of any subset of the integers also makes sense. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. So d is not minimum as claimed. that d does not divide b. A relation is called symmetric if it cannot contain a pair (x. but is not necessarily transitive. a relation can be thought of as a set of pairs of elements of the set A. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). A relation which possesses all the three properties is called equivalence relation. The range of f is the set of its values. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. The equality relation = is reflexive. we can divide b by d to find b = qd + r. yielding r = (−qx)a + (1 − qy)b. 108 . then the relation “x is a friend of y” is symmetric. z). For example. A function is called one-to-one (or injective) if two different x1 . The set of functions from X to Y is denoted by Y X .e. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . f (x2 ). the set {f (x) : x ∈ X}. y) and (y.g. B be finite sets with n. Suppose this is not true. Since d > 0. If we take A to be the set of students in a classroom. m elements. But then b = q(xa + yb) + r. A relation is called reflexive if it contains pairs (x.

n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . we need more names than animals: n ≤ m. the cardinality of the set B A ) is mn . we may think of the elements of A as animals and those of B as names. Since the name of the first animal can be chosen in m ways. the second in n − 1.e. and so on. In the special case m = n. that of the second in m − 1 ways. Clearly. a one-to-one function from the set A into A is called a permutation of A. Incidentally. 109 . we have = 1. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . Indeed. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. so does the second. Each assignment of this type is a one-to-one function from A to B. If A has n elements. we denote (n)n also by n!. To be able to do this. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1).The number of functions from A to B (i. Thus. and so on. the k-th in n − k + 1 ways. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. and the third. The first element can be chosen in n ways.) So the first animal can be assigned any of the m available names. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). 1}n of sequences of length n with elements 0 or 1. A function is then an assignment of a name to each animal. and so on. n! is the total number of permutations of a set of n elements. in other words. We thus observe that there are as many subsets in A as number of elements in the set {0. k! k where the symbol equation. (The same name may be used for different animals. Any set A contains several subsets. and this follows from the binomial theorem. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n.

k Also. b). + k−1 k Indeed. 0). Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. in choosing k animals from a set of n + 1 ones. If the pig is included. If the pig is not included in the sample. 110 . The notation extends to more than one numbers. we use the following notations: a+ = max(a. 0). In particular. by convention. a− = − min(a. b). Thus a ∧ b = a if a ≤ b or = b if b ≤ a. if x is a real number. a+ is called the positive part of a. we define x m = x(x − 1) · · · (x − m + 1) . b is denoted by a ∧ b = min(a. by convention. consider a specific animal. m! n y If y is not an integer then. and we don’t care to put parantheses because the order of finding a maximum is irrelevant.We shall let n = 0 if n < k. Maximum and minimum The minimum of two numbers a. we let The Pascal triangle relation states that n+1 k = = 0. Both quantities are nonnegative. we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. c. and a− is called the negative part of a. we choose k animals from the remaining n ones and this can be done in n ways. a∨b∨c refers to the maximum of the three numbers a. n n . b. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. say. The absolute value of a is defined by |a| = a+ + a− . The maximum is denoted by a ∨ b = max(a. For example. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). Note that it is always true that a = a+ − a− . the pig.

This is very useful notation because conjunction of clauses is transformed into multiplication. meaning a function that takes value 1 on A and 0 on the complement Ac of A. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . It takes value 1 with probability P (A) and 0 with probability P (Ac ). We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. Several quantities of interest are expressed in terms of indicators. Thus. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. = 1A + 1B − 1A∩B . . . . So X= n≥1 1(X ≥ n). . It is easy to see that 1A 1A∩B = 1A 1B . 111 . Then X X= n=1 1 (the sum of X ones is X). A2 .}. . If A is an event then 1A is a random variable. are sets or clauses then the number n≥1 1An is the number true clauses. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. For instance. P (X ≥ n). if A1 . . Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). 2. Another example: Let X be a nonnegative random variable with values in N = {1. 1 − 1A = 1 A c . So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components.Indicator function stands for the indicator function of a set A. It is worth remembering that.

for all x. and all α. xn the given basis. . The column  . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . .  = ··· ··· ···  . β ∈ R.  . . Let y i := n aij xj .  is a column representation of y := ϕ(x) with respect to the given basis . y ∈ Rn . . m.  The column  . ϕ are linear functions: It is a m × n matrix. The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . So if we choose a basis v1 . . x  . . . Suppose that ψ. vm . But the ϕ(uj ) are vectors in Rm . . . Combining. . ym v1 . j=1 i = 1. . . we have ϕ(x) = m n vi i=1 j=1 aij xj . because ϕ is linear. .   .  is a column representation of x with respect to . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  . . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi .    .  where ∈ R. Then. . . . un be a basis for Rn . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ). am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  . x1 . Let u1 . . . .

B is a m × n matrix. ψ. 0 0 1 In other words. . . If I has no lower bound then inf I = −∞. Supremum and infimum When we have a set of numbers I (for instance. The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . ed =  . . x4 . . where m (AB)ij = k=1 Aik Bkj . .e. A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. Then the matrix representation of ϕ ◦ ψ is AB. ei = 1(i = j).} ≤ · · · and. the choice of the basis is the so-called standard basis. and it is this that we called infimum. . .Fix three sets of basis on Rn .  . . 113 . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. Rm and Rℓ and let A. Usually. it is its matrix representation with respect to the fixed basis. 1/3. there is x ∈ I with x ≥ ε + inf I. we define max I. x2 . is a sequence of numbers then inf{x1 . x2 . For instance I = {1. . xn+2 . and AB is a ℓ × n matrix. 1 ≤ j ≤ n. defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. standing for the “infimum” of I. i. . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. A matrix A is called square if it is a n × n matrix.} ≤ inf{x2 . Likewise. sup ∅ = −∞. for any ε > 0. we define the supremum sup I to be the minimum of all upper bounds. . with respect to these bases. If I is empty then inf ∅ = +∞. e2 =  . . . for any ε > 0. . we define lim sup xn . B be the matrix representations of ϕ. j So A is a ℓ × m matrix. . Similarly.} has no minimum. a square matrix represents a linear transformation from Rn to Rn . In these cases we use the notation inf I. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. This minimum (or maximum) may not exist. . xn+1 . . 1 ≤ i ≤ ℓ. .} ≤ inf{x3 . . · · · }. If I has no upper bound then sup I = +∞. We note that lim inf(−xn ) = − lim sup xn . . . Limsup and liminf If x1 . . . . . . 1/2. x3 . a sequence a1 .  . So lim sup xn = limn→∞ inf{xn . there is x ∈ I with x ≤ ε + inf I. sup I is characterised by: sup I ≥ x for all x ∈ I and. If we fix a basis in Rn . . Similarly. inf I is characterised by: inf I ≤ x for all x ∈ I and. a2 . . respectively.  .

n≥1 Usually. Linear Algebra that has many axioms is more ‘restrictive’. . In contrast. . in these notes.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). B are disjoint events then P (A ∪ B) = P (A) + P (B). Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. and 1 Ac = 1 − 1 A . . B in the definition. . in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ).) If A. Similarly. . In Markov chain theory. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. to compute P (A). Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition.. B2 .e. E 1A = P (A). If An are events with P (An ) = 0 then P (∪n An ) = 0. I do sometimes “cheat” from a strict mathematical point of view. If the events A. 11 114 . the function that takes value 1 on A and 0 on Ac . If A is an event then 1A or 1(A) is the indicator random variable. Note that 1A 1B = 1A∩B . . In contrast. we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. . Probability Theory is exceptionally simple in terms of its axioms. Often. This follows from the logical statement X 1(X > x) ≤ X. . That is why Probability Theory is very rich. to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). n P (An ). This can be generalised to any finite number of events. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). and everywhere else. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. we often need to argue that X ≤ Y . for (besides P (Ω) = 1) there is essentially one axiom .. . For example. i. it is useful to find disjoint events B1 . P (∪n An ) ≤ Similarly. are mutually disjoint events. such that ∪n Bn = Ω. if he events A1 . . Of course. Therefore. namely:   Everything else. and apply them. Similarly. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). This is the inclusion-exclusion formula for two events. derive formulae. Linear Algebra that has many axioms is more ‘restrictive’. which holds since X is positive. A is a subset of B). A2 . n If the events A1 . Indeed. (That is why Probability Theory is very rich. . but not too much. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). If A1 . This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. we work a lot with conditional probabilities. to show that EX ≤ EY . then P  n≥1 An  = P (An ). . follows from this axiom. A2 . by not requiring this. Quite often. if An are events with P (An ) = 1 then P (∩n An ) = 1. A2 .e.

. 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. n ≥ 0 is ∞ ρn s n = n=0 1 . which is motivated from the algebraic identity X = X + − X − . for s = 0 for which G(0) = 0. 2. x→∞ In Markov chain theory. Generating functions A generating function of a sequence a0 . . viz. In case the random variable is discrete. . . .} because then n an = 1.An application of this is: P (X < ∞) = lim P (X ≤ x). For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). i. 1. a1 . For a random variable which takes positive and negative values. We do not a priori know that this exists for any s other than. which are both nonnegative and then define EX = EX + − EX − . In case the random variable is ∞ absoultely continuous with density f . the generating function of the geometric sequence an = ρn . The generating function of the sequence pn = P (X = n).e. . . the formula reduces to the known one. we have EX = −∞ xf (x)dx. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. 1. 1]. 1]. This definition makes sense only if EX + < ∞ or if EX − < ∞. . . So if n an < ∞ then G(s) exists for all s ∈ [−1. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). . is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . so it is useful to keep this in mind. we define X + = max(X. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. obviously. ϕ(s) = EsX = n=0 sn P (X = n) 115 .. we encounter situations where P (X = ∞) may be positive. making it a useful object when the a0 . As an example. The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. EX = x xP (X = x). is called probability generating function of X: ∞ as long as |ρs| < 1. a1 . n = 0. . 0). The formula holds for any positive random variable. 0). X − = − min(X. is a probability distribution on Z+ = {0. taught in calculus courses). |s| < 1/|ρ|. .

Note that ϕ(0) = P (X = 0). n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). ∞ and p∞ := P (X = ∞) = 1 − pn . We can recover the probabilities pn from the formula pn = ϕ(n) (0) . we can recover the expectation of X from EX = lim ϕ′ (s). but. with x0 = x1 = 1. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . and this is why ϕ is called probability generating function. consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . we have s∞ = 0. n=0 We do allow X to. Also.This is defined for all −1 < s < 1. 116 . As an example. n ≥ 0) then the generating function of (xn+1 . possibly. take the value +∞. xn sn be the generating of (xn . ds ds and by letting s = 1. If we let X(s) = n≥0 n ≥ 0. Note that ∞ n=0 pn ≤ 1. while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). Generating functions are useful for solving linear recursions. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). n! as follows from Taylor’s theorem. since |s| < 1.

b−a is the solution to the recursion. the n-th term of the Fibonacci sequence. Hence s2 + s − 1 = (s − a)(s − b). Since x0 = x1 = 1. n ≥ 0). But I could very well use the function P (X > x). and so X(s) = 1 1 1 1 1 1 = . b−a Noting that ab = −1. i. x ∈ R. then the so-called distribution function.e. Thus (1/ρ)−s is the 1 generating function of ρn+1 . − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . the function P (X ≤ x). This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space.e.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . we can solve for X(s) and find X(s) = s2 −1 . i. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. n ≥ 0) and (xn . for example. n ≥ 0) equals the sum of the generating functions of (xn+1 . is such an example. we can also write this as xn = (−a)n+1 − (−b)n+1 . b = −( 5 + 1)/2. Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). namely. 117 . +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. So. Despite the square roots in the formula. if X is a random variable with values in R.

Specifying the distribution of such an object may be hard. For example (try it!) the rough Stirling approximation give that the probability that. For instance. since the number is so big. for it does allow us to specify the probabilities P (X = n). Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. Hence we have We call this the rough Stirling approximation. n! ≈ exp(n log n − n) = nn e−n . 1 Since d dx (x log x − x) = log x. I could use its density f (x) = d P (X ≤ x). We frequently encounter random objects which belong to larger spaces. we encountered the concept of an excursion of a Markov chain from a state i.or the 2-parameter function P (a < X < b). Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. in 2n fair coin tosses 118 . But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. n ∈ Z+ . If X is absolutely continuous. it’s better to use logarithms: n n! = e log n = exp k=1 log k. compute an integral and obtain an approximation. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. dx All these objects can be referred to as “distribution” or “law” of X. (b) The approximation is not good when we consider ratios of products. Such an excursion is a random object which should be thought of as a random function. instead. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. Even the generating function EsX of a Z+ -valued random variable X is an example. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. n Hence n k=1 log k ≈ log x dx. Now. to compute a large sum we can. It turns out to be a good one. for example they could take values which are themselves functions.

Consider an increasing and nonnegative function g : R → R+ . a geometric function of k. g(X) ≥ g(x)1(X ≥ x). if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). If we think of X as the time of occurrence of a random phenomenon. Markov’s inequality Arguably this is the most useful inequality in Probability. and this is terrible. for all x. 119 . we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. as n → ∞. defined in (34). expressing non-negativity. where λ = P (X ≥ 1). This completely trivial thought gives the very useful Markov’s inequality. That was the random variable J0 = ∞ n=1 1(Sn = 0). We can case both properties of g neatly by writing For all x. We baptise this geometric(λ) distribution. because we know that the n probability goes to zero as n → ∞. so this little section is a reminder. and so. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . n ∈ Z+ . Indeed. expressing monotonicity. Let X be a random variable. you have seen it before. 0 ≤ n ≤ m.we have exactly n heads equals 2n 2−2n ≈ 1. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. y ∈ R g(y) ≥ g(x)1(y ≥ x). It was shown in Theorem 38 that J0 satisfies the memoryless property. and is based on the concept of monotonicity. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. Surely. Then. We have already seen a geometric random variable. by taking expectations of both sides (expectation is an increasing operator). To remedy this. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). Eg(X) ≥ g(x)P (X ≥ x) .

mac.Bibliography ´ 1. Bremaud P (1999) Markov Chains. Also available at: http://www. 3. Soc. Springer. Cambridge University Press. ´ 2. Math. Amer. W (1981) Topics in Ergodic Theory. Springer.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Springer. J L (1997) Introduction to Probability. Feller.dartmouth. Parry. 5. Bremaud P (1997) An Introduction to Probabilistic Modeling. Chung. Grinstead. W (1968) An Introduction to Probability Theory and its Applications. 120 . Cambridge University Press. 7. Wiley. Norris. C M and Snell. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). 4. J R (1998) Markov Chains.pdf 6.

48 cycle formula. 10 irreducible. 18 permutation. 108 ergodic. 16. 82 Cesaro averages. 17 Ethernet. 113 initial distribution. 76 neighbours. 26 running maximum. 116 first-step analysis. 13 probability generating function. 81 reflection principle. 18 dual. 111 maximum. 117 division. 48 memoryless property. 108 relation. 7 closed. 51 essential state. 16 equivalence relation. 86 reflexive. 110 mixing. 17 communicates with. 53 Fibonacci sequence. 70 greatest common divisor. 106 ground state.Index absorbing. 87 121 . 77 coupling. 84 bank account. 68 aperiodic. 65 generating function. 51 Monte Carlo. 35 Helly-Bray Lemma. 16 communicating classes. 6 Markov’s inequality. 68 Fatou’s lemma. 63 Galton-Watson process. 18 arcsine law. 34 probability flow. 6 Markov property. 115 geometric distribution. 59 law. 15 Loynes’ method. 111 inessential state. 29 reflection. 16 convolution. 65 Catalan numbers. 70 period. 101 axiom of Probability Theory. 108 ruin probability. 119 matrix. 15. 109 positive recurrent. 2 nearest neighbour random walk. 110 meeting time. 61 detailed balance equations. 1 equivalence classes. 77 indicator. 17 infimum. 53 hitting time. 47 coupling inequality. 45 Chapman-Kolmogorov equation. 34 PageRank. 114 balance equations. 119 Google. 119 minimum. 117 leads to. 98 dynamical system. 20 function of a Markov chain. 61 recurrent. 17 Aloha. 44 degree. 3 branching process. 61 null recurrent. 10 ballot theorem. 17 Kolmogorov’s loop criterion. 20 increments. 73 Markov chain. 58 digraph (directed graph). 15 distribution. 7 invariant distribution. 115 random walk on a graph.

62 subsequence of a Markov chain. 27 subordination. 6 stationary. 62 supremum. 7 time-reversible. 27 storage chain. 73 strategy of the gambler. 62 Wright-Fisher model. 8 transition probability matrix. 3. 104 states. 16. 13 stopping time. 88 watching a chain in a set. 58 transient.semigroup property. 77 skip-free property. 76 simple symmetric random walk. 8 transitive. 6 statespace. 108 tree. 76 Skorokhod embedding. 16. 108 time-homogeneous. 119 strong Markov property. 10 steady-state. 24 Strirling’s approximation. 7 simple random walk. 71 122 . 60 urn. 29 transition diagram:. 9 stationary distribution. 7. 113 symmetric. 2 transition probability.