Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . . . . . . . . .24. . . . . . . . . . 90 30. . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Total number of visits to a state . . . .1 First hitting time analysis . . . . . . . . . . . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . . 71 24. . .1 Asymptotic distribution of the maximum . 70 24. . . . . . . . . . 65 24. . . . . 85 27. . . . . . . . . . . . . . . . . 95 31. . . . . . . . . . . . . .1 Distribution of hitting time and maximum . . . . . . .2 The ALOHA protocol: a computer communications application . . . . . . . .4 Some remarkable identities . . . . . . . . . . . . . . . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . . . .2 Return probabilities . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . . . . . . . . . .5 A storage or queueing application . . . . . . . . . . . . . . . . . . . . 84 27. . . . . . . . . . . . . 79 26. . . . . . . . . . . . . . . . . . . .3 Some conditional probabilities . . . . . . . . . . . . . . .5 First return to zero . . . . . . . . . . . . . . . . . .2 First return to the origin . . 84 27. . . . .2 Paths that start and end at specific points and do not touch zero at all . . . . . . . . . 95 31. . . . . . . . . . . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . . . . . . . . . .2 The ballot theorem . . . . . . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . . 68 24. . . . . . . . . . . . . . . . . . . . . . . . . .1 Distribution after n steps . . . . . 92 30. . . . . . . . . . .4 The Wright-Fisher model: an example from biology . . . . 83 27. . . .1 Branching processes: a population growth application . .3 The distribution of the first return to the origin . . . . . . . . . . . . .1 Paths that start and end at specific points . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

Of course. Also.K. There notes are still incomplete.. There are. Also. “important” formulae are placed inside a coloured box. of course. I tried to put all terms in blue small capitals whenever they are first encountered. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come.S. The first part is about Markov chains and some applications. I have tried both and prefer the current ordering.Preface These notes were written based on a number of courses I taught over the years in the U.. dozens of good books on the topic. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. At the end. A few starred sections should be considered “advanced” and can be omitted at first reading. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. v .. Greece and the U. I introduce stationarity before even talking about state classification. They form the core material for an undergraduate course on Markov chains in discrete time. I have a little mathematical appendix. The second one is specifically for simple random walks.

where r = rem(a. Xn+2 . For instance. for any t. . x = x(t) denotes the position of the particle at time t. x). In Mathematics. An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. . Its velocity is x = dx . F = f (x. there is a function F that takes the pair Xn = (xn . The “time” can be discrete (i. n = 0. yn ). initialised by x0 = max(a. . a totally ordered set. The sequence of pairs thus produced. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). b) = gcd(r.e. x(t)). It is not hard to see that.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. X1 . Newton’s second law says that the acceleration is proportional to the force: m¨ = F. where xn ≥ yn . In the module following this one you will study continuous-time chains. one of which is 0. In other words. or. X2 . a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. continuous (i. i. Suppose that a particle of mass m is moving under the action of a force F in a straight line. the real numbers). b) is the remainder of the division of a by b. the integers). and evolving as xn+1 = yn yn+1 = rem(xn . . y0 = min(a. and the other the greatest common divisor. Recall (from elementary school maths). we let our “state” be a pair of integers (xn . we end up with two numbers. depend only on the pair Xn .e. yn+1 ). Formally. s ≥ t) ˙ 1 . the future trajectory (X(s). 1. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). x Here. for any n. yn ). b). X0 . that gcd(a. . . Assume that the force F depends only on the particle’s position ¨ dt and velocity.e. . b) between two positive integers a and b. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. yn ) into Xn+1 = (xn+1 . An example from Physics is provided by Newton’s laws of motion. the future pairs Xn+1 . 2. b). b). . more generally. . By repeating the procedure. has the property that. and its ˙ dt 2x acceleration is x = d 2 .

we will see what kind of questions we can ask and what kind of answers we can hope for. Try this on the computer! 2 . 2 2. A mouse lives in the cage. at time n + 1 it is either still in 1 or has moved to 2. We will also see why they are useful and discuss how they are applied. an effective way for dealing with large problems is via the use of randomness. . In addition. Other examples of dynamical systems are the algorithms run.e. When the mouse is in cell 1 at time n (minutes) then. Yn ). 2. 0 ≤ Y0 ≤ B. but some are stochastic. .05. . for steps n = 1. by the software in your computer. via the use of the tools of Probability. containing fresh and stinky cheese.. Some of these algorithms are deterministic. analyse. Y0 ) of random numbers. The resulting ratio is an approximation b of a f (x)dx. Perform this a large number N of times. past states are not needed. We will develop a theory that tells us how to describe. and use those mathematical models which are called Markov chains. We will be dealing with random variables. Repeat with (Xn . regardless of where it was at earlier times. A scientist’s job is to record the position of the mouse every minute.can be completely specified by the current state X(t). Count the number of 1’s and divide by N . 1 and 2.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. Each time count 1 if Yn < f (Xn ) and 0 otherwise. We can summarise this information by the transition diagram: 1 Stochastic means random. it does so. say. respectively. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. i. instead of deterministic objects.99. In other words.000 rows and 10. Indeed.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. uniformly. Similarly. Markov chains generalise this concept of dependence of the future only on the present. The generalisation takes us into the realm of Randomness.000 columns or computing the integral of a complicated function of a large number of variables. a ≤ X0 ≤ b. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. Choose a pair (X0 . it moves from 2 to 1 with probability β = 0.

05 0.y := P (Xn+1 = y|Xn = x). We easily see that if y > 0 then ∞ x. . Xn is the money (in pounds) at the beginning of the month.2 Bank account The amount of money in Mr Baxter’s bank account evolves.01 Questions of interest: 1. FS . How long does it take for the mouse.05 ≈ 20 minutes.95 0. px. he can’t. = z=0 ∞ = z=0 3 . D0 . D1 . 2. are independent random variables. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). S1 . to move from cell 1 to cell 2? 2. (This is the mean of the binomial distribution with parameter α. 1. How often is the mouse in room 1 ? Question 1 has an easy. Here.) 2. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. So if he wishes to spend more than what is available. . as follows: Xn+1 = max{Xn + Dn − Sn . respectively. the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. Dn − Sn = y − x) P (Sn = z. S2 . . are random variables with distributions FD . intuitive. . and Sn is the money he wants to spend. Sn . y = 0. .α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. Dn is the money he deposits. 0}. . Assume that Dn . S0 . on the average. Assume also that X0 . from month to month.99 0.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. D2 .

how fast? 2.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . 2. being drunk. Will Mr Baxter’s account grow. can be computed from FS and FD . to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction.0 p1.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. The transition probability matrix is y=1 p0. on the average. Are there any places which he will visit infinitely many times? 4. how often will he be visiting 0? 2. we have px.1 p1. Are there any places which he will never visit? 3. So he may move forwards with equal probability that he moves backwards. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). how long will it take him.1 p0.y . If the pub is located in position 0 and his home in position 100. he has no sense of direction. If he starts from position 0.0 = 1 − ∞ px.2 P= p2.2 p1.something that. if y = 0. in principle.2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. and if so. 2. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions.1 p2.0 p0. Questions of interest: 1. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1.0 p2. Of course.

01 S 0. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. and pS. but they will become more precise in the sequel.8 0. so we assign probability 1/4 for each move. So he can move east. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly.D is omitted.H − pS. These things are. given that he is currently healthy. west. 2. 5 . north or south. and pS.S because pH.S = 1 − pS.D . mathematically imprecise.1 D Thus.3 that the person will become sick.H . But.H = 1 − pH. observe. We may again want to ask similar questions.D = 1. therefore. that since there are more degrees of freedom now. and let’s say he’s completely drunk. pD. there is probability pH. the answer to this is crucial for the policy determination. there is clearly a higher chance that our man will be lost.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. Note that the diagram omits pH. Clearly.D . the company must have an idea on how long the clients will live. pD.3 H 0.S − pH. etc.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients.S = 0. Also. the reason being that the company does not believe that its clients are subject to resurrection. for now.

3. Let us now write the Markov property in an equivalent way. P (Xn+1 = in+1 | Xn = in . The elements of S are frequently called states. . On the other hand. Xn+m ) and (X0 . . . that the Xn take values in some countable set S. . . n ∈ T} is Markov instead of saying that it has the Markov property. called the state space. This is true for any n and m. 3. conditionally on Xn . m and any choice of states i0 . So we will omit referring to T explicitly. Sometimes we say that {Xn . . m < n. . the future process (Xm . . in+m . . . viz. . Xn−1 ) are independent conditionally on Xn . X1 = i1 . Since S is countable. m ∈ T). X2 = i2 . in+1 ∈ S. m ∈ T) is independent of the past process (Xm . . which precisely means that (Xn+1 . zero. . (Xn . 2. 3. .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . . . . m > n. unless needed. or negative). T = Z+ = {0. . 1. n ∈ Z+ ) is Markov if and only if. Lemma 1. . . and so future is independent of past given the present. . . . n + m] Xk = ik | Xn = in ). We will also assume. Proof. it is customary to call (Xn ) a Markov chain. (1) 6 . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. where T is a subset of the integers. . We say that this sequence has the Markov property if. for all n ∈ Z+ and all i0 . . . X0 = i0 ) = P (∀ k ∈ [n + 1. Xn−1 ) given Xn .} or N = {1. . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . 2.} or the set Z of all integers (positive. almost exclusively (unless otherwise said explicitly). n ∈ T}. for any n ∈ T. n + m] Xk = ik | Xn = in . Usually. . . . if the claim holds. . . . . . . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ).3 3.1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . for any n. . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). . the Markov property. We focus almost exclusively to the first choice.

viz.which means that the two functions µ(i) := P (X0 = i) . X1 = i1 .. . More generally. m + 1)P(m + 1. n). will specify all joint distributions2 : P (X0 = i0 . pi.i (n − 1. n)]i. but if S is an infinite set. n ≥ 0. ℓ) P(ℓ. .j (m. X2 = i2 . m + 1) · · · pin−1 . Xn = in ) = µ(i0 ) pi0 . . pi. n) := [pi. 3. (2) which is reminiscent of matrix multiplication. n) if m ≤ ℓ ≤ n. 7 .3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. If |S| = d then we have d × d matrices. n + 1) do not depend on n . n) = i1 ∈S ··· in−1 ∈S pi.j (m. The function µ(i) is called initial distribution.in (n − 1. for each m ≤ n the matrix P(m.j (m. The function pi.j (m. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. by definition.j∈S . n + 1) := P (Xn+1 = j | Xn = i) . n) = P(m. m + 2) · · · P(n − 1. a square matrix whose rows and columns are indexed by S (and this. n) = P(m. n) = 1(i = j) and. by (1). Of course.i2 (1. Therefore.3 In this new notation. pi.j (n. 1) pi1 .j (n. something that can be called semigroup property or Chapman-Kolmogorov equation. n). means that the transition probabilities pi. But we will circumvent them by using Probability only. then the matrices are infinitely large. pi. j ∈ S.j (n.j (n. of course. 2) · · · pin−1 . . remembering a bit of Algebra. by a result in Probability. we can write (2) as P(m. which. i∈S i. all probabilities of events associated with the Markov chain This causes some analytical difficulties. n). requires some ordering on S) and whose entries are pi. 2 3 And. n + 1) is called transition probability from state i at time n to state j at time n + 1.i1 (m. or as P(m. we define.i1 (0. n).

A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i).j the (one-step) transition probability from state i to state j. b ≥ 0. y y 8 . n + 1) =: pi. if j = i otherwise.and therefore we can simply write pi.j (n. unless otherwise specified. transition matrix. It corresponds. 0.j∈S is called the (one-step) transition probability matrix or.j ]i. Notational conventions Unit mass. The matrix notation is useful. For example. if µ assigns probability p to state i1 and 1 − p to state i2 . let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . Notice that any probability distribution on S can be written as a linear combination of unit masses. arranged in a row vector. then µ = pδi1 + (1 − p)δi2 . and the semigroup property is now even more obvious because Pa+b = Pa Pb . as follows from (1) and the definition of P. if Y is a real random variable. (3) We call δi the unit mass at i. n) = P × P × · · · · · · × P = Pn−m . of course. for all integers a. while P = [pi. In other words. For example. Then P(m. Then we have µ n = µ 0 Pn . when we say “Markov chain” we will mean “timehomogeneous Markov chain”. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . n−m times for all m ≤ n. Thus Pi is the law of the chain when the initial distribution is δi . simply. Similarly. and call pi. Starting from a specific state i. to picking state i with probability 1.j . we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). From now on. δi (j) := 1.

1. say N = 1025 ) in a metallic chamber with a separator in the middle. m + 1. selecting a vertex with probability 1/7. If at time n the bug is on a vertex.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. Let Xn be the number of molecules in room 1 at time n. n ≥ 0) will not be stationary: indeed. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2.4 Stationarity We say that the process (Xn . then. n = 0. . . This is a model of gas distributed between two identical communicating chambers. the law of a stationary process does not depend on the origin of time. then. It should be clear that if initially place the bug at random on one of the vertices. random variables provides an example of a stationary process. .e.). It is clear. The gas is placed partly in room 1 and partly in room 2. for any m ∈ Z+ . that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. as it takes some time before the molecules diffuse from room to room. Hence the uniform probability distribution is stationary. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1.i.) is stationary if it has the same law as (Xn . Consider a canonical heptagon and let a bug move on its vertices. In other words. On the average. . let us look at the following example: Example: Motion on a polygon. . We now investigate when a Markov chain is stationary. and vice versa. a sequence of i. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . for a while we will be able to tell how long ago the process started.d. Then pi. Imagine there are N molecules of gas (where N is a large number. To provide motivation and intuition. at any point of time n. But this will not work. at time n + 1 it moves to one of the adjacent vertices with equal probability. Example: the Ehrenfest chain. dividing it into two identical rooms. Clearly. i. because it is impossible for the number of molecules to remain constant at all times. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. Another guess then is: Place 9 .i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . . n = m. the distribution of the bug will still be the same. we indeed expect that we have N/2 molecules in each room.

10 . by (3). in ∈ S.i = P (X0 = i − 1)pi−1. find those distributions π such that πP = π.i + P (X0 = i + 1)pi+1. . i 0 ≤ i ≤ N. P (Xn = i) = π(i) for all n. Xm+2 = i2 . X2 = i2 . X1 = i1 . . . then we have created a stationary Markov chain. In other words. . 2−N i−1 i+1 N N (The dots signify some missing algebra. we are led to consider: The stationarity problem: Given [pij ].i = π(i − 1)pi−1. For the if part. . If we can do so. hence. P (X1 = i) = j P (X0 = j)pj. . Lemma 2.i = N N −i+1 N i+1 2−N + = · · · = π(i). . use (1) to see that P (X0 = i0 . . Indeed. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . Xn = in ) = P (Xm = i0 . Xm+1 = i1 . . . The equations (4) are called balance equations. Since. a Markov chain is stationary if and only if the distribution of Xn does not change with n. . (a binomial distribution). But we need to prove this. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. Xm+n = in ). for all m ≥ 0 and i0 . then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . stationary or invariant distribution. A Markov chain (Xn . .i + π(i + 1)pi+1.) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. Proof. The only if part is obvious. Changing notation. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. (4) Such a π is called . which you should go through by yourselves. If we do so.each molecule at random either in room 1 or in room 2.

In addition to that. 2. each x ∈ Sd is of the form x = (x1 . 2 .Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. . i = 1. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. . Example: card shuffling.j = 1. p1. . The state space is S = N = {1. Keep doing it continuously. random variables.. we use the normalisation condition π(1) = i≥1 j≥i = 1.i = π(0)p(i) + π(i + 1). Then (Xn . the next one will arrive in i minutes with probability p(i). π must be a probability: π(i) = 1. i ≥ 1. . π(1) will be zero. It is easy to see that the stationary distribution is uniform: π(x) = 1 . and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. which in turn means that the normalisation condition cannot be satisfied. where 1 is a column whose entries are all 1. d! x ∈ Sd . then we run into trouble: indeed. can be written as P1 = 1. . . . if i > 0 i ∈ N. Example: waiting for a bus.d. . . i∈S The matrix P has the eigenvalue 1 always because j∈S pi. write the equations (4): π(i) = π(j)pj. i ≥ 0. Thus. xd ).i = p(i). . i≥1 π(i) −1 To compute π(1). and so each π(i) will be zero. where all the xi are distinct. which gives p(j) . Who tells us that the sum inside the parentheses is finite? If the sum is infinite. To find the stationary distribution.i = 1. d}. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). we must be careful. The times between successive buses are i.i. 11 . . We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. . which.}. . But here. Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. in matrix notation. n ≥ 0) is a Markov chain with transition probabilities pi+1. After bus arrives at a bus stop. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. This means that the solution to the balance equations is identically zero.

2. . So it goes beyond our theory. .e. We have thus defined a mapping ϕ from binary vectors into binary vectors. . So.). what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. . any r = 1. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. the same is true at n + 1. Here ξ(n) = (ξ1 (n). . . But the function f (x) = min(2x. with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. . .) Consider an infinite binary vector ξ = (ξ1 . . and any ε1 . 2. . for any n = 0. with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. εr ∈ {0. i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. 2(1 − x)).d. . . . ξi ∈ {0. ξ2 . Remarks: This is not a Markov chain with countably many states. if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. ξ2 (n).. ξ2 (n) = 1 − ε1 . . . ξ3 . .). . applied to the interval [0. . 1}. . . But to even think about it will give you some strength. We can show that if we start with the ξr (0). ξ(n + 1) = ϕ(ξ(n)). If ξ1 = 1 then x ≥ 1/2. (5) 12 . ξ2 (n) = ε1 .Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). ξr (n + 1) = εr ) = P (ξ1 (n) = 0.i. .* (This is slightly beyond the scope of these lectures and can be omitted.) is the state at time n. r = 1. The Markov chain is obtained by applying the mapping ϕ again and again.). . . . . . 1} for all i. . So let us do so. i. and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . i. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). ξr+1 (n) = εr ) + P (ξ1 (n) = 1. . from (5). . . 1. . 1 − ξ3 .i. are i. and the rule maps x into 2(1 − x). . You can check that . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. P (ξ1 (n + 1) = ε1 . ξ2 (n). . . . Hence choosing the initial state at random in this manner results in a stationary Markov chain. . 1] (think of [0. . Example: bread kneading. . 2.d. Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random... . ξr+1 (n) = 1 − εr ). .

j (which equals 1): π(i) j∈S pi.i Multiply π(i) by j∈S pi. for all A ⊂ S. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. If the latter condition holds. expecting to be able to choose. π(j) = 1. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. This is OK.i 13 .i . If the state space has d states. i∈A j∈Ac Think of π(i)pi. Writing the equations in this form is not always the best thing to do. a set of d linearly independent equations which can be more easily solved. therefore one of them is always redundant. The convenience is often suggested by the topological structure of the chain. as we will see in examples. because the first d equations are linearly dependent: the sum of both sides equals 1. suppose that the balance equations hold. a stationary distribution π satisfies π = πP π1 = 1 In other words. then the above are d + 1 equations. Instead of eliminating equations. j∈S i ∈ S. π is a stationary distribution if and only if π1 = 1 and F (A.1 Finding a stationary distribution As explained above. amongst them. Ac ) is the total current from A to Ac . then choose A = {i} and see that you obtain the balance equations. A). π(i) = j∈S (balance equations) (normalisation condition). But we only have d unknowns. We have: Proposition 1. So F (A. π(j)pj.i + j∈Ac π(j)pj.i + j∈Ac π(j)pj. Conversely. Ac ) := π(i)pi. Ac ) = F (Ac . Proof.j as a “current” flowing from i to j. we introduce more.j .Note on terminology: steady-state. When a Markov chain is stationary we refer to it as being in 4.i = j∈A π(j)pj.j = j∈A π(j)pj.

14 i = 1..8% of the time in room 2. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . the mouse spends only 95.952. (Consequently. but we shall not do this here.j = j∈A π(j)pj. A) Schematic representation of the flow balance relations. The constant c is found from π(1) + π(2) = 1.. 1. That is.j + j∈A j∈Ac π(i)pi. β = 0.048. Summing over i ∈ A. Example: a walk with a barrier.05.05 ≈ 0. p q Consider the following chain.2% of the time in 1. α+β π(2) = α .i + j∈Ac π(j)pj. A c F(A . One can ask how to choose a minimal complete set of linearly independent equations.Now split the left sum also: π(i)pi.) Assume 0 < α. If we take α = 0. in steady state. while the remaining two terms give the equality we seek. Thus. gives us flexibility.i This is true for all i ∈ S.04 room 1.04 π(2) = 0. we find π(1) = 0. . . here is an example: Example: Consider the 2-state Markov chain with p12 = α. The extra degree of freedom provided by the last result. p21 = β. p11 = 1 − α. β < 1. . We have π(1)α = π(2)β. A ) A c c F( A . 3. 2. α+β π(2) = cα. we see that the first term on the left cancels with the first on the right. π(1) = β .99 (as in the mouse example). p22 = 1 − β. and 4.99 ≈ 0. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. . Instead.

Ac ) = π(i − 1)p = π(i)q = F (Ac . (If p ≥ 1/2 there is no stationary distribution. Proof. In fact. the chain will visit j at some finite time. These are much simpler than the previous ones.e. a pair (S. and E a set of directed edges. . i − 1}. (ii) pi. starting from i. i. 1. . If i = j there is nothing to prove. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j).e. A). . . j) ∈ E ⇐⇒ pi. 5. . In graph-theoretic terminology. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. i. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. This is often cast in terms of a digraph (directed graph).1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive.But we can make them simpler by choosing A = {0.i2 · · · pin−1 . We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. π(i) = (p/q)i π(0). If i ≥ 1 we have F (A. n≥0 . S is a set of vertices. • Suppose that i j. In other words. Does this provide a stationary distribution? Yes. (iii) Pi (Xn = j) > 0 for some n ≥ 0.j > 0 for some n ∈ N and some states i1 .2 The relation “leads to” We say that a state i leads to state j if.i1 pi1 . provided that ∞ π(i) = 1.) 5 5. This is i=0 possible if and only if p < q. So assume i = j. . E) where S is the state space and E ⊂ S × S defined by (i. p < 1/2. we can solve them immediately.j > 0. . For i ≥ 1. in−1 . .

Just as any equivalence relation.the sum on the right must have a nonzero term. Corollary 1. we have that the sum is also positive and so one of its terms is positive. 5}. 3} is not closed. The first two classes differ from the last two in character. and so (iii) holds. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). we see that there are four communicating classes: {1}. Proof. by definition. j) ⇐⇒ i j and j i. [i] = [j] if and only if i identical or completely disjoint. • Suppose now that (iii) holds.i1 pi1 . Equivalence relation means that it is symmetric. (i) holds. • Suppose that (ii) holds. is an equivalence relation. the set of all states that communicate with i: [i] := {j ∈ S : j So.in−1 pi. We just observed it is symmetric. 16 .. The communicating class corresponding to the state i is. {6}. However. i}. {2. (iii) holds. 5} is closed: if the chain goes in there then it never moves out of it. In addition. meaning that (ii) holds. By assumption. The relation j ⇐⇒ j i). Look at the last display again.. 5. The class {4. 3}. It is reflexive and transitive because is so. The relation is transitive: If i j and j k then i k. Hence Pi (Xn = j) > 0. it partitions S into equivalence classes known as communicating classes. the class {2. by definition. Since Pi (Xn = j) > 0.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. where the sum extends over all possible choices of states in S. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. {4. this is symmetric (i Corollary 2... In other words. the sum contains a nonzero term. (6) j.j . reflexive and transitive. We have Pi (Xn = j) = i1 .i2 · · · pin−1 .

17 . and so on. Suppose that i j. In other words still.i1 > 0. in−1 such that pi.i1 pi1 . Otherwise. Thus. we conclude that j ∈ C. the state space S is a closed communicating class. we can reach any other state. we have i1 ∈ C. The communicating classes C1 . Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. Closed communicating classes are particularly important because they decompose the chain into smaller. . C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. C2 . The converse is left as an exercise. and so i2 ∈ C. .i = 1 then we say that i is an absorbing state. Similarly. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. In graph terms.j = 1. There can be arrows between two nonclosed communicating classes but they are always in the same direction. pi. C3 are not closed. The classes C4 . If a state i is such that pi. Since i ∈ C. we say that a set of states C ⊂ S is closed if pi. and C is closed. To put it otherwise. more precisely. A general Markov chain can have an infinite number of communicating classes. C5 in the figure above are closed communicating classes. the single-element set {i} is a closed communicating class.More generally.i2 · · · pin−1 . pi1 .j > 0. starting from any state. the state is inessential. Why? A state i is essential if it belongs to a closed communicating class. Note that there can be no arrows between closed communicating classes. more manageable. We then can find states i1 . j∈C for all i ∈ C. Proof. then the chain (or. If all states communicate with all others. Note: An inessential state i is such that there exists a state j such that i j but j i. . parts.i2 > 0. Lemma 3. its transition matrix P) is called irreducible. . Suppose first that C is a closed communicating class. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C.

j. ℓ ∈ N. i. Since Pm+n = Pm Pn . Consider two distinct essential states i. . pii (ℓn) ≥ pii (n) ℓ . d(j) = gcd Dj . Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. We say that the chain has period 2. X4 ∈ C1 . j ∈ S. X2 ∈ C1 . 4} at the next step and vice versa. 3} at some point of time then. i. The period of an inessential state is not defined. Dj = {n ∈ (n) (α) N : pjj > 0}. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . Notice however. it will move to the set C2 = {2. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . n ≥ 0.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. Let Di = {n ∈ N : pii > 0}. X3 ∈ C2 . all states form a single closed communicating class. If i j then d(i) = d(j). more generally. If i j there is α ∈ N with pij > 0 and if 18 . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. and.5. d(i) = gcd Di . . So. Theorem 2 (the period is a class property). A state i is called aperiodic if d(i) = 1. (n) Proof.e. Let us denote by pij the entries of Pn . The chain alternates between the two sets. . (m) (n) for all m. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N.

Theorem 3. d − 1. Therefore d(j) | α + β + n for all n ∈ Di . showing that α + n + β ∈ Dj . . d(j) | d(i)). Proof. if d = 1. In particular. C1 . Divide n by n1 to obtain n = qn1 + r. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . So d(j) is a divisor of all elements of Di . An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. as defined earlier. (Here Cd := C0 . all states communicate with one another. q − r > 0. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. Now let n ∈ Di .) 19 . Since n1 . . Cd−1 such that the chain moves cyclically between them. we have n ∈ Di as well. r = 0. . n2 ∈ Di .j i there is β ∈ N with pji > 0. Consider an irreducible chain with period d. Then pii > 0 and so pjj ≥ pij pii pji > 0. More generally. j∈Cr+1 i ∈ Cr . Let n be sufficiently large. are irreducible chains. j ∈ S. Cd−1 such that pij = 1. . . Pick n2 . C1 . . n1 ∈ Di such that n2 − n1 = 1. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. i. . Arguing symmetrically. An irreducible chain has period d if one (and hence all) of the states have period d. the chain is called aperiodic. . . chains where. From the last two displays we get d(j) | n for all n ∈ Di .e. if an irreducible chain has period d then we can decompose the state space into d sets C0 . Of particular interest. Theorem 4. showing that α + β ∈ Dj . Because n is sufficiently large. Formally. . where the remainder r ≤ n1 − 1. this is the content of the following theorem. Therefore d(j) | α + β. So d(i) = d(j). Corollary 3. If a number divides two other numbers then it also divides their difference. . So pjj ≥ pij pji > 0. 1. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . we obtain d(i) ≤ d(j) as well. Then we can uniquely partition the state space S into d disjoint subsets C0 . .

i1 . As an example. 20 . Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 .. i1 . The existence of such a path implies that it is possible to go from i0 to i. C1 . say. which contradicts the definition of d. after it takes one step from its current position. . and by the assumptions that d d j. If X0 ∈ A the two times coincide... for example. i ∈ C0 . with corresponding classes C0 . Take. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. Such a path is possible because of the choice of the i0 . id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. j ∈ C2 . . then. Notice that this is an equivalence relation..id > 0. Assume d > 1. . . necessarily. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. We avoid excessive notation by using the same letter for both. that is why we call it “first-step analysis”. Suppose pij > 0 but j ∈ C1 but. Continue in this manner and define states i0 . It is easy to see that if we continue and pick a further state id with pid−1 . necessarily. We use a method that is based upon considering what the Markov chain does at time 1. Then pick i1 such that pi0 i1 > 0. We show that these classes satisfy what we need. The reader is warned to be alert as to which of the two variables we are considering at each time. Let C1 be the equivalence class of i1 .. j belongs to the next class. id ∈ C0 . Cd−1 . We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0.. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). . Pick a state i0 and let C0 be its equivalence class (i.e. We now show that if i belongs to one of these classes and if pij > 0 then.Proof.e the set of states j such that i j).. i. Hence it partitions the state space S into d disjoint equivalence classes.

The function ϕ(i) satisfies ϕ(i) = 1. ϕ(i) = j∈S i∈A pij ϕ(j). Hence: ′ ′ P (TA < ∞|X1 = j. Furthermore. . X2 . Combining the above. But what exactly is this probability equal to? To answer this in its generality. and so ϕ(i) = 1. let ϕ′ (i) be another solution. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). the event TA is independent of X0 .It is clear that P1 (T0 < ∞) < 1. fix a set of states A. By self-feeding this equation.e. and define ϕ(i) := Pi (TA < ∞). So TA = 1 + TA . But the Markov chain is homogeneous. ′ Proof. we have Pi (TA < ∞) = ϕ(j)pij . We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i.). then TA ≥ 1. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). j∈S as needed. i ∈ A. If i ∈ A then TA = 0. conditionally on X1 . . ′ is the remaining time until A is hit). a ′ function of X1 . X0 = i) = P (TA < ∞|X1 = j). because the chain may end up in state 2. if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . We then have: Theorem 5. . For the second part of the proof. Therefore. If i ∈ A.

1. Example: In the example of the previous figure. the second equals Pi (TA = 2). we choose A = {0}. 3 2 so ϕ(1) = 3/4. because it takes no time to hit A if the chain starts from A. Start the chain from i. so the first two terms together equal Pi (TA ≤ 2).) As for ϕ(1). Example: Continuing the previous example. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. i = 0. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). if the chain starts from 2 it will never hit 0. let A = {0. 2}. Proof. and let ψ(i) = Ei TA . Letting n → ∞. Furthermore. By continuing self-feeding the above equation n times. 2.We recognise that the first term equals Pi (TA = 1). The function ψ(i) satisfies ψ(i) = 0. We let ϕ(i) = Pi (T0 < ∞). 3 22 . TA = 0 and so ψ(i) := Ei TA = 0. If i ∈ A then. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. obviously. the set that contains only state 0. ψ(i) = 1 + i∈A pij ψ(j). 1 ψ(1) = 1 + ψ(1). By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). Clearly. The second part of the proof is omitted. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). We have: Theorem 6. as above. ψ(i) := Ei TA . ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). (If the chain starts from 0 it takes no time to hit 0. which is the second of the equations. Therefore. If ′ i ∈ A. j∈A i ∈ A. ψ(0) = ψ(2) = 0. then TA = 1 + TA . We immediately have ϕ(0) = 1. and ϕ(2) = 0. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). On the other hand.

X2 . until set A is reached. B. . because the underlying experiment is the flipping of a coin with “success” probability 2/3. ′ where TA is the remaining time. Now the last sum is a function of the future after time 1. ϕAB (i) = 0. Then ′ 1+TA x ∈ A. consider the following situation: Every time the state is x a reward f (x) is earned. where. h(x) = f (x). Hence.e. if x ∈ A. .. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). in the last sum. otherwise. after one step. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. ′ TA = 1 + TA . This happens up to the first hitting time TA first hitting time of the set A. we just changed index from n to n + 1. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). TB for two disjoint sets of states A. . ϕAB is the minimal solution to these equations. of the random variables X1 .which gives ψ(1) = 3/2. i.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . ϕAB (i) = j∈A i∈B pij ϕAB (j). (This should have been obvious from the start. by the Markov property. Clearly. as argued earlier. It should be clear that the equations satisfied are: ϕAB (i) = 1. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ). 23 . i∈A Furthermore. therefore the mean number of flips until the first success is 3/2. Next.

no gambler is assumed to have any information about the outcome of the upcoming toss. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . he will die. 1 2 4 3 The question is to compute the average total reward. unless he is in room 4. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. (Also. so 24 . h(3) = 7. If heads come up then you win £1 per pound of stake. The successive stakes (Yk . are h(x) = 0. 2 2 We solve and find h(2) = 8. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). if he starts from room 2. 3. collecting “rewards”: when in room i. y∈S h(x) = f (x) + x ∈ A. Let p the probability of heads and q = 1 − p that of tails. h(1) = 5. for i = 1. until he dies in room 4. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). k ∈ N) form the strategy of the gambler. x∈A pxy h(y). Clearly. the set of equations satisfied by h. (Sooner or later. 2.Thus. So. in which case he dies immediately.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. Example: A thief enters sneaks in the following house and moves around the rooms. why?) If we let h(i) be the average total reward if he starts from room i. he collects i pounds.

. in the first case. it will make enormous profits). We wish to consider the problem of finding the probability that the gambler will never lose her money. a < i < b.i−1 = q = 1 − p. in addition. in simplest case Yk = 1 for all k. if p = q = 1/2 b−a Proof. . ℓ) := Px (Tℓ < T0 ). We first solve the problem when a = 0. She will stop playing if her money fall below a (a < x). Ex ) for the probability (respectively. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. write ϕ(x) instead of ϕ(x. expectation) under the initial condition X0 = x. . Let £x be the initial fortune of the gambler. b are absorbing. and where the states a. as described above. We are clearly dealing with a Markov chain in the state space {a. 25 ϕ(ℓ) = 1. .g. . 0 ≤ x ≤ ℓ. b − 1. a + 2. x − a    . . . a ≤ x ≤ b. Consider a gambler playing a simple coin tossing game. pi. ξk−1 and–as is often the case in practice–on other considerations of e. if p = q Px (Tb < Ta ) = . It can only depend on ξ1 . betting £1 at each integer time against a coin which has P (heads) = p. b − a). is our next goal. We use the notation Px (respectively. P (tails) = 1 − p. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . Let us consider a concrete problem. Yk cannot depend on ξk . b = ℓ > 0 and let ϕ(x. We obviously have ϕ(0) = 0. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. astrological nature. she must base it on whatever outcomes have appeared thus far.i+1 = p. ℓ). In other words. . b} with transition probabilities pi. (Think why!) To simplify things. a + 1. Theorem 7.if the gambler wants to have a strategy at all. To solve the problem. the company has control over the probability p. .

Since p + q = 1. If p > 1/2. Since ϕ(ℓ) = 1. then λ = q/p > 1.and. if λ = 1 if λ = 1. by first-step analysis. The general case is obtained by replacing ℓ by b − a and x by x − a. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). we find δ(0) = 1/ℓ and so x ϕ(x) = . We finally find λx − 1 . δ(x − 2) = · · · = q p x δ(0). we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. Corollary 4. if p = q. λ = 1. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). ℓ→∞ ℓ→∞ (q/p)x . 0 ≤ x ≤ ℓ. if p > 1/2 if p ≤ 1/2. then λ = q/p < 1. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). and so λx − 1 λx − 1 =1− = λx . ℓ) = λx −1 . We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). ℓ Summarising. λ = 1. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. 1. λℓ − 1 0 ≤ x ≤ ℓ. we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. Assuming that λ := q/p = 1. and so Px (T0 < ∞) = 1 − lim If p < 1/2. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. we have obtained ϕ(x. λℓ −1 x ℓ. λx − 1 δ(0). ϕ(x) = pϕ(x + 1) + qϕ(x − 1). Let δ(x) := ϕ(x + 1) − ϕ(x). Px (T0 < ∞) = 1 − lim 26 . We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. ℓ→∞ λℓ − 1 x = 1 − 0 = 1.

A. . n ≥ 0) is independent of the past process (X0 . An example of a stopping time is the time TA . Xn ) for all n ∈ Z+ . the future process (Xn . . . possibly. Xn )1(τ = n).8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. . . A. . considered earlier. Xτ +2 = j2 . τ < ∞)P (A | Xτ . . .j2 · · · We have: P (Xτ +1 = j1 . A | Xτ . | Xn = i. . . τ = n)P (Xn = i. . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . . τ = n) (a) (b) = P (Xτ +1 = j1 . . We are going to show that for any such A. . A. P (Xτ +1 = j1 . Xn+1 . . Xn ). . . Xn+2 = j2 . for any n. .j1 pj1 . . τ = n) = P (Xn+1 = j1 . . | Xn = i)P (Xn = i. the first hitting time of the set A. Let A be any event determined by the past (X0 . Proof. Xτ )1(τ < ∞) = n∈Z+ (X0 .j1 pj1 . Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . τ < ∞). (d) where (a) is just logic. Suppose (Xn . A. A. . . . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . n ≥ 0) is a time-homogeneous Markov chain. . Let τ be a stopping time. 27 . Xn+2 = j2 . . . . . Theorem 8 (strong Markov property). In particular. . Recall that the Markov property states that. . Then. . the future process (Xτ +n . including. An example of a random time which is not a stopping time is the last visit of a specific set. . . Xτ +2 = j2 . the value +∞. . | Xτ = i. . Xn ) if we know the present Xn . Xτ +2 = j2 . . τ < ∞) = P (Xn+1 = j1 .Xτ +2 = j2 . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . Since (X0 . τ = n) (c) = P (Xn+1 = j1 . | Xτ . Xn = i. . . . A ∩ {τ = n} is determined by (X0 . . . . . any such A must have the property that. . . A. τ = n) = pi. τ < ∞) = pi. . conditional on τ < ∞ and Xτ = i. (b) is by the definition of conditional probability. τ < ∞) and that P (Xτ +1 = j1 . This stronger property is referred to as the strong Markov property. . . τ = n). Let τ be a stopping time. . . . for all n ∈ Z+ . Xτ ) if τ < ∞. Xn+2 = j2 . Xτ = i. . Xτ ) and has the same law as the original process started from i. . . .j2 · · · P (Xn = i.) is independent of the past process (X0 . . . Xn ) and the ordinary splitting property at n.

by using the strong Markov property.9 Regenerative structure of a Markov chain A sequence of i.d. All these random times are stopping times. Ti Xi (0) (r) ≤ n < Ti (r+1) . They are. in fact. Note the subtle difference between this Ti and the Ti of Section 6. := Xn . Schematically. (1) r = 1.d. Our Ti here is always positive. we will show that we can split the sequence into i. 3. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . . State i may or may not be visited again. Xi (2) . random trajectories. Xi (1) . . we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. (And we set Ti := Ti . The random variables Xn comprising a general Markov chain are not independent.) We let Ti be the time of the third visit. The idea is that if we can find a state i that is visited again and again.. . However. we let Ti be the time of the second (1) (3) visit. . These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. “chunks”. The Ti of Section 6 was allowed to take the value 0.i. 0 ≤ n < Ti . random variables is a (very) particular example of a Markov chain. We (r) refer to them as excursions of the process. So Xi is the r-th excursion: the chain visits 28 . then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). in a generalised sense. . as “random variables”. If X0 = i then the two definitions coincide. Formally. . and so on. .. Regardless. Xi (0) . r = 2.i. 2.

The claim that Xi . we “watch” it up until the next time it returns to state i. g(Xi (2) ). The last corollary ensures us that Λ1 . equal to i. Therefore. Therefore. Xi distribution. . starting from i. for time-homogeneous Markov chains.. by definition. . Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. it will behave in the same way as if it starts from the same state at another point of time. except. we say that a state i is transient if. But (r) the state at Ti is. have the same Proof. and “goes for an excursion”. We abstract this trivial observation and give a couple of definitions: First. . (r) . if it starts from state i at some point of time. Apply Strong Markov Property (SMP) at Ti : SMP says that. at least one state must be visited infinitely many times. 10 Recurrence and transience When a Markov chain has finitely many states. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. but there are always states which will be visited again and again. . the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. A trivial. Λ2 . given the state at (r) (r) (r) the stopping time Ti . (Numerical) functions g(Xi independent random variables. all. are independent random variables.. . The excursions Xi (0) . .i. 29 . Moreover. Example: Define Ti (r+1) (0) ). In particular. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. the future is independent of the (1) (2) past. possibly. the future after Ti is independent of the past before Ti . Xi (2) . we say that a state i is recurrent if.). they are also identical in law (i. starting from i. g(Xi (1) ). Perhaps some states (inessential ones) will be visited finitely many times. Xi (1) . but important corollary of the observation of this section is that: Corollary 5. and this proves the first claim. (0) are independent random variables. Second. An immediate consequence of the strong Markov property is: Theorem 9. ad infinitum.state i for the r-th time.d. .

This means that the state i is recurrent. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. Indeed. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . 30 . 2. namely. State i is recurrent. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . therefore concluding that every state is either recurrent or transient. i.Important note: Notice that the two statements above are not negations of one another. every state is either recurrent or transient. k ≥ 0. Pi (Ji = ∞) = 1.e. Because either fii = 1 or fii < 1. Therefore. fii = Pi (Ti < ∞) = 1. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. Lemma 5. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. Pi (Ji = ∞) = 1. we have Proof. Suppose that the Markov chain starts from state i. in principle. we conclude what we asserted earlier. We claim that: Lemma 4. Starting from X0 = i. The following are equivalent: 1. 3. But the past (k−1) before Ti is independent of the future. Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. Indeed. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . We will show that such a possibility does not exist. If fii < 1. Ei Ji = ∞. there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). then Pi (Ji < ∞) = 1. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. . 4. This means that the state i is transient. and that is why we get the product.

Clearly. Proof. This. (n) i. 2. The following are equivalent: 1. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. and. Ei Ji < ∞. but the two sequences are related in an exact way: Lemma 7.e. State i is recurrent if. it is visited finitely many times with probability 1. if we assume Ei Ji < ∞. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj .Proof. On the other hand. Proof. then. If i is a transient state then pii → 0. then. 10. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . fij ≤ pij . (n) But if a series is finite then the summands go to 0. as n → ∞. If Ei Ji = ∞. fii = Pi (Ti < ∞) < 1. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. by definition. Proof. n ≥ 1. Lemma 6. the probability to hit state j for the first time at time n ≥ 1.1 Define First hitting time decomposition fij := Pi (Tj = n).e. owing to the fact that Ji is a geometric random variable. assuming that the chain (n) starts from state i. Pi (Ji < ∞) = 1. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . Corollary 6. The equivalence of the first two statements was proved above. by definition. since Ji is a geometric random variable. as n → ∞. The equivalence of the first two statements was proved before the lemma. If i is transient then Ei Ji < ∞. Ji cannot take value +∞ with positive probability. is equivalent to Pi (Ji < ∞) = 1. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. it is visited infinitely many times. Hence Ei Ji = ∞ also. 4. pii → 0. Notice the difference with pij . State i is transient if. which. clearly. therefore Pi (Ji < ∞) = 1. State i is transient. 3. From Elementary Probability. by the definition of Ji . i. this implies that Ei Ji < ∞.

i. and so the probability that j will appear in a specific excursion is positive. by summing over n the relation of Lemma 7. . .) Theorem 10. denoted by i j: it means that. infinitely many of them will be 1 (with probability 1). Corollary 7.i. The last statement is simply an expression in symbols of what we said above. either all states are recurrent or all states are transient.d. as n → ∞. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . ( Remember: another property that was a class property was the property that the state have a certain period. which is finite. Start with X0 = i. If j is a transient state then pij → 0. j i. In a communicating class. . are i. In particular. Indeed. and gij := Pi (Tj < Ti ) > 0. Xi . Proof. and since they take the value 1 with positive probability. Now. showing that j will appear in infinitely many of the excursions for sure.The first term equals fij . For the second term. . we have i j ⇐⇒ fij > 0. by definition. We now show that recurrence is a class property. but also fij = 1. starting from i. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. not only fij > 0. 1. 32 . Corollary 8. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1.2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. excursions Xi .d. and hence pij as n → ∞. 10. for all states i. If j is transient. . if we let fij := Pi (Tj < ∞) . r = 0.. Hence. ∞ Proof. j i.r := 1 j appears in excursion Xi (r) . we have n=1 pjj = Ej Jj < ∞. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). So the random variables δj. In other words. . the probability to go to j in some finite time is positive. Hence.

Pi (B) = 0. if started in C. i. n+1 33 = 0 . It should be clear that this can. Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). But then. random variables. This state is recurrent. We now take a closer look at recurrence. . . all other states are also transient. every other j is also recurrent. X remains in C. . Thus T represents the number of variables we need to draw to get a value larger than the initial one. Example: Let U0 . Recall that a finite random variable may not necessarily have finite expectation. Indeed. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). and so i is a transient state. and hence. What is the expectation of T ? First. 1]. Every finite closed communicating class is recurrent. uniformly distributed in the unit interval [0.d. Let C be the class. By closedness. but Pi (A ∩ B) = Pi (Xm = j. Un ≤ x | U0 = x)dx xn dx = 1 . . . By finiteness. . Let T := inf{n ≥ 1 : Un > U0 }.e. So if a communicating class is recurrent then it contains no inessential states and so it is closed.Proof. Xn = i for infinitely many n) = 0. If a communicating class contains a state which is transient then. some state i must be visited infinitely many times. U2 . Suppose that i is inessential. . Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. be i. Hence. P (T > n) = P (U1 ≤ U0 . . i. appear in a certain game. . Proof. If i is inessential then it is transient. Hence all states are recurrent. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. by the above theorem. If i is recurrent then every other j in the same communicating class satisfies i j. Proof. U1 .i. Theorem 11. j but j i. . P (B) = Pi (Xn = i for infinitely many n) < 1. for example. . necessarily. observe that T is finite.e. Theorem 12.

13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. It is not a Law. random variables.Therefore. We say that i is null recurrent if Ei Ti = ∞. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. it is a Theorem that can be derived from the axioms of Probability Theory. be i. But this is misleading.d. Theorem 13.i. . Let Z1 . arguably. X2 . of successive states is not an independent sequence. When we have a Markov chain. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. To apply the law of large numbers we must create some kind of independence. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. but E min(Z1 . P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. (ii) Suppose E max(Z1 . It is traditional to refer to it as “Law”. 0) = ∞. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. X1 . . Z2 . Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. We state it here without proof. . 0) > −∞. . n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. on the average. Before doing that. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . then it takes. . the Fundamental Theorem of Probability Theory. the sequence X0 . Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. 34 . .

we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. assign a reward f (x). Since functions of independent random variables are also independent. Thus. are i. G(Xa(2) ). and because if we start with X0 = a.i. because the excursions Xa .. the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). we can think of as a reward received when the chain is in state x. the (0) initial excursion Xa also has the same law as the rest of them. (1) (2) (1) (n) (8) 13. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). . n as n → ∞. (7) Whichever the choice of G. which. random variables.d. Xa(2) . . Then. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . as in Example 2. . we have that G(Xa(1) ). for an excursion X . .d.13. with probability 1. but define G(X ) to be the total reward received over the excursion X . . define G(X ) to be the maximum reward received during this excursion. forms an independent sequence of (generalised) random variables. Specifically. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). Xa . In particular. Xa(3) . . . G(Xa(3) ). 35 . . in this example. k=0 which represents the ‘time-average reward’ received. . are i. Example 1: To each state x ∈ S. G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ).i. (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }.2 Average reward Consider now a function f : S → R.1 Functions of excursions Namely. Example 2: Let f (x) be as in example 1. The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. for reasonable choices of the functions G taking values in R.

t as t → ∞. with probability one. of complete excursions from the ground state a that occur between 0 and t. Since P (Ta < ∞) = 1. Hence |Gfirst | ≤ BTa . We assumed that f is t bounded. we have BTa /t → 0. For example. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. t t 36 . Nt = 5. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). t t as t → ∞. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). and Glast is the reward over the last incomplete cycle. A similar argument shows that 1 last G → 0. The idea is this: Suppose that t is large enough.e. Nt . We break the sum into the total reward received over the initial excursion. in the figure below. We are looking for the total reward received between 0 and t. Thus we need to keep track of the number. plus the total reward received over the next excursion. and so on. Let f : S → R+ be a bounded ‘reward’ function. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. with probability one. say |f (x)| ≤ B. and so 1 first G → 0. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . Gr is given by (7). say. Proof. Then.Theorem 14. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. (r) Nt := max{r ≥ 1 : Ta ≤ t}. t t In other words.

(11) By definition. E a Ta f (Xn ) = n=0 EG1 =: f . are i. r=1 with probability one. lim t∞ 1 Nt Nt −1 Gr = EG1 . So we are left with the sum of the rewards over complete cycles. . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. . we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . as t → ∞.d. = E a Ta . and since Ta − Ta k = 1.. Since Ta /r → 0. . Since Ta → ∞.also. = E a Ta . we have Ta r→∞ r lim (r) . as r → ∞. random variables with common expectation Ea Ta .i. 1 Nt = . 2. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . it follows that the number Nt of complete cycles before time t will also be tending to ∞. Hence. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . E a Ta 37 . r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 .

b. a. Note that we include only one endpoint of the excursion.e. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). The theorem above then says that. on the average. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). Comment 4: Suppose that two states. then the long-run proportion of visits to the state a should equal 1/Ea Ta . The denominator is the duration of the excursion. The physical interpretation of the latter should be clear: If. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). Ta −1 k=0 1 lim t→∞ t with probability 1. Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. E b Tb 38 . t Ea 1(Xk = b) E a Ta . i. for all b ∈ S.To see that f is as claimed. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . from the Strong Markov Property. The constant f is a sort of average over an excursion. Comment 1: It is important to understand what the formula says. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . then we see that the numerator equals 1. it takes Ea Ta units of time between successive visits to the state a. E a Ta (14) with probability 1. We can use b as a ground state instead of a. we let f to be 1 at the state b and 0 everywhere else. the ground state. The equality of these two limits gives: average no. not both. which communicate and are positive recurrent.

x ∈ S. . Both of them equal 1. Hence the superscript [a] will soon be dropped. . (15) should play an important role. . Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. Recurrence tells us that Ta < ∞ with probability one. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. where a is the fixed recurrent ground state whose existence was assumed. Since a = XT a . X3 . Clearly. we have 0 < ν [a] (x) < ∞. Proposition 2. Proof. . X1 .e. 39 . i. ν [a] (x) = y∈S ν [a] (y)pyx . Indeed. by a function of X1 . We start the Markov chain with X0 = a. Moreover. X2 . In view of this.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . Note: Although ν [a] depends on the choice of the ‘ground’ state a. (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. . we will soon realise that this dependence vanishes when the chain is irreducible. X2 . for any state b such that a → x (a leads to x). x ∈ S. this should have something to do with the concept of stationary distribution discussed a few sections ago. 5 We stress that we do not need the stronger assumption of positive recurrence here. so there is nothing lost in doing so. Consider the function ν [a] defined in (15). Fix a ground state a ∈ S and assume that it is recurrent5 . we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). . notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. The real reason for doing so is because we wanted to replace a function of X0 .

ax ax From Theorem 10. which. Xn−1 = y. Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta .and thus be in a position to apply first-step analysis. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). so there exists n ≥ 1. that a is positive recurrent. namely. Therefore. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). such that pax > 0. ν [a] (x) < ∞. x a. is finite when a is positive recurrent. where. x ∈ S. indeed. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). so there exists m ≥ 1. We just showed that ν [a] = ν [a] P. by definition. 40 . Note: The last result was shown under the assumption that a is a recurrent state. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . We are now going to assume more. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. for all x ∈ S. such that pxa > 0. which are precisely the balance equations. xa so. We now have: Corollary 9. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. for all n ∈ N. First note that. we used (16). ν [a] = ν [a] Pn . n ≤ Ta ) pyx Pa (Xn−1 = y. (n) Next. (17) This π is a stationary distribution. n ≤ Ta ) Pa (Xn = x. let x be such that a x. in this last step. from (16). Suppose that the Markov chain possesses a positive recurrent state a. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x.

But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. to decide whether j will occur in a specific excursion. R2 ) be the index of the first (respectively. Otherwise.d. then the r-th excursion does not contain state j. Consider the excursions Xi .In other words: If there is a positive recurrent state. But π [a] is simply a multiple of ν [a] . and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. Start with X0 = i. positive recurrent state. . Hence j will occur in infinitely many excursions. Let R1 (respectively. We already know that j is recurrent. in Proposition 2. In words. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. Suppose that i recurrent. Since a is positive recurrent. since the excursions are independent. Proof. It only remains to show that π [a] is a probability. Then j is positive Proof. that it sums up to one.i. no matter which one. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. 41 . The coins are i. . i. then j occurs in the r-th excursion. Also. in general. . then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is.e. second) excursion that contains j. π [a] P = π [a] . Let i be a positive recurrent state. if tails show up. Therefore. π(y) = 1 y∈S x∈S have at least one solution. 2. then the set of linear equations π(x) = y∈S π(y)pyx . and so π [a] is not trivially equal to zero. Theorem 15. We have already shown. we toss a coin with probability of heads gij > 0. r = 0. unique. This is what we will try to do in the sequel. 1. (r) j. we have Ea Ta < ∞. that ν [a] P = ν [a] . If heads show up at the r-th excursion.

R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . In other words. . . RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . or all states are null recurrent or all states are transient. An irreducible Markov chain is called transient if one (and hence all) states are transient. where the Zr are i. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. . by Wald’s lemma. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. Then either all states in C are positive recurrent. . But. . . 2. we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. Let C be a communicating class. with the same law as Ti .i. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. R2 has finite expectation. Corollary 10. .d. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . IRREDUCIBLE CHAIN r = 1.

But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. Let π be a stationary distribution. n ≥ 0) and suppose that P (X0 = x) = π(x). what prohibits uniqueness. Then P (Ta < ∞) = x∈S In other words. we stress that a stationary distribution may not be unique. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). we can take expectations of both sides: E as t → ∞. recognising that the random variables on the left are nonnegative and bounded below 1. π(2) = 1 − α 2. Consider the Markov chain (Xn . π(1) = 0. Consider positive recurrent irreducible Markov chain. Proof. Hence. the π defined by π(0) = α. we have. following Theorem 14. is a stationary distribution. as t → ∞. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x).16 Uniqueness of the stationary distribution At the risk of being too pedantic. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. there is a stationary distribution. Fix a ground state a. indeed. Then there is a unique stationary distribution. Theorem 16. after all. by (18). x ∈ S. x∈S By (13). with probability one. for all states i and j. for all x ∈ S. state 1 (and state 2) is positive recurrent. for all n. n=1 43 . we start the Markov chain from the stationary distribution π. But. By a theorem of Analysis (Bounded Convergence Theorem). (18) π(x)Px (Ta < ∞) = π(x) = 1. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. we have P (Xn = x) = π(x). 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). an absorbing state). for totally trivial reasons (it is. Then. x ∈ S. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1.

y ∈ C.e. b ∈ S. Decompose it into communicating classes. i is positive recurrent. Then π [a] = π [b] .Therefore π(x) = π [a] (x). Consider a Markov chain. E a Ta Proof. 1 π(a) = . Let i be such that π(i) > 0. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. There is only one stationary distribution. π(i|C) = Ei Ti . i. the only stationary distribution for the restricted chain because C is irreducible. Thus. This is in fact. Consider a positive recurrent irreducible Markov chain and two states a. Then any state i such that π(i) > 0 is positive recurrent. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . Assume that a stationary distribution π exists. Consider an arbitrary Markov chain. Proof. Hence π = π [a] . The cycle formula merely re-expresses this equality. Let C be the communicating class of i. In particular. Consider a positive recurrent irreducible Markov chain with stationary distribution π. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . x ∈ S. The following is a sort of converse to that. Both π [a] and π[b] are stationary distributions. for all x ∈ S. By the 1 above corollary. so they are equal. Therefore Ei Ti < ∞. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. where π(C) = y∈C π(y). Consider the restriction of the chain to C. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . There is a unique stationary distribution. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. an arbitrary stationary distribution π must be equal to the specific one π [a] . In particular. Corollary 11. Proof. π(C) x ∈ C. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Corollary 12. Then. Hence there is only one stationary distribution. π(a) = π [a] (a) = 1/Ea Ta . for all a ∈ S. delete all transition probabilities pxy with x ∈ C. But π(i|C) > 0. from the definition of π [a] . Namely.

that the pendulum will swing for a while and. that it remains in a small region). eventually. more generally. So we have that the above decomposition holds with αC = π(C). under irreducibility and positive recurrence conditions. Proof. Let π be an arbitrary stationary distribution. There is dissipation of energy due to friction and that causes the settling down. x∈C Then (π(x|C). consider the motion of a pendulum under the action of gravity. π(x|C) ≡ π C (x). it will settle to the vertical position. To each positive recurrent communicating class C there corresponds a stationary distribution π C . An arbitrary stationary distribution π must be a linear combination of these π C . take an = (−1)n . Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. For example. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . A Markov chain is a simple model of a stochastic dynamical system. 18 18. If π(x) > 0 then x belongs to some positive recurrent class C. So far we have seen. that. for example. x ∈ C) is a stationary distribution.which assigns positive probability to each state in C. from physical experience. such that C αC = 1. We all know. For each such C define the conditional distribution π(x|C) := π(x) . In an ideal pendulum. This is the stable position. where π(C) := π(C) π(x). the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). where αC ≥ 0. without friction.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. the pendulum will perform an 45 . A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. By our uniqueness result. as n → ∞. Proposition 3.

However notice what happens when we solve the recursion. and define a stochastic sequence X1 . it cannot mean convergence to a fixed equilibrium point. . are given. The thing to observe here is that there is an equilibrium point i. ξ2 . ξ0 . then regardless of the starting position.oscillatory motion ad infinitum. or whatever. or an example in physics. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ .e. we here assume that X0 . ˙ There are myriads of applications giving physical interpretation to this. moreover. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . Still. We shall assume that the ξn have the same law. ξ1 . x ∈ R. random variables. If. ······ 46 . . And we are in a happy situation here. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . The simplest differential equation As another example. A linear stochastic system We now pass on to stochastic systems. X2 . Specifically. we have x(t) → x∗ . the one found by setting the right-hand side equal to zero: x∗ = b/a. by means of the recursion above. as t → ∞. . so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. then x(t) = x∗ for all t > 0. a > 0. But what does stability mean? In general. . Since we are prohibited by the syllabus to speak about continuous time stochastic systems. . . We say that x∗ is a stable equilibrium. mutually independent. It follows that |ρ| < 1 is the condition for stability. consider the following differential equation x = −ax + b. simple because the ξn ’s depend on the time parameter n. we may it is stable because it does not (it cannot!) swing too far away.

In particular. A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. Y = y) = P (X = x)P (Y = y). assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . ξn−1 does not converge. uniformly in x. Y = y) is? 47 . j ∈ S. converges. but we are not given what P (X = x. 18. . Y = y) is. Y but not their joint law. If a Markov chain is irreducible. under nice conditions. .g. we have that the first term goes to 0 as n → ∞. for instance.3 Coupling To understand why the fundamental stability theorem works. we know f (x) = P (X = x). The second term. ∞ ˆ The nice thing is that. But suppose that we have some further requirement. What this means is that. the n-step transition matrix Pn . which is a linear combination of ξ0 . the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). Coupling refers to the various methods for constructing this joint law.2 The fundamental stability theorem for Markov chains Theorem 17.Since |ρ| < 1. we need to understand the notion of coupling. i. g(y) = P (Y = y). . 18. for any initial distribution. to try to make the coincidence probability P (X = Y ) as large as possible. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . for time-homogeneous Markov chains. define stability in a way other than convergence of the distributions of the random variables. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). (e. as n → ∞. How should we then define what P (X = x. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). Starting at an elementary level. while the limit of Xn does not exists. In other words. suppose you are given the laws of two random variables X. .

n ≥ 0) and the law of a stochastic process Y = (Yn . n ≥ T ) ≤ P (n < T ) + P (Yn = x). T is finite with probability one. P (T < ∞) = 1. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). We have P (Xn = x) = P (Xn = x. If. Find the joint law h(x. Y = y) which respects the marginals. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). Since this holds for all x ∈ S and since the right hand side does not depend on x. The following result is the so-called coupling inequality: Proposition 4. (19) as n → ∞. Let T be a meeting of two coupled stochastic processes X. then |P (Xn = x) − P (Yn = x)| → 0. y) = P (X = x. x ∈ S. y). The coupling we are interested in is a coupling of processes.Exercise: Let X. on the event that the two processes never meet. in particular. n ≥ 0). Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n).e. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). y). n < T ) + P (Xn = x. Y be random variables with values in Z and laws f (x) = P (X = x). i. Note that it makes sense to speak of a meeting time. Y . We are given the law of a stochastic process X = (Xn . n < T ) + P (Yn = x. but we are not told what the joint probabilities are. n ≥ 0. but with the roles of X and Y interchanged. x∈S 48 . Then. f (x) = y h(x. respectively. for all n ≥ 0. g(y) = P (Y = y). Proof. and which maximises P (X = Y ). Suppose that we have two stochastic processes as above and suppose that they have been coupled. i. A coupling of them refers to a joint construction on the same probability space. n ≥ T ) = P (Xn = x. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). g(y) = x h(x. precisely because of the assumption that the two processes have been constructed together. Repeating (19). We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. This T is a random variables that takes values in the set of times or it may take values +∞. and all x ∈ S.e. uniformly in x ∈ S. n ≥ 0.

It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. Its (1-step) transition probabilities are q(x.(y. Having coupled them. We now couple them by doing the straightforward thing: we assume that X = (Xn . π(x) > 0 for all x ∈ S. y) = µ(x)π(y).(y. From Theorem 3 and the aperiodicity assumption. Therefore. (x. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities.x′ py.x′ ). n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. as usual. in other words.x′ ). is a stationary distribution for (Wn . review the assumptions of the theorem: the chain is irreducible. x∈S as n → ∞. Then n→∞ lim P (T > n) = P (T = ∞) = 0. Then (Wn . they differ only in how the initial states are chosen.Assume now that P (T < ∞) = 1. Let pij denote. we can define joint probabilities. the transition probabilities. y) Its n-step transition probabilities are q(x. implying that q(x. n ≥ 0) is a Markov chain with state space S × S. The second process. Consider the process Wn := (Xn .y′ .y′ ) := P Wn+1 = (x′ . Yn ). positive recurrent and aperiodic. Y0 ) has distribution P (W0 = (x. sup |P (Xn = x) − P (Yn = x)| → 0. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. and so (Wn .y′ ) for all large n. we have that px. y) ∈ S × S.x′ py. We shall consider the laws of two processes. for all n ≥ 1.(y. n ≥ 0) is an irreducible chain.y′ ) = px. denoted by π. n ≥ 0). Xn and Yn would be independent and distributed according to π.y′ .4 Proof of the fundamental stability theorem First. Its initial state W0 = (X0 . had we started with both X0 and Y0 independent and distributed according to π. 18. (Indeed. Notice that σ(x. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px.x′ ). .) By positive recurrence. y) := π(x)π(y).x′ > 0 and py. y ′ ) | Wn = (x. denoted by X = (Xn . we have not defined joint probabilities such as P (Xn = x. Xm = y). We know that there is a unique stationary distribution. denoted by Y = (Yn . The first process. and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }. then. n ≥ 0) is independent of Y = (Yn .y′ for all large n. n ≥ 0).

Therefore σ(x.1) = 1. P (Xn = x) = P (Zn = x). (0. suppose that S = {0. For example.(0. by assumption. 1).0).01. Moreover. X arbitrary distribution Z stationary distribution Y T Thus. if we modify the original chain and make it aperiodic by. n ≥ 0) is positive recurrent.(0. In particular. y) > 0 for all (x. Clearly. say. for all n and x. which converges to zero. T is a meeting time between Z and Y . and since P (Yn = x) = π(x). However. Hence (Wn .1). we have |P (Xn = x) − π(x)| ≤ P (T > n). n ≥ 0) is identical in law to (Xn . 0).(1. |P (Zn = x) − P (Yn = x)| ≤ P (T > n).1). p00 = 0. p01 = 0. Since P (Zn = x) = P (Xn = x). Y ) may fail to be irreducible. then the process W = (X. n ≥ 0). In particular. after the meeting time. Zn = Yn . Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). precisely because P (T < ∞) = 1. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. ≥ 0) is a Markov chain with transition probabilities pij . (1. 50 . Hence (Zn . P (T < ∞) = 1. 0).0) = q(0. This is not irreducible. (1. 1)} and the transition probabilities are q(0. y) ∈ S × S. (Zn . Then the product chain has state space S ×S = {(0. as n → ∞. By the coupling inequality. uniformly in x.99. 1} and the pij are p01 = p10 = 1.(1.0).0) = q(1. Obviously. Z0 = X0 which has an arbitrary distribution µ. Zn equals Xn before the meeting time of X and Y .1) = q(1. p10 = 1.

99. limn→∞ p00 does not exist. It is not difficult to see that 6 We put “wrong” in quotes. 51 .6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. positive recurrent and aperiodic Markov chain.503 0. By the results of Section 14. n→∞ (n) where π(0).then the product chain becomes irreducible. . however. π(1) satisfy 0. In matrix notation.497.497 18. 1981). and so on. By the results of §5. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0.01. Suppose X0 = 0. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. i. Terminology cannot be. most people use the term “ergodic” for an irreducible. 0. . Theorem 4.497 . π(0) ≈ 0.503 0. It is.503. p01 = 0. in particular. . Indeed. Parry.99π(0) = π(1). Therefore p00 = 1 for even n (n) and = 0 for odd n. from C1 to C2 . C1 . However. if we modify the chain by p00 = 0. we can decompose the state space into d classes C0 . π(0) + π(1) = 1. Thus. per se wrong. Then Xn = 0 for even n and = 1 for odd n. customary to have a consistent terminology.e. and this is expressed by our terminology. . 6 The reason is that the term “ergodic” means something specific in Mathematics (c. from Cd back to C0 . Suppose that the period of the chain is d. But this is “wrong”. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic.4 and. such that the chain moves cyclically between them: from C0 to C1 . Cd−1 . we know we have a unique stationary distribution π and that π(x) > 0 for all states x. 18.f. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). then (n) lim p n→∞ 00 (n) = lim p10 = π(0). p10 = 1. π(1) ≈ 0.

as n → ∞. Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. thereafter. Suppose that X0 ∈ Cr . Since the αr add up to 1. j ∈ Cr . 52 . n ≥ 0) consists of d closed communicating classes. . it remains in Cℓ . 1. . where π is the stationary distribution. n ≥ 0) is aperiodic. each one must be equal to 1/d. . All states have period 1 for this chain. . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . This means that: Theorem 18. . Proof. and the previous results apply. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. Suppose i ∈ Cr . The chain (Xnd . Then pji = 0 unless j ∈ Cr−1 . i ∈ Cr . . j ∈ Cℓ . starting from i. . We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . i ∈ S. and let r = ℓ − k (modulo d). Then. d − 1.Lemma 8. . the (unique) stationary distribution of (Xnd . Cd−1 . we have pij for all i. We shall show that Lemma 9. j ∈ Cℓ . after reduction by d) and. π(i) = 1/d. Let αr := i∈Cr π(i). i∈Cr for all r = 0. n ≥ 0) is dπ(i). So π(i) = j∈Cr−1 π(j)pji . if sampled every d steps. Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . namely C0 . Since (Xnd . Then. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). Let i ∈ Ck . as n → ∞. → dπ(j). Hence if we restrict the chain (Xnd . i ∈ Cr .

p2 . 2. . . for each n = 1.) is a k k subsequence of each (ni . p2 . Hence pi k i.. such that limk→∞ pi k exists and equals. 2. k = 1. Consider. k = 1. We have. 19. . p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. (nk ) k → pi .19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . e 53 . p3 . . Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. the sequence n3 of the third row is a subsequence k k of n2 of the second row. . for each The second result is Fatou’s lemma. . . . For each i we have found subsequence ni . for each i. By construction. . . are contained between 0 and 1. It is clear how we (ni ) k continue. the first of which is known as HellyBray Lemma: Lemma 10. as n → ∞.). . If the period is d then the limit exists only along a selected subsequence (Theorem 18).1 Limiting probabilities for null recurrent states Theorem 19. p1 . Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. . . . exists for all i ∈ N. . 2. a probability distribution on N. 2. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . then pij → 0 as n → ∞. where = 1 and pi ≥ 0 for all i. We also saw in Corollary (n) 7 that. . there must be a subsequence exists and equals. for all i ∈ S. Now consider the contained between 0 and 1 there must be a exists and equals. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. k = 1. i. schematically. p(n) = (n) (n) (n) (n) (n) ∞ p1 . . . as k → ∞. The proof will be based on two results from Analysis. say. . pi . whose proof we omit7 : 7 See Br´maud (1999). 2. say. the sequence n2 of the second row k is a subsequence of n1 of the first row.e. If j is a null recurrent state then pij → 0.. then pij → π(j) (Theorems 17). say. k = 1. So the diagonal subsequence (nk . and so on. if j is transient. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . for all i ∈ S.

e. Suppose that for some i. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i .e. π is a stationary distribution. pix k . the sequence (n) pij does not converge to 0. 54 . for some x0 ∈ S. We wish to show that these inequalities are. actually. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . equalities. and this is a contradiction (the number αx = αx cannot be strictly larger than itself).Lemma 11. Thus. x ∈ S. αy pyx . . (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . state j is positive recurrent: Contradiction. By Corollary 13. by assumption. i. Now π(j) > 0 because αj > 0. y∈S Since x∈S αx ≤ 1. (xi .e. i ∈ N). we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. x ∈ S . i=1 (n) Proof of Theorem 19. Suppose that one of them is not. we have 0< x∈S The inequality follows from Lemma 11. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. y∈S x ∈ S. But Pn+1 = Pn P. i ∈ N). i. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . and the last equality is obvious since we have a probability vector. x∈S (n′ ) Since aj > 0. (n′ ) Hence αx ≥ αy pyx . it is a strict inequality: αx0 > y∈S αy pyx0 . i. n = 1. . 2. pix k Therefore. If (yi . Let j be a null recurrent state. . pix k = 1. (n ) (n ) Consider the probability vector By Lemma 10. by Lemma 11. Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. k→∞ (n′ +1) = y∈S piy k pyx .

If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. which is what we need. as in §10. Let ϕC (i) := Pi (HC < ∞). when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . Let HC := inf{n ≥ 0 : Xn ∈ C}. π C (j) = 1/Ej Tj . as n → ∞.2. From the Strong Markov Property. Indeed.2 The general case (n) It remains to see what the limit of pij is. Let i be an arbitrary state. Example: Consider the following chain (0 < α. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. the result can be obtained by the so-called Abel’s Lemma of Analysis. Let j be a positive recurrent state with period 1. Proof. Pµ (Xn−m = j) → π C (j). The result when the period j is larger than 1 is a bit awkward to formulate. Theorem 20. as n → ∞. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . so we will only look that the aperiodic case. we have pij = Pi (Xn = j) = Pi (Xn = j. In this form. β. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). This is a more old-fashioned way of getting the same result. Then (n) lim p n→∞ ij = ϕC (i)π C (j). and fij = ϕC (j). starting with the first hitting time decomposition relation of Lemma 7. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j).19. Let C be the communicating class of j and let π C be the stationary distribution associated with C. In general. that is. and fij = Pi (Tj < ∞). But Theorem 17 tells us that the limit exists regardless of the initial distribution.

(n) (n) p32 → ϕ(3)π C1 (2). We now generalise this. in Theorem 14.The communicating classes are I = {3. p41 → ϕ(4)π C1 (1). The subject of Ergodic Theory has its origins in Physics. And so is the Law of Large Numbers itself (Theorem 13). as n → ∞. C1 = {1. (n) (n) p34 → 0. 4} (inessential states). Let f be a reward function such that |f (x)|ν [a] (x) < ∞. information about its velocity. C2 are aperiodic classes. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). we assumed the existence of a positive recurrent state. π C1 (2) = . 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. p42 → ϕ(4)π C1 (2). ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. 2} (closed class). p45 → (1 − ϕ(4))π C2 (5). 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. while. Theorem 21. π C2 (5) = 1. from first-step analysis. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . for example. The phase (or state) space of a particle is not just a physical space but it includes. Similarly. Recall that. Hence. ϕ(4) = . 1 − αβ 1 − αβ Both C1 . 56 (20) x∈S . C2 = {5} (absorbing state). ϕ(5) = 0. among others. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. (n) (n) p43 → 0. p44 → 0. Theorem 14 is a type of ergodic theorem. (n) (n) p35 → (1 − ϕ(3))π C2 (5). ϕ(3) = p31 → ϕ(3)π C1 (1). the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. For example. whence 1−α β(1 − α) . Pioneers in the are were. We have ϕ(1) = ϕ(2) = 1. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. (n) (n) p33 → 0. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = .

Proof. From proposition 2 we know that the balance equations ν = νP have at least one solution. But this is precisely the condition (20). while Gfirst . we have Nt /t → 1/Ea Ta . we have that the denominator equals Nt /t and converges to 1/Ea Ta . So if we divide by t the numerator and denominator of the fraction in (21). We only need to check that EG1 = f .5. r=0 with probability 1 as long as E|G1 | < ∞. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. As in Theorem 14. So there are always recurrent states. If so. we assumed positive recurrence of the ground state a.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . we have C := i∈S ν(i) < ∞. Then at least one of the states must be visited infinitely often. so the numerator converges to f /Ea Ta . t t→∞ t t t with probability one. we have t→∞ lim 1 last 1 G = lim Gfirst = 0. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. 57 . and this justifies the terminology of §18. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. and this is left as an exercise. since the number of states is finite. in the former. Replacing n by the subsequence Nt . respectively. Remark 1: The difference between Theorem 14 and 21 is that. which is the quantity denoted by f in Theorem 14. As in the proof of Theorem 14. then. we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . we obtain the result. Now. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . (21) f (x)ν [a] (x). Glast are the total rewards over the first and t t a a last incomplete cycles. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). The reader is asked to revisit the proof of Theorem 14. as we saw in the proof of Theorem 14.

Therefore π is a stationary distribution. reversible if it has the same distribution when the arrow of time is reversed. without being able to tell that it is. in addition. Lemma 12. By Corollary 13. the stationary distribution is unique. running backwards. P (X0 = i. X1 ) has the same distribution as (X1 . . 22 Time reversibility Consider a Markov chain (Xn . Suppose first that the chain is reversible. X0 ). C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. j ∈ S. actually. But if we film a man who walks and run the film backwards. (X0 . Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. . it will look the same. π(i) := Hence. Consider an irreducible Markov chain with finitely many states. It is.e. then we instantly know that the film is played in reverse. and run the film backwards. the chain is irreducible. Proof. Then the chain is positive recurrent with a unique stationary distribution π. Proof. 58 . a finite Markov chain always has positive recurrent states. for all n. X1 . i ∈ S. Taking n = 1 in the definition of reversibility. . In particular. X0 = j). by Theorem 16. then all states are positive recurrent. Xn ) has the same distribution as (Xn . like playing a film backwards. A reversible Markov chain is stationary. simply. For example. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . . n ≥ 0). . . . . we see that (X0 . . because positive recurrence is a class property (Theorem 15). in addition. X0 ) for all n. that is. Theorem 23. . the first components of the two vectors must have the same distribution.Hence 1 ν(i). If. the chain is aperiodic. Xn ) has the same distribution as (Xn . Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. i. Reversibility means that (X0 . . At least some state i must have π(i) > 0. i. if we film a standing man who raises his arms up and down successively. X1 . X0 ). . . We summarise the above discussion in: Theorem 22. Because the process is Markov. If. . or. We say that it is time-reversible. . Xn has the same distribution as X0 for all n. as n → ∞. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. Xn−1 . In addition. . in a sense. Xn−1 . we have Pn → Π. X1 = j) = P (X1 = i. More precisely. this state must be positive recurrent. this means that it is stationary.

P (X0 = j. and the chain is reversible. .for all i and j. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . for all i. X1 = j. . and write. Hence the DBE hold. X2 = i). π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. X1 = j. . In other words. for all n. π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . But P (X0 = i. . Conversely. letting m → ∞. . . (X0 . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. X2 ) has the same distribution as (X2 . X0 ). Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. and hence (X0 . using the DBE. . k. So the chain is reversible. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. then the chain cannot be reversible. . Theorem 24. X0 ). Conversely. i4 . and pi1 i2 (m−3) → π(i2 ). j. it is easy to show that. Suppose first that the chain is reversible. k. im ∈ S and all m ≥ 3. The same logic works for any m and can be worked out by induction. . Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. These are the DBE. we have π(i) > 0 for all i. using DBE. Summing up over all choices of states i3 . j. Pick 3 states i. . X2 = k) = P (X0 = k. . suppose that the KLC (22) holds. X1 . . then. i2 . Xn ) has the same distribution as (Xn . and this means that P (X0 = i. . X0 ). (22) Proof. X1 = i) = π(j)pji . Xn−1 . if there is an arrow from i to j in the graph of the chain but no arrow from j to i. . X1 = j) = π(i)pij . By induction. The DBE imply immediately that (X0 . we have separation of variables: pij = f (i)g(j). . X1 . So the detailed balance equations (DBE) hold. Now pick a state k and write. suppose the DBE hold. We will show the Kolmogorov’s loop criterion (KLC). pji 59 (m−3) → π(i1 ). Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . . X1 . X1 ) has the same distribution as (X1 . . pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 .

) Necessarily. Hence the original chain is reversible. Ac ) = F (Ac . And hence π can be found very easily.1) which gives π(i)pij = π(j)pji . then we need. so we have pij g(j) = . j for which pij > 0. and by replacing. The balance equations are written in the form F (A. to guarantee. i. for each pair of distinct states i.(Convention: we form this ratio only when pij > 0. The reason is as follows. j for which pij > 0. and the arrow from j to j is the only one from Ac to A. the detailed balance equations. and. There is obviously only one arrow from AtoAc . the two oriented edges from i to j and from j to i by a single edge without arrow. Form the graph by deleting all self-loops. using another method. Hence the chain is reversible. If we remove the link between them. f (i)g(i) must be 1 (why?). • If the chain has infinitely many states. Example 1: Chain with a tree-structure. there would be a loop. So g satisfies the balance equations. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. in addition. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . namely the arrow from i to j. If the graph thus obtained is a tree. 60 . Pick two distinct states i. consider the chain: Replace it by: This is a tree. then the chain is reversible. that the chain is positive recurrent. A) (see §4. the graph becomes disconnected. (If it didn’t. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. Ac be the sets in the two connected components. there isn’t any.e. For example.) Let A. meaning that it has no loops. by assumption.

i∈S where |E| is the total number of edges. j are not neighbours then both sides are 0. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). Suppose the graph is connected. 0. so both members are equal to C. We claim that π(i) = Cδ(i). the DBE hold. a finite set. For example. where C is a constant. If i. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. π(i)pij = π(j)pji . etc.Example 2: Random walk on a graph. To see this let us verify that the detailed balance equations hold. 61 . There are two cases: If i. and a set E of undirected edges. States j are i are called neighbours if they are connected by an edge. i∈S Since δ(i) = 2|E|. Now define the the following transition probabilities: pij = 1/δ(i). Consider an undirected graph G with vertices S. For each state i define its degree δ(i) = number of neighbours of i. A RWG is a reversible chain. p02 = 1/4. if j is a neighbour of i otherwise. 2.e. Hence we conclude that: 1. The constant C is found by normalisation. i. In all cases. we have that C = 1/2|E|. Each edge connects a pair of distinct vertices (states) without orientation. The stationary distribution of a RWG is very easy to find. π(i) = 1. The stationary distribution assigns to a state probability which is proportional to its degree. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3.

1. t ≥ 0) be independent from (Xn ) with S0 = 0. . i. then it is a stationary distribution for the new chain. this process is also a Markov chain and it does have time-homogeneous transition probabilities. where pij are the m-step transition probabilities of the original chain. k = 0. in addition. if nk = n0 + km. Let (St . π is a stationary distribution for the original chain. . how can we create new ones? 23. random variables with values in N and distribution λ(n) = P (ξ0 = n).e. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . .) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . (m) (m) 23. n ≥ 0) be Markov with transition probabilities pij . 23. k = 0.i.e. Let A be (r) a set of states and let TA be the r-th visit to A. . . . 1. 2. then (Yk . k = 0. . where ξt are i. . .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. in general. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). .1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. 2. 1.) has the Markov property but it is not time-homogeneous.d. . . k = 0. . In this case. . . 2. Define Yr := XT (r) .3 Subordination Let (Xn . St+1 = St + ξt . t ≥ 0. irreducible and positive recurrent). . . The state space of (Yr ) is A. 1. 2. A r = 1. if.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). If the subsequence forms an arithmetic progression. The sequence (Yk . 62 n ∈ N. 2. Due to the Strong Markov Property. i.e.

We need to show that P (Yn+1 = β | Yn = α. . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). n≥0 63 . Indeed. α1 .) More generally. knowledge of f (Xn ) implies knowledge of Xn and so.. . . j.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . conditional on Yn the past is independent of the future. .. .e. .Then Yt := XSt . a Markov chain. . however. f is a one-to-one function then (Yn ) is a Markov chain. . in general. and the elements of S ′ by α. ∞ λ(n)pij . . Fix α0 . . Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). . . is NOT. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. (n) 23. . If. β. β). Yn−1 ) = P (Yn+1 = β | Yn = α). Proof. we have: Lemma 13. then the process Yn = f (Xn ). αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . Yn−1 = αn−1 }. (Notation: We will denote the elements of S by i. i. Then (Yn ) has the Markov property. .

and so (Yn ) is. β ∈ Z+ . Yn−1 )P (Xn = i | Yn = α. and if i = 0. depends on i only through |i|. In other words. Thus (Yn ) is Markov with transition probabilities Q(α. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. regardless of whether i > 0 or i < 0. Yn−1 ) Q(f (i). β)P (Xn = i | Yn = α. β). observe that the function | · | is not one-to-one. β) P (Yn = α|Yn−1 ) i∈S = Q(α. if β = α ± 1.e. Indeed. We have that P (Yn+1 = β|Xn = i) = Q(|i|. On the other hand. P (Xn+1 = i ± 1|Xn = i) = 1/2. 64 . because the probability of going up is the same as the probability of going down. i. then |Xn+1 | = 1 for sure. β) P (Yn = α|Xn = i. But P (Yn+1 = β | Yn = α. β) i∈S P (Yn = α|Xn = i. if i = 0. Yn = α. if β = 1. β). Yn−1 ) = i∈S P (Yn+1 = β | Xn = i.for all choices of the variables involved. α > 0 Q(α. indeed. To see this. i ∈ Z. α = 0. i ∈ Z. β). Yn−1 ) Q(f (i). for all i ∈ Z and all β ∈ Z+ . Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). β) P (Yn = α. Example: Consider the symmetric drunkard’s random walk. Then (Yn ) is a Markov chain. Define Yn := |Xn |. β) := 1. a Markov chain with transition probabilities Q(α. if we let 1/2. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. But we will show that P (Yn+1 = β|Xn = i). the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. conditional on {Xn = i}. β).

(and independent of the initial population size X0 –a further assumption). Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. . independently from one another. Hence all states i ≥ 1 are inessential. which has a binomial distribution: i j pij = p (1 − p)i−j .d. Thus. Let ξk be the number of offspring of the k-th individual in the n-th generation. . 0 But 0 is an absorbing state. If we let ξ (n) = ξ1 . we have that (Xn . and 0 is always an absorbing state. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. . 1. the population cannot grow: it will eventually reach 0. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. therefore transient. Therefore. and. n ≥ 0) has the Markov property. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn .1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. .24 24. so we will find some different method to proceed. p0 = 1 − p. We find the population (n) at time n + 1 by adding the number of offspring of each individual. . a hard problem. with probability 1. ξ2 . in general. Back to the general case. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. . The parent dies immediately upon giving birth. j In this case. if the current population is i there is probability pi that everybody gives birth to zero offspring. . Then P (ξ = 0) + P (ξ ≥ 2) > 0. each individual gives birth to at one child with probability p or has children with probability 1 − p. Example: Suppose that p1 = p. then we see that the above equation is of the form Xn+1 = f (Xn . Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. But here is an interesting special case.i. ξ (n) ). . 2. We let pk be the probability that an individual will bear k offspring. one question of interest is whether the process will become extinct or not. . In other words. if the 65 (n) (n) . if 0 is never reached then Xn → ∞. k = 0.. are i. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. 0 ≤ j ≤ i. and following the same distribution. So assume that p1 = 1. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. Hence all states i ≥ 1 are transient. The state 0 is irrelevant because it cannot be reached from anywhere. It is a Markov chain with time-homogeneous transitions. since ξ (0) . ξ (1) .

. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. i. each one behaves independently of one another. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. Obviously. For each k = 1. Let ε be the extinction probability of the process. and P (Dk ) = ε. Proof. . let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. where ε = P1 (D). Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. when we start with X0 = i. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. We then have that D = D1 ∩ · · · ∩ Di . . If µ ≤ 1 then the ε = 1. since Xn = 0 implies Xm = 0 for all m ≥ n. Indeed. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. we can argue that ε(i) = εi . so P (D|X0 = i) = εi .population does not become extinct then it must grow to infinity. Indeed. ϕn (0) = P (Xn = 0). We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . if we start with X0 = i in ital ancestors. We wish to compute the extinction probability ε(i) := Pi (D). This function is of interest because it gives information about the extinction probability ε. Therefore the problem reduces to the study of the branching process starting with X0 = 1. and. For i ≥ 1. Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. . ε(0) = 1. The result is this: Proposition 5. 66 .

Therefore ε = ϕ(ε). then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. ϕn (0) ≤ ε∗ . ϕn+1 (0) = ϕ(ϕn (0)). and since the latter has only two solutions (z = 0 and z = ε∗ ). Since both ε and ε∗ satisfy z = ϕ(z). Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). all we need to show is that ε < 1. But ϕn (0) → ε. Therefore ε = ε∗ . So ε ≤ ε∗ < 1. But 0 ≤ ε∗ . in fact. ε = ε∗ . if µ > 1. Let us show that. if the slope µ at z = 1 is ≤ 1. and so on. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). since ϕ is a continuous function. By induction.Note also that ϕ0 (z) = 1. We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). Also ϕn (0) → ε. Notice now that µ= d ϕ(z) dz . Therefore. and. In particular. recall that ϕ is a convex function with ϕ(1) = 1. This means that ε = 1 (see left figure below). ϕ(ϕn (0)) → ϕ(ε).) 67 . ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). ϕn+1 (z) = ϕ(ϕn (z)). the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . z=1 Also. On the other hand. But ϕn+1 (0) → ε as n → ∞. (See right figure below.

there is no time to make a prior query whether a host has something to transmit or not. then time slots 0. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. So. Xn+1 = Xn + An . During this time slot assume there are a new packets that need to be transmitted. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. on the next times slot. is that if more than one hosts attempt to transmit a packet. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. Case: µ > 1. 68 . Ethernet has replaced Aloha. only one of the hosts should be transmitting at a given time. . . 2. 3. k ≥ 0. If collision happens. 3 hosts. 1. say. there is collision and the transmission is unsuccessful. then the transmission is not successful and the transmissions must be attempted anew. We assume that the An are i. If s ≥ 2. Then ε = 1. there are x + a − 1 packets. Then ε < 1. periodically. 5.d. The problem of communication over a common channel. using the same frequency. . So if there are. on the next times slot.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel).i. 4. and Sn the number of selected packets. One obvious solution is to allocate time slots to hosts in a specific order. for a packet to go through. there is a collision and. 1. 0). 3. Abandoning this idea. .1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. Since things happen fast in a computer network. 6. if Sn = 1. there are x + a packets. otherwise. Let s be the number of selected packets. 2. random variables with some common distribution: P (An = k) = λk . 24. If s = 1 there is a successful transmission and. in a simplified model. 1. Nowadays. then max(Xn + An − 1. Thus. . we let hosts transmit whenever they like. 2. are allocated to hosts 1. 3. An the number of new packets on the same time slot. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). ..

then the chain is transient. we have that the drift is positive for all large x and this proves transience. We here only restrict ourselves to the computation of this expected change. we think as follows: to have a reduction of x by 1. Unless µ = 0 (which is a trivial case: no arrivals!). we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. So: px. So we can make q x G(x) as small as we like by choosing x sufficiently large.x−1 + k≥1 kpx. The probabilities px. The main result is: Proposition 6. selections are made amongst the Xn + An packets. . We also let µ := k≥1 kλk be the expectation of An . . and p.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . k 0 ≤ k ≤ x + a. Therefore ∆(x) ≥ µ/2 for all large x.x+k for k ≥ 1. Xn−1 .e. and exactly one selection: px. the Markov chain (Xn ) is transient. where G(x) is a function of the form α + βx .where λk ≥ 0 and k≥0 kλk = 1. So P (Sn = k|Xn = x. where q = 1 − p. for example we can make it ≤ µ/2.) = x + a k x−k p q . we must have no new packets arriving. if Xn + An = 0.x are computed by subtracting the sum of the above from 1. Let us find its transition probabilities.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . The process (Xn . so q x G(x) tends to 0 as x → ∞. To compute px. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). . The reason we take the maximum with 0 in the above equation is because. Sn = 1|Xn = x) = λ0 xpq x−1 . defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. An−1 .x−1 when x > 0. Proving this is beyond the scope of the lectures.x−1 = P (An = 0. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. For x > 0. n ≥ 0) is a Markov chain. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). Idea of proof. we have ∆(x) = (−1)px. The Aloha protocol is unstable.e. The random variable Sn has. and so q x tends to 0 much faster than G(x) tends to ∞. we have q < 1. i. Indeed. An = a. 69 . Since p > 0. we should not subtract 1. To compute px. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . k ≥ 1.

The idea is simple. For each page i let L(i) be the set of pages it links to. So we make the following modification. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. then i is a ‘dangling page’. if L(i) = ∅   0. There is a technique. Then it says: page i has higher rank than j if π(i) > π(j). So if Xn = i is the current location of the chain. the chain moves to a random page. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. If L(i) = ∅. It uses advanced techniques from linear algebra to speed up computations. In an Internet-dominated society. We pick α between 0 and 1 (typically. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. Academic departments link to each other by hiring their 70 . otherwise. Comments: 1. if j ∈ L(i)  pij = 1/|V |. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. neither that it is aperiodic. |V | These are new transition probabilities. 2. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. Its implementation though is one of the largest matrix computations in the world. because of the size of the problem. known as ‘302 Google jacking’. Therefore. So this is how Google assigns ranks to pages: It uses an algorithm. Define transition probabilities as follows:  1/|L(i)|. the rank of a page may be very important for a company’s profits. that allows someone to manipulate the PageRank of a web page and make it much higher. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. All you need to do to get high rank is pay (a lot of money). 3. unless i is a dangling page.24. there are companies that sell high-rank pages.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. PageRank.2) and let α pij := (1 − α)pij + . α ≈ 0. in the latter case. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. It is not clear that the chain is irreducible.

all we need is the information outlined in this figure. a gene controlling a specific characteristic (e. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). colour of eyes) appears in two forms. This means that the external appearance of a characteristic is a function of the pair of alleles. We will also assume that the parents are replaced by the children. This number could be independent of the research quality but. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. To simplify. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. from the point of view of an administrator. we will assume that. When two individuals mate. then. We would like to study the genetic diversity. 4. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. Imagine that we have a population of N individuals who mate at random producing a number of offspring.g. The alleles in an individual come in pairs and express the individual’s genotype for that gene. We have a transition from x to y if y alleles of type A were produced. it makes the evaluation procedure easy. We will assume that a child chooses an allele from each parent at random. What is important is that the evaluation can be done without ever looking at the content of one’s research. the genetic diversity depends only on the total number of alleles of each type. 5. in each generation. And so on.4 The Wright-Fisher model: an example from biology In an organism. 24. Since we don’t care to keep track of the individuals’ identities. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. But an AA mating with a Aa may produce a child of type AA or one of type Aa. 71 . two AA individuals will produce and AA child. meaning how many characteristics we have. this allele is of type A with probability x/2N . One of the circle alleles is selected twice. In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). generation after generation. If so. a dominant (say A) and a recessive (say a). Thus. The process repeats. One of the square alleles is selected 4 times. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. maintain this allele for the next generation. their offspring inherits an allele from each parent. Three of the square alleles are never selected. Do the same thing 2N times.faculty from each other (and from themselves).

p00 = p2N. . since. The result is correct. These are: M ϕ(0) = 0. so both 0 and 2N are absorbing states. . . E(Xn+1 |Xn . . but the argument needs further justification because “the end of time” is not a deterministic time. M n P (ξk = 0|Xn . . . i. . . X0 ) = 1 − Xn . there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. .e. . The fact that M is even is irrelevant for the mathematical model.d. Since selections are independent.Each allele of type A is selected with probability x/2N .e. . . X0 ) = Xn . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. .i. . M and thus. Let M = 2N . EXn+1 = EXn Xn = Xn . as usual. and the population becomes one consisting of alleles of type a only. we see that 2N x 2N −y x y pxy = . . the random variables ξ1 . Clearly. . Clearly. 2N − 1} form a communicating class of inessential states. ϕ(x) = x/M. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . Since pxy > 0 for all 1 ≤ x. . . . with n P (ξk = 1|Xn . . for all n ≥ 0. X0 ) = E(Xn+1 |Xn ) = M This implies that. ϕ(M ) = 1. and the population becomes one consisting of alleles of type A only. Similarly. on the average. (Here. n n where. M Therefore. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). ξM are i. the expectation doesn’t change and. . conditionally on (X0 . 0 ≤ y ≤ 2N. Xn ). . ϕ(x) = y=0 pxy ϕ(y). 72 .) One can argue that. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. y ≤ 2N − 1. on the average. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. the states {1. . .2N = 1. the number of alleles of type A won’t change. i. Tx := inf{n ≥ 1 : Xn = x}. since. We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. the chain gets absorbed by either 0 or 2N .

We will show that this Markov chain is positive recurrent. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. say. provided that enough components are available. A number An of new components are ordered and added to the stock. the demand is met instantaneously.d. at time n + 1. larger than the supply per unit of time. Thus. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. and so.The first two are obvious. . M x M (M −1)−(y−1) x x +1− M M M −1 = 24. while a number of them are sold to the market. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. 2. .5 A storage or queueing application A storage facility stores. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . If the demand is for Dn components. Working for n = 1. Here are our assumptions: The pairs of random variables ((An . so that Xn+1 = (Xn + ξn ) ∨ 0.i. Otherwise. the demand is only partially satisfied. and the stock reached level zero at time n + 1. the demand is. We will apply the so-called Loynes’ method. we will try to solve the recursion. We have that (Xn . there are Xn + An − Dn components in the stock. n ≥ 0) are i. Xn electronic components at time n. 73 . on the average. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. Dn ). b). M −1 y−1 x M y−1 1− x . Since the Markov chain is represented by this very simple recursion. . and E(An − Dn ) = −µ < 0. for all n ≥ 0. Let ξn := An − Dn . n ≥ 0) is a Markov chain.

. . so we can shuffle them in any way we like without altering their joint distribution. . Indeed. (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . ξn . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). ξn−1 . we find π(y) = x≥0 π(x)pxy . where the latter equality follows from the fact that. we reverse it. . instead of considering them in their original order. .i. . ξ0 ). however. the random variables (ξn ) are i. ξn−1 ) = (ξn−1 . . . we may replace the latter random variables by ξ1 . . . that d (ξ0 . . . for y ≥ 0. . Therefore. in computing the distribution of Mn which depends on ξ0 . we may take the limit of both sides as n → ∞. . using π(y) = limn→∞ P (Mn = y). 74 . and. the event whose probability we want to compute is a function of (ξ0 . = x≥0 Since the n-dependent terms are bounded.d. In particular. Notice. . (n) . . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy .Thus. . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). ξn−1 ). Hence. Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). for all n. . . .

From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. . Z2 . . we will use the assumption that Eξn = −µ < 0. Z2 . . Depending on the actual distribution of ξn . Here. This will be the case if g(y) is not identically equal to 1.} > y) is not identically equal to 1. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient.} < ∞) = 1. We must make sure that it also satisfies y π(y) = 1.So π satisfies the balance equations. The case Eξn = 0 is delicate. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . as required. the Markov chain may be transient or null recurrent. 75 . n→∞ Hence g(y) = P (0 ∨ max{Z1 . . . n→∞ This implies that P ( lim Zn = −∞) = 1. .

if x = −ej .and space-homogeneous transition probabilities given by px. . Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. namely. we need to give more structure to the state space S. p(x) = 0. given a function p(x). groups) are allowed. where p1 + · · · + pd + q1 + · · · + qd = 1. but. d  p(x) = qj . . the set of d-dimensional vectors with integer coordinates. p(−1) = q.y+z for any translation z. .and space-homogeneity. . Our Markov chains. . . otherwise. . d   0. This means that the transition probability pxy should depend on x and y only through their relative positions in space. . then p(1) = p. j = 1.y = p(y − x). 0. To be able to say this. if x ∈ {±e1 .g. j = 1. x ∈ Zd . where C is such that x p(x) = 1. that of spatial homogeneity. this walk enjoys. otherwise where p + q = 1. This is the drunkard’s walk we’ve seen before.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. Define  pj . we won’t go beyond Zd . a structure that enables us to say that px. so far. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. In dimension d = 1. A random walk possesses an additional property. Example 3: Define p(x) = 1 2d . here. For example. . More general spaces (e. if x = ej . A random walk of this type is called simple random walk or nearest neighbour random walk. ±ed } otherwise 76 . If d = 1. Example 1: Let p(x) = C2−|x1 |−···−|xd | . the skip-free property. . we can let S = Zd . a random walk is a Markov chain with time. . besides time. So. have been processes with timehomogeneous transition probabilities. .y = px+z.

We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). we will mean a sum over all possible values. x ± ed . It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . and this is the symmetric drunkard’s walk. which. Let us denote by Sn the state of the walk at time n. random variables in Zd with common distribution P (ξn = x) = p(x). y p(x)p′ (y − x). . We will let p∗n be the convolution of p with itself n times. (Recall that δz is a probability distribution that assigns probability 1 to the point z. Obviously. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 .This is a simple random walk. . p′ (x). ξ2 . we can reduce several questions to questions about this random walk. Furthermore.i. . p ∗ δz = p. The convolution of two probabilities p(x). . where ξ1 . notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. are i. We refer to ξn as the increments of the random walk. If d = 1. We call it simple symmetric random walk. So.) In this notation. then p(1) = p(−1) = 1/2. the initial state S0 is independent of the increments. (When we write a sum without indicating explicitly the range of the summation variable. computing the n-th order convolution for all n is a notoriously hard problem.) Now the operation in the last term is known as convolution. Furthermore. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. . in addition. x + ξ2 = y) = y p(x)p(y − x). Indeed. because all we have done is that we dressed the original problem in more fancy clothes. . p(x) = 0. We shall resist the temptation to compute and look at other 77 . . otherwise.d. x ∈ Zd . in general. One could stop here and say that we have a formula for the n-step transition probability. x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. But this would be nonsense. The special structure of a random walk enables us to give more detailed formulae for its behaviour.

all possible values of the vector are equally likely.1) + − (5.i. . Will the random walk ever return to where it started from? B. In the sequel. The principle is trivial.1) − (3. 1): (2. +1}n . 2) → (5. Events A that depend only on these the first n random signs are subsets of this sample space. to each initial position and to each sequence of n signs.0) We can thus think of the elements of the sample space as paths. ξn = εn ) = 2−n . 2 where #A is the number of elements of A. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. As a warm-up. a random vector with values in {−1. +1}n . If not. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi.2) (4. . random variables (random signs) with P (ξn = ±1) = 1/2. and the sequence of signs is {+1. Therefore. but its practice may be hard. . . So. −1. 1}n . 2) → (3. . +1. +1.i−1 = 1/2 for all i ∈ Z. for fixed n. If yes. . we should assign. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn .1) + (0. consider the following: 78 . +1}n . + (1. Look at the distribution of (ξ1 . a path of length n. and so #A P (A) = n . we shall learn some counting methods.methods. After all. So if the initial position is S0 = 0. −1} then we have a path in the plane that joins the points (0. . 0) → (1. 1) → (2.2) where the ξn are i. we have P (ξ1 = ε1 . First. . we are dealing with a uniform distribution on the sample space {−1. A ⊂ {−1. Clearly.d. 1) → (4.i+1 = pi. Here are a couple of meaningful questions: A. ξn ). if we can count the number of elements of A we can compute its probability. Since there are 2n elements in {0. how long will that take? C.

x) (n. y)} = n−m . i). 0) and n ends at (n. Consider a path that starts from (0.i) (0. Proof. (n. i)} = n n+i 2 More generally. and compute the probability of the event A = {S2 ≥ 0. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. The number of paths that start from (0. in this case. 0)). Hence (n+i)/2 = 0. the number of paths that start from (m. We use the notation {(m. y)}. y) is given by: #{(m. Let us first show that Lemma 14. S4 = −1. i).1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point.0) The lightly shaded region contains all paths of length n that start at (0. y) is given by: #{(0. S6 < 0. 0) and ends at (n.) The darker region shows all paths that start at (0. n 79 . y)} =: {all paths that start at (m. 0) and end up at (n. i). x) (n.047. In if n + i is not an even number then there is no path that starts from (0.Example: Assume that S0 = 0. The paths comprising this event are shown below. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. 0) and end up at the specific point (n. S7 < 0}. x) and end up at (n. (n − m + y − x)/2 (23) Remark. x) and end up at (n. (There are 2n of them. 0) (n. We can easily count that there are 6 paths in A.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

. 0) (n. y)} . The left side of (30) equals #{paths (1. . 27. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). . y)} . . . The probability that. P0 (S1 ≥ 0 . If n. for now. . Sn ≥ 0 | Sn = 0) = 84 1 . . if n is even. .3 Some conditional probabilities Lemma 17. Such a simple answer begs for a physical/intuitive explanation. 1) (n. . 27. n (30) Proof. the random walk never becomes zero up to time n is given by P0 (S1 > 0. x) 2n−m (n. given that S0 = 0 and Sn = y > 0.2 The ballot theorem Theorem 25. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. 0) (n. P0 (Sn = y) 2 (n + y) + 1 . S2 > 0. n 2−n .P (Sn = y|Sm = x) = In particular. if n is even. Sn > 0 | Sn = y) = y . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . y) that do not touch zero} × 2−n #{paths (0. n/2 (29) #{(0. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . n 2 #{paths (m. n−m 2−n−m . we shall just compute it. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . n This is called the ballot theorem and the reason will be given in Section 29. . Sn ≥ 0 | Sn = y) = 1 − In particular. But.

. . . If n is even. ξ2 . . . and Sn−1 > 0 implies that Sn ≥ 0. n/2 where we used (28) and (29). We use (27): P0 (Sn ≥ 0 . . Sn > 0) + P0 (S1 < 0. Sn ≥ 0) = P0 (S1 = 0. . . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . Sn = 0) = P0 (S1 > 0. . . 1 + ξ2 + ξ3 > 0. . . . . Sn ≥ 0). . . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . .Proof. . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. ξ3 . 85 . . . and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . . . . . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . . . . Proof. .). . . 0) (n. P0 (S1 ≥ 0. . . For even n we have P0 (S1 ≥ 0. •). . . . . . . . (b) follows from symmetry. ξ2 + ξ3 ≥ 0. . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. .4 Some remarkable identities Theorem 26. We next have P0 (S1 = 0. . . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. Sn = 0) = P0 (Sn = 0). . ξ3 . . . . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . Sn ≥ 0.) by (ξ1 . (f ) follows from the replacement of (ξ2 . (c) (d) (e) (f ) (g) where (a) is obvious. . . (e) follows from the fact that ξ1 is independent of ξ2 . . . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. . +1 27. . Sn ≥ 0) = #{(0. . . Sn < 0) (b) (a) = 2P0 (S1 > 0. .

Sn−1 > 0. By the Markov property at time n − 1. the two terms must be equal. Sn = 0 means Sn−1 = 1.5 First return to zero In (29). Sn = 0) = 2P0 (Sn−1 > 0. By symmetry. . . So f (n) = 2P0 (S1 > 0. Fix a state a. for even n. given that Sn = 0. Sn−2 > 0|Sn−1 = 1). . n We know that P0 (Sn = 0) = u(n) = n/2 2−n . Sn−1 > 0. Sn = 0) = u(n) . . .27. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. we can omit the last event. . n−1 Proof. . we either have Sn−1 = 1 or −1. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . Theorem 27. . and if the particle is to the left of a then its mirror image is on the right. . . To take care of the last term in the last expression for f (n). P0 (Sn−1 > 0. Then f (n) := P0 (S1 = 0. Also. So 1 f (n) = 2u(n) P0 (S1 > 0. . Put a two-sided mirror at a. We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. we computed. 86 . we see that Sn−1 > 0. . Let n be even. So the last term is 1/2. Since the random walk cannot change sign before becoming zero. . . with equal probability. . . . . n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. Sn = 0). Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). Sn−2 > 0|Sn−1 > 0. Sn = 0) Now. f (n) = P0 (S1 > 0. Sn = 0) + P0 (S1 < 0. If a particle is to the right of a then you see its image on the left. . This is not so interesting. . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). the probability u(n) that the walk (started at 0) attains value 0 in n steps. Sn−1 = 0. Sn = 0) P0 (S1 > 0. . . Sn−1 < 0. Sn = 0. So u(n) f (n) = . and u(n) = P0 (Sn = 0). .

28. Then Ta = inf{n ≥ 0 : Sn = a} Sn . n ∈ Z+ ) has the same law as (Sn . observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . Let (Sn . Sn = Sn . P0 (Sn ≥ a) = P0 (Ta ≤ n. Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). Finally. But P0 (Ta ≤ n) = P0 (Ta ≤ n. n ≥ Ta . we obtain 87 . Sn ). n ≥ Ta By the strong Markov property. Sn ≥ a). . 2 Proof. . In other words. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). we have n ≥ Ta . Proof. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). If Sn ≥ a then Ta ≤ n. The last equality follows from the fact that Sn > a implies Ta ≤ n. a + (Sn − a). So Sn − a can be replaced by a − Sn in the last display and the resulting process. called (Sn .1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. Thus. independent of the past before Ta . Then. Sn ≤ a) + P0 (Ta ≤ n. But a symmetric random walk has the same law as its negative.What is more interesting is this: Run the particle till it hits a (it will. . Trivially. n < Ta . Sn > a) = P0 (Ta ≤ n. define Sn = where is the first hitting time of a. . Sn ≤ a). Theorem 28. the process (STa +m − a. m ≥ 0) is a simple symmetric random walk. Sn ≤ a) + P0 (Sn > a). 2 Define now the running maximum Mn = max(S0 . n ∈ Z+ ). 2a − Sn . S1 . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. so we may replace Sn by 2a − Sn (Theorem 28). n ≥ 0) be a simple symmetric random walk and a > 0. 2a − Sn ≥ a) = P0 (Ta ≤ n. n < Ta . In the last event. as a corollary to the above result.

This. employed for different purposes. Acad. 8 88 . The set of values of η contains n elements and η is uniformly distributed a in this set. (1887). the number of selected azure items is always ahead of the black ones equals (a − b)/n. Sci. as for holding liquids. . Acad. during sampling without replacement. usually a vase furnished with a foot or pedestal. e 10 Andr´. Sn = y) = P0 (Sn = 2x − y). then the probability that. Compt. P0 (Mn < x. Sn = y) = P0 (Tx ≤ n. Solution directe du probl`me r´solu par M. Compt. Proof. Rend. J. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . 9 Bertrand. 2 Proof. A sample from the urn is a sequence η = (η1 . . An urn is a vessel. ornamental objects. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. we have P0 (Mn < x. then. It is the latter use we are interested in in Probability and Statistics. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. gives the result. by applying the reflection principle (Theorem 28). If Tx ≤ n. Since Mn < x is equivalent to Tx > n. the ashes of the dead after cremation. 369. Solution d’un probl`me. 105. combined with (31). We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. If an urn contains a azure items and b = n−a black items. ηn ). The answer is very simple: Theorem 31. .Corollary 19. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. n = a + b. (31) x > y. (1887). Paris. D. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). e e e Paris. of which a are coloured azure and b black. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. 436-437. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. 105. Sci. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. Rend. Sn = y) = P0 (Tx > n. which results into P0 (Tx ≤ n. as long as a ≥ b. Sn = 2x − y). and anciently for holding ballots to be drawn. Bertrand. The original ballot problem. . Sn = y). where ηi indicates the colour of the i-th item picked.

. .We shall call (η1 . Sn > 0} that the random walk remains positive. . 89 . n Since y = a − (n − a) = a − b. . . . . Since an urn with parameters n. . the ‘black’ items correspond to − signs). a − b = y. we can translate the original question into a question about the random walk. Then. −n ≤ k ≤ n. always assuming (without any loss of generality) that a ≥ b = n − a. We have P0 (Sn = k) = n n+k 2 pi. . n . But in (30) we computed that y P0 (S1 > 0. So the number of possible values of (ξ1 . a we have that a out of the n increments will be +1 and b will be −1. ξn ) has a uniform distribution. 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . (ξ1 . Let ε1 .) So pi. But if Sn = y then. . but which is not necessarily symmetric. n n −n 2 (n + y)/2 a Thus. +1} such that ε1 + · · · + εn = y. We need to show that. . b be defined by a + b = n. Proof. . . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. letting a.i−1 = q = 1 − p. . ξn ) is. . . i ∈ Z. . . n ≥ 0) with values in Z and let y be a positive integer. Consider a simple symmetric random walk (Sn . a. ξn ) is uniformly distributed in a set with n elements. εn be elements of {−1. indeed. ξn = εn ) P0 (Sn = y) 2−n 1 = = . conditional on {Sn = y}. we have: P0 (ξ1 = ε1 . . . . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. where y = a − (n − a) = 2a − n. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. . Since the ‘azure’ items correspond to + signs (respectively. An urn can be realised quite easily be a random walk: Theorem 32. . ηn ) an urn process with parameters n. so ‘azure’ items are +1 and ‘black’ items are −1. . the sequence (ξ1 . . . the result follows. It remains to show that all values are equally a likely. . using the definition of conditional probability and Lemma 16. . Sn > 0 | Sn = y) = . . . . The values of each ξi are ±1. (The word “asymmetric” will always mean “not necessarily symmetric”. . conditional on the event {Sn = y}. . Then. . . p(n+k)/2 q (n−k)/2 . . Proof of Theorem 31. the sequence (ξ1 . conditional on {Sn = y}. . . .i+1 = p.

Consider a simple random walk starting from 0. For all x. Proof. x ≥ 0) is also a random walk. 2. for x > 0. If S1 = 1 then T1 = 1. and all t ≥ 0. . . This random variable takes values in {0. .30. Corollary 20. random variables. In view of this. See figure below. Consider a simple random walk starting from 0. The rest is a consequence of the strong Markov property. . it is no loss of generality to consider the random walk starting from 0. 2qs Proof. plus a ′′ further time T1 required to go from 0 to +1 for the first time.d. x − 1 in order to reach x.) Then. (We tacitly assume that S0 = 0. The process (Tx . we make essential use of the skip-free property. plus a time T1 required to go from −1 to 0 for the first time. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1.} ∪ {∞}. 90 . Proof. Px (Ty = t) = P0 (Ty−x = t). If x > 0. E0 sTx = ϕ(s)x . namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. We will approach this using probability generating functions. x ∈ Z.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. In view of Lemma 19 we need to know the distribution of T1 . . To prove this. . only the relative position of y with respect to x is needed. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1.i. Lemma 19. Lemma 18. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. Theorem 33. Let ϕ(s) := E0 sT1 . 2. . y ∈ Z. Consider a simple random walk starting from 0. 1. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 .

Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . Consider a simple random walk starting from 0. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 .Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then.  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . Corollary 21. The first time n such that Sn = −1 is the first time n such that −Sn = 1. Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . if x < 0 2ps Proof. in order to hit 1. from the strong Markov property. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . 2ps Proof. The formula is obtained by solving the quadratic. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Consider a simple random walk starting from 0. Hence the previous formula applies with the roles of p and q interchanged. 91 . But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. Corollary 22.

Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. where α is a real number. If we define n to be equal to 0 for k > n. For the general case. 1 − 4pqs2 . in distribution. The latter is. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. Consider a simple random walk starting from 0. We again do first-step analysis. then we can omit the k upper limit in the sum. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. 2ps ′ =1− 30.2 First return to the origin Now let us ask: If the particle starts at 0. If α = n is a positive integer then n (1 + x)n = k=0 n k x . we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . k by the binomial theorem. and simply write (1 + x)n = k≥0 n k x . E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . this was found in §30.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. Similarly. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0.30. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1.3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. if S1 = 1. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . For a symmetric ′ random walk. k 92 . the same as the time required for the walk to hit 1 starting ′ from 0.

Indeed. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . For example. −1 k so (1 − x)−1 = as needed. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . by definition.Recall that n k = It turns out that the formula is formally true for general α. 2k − 1 k 93 . k! k! (−1)2k xk = k≥0 xk . √ 1 − x. but that we won’t worry about which ones. (−1)k = 2k − 1 k k and this is shown by direct computation. As another example. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . (1 − x)−1 = But. k as long as we define α k = α(α − 1) · · · (α − k + 1) . and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . k!(n − k)! k! α k x .

• If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. P0 (Tx < ∞) = p q q p ∧1 ∧1 . Hence it is transient.But. if x < 0 This formula can be written more neatly as follows: Theorem 35. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. We apply this to (32). s↑1 which is true for any probability generating function. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . Hence it is again transient. x if x > 0 . E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . But remember that q = 1 − p. The quantity inside the square root becomes 1 − 4pq. That it is actually null-recurrent follows from Markov chain theory. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. if x < 0 |x| In particular. x 1−|p−q| 2q 1−|p−q| 2p . 94 . without appeal to the Markov chain theory. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. by the definition of E0 sT0 . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. |x| if x > 0 (33) .

with probability 1 because p < 1/2. The last part of Theorem 35 tells us that it is recurrent. Proof. Sn ). if x > 0 . Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. . Proof. We know that it cannot be positive recurrent. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. Corollary 23. . This means that there is a largest value that will be ever attained by the random walk. if x < 0 which is what we want. Consider a random walk with values in Z. . This value is random and is denoted by M∞ = sup(S0 . The simple symmetric random walk with values in Z is null-recurrent. Then limn→∞ Sn = −∞ with probability 1. again.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. but here it is < ∞. What is this probability? Theorem 37. the sequence Mn is nondecreasing. . Assume p < 1/2. Unless p = 1/2. it will never ever return to the point where it started from. Otherwise. with some positive probability.) If we let Mn = max(S0 .1 Asymptotic distribution of the maximum Suppose now that p < 1/2. Consider a simple random walk with values in Z. we obtain what we are looking for. Indeed. . . every nondecreasing sequence has a limit. n→∞ n→∞ x ≥ 0. except that the limit may be ∞. and so it is null-recurrent. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . S2 . this probability is always strictly less than 1. Then ′ P0 (T0 < ∞) = 2(p ∧ q). 31. 95 . x ≥ 0. Theorem 36. S1 . Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . . If p < q then |p − q| = q − p and. 31. q p. then we have M∞ = limn→∞ Mn .Proof.

. . 2(p ∧ q) E0 J0 = . We now look at the the number of visits to some state but if we start from a different state. Consider a simple random walk with values in Z. 2. 1]. k = 0. . (34) a random variable first considered in Section 10 where it was also shown to be geometric. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. 1. . But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. 1 − 2(p ∧ q) Proof. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. the total number of visits to a state will be finite.3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . 2. Then. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. But if it is transient. . But J0 ≥ 1 if and only if the first return time to zero is finite. Pi (Ji ≥ k) = 1 for all k. Remark 1: By spatial homogeneity. 1 − 2(p ∧ q) this being a geometric series. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). . 2. meaning that Pi (Ji = ∞) = 1. as already known. . 96 . even for p = 1/2. 1. If p = q this gives 1. . if a Markov chain starts from a state i then Ji is geometric. starting from zero. . the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . 31. k = 0. Theorem 38.′ Proof. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . where the last probability was computed in Theorem 37. We now compute its exact distribution. 2(p ∧ q) . This proves the first formula. In Lemma 4 we showed that. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . 1. . The probability generating function of T0 was obtained in Theorem 34. In the latter case. k = 0.

If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . For x > 0. 97 .e. The reason that we replaced k by k − 1 is because. Then P0 (Jx ≥ k) = P0 (Tx < ∞. Consider the case p < 1/2.Theorem 39. (2(p ∧ q))k−1 . Jx ≥ k). E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . p < 1/2) then. i. it will visit any point below 0 exactly the same number of times on the average. Remark 1: The expectation E0 Jx is finite if p = 1/2. starting from 0. ∞ As for the expectation. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). Apply now the strong Markov property at Tx . if the random walk “drifts down” (i. simply because if Jx ≥ k ≥ 1 then Tx < ∞. we have E0 Jx = 1/(1−2p) regardless of x. given that Tx < ∞. Consider a random walk with values in Z. (q/p)(−x)∨0 1 − 2q k≥1 . then E0 Jx drops down to zero geometrically fast as x tends to infinity. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total.e. The first term was computed in Theorem 35. and the second in Theorem 38. k ≥ 1. Suppose x > 0. E0 Jx = Proof. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . But for x < 0. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . In other words. Then P0 (Tx < ∞. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 .

e. Fix an integer n. Then P0 (Tx = n) = x P0 (Sn = x) . . . Thus. 0 ≤ k ≤ n). (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). n (37) 98 . ξn−1 . . ξ2 . for a simple random walk and for x > 0.32 Duality Consider random walk Sk = k ξi . Tx is the sum of x i. n Proof. a sum of i. . Use the obvious fact that (ξ1 . . = 1− √ 1 − s2 s x . i. Proof. each distributed as T1 .d. Theorem 40. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. . Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. the probabilities P0 (Tx = n) = x P0 (Sn = x). Here is an application of this: Theorem 41. In Lemma 19 we observed that. Sn = x) P0 (Tx = n) = . P0 (Sn = x) P0 (Sn = x) x . for a simple symmetric random walk. 0 ≤ k ≤ n) has the same distribution as (Sk . . . i=1 Then the dual (Sk . which is basically another probability for S.i. because the two have the same distribution. Let (Sk ) be a simple symmetric random walk and x a positive integer. in Theorem 41. 0 ≤ k ≤ n) of (Sk . . we derived using duality. random variables. every time we have a probability for S we can translate it into a probability for S. n x > 0. . Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited.d. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x .i. Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. . 0 ≤ k ≤ n − 1. ξ1 ). . ξn ) has the same distribution as (ξn . random variables. (Sk . (36) 2 Lastly. 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k .

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

n ∈ Z+ ) is a sum of i. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . . Many other quantities can be approximated in a similar manner. map (x1 . . with equal probability (1/4). 1 1 Proof. It is not a simple random walk because there is a possibility that the increment takes the value 0. . we let x1 be its horizontal coordinate and x2 its vertical. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0.i. i. (Sn . west. 2 1 If x ∈ Z2 . ξ2 .. 1 Hence (Sn . n ∈ Z+ ) is a random walk. i=1 ξi i=1 ξi 2 1 Lemma 20. too.i. 0)) = 1/4.e. 0) and. We then let S0 = (0. ξ2 . north or south. We have 1. −1)) = 1/2. Indeed.d. . 2 Same argument applies to (Sn . 1 P (ξn = 1) = P (ξn = (1. the random 1 2 variables ξn . So Sn = (Sn . Since ξ1 . The two random walks are not independent. 0)) = P (ξn = (1. 1)) = P (ξn = (−1. n ∈ Z+ ) is also a random walk. going at random east. n ≥ 1. for each n. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1.From this. ξn are not independent simply because if one takes value 0 the other is nonzero. Sn = i=1 n ξi . y) where x. . random vectors such that P (ξn = (0. the two stochastic processes are not independent. x1 − x2 ). Sn ) = n n 2 . (Sn . 0)) = 1/4. Sn − Sn ). Consider a change of coordinates: rotate the axes by 45o (and scale them). being totally drunk. so are ξ1 . the latter being functions of the former. . 1)) + P (ξn = (0. x2 ) into (x1 + x2 . . He starts at (0. north or south. The corners are described by coordinates (x. west. n ∈ Z+ ) showing that it. So let 1 2 1 2 1 2 Rn = (Rn . Rn ) := (Sn + Sn . 0)) = P (ξn = (−1. i. he moves east. random variables and hence a random walk. is a random walk. However.d. 0). 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. When he reaches a corner he moves again in the same manner. y ∈ Z. ξ2 . are independent. We have: 102 . we immediately get that the expectation of T1 is infinity: ET1 = ∞. . .

The two-dimensional simple symmetric random walk is recurrent. ξ 1 − ξ 2 ). R2n = 0) = P (R2n = 0)(R2n = 0). that for a simple symmetric random walk (started from S0 = 0). in the sense that the ratio of the two sides goes to 1. The last two are walk in one dimension. Hence (Rn . P (S2n = 0) = Proof. n ∈ Z+ ) 1 is a random walk in two dimensions. Rn = n (ξi − ξi ). n ∈ Z ) is a random walk in one dimension. . . And so 2 1 2 1 P (ηn = ε. So P (R2n = 0) = 2n 2−2n . being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . Lemma 22. ηn = ξn − ξn are independent for each n.i. n ∈ Z+ ) is a random 2 . ξ2 . +1}. (Rn . 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. Hence P (ηn = 1) = P (ηn = −1) = 1/2. from Theorem 7. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . Let a < 0 < b. 2 . η 2 are ±1. 1)) = 1/4 1 2 P (ηn = −1. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . We have Rn = n (ξi + ξi ). Recall. 0)) = 1/4 1 2 P (ηn = −1. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. P (S2n = 0) ∼ n . where c is a positive constant. ε′ ∈ {−1. The possible values 1 .1 Lemma 21.. To show that the last two are independent. . We have of each of the ηn n 1 2 P (ηn = 1. Similar argument shows that each of (Rn . 0)) = 1/4 1 2 P (ηn = 1. (Rn . (Rn + independent. b−a P (Ta < Tb ) = b . η . . ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. . 2n n 2 2−4n . By Lemma 5. n ∈ Z+ ) is a random walk in two dimensions. But (Rn ) 1 2 is a simple symmetric random walk. n ∈ Z+ ). as n → ∞. The sequence of vectors η . But by the previous n=0 c lemma. n Corollary 24. we need to show that ∞ P (Sn = 0) = ∞. b−a −a . . n ∈ Z ) is a random walk in one dimension. ηn = −1) = P (ξn = (−1. 2 1 2 2 1 1 Proof. 1 where the last equality follows from the independence of the two random walks. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . and by Stirling’s approximation. b−a 103 . ηn = 1) = P (ξn = (0. b−a P (STa ∧Tb = a) = b . Proof. is i.d. ηn = 1) = P (ξn = (1. ηn = −1) = P (ξn = (0. The same is true for (Rn ). 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. = 1)) = 1/4.

we consider the random variable Z = STA ∧TB . if p = M/N . P (A = 0. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0.This last display tells us the distribution of STa ∧Tb . c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. N ∈ N. b]. M < N . i < 0 < j. The probability it exits it from the left point is p. Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. B) taking values in {(0. Define a pair of random variables (A. e. Note. 1 − p). b. using this. Indeed. k ∈ Z. given a 2-point probability distribution (p. M. Then Z = 0 if and only if A = B = 0 and this has 104 . we get that c is precisely ipi . B) are independent of (Sn ). i ∈ Z. and assuming that (A. Proof. Theorem 43. B = j) = c−1 (j − i)pi pj . B = 0) = p0 .g. i ∈ Z. b = N . Conversely. a = M − N . such that p is rational. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. in particular. suppose first k = 0. we can always find integers a. where c−1 is chosen by normalisation: P (A = 0. B = j) = 1. Let (Sn ) be a simple symmetric random walk and (pi . a < 0 < b such that pa + (1 − p)b = 0. Then there exists a stopping time T such that P (ST = i) = pi . 0)} ∪ (N × N) with the following distribution: P (A = i. The claim is that P (Z = k) = pk . How about more complicated distributions? Let suppose we are given a probability distribution (pi . we have this common value: i<0 (−i)pi = and. i ∈ Z). and the probability it exits from the right point is 1 − p. choose. B = 0) + P (A = i. To see this. that ESTa ∧Tb = 0.

We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. i<0 = c−1 Similarly. Suppose next k > 0. P (Z = k) = pk for k < 0.probability p0 . B = j) P (STi ∧Tk = k)P (A = i. by definition. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . 105 .

with b = 0. Because c | a and c | b. 3. 8) = gcd(6. Indeed. Indeed 4 × 38 > 120. we can define their greatest common divisor d = gcd(a. c | b. 106 . Similarly. let d = gcd(a. their negatives and zero. b) = gcd(a. The set Z of integers consists of N. with at least one of them nonzero. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. 2) = 2. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. Now. b). In other words. then d | c. Arithmetic Arithmetic deals with integers. . . i. Therefore d | b − a. and d = gcd(a. 3. leading to d1 = d2 . d2 are two such numbers then d1 would divide d2 and vice-versa. But in the first line we subtracted 38 exactly 3 times.}. Keep doing this until you obtain b − qa < a. 38) = gcd(44. Replace b by b − a. as taught in high schools or universities. −2.e. . Zero is denoted by 0. 14) = gcd(6. Given integers a. we say that b divides a if there is an integer k such that a = kb.) Notice that: gcd(a. For instance. Then c | a + (b − a). Then interchange the roles of the numbers and continue. b). (That it is unique is obvious from the fact that if d1 . replace b − a by b − 2a. 38) = gcd(82. b) as the unique positive integer such that (i) d | a. 2) = gcd(4. if we also invoke Euclid’s algorithm (i. −3. 4. 5. and we’re done. 38) = gcd(6. So (i) holds. So (ii) holds. b are arbitrary integers. If b − a > a. with the second line. We will show that d = gcd(a. If c | b and b | a then c | a. But d | a and d | b. Now suppose that c | a and c | b − a. 2) = gcd(2. b − a). 32) = gcd(6. (ii) if c | a and c | b. gcd(120. 2. 38) = gcd(6. the process of long division that one learns in elementary school). 6. if a. But 3 is the maximum number of times that 38 “fits” into 120. b − a). · · · }. 20) = gcd(6.e. we have c | d. So we work faster. 26) = gcd(6.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. they merely serve as reminders. 2. Z = {· · · . We write this by b | a. b. d | b. −1. 1. The set N of natural numbers is the set of positive integers: N = {1. They are not supposed to replace textbooks or basic courses. 0. If a | b and b | a then a = b.

in particular. . to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. Working backwards. let us follow the previous example again: gcd(120. 120) = 38x + 120y. r = 18. b. 107 . y ∈ Z. 1. Here are some other facts about the greatest common divisor: First. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. let d be the right hand side. a − 1}. For example. y = 6. . we have the fact that if d = gcd(a. The same logic works in general. i. 38) = gcd(6. Suppose that c | a and c | b. 2) = 2. a. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. . for integers a.e. not both zero. we can show that. Second. b) then there are integers x. it divides d. for some x. with a > b. . xa + yb > 0}. This shows (ii) of the definition of a greatest common divisor. gcd(38. Thus. To see why this is true. The long division algorithm is a process for finding the quotient q and the remainder r. b ∈ Z. with x = 19. y ∈ Z. we have gcd(a.The theorem of division says the following: Given two positive integers b. In the process. To see this. y such that d = xa + yb. We need to show (i). b) = min{xa + yb : x. such that a = qb + r. Then d = xa + yb. we can find an integer q and an integer r ∈ {0. Its proof is obvious. 38) = gcd(6. Then c divides any linear combination of a.

Counting Let A. respectively. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. say. and this is a contradiction. The equality relation = is reflexive. the set {f (x) : x ∈ X}. z). z) it also contains (x. A relation is called symmetric if it cannot contain a pair (x. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. Finally. x). If we take A to be the set of students in a classroom. In general. 108 . The relation < is not reflexive.g. f (x2 ).. so d divides both a and b. for some 1 ≤ r ≤ d − 1. y) and (y. For example. i. But then b = q(xa + yb) + r. x). A function is called one-to-one (or injective) if two different x1 . but is not necessarily transitive. We encountered the equivalence relation i theory. a relation can be thought of as a set of pairs of elements of the set A. A relation which possesses all the three properties is called equivalence relation. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself).e. then the relation “x is a friend of y” is symmetric.e. yielding r = (−qx)a + (1 − qy)b. The set of functions from X to Y is denoted by Y X . j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . i. m elements. Since d > 0. B be finite sets with n. y) without contain its symmetric (y. y) such that x is smaller than y. A relation is called reflexive if it contains pairs (x. The greatest common divisor of 3 integers can now be defined in a similar manner. A function is called onto (or surjective) if its range is Y . The relation < is not symmetric. The range of f is the set of its values. we can divide b by d to find b = qd + r. that d does not divide both numbers. the greatest common divisor of any subset of the integers also makes sense. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. Suppose this is not true. so r = 0. that d does not divide b. x2 ∈ X have different values f (x1 ). So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). Both < and = are transitive. e. The equality relation is symmetric.that d | a and d | b. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. So d is not minimum as claimed. A relation is transitive if when it contains (x.

109 . and so on. If A has n elements. in other words. that of the second in m − 1 ways. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. Incidentally. Since the name of the first animal can be chosen in m ways. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals.e. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). the cardinality of the set B A ) is mn . and so on. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . we have = 1.The number of functions from A to B (i. In the special case m = n.) So the first animal can be assigned any of the m available names. and so on. so does the second. The first element can be chosen in n ways. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. Indeed. we need more names than animals: n ≤ m. A function is then an assignment of a name to each animal. To be able to do this. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). (The same name may be used for different animals. 1}n of sequences of length n with elements 0 or 1. Clearly. the second in n − 1. We thus observe that there are as many subsets in A as number of elements in the set {0. we may think of the elements of A as animals and those of B as names. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . and the third. Any set A contains several subsets. a one-to-one function from the set A into A is called a permutation of A. Thus. n! is the total number of permutations of a set of n elements. Each assignment of this type is a one-to-one function from A to B. the k-th in n − k + 1 ways. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. we denote (n)n also by n!. k! k where the symbol equation. and this follows from the binomial theorem.

If the pig is included. b. we let The Pascal triangle relation states that n+1 k = = 0. in choosing k animals from a set of n + 1 ones. a+ is called the positive part of a. by convention. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. b). Thus a ∧ b = a if a ≤ b or = b if b ≤ a. n n . For example. The maximum is denoted by a ∨ b = max(a. the pig. k Also. Note that it is always true that a = a+ − a− . consider a specific animal. we use the following notations: a+ = max(a. we choose k animals from the remaining n ones and this can be done in n ways. by convention. and a− is called the negative part of a. 0). Maximum and minimum The minimum of two numbers a. Both quantities are nonnegative. a∨b∨c refers to the maximum of the three numbers a. The absolute value of a is defined by |a| = a+ + a− . and we don’t care to put parantheses because the order of finding a maximum is irrelevant. say.We shall let n = 0 if n < k. a− = − min(a. If the pig is not included in the sample. c. 0). m! n y If y is not an integer then. The notation extends to more than one numbers. 110 . In particular. we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). if x is a real number. b is denoted by a ∧ b = min(a. + k−1 k Indeed. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. we define x m = x(x − 1) · · · (x − m + 1) . b).

.}. It takes value 1 with probability P (A) and 0 with probability P (Ac ). are sets or clauses then the number n≥1 1An is the number true clauses. 1 − 1A = 1 A c . = 1A + 1B − 1A∩B . It is easy to see that 1A 1A∩B = 1A 1B . It is worth remembering that. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. Then X X= n=1 1 (the sum of X ones is X). 111 . which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). .Indicator function stands for the indicator function of a set A. meaning a function that takes value 1 on A and 0 on the complement Ac of A. . 2. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. So X= n≥1 1(X ≥ n). If A is an event then 1A is a random variable. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. . This is very useful notation because conjunction of clauses is transformed into multiplication. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). if A1 . Several quantities of interest are expressed in terms of indicators. . For instance. Another example: Let X be a nonnegative random variable with values in N = {1. . Thus. P (X ≥ n). A2 .

 The column  . because ϕ is linear. Let u1 . .for all x. So if we choose a basis v1 . . . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ). am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  .  . y ∈ Rn . . Then. . . . β ∈ R. we have ϕ(x) = m n vi i=1 j=1 aij xj . x  . The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ .   . . ϕ are linear functions: It is a m × n matrix. . j=1 i = 1. .    . xn the given basis. vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . . ym v1 . The column  . But the ϕ(uj ) are vectors in Rm . . . . x1 . Let y i := n aij xj . . m. . . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . vm . .  = ··· ··· ···  .  is a column representation of x with respect to . . and all α. . Suppose that ψ.  is a column representation of y := ϕ(x) with respect to the given basis .  where ∈ R. Combining. un be a basis for Rn . .

is a sequence of numbers then inf{x1 .Fix three sets of basis on Rn . If I has no lower bound then inf I = −∞. i. 113 . . Likewise. xn+1 . 1/3. of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. x2 . 0 0 1 In other words. Similarly. In these cases we use the notation inf I. We note that lim inf(−xn ) = − lim sup xn . . . we define the supremum sup I to be the minimum of all upper bounds. . . The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . respectively. . 1 ≤ j ≤ n. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. . . Similarly. . . Rm and Rℓ and let A. we define lim sup xn . . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. sup ∅ = −∞. for any ε > 0. · · · }. A matrix A is called square if it is a n × n matrix. . If I has no upper bound then sup I = +∞. ed =  . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. B be the matrix representations of ϕ. a square matrix represents a linear transformation from Rn to Rn . ψ. a sequence a1 . x2 .  . . the choice of the basis is the so-called standard basis. for any ε > 0. . . . j So A is a ℓ × m matrix. . x4 . 1 ≤ i ≤ ℓ. with respect to these bases. x3 . defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I.  . .} ≤ · · · and. . and AB is a ℓ × n matrix. xn+2 . .} has no minimum. If I is empty then inf ∅ = +∞. where m (AB)ij = k=1 Aik Bkj . there is x ∈ I with x ≤ ε + inf I.} ≤ inf{x2 . there is x ∈ I with x ≥ ε + inf I. 1/2.} ≤ inf{x3 . Usually. it is its matrix representation with respect to the fixed basis. B is a m × n matrix. we define max I. . . ei = 1(i = j). Limsup and liminf If x1 . . So lim sup xn = limn→∞ inf{xn . If we fix a basis in Rn . For instance I = {1. e2 =  . . . inf I is characterised by: inf I ≤ x for all x ∈ I and. This minimum (or maximum) may not exist. . a2 . .  . Supremum and infimum When we have a set of numbers I (for instance. standing for the “infimum” of I. Then the matrix representation of ϕ ◦ ψ is AB.e. sup I is characterised by: sup I ≥ x for all x ∈ I and. . and it is this that we called infimum.

e. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). but not too much. and 1 Ac = 1 − 1 A . A2 . P (∪n An ) ≤ Similarly. . and apply them. In Markov chain theory. B2 . Linear Algebra that has many axioms is more ‘restrictive’. Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. If An are events with P (An ) = 0 then P (∪n An ) = 0. In contrast. Of course. n If the events A1 . . in these notes. if he events A1 . This follows from the logical statement X 1(X > x) ≤ X. A2 . it is useful to find disjoint events B1 . and everywhere else. If A1 . In contrast. Therefore. Linear Algebra that has many axioms is more ‘restrictive’. to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). A2 . are increasing with A = ∪n An then P (A) = limn→∞ P (An ). to compute P (A). if An are events with P (An ) = 1 then P (∩n An ) = 1. B in the definition. Probability Theory is exceptionally simple in terms of its axioms. If A is an event then 1A or 1(A) is the indicator random variable. .e. . But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. A is a subset of B). . . B are disjoint events then P (A ∪ B) = P (A) + P (B). Note that 1A 1B = 1A∩B . we often need to argue that X ≤ Y . in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). Similarly. i. we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. such that ∪n Bn = Ω. E 1A = P (A).. Indeed. This can be generalised to any finite number of events. n P (An ). . are mutually disjoint events. . n≥1 Usually. . showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. namely:   Everything else. follows from this axiom.) If A. then P  n≥1 An  = P (An ).. That is why Probability Theory is very rich. Often. I do sometimes “cheat” from a strict mathematical point of view. Quite often. (That is why Probability Theory is very rich.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). we work a lot with conditional probabilities. derive formulae. If the events A. . for (besides P (Ω) = 1) there is essentially one axiom . Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. Similarly. by not requiring this. the function that takes value 1 on A and 0 on Ac . to show that EX ≤ EY . . which holds since X is positive. For example. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). . This is the inclusion-exclusion formula for two events. 11 114 .

. viz. i. . The generating function of the sequence pn = P (X = n). For a random variable which takes positive and negative values. the generating function of the geometric sequence an = ρn . In case the random variable is ∞ absoultely continuous with density f . 1. 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. obviously. we define X + = max(X. taught in calculus courses). which are both nonnegative and then define EX = EX + − EX − . |s| < 1/|ρ|. As an example. EX = x xP (X = x). 1. This definition makes sense only if EX + < ∞ or if EX − < ∞. . In case the random variable is discrete. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). X − = − min(X. . The formula holds for any positive random variable.. x→∞ In Markov chain theory. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). . n ≥ 0 is ∞ ρn s n = n=0 1 . . .An application of this is: P (X < ∞) = lim P (X ≤ x). So if n an < ∞ then G(s) exists for all s ∈ [−1. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. making it a useful object when the a0 . ϕ(s) = EsX = n=0 sn P (X = n) 115 . 1]. we have EX = −∞ xf (x)dx. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. so it is useful to keep this in mind.e. for s = 0 for which G(0) = 0. is called probability generating function of X: ∞ as long as |ρs| < 1. We do not a priori know that this exists for any s other than. . The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. a1 . which is motivated from the algebraic identity X = X + − X − .} because then n an = 1. n = 0. . is a probability distribution on Z+ = {0. a1 . is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . Generating functions A generating function of a sequence a0 . 0). . . 1]. 0). 2. . the formula reduces to the known one. we encounter situations where P (X = ∞) may be positive.

xn sn be the generating of (xn . since |s| < 1. Note that ϕ(0) = P (X = 0). s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . Note that ∞ n=0 pn ≤ 1. take the value +∞. ∞ and p∞ := P (X = ∞) = 1 − pn . n ≥ 0) then the generating function of (xn+1 . As an example. n! as follows from Taylor’s theorem. n=0 We do allow X to. If we let X(s) = n≥0 n ≥ 0. we have s∞ = 0.This is defined for all −1 < s < 1. but. we can recover the expectation of X from EX = lim ϕ′ (s). ds ds and by letting s = 1. consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . 116 . Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). possibly. n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). with x0 = x1 = 1. Also. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . Generating functions are useful for solving linear recursions. n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . and this is why ϕ is called probability generating function.

But I could very well use the function P (X > x). for example. and so X(s) = 1 1 1 1 1 1 = . i. we can also write this as xn = (−a)n+1 − (−b)n+1 . n ≥ 0) and (xn . Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). then the so-called distribution function. the function P (X ≤ x). b−a is the solution to the recursion.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . i. n ≥ 0) equals the sum of the generating functions of (xn+1 . namely. the n-th term of the Fibonacci sequence. x ∈ R. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. Thus (1/ρ)−s is the 1 generating function of ρn+1 .e. is such an example. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . we can solve for X(s) and find X(s) = s2 −1 . if X is a random variable with values in R. b = −( 5 + 1)/2. Since x0 = x1 = 1. Hence s2 + s − 1 = (s − a)(s − b). n ≥ 0). So. s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2.e. 117 . This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . b−a Noting that ab = −1. Despite the square roots in the formula.

or the 2-parameter function P (a < X < b). since the number is so big. for it does allow us to specify the probabilities P (X = n). Now. to compute a large sum we can. For example (try it!) the rough Stirling approximation give that the probability that. Even the generating function EsX of a Z+ -valued random variable X is an example. For instance. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. dx All these objects can be referred to as “distribution” or “law” of X. n! ≈ exp(n log n − n) = nn e−n . 1 Since d dx (x log x − x) = log x. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. n ∈ Z+ . it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. Specifying the distribution of such an object may be hard. Hence we have We call this the rough Stirling approximation. I could use its density f (x) = d P (X ≤ x). (b) The approximation is not good when we consider ratios of products. it’s better to use logarithms: n n! = e log n = exp k=1 log k. n Hence n k=1 log k ≈ log x dx. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. If X is absolutely continuous. We frequently encounter random objects which belong to larger spaces. instead. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. in 2n fair coin tosses 118 . But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. compute an integral and obtain an approximation. we encountered the concept of an excursion of a Markov chain from a state i. It turns out to be a good one. for example they could take values which are themselves functions. Such an excursion is a random object which should be thought of as a random function.

Eg(X) ≥ g(x)P (X ≥ x) . If we think of X as the time of occurrence of a random phenomenon. defined in (34). That was the random variable J0 = ∞ n=1 1(Sn = 0). y ∈ R g(y) ≥ g(x)1(y ≥ x). because we know that the n probability goes to zero as n → ∞. where λ = P (X ≥ 1).we have exactly n heads equals 2n 2−2n ≈ 1. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . We have already seen a geometric random variable. Indeed. by taking expectations of both sides (expectation is an increasing operator). n ∈ Z+ . Markov’s inequality Arguably this is the most useful inequality in Probability. Surely. Then. To remedy this. Let X be a random variable. expressing non-negativity. expressing monotonicity. We can case both properties of g neatly by writing For all x. We baptise this geometric(λ) distribution. for all x. This completely trivial thought gives the very useful Markov’s inequality. and is based on the concept of monotonicity. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). 119 . 0 ≤ n ≤ m. a geometric function of k. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . as n → ∞. It was shown in Theorem 38 that J0 satisfies the memoryless property. g(X) ≥ g(x)1(X ≥ x). if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). and so. so this little section is a reminder. Consider an increasing and nonnegative function g : R → R+ . and this is terrible. you have seen it before. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0.

edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. 120 .mac. W (1968) An Introduction to Probability Theory and its Applications.Bibliography ´ 1. C M and Snell. 5. Bremaud P (1999) Markov Chains. W (1981) Topics in Ergodic Theory. Amer. Wiley. Springer. Cambridge University Press. Parry.dartmouth. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). Springer. Math. Grinstead. Cambridge University Press. ´ 2. Norris. 3. 7. Springer. J L (1997) Introduction to Probability. Also available at: http://www. J R (1998) Markov Chains. Chung. Feller. 4. Soc. Bremaud P (1997) An Introduction to Probabilistic Modeling.pdf 6.

15 Loynes’ method. 76 neighbours. 77 coupling. 117 division. 111 inessential state. 20 increments. 119 minimum. 87 121 . 53 hitting time. 13 probability generating function. 17 Aloha. 61 detailed balance equations. 108 ergodic. 119 Google. 16 convolution. 114 balance equations. 45 Chapman-Kolmogorov equation. 44 degree. 34 probability flow. 110 meeting time. 51 Monte Carlo. 17 communicates with. 26 running maximum. 108 relation. 2 nearest neighbour random walk. 98 dynamical system. 6 Markov property. 17 Kolmogorov’s loop criterion. 84 bank account. 86 reflexive. 61 null recurrent. 109 positive recurrent. 81 reflection principle. 82 Cesaro averages. 119 matrix. 17 infimum. 3 branching process. 65 generating function. 101 axiom of Probability Theory. 58 digraph (directed graph). 53 Fibonacci sequence. 34 PageRank. 68 Fatou’s lemma. 15 distribution. 115 geometric distribution. 35 Helly-Bray Lemma. 10 ballot theorem. 18 arcsine law. 70 period.Index absorbing. 16. 47 coupling inequality. 29 reflection. 51 essential state. 10 irreducible. 113 initial distribution. 17 Ethernet. 63 Galton-Watson process. 18 dual. 7 invariant distribution. 110 mixing. 73 Markov chain. 6 Markov’s inequality. 116 first-step analysis. 59 law. 1 equivalence classes. 48 cycle formula. 16 communicating classes. 61 recurrent. 70 greatest common divisor. 77 indicator. 7 closed. 68 aperiodic. 108 ruin probability. 48 memoryless property. 117 leads to. 111 maximum. 20 function of a Markov chain. 18 permutation. 65 Catalan numbers. 15. 106 ground state. 16 equivalence relation. 115 random walk on a graph.

3. 58 transient. 2 transition probability. 104 states. 76 simple symmetric random walk. 76 Skorokhod embedding. 6 stationary. 29 transition diagram:. 108 time-homogeneous. 16. 71 122 . 77 skip-free property. 27 subordination. 8 transitive. 27 storage chain. 62 subsequence of a Markov chain. 24 Strirling’s approximation. 8 transition probability matrix. 16. 62 supremum. 119 strong Markov property. 62 Wright-Fisher model. 60 urn.semigroup property. 108 tree. 6 statespace. 73 strategy of the gambler. 13 stopping time. 7 simple random walk. 9 stationary distribution. 88 watching a chain in a set. 7 time-reversible. 113 symmetric. 7. 10 steady-state.

Sign up to vote on this title
UsefulNot useful