Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . .4 Some remarkable identities . . . . . . 70 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 First return to zero . . . . . . . . . . . . . . . . . . 85 27. . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . .1 Distribution of hitting time and maximum . . .2 The ALOHA protocol: a computer communications application . . 71 24. . . . . .1 First hitting time analysis . . . . . . . . . . . . . 84 27. . . . . . . . . . . 90 30. . . . . . . . . . . . . .1 Branching processes: a population growth application . . .1 Asymptotic distribution of the maximum . . . . . . . . . . . . . . 95 31. . . . . .4 The Wright-Fisher model: an example from biology .1 Paths that start and end at specific points . . . . . . . . . . .5 A storage or queueing application . . . . . . . . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . .2 The ballot theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 24. . . . . . . . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . . . . . .3 Some conditional probabilities . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . .24. .2 First return to the origin . . 84 27. . . . . .2 Return probabilities . . . . . . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . . . . . . . . 65 24. . . . . . . . . . . .3 Total number of visits to a state . . . . . . . . . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . .1 Distribution after n steps . . . . . . . . . . . . . . . . . . . . . . . . 79 26. . . 95 31. . . . . . . . . . . . . . . . . . . .3 The distribution of the first return to the origin . . . . . 83 27. . . . . . . . . . . . . . . . .2 Paths that start and end at specific points and do not touch zero at all . . . . . . 92 30. . . . . . . . . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

They form the core material for an undergraduate course on Markov chains in discrete time...Preface These notes were written based on a number of courses I taught over the years in the U. dozens of good books on the topic. I have tried both and prefer the current ordering.K. There are. Of course. I have a little mathematical appendix. Also. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. The second one is specifically for simple random walks. I tried to put all terms in blue small capitals whenever they are first encountered. The first part is about Markov chains and some applications. v . I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory.S.. of course. “important” formulae are placed inside a coloured box. I introduce stationarity before even talking about state classification. There notes are still incomplete. Also. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. Greece and the U. A few starred sections should be considered “advanced” and can be omitted at first reading. At the end.

2. . . one of which is 0.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. for any n. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. . for any t. yn ) into Xn+1 = (xn+1 . An example from Physics is provided by Newton’s laws of motion. x). the future pairs Xn+1 . where r = rem(a. In other words. and its ˙ dt 2x acceleration is x = d 2 . and the other the greatest common divisor. continuous (i. Formally. In Mathematics. For instance. x(t)). X1 . Recall (from elementary school maths). b). where xn ≥ yn . there is a function F that takes the pair Xn = (xn . s ≥ t) ˙ 1 . . . We are herein constrained by the Syllabus to discuss only discrete-time Markov chains.e. b) = gcd(r. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). Assume that the force F depends only on the particle’s position ¨ dt and velocity. 1. The “time” can be discrete (i. b). the integers). yn+1 ). x Here. the future trajectory (X(s). we let our “state” be a pair of integers (xn . In the module following this one you will study continuous-time chains. Suppose that a particle of mass m is moving under the action of a force F in a straight line. b) is the remainder of the division of a by b. . Xn+2 .e. F = f (x. n = 0. b). . An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. more generally. a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. By repeating the procedure.e. x = x(t) denotes the position of the particle at time t. has the property that. y0 = min(a. It is not hard to see that. Its velocity is x = dx . or. b) between two positive integers a and b. The sequence of pairs thus produced. initialised by x0 = max(a. depend only on the pair Xn . yn ). i. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). a totally ordered set. that gcd(a. X2 . and evolving as xn+1 = yn yn+1 = rem(xn . yn ). . the real numbers). we end up with two numbers. . X0 .

analyse. . Some of these algorithms are deterministic. an effective way for dealing with large problems is via the use of randomness. i. Each time count 1 if Yn < f (Xn ) and 0 otherwise. 1 and 2. The generalisation takes us into the realm of Randomness. Count the number of 1’s and divide by N . In addition. say.can be completely specified by the current state X(t). a ≤ X0 ≤ b. 2 2. Indeed. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. Other examples of dynamical systems are the algorithms run.05. for steps n = 1.. containing fresh and stinky cheese. When the mouse is in cell 1 at time n (minutes) then. Repeat with (Xn . via the use of the tools of Probability. at time n + 1 it is either still in 1 or has moved to 2. past states are not needed.e. The resulting ratio is an approximation b of a f (x)dx. We will also see why they are useful and discuss how they are applied. Choose a pair (X0 .000 rows and 10.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells.99. by the software in your computer.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. we will see what kind of questions we can ask and what kind of answers we can hope for. uniformly. Similarly. . Yn ). We can summarise this information by the transition diagram: 1 Stochastic means random.000 columns or computing the integral of a complicated function of a large number of variables. A scientist’s job is to record the position of the mouse every minute. 2. . In other words. A mouse lives in the cage. We will be dealing with random variables. Perform this a large number N of times. Y0 ) of random numbers. Try this on the computer! 2 . but some are stochastic. 0 ≤ Y0 ≤ B. Markov chains generalise this concept of dependence of the future only on the present. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. it moves from 2 to 1 with probability β = 0. We will develop a theory that tells us how to describe. it does so. regardless of where it was at earlier times. instead of deterministic objects. and use those mathematical models which are called Markov chains. respectively.

on the average. y = 0. and Sn is the money he wants to spend. . . = z=0 ∞ = z=0 3 . (This is the mean of the binomial distribution with parameter α. respectively. . Assume also that X0 .2 Bank account The amount of money in Mr Baxter’s bank account evolves. from month to month. D1 . 1. D0 .y := P (Xn+1 = y|Xn = x). as follows: Xn+1 = max{Xn + Dn − Sn . 2. Assume that Dn . Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). intuitive. he can’t. Dn is the money he deposits. are random variables with distributions FD . to move from cell 1 to cell 2? 2. So if he wishes to spend more than what is available.05 ≈ 20 minutes. How often is the mouse in room 1 ? Question 1 has an easy. px.) 2. S1 . . Sn . . S2 . S0 . . FS . How long does it take for the mouse. We easily see that if y > 0 then ∞ x. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px.95 0. D2 . the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0.01 Questions of interest: 1. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. Dn − Sn = y − x) P (Sn = z.05 0. Xn is the money (in pounds) at the beginning of the month. are independent random variables.99 0. Here.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. 0}.

on the average. Are there any places which he will never visit? 3.2 P= p2.2 p1.1 p2. Will Mr Baxter’s account grow.0 p1. Questions of interest: 1.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street.0 p2. Are there any places which he will visit infinitely many times? 4.0 p0. 2. we have px. If he starts from position 0. The transition probability matrix is y=1 p0. in principle. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. 2. how often will he be visiting 0? 2. So he may move forwards with equal probability that he moves backwards.y .4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 .2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. if y = 0. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). can be computed from FS and FD . to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction.0 = 1 − ∞ px. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. If the pub is located in position 0 and his home in position 100. and if so. how long will it take him.1 p0. Of course. being drunk.1 p1.something that. how fast? 2. he has no sense of direction.

the reason being that the company does not believe that its clients are subject to resurrection. These things are.D = 1. etc.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. north or south.D . that since there are more degrees of freedom now. there is probability pH. Note that the diagram omits pH.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients. the company must have an idea on how long the clients will live. Clearly. 5 .3 that the person will become sick.01 S 0.8 0.H − pS.D is omitted.1 D Thus.H = 1 − pH. observe.3 H 0. We may again want to ask similar questions. west. therefore.S = 0. given that he is currently healthy. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. and pS. for now. 2. pD. so we assign probability 1/4 for each move. and let’s say he’s completely drunk. But. pD.D . mathematically imprecise.S because pH. but they will become more precise in the sequel.H . there is clearly a higher chance that our man will be lost. and pS. Also.S − pH. the answer to this is crucial for the policy determination.S = 1 − pS. So he can move east. It proposes the following model summarising the state of health of an individual on a monthly basis: 0.

m > n. . P (Xn+1 = in+1 | Xn = in . . This is true for any n and m. We will also assume. if the claim holds. . . which precisely means that (Xn+1 . .3 3. n + m] Xk = ik | Xn = in ). Lemma 1. .} or the set Z of all integers (positive. . Usually. called the state space. The elements of S are frequently called states. . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. Xn−1 ) given Xn . . X2 = i2 . 1. (Xn . . for any n. Sometimes we say that {Xn . . . . . . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . unless needed.1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . (1) 6 . On the other hand.} or N = {1. Let us now write the Markov property in an equivalent way. We say that this sequence has the Markov property if. 3. n ∈ T} is Markov instead of saying that it has the Markov property. . m ∈ T). in+m . . . Xn−1 ) are independent conditionally on Xn . viz. . . almost exclusively (unless otherwise said explicitly). . . . 2. 3. the Markov property. the future process (Xm . for any n ∈ T. m ∈ T) is independent of the past process (Xm . 2. n ∈ T}. X1 = i1 . . So we will omit referring to T explicitly.2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . n + m] Xk = ik | Xn = in . zero. in+1 ∈ S. Since S is countable. where T is a subset of the integers. conditionally on Xn . . . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). . . . for all n ∈ Z+ and all i0 . We focus almost exclusively to the first choice. . m and any choice of states i0 . . that the Xn take values in some countable set S. . . Xn+m ) and (X0 . it is customary to call (Xn ) a Markov chain. n ∈ Z+ ) is Markov if and only if. and so future is independent of past given the present. . X0 = i0 ) = P (∀ k ∈ [n + 1. m < n. . 3. T = Z+ = {0. . . Proof. or negative). .

of course. X1 = i1 . Xn = in ) = µ(i0 ) pi0 .j (n. 7 . requires some ordering on S) and whose entries are pi. .j (n. pi. for each m ≤ n the matrix P(m. m + 1)P(m + 1. then the matrices are infinitely large. Of course. pi.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. X2 = i2 .in (n − 1.. we can write (2) as P(m. by (1).3 In this new notation. 2 3 And. something that can be called semigroup property or Chapman-Kolmogorov equation. a square matrix whose rows and columns are indexed by S (and this. The function µ(i) is called initial distribution. Therefore.j (n.j∈S . (2) which is reminiscent of matrix multiplication.i1 (0.j (n. 2) · · · pin−1 . n + 1) := P (Xn+1 = j | Xn = i) . n). means that the transition probabilities pi. The function pi.i (n − 1. n). n). pi.which means that the two functions µ(i) := P (X0 = i) . j ∈ S. will specify all joint distributions2 : P (X0 = i0 . n)]i. all probabilities of events associated with the Markov chain This causes some analytical difficulties. n) := [pi. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. 3. n + 1) is called transition probability from state i at time n to state j at time n + 1.j (m.j (m. . m + 1) · · · pin−1 .i1 (m.j (m. n ≥ 0. n) = P(m. or as P(m. we define. m + 2) · · · P(n − 1. by definition. 1) pi1 . More generally. . n) = P(m. n) = i1 ∈S ··· in−1 ∈S pi. ℓ) P(ℓ.j (m. n). But we will circumvent them by using Probability only. . viz. If |S| = d then we have d × d matrices.i2 (1. by a result in Probability. pi. but if S is an infinite set. n + 1) do not depend on n . remembering a bit of Algebra. n) = 1(i = j) and. n) if m ≤ ℓ ≤ n. which. i∈S i.

as follows from (1) and the definition of P. Notational conventions Unit mass. For example. for all integers a. In other words. n−m times for all m ≤ n. b ≥ 0. Then we have µ n = µ 0 Pn . then µ = pδi1 + (1 − p)δi2 . For example. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). δi (j) := 1. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). arranged in a row vector. and call pi.j∈S is called the (one-step) transition probability matrix or. y y 8 . when we say “Markov chain” we will mean “timehomogeneous Markov chain”. and the semigroup property is now even more obvious because Pa+b = Pa Pb . 0. Notice that any probability distribution on S can be written as a linear combination of unit masses.j . Thus Pi is the law of the chain when the initial distribution is δi . while P = [pi. It corresponds. (3) We call δi the unit mass at i. of course.j ]i. if Y is a real random variable. to picking state i with probability 1.and therefore we can simply write pi. Then P(m. n) = P × P × · · · · · · × P = Pn−m .j (n. From now on. n + 1) =: pi. simply. Similarly. transition matrix. The matrix notation is useful. Starting from a specific state i.j the (one-step) transition probability from state i to state j. if j = i otherwise. if µ assigns probability p to state i1 and 1 − p to state i2 . unless otherwise specified. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi .

i. then. Example: the Ehrenfest chain. . Let Xn be the number of molecules in room 1 at time n. The gas is placed partly in room 1 and partly in room 2. . the law of a stationary process does not depend on the origin of time. random variables provides an example of a stationary process. This is a model of gas distributed between two identical communicating chambers. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2.e. In other words. . we indeed expect that we have N/2 molecules in each room. the distribution of the bug will still be the same. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. Consider a canonical heptagon and let a bug move on its vertices. let us look at the following example: Example: Motion on a polygon. dividing it into two identical rooms. Another guess then is: Place 9 . n = 0. Then pi. . n ≥ 0) will not be stationary: indeed. because it is impossible for the number of molecules to remain constant at all times. and vice versa. then. It is clear. a sequence of i. at any point of time n. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. as it takes some time before the molecules diffuse from room to room. We now investigate when a Markov chain is stationary. at time n + 1 it moves to one of the adjacent vertices with equal probability. If at time n the bug is on a vertex. m + 1. It should be clear that if initially place the bug at random on one of the vertices. . Clearly. To provide motivation and intuition.4 Stationarity We say that the process (Xn .) is stationary if it has the same law as (Xn . But this will not work. 1. for a while we will be able to tell how long ago the process started. . i. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. On the average.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi.d. for any m ∈ Z+ . selecting a vertex with probability 1/7. n = m. say N = 1025 ) in a metallic chamber with a separator in the middle. Hence the uniform probability distribution is stationary.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = .). Imagine there are N molecules of gas (where N is a large number.

i 0 ≤ i ≤ N. If we can do so.i + P (X0 = i + 1)pi+1. The equations (4) are called balance equations. . then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . For the if part. 10 . If we do so. P (X1 = i) = j P (X0 = j)pj. . A Markov chain (Xn . a Markov chain is stationary if and only if the distribution of Xn does not change with n.i + π(i + 1)pi+1. stationary or invariant distribution. .i = P (X0 = i − 1)pi−1. . 2−N i−1 i+1 N N (The dots signify some missing algebra. The only if part is obvious. . . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. X2 = i2 . In other words.each molecule at random either in room 1 or in room 2. in ∈ S. Indeed. X1 = i1 . Xm+1 = i1 . find those distributions π such that πP = π. . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. Changing notation. we are led to consider: The stationarity problem: Given [pij ]. . which you should go through by yourselves. P (Xn = i) = π(i) for all n. Xm+2 = i2 .i = N N −i+1 N i+1 2−N + = · · · = π(i). the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . But we need to prove this. (4) Such a π is called . Xm+n = in ). . use (1) to see that P (X0 = i0 . . . for all m ≥ 0 and i0 .) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. then we have created a stationary Markov chain. Lemma 2. Since. hence. (a binomial distribution). Proof.i = π(i − 1)pi−1. by (3). we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. . Xn = in ) = P (Xm = i0 .

write the equations (4): π(i) = π(j)pj. i≥1 π(i) −1 To compute π(1). p1. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. .j = 1. π(1) will be zero.. which gives p(j) . d! x ∈ Sd . where 1 is a column whose entries are all 1. if i > 0 i ∈ N. and so each π(i) will be zero. . n ≥ 0) is a Markov chain with transition probabilities pi+1. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. can be written as P1 = 1. . which.i. d}. then we run into trouble: indeed. 11 .i = π(0)p(i) + π(i + 1).i = 1. i = 1. . π must be a probability: π(i) = 1. Who tells us that the sum inside the parentheses is finite? If the sum is infinite. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. random variables. . .}. Example: waiting for a bus. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. But here. . where all the xi are distinct. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. . i ≥ 0. we use the normalisation condition π(1) = i≥1 j≥i = 1. Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. the next one will arrive in i minutes with probability p(i). After bus arrives at a bus stop.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. It is easy to see that the stationary distribution is uniform: π(x) = 1 . . . each x ∈ Sd is of the form x = (x1 . Keep doing it continuously. . 2.i = p(i). Example: card shuffling. The times between successive buses are i. in matrix notation. we must be careful. which in turn means that the normalisation condition cannot be satisfied. To find the stationary distribution. In addition to that. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). 2 . This means that the solution to the balance equations is identically zero. . The state space is S = N = {1. .d. Thus. xd ). Then (Xn . i ≥ 1.

ξr (n + 1) = εr ) = P (ξ1 (n) = 0.) is the state at time n. .i. So. As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. .). . .d. The Markov chain is obtained by applying the mapping ϕ again and again. . . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then.). Remarks: This is not a Markov chain with countably many states. . . ξ2 (n) = ε1 .e.* (This is slightly beyond the scope of these lectures and can be omitted. . 2. . . But to even think about it will give you some strength. are i. . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. . . 1 − ξ3 . Here ξ(n) = (ξ1 (n). .. . ξr+1 (n) = εr ) + P (ξ1 (n) = 1. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). . 2. . applied to the interval [0. ξ2 . . .. 1] (think of [0. . . So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. We have thus defined a mapping ϕ from binary vectors into binary vectors. . . So let us do so. ξ3 . ξ2 (n) = 1 − ε1 . . the same is true at n + 1. 2. . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. any r = 1. So it goes beyond our theory.. ξ(n + 1) = ϕ(ξ(n)). and any ε1 . for any n = 0. . and the rule maps x into 2(1 − x). If ξ1 = 1 then x ≥ 1/2.d. ξ2 (n). .). 1}. 1. P (ξ1 (n + 1) = ε1 . i. . 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. ξi ∈ {0. You can check that . ξr+1 (n) = 1 − εr ). (5) 12 . We can show that if we start with the ξr (0). ξ2 (n).Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 .) Consider an infinite binary vector ξ = (ξ1 . . 2(1 − x)).i. . . i. . . with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. r = 1. Example: bread kneading. . . Hence choosing the initial state at random in this manner results in a stationary Markov chain. εr ∈ {0. 1} for all i. But the function f (x) = min(2x. if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . . from (5). . . . what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution.

If the latter condition holds. When a Markov chain is stationary we refer to it as being in 4. We have: Proposition 1.j (which equals 1): π(i) j∈S pi. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. The convenience is often suggested by the topological structure of the chain. But we only have d unknowns. j∈S i ∈ S. a set of d linearly independent equations which can be more easily solved. Proof.i = j∈A π(j)pj. amongst them. Ac ) is the total current from A to Ac .i .1 Finding a stationary distribution As explained above. π is a stationary distribution if and only if π1 = 1 and F (A. then the above are d + 1 equations.i 13 . Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. π(j)pj. Ac ) := π(i)pi. then choose A = {i} and see that you obtain the balance equations.Note on terminology: steady-state.j as a “current” flowing from i to j. i∈A j∈Ac Think of π(i)pi.i + j∈Ac π(j)pj.i + j∈Ac π(j)pj.j = j∈A π(j)pj. Ac ) = F (Ac . A). expecting to be able to choose. This is OK. Conversely. π(j) = 1. therefore one of them is always redundant.i Multiply π(i) by j∈S pi. we introduce more. Instead of eliminating equations. π(i) = j∈S (balance equations) (normalisation condition). So F (A. for all A ⊂ S. If the state space has d states. as we will see in examples. suppose that the balance equations hold. Writing the equations in this form is not always the best thing to do. a stationary distribution π satisfies π = πP π1 = 1 In other words. because the first d equations are linearly dependent: the sum of both sides equals 1.j .

952. If we take α = 0.048. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. Instead. One can ask how to choose a minimal complete set of linearly independent equations.04 π(2) = 0. β < 1. p11 = 1 − α. We have π(1)α = π(2)β. Example: a walk with a barrier.Now split the left sum also: π(i)pi. but we shall not do this here. π(1) = β .99 ≈ 0. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . and 4.04 room 1. 3.. . p22 = 1 − β. (Consequently.99 (as in the mouse example).8% of the time in room 2. A) Schematic representation of the flow balance relations. 2. . α+β π(2) = cα. That is. p q Consider the following chain.j + j∈A j∈Ac π(i)pi.05. 1. . The extra degree of freedom provided by the last result.. α+β π(2) = α .05 ≈ 0. 14 i = 1. gives us flexibility. β = 0.i + j∈Ac π(j)pj. we see that the first term on the left cancels with the first on the right. the mouse spends only 95. A c F(A . here is an example: Example: Consider the 2-state Markov chain with p12 = α.) Assume 0 < α.2% of the time in 1. p21 = β. in steady state. Summing over i ∈ A. .j = j∈A π(j)pj.i This is true for all i ∈ S. The constant c is found from π(1) + π(2) = 1. Thus. A ) A c c F( A . we find π(1) = 0. while the remaining two terms give the equality we seek. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1).

. starting from i. n≥0 . . Ac ) = π(i − 1)p = π(i)q = F (Ac . Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). Does this provide a stationary distribution? Yes. π(i) = (p/q)i π(0). . So assume i = j.i2 · · · pin−1 . E) where S is the state space and E ⊂ S × S defined by (i. . In graph-theoretic terminology. in−1 . and E a set of directed edges. If i ≥ 1 we have F (A.i1 pi1 .) 5 5.e. Proof. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0.2 The relation “leads to” We say that a state i leads to state j if. 5. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. • Suppose that i j. . This is i=0 possible if and only if p < q.j > 0. In fact. This is often cast in terms of a digraph (directed graph). i − 1}. . i. 1. S is a set of vertices. . For i ≥ 1. .e. If i = j there is nothing to prove. (If p ≥ 1/2 there is no stationary distribution. the chain will visit j at some finite time. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. provided that ∞ π(i) = 1. we can solve them immediately. In other words. p < 1/2. a pair (S. These are much simpler than the previous ones. j) ∈ E ⇐⇒ pi. (iii) Pi (Xn = j) > 0 for some n ≥ 0.But we can make them simpler by choosing A = {0.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. (ii) pi.j > 0 for some n ∈ N and some states i1 . i. A).

and so (iii) holds. where the sum extends over all possible choices of states in S. (6) j.j . Since Pi (Xn = j) > 0. In other words. meaning that (ii) holds.in−1 pi. the set of all states that communicate with i: [i] := {j ∈ S : j So. Corollary 1. is an equivalence relation. 3}. reflexive and transitive. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). The class {4. 5}. Proof.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. • Suppose now that (iii) holds. It is reflexive and transitive because is so. (i) holds. {2. {4. by definition. i}. The relation is transitive: If i j and j k then i k. By assumption. we see that there are four communicating classes: {1}. The first two classes differ from the last two in character. We just observed it is symmetric. this is symmetric (i Corollary 2.. 5} is closed: if the chain goes in there then it never moves out of it.. Just as any equivalence relation. Hence Pi (Xn = j) > 0.i2 · · · pin−1 .. it partitions S into equivalence classes known as communicating classes. • Suppose that (ii) holds. The communicating class corresponding to the state i is. 16 . by definition. 5. However.the sum on the right must have a nonzero term. 3} is not closed. [i] = [j] if and only if i identical or completely disjoint.i1 pi1 . {6}. The relation j ⇐⇒ j i). the class {2. We have Pi (Xn = j) = i1 . j) ⇐⇒ i j and j i. In addition. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. Equivalence relation means that it is symmetric. Look at the last display again. (iii) holds.. the sum contains a nonzero term. we have that the sum is also positive and so one of its terms is positive.

The converse is left as an exercise. Otherwise. There can be arrows between two nonclosed communicating classes but they are always in the same direction. in−1 such that pi. C3 are not closed. . j∈C for all i ∈ C. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. the state space S is a closed communicating class. Since i ∈ C. the single-element set {i} is a closed communicating class. more precisely. In graph terms. Proof. . A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. Suppose first that C is a closed communicating class. Similarly. The classes C4 . we conclude that j ∈ C. parts. A general Markov chain can have an infinite number of communicating classes. . Note that there can be no arrows between closed communicating classes.i1 pi1 . the state is inessential. Closed communicating classes are particularly important because they decompose the chain into smaller.j > 0. pi1 . and C is closed.i = 1 then we say that i is an absorbing state. and so on. Why? A state i is essential if it belongs to a closed communicating class. . we can reach any other state. we have i1 ∈ C. 17 .i2 · · · pin−1 . Thus. Lemma 3. more manageable. We then can find states i1 . C5 in the figure above are closed communicating classes.i2 > 0. then the chain (or. The communicating classes C1 . Note: An inessential state i is such that there exists a state j such that i j but j i. its transition matrix P) is called irreducible.More generally. and so i2 ∈ C. If a state i is such that pi. To put it otherwise. If all states communicate with all others. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. Suppose that i j. we say that a set of states C ⊂ S is closed if pi. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. In other words still. starting from any state. pi.i1 > 0. C2 .j = 1.

it will move to the set C2 = {2.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. all states form a single closed communicating class. . . we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . The period of an inessential state is not defined. (n) Proof. A state i is called aperiodic if d(i) = 1. ℓ ∈ N.e. that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. i. i. Theorem 2 (the period is a class property). and. (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. The chain alternates between the two sets. If i j then d(i) = d(j). Since Pm+n = Pm Pn . 3} at some point of time then. We say that the chain has period 2. . Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. . n ≥ 0. more generally. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . X2 ∈ C1 . d(i) = gcd Di .5. 4} at the next step and vice versa. pii (ℓn) ≥ pii (n) ℓ . X4 ∈ C1 . So. j. X3 ∈ C2 . Let Di = {n ∈ N : pii > 0}. Let us denote by pij the entries of Pn . (m) (n) for all m. d(j) = gcd Dj . Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. If i j there is α ∈ N with pij > 0 and if 18 . Dj = {n ∈ (n) (α) N : pjj > 0}. Notice however. j ∈ S. Consider two distinct essential states i.

Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. Theorem 3. From the last two displays we get d(j) | n for all n ∈ Di . Cd−1 such that pij = 1. . So pjj ≥ pij pji > 0. . are irreducible chains. Formally. r = 0. 1. In particular. . . this is the content of the following theorem. n1 ∈ Di such that n2 − n1 = 1. Arguing symmetrically. if d = 1. Pick n2 . j ∈ S. if an irreducible chain has period d then we can decompose the state space into d sets C0 . where the remainder r ≤ n1 − 1. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. If a number divides two other numbers then it also divides their difference. we have n ∈ Di as well. Then pii > 0 and so pjj ≥ pij pii pji > 0. i. d(j) | d(i)). all states communicate with one another.) 19 . C1 . . . as defined earlier. q − r > 0. d − 1. . Consider an irreducible chain with period d. So d(j) is a divisor of all elements of Di . . Of particular interest.e. So d(i) = d(j). . Therefore d(j) | α + β. n2 ∈ Di . C1 . showing that α + β ∈ Dj . Corollary 3. Cd−1 such that the chain moves cyclically between them. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . . An irreducible chain has period d if one (and hence all) of the states have period d.j i there is β ∈ N with pji > 0. Now let n ∈ Di . we obtain d(i) ≤ d(j) as well. showing that α + n + β ∈ Dj . Theorem 4. Divide n by n1 to obtain n = qn1 + r. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . Let n be sufficiently large. the chain is called aperiodic. . More generally. Proof. (Here Cd := C0 . Then we can uniquely partition the state space S into d disjoint subsets C0 . Therefore d(j) | α + β + n for all n ∈ Di . Because n is sufficiently large. chains where. An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. Since n1 . j∈Cr+1 i ∈ Cr . .

e the set of states j such that i j).e. The reader is warned to be alert as to which of the two variables we are considering at each time. If X0 ∈ A the two times coincide. It is easy to see that if we continue and pick a further state id with pid−1 . i1 . . Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . say. j belongs to the next class. j ∈ C2 . that is why we call it “first-step analysis”. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0.. for example. Assume d > 1. .Proof.. We use a method that is based upon considering what the Markov chain does at time 1. As an example. then. i ∈ C0 .. Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. Continue in this manner and define states i0 . after it takes one step from its current position. . Suppose pij > 0 but j ∈ C1 but.. 20 . Cd−1 . which contradicts the definition of d. Notice that this is an equivalence relation. id ∈ C0 . and by the assumptions that d d j.id > 0. Pick a state i0 and let C0 be its equivalence class (i. We now show that if i belongs to one of these classes and if pij > 0 then. We show that these classes satisfy what we need. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. C1 . id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. Hence it partitions the state space S into d disjoint equivalence classes. Let C1 be the equivalence class of i1 . We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. . Take. i1 . .. necessarily. Then pick i1 such that pi0 i1 > 0. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?).. necessarily.. Such a path is possible because of the choice of the i0 . The existence of such a path implies that it is possible to go from i0 to i. i. We avoid excessive notation by using the same letter for both. with corresponding classes C0 .

It is clear that P1 (T0 < ∞) < 1. If i ∈ A then TA = 0. and define ϕ(i) := Pi (TA < ∞). Therefore. we have Pi (TA < ∞) = ϕ(j)pij . The function ϕ(i) satisfies ϕ(i) = 1. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). So TA = 1 + TA . But the Markov chain is homogeneous. . then TA ≥ 1.). ′ Proof. For the second part of the proof. and so ϕ(i) = 1. if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. fix a set of states A. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . ′ is the remaining time until A is hit).e. ϕ(i) = j∈S i∈A pij ϕ(j). X0 = i) = P (TA < ∞|X1 = j). because the chain may end up in state 2. Combining the above. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). We then have: Theorem 5. let ϕ′ (i) be another solution. If i ∈ A. j∈S as needed. By self-feeding this equation. a ′ function of X1 . But what exactly is this probability equal to? To answer this in its generality. We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. conditionally on X1 . . . Furthermore. Hence: ′ ′ P (TA < ∞|X1 = j. the event TA is independent of X0 . X2 . i ∈ A.

3 22 . On the other hand. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). then TA = 1 + TA . 2. Example: Continuing the previous example. Example: In the example of the previous figure. the second equals Pi (TA = 2). the set that contains only state 0. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). If i ∈ A then. Clearly. 2}. Furthermore. TA = 0 and so ψ(i) := Ei TA = 0. because it takes no time to hit A if the chain starts from A. Proof. Start the chain from i. ψ(i) := Ei TA . ψ(i) = 1 + i∈A pij ψ(j). We immediately have ϕ(0) = 1. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. (If the chain starts from 0 it takes no time to hit 0. i = 0. By continuing self-feeding the above equation n times. obviously. The second part of the proof is omitted. The function ψ(i) satisfies ψ(i) = 0. j∈A i ∈ A. as above. Letting n → ∞. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). so the first two terms together equal Pi (TA ≤ 2).) As for ϕ(1). which is the second of the equations. Therefore. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). let A = {0. We let ϕ(i) = Pi (T0 < ∞). and let ψ(i) = Ei TA . If ′ i ∈ A. We have: Theorem 6. 1. if the chain starts from 2 it will never hit 0. ψ(0) = ψ(2) = 0. we choose A = {0}. 1 ψ(1) = 1 + ψ(1).We recognise that the first term equals Pi (TA = 1). 3 2 so ϕ(1) = 3/4. and ϕ(2) = 0. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0).

because the underlying experiment is the flipping of a coin with “success” probability 2/3. Clearly. ϕAB (i) = j∈A i∈B pij ϕAB (j). ϕAB is the minimal solution to these equations. Now the last sum is a function of the future after time 1. Then ′ 1+TA x ∈ A. consider the following situation: Every time the state is x a reward f (x) is earned. (This should have been obvious from the start. i∈A Furthermore. if x ∈ A.e. This happens up to the first hitting time TA first hitting time of the set A. where. h(x) = f (x). . ϕAB (i) = 0. otherwise. TB for two disjoint sets of states A. of the random variables X1 . ′ TA = 1 + TA . ′ where TA is the remaining time. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). by the Markov property. therefore the mean number of flips until the first success is 3/2. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. after one step. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). 23 . X2 . The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ).. as argued earlier. We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ).which gives ψ(1) = 3/2. Hence. i. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). B. It should be clear that the equations satisfied are: ϕAB (i) = 1. in the last sum. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). . until set A is reached. Next.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . . we just changed index from n to n + 1.

unless he is in room 4. The successive stakes (Yk . Clearly. 2 2 We solve and find h(2) = 8. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. are h(x) = 0. If heads come up then you win £1 per pound of stake. 1 2 4 3 The question is to compute the average total reward. (Sooner or later. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. until he dies in room 4. why?) If we let h(i) be the average total reward if he starts from room i. h(3) = 7. no gambler is assumed to have any information about the outcome of the upcoming toss. he will die. Example: A thief enters sneaks in the following house and moves around the rooms. collecting “rewards”: when in room i. the set of equations satisfied by h. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails).Thus. h(1) = 5. 3. Let p the probability of heads and q = 1 − p that of tails. if he starts from room 2. for i = 1. (Also. 2. y∈S h(x) = f (x) + x ∈ A. in which case he dies immediately. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). So.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. he collects i pounds. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . x∈A pxy h(y). so 24 . k ∈ N) form the strategy of the gambler.

To solve the problem. We obviously have ϕ(0) = 0.i−1 = q = 1 − p. if p = q = 1/2 b−a Proof. it will make enormous profits). Let £x be the initial fortune of the gambler. b are absorbing. and where the states a. write ϕ(x) instead of ϕ(x. a + 2. a ≤ x ≤ b. she must base it on whatever outcomes have appeared thus far. . is our next goal. b − 1. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. . b} with transition probabilities pi. in the first case. . a + 1. if p = q Px (Tb < Ta ) = . P (tails) = 1 − p. b = ℓ > 0 and let ϕ(x. We wish to consider the problem of finding the probability that the gambler will never lose her money. We first solve the problem when a = 0. We use the notation Px (respectively. in simplest case Yk = 1 for all k. . as described above. the company has control over the probability p.g. Let us consider a concrete problem. Yk cannot depend on ξk . . Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. a < i < b.i+1 = p. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . In other words. x − a    . 25 ϕ(ℓ) = 1. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. betting £1 at each integer time against a coin which has P (heads) = p. b − a). ℓ). We are clearly dealing with a Markov chain in the state space {a. ℓ) := Px (Tℓ < T0 ). It can only depend on ξ1 . expectation) under the initial condition X0 = x. in addition. astrological nature. . Theorem 7. Ex ) for the probability (respectively. ξk−1 and–as is often the case in practice–on other considerations of e. . pi. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. Consider a gambler playing a simple coin tossing game. . . (Think why!) To simplify things.if the gambler wants to have a strategy at all. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). She will stop playing if her money fall below a (a < x). 0 ≤ x ≤ ℓ.

λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). Let δ(x) := ϕ(x + 1) − ϕ(x). Assuming that λ := q/p = 1. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. 0 ≤ x ≤ ℓ. Since p + q = 1. we have obtained ϕ(x. by first-step analysis. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. λℓ − 1 0 ≤ x ≤ ℓ. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. λ = 1. ℓ Summarising.and. The general case is obtained by replacing ℓ by b − a and x by x − a. we find δ(0) = 1/ℓ and so x ϕ(x) = . Corollary 4. If p > 1/2. λx − 1 δ(0). ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). 1. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). ϕ(x) = pϕ(x + 1) + qϕ(x − 1). then λ = q/p > 1. ℓ→∞ ℓ→∞ (q/p)x . This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). Since ϕ(ℓ) = 1. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. We finally find λx − 1 . then λ = q/p < 1. if λ = 1 if λ = 1. and so Px (T0 < ∞) = 1 − lim If p < 1/2. if p = q. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. δ(x − 2) = · · · = q p x δ(0). ℓ) = λx −1 . and so λx − 1 λx − 1 =1− = λx . λℓ −1 x ℓ. λ = 1. Px (T0 < ∞) = 1 − lim 26 . if p > 1/2 if p ≤ 1/2.

(b) is by the definition of conditional probability. . . τ = n) = pi. any such A must have the property that. .j2 · · · P (Xn = i. Let τ be a stopping time. Recall that the Markov property states that. A ∩ {τ = n} is determined by (X0 . Xn ). . . for all n ∈ Z+ . n ≥ 0) is a time-homogeneous Markov chain. Xn ) and the ordinary splitting property at n. . .j1 pj1 . Xn+2 = j2 . Since (X0 . τ < ∞) and that P (Xτ +1 = j1 . . τ = n). . . the future process (Xτ +n . . . conditional on τ < ∞ and Xτ = i. possibly. Xn+2 = j2 . . . Xn+2 = j2 . including. . . | Xτ = i. . the first hitting time of the set A. . Xτ )1(τ < ∞) = n∈Z+ (X0 . Let τ be a stopping time. . for any n. Then. A. . Xn+1 . . We are going to show that for any such A. .) is independent of the past process (X0 . τ < ∞)P (A | Xτ . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . τ = n) (c) = P (Xn+1 = j1 . This stronger property is referred to as the strong Markov property. τ = n) = P (Xn+1 = j1 .j2 · · · We have: P (Xτ +1 = j1 . τ < ∞). Xτ = i. A. . In particular. (d) where (a) is just logic. (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . . τ = n)P (Xn = i. A. Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . Let A be any event determined by the past (X0 . Xn )1(τ = n). 27 . τ < ∞) = pi. A. . . . . . An example of a stopping time is the time TA . Xn = i. . . . considered earlier. . Xτ ) if τ < ∞. . . the future process (Xn . Xn ) if we know the present Xn . . . Xn ) for all n ∈ Z+ .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. Xτ +2 = j2 . . | Xτ . . . Suppose (Xn .Xτ +2 = j2 .j1 pj1 . . . . . | Xn = i. Xτ ) and has the same law as the original process started from i. . Xτ +2 = j2 . A | Xτ . n ≥ 0) is independent of the past process (X0 . . τ < ∞) = P (Xn+1 = j1 . Proof. An example of a random time which is not a stopping time is the last visit of a specific set. We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . P (Xτ +1 = j1 . . the value +∞. τ = n) (a) (b) = P (Xτ +1 = j1 . | Xn = i)P (Xn = i. . A. . . Xτ +2 = j2 . . . . . Theorem 8 (strong Markov property). . . A.

in fact. 3. Xi (0) . . The random variables Xn comprising a general Markov chain are not independent. Schematically. Regardless. . These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. random variables is a (very) particular example of a Markov chain. random trajectories. and so on. Xi (2) . .i. . . All these random times are stopping times.d. Note the subtle difference between this Ti and the Ti of Section 6. 0 ≤ n < Ti . “chunks”.) We let Ti be the time of the third visit. So Xi is the r-th excursion: the chain visits 28 . by using the strong Markov property. Our Ti here is always positive. Formally. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). Ti Xi (0) (r) ≤ n < Ti (r+1) . := Xn . (1) r = 1. we will show that we can split the sequence into i. We (r) refer to them as excursions of the process. The idea is that if we can find a state i that is visited again and again. However. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn .9 Regenerative structure of a Markov chain A sequence of i.d. (And we set Ti := Ti . 2.. If X0 = i then the two definitions coincide. we let Ti be the time of the second (1) (3) visit. State i may or may not be visited again.. r = 2.i. . They are. in a generalised sense. we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. Xi (1) . . as “random variables”. The Ti of Section 6 was allowed to take the value 0.

Example: Define Ti (r+1) (0) ). and “goes for an excursion”. Therefore.i. it will behave in the same way as if it starts from the same state at another point of time. g(Xi (2) ). Apply Strong Markov Property (SMP) at Ti : SMP says that. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. they are also identical in law (i.. the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. except. the future is independent of the (1) (2) past. but there are always states which will be visited again and again. The excursions Xi (0) . (0) are independent random variables. The last corollary ensures us that Λ1 . (r) . possibly. . An immediate consequence of the strong Markov property is: Theorem 9. Λ2 . We abstract this trivial observation and give a couple of definitions: First. . A trivial. but important corollary of the observation of this section is that: Corollary 5. for time-homogeneous Markov chains. . starting from i. Xi (1) . we “watch” it up until the next time it returns to state i. Moreover. (Numerical) functions g(Xi independent random variables. . all. .state i for the r-th time. are independent random variables. given the state at (r) (r) (r) the stopping time Ti . Therefore. Xi (2) . if it starts from state i at some point of time. But (r) the state at Ti is. Perhaps some states (inessential ones) will be visited finitely many times.). we say that a state i is transient if. 10 Recurrence and transience When a Markov chain has finitely many states. .. the future after Ti is independent of the past before Ti . the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. The claim that Xi . starting from i. equal to i. at least one state must be visited infinitely many times. have the same Proof. Second. In particular. by definition. ad infinitum. . Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. and this proves the first claim. 29 .d. Xi distribution. g(Xi (1) ). we say that a state i is recurrent if.

every state is either recurrent or transient. But the past (k−1) before Ti is independent of the future. 2. and that is why we get the product. State i is recurrent. 3. .e. Lemma 5. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. We claim that: Lemma 4. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. If fii < 1. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . Indeed. The following are equivalent: 1. namely. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. i. we have Proof. therefore concluding that every state is either recurrent or transient. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . Indeed. Starting from X0 = i. we conclude what we asserted earlier. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. 4. This means that the state i is transient. then Pi (Ji < ∞) = 1. 30 . there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. Ei Ji = ∞.Important note: Notice that the two statements above are not negations of one another. in principle. k ≥ 0. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. We will show that such a possibility does not exist. Therefore. Pi (Ji = ∞) = 1. fii = Pi (Ti < ∞) = 1. Because either fii = 1 or fii < 1. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). Pi (Ji = ∞) = 1. This means that the state i is recurrent. Suppose that the Markov chain starts from state i. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞).

3. If Ei Ji = ∞. fij ≤ pij . The equivalence of the first two statements was proved before the lemma. which. State i is transient. Proof. 10. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . This. then. i. owing to the fact that Ji is a geometric random variable. State i is transient if. this implies that Ei Ji < ∞. Clearly. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. Proof. it is visited finitely many times with probability 1. If i is a transient state then pii → 0. is equivalent to Pi (Ji < ∞) = 1. by definition.e. Hence Ei Ji = ∞ also. by the definition of Ji . then. therefore Pi (Ji < ∞) = 1. but the two sequences are related in an exact way: Lemma 7. The equivalence of the first two statements was proved above. Proof. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . as n → ∞. 2. since Ji is a geometric random variable. clearly. pii → 0. 4. Lemma 6. Notice the difference with pij . Ji cannot take value +∞ with positive probability. fii = Pi (Ti < ∞) < 1.1 Define First hitting time decomposition fij := Pi (Tj = n). by definition. State i is recurrent if.e. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. the probability to hit state j for the first time at time n ≥ 1. On the other hand. as n → ∞. The following are equivalent: 1. and. Pi (Ji < ∞) = 1. assuming that the chain (n) starts from state i. (n) i. If i is transient then Ei Ji < ∞. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. Ei Ji < ∞. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . we see that its expectation cannot be infinity without Ji itself being infinity with probability one.Proof. it is visited infinitely many times. Corollary 6. From Elementary Probability. n ≥ 1. (n) But if a series is finite then the summands go to 0. if we assume Ei Ji < ∞.

The last statement is simply an expression in symbols of what we said above. Corollary 8. Now. If j is transient.i. showing that j will appear in infinitely many of the excursions for sure. Indeed. and so the probability that j will appear in a specific excursion is positive. . the probability to go to j in some finite time is positive. either all states are recurrent or all states are transient. 10. denoted by i j: it means that. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). starting from i.d.r := 1 j appears in excursion Xi (r) . r = 0. We now show that recurrence is a class property.i. If j is a transient state then pij → 0. and since they take the value 1 with positive probability. Proof. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . In a communicating class. Start with X0 = i. are i. So the random variables δj. . ∞ Proof.) Theorem 10. In particular.2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. we have n=1 pjj = Ej Jj < ∞. excursions Xi . by definition.. which is finite. if we let fij := Pi (Tj < ∞) . For the second term. and hence pij as n → ∞. Xi . . . j i. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. for all states i. In other words. ( Remember: another property that was a class property was the property that the state have a certain period. 1.d. as n → ∞. infinitely many of them will be 1 (with probability 1). and gij := Pi (Tj < Ti ) > 0. Hence. . not only fij > 0. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1.The first term equals fij . we have i j ⇐⇒ fij > 0. by summing over n the relation of Lemma 7. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). but also fij = 1. Hence. . Corollary 7. j i. 32 .

Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. By finiteness. be i. Theorem 11. What is the expectation of T ? First. Un ≤ x | U0 = x)dx xn dx = 1 . Example: Let U0 . Every finite closed communicating class is recurrent. Proof. If i is inessential then it is transient. by the above theorem. . i. but Pi (A ∩ B) = Pi (Xm = j. Xn = i for infinitely many n) = 0. for example. . Thus T represents the number of variables we need to draw to get a value larger than the initial one. . n+1 33 = 0 . Suppose that i is inessential. i.Proof. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. We now take a closer look at recurrence. j but j i. necessarily. observe that T is finite. Let C be the class. . X remains in C. Let T := inf{n ≥ 1 : Un > U0 }. some state i must be visited infinitely many times. . If i is recurrent then every other j in the same communicating class satisfies i j. So if a communicating class is recurrent then it contains no inessential states and so it is closed. . Hence. and hence. Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). random variables.e. Hence all states are recurrent. appear in a certain game. U1 . uniformly distributed in the unit interval [0. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ).e. 1]. . Theorem 12. It should be clear that this can. Proof. Pi (B) = 0. if started in C. all other states are also transient. P (B) = Pi (Xn = i for infinitely many n) < 1. Indeed. . Recall that a finite random variable may not necessarily have finite expectation. By closedness. But then. This state is recurrent. every other j is also recurrent. and so i is a transient state. . . If a communicating class contains a state which is transient then.d.i. U2 . . P (T > n) = P (U1 ≤ U0 .

then it takes. . the Fundamental Theorem of Probability Theory. It is not a Law. .Therefore. but E min(Z1 . We state it here without proof. arguably. X1 . When we have a Markov chain. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. X2 . n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. . an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. of successive states is not an independent sequence. 0) = ∞. But this is misleading. Before doing that. 34 .d. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. To apply the law of large numbers we must create some kind of independence. it is a Theorem that can be derived from the axioms of Probability Theory. . the sequence X0 . we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. be i. Z2 . on the average. . (ii) Suppose E max(Z1 . (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. random variables. Let Z1 . . We say that i is null recurrent if Ei Ti = ∞. 0) > −∞.i. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. It is traditional to refer to it as “Law”. Theorem 13.

Specifically. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). . n as n → ∞. Since functions of independent random variables are also independent. the (0) initial excursion Xa also has the same law as the rest of them. Example 1: To each state x ∈ S. as in Example 2. .i. Xa(3) .. G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). (1) (2) (1) (n) (8) 13. assign a reward f (x). 35 . The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a.2 Average reward Consider now a function f : S → R.1 Functions of excursions Namely. In particular. are i. Then. . we can think of as a reward received when the chain is in state x.d. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). Thus. Xa(2) . Xa .d. . forms an independent sequence of (generalised) random variables. . we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. in this example.13. and because if we start with X0 = a.i. G(Xa(2) ). but define G(X ) to be the total reward received over the excursion X . we have that G(Xa(1) ). the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. Example 2: Let f (x) be as in example 1. which. define G(X ) to be the maximum reward received during this excursion. . for an excursion X . (7) Whichever the choice of G. G(Xa(3) ). are i. . with probability 1. . for reasonable choices of the functions G taking values in R. random variables. . because the excursions Xa . k=0 which represents the ‘time-average reward’ received.

t t In other words. A similar argument shows that 1 last G → 0. Proof. plus the total reward received over the next excursion. t t 36 . Nt = 5. t t as t → ∞. (r) Nt := max{r ≥ 1 : Ta ≤ t}. The idea is this: Suppose that t is large enough. and Glast is the reward over the last incomplete cycle. t as t → ∞.e. For example. Gr is given by (7). We shall present the proof in a way that the constant f will reveal itself–it will be discovered. of complete excursions from the ground state a that occur between 0 and t. Nt . and so on. with probability one. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . and so 1 first G → 0. Then. Hence |Gfirst | ≤ BTa . We break the sum into the total reward received over the initial excursion. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x).Theorem 14. Thus we need to keep track of the number. We assumed that f is t bounded. Since P (Ta < ∞) = 1. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . We are looking for the total reward received between 0 and t. say. in the figure below. say |f (x)| ≤ B. with probability one. we have BTa /t → 0. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. Let f : S → R+ be a bounded ‘reward’ function. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller).

d. So we are left with the sum of the rewards over complete cycles.also. r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. (11) By definition. Hence. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . .i. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . . 1 Nt = . it follows that the number Nt of complete cycles before time t will also be tending to ∞. Since Ta → ∞. as r → ∞. we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . . E a Ta 37 . = E a Ta . we have Ta r→∞ r lim (r) . are i. 2. = E a Ta . lim t∞ 1 Nt Nt −1 Gr = EG1 . r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . Since Ta /r → 0. r=1 with probability one.. E a Ta f (Xn ) = n=0 EG1 =: f . and since Ta − Ta k = 1. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . as t → ∞. random variables with common expectation Ea Ta .

b. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). We can use b as a ground state instead of a. The physical interpretation of the latter should be clear: If. Comment 1: It is important to understand what the formula says. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. the ground state. then the long-run proportion of visits to the state a should equal 1/Ea Ta . then we see that the numerator equals 1. which communicate and are positive recurrent. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . for all b ∈ S. The equality of these two limits gives: average no. it takes Ea Ta units of time between successive visits to the state a.To see that f is as claimed. i. a. The constant f is a sort of average over an excursion. The denominator is the duration of the excursion. E a Ta (14) with probability 1. from the Strong Markov Property. Comment 4: Suppose that two states. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). t Ea 1(Xk = b) E a Ta . f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . on the average. Ta −1 k=0 1 lim t→∞ t with probability 1.e. not both. The theorem above then says that. E b Tb 38 . we let f to be 1 at the state b and 0 everywhere else. Note that we include only one endpoint of the excursion.

X1 . X3 . Recurrence tells us that Ta < ∞ with probability one. . we have 0 < ν [a] (x) < ∞. . Indeed. We start the Markov chain with X0 = a. Both of them equal 1. where a is the fixed recurrent ground state whose existence was assumed. this should have something to do with the concept of stationary distribution discussed a few sections ago. Proof.e. 39 . Proposition 2. 5 We stress that we do not need the stronger assumption of positive recurrence here. . i.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . . Fix a ground state a ∈ S and assume that it is recurrent5 . by a function of X1 . Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. x ∈ S. so there is nothing lost in doing so. notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. Clearly. X2 . ν [a] (x) = y∈S ν [a] (y)pyx . . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. Consider the function ν [a] defined in (15). Moreover. (15) should play an important role. Hence the superscript [a] will soon be dropped. The real reason for doing so is because we wanted to replace a function of X0 . In view of this. X2 . x ∈ S. . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). we will soon realise that this dependence vanishes when the chain is irreducible. Since a = XT a . for any state b such that a → x (a leads to x). Note: Although ν [a] depends on the choice of the ‘ground’ state a.

we used (16). n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). let x be such that a x. is finite when a is positive recurrent. x a.and thus be in a position to apply first-step analysis. n ≤ Ta ) pyx Pa (Xn−1 = y. 40 . (17) This π is a stationary distribution. Note: The last result was shown under the assumption that a is a recurrent state. such that pax > 0. for all n ∈ N. that a is positive recurrent. We now have: Corollary 9. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). ax ax From Theorem 10. n ≤ Ta ) Pa (Xn = x. We are now going to assume more. such that pxa > 0. where. from (16). indeed. so there exists n ≥ 1. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). ν [a] (x) < ∞. (n) Next. which are precisely the balance equations. First note that. Suppose that the Markov chain possesses a positive recurrent state a. in this last step. which. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. We just showed that ν [a] = ν [a] P. Xn−1 = y. Therefore. for all x ∈ S. x ∈ S. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. ν [a] = ν [a] Pn . by definition. Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . xa so. namely. so there exists m ≥ 1. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta .

since the excursions are independent. in Proposition 2. Then j is positive Proof.In other words: If there is a positive recurrent state. 2. But π [a] is simply a multiple of ν [a] . Consider the excursions Xi . If heads show up at the r-th excursion. Hence j will occur in infinitely many excursions. 41 . We have already shown. We already know that j is recurrent. . R2 ) be the index of the first (respectively. In words. unique. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. then the set of linear equations π(x) = y∈S π(y)pyx . the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. 1.i.e. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. Proof. r = 0. in general. . Therefore. (r) j. second) excursion that contains j. . π(y) = 1 y∈S x∈S have at least one solution. π [a] P = π [a] . and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. Suppose that i recurrent. and so π [a] is not trivially equal to zero. The coins are i. It only remains to show that π [a] is a probability. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. Otherwise. Theorem 15. no matter which one. that ν [a] P = ν [a] . This is what we will try to do in the sequel. we toss a coin with probability of heads gij > 0. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one.d. Also. that it sums up to one. positive recurrent state. we have Ea Ta < ∞. Since a is positive recurrent. then the r-th excursion does not contain state j. if tails show up. i. Let i be a positive recurrent state. Let R1 (respectively. Start with X0 = i. to decide whether j will occur in a specific excursion. then j occurs in the r-th excursion.

Let C be a communicating class. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. Then either all states in C are positive recurrent. by Wald’s lemma. IRREDUCIBLE CHAIN r = 1. In other words. .R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij .d. we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. R2 has finite expectation. where the Zr are i.i. Corollary 10. But. . An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. with the same law as Ti . RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . . or all states are null recurrent or all states are transient. 2. An irreducible Markov chain is called transient if one (and hence all) states are transient. . . . .

1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). for totally trivial reasons (it is. π(1) = 0. Consider positive recurrent irreducible Markov chain. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. we stress that a stationary distribution may not be unique. π(2) = 1 − α 2. n ≥ 0) and suppose that P (X0 = x) = π(x). Theorem 16. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). recognising that the random variables on the left are nonnegative and bounded below 1. Proof. But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. indeed. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). what prohibits uniqueness. as t → ∞. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. x ∈ S. for all x ∈ S. is a stationary distribution. there is a stationary distribution. (18) π(x)Px (Ta < ∞) = π(x) = 1. after all. by (18). Fix a ground state a. Then P (Ta < ∞) = x∈S In other words. we start the Markov chain from the stationary distribution π. Let π be a stationary distribution. But. for all n. Consider the Markov chain (Xn . By a theorem of Analysis (Bounded Convergence Theorem). (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. n=1 43 . we have P (Xn = x) = π(x). an absorbing state).16 Uniqueness of the stationary distribution At the risk of being too pedantic. Then. we have. state 1 (and state 2) is positive recurrent. with probability one. Hence. for all states i and j. following Theorem 14. the π defined by π(0) = α. Then there is a unique stationary distribution. x∈S By (13). x ∈ S. we can take expectations of both sides: E as t → ∞.

Hence π = π [a] . x ∈ S. This is in fact. the only stationary distribution for the restricted chain because C is irreducible. delete all transition probabilities pxy with x ∈ C. an arbitrary stationary distribution π must be equal to the specific one π [a] . Consider an arbitrary Markov chain. from the definition of π [a] . In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Thus. By the 1 above corollary. where π(C) = y∈C π(y). Consider a Markov chain. In particular. Then. π(C) x ∈ C.e. so they are equal. Decompose it into communicating classes. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. π(a) = π [a] (a) = 1/Ea Ta . But π(i|C) > 0. Therefore Ei Ti < ∞. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . Assume that a stationary distribution π exists. Let C be the communicating class of i. The following is a sort of converse to that. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. π(i|C) = Ei Ti . we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . Corollary 11. Proof. There is a unique stationary distribution. Consider the restriction of the chain to C. Namely. Both π [a] and π[b] are stationary distributions. 1 π(a) = . Consider a positive recurrent irreducible Markov chain with stationary distribution π. Proof. i is positive recurrent. b ∈ S. Consider a positive recurrent irreducible Markov chain and two states a. The cycle formula merely re-expresses this equality. There is only one stationary distribution. Hence there is only one stationary distribution. Let i be such that π(i) > 0. E a Ta Proof. i. Corollary 12. Then any state i such that π(i) > 0 is positive recurrent. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . for all x ∈ S. Then π [a] = π [b] . for all a ∈ S. y ∈ C.Therefore π(x) = π [a] (x). In particular.

that. This is the stable position. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). By our uniqueness result. for example. π(x|C) ≡ π C (x). where π(C) := π(C) π(x). A Markov chain is a simple model of a stochastic dynamical system. without friction. Let π be an arbitrary stationary distribution. x ∈ C) is a stationary distribution. For each such C define the conditional distribution π(x|C) := π(x) . that it remains in a small region). more generally. We all know. x∈C Then (π(x|C).which assigns positive probability to each state in C. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . that the pendulum will swing for a while and. from physical experience. consider the motion of a pendulum under the action of gravity. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. Proof. So we have that the above decomposition holds with αC = π(C). the pendulum will perform an 45 . eventually. This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . where αC ≥ 0.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. If π(x) > 0 then x belongs to some positive recurrent class C. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. it will settle to the vertical position. There is dissipation of energy due to friction and that causes the settling down. For example. An arbitrary stationary distribution π must be a linear combination of these π C . So far we have seen. such that C αC = 1. 18 18. take an = (−1)n . under irreducibility and positive recurrence conditions. Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. Proposition 3. To each positive recurrent communicating class C there corresponds a stationary distribution π C . In an ideal pendulum. as n → ∞.

If. we have x(t) → x∗ . ξ2 . x ∈ R. simple because the ξn ’s depend on the time parameter n. ˙ There are myriads of applications giving physical interpretation to this. the one found by setting the right-hand side equal to zero: x∗ = b/a. then x(t) = x∗ for all t > 0. . and define a stochastic sequence X1 . are given. X2 . or an example in physics. Still. consider the following differential equation x = −ax + b. The simplest differential equation As another example. . so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. . . We say that x∗ is a stable equilibrium. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. However notice what happens when we solve the recursion. But what does stability mean? In general. by means of the recursion above. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . as t → ∞. . we here assume that X0 . let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . ······ 46 . It follows that |ρ| < 1 is the condition for stability.oscillatory motion ad infinitum.e. A linear stochastic system We now pass on to stochastic systems. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . random variables. Specifically. We shall assume that the ξn have the same law. mutually independent. ξ1 . it cannot mean convergence to a fixed equilibrium point. or whatever. The thing to observe here is that there is an equilibrium point i. we may it is stable because it does not (it cannot!) swing too far away. And we are in a happy situation here. a > 0. then regardless of the starting position. moreover. ξ0 . .

2 The fundamental stability theorem for Markov chains Theorem 17. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . . The second term. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). Y = y) is. uniformly in x. for instance. we have that the first term goes to 0 as n → ∞. for any initial distribution. In particular. ∞ ˆ The nice thing is that. g(y) = P (Y = y). Starting at an elementary level. (e. How should we then define what P (X = x. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). we know f (x) = P (X = x). If a Markov chain is irreducible. to try to make the coincidence probability P (X = Y ) as large as possible. as n → ∞. which is a linear combination of ξ0 . What this means is that. 18. A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. 18. . But suppose that we have some further requirement. In other words. but we are not given what P (X = x. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). the n-step transition matrix Pn . define stability in a way other than convergence of the distributions of the random variables. we need to understand the notion of coupling. .g. while the limit of Xn does not exists. Y = y) = P (X = x)P (Y = y). j ∈ S. suppose you are given the laws of two random variables X. i. ξn−1 does not converge. Coupling refers to the various methods for constructing this joint law. . Y = y) is? 47 . converges. for time-homogeneous Markov chains. under nice conditions.Since |ρ| < 1. Y but not their joint law.3 Coupling To understand why the fundamental stability theorem works. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot.

we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). If. precisely because of the assumption that the two processes have been constructed together. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. respectively. Note that it makes sense to speak of a meeting time. Repeating (19). P (T < ∞) = 1. Y be random variables with values in Z and laws f (x) = P (X = x). Since this holds for all x ∈ S and since the right hand side does not depend on x. Suppose that we have two stochastic processes as above and suppose that they have been coupled.Exercise: Let X. i. for all n ≥ 0. but we are not told what the joint probabilities are. n ≥ T ) ≤ P (n < T ) + P (Yn = x). Proof. x ∈ S. Then. A coupling of them refers to a joint construction on the same probability space. f (x) = y h(x. uniformly in x ∈ S. on the event that the two processes never meet. The following result is the so-called coupling inequality: Proposition 4. |P (Xn = x) − P (Yn = x)| ≤ P (T > n).e. g(y) = x h(x. i. n ≥ 0. n ≥ T ) = P (Xn = x. We are given the law of a stochastic process X = (Xn . Y .e. n ≥ 0). We have P (Xn = x) = P (Xn = x. Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Let T be a meeting of two coupled stochastic processes X. T is finite with probability one. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Y = y) which respects the marginals. n ≥ 0) and the law of a stochastic process Y = (Yn . and which maximises P (X = Y ). (19) as n → ∞. y) = P (X = x. The coupling we are interested in is a coupling of processes. but with the roles of X and Y interchanged. g(y) = P (Y = y). x∈S 48 . n < T ) + P (Xn = x. and all x ∈ S. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). y). Find the joint law h(x. n ≥ 0. n < T ) + P (Yn = x. y). in particular. This T is a random variables that takes values in the set of times or it may take values +∞. then |P (Xn = x) − P (Yn = x)| → 0.

for all n ≥ 1. had we started with both X0 and Y0 independent and distributed according to π. the transition probabilities.y′ ) := P Wn+1 = (x′ .(y. denoted by Y = (Yn .y′ . then. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. in other words.y′ ) for all large n. they differ only in how the initial states are chosen. positive recurrent and aperiodic. denoted by π. is a stationary distribution for (Wn . n ≥ 0) is an irreducible chain.y′ . y) ∈ S × S. Xm = y). π(x) > 0 for all x ∈ S. Then n→∞ lim P (T > n) = P (T = ∞) = 0.x′ ). Yn ). Its initial state W0 = (X0 .(y. We know that there is a unique stationary distribution. y ′ ) | Wn = (x. 18. and so (Wn . we have that px.x′ ). x∈S as n → ∞.(y. y) := π(x)π(y). implying that q(x. Notice that σ(x. From Theorem 3 and the aperiodicity assumption. We shall consider the laws of two processes. n ≥ 0). n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }. n ≥ 0) is independent of Y = (Yn . (Indeed. sup |P (Xn = x) − P (Yn = x)| → 0.x′ > 0 and py. The first process. Y0 ) has distribution P (W0 = (x. we have not defined joint probabilities such as P (Xn = x.x′ py.y′ for all large n. Xn and Yn would be independent and distributed according to π.4 Proof of the fundamental stability theorem First. n ≥ 0) is a Markov chain with state space S × S. y) = µ(x)π(y). The second process. Let pij denote. It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. .) By positive recurrence.x′ py. n ≥ 0). y) Its n-step transition probabilities are q(x.x′ ). Then (Wn . we can define joint probabilities. Consider the process Wn := (Xn . (x. Its (1-step) transition probabilities are q(x. denoted by X = (Xn .y′ ) = px. Having coupled them. review the assumptions of the theorem: the chain is irreducible. Therefore.Assume now that P (T < ∞) = 1. as usual. We now couple them by doing the straightforward thing: we assume that X = (Xn .

For example. Y ) may fail to be irreducible. Hence (Zn . for all n and x.(0. In particular. 50 .1).01. Then the product chain has state space S ×S = {(0. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). n ≥ 0) is identical in law to (Xn . 0). (1. Hence (Wn . T is a meeting time between Z and Y . p01 = 0. n ≥ 0). which converges to zero. (Zn . 1). In particular.0).0) = q(1. 1)} and the transition probabilities are q(0. 1} and the pij are p01 = p10 = 1.(1. say. and since P (Yn = x) = π(x). we have |P (Xn = x) − π(x)| ≤ P (T > n). precisely because P (T < ∞) = 1. P (T < ∞) = 1. by assumption. Since P (Zn = x) = P (Xn = x).(1.1) = q(1. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X.1). This is not irreducible.Therefore σ(x. p10 = 1. Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). then the process W = (X. as n → ∞. p00 = 0. 0). Clearly.1) = 1. Zn = Yn . Zn equals Xn before the meeting time of X and Y . y) ∈ S × S. n ≥ 0) is positive recurrent. after the meeting time. suppose that S = {0. y) > 0 for all (x. ≥ 0) is a Markov chain with transition probabilities pij . X arbitrary distribution Z stationary distribution Y T Thus. However. (0. (1. Moreover.(0. if we modify the original chain and make it aperiodic by.99. By the coupling inequality. P (Xn = x) = P (Zn = x). uniformly in x. Obviously. Z0 = X0 which has an arbitrary distribution µ.0).0) = q(0.

4 and. 51 . By the results of §5. we know we have a unique stationary distribution π and that π(x) > 0 for all states x.then the product chain becomes irreducible. . Parry.497 . . most people use the term “ergodic” for an irreducible. π(1) satisfy 0. n→∞ (n) where π(0). such that the chain moves cyclically between them: from C0 to C1 .503. It is. in particular. Indeed.99π(0) = π(1). But this is “wrong”. 0. 6 The reason is that the term “ergodic” means something specific in Mathematics (c.e. Suppose X0 = 0.503 0. C1 . per se wrong. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0.497 18. p10 = 1. however.99. By the results of Section 14. In matrix notation. i. and this is expressed by our terminology. Cd−1 . Suppose that the period of the chain is d. and so on.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. Theorem 4. . limn→∞ p00 does not exist. The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). customary to have a consistent terminology. Therefore p00 = 1 for even n (n) and = 0 for odd n. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. if we modify the chain by p00 = 0. Terminology cannot be. positive recurrent and aperiodic Markov chain. p01 = 0. Thus. It is not difficult to see that 6 We put “wrong” in quotes. we can decompose the state space into d classes C0 .503 0. π(1) ≈ 0. π(0) + π(1) = 1. However. then (n) lim p n→∞ 00 (n) = lim p10 = π(0).5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. 18. from C1 to C2 .497.f. Then Xn = 0 for even n and = 1 for odd n. .01. from Cd back to C0 . π(0) ≈ 0. 1981).

i ∈ Cr . Hence if we restrict the chain (Xnd . → dπ(j). starting from i. i∈Cr for all r = 0. Then pji = 0 unless j ∈ Cr−1 . Suppose that X0 ∈ Cr . n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. j ∈ Cℓ . . thereafter. if sampled every d steps. . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. n ≥ 0) consists of d closed communicating classes. . after reduction by d) and. Since (Xnd . the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1.Lemma 8. and the previous results apply. and let r = ℓ − k (modulo d). the (unique) stationary distribution of (Xnd . d − 1. where π is the stationary distribution. n ≥ 0) is aperiodic. . n ≥ 0) is dπ(i). Cd−1 . 52 . The chain (Xnd . So π(i) = j∈Cr−1 π(j)pji . namely C0 . 1. j ∈ Cℓ . i ∈ Cr . as n → ∞. j ∈ Cr . each one must be equal to 1/d. i ∈ S. . as n → ∞. . it remains in Cℓ . we have pij for all i. Suppose i ∈ Cr . Then. We shall show that Lemma 9. Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . . This means that: Theorem 18. . Proof. Since the αr add up to 1. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . Then. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Let i ∈ Ck . Let αr := i∈Cr π(i). π(i) = 1/d. All states have period 1 for this chain.

If the period is d then the limit exists only along a selected subsequence (Theorem 18). 19. . . . . p2 . Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof.. exists for all i ∈ N. . schematically. where = 1 and pi ≥ 0 for all i. . p(n) = (n) (n) (n) (n) (n) ∞ p1 . and so on. We have. 2. . the sequence n2 of the second row k is a subsequence of n1 of the first row. a probability distribution on N. k = 1. . p2 . 2. . . such that limk→∞ pi k exists and equals.1 Limiting probabilities for null recurrent states Theorem 19. .) is a k k subsequence of each (ni . for each i. . p1 . e 53 . such that limk→∞ p1 k 3 (n1 ) numbers p2 k . . So the diagonal subsequence (nk . If j is a null recurrent state then pij → 0.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. k = 1. k = 1. . for each n = 1. as n → ∞. . . We also saw in Corollary (n) 7 that. . say. (nk ) k → pi . if j is transient. Consider. whose proof we omit7 : 7 See Br´maud (1999). there must be a subsequence exists and equals. The proof will be based on two results from Analysis. then pij → π(j) (Theorems 17). pi . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. By construction. k = 1. p3 .. It is clear how we (ni ) k continue. i. Now consider the contained between 0 and 1 there must be a exists and equals. for each The second result is Fatou’s lemma. are contained between 0 and 1.e. 2. 2. 2.). the first of which is known as HellyBray Lemma: Lemma 10. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. for all i ∈ S. say. say. the sequence n3 of the third row is a subsequence k k of n2 of the second row. then pij → 0 as n → ∞. For each i we have found subsequence ni . for all i ∈ S. . Hence pi k i. . as k → ∞.

αy pyx . . k→∞ (n′ +1) = y∈S piy k pyx . x∈S (n′ ) Since aj > 0. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . pix k . the sequence (n) pij does not converge to 0. . . i. Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. 2. n = 1. (n ) (n ) Consider the probability vector By Lemma 10. But Pn+1 = Pn P. (n′ ) Hence αx ≥ αy pyx . and this is a contradiction (the number αx = αx cannot be strictly larger than itself). i=1 (n) Proof of Theorem 19.Lemma 11. Let j be a null recurrent state. Thus. Suppose that one of them is not. π is a stationary distribution. x ∈ S .e. for some x0 ∈ S. by assumption.e. i ∈ N). This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. (xi . we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . pix k = 1. Suppose that for some i. y∈S x ∈ S. by Lemma 11. i. it is a strict inequality: αx0 > y∈S αy pyx0 . 54 . i. we have 0< x∈S The inequality follows from Lemma 11. actually. If (yi . (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . state j is positive recurrent: Contradiction. We wish to show that these inequalities are.e. pix k Therefore. i ∈ N). equalities. and the last equality is obvious since we have a probability vector. By Corollary 13. y∈S Since x∈S αx ≤ 1. x ∈ S. Now π(j) > 0 because αj > 0. we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1.

Indeed.2 The general case (n) It remains to see what the limit of pij is. so we will only look that the aperiodic case. the result can be obtained by the so-called Abel’s Lemma of Analysis. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). Example: Consider the following chain (0 < α. From the Strong Markov Property. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). we have pij = Pi (Xn = j) = Pi (Xn = j. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . as n → ∞. Let j be a positive recurrent state with period 1. that is. Let i be an arbitrary state. β. Let HC := inf{n ≥ 0 : Xn ∈ C}. and fij = ϕC (j). which is what we need.2. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . as in §10. π C (j) = 1/Ej Tj . Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. But Theorem 17 tells us that the limit exists regardless of the initial distribution. starting with the first hitting time decomposition relation of Lemma 7.19. Pµ (Xn−m = j) → π C (j). E j Tj where Tj = inf{n ≥ 1 : Xn = j}. Then (n) lim p n→∞ ij = ϕC (i)π C (j). In this form. The result when the period j is larger than 1 is a bit awkward to formulate. Let ϕC (i) := Pi (HC < ∞). In general. Let C be the communicating class of j and let π C be the stationary distribution associated with C. Proof. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. and fij = Pi (Tj < ∞). Theorem 20. as n → ∞. This is a more old-fashioned way of getting the same result.

ϕ(5) = 0. 2} (closed class). information about its velocity. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . p42 → ϕ(4)π C1 (2). Theorem 14 is a type of ergodic theorem. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. Recall that. 56 (20) x∈S . (n) (n) p35 → (1 − ϕ(3))π C2 (5). as n → ∞. C1 = {1. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. p41 → ϕ(4)π C1 (1). whence 1−α β(1 − α) . while. ϕ(3) = p31 → ϕ(3)π C1 (1). Similarly. ϕ(4) = . Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. And so is the Law of Large Numbers itself (Theorem 13). in Theorem 14. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . ϕ(4) = (1 − β)ϕ(5) + βϕ(3). (n) (n) p43 → 0. (n) (n) p33 → 0. p45 → (1 − ϕ(4))π C2 (5). (n) (n) p32 → ϕ(3)π C1 (2). We have ϕ(1) = ϕ(2) = 1. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. for example. Pioneers in the are were. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. π C2 (5) = 1. C2 are aperiodic classes. we assumed the existence of a positive recurrent state. π C1 (2) = . 1 − αβ 1 − αβ Both C1 . the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. 4} (inessential states). Hence. p44 → 0. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. We now generalise this. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i.The communicating classes are I = {3. The subject of Ergodic Theory has its origins in Physics. among others. from first-step analysis. For example. Theorem 21. The phase (or state) space of a particle is not just a physical space but it includes. C2 = {5} (absorbing state). (n) (n) p34 → 0.

we have that the denominator equals Nt /t and converges to 1/Ea Ta . we have C := i∈S ν(i) < ∞. we have Nt /t → 1/Ea Ta . If so. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. in the former.5. (21) f (x)ν [a] (x). Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. we have t→∞ lim 1 last 1 G = lim Gfirst = 0. So there are always recurrent states.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . while Gfirst . The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . So if we divide by t the numerator and denominator of the fraction in (21). Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. From proposition 2 we know that the balance equations ν = νP have at least one solution. and this is left as an exercise. and this justifies the terminology of §18. Remark 1: The difference between Theorem 14 and 21 is that. we assumed positive recurrence of the ground state a. we obtain the result. r=0 with probability 1 as long as E|G1 | < ∞. then. 57 . Proof. As in the proof of Theorem 14. Now. Then at least one of the states must be visited infinitely often. Replacing n by the subsequence Nt . as we saw in the proof of Theorem 14. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . We only need to check that EG1 = f . But this is precisely the condition (20). t t→∞ t t t with probability one. which is the quantity denoted by f in Theorem 14. Glast are the total rewards over the first and t t a a last incomplete cycles. The reader is asked to revisit the proof of Theorem 14. As in Theorem 14. respectively. since the number of states is finite. so the numerator converges to f /Ea Ta .

Xn ) has the same distribution as (Xn . X1 = j) = P (X1 = i. 58 . Xn has the same distribution as X0 for all n. if we film a standing man who raises his arms up and down successively. . Because the process is Markov. . . . Suppose first that the chain is reversible. it will look the same. the first components of the two vectors must have the same distribution. Lemma 12. . j ∈ S. for all n. . Xn−1 . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. X1 . in a sense. as n → ∞. . in addition. More precisely. this state must be positive recurrent.e. running backwards. X0 ). In addition. . If. like playing a film backwards. actually. a finite Markov chain always has positive recurrent states. X1 . simply. Therefore π is a stationary distribution. the chain is irreducible.Hence 1 ν(i). X0 = j). We summarise the above discussion in: Theorem 22. X0 ) for all n. We say that it is time-reversible. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . Proof. the chain is aperiodic. P (X0 = i. without being able to tell that it is. Reversibility means that (X0 . By Corollary 13. or. But if we film a man who walks and run the film backwards. this means that it is stationary. . Consider an irreducible Markov chain with finitely many states. Then the chain is positive recurrent with a unique stationary distribution π. X0 ). then all states are positive recurrent. . i. i ∈ S. . In particular. . Proof. then we instantly know that the film is played in reverse. in addition. because positive recurrence is a class property (Theorem 15). we have Pn → Π. If. X1 ) has the same distribution as (X1 . Xn ) has the same distribution as (Xn . we see that (X0 . Taking n = 1 in the definition of reversibility. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. . At least some state i must have π(i) > 0. . For example. reversible if it has the same distribution when the arrow of time is reversed. (X0 . . n ≥ 0). that is. It is. π(i) := Hence. . 22 Time reversibility Consider a Markov chain (Xn . the stationary distribution is unique. A reversible Markov chain is stationary. C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. Xn−1 . and run the film backwards. by Theorem 16. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. Theorem 23. i.

The DBE imply immediately that (X0 . . . . Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. . . j. π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. im ∈ S and all m ≥ 3. X2 ) has the same distribution as (X2 . Suppose first that the chain is reversible. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. X1 ) has the same distribution as (X1 . then the chain cannot be reversible. (X0 . X0 ). we have π(i) > 0 for all i. The same logic works for any m and can be worked out by induction. Summing up over all choices of states i3 . . Xn−1 . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . for all n. X1 . X2 = k) = P (X0 = k. suppose the DBE hold. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . k. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . using DBE. . . (22) Proof. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. X1 = j) = π(i)pij . . . X1 = j. In other words. Pick 3 states i. it is easy to show that. . But P (X0 = i. X1 = i) = π(j)pji . Conversely. letting m → ∞. k. for all i. . P (X0 = j. So the chain is reversible. then. . and the chain is reversible. . Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. X2 = i). Theorem 24. and write. Now pick a state k and write. suppose that the KLC (22) holds. pji 59 (m−3) → π(i1 ).for all i and j. X0 ). . Conversely. 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. X0 ). i2 . and this means that P (X0 = i. X1 = j. we have separation of variables: pij = f (i)g(j). . j. These are the DBE. By induction. and pi1 i2 (m−3) → π(i2 ). So the detailed balance equations (DBE) hold. i4 . X1 . . Xn ) has the same distribution as (Xn . and hence (X0 . using the DBE. We will show the Kolmogorov’s loop criterion (KLC). Hence the DBE hold. X1 . im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic.

pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji .1) which gives π(i)pij = π(j)pji . that the chain is positive recurrent. consider the chain: Replace it by: This is a tree. Example 1: Chain with a tree-structure. and the arrow from j to j is the only one from Ac to A. Form the graph by deleting all self-loops.e.) Necessarily. The reason is as follows. Pick two distinct states i. there would be a loop. f (i)g(i) must be 1 (why?). j for which pij > 0. in addition. The balance equations are written in the form F (A. Ac be the sets in the two connected components. meaning that it has no loops. Ac ) = F (Ac . and. • If the chain has infinitely many states. to guarantee. And hence π can be found very easily. Hence the chain is reversible. the detailed balance equations. j for which pij > 0. Hence the original chain is reversible. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. then we need. the graph becomes disconnected. There is obviously only one arrow from AtoAc . the two oriented edges from i to j and from j to i by a single edge without arrow. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. If we remove the link between them. so we have pij g(j) = .) Let A. then the chain is reversible. i. by assumption. A) (see §4. (If it didn’t. So g satisfies the balance equations. namely the arrow from i to j. For example. for each pair of distinct states i. 60 . and by replacing. If the graph thus obtained is a tree. there isn’t any. using another method.(Convention: we form this ratio only when pij > 0.

If i. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). if j is a neighbour of i otherwise. Suppose the graph is connected. and a set E of undirected edges. States j are i are called neighbours if they are connected by an edge. π(i) = 1. Now define the the following transition probabilities: pij = 1/δ(i). Consider an undirected graph G with vertices S. where C is a constant. i. For each state i define its degree δ(i) = number of neighbours of i. The stationary distribution of a RWG is very easy to find. the DBE hold. For example. we have that C = 1/2|E|. π(i)pij = π(j)pji . We claim that π(i) = Cδ(i). A RWG is a reversible chain. so both members are equal to C. etc. In all cases. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. 0. j are not neighbours then both sides are 0. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. 2.e. To see this let us verify that the detailed balance equations hold. Hence we conclude that: 1. The stationary distribution assigns to a state probability which is proportional to its degree. 61 . There are two cases: If i. Each edge connects a pair of distinct vertices (states) without orientation. a finite set. i∈S Since δ(i) = 2|E|.Example 2: Random walk on a graph. i∈S where |E| is the total number of edges. p02 = 1/4. The constant C is found by normalisation.

irreducible and positive recurrent). random variables with values in N and distribution λ(n) = P (ξ0 = n). k = 0. this process is also a Markov chain and it does have time-homogeneous transition probabilities.i. i. . . . .e. 1. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . . if nk = n0 + km. . t ≥ 0) be independent from (Xn ) with S0 = 0.d. (m) (m) 23. 1. The sequence (Yk . 23. in addition. . St+1 = St + ξt . Let A be (r) a set of states and let TA be the r-th visit to A. i. 2. 2. where ξt are i. 2. Let (St .1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). Due to the Strong Markov Property. 62 n ∈ N. . .) has the Markov property but it is not time-homogeneous. If the subsequence forms an arithmetic progression. 2. . n ≥ 0) be Markov with transition probabilities pij . in general. how can we create new ones? 23. The state space of (Yr ) is A. . k = 0.e. where pij are the m-step transition probabilities of the original chain.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . A r = 1. Define Yr := XT (r) .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). . In this case. .e. . π is a stationary distribution for the original chain. . k = 0. 1. then (Yk . . if. . k = 0. then it is a stationary distribution for the new chain. 2.3 Subordination Let (Xn . 1. t ≥ 0.

. (n) 23.. αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . . Yn−1 = αn−1 }. .4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . f is a one-to-one function then (Yn ) is a Markov chain. . knowledge of f (Xn ) implies knowledge of Xn and so. n≥0 63 . ∞ λ(n)pij . conditional on Yn the past is independent of the future.Then Yt := XSt . . . j. Then (Yn ) has the Markov property. we have: Lemma 13. a Markov chain. . in general.e.) More generally. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. then the process Yn = f (Xn ). Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). .. . and the elements of S ′ by α. i. . α1 . Proof. Yn−1 ) = P (Yn+1 = β | Yn = α). is NOT. Fix α0 . . . Indeed. If. that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). β. β). . however. (Notation: We will denote the elements of S by i. We need to show that P (Yn+1 = β | Yn = α. .

β). Yn = α. Define Yn := |Xn |. depends on i only through |i|. Yn−1 ) Q(f (i). In other words. On the other hand. β)P (Xn = i | Yn = α. To see this. β) := 1. Yn−1 ) Q(f (i). and so (Yn ) is. But P (Yn+1 = β | Yn = α. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. P (Xn+1 = i ± 1|Xn = i) = 1/2. regardless of whether i > 0 or i < 0. i ∈ Z. α = 0. indeed. β) i∈S P (Yn = α|Xn = i. α > 0 Q(α. a Markov chain with transition probabilities Q(α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). Then (Yn ) is a Markov chain. and if i = 0. observe that the function | · | is not one-to-one. if β = 1. β) P (Yn = α|Yn−1 ) i∈S = Q(α. for all i ∈ Z and all β ∈ Z+ . β) P (Yn = α. i. β). Yn−1 ) = i∈S P (Yn+1 = β | Xn = i.e. Yn−1 )P (Xn = i | Yn = α. Indeed. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. β).for all choices of the variables involved. Example: Consider the symmetric drunkard’s random walk. β ∈ Z+ . i ∈ Z. 64 . Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Thus (Yn ) is Markov with transition probabilities Q(α. But we will show that P (Yn+1 = β|Xn = i). if i = 0. β). β) P (Yn = α|Xn = i. We have that P (Yn+1 = β|Xn = i) = Q(|i|. conditional on {Xn = i}. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. if β = α ± 1. then |Xn+1 | = 1 for sure. if we let 1/2. because the probability of going up is the same as the probability of going down.

and following the same distribution. n ≥ 0) has the Markov property. k = 0. the population cannot grow: it will eventually reach 0. 2. and 0 is always an absorbing state. . . So assume that p1 = 1.d. But here is an interesting special case.i. a hard problem. which has a binomial distribution: i j pij = p (1 − p)i−j . Back to the general case. are i. independently from one another. Thus. Then P (ξ = 0) + P (ξ ≥ 2) > 0. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. therefore transient. we have that (Xn . so we will find some different method to proceed. since ξ (0) . We find the population (n) at time n + 1 by adding the number of offspring of each individual. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. . The parent dies immediately upon giving birth. 0 But 0 is an absorbing state. j In this case. It is a Markov chain with time-homogeneous transitions. with probability 1. Therefore. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. We let pk be the probability that an individual will bear k offspring.. The state 0 is irrelevant because it cannot be reached from anywhere. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . if 0 is never reached then Xn → ∞. 1. ξ2 .24 24. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. ξ (n) ). p0 = 1 − p. if the 65 (n) (n) . Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. ξ (1) . . then we see that the above equation is of the form Xn+1 = f (Xn . . . . Hence all states i ≥ 1 are inessential. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. In other words. one question of interest is whether the process will become extinct or not. (and independent of the initial population size X0 –a further assumption).1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. in general. if the current population is i there is probability pi that everybody gives birth to zero offspring. Let ξk be the number of offspring of the k-th individual in the n-th generation. Hence all states i ≥ 1 are transient. Example: Suppose that p1 = p. . and. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. 0 ≤ j ≤ i. . each individual gives birth to at one child with probability p or has children with probability 1 − p. If we let ξ (n) = ξ1 .

Consider a Galton-Watson process starting with X0 = 1 initial ancestor. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. For i ≥ 1. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. . let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. where ε = P1 (D). . ϕn (0) = P (Xn = 0). . If µ ≤ 1 then the ε = 1. The result is this: Proposition 5. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . Indeed. Indeed. Obviously. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. Let ε be the extinction probability of the process. ε(0) = 1. when we start with X0 = i. so P (D|X0 = i) = εi . We wish to compute the extinction probability ε(i) := Pi (D). we can argue that ε(i) = εi . since Xn = 0 implies Xm = 0 for all m ≥ n. each one behaves independently of one another. i. Proof. This function is of interest because it gives information about the extinction probability ε. if we start with X0 = i in ital ancestors. and.population does not become extinct then it must grow to infinity. We then have that D = D1 ∩ · · · ∩ Di . and P (Dk ) = ε. For each k = 1. Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. 66 . Therefore the problem reduces to the study of the branching process starting with X0 = 1. .

But ϕn (0) → ε.Note also that ϕ0 (z) = 1. if the slope µ at z = 1 is ≤ 1. By induction. (See right figure below. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). Also ϕn (0) → ε. all we need to show is that ε < 1. ε = ε∗ . But 0 ≤ ε∗ . On the other hand. Since both ε and ε∗ satisfy z = ϕ(z). ϕn (0) ≤ ε∗ . since ϕ is a continuous function. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . if µ > 1. recall that ϕ is a convex function with ϕ(1) = 1. and so on. the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. in fact. then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. Therefore. Let us show that. Therefore ε = ε∗ . So ε ≤ ε∗ < 1. Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). Notice now that µ= d ϕ(z) dz . and since the latter has only two solutions (z = 0 and z = ε∗ ). ϕ(ϕn (0)) → ϕ(ε). ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). This means that ε = 1 (see left figure below). Therefore ε = ϕ(ε). and. We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). ϕn+1 (z) = ϕ(ϕn (z)). z=1 Also. ϕn+1 (0) = ϕ(ϕn (0)).) 67 . In particular. But ϕn+1 (0) → ε as n → ∞.

But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. 3 hosts. is that if more than one hosts attempt to transmit a packet. Xn+1 = Xn + An . An the number of new packets on the same time slot. in a simplified model. 6. for a packet to go through. So if there are. otherwise. The problem of communication over a common channel. 4. there are x + a − 1 packets. and Sn the number of selected packets. 3. 24. 1. on the next times slot. then max(Xn + An − 1. .2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). if Sn = 1. If s = 1 there is a successful transmission and. . Then ε < 1. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’).1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. 1. periodically. we let hosts transmit whenever they like. only one of the hosts should be transmitting at a given time.d. random variables with some common distribution: P (An = k) = λk .i. 68 . If s ≥ 2. Then ε = 1. there is collision and the transmission is unsuccessful. Let s be the number of selected packets. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot.. Since things happen fast in a computer network. Thus. 2. there is a collision and. using the same frequency. Ethernet has replaced Aloha. . . are allocated to hosts 1. So. then the transmission is not successful and the transmissions must be attempted anew. During this time slot assume there are a new packets that need to be transmitted. 3. Abandoning this idea. . Case: µ > 1. . say. 1. If collision happens. there are x + a packets. 0). 5. on the next times slot. there is no time to make a prior query whether a host has something to transmit or not. k ≥ 0. then time slots 0. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. Nowadays. 2. One obvious solution is to allocate time slots to hosts in a specific order. 2. 3. We assume that the An are i.

e. we must have no new packets arriving.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. Xn−1 . Since p > 0. where q = 1 − p. The probabilities px. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. Therefore ∆(x) ≥ µ/2 for all large x. . i. so q x G(x) tends to 0 as x → ∞. An−1 . Proving this is beyond the scope of the lectures.x−1 = P (An = 0. we have ∆(x) = (−1)px. and p. An = a. To compute px. we have q < 1. We here only restrict ourselves to the computation of this expected change. for example we can make it ≤ µ/2.x+k for k ≥ 1. Idea of proof.e. The main result is: Proposition 6. if Xn + An = 0.x−1 when x > 0. For x > 0. So P (Sn = k|Xn = x. Sn = 1|Xn = x) = λ0 xpq x−1 . selections are made amongst the Xn + An packets. To compute px. So: px. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). Indeed. the Markov chain (Xn ) is transient. So we can make q x G(x) as small as we like by choosing x sufficiently large. we should not subtract 1. k 0 ≤ k ≤ x + a. Let us find its transition probabilities. k ≥ 1. then the chain is transient. We also let µ := k≥1 kλk be the expectation of An . The Aloha protocol is unstable.) = x + a k x−k p q .x are computed by subtracting the sum of the above from 1. The reason we take the maximum with 0 in the above equation is because. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). Unless µ = 0 (which is a trivial case: no arrivals!). conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . where G(x) is a function of the form α + βx . we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. . n ≥ 0) is a Markov chain. The random variable Sn has. we think as follows: to have a reduction of x by 1. 69 . and so q x tends to 0 much faster than G(x) tends to ∞. The process (Xn . we have that the drift is positive for all large x and this proves transience. .x−1 + k≥1 kpx.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k .where λk ≥ 0 and k≥0 kλk = 1. and exactly one selection: px.

the rank of a page may be very important for a company’s profits. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. It uses advanced techniques from linear algebra to speed up computations. All you need to do to get high rank is pay (a lot of money). Comments: 1. There is a technique. So this is how Google assigns ranks to pages: It uses an algorithm. that allows someone to manipulate the PageRank of a web page and make it much higher. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. In an Internet-dominated society. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. Then it says: page i has higher rank than j if π(i) > π(j). So we make the following modification. Define transition probabilities as follows:  1/|L(i)|. there are companies that sell high-rank pages. It is not clear that the chain is irreducible. The way it does that is by computing the stationary distribution of a Markov chain that we now describe.24. otherwise. So if Xn = i is the current location of the chain.2) and let α pij := (1 − α)pij + . α ≈ 0. For each page i let L(i) be the set of pages it links to. the chain moves to a random page. 3. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. known as ‘302 Google jacking’.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. Its implementation though is one of the largest matrix computations in the world. Academic departments link to each other by hiring their 70 . We pick α between 0 and 1 (typically. The idea is simple. 2. unless i is a dangling page. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. neither that it is aperiodic. because of the size of the problem. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. |V | These are new transition probabilities. PageRank. then i is a ‘dangling page’. if j ∈ L(i)  pij = 1/|V |. in the latter case. Therefore. if L(i) = ∅   0. If L(i) = ∅.

One of the square alleles is selected 4 times. The alleles in an individual come in pairs and express the individual’s genotype for that gene. all we need is the information outlined in this figure. from the point of view of an administrator. We will assume that a child chooses an allele from each parent at random. two AA individuals will produce and AA child. a gene controlling a specific characteristic (e. We will also assume that the parents are replaced by the children. But an AA mating with a Aa may produce a child of type AA or one of type Aa. We would like to study the genetic diversity. this allele is of type A with probability x/2N . in each generation. This means that the external appearance of a characteristic is a function of the pair of alleles. When two individuals mate. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). meaning how many characteristics we have.faculty from each other (and from themselves). We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. What is important is that the evaluation can be done without ever looking at the content of one’s research. a dominant (say A) and a recessive (say a). This number could be independent of the research quality but. We have a transition from x to y if y alleles of type A were produced. And so on. 5. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology.g. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. generation after generation.4 The Wright-Fisher model: an example from biology In an organism. colour of eyes) appears in two forms. Thus. then. The process repeats. In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). Three of the square alleles are never selected. 4. Imagine that we have a population of N individuals who mate at random producing a number of offspring. we will assume that. One of the circle alleles is selected twice. their offspring inherits an allele from each parent. maintain this allele for the next generation. If so. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. To simplify. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. Do the same thing 2N times. it makes the evaluation procedure easy. 71 . the genetic diversity depends only on the total number of alleles of each type. Since we don’t care to keep track of the individuals’ identities. 24.

0 ≤ y ≤ 2N.e. . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. since. Since selections are independent. . ϕ(x) = y=0 pxy ϕ(y). . (Here. so both 0 and 2N are absorbing states. . . the number of alleles of type A won’t change. Let M = 2N . Since pxy > 0 for all 1 ≤ x. conditionally on (X0 . there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. Similarly. The fact that M is even is irrelevant for the mathematical model. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). E(Xn+1 |Xn . we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. i. with n P (ξk = 1|Xn . the states {1. . the chain gets absorbed by either 0 or 2N . 72 . . . X0 ) = 1 − Xn . . the random variables ξ1 . ϕ(M ) = 1. M n P (ξk = 0|Xn . Clearly.Each allele of type A is selected with probability x/2N .d. . for all n ≥ 0. Tx := inf{n ≥ 1 : Xn = x}.i. n n where. M and thus. . M Therefore. i.2N = 1. y ≤ 2N − 1. . Xn ). We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. . the expectation doesn’t change and. on the average. we see that 2N x 2N −y x y pxy = . ξM are i. These are: M ϕ(0) = 0. . p00 = p2N. . .) One can argue that. since. but the argument needs further justification because “the end of time” is not a deterministic time. EXn+1 = EXn Xn = Xn . . as usual. The result is correct. and the population becomes one consisting of alleles of type A only. on the average. Clearly. . . . . and the population becomes one consisting of alleles of type a only. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . . X0 ) = E(Xn+1 |Xn ) = M This implies that. X0 ) = Xn . ϕ(x) = x/M. . 2N − 1} form a communicating class of inessential states. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x.e. .

Dn ). A number An of new components are ordered and added to the stock. . for all n ≥ 0. provided that enough components are available. Thus. b). on the average. Otherwise. and the stock reached level zero at time n + 1.i. say. Let ξn := An − Dn . Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . Here are our assumptions: The pairs of random variables ((An . The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. the demand is only partially satisfied. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. there are Xn + An − Dn components in the stock.The first two are obvious. We have that (Xn . M x M (M −1)−(y−1) x x +1− M M M −1 = 24. Since the Markov chain is represented by this very simple recursion. We will show that this Markov chain is positive recurrent. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. n ≥ 0) is a Markov chain.5 A storage or queueing application A storage facility stores. at time n + 1. larger than the supply per unit of time. so that Xn+1 = (Xn + ξn ) ∨ 0. and so. the demand is met instantaneously. the demand is. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. If the demand is for Dn components. we will try to solve the recursion. M −1 y−1 x M y−1 1− x .d. and E(An − Dn ) = −µ < 0. . 73 . 2. Xn electronic components at time n. Working for n = 1. n ≥ 0) are i. while a number of them are sold to the market. . We will apply the so-called Loynes’ method.

Thus. The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). ξn−1 . where the latter equality follows from the fact that. . . . . . the event whose probability we want to compute is a function of (ξ0 . But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). we may replace the latter random variables by ξ1 . Notice. = x≥0 Since the n-dependent terms are bounded. however. ξn−1 ).i. n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . . instead of considering them in their original order. . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. . we find π(y) = x≥0 π(x)pxy . Hence. and. 74 . . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). using π(y) = limn→∞ P (Mn = y). the random variables (ξn ) are i. we reverse it. .d. . In particular. . . . we may take the limit of both sides as n → ∞. for y ≥ 0. so we can shuffle them in any way we like without altering their joint distribution. (n) . . Therefore. . (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . ξn−1 ) = (ξn−1 . for all n. in computing the distribution of Mn which depends on ξ0 . . that d (ξ0 . Indeed. ξn . . . ξ0 ).

we will use the assumption that Eξn = −µ < 0. From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. . . n→∞ This implies that P ( lim Zn = −∞) = 1. Depending on the actual distribution of ξn . This will be the case if g(y) is not identically equal to 1.} < ∞) = 1. We must make sure that it also satisfies y π(y) = 1. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . n→∞ Hence g(y) = P (0 ∨ max{Z1 . as required. . . Z2 . . 75 . The case Eξn = 0 is delicate. . the Markov chain may be transient or null recurrent. Z2 . Here.} > y) is not identically equal to 1.So π satisfies the balance equations.

j = 1. we won’t go beyond Zd . . p(−1) = q.g. d  p(x) = qj . the set of d-dimensional vectors with integer coordinates. otherwise. groups) are allowed. Example 1: Let p(x) = C2−|x1 |−···−|xd | . but. we need to give more structure to the state space S. if x = −ej .and space-homogeneous transition probabilities given by px. besides time. that of spatial homogeneity. have been processes with timehomogeneous transition probabilities. .y+z for any translation z. Example 3: Define p(x) = 1 2d . If d = 1. j = 1. otherwise where p + q = 1. . x ∈ Zd . ±ed } otherwise 76 . here. . so far. In dimension d = 1. .PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. a structure that enables us to say that px. 0. namely. if x ∈ {±e1 . So. we can let S = Zd . . . given a function p(x). For example. A random walk of this type is called simple random walk or nearest neighbour random walk. . . Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. d   0. . A random walk possesses an additional property. where C is such that x p(x) = 1.and space-homogeneity.y = p(y − x). Define  pj . p(x) = 0. where p1 + · · · + pd + q1 + · · · + qd = 1. . then p(1) = p. the skip-free property. More general spaces (e. This means that the transition probability pxy should depend on x and y only through their relative positions in space. a random walk is a Markov chain with time. To be able to say this. this walk enjoys. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. Our Markov chains. . This is the drunkard’s walk we’ve seen before. if x = ej .y = px+z.

has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ .) In this notation. x ± ed . Obviously. random variables in Zd with common distribution P (ξn = x) = p(x). Indeed. We refer to ξn as the increments of the random walk.) Now the operation in the last term is known as convolution. where ξ1 . ξ2 . . (When we write a sum without indicating explicitly the range of the summation variable. Furthermore.i. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). But this would be nonsense. We will let p∗n be the convolution of p with itself n times. we will mean a sum over all possible values. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x).d. and this is the symmetric drunkard’s walk. Furthermore. because all we have done is that we dressed the original problem in more fancy clothes. otherwise. p ∗ δz = p. We shall resist the temptation to compute and look at other 77 . p(x) = 0. x ∈ Zd . . . (Recall that δz is a probability distribution that assigns probability 1 to the point z. . p′ (x). The special structure of a random walk enables us to give more detailed formulae for its behaviour. x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. So. then p(1) = p(−1) = 1/2. . notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. the initial state S0 is independent of the increments. are i. in addition. which. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). We call it simple symmetric random walk. y p(x)p′ (y − x). x + ξ2 = y) = y p(x)p(y − x). If d = 1. Let us denote by Sn the state of the walk at time n.This is a simple random walk. One could stop here and say that we have a formula for the n-step transition probability. The convolution of two probabilities p(x). we can reduce several questions to questions about this random walk. . computing the n-th order convolution for all n is a notoriously hard problem. . It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . in general.

1}n . + (1. we shall learn some counting methods. we should assign. First. . .2) where the ξn are i. +1}n . When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . Here are a couple of meaningful questions: A. . we have P (ξ1 = ε1 .0) We can thus think of the elements of the sample space as paths. 2) → (5. +1.1) + (0.1) − (3. 1): (2.i+1 = pi. ξn = εn ) = 2−n . where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. . Clearly.2) (4. +1}n . After all. . +1}n . we are dealing with a uniform distribution on the sample space {−1.methods. ξn ). if we can count the number of elements of A we can compute its probability.d. 1) → (4. A ⊂ {−1. Since there are 2n elements in {0. . and the sequence of signs is {+1. So if the initial position is S0 = 0. Events A that depend only on these the first n random signs are subsets of this sample space. +1. Look at the distribution of (ξ1 . . −1} then we have a path in the plane that joins the points (0. all possible values of the vector are equally likely. 2) → (3. −1. If not. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. 0) → (1.1) + − (5. 2 where #A is the number of elements of A. So.i. to each initial position and to each sequence of n signs. how long will that take? C. The principle is trivial.i−1 = 1/2 for all i ∈ Z. random variables (random signs) with P (ξn = ±1) = 1/2. for fixed n. consider the following: 78 . 1) → (2. a random vector with values in {−1. but its practice may be hard. and so #A P (A) = n . . Therefore. In the sequel. As a warm-up. a path of length n. Will the random walk ever return to where it started from? B. If yes.

the number of paths that start from (m. S4 = −1. S7 < 0}. 0)). The paths comprising this event are shown below. 0) (n. Hence (n+i)/2 = 0. 0) and end up at the specific point (n. y) is given by: #{(m. x) and end up at (n.) The darker region shows all paths that start at (0. (n − m + y − x)/2 (23) Remark. 0) and end up at (n. y) is given by: #{(0. Consider a path that starts from (0. and compute the probability of the event A = {S2 ≥ 0. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. (There are 2n of them. The number of paths that start from (0. x) (n. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. 0) and n ends at (n. n 79 . We use the notation {(m.1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. y)} =: {all paths that start at (m.i) (0.0) The lightly shaded region contains all paths of length n that start at (0. i). y)}. i). i). x) and end up at (n. S6 < 0. In if n + i is not an even number then there is no path that starts from (0. Let us first show that Lemma 14. i)} = n n+i 2 More generally. (n.047. y)} = n−m . in this case. Proof. We can easily count that there are 6 paths in A.Example: Assume that S0 = 0. x) (n. 0) and ends at (n.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

0) (n. we shall just compute it. the random walk never becomes zero up to time n is given by P0 (S1 > 0. Sn > 0 | Sn = y) = y . .3 Some conditional probabilities Lemma 17. Such a simple answer begs for a physical/intuitive explanation. 1) (n. . S2 > 0. y)} . y)} . if n is even. The probability that. . P0 (S1 ≥ 0 . n This is called the ballot theorem and the reason will be given in Section 29. . 27. n 2−n . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . But. n/2 (29) #{(0. for now. n (30) Proof. Sn ≥ 0 | Sn = 0) = 84 1 . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . given that S0 = 0 and Sn = y > 0. If n. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . .2 The ballot theorem Theorem 25. 27. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. n 2 #{paths (m. P0 (Sn = y) 2 (n + y) + 1 . Sn ≥ 0 | Sn = y) = 1 − In particular. The left side of (30) equals #{paths (1. . . x) 2n−m (n. n−m 2−n−m . .P (Sn = y|Sm = x) = In particular. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). y) that do not touch zero} × 2−n #{paths (0. 0) (n. if n is even. . .

. +1 27. .4 Some remarkable identities Theorem 26. . . . Sn = 0) = P0 (Sn = 0). . . 85 . . . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . and Sn−1 > 0 implies that Sn ≥ 0. If n is even. . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . . . . . . . . . •). (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . . ξ2 . . . .) by (ξ1 . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . Sn < 0) (b) (a) = 2P0 (S1 > 0. ξ3 . (c) (d) (e) (f ) (g) where (a) is obvious. . . (f ) follows from the replacement of (ξ2 . . . Sn ≥ 0) = #{(0. . For even n we have P0 (S1 ≥ 0. . . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0.). ξ2 + ξ3 ≥ 0. . . Sn ≥ 0) = P0 (S1 = 0. . . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). Sn = 0) = P0 (S1 > 0. . n/2 where we used (28) and (29). Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. 0) (n. . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. We next have P0 (S1 = 0. P0 (S1 ≥ 0. 1 + ξ2 + ξ3 > 0. Sn ≥ 0). Sn ≥ 0. . ξ3 . (b) follows from symmetry. Sn > 0) + P0 (S1 < 0. . . . . . .Proof. . . . . and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . . We use (27): P0 (Sn ≥ 0 . . . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . (e) follows from the fact that ξ1 is independent of ξ2 . . . . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . Proof. . . . .

we can omit the last event. . . . and if the particle is to the left of a then its mirror image is on the right.27. . . . Since the random walk cannot change sign before becoming zero. for even n. Sn = 0 means Sn−1 = 1. Sn = 0) = 2P0 (Sn−1 > 0. 86 . we either have Sn−1 = 1 or −1. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . . . . So the last term is 1/2. f (n) = P0 (S1 > 0. . . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). Also. . Theorem 27. We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. . Then f (n) := P0 (S1 = 0. . To take care of the last term in the last expression for f (n). Sn−1 > 0. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . . . If a particle is to the right of a then you see its image on the left. This is not so interesting. . Sn = 0). we see that Sn−1 > 0. n−1 Proof. So f (n) = 2P0 (S1 > 0. n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. Fix a state a. we computed. So u(n) f (n) = . given that Sn = 0.5 First return to zero In (29). Sn−2 > 0|Sn−1 > 0. the two terms must be equal. Sn = 0) P0 (S1 > 0. Put a two-sided mirror at a. P0 (Sn−1 > 0. Sn = 0) Now. . Sn−1 = 0. By symmetry. Sn = 0) = u(n) . So 1 f (n) = 2u(n) P0 (S1 > 0. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. . By the Markov property at time n − 1. . and u(n) = P0 (Sn = 0). the probability u(n) that the walk (started at 0) attains value 0 in n steps. Sn−2 > 0|Sn−1 = 1). Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). . with equal probability. . Sn−1 > 0. Let n be even. Sn = 0) + P0 (S1 < 0. Sn = 0. . Sn−1 < 0.

we have n ≥ Ta . Theorem 28. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). Sn ≥ a). Sn ≤ a) + P0 (Ta ≤ n. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ).What is more interesting is this: Run the particle till it hits a (it will. Sn ≤ a) + P0 (Sn > a). In the last event. 2a − Sn . Proof. Trivially. . we obtain 87 . Let (Sn . m ≥ 0) is a simple symmetric random walk. Then. Sn > a) = P0 (Ta ≤ n. . If Sn ≥ a then Ta ≤ n. Sn ≤ a). for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . 2a − Sn ≥ a) = P0 (Ta ≤ n. n ≥ Ta By the strong Markov property. define Sn = where is the first hitting time of a. But P0 (Ta ≤ n) = P0 (Ta ≤ n. P0 (Sn ≥ a) = P0 (Ta ≤ n. n ∈ Z+ ) has the same law as (Sn . n ≥ 0) be a simple symmetric random walk and a > 0. 2 Define now the running maximum Mn = max(S0 . observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). S1 . a + (Sn − a). Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). Then Ta = inf{n ≥ 0 : Sn = a} Sn . 28. . So Sn − a can be replaced by a − Sn in the last display and the resulting process. The last equality follows from the fact that Sn > a implies Ta ≤ n. independent of the past before Ta . Thus.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. n < Ta . 2 Proof. as a corollary to the above result. But a symmetric random walk has the same law as its negative. Sn = Sn . the process (STa +m − a. Finally. called (Sn . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. so we may replace Sn by 2a − Sn (Theorem 28). Sn ). n < Ta . n ∈ Z+ ). . n ≥ Ta . In other words.

usually a vase furnished with a foot or pedestal. Sci. then the probability that. Compt. If an urn contains a azure items and b = n−a black items. Paris. An urn is a vessel. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. Rend. n = a + b. 105. If Tx ≤ n. as long as a ≥ b. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. Sn = y) = P0 (Sn = 2x − y). what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. The answer is very simple: Theorem 31. Sn = y). during sampling without replacement. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. Since Mn < x is equivalent to Tx > n. employed for different purposes. A sample from the urn is a sequence η = (η1 . 105. The set of values of η contains n elements and η is uniformly distributed a in this set. where ηi indicates the colour of the i-th item picked. (1887). P0 (Mn < x. It is the latter use we are interested in in Probability and Statistics. of which a are coloured azure and b black. ηn ). we have P0 (Mn < x. Sn = y) = P0 (Tx > n. e 10 Andr´. Acad. which results into P0 (Tx ≤ n. combined with (31). Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . . Bertrand. then. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). as for holding liquids. Solution directe du probl`me r´solu par M. 369. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. . This. Acad. 8 88 . Solution d’un probl`me. J. The original ballot problem. 2 Proof. 436-437. Compt. Sn = 2x − y).Corollary 19. 9 Bertrand. and anciently for holding ballots to be drawn. gives the result. Sci. Sn = y) = P0 (Tx ≤ n. e e e Paris. ornamental objects. the number of selected azure items is always ahead of the black ones equals (a − b)/n. (31) x > y. Rend. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. the ashes of the dead after cremation. . We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. D. Proof. (1887). by applying the reflection principle (Theorem 28). .

. using the definition of conditional probability and Lemma 16. . . so ‘azure’ items are +1 and ‘black’ items are −1.We shall call (η1 . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. −n ≤ k ≤ n. . . ξn ) is. n ≥ 0) with values in Z and let y be a positive integer. We need to show that. . . the ‘black’ items correspond to − signs). letting a. but which is not necessarily symmetric. . . b be defined by a + b = n. . . Then. +1} such that ε1 + · · · + εn = y. . . Proof of Theorem 31. we can translate the original question into a question about the random walk. . the result follows. . Since the ‘azure’ items correspond to + signs (respectively. The values of each ξi are ±1. n n −n 2 (n + y)/2 a Thus. . Then. Consider a simple symmetric random walk (Sn . . . indeed. . An urn can be realised quite easily be a random walk: Theorem 32. . . conditional on {Sn = y}. always assuming (without any loss of generality) that a ≥ b = n − a. conditional on {Sn = y}. ξn = εn ) P0 (Sn = y) 2−n 1 = = . Let ε1 . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. ξn = εn | Sn = y) = P0 (ξ1 = ε1 . . . ηn ) an urn process with parameters n. a − b = y. Since an urn with parameters n. . p(n+k)/2 q (n−k)/2 . . But if Sn = y then. . conditional on the event {Sn = y}. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. where y = a − (n − a) = 2a − n. ξn ) is uniformly distributed in a set with n elements. ξn ) has a uniform distribution. we have: P0 (ξ1 = ε1 . 89 .) So pi. the sequence (ξ1 . . n Since y = a − (n − a) = a − b.i+1 = p. . a. It remains to show that all values are equally a likely. . . But in (30) we computed that y P0 (S1 > 0. Sn > 0} that the random walk remains positive. . . . εn be elements of {−1. . (The word “asymmetric” will always mean “not necessarily symmetric”. . We have P0 (Sn = k) = n n+k 2 pi. a we have that a out of the n increments will be +1 and b will be −1. . . . So the number of possible values of (ξ1 . the sequence (ξ1 . Proof. (ξ1 . .i−1 = q = 1 − p. i ∈ Z. n . . Sn > 0 | Sn = y) = .

. For all x. Theorem 33. In view of Lemma 19 we need to know the distribution of T1 . (We tacitly assume that S0 = 0. x ∈ Z. If S1 = 1 then T1 = 1. . namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. for x > 0. Proof. 2. . This random variable takes values in {0. See figure below. The process (Tx . x − 1 in order to reach x. it is no loss of generality to consider the random walk starting from 0. E0 sTx = ϕ(s)x . Consider a simple random walk starting from 0. In view of this. y ∈ Z. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. Lemma 18. x ≥ 0) is also a random walk. 2qs Proof. plus a ′′ further time T1 required to go from 0 to +1 for the first time. We will approach this using probability generating functions. Proof.i.d. Lemma 19. and all t ≥ 0. Consider a simple random walk starting from 0. random variables. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. . Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . only the relative position of y with respect to x is needed.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. 2.30. we make essential use of the skip-free property. . To prove this. . 90 . plus a time T1 required to go from −1 to 0 for the first time. The rest is a consequence of the strong Markov property.) Then. Px (Ty = t) = P0 (Ty−x = t). . 1. Consider a simple random walk starting from 0. Corollary 20. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1.} ∪ {∞}. If x > 0. Let ϕ(s) := E0 sT1 .

Corollary 21. 2ps Proof. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. The formula is obtained by solving the quadratic. Corollary 22. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). 91 . if x < 0 2ps Proof. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. Consider a simple random walk starting from 0. Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . in order to hit 1. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 .Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Hence the previous formula applies with the roles of p and q interchanged. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . from the strong Markov property. Consider a simple random walk starting from 0. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 .  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. The first time n such that Sn = −1 is the first time n such that −Sn = 1.

we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . and simply write (1 + x)n = k≥0 n k x . We again do first-step analysis.3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin.30. Consider a simple random walk starting from 0. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. The latter is. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. 1 − 4pqs2 . E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . Similarly. this was found in §30. Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. If α = n is a positive integer then n (1 + x)n = k=0 n k x . then we can omit the k upper limit in the sum. 2ps ′ =1− 30. where α is a real number.2 First return to the origin Now let us ask: If the particle starts at 0. in distribution. k by the binomial theorem. If we define n to be equal to 0 for k > n. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. the same as the time required for the walk to hit 1 starting ′ from 0. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . For a symmetric ′ random walk. For the general case. k 92 . if S1 = 1. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 .

As another example. For example. by definition. Indeed. k as long as we define α k = α(α − 1) · · · (α − k + 1) . √ 1 − x. k! k! (−1)2k xk = k≥0 xk . (−1)k = 2k − 1 k k and this is shown by direct computation. −1 k so (1 − x)−1 = as needed. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k .Recall that n k = It turns out that the formula is formally true for general α. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . (1 − x)−1 = But. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . k!(n − k)! k! α k x . 2k − 1 k 93 . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . but that we won’t worry about which ones. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series.

if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. by the definition of E0 sT0 . x if x > 0 . That it is actually null-recurrent follows from Markov chain theory. if x < 0 |x| In particular. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). 94 . x 1−|p−q| 2q 1−|p−q| 2p . 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one.But. if x < 0 This formula can be written more neatly as follows: Theorem 35. without appeal to the Markov chain theory. But remember that q = 1 − p. The quantity inside the square root becomes 1 − 4pq. P0 (Tx < ∞) = p q q p ∧1 ∧1 . The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . Hence it is transient. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. |x| if x > 0 (33) . We apply this to (32). s↑1 which is true for any probability generating function. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. Hence it is again transient.

if x > 0 . with probability 1 because p < 1/2. . P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . If p < q then |p − q| = q − p and. every nondecreasing sequence has a limit. with some positive probability. if x < 0 which is what we want. . 95 . but here it is < ∞. We know that it cannot be positive recurrent.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. Assume p < 1/2. n→∞ n→∞ x ≥ 0. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. then we have M∞ = limn→∞ Mn . again. This value is random and is denoted by M∞ = sup(S0 .2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. 31. Theorem 36. S1 . Then ′ P0 (T0 < ∞) = 2(p ∧ q). x ≥ 0. we obtain what we are looking for. Unless p = 1/2. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . and so it is null-recurrent. Otherwise. Indeed.) If we let Mn = max(S0 . . Then limn→∞ Sn = −∞ with probability 1. What is this probability? Theorem 37. 31. . Consider a simple random walk with values in Z. Consider a random walk with values in Z. except that the limit may be ∞. it will never ever return to the point where it started from. q p. The last part of Theorem 35 tells us that it is recurrent. Proof. S2 . Sn ). the sequence Mn is nondecreasing. This means that there is a largest value that will be ever attained by the random walk. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. . The simple symmetric random walk with values in Z is null-recurrent. . Proof. Corollary 23.Proof. . this probability is always strictly less than 1.

. Theorem 38. . . . We now compute its exact distribution. k = 0. But if it is transient. We now look at the the number of visits to some state but if we start from a different state. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . . . even for p = 1/2. 2. if a Markov chain starts from a state i then Ji is geometric. Remark 1: By spatial homogeneity. Consider a simple random walk with values in Z. the total number of visits to a state will be finite. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. meaning that Pi (Ji = ∞) = 1. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). 1. . The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. 2. as already known. 31. 1 − 2(p ∧ q) this being a geometric series. If p = q this gives 1. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . 1. In Lemma 4 we showed that. Then. This proves the first formula. 2(p ∧ q) .′ Proof. The probability generating function of T0 was obtained in Theorem 34. . In the latter case. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. 1 − 2(p ∧ q) Proof. starting from zero. . Pi (Ji ≥ k) = 1 for all k. But J0 ≥ 1 if and only if the first return time to zero is finite. 1]. 96 . we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . k = 0.3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. 1. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . 2(p ∧ q) E0 J0 = . where the last probability was computed in Theorem 37. 2. (34) a random variable first considered in Section 10 where it was also shown to be geometric. k = 0. .

So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . Consider the case p < 1/2. Then P0 (Jx ≥ k) = P0 (Tx < ∞. Suppose x > 0. p < 1/2) then. we have E0 Jx = 1/(1−2p) regardless of x. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . The first term was computed in Theorem 35. For x > 0. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . Remark 1: The expectation E0 Jx is finite if p = 1/2. Apply now the strong Markov property at Tx . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. simply because if Jx ≥ k ≥ 1 then Tx < ∞. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). (2(p ∧ q))k−1 . given that Tx < ∞. it will visit any point below 0 exactly the same number of times on the average. and the second in Theorem 38. In other words. The reason that we replaced k by k − 1 is because. (q/p)(−x)∨0 1 − 2q k≥1 . we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae.e. if the random walk “drifts down” (i. Consider a random walk with values in Z. starting from 0. i.Theorem 39. ∞ As for the expectation. then E0 Jx drops down to zero geometrically fast as x tends to infinity. Then P0 (Tx < ∞. Jx ≥ k).e. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . 97 . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . k ≥ 1. E0 Jx = Proof. But for x < 0. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 .

. . random variables. . we derived using duality. Theorem 40. ξ2 .i.i. (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Then P0 (Tx = n) = x P0 (Sn = x) . each distributed as T1 . Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. which is basically another probability for S. for a simple random walk and for x > 0. Use the obvious fact that (ξ1 . . Sn = x) P0 (Tx = n) = .32 Duality Consider random walk Sk = k ξi . n (37) 98 . the probabilities P0 (Tx = n) = x P0 (Sn = x). random variables. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. Fix an integer n. 0 ≤ k ≤ n). 0 ≤ k ≤ n) of (Sk . 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . in Theorem 41. n x > 0. P0 (Sn = x) P0 (Sn = x) x . Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. a sum of i. 0 ≤ k ≤ n) has the same distribution as (Sk . = 1− √ 1 − s2 s x . .e. . i=1 Then the dual (Sk . ξn−1 .d. . n Proof. Thus. Let (Sk ) be a simple symmetric random walk and x a positive integer. Tx is the sum of x i. because the two have the same distribution. . ξn ) has the same distribution as (ξn . i. (Sk . Proof. .d. Here is an application of this: Theorem 41. every time we have a probability for S we can translate it into a probability for S. (36) 2 Lastly. 0 ≤ k ≤ n − 1. . . Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. ξ1 ). for a simple symmetric random walk. In Lemma 19 we observed that. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. .

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

ξn are not independent simply because if one takes value 0 the other is nonzero. y ∈ Z. i. We then let S0 = (0. ξ2 .e. . 0). the latter being functions of the former. are independent. 1 1 Proof. 0)) = P (ξn = (1. random variables and hence a random walk. So Sn = (Sn . . north or south. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. n ∈ Z+ ) is a sum of i. . 1)) = P (ξn = (−1. n ∈ Z+ ) showing that it. Many other quantities can be approximated in a similar manner. ξ2 . Rn ) := (Sn + Sn . being totally drunk. n ∈ Z+ ) is a random walk. with equal probability (1/4). We have 1. . . . It is not a simple random walk because there is a possibility that the increment takes the value 0. However. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. 2 1 If x ∈ Z2 . n ∈ Z+ ) is also a random walk. west. for each n. so are ξ1 . 2 Same argument applies to (Sn . y) where x. i. 0)) = P (ξn = (−1. We have: 102 . the two stochastic processes are not independent. Indeed. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . Sn = i=1 n ξi . The two random walks are not independent. . When he reaches a corner he moves again in the same manner. map (x1 . north or south. ξ2 . . Sn − Sn ). west. 1)) + P (ξn = (0.. is a random walk. he moves east. x2 ) into (x1 + x2 . 1 Hence (Sn . (Sn . 0) and. n ≥ 1. Sn ) = n n 2 .i. we immediately get that the expectation of T1 is infinity: ET1 = ∞. So let 1 2 1 2 1 2 Rn = (Rn . the random 1 2 variables ξn . 0)) = 1/4.d. (Sn . going at random east. random vectors such that P (ξn = (0. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. The corners are described by coordinates (x.d. He starts at (0.i. 0)) = 1/4. .From this. Since ξ1 . x1 − x2 ). too. we let x1 be its horizontal coordinate and x2 its vertical. −1)) = 1/2. Consider a change of coordinates: rotate the axes by 45o (and scale them). 1 P (ξn = 1) = P (ξn = (1. i=1 ξi i=1 ξi 2 1 Lemma 20.

And so 2 1 2 1 P (ηn = ε. ηn = −1) = P (ξn = (0. Proof. 1)) = 1/4 1 2 P (ηn = −1. . By Lemma 5. Hence P (ηn = 1) = P (ηn = −1) = 1/2.d. But by the previous n=0 c lemma. Let a < 0 < b. R2n = 0) = P (R2n = 0)(R2n = 0). . that for a simple symmetric random walk (started from S0 = 0). from Theorem 7. 2n n 2 2−4n . (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . in the sense that the ratio of the two sides goes to 1. we need to show that ∞ P (Sn = 0) = ∞. The last two are walk in one dimension. To show that the last two are independent. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . ε′ ∈ {−1. P (S2n = 0) ∼ n . (Rn . n ∈ Z+ ) is a random walk in two dimensions. b−a P (STa ∧Tb = a) = b . ηn = 1) = P (ξn = (1. ηn = −1) = P (ξn = (−1. We have Rn = n (ξi + ξi ). ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. b−a −a . . n ∈ Z+ ) is a random 2 . The two-dimensional simple symmetric random walk is recurrent. So P (R2n = 0) = 2n 2−2n . n Corollary 24. as n → ∞. 0)) = 1/4 1 2 P (ηn = 1. Rn = n (ξi − ξi ). Recall. P (S2n = 0) = Proof. The possible values 1 . is i. . 2 . Hence (Rn . n ∈ Z+ ) 1 is a random walk in two dimensions. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. η . The same is true for (Rn ). and by Stirling’s approximation. +1}.i.. n ∈ Z ) is a random walk in one dimension. We have of each of the ηn n 1 2 P (ηn = 1. Similar argument shows that each of (Rn . n ∈ Z+ ). . being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . 1 where the last equality follows from the independence of the two random walks. (Rn + independent. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. η 2 are ±1. The sequence of vectors η . (Rn . ξ2 . 2 1 2 2 1 1 Proof. b−a P (Ta < Tb ) = b . P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . Lemma 22. . But (Rn ) 1 2 is a simple symmetric random walk. = 1)) = 1/4. b−a 103 . ξ 1 − ξ 2 ).1 Lemma 21. ηn = 1) = P (ξn = (0. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. ηn = ξn − ξn are independent for each n. 0)) = 1/4 1 2 P (ηn = −1. where c is a positive constant. n ∈ Z ) is a random walk in one dimension.

B = j) = c−1 (j − i)pi pj . and the probability it exits from the right point is 1 − p. N ∈ N. Then Z = 0 if and only if A = B = 0 and this has 104 . using this. a < 0 < b such that pa + (1 − p)b = 0. suppose first k = 0. b. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. we get that c is precisely ipi . Proof. and assuming that (A. i ∈ Z). we have this common value: i<0 (−i)pi = and. in particular. B = 0) + P (A = i. i ∈ Z. Let (Sn ) be a simple symmetric random walk and (pi . choose. B) are independent of (Sn ). 0)} ∪ (N × N) with the following distribution: P (A = i. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0.This last display tells us the distribution of STa ∧Tb . k ∈ Z. Indeed. 1 − p). if p = M/N . a = M − N . The claim is that P (Z = k) = pk . B = 0) = p0 . To see this. Note. Then there exists a stopping time T such that P (ST = i) = pi . c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. we can always find integers a. we consider the random variable Z = STA ∧TB . B = j) = 1. How about more complicated distributions? Let suppose we are given a probability distribution (pi . Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. given a 2-point probability distribution (p. B) taking values in {(0. Define a pair of random variables (A. such that p is rational. b = N . that ESTa ∧Tb = 0. Theorem 43. b]. M. i < 0 < j. P (A = 0. i ∈ Z. e. where c−1 is chosen by normalisation: P (A = 0. The probability it exits it from the left point is p.g. M < N . Conversely.

by definition. 105 . B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . Suppose next k > 0. B = j) P (STi ∧Tk = k)P (A = i. P (Z = k) = pk for k < 0.probability p0 . We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. i<0 = c−1 Similarly.

38) = gcd(6.}. −1. if we also invoke Euclid’s algorithm (i. gcd(120. the process of long division that one learns in elementary school). replace b − a by b − 2a. So we work faster. 38) = gcd(6. 20) = gcd(6. So (i) holds. 4. Because c | a and c | b. 38) = gcd(44. Keep doing this until you obtain b − qa < a. b. The set Z of integers consists of N. then d | c. Arithmetic Arithmetic deals with integers. · · · }. 5. 2) = gcd(2. with at least one of them nonzero. But in the first line we subtracted 38 exactly 3 times. 14) = gcd(6. Similarly. Therefore d | b − a. But 3 is the maximum number of times that 38 “fits” into 120. If b − a > a. d2 are two such numbers then d1 would divide d2 and vice-versa. 6. They are not supposed to replace textbooks or basic courses. we have c | d. b). 26) = gcd(6. If c | b and b | a then c | a. (ii) if c | a and c | b. we can define their greatest common divisor d = gcd(a. 3. 8) = gcd(6. we say that b divides a if there is an integer k such that a = kb. (That it is unique is obvious from the fact that if d1 . b) as the unique positive integer such that (i) d | a. b − a). b are arbitrary integers. Indeed 4 × 38 > 120. leading to d1 = d2 . Then interchange the roles of the numbers and continue. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. their negatives and zero. The set N of natural numbers is the set of positive integers: N = {1. So (ii) holds. Now suppose that c | a and c | b − a. i. 2. 2) = 2. . c | b. they merely serve as reminders. 32) = gcd(6. with b = 0. For instance. b) = gcd(a. But d | a and d | b. . Indeed. 3. 38) = gcd(82. Z = {· · · . We will show that d = gcd(a.e. We write this by b | a. b). Now. if a. 0. d | b. Zero is denoted by 0. 106 . Replace b by b − a. let d = gcd(a. with the second line.) Notice that: gcd(a. . b − a). −2.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. and d = gcd(a. 2) = gcd(4. Given integers a. 1. Then c | a + (b − a). −3.e. as taught in high schools or universities. In other words. and we’re done. 2. If a | b and b | a then a = b. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest.

we have gcd(a. Thus. not both zero. we can find an integer q and an integer r ∈ {0. We need to show (i). Suppose that c | a and c | b. y = 6. Second. b. b) then there are integers x. let us follow the previous example again: gcd(120. Here are some other facts about the greatest common divisor: First. in particular. 38) = gcd(6. Working backwards. y such that d = xa + yb. let d be the right hand side. 2) = 2. . Its proof is obvious. b ∈ Z. For example. To see why this is true. y ∈ Z. b) = min{xa + yb : x.e. for some x. it divides d. This shows (ii) of the definition of a greatest common divisor. we have the fact that if d = gcd(a.The theorem of division says the following: Given two positive integers b. with x = 19. xa + yb > 0}. y ∈ Z. 1. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. we can show that. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. . In the process. To see this. . . 107 . 120) = 38x + 120y. a − 1}. 38) = gcd(6. gcd(38. Then d = xa + yb. The long division algorithm is a process for finding the quotient q and the remainder r. with a > b. i. such that a = qb + r. a. Then c divides any linear combination of a. for integers a. The same logic works in general. r = 18.

e. f (x2 ). A function is called onto (or surjective) if its range is Y . say. B be finite sets with n. but is not necessarily transitive. i. Suppose this is not true. The set of functions from X to Y is denoted by Y X . so d divides both a and b. So d is not minimum as claimed. m elements. The relation < is not symmetric. Finally. x). so r = 0. i. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . The range of f is the set of its values. respectively. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. But then b = q(xa + yb) + r. A function is called one-to-one (or injective) if two different x1 . that d does not divide b. A relation which possesses all the three properties is called equivalence relation. y) and (y. For example.e. A relation is called symmetric if it cannot contain a pair (x. that d does not divide both numbers. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). the relation < between real numbers can be thought of as consisting of pairs of numbers (x. The equality relation = is reflexive. and this is a contradiction. A relation is transitive if when it contains (x. Both < and = are transitive. the set {f (x) : x ∈ X}. A relation is called reflexive if it contains pairs (x. yielding r = (−qx)a + (1 − qy)b. We encountered the equivalence relation i theory. we can divide b by d to find b = qd + r. x2 ∈ X have different values f (x1 ). If we take A to be the set of students in a classroom. The equality relation is symmetric. a relation can be thought of as a set of pairs of elements of the set A. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x.g. x). then the relation “x is a friend of y” is symmetric. z). it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). 108 . z) it also contains (x. Counting Let A. Since d > 0. y) without contain its symmetric (y. e.that d | a and d | b. y) such that x is smaller than y. for some 1 ≤ r ≤ d − 1.. In general. The greatest common divisor of 3 integers can now be defined in a similar manner. The relation < is not reflexive. the greatest common divisor of any subset of the integers also makes sense. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B.

that of the second in m − 1 ways. the k-th in n − k + 1 ways. Incidentally. and so on. and the third. In the special case m = n. Thus. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . The first element can be chosen in n ways. we may think of the elements of A as animals and those of B as names. Indeed. Since the name of the first animal can be chosen in m ways. 109 . k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. we need more names than animals: n ≤ m. and so on. A function is then an assignment of a name to each animal. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: .) So the first animal can be assigned any of the m available names. we denote (n)n also by n!. so does the second. Clearly. and this follows from the binomial theorem. k! k where the symbol equation. the second in n − 1. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. in other words. we have = 1.e. Each assignment of this type is a one-to-one function from A to B. n! is the total number of permutations of a set of n elements. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. the cardinality of the set B A ) is mn . We thus observe that there are as many subsets in A as number of elements in the set {0. If A has n elements. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . To be able to do this. (The same name may be used for different animals. Any set A contains several subsets.The number of functions from A to B (i. 1}n of sequences of length n with elements 0 or 1. a one-to-one function from the set A into A is called a permutation of A. and so on. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1).

and a− is called the negative part of a. If the pig is included. b). we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. Both quantities are nonnegative. 110 . 0). say. consider a specific animal. a− = − min(a. a+ is called the positive part of a. Maximum and minimum The minimum of two numbers a. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. + k−1 k Indeed. m! n y If y is not an integer then. the pig. Note that it is always true that a = a+ − a− . b is denoted by a ∧ b = min(a. by convention.We shall let n = 0 if n < k. The absolute value of a is defined by |a| = a+ + a− . we use the following notations: a+ = max(a. 0). Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). n n . Thus a ∧ b = a if a ≤ b or = b if b ≤ a. c. we define x m = x(x − 1) · · · (x − m + 1) . and we don’t care to put parantheses because the order of finding a maximum is irrelevant. b). In particular. we let The Pascal triangle relation states that n+1 k = = 0. we choose k animals from the remaining n ones and this can be done in n ways. a∨b∨c refers to the maximum of the three numbers a. The notation extends to more than one numbers. b. If the pig is not included in the sample. if x is a real number. in choosing k animals from a set of n + 1 ones. The maximum is denoted by a ∨ b = max(a. by convention. For example. k Also.

. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . P (X ≥ n). A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). This is very useful notation because conjunction of clauses is transformed into multiplication. It is easy to see that 1A 1A∩B = 1A 1B . .}.Indicator function stands for the indicator function of a set A. It takes value 1 with probability P (A) and 0 with probability P (Ac ). Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). It is worth remembering that. . Several quantities of interest are expressed in terms of indicators. . which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). if A1 . Another example: Let X be a nonnegative random variable with values in N = {1. A2 . So X= n≥1 1(X ≥ n). Then X X= n=1 1 (the sum of X ones is X). . 1 − 1A = 1 A c . = 1A + 1B − 1A∩B . meaning a function that takes value 1 on A and 0 on the complement Ac of A. are sets or clauses then the number n≥1 1An is the number true clauses. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. Thus. 2. 111 . So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. . For instance. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. If A is an event then 1A is a random variable.

 is a column representation of y := ϕ(x) with respect to the given basis . x1 . ym v1 .    . . . ϕ are linear functions: It is a m × n matrix. . y ∈ Rn . and all α. x  . . . . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ). xn the given basis. So if we choose a basis v1 . . . Let u1 . . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . . . we have ϕ(x) = m n vi i=1 j=1 aij xj . But the ϕ(uj ) are vectors in Rm .   . .for all x. . . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  .  The column  .  .  = ··· ··· ···  .  is a column representation of x with respect to . The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . . Suppose that ψ. m. am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  . . β ∈ R. vm . j=1 i = 1. Then.  where ∈ R. Combining. vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . un be a basis for Rn . . The column  . Let y i := n aij xj . because ϕ is linear. . . .

x4 . x2 . it is its matrix representation with respect to the fixed basis. i.  . .} has no minimum. sup I is characterised by: sup I ≥ x for all x ∈ I and.} ≤ · · · and. . Similarly. the choice of the basis is the so-called standard basis. a square matrix represents a linear transformation from Rn to Rn .} ≤ inf{x2 . A matrix A is called square if it is a n × n matrix. 1/3. . there is x ∈ I with x ≤ ε + inf I. Supremum and infimum When we have a set of numbers I (for instance. ψ. x3 . where m (AB)ij = k=1 Aik Bkj . . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. Usually. . . For instance I = {1. standing for the “infimum” of I. 1/2. e2 =  . B is a m × n matrix. . B be the matrix representations of ϕ. for any ε > 0.e. · · · }. defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. If I is empty then inf ∅ = +∞. If I has no lower bound then inf I = −∞. x2 .Fix three sets of basis on Rn . respectively. If we fix a basis in Rn . . . . j So A is a ℓ × m matrix. and AB is a ℓ × n matrix. and it is this that we called infimum. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. 113 . . . 1 ≤ j ≤ n. The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . .  . sup ∅ = −∞.} ≤ inf{x3 . . This minimum (or maximum) may not exist. with respect to these bases. Rm and Rℓ and let A. ed =  . . . a sequence a1 . we define lim sup xn . there is x ∈ I with x ≥ ε + inf I. . . is a sequence of numbers then inf{x1 . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. We note that lim inf(−xn ) = − lim sup xn . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. . .  . . . a2 . 1 ≤ i ≤ ℓ. . . Likewise. Similarly. inf I is characterised by: inf I ≤ x for all x ∈ I and. we define max I. So lim sup xn = limn→∞ inf{xn . 0 0 1 In other words. In these cases we use the notation inf I. xn+1 . for any ε > 0. . If I has no upper bound then sup I = +∞. Limsup and liminf If x1 . . Then the matrix representation of ϕ ◦ ψ is AB. we define the supremum sup I to be the minimum of all upper bounds. . . ei = 1(i = j). xn+2 .

But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). That is why Probability Theory is very rich. Of course. E 1A = P (A). If A1 . B in the definition.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). which holds since X is positive. In Markov chain theory. If An are events with P (An ) = 0 then P (∪n An ) = 0. Linear Algebra that has many axioms is more ‘restrictive’. . P (∪n An ) ≤ Similarly. such that ∪n Bn = Ω. . showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. follows from this axiom. if An are events with P (An ) = 1 then P (∩n An ) = 1. . to show that EX ≤ EY . i. This follows from the logical statement X 1(X > x) ≤ X.) If A. are mutually disjoint events. (That is why Probability Theory is very rich. This can be generalised to any finite number of events. A is a subset of B). This is the inclusion-exclusion formula for two events. and everywhere else. if he events A1 . . In contrast. A2 . and 1 Ac = 1 − 1 A . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. . . n≥1 Usually. Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. If A is an event then 1A or 1(A) is the indicator random variable. For example.e. namely:   Everything else. then P  n≥1 An  = P (An ). . but not too much. Note that 1A 1B = 1A∩B . .e. Probability Theory is exceptionally simple in terms of its axioms.. and apply them. the function that takes value 1 on A and 0 on Ac . A2 . by not requiring this. Therefore. derive formulae. Quite often. A2 . B2 . 11 114 . to compute P (A). to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). Similarly. n If the events A1 . in these notes.. Similarly. we often need to argue that X ≤ Y . it is useful to find disjoint events B1 . n P (An ). in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). If the events A. This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. I do sometimes “cheat” from a strict mathematical point of view. . we work a lot with conditional probabilities. for (besides P (Ω) = 1) there is essentially one axiom . Linear Algebra that has many axioms is more ‘restrictive’. In contrast. . B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. . Often. . B are disjoint events then P (A ∪ B) = P (A) + P (B). Indeed.

EX = x xP (X = x). n = 0. The formula holds for any positive random variable. . which is motivated from the algebraic identity X = X + − X − . Generating functions A generating function of a sequence a0 . obviously. x→∞ In Markov chain theory. a1 . the generating function of the geometric sequence an = ρn .} because then n an = 1. so it is useful to keep this in mind. making it a useful object when the a0 . The generating function of the sequence pn = P (X = n). . . is called probability generating function of X: ∞ as long as |ρs| < 1. which are both nonnegative and then define EX = EX + − EX − . for s = 0 for which G(0) = 0. taught in calculus courses). . 2. . 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. 0). . is a probability distribution on Z+ = {0. We do not a priori know that this exists for any s other than. This definition makes sense only if EX + < ∞ or if EX − < ∞. . we define X + = max(X. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). 0). n ≥ 0 is ∞ ρn s n = n=0 1 . 1. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. For a random variable which takes positive and negative values. X − = − min(X.An application of this is: P (X < ∞) = lim P (X ≤ x). |s| < 1/|ρ|. viz. In case the random variable is ∞ absoultely continuous with density f . ϕ(s) = EsX = n=0 sn P (X = n) 115 . So if n an < ∞ then G(s) exists for all s ∈ [−1. . In case the random variable is discrete. The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). 1. . . we encounter situations where P (X = ∞) may be positive. a1 . i.. 1]. . 1]. As an example. the formula reduces to the known one. we have EX = −∞ xf (x)dx. is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . .e.

n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . since |s| < 1. Generating functions are useful for solving linear recursions. we can recover the expectation of X from EX = lim ϕ′ (s). 116 . we have s∞ = 0. n ≥ 0) then the generating function of (xn+1 . ∞ and p∞ := P (X = ∞) = 1 − pn . n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). possibly. As an example. while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). Note that ϕ(0) = P (X = 0). consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . but. Also. take the value +∞. Note that ∞ n=0 pn ≤ 1. xn sn be the generating of (xn . We can recover the probabilities pn from the formula pn = ϕ(n) (0) . ds ds and by letting s = 1. n=0 We do allow X to. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). n! as follows from Taylor’s theorem. If we let X(s) = n≥0 n ≥ 0. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . and this is why ϕ is called probability generating function. with x0 = x1 = 1.This is defined for all −1 < s < 1.

n ≥ 0). x ∈ R. Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). b = −( 5 + 1)/2. for example. So. Hence s2 + s − 1 = (s − a)(s − b). then the so-called distribution function. i. n ≥ 0) and (xn . s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s).e. Thus (1/ρ)−s is the 1 generating function of ρn+1 . namely. b−a is the solution to the recursion.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 .e. the n-th term of the Fibonacci sequence. 117 . Despite the square roots in the formula. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. we can also write this as xn = (−a)n+1 − (−b)n+1 . the function P (X ≤ x). we can solve for X(s) and find X(s) = s2 −1 . i. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . n ≥ 0) equals the sum of the generating functions of (xn+1 . Since x0 = x1 = 1. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . and so X(s) = 1 1 1 1 1 1 = . But I could very well use the function P (X > x). is such an example. b−a Noting that ab = −1. if X is a random variable with values in R.

We frequently encounter random objects which belong to larger spaces. I could use its density f (x) = d P (X ≤ x). Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. For example (try it!) the rough Stirling approximation give that the probability that. n! ≈ exp(n log n − n) = nn e−n . Even the generating function EsX of a Z+ -valued random variable X is an example. It turns out to be a good one. n Hence n k=1 log k ≈ log x dx. in 2n fair coin tosses 118 .or the 2-parameter function P (a < X < b). for example they could take values which are themselves functions. to compute a large sum we can. Now. (b) The approximation is not good when we consider ratios of products. Specifying the distribution of such an object may be hard. Such an excursion is a random object which should be thought of as a random function. for it does allow us to specify the probabilities P (X = n). Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. it’s better to use logarithms: n n! = e log n = exp k=1 log k. instead. since the number is so big. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. For instance. compute an integral and obtain an approximation. we encountered the concept of an excursion of a Markov chain from a state i. Hence we have We call this the rough Stirling approximation. n ∈ Z+ . Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. dx All these objects can be referred to as “distribution” or “law” of X. 1 Since d dx (x log x − x) = log x. If X is absolutely continuous.

119 . This completely trivial thought gives the very useful Markov’s inequality. It was shown in Theorem 38 that J0 satisfies the memoryless property. defined in (34). as n → ∞. expressing monotonicity. by taking expectations of both sides (expectation is an increasing operator). n ∈ Z+ . because we know that the n probability goes to zero as n → ∞. g(X) ≥ g(x)1(X ≥ x). expressing non-negativity. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). and this is terrible. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. We can case both properties of g neatly by writing For all x. so this little section is a reminder. We baptise this geometric(λ) distribution. a geometric function of k. Consider an increasing and nonnegative function g : R → R+ . In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . Eg(X) ≥ g(x)P (X ≥ x) . Markov’s inequality Arguably this is the most useful inequality in Probability. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). We have already seen a geometric random variable. 0 ≤ n ≤ m. That was the random variable J0 = ∞ n=1 1(Sn = 0). for all x. If we think of X as the time of occurrence of a random phenomenon. Then. Surely. Indeed. Let X be a random variable. you have seen it before.we have exactly n heads equals 2n 2−2n ≈ 1. and is based on the concept of monotonicity. and so. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. To remedy this. where λ = P (X ≥ 1). y ∈ R g(y) ≥ g(x)1(y ≥ x). The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn .

Also available at: http://www. Amer. Springer. Bremaud P (1997) An Introduction to Probabilistic Modeling. Springer. W (1981) Topics in Ergodic Theory. Chung. Math. Springer. Cambridge University Press. Feller. Wiley. Norris. Soc. C M and Snell. W (1968) An Introduction to Probability Theory and its Applications.pdf 6. Parry. Grinstead. 5. 7.Bibliography ´ 1.mac.dartmouth. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). Cambridge University Press.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. 3. 4. 120 . J L (1997) Introduction to Probability. Bremaud P (1999) Markov Chains. J R (1998) Markov Chains. ´ 2.

18 permutation. 61 recurrent. 63 Galton-Watson process. 48 memoryless property. 48 cycle formula. 10 irreducible. 29 reflection. 73 Markov chain. 106 ground state. 61 null recurrent. 70 greatest common divisor. 51 Monte Carlo. 115 random walk on a graph. 111 inessential state. 77 coupling. 18 dual. 70 period. 16 communicating classes. 119 minimum. 117 division. 2 nearest neighbour random walk. 77 indicator. 20 function of a Markov chain. 101 axiom of Probability Theory. 34 PageRank. 119 matrix. 17 infimum. 84 bank account. 1 equivalence classes. 108 relation. 68 aperiodic. 15. 115 geometric distribution. 6 Markov’s inequality. 53 hitting time. 20 increments. 51 essential state. 111 maximum. 98 dynamical system. 47 coupling inequality. 18 arcsine law. 16. 35 Helly-Bray Lemma. 61 detailed balance equations.Index absorbing. 17 communicates with. 44 degree. 34 probability flow. 26 running maximum. 110 mixing. 65 Catalan numbers. 58 digraph (directed graph). 7 closed. 119 Google. 113 initial distribution. 53 Fibonacci sequence. 109 positive recurrent. 76 neighbours. 13 probability generating function. 82 Cesaro averages. 81 reflection principle. 87 121 . 116 first-step analysis. 10 ballot theorem. 17 Aloha. 16 equivalence relation. 17 Ethernet. 15 Loynes’ method. 108 ruin probability. 68 Fatou’s lemma. 3 branching process. 108 ergodic. 16 convolution. 7 invariant distribution. 114 balance equations. 6 Markov property. 15 distribution. 65 generating function. 110 meeting time. 45 Chapman-Kolmogorov equation. 86 reflexive. 117 leads to. 59 law. 17 Kolmogorov’s loop criterion.

62 subsequence of a Markov chain. 77 skip-free property. 8 transitive. 13 stopping time. 58 transient. 88 watching a chain in a set. 10 steady-state. 62 Wright-Fisher model. 71 122 . 62 supremum. 3. 2 transition probability. 108 tree. 16. 6 stationary. 104 states. 7 time-reversible. 8 transition probability matrix. 6 statespace. 113 symmetric. 7. 7 simple random walk.semigroup property. 27 subordination. 29 transition diagram:. 108 time-homogeneous. 9 stationary distribution. 60 urn. 76 Skorokhod embedding. 27 storage chain. 119 strong Markov property. 76 simple symmetric random walk. 16. 24 Strirling’s approximation. 73 strategy of the gambler.

Sign up to vote on this title
UsefulNot useful