Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . .24. .1 Branching processes: a population growth application . .1 Paths that start and end at specific points . . 70 24. . .5 A storage or queueing application . .3 Some conditional probabilities . . . . . . . . . . . . . . . . . . 65 24. . . . . . . 68 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 26. . . . 85 27. . .1 First hitting time analysis . . . . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . 71 24. . . . . . . . . . . . . .2 First return to the origin . . . . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . . . . . . . . . . . . . .1 Distribution of hitting time and maximum . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . . . . .3 The distribution of the first return to the origin . . . 95 31. . . . . . . . . . . . . . . . 92 30.2 The ballot theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 30. . . . . . . . . . .2 Paths that start and end at specific points and do not touch zero at all . . . . . . . . . . . . . . . . . . 83 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 31. . . . . .2 Return probabilities . . . .4 Some remarkable identities .1 Distribution after n steps . . . . . . . . .2 The ALOHA protocol: a computer communications application . . . . . . . . . . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. .4 The Wright-Fisher model: an example from biology . . . . . . . . . . . . . . . . 84 27. . . . . . .3 Total number of visits to a state . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . .5 First return to zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Asymptotic distribution of the maximum . .3 PageRank (trademark of Google): A World Wide Web application . . . . 84 27. . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

The second one is specifically for simple random walks. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come.. At the end. I tried to put all terms in blue small capitals whenever they are first encountered. dozens of good books on the topic. “important” formulae are placed inside a coloured box. Of course. I have a little mathematical appendix. of course. There notes are still incomplete. Also. Greece and the U. They form the core material for an undergraduate course on Markov chains in discrete time. The first part is about Markov chains and some applications. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. Also. I introduce stationarity before even talking about state classification. The only new thing here is that I give emphasis to probabilistic methods as soon as possible.K.. I have tried both and prefer the current ordering.S. v . A few starred sections should be considered “advanced” and can be omitted at first reading. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. There are.Preface These notes were written based on a number of courses I taught over the years in the U..

. An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. for any t. 2. yn+1 ). Newton’s second law says that the acceleration is proportional to the force: m¨ = F. and the other the greatest common divisor. b) is the remainder of the division of a by b. Recall (from elementary school maths). the integers). . . In other words. and its ˙ dt 2x acceleration is x = d 2 . Assume that the force F depends only on the particle’s position ¨ dt and velocity. i. and evolving as xn+1 = yn yn+1 = rem(xn . Xn+2 . for any n.e. the real numbers). b). By repeating the procedure. . s ≥ t) ˙ 1 . F = f (x. It is not hard to see that. b). b). the future trajectory (X(s). continuous (i. there is a function F that takes the pair Xn = (xn . . x Here. . Formally. The “time” can be discrete (i. we end up with two numbers. n = 0. Its velocity is x = dx . where r = rem(a. the future pairs Xn+1 . x(t)). Suppose that a particle of mass m is moving under the action of a force F in a straight line. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. yn ). X1 . . depend only on the pair Xn . For instance. yn ) into Xn+1 = (xn+1 . . The sequence of pairs thus produced. one of which is 0. that gcd(a. 1. An example from Physics is provided by Newton’s laws of motion. b) = gcd(r. y0 = min(a. or.e. x). X0 .e. In Mathematics. where xn ≥ yn . In the module following this one you will study continuous-time chains. x = x(t) denotes the position of the particle at time t. more generally. we let our “state” be a pair of integers (xn . has the property that. X2 . a totally ordered set.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. initialised by x0 = max(a. a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. . the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). b) between two positive integers a and b. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). yn ).

Count the number of 1’s and divide by N . it moves from 2 to 1 with probability β = 0. Indeed. The resulting ratio is an approximation b of a f (x)dx. an effective way for dealing with large problems is via the use of randomness. containing fresh and stinky cheese. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. past states are not needed.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10.can be completely specified by the current state X(t). we will see what kind of questions we can ask and what kind of answers we can hope for. Repeat with (Xn . instead of deterministic objects. for steps n = 1. We will develop a theory that tells us how to describe.. regardless of where it was at earlier times. via the use of the tools of Probability. by the software in your computer. it does so. . Markov chains generalise this concept of dependence of the future only on the present. We will also see why they are useful and discuss how they are applied. We will be dealing with random variables. Similarly. say.05. In addition. respectively. Try this on the computer! 2 . i. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. a ≤ X0 ≤ b. A mouse lives in the cage. Perform this a large number N of times.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. and use those mathematical models which are called Markov chains. Other examples of dynamical systems are the algorithms run. The generalisation takes us into the realm of Randomness. In other words. uniformly. Each time count 1 if Yn < f (Xn ) and 0 otherwise. 2 2.000 rows and 10. 2. Some of these algorithms are deterministic. 1 and 2. When the mouse is in cell 1 at time n (minutes) then. .99. analyse. . at time n + 1 it is either still in 1 or has moved to 2. A scientist’s job is to record the position of the mouse every minute. Y0 ) of random numbers.e. Choose a pair (X0 . We can summarise this information by the transition diagram: 1 Stochastic means random. 0 ≤ Y0 ≤ B. but some are stochastic.000 columns or computing the integral of a complicated function of a large number of variables. Yn ).

. D0 . Dn is the money he deposits. y = 0. the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. FS . D1 . S1 . from month to month. We easily see that if y > 0 then ∞ x. on the average. How long does it take for the mouse. he can’t. How often is the mouse in room 1 ? Question 1 has an easy. S0 .95 0. are random variables with distributions FD .01 Questions of interest: 1. D2 .α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. Sn .05 ≈ 20 minutes. = z=0 ∞ = z=0 3 .y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. Xn is the money (in pounds) at the beginning of the month.2 Bank account The amount of money in Mr Baxter’s bank account evolves. . px. Assume also that X0 . 2. S2 . Here. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x).99 0. Assume that Dn . respectively. (This is the mean of the binomial distribution with parameter α. . Dn − Sn = y − x) P (Sn = z.y := P (Xn+1 = y|Xn = x). 0}. to move from cell 1 to cell 2? 2. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. . .05 0. and Sn is the money he wants to spend. 1.) 2. intuitive. are independent random variables. as follows: Xn+1 = max{Xn + Dn − Sn . So if he wishes to spend more than what is available. .

2 p1. The transition probability matrix is y=1 p0.0 p1. Questions of interest: 1. and if so. on the average.0 p2. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. being drunk. Of course. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions.y . Are there any places which he will never visit? 3. if y = 0. If he starts from position 0. how long will it take him.1 p1. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). in principle. 2.2 P= p2. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction.0 = 1 − ∞ px. 2. If the pub is located in position 0 and his home in position 100. Are there any places which he will visit infinitely many times? 4. how fast? 2.1 p0.2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. Will Mr Baxter’s account grow. we have px.something that. can be computed from FS and FD . So he may move forwards with equal probability that he moves backwards.0 p0.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street.1 p2. how often will he be visiting 0? 2. he has no sense of direction.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 .

for now. given that he is currently healthy. We may again want to ask similar questions. and pS. pD. mathematically imprecise. These things are.D = 1.D is omitted.8 0. But. therefore.S because pH. 2.H . observe.3 H 0. and let’s say he’s completely drunk. Note that the diagram omits pH. west.H = 1 − pH. north or south. Clearly. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. the answer to this is crucial for the policy determination. so we assign probability 1/4 for each move. there is probability pH. pD.S = 1 − pS.S = 0. that since there are more degrees of freedom now. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. the reason being that the company does not believe that its clients are subject to resurrection. etc. 5 .D .D . So he can move east.01 S 0.S − pH. the company must have an idea on how long the clients will live.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients. but they will become more precise in the sequel.1 D Thus.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner.H − pS.3 that the person will become sick. and pS. there is clearly a higher chance that our man will be lost. Also.

Proof. We will also assume. . m and any choice of states i0 . We say that this sequence has the Markov property if. 3. 2. . . that the Xn take values in some countable set S. Usually. .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . for all n ∈ Z+ and all i0 . X1 = i1 . zero. T = Z+ = {0. . . n + m] Xk = ik | Xn = in . P (Xn+1 = in+1 | Xn = in . . almost exclusively (unless otherwise said explicitly). . On the other hand. in+1 ∈ S.1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . Sometimes we say that {Xn . Xn−1 ) given Xn . . We focus almost exclusively to the first choice. . in+m . for any n. which precisely means that (Xn+1 . . Since S is countable. n ∈ T} is Markov instead of saying that it has the Markov property. X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). . and so future is independent of past given the present. Xn+m ) and (X0 . . n ∈ Z+ ) is Markov if and only if. . unless needed. X2 = i2 . . . Xn−1 ) are independent conditionally on Xn . called the state space. So we will omit referring to T explicitly. n + m] Xk = ik | Xn = in ). . X0 = i0 ) = P (∀ k ∈ [n + 1. for any n ∈ T. This is true for any n and m. . (1) 6 . conditionally on Xn . The elements of S are frequently called states. . . 1. n ∈ T}. . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. . . . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . . m > n.3 3.} or the set Z of all integers (positive. . where T is a subset of the integers. 2. . . it is customary to call (Xn ) a Markov chain.} or N = {1. the Markov property. . m ∈ T) is independent of the past process (Xm . 3. . Let us now write the Markov property in an equivalent way. m ∈ T). the future process (Xm . . . viz. . Lemma 1. . . 3. . or negative). (Xn . . m < n. if the claim holds.

requires some ordering on S) and whose entries are pi. n).i (n − 1.3 In this new notation. pi. pi. n).j∈S . Xn = in ) = µ(i0 ) pi0 . 2) · · · pin−1 . for each m ≤ n the matrix P(m. But we will circumvent them by using Probability only.i2 (1. j ∈ S. The function pi. a square matrix whose rows and columns are indexed by S (and this. . which. The function µ(i) is called initial distribution. n) if m ≤ ℓ ≤ n. m + 2) · · · P(n − 1. 7 . n) = 1(i = j) and. n)]i. will specify all joint distributions2 : P (X0 = i0 .j (m. n) = P(m. n). Of course. by a result in Probability. .j (n. m + 1)P(m + 1. ℓ) P(ℓ. If |S| = d then we have d × d matrices. we define. n) := [pi. something that can be called semigroup property or Chapman-Kolmogorov equation. remembering a bit of Algebra.j (n. of course. n ≥ 0.which means that the two functions µ(i) := P (X0 = i) . we can write (2) as P(m. pi. then the matrices are infinitely large. More generally. 2 3 And. but if S is an infinite set. n) = i1 ∈S ··· in−1 ∈S pi. X1 = i1 .j (n. 3. .j (m. means that the transition probabilities pi. m + 1) · · · pin−1 .j (m. by (1). all probabilities of events associated with the Markov chain This causes some analytical difficulties.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m.in (n − 1. n + 1) is called transition probability from state i at time n to state j at time n + 1.j (m. 1) pi1 . n). (2) which is reminiscent of matrix multiplication. . X2 = i2 . n + 1) do not depend on n . or as P(m. Therefore.j (n. n + 1) := P (Xn+1 = j | Xn = i) .. pi.i1 (m. viz.i1 (0. i∈S i. by definition. n) = P(m.

and call pi. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . if µ assigns probability p to state i1 and 1 − p to state i2 . then µ = pδi1 + (1 − p)δi2 .j ]i. simply.j (n. to picking state i with probability 1. In other words. y y 8 . if Y is a real random variable.and therefore we can simply write pi. transition matrix. n + 1) =: pi. (3) We call δi the unit mass at i. Similarly. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. b ≥ 0. arranged in a row vector. For example. Notational conventions Unit mass. if j = i otherwise. 0. Then P(m.j the (one-step) transition probability from state i to state j. For example. The matrix notation is useful. Notice that any probability distribution on S can be written as a linear combination of unit masses. and the semigroup property is now even more obvious because Pa+b = Pa Pb . It corresponds. n) = P × P × · · · · · · × P = Pn−m .j∈S is called the (one-step) transition probability matrix or. δi (j) := 1. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). as follows from (1) and the definition of P. Starting from a specific state i. unless otherwise specified. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y).j . of course. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . for all integers a. From now on. Thus Pi is the law of the chain when the initial distribution is δi . n−m times for all m ≤ n. Then we have µ n = µ 0 Pn . while P = [pi.

The gas is placed partly in room 1 and partly in room 2. . for any m ∈ Z+ .d. then. . a sequence of i.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . the law of a stationary process does not depend on the origin of time. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. for a while we will be able to tell how long ago the process started. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. because it is impossible for the number of molecules to remain constant at all times.4 Stationarity We say that the process (Xn . . selecting a vertex with probability 1/7. To provide motivation and intuition. We now investigate when a Markov chain is stationary. say N = 1025 ) in a metallic chamber with a separator in the middle. In other words. . But this will not work. Another guess then is: Place 9 . It is clear. Consider a canonical heptagon and let a bug move on its vertices. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. . The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. we indeed expect that we have N/2 molecules in each room. let us look at the following example: Example: Motion on a polygon. as it takes some time before the molecules diffuse from room to room. If at time n the bug is on a vertex.i.e. Then pi. at any point of time n. dividing it into two identical rooms. Hence the uniform probability distribution is stationary.). Imagine there are N molecules of gas (where N is a large number. random variables provides an example of a stationary process. This is a model of gas distributed between two identical communicating chambers. On the average. m + 1. then. Clearly. and vice versa. at time n + 1 it moves to one of the adjacent vertices with equal probability.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. i. . N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . the distribution of the bug will still be the same. Let Xn be the number of molecules in room 1 at time n. n = 0. 1.) is stationary if it has the same law as (Xn . n ≥ 0) will not be stationary: indeed. Example: the Ehrenfest chain. n = m. It should be clear that if initially place the bug at random on one of the vertices.

stationary or invariant distribution. for all m ≥ 0 and i0 . Changing notation. Xm+2 = i2 . In other words.) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. . If we can do so.i = N N −i+1 N i+1 2−N + = · · · = π(i). which you should go through by yourselves. (a binomial distribution). find those distributions π such that πP = π. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. . . Since.each molecule at random either in room 1 or in room 2. Xn = in ) = P (Xm = i0 . hence. then we have created a stationary Markov chain. . i 0 ≤ i ≤ N. in ∈ S. . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. If we do so. Lemma 2. We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. . by (3). . P (X1 = i) = j P (X0 = j)pj. 2−N i−1 i+1 N N (The dots signify some missing algebra. . Proof. a Markov chain is stationary if and only if the distribution of Xn does not change with n. The equations (4) are called balance equations. 10 . The only if part is obvious.i + π(i + 1)pi+1. A Markov chain (Xn . Xm+1 = i1 . .i = π(i − 1)pi−1. Xm+n = in ). we are led to consider: The stationarity problem: Given [pij ].i + P (X0 = i + 1)pi+1. use (1) to see that P (X0 = i0 . then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . . P (Xn = i) = π(i) for all n. X1 = i1 . Indeed. . . the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . But we need to prove this. X2 = i2 .i = P (X0 = i − 1)pi−1. (4) Such a π is called . For the if part.

where all the xi are distinct.i = π(0)p(i) + π(i + 1). . Example: waiting for a bus. in matrix notation.i. write the equations (4): π(i) = π(j)pj. . In addition to that. d}. p1. Thus.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. Who tells us that the sum inside the parentheses is finite? If the sum is infinite. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. After bus arrives at a bus stop. the next one will arrive in i minutes with probability p(i). d! x ∈ Sd . π must be a probability: π(i) = 1. which. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. 11 . We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. random variables. i ≥ 1. Example: card shuffling. . . . where 1 is a column whose entries are all 1. if i > 0 i ∈ N. and so each π(i) will be zero. The state space is S = N = {1. The times between successive buses are i.i = p(i). . each x ∈ Sd is of the form x = (x1 . But here. . which gives p(j) .. i ≥ 0. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. .j = 1. π(1) will be zero. . n ≥ 0) is a Markov chain with transition probabilities pi+1. 2. xd ). i = 1. can be written as P1 = 1. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. This means that the solution to the balance equations is identically zero. then we run into trouble: indeed. Keep doing it continuously. It is easy to see that the stationary distribution is uniform: π(x) = 1 .i = 1.d. which in turn means that the normalisation condition cannot be satisfied. .}. we use the normalisation condition π(1) = i≥1 j≥i = 1. . 2 . Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. . Then (Xn . . i≥1 π(i) −1 To compute π(1). To find the stationary distribution. we must be careful. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j).

ξ(n + 1) = ϕ(ξ(n)). ξr+1 (n) = 1 − εr ).). are i. 1] (think of [0. 1. ξ2 (n). . . . . . i. Example: bread kneading. from (5). So it goes beyond our theory. ξr (n + 1) = εr ) = P (ξ1 (n) = 0. εr ∈ {0. You can check that . But the function f (x) = min(2x. . ξ2 (n). and any ε1 . P (ξ1 (n + 1) = ε1 . . We can show that if we start with the ξr (0). . 2. what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. the same is true at n + 1. 2. ξr+1 (n) = εr ) + P (ξ1 (n) = 1.) is the state at time n. ξ3 . . If ξ1 = 1 then x ≥ 1/2. So let us do so. ξi ∈ {0. . . . i. . applied to the interval [0. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. . .e. . . ξ2 (n) = 1 − ε1 . . . But to even think about it will give you some strength. . Hence choosing the initial state at random in this manner results in a stationary Markov chain. and the rule maps x into 2(1 − x). . if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. 1} for all i. with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. any r = 1. 2(1 − x)). .. .) Consider an infinite binary vector ξ = (ξ1 . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . ξ2 (n) = ε1 . . . .i. 1}. . ξ2 . Here ξ(n) = (ξ1 (n). 2.). . . . .. with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. We have thus defined a mapping ϕ from binary vectors into binary vectors. . So.. .). 1 − ξ3 . for any n = 0. Remarks: This is not a Markov chain with countably many states. The Markov chain is obtained by applying the mapping ϕ again and again.d.d. . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. (5) 12 . . r = 1. .i.* (This is slightly beyond the scope of these lectures and can be omitted. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). . 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. . .

expecting to be able to choose. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. Proof.j (which equals 1): π(i) j∈S pi.i . Instead of eliminating equations. So F (A.1 Finding a stationary distribution As explained above. Conversely. we introduce more.j = j∈A π(j)pj.j . a stationary distribution π satisfies π = πP π1 = 1 In other words. Ac ) = F (Ac .j as a “current” flowing from i to j. π(i) = j∈S (balance equations) (normalisation condition). This is OK.i + j∈Ac π(j)pj. If the state space has d states. amongst them. The convenience is often suggested by the topological structure of the chain.i 13 . A). π(j)pj. Writing the equations in this form is not always the best thing to do. because the first d equations are linearly dependent: the sum of both sides equals 1. a set of d linearly independent equations which can be more easily solved. suppose that the balance equations hold. If the latter condition holds. We have: Proposition 1. Ac ) is the total current from A to Ac .Note on terminology: steady-state. Ac ) := π(i)pi.i + j∈Ac π(j)pj. therefore one of them is always redundant. But we only have d unknowns. then the above are d + 1 equations. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. When a Markov chain is stationary we refer to it as being in 4. π(j) = 1. i∈A j∈Ac Think of π(i)pi. then choose A = {i} and see that you obtain the balance equations. for all A ⊂ S. as we will see in examples.i = j∈A π(j)pj.i Multiply π(i) by j∈S pi. π is a stationary distribution if and only if π1 = 1 and F (A. j∈S i ∈ S.

The extra degree of freedom provided by the last result. . The constant c is found from π(1) + π(2) = 1. p22 = 1 − β..i This is true for all i ∈ S. Summing over i ∈ A. . p11 = 1 − α. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ.048. . Instead. in steady state. and 4. p21 = β. gives us flexibility. we see that the first term on the left cancels with the first on the right. (Consequently.i + j∈Ac π(j)pj. β < 1. but we shall not do this here. Thus. .952. α+β π(2) = α .j = j∈A π(j)pj. α+β π(2) = cα. That is.8% of the time in room 2. p q Consider the following chain. 2.Now split the left sum also: π(i)pi.04 π(2) = 0. 14 i = 1. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). We have π(1)α = π(2)β. If we take α = 0. A ) A c c F( A .99 (as in the mouse example).) Assume 0 < α. while the remaining two terms give the equality we seek. A) Schematic representation of the flow balance relations.04 room 1. A c F(A .99 ≈ 0. β = 0. the mouse spends only 95.05 ≈ 0. π(1) = β . here is an example: Example: Consider the 2-state Markov chain with p12 = α. 3.j + j∈A j∈Ac π(i)pi. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 .2% of the time in 1. 1.. One can ask how to choose a minimal complete set of linearly independent equations. Example: a walk with a barrier.05. we find π(1) = 0.

j) ∈ E ⇐⇒ pi. Does this provide a stationary distribution? Yes. the chain will visit j at some finite time. (iii) Pi (Xn = j) > 0 for some n ≥ 0.2 The relation “leads to” We say that a state i leads to state j if. So assume i = j. starting from i.j > 0 for some n ∈ N and some states i1 . S is a set of vertices. we can solve them immediately. This is often cast in terms of a digraph (directed graph). . a pair (S. In fact. If i ≥ 1 we have F (A. . • Suppose that i j. In graph-theoretic terminology.j > 0. In other words. These are much simpler than the previous ones. For i ≥ 1. E) where S is the state space and E ⊂ S × S defined by (i. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. i.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive.) 5 5. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. provided that ∞ π(i) = 1.i2 · · · pin−1 . i − 1}. 1. n≥0 .e. (If p ≥ 1/2 there is no stationary distribution. i. 5. . Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j).i1 pi1 . This is i=0 possible if and only if p < q. in−1 . and E a set of directed edges. . Proof. . . The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0.e. Ac ) = π(i − 1)p = π(i)q = F (Ac . p < 1/2. If i = j there is nothing to prove. . (ii) pi. π(i) = (p/q)i π(0). . A).But we can make them simpler by choosing A = {0.

(6) j. Corollary 1. j) ⇐⇒ i j and j i. (iii) holds. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). However. {6}. The communicating class corresponding to the state i is. the set of all states that communicate with i: [i] := {j ∈ S : j So.. • Suppose now that (iii) holds. it partitions S into equivalence classes known as communicating classes. Since Pi (Xn = j) > 0. the sum contains a nonzero term. The first two classes differ from the last two in character. by definition.i1 pi1 . The class {4. we see that there are four communicating classes: {1}. In addition. we have that the sum is also positive and so one of its terms is positive. 5} is closed: if the chain goes in there then it never moves out of it. where the sum extends over all possible choices of states in S. 3}.in−1 pi. {2. reflexive and transitive. 5}.. 3} is not closed. is an equivalence relation. The relation j ⇐⇒ j i). {4. (i) holds. Proof. Just as any equivalence relation. The relation is transitive: If i j and j k then i k. Hence Pi (Xn = j) > 0. Look at the last display again. [i] = [j] if and only if i identical or completely disjoint. this is symmetric (i Corollary 2.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. Equivalence relation means that it is symmetric.. In other words. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. We have Pi (Xn = j) = i1 .the sum on the right must have a nonzero term. by definition. • Suppose that (ii) holds. meaning that (ii) holds.i2 · · · pin−1 . By assumption. the class {2. and so (iii) holds. It is reflexive and transitive because is so..j . 5. 16 . We just observed it is symmetric. i}.

and C is closed.More generally. in−1 such that pi. Suppose that i j. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. more manageable.i1 > 0. starting from any state.j > 0. Note that there can be no arrows between closed communicating classes. Closed communicating classes are particularly important because they decompose the chain into smaller.j = 1. we can reach any other state. In graph terms. Proof. the single-element set {i} is a closed communicating class. its transition matrix P) is called irreducible. . Suppose first that C is a closed communicating class. pi1 . we conclude that j ∈ C. 17 . There can be arrows between two nonclosed communicating classes but they are always in the same direction. . Lemma 3.i1 pi1 . C5 in the figure above are closed communicating classes. The classes C4 . The communicating classes C1 . j∈C for all i ∈ C. Note: An inessential state i is such that there exists a state j such that i j but j i.i = 1 then we say that i is an absorbing state. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. The converse is left as an exercise. In other words still.i2 > 0. Thus. Similarly. . A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. pi. the state is inessential. Otherwise. If a state i is such that pi. and so on. Why? A state i is essential if it belongs to a closed communicating class. C2 . we say that a set of states C ⊂ S is closed if pi. A general Markov chain can have an infinite number of communicating classes. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. Since i ∈ C. parts. we have i1 ∈ C. . To put it otherwise. and so i2 ∈ C. If all states communicate with all others. more precisely. We then can find states i1 . then the chain (or.i2 · · · pin−1 . C3 are not closed. the state space S is a closed communicating class.

(m) (n) for all m. X4 ∈ C1 . i. If i j there is α ∈ N with pij > 0 and if 18 . pii (ℓn) ≥ pii (n) ℓ . Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. Let us denote by pij the entries of Pn . Let Di = {n ∈ N : pii > 0}. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. Theorem 2 (the period is a class property). i. X3 ∈ C2 . j. Notice however. The period of an inessential state is not defined. A state i is called aperiodic if d(i) = 1. d(j) = gcd Dj . If i j then d(i) = d(j). n ≥ 0. it will move to the set C2 = {2. ℓ ∈ N. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . Dj = {n ∈ (n) (α) N : pjj > 0}.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. Since Pm+n = Pm Pn . . So. more generally. . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1.5. and. all states form a single closed communicating class. (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain.e. The chain alternates between the two sets. j ∈ S. Consider two distinct essential states i. 4} at the next step and vice versa. . 3} at some point of time then. X2 ∈ C1 . d(i) = gcd Di . The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . We say that the chain has period 2. . (n) Proof.

showing that α + β ∈ Dj . Of particular interest. . r = 0. . Proof. Cd−1 such that the chain moves cyclically between them. as defined earlier.j i there is β ∈ N with pji > 0.e. More generally. . Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . 1. are irreducible chains. C1 . q − r > 0. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . . n2 ∈ Di . (Here Cd := C0 . d(j) | d(i)). So d(i) = d(j).) 19 . . Cd−1 such that pij = 1. Theorem 3. Let n be sufficiently large. Divide n by n1 to obtain n = qn1 + r. j∈Cr+1 i ∈ Cr . From the last two displays we get d(j) | n for all n ∈ Di . d − 1. If a number divides two other numbers then it also divides their difference. showing that α + n + β ∈ Dj . Formally. this is the content of the following theorem. An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. Arguing symmetrically. Consider an irreducible chain with period d. Then we can uniquely partition the state space S into d disjoint subsets C0 . Then pii > 0 and so pjj ≥ pij pii pji > 0. Pick n2 . chains where. n1 ∈ Di such that n2 − n1 = 1. . Therefore d(j) | α + β. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. . Corollary 3. Now let n ∈ Di . So pjj ≥ pij pji > 0. where the remainder r ≤ n1 − 1. Since n1 . we obtain d(i) ≤ d(j) as well. if d = 1. Theorem 4. Therefore d(j) | α + β + n for all n ∈ Di . . So d(j) is a divisor of all elements of Di . all states communicate with one another. An irreducible chain has period d if one (and hence all) of the states have period d. j ∈ S. i. In particular. . . if an irreducible chain has period d then we can decompose the state space into d sets C0 . . Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. C1 . Because n is sufficiently large. . we have n ∈ Di as well. the chain is called aperiodic.

after it takes one step from its current position. We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. We avoid excessive notation by using the same letter for both. then. We use a method that is based upon considering what the Markov chain does at time 1.. As an example. The reader is warned to be alert as to which of the two variables we are considering at each time. .. Assume d > 1. .e. Pick a state i0 and let C0 be its equivalence class (i. say. Let C1 be the equivalence class of i1 .. i1 .. i1 .. Hence it partitions the state space S into d disjoint equivalence classes. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0.Proof. It is easy to see that if we continue and pick a further state id with pid−1 . i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. We now show that if i belongs to one of these classes and if pij > 0 then. necessarily. Continue in this manner and define states i0 . i ∈ C0 . and by the assumptions that d d j. that is why we call it “first-step analysis”.id > 0. for example. If X0 ∈ A the two times coincide. Notice that this is an equivalence relation.. We show that these classes satisfy what we need. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . with corresponding classes C0 . The existence of such a path implies that it is possible to go from i0 to i. Take. 20 . j ∈ C2 . id ∈ C0 . Suppose pij > 0 but j ∈ C1 but.e the set of states j such that i j). C1 . j belongs to the next class. . necessarily. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. i. which contradicts the definition of d. Then pick i1 such that pi0 i1 > 0. . Such a path is possible because of the choice of the i0 . . Cd−1 .. id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P.

Therefore. X0 = i) = P (TA < ∞|X1 = j). Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j).It is clear that P1 (T0 < ∞) < 1. But the Markov chain is homogeneous. if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. j∈S as needed. and define ϕ(i) := Pi (TA < ∞). because the chain may end up in state 2. We then have: Theorem 5. By self-feeding this equation. We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. a ′ function of X1 . For the second part of the proof. Hence: ′ ′ P (TA < ∞|X1 = j. and so ϕ(i) = 1. we have Pi (TA < ∞) = ϕ(j)pij . conditionally on X1 . i ∈ A. the event TA is independent of X0 . .). let ϕ′ (i) be another solution. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . Combining the above. The function ϕ(i) satisfies ϕ(i) = 1. So TA = 1 + TA . If i ∈ A. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). ′ Proof. But what exactly is this probability equal to? To answer this in its generality. . Furthermore. X2 . .e. fix a set of states A. then TA ≥ 1. If i ∈ A then TA = 0. ′ is the remaining time until A is hit). ϕ(i) = j∈S i∈A pij ϕ(j).

∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. 3 2 so ϕ(1) = 3/4. Proof. By continuing self-feeding the above equation n times. ψ(0) = ψ(2) = 0. which is the second of the equations. Furthermore. let A = {0. 2}. because it takes no time to hit A if the chain starts from A. the set that contains only state 0. then TA = 1 + TA .) As for ϕ(1). ψ(i) = 1 + i∈A pij ψ(j). we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). Start the chain from i. and let ψ(i) = Ei TA . j∈A i ∈ A. We immediately have ϕ(0) = 1. If i ∈ A then. 3 22 . We have: Theorem 6. 1 ψ(1) = 1 + ψ(1). If ′ i ∈ A. The second part of the proof is omitted. Clearly. we choose A = {0}. if the chain starts from 2 it will never hit 0. 2. Letting n → ∞. so the first two terms together equal Pi (TA ≤ 2). and ϕ(2) = 0. (If the chain starts from 0 it takes no time to hit 0. Therefore. The function ψ(i) satisfies ψ(i) = 0. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j).We recognise that the first term equals Pi (TA = 1). ψ(i) := Ei TA . By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). On the other hand. the second equals Pi (TA = 2). we obtain ϕ′ (i) ≥ Pi (TA ≤ n). if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. 1. Example: In the example of the previous figure. TA = 0 and so ψ(i) := Ei TA = 0. obviously. We let ϕ(i) = Pi (T0 < ∞). as above. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). i = 0. Example: Continuing the previous example.

in the last sum. where. we just changed index from n to n + 1.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . (This should have been obvious from the start. if x ∈ A. ϕAB (i) = 0. otherwise.which gives ψ(1) = 3/2. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). Next. 23 . as argued earlier. by the Markov property. B. Clearly. of the random variables X1 . This happens up to the first hitting time TA first hitting time of the set A. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). i∈A Furthermore. because the underlying experiment is the flipping of a coin with “success” probability 2/3. We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ). h(x) = f (x). consider the following situation: Every time the state is x a reward f (x) is earned. ′ TA = 1 + TA . Then ′ 1+TA x ∈ A. ϕAB (i) = j∈A i∈B pij ϕAB (j). . after one step. . ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. It should be clear that the equations satisfied are: ϕAB (i) = 1. X2 . TB for two disjoint sets of states A. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). until set A is reached. therefore the mean number of flips until the first success is 3/2. . Now the last sum is a function of the future after time 1. ′ where TA is the remaining time..e. i. ϕAB is the minimal solution to these equations. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). Hence. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ).

Let p the probability of heads and q = 1 − p that of tails. so 24 . unless he is in room 4. k ∈ N) form the strategy of the gambler. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. 1 2 4 3 The question is to compute the average total reward. If heads come up then you win £1 per pound of stake. (Also. y∈S h(x) = f (x) + x ∈ A. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . why?) If we let h(i) be the average total reward if he starts from room i. are h(x) = 0. in which case he dies immediately. 2. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. he collects i pounds. the set of equations satisfied by h. h(1) = 5. if he starts from room 2. no gambler is assumed to have any information about the outcome of the upcoming toss. (Sooner or later. Clearly.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. h(3) = 7. x∈A pxy h(y). If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). The successive stakes (Yk . he will die. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). until he dies in room 4. for i = 1. So. 2 2 We solve and find h(2) = 8. 3.Thus. Example: A thief enters sneaks in the following house and moves around the rooms. collecting “rewards”: when in room i.

and where the states a. We are clearly dealing with a Markov chain in the state space {a. a < i < b. . Let £x be the initial fortune of the gambler. It can only depend on ξ1 . This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. if p = q = 1/2 b−a Proof. write ϕ(x) instead of ϕ(x. . In other words. . P (tails) = 1 − p. a + 1. betting £1 at each integer time against a coin which has P (heads) = p. Yk cannot depend on ξk . Let us consider a concrete problem. (Think why!) To simplify things. Ex ) for the probability (respectively. 0 ≤ x ≤ ℓ.g. . a + 2. in the first case. a ≤ x ≤ b. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}.i−1 = q = 1 − p. as described above. b − 1. . Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 .i+1 = p. She will stop playing if her money fall below a (a < x). expectation) under the initial condition X0 = x. b − a). b} with transition probabilities pi. in simplest case Yk = 1 for all k. We first solve the problem when a = 0. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. b are absorbing. in addition. she must base it on whatever outcomes have appeared thus far. Theorem 7. is our next goal. Consider a gambler playing a simple coin tossing game. x − a    . We wish to consider the problem of finding the probability that the gambler will never lose her money. the company has control over the probability p. . . . ℓ). 25 ϕ(ℓ) = 1. We use the notation Px (respectively. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). ξk−1 and–as is often the case in practice–on other considerations of e. if p = q Px (Tb < Ta ) = .if the gambler wants to have a strategy at all. . The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. astrological nature. ℓ) := Px (Tℓ < T0 ). b = ℓ > 0 and let ϕ(x. We obviously have ϕ(0) = 0. To solve the problem. it will make enormous profits). pi.

ϕ(x) = pϕ(x + 1) + qϕ(x − 1). If p > 1/2. if p > 1/2 if p ≤ 1/2. We finally find λx − 1 . We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. if λ = 1 if λ = 1. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). ℓ→∞ ℓ→∞ (q/p)x . Corollary 4. δ(x − 2) = · · · = q p x δ(0). 1. ℓ) = λx −1 . λx − 1 δ(0). and so λx − 1 λx − 1 =1− = λx . This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). λ = 1. Px (T0 < ∞) = 1 − lim 26 . 0 ≤ x ≤ ℓ. λ = 1. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. Since p + q = 1. by first-step analysis. Since ϕ(ℓ) = 1. we find δ(0) = 1/ℓ and so x ϕ(x) = . We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). ℓ→∞ λℓ − 1 x = 1 − 0 = 1. then λ = q/p < 1. ℓ Summarising. The general case is obtained by replacing ℓ by b − a and x by x − a. Assuming that λ := q/p = 1. Let δ(x) := ϕ(x + 1) − ϕ(x). λℓ − 1 0 ≤ x ≤ ℓ. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. and so Px (T0 < ∞) = 1 − lim If p < 1/2. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1).and. we have obtained ϕ(x. then λ = q/p > 1. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. λℓ −1 x ℓ. we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. if p = q.

. . A ∩ {τ = n} is determined by (X0 . .j2 · · · P (Xn = i. . A. A. τ < ∞) = P (Xn+1 = j1 . . An example of a random time which is not a stopping time is the last visit of a specific set. Xn ) if we know the present Xn . Suppose (Xn . τ < ∞). τ = n) = pi. . . Then. Proof. . . | Xτ = i. . In particular. . Xn+1 . . the future process (Xτ +n . A. . . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . . Xτ = i. n ≥ 0) is a time-homogeneous Markov chain. Xτ +2 = j2 .Xτ +2 = j2 . (b) is by the definition of conditional probability. τ < ∞)P (A | Xτ . . . τ = n) (a) (b) = P (Xτ +1 = j1 . the first hitting time of the set A. This stronger property is referred to as the strong Markov property. Xn+2 = j2 .) is independent of the past process (X0 . conditional on τ < ∞ and Xτ = i. Xn+2 = j2 . τ < ∞) = pi. . . . Xτ ) and has the same law as the original process started from i. | Xn = i. . Recall that the Markov property states that. A. Theorem 8 (strong Markov property). . Let A be any event determined by the past (X0 . Xn ) and the ordinary splitting property at n. Let τ be a stopping time. . A. An example of a stopping time is the time TA . . . . Xn ). A. . . the value +∞. . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . n ≥ 0) is independent of the past process (X0 . . τ = n) = P (Xn+1 = j1 . . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . . . possibly. . Xn+2 = j2 . We are going to show that for any such A. Xτ +2 = j2 . .j2 · · · We have: P (Xτ +1 = j1 . τ = n). . | Xn = i)P (Xn = i. . P (Xτ +1 = j1 . | Xτ . including. Xn )1(τ = n). . (d) where (a) is just logic. Let τ be a stopping time. Xτ ) if τ < ∞. Xn ) for all n ∈ Z+ .j1 pj1 . . . . . . Xτ )1(τ < ∞) = n∈Z+ (X0 . Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. any such A must have the property that. τ = n)P (Xn = i. Since (X0 .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. . . Xτ +2 = j2 . the future process (Xn . . A | Xτ . . .j1 pj1 . . considered earlier. . . Xn = i. for all n ∈ Z+ . . . . . 27 . for any n. . τ = n) (c) = P (Xn+1 = j1 . τ < ∞) and that P (Xτ +1 = j1 .

. Xi (0) . Xi (2) . . as “random variables”. Ti Xi (0) (r) ≤ n < Ti (r+1) . (1) r = 1. we let Ti be the time of the second (1) (3) visit. So Xi is the r-th excursion: the chain visits 28 .d. Formally. We (r) refer to them as excursions of the process. and so on. := Xn . random trajectories. (And we set Ti := Ti . Our Ti here is always positive. we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}.i. All these random times are stopping times. Xi (1) . r = 2. . Regardless. “chunks”.. in a generalised sense. we will show that we can split the sequence into i. However.9 Regenerative structure of a Markov chain A sequence of i.) We let Ti be the time of the third visit. Note the subtle difference between this Ti and the Ti of Section 6. The Ti of Section 6 was allowed to take the value 0. random variables is a (very) particular example of a Markov chain. State i may or may not be visited again. The idea is that if we can find a state i that is visited again and again. . by using the strong Markov property. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). 3.i.d.. . 2. . (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . in fact. They are. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. 0 ≤ n < Ti . Schematically. The random variables Xn comprising a general Markov chain are not independent. . If X0 = i then the two definitions coincide.

starting from i. if it starts from state i at some point of time. we say that a state i is transient if. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. (Numerical) functions g(Xi independent random variables. Therefore. for time-homogeneous Markov chains. we say that a state i is recurrent if. Example: Define Ti (r+1) (0) ). the future after Ti is independent of the past before Ti .. given the state at (r) (r) (r) the stopping time Ti . . except. Moreover. they are also identical in law (i. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. The last corollary ensures us that Λ1 . we “watch” it up until the next time it returns to state i. . A trivial.d. are independent random variables. it will behave in the same way as if it starts from the same state at another point of time. Xi (1) . and this proves the first claim. the future is independent of the (1) (2) past. but important corollary of the observation of this section is that: Corollary 5.). But (r) the state at Ti is. Xi (2) . An immediate consequence of the strong Markov property is: Theorem 9. have the same Proof. . g(Xi (1) ). by definition. at least one state must be visited infinitely many times. 10 Recurrence and transience When a Markov chain has finitely many states. We abstract this trivial observation and give a couple of definitions: First. (r) . Λ2 ..state i for the r-th time. Therefore. equal to i.i. possibly. g(Xi (2) ). The claim that Xi . . the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. (0) are independent random variables. In particular. Perhaps some states (inessential ones) will be visited finitely many times. starting from i. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. but there are always states which will be visited again and again. . Second. and “goes for an excursion”. all. . ad infinitum. Xi distribution. 29 . Apply Strong Markov Property (SMP) at Ti : SMP says that. The excursions Xi (0) . .

This means that the state i is transient. i. 3. namely. Pi (Ji = ∞) = 1. Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. . We will show that such a possibility does not exist. Therefore. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞).Important note: Notice that the two statements above are not negations of one another. we conclude what we asserted earlier. Lemma 5. in principle. Ei Ji = ∞. 2. But the past (k−1) before Ti is independent of the future. This means that the state i is recurrent. k ≥ 0. 4. Starting from X0 = i. there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. then Pi (Ji < ∞) = 1. If fii < 1. We claim that: Lemma 4. every state is either recurrent or transient. fii = Pi (Ti < ∞) = 1. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. 30 . the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . Indeed. Suppose that the Markov chain starts from state i. The following are equivalent: 1. we have Proof. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1).e. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. Pi (Ji = ∞) = 1. therefore concluding that every state is either recurrent or transient. State i is recurrent. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. Indeed. Because either fii = 1 or fii < 1. and that is why we get the product.

10. 2. then.e. by definition. therefore Pi (Ji < ∞) = 1.1 Define First hitting time decomposition fij := Pi (Tj = n). Proof. State i is transient if. owing to the fact that Ji is a geometric random variable. if we assume Ei Ji < ∞. fij ≤ pij . (n) i.Proof. by the definition of Ji . Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . is equivalent to Pi (Ji < ∞) = 1. as n → ∞. Ei Ji < ∞. From Elementary Probability. (n) But if a series is finite then the summands go to 0. fii = Pi (Ti < ∞) < 1. but the two sequences are related in an exact way: Lemma 7. as n → ∞. Corollary 6. it is visited infinitely many times. since Ji is a geometric random variable. The equivalence of the first two statements was proved above. assuming that the chain (n) starts from state i. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . which. Proof. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. Lemma 6. 4. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. If Ei Ji = ∞. This. clearly.e. then. Hence Ei Ji = ∞ also. Pi (Ji < ∞) = 1. The following are equivalent: 1. On the other hand. the probability to hit state j for the first time at time n ≥ 1. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. n ≥ 1. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . Notice the difference with pij . it is visited finitely many times with probability 1. and. pii → 0. i. If i is a transient state then pii → 0. 3. State i is recurrent if. Ji cannot take value +∞ with positive probability. If i is transient then Ei Ji < ∞. The equivalence of the first two statements was proved before the lemma. this implies that Ei Ji < ∞. Clearly. State i is transient. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. Proof. by definition.

either all states are recurrent or all states are transient. In other words. For the second term.d.i. We now show that recurrence is a class property. If j is a transient state then pij → 0.i. .The first term equals fij .. Hence. excursions Xi . If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. the probability to go to j in some finite time is positive. for all states i. . and so the probability that j will appear in a specific excursion is positive. Xi . 10. Now. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). . showing that j will appear in infinitely many of the excursions for sure. by definition. j i. and hence pij as n → ∞. we have n=1 pjj = Ej Jj < ∞. infinitely many of them will be 1 (with probability 1). as n → ∞. if we let fij := Pi (Tj < ∞) . not only fij > 0. j i. Indeed. . denoted by i j: it means that.r := 1 j appears in excursion Xi (r) . Proof. but also fij = 1. starting from i.d. The last statement is simply an expression in symbols of what we said above. In particular. 32 . are i. . we have i j ⇐⇒ fij > 0. 1. In a communicating class. Corollary 8. ( Remember: another property that was a class property was the property that the state have a certain period. Hence. If j is transient.2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. by summing over n the relation of Lemma 7. ∞ Proof. and since they take the value 1 with positive probability. . Start with X0 = i. and gij := Pi (Tj < Ti ) > 0. which is finite. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). So the random variables δj. Corollary 7.) Theorem 10. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . r = 0.

all other states are also transient. Pi (B) = 0. . by the above theorem. U2 . Indeed. 1]. Recall that a finite random variable may not necessarily have finite expectation. n+1 33 = 0 . Example: Let U0 . if started in C. . appear in a certain game. . random variables. . X remains in C. Theorem 11. Hence. P (B) = Pi (Xn = i for infinitely many n) < 1. and hence. This state is recurrent. . . uniformly distributed in the unit interval [0. Xn = i for infinitely many n) = 0. Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). If a communicating class contains a state which is transient then. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. Proof. i.e. i. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ).e. observe that T is finite. but Pi (A ∩ B) = Pi (Xm = j. necessarily. Proof. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. Hence all states are recurrent. By closedness. Theorem 12. . If i is recurrent then every other j in the same communicating class satisfies i j.i. . Every finite closed communicating class is recurrent. . So if a communicating class is recurrent then it contains no inessential states and so it is closed. be i. every other j is also recurrent.d. . What is the expectation of T ? First. But then.Proof. If i is inessential then it is transient. U1 . Thus T represents the number of variables we need to draw to get a value larger than the initial one. . Un ≤ x | U0 = x)dx xn dx = 1 . and so i is a transient state. Suppose that i is inessential. We now take a closer look at recurrence. j but j i. some state i must be visited infinitely many times. P (T > n) = P (U1 ≤ U0 . Let T := inf{n ≥ 1 : Un > U0 }. Let C be the class. for example. It should be clear that this can. By finiteness.

X2 . the Fundamental Theorem of Probability Theory. When we have a Markov chain. the sequence X0 . 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. . be i. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . Let Z1 . It is not a Law. To apply the law of large numbers we must create some kind of independence.i. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. of successive states is not an independent sequence. . Theorem 13. 0) > −∞. . . Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. It is traditional to refer to it as “Law”. it is a Theorem that can be derived from the axioms of Probability Theory. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. random variables. but E min(Z1 . 34 . on the average. To our rescue comes the regenerative structure of a Markov chain identified at an earlier section.d. We say that i is null recurrent if Ei Ti = ∞. arguably. (ii) Suppose E max(Z1 . . Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1.Therefore. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. We state it here without proof. . X1 . n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. 0) = ∞. then it takes. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. But this is misleading. Z2 . Before doing that.

because the excursions Xa . G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). are i.i. random variables. . G(Xa(2) ). the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). are i. define G(X ) to be the maximum reward received during this excursion. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ).d. we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. G(Xa(3) ). Thus. . with probability 1. . (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. In particular. Example 1: To each state x ∈ S. Xa(2) . .2 Average reward Consider now a function f : S → R.i.d.. forms an independent sequence of (generalised) random variables. Then. 35 . Xa . . Specifically. but define G(X ) to be the total reward received over the excursion X .13. k=0 which represents the ‘time-average reward’ received. . (1) (2) (1) (n) (8) 13. The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. and because if we start with X0 = a. (7) Whichever the choice of G. Xa(3) . for an excursion X . Since functions of independent random variables are also independent. the (0) initial excursion Xa also has the same law as the rest of them. for reasonable choices of the functions G taking values in R. we have that G(Xa(1) ).1 Functions of excursions Namely. n as n → ∞. which. we can think of as a reward received when the chain is in state x. in this example. . as in Example 2. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). . assign a reward f (x). . Example 2: Let f (x) be as in example 1.

Gr is given by (7). We shall present the proof in a way that the constant f will reveal itself–it will be discovered. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast .e. For example. The idea is this: Suppose that t is large enough. with probability one. we have BTa /t → 0. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Nt = 5. Thus we need to keep track of the number. plus the total reward received over the next excursion. Let f : S → R+ be a bounded ‘reward’ function. t t as t → ∞. We break the sum into the total reward received over the initial excursion. and so on. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). say |f (x)| ≤ B. in the figure below. (r) Nt := max{r ≥ 1 : Ta ≤ t}. Proof. Then. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). We assumed that f is t bounded. t as t → ∞. Hence |Gfirst | ≤ BTa . with probability one. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . We are looking for the total reward received between 0 and t. and so 1 first G → 0. and Glast is the reward over the last incomplete cycle.Theorem 14. Nt . t t 36 . t t In other words. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. A similar argument shows that 1 last G → 0. of complete excursions from the ground state a that occur between 0 and t. say. Since P (Ta < ∞) = 1.

lim t∞ 1 Nt Nt −1 Gr = EG1 . Since Ta → ∞. (11) By definition. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr .i. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . Hence. are i. = E a Ta . we have Ta r→∞ r lim (r) .. as r → ∞. . random variables with common expectation Ea Ta .d. = E a Ta . . and since Ta − Ta k = 1. it follows that the number Nt of complete cycles before time t will also be tending to ∞. So we are left with the sum of the rewards over complete cycles. r=1 with probability one. . 1 Nt = . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. Since Ta /r → 0. 2. we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . E a Ta f (Xn ) = n=0 EG1 =: f . as t → ∞. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 .also. E a Ta 37 .

it takes Ea Ta units of time between successive visits to the state a. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). then we see that the numerator equals 1. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). Note that we include only one endpoint of the excursion. The equality of these two limits gives: average no. which communicate and are positive recurrent. a. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). Comment 1: It is important to understand what the formula says. The physical interpretation of the latter should be clear: If. Comment 4: Suppose that two states. Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . the ground state. on the average. then the long-run proportion of visits to the state a should equal 1/Ea Ta . Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . b. for all b ∈ S. We can use b as a ground state instead of a. we let f to be 1 at the state b and 0 everywhere else. E b Tb 38 .To see that f is as claimed.e. The theorem above then says that. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. t Ea 1(Xk = b) E a Ta . from the Strong Markov Property. not both. i. The denominator is the duration of the excursion. The constant f is a sort of average over an excursion. Ta −1 k=0 1 lim t→∞ t with probability 1. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). E a Ta (14) with probability 1.

where a is the fixed recurrent ground state whose existence was assumed. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. . . by a function of X1 . Proof. Fix a ground state a ∈ S and assume that it is recurrent5 . We start the Markov chain with X0 = a. Both of them equal 1. Hence the superscript [a] will soon be dropped. for any state b such that a → x (a leads to x). Since a = XT a . x ∈ S. we have 0 < ν [a] (x) < ∞. i. . ν [a] (x) = y∈S ν [a] (y)pyx . Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. . Recurrence tells us that Ta < ∞ with probability one. Moreover. . this should have something to do with the concept of stationary distribution discussed a few sections ago. X2 . Consider the function ν [a] defined in (15). we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). X1 . so there is nothing lost in doing so. 39 . . Note: Although ν [a] depends on the choice of the ‘ground’ state a. Clearly. In view of this. X2 . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. Indeed. notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. Proposition 2. X3 . 5 We stress that we do not need the stronger assumption of positive recurrence here. we will soon realise that this dependence vanishes when the chain is irreducible. The real reason for doing so is because we wanted to replace a function of X0 .14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . (15) should play an important role.e. x ∈ S.

We are now going to assume more. xa so. namely. ax ax From Theorem 10. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). x a. in this last step. we used (16). such that pxa > 0.and thus be in a position to apply first-step analysis. (17) This π is a stationary distribution. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . (n) Next. Therefore. which. 40 . Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. Xn−1 = y. indeed. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. for all x ∈ S. by definition. We just showed that ν [a] = ν [a] P. n ≤ Ta ) Pa (Xn = x. that a is positive recurrent. We now have: Corollary 9. so there exists m ≥ 1. let x be such that a x. such that pax > 0. Suppose that the Markov chain possesses a positive recurrent state a. from (16). where. for all n ∈ N. so there exists n ≥ 1. is finite when a is positive recurrent. ν [a] = ν [a] Pn . Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . First note that. n ≤ Ta ) pyx Pa (Xn−1 = y. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. ν [a] (x) < ∞. which are precisely the balance equations. Note: The last result was shown under the assumption that a is a recurrent state. x ∈ S. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). So we have shown ν [a] (x) = y∈S pyx ν [a] (y).

we toss a coin with probability of heads gij > 0. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. We already know that j is recurrent. r = 0. since the excursions are independent. in general. Theorem 15. In words. Let R1 (respectively. But π [a] is simply a multiple of ν [a] . But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. to decide whether j will occur in a specific excursion. Then j is positive Proof. Hence j will occur in infinitely many excursions. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. Suppose that i recurrent. (r) j. 2. It only remains to show that π [a] is a probability. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. Start with X0 = i. we have Ea Ta < ∞.e. Consider the excursions Xi . .d. Since a is positive recurrent. . 1. 41 . If heads show up at the r-th excursion. no matter which one. in Proposition 2.In other words: If there is a positive recurrent state. then j occurs in the r-th excursion. i. that it sums up to one. second) excursion that contains j. Let i be a positive recurrent state. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. π(y) = 1 y∈S x∈S have at least one solution. unique. Proof. . and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. The coins are i. Also. then the r-th excursion does not contain state j. R2 ) be the index of the first (respectively. Otherwise. if tails show up. We have already shown. that ν [a] P = ν [a] . Therefore. and so π [a] is not trivially equal to zero. positive recurrent state. This is what we will try to do in the sequel. π [a] P = π [a] . then the set of linear equations π(x) = y∈S π(y)pyx .i.

by Wald’s lemma. Then either all states in C are positive recurrent. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. Corollary 10.i. . . Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 .d. IRREDUCIBLE CHAIN r = 1. with the same law as Ti . where the Zr are i. 2. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. An irreducible Markov chain is called transient if one (and hence all) states are transient. or all states are null recurrent or all states are transient. R2 has finite expectation.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . . But. . Let C be a communicating class. . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. In other words. . .

for all x ∈ S. Hence. recognising that the random variables on the left are nonnegative and bounded below 1. But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. after all. the π defined by π(0) = α. what prohibits uniqueness. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). as t → ∞. following Theorem 14. we have P (Xn = x) = π(x). we can take expectations of both sides: E as t → ∞. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). is a stationary distribution. an absorbing state). x∈S By (13). By a theorem of Analysis (Bounded Convergence Theorem). we have.16 Uniqueness of the stationary distribution At the risk of being too pedantic. we stress that a stationary distribution may not be unique. for all states i and j. with probability one. Proof. Fix a ground state a. Consider the Markov chain (Xn . Then there is a unique stationary distribution. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). But. π(1) = 0. Let π be a stationary distribution. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. Consider positive recurrent irreducible Markov chain. by (18). n=1 43 . π(2) = 1 − α 2. Then P (Ta < ∞) = x∈S In other words. x ∈ S. there is a stationary distribution. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. state 1 (and state 2) is positive recurrent. (18) π(x)Px (Ta < ∞) = π(x) = 1. x ∈ S. we start the Markov chain from the stationary distribution π. for totally trivial reasons (it is. Then. n ≥ 0) and suppose that P (X0 = x) = π(x). Theorem 16. for all n. indeed.

In particular. Namely. i is positive recurrent. x ∈ S. π(C) x ∈ C. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . E a Ta Proof. so they are equal.e. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Hence π = π [a] . But π(i|C) > 0. Let C be the communicating class of i. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . The cycle formula merely re-expresses this equality. for all x ∈ S. Consider a Markov chain. Both π [a] and π[b] are stationary distributions. There is a unique stationary distribution. The following is a sort of converse to that. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. There is only one stationary distribution. Consider an arbitrary Markov chain. Corollary 12. Assume that a stationary distribution π exists. Corollary 11. b ∈ S.Therefore π(x) = π [a] (x). We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). In particular. Then any state i such that π(i) > 0 is positive recurrent. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Consider a positive recurrent irreducible Markov chain and two states a. y ∈ C. Consider the restriction of the chain to C. from the definition of π [a] . This is in fact. an arbitrary stationary distribution π must be equal to the specific one π [a] . By the 1 above corollary. i. Hence there is only one stationary distribution. Therefore Ei Ti < ∞. π(i|C) = Ei Ti . Proof. Let i be such that π(i) > 0. Thus. Consider a positive recurrent irreducible Markov chain with stationary distribution π. the only stationary distribution for the restricted chain because C is irreducible. Proof. for all a ∈ S. Then. delete all transition probabilities pxy with x ∈ C. Decompose it into communicating classes. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . π(a) = π [a] (a) = 1/Ea Ta . 1 π(a) = . Then π [a] = π [b] . where π(C) = y∈C π(y).

A Markov chain is a simple model of a stochastic dynamical system. Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. Let π be an arbitrary stationary distribution. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). Proof. that the pendulum will swing for a while and. as n → ∞. An arbitrary stationary distribution π must be a linear combination of these π C .which assigns positive probability to each state in C. There is dissipation of energy due to friction and that causes the settling down. take an = (−1)n . This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . In an ideal pendulum. Proposition 3. where αC ≥ 0. x ∈ C) is a stationary distribution. under irreducibility and positive recurrence conditions. π(x|C) ≡ π C (x). By our uniqueness result. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. eventually. 18 18. for example. So far we have seen. where π(C) := π(C) π(x). We all know. For each such C define the conditional distribution π(x|C) := π(x) . For example. This is the stable position. that. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. more generally. To each positive recurrent communicating class C there corresponds a stationary distribution π C .1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. such that C αC = 1. So we have that the above decomposition holds with αC = π(C). without friction. If π(x) > 0 then x belongs to some positive recurrent class C. that it remains in a small region). from physical experience. the pendulum will perform an 45 . Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . x∈C Then (π(x|C). consider the motion of a pendulum under the action of gravity. it will settle to the vertical position.

we may it is stable because it does not (it cannot!) swing too far away. We shall assume that the ξn have the same law. But what does stability mean? In general. by means of the recursion above. . A linear stochastic system We now pass on to stochastic systems. The simplest differential equation As another example. . ξ2 . . moreover. simple because the ξn ’s depend on the time parameter n. . ξ1 . random variables. then regardless of the starting position. and define a stochastic sequence X1 . x ∈ R. And we are in a happy situation here. we have x(t) → x∗ . ······ 46 . or whatever. consider the following differential equation x = −ax + b. . or an example in physics. as t → ∞. ξ0 . we here assume that X0 . 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . it cannot mean convergence to a fixed equilibrium point. Still. It follows that |ρ| < 1 is the condition for stability. The thing to observe here is that there is an equilibrium point i. a > 0. are given. However notice what happens when we solve the recursion. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn .e. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. . If. then x(t) = x∗ for all t > 0. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. the one found by setting the right-hand side equal to zero: x∗ = b/a. X2 . Specifically. We say that x∗ is a stable equilibrium.oscillatory motion ad infinitum. mutually independent. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . ˙ There are myriads of applications giving physical interpretation to this.

ξn−1 does not converge. we need to understand the notion of coupling. Y = y) is? 47 . (e. Coupling refers to the various methods for constructing this joint law.3 Coupling To understand why the fundamental stability theorem works. for instance. ∞ ˆ The nice thing is that. Starting at an elementary level. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. If a Markov chain is irreducible. What this means is that. . In other words. Y = y) is. . for any initial distribution. to try to make the coincidence probability P (X = Y ) as large as possible. but we are not given what P (X = x. we have that the first term goes to 0 as n → ∞. In particular. define stability in a way other than convergence of the distributions of the random variables. . but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . The second term. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). j ∈ S. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . which is a linear combination of ξ0 . Y but not their joint law. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). How should we then define what P (X = x. i. as n → ∞. uniformly in x. 18. . we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). 18. while the limit of Xn does not exists. But suppose that we have some further requirement. g(y) = P (Y = y). the n-step transition matrix Pn . converges. suppose you are given the laws of two random variables X.Since |ρ| < 1. for time-homogeneous Markov chains.2 The fundamental stability theorem for Markov chains Theorem 17. A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. under nice conditions. we know f (x) = P (X = x). Y = y) = P (X = x)P (Y = y).g.

g(y) = P (Y = y). we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). g(y) = x h(x. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Y = y) which respects the marginals. on the event that the two processes never meet. but with the roles of X and Y interchanged. n ≥ T ) = P (Xn = x. Let T be a meeting of two coupled stochastic processes X. We are given the law of a stochastic process X = (Xn . and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). This T is a random variables that takes values in the set of times or it may take values +∞. Note that it makes sense to speak of a meeting time. T is finite with probability one. f (x) = y h(x. respectively. A coupling of them refers to a joint construction on the same probability space. Repeating (19). n ≥ 0. P (T < ∞) = 1. n < T ) + P (Yn = x. uniformly in x ∈ S. precisely because of the assumption that the two processes have been constructed together. Y .e. n ≥ 0. If. n ≥ 0) and the law of a stochastic process Y = (Yn . n < T ) + P (Xn = x. n ≥ T ) ≤ P (n < T ) + P (Yn = x). Find the joint law h(x. then |P (Xn = x) − P (Yn = x)| → 0. Suppose that we have two stochastic processes as above and suppose that they have been coupled. y). i. for all n ≥ 0.e. Since this holds for all x ∈ S and since the right hand side does not depend on x.Exercise: Let X. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). and which maximises P (X = Y ). (19) as n → ∞. y). x ∈ S. Then. and all x ∈ S. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. but we are not told what the joint probabilities are. Y be random variables with values in Z and laws f (x) = P (X = x). We have P (Xn = x) = P (Xn = x. i. x∈S 48 . n ≥ 0). The following result is the so-called coupling inequality: Proposition 4. y) = P (X = x. Proof. The coupling we are interested in is a coupling of processes. Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). in particular.

Xn and Yn would be independent and distributed according to π. we have that px. We know that there is a unique stationary distribution. y ′ ) | Wn = (x. review the assumptions of the theorem: the chain is irreducible. n ≥ 0).(y.(y. is a stationary distribution for (Wn . x∈S as n → ∞. Then n→∞ lim P (T > n) = P (T = ∞) = 0. Its initial state W0 = (X0 . as usual. Then (Wn . y) Its n-step transition probabilities are q(x.y′ ) := P Wn+1 = (x′ . (x. and so (Wn . The first process.x′ ). denoted by Y = (Yn . Yn ). So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities.y′ for all large n.y′ ) for all large n. we have not defined joint probabilities such as P (Xn = x. π(x) > 0 for all x ∈ S. the transition probabilities. for all n ≥ 1.x′ py. n ≥ 0). Y0 ) has distribution P (W0 = (x. had we started with both X0 and Y0 independent and distributed according to π.(y. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. in other words.4 Proof of the fundamental stability theorem First. Having coupled them. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. 18. We shall consider the laws of two processes. n ≥ 0) is independent of Y = (Yn .x′ ). they differ only in how the initial states are chosen. It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. The second process. We now couple them by doing the straightforward thing: we assume that X = (Xn . . and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }.x′ ). we can define joint probabilities. y) = µ(x)π(y). denoted by X = (Xn . (Indeed. implying that q(x. Let pij denote. From Theorem 3 and the aperiodicity assumption. sup |P (Xn = x) − P (Yn = x)| → 0.y′ ) = px. y) := π(x)π(y). Therefore.) By positive recurrence.x′ > 0 and py. Notice that σ(x. positive recurrent and aperiodic. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px.Assume now that P (T < ∞) = 1. Xm = y).y′ .x′ py. then. denoted by π. n ≥ 0) is a Markov chain with state space S × S. Its (1-step) transition probabilities are q(x. y) ∈ S × S. Consider the process Wn := (Xn .y′ . n ≥ 0) is an irreducible chain.

suppose that S = {0.0) = q(0. For example. Y ) may fail to be irreducible. 1} and the pij are p01 = p10 = 1. n ≥ 0). n ≥ 0) is identical in law to (Xn . then the process W = (X. for all n and x. (Zn . p00 = 0. Obviously. Z0 = X0 which has an arbitrary distribution µ. 50 .0). ≥ 0) is a Markov chain with transition probabilities pij . as n → ∞.1) = 1. (0. By the coupling inequality. (1. Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ).(0. p01 = 0. p10 = 1. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). say. 0). However. In particular. n ≥ 0) is positive recurrent. which converges to zero. Hence (Wn . uniformly in x.(1. precisely because P (T < ∞) = 1.Therefore σ(x. Then the product chain has state space S ×S = {(0. 1)} and the transition probabilities are q(0. if we modify the original chain and make it aperiodic by. P (Xn = x) = P (Zn = x). Moreover. (1. we have |P (Xn = x) − π(x)| ≤ P (T > n). Zn equals Xn before the meeting time of X and Y .0) = q(1.1). T is a meeting time between Z and Y . y) > 0 for all (x.(0. Clearly. X arbitrary distribution Z stationary distribution Y T Thus. This is not irreducible. In particular.01. P (T < ∞) = 1.(1.0). by assumption.1) = q(1. 0). 1). after the meeting time. Since P (Zn = x) = P (Xn = x).99. Zn = Yn . Hence (Zn .1). and since P (Yn = x) = π(x). Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. y) ∈ S × S.

and so on. 1981).5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. n→∞ (n) where π(0).99π(0) = π(1). per se wrong. and this is expressed by our terminology. . π(0) + π(1) = 1.f. we can decompose the state space into d classes C0 . n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. Suppose that the period of the chain is d. 51 . However. It is not difficult to see that 6 We put “wrong” in quotes. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. 0. By the results of §5. π(0) ≈ 0.497 18.e. Then Xn = 0 for even n and = 1 for odd n. in particular.4 and. Parry. . A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. Suppose X0 = 0. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1).497. from Cd back to C0 . limn→∞ p00 does not exist. positive recurrent and aperiodic Markov chain. 18. Cd−1 . p01 = 0. from C1 to C2 . then (n) lim p n→∞ 00 (n) = lim p10 = π(0). i. 6 The reason is that the term “ergodic” means something specific in Mathematics (c. . Thus. π(1) satisfy 0. Therefore p00 = 1 for even n (n) and = 0 for odd n. But this is “wrong”. however. p10 = 1. In matrix notation. Theorem 4.497 . Terminology cannot be.503.503 0. such that the chain moves cyclically between them: from C0 to C1 . .then the product chain becomes irreducible. we know we have a unique stationary distribution π and that π(x) > 0 for all states x.01. most people use the term “ergodic” for an irreducible.503 0. By the results of Section 14. customary to have a consistent terminology.99. Indeed. It is. π(1) ≈ 0. C1 . if we modify the chain by p00 = 0.

We shall show that Lemma 9. as n → ∞. 1. Then. This means that: Theorem 18. j ∈ Cℓ . thereafter. n ≥ 0) is dπ(i). if sampled every d steps. . Cd−1 . the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. where π is the stationary distribution. it remains in Cℓ . The chain (Xnd . after reduction by d) and. So π(i) = j∈Cr−1 π(j)pji . Hence if we restrict the chain (Xnd . Then. d − 1. Let i ∈ Ck . Since (Xnd . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. n ≥ 0) consists of d closed communicating classes. . Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . Suppose i ∈ Cr . 52 . Let αr := i∈Cr π(i). . We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . n ≥ 0) is aperiodic. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). i∈Cr for all r = 0.Lemma 8. starting from i. . j ∈ Cr . each one must be equal to 1/d. Then pji = 0 unless j ∈ Cr−1 . i ∈ S. namely C0 . → dπ(j). Suppose that X0 ∈ Cr . Since the αr add up to 1. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. we have pij for all i. i ∈ Cr . . π(i) = 1/d. . j ∈ Cℓ . All states have period 1 for this chain. . and the previous results apply. i ∈ Cr . as n → ∞. Proof. and let r = ℓ − k (modulo d). . the (unique) stationary distribution of (Xnd . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 .

are contained between 0 and 1. exists for all i ∈ N. p2 . such that limk→∞ pi k exists and equals. there must be a subsequence exists and equals. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . 2. .19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . whose proof we omit7 : 7 See Br´maud (1999). p2 . If the period is d then the limit exists only along a selected subsequence (Theorem 18). k = 1. . say. Consider. a probability distribution on N. 2. k = 1. By construction. . for each The second result is Fatou’s lemma. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal.e. . p3 . Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. for each n = 1. . as n → ∞. the first of which is known as HellyBray Lemma: Lemma 10. . i. for each i. So the diagonal subsequence (nk .) is a k k subsequence of each (ni . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. then pij → 0 as n → ∞. We also saw in Corollary (n) 7 that. Hence pi k i. 2.).. e 53 . .. for all i ∈ S. . as k → ∞. k = 1. . then pij → π(j) (Theorems 17). . If j is a null recurrent state then pij → 0. The proof will be based on two results from Analysis. . the sequence n3 of the third row is a subsequence k k of n2 of the second row.1 Limiting probabilities for null recurrent states Theorem 19. . . and so on. schematically. say. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. It is clear how we (ni ) k continue. pi . . . p(n) = (n) (n) (n) (n) (n) ∞ p1 . We have. . such that limk→∞ p1 k 3 (n1 ) numbers p2 k . (nk ) k → pi . if j is transient. . 19. 2. k = 1. p1 . . . For each i we have found subsequence ni . where = 1 and pi ≥ 0 for all i. 2. the sequence n2 of the second row k is a subsequence of n1 of the first row. say. Now consider the contained between 0 and 1 there must be a exists and equals. for all i ∈ S.

Now π(j) > 0 because αj > 0. (n ) (n ) Consider the probability vector By Lemma 10.e. (n′ ) Hence αx ≥ αy pyx . . αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. π is a stationary distribution. . 2. i. Let j be a null recurrent state. x ∈ S . x ∈ S. k→∞ (n′ +1) = y∈S piy k pyx . Suppose that for some i. Thus. i ∈ N). i.e. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . and this is a contradiction (the number αx = αx cannot be strictly larger than itself). the sequence (n) pij does not converge to 0. equalities. i ∈ N). and the last equality is obvious since we have a probability vector. we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . x∈S (n′ ) Since aj > 0. (xi . y∈S x ∈ S. αy pyx . we have 0< x∈S The inequality follows from Lemma 11. by assumption. . Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. pix k Therefore. Suppose that one of them is not. But Pn+1 = Pn P. state j is positive recurrent: Contradiction. y∈S Since x∈S αx ≤ 1.Lemma 11. i. by Lemma 11. actually. n = 1. i=1 (n) Proof of Theorem 19. If (yi . pix k . This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . We wish to show that these inequalities are.e. pix k = 1. it is a strict inequality: αx0 > y∈S αy pyx0 . for some x0 ∈ S. By Corollary 13. 54 .

when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. β. as in §10.2. Then (n) lim p n→∞ ij = ϕC (i)π C (j). Let j be a positive recurrent state with period 1. that is. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). and fij = ϕC (j). Theorem 20. Proof. we have pij = Pi (Xn = j) = Pi (Xn = j. π C (j) = 1/Ej Tj . as n → ∞. In this form. starting with the first hitting time decomposition relation of Lemma 7. Let C be the communicating class of j and let π C be the stationary distribution associated with C. In general. Let ϕC (i) := Pi (HC < ∞). E j Tj where Tj = inf{n ≥ 1 : Xn = j}. and fij = Pi (Tj < ∞). the result can be obtained by the so-called Abel’s Lemma of Analysis. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. The result when the period j is larger than 1 is a bit awkward to formulate. Indeed. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Let i be an arbitrary state. so we will only look that the aperiodic case. Pµ (Xn−m = j) → π C (j). From the Strong Markov Property. This is a more old-fashioned way of getting the same result. But Theorem 17 tells us that the limit exists regardless of the initial distribution. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . Example: Consider the following chain (0 < α. Let HC := inf{n ≥ 0 : Xn ∈ C}. as n → ∞. which is what we need. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j).2 The general case (n) It remains to see what the limit of pij is. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}.19.

the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. we assumed the existence of a positive recurrent state. (n) (n) p33 → 0. The phase (or state) space of a particle is not just a physical space but it includes. p45 → (1 − ϕ(4))π C2 (5). For example. C2 are aperiodic classes. π C2 (5) = 1.The communicating classes are I = {3. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. π C1 (2) = . information about its velocity. from first-step analysis. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. (n) (n) p32 → ϕ(3)π C1 (2). The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. ϕ(5) = 0. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. We have ϕ(1) = ϕ(2) = 1. C1 = {1. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. 2} (closed class). Theorem 14 is a type of ergodic theorem. The subject of Ergodic Theory has its origins in Physics. (n) (n) p34 → 0. p41 → ϕ(4)π C1 (1). Hence. Similarly. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. (n) (n) p35 → (1 − ϕ(3))π C2 (5). whence 1−α β(1 − α) . 1 − αβ 1 − αβ Both C1 . 4} (inessential states). (n) (n) p43 → 0. p42 → ϕ(4)π C1 (2). p44 → 0. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. among others. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . as n → ∞. in Theorem 14. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. 56 (20) x∈S . Pioneers in the are were. Theorem 21. while. We now generalise this. Recall that. ϕ(3) = p31 → ϕ(3)π C1 (1). for example. And so is the Law of Large Numbers itself (Theorem 13). ϕ(4) = . C2 = {5} (absorbing state).

But this is precisely the condition (20). in the former. 57 . Glast are the total rewards over the first and t t a a last incomplete cycles. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). and this justifies the terminology of §18. The reader is asked to revisit the proof of Theorem 14. we have that the denominator equals Nt /t and converges to 1/Ea Ta . we have Nt /t → 1/Ea Ta . then.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. We only need to check that EG1 = f . The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . So if we divide by t the numerator and denominator of the fraction in (21). we obtain the result. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. As in the proof of Theorem 14. as we saw in the proof of Theorem 14. Proof. Remark 1: The difference between Theorem 14 and 21 is that. since the number of states is finite. If so. we assumed positive recurrence of the ground state a. Then at least one of the states must be visited infinitely often. which is the quantity denoted by f in Theorem 14. we have C := i∈S ν(i) < ∞. so the numerator converges to f /Ea Ta . (21) f (x)ν [a] (x). As in Theorem 14. while Gfirst . Now. r=0 with probability 1 as long as E|G1 | < ∞. Replacing n by the subsequence Nt . respectively. From proposition 2 we know that the balance equations ν = νP have at least one solution.5. we have t→∞ lim 1 last 1 G = lim Gfirst = 0. and this is left as an exercise. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . So there are always recurrent states. t t→∞ t t t with probability one.

In particular. i ∈ S. a finite Markov chain always has positive recurrent states. Proof. i. At least some state i must have π(i) > 0. . If. In addition. that is. Proof. the stationary distribution is unique. Reversibility means that (X0 . and run the film backwards. actually. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. Therefore π is a stationary distribution. Xn has the same distribution as X0 for all n. the chain is irreducible. X1 ) has the same distribution as (X1 . like playing a film backwards. because positive recurrence is a class property (Theorem 15). Xn ) has the same distribution as (Xn . . If. without being able to tell that it is. for all n. then all states are positive recurrent. the chain is aperiodic. j ∈ S. X0 ). this state must be positive recurrent. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. Consider an irreducible Markov chain with finitely many states.e. simply. . n ≥ 0). C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. X1 . . Suppose first that the chain is reversible. the first components of the two vectors must have the same distribution. or. By Corollary 13. 22 Time reversibility Consider a Markov chain (Xn . . X1 . Theorem 23. (X0 . We summarise the above discussion in: Theorem 22. reversible if it has the same distribution when the arrow of time is reversed. X0 = j). i. Then the chain is positive recurrent with a unique stationary distribution π. if we film a standing man who raises his arms up and down successively. then we instantly know that the film is played in reverse. For example. as n → ∞. We say that it is time-reversible. running backwards. P (X0 = i. X1 = j) = P (X1 = i. . But if we film a man who walks and run the film backwards. . . . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. Xn ) has the same distribution as (Xn . . π(i) := Hence. Xn−1 . Lemma 12. X0 ). . Xn−1 . A reversible Markov chain is stationary. . Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . this means that it is stationary. . . in addition. in a sense. it will look the same. . Because the process is Markov.Hence 1 ν(i). More precisely. by Theorem 16. X0 ) for all n. . in addition. we have Pn → Π. It is. 58 . we see that (X0 . Taking n = 1 in the definition of reversibility.

The same logic works for any m and can be worked out by induction. (22) Proof. Conversely. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . . Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. . X1 = j. and this means that P (X0 = i. . pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . Suppose first that the chain is reversible. X1 = j) = π(i)pij . Xn−1 . letting m → ∞. π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. By induction. . Pick 3 states i. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. . i2 . X1 = j. and hence (X0 . . Summing up over all choices of states i3 . In other words. . X1 . and the chain is reversible. using the DBE. for all i. X2 ) has the same distribution as (X2 . X0 ). using DBE. suppose that the KLC (22) holds. X1 . . X1 = i) = π(j)pji . . pji 59 (m−3) → π(i1 ). then the chain cannot be reversible. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . (X0 . So the chain is reversible. X2 = i). . The DBE imply immediately that (X0 . Theorem 24. These are the DBE. . But P (X0 = i. we have π(i) > 0 for all i. then. im ∈ S and all m ≥ 3. So the detailed balance equations (DBE) hold. . . we have separation of variables: pij = f (i)g(j). Xn ) has the same distribution as (Xn . . Now pick a state k and write. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. k. it is easy to show that. suppose the DBE hold. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. i4 . . and pi1 i2 (m−3) → π(i2 ). X1 . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . X0 ). Conversely. Hence the DBE hold. j. . X2 = k) = P (X0 = k.for all i and j. j. . and write. for all n. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. P (X0 = j. We will show the Kolmogorov’s loop criterion (KLC). X1 ) has the same distribution as (X1 . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. k. X0 ).

for each pair of distinct states i. So g satisfies the balance equations. the detailed balance equations. Ac be the sets in the two connected components. j for which pij > 0. there would be a loop. Ac ) = F (Ac . Form the graph by deleting all self-loops. so we have pij g(j) = . then we need. And hence π can be found very easily. A) (see §4.e. using another method. f (i)g(i) must be 1 (why?). Hence the chain is reversible.1) which gives π(i)pij = π(j)pji .(Convention: we form this ratio only when pij > 0. Pick two distinct states i. (If it didn’t. 60 . The reason is as follows. j for which pij > 0. Example 1: Chain with a tree-structure. i. the two oriented edges from i to j and from j to i by a single edge without arrow. then the chain is reversible. Hence the original chain is reversible. and the arrow from j to j is the only one from Ac to A. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. meaning that it has no loops. • If the chain has infinitely many states. to guarantee. If we remove the link between them. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0.) Necessarily. namely the arrow from i to j. and. and by replacing. There is obviously only one arrow from AtoAc . The balance equations are written in the form F (A. there isn’t any. For example. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . consider the chain: Replace it by: This is a tree. If the graph thus obtained is a tree. that the chain is positive recurrent. in addition. by assumption. the graph becomes disconnected.) Let A.

We claim that π(i) = Cδ(i). The Markov chain with these transition probabilities is called a random walk on a graph (RWG). Now define the the following transition probabilities: pij = 1/δ(i). 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. The stationary distribution of a RWG is very easy to find. There are two cases: If i. we have that C = 1/2|E|. and a set E of undirected edges. Each edge connects a pair of distinct vertices (states) without orientation. Hence we conclude that: 1. States j are i are called neighbours if they are connected by an edge. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. The stationary distribution assigns to a state probability which is proportional to its degree. π(i) = 1. In all cases. 2. where C is a constant. To see this let us verify that the detailed balance equations hold. i∈S Since δ(i) = 2|E|. π(i)pij = π(j)pji . Suppose the graph is connected.Example 2: Random walk on a graph. p02 = 1/4.e. i. the DBE hold. i∈S where |E| is the total number of edges. Consider an undirected graph G with vertices S. 61 . so both members are equal to C. etc. If i. A RWG is a reversible chain. j are not neighbours then both sides are 0. if j is a neighbour of i otherwise. For example. The constant C is found by normalisation. a finite set. 0. For each state i define its degree δ(i) = number of neighbours of i.

this process is also a Markov chain and it does have time-homogeneous transition probabilities. . then (Yk . a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . 2. 1. 2. 2.) has the Markov property but it is not time-homogeneous. 23.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. in addition. The state space of (Yr ) is A. If the subsequence forms an arithmetic progression. π is a stationary distribution for the original chain. . . Let (St . 1. if. . i.e. 2. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). k = 0.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). . if nk = n0 + km.i. 2. then it is a stationary distribution for the new chain. k = 0. Let A be (r) a set of states and let TA be the r-th visit to A.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . random variables with values in N and distribution λ(n) = P (ξ0 = n). irreducible and positive recurrent). how can we create new ones? 23. 1. (m) (m) 23. . In this case. .e. where ξt are i. 1. .e. . . . i. k = 0. Due to the Strong Markov Property. .1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. The sequence (Yk . in general. . k = 0. . . Define Yr := XT (r) . t ≥ 0) be independent from (Xn ) with S0 = 0.3 Subordination Let (Xn .d. n ≥ 0) be Markov with transition probabilities pij . A r = 1. . 62 n ∈ N. t ≥ 0. St+1 = St + ξt . where pij are the m-step transition probabilities of the original chain. .

If. . . Yn−1 = αn−1 }. . j. f is a one-to-one function then (Yn ) is a Markov chain. conditional on Yn the past is independent of the future. Fix α0 . . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). . (n) 23. n≥0 63 . We need to show that P (Yn+1 = β | Yn = α. α1 . β). αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. .. . a Markov chain. . Proof. then the process Yn = f (Xn ). . in general. Then (Yn ) has the Markov property. . knowledge of f (Xn ) implies knowledge of Xn and so. . i. ∞ λ(n)pij . . β.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ .) More generally. Yn−1 ) = P (Yn+1 = β | Yn = α). . Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i).Then Yt := XSt . (Notation: We will denote the elements of S by i. we have: Lemma 13. .e.. however. is NOT. and the elements of S ′ by α. Indeed.

To see this. i. β). 64 . Example: Consider the symmetric drunkard’s random walk. But P (Yn+1 = β | Yn = α. β). We have that P (Yn+1 = β|Xn = i) = Q(|i|. α > 0 Q(α. On the other hand. for all i ∈ Z and all β ∈ Z+ . a Markov chain with transition probabilities Q(α. Yn−1 ) Q(f (i). But we will show that P (Yn+1 = β|Xn = i). β) P (Yn = α|Yn−1 ) i∈S = Q(α. because the probability of going up is the same as the probability of going down.e. Yn = α. conditional on {Xn = i}. i ∈ Z. β ∈ Z+ . β) i∈S P (Yn = α|Xn = i. Yn−1 )P (Xn = i | Yn = α. In other words. Define Yn := |Xn |. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. if β = 1. Then (Yn ) is a Markov chain. and if i = 0. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. β) := 1. i ∈ Z. and so (Yn ) is.for all choices of the variables involved. β) P (Yn = α|Xn = i. β) P (Yn = α. Thus (Yn ) is Markov with transition probabilities Q(α. β). β). observe that the function | · | is not one-to-one. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Indeed. if i = 0. β)P (Xn = i | Yn = α. depends on i only through |i|. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. regardless of whether i > 0 or i < 0. indeed. if we let 1/2. α = 0. P (Xn+1 = i ± 1|Xn = i) = 1/2. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). Yn−1 ) Q(f (i). if β = α ± 1. then |Xn+1 | = 1 for sure.

1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. a hard problem. Thus. so we will find some different method to proceed. in general. Hence all states i ≥ 1 are inessential. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. ξ (1) .. It is a Markov chain with time-homogeneous transitions. if the current population is i there is probability pi that everybody gives birth to zero offspring. In other words. . k = 0.24 24. 0 But 0 is an absorbing state. n ≥ 0) has the Markov property.d. which has a binomial distribution: i j pij = p (1 − p)i−j . We find the population (n) at time n + 1 by adding the number of offspring of each individual. 1. then we see that the above equation is of the form Xn+1 = f (Xn . Hence all states i ≥ 1 are transient. are i. . We let pk be the probability that an individual will bear k offspring. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. Let ξk be the number of offspring of the k-th individual in the n-th generation. each individual gives birth to at one child with probability p or has children with probability 1 − p. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. ξ2 . we have that (Xn . Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. since ξ (0) . . and. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. . (and independent of the initial population size X0 –a further assumption). The state 0 is irrelevant because it cannot be reached from anywhere. . the population cannot grow: it will eventually reach 0. if 0 is never reached then Xn → ∞. 0 ≤ j ≤ i. if the 65 (n) (n) . . 2. Therefore.i. j In this case. . therefore transient. So assume that p1 = 1. and 0 is always an absorbing state. one question of interest is whether the process will become extinct or not. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . The parent dies immediately upon giving birth. p0 = 1 − p. Back to the general case. If we let ξ (n) = ξ1 . independently from one another. and following the same distribution. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. . But here is an interesting special case. ξ (n) ). with probability 1. . Example: Suppose that p1 = p. Then P (ξ = 0) + P (ξ ≥ 2) > 0.

66 . Therefore the problem reduces to the study of the branching process starting with X0 = 1. This function is of interest because it gives information about the extinction probability ε. For each k = 1.population does not become extinct then it must grow to infinity. The result is this: Proposition 5. Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. If µ ≤ 1 then the ε = 1. We wish to compute the extinction probability ε(i) := Pi (D). Proof. . We then have that D = D1 ∩ · · · ∩ Di . . ε(0) = 1. when we start with X0 = i. since Xn = 0 implies Xm = 0 for all m ≥ n. and. ϕn (0) = P (Xn = 0). we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. we can argue that ε(i) = εi . . We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. each one behaves independently of one another. so P (D|X0 = i) = εi . For i ≥ 1. if we start with X0 = i in ital ancestors. Indeed. and P (Dk ) = ε. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. Let ε be the extinction probability of the process. where ε = P1 (D). . i. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. Indeed. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. Obviously.

z=1 Also. ϕn+1 (0) = ϕ(ϕn (0)). We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). recall that ϕ is a convex function with ϕ(1) = 1. ϕn+1 (z) = ϕ(ϕn (z)). Therefore ε = ϕ(ε).) 67 . Notice now that µ= d ϕ(z) dz . Therefore. Also ϕn (0) → ε. But 0 ≤ ε∗ . On the other hand. if µ > 1. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). But ϕn (0) → ε. and. Therefore ε = ε∗ . ϕn (0) ≤ ε∗ . ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). But ϕn+1 (0) → ε as n → ∞. By induction.Note also that ϕ0 (z) = 1. (See right figure below. all we need to show is that ε < 1. This means that ε = 1 (see left figure below). Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. ε = ε∗ . Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). Let us show that. since ϕ is a continuous function. in fact. then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. if the slope µ at z = 1 is ≤ 1. So ε ≤ ε∗ < 1. ϕ(ϕn (0)) → ϕ(ε). and since the latter has only two solutions (z = 0 and z = ε∗ ). Since both ε and ε∗ satisfy z = ϕ(z). In particular. and so on.

An the number of new packets on the same time slot. 1. 6. we let hosts transmit whenever they like. 5. 4. for a packet to go through.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. 2. k ≥ 0. there is no time to make a prior query whether a host has something to transmit or not. 3. One obvious solution is to allocate time slots to hosts in a specific order. random variables with some common distribution: P (An = k) = λk . is that if more than one hosts attempt to transmit a packet. only one of the hosts should be transmitting at a given time. . So.i. . 3 hosts. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. The problem of communication over a common channel.. then time slots 0. using the same frequency. there are x + a − 1 packets. 2. We assume that the An are i.d. Then ε < 1. there is a collision and. If collision happens. 0). Thus. If s = 1 there is a successful transmission and. Let s be the number of selected packets. and Sn the number of selected packets. So if there are. During this time slot assume there are a new packets that need to be transmitted. then max(Xn + An − 1. Nowadays. periodically. are allocated to hosts 1. then the transmission is not successful and the transmissions must be attempted anew. if Sn = 1. say. there is collision and the transmission is unsuccessful. . Since things happen fast in a computer network. 2. If s ≥ 2. . otherwise. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). 1. 3. there are x + a packets.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). 68 . . Abandoning this idea. 24. Xn+1 = Xn + An . Ethernet has replaced Aloha. on the next times slot. 3. Then ε = 1. 1. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. in a simplified model. Case: µ > 1. . on the next times slot.

69 . We here only restrict ourselves to the computation of this expected change. So we can make q x G(x) as small as we like by choosing x sufficiently large. we should not subtract 1. we must have no new packets arriving. selections are made amongst the Xn + An packets. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i.) = x + a k x−k p q . and so q x tends to 0 much faster than G(x) tends to ∞. An = a. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). we have ∆(x) = (−1)px. Xn−1 .x−1 + k≥1 kpx. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). The Aloha protocol is unstable. To compute px. then the chain is transient. i. . To compute px. We also let µ := k≥1 kλk be the expectation of An . where G(x) is a function of the form α + βx . For x > 0. An−1 . we think as follows: to have a reduction of x by 1. . . if Xn + An = 0. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. Since p > 0. So: px. The process (Xn . k 0 ≤ k ≤ x + a. The main result is: Proposition 6. Idea of proof. and p. we have that the drift is positive for all large x and this proves transience.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . The probabilities px. The random variable Sn has.x−1 when x > 0. Indeed. the Markov chain (Xn ) is transient. where q = 1 − p. So P (Sn = k|Xn = x. we have q < 1. Therefore ∆(x) ≥ µ/2 for all large x.e. Proving this is beyond the scope of the lectures. and exactly one selection: px.x are computed by subtracting the sum of the above from 1.e.x−1 = P (An = 0.x+k for k ≥ 1.where λk ≥ 0 and k≥0 kλk = 1.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . n ≥ 0) is a Markov chain. Unless µ = 0 (which is a trivial case: no arrivals!). Sn = 1|Xn = x) = λ0 xpq x−1 . so q x G(x) tends to 0 as x → ∞. The reason we take the maximum with 0 in the above equation is because. Let us find its transition probabilities. defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. for example we can make it ≤ µ/2. k ≥ 1.

For each page i let L(i) be the set of pages it links to. |V | These are new transition probabilities. We pick α between 0 and 1 (typically. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. So this is how Google assigns ranks to pages: It uses an algorithm. Therefore. if j ∈ L(i)  pij = 1/|V |. otherwise. there are companies that sell high-rank pages. It is not clear that the chain is irreducible. the chain moves to a random page. Comments: 1.24. the rank of a page may be very important for a company’s profits. All you need to do to get high rank is pay (a lot of money). PageRank. Its implementation though is one of the largest matrix computations in the world. neither that it is aperiodic. Define transition probabilities as follows:  1/|L(i)|. 2. In an Internet-dominated society.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page.2) and let α pij := (1 − α)pij + . If L(i) = ∅. 3. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. that allows someone to manipulate the PageRank of a web page and make it much higher. It uses advanced techniques from linear algebra to speed up computations. So if Xn = i is the current location of the chain. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. then i is a ‘dangling page’. because of the size of the problem. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. if L(i) = ∅   0. unless i is a dangling page. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. in the latter case. The idea is simple. α ≈ 0. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. Academic departments link to each other by hiring their 70 . known as ‘302 Google jacking’. There is a technique. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. So we make the following modification. Then it says: page i has higher rank than j if π(i) > π(j).

We will also assume that the parents are replaced by the children. 4. in each generation. We would like to study the genetic diversity. One of the square alleles is selected 4 times. Thus. the genetic diversity depends only on the total number of alleles of each type. a dominant (say A) and a recessive (say a). Imagine that we have a population of N individuals who mate at random producing a number of offspring. When two individuals mate. we will assume that. 71 . all we need is the information outlined in this figure. To simplify. We will assume that a child chooses an allele from each parent at random. The process repeats. This number could be independent of the research quality but. it makes the evaluation procedure easy. 5. meaning how many characteristics we have. In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. The alleles in an individual come in pairs and express the individual’s genotype for that gene. their offspring inherits an allele from each parent. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. If so. What is important is that the evaluation can be done without ever looking at the content of one’s research. But an AA mating with a Aa may produce a child of type AA or one of type Aa. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. Do the same thing 2N times. then.g. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. One of the circle alleles is selected twice. We have a transition from x to y if y alleles of type A were produced. two AA individuals will produce and AA child. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). colour of eyes) appears in two forms. Three of the square alleles are never selected. from the point of view of an administrator. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa.4 The Wright-Fisher model: an example from biology In an organism. generation after generation. And so on. 24. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. a gene controlling a specific characteristic (e. this allele is of type A with probability x/2N . Since we don’t care to keep track of the individuals’ identities. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. maintain this allele for the next generation.faculty from each other (and from themselves). This means that the external appearance of a characteristic is a function of the pair of alleles.

we see that 2N x 2N −y x y pxy = . and the population becomes one consisting of alleles of type a only. E(Xn+1 |Xn . . EXn+1 = EXn Xn = Xn . 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. . with n P (ξk = 1|Xn . . the number of alleles of type A won’t change. 2N − 1} form a communicating class of inessential states. Since pxy > 0 for all 1 ≤ x. since. . as usual. . . Since selections are independent. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. on the average. These are: M ϕ(0) = 0. p00 = p2N. Tx := inf{n ≥ 1 : Xn = x}. . The result is correct. . .i. on the average. the states {1.Each allele of type A is selected with probability x/2N . at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x).e. We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. Clearly. . . The fact that M is even is irrelevant for the mathematical model. . 0 ≤ y ≤ 2N. . for all n ≥ 0. Xn ). n n where. . ϕ(M ) = 1. M n P (ξk = 0|Xn . ξM are i. Let M = 2N . . the chain gets absorbed by either 0 or 2N . . Similarly. . . Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . and the population becomes one consisting of alleles of type A only. M Therefore. conditionally on (X0 . (Here. . i. . the expectation doesn’t change and. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . X0 ) = Xn . . .2N = 1. . so both 0 and 2N are absorbing states. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected.d. i. but the argument needs further justification because “the end of time” is not a deterministic time.e.) One can argue that. y ≤ 2N − 1. since. the random variables ξ1 . X0 ) = E(Xn+1 |Xn ) = M This implies that. Clearly. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. 72 . ϕ(x) = y=0 pxy ϕ(y). X0 ) = 1 − Xn . M and thus. ϕ(x) = x/M.

Since the Markov chain is represented by this very simple recursion. Here are our assumptions: The pairs of random variables ((An . and so. while a number of them are sold to the market. Let ξn := An − Dn . n ≥ 0) is a Markov chain.5 A storage or queueing application A storage facility stores. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. . there are Xn + An − Dn components in the stock. We have that (Xn . M −1 y−1 x M y−1 1− x . n ≥ 0) are i. so that Xn+1 = (Xn + ξn ) ∨ 0. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . say. and E(An − Dn ) = −µ < 0. the demand is only partially satisfied. .d. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed.i. we will try to solve the recursion.The first two are obvious. on the average. We will show that this Markov chain is positive recurrent. the demand is met instantaneously. Thus. . and the stock reached level zero at time n + 1. at time n + 1. 73 . Otherwise. the demand is. provided that enough components are available. 2. A number An of new components are ordered and added to the stock. Working for n = 1. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. Xn electronic components at time n. We will apply the so-called Loynes’ method. for all n ≥ 0. If the demand is for Dn components. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. Dn ). b). larger than the supply per unit of time.

. n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. Indeed. P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . we reverse it. ξn . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). Therefore. and. . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y).Thus. ξ0 ). ξn−1 ) = (ξn−1 . . the random variables (ξn ) are i. we may replace the latter random variables by ξ1 . . for y ≥ 0. for all n. In particular. 74 . . . however. . . using π(y) = limn→∞ P (Mn = y). = x≥0 Since the n-dependent terms are bounded. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). we may take the limit of both sides as n → ∞. Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. . . that d (ξ0 . . ξn−1 ). . the event whose probability we want to compute is a function of (ξ0 . Notice. . ξn−1 . . so we can shuffle them in any way we like without altering their joint distribution.i. . we find π(y) = x≥0 π(x)pxy .d. . where the latter equality follows from the fact that. (n) . instead of considering them in their original order. Hence. . . in computing the distribution of Mn which depends on ξ0 . . (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . .

Z2 . Here. We must make sure that it also satisfies y π(y) = 1. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 .} > y) is not identically equal to 1. . the Markov chain may be transient or null recurrent. . It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. n→∞ This implies that P ( lim Zn = −∞) = 1. From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. . as required. Depending on the actual distribution of ξn . . Z2 . . The case Eξn = 0 is delicate.So π satisfies the balance equations. . n→∞ Hence g(y) = P (0 ∨ max{Z1 .} < ∞) = 1. 75 . we will use the assumption that Eξn = −µ < 0. This will be the case if g(y) is not identically equal to 1.

and space-homogeneity. given a function p(x). . More general spaces (e. x ∈ Zd . p(x) = 0. . . if x = −ej . A random walk of this type is called simple random walk or nearest neighbour random walk. where p1 + · · · + pd + q1 + · · · + qd = 1. For example. Example 1: Let p(x) = C2−|x1 |−···−|xd | . j = 1. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. . if x ∈ {±e1 . .y = px+z. we need to give more structure to the state space S. d  p(x) = qj . a structure that enables us to say that px. 0. . so far. the skip-free property. d   0. j = 1. where C is such that x p(x) = 1. have been processes with timehomogeneous transition probabilities. this walk enjoys. otherwise where p + q = 1. p(−1) = q. then p(1) = p.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk.y = p(y − x). that of spatial homogeneity. In dimension d = 1. .y+z for any translation z. namely. . the set of d-dimensional vectors with integer coordinates. we won’t go beyond Zd . Our Markov chains. . a random walk is a Markov chain with time. ±ed } otherwise 76 . if x = ej . Example 3: Define p(x) = 1 2d . So. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. . we can let S = Zd . Define  pj . A random walk possesses an additional property.g. . This means that the transition probability pxy should depend on x and y only through their relative positions in space.and space-homogeneous transition probabilities given by px. otherwise. groups) are allowed. besides time. here. but. If d = 1. This is the drunkard’s walk we’ve seen before. . To be able to say this.

p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). The special structure of a random walk enables us to give more detailed formulae for its behaviour. we will mean a sum over all possible values.i.d. But this would be nonsense. in general. Indeed. Obviously. . in addition. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. . So. then p(1) = p(−1) = 1/2. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . x + ξ2 = y) = y p(x)p(y − x). random variables in Zd with common distribution P (ξn = x) = p(x). . p(x) = 0. Furthermore. . . If d = 1. otherwise. x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. p′ (x). because all we have done is that we dressed the original problem in more fancy clothes. . p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . (Recall that δz is a probability distribution that assigns probability 1 to the point z.) In this notation. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). computing the n-th order convolution for all n is a notoriously hard problem.) Now the operation in the last term is known as convolution. We refer to ξn as the increments of the random walk. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). Let us denote by Sn the state of the walk at time n. which. One could stop here and say that we have a formula for the n-step transition probability. . x ± ed . The convolution of two probabilities p(x). p ∗ δz = p. and this is the symmetric drunkard’s walk. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . y p(x)p′ (y − x). where ξ1 . We shall resist the temptation to compute and look at other 77 . notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x.This is a simple random walk. we can reduce several questions to questions about this random walk. We will let p∗n be the convolution of p with itself n times. are i. the initial state S0 is independent of the increments. x ∈ Zd . (When we write a sum without indicating explicitly the range of the summation variable. ξ2 . We call it simple symmetric random walk. Furthermore.

+1. After all. 2 where #A is the number of elements of A. ξn ). 1}n . So if the initial position is S0 = 0. Look at the distribution of (ξ1 . who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. + (1.2) where the ξn are i.0) We can thus think of the elements of the sample space as paths. . a path of length n. A ⊂ {−1. If yes. In the sequel. Clearly. to each initial position and to each sequence of n signs. +1}n .1) + − (5. a random vector with values in {−1.2) (4. . 1) → (4. and so #A P (A) = n .methods. and the sequence of signs is {+1. Here are a couple of meaningful questions: A. If not. . random variables (random signs) with P (ξn = ±1) = 1/2. how long will that take? C. consider the following: 78 . Events A that depend only on these the first n random signs are subsets of this sample space. all possible values of the vector are equally likely. 0) → (1.i. +1}n . where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. Will the random walk ever return to where it started from? B.i+1 = pi. As a warm-up. Therefore. −1} then we have a path in the plane that joins the points (0. +1}n . .d. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . 1): (2.i−1 = 1/2 for all i ∈ Z. −1. 2) → (3. So. we have P (ξ1 = ε1 . if we can count the number of elements of A we can compute its probability. .1) + (0. . The principle is trivial. for fixed n. ξn = εn ) = 2−n . we are dealing with a uniform distribution on the sample space {−1. we should assign. +1. . .1) − (3. 1) → (2. First. Since there are 2n elements in {0. we shall learn some counting methods. 2) → (5. but its practice may be hard.

) The darker region shows all paths that start at (0. S4 = −1. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. i). x) (n. y)} = n−m .Example: Assume that S0 = 0.047. x) and end up at (n. 0) and end up at the specific point (n. i). 0) and end up at (n. Proof. Let us first show that Lemma 14. (n − m + y − x)/2 (23) Remark. In if n + i is not an even number then there is no path that starts from (0. We can easily count that there are 6 paths in A. i)} = n n+i 2 More generally.0) The lightly shaded region contains all paths of length n that start at (0.i) (0. y) is given by: #{(0. The paths comprising this event are shown below. in this case. (n. 0) and ends at (n. S7 < 0}. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. 0) (n. 0)). We use the notation {(m. Hence (n+i)/2 = 0. 0) and n ends at (n. y)} =: {all paths that start at (m. S6 < 0. x) and end up at (n. Consider a path that starts from (0. x) (n.1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. n 79 . i). and compute the probability of the event A = {S2 ≥ 0. (There are 2n of them. the number of paths that start from (m. y)}. y) is given by: #{(m. The number of paths that start from (0.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

given that S0 = 0 and Sn = y > 0. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . 1) (n. 0) (n. n−m 2−n−m . The left side of (30) equals #{paths (1. If n. Sn ≥ 0 | Sn = y) = 1 − In particular. . the random walk never becomes zero up to time n is given by P0 (S1 > 0. n/2 (29) #{(0. n (30) Proof. .P (Sn = y|Sm = x) = In particular. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . . But.2 The ballot theorem Theorem 25. P0 (S1 ≥ 0 . (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. . we shall just compute it. P0 (Sn = y) 2 (n + y) + 1 . n 2 #{paths (m. if n is even. . 27. The probability that. 0) (n. . y) that do not touch zero} × 2−n #{paths (0. x) 2n−m (n. Sn ≥ 0 | Sn = 0) = 84 1 . y)} . n 2−n .3 Some conditional probabilities Lemma 17. 27. . Sn > 0 | Sn = y) = y . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . . if n is even. Such a simple answer begs for a physical/intuitive explanation. S2 > 0. . . We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). y)} . n This is called the ballot theorem and the reason will be given in Section 29. for now.

. ξ3 . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. Sn < 0) (b) (a) = 2P0 (S1 > 0. . 85 . . . . and Sn−1 > 0 implies that Sn ≥ 0. and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . . . . 1 + ξ2 + ξ3 > 0. P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . For even n we have P0 (S1 ≥ 0. P0 (S1 ≥ 0. . . (f ) follows from the replacement of (ξ2 . .). n/2 where we used (28) and (29). We use (27): P0 (Sn ≥ 0 . . We next have P0 (S1 = 0. . Sn = 0) = P0 (Sn = 0). •). . . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . .) by (ξ1 . . . ξ3 . . . . . 0) (n. . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. Sn ≥ 0) = #{(0. . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. Proof. . . .4 Some remarkable identities Theorem 26. . ξ2 + ξ3 ≥ 0. (e) follows from the fact that ξ1 is independent of ξ2 . . (b) follows from symmetry. Sn > 0) + P0 (S1 < 0. . . . . . . and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . . . Sn ≥ 0) = P0 (S1 = 0. Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . . . . . . Sn = 0) = P0 (S1 > 0. . . . . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . (c) (d) (e) (f ) (g) where (a) is obvious. . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . . . . If n is even. . . Sn ≥ 0. . . . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). Sn ≥ 0). Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. +1 27. ξ2 . .Proof.

Sn = 0). P0 (Sn−1 > 0. . . It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. . f (n) = P0 (S1 > 0. . Sn = 0) P0 (S1 > 0. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. Since the random walk cannot change sign before becoming zero. the two terms must be equal. Also. So 1 f (n) = 2u(n) P0 (S1 > 0. with equal probability. Sn = 0) = 2P0 (Sn−1 > 0. So the last term is 1/2. . Sn = 0) + P0 (S1 < 0. for even n. Sn−1 > 0. Sn−1 < 0. we computed. . . . 86 . we can omit the last event. n−1 Proof. . given that Sn = 0. . Sn = 0) = u(n) . . Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). . we either have Sn−1 = 1 or −1. Theorem 27. . Sn−2 > 0|Sn−1 > 0. Let n be even. Sn = 0) Now. and if the particle is to the left of a then its mirror image is on the right. By the Markov property at time n − 1. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . Sn = 0. To take care of the last term in the last expression for f (n). By symmetry. . . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). we see that Sn−1 > 0. . Sn−2 > 0|Sn−1 = 1). . . So u(n) f (n) = . Sn−1 > 0. . So f (n) = 2P0 (S1 > 0. the probability u(n) that the walk (started at 0) attains value 0 in n steps. Sn = 0 means Sn−1 = 1. . . Fix a state a. . and u(n) = P0 (Sn = 0). Then f (n) := P0 (S1 = 0.5 First return to zero In (29). Sn−1 = 0. This is not so interesting. Put a two-sided mirror at a. .27. If a particle is to the right of a then you see its image on the left. n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following.

In the last event. Let (Sn . In other words. Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). Sn ≤ a) + P0 (Sn > a). Thus.What is more interesting is this: Run the particle till it hits a (it will. Then Ta = inf{n ≥ 0 : Sn = a} Sn . so we may replace Sn by 2a − Sn (Theorem 28). the process (STa +m − a. . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. Then. The last equality follows from the fact that Sn > a implies Ta ≤ n. If Sn ≥ a then Ta ≤ n. So Sn − a can be replaced by a − Sn in the last display and the resulting process. n ≥ Ta . But a symmetric random walk has the same law as its negative. independent of the past before Ta . 28. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). 2a − Sn ≥ a) = P0 (Ta ≤ n. Theorem 28. n ≥ 0) be a simple symmetric random walk and a > 0. Proof. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . n ∈ Z+ ) has the same law as (Sn . n ∈ Z+ ). Finally. Sn ). Sn ≤ a). . Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). Sn ≥ a). Sn ≤ a) + P0 (Ta ≤ n. we obtain 87 . P0 (Sn ≥ a) = P0 (Ta ≤ n. 2 Define now the running maximum Mn = max(S0 . n ≥ Ta By the strong Markov property. Trivially.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. 2 Proof. m ≥ 0) is a simple symmetric random walk. Sn = Sn . But P0 (Ta ≤ n) = P0 (Ta ≤ n. . called (Sn . Sn > a) = P0 (Ta ≤ n. n < Ta . define Sn = where is the first hitting time of a. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). 2a − Sn . as a corollary to the above result. a + (Sn − a). n < Ta . . we have n ≥ Ta . S1 .

which results into P0 (Tx ≤ n. D. ornamental objects. by applying the reflection principle (Theorem 28). we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. Sn = y) = P0 (Sn = 2x − y). The original ballot problem. and anciently for holding ballots to be drawn. If an urn contains a azure items and b = n−a black items. Proof. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. . (1887). 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. . Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . usually a vase furnished with a foot or pedestal. Sci. An urn is a vessel. 2 Proof. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Rend. of which a are coloured azure and b black. as for holding liquids. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. e 10 Andr´. P0 (Mn < x. Paris. 8 88 . Since Mn < x is equivalent to Tx > n. Acad. Compt. Bertrand. during sampling without replacement. then. gives the result. (1887). It is the latter use we are interested in in Probability and Statistics. 369. Sn = 2x − y). then the probability that. Sci. The answer is very simple: Theorem 31. 105. e e e Paris. the ashes of the dead after cremation. J. 105. 9 Bertrand.Corollary 19. Sn = y) = P0 (Tx > n. we have P0 (Mn < x. combined with (31). The set of values of η contains n elements and η is uniformly distributed a in this set. Solution d’un probl`me. employed for different purposes. Sn = y) = P0 (Tx ≤ n. where ηi indicates the colour of the i-th item picked. . the number of selected azure items is always ahead of the black ones equals (a − b)/n. Sn = y). This. (31) x > y. Acad. If Tx ≤ n. as long as a ≥ b. ηn ). 436-437. Compt. Rend. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. n = a + b. A sample from the urn is a sequence η = (η1 . Solution directe du probl`me r´solu par M. .

. . i ∈ Z. we have: P0 (ξ1 = ε1 . but which is not necessarily symmetric. the sequence (ξ1 . where y = a − (n − a) = 2a − n. Sn > 0} that the random walk remains positive. . b be defined by a + b = n. . Then. (ξ1 . . Then. (The word “asymmetric” will always mean “not necessarily symmetric”. Since the ‘azure’ items correspond to + signs (respectively. . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. εn be elements of {−1. . ηn ) an urn process with parameters n. always assuming (without any loss of generality) that a ≥ b = n − a. Proof of Theorem 31. . . . using the definition of conditional probability and Lemma 16.We shall call (η1 . . . So the number of possible values of (ξ1 . p(n+k)/2 q (n−k)/2 . . . n . Consider a simple symmetric random walk (Sn . . The values of each ξi are ±1. We have P0 (Sn = k) = n n+k 2 pi. But in (30) we computed that y P0 (S1 > 0. 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. indeed. a. . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. the ‘black’ items correspond to − signs). . . . letting a. . . n Since y = a − (n − a) = a − b. Proof. conditional on {Sn = y}. . . . We need to show that. ξn ) is. ξn = εn | Sn = y) = P0 (ξ1 = ε1 . Since an urn with parameters n. ξn ) is uniformly distributed in a set with n elements. . 89 . . the result follows. ξn = εn ) P0 (Sn = y) 2−n 1 = = . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. An urn can be realised quite easily be a random walk: Theorem 32. But if Sn = y then. a we have that a out of the n increments will be +1 and b will be −1.i−1 = q = 1 − p. . conditional on {Sn = y}.) So pi. . . . . . . ξn ) has a uniform distribution. +1} such that ε1 + · · · + εn = y. the sequence (ξ1 . . .i+1 = p. . n ≥ 0) with values in Z and let y be a positive integer. Let ε1 . conditional on the event {Sn = y}. . . −n ≤ k ≤ n. It remains to show that all values are equally a likely. . . n n −n 2 (n + y)/2 a Thus. so ‘azure’ items are +1 and ‘black’ items are −1. a − b = y. we can translate the original question into a question about the random walk. Sn > 0 | Sn = y) = .

We will approach this using probability generating functions. y ∈ Z. . plus a time T1 required to go from −1 to 0 for the first time. Consider a simple random walk starting from 0. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. 1. x − 1 in order to reach x. 2qs Proof. 2. 2. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. This random variable takes values in {0. 90 . Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. only the relative position of y with respect to x is needed. In view of this. x ≥ 0) is also a random walk. See figure below. . Px (Ty = t) = P0 (Ty−x = t). Theorem 33. Consider a simple random walk starting from 0. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. To prove this. it is no loss of generality to consider the random walk starting from 0. Consider a simple random walk starting from 0. for x > 0. Let ϕ(s) := E0 sT1 .d. E0 sTx = ϕ(s)x . For all x. (We tacitly assume that S0 = 0. . . If S1 = 1 then T1 = 1. . In view of Lemma 19 we need to know the distribution of T1 .30. Proof. If x > 0. Proof. Lemma 19.} ∪ {∞}. we make essential use of the skip-free property. . Lemma 18. Corollary 20. x ∈ Z. The process (Tx . Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. plus a ′′ further time T1 required to go from 0 to +1 for the first time. and all t ≥ 0.) Then.i. . random variables. The rest is a consequence of the strong Markov property.

Consider a simple random walk starting from 0. from the strong Markov property. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . in order to hit 1.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Corollary 21. 91 . if x < 0 2ps Proof. Hence the previous formula applies with the roles of p and q interchanged. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . The formula is obtained by solving the quadratic.  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . The first time n such that Sn = −1 is the first time n such that −Sn = 1. Corollary 22. Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . 2ps Proof. Consider a simple random walk starting from 0.

the same as the time required for the walk to hit 1 starting ′ from 0. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. If we define n to be equal to 0 for k > n. k by the binomial theorem. where α is a real number. Similarly. in distribution. We again do first-step analysis. 1 − 4pqs2 . ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. If α = n is a positive integer then n (1 + x)n = k=0 n k x .2 First return to the origin Now let us ask: If the particle starts at 0. this was found in §30. Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 .2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. Consider a simple random walk starting from 0.3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. k 92 . and simply write (1 + x)n = k≥0 n k x .30. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . For the general case. if S1 = 1. then we can omit the k upper limit in the sum. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . 2ps ′ =1− 30. The latter is. For a symmetric ′ random walk.

2k − 1 k 93 . (−1)k = 2k − 1 k k and this is shown by direct computation. k! k! (−1)2k xk = k≥0 xk . −1 k so (1 − x)−1 = as needed. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . k as long as we define α k = α(α − 1) · · · (α − k + 1) . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k .Recall that n k = It turns out that the formula is formally true for general α. For example. but that we won’t worry about which ones. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . (1 − x)−1 = But. Indeed. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . √ 1 − x. k!(n − k)! k! α k x . As another example. k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. by definition. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series.

s↑1 which is true for any probability generating function. That it is actually null-recurrent follows from Markov chain theory. We apply this to (32). Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). if x < 0 This formula can be written more neatly as follows: Theorem 35. But remember that q = 1 − p. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. Hence it is again transient. x if x > 0 . The quantity inside the square root becomes 1 − 4pq. if x < 0 |x| In particular. |x| if x > 0 (33) . by the definition of E0 sT0 . without appeal to the Markov chain theory. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . x 1−|p−q| 2q 1−|p−q| 2p . We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. Hence it is transient. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. P0 (Tx < ∞) = p q q p ∧1 ∧1 . so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one.But. 94 . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z.

then we have M∞ = limn→∞ Mn . We know that it cannot be positive recurrent. with some positive probability. Indeed. . Assume p < 1/2. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. Sn ). if x > 0 . again. . it will never ever return to the point where it started from. every nondecreasing sequence has a limit. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. .Proof. x ≥ 0.) If we let Mn = max(S0 . Consider a simple random walk with values in Z. This value is random and is denoted by M∞ = sup(S0 . except that the limit may be ∞. If p < q then |p − q| = q − p and. q p. . Proof. What is this probability? Theorem 37. we obtain what we are looking for. . 95 . . Otherwise. Corollary 23. and so it is null-recurrent. Proof. but here it is < ∞. Theorem 36. This means that there is a largest value that will be ever attained by the random walk.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. . if x < 0 which is what we want. The simple symmetric random walk with values in Z is null-recurrent. S2 . P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . with probability 1 because p < 1/2. the sequence Mn is nondecreasing. The last part of Theorem 35 tells us that it is recurrent. Unless p = 1/2. Consider a random walk with values in Z. 31. 31. Then ′ P0 (T0 < ∞) = 2(p ∧ q). S1 . Then limn→∞ Sn = −∞ with probability 1. n→∞ n→∞ x ≥ 0.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. this probability is always strictly less than 1.

. . 2(p ∧ q) . the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . 2. 2. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . 2(p ∧ q) E0 J0 = . In Lemma 4 we showed that. if a Markov chain starts from a state i then Ji is geometric. 1]. But if it is transient. 1. even for p = 1/2. k = 0. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. But J0 ≥ 1 if and only if the first return time to zero is finite. . Remark 1: By spatial homogeneity. . starting from zero. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. We now compute its exact distribution. . 2. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. In the latter case. 1 − 2(p ∧ q) this being a geometric series.3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times.′ Proof. where the last probability was computed in Theorem 37. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). This proves the first formula. If p = q this gives 1. . And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . . . 1. Consider a simple random walk with values in Z. Theorem 38. . as already known. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). Then. k = 0. We now look at the the number of visits to some state but if we start from a different state. Pi (Ji ≥ k) = 1 for all k. (34) a random variable first considered in Section 10 where it was also shown to be geometric. 96 . . 1. meaning that Pi (Ji = ∞) = 1. k = 0. The probability generating function of T0 was obtained in Theorem 34. 31. 1 − 2(p ∧ q) Proof. the total number of visits to a state will be finite.

So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . Consider the case p < 1/2. But for x < 0. Consider a random walk with values in Z. Remark 1: The expectation E0 Jx is finite if p = 1/2. and the second in Theorem 38.Theorem 39. 97 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . p < 1/2) then. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . simply because if Jx ≥ k ≥ 1 then Tx < ∞. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). it will visit any point below 0 exactly the same number of times on the average. Jx ≥ k). Then P0 (Tx < ∞. i. given that Tx < ∞. Suppose x > 0. The first term was computed in Theorem 35. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. Then P0 (Jx ≥ k) = P0 (Tx < ∞. E0 Jx = Proof. k ≥ 1. (q/p)(−x)∨0 1 − 2q k≥1 . we have E0 Jx = 1/(1−2p) regardless of x.e. then E0 Jx drops down to zero geometrically fast as x tends to infinity. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . ∞ As for the expectation. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. Apply now the strong Markov property at Tx . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 .e. starting from 0. For x > 0. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . In other words. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . The reason that we replaced k by k − 1 is because. (2(p ∧ q))k−1 . if the random walk “drifts down” (i. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 .

ξ1 ). Thus. which is basically another probability for S. . . n x > 0. each distributed as T1 . P0 (Sn = x) P0 (Sn = x) x . . Sn = x) P0 (Tx = n) = .32 Duality Consider random walk Sk = k ξi . Proof.d. Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. a sum of i. . every time we have a probability for S we can translate it into a probability for S.i. (36) 2 Lastly. Let (Sk ) be a simple symmetric random walk and x a positive integer. .e. for a simple random walk and for x > 0. we derived using duality. i=1 Then the dual (Sk . In Lemma 19 we observed that. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. ξn ) has the same distribution as (ξn . (Sk . in Theorem 41. . Use the obvious fact that (ξ1 . (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). random variables.d. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. 0 ≤ k ≤ n). 0 ≤ k ≤ n) has the same distribution as (Sk . Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. the probabilities P0 (Tx = n) = x P0 (Sn = x). ξ2 . Theorem 40. . Here is an application of this: Theorem 41. 0 ≤ k ≤ n) of (Sk . and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . = 1− √ 1 − s2 s x . Fix an integer n. random variables. . . because the two have the same distribution. n Proof. . . 0 ≤ k ≤ n − 1. for a simple symmetric random walk. i. Then P0 (Tx = n) = x P0 (Sn = x) . n (37) 98 . . 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . ξn−1 .i. Tx is the sum of x i.

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

(Sn . 1 Hence (Sn .e. Sn − Sn ). We then let S0 = (0. west. 0)) = P (ξn = (1. 1)) = P (ξn = (−1. the random 1 2 variables ξn . n ∈ Z+ ) is also a random walk. so are ξ1 . random variables and hence a random walk. n ∈ Z+ ) is a random walk. . Indeed. north or south. The two random walks are not independent. y) where x. 1 1 Proof. y ∈ Z. i. He starts at (0. . the two stochastic processes are not independent. Sn = i=1 n ξi . x1 − x2 ). ξ2 . 0). . . Rn ) := (Sn + Sn . going at random east. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . is a random walk. So let 1 2 1 2 1 2 Rn = (Rn . we immediately get that the expectation of T1 is infinity: ET1 = ∞. ξn are not independent simply because if one takes value 0 the other is nonzero. being totally drunk. west. too. i. . . Sn ) = n n 2 . we let x1 be its horizontal coordinate and x2 its vertical. So Sn = (Sn . 2 Same argument applies to (Sn . Consider a change of coordinates: rotate the axes by 45o (and scale them). n ∈ Z+ ) showing that it. the latter being functions of the former. random vectors such that P (ξn = (0. he moves east. Since ξ1 . Many other quantities can be approximated in a similar manner. n ∈ Z+ ) is a sum of i.. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. . However. 2 1 If x ∈ Z2 . north or south. −1)) = 1/2. x2 ) into (x1 + x2 . 0)) = P (ξn = (−1. . 0)) = 1/4. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. for each n. (Sn . . ξ2 .d. We have: 102 . with equal probability (1/4). 1 P (ξn = 1) = P (ξn = (1. map (x1 . It is not a simple random walk because there is a possibility that the increment takes the value 0. The corners are described by coordinates (x.d.From this. We have 1. n ≥ 1. When he reaches a corner he moves again in the same manner. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. are independent. i=1 ξi i=1 ξi 2 1 Lemma 20.i. 1)) + P (ξn = (0.i. 0)) = 1/4. 0) and. ξ2 .

We have of each of the ηn n 1 2 P (ηn = 1. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. n ∈ Z ) is a random walk in one dimension. we need to show that ∞ P (Sn = 0) = ∞. b−a P (STa ∧Tb = a) = b . b−a 103 . n Corollary 24. Similar argument shows that each of (Rn . (Rn + independent. The two-dimensional simple symmetric random walk is recurrent. So P (R2n = 0) = 2n 2−2n . and by Stirling’s approximation. n ∈ Z+ ). 0)) = 1/4 1 2 P (ηn = −1. But (Rn ) 1 2 is a simple symmetric random walk. ξ 1 − ξ 2 ). And so 2 1 2 1 P (ηn = ε. ηn = 1) = P (ξn = (0. from Theorem 7. n ∈ Z+ ) is a random 2 . So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . The last two are walk in one dimension. is i. . Hence (Rn . . 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. where c is a positive constant. η . The same is true for (Rn ).i. ηn = −1) = P (ξn = (−1. (Rn . P (S2n = 0) = Proof. ηn = 1) = P (ξn = (1. Proof. The possible values 1 . = 1)) = 1/4. b−a −a . η 2 are ±1.1 Lemma 21. 1)) = 1/4 1 2 P (ηn = −1. Hence P (ηn = 1) = P (ηn = −1) = 1/2. that for a simple symmetric random walk (started from S0 = 0). Let a < 0 < b. We have Rn = n (ξi + ξi ). Lemma 22. b−a P (Ta < Tb ) = b . +1}. ηn = −1) = P (ξn = (0.. in the sense that the ratio of the two sides goes to 1. ε′ ∈ {−1. . The sequence of vectors η . .d. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . as n → ∞. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . n ∈ Z+ ) is a random walk in two dimensions. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. (Rn . By Lemma 5. . . ξ2 . R2n = 0) = P (R2n = 0)(R2n = 0). 1 where the last equality follows from the independence of the two random walks. P (S2n = 0) ∼ n . 2 1 2 2 1 1 Proof. 2n n 2 2−4n . But by the previous n=0 c lemma. n ∈ Z+ ) 1 is a random walk in two dimensions. Recall. n ∈ Z ) is a random walk in one dimension. 2 . 0)) = 1/4 1 2 P (ηn = 1. Rn = n (ξi − ξi ). ηn = ξn − ξn are independent for each n. To show that the last two are independent.

Theorem 43. and assuming that (A. Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. B = 0) = p0 . a = M − N . 1 − p). such that p is rational. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. k ∈ Z. B = 0) + P (A = i. i < 0 < j. that ESTa ∧Tb = 0. choose. a < 0 < b such that pa + (1 − p)b = 0. To see this.g. we can always find integers a. b = N . i ∈ Z. e. Then Z = 0 if and only if A = B = 0 and this has 104 . B = j) = 1. b. where c−1 is chosen by normalisation: P (A = 0. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. How about more complicated distributions? Let suppose we are given a probability distribution (pi . i ∈ Z. The probability it exits it from the left point is p. and the probability it exits from the right point is 1 − p. 0)} ∪ (N × N) with the following distribution: P (A = i. Note. B) are independent of (Sn ). Conversely. Define a pair of random variables (A. suppose first k = 0. i ∈ Z). using this. P (A = 0. given a 2-point probability distribution (p. Proof. Then there exists a stopping time T such that P (ST = i) = pi . Let (Sn ) be a simple symmetric random walk and (pi . The claim is that P (Z = k) = pk . we consider the random variable Z = STA ∧TB . M. B = j) = c−1 (j − i)pi pj . Indeed. if p = M/N .This last display tells us the distribution of STa ∧Tb . b]. we have this common value: i<0 (−i)pi = and. we get that c is precisely ipi . B) taking values in {(0. in particular. M < N . N ∈ N.

Suppose next k > 0. i<0 = c−1 Similarly. P (Z = k) = pk for k < 0. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. B = j) P (STi ∧Tk = k)P (A = i. 105 . by definition.probability p0 .

So (i) holds. 2. 8) = gcd(6. with b = 0. . their negatives and zero. But 3 is the maximum number of times that 38 “fits” into 120. But in the first line we subtracted 38 exactly 3 times. b). d | b. b − a). Now. For instance. c | b. 2) = gcd(2.e. 14) = gcd(6. Similarly. we can define their greatest common divisor d = gcd(a. 20) = gcd(6. 1. b). and we’re done.}. let d = gcd(a. 32) = gcd(6. . Then c | a + (b − a). and d = gcd(a. We will show that d = gcd(a. (That it is unique is obvious from the fact that if d1 . i. replace b − a by b − 2a. with at least one of them nonzero. But d | a and d | b. Now suppose that c | a and c | b − a. We write this by b | a. 38) = gcd(44. So we work faster. If a | b and b | a then a = b. Indeed 4 × 38 > 120. 38) = gcd(6. Zero is denoted by 0. they merely serve as reminders. 38) = gcd(6. with the second line. They are not supposed to replace textbooks or basic courses. The set N of natural numbers is the set of positive integers: N = {1. if a. 2. b) = gcd(a. if we also invoke Euclid’s algorithm (i. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. b. 38) = gcd(82. the process of long division that one learns in elementary school). In other words. 26) = gcd(6. Z = {· · · .e. 106 . b) as the unique positive integer such that (i) d | a. −2. Therefore d | b − a. Keep doing this until you obtain b − qa < a. Then interchange the roles of the numbers and continue. b are arbitrary integers. −3. then d | c.) Notice that: gcd(a. 3. 2) = gcd(4. Given integers a. leading to d1 = d2 . −1. . Because c | a and c | b. 0. · · · }. 5. Indeed. So (ii) holds. 6. Arithmetic Arithmetic deals with integers. Replace b by b − a. d2 are two such numbers then d1 would divide d2 and vice-versa. gcd(120. 2) = 2.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. If b − a > a. we say that b divides a if there is an integer k such that a = kb. The set Z of integers consists of N. as taught in high schools or universities. we have c | d. 4. b − a). If c | b and b | a then c | a. 3. (ii) if c | a and c | b.

y ∈ Z. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. Then c divides any linear combination of a. with x = 19. 38) = gcd(6. Here are some other facts about the greatest common divisor: First. b) = min{xa + yb : x. The long division algorithm is a process for finding the quotient q and the remainder r. To see why this is true. 107 . it divides d. i. b) then there are integers x. a − 1}. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. y = 6. To see this. . for integers a. we can find an integer q and an integer r ∈ {0. Working backwards. Thus. 38) = gcd(6. We need to show (i). we have gcd(a. for some x. b ∈ Z. This shows (ii) of the definition of a greatest common divisor. not both zero. Then d = xa + yb. . we have the fact that if d = gcd(a. Second. . let us follow the previous example again: gcd(120. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. gcd(38. 1. r = 18. let d be the right hand side. In the process. 120) = 38x + 120y. we can show that. Its proof is obvious.The theorem of division says the following: Given two positive integers b. with a > b. For example. . Suppose that c | a and c | b. in particular. a.e. b. The same logic works in general. y such that d = xa + yb. such that a = qb + r. 2) = 2. y ∈ Z. xa + yb > 0}.

i. e. a relation can be thought of as a set of pairs of elements of the set A. The equality relation = is reflexive. The relation < is not symmetric. so r = 0. A relation which possesses all the three properties is called equivalence relation. A relation is called reflexive if it contains pairs (x. But then b = q(xa + yb) + r. So d is not minimum as claimed. respectively. yielding r = (−qx)a + (1 − qy)b. y) and (y. say. If we take A to be the set of students in a classroom. so d divides both a and b. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . z) it also contains (x. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. A function is called onto (or surjective) if its range is Y . the greatest common divisor of any subset of the integers also makes sense. For example.g. then the relation “x is a friend of y” is symmetric. that d does not divide b. that d does not divide both numbers. the set {f (x) : x ∈ X}. 108 . In general.. Finally. we can divide b by d to find b = qd + r. The greatest common divisor of 3 integers can now be defined in a similar manner. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). y) without contain its symmetric (y. The set of functions from X to Y is denoted by Y X . We encountered the equivalence relation i theory. A relation is called symmetric if it cannot contain a pair (x.e. x). Both < and = are transitive. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A.that d | a and d | b. and this is a contradiction. m elements. z). f (x2 ). Since d > 0. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). A relation is transitive if when it contains (x. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. The range of f is the set of its values. y) such that x is smaller than y. A function is called one-to-one (or injective) if two different x1 . Counting Let A. x).e. The relation < is not reflexive. i. The equality relation is symmetric. Suppose this is not true. but is not necessarily transitive. x2 ∈ X have different values f (x1 ). If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. B be finite sets with n. for some 1 ≤ r ≤ d − 1.

The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . and the third. Any set A contains several subsets. we have = 1. we need more names than animals: n ≤ m. (The same name may be used for different animals. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1).) So the first animal can be assigned any of the m available names.e. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. the k-th in n − k + 1 ways. so does the second. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. If A has n elements. and this follows from the binomial theorem. and so on. and so on. and so on. we denote (n)n also by n!. the cardinality of the set B A ) is mn . 109 . Incidentally. Clearly. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . Each assignment of this type is a one-to-one function from A to B. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. In the special case m = n. 1}n of sequences of length n with elements 0 or 1. we may think of the elements of A as animals and those of B as names. To be able to do this.The number of functions from A to B (i. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). the second in n − 1. We thus observe that there are as many subsets in A as number of elements in the set {0. that of the second in m − 1 ways. in other words. Indeed. A function is then an assignment of a name to each animal. a one-to-one function from the set A into A is called a permutation of A. The first element can be chosen in n ways. Thus. n! is the total number of permutations of a set of n elements. Since the name of the first animal can be chosen in m ways. k! k where the symbol equation.

n n . by convention. 110 . k Also. we choose k animals from the remaining n ones and this can be done in n ways. consider a specific animal. Both quantities are nonnegative. in choosing k animals from a set of n + 1 ones. a+ is called the positive part of a.We shall let n = 0 if n < k. In particular. If the pig is not included in the sample. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. Note that it is always true that a = a+ − a− . b). we let The Pascal triangle relation states that n+1 k = = 0. b. we use the following notations: a+ = max(a. 0). Maximum and minimum The minimum of two numbers a. b). a∨b∨c refers to the maximum of the three numbers a. b is denoted by a ∧ b = min(a. c. by convention. + k−1 k Indeed. and a− is called the negative part of a. m! n y If y is not an integer then. If the pig is included. 0). The maximum is denoted by a ∨ b = max(a. we define x m = x(x − 1) · · · (x − m + 1) . and we don’t care to put parantheses because the order of finding a maximum is irrelevant. a− = − min(a. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. For example. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. The absolute value of a is defined by |a| = a+ + a− . say. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). if x is a real number. we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. the pig. The notation extends to more than one numbers.

}. meaning a function that takes value 1 on A and 0 on the complement Ac of A.Indicator function stands for the indicator function of a set A. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. Thus. . . For instance. A2 . . . We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. It is easy to see that 1A 1A∩B = 1A 1B . 2. P (X ≥ n). So X= n≥1 1(X ≥ n). are sets or clauses then the number n≥1 1An is the number true clauses. which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). It is worth remembering that. 1 − 1A = 1 A c . It takes value 1 with probability P (A) and 0 with probability P (Ac ). An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . if A1 . we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. 111 . Another example: Let X be a nonnegative random variable with values in N = {1. If A is an event then 1A is a random variable. = 1A + 1B − 1A∩B . . This is very useful notation because conjunction of clauses is transformed into multiplication. . Then X X= n=1 1 (the sum of X ones is X). Several quantities of interest are expressed in terms of indicators.

.  The column  . Then.  is a column representation of y := ϕ(x) with respect to the given basis . The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . . . .for all x. we have ϕ(x) = m n vi i=1 j=1 aij xj . . Let y i := n aij xj . x  . . ym v1 . β ∈ R. because ϕ is linear. . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ). y ∈ Rn . . . .  . and all α. Suppose that ψ. xn the given basis. .  = ··· ··· ···  . . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi .   . x1 . . . So if we choose a basis v1 . . j=1 i = 1.  where ∈ R. Let u1 . vm . . . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . m.    . Combining. . . un be a basis for Rn . ϕ are linear functions: It is a m × n matrix. But the ϕ(uj ) are vectors in Rm . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  .  is a column representation of x with respect to . . The column  .

} ≤ · · · and. e2 =  . B be the matrix representations of ϕ. . B is a m × n matrix.e. Likewise. . If we fix a basis in Rn .} ≤ inf{x3 . ψ. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. Supremum and infimum When we have a set of numbers I (for instance. x4 . . i. . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. If I has no upper bound then sup I = +∞. . . . 1/3. · · · }. 0 0 1 In other words. where m (AB)ij = k=1 Aik Bkj . If I is empty then inf ∅ = +∞. . and AB is a ℓ × n matrix. for any ε > 0. xn+1 . . So lim sup xn = limn→∞ inf{xn . . sup ∅ = −∞. j So A is a ℓ × m matrix. . Similarly. it is its matrix representation with respect to the fixed basis. ed =  . Then the matrix representation of ϕ ◦ ψ is AB. for any ε > 0. . x3 . . defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. . . Limsup and liminf If x1 . We note that lim inf(−xn ) = − lim sup xn . a2 . a square matrix represents a linear transformation from Rn to Rn . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. In these cases we use the notation inf I. x2 . Similarly.  . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. there is x ∈ I with x ≥ ε + inf I. 1 ≤ j ≤ n. 1/2. The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  .} ≤ inf{x2 . If I has no lower bound then inf I = −∞. xn+2 . ei = 1(i = j).Fix three sets of basis on Rn . and it is this that we called infimum. is a sequence of numbers then inf{x1 . inf I is characterised by: inf I ≤ x for all x ∈ I and. there is x ∈ I with x ≤ ε + inf I.  . with respect to these bases. . standing for the “infimum” of I. For instance I = {1. . we define lim sup xn . A matrix A is called square if it is a n × n matrix. . .} has no minimum. 113 . a sequence a1 . we define max I. respectively. we define the supremum sup I to be the minimum of all upper bounds. sup I is characterised by: sup I ≥ x for all x ∈ I and. .  . x2 . Rm and Rℓ and let A. . the choice of the basis is the so-called standard basis. . . 1 ≤ i ≤ ℓ. . This minimum (or maximum) may not exist. . . Usually. . .

A2 . . derive formulae. . n≥1 Usually. A is a subset of B). . showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. follows from this axiom. Note that 1A 1B = 1A∩B . in these notes. we work a lot with conditional probabilities. In Markov chain theory. For example. This is the inclusion-exclusion formula for two events. namely:   Everything else. Similarly. such that ∪n Bn = Ω. Of course. for (besides P (Ω) = 1) there is essentially one axiom . If An are events with P (An ) = 0 then P (∪n An ) = 0. the function that takes value 1 on A and 0 on Ac . B in the definition.e. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. . B2 . . to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). by not requiring this. and apply them. I do sometimes “cheat” from a strict mathematical point of view. Therefore.. This can be generalised to any finite number of events. are mutually disjoint events. If A is an event then 1A or 1(A) is the indicator random variable. it is useful to find disjoint events B1 . Linear Algebra that has many axioms is more ‘restrictive’. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). to show that EX ≤ EY . and everywhere else. In contrast. n P (An ). 11 114 . in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. That is why Probability Theory is very rich. Indeed. (That is why Probability Theory is very rich. A2 . Linear Algebra that has many axioms is more ‘restrictive’. to compute P (A). but not too much. P (∪n An ) ≤ Similarly. if An are events with P (An ) = 1 then P (∩n An ) = 1. we often need to argue that X ≤ Y .Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). if he events A1 . i. Probability Theory is exceptionally simple in terms of its axioms. . In contrast. This follows from the logical statement X 1(X > x) ≤ X. Often. B are disjoint events then P (A ∪ B) = P (A) + P (B). If A1 . If the events A. then P  n≥1 An  = P (An ). Similarly. . Quite often. . .. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). . This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. and 1 Ac = 1 − 1 A . A2 . . E 1A = P (A).) If A. n If the events A1 . . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. which holds since X is positive. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition.e.

The formula holds for any positive random variable. x→∞ In Markov chain theory. EX = x xP (X = x). we encounter situations where P (X = ∞) may be positive. making it a useful object when the a0 . which are both nonnegative and then define EX = EX + − EX − . . we define X + = max(X. . We do not a priori know that this exists for any s other than. As an example. 1. is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . . For a random variable which takes positive and negative values. 1. n = 0. . X − = − min(X. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). a1 . This definition makes sense only if EX + < ∞ or if EX − < ∞. Generating functions A generating function of a sequence a0 . . 1].e. 1]. . So if n an < ∞ then G(s) exists for all s ∈ [−1. obviously. .. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). . for s = 0 for which G(0) = 0. the formula reduces to the known one. n ≥ 0 is ∞ ρn s n = n=0 1 . In case the random variable is ∞ absoultely continuous with density f . 2. is called probability generating function of X: ∞ as long as |ρs| < 1. . is a probability distribution on Z+ = {0.} because then n an = 1. ϕ(s) = EsX = n=0 sn P (X = n) 115 . |s| < 1/|ρ|. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. i. . taught in calculus courses). . Suppose now that X is a random variable with values in Z+ ∪ {+∞}. The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx.An application of this is: P (X < ∞) = lim P (X ≤ x). the generating function of the geometric sequence an = ρn . . 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. we have EX = −∞ xf (x)dx. In case the random variable is discrete. 0). a1 . viz. The generating function of the sequence pn = P (X = n). 0). so it is useful to keep this in mind. which is motivated from the algebraic identity X = X + − X − .

We can recover the probabilities pn from the formula pn = ϕ(n) (0) . with x0 = x1 = 1. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). we can recover the expectation of X from EX = lim ϕ′ (s). s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . n ≥ 0) then the generating function of (xn+1 . As an example. and this is why ϕ is called probability generating function.This is defined for all −1 < s < 1. since |s| < 1. possibly. n=0 We do allow X to. while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). Generating functions are useful for solving linear recursions. ∞ and p∞ := P (X = ∞) = 1 − pn . If we let X(s) = n≥0 n ≥ 0. take the value +∞. n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). Note that ∞ n=0 pn ≤ 1. ds ds and by letting s = 1. n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . 116 . n! as follows from Taylor’s theorem. we have s∞ = 0. Also. Note that ϕ(0) = P (X = 0). but. consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . xn sn be the generating of (xn .

Hence s2 + s − 1 = (s − a)(s − b). s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). for example. is such an example. the function P (X ≤ x).e. i.e. 117 . namely. x ∈ R.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . Despite the square roots in the formula. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . b = −( 5 + 1)/2. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). b−a Noting that ab = −1. then the so-called distribution function. n ≥ 0) and (xn . n ≥ 0). we can also write this as xn = (−a)n+1 − (−b)n+1 . and so X(s) = 1 1 1 1 1 1 = . So. Thus (1/ρ)−s is the 1 generating function of ρn+1 . i. if X is a random variable with values in R. But I could very well use the function P (X > x). This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . we can solve for X(s) and find X(s) = s2 −1 . Since x0 = x1 = 1. the n-th term of the Fibonacci sequence. n ≥ 0) equals the sum of the generating functions of (xn+1 . b−a is the solution to the recursion. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X.

Specifying the distribution of such an object may be hard. If X is absolutely continuous. for it does allow us to specify the probabilities P (X = n). compute an integral and obtain an approximation. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. Such an excursion is a random object which should be thought of as a random function. It turns out to be a good one. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. to compute a large sum we can. Even the generating function EsX of a Z+ -valued random variable X is an example. instead. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. Hence we have We call this the rough Stirling approximation. 1 Since d dx (x log x − x) = log x. Now. We frequently encounter random objects which belong to larger spaces. for example they could take values which are themselves functions. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. in 2n fair coin tosses 118 . n Hence n k=1 log k ≈ log x dx. (b) The approximation is not good when we consider ratios of products. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. n ∈ Z+ . dx All these objects can be referred to as “distribution” or “law” of X. n! ≈ exp(n log n − n) = nn e−n . I could use its density f (x) = d P (X ≤ x). For example (try it!) the rough Stirling approximation give that the probability that. since the number is so big. we encountered the concept of an excursion of a Markov chain from a state i.or the 2-parameter function P (a < X < b). it’s better to use logarithms: n n! = e log n = exp k=1 log k. For instance.

so this little section is a reminder. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). Let X be a random variable. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. for all x. Then. This completely trivial thought gives the very useful Markov’s inequality.we have exactly n heads equals 2n 2−2n ≈ 1. g(X) ≥ g(x)1(X ≥ x). we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. a geometric function of k. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . To remedy this. and so. Surely. you have seen it before. expressing non-negativity. and is based on the concept of monotonicity. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). because we know that the n probability goes to zero as n → ∞. n ∈ Z+ . We baptise this geometric(λ) distribution. by taking expectations of both sides (expectation is an increasing operator). That was the random variable J0 = ∞ n=1 1(Sn = 0). expressing monotonicity. Indeed. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . y ∈ R g(y) ≥ g(x)1(y ≥ x). Markov’s inequality Arguably this is the most useful inequality in Probability. Consider an increasing and nonnegative function g : R → R+ . as n → ∞. We can case both properties of g neatly by writing For all x. 0 ≤ n ≤ m. We have already seen a geometric random variable. defined in (34). Eg(X) ≥ g(x)P (X ≥ x) . then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. where λ = P (X ≥ 1). It was shown in Theorem 38 that J0 satisfies the memoryless property. and this is terrible. If we think of X as the time of occurrence of a random phenomenon. 119 .

7. 3. 4. Bremaud P (1999) Markov Chains. Norris. W (1981) Topics in Ergodic Theory. C M and Snell. Cambridge University Press.pdf 6. J R (1998) Markov Chains. Soc. Springer.Bibliography ´ 1. Wiley. Amer. Grinstead.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. ´ 2.mac. 120 . Cambridge University Press. Springer. W (1968) An Introduction to Probability Theory and its Applications. Chung. Feller. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance).dartmouth. Springer. 5. Math. Also available at: http://www. Bremaud P (1997) An Introduction to Probabilistic Modeling. Parry. J L (1997) Introduction to Probability.

48 cycle formula. 110 mixing. 18 dual. 98 dynamical system. 10 irreducible. 114 balance equations. 3 branching process. 17 infimum. 113 initial distribution. 13 probability generating function. 53 Fibonacci sequence. 61 detailed balance equations. 116 first-step analysis. 2 nearest neighbour random walk. 45 Chapman-Kolmogorov equation. 106 ground state. 16. 117 division. 16 communicating classes. 58 digraph (directed graph). 115 random walk on a graph. 6 Markov’s inequality. 111 inessential state. 18 arcsine law. 53 hitting time. 17 communicates with. 108 ergodic. 51 Monte Carlo. 15. 20 increments. 82 Cesaro averages. 44 degree. 119 matrix. 70 greatest common divisor. 68 aperiodic. 51 essential state. 65 generating function. 16 equivalence relation. 111 maximum. 65 Catalan numbers. 61 recurrent. 73 Markov chain. 119 minimum. 10 ballot theorem. 77 indicator. 110 meeting time. 87 121 . 117 leads to. 76 neighbours. 16 convolution. 81 reflection principle. 6 Markov property. 34 PageRank. 109 positive recurrent. 86 reflexive. 61 null recurrent. 108 ruin probability. 7 closed. 115 geometric distribution. 29 reflection. 48 memoryless property. 68 Fatou’s lemma. 119 Google. 17 Ethernet. 15 Loynes’ method. 35 Helly-Bray Lemma. 26 running maximum. 70 period. 34 probability flow.Index absorbing. 101 axiom of Probability Theory. 15 distribution. 17 Aloha. 84 bank account. 59 law. 108 relation. 20 function of a Markov chain. 63 Galton-Watson process. 77 coupling. 18 permutation. 47 coupling inequality. 1 equivalence classes. 7 invariant distribution. 17 Kolmogorov’s loop criterion.

119 strong Markov property. 2 transition probability. 8 transition probability matrix. 27 storage chain. 7 time-reversible. 71 122 . 10 steady-state. 113 symmetric. 8 transitive. 16. 108 time-homogeneous. 76 simple symmetric random walk. 6 stationary. 104 states. 9 stationary distribution. 58 transient. 6 statespace. 27 subordination. 76 Skorokhod embedding. 16. 60 urn. 108 tree. 62 subsequence of a Markov chain.semigroup property. 77 skip-free property. 3. 13 stopping time. 29 transition diagram:. 24 Strirling’s approximation. 7 simple random walk. 88 watching a chain in a set. 62 Wright-Fisher model. 7. 62 supremum. 73 strategy of the gambler.

Sign up to vote on this title
UsefulNot useful