Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . .2 Paths that start and end at specific points and do not touch zero at all . . . . 84 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Branching processes: a population growth application . . . . . 90 30. . . . . . . . . . . . . . . . . . . .1 Distribution after n steps . .3 PageRank (trademark of Google): A World Wide Web application . . . . . 84 27. . . . . . . . . . . . . 92 30. . . . . .1 First hitting time analysis . . . . . . . . . . .3 Total number of visits to a state . . . . . . . . . . . . . . . . .1 Paths that start and end at specific points . . . . . 83 27. . . . . . . . . . . . . .1 Asymptotic distribution of the maximum . . . . . . . . . . . . . . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . . . . 70 24. . . . . . . .5 First return to zero . . . . . . . . . . . . . . . . . . . 95 31.2 The ALOHA protocol: a computer communications application . 65 24. . . . .3 Some conditional probabilities . . . . . .24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 The distribution of the first return to the origin .4 The Wright-Fisher model: an example from biology . . . . . 95 31. . . . . . . . . . . . . . . . .2 Return probabilities . . . . . . . .4 Some remarkable identities . . . . . . . .2 First return to the origin . . . . . . . . . . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . 85 27. . . . . . . 68 24. . . . . . . . . . . . . . . . . . .5 A storage or queueing application . . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . .2 The ballot theorem . 71 24. . . . . . . . . 79 26. . . . . . . . . . . .1 Distribution of hitting time and maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

I introduce stationarity before even talking about state classification. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. Also. The second one is specifically for simple random walks. v .. There are. Greece and the U. They form the core material for an undergraduate course on Markov chains in discrete time..Preface These notes were written based on a number of courses I taught over the years in the U. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. I have a little mathematical appendix. of course.S. Of course. I tried to put all terms in blue small capitals whenever they are first encountered. I have tried both and prefer the current ordering. “important” formulae are placed inside a coloured box.. dozens of good books on the topic. Also. The first part is about Markov chains and some applications. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. At the end. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. There notes are still incomplete.K. A few starred sections should be considered “advanced” and can be omitted at first reading.

In other words. a totally ordered set. An example from Physics is provided by Newton’s laws of motion. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. where r = rem(a. It is not hard to see that. Formally. 1. the real numbers). b). Assume that the force F depends only on the particle’s position ¨ dt and velocity. b) = gcd(r. yn+1 ). F = f (x. one of which is 0. n = 0. For instance. the integers). Suppose that a particle of mass m is moving under the action of a force F in a straight line.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. b) is the remainder of the division of a by b. Xn+2 . b) between two positive integers a and b. s ≥ t) ˙ 1 . . . The sequence of pairs thus produced. 2.e. X2 . yn ) into Xn+1 = (xn+1 . . The state of the particle at time t is described by the pair ˙ X(t) = (x(t). for any t. x). By repeating the procedure. and evolving as xn+1 = yn yn+1 = rem(xn . yn ). . or. for any n. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). X1 . that gcd(a. . . the future pairs Xn+1 . more generally. a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. x = x(t) denotes the position of the particle at time t. y0 = min(a. yn ). Its velocity is x = dx . An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. x Here. we end up with two numbers. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. initialised by x0 = max(a. b). . x(t)). and its ˙ dt 2x acceleration is x = d 2 . there is a function F that takes the pair Xn = (xn . . the future trajectory (X(s). b). i. where xn ≥ yn . Recall (from elementary school maths). and the other the greatest common divisor. we let our “state” be a pair of integers (xn . continuous (i. depend only on the pair Xn . In the module following this one you will study continuous-time chains.e. X0 . In Mathematics. The “time” can be discrete (i.e. has the property that. .

respectively. A mouse lives in the cage. say. Some of these algorithms are deterministic.99. 0 ≤ Y0 ≤ B. 2.can be completely specified by the current state X(t). at time n + 1 it is either still in 1 or has moved to 2. uniformly. Markov chains generalise this concept of dependence of the future only on the present. In other words. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. We will be dealing with random variables. it does so. The resulting ratio is an approximation b of a f (x)dx. A scientist’s job is to record the position of the mouse every minute. The generalisation takes us into the realm of Randomness. Repeat with (Xn . We will develop a theory that tells us how to describe. Try this on the computer! 2 .000 rows and 10. analyse. an effective way for dealing with large problems is via the use of randomness. Indeed. 2 2.05.000 columns or computing the integral of a complicated function of a large number of variables. we will see what kind of questions we can ask and what kind of answers we can hope for. containing fresh and stinky cheese.e. Choose a pair (X0 . regardless of where it was at earlier times. Similarly.. Perform this a large number N of times. In addition. . instead of deterministic objects. a ≤ X0 ≤ b. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. Other examples of dynamical systems are the algorithms run.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. i. Each time count 1 if Yn < f (Xn ) and 0 otherwise. for steps n = 1. past states are not needed. 1 and 2. Y0 ) of random numbers. . but some are stochastic. We will also see why they are useful and discuss how they are applied. .1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. via the use of the tools of Probability. Count the number of 1’s and divide by N . We can summarise this information by the transition diagram: 1 Stochastic means random. by the software in your computer. When the mouse is in cell 1 at time n (minutes) then. and use those mathematical models which are called Markov chains. Yn ). it moves from 2 to 1 with probability β = 0.

are random variables with distributions FD .α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. on the average. to move from cell 1 to cell 2? 2. respectively. We easily see that if y > 0 then ∞ x. 1. (This is the mean of the binomial distribution with parameter α. he can’t. Dn − Sn = y − x) P (Sn = z.01 Questions of interest: 1. . D0 . S2 .y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. S0 . 0}. . y = 0. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). . Assume also that X0 . the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. as follows: Xn+1 = max{Xn + Dn − Sn . . . Dn is the money he deposits. Assume that Dn .05 ≈ 20 minutes. Here. How often is the mouse in room 1 ? Question 1 has an easy.05 0. and Sn is the money he wants to spend. D1 .99 0.2 Bank account The amount of money in Mr Baxter’s bank account evolves. . D2 . are independent random variables. How long does it take for the mouse.) 2.y := P (Xn+1 = y|Xn = x). px. from month to month. S1 .95 0. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. Sn . = z=0 ∞ = z=0 3 . So if he wishes to spend more than what is available. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. FS . Xn is the money (in pounds) at the beginning of the month. intuitive. 2.

0 = 1 − ∞ px. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. If he starts from position 0.1 p1.2 P= p2. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left).0 p1.something that.y .2 p1. if y = 0. can be computed from FS and FD . how fast? 2. Questions of interest: 1.2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. Of course. how long will it take him. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. So he may move forwards with equal probability that he moves backwards. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction.1 p2. If the pub is located in position 0 and his home in position 100. and if so.1 p0. 2. Are there any places which he will visit infinitely many times? 4. on the average. being drunk. in principle.0 p0.0 p2. Are there any places which he will never visit? 3. 2. The transition probability matrix is y=1 p0. we have px.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . how often will he be visiting 0? 2. he has no sense of direction. Will Mr Baxter’s account grow.

1 D Thus.D . It proposes the following model summarising the state of health of an individual on a monthly basis: 0.3 that the person will become sick. and let’s say he’s completely drunk. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. pD. there is probability pH. Note that the diagram omits pH.D . given that he is currently healthy. therefore. We may again want to ask similar questions.S − pH. the answer to this is crucial for the policy determination. there is clearly a higher chance that our man will be lost. 5 . So he can move east.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients.S = 0. north or south. for now.H .Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner.S = 1 − pS. These things are. Clearly.8 0. But. observe. pD. Also.D is omitted. 2. mathematically imprecise.S because pH. so we assign probability 1/4 for each move. that since there are more degrees of freedom now.01 S 0. and pS.H = 1 − pH. the company must have an idea on how long the clients will live.3 H 0. west. the reason being that the company does not believe that its clients are subject to resurrection.H − pS. but they will become more precise in the sequel. and pS.D = 1. etc.

we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. On the other hand. where T is a subset of the integers. m < n. n ∈ T} is Markov instead of saying that it has the Markov property. m ∈ T). . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . We focus almost exclusively to the first choice. for all n ∈ Z+ and all i0 . that the Xn take values in some countable set S. n ∈ Z+ ) is Markov if and only if. called the state space. in+m . X1 = i1 . Since S is countable. 3. if the claim holds. for any n. X0 = i0 ) = P (∀ k ∈ [n + 1. (1) 6 . . Lemma 1. . The elements of S are frequently called states. 2. T = Z+ = {0.3 3. almost exclusively (unless otherwise said explicitly). Xn−1 ) are independent conditionally on Xn . . and so future is independent of past given the present. the future process (Xm . for any n ∈ T. . which precisely means that (Xn+1 . . . the Markov property. So we will omit referring to T explicitly. We say that this sequence has the Markov property if. m > n. . . . . or negative). . . m ∈ T) is independent of the past process (Xm . . Usually. Xn−1 ) given Xn . . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). .} or the set Z of all integers (positive. (Xn . . We will also assume. . . n + m] Xk = ik | Xn = in . . viz. 1. . Let us now write the Markov property in an equivalent way. n + m] Xk = ik | Xn = in ). . . . conditionally on Xn . . .1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . . . . This is true for any n and m.} or N = {1. 2. . .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . zero. unless needed. X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). 3. X2 = i2 . . m and any choice of states i0 . Sometimes we say that {Xn . 3. . P (Xn+1 = in+1 | Xn = in . Xn+m ) and (X0 . . . n ∈ T}. Proof. . it is customary to call (Xn ) a Markov chain. . . in+1 ∈ S.

we can write (2) as P(m.j (m.i2 (1. Xn = in ) = µ(i0 ) pi0 . viz. But we will circumvent them by using Probability only. . n ≥ 0. n) = i1 ∈S ··· in−1 ∈S pi. means that the transition probabilities pi. n) = 1(i = j) and. pi. n + 1) is called transition probability from state i at time n to state j at time n + 1. 2) · · · pin−1 . then the matrices are infinitely large.which means that the two functions µ(i) := P (X0 = i) . n). 1) pi1 . ℓ) P(ℓ. a square matrix whose rows and columns are indexed by S (and this. will specify all joint distributions2 : P (X0 = i0 . something that can be called semigroup property or Chapman-Kolmogorov equation. n) = P(m. by (1). by a result in Probability. which.j (m. all probabilities of events associated with the Markov chain This causes some analytical difficulties. X2 = i2 . of course. n) := [pi. j ∈ S. pi.i1 (m. X1 = i1 .j (m.j (n. n + 1) do not depend on n . n). by definition.j (n. Of course. remembering a bit of Algebra. n) = P(m. The function µ(i) is called initial distribution. n) if m ≤ ℓ ≤ n. i∈S i. (2) which is reminiscent of matrix multiplication. for each m ≤ n the matrix P(m.in (n − 1.j (n. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m.j (m.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. 2 3 And. 7 . The function pi.j∈S . . 3. m + 1) · · · pin−1 . m + 2) · · · P(n − 1. or as P(m.i (n − 1. pi. pi. n). n). . If |S| = d then we have d × d matrices.3 In this new notation. requires some ordering on S) and whose entries are pi. More generally.. n)]i.i1 (0. but if S is an infinite set. we define.j (n. m + 1)P(m + 1. Therefore. n + 1) := P (Xn+1 = j | Xn = i) . .

The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . if Y is a real random variable. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). 0. for all integers a. transition matrix. n) = P × P × · · · · · · × P = Pn−m . to picking state i with probability 1. and call pi.j ]i. Similarly. then µ = pδi1 + (1 − p)δi2 . as follows from (1) and the definition of P. n + 1) =: pi. In other words. δi (j) := 1.j (n. Starting from a specific state i. From now on. Then P(m. y y 8 . when we say “Markov chain” we will mean “timehomogeneous Markov chain”. For example. simply. It corresponds.j . unless otherwise specified. For example. Then we have µ n = µ 0 Pn .and therefore we can simply write pi. arranged in a row vector.j the (one-step) transition probability from state i to state j. Notice that any probability distribution on S can be written as a linear combination of unit masses. (3) We call δi the unit mass at i. and the semigroup property is now even more obvious because Pa+b = Pa Pb . The matrix notation is useful. b ≥ 0. Thus Pi is the law of the chain when the initial distribution is δi . while P = [pi. if j = i otherwise. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). if µ assigns probability p to state i1 and 1 − p to state i2 . Notational conventions Unit mass. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn .j∈S is called the (one-step) transition probability matrix or. of course. n−m times for all m ≤ n.

dividing it into two identical rooms.e. . Then pi. random variables provides an example of a stationary process. But this will not work. the distribution of the bug will still be the same.) is stationary if it has the same law as (Xn . say N = 1025 ) in a metallic chamber with a separator in the middle. and vice versa. n ≥ 0) will not be stationary: indeed. Imagine there are N molecules of gas (where N is a large number. Example: the Ehrenfest chain. we indeed expect that we have N/2 molecules in each room. n = 0. the law of a stationary process does not depend on the origin of time. n = m. m + 1. If at time n the bug is on a vertex. let us look at the following example: Example: Motion on a polygon.). selecting a vertex with probability 1/7. We now investigate when a Markov chain is stationary. then. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. To provide motivation and intuition. It should be clear that if initially place the bug at random on one of the vertices. Let Xn be the number of molecules in room 1 at time n. . Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. then. . at any point of time n. . On the average. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. This is a model of gas distributed between two identical communicating chambers.d. Hence the uniform probability distribution is stationary.4 Stationarity We say that the process (Xn . for a while we will be able to tell how long ago the process started.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. for any m ∈ Z+ . We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. Another guess then is: Place 9 .i. as it takes some time before the molecules diffuse from room to room. because it is impossible for the number of molecules to remain constant at all times. at time n + 1 it moves to one of the adjacent vertices with equal probability. 1. i. a sequence of i. Clearly. .i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . Consider a canonical heptagon and let a bug move on its vertices. . It is clear. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . The gas is placed partly in room 1 and partly in room 2. In other words.

We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. which you should go through by yourselves. . a Markov chain is stationary if and only if the distribution of Xn does not change with n. Indeed. . But we need to prove this. in ∈ S.) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. for all m ≥ 0 and i0 . X1 = i1 .i = π(i − 1)pi−1. then the number of molecules in room 1 will be i with probability π(i) = N −N 2 .each molecule at random either in room 1 or in room 2. Proof.i + P (X0 = i + 1)pi+1. In other words. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. . hence. Xm+n = in ). The equations (4) are called balance equations. i 0 ≤ i ≤ N. . . find those distributions π such that πP = π. by (3). 2−N i−1 i+1 N N (The dots signify some missing algebra. . .i = P (X0 = i − 1)pi−1. . X2 = i2 . If we can do so. stationary or invariant distribution. For the if part. Xm+2 = i2 . Xm+1 = i1 . . Changing notation. Lemma 2. . use (1) to see that P (X0 = i0 . we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. Xn = in ) = P (Xm = i0 .i = N N −i+1 N i+1 2−N + = · · · = π(i). P (Xn = i) = π(i) for all n. then we have created a stationary Markov chain. A Markov chain (Xn . (4) Such a π is called . Since. 10 . The only if part is obvious. we are led to consider: The stationarity problem: Given [pij ]. (a binomial distribution). . If we do so.i + π(i + 1)pi+1. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . . P (X1 = i) = j P (X0 = j)pj.

. . in matrix notation. . we must be careful. then we run into trouble: indeed. . . if i > 0 i ∈ N. After bus arrives at a bus stop. d}.d. we use the normalisation condition π(1) = i≥1 j≥i = 1. which. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. p1. Example: waiting for a bus.}.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. But here. π must be a probability: π(i) = 1. Example: card shuffling.i = p(i). Who tells us that the sum inside the parentheses is finite? If the sum is infinite. To find the stationary distribution. 11 .i = π(0)p(i) + π(i + 1). . Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. 2. This means that the solution to the balance equations is identically zero. . It is easy to see that the stationary distribution is uniform: π(x) = 1 . Keep doing it continuously. The times between successive buses are i. . We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. π(1) will be zero. xd ). which in turn means that the normalisation condition cannot be satisfied. . the next one will arrive in i minutes with probability p(i). each x ∈ Sd is of the form x = (x1 . 2 . d! x ∈ Sd . Thus. i = 1. . Then (Xn . which gives p(j) . . i ≥ 1. where 1 is a column whose entries are all 1. random variables. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. . and so each π(i) will be zero.j = 1. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. The state space is S = N = {1. i ≥ 0.. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). i≥1 π(i) −1 To compute π(1). n ≥ 0) is a Markov chain with transition probabilities pi+1. . where all the xi are distinct. can be written as P1 = 1.i. In addition to that.i = 1. write the equations (4): π(i) = π(j)pj.

. . So. So let us do so. P (ξ1 (n + 1) = ε1 . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. We have thus defined a mapping ϕ from binary vectors into binary vectors. If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. So it goes beyond our theory. εr ∈ {0. . . . are i.d.. .). You can check that . P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). ξr+1 (n) = εr ) + P (ξ1 (n) = 1. ξ2 . and the rule maps x into 2(1 − x).). ..). 2(1 − x)). . ξr+1 (n) = 1 − εr ). . We can show that if we start with the ξr (0). ξ2 (n) = 1 − ε1 . Example: bread kneading.e.d. . i. 1] (think of [0. ξr (n + 1) = εr ) = P (ξ1 (n) = 0. Remarks: This is not a Markov chain with countably many states. ξ3 . . . ξi ∈ {0. and any ε1 .Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). . The Markov chain is obtained by applying the mapping ϕ again and again. . As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x.) is the state at time n. ξ2 (n) = ε1 . . 1} for all i. . ξ2 (n). .i. . . . if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . . what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution.) Consider an infinite binary vector ξ = (ξ1 . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . . 2. r = 1. . i. . . from (5). the same is true at n + 1. . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. 1 − ξ3 . . . Hence choosing the initial state at random in this manner results in a stationary Markov chain. . . .* (This is slightly beyond the scope of these lectures and can be omitted. (5) 12 . 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. . . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. 2. If ξ1 = 1 then x ≥ 1/2. ξ2 (n). . ξ(n + 1) = ϕ(ξ(n)). But to even think about it will give you some strength. 2. . for any n = 0. 1. . . any r = 1. But the function f (x) = min(2x. . . . . applied to the interval [0. 1}.i. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. Here ξ(n) = (ξ1 (n). with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n.

then the above are d + 1 equations. then choose A = {i} and see that you obtain the balance equations. If the latter condition holds. π(j) = 1. So F (A.j . expecting to be able to choose. If the state space has d states. π is a stationary distribution if and only if π1 = 1 and F (A. π(j)pj. for all A ⊂ S. because the first d equations are linearly dependent: the sum of both sides equals 1.j = j∈A π(j)pj. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. a set of d linearly independent equations which can be more easily solved. A). Instead of eliminating equations. suppose that the balance equations hold. as we will see in examples. But we only have d unknowns.i + j∈Ac π(j)pj. Writing the equations in this form is not always the best thing to do. The convenience is often suggested by the topological structure of the chain. Conversely. This is OK. a stationary distribution π satisfies π = πP π1 = 1 In other words. π(i) = j∈S (balance equations) (normalisation condition). Ac ) := π(i)pi.i 13 .i Multiply π(i) by j∈S pi. i∈A j∈Ac Think of π(i)pi.i . Proof.j as a “current” flowing from i to j. We have: Proposition 1.1 Finding a stationary distribution As explained above. When a Markov chain is stationary we refer to it as being in 4.i + j∈Ac π(j)pj.i = j∈A π(j)pj. therefore one of them is always redundant. Ac ) = F (Ac . j∈S i ∈ S. amongst them. Ac ) is the total current from A to Ac . we introduce more.j (which equals 1): π(i) j∈S pi. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj.Note on terminology: steady-state.

which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). and 4.99 ≈ 0.04 π(2) = 0. α+β π(2) = cα.05 ≈ 0. One can ask how to choose a minimal complete set of linearly independent equations. If we take α = 0.99 (as in the mouse example). 2. A ) A c c F( A . 3. The extra degree of freedom provided by the last result.2% of the time in 1. .j + j∈A j∈Ac π(i)pi.. p22 = 1 − β. p21 = β. A c F(A . α+β π(2) = α .8% of the time in room 2. p q Consider the following chain.05. Thus. Instead. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . in steady state. but we shall not do this here. .Now split the left sum also: π(i)pi. . here is an example: Example: Consider the 2-state Markov chain with p12 = α. 1. while the remaining two terms give the equality we seek.j = j∈A π(j)pj. we see that the first term on the left cancels with the first on the right. π(1) = β .) Assume 0 < α. We have π(1)α = π(2)β. A) Schematic representation of the flow balance relations. we find π(1) = 0.. That is. β = 0.i + j∈Ac π(j)pj.952. the mouse spends only 95. 14 i = 1. (Consequently. β < 1.i This is true for all i ∈ S. . Example: a walk with a barrier.048. gives us flexibility. p11 = 1 − α.04 room 1. The constant c is found from π(1) + π(2) = 1. Summing over i ∈ A.

1. If i = j there is nothing to prove. A). . i. π(i) = (p/q)i π(0). 5. . If i ≥ 1 we have F (A. E) where S is the state space and E ⊂ S × S defined by (i. (If p ≥ 1/2 there is no stationary distribution. (ii) pi.2 The relation “leads to” We say that a state i leads to state j if. This is i=0 possible if and only if p < q. . i − 1}. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). i. starting from i. . . n≥0 . S is a set of vertices. In other words. .i2 · · · pin−1 .e. a pair (S. in−1 .But we can make them simpler by choosing A = {0.i1 pi1 . So assume i = j.) 5 5. and E a set of directed edges. provided that ∞ π(i) = 1. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. (iii) Pi (Xn = j) > 0 for some n ≥ 0. In fact.e. For i ≥ 1. the chain will visit j at some finite time. Proof. . We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. we can solve them immediately. Does this provide a stationary distribution? Yes. j) ∈ E ⇐⇒ pi. This is often cast in terms of a digraph (directed graph). .j > 0.j > 0 for some n ∈ N and some states i1 . i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. • Suppose that i j. These are much simpler than the previous ones. In graph-theoretic terminology. p < 1/2. Ac ) = π(i − 1)p = π(i)q = F (Ac .1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive.

3}. It is reflexive and transitive because is so. {4. i}. Since Pi (Xn = j) > 0. (i) holds. j) ⇐⇒ i j and j i. Equivalence relation means that it is symmetric.. and so (iii) holds. 5} is closed: if the chain goes in there then it never moves out of it. meaning that (ii) holds.in−1 pi.i1 pi1 .the sum on the right must have a nonzero term. is an equivalence relation.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. 5. The relation is transitive: If i j and j k then i k. • Suppose that (ii) holds. [i] = [j] if and only if i identical or completely disjoint. the set of all states that communicate with i: [i] := {j ∈ S : j So. 5}. Corollary 1.. reflexive and transitive. the class {2. In other words. we have that the sum is also positive and so one of its terms is positive. Just as any equivalence relation. it partitions S into equivalence classes known as communicating classes.. The class {4. {6}. the sum contains a nonzero term. • Suppose now that (iii) holds. We just observed it is symmetric. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). Two communicating classes are either 1 2 3 4 5 6 In the example of the figure.i2 · · · pin−1 . Look at the last display again. Hence Pi (Xn = j) > 0. we see that there are four communicating classes: {1}. (iii) holds. 3} is not closed. We have Pi (Xn = j) = i1 . by definition. this is symmetric (i Corollary 2. Proof.j . where the sum extends over all possible choices of states in S. The relation j ⇐⇒ j i).. {2. by definition. However. 16 . By assumption. The communicating class corresponding to the state i is. The first two classes differ from the last two in character. (6) j. In addition.

and so on. parts. Thus. Note: An inessential state i is such that there exists a state j such that i j but j i. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. The converse is left as an exercise. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. more precisely. To put it otherwise. If all states communicate with all others. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. the state space S is a closed communicating class. then the chain (or. If a state i is such that pi.i1 > 0. . in−1 such that pi. pi1 . C2 . Lemma 3. The communicating classes C1 .i2 > 0. we say that a set of states C ⊂ S is closed if pi. and C is closed. Closed communicating classes are particularly important because they decompose the chain into smaller. There can be arrows between two nonclosed communicating classes but they are always in the same direction. .More generally. we can reach any other state. In other words still. and so i2 ∈ C. more manageable. pi.j > 0.j = 1. In graph terms. Otherwise. the single-element set {i} is a closed communicating class. Note that there can be no arrows between closed communicating classes. the state is inessential. we have i1 ∈ C. C5 in the figure above are closed communicating classes. Suppose that i j. Since i ∈ C. Proof. we conclude that j ∈ C. C3 are not closed. 17 . Why? A state i is essential if it belongs to a closed communicating class. . its transition matrix P) is called irreducible. j∈C for all i ∈ C. A general Markov chain can have an infinite number of communicating classes. Similarly. Suppose first that C is a closed communicating class. We then can find states i1 .i1 pi1 . The classes C4 .i = 1 then we say that i is an absorbing state. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. starting from any state.i2 · · · pin−1 . .

(m) (n) for all m. n ≥ 0. So. If i j there is α ∈ N with pij > 0 and if 18 . Theorem 2 (the period is a class property). Let Di = {n ∈ N : pii > 0}. . i. i. X4 ∈ C1 .e. . The period of an inessential state is not defined. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . ℓ ∈ N. d(i) = gcd Di . (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. pii (ℓn) ≥ pii (n) ℓ . Consider two distinct essential states i. . Notice however. 3} at some point of time then. Let us denote by pij the entries of Pn . j. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . all states form a single closed communicating class. X3 ∈ C2 . d(j) = gcd Dj . The chain alternates between the two sets. j ∈ S. Dj = {n ∈ (n) (α) N : pjj > 0}. and. it will move to the set C2 = {2. 4} at the next step and vice versa. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. A state i is called aperiodic if d(i) = 1. So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . X2 ∈ C1 . whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. more generally. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0.5. . If i j then d(i) = d(j). Since Pm+n = Pm Pn . (n) Proof.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. We say that the chain has period 2. that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1.

Theorem 4. Since n1 . . Therefore d(j) | α + β. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . as defined earlier. j∈Cr+1 i ∈ Cr . Because n is sufficiently large. Proof. C1 . the chain is called aperiodic. if d = 1. . . Then we can uniquely partition the state space S into d disjoint subsets C0 . Of particular interest. (Here Cd := C0 . (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. . q − r > 0. From the last two displays we get d(j) | n for all n ∈ Di . If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . An irreducible chain has period d if one (and hence all) of the states have period d.e.) 19 . Divide n by n1 to obtain n = qn1 + r. showing that α + n + β ∈ Dj . j ∈ S. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. where the remainder r ≤ n1 − 1. Now let n ∈ Di . . More generally. . An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. r = 0. 1. i. all states communicate with one another.j i there is β ∈ N with pji > 0. So d(i) = d(j). . . In particular. . if an irreducible chain has period d then we can decompose the state space into d sets C0 . we obtain d(i) ≤ d(j) as well. Cd−1 such that the chain moves cyclically between them. C1 . If a number divides two other numbers then it also divides their difference. Then pii > 0 and so pjj ≥ pij pii pji > 0. showing that α + β ∈ Dj . chains where. . Pick n2 . this is the content of the following theorem. Cd−1 such that pij = 1. So pjj ≥ pij pji > 0. are irreducible chains. . n2 ∈ Di . Corollary 3. d − 1. Theorem 3. n1 ∈ Di such that n2 − n1 = 1. d(j) | d(i)). Arguing symmetrically. Therefore d(j) | α + β + n for all n ∈ Di . Formally. . Consider an irreducible chain with period d. So d(j) is a divisor of all elements of Di . Let n be sufficiently large. we have n ∈ Di as well.

i1 . .id > 0. We avoid excessive notation by using the same letter for both. necessarily. that is why we call it “first-step analysis”. Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. . Take. i. i ∈ C0 .. Pick a state i0 and let C0 be its equivalence class (i. If X0 ∈ A the two times coincide. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 .e the set of states j such that i j). It is easy to see that if we continue and pick a further state id with pid−1 . Such a path is possible because of the choice of the i0 . Continue in this manner and define states i0 . Hence it partitions the state space S into d disjoint equivalence classes. j belongs to the next class. Then pick i1 such that pi0 i1 > 0. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). then. which contradicts the definition of d. after it takes one step from its current position. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. We now show that if i belongs to one of these classes and if pij > 0 then. . for example.. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. C1 . As an example. Let C1 be the equivalence class of i1 ..Proof. necessarily.... The existence of such a path implies that it is possible to go from i0 to i. Assume d > 1. We use a method that is based upon considering what the Markov chain does at time 1. Suppose pij > 0 but j ∈ C1 but. Notice that this is an equivalence relation. i1 . id ∈ C0 . and by the assumptions that d d j.. . id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. The reader is warned to be alert as to which of the two variables we are considering at each time. . We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. j ∈ C2 . 20 . We show that these classes satisfy what we need. Cd−1 . say.e. with corresponding classes C0 .

the event TA is independent of X0 . and so ϕ(i) = 1. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). let ϕ′ (i) be another solution. If i ∈ A.). a ′ function of X1 . because the chain may end up in state 2. But the Markov chain is homogeneous. j∈S as needed. But what exactly is this probability equal to? To answer this in its generality. So TA = 1 + TA . ′ Proof. We then have: Theorem 5. Furthermore. By self-feeding this equation. . ϕ(i) = j∈S i∈A pij ϕ(j). we have Pi (TA < ∞) = ϕ(j)pij . ′ is the remaining time until A is hit). Combining the above.e.It is clear that P1 (T0 < ∞) < 1. The function ϕ(i) satisfies ϕ(i) = 1. If i ∈ A then TA = 0. X0 = i) = P (TA < ∞|X1 = j). Hence: ′ ′ P (TA < ∞|X1 = j. . conditionally on X1 . if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. Therefore. X2 . . which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). i ∈ A. We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . fix a set of states A. and define ϕ(i) := Pi (TA < ∞). then TA ≥ 1. For the second part of the proof.

Example: Continuing the previous example. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). and ϕ(2) = 0. if the chain starts from 2 it will never hit 0. Example: In the example of the previous figure. which is the second of the equations. so the first two terms together equal Pi (TA ≤ 2).) As for ϕ(1). ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. as above. ψ(0) = ψ(2) = 0. Start the chain from i. Letting n → ∞. TA = 0 and so ψ(i) := Ei TA = 0. i = 0. the set that contains only state 0. (If the chain starts from 0 it takes no time to hit 0. By continuing self-feeding the above equation n times. and let ψ(i) = Ei TA . if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. We have: Theorem 6. we choose A = {0}. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). obviously. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). On the other hand. Furthermore. Therefore. let A = {0. ψ(i) := Ei TA . Clearly. 2}. Proof.We recognise that the first term equals Pi (TA = 1). If i ∈ A then. the second equals Pi (TA = 2). 3 2 so ϕ(1) = 3/4. The function ψ(i) satisfies ψ(i) = 0. j∈A i ∈ A. because it takes no time to hit A if the chain starts from A. then TA = 1 + TA . We immediately have ϕ(0) = 1. ψ(i) = 1 + i∈A pij ψ(j). 1 ψ(1) = 1 + ψ(1). We let ϕ(i) = Pi (T0 < ∞). The second part of the proof is omitted. 1. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). 3 22 . 2. If ′ i ∈ A.

ϕAB (i) = j∈A i∈B pij ϕAB (j). Then ′ 1+TA x ∈ A.e. ϕAB is the minimal solution to these equations. Next. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ).. Hence. we just changed index from n to n + 1. by the Markov property. h(x) = f (x).which gives ψ(1) = 3/2. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). 23 . as argued earlier. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). ′ TA = 1 + TA . ′ where TA is the remaining time. i∈A Furthermore. because the underlying experiment is the flipping of a coin with “success” probability 2/3. X2 . of the random variables X1 . if x ∈ A. This happens up to the first hitting time TA first hitting time of the set A. (This should have been obvious from the start. where. i. . . consider the following situation: Every time the state is x a reward f (x) is earned. Clearly. in the last sum. It should be clear that the equations satisfied are: ϕAB (i) = 1. after one step. therefore the mean number of flips until the first success is 3/2. ϕAB (i) = 0. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ).) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . TB for two disjoint sets of states A. until set A is reached. otherwise. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). Now the last sum is a function of the future after time 1. B. . We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ).

x∈A pxy h(y). 2. the set of equations satisfied by h. unless he is in room 4. Clearly. The successive stakes (Yk . Let p the probability of heads and q = 1 − p that of tails. are h(x) = 0. so 24 . until he dies in room 4. (Also. h(1) = 5. he will die. for i = 1. why?) If we let h(i) be the average total reward if he starts from room i. h(3) = 7. if he starts from room 2. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). y∈S h(x) = f (x) + x ∈ A. So. no gambler is assumed to have any information about the outcome of the upcoming toss. (Sooner or later. he collects i pounds. Example: A thief enters sneaks in the following house and moves around the rooms. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward.Thus.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. 3. If heads come up then you win £1 per pound of stake. in which case he dies immediately. k ∈ N) form the strategy of the gambler. collecting “rewards”: when in room i. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). 1 2 4 3 The question is to compute the average total reward. 2 2 We solve and find h(2) = 8. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared.

the company has control over the probability p. In other words. We are clearly dealing with a Markov chain in the state space {a. Consider a gambler playing a simple coin tossing game.i−1 = q = 1 − p. . . The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x).g. . a + 1. astrological nature. in addition. Yk cannot depend on ξk . pi. Let £x be the initial fortune of the gambler. a ≤ x ≤ b. We use the notation Px (respectively. b are absorbing. Let us consider a concrete problem. in the first case. . she must base it on whatever outcomes have appeared thus far. We wish to consider the problem of finding the probability that the gambler will never lose her money. . as described above. a + 2. betting £1 at each integer time against a coin which has P (heads) = p. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . 0 ≤ x ≤ ℓ. x − a    . Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. We first solve the problem when a = 0. . and where the states a. a < i < b. write ϕ(x) instead of ϕ(x. She will stop playing if her money fall below a (a < x). The difference between the gambler’s ruin problem for a company and that for a mere mortal is that.i+1 = p. ℓ). (Think why!) To simplify things. is our next goal.if the gambler wants to have a strategy at all. . ξk−1 and–as is often the case in practice–on other considerations of e. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. It can only depend on ξ1 . We obviously have ϕ(0) = 0. b − 1. ℓ) := Px (Tℓ < T0 ). . . b = ℓ > 0 and let ϕ(x. b − a). Theorem 7. b} with transition probabilities pi. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. it will make enormous profits). expectation) under the initial condition X0 = x. P (tails) = 1 − p. if p = q Px (Tb < Ta ) = . Ex ) for the probability (respectively. if p = q = 1/2 b−a Proof. 25 ϕ(ℓ) = 1. To solve the problem. in simplest case Yk = 1 for all k.

λ = 1. λℓ −1 x ℓ. 0 ≤ x ≤ ℓ. Since p + q = 1. We finally find λx − 1 . if p > 1/2 if p ≤ 1/2. ℓ Summarising. δ(x − 2) = · · · = q p x δ(0). then λ = q/p < 1. by first-step analysis. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). 1. λx − 1 δ(0). λ = 1. Assuming that λ := q/p = 1. The general case is obtained by replacing ℓ by b − a and x by x − a. Corollary 4. ℓ) = λx −1 . If p > 1/2. Px (T0 < ∞) = 1 − lim 26 . we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. Since ϕ(ℓ) = 1. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. we have obtained ϕ(x. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). if p = q. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. Let δ(x) := ϕ(x + 1) − ϕ(x). This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). λℓ − 1 0 ≤ x ≤ ℓ. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). then λ = q/p > 1. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. and so λx − 1 λx − 1 =1− = λx . and so Px (T0 < ∞) = 1 − lim If p < 1/2. we find δ(0) = 1/ℓ and so x ϕ(x) = . if λ = 1 if λ = 1. ℓ→∞ ℓ→∞ (q/p)x .and. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof.

. . τ = n)P (Xn = i. . Xτ +2 = j2 . τ < ∞) = P (Xn+1 = j1 . . including. n ≥ 0) is independent of the past process (X0 . . . Xn ) for all n ∈ Z+ . . Xτ ) and has the same law as the original process started from i. Let τ be a stopping time. Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . τ < ∞) and that P (Xτ +1 = j1 . τ < ∞)P (A | Xτ . . Since (X0 . (b) is by the definition of conditional probability. . τ = n) (a) (b) = P (Xτ +1 = j1 . P (Xτ +1 = j1 . Xn ) and the ordinary splitting property at n. . for all n ∈ Z+ . A | Xτ . (d) where (a) is just logic. the future process (Xn . . τ = n) (c) = P (Xn+1 = j1 . . Xτ = i. . the value +∞. .j2 · · · We have: P (Xτ +1 = j1 . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . .j1 pj1 . . Suppose (Xn . Xn = i. . Xn+1 . τ = n) = pi. Then. . Proof. . possibly. . conditional on τ < ∞ and Xτ = i. Theorem 8 (strong Markov property). any such A must have the property that. . A. . Xn ) if we know the present Xn . 27 . Xn+2 = j2 . In particular. Let τ be a stopping time. .j1 pj1 . . A. | Xn = i. considered earlier. . . . τ < ∞). . . A. n ≥ 0) is a time-homogeneous Markov chain. This stronger property is referred to as the strong Markov property. . .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. A ∩ {τ = n} is determined by (X0 . . . . Recall that the Markov property states that. Xn+2 = j2 . | Xτ . for any n.) is independent of the past process (X0 . . . τ < ∞) = pi. Let A be any event determined by the past (X0 .j2 · · · P (Xn = i. . . . . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . We are going to show that for any such A. . the future process (Xτ +n . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . A. . . A. Xn+2 = j2 . Xτ ) if τ < ∞. Xτ +2 = j2 . . Xn ). An example of a stopping time is the time TA . . . An example of a random time which is not a stopping time is the last visit of a specific set. | Xn = i)P (Xn = i. Xτ )1(τ < ∞) = n∈Z+ (X0 . . Xτ +2 = j2 . . τ = n) = P (Xn+1 = j1 . A.Xτ +2 = j2 . . τ = n). . . . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . the first hitting time of the set A. . Xn )1(τ = n). . | Xτ = i. . . .

If X0 = i then the two definitions coincide. We (r) refer to them as excursions of the process. . So Xi is the r-th excursion: the chain visits 28 . . we will show that we can split the sequence into i. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}.i. and so on. . 0 ≤ n < Ti . . we let Ti be the time of the second (1) (3) visit.. However.9 Regenerative structure of a Markov chain A sequence of i. Regardless. All these random times are stopping times. in a generalised sense. as “random variables”. They are.i. Schematically. Xi (1) . “chunks”.. . .) We let Ti be the time of the third visit. random variables is a (very) particular example of a Markov chain. The random variables Xn comprising a general Markov chain are not independent. State i may or may not be visited again. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . by using the strong Markov property. The idea is that if we can find a state i that is visited again and again. Formally.d. := Xn . random trajectories. Ti Xi (0) (r) ≤ n < Ti (r+1) . Our Ti here is always positive. Xi (2) . . Note the subtle difference between this Ti and the Ti of Section 6. 2. r = 2. (And we set Ti := Ti . The Ti of Section 6 was allowed to take the value 0. we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}.d. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). (1) r = 1. Xi (0) . 3. in fact.

Xi distribution. Second. we say that a state i is transient if. .. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. An immediate consequence of the strong Markov property is: Theorem 9. 10 Recurrence and transience When a Markov chain has finitely many states.d. are independent random variables. The excursions Xi (0) . given the state at (r) (r) (r) the stopping time Ti . A trivial. but important corollary of the observation of this section is that: Corollary 5. possibly. . we “watch” it up until the next time it returns to state i. . The last corollary ensures us that Λ1 . the future after Ti is independent of the past before Ti . the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. except. In particular. Therefore. by definition. Example: Define Ti (r+1) (0) ). ad infinitum. .i. But (r) the state at Ti is. Xi (2) . have the same Proof. Apply Strong Markov Property (SMP) at Ti : SMP says that. we say that a state i is recurrent if. . but there are always states which will be visited again and again. (Numerical) functions g(Xi independent random variables.. . Perhaps some states (inessential ones) will be visited finitely many times. and this proves the first claim.state i for the r-th time. 29 . . the future is independent of the (1) (2) past. starting from i. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. (0) are independent random variables. We abstract this trivial observation and give a couple of definitions: First. at least one state must be visited infinitely many times. Therefore. for time-homogeneous Markov chains. g(Xi (1) ). all. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. and “goes for an excursion”. Moreover. The claim that Xi . starting from i. g(Xi (2) ). Xi (1) .). it will behave in the same way as if it starts from the same state at another point of time. equal to i. (r) . Λ2 . they are also identical in law (i. if it starts from state i at some point of time.

we have Proof. But the past (k−1) before Ti is independent of the future. This means that the state i is transient. Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. then Pi (Ji < ∞) = 1. every state is either recurrent or transient. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas.e. Indeed. If fii < 1. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. We claim that: Lemma 4. . Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . 30 . 3. Indeed. fii = Pi (Ti < ∞) = 1. Because either fii = 1 or fii < 1. therefore concluding that every state is either recurrent or transient. in principle. Therefore. and that is why we get the product. Ei Ji = ∞. 2. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . This means that the state i is recurrent. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. Starting from X0 = i. We will show that such a possibility does not exist. The following are equivalent: 1. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). i. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). Pi (Ji = ∞) = 1. 4. Suppose that the Markov chain starts from state i. there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. we conclude what we asserted earlier. Lemma 5. k ≥ 0. namely. Pi (Ji = ∞) = 1.Important note: Notice that the two statements above are not negations of one another. State i is recurrent.

Clearly. then. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. (n) But if a series is finite then the summands go to 0. Corollary 6. State i is recurrent if. is equivalent to Pi (Ji < ∞) = 1.1 Define First hitting time decomposition fij := Pi (Tj = n). pii → 0. This. State i is transient. clearly. If Ei Ji = ∞.e. and. the probability to hit state j for the first time at time n ≥ 1. it is visited infinitely many times. since Ji is a geometric random variable. Proof. The equivalence of the first two statements was proved above. n ≥ 1. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . Proof.Proof. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. by definition. as n → ∞. If i is transient then Ei Ji < ∞. If i is a transient state then pii → 0. Ji cannot take value +∞ with positive probability. State i is transient if. Notice the difference with pij . by definition. if we assume Ei Ji < ∞. but the two sequences are related in an exact way: Lemma 7. Pi (Ji < ∞) = 1. Proof. The equivalence of the first two statements was proved before the lemma. owing to the fact that Ji is a geometric random variable. Hence Ei Ji = ∞ also. then. From Elementary Probability. it is visited finitely many times with probability 1. this implies that Ei Ji < ∞. as n → ∞. fij ≤ pij . (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . 3. Ei Ji < ∞. The following are equivalent: 1. 4. 2. fii = Pi (Ti < ∞) < 1. assuming that the chain (n) starts from state i.e. therefore Pi (Ji < ∞) = 1. by the definition of Ji . But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . Lemma 6. 10. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. (n) i. i. On the other hand. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. which.

2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. Corollary 8. If j is transient. If j is a transient state then pij → 0. In other words. and since they take the value 1 with positive probability.. if we let fij := Pi (Tj < ∞) . by summing over n the relation of Lemma 7. infinitely many of them will be 1 (with probability 1). . we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). . Hence.) Theorem 10. Xi . Corollary 7. j i.r := 1 j appears in excursion Xi (r) . we have i j ⇐⇒ fij > 0. are i. Indeed. showing that j will appear in infinitely many of the excursions for sure.The first term equals fij . If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. . which is finite. . not only fij > 0. and hence pij as n → ∞. . denoted by i j: it means that. . Proof. excursions Xi . the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). ( Remember: another property that was a class property was the property that the state have a certain period. starting from i. and so the probability that j will appear in a specific excursion is positive.i. either all states are recurrent or all states are transient. the probability to go to j in some finite time is positive. for all states i. 32 . r = 0. We now show that recurrence is a class property. 1.d. So the random variables δj. Now. In particular. and gij := Pi (Tj < Ti ) > 0. j i. 10. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. In a communicating class. Start with X0 = i. The last statement is simply an expression in symbols of what we said above. ∞ Proof.d. For the second term. but also fij = 1. by definition.i. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . as n → ∞. Hence. we have n=1 pjj = Ej Jj < ∞.

necessarily. If i is recurrent then every other j in the same communicating class satisfies i j. n+1 33 = 0 . if started in C. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). uniformly distributed in the unit interval [0.i. If a communicating class contains a state which is transient then. random variables. Every finite closed communicating class is recurrent. By closedness. . . What is the expectation of T ? First. i. Thus T represents the number of variables we need to draw to get a value larger than the initial one. Un ≤ x | U0 = x)dx xn dx = 1 . By finiteness. i. We now take a closer look at recurrence. every other j is also recurrent. some state i must be visited infinitely many times. Indeed. all other states are also transient. Proof. and so i is a transient state. . observe that T is finite. P (T > n) = P (U1 ≤ U0 . .Proof. . X remains in C. So if a communicating class is recurrent then it contains no inessential states and so it is closed. Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). But then. . Theorem 11. Let T := inf{n ≥ 1 : Un > U0 }.d. This state is recurrent. . there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. . . and hence. Let C be the class. Suppose that i is inessential. Theorem 12. appear in a certain game. U1 . Example: Let U0 . Xn = i for infinitely many n) = 0. j but j i. by the above theorem. Hence all states are recurrent. U2 . P (B) = Pi (Xn = i for infinitely many n) < 1. If i is inessential then it is transient. Pi (B) = 0. . 1]. It should be clear that this can. be i. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. .e. but Pi (A ∩ B) = Pi (Xm = j. Hence. Recall that a finite random variable may not necessarily have finite expectation. for example. Proof.e.

Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. When we have a Markov chain. on the average. X2 . random variables. but E min(Z1 . 34 . It is not a Law. X1 . of successive states is not an independent sequence. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. . It is traditional to refer to it as “Law”. then it takes. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. (ii) Suppose E max(Z1 . Before doing that. . (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . the sequence X0 . 0) = ∞. To apply the law of large numbers we must create some kind of independence. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. . 0) > −∞.d. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞.i. .Therefore. . be i. Theorem 13. arguably. But this is misleading. We state it here without proof. . n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. Let Z1 . the Fundamental Theorem of Probability Theory. it is a Theorem that can be derived from the axioms of Probability Theory. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. Z2 . We say that i is null recurrent if Ei Ti = ∞. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1.

with probability 1. The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. for an excursion X . n as n → ∞. (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. Example 1: To each state x ∈ S. Xa(2) . random variables. (7) Whichever the choice of G. G(Xa(2) ). . forms an independent sequence of (generalised) random variables. which. Example 2: Let f (x) be as in example 1. and because if we start with X0 = a. we have that G(Xa(1) ). as in Example 2. . define G(X ) to be the maximum reward received during this excursion. . the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). are i. k=0 which represents the ‘time-average reward’ received. 35 . (1) (2) (1) (n) (8) 13. we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. in this example. .i. the (0) initial excursion Xa also has the same law as the rest of them. .1 Functions of excursions Namely. Since functions of independent random variables are also independent. Xa .13. we can think of as a reward received when the chain is in state x. are i. but define G(X ) to be the total reward received over the excursion X .. assign a reward f (x). In particular.i. . . Xa(3) . because the excursions Xa . . Thus. G(Xa(3) ). Specifically.d. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). . given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . for reasonable choices of the functions G taking values in R. Then. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ).d.2 Average reward Consider now a function f : S → R.

Then. in the figure below. Gr is given by (7). Let f : S → R+ be a bounded ‘reward’ function. We shall present the proof in a way that the constant f will reveal itself–it will be discovered.e. and Glast is the reward over the last incomplete cycle. and so 1 first G → 0. say |f (x)| ≤ B. For example. (r) Nt := max{r ≥ 1 : Ta ≤ t}. Thus we need to keep track of the number. plus the total reward received over the next excursion. Nt = 5. Nt . Since P (Ta < ∞) = 1. t t as t → ∞. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. we have BTa /t → 0. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . Hence |Gfirst | ≤ BTa . The idea is this: Suppose that t is large enough. t t 36 . A similar argument shows that 1 last G → 0. t t In other words. of complete excursions from the ground state a that occur between 0 and t. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . and so on. t as t → ∞. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. with probability one. with probability one. We break the sum into the total reward received over the initial excursion.Theorem 14. say. We are looking for the total reward received between 0 and t. Proof. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). We assumed that f is t bounded.

as r → ∞. Hence.d. 1 Nt = . r=1 with probability one. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . Since Ta → ∞. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 .i. E a Ta 37 . we have Ta r→∞ r lim (r) . we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . So we are left with the sum of the rewards over complete cycles. 2. . Since Ta /r → 0. it follows that the number Nt of complete cycles before time t will also be tending to ∞. . random variables with common expectation Ea Ta .also. E a Ta f (Xn ) = n=0 EG1 =: f . and since Ta − Ta k = 1.. (11) By definition. are i. = E a Ta . lim t∞ 1 Nt Nt −1 Gr = EG1 . . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. = E a Ta . as t → ∞.

The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). E a Ta (14) with probability 1. Comment 1: It is important to understand what the formula says. The equality of these two limits gives: average no. b. The denominator is the duration of the excursion. Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb .To see that f is as claimed. for all b ∈ S. which communicate and are positive recurrent. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . The constant f is a sort of average over an excursion. a. then the long-run proportion of visits to the state a should equal 1/Ea Ta . We can use b as a ground state instead of a. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . on the average. The theorem above then says that. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. it takes Ea Ta units of time between successive visits to the state a. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). not both. Note that we include only one endpoint of the excursion. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). Comment 4: Suppose that two states. t Ea 1(Xk = b) E a Ta .e. i. The physical interpretation of the latter should be clear: If. E b Tb 38 . the ground state. Ta −1 k=0 1 lim t→∞ t with probability 1. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). from the Strong Markov Property. then we see that the numerator equals 1. we let f to be 1 at the state b and 0 everywhere else.

Proposition 2. . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. i. Clearly. . Recurrence tells us that Ta < ∞ with probability one. 39 . Proof. by a function of X1 . X3 . notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). for any state b such that a → x (a leads to x).14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . In view of this. . Indeed. Consider the function ν [a] defined in (15). ν [a] (x) = y∈S ν [a] (y)pyx . this should have something to do with the concept of stationary distribution discussed a few sections ago. (15) should play an important role. we will soon realise that this dependence vanishes when the chain is irreducible. we have 0 < ν [a] (x) < ∞. Both of them equal 1. We start the Markov chain with X0 = a. where a is the fixed recurrent ground state whose existence was assumed. (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. X2 . Hence the superscript [a] will soon be dropped. . Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. . x ∈ S. Moreover. x ∈ S. Fix a ground state a ∈ S and assume that it is recurrent5 .e. so there is nothing lost in doing so. Since a = XT a . Note: Although ν [a] depends on the choice of the ‘ground’ state a. 5 We stress that we do not need the stronger assumption of positive recurrence here. X2 . The real reason for doing so is because we wanted to replace a function of X0 . . X1 .

which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. ν [a] (x) < ∞. namely. x a. in this last step. xa so. so there exists m ≥ 1. We just showed that ν [a] = ν [a] P. so there exists n ≥ 1. for all n ∈ N. indeed. We are now going to assume more. such that pxa > 0. from (16). Therefore. such that pax > 0. that a is positive recurrent. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. let x be such that a x. for all x ∈ S. n ≤ Ta ) Pa (Xn = x. (17) This π is a stationary distribution. by definition. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). where. We now have: Corollary 9. we used (16). (n) Next. n ≤ Ta ) pyx Pa (Xn−1 = y. x ∈ S. Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . First note that. ax ax From Theorem 10. Suppose that the Markov chain possesses a positive recurrent state a. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). 40 .and thus be in a position to apply first-step analysis. ν [a] = ν [a] Pn . which. Xn−1 = y. which are precisely the balance equations. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). Note: The last result was shown under the assumption that a is a recurrent state. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. is finite when a is positive recurrent.

Since a is positive recurrent. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1.i. We have already shown. The coins are i.e. in general. In words. (r) j. . and so π [a] is not trivially equal to zero. But π [a] is simply a multiple of ν [a] . . Start with X0 = i. R2 ) be the index of the first (respectively. Proof. Otherwise. that it sums up to one. Hence j will occur in infinitely many excursions. and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. we toss a coin with probability of heads gij > 0. π(y) = 1 y∈S x∈S have at least one solution. i. r = 0. unique. π [a] P = π [a] . to decide whether j will occur in a specific excursion. Therefore. then the set of linear equations π(x) = y∈S π(y)pyx . Let R1 (respectively. second) excursion that contains j. then the r-th excursion does not contain state j. Suppose that i recurrent. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. It only remains to show that π [a] is a probability. 2. since the excursions are independent. then j occurs in the r-th excursion. we have Ea Ta < ∞. 41 . Consider the excursions Xi . . in Proposition 2. Let i be a positive recurrent state. if tails show up. If heads show up at the r-th excursion. Theorem 15. This is what we will try to do in the sequel. Also. no matter which one. Then j is positive Proof. that ν [a] P = ν [a] .d. positive recurrent state. We already know that j is recurrent. 1. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one.In other words: If there is a positive recurrent state. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property.

where the Zr are i. we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. . 2. An irreducible Markov chain is called transient if one (and hence all) states are transient.d.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. Then either all states in C are positive recurrent. or all states are null recurrent or all states are transient. . Let C be a communicating class. IRREDUCIBLE CHAIN r = 1. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . . An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. . with the same law as Ti . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . . In other words. Corollary 10. . But. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. by Wald’s lemma.i. R2 has finite expectation. .

n ≥ 0) and suppose that P (X0 = x) = π(x). the π defined by π(0) = α. for all n. what prohibits uniqueness. an absorbing state). we start the Markov chain from the stationary distribution π. for all x ∈ S. we stress that a stationary distribution may not be unique. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). after all. n=1 43 . Hence. Then. there is a stationary distribution. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. x∈S By (13). for totally trivial reasons (it is. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). by (18). as t → ∞. By a theorem of Analysis (Bounded Convergence Theorem). x ∈ S. Theorem 16. Consider positive recurrent irreducible Markov chain. Fix a ground state a. following Theorem 14. Then P (Ta < ∞) = x∈S In other words. π(1) = 0. state 1 (and state 2) is positive recurrent. we have P (Xn = x) = π(x). Proof. Consider the Markov chain (Xn . indeed. Then there is a unique stationary distribution. But. we can take expectations of both sides: E as t → ∞. (18) π(x)Px (Ta < ∞) = π(x) = 1. is a stationary distribution. recognising that the random variables on the left are nonnegative and bounded below 1. x ∈ S. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. for all states i and j. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. Let π be a stationary distribution. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. with probability one. π(2) = 1 − α 2.16 Uniqueness of the stationary distribution At the risk of being too pedantic. we have.

The cycle formula merely re-expresses this equality. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Decompose it into communicating classes. b ∈ S. Consider an arbitrary Markov chain. Consider a Markov chain. In particular. i. π(a) = π [a] (a) = 1/Ea Ta . There is only one stationary distribution. Then any state i such that π(i) > 0 is positive recurrent. Corollary 12. Hence π = π [a] . where π(C) = y∈C π(y). By the 1 above corollary. Proof. Proof. Both π [a] and π[b] are stationary distributions. Let i be such that π(i) > 0. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. The following is a sort of converse to that. delete all transition probabilities pxy with x ∈ C. There is a unique stationary distribution. This is in fact. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). π(i|C) = Ei Ti . Then π [a] = π [b] . i is positive recurrent. Consider a positive recurrent irreducible Markov chain with stationary distribution π. π(C) x ∈ C. an arbitrary stationary distribution π must be equal to the specific one π [a] . Consider the restriction of the chain to C. x ∈ S. from the definition of π [a] . Then. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . the only stationary distribution for the restricted chain because C is irreducible. y ∈ C. E a Ta Proof. Consider a positive recurrent irreducible Markov chain and two states a.Therefore π(x) = π [a] (x). Let C be the communicating class of i. so they are equal. 1 π(a) = . Namely. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . In particular. for all x ∈ S. Corollary 11. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . for all a ∈ S.e. Therefore Ei Ti < ∞. Assume that a stationary distribution π exists. Thus. But π(i|C) > 0. Hence there is only one stationary distribution.

This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . So we have that the above decomposition holds with αC = π(C). x∈C Then (π(x|C). consider the motion of a pendulum under the action of gravity. take an = (−1)n . where π(C) := π(C) π(x). There is dissipation of energy due to friction and that causes the settling down. Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. under irreducibility and positive recurrence conditions. Let π be an arbitrary stationary distribution. more generally. for example. as n → ∞. For each such C define the conditional distribution π(x|C) := π(x) . Proposition 3. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). where αC ≥ 0. it will settle to the vertical position.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. This is the stable position. By our uniqueness result. In an ideal pendulum. We all know. For example. 18 18.which assigns positive probability to each state in C. that the pendulum will swing for a while and. the pendulum will perform an 45 . An arbitrary stationary distribution π must be a linear combination of these π C . Proof. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . If π(x) > 0 then x belongs to some positive recurrent class C. A Markov chain is a simple model of a stochastic dynamical system. that it remains in a small region). that. such that C αC = 1. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. π(x|C) ≡ π C (x). A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. x ∈ C) is a stationary distribution. from physical experience. So far we have seen. eventually. To each positive recurrent communicating class C there corresponds a stationary distribution π C . without friction.

X2 . . the one found by setting the right-hand side equal to zero: x∗ = b/a. or an example in physics. we have x(t) → x∗ . random variables. mutually independent. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. The thing to observe here is that there is an equilibrium point i. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . moreover.e. ˙ There are myriads of applications giving physical interpretation to this. by means of the recursion above. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. then regardless of the starting position. . it cannot mean convergence to a fixed equilibrium point. or whatever. Still. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ .oscillatory motion ad infinitum. The simplest differential equation As another example. consider the following differential equation x = −ax + b. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . as t → ∞. and define a stochastic sequence X1 . But what does stability mean? In general. x ∈ R. ξ1 . . ······ 46 . we may it is stable because it does not (it cannot!) swing too far away. a > 0. And we are in a happy situation here. we here assume that X0 . simple because the ξn ’s depend on the time parameter n. . If. then x(t) = x∗ for all t > 0. It follows that |ρ| < 1 is the condition for stability. We say that x∗ is a stable equilibrium. Specifically. A linear stochastic system We now pass on to stochastic systems. . . are given. ξ0 . However notice what happens when we solve the recursion. ξ2 . We shall assume that the ξn have the same law.

Y but not their joint law. i.3 Coupling To understand why the fundamental stability theorem works. 18. . to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). which is a linear combination of ξ0 . What this means is that.Since |ρ| < 1. for any initial distribution. converges. uniformly in x. But suppose that we have some further requirement. the n-step transition matrix Pn . . Y = y) is. for time-homogeneous Markov chains. How should we then define what P (X = x. suppose you are given the laws of two random variables X. j ∈ S. If a Markov chain is irreducible. define stability in a way other than convergence of the distributions of the random variables. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . under nice conditions. . while the limit of Xn does not exists. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). we have that the first term goes to 0 as n → ∞. to try to make the coincidence probability P (X = Y ) as large as possible. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . for instance. 18. Y = y) = P (X = x)P (Y = y). we know f (x) = P (X = x). we need to understand the notion of coupling. In other words. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. In particular. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . but we are not given what P (X = x. (e. ∞ ˆ The nice thing is that. The second term. Y = y) is? 47 . g(y) = P (Y = y). ξn−1 does not converge. as n → ∞. Starting at an elementary level. A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x.g. Coupling refers to the various methods for constructing this joint law.2 The fundamental stability theorem for Markov chains Theorem 17. .

Suppose that we have two stochastic processes as above and suppose that they have been coupled.e. n < T ) + P (Xn = x. but with the roles of X and Y interchanged. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). n < T ) + P (Yn = x. Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). n ≥ T ) ≤ P (n < T ) + P (Yn = x). We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. Then. n ≥ 0. Find the joint law h(x. i. Let T be a meeting of two coupled stochastic processes X. Y = y) which respects the marginals. and which maximises P (X = Y ). n ≥ 0. n ≥ 0). We have P (Xn = x) = P (Xn = x. A coupling of them refers to a joint construction on the same probability space. y) = P (X = x. x ∈ S. This T is a random variables that takes values in the set of times or it may take values +∞. for all n ≥ 0. (19) as n → ∞. in particular. Repeating (19). y). The following result is the so-called coupling inequality: Proposition 4. T is finite with probability one. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). f (x) = y h(x. P (T < ∞) = 1. g(y) = x h(x. precisely because of the assumption that the two processes have been constructed together.Exercise: Let X. If. The coupling we are interested in is a coupling of processes. Y be random variables with values in Z and laws f (x) = P (X = x). g(y) = P (Y = y). Note that it makes sense to speak of a meeting time. respectively.e. on the event that the two processes never meet. Proof. Since this holds for all x ∈ S and since the right hand side does not depend on x. then |P (Xn = x) − P (Yn = x)| → 0. x∈S 48 . and all x ∈ S. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). n ≥ T ) = P (Xn = x. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). i. uniformly in x ∈ S. but we are not told what the joint probabilities are. Y . We are given the law of a stochastic process X = (Xn . y). n ≥ 0) and the law of a stochastic process Y = (Yn .

4 Proof of the fundamental stability theorem First. in other words. x∈S as n → ∞. Notice that σ(x.(y. We now couple them by doing the straightforward thing: we assume that X = (Xn . n ≥ 0).x′ py. n ≥ 0) is independent of Y = (Yn . denoted by Y = (Yn . y ′ ) | Wn = (x. Then n→∞ lim P (T > n) = P (T = ∞) = 0. Xm = y). Xn and Yn would be independent and distributed according to π. had we started with both X0 and Y0 independent and distributed according to π.y′ ) = px.Assume now that P (T < ∞) = 1. for all n ≥ 1. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. π(x) > 0 for all x ∈ S. (Indeed. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. We know that there is a unique stationary distribution.y′ . (x. we can define joint probabilities. y) Its n-step transition probabilities are q(x. . 18. we have not defined joint probabilities such as P (Xn = x. we have that px. Yn ). Consider the process Wn := (Xn . and so (Wn .y′ for all large n. Y0 ) has distribution P (W0 = (x. then. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. implying that q(x. The second process. From Theorem 3 and the aperiodicity assumption. as usual.x′ ). n ≥ 0) is a Markov chain with state space S × S. y) := π(x)π(y). positive recurrent and aperiodic. denoted by X = (Xn . It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. Its (1-step) transition probabilities are q(x. Let pij denote. sup |P (Xn = x) − P (Yn = x)| → 0. Therefore. they differ only in how the initial states are chosen.y′ ) := P Wn+1 = (x′ . y) = µ(x)π(y). the transition probabilities. denoted by π. is a stationary distribution for (Wn .x′ py. Then (Wn .y′ ) for all large n.y′ .(y. The first process.x′ ). and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }. Its initial state W0 = (X0 .x′ > 0 and py. Having coupled them. y) ∈ S × S.) By positive recurrence. review the assumptions of the theorem: the chain is irreducible.x′ ). n ≥ 0). We shall consider the laws of two processes. n ≥ 0) is an irreducible chain.(y. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px.

Clearly. if we modify the original chain and make it aperiodic by.0). (1. In particular. Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). then the process W = (X. Hence (Wn .1). Y ) may fail to be irreducible.Therefore σ(x.1). n ≥ 0) is identical in law to (Xn . For example. 1). 1} and the pij are p01 = p10 = 1. for all n and x. This is not irreducible. uniformly in x. after the meeting time. by assumption. p00 = 0. n ≥ 0). Obviously.(1. (Zn . n ≥ 0) is positive recurrent. we have |P (Xn = x) − π(x)| ≤ P (T > n). (1. Moreover. Since P (Zn = x) = P (Xn = x). 0). say. y) ∈ S × S. suppose that S = {0.0).(0. X arbitrary distribution Z stationary distribution Y T Thus. Zn = Yn . y) > 0 for all (x. 50 . T is a meeting time between Z and Y . By the coupling inequality. p01 = 0. P (T < ∞) = 1. Z0 = X0 which has an arbitrary distribution µ.1) = 1. as n → ∞. 1)} and the transition probabilities are q(0.(1. and since P (Yn = x) = π(x). (0.0) = q(0. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X.0) = q(1.01. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). precisely because P (T < ∞) = 1. Hence (Zn . However. In particular. ≥ 0) is a Markov chain with transition probabilities pij .99. Then the product chain has state space S ×S = {(0. P (Xn = x) = P (Zn = x). which converges to zero. 0). p10 = 1.(0.1) = q(1. Zn equals Xn before the meeting time of X and Y .

π(1) satisfy 0. positive recurrent and aperiodic Markov chain. however. p01 = 0.e. limn→∞ p00 does not exist. most people use the term “ergodic” for an irreducible. Suppose X0 = 0.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain.99π(0) = π(1). The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature.01. .497 18.503 0. n→∞ (n) where π(0). per se wrong. It is not difficult to see that 6 We put “wrong” in quotes. and so on. from Cd back to C0 .497. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. Terminology cannot be. Indeed.4 and. i. 0. 6 The reason is that the term “ergodic” means something specific in Mathematics (c. in particular. p10 = 1. But this is “wrong”. Parry.then the product chain becomes irreducible. 51 . Cd−1 . By the results of Section 14. Suppose that the period of the chain is d. π(0) ≈ 0. 1981). Therefore p00 = 1 for even n (n) and = 0 for odd n. C1 . such that the chain moves cyclically between them: from C0 to C1 . π(0) + π(1) = 1. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). .497 . In matrix notation.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic.f. π(1) ≈ 0.503. Thus.503 0. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1.99. customary to have a consistent terminology. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. we can decompose the state space into d classes C0 . 18. if we modify the chain by p00 = 0. . Theorem 4. from C1 to C2 . then (n) lim p n→∞ 00 (n) = lim p10 = π(0). It is. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. and this is expressed by our terminology. However. Then Xn = 0 for even n and = 1 for odd n. By the results of §5. .

and let r = ℓ − k (modulo d). . and the previous results apply. i ∈ Cr . . Let i ∈ Ck . where π is the stationary distribution. starting from i. . n ≥ 0) consists of d closed communicating classes. . → dπ(j). This means that: Theorem 18. i ∈ S. j ∈ Cℓ . i∈Cr for all r = 0. as n → ∞. We shall show that Lemma 9. j ∈ Cℓ . n ≥ 0) is dπ(i). it remains in Cℓ . . Proof. Since (Xnd . namely C0 . n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. 1. We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . as n → ∞. Cd−1 . Hence if we restrict the chain (Xnd . we have pij for all i. All states have period 1 for this chain. i ∈ Cr . the (unique) stationary distribution of (Xnd . j ∈ Cr . each one must be equal to 1/d. Then. Suppose i ∈ Cr . the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. 52 . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. . Then pji = 0 unless j ∈ Cr−1 . Suppose that X0 ∈ Cr . The chain (Xnd . . if sampled every d steps. after reduction by d) and. Then. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). d − 1. Since the αr add up to 1. So π(i) = j∈Cr−1 π(j)pji . Let αr := i∈Cr π(i). thereafter. . π(i) = 1/d. Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck .Lemma 8. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . n ≥ 0) is aperiodic.

2. . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. . If the period is d then the limit exists only along a selected subsequence (Theorem 18). By construction. pi . p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . p1 . and so on. . . .) is a k k subsequence of each (ni . such that limk→∞ p1 k 3 (n1 ) numbers p2 k . are contained between 0 and 1. i. where = 1 and pi ≥ 0 for all i. 2. p2 . . for each n = 1. So the diagonal subsequence (nk . We have. k = 1. say. there must be a subsequence exists and equals. . the first of which is known as HellyBray Lemma: Lemma 10. Hence pi k i. a probability distribution on N. whose proof we omit7 : 7 See Br´maud (1999). for each i. The proof will be based on two results from Analysis. If j is a null recurrent state then pij → 0. for all i ∈ S.). Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. k = 1. We also saw in Corollary (n) 7 that. . exists for all i ∈ N. . . . . as k → ∞. 2. Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. 2.1 Limiting probabilities for null recurrent states Theorem 19. the sequence n2 of the second row k is a subsequence of n1 of the first row. p(n) = (n) (n) (n) (n) (n) ∞ p1 . .. It is clear how we (ni ) k continue. p2 . such that limk→∞ pi k exists and equals. if j is transient.. . e 53 . k = 1. . for each The second result is Fatou’s lemma. as n → ∞. k = 1. 19. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . then pij → π(j) (Theorems 17). . (nk ) k → pi . for all i ∈ S. . For each i we have found subsequence ni . schematically. Now consider the contained between 0 and 1 there must be a exists and equals. 2.e. Consider. the sequence n3 of the third row is a subsequence k k of n2 of the second row. . then pij → 0 as n → ∞. . say. p3 . say.

and this is a contradiction (the number αx = αx cannot be strictly larger than itself). . . i=1 (n) Proof of Theorem 19. actually.e. i ∈ N). pix k = 1. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . 54 . i. the sequence (n) pij does not converge to 0. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . Thus. we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1.Lemma 11. But Pn+1 = Pn P. by Lemma 11. n = 1. i. pix k Therefore. We wish to show that these inequalities are. Now π(j) > 0 because αj > 0. pix k . by assumption. By Corollary 13. i. state j is positive recurrent: Contradiction. y∈S Since x∈S αx ≤ 1. Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. for some x0 ∈ S. x ∈ S. (xi . i ∈ N). αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. it is a strict inequality: αx0 > y∈S αy pyx0 . π is a stationary distribution. Suppose that one of them is not. αy pyx .e. If (yi . y∈S x ∈ S. and the last equality is obvious since we have a probability vector. (n ) (n ) Consider the probability vector By Lemma 10. x∈S (n′ ) Since aj > 0. . x ∈ S . Suppose that for some i. (n′ ) Hence αx ≥ αy pyx . equalities. Let j be a null recurrent state. k→∞ (n′ +1) = y∈S piy k pyx . we have 0< x∈S The inequality follows from Lemma 11.e. 2. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx .

But Theorem 17 tells us that the limit exists regardless of the initial distribution. The result when the period j is larger than 1 is a bit awkward to formulate. β. starting with the first hitting time decomposition relation of Lemma 7. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. This is a more old-fashioned way of getting the same result. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. which is what we need. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. Then (n) lim p n→∞ ij = ϕC (i)π C (j). π C (j) = 1/Ej Tj . as n → ∞. Let i be an arbitrary state.2 The general case (n) It remains to see what the limit of pij is. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . the result can be obtained by the so-called Abel’s Lemma of Analysis. Proof. In this form. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . Let C be the communicating class of j and let π C be the stationary distribution associated with C. Indeed. Let HC := inf{n ≥ 0 : Xn ∈ C}. From the Strong Markov Property. that is. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m).19. Theorem 20. and fij = Pi (Tj < ∞). and fij = ϕC (j). as n → ∞. Let ϕC (i) := Pi (HC < ∞). so we will only look that the aperiodic case. Example: Consider the following chain (0 < α. we have pij = Pi (Xn = j) = Pi (Xn = j. In general. Let j be a positive recurrent state with period 1. as in §10. Pµ (Xn−m = j) → π C (j). when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class.2.

for example. ϕ(4) = . p42 → ϕ(4)π C1 (2). ϕ(4) = (1 − β)ϕ(5) + βϕ(3). We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. Theorem 21. information about its velocity. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. C2 = {5} (absorbing state). (n) (n) p34 → 0. (n) (n) p35 → (1 − ϕ(3))π C2 (5). Pioneers in the are were. (n) (n) p33 → 0. 56 (20) x∈S . We now generalise this. p45 → (1 − ϕ(4))π C2 (5). from first-step analysis. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . 1 − αβ 1 − αβ Both C1 . 4} (inessential states). C1 = {1. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. (n) (n) p32 → ϕ(3)π C1 (2). π C1 (2) = . ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. ϕ(3) = p31 → ϕ(3)π C1 (1). The phase (or state) space of a particle is not just a physical space but it includes. We have ϕ(1) = ϕ(2) = 1. C2 are aperiodic classes. p41 → ϕ(4)π C1 (1). For example. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. (n) (n) p43 → 0. Hence. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. π C2 (5) = 1. Similarly. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. Theorem 14 is a type of ergodic theorem.The communicating classes are I = {3. ϕ(5) = 0. in Theorem 14. while. And so is the Law of Large Numbers itself (Theorem 13). Recall that. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . we assumed the existence of a positive recurrent state. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. The subject of Ergodic Theory has its origins in Physics. p44 → 0. among others. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. as n → ∞. whence 1−α β(1 − α) . 2} (closed class).

If so. then. 57 . in the former. we have C := i∈S ν(i) < ∞. we have t→∞ lim 1 last 1 G = lim Gfirst = 0. we assumed positive recurrence of the ground state a. The reader is asked to revisit the proof of Theorem 14.5. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. Now. We only need to check that EG1 = f . Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. As in the proof of Theorem 14. But this is precisely the condition (20). From proposition 2 we know that the balance equations ν = νP have at least one solution. t t→∞ t t t with probability one. we have Nt /t → 1/Ea Ta . as we saw in the proof of Theorem 14. (21) f (x)ν [a] (x). 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. and this is left as an exercise. Proof. So if we divide by t the numerator and denominator of the fraction in (21). Then at least one of the states must be visited infinitely often. r=0 with probability 1 as long as E|G1 | < ∞. So there are always recurrent states. while Gfirst . Replacing n by the subsequence Nt . As in Theorem 14. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). Remark 1: The difference between Theorem 14 and 21 is that.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . which is the quantity denoted by f in Theorem 14. we have that the denominator equals Nt /t and converges to 1/Ea Ta . we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . and this justifies the terminology of §18. so the numerator converges to f /Ea Ta . Glast are the total rewards over the first and t t a a last incomplete cycles. respectively. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . we obtain the result. since the number of states is finite.

or. Suppose first that the chain is reversible. the chain is aperiodic. π(i) := Hence. But if we film a man who walks and run the film backwards. Xn has the same distribution as X0 for all n. Then the chain is positive recurrent with a unique stationary distribution π. this means that it is stationary. It is. . P (X0 = i. like playing a film backwards. . for all n. . 58 . At least some state i must have π(i) > 0. Lemma 12. this state must be positive recurrent. in a sense. . as n → ∞. reversible if it has the same distribution when the arrow of time is reversed. . Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. Theorem 23. For example. A reversible Markov chain is stationary. More precisely. X1 ) has the same distribution as (X1 . X1 . X1 = j) = P (X1 = i. the stationary distribution is unique. We say that it is time-reversible. we see that (X0 . X0 = j). X0 ) for all n. n ≥ 0). . Proof. running backwards. . X0 ). Taking n = 1 in the definition of reversibility. it will look the same. Because the process is Markov. . then we instantly know that the film is played in reverse. we have Pn → Π. . the first components of the two vectors must have the same distribution. and run the film backwards. Proof. Xn ) has the same distribution as (Xn . Xn−1 . that is. because positive recurrence is a class property (Theorem 15). . Xn−1 . By Corollary 13. then all states are positive recurrent. 22 Time reversibility Consider a Markov chain (Xn . the chain is irreducible. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. If. We summarise the above discussion in: Theorem 22. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . Reversibility means that (X0 . actually. X0 ). j ∈ S. (X0 . If. X1 . C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. if we film a standing man who raises his arms up and down successively. by Theorem 16. i ∈ S. . . . . in addition. in addition. . Consider an irreducible Markov chain with finitely many states. Therefore π is a stationary distribution. Xn ) has the same distribution as (Xn . i. simply. without being able to tell that it is.Hence 1 ν(i). In particular.e. i. a finite Markov chain always has positive recurrent states. . In addition.

we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . Conversely. suppose the DBE hold. . X0 ). Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. we have separation of variables: pij = f (i)g(j). . . then the chain cannot be reversible. for all i. X1 . X1 = j. X0 ). . X1 ) has the same distribution as (X1 . k. X1 = j) = π(i)pij . π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. X1 . and the chain is reversible. . and write. k. So the detailed balance equations (DBE) hold. So the chain is reversible. . i2 . Xn ) has the same distribution as (Xn . The same logic works for any m and can be worked out by induction. . .for all i and j. and this means that P (X0 = i. Theorem 24. Hence the DBE hold. Now pick a state k and write. j. We will show the Kolmogorov’s loop criterion (KLC). π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . X1 . it is easy to show that. The DBE imply immediately that (X0 . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. i4 . . and hence (X0 . and pi1 i2 (m−3) → π(i2 ). pji 59 (m−3) → π(i1 ). using DBE. X2 = i). using the DBE. j. . X2 ) has the same distribution as (X2 . suppose that the KLC (22) holds. Summing up over all choices of states i3 . letting m → ∞. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . X2 = k) = P (X0 = k. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. These are the DBE. P (X0 = j. . Suppose first that the chain is reversible. for all n. In other words. (22) Proof. im ∈ S and all m ≥ 3. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. . But P (X0 = i. X1 = j. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. . (X0 . then. X1 = i) = π(j)pji . . X0 ). By induction. . Xn−1 . Pick 3 states i. Conversely. . we have π(i) > 0 for all i. .

using another method. f (i)g(i) must be 1 (why?). So g satisfies the balance equations.) Necessarily. consider the chain: Replace it by: This is a tree. there would be a loop. Hence the chain is reversible. in addition. Example 1: Chain with a tree-structure. j for which pij > 0. Ac be the sets in the two connected components. If we remove the link between them.1) which gives π(i)pij = π(j)pji . And hence π can be found very easily. that the chain is positive recurrent. 60 . Form the graph by deleting all self-loops. For example.(Convention: we form this ratio only when pij > 0. Hence the original chain is reversible. then we need. the graph becomes disconnected. A) (see §4. then the chain is reversible. and by replacing. there isn’t any. (If it didn’t. There is obviously only one arrow from AtoAc .) Let A. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. so we have pij g(j) = . Ac ) = F (Ac . to guarantee. the detailed balance equations. If the graph thus obtained is a tree. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . by assumption. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. and the arrow from j to j is the only one from Ac to A. and. j for which pij > 0. meaning that it has no loops. • If the chain has infinitely many states. i. Pick two distinct states i. The reason is as follows. the two oriented edges from i to j and from j to i by a single edge without arrow. namely the arrow from i to j. The balance equations are written in the form F (A. for each pair of distinct states i.e.

Consider an undirected graph G with vertices S. Each edge connects a pair of distinct vertices (states) without orientation. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). etc. so both members are equal to C. i∈S Since δ(i) = 2|E|. a finite set. There are two cases: If i. The constant C is found by normalisation. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. The stationary distribution assigns to a state probability which is proportional to its degree. π(i) = 1. and a set E of undirected edges. π(i)pij = π(j)pji . we have that C = 1/2|E|. For each state i define its degree δ(i) = number of neighbours of i. A RWG is a reversible chain. To see this let us verify that the detailed balance equations hold. j are not neighbours then both sides are 0. For example. Hence we conclude that: 1. States j are i are called neighbours if they are connected by an edge. We claim that π(i) = Cδ(i). If i.e. The stationary distribution of a RWG is very easy to find. the DBE hold. i∈S where |E| is the total number of edges. Now define the the following transition probabilities: pij = 1/δ(i). if j is a neighbour of i otherwise. p02 = 1/4. where C is a constant.Example 2: Random walk on a graph. In all cases. 0. i. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. 2. Suppose the graph is connected. 61 .

. k = 0.i. . 2. k = 0.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. Let A be (r) a set of states and let TA be the r-th visit to A. 2. 1. Due to the Strong Markov Property. 2.e. n ≥ 0) be Markov with transition probabilities pij . then it is a stationary distribution for the new chain. . irreducible and positive recurrent). . random variables with values in N and distribution λ(n) = P (ξ0 = n). if. . k = 0. in addition. t ≥ 0. . 1.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). where ξt are i.d. 1. this process is also a Markov chain and it does have time-homogeneous transition probabilities. St+1 = St + ξt . The sequence (Yk . . The state space of (Yr ) is A. 23. i. i. where pij are the m-step transition probabilities of the original chain. . A r = 1. . a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . . . π is a stationary distribution for the original chain. 62 n ∈ N.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . .3 Subordination Let (Xn .) has the Markov property but it is not time-homogeneous. . Define Yr := XT (r) . If the subsequence forms an arithmetic progression. . k = 0. . 2. 1. in general. if nk = n0 + km. then (Yk . t ≥ 0) be independent from (Xn ) with S0 = 0. 2. . If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). how can we create new ones? 23. (m) (m) 23.e.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i.e. Let (St . . In this case.

then the process Yn = f (Xn ). . . . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). and the elements of S ′ by α.e. . is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. j.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . Then (Yn ) has the Markov property. Indeed. Proof. . . . however. Fix α0 . Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). knowledge of f (Xn ) implies knowledge of Xn and so... . . β. If. . Yn−1 = αn−1 }. . ∞ λ(n)pij . is NOT. f is a one-to-one function then (Yn ) is a Markov chain.Then Yt := XSt . (Notation: We will denote the elements of S by i. i. .) More generally. β). conditional on Yn the past is independent of the future. Yn−1 ) = P (Yn+1 = β | Yn = α). we have: Lemma 13. n≥0 63 . αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . . . α1 . in general. We need to show that P (Yn+1 = β | Yn = α. a Markov chain. (n) 23.

We have that P (Yn+1 = β|Xn = i) = Q(|i|. P (Xn+1 = i ± 1|Xn = i) = 1/2. i. Thus (Yn ) is Markov with transition probabilities Q(α. On the other hand. To see this. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. a Markov chain with transition probabilities Q(α. conditional on {Xn = i}. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Example: Consider the symmetric drunkard’s random walk. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β). β) P (Yn = α|Yn−1 ) i∈S = Q(α. if we let 1/2. β)P (Xn = i | Yn = α. Yn−1 )P (Xn = i | Yn = α. indeed. 64 . α > 0 Q(α. Indeed.e. if i = 0. depends on i only through |i|. observe that the function | · | is not one-to-one. Then (Yn ) is a Markov chain. In other words. α = 0. β). But P (Yn+1 = β | Yn = α. i ∈ Z. i ∈ Z. β ∈ Z+ . if β = α ± 1. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. for all i ∈ Z and all β ∈ Z+ . if β = 1. and so (Yn ) is. Yn = α. β) P (Yn = α. and if i = 0. Yn−1 ) Q(f (i).for all choices of the variables involved. then |Xn+1 | = 1 for sure. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). But we will show that P (Yn+1 = β|Xn = i). Yn−1 ) Q(f (i). β). β) := 1. because the probability of going up is the same as the probability of going down. regardless of whether i > 0 or i < 0. β) P (Yn = α|Xn = i. β) i∈S P (Yn = α|Xn = i. Define Yn := |Xn |. β).

in general. Example: Suppose that p1 = p. 2. and 0 is always an absorbing state. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. Thus.24 24. we have that (Xn . ξ (n) ). therefore transient. independently from one another. . Hence all states i ≥ 1 are transient. The state 0 is irrelevant because it cannot be reached from anywhere. if the 65 (n) (n) . . . 0 ≤ j ≤ i.i. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . In other words. ξ2 . and following the same distribution. We let pk be the probability that an individual will bear k offspring. . Back to the general case. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. if 0 is never reached then Xn → ∞. (and independent of the initial population size X0 –a further assumption).. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. p0 = 1 − p. n ≥ 0) has the Markov property.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. if the current population is i there is probability pi that everybody gives birth to zero offspring. Let ξk be the number of offspring of the k-th individual in the n-th generation. . If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. so we will find some different method to proceed. ξ (1) . Hence all states i ≥ 1 are inessential. then we see that the above equation is of the form Xn+1 = f (Xn . We find the population (n) at time n + 1 by adding the number of offspring of each individual. But here is an interesting special case. So assume that p1 = 1. It is a Markov chain with time-homogeneous transitions. since ξ (0) . Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. . which has a binomial distribution: i j pij = p (1 − p)i−j . 1. a hard problem. one question of interest is whether the process will become extinct or not. 0 But 0 is an absorbing state. and. The parent dies immediately upon giving birth. with probability 1. each individual gives birth to at one child with probability p or has children with probability 1 − p. j In this case. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. the population cannot grow: it will eventually reach 0. . If we let ξ (n) = ξ1 . Therefore. . . Then P (ξ = 0) + P (ξ ≥ 2) > 0. k = 0.d. are i.

Proof. i. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. We then have that D = D1 ∩ · · · ∩ Di . and P (Dk ) = ε. ϕn (0) = P (Xn = 0).population does not become extinct then it must grow to infinity. If µ ≤ 1 then the ε = 1. and. Let ε be the extinction probability of the process. . each one behaves independently of one another. We wish to compute the extinction probability ε(i) := Pi (D). Obviously. Therefore the problem reduces to the study of the branching process starting with X0 = 1. we can argue that ε(i) = εi . Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. if we start with X0 = i in ital ancestors. Indeed. . We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . 66 . Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. . This function is of interest because it gives information about the extinction probability ε. Indeed. For i ≥ 1. . where ε = P1 (D). Consider a Galton-Watson process starting with X0 = 1 initial ancestor. so P (D|X0 = i) = εi . since Xn = 0 implies Xm = 0 for all m ≥ n. For each k = 1. ε(0) = 1. when we start with X0 = i. The result is this: Proposition 5.

all we need to show is that ε < 1. ϕn (0) ≤ ε∗ . This means that ε = 1 (see left figure below). Notice now that µ= d ϕ(z) dz . But ϕn+1 (0) → ε as n → ∞. Therefore. then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. So ε ≤ ε∗ < 1. ϕn+1 (0) = ϕ(ϕn (0)).) 67 . ϕ(ϕn (0)) → ϕ(ε). Therefore ε = ε∗ . Since both ε and ε∗ satisfy z = ϕ(z). Also ϕn (0) → ε. But ϕn (0) → ε. in fact. and so on. Therefore ε = ϕ(ε). Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . Let us show that. (See right figure below. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). and. and since the latter has only two solutions (z = 0 and z = ε∗ ). recall that ϕ is a convex function with ϕ(1) = 1. In particular. On the other hand. if the slope µ at z = 1 is ≤ 1. We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). ε = ε∗ . the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. z=1 Also. ϕn+1 (z) = ϕ(ϕn (z)). By induction.Note also that ϕ0 (z) = 1. Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). since ϕ is a continuous function. if µ > 1. But 0 ≤ ε∗ .

using the same frequency. we let hosts transmit whenever they like. Then ε = 1. . then the transmission is not successful and the transmissions must be attempted anew. there is no time to make a prior query whether a host has something to transmit or not.i. and Sn the number of selected packets. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. We assume that the An are i. only one of the hosts should be transmitting at a given time. there are x + a − 1 packets. 2. there is a collision and. 68 . there is collision and the transmission is unsuccessful. 6. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. . If s = 1 there is a successful transmission and. 0). So. . assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). Nowadays. .1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. 4. Thus. 3. for a packet to go through. then max(Xn + An − 1. . Xn+1 = Xn + An . are allocated to hosts 1.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). The problem of communication over a common channel. 1.. otherwise.d. there are x + a packets. So if there are. Abandoning this idea. if Sn = 1. 24. on the next times slot. If collision happens. 2. 3. 2. . Since things happen fast in a computer network. Ethernet has replaced Aloha. 1. If s ≥ 2. 3 hosts. 5. then time slots 0. Let s be the number of selected packets. is that if more than one hosts attempt to transmit a packet. on the next times slot. Then ε < 1. 1. in a simplified model. say. An the number of new packets on the same time slot. One obvious solution is to allocate time slots to hosts in a specific order. random variables with some common distribution: P (An = k) = λk . k ≥ 0. During this time slot assume there are a new packets that need to be transmitted. 3. Case: µ > 1. periodically.

We here only restrict ourselves to the computation of this expected change. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). and exactly one selection: px. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . n ≥ 0) is a Markov chain.where λk ≥ 0 and k≥0 kλk = 1. Sn = 1|Xn = x) = λ0 xpq x−1 . but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. k 0 ≤ k ≤ x + a. So P (Sn = k|Xn = x. for example we can make it ≤ µ/2. The random variable Sn has. The reason we take the maximum with 0 in the above equation is because. then the chain is transient. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. Proving this is beyond the scope of the lectures.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . The main result is: Proposition 6. we have q < 1. and so q x tends to 0 much faster than G(x) tends to ∞. we think as follows: to have a reduction of x by 1. we have ∆(x) = (−1)px. Xn−1 . and p. where q = 1 − p. Therefore ∆(x) ≥ µ/2 for all large x.x−1 when x > 0. The process (Xn . .e. So we can make q x G(x) as small as we like by choosing x sufficiently large. Let us find its transition probabilities. if Xn + An = 0. Unless µ = 0 (which is a trivial case: no arrivals!). 69 . Idea of proof. Indeed.e.x−1 + k≥1 kpx. So: px. To compute px.x+k for k ≥ 1. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). . The probabilities px. Since p > 0. An−1 . k ≥ 1. we must have no new packets arriving.) = x + a k x−k p q . We also let µ := k≥1 kλk be the expectation of An . we have that the drift is positive for all large x and this proves transience. To compute px.x are computed by subtracting the sum of the above from 1. defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. we should not subtract 1. An = a. the Markov chain (Xn ) is transient.x−1 = P (An = 0. i. . so q x G(x) tends to 0 as x → ∞.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . The Aloha protocol is unstable. For x > 0. selections are made amongst the Xn + An packets. where G(x) is a function of the form α + βx .

PageRank. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. So this is how Google assigns ranks to pages: It uses an algorithm.2) and let α pij := (1 − α)pij + . because of the size of the problem. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. It is not clear that the chain is irreducible. So we make the following modification. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. in the latter case. Then it says: page i has higher rank than j if π(i) > π(j). known as ‘302 Google jacking’. there are companies that sell high-rank pages. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. then i is a ‘dangling page’. So if Xn = i is the current location of the chain. We pick α between 0 and 1 (typically. |V | These are new transition probabilities. α ≈ 0. There is a technique. if L(i) = ∅   0. All you need to do to get high rank is pay (a lot of money). Therefore. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. Academic departments link to each other by hiring their 70 . In an Internet-dominated society. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph.24. Comments: 1. otherwise. neither that it is aperiodic. It uses advanced techniques from linear algebra to speed up computations. 2. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. unless i is a dangling page. The idea is simple. the rank of a page may be very important for a company’s profits. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. For each page i let L(i) be the set of pages it links to. Its implementation though is one of the largest matrix computations in the world. that allows someone to manipulate the PageRank of a web page and make it much higher. the chain moves to a random page. 3. if j ∈ L(i)  pij = 1/|V |. If L(i) = ∅.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. Define transition probabilities as follows:  1/|L(i)|.

at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. When two individuals mate. We will assume that a child chooses an allele from each parent at random. 24. To simplify. the genetic diversity depends only on the total number of alleles of each type. And so on.4 The Wright-Fisher model: an example from biology In an organism.faculty from each other (and from themselves).g. then. a gene controlling a specific characteristic (e. meaning how many characteristics we have. We have a transition from x to y if y alleles of type A were produced. Do the same thing 2N times. all we need is the information outlined in this figure. In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). generation after generation. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. two AA individuals will produce and AA child. 71 . One of the circle alleles is selected twice. 4. 5. colour of eyes) appears in two forms. But an AA mating with a Aa may produce a child of type AA or one of type Aa. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. The alleles in an individual come in pairs and express the individual’s genotype for that gene. We would like to study the genetic diversity. We will also assume that the parents are replaced by the children. This number could be independent of the research quality but. If so. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. we will assume that. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. One of the square alleles is selected 4 times. This means that the external appearance of a characteristic is a function of the pair of alleles. Imagine that we have a population of N individuals who mate at random producing a number of offspring. The process repeats. What is important is that the evaluation can be done without ever looking at the content of one’s research. a dominant (say A) and a recessive (say a). Since we don’t care to keep track of the individuals’ identities. Three of the square alleles are never selected. maintain this allele for the next generation. it makes the evaluation procedure easy. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). in each generation. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. from the point of view of an administrator. Thus. this allele is of type A with probability x/2N . their offspring inherits an allele from each parent.

Tx := inf{n ≥ 1 : Xn = x}.i. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. the states {1. and the population becomes one consisting of alleles of type A only.e. but the argument needs further justification because “the end of time” is not a deterministic time. M and thus. the expectation doesn’t change and. . Since pxy > 0 for all 1 ≤ x. . X0 ) = E(Xn+1 |Xn ) = M This implies that. 72 . we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . conditionally on (X0 . (Here. with n P (ξk = 1|Xn . . Clearly. . on the average. the number of alleles of type A won’t change.2N = 1. . 0 ≤ y ≤ 2N. ξM are i. ϕ(x) = y=0 pxy ϕ(y). .Each allele of type A is selected with probability x/2N . the random variables ξ1 . The fact that M is even is irrelevant for the mathematical model. p00 = p2N. . . .d. so both 0 and 2N are absorbing states. on the average. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. . Let M = 2N . . Similarly. since. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. . i. 2N − 1} form a communicating class of inessential states. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). X0 ) = 1 − Xn . for all n ≥ 0. Xn ). Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . . . as usual. M n P (ξk = 0|Xn . . E(Xn+1 |Xn . . X0 ) = Xn . ϕ(M ) = 1.) One can argue that. EXn+1 = EXn Xn = Xn . M Therefore. i. These are: M ϕ(0) = 0. ϕ(x) = x/M. We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations.e. y ≤ 2N − 1. the chain gets absorbed by either 0 or 2N . . since. . and the population becomes one consisting of alleles of type a only. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. we see that 2N x 2N −y x y pxy = . n n where. . . . Clearly. The result is correct. Since selections are independent. . .

and E(An − Dn ) = −µ < 0. Here are our assumptions: The pairs of random variables ((An . the demand is only partially satisfied. Working for n = 1. n ≥ 0) are i. Otherwise. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. Let ξn := An − Dn . If the demand is for Dn components. for all n ≥ 0. Since the Markov chain is represented by this very simple recursion. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . say. We will apply the so-called Loynes’ method.i. and so. 2. we will try to solve the recursion. 73 . Dn ). the demand is. provided that enough components are available. We will show that this Markov chain is positive recurrent. . The correctness of the latter can formally be proved by showing that it does satisfy the recursion. . We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. the demand is met instantaneously. A number An of new components are ordered and added to the stock.The first two are obvious. while a number of them are sold to the market. n ≥ 0) is a Markov chain. on the average. at time n + 1. Thus. b).5 A storage or queueing application A storage facility stores.d. so that Xn+1 = (Xn + ξn ) ∨ 0. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. M −1 y−1 x M y−1 1− x . . and the stock reached level zero at time n + 1. there are Xn + An − Dn components in the stock. larger than the supply per unit of time. Xn electronic components at time n. We have that (Xn . we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0.

we may replace the latter random variables by ξ1 . .d. . .i. . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). and. the random variables (ξn ) are i. where the latter equality follows from the fact that. . for y ≥ 0. . . (n) . . In particular. Therefore. n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. we reverse it. ξn−1 ) = (ξn−1 . the event whose probability we want to compute is a function of (ξ0 . . 74 . . so we can shuffle them in any way we like without altering their joint distribution. that d (ξ0 . . . ξ0 ). we may take the limit of both sides as n → ∞. . Indeed. . But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). . . . using π(y) = limn→∞ P (Mn = y). P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . in computing the distribution of Mn which depends on ξ0 . instead of considering them in their original order. however. = x≥0 Since the n-dependent terms are bounded. ξn . for all n. . ξn−1 . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). Notice. ξn−1 ).Thus. . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. . we find π(y) = x≥0 π(x)pxy . (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . Hence.

Z2 . . It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. the Markov chain may be transient or null recurrent. .} > y) is not identically equal to 1. n→∞ Hence g(y) = P (0 ∨ max{Z1 . This will be the case if g(y) is not identically equal to 1. Depending on the actual distribution of ξn . The case Eξn = 0 is delicate. . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . as required. We must make sure that it also satisfies y π(y) = 1. . we will use the assumption that Eξn = −µ < 0. . Here. .} < ∞) = 1.So π satisfies the balance equations. 75 . n→∞ This implies that P ( lim Zn = −∞) = 1. Z2 . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1.

PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. here. groups) are allowed. if x ∈ {±e1 . have been processes with timehomogeneous transition probabilities. . besides time. 0. This is the drunkard’s walk we’ve seen before. if x = −ej . A random walk of this type is called simple random walk or nearest neighbour random walk. This means that the transition probability pxy should depend on x and y only through their relative positions in space. Define  pj . . . d   0. For example. More general spaces (e. j = 1. To be able to say this. j = 1.y+z for any translation z. Example 3: Define p(x) = 1 2d . . a random walk is a Markov chain with time. p(−1) = q. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. otherwise. d  p(x) = qj . p(x) = 0. the set of d-dimensional vectors with integer coordinates. if x = ej . we won’t go beyond Zd . namely. but.and space-homogeneous transition probabilities given by px. then p(1) = p. we need to give more structure to the state space S. the skip-free property. we can let S = Zd .y = px+z. that of spatial homogeneity. If d = 1. . . Example 1: Let p(x) = C2−|x1 |−···−|xd | . otherwise where p + q = 1. A random walk possesses an additional property. Our Markov chains. this walk enjoys. . x ∈ Zd .g. so far. ±ed } otherwise 76 .y = p(y − x). namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. In dimension d = 1. . .and space-homogeneity. a structure that enables us to say that px. . where p1 + · · · + pd + q1 + · · · + qd = 1. . where C is such that x p(x) = 1. So. . given a function p(x).

x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . p ∗ δz = p. . x + ξ2 = y) = y p(x)p(y − x). p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). x ∈ Zd . .) In this notation.i. the initial state S0 is independent of the increments. The convolution of two probabilities p(x). we can reduce several questions to questions about this random walk.d. We shall resist the temptation to compute and look at other 77 . But this would be nonsense.This is a simple random walk. We call it simple symmetric random walk. Furthermore. We refer to ξn as the increments of the random walk. in general. we will mean a sum over all possible values. where ξ1 . y p(x)p′ (y − x). notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . p(x) = 0. . . We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). then p(1) = p(−1) = 1/2. x ± ed .) Now the operation in the last term is known as convolution. random variables in Zd with common distribution P (ξn = x) = p(x). . and this is the symmetric drunkard’s walk. computing the n-th order convolution for all n is a notoriously hard problem. . Furthermore. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). p′ (x). (Recall that δz is a probability distribution that assigns probability 1 to the point z. We will let p∗n be the convolution of p with itself n times. which. otherwise. are i. ξ2 . Indeed. . The special structure of a random walk enables us to give more detailed formulae for its behaviour. Let us denote by Sn the state of the walk at time n. in addition. Obviously. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. because all we have done is that we dressed the original problem in more fancy clothes. If d = 1. (When we write a sum without indicating explicitly the range of the summation variable. One could stop here and say that we have a formula for the n-step transition probability. So. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ .

. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi.2) (4. −1} then we have a path in the plane that joins the points (0.1) + − (5.1) − (3. consider the following: 78 .i+1 = pi. Here are a couple of meaningful questions: A. .0) We can thus think of the elements of the sample space as paths. + (1. If yes. . Will the random walk ever return to where it started from? B. . . 2) → (5. we should assign. 1): (2. So if the initial position is S0 = 0. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . and so #A P (A) = n . The principle is trivial. how long will that take? C. ξn ). A ⊂ {−1. we have P (ξ1 = ε1 . who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones.d. In the sequel. but its practice may be hard. random variables (random signs) with P (ξn = ±1) = 1/2. Since there are 2n elements in {0. if we can count the number of elements of A we can compute its probability. So. ξn = εn ) = 2−n . .i.i−1 = 1/2 for all i ∈ Z.1) + (0. 1) → (4. 1) → (2. +1}n . 1}n .methods. As a warm-up. First. Look at the distribution of (ξ1 . we are dealing with a uniform distribution on the sample space {−1. 0) → (1. 2 where #A is the number of elements of A. +1}n . . +1}n . all possible values of the vector are equally likely. After all. Therefore. a path of length n. .2) where the ξn are i. +1. to each initial position and to each sequence of n signs. Events A that depend only on these the first n random signs are subsets of this sample space. +1. −1. Clearly. a random vector with values in {−1. and the sequence of signs is {+1. If not. we shall learn some counting methods. 2) → (3. for fixed n.

1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. 0) (n. x) (n. y)} = n−m . in this case. x) (n. Hence (n+i)/2 = 0. In if n + i is not an even number then there is no path that starts from (0. the number of paths that start from (m. 0)). 0) and end up at (n. Proof. i). i)} = n n+i 2 More generally. i). 0) and ends at (n. The number of paths that start from (0. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. x) and end up at (n. y) is given by: #{(0. 0) and n ends at (n.) The darker region shows all paths that start at (0. (n − m + y − x)/2 (23) Remark. y) is given by: #{(m.Example: Assume that S0 = 0. S6 < 0. x) and end up at (n. (There are 2n of them. Let us first show that Lemma 14. y)} =: {all paths that start at (m. S7 < 0}. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26.047.i) (0. (n.0) The lightly shaded region contains all paths of length n that start at (0. n 79 . 0) and end up at the specific point (n. and compute the probability of the event A = {S2 ≥ 0. Consider a path that starts from (0. We can easily count that there are 6 paths in A. S4 = −1. y)}. i). We use the notation {(m. The paths comprising this event are shown below.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

The left side of (30) equals #{paths (1.P (Sn = y|Sm = x) = In particular. y)} . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . for now. n/2 (29) #{(0. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). . n (30) Proof. P0 (Sn = y) 2 (n + y) + 1 . S2 > 0. . we shall just compute it. Sn > 0 | Sn = y) = y . . . n This is called the ballot theorem and the reason will be given in Section 29. if n is even. Sn ≥ 0 | Sn = 0) = 84 1 . The probability that. . 27. . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . Such a simple answer begs for a physical/intuitive explanation. P0 (S1 ≥ 0 . . given that S0 = 0 and Sn = y > 0. y) that do not touch zero} × 2−n #{paths (0.3 Some conditional probabilities Lemma 17. .2 The ballot theorem Theorem 25. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. If n. n 2−n . x) 2n−m (n. y)} . n 2 #{paths (m. if n is even. . the random walk never becomes zero up to time n is given by P0 (S1 > 0. . 1) (n. n−m 2−n−m . Sn ≥ 0 | Sn = y) = 1 − In particular. But. 27. 0) (n. 0) (n. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 .

. . (c) (d) (e) (f ) (g) where (a) is obvious. . (f ) follows from the replacement of (ξ2 . . . . and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . ξ2 . n/2 where we used (28) and (29). . . . . . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. . . Sn ≥ 0). 85 . . . . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. ξ3 . .4 Some remarkable identities Theorem 26. . . Sn ≥ 0. . . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . and Sn−1 > 0 implies that Sn ≥ 0. . . . . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . . . . . . (e) follows from the fact that ξ1 is independent of ξ2 . . ξ2 + ξ3 ≥ 0. . (b) follows from symmetry. Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . We use (27): P0 (Sn ≥ 0 . Proof. Sn = 0) = P0 (Sn = 0). . . . . . . . ξ3 . .) by (ξ1 . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . 1 + ξ2 + ξ3 > 0. Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . We next have P0 (S1 = 0. remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). .). . +1 27. ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. . Sn > 0) + P0 (S1 < 0. . . . 0) (n.Proof. For even n we have P0 (S1 ≥ 0. . . P0 (S1 ≥ 0. . Sn ≥ 0) = #{(0. . Sn = 0) = P0 (S1 > 0. . . . . Sn < 0) (b) (a) = 2P0 (S1 > 0. •). If n is even. Sn ≥ 0) = P0 (S1 = 0. .

Let n be even. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . the probability u(n) that the walk (started at 0) attains value 0 in n steps. . Sn = 0) = 2P0 (Sn−1 > 0. . Sn−1 > 0. given that Sn = 0. . Sn−1 > 0. So f (n) = 2P0 (S1 > 0. we either have Sn−1 = 1 or −1. for even n. we computed. . So the last term is 1/2. Put a two-sided mirror at a. So 1 f (n) = 2u(n) P0 (S1 > 0. To take care of the last term in the last expression for f (n). . . . Then f (n) := P0 (S1 = 0. . n−1 Proof. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. Also. . . Theorem 27. Sn−1 < 0. P0 (Sn−1 > 0. f (n) = P0 (S1 > 0. .5 First return to zero In (29). Sn = 0). and u(n) = P0 (Sn = 0). By the Markov property at time n − 1. If a particle is to the right of a then you see its image on the left. . 86 . . Sn = 0 means Sn−1 = 1. . Fix a state a. . . . . . and if the particle is to the left of a then its mirror image is on the right. Sn = 0. Sn = 0) P0 (S1 > 0. Sn = 0) = u(n) . This is not so interesting. 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). with equal probability. Sn−2 > 0|Sn−1 = 1). . So u(n) f (n) = . . We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. By symmetry. . Since the random walk cannot change sign before becoming zero. the two terms must be equal. Sn−1 = 0. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s.27. . . Sn = 0) + P0 (S1 < 0. we see that Sn−1 > 0. we can omit the last event. n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). Sn = 0) Now. Sn−2 > 0|Sn−1 > 0.

Then. independent of the past before Ta . n ∈ Z+ ) has the same law as (Sn . n ≥ Ta By the strong Markov property. Let (Sn .What is more interesting is this: Run the particle till it hits a (it will. we have n ≥ Ta . Then Ta = inf{n ≥ 0 : Sn = a} Sn . for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . Sn > a) = P0 (Ta ≤ n. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). Finally. we obtain 87 . as a corollary to the above result. So Sn − a can be replaced by a − Sn in the last display and the resulting process. But P0 (Ta ≤ n) = P0 (Ta ≤ n. 2 Proof. Sn ≤ a) + P0 (Ta ≤ n. The last equality follows from the fact that Sn > a implies Ta ≤ n. In the last event. the process (STa +m − a. Theorem 28. 2a − Sn . n ≥ Ta . n < Ta . In other words. 2 Define now the running maximum Mn = max(S0 . Sn ≤ a). define Sn = where is the first hitting time of a. . Sn ≤ a) + P0 (Sn > a). P0 (Sn ≥ a) = P0 (Ta ≤ n. so we may replace Sn by 2a − Sn (Theorem 28). If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). . n < Ta . Thus. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). Sn = Sn . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). . S1 . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. Sn ). Proof.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. n ≥ 0) be a simple symmetric random walk and a > 0. a + (Sn − a). 2a − Sn ≥ a) = P0 (Ta ≤ n. Sn ≥ a). 28. But a symmetric random walk has the same law as its negative. If Sn ≥ a then Ta ≤ n. . called (Sn . n ∈ Z+ ). Trivially. m ≥ 0) is a simple symmetric random walk.

the number of selected azure items is always ahead of the black ones equals (a − b)/n. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . . 2 Proof. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. Sn = y) = P0 (Tx ≤ n. combined with (31). But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. of which a are coloured azure and b black. The set of values of η contains n elements and η is uniformly distributed a in this set. This. e 10 Andr´. 8 88 . A sample from the urn is a sequence η = (η1 . (31) x > y. P0 (Mn < x. Sn = 2x − y). gives the result. If an urn contains a azure items and b = n−a black items. which results into P0 (Tx ≤ n. Sci. J. usually a vase furnished with a foot or pedestal. Sn = y). we have P0 (Mn < x. If Tx ≤ n. Sn = y) = P0 (Sn = 2x − y). 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). by applying the reflection principle (Theorem 28). Sn = y) = P0 (Tx > n. Bertrand. 369. Paris. Since Mn < x is equivalent to Tx > n. then the probability that. (1887). 436-437. (1887). . Acad. e e e Paris. employed for different purposes. Proof. as long as a ≥ b. and anciently for holding ballots to be drawn. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. the ashes of the dead after cremation. then. An urn is a vessel. D. where ηi indicates the colour of the i-th item picked. Rend. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. Compt. Compt. 9 Bertrand. 105. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. 105. The answer is very simple: Theorem 31. Sci. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. Solution d’un probl`me. It is the latter use we are interested in in Probability and Statistics. Rend. during sampling without replacement. Solution directe du probl`me r´solu par M. ηn ). . what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. n = a + b. Acad. The original ballot problem.Corollary 19. ornamental objects. as for holding liquids. .

n ≥ 0) with values in Z and let y be a positive integer. . Since the ‘azure’ items correspond to + signs (respectively. 89 . . . conditional on the event {Sn = y}. the sequence (ξ1 . εn be elements of {−1. . . Then. using the definition of conditional probability and Lemma 16. . ξn ) is uniformly distributed in a set with n elements. . . the sequence (ξ1 . (ξ1 . so ‘azure’ items are +1 and ‘black’ items are −1. . But in (30) we computed that y P0 (S1 > 0. Since an urn with parameters n. ξn ) is. Let ε1 . . indeed. But if Sn = y then. Sn > 0 | Sn = y) = . . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . . . . Then. We need to show that. ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. Proof of Theorem 31. . The values of each ξi are ±1. the result follows. . . p(n+k)/2 q (n−k)/2 . . . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. .i+1 = p. (The word “asymmetric” will always mean “not necessarily symmetric”. . .We shall call (η1 . .i−1 = q = 1 − p. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. . It remains to show that all values are equally a likely. −n ≤ k ≤ n. where y = a − (n − a) = 2a − n. conditional on {Sn = y}. n Since y = a − (n − a) = a − b. . always assuming (without any loss of generality) that a ≥ b = n − a. . . Sn > 0} that the random walk remains positive. but which is not necessarily symmetric. . b be defined by a + b = n. the ‘black’ items correspond to − signs). . conditional on {Sn = y}. . An urn can be realised quite easily be a random walk: Theorem 32. n . . . . We have P0 (Sn = k) = n n+k 2 pi. Proof. . ξn = εn ) P0 (Sn = y) 2−n 1 = = . a. i ∈ Z. we have: P0 (ξ1 = ε1 . a we have that a out of the n increments will be +1 and b will be −1.) So pi. a − b = y. So the number of possible values of (ξ1 . ηn ) an urn process with parameters n. n n −n 2 (n + y)/2 a Thus. we can translate the original question into a question about the random walk. . . ξn ) has a uniform distribution. . letting a. Consider a simple symmetric random walk (Sn . . +1} such that ε1 + · · · + εn = y. . . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y.

. for x > 0. Let ϕ(s) := E0 sT1 . Proof. Consider a simple random walk starting from 0. E0 sTx = ϕ(s)x .} ∪ {∞}. The rest is a consequence of the strong Markov property. Lemma 19. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1.) Then. plus a ′′ further time T1 required to go from 0 to +1 for the first time. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. Px (Ty = t) = P0 (Ty−x = t). . it is no loss of generality to consider the random walk starting from 0. 2qs Proof. . This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. 1. We will approach this using probability generating functions. 2.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. Proof. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. Theorem 33. To prove this. x ≥ 0) is also a random walk. In view of Lemma 19 we need to know the distribution of T1 . x − 1 in order to reach x. . If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. we make essential use of the skip-free property. x ∈ Z. plus a time T1 required to go from −1 to 0 for the first time. (We tacitly assume that S0 = 0. 90 .i. If x > 0. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . Consider a simple random walk starting from 0. This random variable takes values in {0. and all t ≥ 0. . If S1 = 1 then T1 = 1. Corollary 20. y ∈ Z.d. Consider a simple random walk starting from 0.30. only the relative position of y with respect to x is needed. Lemma 18. The process (Tx . random variables. . For all x. In view of this. . See figure below. 2.

T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . 2ps Proof. The formula is obtained by solving the quadratic. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). from the strong Markov property. in order to hit 1.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Corollary 21. if x < 0 2ps Proof. Hence the previous formula applies with the roles of p and q interchanged. The first time n such that Sn = −1 is the first time n such that −Sn = 1. Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p.  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . Corollary 22. Consider a simple random walk starting from 0. Consider a simple random walk starting from 0. 91 .

3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. if S1 = 1. the same as the time required for the walk to hit 1 starting ′ from 0. k 92 . We again do first-step analysis. Consider a simple random walk starting from 0. Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. k by the binomial theorem. If we define n to be equal to 0 for k > n. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 .2 First return to the origin Now let us ask: If the particle starts at 0. If α = n is a positive integer then n (1 + x)n = k=0 n k x . 1 − 4pqs2 . E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . 2ps ′ =1− 30. then we can omit the k upper limit in the sum. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1.30. For the general case. in distribution. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. Similarly. For a symmetric ′ random walk. where α is a real number. and simply write (1 + x)n = k≥0 n k x . The latter is. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. this was found in §30. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 .2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even.

(1 − x)−1 = But. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . k!(n − k)! k! α k x . and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . but that we won’t worry about which ones. −1 k so (1 − x)−1 = as needed. k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . Indeed. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . As another example. √ 1 − x. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. k as long as we define α k = α(α − 1) · · · (α − k + 1) . (−1)k = 2k − 1 k k and this is shown by direct computation. For example. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . k! k! (−1)2k xk = k≥0 xk . by definition.Recall that n k = It turns out that the formula is formally true for general α. 2k − 1 k 93 .

That it is actually null-recurrent follows from Markov chain theory. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. P0 (Tx < ∞) = p q q p ∧1 ∧1 . We apply this to (32). by the definition of E0 sT0 . x if x > 0 . The quantity inside the square root becomes 1 − 4pq. s↑1 which is true for any probability generating function. Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). |x| if x > 0 (33) . x 1−|p−q| 2q 1−|p−q| 2p .But. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. without appeal to the Markov chain theory. 94 . Hence it is again transient. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. Hence it is transient. if x < 0 |x| In particular. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. But remember that q = 1 − p. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. if x < 0 This formula can be written more neatly as follows: Theorem 35.

. if x > 0 . If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. Otherwise. Theorem 36. every nondecreasing sequence has a limit. if x < 0 which is what we want. The simple symmetric random walk with values in Z is null-recurrent. If p < q then |p − q| = q − p and. Unless p = 1/2. S1 . Proof. the sequence Mn is nondecreasing.) If we let Mn = max(S0 . P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . but here it is < ∞. . Assume p < 1/2. What is this probability? Theorem 37. Proof. Then limn→∞ Sn = −∞ with probability 1. and so it is null-recurrent. Indeed. This value is random and is denoted by M∞ = sup(S0 . Corollary 23. This means that there is a largest value that will be ever attained by the random walk. Consider a random walk with values in Z. The last part of Theorem 35 tells us that it is recurrent. . with some positive probability. . S2 .1 Asymptotic distribution of the maximum Suppose now that p < 1/2. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. We know that it cannot be positive recurrent. we obtain what we are looking for. x ≥ 0. Consider a simple random walk with values in Z. 31. Sn ). . 95 . this probability is always strictly less than 1. . then we have M∞ = limn→∞ Mn .2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. Then ′ P0 (T0 < ∞) = 2(p ∧ q). Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . with probability 1 because p < 1/2. it will never ever return to the point where it started from. . n→∞ n→∞ x ≥ 0. except that the limit may be ∞. 31.Proof. again. q p.

if a Markov chain starts from a state i then Ji is geometric. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. 2(p ∧ q) . 1 − 2(p ∧ q) this being a geometric series. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). 1. 1. But if it is transient. 31. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. In the latter case. But J0 ≥ 1 if and only if the first return time to zero is finite. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . the total number of visits to a state will be finite. (34) a random variable first considered in Section 10 where it was also shown to be geometric. k = 0. where the last probability was computed in Theorem 37. . we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . 1. This proves the first formula. We now compute its exact distribution. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. . Consider a simple random walk with values in Z. . If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. 2. 2.′ Proof.3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. Then. Theorem 38. 2. The probability generating function of T0 was obtained in Theorem 34. 2(p ∧ q) E0 J0 = . . 96 . . starting from zero. Remark 1: By spatial homogeneity. k = 0. even for p = 1/2. . . k = 0. In Lemma 4 we showed that. as already known. . Pi (Ji ≥ k) = 1 for all k. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). 1]. 1 − 2(p ∧ q) Proof. . If p = q this gives 1. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . meaning that Pi (Ji = ∞) = 1. . We now look at the the number of visits to some state but if we start from a different state. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k .

e. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . Jx ≥ k). Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . starting from 0.e. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . Remark 1: The expectation E0 Jx is finite if p = 1/2. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. Suppose x > 0. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . simply because if Jx ≥ k ≥ 1 then Tx < ∞. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). Consider the case p < 1/2. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . and the second in Theorem 38. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 .Theorem 39. (2(p ∧ q))k−1 . For x > 0. E0 Jx = Proof. p < 1/2) then. But for x < 0. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . then E0 Jx drops down to zero geometrically fast as x tends to infinity. The reason that we replaced k by k − 1 is because. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . it will visit any point below 0 exactly the same number of times on the average. ∞ As for the expectation. Consider a random walk with values in Z. Then P0 (Tx < ∞. 97 . if the random walk “drifts down” (i. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . Then P0 (Jx ≥ k) = P0 (Tx < ∞. (q/p)(−x)∨0 1 − 2q k≥1 . The first term was computed in Theorem 35. we have E0 Jx = 1/(1−2p) regardless of x. Apply now the strong Markov property at Tx . In other words. k ≥ 1. given that Tx < ∞. i.

a sum of i. Thus. . In Lemma 19 we observed that. the probabilities P0 (Tx = n) = x P0 (Sn = x). in Theorem 41. . . . Then P0 (Tx = n) = x P0 (Sn = x) . n x > 0. i.d. = 1− √ 1 − s2 s x . random variables. Fix an integer n. P0 (Sn = x) P0 (Sn = x) x .i. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0.i. . because the two have the same distribution. (Sk . ξn ) has the same distribution as (ξn . i=1 Then the dual (Sk . 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. we derived using duality. Tx is the sum of x i. each distributed as T1 . Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. Proof. for a simple random walk and for x > 0. random variables. every time we have a probability for S we can translate it into a probability for S.32 Duality Consider random walk Sk = k ξi . (36) 2 Lastly. Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. 0 ≤ k ≤ n).e. ξ1 ). for a simple symmetric random walk. . Let (Sk ) be a simple symmetric random walk and x a positive integer. ξn−1 . and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . Sn = x) P0 (Tx = n) = . . Use the obvious fact that (ξ1 . n (37) 98 . Here is an application of this: Theorem 41. . . 0 ≤ k ≤ n) of (Sk . 0 ≤ k ≤ n − 1. . Theorem 40. which is basically another probability for S. n Proof. . . ξ2 . 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x).d. 0 ≤ k ≤ n) has the same distribution as (Sk .

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

0)) = 1/4. Indeed. 1)) + P (ξn = (0. n ∈ Z+ ) is a sum of i. . 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. being totally drunk. the latter being functions of the former. . random vectors such that P (ξn = (0. 2 Same argument applies to (Sn . 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. The two random walks are not independent. 0)) = 1/4. 1 1 Proof. n ∈ Z+ ) is a random walk. he moves east. with equal probability (1/4). y) where x. i. n ∈ Z+ ) showing that it. Rn ) := (Sn + Sn . so are ξ1 . west. Sn = i=1 n ξi . Many other quantities can be approximated in a similar manner. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. (Sn . We then let S0 = (0. random variables and hence a random walk. Since ξ1 . So let 1 2 1 2 1 2 Rn = (Rn . . . ξn are not independent simply because if one takes value 0 the other is nonzero. 0)) = P (ξn = (−1. going at random east. We have 1. are independent. 0). west. map (x1 . We have: 102 . 1 P (ξn = 1) = P (ξn = (1. the random 1 2 variables ξn . When he reaches a corner he moves again in the same manner.e. n ≥ 1. 1)) = P (ξn = (−1. 0)) = P (ξn = (1. Sn ) = n n 2 .i. . ξ2 . (Sn . north or south.. x2 ) into (x1 + x2 .i. . . 1 Hence (Sn .d. −1)) = 1/2. y ∈ Z. i. . Sn − Sn ). It is not a simple random walk because there is a possibility that the increment takes the value 0. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . 2 1 If x ∈ Z2 . too. ξ2 .d. for each n. the two stochastic processes are not independent. i=1 ξi i=1 ξi 2 1 Lemma 20. However. . north or south. 0) and. n ∈ Z+ ) is also a random walk. x1 − x2 ). is a random walk. The corners are described by coordinates (x. we immediately get that the expectation of T1 is infinity: ET1 = ∞. He starts at (0. So Sn = (Sn . ξ2 .From this. we let x1 be its horizontal coordinate and x2 its vertical. Consider a change of coordinates: rotate the axes by 45o (and scale them).

= 1)) = 1/4. The possible values 1 . n Corollary 24. n ∈ Z+ ) 1 is a random walk in two dimensions. η 2 are ±1. .d. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . 1)) = 1/4 1 2 P (ηn = −1. P (S2n = 0) = Proof. Let a < 0 < b. ηn = 1) = P (ξn = (0. b−a P (Ta < Tb ) = b .. So P (R2n = 0) = 2n 2−2n . (Rn . and by Stirling’s approximation. that for a simple symmetric random walk (started from S0 = 0). ηn = −1) = P (ξn = (−1. from Theorem 7. ξ 1 − ξ 2 ). (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . R2n = 0) = P (R2n = 0)(R2n = 0). n ∈ Z ) is a random walk in one dimension. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. where c is a positive constant. (Rn . ξ2 . 0)) = 1/4 1 2 P (ηn = −1. . (Rn + independent. To show that the last two are independent. Recall.i. Similar argument shows that each of (Rn . By Lemma 5. P (S2n = 0) ∼ n . n ∈ Z+ ) is a random walk in two dimensions. b−a 103 . b−a −a . But (Rn ) 1 2 is a simple symmetric random walk. Hence (Rn . P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . Hence P (ηn = 1) = P (ηn = −1) = 1/2. . We have Rn = n (ξi + ξi ). Lemma 22. Proof. n ∈ Z+ ) is a random 2 . We have of each of the ηn n 1 2 P (ηn = 1.1 Lemma 21. . 2n n 2 2−4n . b−a P (STa ∧Tb = a) = b . ηn = 1) = P (ξn = (1. The two-dimensional simple symmetric random walk is recurrent. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. The sequence of vectors η . But by the previous n=0 c lemma. 2 . η . n ∈ Z+ ). And so 2 1 2 1 P (ηn = ε. 0)) = 1/4 1 2 P (ηn = 1. ηn = ξn − ξn are independent for each n. ε′ ∈ {−1. 2 1 2 2 1 1 Proof. Rn = n (ξi − ξi ). 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. ηn = −1) = P (ξn = (0. as n → ∞. . 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. is i. in the sense that the ratio of the two sides goes to 1. . we need to show that ∞ P (Sn = 0) = ∞. n ∈ Z ) is a random walk in one dimension. +1}. 1 where the last equality follows from the independence of the two random walks. The same is true for (Rn ). The last two are walk in one dimension.

i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. B = j) = c−1 (j − i)pi pj . The claim is that P (Z = k) = pk . B = j) = 1. e. i ∈ Z). i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. such that p is rational. i ∈ Z. we can always find integers a. a < 0 < b such that pa + (1 − p)b = 0. given a 2-point probability distribution (p. we have this common value: i<0 (−i)pi = and. if p = M/N . where c−1 is chosen by normalisation: P (A = 0. M. we consider the random variable Z = STA ∧TB . k ∈ Z. Then there exists a stopping time T such that P (ST = i) = pi .g. choose. M < N . Indeed. i < 0 < j. To see this. Theorem 43.This last display tells us the distribution of STa ∧Tb . Let (Sn ) be a simple symmetric random walk and (pi . P (A = 0. b. suppose first k = 0. 1 − p). using this. that ESTa ∧Tb = 0. 0)} ∪ (N × N) with the following distribution: P (A = i. N ∈ N. The probability it exits it from the left point is p. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. B) taking values in {(0. and the probability it exits from the right point is 1 − p. Conversely. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. b]. b = N . B) are independent of (Sn ). B = 0) = p0 . and assuming that (A. Proof. Then Z = 0 if and only if A = B = 0 and this has 104 . Define a pair of random variables (A. i ∈ Z. Note. B = 0) + P (A = i. we get that c is precisely ipi . in particular. How about more complicated distributions? Let suppose we are given a probability distribution (pi . Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. a = M − N .

probability p0 . B = j) P (STi ∧Tk = k)P (A = i. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . 105 . Suppose next k > 0. P (Z = k) = pk for k < 0. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. by definition. i<0 = c−1 Similarly.

b − a). 1. 38) = gcd(82. 38) = gcd(44. Zero is denoted by 0. replace b − a by b − 2a. 38) = gcd(6. b).}. Arithmetic Arithmetic deals with integers. i. 2) = gcd(2. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. if we also invoke Euclid’s algorithm (i. −2. Replace b by b − a. Now suppose that c | a and c | b − a. b. with b = 0. their negatives and zero. If a | b and b | a then a = b. 6. 3. 20) = gcd(6. So (ii) holds. They are not supposed to replace textbooks or basic courses. 32) = gcd(6. Then interchange the roles of the numbers and continue. (ii) if c | a and c | b. Keep doing this until you obtain b − qa < a. −3. c | b. with the second line. b − a). · · · }. we say that b divides a if there is an integer k such that a = kb. we can define their greatest common divisor d = gcd(a. let d = gcd(a. we have c | d. the process of long division that one learns in elementary school). with at least one of them nonzero. Indeed 4 × 38 > 120. For instance. 26) = gcd(6. Given integers a. In other words. 14) = gcd(6. d | b. Then c | a + (b − a).PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. But d | a and d | b. 2) = gcd(4. We will show that d = gcd(a. d2 are two such numbers then d1 would divide d2 and vice-versa. b). 5. 3. then d | c. Z = {· · · . Therefore d | b − a. So (i) holds. . The set Z of integers consists of N. (That it is unique is obvious from the fact that if d1 . Because c | a and c | b. and d = gcd(a. leading to d1 = d2 . b) as the unique positive integer such that (i) d | a. b are arbitrary integers. We write this by b | a.) Notice that: gcd(a.e. The set N of natural numbers is the set of positive integers: N = {1. 0. But in the first line we subtracted 38 exactly 3 times. −1.e. But 3 is the maximum number of times that 38 “fits” into 120. If b − a > a. . Similarly. So we work faster. and we’re done. 2. as taught in high schools or universities. 8) = gcd(6. 106 . gcd(120. b) = gcd(a. if a. . 2) = 2. 2. 4. Indeed. they merely serve as reminders. If c | b and b | a then c | a. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. 38) = gcd(6. Now.

a − 1}. with x = 19.e. i. 2) = 2. Then d = xa + yb. b) = min{xa + yb : x. . we can find an integer q and an integer r ∈ {0. We need to show (i). Working backwards. 1. gcd(38. 38) = gcd(6. not both zero. let us follow the previous example again: gcd(120. For example. The same logic works in general. such that a = qb + r. a. In the process. y such that d = xa + yb. we have the fact that if d = gcd(a. Thus. Here are some other facts about the greatest common divisor: First. b. b) then there are integers x. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. let d be the right hand side. we can show that. in particular. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. 120) = 38x + 120y. Then c divides any linear combination of a. To see this. with a > b. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. for integers a. it divides d. for some x. To see why this is true. y ∈ Z. Second. r = 18. 107 . This shows (ii) of the definition of a greatest common divisor. . we have gcd(a. 38) = gcd(6. Suppose that c | a and c | b. . b ∈ Z. Its proof is obvious.The theorem of division says the following: Given two positive integers b. y ∈ Z. . The long division algorithm is a process for finding the quotient q and the remainder r. xa + yb > 0}. y = 6.

so d divides both a and b. In general. respectively. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . we can divide b by d to find b = qd + r. i. If we take A to be the set of students in a classroom. 108 . The set of functions from X to Y is denoted by Y X . A relation is transitive if when it contains (x. for some 1 ≤ r ≤ d − 1. But then b = q(xa + yb) + r. and this is a contradiction. that d does not divide b. The greatest common divisor of 3 integers can now be defined in a similar manner. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). So d is not minimum as claimed. The range of f is the set of its values. the set {f (x) : x ∈ X}. the greatest common divisor of any subset of the integers also makes sense. Finally. e.that d | a and d | b. a relation can be thought of as a set of pairs of elements of the set A. Since d > 0. that d does not divide both numbers.e. Both < and = are transitive.g. A function is called onto (or surjective) if its range is Y . Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. m elements. The relation < is not reflexive. y) such that x is smaller than y. Suppose this is not true. The equality relation = is reflexive. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. A function is called one-to-one (or injective) if two different x1 . x2 ∈ X have different values f (x1 ). z) it also contains (x. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. B be finite sets with n. The equality relation is symmetric. so r = 0. We encountered the equivalence relation i theory. A relation is called reflexive if it contains pairs (x. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). then the relation “x is a friend of y” is symmetric. y) and (y. Counting Let A.e. say. f (x2 ). y) without contain its symmetric (y. but is not necessarily transitive. x). If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. A relation which possesses all the three properties is called equivalence relation. z). yielding r = (−qx)a + (1 − qy)b. x). A relation is called symmetric if it cannot contain a pair (x. i. For example.. The relation < is not symmetric.

we denote (n)n also by n!. a one-to-one function from the set A into A is called a permutation of A. To be able to do this. 1}n of sequences of length n with elements 0 or 1.) So the first animal can be assigned any of the m available names. we need more names than animals: n ≤ m. the cardinality of the set B A ) is mn . in other words. we may think of the elements of A as animals and those of B as names. A function is then an assignment of a name to each animal. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. Each assignment of this type is a one-to-one function from A to B. Incidentally. We thus observe that there are as many subsets in A as number of elements in the set {0. Clearly. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. In the special case m = n. the k-th in n − k + 1 ways. and so on. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. The first element can be chosen in n ways. and this follows from the binomial theorem. Any set A contains several subsets. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). and so on. n! is the total number of permutations of a set of n elements.The number of functions from A to B (i. so does the second. k! k where the symbol equation. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. the second in n − 1. Thus.e. (The same name may be used for different animals. we have = 1. and so on. Since the name of the first animal can be chosen in m ways. that of the second in m − 1 ways. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . and the third. Indeed. If A has n elements. 109 . If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: .

we define x m = x(x − 1) · · · (x − m + 1) . 0). we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. Note that it is always true that a = a+ − a− . by convention. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. say. and a− is called the negative part of a. + k−1 k Indeed. If the pig is included. if x is a real number. In particular. n n . If the pig is not included in the sample. c. For example. b). Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. by convention. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). we let The Pascal triangle relation states that n+1 k = = 0. b. Both quantities are nonnegative. Maximum and minimum The minimum of two numbers a. in choosing k animals from a set of n + 1 ones. 0). Thus a ∧ b = a if a ≤ b or = b if b ≤ a. a+ is called the positive part of a. and we don’t care to put parantheses because the order of finding a maximum is irrelevant. we choose k animals from the remaining n ones and this can be done in n ways. The notation extends to more than one numbers. k Also. The maximum is denoted by a ∨ b = max(a. we use the following notations: a+ = max(a. b). the pig. 110 . consider a specific animal.We shall let n = 0 if n < k. a− = − min(a. a∨b∨c refers to the maximum of the three numbers a. b is denoted by a ∧ b = min(a. m! n y If y is not an integer then. The absolute value of a is defined by |a| = a+ + a− .

if A1 . Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). 2. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. Thus. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). . = 1A + 1B − 1A∩B .}. . For instance. . It is easy to see that 1A 1A∩B = 1A 1B . Several quantities of interest are expressed in terms of indicators. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. This is very useful notation because conjunction of clauses is transformed into multiplication. are sets or clauses then the number n≥1 1An is the number true clauses. meaning a function that takes value 1 on A and 0 on the complement Ac of A. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . 111 . 1 − 1A = 1 A c . . P (X ≥ n). . we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ].Indicator function stands for the indicator function of a set A. So X= n≥1 1(X ≥ n). Another example: Let X be a nonnegative random variable with values in N = {1. It takes value 1 with probability P (A) and 0 with probability P (Ac ). A2 . which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Then X X= n=1 1 (the sum of X ones is X). . If A is an event then 1A is a random variable. It is worth remembering that. So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components.

x1 . . β ∈ R. . . . Let u1 . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . . because ϕ is linear. The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . Then.  = ··· ··· ···  .  is a column representation of x with respect to . So if we choose a basis v1 . xn the given basis. Suppose that ψ. . . . . Let y i := n aij xj . . x  . y ∈ Rn . . .  The column  . The column  . . and all α. un be a basis for Rn . we have ϕ(x) = m n vi i=1 j=1 aij xj . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  . . .    . . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi .for all x. vm .  where ∈ R. . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ).   . . ϕ are linear functions: It is a m × n matrix.  . . But the ϕ(uj ) are vectors in Rm . Combining. The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  . m. j=1 i = 1. .  is a column representation of y := ϕ(x) with respect to the given basis . ym v1 .

.} ≤ inf{x2 . a2 . A matrix A is called square if it is a n × n matrix.  . there is x ∈ I with x ≥ ε + inf I. . x4 . . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. . inf I is characterised by: inf I ≤ x for all x ∈ I and. .} ≤ inf{x3 . . . B is a m × n matrix. . The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . i. . Likewise.e. . . . 1/2. . . . where m (AB)ij = k=1 Aik Bkj . j So A is a ℓ × m matrix. we define max I. If we fix a basis in Rn . x2 . sup ∅ = −∞. This minimum (or maximum) may not exist. with respect to these bases. is a sequence of numbers then inf{x1 . a square matrix represents a linear transformation from Rn to Rn . a sequence a1 . for any ε > 0.  . Then the matrix representation of ϕ ◦ ψ is AB. we define lim sup xn . . 1 ≤ i ≤ ℓ. We note that lim inf(−xn ) = − lim sup xn .} has no minimum. 1 ≤ j ≤ n. If I has no lower bound then inf I = −∞. the choice of the basis is the so-called standard basis. 113 . standing for the “infimum” of I. since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. for any ε > 0. sup I is characterised by: sup I ≥ x for all x ∈ I and. x3 . xn+2 . . we define the supremum sup I to be the minimum of all upper bounds. . . defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. it is its matrix representation with respect to the fixed basis. xn+1 .} ≤ · · · and. of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. . respectively. Similarly. . If I is empty then inf ∅ = +∞. and AB is a ℓ × n matrix.  . So lim sup xn = limn→∞ inf{xn . If I has no upper bound then sup I = +∞. and it is this that we called infimum. ei = 1(i = j). . B be the matrix representations of ϕ. . Limsup and liminf If x1 . Similarly.Fix three sets of basis on Rn . Supremum and infimum When we have a set of numbers I (for instance. . . e2 =  . ed =  . 0 0 1 In other words. · · · }. Usually. ψ. Rm and Rℓ and let A. 1/3. . there is x ∈ I with x ≤ ε + inf I. x2 . For instance I = {1. . . In these cases we use the notation inf I. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ.

then P  n≥1 An  = P (An ). and 1 Ac = 1 − 1 A .. (That is why Probability Theory is very rich. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). B in the definition.. .e. for (besides P (Ω) = 1) there is essentially one axiom . Similarly. and apply them. . if he events A1 . i. but not too much. to show that EX ≤ EY . . Often. n≥1 Usually. namely:   Everything else. Linear Algebra that has many axioms is more ‘restrictive’.) If A. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). to compute P (A). In Markov chain theory. we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. in these notes. . Similarly. Indeed. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. we often need to argue that X ≤ Y . showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. A2 . In contrast. Probability Theory is exceptionally simple in terms of its axioms. A2 . the function that takes value 1 on A and 0 on Ac . by not requiring this. If An are events with P (An ) = 0 then P (∪n An ) = 0. are mutually disjoint events. n P (An ). . derive formulae. If A is an event then 1A or 1(A) is the indicator random variable. Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. P (∪n An ) ≤ Similarly. A is a subset of B). . follows from this axiom. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). A2 . Of course. For example. . to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). if An are events with P (An ) = 1 then P (∩n An ) = 1. B are disjoint events then P (A ∪ B) = P (A) + P (B). 11 114 . and everywhere else. B2 . we work a lot with conditional probabilities. This is the inclusion-exclusion formula for two events. it is useful to find disjoint events B1 . . in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). . If the events A. This can be generalised to any finite number of events. . n If the events A1 . .Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. Quite often. If A1 .e. I do sometimes “cheat” from a strict mathematical point of view. Note that 1A 1B = 1A∩B . That is why Probability Theory is very rich. In contrast. Linear Algebra that has many axioms is more ‘restrictive’. . Therefore. E 1A = P (A). such that ∪n Bn = Ω. This follows from the logical statement X 1(X > x) ≤ X. which holds since X is positive.

. is called probability generating function of X: ∞ as long as |ρs| < 1. As an example. . . We do not a priori know that this exists for any s other than. n ≥ 0 is ∞ ρn s n = n=0 1 . obviously.An application of this is: P (X < ∞) = lim P (X ≤ x). The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. . so it is useful to keep this in mind. we have EX = −∞ xf (x)dx. X − = − min(X. Generating functions A generating function of a sequence a0 . for s = 0 for which G(0) = 0.. is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . 1]. . . . . . 0). This definition makes sense only if EX + < ∞ or if EX − < ∞. |s| < 1/|ρ|. which is motivated from the algebraic identity X = X + − X − . ϕ(s) = EsX = n=0 sn P (X = n) 115 . . If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. the formula reduces to the known one. n = 0. The formula holds for any positive random variable. we encounter situations where P (X = ∞) may be positive. In case the random variable is ∞ absoultely continuous with density f . making it a useful object when the a0 . . In case the random variable is discrete. the generating function of the geometric sequence an = ρn . 1]. a1 . . a1 . For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). EX = x xP (X = x).e. we define X + = max(X. x→∞ In Markov chain theory. which are both nonnegative and then define EX = EX + − EX − . 1. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. For a random variable which takes positive and negative values. 2. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). The generating function of the sequence pn = P (X = n). 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. So if n an < ∞ then G(s) exists for all s ∈ [−1.} because then n an = 1. taught in calculus courses). 0). is a probability distribution on Z+ = {0. 1. i. viz.

n=0 We do allow X to. Also. we have s∞ = 0. but. Generating functions are useful for solving linear recursions. ds ds and by letting s = 1. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . with x0 = x1 = 1. n ≥ 0) then the generating function of (xn+1 . since |s| < 1. xn sn be the generating of (xn . As an example. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . n! as follows from Taylor’s theorem. while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). we can recover the expectation of X from EX = lim ϕ′ (s).This is defined for all −1 < s < 1. ∞ and p∞ := P (X = ∞) = 1 − pn . Note that ∞ n=0 pn ≤ 1. 116 . We can recover the probabilities pn from the formula pn = ϕ(n) (0) . If we let X(s) = n≥0 n ≥ 0. and this is why ϕ is called probability generating function. possibly. take the value +∞. consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). Note that ϕ(0) = P (X = 0).

Hence s2 + s − 1 = (s − a)(s − b). namely. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. Thus (1/ρ)−s is the 1 generating function of ρn+1 . b−a Noting that ab = −1. n ≥ 0) equals the sum of the generating functions of (xn+1 . b−a is the solution to the recursion.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . is such an example. But I could very well use the function P (X > x). Since x0 = x1 = 1. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . n ≥ 0) and (xn .e. and so X(s) = 1 1 1 1 1 1 = . i. So. if X is a random variable with values in R. Despite the square roots in the formula. i. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. x ∈ R. for example.e. then the so-called distribution function. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). we can solve for X(s) and find X(s) = s2 −1 . n ≥ 0). b = −( 5 + 1)/2. the function P (X ≤ x). +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. 117 . s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). the n-th term of the Fibonacci sequence. we can also write this as xn = (−a)n+1 − (−b)n+1 .

we encountered the concept of an excursion of a Markov chain from a state i. Such an excursion is a random object which should be thought of as a random function. n ∈ Z+ . it’s better to use logarithms: n n! = e log n = exp k=1 log k. (b) The approximation is not good when we consider ratios of products. dx All these objects can be referred to as “distribution” or “law” of X. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm.or the 2-parameter function P (a < X < b). For instance. compute an integral and obtain an approximation. in 2n fair coin tosses 118 . We frequently encounter random objects which belong to larger spaces. For example (try it!) the rough Stirling approximation give that the probability that. to compute a large sum we can. Specifying the distribution of such an object may be hard. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. n! ≈ exp(n log n − n) = nn e−n . Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. I could use its density f (x) = d P (X ≤ x). 1 Since d dx (x log x − x) = log x. n Hence n k=1 log k ≈ log x dx. instead. It turns out to be a good one. for it does allow us to specify the probabilities P (X = n). Now. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. Even the generating function EsX of a Z+ -valued random variable X is an example. since the number is so big. If X is absolutely continuous. for example they could take values which are themselves functions. Hence we have We call this the rough Stirling approximation. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n.

We can case both properties of g neatly by writing For all x. y ∈ R g(y) ≥ g(x)1(y ≥ x). It was shown in Theorem 38 that J0 satisfies the memoryless property. and this is terrible. for all x. Consider an increasing and nonnegative function g : R → R+ . as n → ∞. defined in (34). And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. and is based on the concept of monotonicity. expressing monotonicity. Markov’s inequality Arguably this is the most useful inequality in Probability. because we know that the n probability goes to zero as n → ∞. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . Then. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . That was the random variable J0 = ∞ n=1 1(Sn = 0). you have seen it before. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. n ∈ Z+ . a geometric function of k. This completely trivial thought gives the very useful Markov’s inequality. To remedy this. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. g(X) ≥ g(x)1(X ≥ x). by taking expectations of both sides (expectation is an increasing operator). We baptise this geometric(λ) distribution. Surely. Indeed. expressing non-negativity. and so. We have already seen a geometric random variable.we have exactly n heads equals 2n 2−2n ≈ 1. so this little section is a reminder. 0 ≤ n ≤ m. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). Eg(X) ≥ g(x)P (X ≥ x) . where λ = P (X ≥ 1). 119 . Let X be a random variable. If we think of X as the time of occurrence of a random phenomenon.

5. Springer. Soc. Amer. 120 .edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Springer. Bremaud P (1999) Markov Chains.dartmouth. Springer. 7. Chung. Wiley. Bremaud P (1997) An Introduction to Probabilistic Modeling. W (1981) Topics in Ergodic Theory. Parry.mac.pdf 6. 4. Grinstead. Math. Feller. 3. ´ 2. J L (1997) Introduction to Probability. C M and Snell. Also available at: http://www. W (1968) An Introduction to Probability Theory and its Applications. Cambridge University Press. Norris. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). J R (1998) Markov Chains. Cambridge University Press.Bibliography ´ 1.

77 indicator. 29 reflection. 110 meeting time. 87 121 . 65 generating function. 6 Markov’s inequality. 18 dual. 101 axiom of Probability Theory. 7 invariant distribution. 115 random walk on a graph. 6 Markov property. 7 closed. 61 recurrent. 68 aperiodic. 20 increments. 111 inessential state. 119 Google. 117 division. 16 communicating classes. 117 leads to. 17 communicates with. 63 Galton-Watson process. 35 Helly-Bray Lemma. 10 irreducible. 81 reflection principle. 34 PageRank. 108 relation. 34 probability flow. 108 ruin probability. 59 law. 48 cycle formula. 15 Loynes’ method. 16 equivalence relation. 10 ballot theorem. 20 function of a Markov chain. 45 Chapman-Kolmogorov equation. 51 essential state. 119 matrix. 48 memoryless property. 53 Fibonacci sequence. 17 Ethernet. 73 Markov chain. 47 coupling inequality. 76 neighbours. 68 Fatou’s lemma. 70 greatest common divisor. 58 digraph (directed graph). 17 Kolmogorov’s loop criterion. 98 dynamical system. 13 probability generating function. 16. 108 ergodic. 109 positive recurrent. 84 bank account. 77 coupling. 86 reflexive. 116 first-step analysis. 18 permutation. 18 arcsine law. 17 Aloha. 2 nearest neighbour random walk. 15. 53 hitting time. 61 detailed balance equations. 15 distribution. 16 convolution. 65 Catalan numbers. 113 initial distribution. 115 geometric distribution. 17 infimum. 70 period. 3 branching process. 110 mixing. 111 maximum. 114 balance equations. 119 minimum. 51 Monte Carlo. 44 degree.Index absorbing. 82 Cesaro averages. 61 null recurrent. 106 ground state. 26 running maximum. 1 equivalence classes.

6 stationary. 71 122 . 8 transition probability matrix. 76 simple symmetric random walk. 27 storage chain. 77 skip-free property. 76 Skorokhod embedding. 13 stopping time. 88 watching a chain in a set. 27 subordination. 9 stationary distribution. 24 Strirling’s approximation. 119 strong Markov property. 108 tree. 62 Wright-Fisher model. 16. 104 states. 7 simple random walk. 16. 62 supremum. 2 transition probability. 8 transitive. 113 symmetric. 7 time-reversible. 62 subsequence of a Markov chain. 73 strategy of the gambler. 60 urn. 6 statespace. 58 transient.semigroup property. 7. 29 transition diagram:. 10 steady-state. 108 time-homogeneous. 3.