Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

.1 Asymptotic distribution of the maximum . . . .3 The distribution of the first return to the origin . . 71 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 30. . 83 27. . . . . . . . . . . . . .3 Total number of visits to a state . . . . . . . . . . . . . . . . . . . 90 30.1 Branching processes: a population growth application . . . . . . . . . . . . . . . . . . . . . 85 27. . . . . . . . .2 The ballot theorem . . . . . . . . . 68 24. . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . .24. . . 70 24. . . . . . . . . . . . . . . . . . . . .2 Return probabilities . .5 First return to zero . .2 The ALOHA protocol: a computer communications application . . . . . . . . . . . . 95 31.2 Paths that start and end at specific points and do not touch zero at all . . . . . . . . . . . . . . . . . . . . . . 65 24. . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application .5 A storage or queueing application . . . . . . . . . . . . . . . . . . . . .1 First hitting time analysis . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . 84 27. . . . . . . . . . . . . . . . . . . .1 Paths that start and end at specific points . . . . . . . . . . . . . . . 84 27. . . . . 79 26. . . . .4 Some remarkable identities . . . . . . . . . .1 Distribution after n steps . . . . . . . . . . . . .4 The Wright-Fisher model: an example from biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Distribution of hitting time and maximum . . . . .2 First return to the origin . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . 95 31. . . . . . .3 Some conditional probabilities . . . . . . . . . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

A few starred sections should be considered “advanced” and can be omitted at first reading.. v .K. of course.. At the end. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. Also.Preface These notes were written based on a number of courses I taught over the years in the U. The first part is about Markov chains and some applications. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. dozens of good books on the topic. I have a little mathematical appendix.S. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. Of course. Greece and the U. I have tried both and prefer the current ordering. The second one is specifically for simple random walks. There are. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability.. Also. They form the core material for an undergraduate course on Markov chains in discrete time. “important” formulae are placed inside a coloured box. I introduce stationarity before even talking about state classification. There notes are still incomplete. I tried to put all terms in blue small capitals whenever they are first encountered.

one of which is 0. Suppose that a particle of mass m is moving under the action of a force F in a straight line. The “time” can be discrete (i.e. for any t. x(t)). The sequence of pairs thus produced. . . . In other words. X0 . Assume that the force F depends only on the particle’s position ¨ dt and velocity. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. . the real numbers). An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. An example from Physics is provided by Newton’s laws of motion. depend only on the pair Xn . X1 . a totally ordered set. or. yn+1 ). continuous (i. 2. Xn+2 . X2 . The state of the particle at time t is described by the pair ˙ X(t) = (x(t).e. i. . yn ) into Xn+1 = (xn+1 . for any n. more generally. the future trajectory (X(s). x Here. By repeating the procedure. s ≥ t) ˙ 1 . and its ˙ dt 2x acceleration is x = d 2 . we end up with two numbers. It is not hard to see that.e. and the other the greatest common divisor. that gcd(a. . . where xn ≥ yn . . For instance. x). a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. yn ). the future pairs Xn+1 . Formally. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). F = f (x. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. and evolving as xn+1 = yn yn+1 = rem(xn . b) = gcd(r. b). b). where r = rem(a. b) between two positive integers a and b. we let our “state” be a pair of integers (xn . the integers). Recall (from elementary school maths). y0 = min(a. b). has the property that. n = 0. In the module following this one you will study continuous-time chains. yn ).PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. . x = x(t) denotes the position of the particle at time t. 1. b) is the remainder of the division of a by b. there is a function F that takes the pair Xn = (xn . In Mathematics. initialised by x0 = max(a. Its velocity is x = dx .

e. an effective way for dealing with large problems is via the use of randomness. a ≤ X0 ≤ b.. Other examples of dynamical systems are the algorithms run. past states are not needed. The generalisation takes us into the realm of Randomness. we will see what kind of questions we can ask and what kind of answers we can hope for. Count the number of 1’s and divide by N . analyse.05. . Similarly. When the mouse is in cell 1 at time n (minutes) then. instead of deterministic objects. Markov chains generalise this concept of dependence of the future only on the present.000 columns or computing the integral of a complicated function of a large number of variables. Some of these algorithms are deterministic. In addition. Try this on the computer! 2 . say.99. We can summarise this information by the transition diagram: 1 Stochastic means random. In other words. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. 1 and 2. Indeed. A scientist’s job is to record the position of the mouse every minute. . Y0 ) of random numbers. We will be dealing with random variables. Yn ). respectively. Choose a pair (X0 . The resulting ratio is an approximation b of a f (x)dx. Each time count 1 if Yn < f (Xn ) and 0 otherwise.000 rows and 10. it moves from 2 to 1 with probability β = 0.can be completely specified by the current state X(t).1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. i. at time n + 1 it is either still in 1 or has moved to 2. for steps n = 1. Perform this a large number N of times. Repeat with (Xn . 0 ≤ Y0 ≤ B. by the software in your computer. We will also see why they are useful and discuss how they are applied. We will develop a theory that tells us how to describe. but some are stochastic. via the use of the tools of Probability. . 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. uniformly. 2. regardless of where it was at earlier times. and use those mathematical models which are called Markov chains. 2 2. A mouse lives in the cage. containing fresh and stinky cheese. it does so.

α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. . Assume that Dn .01 Questions of interest: 1. D0 . The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. as follows: Xn+1 = max{Xn + Dn − Sn .95 0.99 0. 2.) 2. Dn − Sn = y − x) P (Sn = z. . 0}. y = 0. are independent random variables. intuitive. are random variables with distributions FD . D2 .2 Bank account The amount of money in Mr Baxter’s bank account evolves. How long does it take for the mouse. S0 .y := P (Xn+1 = y|Xn = x). Here. S2 .05 0. . . Assume also that X0 . respectively. Sn . How often is the mouse in room 1 ? Question 1 has an easy. the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. Dn is the money he deposits. D1 . 1. So if he wishes to spend more than what is available. We easily see that if y > 0 then ∞ x. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). and Sn is the money he wants to spend. px. FS . from month to month. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell.05 ≈ 20 minutes. S1 . he can’t. . to move from cell 1 to cell 2? 2. on the average. . = z=0 ∞ = z=0 3 . Xn is the money (in pounds) at the beginning of the month.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. (This is the mean of the binomial distribution with parameter α.

2. Questions of interest: 1. if y = 0. we have px. If the pub is located in position 0 and his home in position 100. and if so.2 p1. can be computed from FS and FD .1 p2. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. 2.0 = 1 − ∞ px. Are there any places which he will visit infinitely many times? 4.1 p1. Are there any places which he will never visit? 3. how long will it take him.0 p0. If he starts from position 0. The transition probability matrix is y=1 p0. how fast? 2. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions.y .2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. on the average. how often will he be visiting 0? 2.0 p2. in principle. Will Mr Baxter’s account grow.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1.0 p1.1 p0. being drunk. So he may move forwards with equal probability that he moves backwards.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. he has no sense of direction.something that. Of course. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left).2 P= p2.

there is clearly a higher chance that our man will be lost. etc.3 that the person will become sick.S = 1 − pS. north or south. the answer to this is crucial for the policy determination.H . We may again want to ask similar questions. so we assign probability 1/4 for each move. but they will become more precise in the sequel. and pS. the company must have an idea on how long the clients will live. pD. that since there are more degrees of freedom now. Also.H = 1 − pH.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients. and let’s say he’s completely drunk. But. the reason being that the company does not believe that its clients are subject to resurrection. therefore.H − pS. for now. So he can move east. and pS.D .S because pH.D . Note that the diagram omits pH. given that he is currently healthy.S − pH. pD. observe. 2.8 0.1 D Thus.3 H 0. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. Clearly. These things are.01 S 0.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. 5 .D is omitted. It proposes the following model summarising the state of health of an individual on a monthly basis: 0.S = 0.D = 1. there is probability pH. mathematically imprecise. west.

T = Z+ = {0. . Xn−1 ) given Xn . in+1 ∈ S. . .1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . (1) 6 . . .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 .} or N = {1. . called the state space. n + m] Xk = ik | Xn = in ). that the Xn take values in some countable set S. . in+m . X2 = i2 . Let us now write the Markov property in an equivalent way. 3. n ∈ Z+ ) is Markov if and only if. . which precisely means that (Xn+1 . Proof. . .3 3. m ∈ T). . . . . We will also assume. . m ∈ T) is independent of the past process (Xm . . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. . On the other hand. Usually. The elements of S are frequently called states. . X0 = i0 ) = P (∀ k ∈ [n + 1. X1 = i1 . . almost exclusively (unless otherwise said explicitly). . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . m and any choice of states i0 . Sometimes we say that {Xn . 3. 3. zero. (Xn . This is true for any n and m. X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). P (Xn+1 = in+1 | Xn = in . m > n. 2. So we will omit referring to T explicitly. . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). . . viz. if the claim holds. . . . We focus almost exclusively to the first choice. n ∈ T} is Markov instead of saying that it has the Markov property. n ∈ T}. . unless needed. it is customary to call (Xn ) a Markov chain. Lemma 1. Xn−1 ) are independent conditionally on Xn . 2. Since S is countable. . . 1. . . n + m] Xk = ik | Xn = in . We say that this sequence has the Markov property if. for any n. for all n ∈ Z+ and all i0 . and so future is independent of past given the present.} or the set Z of all integers (positive. m < n. . . for any n ∈ T. where T is a subset of the integers. . . . the Markov property. or negative). . conditionally on Xn . Xn+m ) and (X0 . the future process (Xm . .

The function pi. n).which means that the two functions µ(i) := P (X0 = i) . viz. n) = i1 ∈S ··· in−1 ∈S pi.j (n. 7 . or as P(m. 2 3 And. n) if m ≤ ℓ ≤ n.3 In this new notation.i1 (m.j (n. n + 1) := P (Xn+1 = j | Xn = i) . by a result in Probability. ℓ) P(ℓ. If |S| = d then we have d × d matrices. n)]i. Therefore. pi. pi. something that can be called semigroup property or Chapman-Kolmogorov equation. by definition.i2 (1. means that the transition probabilities pi. . m + 1)P(m + 1. n) = P(m. The function µ(i) is called initial distribution. X1 = i1 . but if S is an infinite set. .j (n. i∈S i. 3. all probabilities of events associated with the Markov chain This causes some analytical difficulties.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains.j (m. (2) which is reminiscent of matrix multiplication. 1) pi1 . . pi.j∈S . Xn = in ) = µ(i0 ) pi0 .in (n − 1. But we will circumvent them by using Probability only. pi. n ≥ 0. n) = 1(i = j) and.j (n. n). by (1). n + 1) do not depend on n . we can write (2) as P(m.i (n − 1. X2 = i2 . requires some ordering on S) and whose entries are pi. n). m + 2) · · · P(n − 1. for each m ≤ n the matrix P(m.i1 (0. of course. will specify all joint distributions2 : P (X0 = i0 . 2) · · · pin−1 .j (m. n) := [pi. then the matrices are infinitely large. remembering a bit of Algebra.. . Of course. n) = P(m. we define. which. a square matrix whose rows and columns are indexed by S (and this. j ∈ S.j (m.j (m. m + 1) · · · pin−1 . n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. n + 1) is called transition probability from state i at time n to state j at time n + 1. n). More generally.

j∈S is called the (one-step) transition probability matrix or. transition matrix. y y 8 . Then we have µ n = µ 0 Pn . and the semigroup property is now even more obvious because Pa+b = Pa Pb .j . Then P(m.j the (one-step) transition probability from state i to state j. as follows from (1) and the definition of P. Notational conventions Unit mass. to picking state i with probability 1. n−m times for all m ≤ n. Similarly. Thus Pi is the law of the chain when the initial distribution is δi . For example.j ]i. In other words. if Y is a real random variable. Starting from a specific state i. n) = P × P × · · · · · · × P = Pn−m . 0. when we say “Markov chain” we will mean “timehomogeneous Markov chain”.j (n. arranged in a row vector. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). (3) We call δi the unit mass at i. δi (j) := 1. From now on. simply. It corresponds. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). For example. and call pi. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . for all integers a. The matrix notation is useful. if µ assigns probability p to state i1 and 1 − p to state i2 . Notice that any probability distribution on S can be written as a linear combination of unit masses. if j = i otherwise. b ≥ 0. while P = [pi. n + 1) =: pi.and therefore we can simply write pi. then µ = pδi1 + (1 − p)δi2 . of course. unless otherwise specified. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn .

Hence the uniform probability distribution is stationary. Let Xn be the number of molecules in room 1 at time n. . random variables provides an example of a stationary process. But this will not work. . Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. for a while we will be able to tell how long ago the process started. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . n = 0. as it takes some time before the molecules diffuse from room to room. and vice versa. Imagine there are N molecules of gas (where N is a large number. at time n + 1 it moves to one of the adjacent vertices with equal probability.) is stationary if it has the same law as (Xn .d. . the law of a stationary process does not depend on the origin of time. because it is impossible for the number of molecules to remain constant at all times. If at time n the bug is on a vertex.). the distribution of the bug will still be the same. It should be clear that if initially place the bug at random on one of the vertices. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. . a sequence of i.i. 1. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. We now investigate when a Markov chain is stationary. Clearly. m + 1. Example: the Ehrenfest chain. say N = 1025 ) in a metallic chamber with a separator in the middle. then. Then pi. n ≥ 0) will not be stationary: indeed.e. n = m. i. for any m ∈ Z+ . let us look at the following example: Example: Motion on a polygon. On the average. . then.4 Stationarity We say that the process (Xn .i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. Another guess then is: Place 9 . at any point of time n. we indeed expect that we have N/2 molecules in each room. To provide motivation and intuition. Consider a canonical heptagon and let a bug move on its vertices. The gas is placed partly in room 1 and partly in room 2. In other words. This is a model of gas distributed between two identical communicating chambers. selecting a vertex with probability 1/7. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. dividing it into two identical rooms. It is clear. .i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = .

. Proof. . Lemma 2. then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . Since. Xm+1 = i1 . i 0 ≤ i ≤ N. we are led to consider: The stationarity problem: Given [pij ]. . . A Markov chain (Xn . we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. .i = P (X0 = i − 1)pi−1. 2−N i−1 i+1 N N (The dots signify some missing algebra. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. 10 . then we have created a stationary Markov chain. Indeed. in ∈ S. . In other words.i + π(i + 1)pi+1. .) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. The equations (4) are called balance equations. Xn = in ) = P (Xm = i0 . If we do so. for all m ≥ 0 and i0 . Changing notation. Xm+n = in ). . (a binomial distribution). For the if part. X2 = i2 . by (3). . . (4) Such a π is called . If we can do so. hence. Xm+2 = i2 . P (Xn = i) = π(i) for all n. stationary or invariant distribution.i = N N −i+1 N i+1 2−N + = · · · = π(i). . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. a Markov chain is stationary if and only if the distribution of Xn does not change with n. . use (1) to see that P (X0 = i0 . The only if part is obvious. which you should go through by yourselves. find those distributions π such that πP = π.each molecule at random either in room 1 or in room 2.i + P (X0 = i + 1)pi+1. But we need to prove this.i = π(i − 1)pi−1. P (X1 = i) = j P (X0 = j)pj. X1 = i1 .

where all the xi are distinct. then we run into trouble: indeed.i. i = 1.d.}. Keep doing it continuously.j = 1. i ≥ 0. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). Who tells us that the sum inside the parentheses is finite? If the sum is infinite. . 2. we must be careful. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. where 1 is a column whose entries are all 1. π must be a probability: π(i) = 1. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. . But here. . . The times between successive buses are i. . each x ∈ Sd is of the form x = (x1 . So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. Example: waiting for a bus.i = π(0)p(i) + π(i + 1). The state space is S = N = {1. . Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. d}. which. In addition to that. in matrix notation. It is easy to see that the stationary distribution is uniform: π(x) = 1 . . which gives p(j) . and so each π(i) will be zero. After bus arrives at a bus stop.. . i ≥ 1. This means that the solution to the balance equations is identically zero. write the equations (4): π(i) = π(j)pj. we use the normalisation condition π(1) = i≥1 j≥i = 1. d! x ∈ Sd . can be written as P1 = 1.i = 1. 2 . p1. the next one will arrive in i minutes with probability p(i). i≥1 π(i) −1 To compute π(1). n ≥ 0) is a Markov chain with transition probabilities pi+1. xd ). . . Example: card shuffling.i = p(i). To find the stationary distribution. Thus. Then (Xn .Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. random variables. π(1) will be zero. . 11 . . if i > 0 i ∈ N. which in turn means that the normalisation condition cannot be satisfied. .

εr ∈ {0. and any ε1 . Hence choosing the initial state at random in this manner results in a stationary Markov chain. . ξ2 (n) = 1 − ε1 . are i. . .d. any r = 1. .) Consider an infinite binary vector ξ = (ξ1 . Example: bread kneading.i. So let us do so.e. with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. Here ξ(n) = (ξ1 (n). P (ξ1 (n + 1) = ε1 . . So it goes beyond our theory. ξi ∈ {0. 2. . . As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. . . .). . 1] (think of [0. . . 1}. ξ2 (n). applied to the interval [0. . (5) 12 . ξ3 . ..Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). r = 1. . . 1. i. ξ2 . ξ2 (n). 1 − ξ3 . We can show that if we start with the ξr (0). .) is the state at time n. . . . the same is true at n + 1. ξr+1 (n) = 1 − εr ). ξ2 (n) = ε1 . If ξ1 = 1 then x ≥ 1/2.* (This is slightly beyond the scope of these lectures and can be omitted. what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. You can check that .. . ξ(n + 1) = ϕ(ξ(n)). But the function f (x) = min(2x. . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. We have thus defined a mapping ϕ from binary vectors into binary vectors. . The Markov chain is obtained by applying the mapping ϕ again and again. . 2. . P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). for any n = 0. 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. . 2(1 − x)). and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . .). 2. Remarks: This is not a Markov chain with countably many states. . . . .d. . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. . But to even think about it will give you some strength. if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . . .i. So. ξr+1 (n) = εr ) + P (ξ1 (n) = 1. 1} for all i. If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. from (5). .). with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. and the rule maps x into 2(1 − x). . . i..

we introduce more. amongst them.i 13 . π is a stationary distribution if and only if π1 = 1 and F (A. Proof.j = j∈A π(j)pj. Instead of eliminating equations. Conversely. If the state space has d states. The convenience is often suggested by the topological structure of the chain. i∈A j∈Ac Think of π(i)pi.i + j∈Ac π(j)pj.i .i = j∈A π(j)pj. because the first d equations are linearly dependent: the sum of both sides equals 1. Ac ) = F (Ac . as we will see in examples. If the latter condition holds. π(j)pj. for all A ⊂ S. a set of d linearly independent equations which can be more easily solved.j (which equals 1): π(i) j∈S pi.Note on terminology: steady-state.j as a “current” flowing from i to j. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. then choose A = {i} and see that you obtain the balance equations. therefore one of them is always redundant. A).j . π(j) = 1. Ac ) is the total current from A to Ac . suppose that the balance equations hold. So F (A. expecting to be able to choose. But we only have d unknowns. j∈S i ∈ S. π(i) = j∈S (balance equations) (normalisation condition). This is OK. then the above are d + 1 equations. We have: Proposition 1. Ac ) := π(i)pi.i + j∈Ac π(j)pj. Writing the equations in this form is not always the best thing to do.1 Finding a stationary distribution As explained above. When a Markov chain is stationary we refer to it as being in 4. a stationary distribution π satisfies π = πP π1 = 1 In other words.i Multiply π(i) by j∈S pi.

) Assume 0 < α.05 ≈ 0. The extra degree of freedom provided by the last result. p22 = 1 − β. p q Consider the following chain. π(1) = β . . we see that the first term on the left cancels with the first on the right.i + j∈Ac π(j)pj. gives us flexibility. One can ask how to choose a minimal complete set of linearly independent equations. Example: a walk with a barrier. (Consequently. α+β π(2) = α .2% of the time in 1. α+β π(2) = cα. in steady state.05.04 π(2) = 0... A ) A c c F( A . the mouse spends only 95. 1.048. A) Schematic representation of the flow balance relations. Instead. 14 i = 1. 2.8% of the time in room 2. p21 = β.99 ≈ 0. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). We have π(1)α = π(2)β. Summing over i ∈ A. 3.j + j∈A j∈Ac π(i)pi. p11 = 1 − α. That is.Now split the left sum also: π(i)pi. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 .99 (as in the mouse example). If we take α = 0. . we find π(1) = 0.i This is true for all i ∈ S. A c F(A . Thus. .j = j∈A π(j)pj. and 4. here is an example: Example: Consider the 2-state Markov chain with p12 = α. while the remaining two terms give the equality we seek. β = 0. but we shall not do this here.952.04 room 1. . The constant c is found from π(1) + π(2) = 1. β < 1. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ.

) 5 5. S is a set of vertices.2 The relation “leads to” We say that a state i leads to state j if. This is i=0 possible if and only if p < q. in−1 . (iii) Pi (Xn = j) > 0 for some n ≥ 0. . Does this provide a stationary distribution? Yes. j) ∈ E ⇐⇒ pi. p < 1/2. (If p ≥ 1/2 there is no stationary distribution. If i = j there is nothing to prove. In fact. E) where S is the state space and E ⊂ S × S defined by (i. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). .i2 · · · pin−1 .e.i1 pi1 . starting from i. . A). In other words. • Suppose that i j. Proof. These are much simpler than the previous ones. In graph-theoretic terminology. i. provided that ∞ π(i) = 1.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive.But we can make them simpler by choosing A = {0. (ii) pi. . and E a set of directed edges. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. the chain will visit j at some finite time. 5. This is often cast in terms of a digraph (directed graph). For i ≥ 1. n≥0 . The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. If i ≥ 1 we have F (A. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. we can solve them immediately. π(i) = (p/q)i π(0). So assume i = j.j > 0.e. 1. i − 1}. .j > 0 for some n ∈ N and some states i1 . . i. . . a pair (S. Ac ) = π(i − 1)p = π(i)q = F (Ac .

Proof. {6}.i1 pi1 . {4. 5} is closed: if the chain goes in there then it never moves out of it. (6) j. where the sum extends over all possible choices of states in S. The relation j ⇐⇒ j i). In other words.the sum on the right must have a nonzero term. By assumption. is an equivalence relation. Corollary 1.in−1 pi. by definition. and so (iii) holds... In addition. j) ⇐⇒ i j and j i.. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). {2. Hence Pi (Xn = j) > 0.j . The communicating class corresponding to the state i is. the set of all states that communicate with i: [i] := {j ∈ S : j So. We have Pi (Xn = j) = i1 . (i) holds. (iii) holds. Since Pi (Xn = j) > 0. we see that there are four communicating classes: {1}. 5. It is reflexive and transitive because is so. The relation is transitive: If i j and j k then i k. we have that the sum is also positive and so one of its terms is positive. • Suppose that (ii) holds. However.i2 · · · pin−1 . by definition. it partitions S into equivalence classes known as communicating classes. Look at the last display again. 5}. 3} is not closed. 3}. • Suppose now that (iii) holds. We just observed it is symmetric. meaning that (ii) holds.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. the sum contains a nonzero term. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. The first two classes differ from the last two in character. reflexive and transitive. Equivalence relation means that it is symmetric. this is symmetric (i Corollary 2. i}. [i] = [j] if and only if i identical or completely disjoint. The class {4. the class {2. 16 . Just as any equivalence relation..

Since i ∈ C. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. The converse is left as an exercise. . .i2 · · · pin−1 .j > 0. Note that there can be no arrows between closed communicating classes. the state space S is a closed communicating class. The classes C4 . . We then can find states i1 . Note: An inessential state i is such that there exists a state j such that i j but j i. Closed communicating classes are particularly important because they decompose the chain into smaller.i1 > 0. If a state i is such that pi. then the chain (or. more precisely. j∈C for all i ∈ C. The communicating classes C1 . Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. we have i1 ∈ C. In other words still. A general Markov chain can have an infinite number of communicating classes. starting from any state. in−1 such that pi. C3 are not closed. Thus. more manageable. 17 . Lemma 3. we conclude that j ∈ C. Similarly. pi. and so i2 ∈ C. .i2 > 0. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure.i = 1 then we say that i is an absorbing state. In graph terms. pi1 . If all states communicate with all others. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. the state is inessential. and C is closed.More generally. Suppose that i j. Suppose first that C is a closed communicating class. we say that a set of states C ⊂ S is closed if pi. its transition matrix P) is called irreducible. parts. C5 in the figure above are closed communicating classes. To put it otherwise. Why? A state i is essential if it belongs to a closed communicating class. the single-element set {i} is a closed communicating class. and so on. There can be arrows between two nonclosed communicating classes but they are always in the same direction. Otherwise. Proof.j = 1. C2 .i1 pi1 . we can reach any other state.

X4 ∈ C1 . Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. d(i) = gcd Di . ℓ ∈ N. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. We say that the chain has period 2. i. A state i is called aperiodic if d(i) = 1. and. So. n ≥ 0. (m) (n) for all m. . . The chain alternates between the two sets. all states form a single closed communicating class.e. Consider two distinct essential states i. 4} at the next step and vice versa. If i j then d(i) = d(j). X2 ∈ C1 . The period of an inessential state is not defined. d(j) = gcd Dj . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . X3 ∈ C2 . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. Let us denote by pij the entries of Pn . (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. pii (ℓn) ≥ pii (n) ℓ . The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . j ∈ S. it will move to the set C2 = {2. more generally. Since Pm+n = Pm Pn . 3} at some point of time then. Let Di = {n ∈ N : pii > 0}. Theorem 2 (the period is a class property). Dj = {n ∈ (n) (α) N : pjj > 0}. If i j there is α ∈ N with pij > 0 and if 18 . (n) Proof. .5. .4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. Notice however. i. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . j. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m.

e. as defined earlier. . Since n1 . Consider an irreducible chain with period d. . So d(i) = d(j). the chain is called aperiodic. r = 0. C1 . we obtain d(i) ≤ d(j) as well. Therefore d(j) | α + β. i. Then pii > 0 and so pjj ≥ pij pii pji > 0. this is the content of the following theorem. j ∈ S. are irreducible chains. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . Let n be sufficiently large. . . From the last two displays we get d(j) | n for all n ∈ Di . showing that α + β ∈ Dj . Theorem 4. d − 1. d(j) | d(i)). . An irreducible chain has period d if one (and hence all) of the states have period d. n2 ∈ Di . An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. Then we can uniquely partition the state space S into d disjoint subsets C0 . Theorem 3. More generally. In particular. . 1. if d = 1. . Proof. chains where. q − r > 0. if an irreducible chain has period d then we can decompose the state space into d sets C0 .) 19 . Therefore d(j) | α + β + n for all n ∈ Di . showing that α + n + β ∈ Dj . . Formally. C1 . j∈Cr+1 i ∈ Cr . Because n is sufficiently large. Cd−1 such that pij = 1. Cd−1 such that the chain moves cyclically between them. . . . So d(j) is a divisor of all elements of Di . n1 ∈ Di such that n2 − n1 = 1. Of particular interest.j i there is β ∈ N with pji > 0. Now let n ∈ Di . So pjj ≥ pij pji > 0. where the remainder r ≤ n1 − 1. Divide n by n1 to obtain n = qn1 + r. Pick n2 . If a number divides two other numbers then it also divides their difference. we have n ∈ Di as well. Corollary 3. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. (Here Cd := C0 . . Arguing symmetrically. all states communicate with one another. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact.

after it takes one step from its current position. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. The reader is warned to be alert as to which of the two variables we are considering at each time. Hence it partitions the state space S into d disjoint equivalence classes. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. Let C1 be the equivalence class of i1 . i ∈ C0 . for example.. We now show that if i belongs to one of these classes and if pij > 0 then. 20 .. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). Such a path is possible because of the choice of the i0 . i1 . Notice that this is an equivalence relation. . j belongs to the next class.id > 0. Then pick i1 such that pi0 i1 > 0. We show that these classes satisfy what we need. Assume d > 1.e the set of states j such that i j). Take.. i1 . If X0 ∈ A the two times coincide.e. with corresponding classes C0 . that is why we call it “first-step analysis”. As an example. Pick a state i0 and let C0 be its equivalence class (i. We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}.. id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. It is easy to see that if we continue and pick a further state id with pid−1 . . say. We use a method that is based upon considering what the Markov chain does at time 1. id ∈ C0 . i. C1 .Proof. j ∈ C2 . and by the assumptions that d d j. We avoid excessive notation by using the same letter for both. Suppose pij > 0 but j ∈ C1 but.. . Continue in this manner and define states i0 . . then. . Cd−1 .. necessarily.. necessarily. The existence of such a path implies that it is possible to go from i0 to i. which contradicts the definition of d.

It is clear that P1 (T0 < ∞) < 1. we have Pi (TA < ∞) = ϕ(j)pij .e. ′ is the remaining time until A is hit). Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). the event TA is independent of X0 . ϕ(i) = j∈S i∈A pij ϕ(j). ′ Proof. X2 . We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. let ϕ′ (i) be another solution. fix a set of states A. But what exactly is this probability equal to? To answer this in its generality. . X0 = i) = P (TA < ∞|X1 = j). j∈S as needed. because the chain may end up in state 2. If i ∈ A then TA = 0. By self-feeding this equation. But the Markov chain is homogeneous. a ′ function of X1 . Hence: ′ ′ P (TA < ∞|X1 = j. . if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). So TA = 1 + TA . If i ∈ A. then TA ≥ 1. and so ϕ(i) = 1. For the second part of the proof. Furthermore. and define ϕ(i) := Pi (TA < ∞). The function ϕ(i) satisfies ϕ(i) = 1. Combining the above.). i ∈ A. . we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . conditionally on X1 . Therefore. We then have: Theorem 5.

1 ψ(1) = 1 + ψ(1). ψ(i) := Ei TA . The second part of the proof is omitted. 1. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. Clearly. the set that contains only state 0. We let ϕ(i) = Pi (T0 < ∞). If ′ i ∈ A. If i ∈ A then. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S.) As for ϕ(1). Proof. let A = {0. Letting n → ∞. Example: Continuing the previous example. the second equals Pi (TA = 2). Furthermore. Therefore. By continuing self-feeding the above equation n times. 3 22 . We have: Theorem 6. i = 0. ψ(0) = ψ(2) = 0. TA = 0 and so ψ(i) := Ei TA = 0. 2. We immediately have ϕ(0) = 1. 3 2 so ϕ(1) = 3/4. and ϕ(2) = 0. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). Start the chain from i.We recognise that the first term equals Pi (TA = 1). j∈A i ∈ A. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). Example: In the example of the previous figure. ψ(i) = 1 + i∈A pij ψ(j). By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). we choose A = {0}. 2}. and let ψ(i) = Ei TA . we obtain ϕ′ (i) ≥ Pi (TA ≤ n). obviously. The function ψ(i) satisfies ψ(i) = 0. which is the second of the equations. as above. (If the chain starts from 0 it takes no time to hit 0. then TA = 1 + TA . On the other hand. because it takes no time to hit A if the chain starts from A. if the chain starts from 2 it will never hit 0. so the first two terms together equal Pi (TA ≤ 2).

Hence. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). in the last sum.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . i∈A Furthermore. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). X2 . as argued earlier. ′ TA = 1 + TA . It should be clear that the equations satisfied are: ϕAB (i) = 1. because the underlying experiment is the flipping of a coin with “success” probability 2/3. We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ). The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). we just changed index from n to n + 1. otherwise. . by the Markov property. consider the following situation: Every time the state is x a reward f (x) is earned. This happens up to the first hitting time TA first hitting time of the set A. ϕAB (i) = j∈A i∈B pij ϕAB (j). TB for two disjoint sets of states A. . We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ).which gives ψ(1) = 3/2. therefore the mean number of flips until the first success is 3/2.e. ϕAB is the minimal solution to these equations. ϕAB (i) = 0. i. until set A is reached. (This should have been obvious from the start. if x ∈ A. Clearly. B. 23 . ′ where TA is the remaining time.. Now the last sum is a function of the future after time 1. of the random variables X1 . where. . Then ′ 1+TA x ∈ A. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). after one step. Next. h(x) = f (x).

k ∈ N) form the strategy of the gambler. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). (Also. x∈A pxy h(y). collecting “rewards”: when in room i. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . why?) If we let h(i) be the average total reward if he starts from room i.Thus. Let p the probability of heads and q = 1 − p that of tails. he will die. until he dies in room 4. h(1) = 5. 2. so 24 . The successive stakes (Yk . (Sooner or later. he collects i pounds. Example: A thief enters sneaks in the following house and moves around the rooms. So. for i = 1. the set of equations satisfied by h. are h(x) = 0. unless he is in room 4. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. in which case he dies immediately. no gambler is assumed to have any information about the outcome of the upcoming toss. if he starts from room 2. y∈S h(x) = f (x) + x ∈ A. Clearly. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. h(3) = 7. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). If heads come up then you win £1 per pound of stake. 1 2 4 3 The question is to compute the average total reward.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. 2 2 We solve and find h(2) = 8. 3.

b − 1. expectation) under the initial condition X0 = x. We are clearly dealing with a Markov chain in the state space {a. b are absorbing. betting £1 at each integer time against a coin which has P (heads) = p.i+1 = p. .g. We use the notation Px (respectively. in the first case.if the gambler wants to have a strategy at all. write ϕ(x) instead of ϕ(x. It can only depend on ξ1 . x − a    . (Think why!) To simplify things. the company has control over the probability p. b = ℓ > 0 and let ϕ(x. if p = q Px (Tb < Ta ) = . . and where the states a. she must base it on whatever outcomes have appeared thus far. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. Theorem 7. We obviously have ϕ(0) = 0. She will stop playing if her money fall below a (a < x). in addition. b − a). . . b} with transition probabilities pi. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). . Ex ) for the probability (respectively. as described above. . a < i < b. We wish to consider the problem of finding the probability that the gambler will never lose her money. Consider a gambler playing a simple coin tossing game. ξk−1 and–as is often the case in practice–on other considerations of e. P (tails) = 1 − p. a + 2. Yk cannot depend on ξk . Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. astrological nature. a ≤ x ≤ b. pi. 25 ϕ(ℓ) = 1. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . in simplest case Yk = 1 for all k. Let £x be the initial fortune of the gambler.i−1 = q = 1 − p. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. if p = q = 1/2 b−a Proof. . In other words. is our next goal. 0 ≤ x ≤ ℓ. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. it will make enormous profits). To solve the problem. Let us consider a concrete problem. a + 1. . ℓ). . We first solve the problem when a = 0. ℓ) := Px (Tℓ < T0 ).

we have obtained ϕ(x. if p = q. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. we find δ(0) = 1/ℓ and so x ϕ(x) = . λx − 1 δ(0). ϕ(x) = pϕ(x + 1) + qϕ(x − 1). by first-step analysis. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. Assuming that λ := q/p = 1. and so Px (T0 < ∞) = 1 − lim If p < 1/2. δ(x − 2) = · · · = q p x δ(0). if λ = 1 if λ = 1. This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). then λ = q/p < 1. we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. and so λx − 1 λx − 1 =1− = λx . The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). then λ = q/p > 1. Since p + q = 1. λℓ − 1 0 ≤ x ≤ ℓ.and. If p > 1/2. if p > 1/2 if p ≤ 1/2. Px (T0 < ∞) = 1 − lim 26 . The general case is obtained by replacing ℓ by b − a and x by x − a. ℓ) = λx −1 . 1. Since ϕ(ℓ) = 1. ℓ Summarising. Let δ(x) := ϕ(x + 1) − ϕ(x). We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). λℓ −1 x ℓ. λ = 1. We finally find λx − 1 . ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). λ = 1. ℓ→∞ ℓ→∞ (q/p)x . Corollary 4. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. 0 ≤ x ≤ ℓ.

An example of a stopping time is the time TA . n ≥ 0) is independent of the past process (X0 . . any such A must have the property that. .j1 pj1 .j1 pj1 . . . Let A be any event determined by the past (X0 . τ = n). . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . . the future process (Xτ +n . We are going to show that for any such A. τ < ∞) = P (Xn+1 = j1 . . . | Xn = i)P (Xn = i. A | Xτ . . 27 . . τ < ∞). | Xτ = i. . . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . Xn )1(τ = n). τ = n) (c) = P (Xn+1 = j1 . . . . τ = n)P (Xn = i. . the value +∞. An example of a random time which is not a stopping time is the last visit of a specific set. Xn = i. Xτ ) and has the same law as the original process started from i. . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . Xn ) for all n ∈ Z+ . . τ < ∞) and that P (Xτ +1 = j1 . . Since (X0 . . Xn+2 = j2 . Xn+2 = j2 . (d) where (a) is just logic. Xτ = i. Let τ be a stopping time. A. A. | Xτ . Proof. . Xτ +2 = j2 . . possibly. . . for any n. A. . Let τ be a stopping time.8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. P (Xτ +1 = j1 . . . A. | Xn = i. the first hitting time of the set A. conditional on τ < ∞ and Xτ = i. n ≥ 0) is a time-homogeneous Markov chain. . . . . . considered earlier. In particular. . . Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . . . . . . including. Xn ). . . Suppose (Xn . . . . . A ∩ {τ = n} is determined by (X0 .) is independent of the past process (X0 . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . τ = n) = pi. (b) is by the definition of conditional probability. . τ = n) = P (Xn+1 = j1 . τ = n) (a) (b) = P (Xτ +1 = j1 . . Xτ +2 = j2 .j2 · · · P (Xn = i. . Theorem 8 (strong Markov property). . Xn+1 . τ < ∞) = pi.j2 · · · We have: P (Xτ +1 = j1 . . A. This stronger property is referred to as the strong Markov property. the future process (Xn . . . τ < ∞)P (A | Xτ . A.Xτ +2 = j2 . Xn ) if we know the present Xn . Xn+2 = j2 . Xn ) and the ordinary splitting property at n. Recall that the Markov property states that. Xτ )1(τ < ∞) = n∈Z+ (X0 . . Then. . . for all n ∈ Z+ . Xτ ) if τ < ∞. . Xτ +2 = j2 .

. random variables is a (very) particular example of a Markov chain. and so on..9 Regenerative structure of a Markov chain A sequence of i. we will show that we can split the sequence into i. . . The idea is that if we can find a state i that is visited again and again. we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. Our Ti here is always positive. 2. (1) r = 1. All these random times are stopping times. The random variables Xn comprising a general Markov chain are not independent. 3. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. in a generalised sense.i. . “chunks”. We (r) refer to them as excursions of the process. as “random variables”. in fact. Schematically. Ti Xi (0) (r) ≤ n < Ti (r+1) . Xi (2) . Xi (1) . Formally.i.) We let Ti be the time of the third visit. r = 2. 0 ≤ n < Ti . State i may or may not be visited again. Note the subtle difference between this Ti and the Ti of Section 6. Xi (0) . If X0 = i then the two definitions coincide.d. (And we set Ti := Ti . . . The Ti of Section 6 was allowed to take the value 0.d. . by using the strong Markov property. := Xn . They are. we let Ti be the time of the second (1) (3) visit. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). However.. random trajectories. Regardless. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . So Xi is the r-th excursion: the chain visits 28 .

Second. Xi distribution. we say that a state i is recurrent if. . Therefore. possibly. The last corollary ensures us that Λ1 . we “watch” it up until the next time it returns to state i. But (r) the state at Ti is. An immediate consequence of the strong Markov property is: Theorem 9. starting from i. but there are always states which will be visited again and again.i. g(Xi (1) ). Therefore. . the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. the future after Ti is independent of the past before Ti . but important corollary of the observation of this section is that: Corollary 5. 29 . g(Xi (2) ). and this proves the first claim. ad infinitum. We abstract this trivial observation and give a couple of definitions: First. it will behave in the same way as if it starts from the same state at another point of time. Xi (2) . all. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. for time-homogeneous Markov chains. except. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous.d. the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. at least one state must be visited infinitely many times. we say that a state i is transient if. The excursions Xi (0) . Λ2 . they are also identical in law (i. are independent random variables. Moreover. A trivial. (0) are independent random variables. given the state at (r) (r) (r) the stopping time Ti . (r) . the future is independent of the (1) (2) past. starting from i. Perhaps some states (inessential ones) will be visited finitely many times. Xi (1) . In particular. .). if it starts from state i at some point of time.. and “goes for an excursion”. The claim that Xi . . (Numerical) functions g(Xi independent random variables. have the same Proof.. equal to i. by definition. Example: Define Ti (r+1) (0) ). . . 10 Recurrence and transience When a Markov chain has finitely many states. Apply Strong Markov Property (SMP) at Ti : SMP says that.state i for the r-th time. .

therefore concluding that every state is either recurrent or transient.Important note: Notice that the two statements above are not negations of one another. i. k ≥ 0. in principle. State i is recurrent. there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. 4. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . and that is why we get the product. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. Because either fii = 1 or fii < 1. we conclude what we asserted earlier. Therefore. Lemma 5. We will show that such a possibility does not exist. namely. We claim that: Lemma 4. then Pi (Ji < ∞) = 1. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii .e. every state is either recurrent or transient. This means that the state i is transient. fii = Pi (Ti < ∞) = 1. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. 2. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). Pi (Ji = ∞) = 1. This means that the state i is recurrent. Indeed. The following are equivalent: 1. Indeed. we have Proof. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . 3. Pi (Ji = ∞) = 1. Ei Ji = ∞. 30 . Starting from X0 = i. But the past (k−1) before Ti is independent of the future. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). Suppose that the Markov chain starts from state i. If fii < 1. .

therefore Pi (Ji < ∞) = 1. State i is recurrent if.e. Pi (Ji < ∞) = 1. The equivalence of the first two statements was proved before the lemma. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . fij ≤ pij . by definition of Ji is equivalent to Pi (Ji = ∞) = 1. by definition. Proof. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . it is visited infinitely many times. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . this implies that Ei Ji < ∞. The equivalence of the first two statements was proved above. and. Hence Ei Ji = ∞ also. If Ei Ji = ∞. which. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. then. Ei Ji < ∞. the probability to hit state j for the first time at time n ≥ 1. as n → ∞. Proof. 3. n ≥ 1. fii = Pi (Ti < ∞) < 1. Lemma 6. 2. then. clearly. On the other hand. since Ji is a geometric random variable. but the two sequences are related in an exact way: Lemma 7. we see that its expectation cannot be infinity without Ji itself being infinity with probability one.e. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. if we assume Ei Ji < ∞.1 Define First hitting time decomposition fij := Pi (Tj = n). Corollary 6. i. by definition. If i is transient then Ei Ji < ∞. Notice the difference with pij . Clearly. From Elementary Probability. This. owing to the fact that Ji is a geometric random variable. it is visited finitely many times with probability 1. 4. assuming that the chain (n) starts from state i. by the definition of Ji . The following are equivalent: 1. If i is a transient state then pii → 0. (n) But if a series is finite then the summands go to 0. Proof. pii → 0. Ji cannot take value +∞ with positive probability. 10. is equivalent to Pi (Ji < ∞) = 1.Proof. (n) i. State i is transient. State i is transient if. as n → ∞.

1. In particular. So the random variables δj. . by definition.) Theorem 10. . not only fij > 0. the probability to go to j in some finite time is positive. are i. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). Hence. 10. we have n=1 pjj = Ej Jj < ∞. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1.d. We now show that recurrence is a class property. For the second term. we have i j ⇐⇒ fij > 0. infinitely many of them will be 1 (with probability 1). . In a communicating class. r = 0. 32 . Proof. but also fij = 1. ( Remember: another property that was a class property was the property that the state have a certain period. either all states are recurrent or all states are transient. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj .i. j i. j i.i. and since they take the value 1 with positive probability. excursions Xi . Indeed. .r := 1 j appears in excursion Xi (r) . If j is transient. and hence pij as n → ∞. Xi . and so the probability that j will appear in a specific excursion is positive. Start with X0 = i. starting from i. which is finite. Hence. by summing over n the relation of Lemma 7.2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. . Corollary 7. Corollary 8.The first term equals fij . . Now. denoted by i j: it means that. ∞ Proof. and gij := Pi (Tj < Ti ) > 0. as n → ∞.d. If j is a transient state then pij → 0. In other words. The last statement is simply an expression in symbols of what we said above. showing that j will appear in infinitely many of the excursions for sure. for all states i. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. if we let fij := Pi (Tj < ∞) ..

Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x.i. Un ≤ x | U0 = x)dx xn dx = 1 . every other j is also recurrent. Hence. This state is recurrent. observe that T is finite. So if a communicating class is recurrent then it contains no inessential states and so it is closed. n+1 33 = 0 .d. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ).Proof. Example: Let U0 . Theorem 12. Theorem 11. Xn = i for infinitely many n) = 0. necessarily. Thus T represents the number of variables we need to draw to get a value larger than the initial one. Pi (B) = 0. U1 .e. . Indeed. . 1]. . j but j i. Let C be the class. Hence all states are recurrent. all other states are also transient. Suppose that i is inessential. Proof. by the above theorem. . and hence. We now take a closer look at recurrence. for example. Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). . It should be clear that this can. If a communicating class contains a state which is transient then. random variables. If i is recurrent then every other j in the same communicating class satisfies i j. Proof. . X remains in C. if started in C. P (B) = Pi (Xn = i for infinitely many n) < 1. P (T > n) = P (U1 ≤ U0 . Every finite closed communicating class is recurrent. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. i. . i. some state i must be visited infinitely many times. but Pi (A ∩ B) = Pi (Xm = j. . U2 . .e. appear in a certain game. What is the expectation of T ? First. Recall that a finite random variable may not necessarily have finite expectation. By finiteness. If i is inessential then it is transient. . By closedness. Let T := inf{n ≥ 1 : Un > U0 }. . uniformly distributed in the unit interval [0. But then. be i. and so i is a transient state.

. . 0) = ∞. It is traditional to refer to it as “Law”. . (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . (ii) Suppose E max(Z1 . the Fundamental Theorem of Probability Theory. .i.Therefore. X1 . we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. the sequence X0 . Before doing that. Let Z1 . n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. random variables. arguably.d. Theorem 13. . 34 . X2 . . on the average. When we have a Markov chain. We state it here without proof. But this is misleading. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. To apply the law of large numbers we must create some kind of independence. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. of successive states is not an independent sequence. Z2 . P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. We say that i is null recurrent if Ei Ti = ∞. but E min(Z1 . n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. it is a Theorem that can be derived from the axioms of Probability Theory. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. be i. To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. then it takes. 0) > −∞. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. It is not a Law.

for an excursion X . . as in Example 2. Example 2: Let f (x) be as in example 1. because the excursions Xa . for reasonable choices of the functions G taking values in R. k=0 which represents the ‘time-average reward’ received. random variables. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). we can think of as a reward received when the chain is in state x. forms an independent sequence of (generalised) random variables.2 Average reward Consider now a function f : S → R. are i. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . . . the (0) initial excursion Xa also has the same law as the rest of them. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ).d. but define G(X ) to be the total reward received over the excursion X .d. Since functions of independent random variables are also independent. we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. Specifically. Example 1: To each state x ∈ S. G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). . we have that G(Xa(1) ). G(Xa(3) ).1 Functions of excursions Namely. Xa . Then. which. are i. n as n → ∞. Xa(3) . (7) Whichever the choice of G. Thus. Xa(2) . . . define G(X ) to be the maximum reward received during this excursion. assign a reward f (x). . . (1) (2) (1) (n) (8) 13.. 35 . and because if we start with X0 = a. G(Xa(2) ).i. The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. . (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. with probability 1. the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). in this example. In particular.i.13.

Gr is given by (7). Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). t t 36 . We are looking for the total reward received between 0 and t. and so 1 first G → 0. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). with probability one. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. A similar argument shows that 1 last G → 0. Thus we need to keep track of the number. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . in the figure below.Theorem 14. with probability one. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. t t In other words. Let f : S → R+ be a bounded ‘reward’ function. say. For example. say |f (x)| ≤ B. t as t → ∞. Nt . we have BTa /t → 0. (r) Nt := max{r ≥ 1 : Ta ≤ t}.e. and so on. We break the sum into the total reward received over the initial excursion. plus the total reward received over the next excursion. We assumed that f is t bounded. Nt = 5. of complete excursions from the ground state a that occur between 0 and t. Then. t t as t → ∞. The idea is this: Suppose that t is large enough. and Glast is the reward over the last incomplete cycle. Proof. Since P (Ta < ∞) = 1. Hence |Gfirst | ≤ BTa .

Since Ta /r → 0. . r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 .also. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) .i.d. we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . lim t∞ 1 Nt Nt −1 Gr = EG1 . as t → ∞. are i. Hence. r=1 with probability one. So we are left with the sum of the rewards over complete cycles. Since Ta → ∞. random variables with common expectation Ea Ta . 2. = E a Ta . 1 Nt = . and since Ta − Ta k = 1. we have Ta r→∞ r lim (r) . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . as r → ∞. r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. . E a Ta 37 .. = E a Ta . it follows that the number Nt of complete cycles before time t will also be tending to ∞. (11) By definition. E a Ta f (Xn ) = n=0 EG1 =: f . .

Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). then the long-run proportion of visits to the state a should equal 1/Ea Ta . a. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). The physical interpretation of the latter should be clear: If. for all b ∈ S. The constant f is a sort of average over an excursion. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. The theorem above then says that. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . we let f to be 1 at the state b and 0 everywhere else. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . the ground state.To see that f is as claimed. i. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). E a Ta (14) with probability 1.e. E b Tb 38 . The denominator is the duration of the excursion. from the Strong Markov Property. on the average. it takes Ea Ta units of time between successive visits to the state a. We can use b as a ground state instead of a. Comment 4: Suppose that two states. Comment 1: It is important to understand what the formula says. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). t Ea 1(Xk = b) E a Ta . which communicate and are positive recurrent. not both. b. then we see that the numerator equals 1. Note that we include only one endpoint of the excursion. Ta −1 k=0 1 lim t→∞ t with probability 1. The equality of these two limits gives: average no.

Consider the function ν [a] defined in (15). (15) should play an important role. ν [a] (x) = y∈S ν [a] (y)pyx . by a function of X1 . x ∈ S. this should have something to do with the concept of stationary distribution discussed a few sections ago. . Indeed. In view of this. . Hence the superscript [a] will soon be dropped. so there is nothing lost in doing so. x ∈ S.e. . notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. where a is the fixed recurrent ground state whose existence was assumed. . Since a = XT a . we have 0 < ν [a] (x) < ∞. i.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. Proposition 2. X3 . The real reason for doing so is because we wanted to replace a function of X0 . Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. Clearly. Note: Although ν [a] depends on the choice of the ‘ground’ state a. Both of them equal 1. we will soon realise that this dependence vanishes when the chain is irreducible. Recurrence tells us that Ta < ∞ with probability one. Moreover. Proof. . for any state b such that a → x (a leads to x). We start the Markov chain with X0 = a. Fix a ground state a ∈ S and assume that it is recurrent5 . X2 . 39 . . X2 . X1 . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. 5 We stress that we do not need the stronger assumption of positive recurrence here. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x).

let x be such that a x. is finite when a is positive recurrent. so there exists m ≥ 1. indeed. Xn−1 = y. ν [a] = ν [a] Pn . We are now going to assume more. where. xa so. so there exists n ≥ 1. for all n ∈ N. Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . We just showed that ν [a] = ν [a] P. n ≤ Ta ) Pa (Xn = x. Note: The last result was shown under the assumption that a is a recurrent state. that a is positive recurrent. ν [a] (x) < ∞. which are precisely the balance equations. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. (n) Next. x a. (17) This π is a stationary distribution. we used (16). Suppose that the Markov chain possesses a positive recurrent state a. from (16). in this last step. for all x ∈ S. namely. First note that. by definition. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). n ≤ Ta ) pyx Pa (Xn−1 = y. 40 . Therefore. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. such that pax > 0. We now have: Corollary 9. such that pxa > 0. x ∈ S. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). So we have shown ν [a] (x) = y∈S pyx ν [a] (y). Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . which. ax ax From Theorem 10.and thus be in a position to apply first-step analysis.

Let R1 (respectively. i. to decide whether j will occur in a specific excursion. since the excursions are independent. then the set of linear equations π(x) = y∈S π(y)pyx . Let i be a positive recurrent state. that ν [a] P = ν [a] . (r) j. Start with X0 = i. r = 0. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. that it sums up to one. . Suppose that i recurrent. It only remains to show that π [a] is a probability.d. We have already shown. and so π [a] is not trivially equal to zero. Theorem 15. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. Also. Hence j will occur in infinitely many excursions. Otherwise. in general. . if tails show up. then j occurs in the r-th excursion. we have Ea Ta < ∞. unique.In other words: If there is a positive recurrent state. . then the r-th excursion does not contain state j. Proof. This is what we will try to do in the sequel. we toss a coin with probability of heads gij > 0. If heads show up at the r-th excursion. and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. The coins are i.i. Consider the excursions Xi . But π [a] is simply a multiple of ν [a] . 41 . no matter which one. π [a] P = π [a] . then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. 2.e. Then j is positive Proof. 1. We already know that j is recurrent. In words. positive recurrent state. in Proposition 2. Since a is positive recurrent. second) excursion that contains j. π(y) = 1 y∈S x∈S have at least one solution. Therefore. R2 ) be the index of the first (respectively. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one.

i. Then either all states in C are positive recurrent. An irreducible Markov chain is called transient if one (and hence all) states are transient. . . In other words. But. R2 has finite expectation. . An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. or all states are null recurrent or all states are transient. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. Corollary 10. Let C be a communicating class. . 2. .d. . . we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . IRREDUCIBLE CHAIN r = 1. with the same law as Ti . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . where the Zr are i. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . by Wald’s lemma. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞.

E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). there is a stationary distribution. Consider the Markov chain (Xn . Consider positive recurrent irreducible Markov chain. Proof. we can take expectations of both sides: E as t → ∞. as t → ∞. x ∈ S. By a theorem of Analysis (Bounded Convergence Theorem). π(1) = 0. for all x ∈ S. we have. But. by (18). after all. n ≥ 0) and suppose that P (X0 = x) = π(x). recognising that the random variables on the left are nonnegative and bounded below 1. state 1 (and state 2) is positive recurrent. is a stationary distribution. what prohibits uniqueness. n=1 43 . x ∈ S. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1.16 Uniqueness of the stationary distribution At the risk of being too pedantic. Then P (Ta < ∞) = x∈S In other words. Then. Fix a ground state a. we have P (Xn = x) = π(x). Hence. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. (18) π(x)Px (Ta < ∞) = π(x) = 1. π(2) = 1 − α 2. x∈S By (13). 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). with probability one. we stress that a stationary distribution may not be unique. Let π be a stationary distribution. indeed. the π defined by π(0) = α. we start the Markov chain from the stationary distribution π. for all n. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). Theorem 16. an absorbing state). following Theorem 14. Then there is a unique stationary distribution. for all states i and j. for totally trivial reasons (it is. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly.

Proof. Namely. The following is a sort of converse to that. Proof. By the 1 above corollary. for all x ∈ S. Corollary 12. so they are equal. Let i be such that π(i) > 0. y ∈ C. Assume that a stationary distribution π exists. for all a ∈ S. This is in fact. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . Hence π = π [a] . Consider a positive recurrent irreducible Markov chain with stationary distribution π. In particular. π(i|C) = Ei Ti . x ∈ S. Both π [a] and π[b] are stationary distributions. 1 π(a) = . In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. i. Consider a positive recurrent irreducible Markov chain and two states a. Therefore Ei Ti < ∞. π(C) x ∈ C. There is only one stationary distribution. Then π [a] = π [b] . The cycle formula merely re-expresses this equality. Then any state i such that π(i) > 0 is positive recurrent. Corollary 11. b ∈ S. i is positive recurrent. Consider a Markov chain. There is a unique stationary distribution. an arbitrary stationary distribution π must be equal to the specific one π [a] . where π(C) = y∈C π(y). 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Hence there is only one stationary distribution. But π(i|C) > 0. from the definition of π [a] . Thus. Let C be the communicating class of i. Consider the restriction of the chain to C. the only stationary distribution for the restricted chain because C is irreducible. Decompose it into communicating classes. In particular. which is useful because it is often used as a criterion for positive recurrence: Corollary 13.Therefore π(x) = π [a] (x). delete all transition probabilities pxy with x ∈ C. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . E a Ta Proof.e. π(a) = π [a] (a) = 1/Ea Ta . Then. Consider an arbitrary Markov chain. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) .

that it remains in a small region). under irreducibility and positive recurrence conditions. the pendulum will perform an 45 . the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). So far we have seen. π(x|C) ≡ π C (x). An arbitrary stationary distribution π must be a linear combination of these π C . In an ideal pendulum. If π(x) > 0 then x belongs to some positive recurrent class C. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. that the pendulum will swing for a while and. it will settle to the vertical position. x ∈ C) is a stationary distribution.which assigns positive probability to each state in C. eventually. So we have that the above decomposition holds with αC = π(C). as n → ∞. for example. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . where αC ≥ 0. such that C αC = 1. where π(C) := π(C) π(x). 18 18. There is dissipation of energy due to friction and that causes the settling down. By our uniqueness result. x∈C Then (π(x|C). consider the motion of a pendulum under the action of gravity. that. Let π be an arbitrary stationary distribution. This is the stable position. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. more generally. Proof. take an = (−1)n . without friction. To each positive recurrent communicating class C there corresponds a stationary distribution π C . For example. Proposition 3. from physical experience.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. We all know. A Markov chain is a simple model of a stochastic dynamical system. For each such C define the conditional distribution π(x|C) := π(x) . This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or.

let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . then x(t) = x∗ for all t > 0. We shall assume that the ξn have the same law. we have x(t) → x∗ . . the one found by setting the right-hand side equal to zero: x∗ = b/a. X2 . are given. then regardless of the starting position. And we are in a happy situation here. If. . and define a stochastic sequence X1 . we may it is stable because it does not (it cannot!) swing too far away. it cannot mean convergence to a fixed equilibrium point. mutually independent. by means of the recursion above. . as t → ∞. ξ0 . simple because the ξn ’s depend on the time parameter n.oscillatory motion ad infinitum. ˙ There are myriads of applications giving physical interpretation to this. Specifically. A linear stochastic system We now pass on to stochastic systems. x ∈ R. . moreover. The thing to observe here is that there is an equilibrium point i. random variables. Still.e. The simplest differential equation As another example. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . . or an example in physics. ξ1 . . But what does stability mean? In general. a > 0. consider the following differential equation x = −ax + b. It follows that |ρ| < 1 is the condition for stability. we here assume that X0 . or whatever. ξ2 . We say that x∗ is a stable equilibrium. However notice what happens when we solve the recursion. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. ······ 46 .

we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). converges. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 .3 Coupling To understand why the fundamental stability theorem works. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . uniformly in x. In other words. as n → ∞. Y = y) is. Coupling refers to the various methods for constructing this joint law.g. we need to understand the notion of coupling. . In particular. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). How should we then define what P (X = x. g(y) = P (Y = y). i. suppose you are given the laws of two random variables X. . to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). 18. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . Y but not their joint law. the n-step transition matrix Pn . The second term. j ∈ S.Since |ρ| < 1. we know f (x) = P (X = x). If a Markov chain is irreducible.2 The fundamental stability theorem for Markov chains Theorem 17. But suppose that we have some further requirement. which is a linear combination of ξ0 . (e. 18. for instance. What this means is that. to try to make the coincidence probability P (X = Y ) as large as possible. ∞ ˆ The nice thing is that. we have that the first term goes to 0 as n → ∞. for time-homogeneous Markov chains. while the limit of Xn does not exists. Starting at an elementary level. for any initial distribution. . under nice conditions. but we are not given what P (X = x. ξn−1 does not converge. . Y = y) is? 47 . A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. Y = y) = P (X = x)P (Y = y). define stability in a way other than convergence of the distributions of the random variables. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot.

f (x) = y h(x. and which maximises P (X = Y ). for all n ≥ 0. Y = y) which respects the marginals. n ≥ 0). y). g(y) = x h(x. (19) as n → ∞. y) = P (X = x. This T is a random variables that takes values in the set of times or it may take values +∞. n < T ) + P (Yn = x. n ≥ T ) ≤ P (n < T ) + P (Yn = x). The coupling we are interested in is a coupling of processes. Suppose that we have two stochastic processes as above and suppose that they have been coupled. but with the roles of X and Y interchanged. P (T < ∞) = 1. x ∈ S. but we are not told what the joint probabilities are. Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). T is finite with probability one. on the event that the two processes never meet. n ≥ 0. If. uniformly in x ∈ S. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Find the joint law h(x. g(y) = P (Y = y). Let T be a meeting of two coupled stochastic processes X.e. We are given the law of a stochastic process X = (Xn . i. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. Since this holds for all x ∈ S and since the right hand side does not depend on x. A coupling of them refers to a joint construction on the same probability space.Exercise: Let X. We have P (Xn = x) = P (Xn = x. n ≥ 0. The following result is the so-called coupling inequality: Proposition 4. x∈S 48 . Proof.e. y). n < T ) + P (Xn = x. precisely because of the assumption that the two processes have been constructed together. respectively. Repeating (19). we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). in particular. Y . and all x ∈ S. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). Note that it makes sense to speak of a meeting time. Then. Y be random variables with values in Z and laws f (x) = P (X = x). n ≥ T ) = P (Xn = x. n ≥ 0) and the law of a stochastic process Y = (Yn . then |P (Xn = x) − P (Yn = x)| → 0. i.

n ≥ 0).x′ ). n ≥ 0).x′ py. we have not defined joint probabilities such as P (Xn = x. and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }. (x. 18.x′ > 0 and py. then. Yn ). Consider the process Wn := (Xn . Y0 ) has distribution P (W0 = (x. we have that px. n ≥ 0) is independent of Y = (Yn .y′ ) := P Wn+1 = (x′ . denoted by π.x′ ).Assume now that P (T < ∞) = 1. Therefore. We now couple them by doing the straightforward thing: we assume that X = (Xn .4 Proof of the fundamental stability theorem First. Then (Wn . x∈S as n → ∞. denoted by Y = (Yn . n ≥ 0) is an irreducible chain. Let pij denote. y) Its n-step transition probabilities are q(x. The second process. y) ∈ S × S. We know that there is a unique stationary distribution. in other words. had we started with both X0 and Y0 independent and distributed according to π. Then n→∞ lim P (T > n) = P (T = ∞) = 0. Its (1-step) transition probabilities are q(x. we can define joint probabilities. Having coupled them.) By positive recurrence. From Theorem 3 and the aperiodicity assumption.y′ . for all n ≥ 1. It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ.x′ ). as usual. n ≥ 0) is a Markov chain with state space S × S.y′ for all large n. (Indeed. y) := π(x)π(y).(y. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. review the assumptions of the theorem: the chain is irreducible. is a stationary distribution for (Wn . y) = µ(x)π(y). .y′ ) = px. implying that q(x. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. they differ only in how the initial states are chosen.y′ . denoted by X = (Xn . Xn and Yn would be independent and distributed according to π. π(x) > 0 for all x ∈ S. Notice that σ(x. the transition probabilities. positive recurrent and aperiodic. The first process. sup |P (Xn = x) − P (Yn = x)| → 0. Its initial state W0 = (X0 . and so (Wn .(y. We shall consider the laws of two processes.x′ py.(y. Xm = y). y ′ ) | Wn = (x. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px.y′ ) for all large n.

1).01. uniformly in x. P (Xn = x) = P (Zn = x). n ≥ 0). by assumption.99. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). p01 = 0. y) ∈ S × S. (Zn . ≥ 0) is a Markov chain with transition probabilities pij . 1).Therefore σ(x.0) = q(0.1) = 1. In particular.(1. Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). Z0 = X0 which has an arbitrary distribution µ. n ≥ 0) is identical in law to (Xn . (1. 0). For example. 50 . By the coupling inequality.(0. (1. n ≥ 0) is positive recurrent. then the process W = (X.(1. which converges to zero.1) = q(1. Hence (Zn .(0. for all n and x. Hence (Wn . Obviously. as n → ∞. Clearly. and since P (Yn = x) = π(x). p10 = 1. 1} and the pij are p01 = p10 = 1. 1)} and the transition probabilities are q(0.1).0) = q(1. This is not irreducible. Since P (Zn = x) = P (Xn = x). Zn equals Xn before the meeting time of X and Y . we have |P (Xn = x) − π(x)| ≤ P (T > n). p00 = 0. suppose that S = {0. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. after the meeting time. Then the product chain has state space S ×S = {(0. Y ) may fail to be irreducible. X arbitrary distribution Z stationary distribution Y T Thus. T is a meeting time between Z and Y . Zn = Yn . P (T < ∞) = 1. However.0). if we modify the original chain and make it aperiodic by. In particular. Moreover. y) > 0 for all (x. precisely because P (T < ∞) = 1. (0.0). 0). say.

.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). from Cd back to C0 . π(0) + π(1) = 1. Parry.e. Indeed. n→∞ (n) where π(0). Terminology cannot be. 6 The reason is that the term “ergodic” means something specific in Mathematics (c.then the product chain becomes irreducible. It is not difficult to see that 6 We put “wrong” in quotes.503 0. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing.503. Therefore p00 = 1 for even n (n) and = 0 for odd n. π(1) satisfy 0.99π(0) = π(1). n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. from C1 to C2 . It is. By the results of Section 14. 51 . . .6 Periodic chains Suppose now that we have a positive recurrent irreducible chain.f. By the results of §5. 18.4 and. limn→∞ p00 does not exist. however. Suppose X0 = 0. π(0) ≈ 0.99. positive recurrent and aperiodic Markov chain. Theorem 4. Suppose that the period of the chain is d. Then Xn = 0 for even n and = 1 for odd n. π(1) ≈ 0.497 . we can decompose the state space into d classes C0 . we know we have a unique stationary distribution π and that π(x) > 0 for all states x. Thus. Cd−1 . if we modify the chain by p00 = 0.497.503 0. such that the chain moves cyclically between them: from C0 to C1 . in particular. and so on. most people use the term “ergodic” for an irreducible. p10 = 1. However. 0. The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature.01. i. per se wrong. C1 . In matrix notation. customary to have a consistent terminology.497 18. 1981). then (n) lim p n→∞ 00 (n) = lim p10 = π(0). . But this is “wrong”. p01 = 0. and this is expressed by our terminology.

Proof. each one must be equal to 1/d.Lemma 8. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. Hence if we restrict the chain (Xnd . as n → ∞. The chain (Xnd . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Since the αr add up to 1. Let αr := i∈Cr π(i). So π(i) = j∈Cr−1 π(j)pji . j ∈ Cℓ . . as n → ∞. and let r = ℓ − k (modulo d). i ∈ S. n ≥ 0) consists of d closed communicating classes. π(i) = 1/d. Let i ∈ Ck . . the (unique) stationary distribution of (Xnd . if sampled every d steps. i∈Cr for all r = 0. We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . → dπ(j). j ∈ Cℓ . . Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . thereafter. 52 . i ∈ Cr . and the previous results apply. it remains in Cℓ . after reduction by d) and. n ≥ 0) is dπ(i). where π is the stationary distribution. d − 1. Then. All states have period 1 for this chain. . This means that: Theorem 18. . namely C0 . . . starting from i. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). . n ≥ 0) is aperiodic. Then. Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. Since (Xnd . Suppose i ∈ Cr . we have pij for all i. Suppose that X0 ∈ Cr . We shall show that Lemma 9. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. j ∈ Cr . Then pji = 0 unless j ∈ Cr−1 . 1. i ∈ Cr . Cd−1 .

e 53 .) is a k k subsequence of each (ni . exists for all i ∈ N. . the sequence n3 of the third row is a subsequence k k of n2 of the second row. . Consider. for all i ∈ S. If the period is d then the limit exists only along a selected subsequence (Theorem 18). We also saw in Corollary (n) 7 that. are contained between 0 and 1. 2. the sequence n2 of the second row k is a subsequence of n1 of the first row.. p2 . and so on. for each i. . . By construction. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. .. 19. . schematically. . . say. whose proof we omit7 : 7 See Br´maud (1999).e. . as n → ∞. . . such that limk→∞ p1 k 3 (n1 ) numbers p2 k . there must be a subsequence exists and equals. for all i ∈ S. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . if j is transient. p3 .1 Limiting probabilities for null recurrent states Theorem 19. k = 1. k = 1. as k → ∞. 2. The proof will be based on two results from Analysis. a probability distribution on N. p(n) = (n) (n) (n) (n) (n) ∞ p1 . Now consider the contained between 0 and 1 there must be a exists and equals. p2 . k = 1. such that limk→∞ pi k exists and equals. the first of which is known as HellyBray Lemma: Lemma 10. . . Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. then pij → 0 as n → ∞. . for each n = 1. say. 2. . (nk ) k → pi . . pi . . k = 1. . then pij → π(j) (Theorems 17). 2. . If j is a null recurrent state then pij → 0. i. 2.). So the diagonal subsequence (nk . Hence pi k i. It is clear how we (ni ) k continue. p1 .19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . We have. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. for each The second result is Fatou’s lemma. where = 1 and pi ≥ 0 for all i. say. For each i we have found subsequence ni .

x∈S (n′ ) Since aj > 0. pix k = 1. i ∈ N). i. for some x0 ∈ S. and the last equality is obvious since we have a probability vector. 54 .Lemma 11. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . x ∈ S . We wish to show that these inequalities are. k→∞ (n′ +1) = y∈S piy k pyx . i. Let j be a null recurrent state. y∈S Since x∈S αx ≤ 1. x ∈ S. 2. it is a strict inequality: αx0 > y∈S αy pyx0 . we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. If (yi . the sequence (n) pij does not converge to 0. By Corollary 13. n = 1. (n ) (n ) Consider the probability vector By Lemma 10. i=1 (n) Proof of Theorem 19.e. i ∈ N). pix k . π is a stationary distribution. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . But Pn+1 = Pn P. (xi . .e. (n′ ) Hence αx ≥ αy pyx . Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. by Lemma 11. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). Suppose that for some i. y∈S x ∈ S. pix k Therefore. i. αy pyx . equalities.e. . (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . we have 0< x∈S The inequality follows from Lemma 11. Thus. . state j is positive recurrent: Contradiction. Now π(j) > 0 because αj > 0. actually. Suppose that one of them is not. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. by assumption.

and fij = ϕC (j). that is. the result can be obtained by the so-called Abel’s Lemma of Analysis. This is a more old-fashioned way of getting the same result. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Let j be a positive recurrent state with period 1.2 The general case (n) It remains to see what the limit of pij is. as n → ∞. as in §10. Proof. as n → ∞. From the Strong Markov Property. Let HC := inf{n ≥ 0 : Xn ∈ C}. Let i be an arbitrary state. Let C be the communicating class of j and let π C be the stationary distribution associated with C. Indeed. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. we have pij = Pi (Xn = j) = Pi (Xn = j. Let ϕC (i) := Pi (HC < ∞). π C (j) = 1/Ej Tj . But Theorem 17 tells us that the limit exists regardless of the initial distribution. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . In this form. Example: Consider the following chain (0 < α. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). which is what we need. so we will only look that the aperiodic case. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. Then (n) lim p n→∞ ij = ϕC (i)π C (j). Pµ (Xn−m = j) → π C (j). β. Theorem 20. In general. The result when the period j is larger than 1 is a bit awkward to formulate. and fij = Pi (Tj < ∞).2.19. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). starting with the first hitting time decomposition relation of Lemma 7.

Theorem 21. we assumed the existence of a positive recurrent state. Similarly. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. π C1 (2) = . For example. ϕ(5) = 0. among others. π C2 (5) = 1. in Theorem 14. for example. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. 4} (inessential states). Theorem 14 is a type of ergodic theorem. 2} (closed class). 1 − αβ 1 − αβ Both C1 . while. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). 56 (20) x∈S . information about its velocity. from first-step analysis. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. (n) (n) p35 → (1 − ϕ(3))π C2 (5). We now generalise this. We have ϕ(1) = ϕ(2) = 1. (n) (n) p43 → 0. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. ϕ(3) = p31 → ϕ(3)π C1 (1). (n) (n) p32 → ϕ(3)π C1 (2). The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . p41 → ϕ(4)π C1 (1). And so is the Law of Large Numbers itself (Theorem 13). whence 1−α β(1 − α) . (n) (n) p33 → 0. Pioneers in the are were. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16.The communicating classes are I = {3. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. C2 are aperiodic classes. p44 → 0. The phase (or state) space of a particle is not just a physical space but it includes. Recall that. C1 = {1. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. C2 = {5} (absorbing state). The subject of Ergodic Theory has its origins in Physics. as n → ∞. (n) (n) p34 → 0. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . Hence. ϕ(4) = . p42 → ϕ(4)π C1 (2). p45 → (1 − ϕ(4))π C2 (5).

Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. while Gfirst . and this justifies the terminology of §18. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). r=0 with probability 1 as long as E|G1 | < ∞. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. then. The reader is asked to revisit the proof of Theorem 14. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. But this is precisely the condition (20). Proof. so the numerator converges to f /Ea Ta . Then at least one of the states must be visited infinitely often. If so. From proposition 2 we know that the balance equations ν = νP have at least one solution. respectively. we obtain the result. we have Nt /t → 1/Ea Ta . and this is left as an exercise. As in the proof of Theorem 14. So there are always recurrent states. we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . 57 . Glast are the total rewards over the first and t t a a last incomplete cycles.5. (21) f (x)ν [a] (x). t t→∞ t t t with probability one. So if we divide by t the numerator and denominator of the fraction in (21). As in Theorem 14. Now. Replacing n by the subsequence Nt . we have that the denominator equals Nt /t and converges to 1/Ea Ta . we have C := i∈S ν(i) < ∞.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . we have t→∞ lim 1 last 1 G = lim Gfirst = 0. we assumed positive recurrence of the ground state a. in the former. We only need to check that EG1 = f . which is the quantity denoted by f in Theorem 14. as we saw in the proof of Theorem 14. Remark 1: The difference between Theorem 14 and 21 is that. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . since the number of states is finite.

X0 ) for all n. Xn−1 . the chain is aperiodic. because positive recurrence is a class property (Theorem 15). Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. actually. without being able to tell that it is. i ∈ S. for all n. Because the process is Markov. We summarise the above discussion in: Theorem 22. We say that it is time-reversible. π(i) := Hence. Consider an irreducible Markov chain with finitely many states. 22 Time reversibility Consider a Markov chain (Xn . X1 ) has the same distribution as (X1 . . Proof. . then we instantly know that the film is played in reverse. X1 . Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. the chain is irreducible. and run the film backwards. At least some state i must have π(i) > 0. i. X1 . it will look the same. the first components of the two vectors must have the same distribution. simply. It is. n ≥ 0). j ∈ S. then all states are positive recurrent. Proof. . But if we film a man who walks and run the film backwards. (X0 . 58 . . by Theorem 16. Xn ) has the same distribution as (Xn . C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. . For example. we see that (X0 . . Xn−1 . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. that is. . In particular. Theorem 23. Xn ) has the same distribution as (Xn . Then the chain is positive recurrent with a unique stationary distribution π. . X0 ). this state must be positive recurrent. in addition. A reversible Markov chain is stationary. X1 = j) = P (X1 = i.e. . in a sense. If. as n → ∞. like playing a film backwards. Lemma 12. . If. the stationary distribution is unique. More precisely. . reversible if it has the same distribution when the arrow of time is reversed. Taking n = 1 in the definition of reversibility. In addition. running backwards. in addition. Therefore π is a stationary distribution. Suppose first that the chain is reversible. Reversibility means that (X0 . Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . X0 = j). this means that it is stationary. . . . By Corollary 13. . P (X0 = i.Hence 1 ν(i). i. we have Pn → Π. if we film a standing man who raises his arms up and down successively. X0 ). . Xn has the same distribution as X0 for all n. or. a finite Markov chain always has positive recurrent states.

and hence (X0 . suppose the DBE hold. . So the chain is reversible. X0 ). X1 = j. i2 . im ∈ S and all m ≥ 3. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. then the chain cannot be reversible. and this means that P (X0 = i. . we have separation of variables: pij = f (i)g(j). and pi1 i2 (m−3) → π(i2 ). . .for all i and j. X1 ) has the same distribution as (X1 . Now pick a state k and write. . In other words. (X0 . Hence the DBE hold. and the chain is reversible. for all n. P (X0 = j. . if there is an arrow from i to j in the graph of the chain but no arrow from j to i. . Theorem 24. k. Xn ) has the same distribution as (Xn . (22) Proof. Summing up over all choices of states i3 . X1 = j) = π(i)pij . Xn−1 . pji 59 (m−3) → π(i1 ). . letting m → ∞. j. Suppose first that the chain is reversible. . But P (X0 = i. Conversely. . im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. X1 = i) = π(j)pji . . The same logic works for any m and can be worked out by induction. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. X0 ). X1 . using DBE. we have π(i) > 0 for all i. X1 . using the DBE. and write. . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . X2 ) has the same distribution as (X2 . By induction. it is easy to show that. suppose that the KLC (22) holds. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . . j. X1 = j. k. for all i. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . These are the DBE. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . Conversely. . Pick 3 states i. X0 ). . The DBE imply immediately that (X0 . π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. . We will show the Kolmogorov’s loop criterion (KLC). . then. X1 . So the detailed balance equations (DBE) hold. X2 = k) = P (X0 = k. i4 . X2 = i). Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0.

(Convention: we form this ratio only when pij > 0. and the arrow from j to j is the only one from Ac to A. the graph becomes disconnected. Hence the chain is reversible. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . Form the graph by deleting all self-loops. using another method. The reason is as follows. meaning that it has no loops. If the graph thus obtained is a tree. Example 1: Chain with a tree-structure. So g satisfies the balance equations. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. for each pair of distinct states i. j for which pij > 0. Hence the original chain is reversible. A) (see §4. and by replacing. There is obviously only one arrow from AtoAc . by assumption. namely the arrow from i to j. Ac ) = F (Ac . then the chain is reversible. f (i)g(i) must be 1 (why?). Ac be the sets in the two connected components. so we have pij g(j) = . consider the chain: Replace it by: This is a tree. there would be a loop. in addition. the two oriented edges from i to j and from j to i by a single edge without arrow. 60 . there isn’t any. the detailed balance equations.) Let A. (If it didn’t. • If the chain has infinitely many states.1) which gives π(i)pij = π(j)pji . i. to guarantee. and. Pick two distinct states i. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. The balance equations are written in the form F (A.e. If we remove the link between them. j for which pij > 0.) Necessarily. then we need. For example. And hence π can be found very easily. that the chain is positive recurrent.

The stationary distribution assigns to a state probability which is proportional to its degree. π(i) = 1. 2. We claim that π(i) = Cδ(i). j are not neighbours then both sides are 0. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. Hence we conclude that: 1. etc. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. In all cases. Suppose the graph is connected. the DBE hold. Each edge connects a pair of distinct vertices (states) without orientation. There are two cases: If i. The constant C is found by normalisation. For each state i define its degree δ(i) = number of neighbours of i.e. a finite set. A RWG is a reversible chain. i∈S where |E| is the total number of edges. π(i)pij = π(j)pji . To see this let us verify that the detailed balance equations hold. we have that C = 1/2|E|. 0.Example 2: Random walk on a graph. so both members are equal to C. i∈S Since δ(i) = 2|E|. if j is a neighbour of i otherwise. The stationary distribution of a RWG is very easy to find. 61 . States j are i are called neighbours if they are connected by an edge. and a set E of undirected edges. Consider an undirected graph G with vertices S. For example. where C is a constant. If i. Now define the the following transition probabilities: pij = 1/δ(i). i. p02 = 1/4.

k = 0. 23.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . . . irreducible and positive recurrent). . t ≥ 0. St+1 = St + ξt . i.3 Subordination Let (Xn . . a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . (m) (m) 23. k = 0. then (Yk . then it is a stationary distribution for the new chain. k = 0. 2. If the subsequence forms an arithmetic progression. . . .d. . in general. Define Yr := XT (r) . 1. k = 0.e. Due to the Strong Markov Property. t ≥ 0) be independent from (Xn ) with S0 = 0.) has the Markov property but it is not time-homogeneous. Let (St . this process is also a Markov chain and it does have time-homogeneous transition probabilities. 2. .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). . . π is a stationary distribution for the original chain. In this case. 2. if nk = n0 + km. A r = 1. .e. 1.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. if. 2. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). . how can we create new ones? 23. i. 2. . n ≥ 0) be Markov with transition probabilities pij . 62 n ∈ N. .e. where pij are the m-step transition probabilities of the original chain.i. 1. 1. . The state space of (Yr ) is A. in addition. where ξt are i. . Let A be (r) a set of states and let TA be the r-th visit to A. random variables with values in N and distribution λ(n) = P (ξ0 = n). The sequence (Yk .

Fix α0 . (n) 23. Proof.e. . j. n≥0 63 . . a Markov chain. and the elements of S ′ by α. αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . . Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). . we have: Lemma 13. Yn−1 ) = P (Yn+1 = β | Yn = α). . Indeed.. knowledge of f (Xn ) implies knowledge of Xn and so. β). . . in general. conditional on Yn the past is independent of the future. . We need to show that P (Yn+1 = β | Yn = α. however. that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). If. (Notation: We will denote the elements of S by i. i.) More generally. then the process Yn = f (Xn ).4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . β.. α1 . Then (Yn ) has the Markov property. is NOT. Yn−1 = αn−1 }. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0.Then Yt := XSt . . . . . f is a one-to-one function then (Yn ) is a Markov chain. ∞ λ(n)pij . . .

then |Xn+1 | = 1 for sure. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. β) i∈S P (Yn = α|Xn = i. Define Yn := |Xn |. β ∈ Z+ .for all choices of the variables involved. In other words. if β = 1. Yn−1 )P (Xn = i | Yn = α. Yn−1 ) Q(f (i). i ∈ Z. But P (Yn+1 = β | Yn = α. if we let 1/2. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Thus (Yn ) is Markov with transition probabilities Q(α. Yn−1 ) Q(f (i). P (Xn+1 = i ± 1|Xn = i) = 1/2. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. Yn = α. i. β). Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. observe that the function | · | is not one-to-one. α > 0 Q(α. β)P (Xn = i | Yn = α. and if i = 0. β) := 1. if i = 0. 64 . But we will show that P (Yn+1 = β|Xn = i). β) P (Yn = α. Indeed.e. β). β) P (Yn = α|Xn = i. for all i ∈ Z and all β ∈ Z+ . Example: Consider the symmetric drunkard’s random walk. On the other hand. indeed. because the probability of going up is the same as the probability of going down. β). conditional on {Xn = i}. Then (Yn ) is a Markov chain. β) P (Yn = α|Yn−1 ) i∈S = Q(α. regardless of whether i > 0 or i < 0. α = 0. β). β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. a Markov chain with transition probabilities Q(α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). i ∈ Z. To see this. and so (Yn ) is. if β = α ± 1. We have that P (Yn+1 = β|Xn = i) = Q(|i|. depends on i only through |i|.

. 0 But 0 is an absorbing state. So assume that p1 = 1. p0 = 1 − p. .d. . Thus. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. 1. which has a binomial distribution: i j pij = p (1 − p)i−j . if 0 is never reached then Xn → ∞. But here is an interesting special case. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. Hence all states i ≥ 1 are transient. 0 ≤ j ≤ i. . independently from one another. . If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. so we will find some different method to proceed. j In this case. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. Example: Suppose that p1 = p. Let ξk be the number of offspring of the k-th individual in the n-th generation. since ξ (0) . and following the same distribution. if the current population is i there is probability pi that everybody gives birth to zero offspring. 2. in general. . and 0 is always an absorbing state. the population cannot grow: it will eventually reach 0.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. one question of interest is whether the process will become extinct or not. a hard problem.i. Then P (ξ = 0) + P (ξ ≥ 2) > 0. Therefore. . We find the population (n) at time n + 1 by adding the number of offspring of each individual. if the 65 (n) (n) . If we let ξ (n) = ξ1 . we have that (Xn . n ≥ 0) has the Markov property. Hence all states i ≥ 1 are inessential. It is a Markov chain with time-homogeneous transitions. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. In other words. therefore transient. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . ξ (1) . . We let pk be the probability that an individual will bear k offspring. k = 0. ξ2 . with probability 1. each individual gives birth to at one child with probability p or has children with probability 1 − p. are i.24 24. Back to the general case. then we see that the above equation is of the form Xn+1 = f (Xn .. . and. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. The parent dies immediately upon giving birth. The state 0 is irrelevant because it cannot be reached from anywhere. (and independent of the initial population size X0 –a further assumption). If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. ξ (n) ).

If µ ≤ 1 then the ε = 1. For each k = 1. The result is this: Proposition 5. ε(0) = 1. Let ε be the extinction probability of the process. Obviously. This function is of interest because it gives information about the extinction probability ε. . when we start with X0 = i. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. . Proof. Therefore the problem reduces to the study of the branching process starting with X0 = 1. so P (D|X0 = i) = εi . . each one behaves independently of one another. Indeed. For i ≥ 1. if we start with X0 = i in ital ancestors. we can argue that ε(i) = εi .population does not become extinct then it must grow to infinity. since Xn = 0 implies Xm = 0 for all m ≥ n. We wish to compute the extinction probability ε(i) := Pi (D). Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. . let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. and P (Dk ) = ε. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . and. We then have that D = D1 ∩ · · · ∩ Di . 66 . Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. ϕn (0) = P (Xn = 0). Consider a Galton-Watson process starting with X0 = 1 initial ancestor. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. where ε = P1 (D). Indeed. i. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z.

ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). By induction. all we need to show is that ε < 1. But 0 ≤ ε∗ .Note also that ϕ0 (z) = 1. Therefore ε = ε∗ . ϕn (0) ≤ ε∗ . and since the latter has only two solutions (z = 0 and z = ε∗ ). Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . and. ϕn+1 (z) = ϕ(ϕn (z)). We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). if µ > 1.) 67 . = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). Therefore. Since both ε and ε∗ satisfy z = ϕ(z). ϕ(ϕn (0)) → ϕ(ε). the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. ϕn+1 (0) = ϕ(ϕn (0)). then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. Let us show that. if the slope µ at z = 1 is ≤ 1. recall that ϕ is a convex function with ϕ(1) = 1. in fact. Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). So ε ≤ ε∗ < 1. Also ϕn (0) → ε. (See right figure below. and so on. Notice now that µ= d ϕ(z) dz . since ϕ is a continuous function. Therefore ε = ϕ(ε). In particular. On the other hand. z=1 Also. But ϕn+1 (0) → ε as n → ∞. But ϕn (0) → ε. This means that ε = 1 (see left figure below). ε = ε∗ .

k ≥ 0. there is collision and the transmission is unsuccessful. periodically. there are x + a − 1 packets. 1. 24. Nowadays.i. Then ε < 1. using the same frequency. 68 . is that if more than one hosts attempt to transmit a packet. we let hosts transmit whenever they like. then max(Xn + An − 1. and Sn the number of selected packets. there is a collision and. If s = 1 there is a successful transmission and. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. then the transmission is not successful and the transmissions must be attempted anew. One obvious solution is to allocate time slots to hosts in a specific order.d. 2. Case: µ > 1. . We assume that the An are i. if Sn = 1. During this time slot assume there are a new packets that need to be transmitted. Let s be the number of selected packets. So if there are.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. . If collision happens. there are x + a packets. . 1. 1. on the next times slot. there is no time to make a prior query whether a host has something to transmit or not. . otherwise. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. 3. . 5. 3. 2. random variables with some common distribution: P (An = k) = λk . The problem of communication over a common channel. . assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. then time slots 0. Since things happen fast in a computer network. on the next times slot. say. Thus. for a packet to go through. 3. 0). Abandoning this idea.. So. 4. Then ε = 1. If s ≥ 2. only one of the hosts should be transmitting at a given time. Ethernet has replaced Aloha. An the number of new packets on the same time slot. are allocated to hosts 1. Xn+1 = Xn + An . 6. 2.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). in a simplified model. 3 hosts.

and exactly one selection: px.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . The probabilities px.) = x + a k x−k p q . Unless µ = 0 (which is a trivial case: no arrivals!). Sn = 1|Xn = x) = λ0 xpq x−1 . The reason we take the maximum with 0 in the above equation is because. we think as follows: to have a reduction of x by 1. The process (Xn . The main result is: Proposition 6. and p. Therefore ∆(x) ≥ µ/2 for all large x. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1).e. we have q < 1.where λk ≥ 0 and k≥0 kλk = 1. An = a.x are computed by subtracting the sum of the above from 1. for example we can make it ≤ µ/2. where G(x) is a function of the form α + βx . n ≥ 0) is a Markov chain. Let us find its transition probabilities. Idea of proof. k 0 ≤ k ≤ x + a.x+k for k ≥ 1. we must have no new packets arriving. To compute px.x−1 + k≥1 kpx. So we can make q x G(x) as small as we like by choosing x sufficiently large. So: px.e. . To compute px. where q = 1 − p. The random variable Sn has. i. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. selections are made amongst the Xn + An packets. defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. the Markov chain (Xn ) is transient. if Xn + An = 0. then the chain is transient. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. . we have ∆(x) = (−1)px. we should not subtract 1. The Aloha protocol is unstable. Since p > 0.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . k ≥ 1. so q x G(x) tends to 0 as x → ∞. and so q x tends to 0 much faster than G(x) tends to ∞. we have that the drift is positive for all large x and this proves transience. Indeed. We here only restrict ourselves to the computation of this expected change.x−1 = P (An = 0. 69 . So P (Sn = k|Xn = x. For x > 0. We also let µ := k≥1 kλk be the expectation of An . conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . Xn−1 .x−1 when x > 0. An−1 . . x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). Proving this is beyond the scope of the lectures.

in the latter case. So this is how Google assigns ranks to pages: It uses an algorithm. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. So we make the following modification. if L(i) = ∅   0.24. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. Comments: 1. because of the size of the problem. 2. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. It is not clear that the chain is irreducible. For each page i let L(i) be the set of pages it links to. PageRank. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. We pick α between 0 and 1 (typically. Therefore. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. It uses advanced techniques from linear algebra to speed up computations. known as ‘302 Google jacking’. Then it says: page i has higher rank than j if π(i) > π(j). All you need to do to get high rank is pay (a lot of money). Academic departments link to each other by hiring their 70 . that allows someone to manipulate the PageRank of a web page and make it much higher. Define transition probabilities as follows:  1/|L(i)|. there are companies that sell high-rank pages. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. then i is a ‘dangling page’.2) and let α pij := (1 − α)pij + . if j ∈ L(i)  pij = 1/|V |. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. unless i is a dangling page. α ≈ 0. There is a technique. So if Xn = i is the current location of the chain. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. The idea is simple. If L(i) = ∅. 3. the rank of a page may be very important for a company’s profits. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. neither that it is aperiodic.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. Its implementation though is one of the largest matrix computations in the world. otherwise. |V | These are new transition probabilities. the chain moves to a random page. In an Internet-dominated society.

Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. One of the square alleles is selected 4 times. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. in each generation. it makes the evaluation procedure easy. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. 71 . their offspring inherits an allele from each parent. Do the same thing 2N times. The alleles in an individual come in pairs and express the individual’s genotype for that gene. generation after generation. This number could be independent of the research quality but. Since we don’t care to keep track of the individuals’ identities. We will also assume that the parents are replaced by the children. then. If so. The process repeats. two AA individuals will produce and AA child. 24. this allele is of type A with probability x/2N . In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). colour of eyes) appears in two forms. To simplify. This means that the external appearance of a characteristic is a function of the pair of alleles.g. a gene controlling a specific characteristic (e.faculty from each other (and from themselves). When two individuals mate. And so on. Three of the square alleles are never selected. We will assume that a child chooses an allele from each parent at random. But an AA mating with a Aa may produce a child of type AA or one of type Aa.4 The Wright-Fisher model: an example from biology In an organism. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. Thus. We would like to study the genetic diversity. the genetic diversity depends only on the total number of alleles of each type. all we need is the information outlined in this figure. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. we will assume that. 5. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. One of the circle alleles is selected twice. maintain this allele for the next generation. a dominant (say A) and a recessive (say a). We have a transition from x to y if y alleles of type A were produced. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). What is important is that the evaluation can be done without ever looking at the content of one’s research. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. meaning how many characteristics we have. from the point of view of an administrator. Imagine that we have a population of N individuals who mate at random producing a number of offspring. 4.

since. the random variables ξ1 . we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . and the population becomes one consisting of alleles of type A only. . The result is correct. M Therefore. EXn+1 = EXn Xn = Xn . . . . X0 ) = Xn . .Each allele of type A is selected with probability x/2N . and the population becomes one consisting of alleles of type a only. . as usual. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. 2N − 1} form a communicating class of inessential states. We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. the number of alleles of type A won’t change. . 72 . ϕ(x) = y=0 pxy ϕ(y). . These are: M ϕ(0) = 0. but the argument needs further justification because “the end of time” is not a deterministic time. on the average. so both 0 and 2N are absorbing states. . i. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. .2N = 1. Since pxy > 0 for all 1 ≤ x. for all n ≥ 0. the states {1. (Here. The fact that M is even is irrelevant for the mathematical model. ϕ(x) = x/M. Since selections are independent.e. M and thus.i. Clearly.e. . Xn ). . . Clearly. we see that 2N x 2N −y x y pxy = . p00 = p2N. .) One can argue that. . E(Xn+1 |Xn . on the average. ϕ(M ) = 1. . the expectation doesn’t change and. with n P (ξk = 1|Xn . the chain gets absorbed by either 0 or 2N . M n P (ξk = 0|Xn . conditionally on (X0 . X0 ) = 1 − Xn . ξM are i. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). . Similarly. . i. . since. . . y ≤ 2N − 1. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . Let M = 2N . 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all.d. Tx := inf{n ≥ 1 : Xn = x}. . n n where. . 0 ≤ y ≤ 2N. . X0 ) = E(Xn+1 |Xn ) = M This implies that.

5 A storage or queueing application A storage facility stores. M −1 y−1 x M y−1 1− x . Here are our assumptions: The pairs of random variables ((An . We will apply the so-called Loynes’ method. If the demand is for Dn components. . We have that (Xn . We will show that this Markov chain is positive recurrent. provided that enough components are available. on the average. n ≥ 0) is a Markov chain. A number An of new components are ordered and added to the stock. Xn electronic components at time n.The first two are obvious. Dn ). so that Xn+1 = (Xn + ξn ) ∨ 0. say. b). at time n + 1. Let ξn := An − Dn . the demand is only partially satisfied. and so. and E(An − Dn ) = −µ < 0. Otherwise. Since the Markov chain is represented by this very simple recursion. 2. Thus. Working for n = 1. . We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. while a number of them are sold to the market. . for all n ≥ 0. the demand is met instantaneously. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . n ≥ 0) are i. the demand is. and the stock reached level zero at time n + 1. we will try to solve the recursion. there are Xn + An − Dn components in the stock. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0.i. larger than the supply per unit of time. M x M (M −1)−(y−1) x x +1− M M M −1 = 24.d. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. 73 .

. 74 . Therefore. . so we can shuffle them in any way we like without altering their joint distribution. . using π(y) = limn→∞ P (Mn = y). we reverse it. we find π(y) = x≥0 π(x)pxy . .Thus. ξn−1 ) = (ξn−1 . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). . But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). we may take the limit of both sides as n → ∞.i. (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . in computing the distribution of Mn which depends on ξ0 . . Hence. where the latter equality follows from the fact that. Notice. and. . . . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . ξn−1 ). . . (n) . the random variables (ξn ) are i. . . that d (ξ0 . however. Indeed.d. for all n. ξn . for y ≥ 0. we may replace the latter random variables by ξ1 . . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). . the event whose probability we want to compute is a function of (ξ0 . . . . = x≥0 Since the n-dependent terms are bounded. . ξ0 ). In particular. . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. instead of considering them in their original order. ξn−1 .

Z2 . . as required. . . We must make sure that it also satisfies y π(y) = 1. n→∞ This implies that P ( lim Zn = −∞) = 1. .So π satisfies the balance equations. . the Markov chain may be transient or null recurrent. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient.} < ∞) = 1. 75 . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . Here. we will use the assumption that Eξn = −µ < 0. . n→∞ Hence g(y) = P (0 ∨ max{Z1 . This will be the case if g(y) is not identically equal to 1. The case Eξn = 0 is delicate.} > y) is not identically equal to 1. From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. Z2 . Depending on the actual distribution of ξn .

p(−1) = q. we won’t go beyond Zd . p(x) = 0.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. 0. So. . j = 1. so far. For example. then p(1) = p. if x = ej . . but. this walk enjoys.g. This means that the transition probability pxy should depend on x and y only through their relative positions in space. . More general spaces (e. A random walk possesses an additional property. a random walk is a Markov chain with time.y = px+z. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else.and space-homogeneity. A random walk of this type is called simple random walk or nearest neighbour random walk. Example 3: Define p(x) = 1 2d . otherwise. . In dimension d = 1. if x = −ej . j = 1. d   0. the set of d-dimensional vectors with integer coordinates. we can let S = Zd . here. This is the drunkard’s walk we’ve seen before.y = p(y − x). . namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. To be able to say this. where p1 + · · · + pd + q1 + · · · + qd = 1. Our Markov chains. x ∈ Zd . a structure that enables us to say that px. the skip-free property. namely. . where C is such that x p(x) = 1. . .and space-homogeneous transition probabilities given by px. groups) are allowed.y+z for any translation z. ±ed } otherwise 76 . . given a function p(x). have been processes with timehomogeneous transition probabilities. Example 1: Let p(x) = C2−|x1 |−···−|xd | . we need to give more structure to the state space S. besides time. . that of spatial homogeneity. Define  pj . If d = 1. d  p(x) = qj . if x ∈ {±e1 . . . otherwise where p + q = 1.

y p(x)p′ (y − x). are i. p ∗ δz = p. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . . Obviously. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). which.d. Furthermore. . So. . It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . and this is the symmetric drunkard’s walk. x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. If d = 1. the initial state S0 is independent of the increments. p(x) = 0. then p(1) = p(−1) = 1/2. .This is a simple random walk. random variables in Zd with common distribution P (ξn = x) = p(x). Furthermore. . in addition. where ξ1 . (When we write a sum without indicating explicitly the range of the summation variable. computing the n-th order convolution for all n is a notoriously hard problem.) In this notation. One could stop here and say that we have a formula for the n-step transition probability. Let us denote by Sn the state of the walk at time n. p′ (x). we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). otherwise. because all we have done is that we dressed the original problem in more fancy clothes. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. . But this would be nonsense. we can reduce several questions to questions about this random walk. The special structure of a random walk enables us to give more detailed formulae for its behaviour. We shall resist the temptation to compute and look at other 77 . has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . x + ξ2 = y) = y p(x)p(y − x). We will let p∗n be the convolution of p with itself n times. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x).i. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. ξ2 . (Recall that δz is a probability distribution that assigns probability 1 to the point z. in general. . The convolution of two probabilities p(x). We refer to ξn as the increments of the random walk. we will mean a sum over all possible values. We call it simple symmetric random walk. Indeed. x ∈ Zd . x ± ed .) Now the operation in the last term is known as convolution.

i+1 = pi.i. A ⊂ {−1. and so #A P (A) = n .i−1 = 1/2 for all i ∈ Z. So. Clearly. . If not. to each initial position and to each sequence of n signs. but its practice may be hard. Therefore. Events A that depend only on these the first n random signs are subsets of this sample space. So if the initial position is S0 = 0. a path of length n. +1. 2) → (3. The principle is trivial. 1}n . . As a warm-up. if we can count the number of elements of A we can compute its probability. . 1) → (2.d. −1} then we have a path in the plane that joins the points (0. . how long will that take? C.methods. .1) + − (5. First. Will the random walk ever return to where it started from? B. 2) → (5. −1. Look at the distribution of (ξ1 . 2 where #A is the number of elements of A. 1): (2. ξn ). +1}n . for fixed n.0) We can thus think of the elements of the sample space as paths. we should assign. + (1.2) where the ξn are i. In the sequel.1) + (0. ξn = εn ) = 2−n .2) (4. +1}n . After all. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . .1) − (3. . random variables (random signs) with P (ξn = ±1) = 1/2. . Since there are 2n elements in {0. all possible values of the vector are equally likely. 0) → (1. we shall learn some counting methods. a random vector with values in {−1. +1}n . 1) → (4. we have P (ξ1 = ε1 . we are dealing with a uniform distribution on the sample space {−1. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. Here are a couple of meaningful questions: A. consider the following: 78 . If yes. +1. and the sequence of signs is {+1.

in this case. i)} = n n+i 2 More generally.i) (0. We use the notation {(m.047. (There are 2n of them. i). 0) (n. We can easily count that there are 6 paths in A. x) (n. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. 0) and n ends at (n.Example: Assume that S0 = 0. the number of paths that start from (m. S6 < 0. The number of paths that start from (0. n 79 . 0)). In if n + i is not an even number then there is no path that starts from (0. i).1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. S4 = −1. The paths comprising this event are shown below. 0) and end up at (n. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. y)} = n−m . x) and end up at (n. i). x) and end up at (n. Proof. y) is given by: #{(m. y) is given by: #{(0. (n − m + y − x)/2 (23) Remark. Let us first show that Lemma 14. S7 < 0}. and compute the probability of the event A = {S2 ≥ 0. 0) and end up at the specific point (n.0) The lightly shaded region contains all paths of length n that start at (0. y)} =: {all paths that start at (m. Hence (n+i)/2 = 0. Consider a path that starts from (0.) The darker region shows all paths that start at (0. y)}. x) (n. (n. 0) and ends at (n.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

y) that do not touch zero} × 2−n #{paths (0. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . n−m 2−n−m . x) 2n−m (n. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . 27. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). for now. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . if n is even. n (30) Proof. we shall just compute it. . P0 (Sn = y) 2 (n + y) + 1 . Sn ≥ 0 | Sn = 0) = 84 1 . 27. . n 2 #{paths (m. 0) (n. . y)} . . y)} . the random walk never becomes zero up to time n is given by P0 (S1 > 0. P0 (S1 ≥ 0 . n/2 (29) #{(0. The left side of (30) equals #{paths (1. . Such a simple answer begs for a physical/intuitive explanation. Sn > 0 | Sn = y) = y . But.2 The ballot theorem Theorem 25. . S2 > 0.3 Some conditional probabilities Lemma 17. . . . 1) (n. n 2−n . The probability that. if n is even. Sn ≥ 0 | Sn = y) = 1 − In particular. If n. n This is called the ballot theorem and the reason will be given in Section 29. 0) (n.P (Sn = y|Sm = x) = In particular. . given that S0 = 0 and Sn = y > 0.

. (b) follows from symmetry. . Sn ≥ 0) = P0 (S1 = 0.) by (ξ1 . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . . . +1 27. Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . ξ3 . Sn = 0) = P0 (S1 > 0. . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . .). . 1 + ξ2 + ξ3 > 0. For even n we have P0 (S1 ≥ 0. and Sn−1 > 0 implies that Sn ≥ 0. . and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . (e) follows from the fact that ξ1 is independent of ξ2 . . . Sn = 0) = P0 (Sn = 0). . ξ2 . . . . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . . . . P0 (S1 ≥ 0. remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . (c) (d) (e) (f ) (g) where (a) is obvious. . . . . . Sn ≥ 0). . . . . . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . . ξ2 + ξ3 ≥ 0. . Sn > 0) + P0 (S1 < 0. We use (27): P0 (Sn ≥ 0 . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . If n is even. . .4 Some remarkable identities Theorem 26. . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. 0) (n. . n/2 where we used (28) and (29). . . . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . (f ) follows from the replacement of (ξ2 . Sn < 0) (b) (a) = 2P0 (S1 > 0. . . 85 . . . . . We next have P0 (S1 = 0. . . Sn ≥ 0. . . Proof. Sn ≥ 0) = #{(0. . . . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. . . . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. •).Proof. . ξ3 . .

To take care of the last term in the last expression for f (n). Sn−1 > 0. . we can omit the last event. P0 (Sn−1 > 0. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. If a particle is to the right of a then you see its image on the left. . Sn = 0) = u(n) . Sn = 0). . Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). . we either have Sn−1 = 1 or −1. . we computed. Sn = 0. . This is not so interesting. and u(n) = P0 (Sn = 0).27. By the Markov property at time n − 1. . By symmetry. . . Then f (n) := P0 (S1 = 0. the two terms must be equal. Fix a state a. . n−1 Proof. . . . . we see that Sn−1 > 0. given that Sn = 0. So the last term is 1/2. Sn = 0) Now. 86 . . . So 1 f (n) = 2u(n) P0 (S1 > 0. Sn = 0) + P0 (S1 < 0. 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). . Sn = 0 means Sn−1 = 1. Sn = 0) P0 (S1 > 0. Since the random walk cannot change sign before becoming zero. Theorem 27. n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. and if the particle is to the left of a then its mirror image is on the right. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . Sn−1 = 0. . the probability u(n) that the walk (started at 0) attains value 0 in n steps. Put a two-sided mirror at a. . So u(n) f (n) = . . Sn = 0) = 2P0 (Sn−1 > 0. Let n be even. with equal probability. . Sn−2 > 0|Sn−1 > 0. f (n) = P0 (S1 > 0.5 First return to zero In (29). n We know that P0 (Sn = 0) = u(n) = n/2 2−n . . Sn−1 > 0. Sn−2 > 0|Sn−1 = 1). Also. for even n. We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. Sn−1 < 0. . So f (n) = 2P0 (S1 > 0.

In the last event. Theorem 28. Finally. so we may replace Sn by 2a − Sn (Theorem 28). independent of the past before Ta . n ∈ Z+ ) has the same law as (Sn . In other words. . 2 Proof. n ∈ Z+ ). Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). Trivially.What is more interesting is this: Run the particle till it hits a (it will. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). If Sn ≥ a then Ta ≤ n. 2 Define now the running maximum Mn = max(S0 . But a symmetric random walk has the same law as its negative. as a corollary to the above result.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). n < Ta . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. n ≥ Ta . a + (Sn − a). S1 . 2a − Sn . Sn ≤ a). the process (STa +m − a. Sn = Sn . Sn > a) = P0 (Ta ≤ n. P0 (Sn ≥ a) = P0 (Ta ≤ n. Sn ). Then Ta = inf{n ≥ 0 : Sn = a} Sn . So Sn − a can be replaced by a − Sn in the last display and the resulting process. Let (Sn . called (Sn . define Sn = where is the first hitting time of a. Then. 28. . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). . Sn ≥ a). Proof. The last equality follows from the fact that Sn > a implies Ta ≤ n. 2a − Sn ≥ a) = P0 (Ta ≤ n. But P0 (Ta ≤ n) = P0 (Ta ≤ n. Sn ≤ a) + P0 (Sn > a). n ≥ 0) be a simple symmetric random walk and a > 0. n ≥ Ta By the strong Markov property. n < Ta . we have n ≥ Ta . . Sn ≤ a) + P0 (Ta ≤ n. Thus. we obtain 87 . m ≥ 0) is a simple symmetric random walk. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn .

which results into P0 (Tx ≤ n. . (1887). . ornamental objects. Sci. (1887). Since Mn < x is equivalent to Tx > n. Sn = 2x − y). The answer is very simple: Theorem 31. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. combined with (31). Solution directe du probl`me r´solu par M. e 10 Andr´. Rend. Sn = y) = P0 (Tx > n. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. ηn ). and anciently for holding ballots to be drawn. e e e Paris. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. If an urn contains a azure items and b = n−a black items. The original ballot problem. Rend. 105. 369. If Tx ≤ n. of which a are coloured azure and b black. (31) x > y. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . Solution d’un probl`me. P0 (Mn < x. gives the result. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. where ηi indicates the colour of the i-th item picked. Paris. then the probability that. Acad. Sn = y). But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. . J. during sampling without replacement.Corollary 19. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. 436-437. The set of values of η contains n elements and η is uniformly distributed a in this set. Compt. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. 105. we have P0 (Mn < x. as for holding liquids. 8 88 . An urn is a vessel. usually a vase furnished with a foot or pedestal. Sci. as long as a ≥ b. A sample from the urn is a sequence η = (η1 . Compt. by applying the reflection principle (Theorem 28). the ashes of the dead after cremation. Sn = y) = P0 (Sn = 2x − y). Proof. n = a + b. This. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). employed for different purposes. Bertrand. D. Sn = y) = P0 (Tx ≤ n. then. Acad. It is the latter use we are interested in in Probability and Statistics. 9 Bertrand. 2 Proof. . the number of selected azure items is always ahead of the black ones equals (a − b)/n.

. . . a − b = y. Consider a simple symmetric random walk (Sn . . Then. conditional on {Sn = y}. An urn can be realised quite easily be a random walk: Theorem 32. . But if Sn = y then. But in (30) we computed that y P0 (S1 > 0. −n ≤ k ≤ n. n n −n 2 (n + y)/2 a Thus. Since an urn with parameters n. . . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . We need to show that. . . ηn ) an urn process with parameters n. . . εn be elements of {−1. . . the sequence (ξ1 . . . . . always assuming (without any loss of generality) that a ≥ b = n − a. . p(n+k)/2 q (n−k)/2 . . +1} such that ε1 + · · · + εn = y. the result follows. but which is not necessarily symmetric. ξn ) is uniformly distributed in a set with n elements. We have P0 (Sn = k) = n n+k 2 pi. . . . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. using the definition of conditional probability and Lemma 16. .i−1 = q = 1 − p. The values of each ξi are ±1. Proof of Theorem 31. . 89 .i+1 = p. . Since the ‘azure’ items correspond to + signs (respectively. . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. . .We shall call (η1 . . . indeed. . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. Proof.) So pi. conditional on the event {Sn = y}. ξn ) has a uniform distribution. Let ε1 . conditional on {Sn = y}. . . a. i ∈ Z. ξn ) is. a we have that a out of the n increments will be +1 and b will be −1. Sn > 0} that the random walk remains positive. we have: P0 (ξ1 = ε1 . n ≥ 0) with values in Z and let y be a positive integer. It remains to show that all values are equally a likely. So the number of possible values of (ξ1 . . . so ‘azure’ items are +1 and ‘black’ items are −1. n Since y = a − (n − a) = a − b. . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. Sn > 0 | Sn = y) = . Then. . the sequence (ξ1 . the ‘black’ items correspond to − signs). (The word “asymmetric” will always mean “not necessarily symmetric”. ξn = εn ) P0 (Sn = y) 2−n 1 = = . where y = a − (n − a) = 2a − n. (ξ1 . . we can translate the original question into a question about the random walk. letting a. b be defined by a + b = n. . . n .

The rest is a consequence of the strong Markov property.i. . namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. random variables. Consider a simple random walk starting from 0. This random variable takes values in {0. only the relative position of y with respect to x is needed. Lemma 19. Proof.30. y ∈ Z.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. To prove this. In view of this. 1. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. x ≥ 0) is also a random walk.d. . Let ϕ(s) := E0 sT1 . See figure below. for x > 0. . . For all x. Consider a simple random walk starting from 0. Proof. In view of Lemma 19 we need to know the distribution of T1 .) Then. x ∈ Z. E0 sTx = ϕ(s)x . x − 1 in order to reach x. 2. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. The process (Tx . . Lemma 18. . Px (Ty = t) = P0 (Ty−x = t). If x > 0. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. Consider a simple random walk starting from 0. Corollary 20. 2. (We tacitly assume that S0 = 0. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. 90 . Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . We will approach this using probability generating functions. Theorem 33. If S1 = 1 then T1 = 1. we make essential use of the skip-free property. 2qs Proof.} ∪ {∞}. and all t ≥ 0. plus a ′′ further time T1 required to go from 0 to +1 for the first time. . plus a time T1 required to go from −1 to 0 for the first time. it is no loss of generality to consider the random walk starting from 0.

the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Hence the previous formula applies with the roles of p and q interchanged. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. The formula is obtained by solving the quadratic. if x < 0 2ps Proof. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . 2ps Proof. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. Corollary 22. Corollary 21. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. Consider a simple random walk starting from 0. Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . from the strong Markov property. The first time n such that Sn = −1 is the first time n such that −Sn = 1. 91 .Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . in order to hit 1. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 .  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . Consider a simple random walk starting from 0.

Similarly. We again do first-step analysis. k by the binomial theorem. the same as the time required for the walk to hit 1 starting ′ from 0. where α is a real number. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}.2 First return to the origin Now let us ask: If the particle starts at 0. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . and simply write (1 + x)n = k≥0 n k x .3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. For the general case. 1 − 4pqs2 . 2ps ′ =1− 30. this was found in §30. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . If we define n to be equal to 0 for k > n. then we can omit the k upper limit in the sum. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. The latter is. For a symmetric ′ random walk. if S1 = 1.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. If α = n is a positive integer then n (1 + x)n = k=0 n k x .30. k 92 . ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . Consider a simple random walk starting from 0. in distribution.

Recall that n k = It turns out that the formula is formally true for general α. (1 − x)−1 = But. but that we won’t worry about which ones. (−1)k = 2k − 1 k k and this is shown by direct computation. 2k − 1 k 93 . We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . k as long as we define α k = α(α − 1) · · · (α − k + 1) . let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . −1 k so (1 − x)−1 = as needed. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. Indeed. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. k!(n − k)! k! α k x . √ 1 − x. For example. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . by definition. As another example. k! k! (−1)2k xk = k≥0 xk . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k .

By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. 94 . We apply this to (32). The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . by the definition of E0 sT0 . if x < 0 |x| In particular. |x| if x > 0 (33) . • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). But remember that q = 1 − p. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z.But. x 1−|p−q| 2q 1−|p−q| 2p . We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. The quantity inside the square root becomes 1 − 4pq. Hence it is again transient. without appeal to the Markov chain theory. x if x > 0 . s↑1 which is true for any probability generating function. if x < 0 This formula can be written more neatly as follows: Theorem 35. That it is actually null-recurrent follows from Markov chain theory. Hence it is transient. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. P0 (Tx < ∞) = p q q p ∧1 ∧1 .

What is this probability? Theorem 37. Corollary 23. Indeed. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. If p < q then |p − q| = q − p and. Consider a random walk with values in Z. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . Sn ). Proof. S1 . the sequence Mn is nondecreasing. We know that it cannot be positive recurrent. except that the limit may be ∞. then we have M∞ = limn→∞ Mn . Otherwise. n→∞ n→∞ x ≥ 0. The last part of Theorem 35 tells us that it is recurrent. if x < 0 which is what we want. The simple symmetric random walk with values in Z is null-recurrent. x ≥ 0.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. and so it is null-recurrent. . 31. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . Proof. again. . . we obtain what we are looking for. this probability is always strictly less than 1. with some positive probability. with probability 1 because p < 1/2. This value is random and is denoted by M∞ = sup(S0 . . 31. Then limn→∞ Sn = −∞ with probability 1. Consider a simple random walk with values in Z. Unless p = 1/2. Assume p < 1/2. 95 . . Then ′ P0 (T0 < ∞) = 2(p ∧ q). if x > 0 . but here it is < ∞. S2 . . This means that there is a largest value that will be ever attained by the random walk.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. it will never ever return to the point where it started from. q p. Theorem 36. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1.) If we let Mn = max(S0 . .Proof. every nondecreasing sequence has a limit.

2(p ∧ q) E0 J0 = . k = 0. 2. starting from zero. the total number of visits to a state will be finite. as already known. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . Pi (Ji ≥ k) = 1 for all k. 2. if a Markov chain starts from a state i then Ji is geometric. 2. 31. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). We now compute its exact distribution. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . (34) a random variable first considered in Section 10 where it was also shown to be geometric. 1. meaning that Pi (Ji = ∞) = 1. This proves the first formula. . . . . Theorem 38. In the latter case. . k = 0. . The probability generating function of T0 was obtained in Theorem 34. If p = q this gives 1. 1]. But if it is transient. even for p = 1/2. Remark 1: By spatial homogeneity. where the last probability was computed in Theorem 37. . 1. 1 − 2(p ∧ q) Proof. Consider a simple random walk with values in Z. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . 2(p ∧ q) . But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. But J0 ≥ 1 if and only if the first return time to zero is finite. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. In Lemma 4 we showed that. . 1. 96 .′ Proof. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. Then. k = 0. . And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . . 1 − 2(p ∧ q) this being a geometric series.3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. We now look at the the number of visits to some state but if we start from a different state.

97 . then E0 Jx drops down to zero geometrically fast as x tends to infinity. The reason that we replaced k by k − 1 is because. if the random walk “drifts down” (i. (q/p)(−x)∨0 1 − 2q k≥1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . and the second in Theorem 38. starting from 0. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . (2(p ∧ q))k−1 . Then P0 (Tx < ∞. Then P0 (Jx ≥ k) = P0 (Tx < ∞. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. it will visit any point below 0 exactly the same number of times on the average. In other words. But for x < 0. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . Apply now the strong Markov property at Tx . If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . For x > 0. Consider a random walk with values in Z. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). k ≥ 1. we have E0 Jx = 1/(1−2p) regardless of x. ∞ As for the expectation. i. p < 1/2) then. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. Jx ≥ k). Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . given that Tx < ∞. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . Remark 1: The expectation E0 Jx is finite if p = 1/2. Consider the case p < 1/2.e. simply because if Jx ≥ k ≥ 1 then Tx < ∞.Theorem 39. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 .e. The first term was computed in Theorem 35. Suppose x > 0. E0 Jx = Proof. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 .

Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Theorem 40. . In Lemma 19 we observed that. . . i. ξ2 . 0 ≤ k ≤ n) of (Sk . 0 ≤ k ≤ n). . ξn ) has the same distribution as (ξn . .32 Duality Consider random walk Sk = k ξi . n x > 0. Tx is the sum of x i. Let (Sk ) be a simple symmetric random walk and x a positive integer. . Fix an integer n. every time we have a probability for S we can translate it into a probability for S. each distributed as T1 . Then P0 (Tx = n) = x P0 (Sn = x) .d. = 1− √ 1 − s2 s x . . . (36) 2 Lastly. . Proof.i. . for a simple random walk and for x > 0. Sn = x) P0 (Tx = n) = . i=1 Then the dual (Sk . because the two have the same distribution.d. which is basically another probability for S. Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. Thus. n (37) 98 . Here is an application of this: Theorem 41. Use the obvious fact that (ξ1 . P0 (Sn = x) P0 (Sn = x) x . . (Sk . 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . a sum of i. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. random variables. random variables. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . ξ1 ). 0 ≤ k ≤ n − 1.e. n Proof. the probabilities P0 (Tx = n) = x P0 (Sn = x). in Theorem 41. we derived using duality. 0 ≤ k ≤ n) has the same distribution as (Sk . ξn−1 . for a simple symmetric random walk. . Rewrite the ballot theorem terms of the dual: P0 (S1 > 0.i.

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. The two random walks are not independent. . north or south. . Rn ) := (Sn + Sn . west. the random 1 2 variables ξn . They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. west. Sn ) = n n 2 . 2 1 If x ∈ Z2 . is a random walk. 0). random variables and hence a random walk. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. the latter being functions of the former. Consider a change of coordinates: rotate the axes by 45o (and scale them). y) where x. i. So let 1 2 1 2 1 2 Rn = (Rn . It is not a simple random walk because there is a possibility that the increment takes the value 0. i=1 ξi i=1 ξi 2 1 Lemma 20. . ξ2 . . The corners are described by coordinates (x. x2 ) into (x1 + x2 . x1 − x2 ). 1 1 Proof. 2 Same argument applies to (Sn . . When he reaches a corner he moves again in the same manner. We have 1. However. too. n ∈ Z+ ) is a random walk. north or south. He starts at (0. 0)) = 1/4. Indeed. n ∈ Z+ ) is a sum of i. 1 P (ξn = 1) = P (ξn = (1. ξn are not independent simply because if one takes value 0 the other is nonzero. Sn − Sn ). 1)) + P (ξn = (0. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . . . for each n.i. 0) and. random vectors such that P (ξn = (0.d. ξ2 . the two stochastic processes are not independent. n ≥ 1. 1 Hence (Sn . 1)) = P (ξn = (−1. .. are independent. y ∈ Z. (Sn . 0)) = 1/4. Many other quantities can be approximated in a similar manner. being totally drunk. 0)) = P (ξn = (−1. so are ξ1 .i. going at random east. with equal probability (1/4).From this. −1)) = 1/2. . We then let S0 = (0. We have: 102 . i. we immediately get that the expectation of T1 is infinity: ET1 = ∞. ξ2 . n ∈ Z+ ) is also a random walk. map (x1 . we let x1 be its horizontal coordinate and x2 its vertical. Since ξ1 . n ∈ Z+ ) showing that it. So Sn = (Sn .e. (Sn . 0)) = P (ξn = (1.d. Sn = i=1 n ξi . he moves east.

Hence (Rn . And so 2 1 2 1 P (ηn = ε. 2n n 2 2−4n . Similar argument shows that each of (Rn . P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . P (S2n = 0) ∼ n . The sequence of vectors η . . So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . n ∈ Z+ ) is a random walk in two dimensions. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . Let a < 0 < b. η . 2 1 2 2 1 1 Proof. .1 Lemma 21. ηn = −1) = P (ξn = (0. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. n ∈ Z+ ). 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. we need to show that ∞ P (Sn = 0) = ∞. (Rn . But (Rn ) 1 2 is a simple symmetric random walk. from Theorem 7. n ∈ Z ) is a random walk in one dimension. where c is a positive constant. b−a P (STa ∧Tb = a) = b . 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. We have of each of the ηn n 1 2 P (ηn = 1. Proof. P (S2n = 0) = Proof. n ∈ Z ) is a random walk in one dimension. n ∈ Z+ ) is a random 2 . ηn = 1) = P (ξn = (1. The two-dimensional simple symmetric random walk is recurrent. The possible values 1 . . +1}. By Lemma 5. Lemma 22.i. 1 where the last equality follows from the independence of the two random walks. 1)) = 1/4 1 2 P (ηn = −1. ηn = ξn − ξn are independent for each n.. 2 . and by Stirling’s approximation. But by the previous n=0 c lemma.d. Rn = n (ξi − ξi ). in the sense that the ratio of the two sides goes to 1. η 2 are ±1. b−a 103 . n ∈ Z+ ) 1 is a random walk in two dimensions. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . ε′ ∈ {−1. ηn = 1) = P (ξn = (0. So P (R2n = 0) = 2n 2−2n . (Rn . ξ 1 − ξ 2 ). ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. n Corollary 24. . (Rn + independent. . 0)) = 1/4 1 2 P (ηn = 1. as n → ∞. 0)) = 1/4 1 2 P (ηn = −1. that for a simple symmetric random walk (started from S0 = 0). ξ2 . b−a −a . . Recall. is i. The same is true for (Rn ). The last two are walk in one dimension. To show that the last two are independent. R2n = 0) = P (R2n = 0)(R2n = 0). b−a P (Ta < Tb ) = b . = 1)) = 1/4. Hence P (ηn = 1) = P (ηn = −1) = 1/2. We have Rn = n (ξi + ξi ). ηn = −1) = P (ξn = (−1.

B = j) = 1. Then Z = 0 if and only if A = B = 0 and this has 104 . such that p is rational. i ∈ Z). To see this. b = N . i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. e. a < 0 < b such that pa + (1 − p)b = 0. i ∈ Z. b. P (A = 0. Then there exists a stopping time T such that P (ST = i) = pi . M. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. and assuming that (A. we have this common value: i<0 (−i)pi = and. a = M − N . How about more complicated distributions? Let suppose we are given a probability distribution (pi . Conversely. B) are independent of (Sn ). Indeed. M < N . we get that c is precisely ipi . Theorem 43. 1 − p). Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. in particular. Let (Sn ) be a simple symmetric random walk and (pi . N ∈ N. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. we consider the random variable Z = STA ∧TB . 0)} ∪ (N × N) with the following distribution: P (A = i. k ∈ Z. B = j) = c−1 (j − i)pi pj . suppose first k = 0.g. and the probability it exits from the right point is 1 − p. The claim is that P (Z = k) = pk . we can always find integers a. B = 0) + P (A = i. if p = M/N . B) taking values in {(0. i < 0 < j. B = 0) = p0 . Note.This last display tells us the distribution of STa ∧Tb . Define a pair of random variables (A. given a 2-point probability distribution (p. that ESTa ∧Tb = 0. where c−1 is chosen by normalisation: P (A = 0. i ∈ Z. using this. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. b]. Proof. The probability it exits it from the left point is p. choose.

by definition.probability p0 . B = j) P (STi ∧Tk = k)P (A = i. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . 105 . Suppose next k > 0. i<0 = c−1 Similarly. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. P (Z = k) = pk for k < 0.

b). So (ii) holds. 0. b are arbitrary integers. we have c | d. b − a). replace b − a by b − 2a. b) = gcd(a. gcd(120. 2) = 2. we can define their greatest common divisor d = gcd(a. 8) = gcd(6. −3. We will show that d = gcd(a. leading to d1 = d2 . Now. they merely serve as reminders. 14) = gcd(6. b − a).e. We write this by b | a. Arithmetic Arithmetic deals with integers. 32) = gcd(6. 38) = gcd(6. But d | a and d | b. 2) = gcd(4. Then c | a + (b − a). 106 . 4. In other words.e. 38) = gcd(6. 3. The set N of natural numbers is the set of positive integers: N = {1.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. Because c | a and c | b. d | b. with b = 0.) Notice that: gcd(a. · · · }. But 3 is the maximum number of times that 38 “fits” into 120. Now suppose that c | a and c | b − a. So (i) holds. then d | c. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. Therefore d | b − a. and we’re done. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. Then interchange the roles of the numbers and continue. 26) = gcd(6. with at least one of them nonzero. Replace b by b − a. −2. we say that b divides a if there is an integer k such that a = kb. . If a | b and b | a then a = b. 38) = gcd(82. 6. −1. 20) = gcd(6. 2) = gcd(2. 1. (That it is unique is obvious from the fact that if d1 . . Keep doing this until you obtain b − qa < a. . But in the first line we subtracted 38 exactly 3 times. d2 are two such numbers then d1 would divide d2 and vice-versa. If b − a > a. Indeed 4 × 38 > 120. 2. (ii) if c | a and c | b. if we also invoke Euclid’s algorithm (i. i. if a. For instance. If c | b and b | a then c | a. Similarly. They are not supposed to replace textbooks or basic courses. b). b) as the unique positive integer such that (i) d | a. let d = gcd(a. Zero is denoted by 0.}. the process of long division that one learns in elementary school). So we work faster. The set Z of integers consists of N. Indeed. b. their negatives and zero. as taught in high schools or universities. with the second line. Z = {· · · . and d = gcd(a. 3. 2. c | b. Given integers a. 38) = gcd(44. 5.

Its proof is obvious. we have the fact that if d = gcd(a. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. y ∈ Z. The long division algorithm is a process for finding the quotient q and the remainder r. with a > b. we can find an integer q and an integer r ∈ {0. in particular. Then c divides any linear combination of a. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. with x = 19. b. Then d = xa + yb. not both zero. gcd(38. it divides d. To see this. The same logic works in general. 1. Thus. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. This shows (ii) of the definition of a greatest common divisor. 120) = 38x + 120y. . 38) = gcd(6. 2) = 2. let d be the right hand side.e. We need to show (i). r = 18. y such that d = xa + yb. such that a = qb + r. y ∈ Z. b ∈ Z. we have gcd(a. let us follow the previous example again: gcd(120. 38) = gcd(6. . In the process. Here are some other facts about the greatest common divisor: First. . Suppose that c | a and c | b. i. xa + yb > 0}. a − 1}. b) then there are integers x. for integers a. y = 6. b) = min{xa + yb : x.The theorem of division says the following: Given two positive integers b. Second. For example. a. for some x. . we can show that. 107 . To see why this is true. Working backwards.

A relation is called reflexive if it contains pairs (x. i. Counting Let A. and this is a contradiction.e. e. but is not necessarily transitive. yielding r = (−qx)a + (1 − qy)b. A function is called onto (or surjective) if its range is Y . If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. say. So d is not minimum as claimed.that d | a and d | b. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). A relation is called symmetric if it cannot contain a pair (x. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . that d does not divide b. The relation < is not reflexive. The equality relation = is reflexive. The relation < is not symmetric. We encountered the equivalence relation i theory. we can divide b by d to find b = qd + r. Both < and = are transitive. so r = 0. Since d > 0. z). y) such that x is smaller than y. f (x2 ). Finally. y) and (y. respectively. m elements. x). the set {f (x) : x ∈ X}. The range of f is the set of its values. then the relation “x is a friend of y” is symmetric. A relation which possesses all the three properties is called equivalence relation. For example. for some 1 ≤ r ≤ d − 1. y) without contain its symmetric (y. The greatest common divisor of 3 integers can now be defined in a similar manner.g. But then b = q(xa + yb) + r. 108 .. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. A function is called one-to-one (or injective) if two different x1 . A relation is transitive if when it contains (x. so d divides both a and b. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. Suppose this is not true. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). The set of functions from X to Y is denoted by Y X .e. z) it also contains (x. i. In general. x2 ∈ X have different values f (x1 ). The equality relation is symmetric. the greatest common divisor of any subset of the integers also makes sense. B be finite sets with n. a relation can be thought of as a set of pairs of elements of the set A. that d does not divide both numbers. If we take A to be the set of students in a classroom. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. x).

Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. that of the second in m − 1 ways. we need more names than animals: n ≤ m. and so on.) So the first animal can be assigned any of the m available names. and the third. and so on. so does the second. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. and this follows from the binomial theorem. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). 109 . Clearly. Any set A contains several subsets. Since the name of the first animal can be chosen in m ways.The number of functions from A to B (i. Thus. We thus observe that there are as many subsets in A as number of elements in the set {0. A function is then an assignment of a name to each animal. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . the cardinality of the set B A ) is mn . k! k where the symbol equation. we denote (n)n also by n!.e. Incidentally. a one-to-one function from the set A into A is called a permutation of A. 1}n of sequences of length n with elements 0 or 1. Indeed. In the special case m = n. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). and so on. The first element can be chosen in n ways. Each assignment of this type is a one-to-one function from A to B. we may think of the elements of A as animals and those of B as names. in other words. n! is the total number of permutations of a set of n elements. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . To be able to do this. the k-th in n − k + 1 ways. (The same name may be used for different animals. the second in n − 1. If A has n elements. we have = 1.

If the pig is not included in the sample. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. c. in choosing k animals from a set of n + 1 ones. we use the following notations: a+ = max(a. a∨b∨c refers to the maximum of the three numbers a. For example. consider a specific animal. the pig. we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. by convention. Both quantities are nonnegative. and we don’t care to put parantheses because the order of finding a maximum is irrelevant. b). Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). + k−1 k Indeed. If the pig is included. b). we define x m = x(x − 1) · · · (x − m + 1) . Note that it is always true that a = a+ − a− . 110 . by convention. and a− is called the negative part of a. 0). Thus a ∧ b = a if a ≤ b or = b if b ≤ a. k Also. The absolute value of a is defined by |a| = a+ + a− . we choose k animals from the remaining n ones and this can be done in n ways. m! n y If y is not an integer then. a+ is called the positive part of a. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. The maximum is denoted by a ∨ b = max(a. The notation extends to more than one numbers. In particular. if x is a real number. n n .We shall let n = 0 if n < k. 0). a− = − min(a. we let The Pascal triangle relation states that n+1 k = = 0. Maximum and minimum The minimum of two numbers a. b. say. b is denoted by a ∧ b = min(a.

It takes value 1 with probability P (A) and 0 with probability P (Ac ).Indicator function stands for the indicator function of a set A. Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . It is worth remembering that. 111 . It is easy to see that 1A 1A∩B = 1A 1B . Thus. . which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). = 1A + 1B − 1A∩B . If A is an event then 1A is a random variable. are sets or clauses then the number n≥1 1An is the number true clauses. if A1 . if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. A2 . Then X X= n=1 1 (the sum of X ones is X). . P (X ≥ n). Several quantities of interest are expressed in terms of indicators. So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. . 1 − 1A = 1 A c . We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. For instance. So X= n≥1 1(X ≥ n). This is very useful notation because conjunction of clauses is transformed into multiplication. . Another example: Let X be a nonnegative random variable with values in N = {1. . 2. meaning a function that takes value 1 on A and 0 on the complement Ac of A. . A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ].}.

for all x.   . we have ϕ(x) = m n vi i=1 j=1 aij xj .  is a column representation of x with respect to . . But the ϕ(uj ) are vectors in Rm .    . and all α. x  . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  . .  The column  . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  . Suppose that ψ. β ∈ R. Let y i := n aij xj . ϕ are linear functions: It is a m × n matrix. So if we choose a basis v1 .  is a column representation of y := ϕ(x) with respect to the given basis . . Combining. . . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ).  . Then. x1 . ym v1 . . vm . . . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . . . . j=1 i = 1. . y ∈ Rn . . . . . . .  where ∈ R. Let u1 . m. The column  . The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . . because ϕ is linear.  = ··· ··· ···  . . xn the given basis. un be a basis for Rn .

B be the matrix representations of ϕ. The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . If we fix a basis in Rn .  . If I has no lower bound then inf I = −∞. i. for any ε > 0. 1 ≤ j ≤ n. . e2 =  . we define max I. there is x ∈ I with x ≥ ε + inf I. for any ε > 0. . If I has no upper bound then sup I = +∞. So lim sup xn = limn→∞ inf{xn . a2 . . sup I is characterised by: sup I ≥ x for all x ∈ I and.} ≤ inf{x3 . . . . there is x ∈ I with x ≤ ε + inf I. defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I.} has no minimum. we define lim sup xn . . respectively. ψ. xn+2 . with respect to these bases. .  . This minimum (or maximum) may not exist. ed =  . Then the matrix representation of ϕ ◦ ψ is AB. . xn+1 . Limsup and liminf If x1 . ei = 1(i = j). . If I is empty then inf ∅ = +∞. . and it is this that we called infimum.  . 1 ≤ i ≤ ℓ. . . where m (AB)ij = k=1 Aik Bkj . sup ∅ = −∞. we define the supremum sup I to be the minimum of all upper bounds. 1/2. is a sequence of numbers then inf{x1 . . The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. . In these cases we use the notation inf I. j So A is a ℓ × m matrix. Supremum and infimum When we have a set of numbers I (for instance. Similarly. . Similarly. . standing for the “infimum” of I. x2 . a square matrix represents a linear transformation from Rn to Rn . a sequence a1 . Rm and Rℓ and let A. Usually. A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. 113 . . x4 . it is its matrix representation with respect to the fixed basis. . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. . 1/3. . Likewise. . 0 0 1 In other words. x3 . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. .} ≤ inf{x2 . . . For instance I = {1.Fix three sets of basis on Rn . A matrix A is called square if it is a n × n matrix.} ≤ · · · and. and AB is a ℓ × n matrix. B is a m × n matrix. .e. x2 . . · · · }. the choice of the basis is the so-called standard basis. . inf I is characterised by: inf I ≤ x for all x ∈ I and. We note that lim inf(−xn ) = − lim sup xn .

) If A. are mutually disjoint events. we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. . . by not requiring this. A2 . If A1 . n≥1 Usually. derive formulae. A2 . . For example. n If the events A1 . P (∪n An ) ≤ Similarly. Similarly. we work a lot with conditional probabilities. the function that takes value 1 on A and 0 on Ac .Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). and 1 Ac = 1 − 1 A . That is why Probability Theory is very rich.e. This can be generalised to any finite number of events. If An are events with P (An ) = 0 then P (∪n An ) = 0. (That is why Probability Theory is very rich. . then P  n≥1 An  = P (An ). to compute P (A). . Similarly. Indeed. . Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). If A is an event then 1A or 1(A) is the indicator random variable. . we often need to argue that X ≤ Y . Note that 1A 1B = 1A∩B . in these notes. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). if An are events with P (An ) = 1 then P (∩n An ) = 1. If the events A. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. but not too much. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. namely:   Everything else. Often. .e. if he events A1 . E 1A = P (A). In Markov chain theory. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). Therefore. n P (An ). and everywhere else. to show that EX ≤ EY . 11 114 . This is the inclusion-exclusion formula for two events. A2 . Linear Algebra that has many axioms is more ‘restrictive’. such that ∪n Bn = Ω. I do sometimes “cheat” from a strict mathematical point of view.. and apply them. This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. for (besides P (Ω) = 1) there is essentially one axiom . Probability Theory is exceptionally simple in terms of its axioms. Linear Algebra that has many axioms is more ‘restrictive’. B2 . Of course. In contrast. A is a subset of B). In contrast. Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. . it is useful to find disjoint events B1 . Quite often. B in the definition. This follows from the logical statement X 1(X > x) ≤ X. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). i. . . B are disjoint events then P (A ∪ B) = P (A) + P (B).. . which holds since X is positive. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). follows from this axiom.

. 1]. taught in calculus courses). we define X + = max(X. 1]. 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1.An application of this is: P (X < ∞) = lim P (X ≤ x).e. . which are both nonnegative and then define EX = EX + − EX − . is called probability generating function of X: ∞ as long as |ρs| < 1. x→∞ In Markov chain theory. |s| < 1/|ρ|. so it is useful to keep this in mind. Generating functions A generating function of a sequence a0 . ϕ(s) = EsX = n=0 sn P (X = n) 115 . 0). i. 0). As an example. we encounter situations where P (X = ∞) may be positive. The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. The formula holds for any positive random variable. . is a probability distribution on Z+ = {0. . . the formula reduces to the known one. The generating function of the sequence pn = P (X = n). This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). For a random variable which takes positive and negative values. we have EX = −∞ xf (x)dx. obviously. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. . . a1 .} because then n an = 1. the generating function of the geometric sequence an = ρn . . n ≥ 0 is ∞ ρn s n = n=0 1 . So if n an < ∞ then G(s) exists for all s ∈ [−1. 1. a1 . . n = 0. which is motivated from the algebraic identity X = X + − X − . We do not a priori know that this exists for any s other than. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n).. X − = − min(X. This definition makes sense only if EX + < ∞ or if EX − < ∞. 1. . EX = x xP (X = x). . . In case the random variable is discrete. is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . making it a useful object when the a0 . 2. for s = 0 for which G(0) = 0. In case the random variable is ∞ absoultely continuous with density f . viz.

and this is why ϕ is called probability generating function. n=0 We do allow X to. we can recover the expectation of X from EX = lim ϕ′ (s).This is defined for all −1 < s < 1. 116 . n ≥ 0) then the generating function of (xn+1 . If we let X(s) = n≥0 n ≥ 0. with x0 = x1 = 1. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). As an example. Also. possibly. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . Generating functions are useful for solving linear recursions. we have s∞ = 0. take the value +∞. since |s| < 1. n! as follows from Taylor’s theorem. Note that ϕ(0) = P (X = 0). ds ds and by letting s = 1. Note that ∞ n=0 pn ≤ 1. consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . but. xn sn be the generating of (xn . n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). ∞ and p∞ := P (X = ∞) = 1 − pn .

the n-th term of the Fibonacci sequence.e. i. n ≥ 0) and (xn . Hence s2 + s − 1 = (s − a)(s − b). then the so-called distribution function.e. and so X(s) = 1 1 1 1 1 1 = . x ∈ R. But I could very well use the function P (X > x). for example. b−a is the solution to the recursion. namely. Thus (1/ρ)−s is the 1 generating function of ρn+1 . we can solve for X(s) and find X(s) = s2 −1 . is such an example. Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. n ≥ 0) equals the sum of the generating functions of (xn+1 . Despite the square roots in the formula. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . n ≥ 0). So. b−a Noting that ab = −1. i. the function P (X ≤ x). − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . if X is a random variable with values in R. we can also write this as xn = (−a)n+1 − (−b)n+1 . b = −( 5 + 1)/2. s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). 117 . Since x0 = x1 = 1. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X.

But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. n Hence n k=1 log k ≈ log x dx. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. We frequently encounter random objects which belong to larger spaces. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. dx All these objects can be referred to as “distribution” or “law” of X. to compute a large sum we can. Hence we have We call this the rough Stirling approximation.or the 2-parameter function P (a < X < b). Now. since the number is so big. Such an excursion is a random object which should be thought of as a random function. n ∈ Z+ . Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. (b) The approximation is not good when we consider ratios of products. If X is absolutely continuous. we encountered the concept of an excursion of a Markov chain from a state i. For example (try it!) the rough Stirling approximation give that the probability that. 1 Since d dx (x log x − x) = log x. n! ≈ exp(n log n − n) = nn e−n . Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. Even the generating function EsX of a Z+ -valued random variable X is an example. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. it’s better to use logarithms: n n! = e log n = exp k=1 log k. in 2n fair coin tosses 118 . For instance. I could use its density f (x) = d P (X ≤ x). Specifying the distribution of such an object may be hard. instead. for it does allow us to specify the probabilities P (X = n). compute an integral and obtain an approximation. It turns out to be a good one. for example they could take values which are themselves functions.

for all x. where λ = P (X ≥ 1). The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). by taking expectations of both sides (expectation is an increasing operator). Let X be a random variable. and is based on the concept of monotonicity. expressing monotonicity. because we know that the n probability goes to zero as n → ∞. Consider an increasing and nonnegative function g : R → R+ . defined in (34). we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. Indeed. so this little section is a reminder. We baptise this geometric(λ) distribution. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . We can case both properties of g neatly by writing For all x.we have exactly n heads equals 2n 2−2n ≈ 1. 119 . And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. 0 ≤ n ≤ m. and this is terrible. Then. a geometric function of k. To remedy this. Markov’s inequality Arguably this is the most useful inequality in Probability. That was the random variable J0 = ∞ n=1 1(Sn = 0). Surely. It was shown in Theorem 38 that J0 satisfies the memoryless property. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. as n → ∞. expressing non-negativity. Eg(X) ≥ g(x)P (X ≥ x) . n ∈ Z+ . If we think of X as the time of occurrence of a random phenomenon. We have already seen a geometric random variable. and so. g(X) ≥ g(x)1(X ≥ x). y ∈ R g(y) ≥ g(x)1(y ≥ x). In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . you have seen it before. This completely trivial thought gives the very useful Markov’s inequality.

4. Springer. Grinstead. 5. Cambridge University Press. 3. Springer. Soc. Chung. Bremaud P (1997) An Introduction to Probabilistic Modeling.dartmouth. Springer. Parry. Cambridge University Press. C M and Snell.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Feller. Math. J L (1997) Introduction to Probability. Bremaud P (1999) Markov Chains. ´ 2. Norris. W (1981) Topics in Ergodic Theory. 120 . K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). W (1968) An Introduction to Probability Theory and its Applications. 7.pdf 6.Bibliography ´ 1. Amer.mac. J R (1998) Markov Chains. Also available at: http://www. Wiley.

111 maximum. 18 arcsine law. 63 Galton-Watson process. 114 balance equations. 65 generating function. 15. 20 increments. 3 branching process. 51 Monte Carlo. 106 ground state. 17 infimum. 47 coupling inequality. 77 coupling. 1 equivalence classes. 109 positive recurrent. 45 Chapman-Kolmogorov equation. 87 121 . 76 neighbours. 17 Aloha. 6 Markov property. 70 period. 111 inessential state. 26 running maximum. 34 PageRank. 48 cycle formula. 58 digraph (directed graph).Index absorbing. 44 degree. 10 irreducible. 86 reflexive. 6 Markov’s inequality. 70 greatest common divisor. 16 convolution. 29 reflection. 7 invariant distribution. 84 bank account. 17 communicates with. 48 memoryless property. 65 Catalan numbers. 108 ergodic. 61 null recurrent. 16 communicating classes. 110 mixing. 10 ballot theorem. 18 dual. 59 law. 116 first-step analysis. 115 geometric distribution. 61 recurrent. 110 meeting time. 113 initial distribution. 35 Helly-Bray Lemma. 17 Kolmogorov’s loop criterion. 82 Cesaro averages. 68 Fatou’s lemma. 15 distribution. 13 probability generating function. 7 closed. 16. 51 essential state. 61 detailed balance equations. 119 Google. 108 ruin probability. 115 random walk on a graph. 68 aperiodic. 98 dynamical system. 108 relation. 73 Markov chain. 117 division. 20 function of a Markov chain. 18 permutation. 53 hitting time. 34 probability flow. 81 reflection principle. 17 Ethernet. 101 axiom of Probability Theory. 119 matrix. 16 equivalence relation. 117 leads to. 119 minimum. 77 indicator. 2 nearest neighbour random walk. 53 Fibonacci sequence. 15 Loynes’ method.

9 stationary distribution. 7 simple random walk. 76 simple symmetric random walk. 77 skip-free property. 24 Strirling’s approximation.semigroup property. 108 tree. 73 strategy of the gambler. 6 statespace. 7 time-reversible. 8 transitive. 13 stopping time. 76 Skorokhod embedding. 60 urn. 29 transition diagram:. 3. 16. 6 stationary. 2 transition probability. 16. 62 subsequence of a Markov chain. 71 122 . 10 steady-state. 113 symmetric. 108 time-homogeneous. 27 subordination. 58 transient. 62 Wright-Fisher model. 119 strong Markov property. 62 supremum. 104 states. 7. 88 watching a chain in a set. 27 storage chain. 8 transition probability matrix.

Sign up to vote on this title
UsefulNot useful