Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

1 Branching processes: a population growth application . . . . . . . 95 31. 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . 95 31. . . . . . . . . . . . . .5 First return to zero . . . . . . . . . . . . . . . . . . . . . . . . . . 65 24. . . . . . . . . . . . . . . .1 Distribution of hitting time and maximum . . . . . . . . . .1 Distribution after n steps . . .5 A storage or queueing application . . . . . . . . . . . . . . . . . .1 Paths that start and end at specific points . . . . . . . . . . . . . . . . . . . . . . .4 Some remarkable identities . . . . . . .2 The ballot theorem . . . . . . . .2 The ALOHA protocol: a computer communications application .3 Some conditional probabilities . . . . . .4 The Wright-Fisher model: an example from biology . . . . . . . 84 27. 84 27. . . . . . . . . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . . . . . . . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . 92 30.1 First hitting time analysis . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . . . .2 Return probabilities . . .3 The distribution of the first return to the origin . . . .2 First return to the origin . . . . . . . . . . . . . . 83 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 30. . . . . . . . . . . . . . 85 27. . . . . 68 24. . 70 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 26.24. . . . . . . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . .2 Paths that start and end at specific points and do not touch zero at all . . . 71 24. . . . . .3 PageRank (trademark of Google): A World Wide Web application . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Asymptotic distribution of the maximum .3 Total number of visits to a state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

of course. Also.. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory.. I have a little mathematical appendix. Also.S. There notes are still incomplete.. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. I have tried both and prefer the current ordering. At the end. They form the core material for an undergraduate course on Markov chains in discrete time. The second one is specifically for simple random walks. A few starred sections should be considered “advanced” and can be omitted at first reading. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. dozens of good books on the topic. “important” formulae are placed inside a coloured box. v . I introduce stationarity before even talking about state classification. Of course. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. Greece and the U.K. There are.Preface These notes were written based on a number of courses I taught over the years in the U. The first part is about Markov chains and some applications. I tried to put all terms in blue small capitals whenever they are first encountered.

. In other words. y0 = min(a. . we end up with two numbers. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. . b) is the remainder of the division of a by b. n = 0. for any t. i. that gcd(a. The “time” can be discrete (i. Suppose that a particle of mass m is moving under the action of a force F in a straight line. where r = rem(a. has the property that. where xn ≥ yn . X1 .PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. and evolving as xn+1 = yn yn+1 = rem(xn . Its velocity is x = dx . Recall (from elementary school maths). An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. It is not hard to see that. continuous (i. X2 . the integers). 1. there is a function F that takes the pair Xn = (xn . The sequence of pairs thus produced. 2. F = f (x. X0 . b). yn ). yn ) into Xn+1 = (xn+1 .e. or. By repeating the procedure. In Mathematics. more generally. and the other the greatest common divisor. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains.e. . . Assume that the force F depends only on the particle’s position ¨ dt and velocity. for any n. x(t)). The state of the particle at time t is described by the pair ˙ X(t) = (x(t). a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. . the future trajectory (X(s). An example from Physics is provided by Newton’s laws of motion. Xn+2 . s ≥ t) ˙ 1 . . b). the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). x = x(t) denotes the position of the particle at time t. depend only on the pair Xn . x Here. . x). In the module following this one you will study continuous-time chains. the real numbers). and its ˙ dt 2x acceleration is x = d 2 . b) = gcd(r. b). one of which is 0. .e. the future pairs Xn+1 . b) between two positive integers a and b. For instance. yn ). initialised by x0 = max(a. a totally ordered set. we let our “state” be a pair of integers (xn . yn+1 ). Formally.

Each time count 1 if Yn < f (Xn ) and 0 otherwise. We can summarise this information by the transition diagram: 1 Stochastic means random. We will develop a theory that tells us how to describe. The resulting ratio is an approximation b of a f (x)dx.99. it does so. A mouse lives in the cage. analyse. In addition. 2 2. We will be dealing with random variables. it moves from 2 to 1 with probability β = 0. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. regardless of where it was at earlier times. we will see what kind of questions we can ask and what kind of answers we can hope for. but some are stochastic. Count the number of 1’s and divide by N . Try this on the computer! 2 .. at time n + 1 it is either still in 1 or has moved to 2. a ≤ X0 ≤ b. respectively. via the use of the tools of Probability.000 rows and 10. containing fresh and stinky cheese. Choose a pair (X0 . The generalisation takes us into the realm of Randomness.can be completely specified by the current state X(t). instead of deterministic objects. 2. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. Y0 ) of random numbers. say. by the software in your computer. 1 and 2. uniformly. When the mouse is in cell 1 at time n (minutes) then. for steps n = 1. A scientist’s job is to record the position of the mouse every minute. an effective way for dealing with large problems is via the use of randomness.e. Repeat with (Xn . Other examples of dynamical systems are the algorithms run. We will also see why they are useful and discuss how they are applied. Markov chains generalise this concept of dependence of the future only on the present. i. Yn ). Perform this a large number N of times. .05. and use those mathematical models which are called Markov chains.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. Indeed. In other words. .000 columns or computing the integral of a complicated function of a large number of variables.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. 0 ≤ Y0 ≤ B. Some of these algorithms are deterministic. . past states are not needed. Similarly.

the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. . D1 . S0 . S2 . on the average. are random variables with distributions FD . respectively.01 Questions of interest: 1. So if he wishes to spend more than what is available.05 0. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. px.y := P (Xn+1 = y|Xn = x). . Xn is the money (in pounds) at the beginning of the month. y = 0. from month to month. D2 . The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. Here. . How often is the mouse in room 1 ? Question 1 has an easy. S1 . Sn .05 ≈ 20 minutes. FS . . Assume also that X0 . Dn − Sn = y − x) P (Sn = z. 2. intuitive. and Sn is the money he wants to spend. = z=0 ∞ = z=0 3 .99 0. . Assume that Dn . D0 . as follows: Xn+1 = max{Xn + Dn − Sn . . We easily see that if y > 0 then ∞ x. Dn is the money he deposits. 0}.95 0.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. 1. (This is the mean of the binomial distribution with parameter α. to move from cell 1 to cell 2? 2.2 Bank account The amount of money in Mr Baxter’s bank account evolves. are independent random variables. he can’t.) 2. How long does it take for the mouse. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x).

can be computed from FS and FD . 2. Questions of interest: 1.y .something that. So he may move forwards with equal probability that he moves backwards. we have px. on the average. how often will he be visiting 0? 2.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. and if so. how fast? 2.0 p0. Will Mr Baxter’s account grow. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction.0 p1. If he starts from position 0. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. The transition probability matrix is y=1 p0.2 P= p2.2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. if y = 0.1 p0. Of course. If the pub is located in position 0 and his home in position 100. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. in principle.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . being drunk. he has no sense of direction. Are there any places which he will never visit? 3. 2. how long will it take him.2 p1.1 p2.0 = 1 − ∞ px.1 p1. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). Are there any places which he will visit infinitely many times? 4.0 p2.

etc. the reason being that the company does not believe that its clients are subject to resurrection. for now.S = 0. and pS. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. Also. But. there is probability pH.1 D Thus. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. therefore.H = 1 − pH.D is omitted.D .S = 1 − pS.01 S 0. the answer to this is crucial for the policy determination. the company must have an idea on how long the clients will live. Note that the diagram omits pH.D . mathematically imprecise. but they will become more precise in the sequel.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. These things are. and pS. north or south.3 that the person will become sick. west.H .3 H 0. 5 . given that he is currently healthy.8 0. So he can move east. 2.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients. that since there are more degrees of freedom now. pD. and let’s say he’s completely drunk. pD.S because pH.H − pS. Clearly. so we assign probability 1/4 for each move.S − pH. We may again want to ask similar questions.D = 1. observe. there is clearly a higher chance that our man will be lost.

where T is a subset of the integers. P (Xn+1 = in+1 | Xn = in . in+1 ∈ S. Sometimes we say that {Xn . X1 = i1 . . . . m < n. If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . . . . the future process (Xm . . in+m . or negative). 2. (1) 6 . . . The elements of S are frequently called states. unless needed. n ∈ T} is Markov instead of saying that it has the Markov property. Usually. . . X2 = i2 . 3. viz. n + m] Xk = ik | Xn = in ).1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . for all n ∈ Z+ and all i0 . m ∈ T). n ∈ Z+ ) is Markov if and only if. This is true for any n and m. (Xn . n ∈ T}. Lemma 1. . Xn−1 ) given Xn . Xn−1 ) are independent conditionally on Xn . . m ∈ T) is independent of the past process (Xm . . Xn+m ) and (X0 . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). . for any n.2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). almost exclusively (unless otherwise said explicitly). . T = Z+ = {0. . X0 = i0 ) = P (∀ k ∈ [n + 1. . . . 1. we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. . . for any n ∈ T. . n + m] Xk = ik | Xn = in . zero. which precisely means that (Xn+1 . . . We say that this sequence has the Markov property if. Since S is countable. . and so future is independent of past given the present. On the other hand. m > n. conditionally on Xn . that the Xn take values in some countable set S. 3.} or N = {1. . if the claim holds. . . . We will also assume. Let us now write the Markov property in an equivalent way. We focus almost exclusively to the first choice. . 2. the Markov property. it is customary to call (Xn ) a Markov chain.} or the set Z of all integers (positive. 3. m and any choice of states i0 . . . . . . .3 3. Proof. So we will omit referring to T explicitly. . called the state space.

we can write (2) as P(m. The function µ(i) is called initial distribution. which. m + 1) · · · pin−1 . we define. (2) which is reminiscent of matrix multiplication. n) if m ≤ ℓ ≤ n. 1) pi1 . n). X2 = i2 . but if S is an infinite set. n + 1) := P (Xn+1 = j | Xn = i) . all probabilities of events associated with the Markov chain This causes some analytical difficulties. n). n + 1) is called transition probability from state i at time n to state j at time n + 1. ℓ) P(ℓ. n ≥ 0.j (m. n) = i1 ∈S ··· in−1 ∈S pi. m + 2) · · · P(n − 1. for each m ≤ n the matrix P(m. 2 3 And. will specify all joint distributions2 : P (X0 = i0 .in (n − 1. If |S| = d then we have d × d matrices. i∈S i. m + 1)P(m + 1.j (n. pi.j (m.. pi.i2 (1. 7 . n + 1) do not depend on n . X1 = i1 . means that the transition probabilities pi. by a result in Probability. j ∈ S.j (m.j (n. . Of course. n) = P(m.i (n − 1. More generally. pi. But we will circumvent them by using Probability only. remembering a bit of Algebra. . by definition.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. viz.i1 (0. then the matrices are infinitely large.j (n. The function pi. n) = 1(i = j) and.j (n. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. Therefore. of course. n)]i. n).which means that the two functions µ(i) := P (X0 = i) . n) := [pi. something that can be called semigroup property or Chapman-Kolmogorov equation. a square matrix whose rows and columns are indexed by S (and this. by (1).i1 (m.3 In this new notation. or as P(m. 3. n) = P(m. requires some ordering on S) and whose entries are pi. . 2) · · · pin−1 . n).j∈S . pi.j (m. Xn = in ) = µ(i0 ) pi0 . .

For example. The matrix notation is useful. Starting from a specific state i. 0. then µ = pδi1 + (1 − p)δi2 . Similarly.j (n.and therefore we can simply write pi.j . Thus Pi is the law of the chain when the initial distribution is δi . From now on. arranged in a row vector. y y 8 .j ]i. (3) We call δi the unit mass at i. as follows from (1) and the definition of P. and the semigroup property is now even more obvious because Pa+b = Pa Pb . of course. for all integers a. δi (j) := 1. Then P(m. while P = [pi. n−m times for all m ≤ n. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). Notational conventions Unit mass. if Y is a real random variable. transition matrix. and call pi. Notice that any probability distribution on S can be written as a linear combination of unit masses.j the (one-step) transition probability from state i to state j. if µ assigns probability p to state i1 and 1 − p to state i2 . unless otherwise specified.j∈S is called the (one-step) transition probability matrix or. simply. n + 1) =: pi. b ≥ 0. n) = P × P × · · · · · · × P = Pn−m . when we say “Markov chain” we will mean “timehomogeneous Markov chain”. In other words. to picking state i with probability 1. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). Then we have µ n = µ 0 Pn . For example. It corresponds. if j = i otherwise.

that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. Example: the Ehrenfest chain. It should be clear that if initially place the bug at random on one of the vertices. Imagine there are N molecules of gas (where N is a large number. n = m. dividing it into two identical rooms. The gas is placed partly in room 1 and partly in room 2. as it takes some time before the molecules diffuse from room to room. at any point of time n. . The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2.e. If at time n the bug is on a vertex. We now investigate when a Markov chain is stationary. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. at time n + 1 it moves to one of the adjacent vertices with equal probability. Consider a canonical heptagon and let a bug move on its vertices. Another guess then is: Place 9 .i. Then pi. the distribution of the bug will still be the same. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1.). because it is impossible for the number of molecules to remain constant at all times.) is stationary if it has the same law as (Xn . m + 1.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. This is a model of gas distributed between two identical communicating chambers. It is clear.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . Clearly.4 Stationarity We say that the process (Xn . N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . . say N = 1025 ) in a metallic chamber with a separator in the middle. . let us look at the following example: Example: Motion on a polygon. random variables provides an example of a stationary process. the law of a stationary process does not depend on the origin of time. for a while we will be able to tell how long ago the process started. i. In other words. . then. On the average. and vice versa. a sequence of i. Hence the uniform probability distribution is stationary. we indeed expect that we have N/2 molecules in each room. 1. n ≥ 0) will not be stationary: indeed. . .d. selecting a vertex with probability 1/7. To provide motivation and intuition. But this will not work. then. for any m ∈ Z+ . Let Xn be the number of molecules in room 1 at time n. n = 0.

If we do so. then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . Lemma 2.i = P (X0 = i − 1)pi−1. . Since. The only if part is obvious. . Xm+1 = i1 . For the if part. The equations (4) are called balance equations. . . Indeed. X1 = i1 .i + P (X0 = i + 1)pi+1. use (1) to see that P (X0 = i0 . 2−N i−1 i+1 N N (The dots signify some missing algebra.i = π(i − 1)pi−1. we are led to consider: The stationarity problem: Given [pij ]. Changing notation. X2 = i2 .each molecule at random either in room 1 or in room 2. a Markov chain is stationary if and only if the distribution of Xn does not change with n. by (3). But we need to prove this. . A Markov chain (Xn . hence. (4) Such a π is called . the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . i 0 ≤ i ≤ N. P (Xn = i) = π(i) for all n. . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. (a binomial distribution). for all m ≥ 0 and i0 . Xm+2 = i2 . find those distributions π such that πP = π. . . . then we have created a stationary Markov chain. P (X1 = i) = j P (X0 = j)pj. . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. Proof. stationary or invariant distribution. which you should go through by yourselves. Xm+n = in ).) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. . In other words. If we can do so.i + π(i + 1)pi+1. in ∈ S. 10 . Xn = in ) = P (Xm = i0 .i = N N −i+1 N i+1 2−N + = · · · = π(i). .

The times between successive buses are i. Example: card shuffling. p1. . Thus. . then we run into trouble: indeed.i. . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. which. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. where all the xi are distinct. . . . . the next one will arrive in i minutes with probability p(i).j = 1. which gives p(j) . .. n ≥ 0) is a Markov chain with transition probabilities pi+1. we must be careful. . After bus arrives at a bus stop. i = 1.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. To find the stationary distribution. write the equations (4): π(i) = π(j)pj. i≥1 π(i) −1 To compute π(1). i∈S The matrix P has the eigenvalue 1 always because j∈S pi. and so each π(i) will be zero. In addition to that. Example: waiting for a bus. . Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. 2 . in matrix notation. π(1) will be zero. i ≥ 1. d}. where 1 is a column whose entries are all 1. 11 . . But here. 2. random variables. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). Keep doing it continuously. Then (Xn . xd ).}. π must be a probability: π(i) = 1. if i > 0 i ∈ N. .i = p(i). This means that the solution to the balance equations is identically zero. each x ∈ Sd is of the form x = (x1 .d.i = 1. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. can be written as P1 = 1. The state space is S = N = {1. which in turn means that the normalisation condition cannot be satisfied. . we use the normalisation condition π(1) = i≥1 j≥i = 1. It is easy to see that the stationary distribution is uniform: π(x) = 1 . Who tells us that the sum inside the parentheses is finite? If the sum is infinite. d! x ∈ Sd . i ≥ 0.i = π(0)p(i) + π(i + 1).

You can check that . If ξ1 = 1 then x ≥ 1/2. Hence choosing the initial state at random in this manner results in a stationary Markov chain. . with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n.) Consider an infinite binary vector ξ = (ξ1 .. . . But the function f (x) = min(2x.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). . We have thus defined a mapping ϕ from binary vectors into binary vectors. if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . .i.. . . ξ2 (n). . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. 1 − ξ3 .). 1] (think of [0.* (This is slightly beyond the scope of these lectures and can be omitted. ξ2 .) is the state at time n. . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . . . any r = 1. what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. . are i. . 2(1 − x)). 2. and any ε1 . . .d. ξ(n + 1) = ϕ(ξ(n)).). P (ξ1 (n + 1) = ε1 . ξ2 (n) = 1 − ε1 . . . for any n = 0. 1. ξ3 . . 1} for all i. 1}. and the rule maps x into 2(1 − x). Here ξ(n) = (ξ1 (n). i. So it goes beyond our theory. . . . ξ2 (n) = ε1 . ξ2 (n). the same is true at n + 1. . 2. . . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. . from (5). . ξr+1 (n) = εr ) + P (ξ1 (n) = 1. applied to the interval [0. ξr+1 (n) = 1 − εr ). . r = 1. . As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. The Markov chain is obtained by applying the mapping ϕ again and again. Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. .. with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. ξi ∈ {0. 2. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite.d. . εr ∈ {0. ξr (n + 1) = εr ) = P (ξ1 (n) = 0. But to even think about it will give you some strength. So. So let us do so. . P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). Remarks: This is not a Markov chain with countably many states.e. .). .i. We can show that if we start with the ξr (0). (5) 12 . . . Example: bread kneading. . . i. . . .

The convenience is often suggested by the topological structure of the chain. because the first d equations are linearly dependent: the sum of both sides equals 1. π(j) = 1.1 Finding a stationary distribution As explained above. Ac ) := π(i)pi.j = j∈A π(j)pj. then choose A = {i} and see that you obtain the balance equations. then the above are d + 1 equations.j (which equals 1): π(i) j∈S pi. When a Markov chain is stationary we refer to it as being in 4. Conversely.i = j∈A π(j)pj.i + j∈Ac π(j)pj. We have: Proposition 1. π(i) = j∈S (balance equations) (normalisation condition). A). Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A.Note on terminology: steady-state. we introduce more. therefore one of them is always redundant.j . This is OK. a stationary distribution π satisfies π = πP π1 = 1 In other words. But we only have d unknowns. Writing the equations in this form is not always the best thing to do.i . Ac ) is the total current from A to Ac . Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. π(j)pj. i∈A j∈Ac Think of π(i)pi.i 13 . a set of d linearly independent equations which can be more easily solved.j as a “current” flowing from i to j. amongst them. Proof. If the latter condition holds. If the state space has d states. expecting to be able to choose. suppose that the balance equations hold. for all A ⊂ S. Ac ) = F (Ac .i + j∈Ac π(j)pj.i Multiply π(i) by j∈S pi. So F (A. j∈S i ∈ S. π is a stationary distribution if and only if π1 = 1 and F (A. as we will see in examples. Instead of eliminating equations.

A) Schematic representation of the flow balance relations. That is.j = j∈A π(j)pj.. p22 = 1 − β.Now split the left sum also: π(i)pi.952. in steady state. One can ask how to choose a minimal complete set of linearly independent equations.05.) Assume 0 < α. A ) A c c F( A .j + j∈A j∈Ac π(i)pi. p q Consider the following chain. we see that the first term on the left cancels with the first on the right. 2.048. . π(1) = β . the mouse spends only 95. . and 4. Example: a walk with a barrier. We have π(1)α = π(2)β. Instead. we find π(1) = 0.05 ≈ 0.i + j∈Ac π(j)pj. . where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 .99 ≈ 0. A c F(A . The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). 1. p11 = 1 − α. p21 = β. Summing over i ∈ A. If we take α = 0. while the remaining two terms give the equality we seek. β < 1.8% of the time in room 2. 14 i = 1. Thus. The extra degree of freedom provided by the last result. here is an example: Example: Consider the 2-state Markov chain with p12 = α. (Consequently. gives us flexibility. β = 0. .2% of the time in 1. α+β π(2) = α .04 room 1.i This is true for all i ∈ S. but we shall not do this here. 3. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. α+β π(2) = cα..04 π(2) = 0.99 (as in the mouse example). The constant c is found from π(1) + π(2) = 1.

we can solve them immediately. i − 1}. Ac ) = π(i − 1)p = π(i)q = F (Ac . . starting from i. E) where S is the state space and E ⊂ S × S defined by (i. . S is a set of vertices. a pair (S. i. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. in−1 . For i ≥ 1.e. and E a set of directed edges.j > 0 for some n ∈ N and some states i1 . . . In other words. Does this provide a stationary distribution? Yes. n≥0 . provided that ∞ π(i) = 1. . Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). A). (iii) Pi (Xn = j) > 0 for some n ≥ 0. (ii) pi. In graph-theoretic terminology. j) ∈ E ⇐⇒ pi. (If p ≥ 1/2 there is no stationary distribution. In fact. If i = j there is nothing to prove.i2 · · · pin−1 . 5. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. p < 1/2.i1 pi1 . If i ≥ 1 we have F (A. Proof.e.2 The relation “leads to” We say that a state i leads to state j if. This is i=0 possible if and only if p < q.j > 0. .1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. This is often cast in terms of a digraph (directed graph). . i. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. These are much simpler than the previous ones. 1. So assume i = j. the chain will visit j at some finite time.) 5 5.But we can make them simpler by choosing A = {0. π(i) = (p/q)i π(0). • Suppose that i j. .

The relation j ⇐⇒ j i).j . where the sum extends over all possible choices of states in S. reflexive and transitive. By assumption. [i] = [j] if and only if i identical or completely disjoint. 16 . {6}. 5}.i2 · · · pin−1 . is an equivalence relation. • Suppose now that (iii) holds.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. this is symmetric (i Corollary 2. Since Pi (Xn = j) > 0. It is reflexive and transitive because is so.. We just observed it is symmetric. meaning that (ii) holds. The first two classes differ from the last two in character. (i) holds.. Look at the last display again. 5} is closed: if the chain goes in there then it never moves out of it. by definition. In other words.i1 pi1 . The communicating class corresponding to the state i is. We have Pi (Xn = j) = i1 . by definition. the set of all states that communicate with i: [i] := {j ∈ S : j So. • Suppose that (ii) holds. 5. i}. {2. Proof. we see that there are four communicating classes: {1}. j) ⇐⇒ i j and j i. and so (iii) holds. Corollary 1. In addition. However.. {4. it partitions S into equivalence classes known as communicating classes. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. The relation is transitive: If i j and j k then i k. Just as any equivalence relation. Equivalence relation means that it is symmetric. the sum contains a nonzero term. the class {2. The class {4. (6) j. 3} is not closed. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j).. Hence Pi (Xn = j) > 0.the sum on the right must have a nonzero term. 3}. we have that the sum is also positive and so one of its terms is positive. (iii) holds.in−1 pi.

and so on. Since i ∈ C. Note that there can be no arrows between closed communicating classes.i2 · · · pin−1 .j = 1. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them.More generally. A general Markov chain can have an infinite number of communicating classes. and C is closed. Suppose that i j. more precisely.i = 1 then we say that i is an absorbing state. we conclude that j ∈ C. more manageable. We then can find states i1 . 17 . Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. The classes C4 . Similarly. pi1 . . If all states communicate with all others. Thus. In graph terms. Suppose first that C is a closed communicating class. pi. Why? A state i is essential if it belongs to a closed communicating class. the single-element set {i} is a closed communicating class. If a state i is such that pi. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C.j > 0. its transition matrix P) is called irreducible. the state space S is a closed communicating class. Closed communicating classes are particularly important because they decompose the chain into smaller. To put it otherwise. Proof. In other words still.i1 pi1 . C5 in the figure above are closed communicating classes. starting from any state.i1 > 0. There can be arrows between two nonclosed communicating classes but they are always in the same direction. Note: An inessential state i is such that there exists a state j such that i j but j i. . in−1 such that pi. the state is inessential. Otherwise. C2 . . we can reach any other state. The communicating classes C1 . then the chain (or. parts. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. we have i1 ∈ C. . and so i2 ∈ C. j∈C for all i ∈ C.i2 > 0. The converse is left as an exercise. Lemma 3. we say that a set of states C ⊂ S is closed if pi. C3 are not closed.

i. (m) (n) for all m. n ≥ 0. i. more generally. it will move to the set C2 = {2. (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. So. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. j. ℓ ∈ N. The period of an inessential state is not defined. (n) Proof. A state i is called aperiodic if d(i) = 1.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. . Since Pm+n = Pm Pn . we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . Theorem 2 (the period is a class property).e. . Consider two distinct essential states i. Let us denote by pij the entries of Pn . . X3 ∈ C2 . We say that the chain has period 2. and. d(i) = gcd Di . pii (ℓn) ≥ pii (n) ℓ . X4 ∈ C1 .5. Dj = {n ∈ (n) (α) N : pjj > 0}. If i j then d(i) = d(j). . Let Di = {n ∈ N : pii > 0}. 3} at some point of time then. The chain alternates between the two sets. Notice however. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. all states form a single closed communicating class. X2 ∈ C1 . d(j) = gcd Dj . 4} at the next step and vice versa. If i j there is α ∈ N with pij > 0 and if 18 . j ∈ S. So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 .

if d = 1. From the last two displays we get d(j) | n for all n ∈ Di . Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. Pick n2 . . showing that α + n + β ∈ Dj .j i there is β ∈ N with pji > 0. So d(j) is a divisor of all elements of Di . . Since n1 . An irreducible chain has period d if one (and hence all) of the states have period d. . C1 . all states communicate with one another. are irreducible chains. . . n2 ∈ Di . showing that α + β ∈ Dj . Divide n by n1 to obtain n = qn1 + r. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . . we obtain d(i) ≤ d(j) as well. Cd−1 such that pij = 1. chains where. q − r > 0. Proof. d(j) | d(i)).e. i. . j ∈ S. Corollary 3. r = 0. (Here Cd := C0 . Theorem 4. Then pii > 0 and so pjj ≥ pij pii pji > 0. Now let n ∈ Di . An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. Then we can uniquely partition the state space S into d disjoint subsets C0 . More generally. the chain is called aperiodic. Let n be sufficiently large. . Formally. . 1. Therefore d(j) | α + β + n for all n ∈ Di . as defined earlier. we have n ∈ Di as well. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. Because n is sufficiently large. n1 ∈ Di such that n2 − n1 = 1. this is the content of the following theorem. C1 . d − 1. Arguing symmetrically. If a number divides two other numbers then it also divides their difference. In particular. if an irreducible chain has period d then we can decompose the state space into d sets C0 . Cd−1 such that the chain moves cyclically between them. Of particular interest. So pjj ≥ pij pji > 0. Theorem 3. where the remainder r ≤ n1 − 1. . . Consider an irreducible chain with period d. j∈Cr+1 i ∈ Cr . So d(i) = d(j).) 19 . If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . . Therefore d(j) | α + β.

then.e the set of states j such that i j). Notice that this is an equivalence relation. Suppose pij > 0 but j ∈ C1 but. i1 .. necessarily. Let C1 be the equivalence class of i1 .. As an example. Continue in this manner and define states i0 . Hence it partitions the state space S into d disjoint equivalence classes. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). The existence of such a path implies that it is possible to go from i0 to i. j ∈ C2 . The reader is warned to be alert as to which of the two variables we are considering at each time. i1 . Pick a state i0 and let C0 be its equivalence class (i. that is why we call it “first-step analysis”. .Proof. C1 . and by the assumptions that d d j. necessarily. Assume d > 1. We now show that if i belongs to one of these classes and if pij > 0 then. Take. If X0 ∈ A the two times coincide. . say. We use a method that is based upon considering what the Markov chain does at time 1..e. . We avoid excessive notation by using the same letter for both. with corresponding classes C0 . id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. We show that these classes satisfy what we need.. It is easy to see that if we continue and pick a further state id with pid−1 .id > 0. . Such a path is possible because of the choice of the i0 . We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. j belongs to the next class. which contradicts the definition of d.. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. after it takes one step from its current position. 20 . Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0.. id ∈ C0 . Then pick i1 such that pi0 i1 > 0. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 .. i ∈ C0 . Cd−1 . consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. . for example. i.

Therefore. let ϕ′ (i) be another solution. If i ∈ A. ′ Proof. . if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). . which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). ϕ(i) = j∈S i∈A pij ϕ(j). X0 = i) = P (TA < ∞|X1 = j). The function ϕ(i) satisfies ϕ(i) = 1. But what exactly is this probability equal to? To answer this in its generality. fix a set of states A.It is clear that P1 (T0 < ∞) < 1. conditionally on X1 . and so ϕ(i) = 1. i ∈ A. If i ∈ A then TA = 0. then TA ≥ 1. Combining the above. So TA = 1 + TA . and define ϕ(i) := Pi (TA < ∞). we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . ′ is the remaining time until A is hit). we have Pi (TA < ∞) = ϕ(j)pij . We then have: Theorem 5. the event TA is independent of X0 . Furthermore. .e. X2 . because the chain may end up in state 2. Hence: ′ ′ P (TA < ∞|X1 = j. j∈S as needed. By self-feeding this equation. For the second part of the proof. We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i.). But the Markov chain is homogeneous. a ′ function of X1 .

If i ∈ A then. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). Example: In the example of the previous figure. By continuing self-feeding the above equation n times. On the other hand. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). the set that contains only state 0. 1. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). We have: Theorem 6. Therefore. Start the chain from i. let A = {0. The second part of the proof is omitted. which is the second of the equations. 3 22 . we obtain ϕ′ (i) ≥ Pi (TA ≤ n). Proof. because it takes no time to hit A if the chain starts from A. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. obviously. we choose A = {0}. 1 ψ(1) = 1 + ψ(1). ψ(0) = ψ(2) = 0. 2}. We immediately have ϕ(0) = 1. then TA = 1 + TA . so the first two terms together equal Pi (TA ≤ 2). Letting n → ∞. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. (If the chain starts from 0 it takes no time to hit 0. the second equals Pi (TA = 2). If ′ i ∈ A. Example: Continuing the previous example. if the chain starts from 2 it will never hit 0.) As for ϕ(1). ψ(i) := Ei TA . i = 0. as above. and let ψ(i) = Ei TA . We let ϕ(i) = Pi (T0 < ∞). j∈A i ∈ A. TA = 0 and so ψ(i) := Ei TA = 0.We recognise that the first term equals Pi (TA = 1). 3 2 so ϕ(1) = 3/4. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). ψ(i) = 1 + i∈A pij ψ(j). and ϕ(2) = 0. The function ψ(i) satisfies ψ(i) = 0. Furthermore. 2. Clearly.

23 . where. after one step. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ).e. . ′ TA = 1 + TA . by the Markov property. Next. .which gives ψ(1) = 3/2. otherwise. Hence. ϕAB (i) = j∈A i∈B pij ϕAB (j). ϕAB is the minimal solution to these equations. Clearly. This happens up to the first hitting time TA first hitting time of the set A. It should be clear that the equations satisfied are: ϕAB (i) = 1. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). B. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). (This should have been obvious from the start. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). ′ where TA is the remaining time. i. as argued earlier. X2 . because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). of the random variables X1 . until set A is reached. TB for two disjoint sets of states A. Now the last sum is a function of the future after time 1. h(x) = f (x).. Then ′ 1+TA x ∈ A. if x ∈ A. ϕAB (i) = 0. therefore the mean number of flips until the first success is 3/2. in the last sum. consider the following situation: Every time the state is x a reward f (x) is earned. we just changed index from n to n + 1. . because the underlying experiment is the flipping of a coin with “success” probability 2/3. i∈A Furthermore. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ).) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA .

for i = 1. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . in which case he dies immediately. 2. so 24 . we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). are h(x) = 0. 1 2 4 3 The question is to compute the average total reward. why?) If we let h(i) be the average total reward if he starts from room i. until he dies in room 4. (Sooner or later. So. he collects i pounds. 2 2 We solve and find h(2) = 8. h(1) = 5. y∈S h(x) = f (x) + x ∈ A. no gambler is assumed to have any information about the outcome of the upcoming toss. 3. he will die. h(3) = 7. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared.Thus. the set of equations satisfied by h. k ∈ N) form the strategy of the gambler. (Also.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. collecting “rewards”: when in room i. If heads come up then you win £1 per pound of stake. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. Clearly. Example: A thief enters sneaks in the following house and moves around the rooms. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). The successive stakes (Yk . unless he is in room 4. x∈A pxy h(y). if he starts from room 2. Let p the probability of heads and q = 1 − p that of tails.

the company has control over the probability p. . b} with transition probabilities pi. . Let us consider a concrete problem. P (tails) = 1 − p. . if p = q = 1/2 b−a Proof. x − a    . a < i < b. b − 1. a + 2. . The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. To solve the problem. . in the first case. . Consider a gambler playing a simple coin tossing game. ℓ). astrological nature. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. We obviously have ϕ(0) = 0. Theorem 7. It can only depend on ξ1 . In other words. and where the states a. ℓ) := Px (Tℓ < T0 ).i+1 = p. 0 ≤ x ≤ ℓ. as described above. she must base it on whatever outcomes have appeared thus far. 25 ϕ(ℓ) = 1. Ex ) for the probability (respectively. b − a). We first solve the problem when a = 0. if p = q Px (Tb < Ta ) = .g. a + 1. Yk cannot depend on ξk . Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. We wish to consider the problem of finding the probability that the gambler will never lose her money. in addition. b are absorbing. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). .i−1 = q = 1 − p.if the gambler wants to have a strategy at all. Let £x be the initial fortune of the gambler. a ≤ x ≤ b. write ϕ(x) instead of ϕ(x. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . pi. expectation) under the initial condition X0 = x. . it will make enormous profits). We use the notation Px (respectively. She will stop playing if her money fall below a (a < x). is our next goal. We are clearly dealing with a Markov chain in the state space {a. in simplest case Yk = 1 for all k. betting £1 at each integer time against a coin which has P (heads) = p. (Think why!) To simplify things. . b = ℓ > 0 and let ϕ(x. ξk−1 and–as is often the case in practice–on other considerations of e.

then λ = q/p > 1. λ = 1. if p > 1/2 if p ≤ 1/2. If p > 1/2. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. λx − 1 δ(0). λℓ −1 x ℓ. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). ℓ→∞ λℓ − 1 x = 1 − 0 = 1. ℓ→∞ ℓ→∞ (q/p)x . and so Px (T0 < ∞) = 1 − lim If p < 1/2.and. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. Assuming that λ := q/p = 1. Since ϕ(ℓ) = 1. and so λx − 1 λx − 1 =1− = λx . Let δ(x) := ϕ(x + 1) − ϕ(x). then λ = q/p < 1. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). Since p + q = 1. if p = q. 0 ≤ x ≤ ℓ. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. λ = 1. by first-step analysis. δ(x − 2) = · · · = q p x δ(0). if λ = 1 if λ = 1. ℓ Summarising. we find δ(0) = 1/ℓ and so x ϕ(x) = . 1. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). Px (T0 < ∞) = 1 − lim 26 . This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. we have obtained ϕ(x. The general case is obtained by replacing ℓ by b − a and x by x − a. ℓ) = λx −1 . We finally find λx − 1 . λℓ − 1 0 ≤ x ≤ ℓ. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. Corollary 4. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ.

Xn+2 = j2 . Xτ +2 = j2 . Xn ) and the ordinary splitting property at n. . Xτ )1(τ < ∞) = n∈Z+ (X0 . . . possibly. Xn = i. Recall that the Markov property states that.j2 · · · We have: P (Xτ +1 = j1 . An example of a random time which is not a stopping time is the last visit of a specific set. Xτ +2 = j2 . Xτ = i. n ≥ 0) is a time-homogeneous Markov chain. Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . τ = n) (a) (b) = P (Xτ +1 = j1 . . . τ < ∞) = pi. | Xn = i)P (Xn = i. n ≥ 0) is independent of the past process (X0 . . . including. . . This stronger property is referred to as the strong Markov property. for all n ∈ Z+ . . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . . Xn )1(τ = n). . . . any such A must have the property that. P (Xτ +1 = j1 . . τ < ∞)P (A | Xτ . . . the first hitting time of the set A. . . Suppose (Xn . | Xτ . . Xn+1 . conditional on τ < ∞ and Xτ = i. the future process (Xτ +n . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . A. (b) is by the definition of conditional probability. τ = n). A ∩ {τ = n} is determined by (X0 . τ < ∞). . . Xn+2 = j2 . | Xτ = i. In particular. | Xn = i. . τ = n) = P (Xn+1 = j1 . Xn+2 = j2 . A. . Xτ ) if τ < ∞. . . . . Xτ +2 = j2 . .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. .) is independent of the past process (X0 . Since (X0 . . An example of a stopping time is the time TA . A | Xτ . τ < ∞) = P (Xn+1 = j1 . . Xn ). . . Xn ) if we know the present Xn . τ = n) (c) = P (Xn+1 = j1 . We are going to show that for any such A. . Let τ be a stopping time.j1 pj1 . . . . . . for any n. A. A. . τ = n)P (Xn = i. a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . . Proof.Xτ +2 = j2 . Theorem 8 (strong Markov property). Let τ be a stopping time. τ = n) = pi. Then. Let A be any event determined by the past (X0 . . the future process (Xn . considered earlier. . . . .j2 · · · P (Xn = i. . Xn ) for all n ∈ Z+ . .j1 pj1 . A. . . . 27 . Xτ ) and has the same law as the original process started from i. τ < ∞) and that P (Xτ +1 = j1 . . the value +∞. . (d) where (a) is just logic. . . A.

However.d. Note the subtle difference between this Ti and the Ti of Section 6. 2. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). If X0 = i then the two definitions coincide.i. . 0 ≤ n < Ti . Formally. random trajectories. r = 2. State i may or may not be visited again. Schematically. The random variables Xn comprising a general Markov chain are not independent.i. . Xi (2) . The idea is that if we can find a state i that is visited again and again. 3. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. Regardless.. random variables is a (very) particular example of a Markov chain. Xi (1) . in fact. we will show that we can split the sequence into i. in a generalised sense. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . We (r) refer to them as excursions of the process. . All these random times are stopping times. Xi (0) . They are. (And we set Ti := Ti . . . “chunks”. (1) r = 1.d. we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. := Xn . as “random variables”. . and so on. by using the strong Markov property.9 Regenerative structure of a Markov chain A sequence of i. . Our Ti here is always positive.) We let Ti be the time of the third visit.. Ti Xi (0) (r) ≤ n < Ti (r+1) . So Xi is the r-th excursion: the chain visits 28 . we let Ti be the time of the second (1) (3) visit. The Ti of Section 6 was allowed to take the value 0.

and “goes for an excursion”. . the future is independent of the (1) (2) past. they are also identical in law (i. starting from i. . all. Λ2 . Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. for time-homogeneous Markov chains. (Numerical) functions g(Xi independent random variables. Xi distribution. Perhaps some states (inessential ones) will be visited finitely many times. An immediate consequence of the strong Markov property is: Theorem 9. But (r) the state at Ti is.state i for the r-th time. 29 . Xi (1) . (r) . Moreover. possibly. except. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. . Xi (2) . The last corollary ensures us that Λ1 . we say that a state i is transient if. g(Xi (1) ). In particular. The claim that Xi . and this proves the first claim. given the state at (r) (r) (r) the stopping time Ti . equal to i.d. g(Xi (2) ). 10 Recurrence and transience When a Markov chain has finitely many states. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0.i. at least one state must be visited infinitely many times. but there are always states which will be visited again and again. we say that a state i is recurrent if. have the same Proof. . by definition. Therefore. . ad infinitum. starting from i. the future after Ti is independent of the past before Ti .. Therefore. Apply Strong Markov Property (SMP) at Ti : SMP says that. A trivial. Example: Define Ti (r+1) (0) ).. . (0) are independent random variables.). the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. but important corollary of the observation of this section is that: Corollary 5. we “watch” it up until the next time it returns to state i. . are independent random variables. Second. The excursions Xi (0) . it will behave in the same way as if it starts from the same state at another point of time. if it starts from state i at some point of time. We abstract this trivial observation and give a couple of definitions: First.

Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . Pi (Ji = ∞) = 1. namely. Therefore. But the past (k−1) before Ti is independent of the future. and that is why we get the product. k ≥ 0. Ei Ji = ∞. i. Starting from X0 = i. we have Proof. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. This means that the state i is recurrent.Important note: Notice that the two statements above are not negations of one another. The following are equivalent: 1. then Pi (Ji < ∞) = 1. Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. fii = Pi (Ti < ∞) = 1. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . 30 . 2. therefore concluding that every state is either recurrent or transient. Lemma 5. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. Pi (Ji = ∞) = 1. State i is recurrent. 4. there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. Indeed. We will show that such a possibility does not exist. in principle. we conclude what we asserted earlier. If fii < 1. Because either fii = 1 or fii < 1.e. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). Suppose that the Markov chain starts from state i. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. We claim that: Lemma 4. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. every state is either recurrent or transient. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). . This means that the state i is transient. Indeed. 3.

this implies that Ei Ji < ∞. Lemma 6. the probability to hit state j for the first time at time n ≥ 1. If i is transient then Ei Ji < ∞. as n → ∞. If i is a transient state then pii → 0. (n) But if a series is finite then the summands go to 0. On the other hand. pii → 0. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . Hence Ei Ji = ∞ also. Pi (Ji < ∞) = 1. 4. by definition. State i is transient. Corollary 6. therefore Pi (Ji < ∞) = 1. clearly. i. fii = Pi (Ti < ∞) < 1. and. owing to the fact that Ji is a geometric random variable. 3.e. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i.e. State i is transient if. Ei Ji < ∞. 2. The following are equivalent: 1. since Ji is a geometric random variable. fij ≤ pij . From Elementary Probability. n ≥ 1. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. if we assume Ei Ji < ∞. Proof.Proof. but the two sequences are related in an exact way: Lemma 7. by definition. Proof. Proof.1 Define First hitting time decomposition fij := Pi (Tj = n). (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. assuming that the chain (n) starts from state i. which. then. Notice the difference with pij . The equivalence of the first two statements was proved before the lemma. it is visited infinitely many times. If Ei Ji = ∞. is equivalent to Pi (Ji < ∞) = 1. The equivalence of the first two statements was proved above. (n) i. by the definition of Ji . State i is recurrent if. Ji cannot take value +∞ with positive probability. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . Clearly. 10. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . then. This. it is visited finitely many times with probability 1. as n → ∞.

by summing over n the relation of Lemma 7. and so the probability that j will appear in a specific excursion is positive. Proof. In other words. j i. So the random variables δj. . we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . In particular.d. we have n=1 pjj = Ej Jj < ∞. showing that j will appear in infinitely many of the excursions for sure.i. and gij := Pi (Tj < Ti ) > 0. infinitely many of them will be 1 (with probability 1). . . as n → ∞. are i. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. . if we let fij := Pi (Tj < ∞) . 32 . excursions Xi .The first term equals fij . ( Remember: another property that was a class property was the property that the state have a certain period.r := 1 j appears in excursion Xi (r) . For the second term. which is finite. ∞ Proof. If j is a transient state then pij → 0. We now show that recurrence is a class property. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. for all states i. the probability to go to j in some finite time is positive. either all states are recurrent or all states are transient.2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”.) Theorem 10. Indeed. starting from i. j i.d. Corollary 8. and since they take the value 1 with positive probability. not only fij > 0. r = 0. In a communicating class. we have i j ⇐⇒ fij > 0.i. If j is transient. but also fij = 1. denoted by i j: it means that. Corollary 7. . Now. 10. Start with X0 = i. and hence pij as n → ∞.. Hence. 1. The last statement is simply an expression in symbols of what we said above. Xi . by definition. Hence. . the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ).

It should be clear that this can. if started in C. But then. Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). U1 . every other j is also recurrent. i. What is the expectation of T ? First. all other states are also transient. If i is recurrent then every other j in the same communicating class satisfies i j. Proof. . but Pi (A ∩ B) = Pi (Xm = j. P (T > n) = P (U1 ≤ U0 . Theorem 11. be i.Proof. . observe that T is finite. i. Theorem 12. We now take a closer look at recurrence. Suppose that i is inessential. some state i must be visited infinitely many times. . . uniformly distributed in the unit interval [0. for example. n+1 33 = 0 . . Let T := inf{n ≥ 1 : Un > U0 }. . random variables. By closedness. Let C be the class. Indeed. . P (B) = Pi (Xn = i for infinitely many n) < 1. 1]. Hence. If a communicating class contains a state which is transient then. Recall that a finite random variable may not necessarily have finite expectation. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). by the above theorem. .e. appear in a certain game. j but j i. . Pi (B) = 0. If i is inessential then it is transient. X remains in C. necessarily. Example: Let U0 .i. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. Thus T represents the number of variables we need to draw to get a value larger than the initial one. Hence all states are recurrent. . Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. Proof.e. and so i is a transient state. and hence. . Un ≤ x | U0 = x)dx xn dx = 1 . By finiteness. U2 . So if a communicating class is recurrent then it contains no inessential states and so it is closed. Every finite closed communicating class is recurrent. Xn = i for infinitely many n) = 0.d. This state is recurrent.

n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. We say that i is null recurrent if Ei Ti = ∞. . Z2 . When we have a Markov chain. but E min(Z1 . P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. It is traditional to refer to it as “Law”. It is not a Law. X1 . To apply the law of large numbers we must create some kind of independence. Before doing that. it is a Theorem that can be derived from the axioms of Probability Theory. 0) > −∞. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. .i. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. (ii) Suppose E max(Z1 . be i. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. X2 .Therefore. Theorem 13. . . But this is misleading. To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. of successive states is not an independent sequence. arguably. then it takes. 34 . (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . the sequence X0 . random variables. . Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution.d. the Fundamental Theorem of Probability Theory. on the average. Let Z1 . We state it here without proof. 0) = ∞. . 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence.

d.i. Since functions of independent random variables are also independent.d. . We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). . because the excursions Xa . assign a reward f (x). 35 . Xa(2) . (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. n as n → ∞. for an excursion X . define G(X ) to be the maximum reward received during this excursion. in this example. . given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . (1) (2) (1) (n) (8) 13. Example 1: To each state x ∈ S. are i. . we can think of as a reward received when the chain is in state x.13. with probability 1. k=0 which represents the ‘time-average reward’ received. as in Example 2.. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). but define G(X ) to be the total reward received over the excursion X . we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. forms an independent sequence of (generalised) random variables. In particular. . Example 2: Let f (x) be as in example 1. are i. random variables. and because if we start with X0 = a. for reasonable choices of the functions G taking values in R. . G(Xa(3) ). (7) Whichever the choice of G.1 Functions of excursions Namely. we have that G(Xa(1) ). The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. Specifically. the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ).2 Average reward Consider now a function f : S → R. G(Xa(2) ). . Xa(3) . . the (0) initial excursion Xa also has the same law as the rest of them. G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). Xa . Then. Thus. .i. which.

in the figure below.Theorem 14. The idea is this: Suppose that t is large enough. Since P (Ta < ∞) = 1. Let f : S → R+ be a bounded ‘reward’ function. A similar argument shows that 1 last G → 0. say. t t as t → ∞. Nt = 5. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Gr is given by (7). Hence |Gfirst | ≤ BTa . and so on. Nt . Proof. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). We are looking for the total reward received between 0 and t. Thus we need to keep track of the number. t t 36 . with probability one. We break the sum into the total reward received over the initial excursion. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . t as t → ∞. For example. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). (r) Nt := max{r ≥ 1 : Ta ≤ t}. say |f (x)| ≤ B. Then. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . and so 1 first G → 0.e. with probability one. t t In other words. We assumed that f is t bounded. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. and Glast is the reward over the last incomplete cycle. we have BTa /t → 0. plus the total reward received over the next excursion. of complete excursions from the ground state a that occur between 0 and t.

Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . are i. = E a Ta . and since Ta − Ta k = 1. as r → ∞. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. So we are left with the sum of the rewards over complete cycles. ..d. 2.i. as t → ∞.also. r=1 with probability one. lim t∞ 1 Nt Nt −1 Gr = EG1 . Since Ta → ∞. we have Ta r→∞ r lim (r) . E a Ta 37 . random variables with common expectation Ea Ta . . (11) By definition. we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . it follows that the number Nt of complete cycles before time t will also be tending to ∞. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . 1 Nt = . . t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . Since Ta /r → 0. = E a Ta . Hence. E a Ta f (Xn ) = n=0 EG1 =: f .

To see that f is as claimed. The denominator is the duration of the excursion. Note that we include only one endpoint of the excursion. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). on the average. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). not both. We can use b as a ground state instead of a. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). t Ea 1(Xk = b) E a Ta . Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . the ground state. The physical interpretation of the latter should be clear: If. then we see that the numerator equals 1. then the long-run proportion of visits to the state a should equal 1/Ea Ta . of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . from the Strong Markov Property. which communicate and are positive recurrent. Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . Ta −1 k=0 1 lim t→∞ t with probability 1. The equality of these two limits gives: average no. E a Ta (14) with probability 1. it takes Ea Ta units of time between successive visits to the state a. The constant f is a sort of average over an excursion. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. we let f to be 1 at the state b and 0 everywhere else. for all b ∈ S. i. E b Tb 38 .e. Comment 4: Suppose that two states. Comment 1: It is important to understand what the formula says. b. a. The theorem above then says that. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b).

. so there is nothing lost in doing so. we have 0 < ν [a] (x) < ∞. Proof. . Consider the function ν [a] defined in (15). ν [a] (x) = y∈S ν [a] (y)pyx . Proposition 2. 39 . Note: Although ν [a] depends on the choice of the ‘ground’ state a. Clearly. . Fix a ground state a ∈ S and assume that it is recurrent5 . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. In view of this. Indeed. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. . We start the Markov chain with X0 = a. Both of them equal 1. (15) should play an important role. x ∈ S. x ∈ S. X2 . Moreover.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. Hence the superscript [a] will soon be dropped. i. The real reason for doing so is because we wanted to replace a function of X0 . for any state b such that a → x (a leads to x). Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. X3 . where a is the fixed recurrent ground state whose existence was assumed. Recurrence tells us that Ta < ∞ with probability one.e. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). we will soon realise that this dependence vanishes when the chain is irreducible. . by a function of X1 . 5 We stress that we do not need the stronger assumption of positive recurrence here. . X2 . Since a = XT a . this should have something to do with the concept of stationary distribution discussed a few sections ago. X1 .

by definition. n ≤ Ta ) Pa (Xn = x.and thus be in a position to apply first-step analysis. is finite when a is positive recurrent. First note that. (n) Next. so there exists m ≥ 1. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). for all x ∈ S. indeed. in this last step. for all n ∈ N. x ∈ S. ν [a] = ν [a] Pn . Note: The last result was shown under the assumption that a is a recurrent state. x a. Suppose that the Markov chain possesses a positive recurrent state a. such that pax > 0. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). We just showed that ν [a] = ν [a] P. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). where. from (16). 40 . We are now going to assume more. Therefore. such that pxa > 0. namely. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. that a is positive recurrent. which. xa so. which are precisely the balance equations. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. let x be such that a x. ax ax From Theorem 10. so there exists n ≥ 1. We now have: Corollary 9. Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . we used (16). Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. (17) This π is a stationary distribution. Xn−1 = y. n ≤ Ta ) pyx Pa (Xn−1 = y. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . ν [a] (x) < ∞.

It only remains to show that π [a] is a probability. 2. (r) j. The coins are i. then j occurs in the r-th excursion.i. Hence j will occur in infinitely many excursions. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. positive recurrent state. Let i be a positive recurrent state. Also. R2 ) be the index of the first (respectively. . then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. and so π [a] is not trivially equal to zero. if tails show up. no matter which one.e. π(y) = 1 y∈S x∈S have at least one solution. that ν [a] P = ν [a] . we have Ea Ta < ∞. But π [a] is simply a multiple of ν [a] . Therefore. 41 . Theorem 15. Consider the excursions Xi . . Otherwise. π [a] P = π [a] . 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. we toss a coin with probability of heads gij > 0. 1. and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. in Proposition 2. unique.d. second) excursion that contains j. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. Since a is positive recurrent. that it sums up to one. We have already shown. then the set of linear equations π(x) = y∈S π(y)pyx . i. This is what we will try to do in the sequel. Let R1 (respectively. since the excursions are independent. . If heads show up at the r-th excursion.In other words: If there is a positive recurrent state. In words. Start with X0 = i. r = 0. to decide whether j will occur in a specific excursion. Then j is positive Proof. We already know that j is recurrent. then the r-th excursion does not contain state j. in general. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. Proof. Suppose that i recurrent.

. or all states are null recurrent or all states are transient. . Let C be a communicating class. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. R2 has finite expectation. An irreducible Markov chain is called transient if one (and hence all) states are transient. Corollary 10. . by Wald’s lemma.d. with the same law as Ti . . An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. But. where the Zr are i. . An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. IRREDUCIBLE CHAIN r = 1. . 2.i. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. Then either all states in C are positive recurrent. In other words.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . . we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij .

Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. But. But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. we have. with probability one. Then P (Ta < ∞) = x∈S In other words. as t → ∞. Theorem 16. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. x ∈ S. Hence. by (18). after all. the π defined by π(0) = α. n=1 43 . By a theorem of Analysis (Bounded Convergence Theorem). Then there is a unique stationary distribution. for all n. is a stationary distribution. n ≥ 0) and suppose that P (X0 = x) = π(x). recognising that the random variables on the left are nonnegative and bounded below 1. for totally trivial reasons (it is. we start the Markov chain from the stationary distribution π. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). we have P (Xn = x) = π(x). π(1) = 0. Consider positive recurrent irreducible Markov chain. state 1 (and state 2) is positive recurrent. what prohibits uniqueness. Consider the Markov chain (Xn . an absorbing state). there is a stationary distribution. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). Proof. x ∈ S. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). we can take expectations of both sides: E as t → ∞. x∈S By (13). indeed.16 Uniqueness of the stationary distribution At the risk of being too pedantic. we stress that a stationary distribution may not be unique. Fix a ground state a. for all x ∈ S. (18) π(x)Px (Ta < ∞) = π(x) = 1. Let π be a stationary distribution. for all states i and j. following Theorem 14. π(2) = 1 − α 2. Then. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is.

Then π [a] = π [b] . for all x ∈ S. By the 1 above corollary. Therefore Ei Ti < ∞. The following is a sort of converse to that. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . π(i|C) = Ei Ti . There is a unique stationary distribution. Proof. Corollary 12. b ∈ S. Let i be such that π(i) > 0. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Consider an arbitrary Markov chain. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Then any state i such that π(i) > 0 is positive recurrent. 1 π(a) = .Therefore π(x) = π [a] (x). Both π [a] and π[b] are stationary distributions. There is only one stationary distribution. π(C) x ∈ C. i is positive recurrent. x ∈ S. Consider a positive recurrent irreducible Markov chain and two states a. where π(C) = y∈C π(y). from the definition of π [a] . In particular. Consider a positive recurrent irreducible Markov chain with stationary distribution π. Let C be the communicating class of i. Proof.e. Thus. for all a ∈ S. Consider the restriction of the chain to C. delete all transition probabilities pxy with x ∈ C. so they are equal. y ∈ C. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). In particular. Hence π = π [a] . The cycle formula merely re-expresses this equality. Corollary 11. an arbitrary stationary distribution π must be equal to the specific one π [a] . E a Ta Proof. the only stationary distribution for the restricted chain because C is irreducible. π(a) = π [a] (a) = 1/Ea Ta . But π(i|C) > 0. i. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Then. Hence there is only one stationary distribution. This is in fact. Namely. Decompose it into communicating classes. Assume that a stationary distribution π exists. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . Consider a Markov chain.

A Markov chain is a simple model of a stochastic dynamical system.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. such that C αC = 1. An arbitrary stationary distribution π must be a linear combination of these π C . By our uniqueness result. π(x|C) ≡ π C (x). where αC ≥ 0. more generally. Proposition 3. for example. For each such C define the conditional distribution π(x|C) := π(x) . that it remains in a small region). as n → ∞. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. consider the motion of a pendulum under the action of gravity. without friction. Proof. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . it will settle to the vertical position. 18 18. where π(C) := π(C) π(x). take an = (−1)n . This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . For example. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). We all know. under irreducibility and positive recurrence conditions. So we have that the above decomposition holds with αC = π(C). that. If π(x) > 0 then x belongs to some positive recurrent class C.which assigns positive probability to each state in C. So far we have seen. In an ideal pendulum. from physical experience. This is the stable position. the pendulum will perform an 45 . There is dissipation of energy due to friction and that causes the settling down. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. x∈C Then (π(x|C). that the pendulum will swing for a while and. x ∈ C) is a stationary distribution. Let π be an arbitrary stationary distribution. eventually. To each positive recurrent communicating class C there corresponds a stationary distribution π C .

and define a stochastic sequence X1 . X2 . . The thing to observe here is that there is an equilibrium point i. we may it is stable because it does not (it cannot!) swing too far away. And we are in a happy situation here. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . A linear stochastic system We now pass on to stochastic systems. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. However notice what happens when we solve the recursion. as t → ∞. . it cannot mean convergence to a fixed equilibrium point.e. ξ0 . or an example in physics. If. we here assume that X0 . we have x(t) → x∗ . consider the following differential equation x = −ax + b. then regardless of the starting position. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . mutually independent. are given. We say that x∗ is a stable equilibrium. then x(t) = x∗ for all t > 0. We shall assume that the ξn have the same law. or whatever. .oscillatory motion ad infinitum. It follows that |ρ| < 1 is the condition for stability. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. ······ 46 . a > 0. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . the one found by setting the right-hand side equal to zero: x∗ = b/a. ξ1 . . Still. x ∈ R. simple because the ξn ’s depend on the time parameter n. by means of the recursion above. . Specifically. The simplest differential equation As another example. ˙ There are myriads of applications giving physical interpretation to this. moreover. . random variables. ξ2 . But what does stability mean? In general.

. we have that the first term goes to 0 as n → ∞. j ∈ S. under nice conditions. ∞ ˆ The nice thing is that. the n-step transition matrix Pn . converges. for instance. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). 18. Y but not their joint law. g(y) = P (Y = y). In other words. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . ξn−1 does not converge.2 The fundamental stability theorem for Markov chains Theorem 17. . which is a linear combination of ξ0 . Y = y) is? 47 . (e. suppose you are given the laws of two random variables X.g. A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. Y = y) is. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . we know f (x) = P (X = x). while the limit of Xn does not exists. .3 Coupling To understand why the fundamental stability theorem works. we need to understand the notion of coupling. Y = y) = P (X = x)P (Y = y). 18. as n → ∞. but we are not given what P (X = x. What this means is that. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). In particular. uniformly in x. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . to try to make the coincidence probability P (X = Y ) as large as possible. . If a Markov chain is irreducible. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). for any initial distribution. Starting at an elementary level. for time-homogeneous Markov chains. How should we then define what P (X = x. Coupling refers to the various methods for constructing this joint law.Since |ρ| < 1. i. The second term. define stability in a way other than convergence of the distributions of the random variables. But suppose that we have some further requirement.

Note that it makes sense to speak of a meeting time. Since this holds for all x ∈ S and since the right hand side does not depend on x. Proof. Then. y) = P (X = x. y). n ≥ T ) = P (Xn = x. (19) as n → ∞. y). and all x ∈ S. f (x) = y h(x. in particular.e. A coupling of them refers to a joint construction on the same probability space. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. Let T be a meeting of two coupled stochastic processes X. We have P (Xn = x) = P (Xn = x. on the event that the two processes never meet. but we are not told what the joint probabilities are. then |P (Xn = x) − P (Yn = x)| → 0.e. The following result is the so-called coupling inequality: Proposition 4. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). n < T ) + P (Yn = x. n ≥ 0) and the law of a stochastic process Y = (Yn . respectively. n ≥ T ) ≤ P (n < T ) + P (Yn = x). Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Find the joint law h(x. We are given the law of a stochastic process X = (Xn . Y be random variables with values in Z and laws f (x) = P (X = x). uniformly in x ∈ S. Suppose that we have two stochastic processes as above and suppose that they have been coupled. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). n ≥ 0. precisely because of the assumption that the two processes have been constructed together. i. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). g(y) = x h(x. and which maximises P (X = Y ). we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Y . P (T < ∞) = 1. x ∈ S. but with the roles of X and Y interchanged. T is finite with probability one.Exercise: Let X. This T is a random variables that takes values in the set of times or it may take values +∞. n < T ) + P (Xn = x. The coupling we are interested in is a coupling of processes. g(y) = P (Y = y). n ≥ 0. Y = y) which respects the marginals. Repeating (19). i. for all n ≥ 0. x∈S 48 . If. n ≥ 0).

n ≥ 0). then. y) := π(x)π(y). The first process. Xm = y). It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. Yn ). n ≥ 0) is an irreducible chain. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. is a stationary distribution for (Wn .Assume now that P (T < ∞) = 1. we can define joint probabilities. positive recurrent and aperiodic. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. n ≥ 0) is a Markov chain with state space S × S.x′ ).y′ ) := P Wn+1 = (x′ . denoted by Y = (Yn . sup |P (Xn = x) − P (Yn = x)| → 0. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. for all n ≥ 1. we have that px.y′ ) for all large n.y′ .x′ py. (x. review the assumptions of the theorem: the chain is irreducible. 18. Xn and Yn would be independent and distributed according to π. We shall consider the laws of two processes. y ′ ) | Wn = (x.y′ ) = px. Then (Wn . x∈S as n → ∞. implying that q(x.x′ py. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. and so (Wn . The second process.(y. as usual. Then n→∞ lim P (T > n) = P (T = ∞) = 0.x′ ). (Indeed. they differ only in how the initial states are chosen. Y0 ) has distribution P (W0 = (x. n ≥ 0) is independent of Y = (Yn . in other words.) By positive recurrence. and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }.x′ > 0 and py. From Theorem 3 and the aperiodicity assumption. Its (1-step) transition probabilities are q(x.y′ for all large n. denoted by π.(y. n ≥ 0).(y. Consider the process Wn := (Xn . Let pij denote.4 Proof of the fundamental stability theorem First. We now couple them by doing the straightforward thing: we assume that X = (Xn . y) ∈ S × S. Notice that σ(x. we have not defined joint probabilities such as P (Xn = x. Therefore. denoted by X = (Xn . π(x) > 0 for all x ∈ S. the transition probabilities. Having coupled them. We know that there is a unique stationary distribution. y) = µ(x)π(y). Its initial state W0 = (X0 .y′ . had we started with both X0 and Y0 independent and distributed according to π.x′ ). y) Its n-step transition probabilities are q(x. .

(1. Y ) may fail to be irreducible. 50 . For example. Z0 = X0 which has an arbitrary distribution µ. say. Hence (Zn . if we modify the original chain and make it aperiodic by. p01 = 0.0) = q(0.1) = q(1.0). X arbitrary distribution Z stationary distribution Y T Thus. Then the product chain has state space S ×S = {(0. n ≥ 0). (0. for all n and x. T is a meeting time between Z and Y . n ≥ 0) is identical in law to (Xn .0). 0). This is not irreducible. By the coupling inequality. (Zn . In particular.1).(1. uniformly in x.(1. 1)} and the transition probabilities are q(0. Zn equals Xn before the meeting time of X and Y . n ≥ 0) is positive recurrent.(0. suppose that S = {0. Hence (Wn . In particular. Clearly. and since P (Yn = x) = π(x). p00 = 0.(0. Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). Obviously. (1. y) ∈ S × S. we have |P (Xn = x) − π(x)| ≤ P (T > n).01. after the meeting time. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. precisely because P (T < ∞) = 1. Since P (Zn = x) = P (Xn = x). y) > 0 for all (x. as n → ∞. by assumption. Moreover. p10 = 1. 1). 1} and the pij are p01 = p10 = 1.99.0) = q(1. Zn = Yn . then the process W = (X. |P (Zn = x) − P (Yn = x)| ≤ P (T > n).1) = 1. P (T < ∞) = 1. 0). which converges to zero.Therefore σ(x. ≥ 0) is a Markov chain with transition probabilities pij . P (Xn = x) = P (Zn = x).1). However.

n→∞ (n) where π(0).503 0. . By the results of Section 14. π(1) ≈ 0. Parry.503. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. But this is “wrong”. positive recurrent and aperiodic Markov chain. π(0) + π(1) = 1. most people use the term “ergodic” for an irreducible. 18. Therefore p00 = 1 for even n (n) and = 0 for odd n. However. such that the chain moves cyclically between them: from C0 to C1 .503 0.497 . Cd−1 .497 18. 51 . we can decompose the state space into d classes C0 . Indeed.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. then (n) lim p n→∞ 00 (n) = lim p10 = π(0). p01 = 0.01. . . π(1) satisfy 0. limn→∞ p00 does not exist. C1 .f. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. 1981).99π(0) = π(1). p10 = 1. Terminology cannot be. customary to have a consistent terminology.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain.497. 6 The reason is that the term “ergodic” means something specific in Mathematics (c. Then Xn = 0 for even n and = 1 for odd n. Suppose that the period of the chain is d. and this is expressed by our terminology. . By the results of §5.99.e. from Cd back to C0 . A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. if we modify the chain by p00 = 0. In matrix notation. from C1 to C2 . per se wrong. Suppose X0 = 0. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. 0. i. It is.4 and. in particular. Thus.then the product chain becomes irreducible. It is not difficult to see that 6 We put “wrong” in quotes. Theorem 4. π(0) ≈ 0. however. and so on.

. as n → ∞. . We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . Then. 52 . i ∈ Cr . Suppose that X0 ∈ Cr . the (unique) stationary distribution of (Xnd . starting from i. Let i ∈ Ck . Then. n ≥ 0) is dπ(i). Suppose i ∈ Cr . if sampled every d steps. → dπ(j). Cd−1 . where π is the stationary distribution. Since the αr add up to 1. All states have period 1 for this chain. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). . i ∈ S. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . So π(i) = j∈Cr−1 π(j)pji . and let r = ℓ − k (modulo d). it remains in Cℓ .Lemma 8. as n → ∞. π(i) = 1/d. after reduction by d) and. thereafter. Hence if we restrict the chain (Xnd . i ∈ Cr . j ∈ Cℓ . n ≥ 0) is aperiodic. i∈Cr for all r = 0. n ≥ 0) consists of d closed communicating classes. . d − 1. we have pij for all i. 1. each one must be equal to 1/d. Since (Xnd . . . . The chain (Xnd . j ∈ Cℓ . This means that: Theorem 18. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. We shall show that Lemma 9. Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. and the previous results apply. Let αr := i∈Cr π(i). . j ∈ Cr . namely C0 . Proof. Then pji = 0 unless j ∈ Cr−1 .

schematically. .. So the diagonal subsequence (nk . . . k = 1. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞.1 Limiting probabilities for null recurrent states Theorem 19.. . as n → ∞. where = 1 and pi ≥ 0 for all i. We have. as k → ∞. and so on. By construction. for all i ∈ S.e. exists for all i ∈ N. p2 . Hence pi k i. . for all i ∈ S. p2 . whose proof we omit7 : 7 See Br´maud (1999). . the sequence n3 of the third row is a subsequence k k of n2 of the second row. . then pij → 0 as n → ∞. 2. k = 1. For each i we have found subsequence ni . say. We also saw in Corollary (n) 7 that. the sequence n2 of the second row k is a subsequence of n1 of the first row. pi .) is a k k subsequence of each (ni . say. are contained between 0 and 1. . e 53 . Now consider the contained between 0 and 1 there must be a exists and equals. the first of which is known as HellyBray Lemma: Lemma 10. 2. . p3 . p(n) = (n) (n) (n) (n) (n) ∞ p1 .). . p1 . 19. k = 1. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. . if j is transient. for each n = 1. . 2. . 2. . a probability distribution on N. such that limk→∞ pi k exists and equals. . If j is a null recurrent state then pij → 0. such that limk→∞ p1 k 3 (n1 ) numbers p2 k .19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . say. If the period is d then the limit exists only along a selected subsequence (Theorem 18). . Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . there must be a subsequence exists and equals. for each i. . . Consider. i. It is clear how we (ni ) k continue. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. then pij → π(j) (Theorems 17). 2. for each The second result is Fatou’s lemma. (nk ) k → pi . The proof will be based on two results from Analysis. Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. . k = 1.

Let j be a null recurrent state. i. n = 1. y∈S x ∈ S. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. π is a stationary distribution. By Corollary 13. Now π(j) > 0 because αj > 0. i. i ∈ N). . 54 . for some x0 ∈ S.Lemma 11. we have 0< x∈S The inequality follows from Lemma 11. and the last equality is obvious since we have a probability vector. x ∈ S . (xi . (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . x ∈ S. (n ) (n ) Consider the probability vector By Lemma 10. the sequence (n) pij does not converge to 0. If (yi .e. k→∞ (n′ +1) = y∈S piy k pyx . i=1 (n) Proof of Theorem 19. . Suppose that one of them is not. Thus. y∈S Since x∈S αx ≤ 1.e. αy pyx . actually. (n′ ) Hence αx ≥ αy pyx . i ∈ N). Suppose that for some i. Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. We wish to show that these inequalities are.e. by assumption. i. pix k . state j is positive recurrent: Contradiction. 2. pix k Therefore. equalities. by Lemma 11. x∈S (n′ ) Since aj > 0. . But Pn+1 = Pn P. pix k = 1. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . it is a strict inequality: αx0 > y∈S αy pyx0 .

the result can be obtained by the so-called Abel’s Lemma of Analysis. β. so we will only look that the aperiodic case. Let ϕC (i) := Pi (HC < ∞). In general. Let j be a positive recurrent state with period 1. Let i be an arbitrary state. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. From the Strong Markov Property. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . π C (j) = 1/Ej Tj . when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. starting with the first hitting time decomposition relation of Lemma 7. Proof. The result when the period j is larger than 1 is a bit awkward to formulate.2 The general case (n) It remains to see what the limit of pij is. which is what we need. and fij = Pi (Tj < ∞). γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . This is a more old-fashioned way of getting the same result. and fij = ϕC (j). Theorem 20. Let HC := inf{n ≥ 0 : Xn ∈ C}. Indeed.2. But Theorem 17 tells us that the limit exists regardless of the initial distribution. Let C be the communicating class of j and let π C be the stationary distribution associated with C. that is. Example: Consider the following chain (0 < α. Then (n) lim p n→∞ ij = ϕC (i)π C (j). E j Tj where Tj = inf{n ≥ 1 : Xn = j}.19. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). In this form. we have pij = Pi (Xn = j) = Pi (Xn = j. Pµ (Xn−m = j) → π C (j). as n → ∞. as in §10. as n → ∞.

(n) (n) p32 → ϕ(3)π C1 (2). Theorem 21. p45 → (1 − ϕ(4))π C2 (5). We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. information about its velocity. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. 1 − αβ 1 − αβ Both C1 . Pioneers in the are were. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. Similarly. π C2 (5) = 1. p42 → ϕ(4)π C1 (2). Recall that. 56 (20) x∈S . Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. whence 1−α β(1 − α) . (n) (n) p33 → 0. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . π C1 (2) = . (n) (n) p43 → 0. as n → ∞. Theorem 14 is a type of ergodic theorem. from first-step analysis. (n) (n) p35 → (1 − ϕ(3))π C2 (5). C1 = {1. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). ϕ(3) = p31 → ϕ(3)π C1 (1). We have ϕ(1) = ϕ(2) = 1. (n) (n) p34 → 0. For example. while. Hence. 2} (closed class). The subject of Ergodic Theory has its origins in Physics. The phase (or state) space of a particle is not just a physical space but it includes. among others. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. in Theorem 14. ϕ(5) = 0. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. p41 → ϕ(4)π C1 (1). ϕ(4) = .The communicating classes are I = {3. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . We now generalise this. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. 4} (inessential states). C2 are aperiodic classes. C2 = {5} (absorbing state). the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. And so is the Law of Large Numbers itself (Theorem 13). p44 → 0. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. for example. we assumed the existence of a positive recurrent state.

But this is precisely the condition (20). we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast .Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . The reader is asked to revisit the proof of Theorem 14. Now. Glast are the total rewards over the first and t t a a last incomplete cycles. as we saw in the proof of Theorem 14. we assumed positive recurrence of the ground state a. r=0 with probability 1 as long as E|G1 | < ∞. Proof. since the number of states is finite. So there are always recurrent states. From proposition 2 we know that the balance equations ν = νP have at least one solution. we obtain the result. and this is left as an exercise.5. we have Nt /t → 1/Ea Ta . (21) f (x)ν [a] (x). As in Theorem 14. respectively. in the former. and this justifies the terminology of §18. If so. which is the quantity denoted by f in Theorem 14. t t→∞ t t t with probability one. Then at least one of the states must be visited infinitely often. We only need to check that EG1 = f . t t where Gr = T (r) ≤t<T (r+1) f (Xt ). while Gfirst . So if we divide by t the numerator and denominator of the fraction in (21). 57 . 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. we have t→∞ lim 1 last 1 G = lim Gfirst = 0. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . Remark 1: The difference between Theorem 14 and 21 is that. As in the proof of Theorem 14. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. then. Replacing n by the subsequence Nt . so the numerator converges to f /Ea Ta . we have that the denominator equals Nt /t and converges to 1/Ea Ta . we have C := i∈S ν(i) < ∞.

We summarise the above discussion in: Theorem 22. Proof. Therefore π is a stationary distribution. . Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . then we instantly know that the film is played in reverse. . 58 . running backwards. a finite Markov chain always has positive recurrent states. Proof. X0 = j). if we film a standing man who raises his arms up and down successively. without being able to tell that it is. Suppose first that the chain is reversible. X0 ). . In addition. . We say that it is time-reversible. X0 ). More precisely. Then the chain is positive recurrent with a unique stationary distribution π. Because the process is Markov. actually. . then all states are positive recurrent. Xn ) has the same distribution as (Xn . If. By Corollary 13. (X0 . n ≥ 0). it will look the same. simply. we have Pn → Π. . i. . in a sense. For example. for all n. . Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. in addition. X0 ) for all n. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. But if we film a man who walks and run the film backwards. Xn−1 . . the stationary distribution is unique. Xn ) has the same distribution as (Xn . or. . . X1 ) has the same distribution as (X1 . the chain is irreducible. Xn−1 . Reversibility means that (X0 .e. the chain is aperiodic. X1 . C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. like playing a film backwards. Lemma 12. this means that it is stationary. because positive recurrence is a class property (Theorem 15). P (X0 = i. by Theorem 16. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. . In particular. and run the film backwards. the first components of the two vectors must have the same distribution. i. j ∈ S. X1 . It is. we see that (X0 . 22 Time reversibility Consider a Markov chain (Xn . A reversible Markov chain is stationary. If. . in addition. reversible if it has the same distribution when the arrow of time is reversed. . At least some state i must have π(i) > 0.Hence 1 ν(i). . X1 = j) = P (X1 = i. as n → ∞. Taking n = 1 in the definition of reversibility. this state must be positive recurrent. . Consider an irreducible Markov chain with finitely many states. π(i) := Hence. Theorem 23. i ∈ S. Xn has the same distribution as X0 for all n. that is.

using DBE. . pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . . . k. X2 = i). So the chain is reversible. i4 . π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. . we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . The DBE imply immediately that (X0 . letting m → ∞. . The same logic works for any m and can be worked out by induction. X1 . pji 59 (m−3) → π(i1 ). X2 = k) = P (X0 = k. Summing up over all choices of states i3 . . X1 . Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. for all n. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . (22) Proof. we have π(i) > 0 for all i. So the detailed balance equations (DBE) hold. X0 ). . X1 = j) = π(i)pij . Xn ) has the same distribution as (Xn . Suppose first that the chain is reversible. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. P (X0 = j. we have separation of variables: pij = f (i)g(j). then the chain cannot be reversible. suppose the DBE hold. and this means that P (X0 = i. Hence the DBE hold. then. . . . By induction. X1 = i) = π(j)pji . X2 ) has the same distribution as (X2 . and the chain is reversible. using the DBE. X1 ) has the same distribution as (X1 . it is easy to show that. But P (X0 = i. X0 ). (X0 . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji .for all i and j. . Xn−1 . In other words. X1 = j. k. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. . suppose that the KLC (22) holds. Theorem 24. Conversely. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. . and hence (X0 . . j. X1 . Pick 3 states i. We will show the Kolmogorov’s loop criterion (KLC). i2 . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. for all i. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. X1 = j. Conversely. im ∈ S and all m ≥ 3. . . These are the DBE. and write. X0 ). Now pick a state k and write. . and pi1 i2 (m−3) → π(i2 ). j.

If the graph thus obtained is a tree. The balance equations are written in the form F (A. and by replacing. 60 . For example. • If the chain has infinitely many states. consider the chain: Replace it by: This is a tree. Ac ) = F (Ac . The reason is as follows. there would be a loop.(Convention: we form this ratio only when pij > 0. to guarantee. Hence the original chain is reversible. Hence the chain is reversible.) Necessarily. then the chain is reversible. Pick two distinct states i. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. So g satisfies the balance equations. and.1) which gives π(i)pij = π(j)pji . There is obviously only one arrow from AtoAc . This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate.e. Form the graph by deleting all self-loops. j for which pij > 0. the graph becomes disconnected. A) (see §4. j for which pij > 0. the two oriented edges from i to j and from j to i by a single edge without arrow. there isn’t any. in addition. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . and the arrow from j to j is the only one from Ac to A. Ac be the sets in the two connected components. (If it didn’t. so we have pij g(j) = . then we need. Example 1: Chain with a tree-structure. by assumption. the detailed balance equations. namely the arrow from i to j. that the chain is positive recurrent. using another method. for each pair of distinct states i. f (i)g(i) must be 1 (why?). And hence π can be found very easily. i.) Let A. meaning that it has no loops. If we remove the link between them.

The stationary distribution of a RWG is very easy to find. the DBE hold. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. i. Suppose the graph is connected. a finite set. etc. A RWG is a reversible chain. The constant C is found by normalisation.Example 2: Random walk on a graph. The stationary distribution assigns to a state probability which is proportional to its degree.e. so both members are equal to C. π(i) = 1. j are not neighbours then both sides are 0. If i. For example. For each state i define its degree δ(i) = number of neighbours of i. 0. i∈S Since δ(i) = 2|E|. π(i)pij = π(j)pji . States j are i are called neighbours if they are connected by an edge. where C is a constant. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). and a set E of undirected edges. 61 . j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. Now define the the following transition probabilities: pij = 1/δ(i). p02 = 1/4. In all cases. We claim that π(i) = Cδ(i). i∈S where |E| is the total number of edges. Each edge connects a pair of distinct vertices (states) without orientation. if j is a neighbour of i otherwise. Hence we conclude that: 1. 2. Consider an undirected graph G with vertices S. To see this let us verify that the detailed balance equations hold. There are two cases: If i. we have that C = 1/2|E|.

n ≥ 0) be Markov with transition probabilities pij . Let (St . 2. . In this case. . . irreducible and positive recurrent).d. . 1.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence.3 Subordination Let (Xn . π is a stationary distribution for the original chain. in addition. if. 2. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). t ≥ 0) be independent from (Xn ) with S0 = 0. Due to the Strong Markov Property. . .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. i. . where pij are the m-step transition probabilities of the original chain. . 2. 1. k = 0. 2. . Let A be (r) a set of states and let TA be the r-th visit to A. A r = 1. how can we create new ones? 23. The state space of (Yr ) is A. k = 0.e. 2. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . . if nk = n0 + km. 62 n ∈ N. The sequence (Yk . St+1 = St + ξt . If the subsequence forms an arithmetic progression. . this process is also a Markov chain and it does have time-homogeneous transition probabilities.i. . 1. . then (Yk .e.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). i.) has the Markov property but it is not time-homogeneous. k = 0. random variables with values in N and distribution λ(n) = P (ξ0 = n). Define Yr := XT (r) . (m) (m) 23.e.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . k = 0. . in general. 1. . . t ≥ 0. where ξt are i. . then it is a stationary distribution for the new chain. 23.

. . n≥0 63 . . and the elements of S ′ by α. (Notation: We will denote the elements of S by i.e. We need to show that P (Yn+1 = β | Yn = α. ∞ λ(n)pij . β. is NOT. that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). conditional on Yn the past is independent of the future. Then (Yn ) has the Markov property.. we have: Lemma 13. j. knowledge of f (Xn ) implies knowledge of Xn and so. then the process Yn = f (Xn ). (n) 23. . . β).) More generally. αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 .Then Yt := XSt . i. . in general. Yn−1 ) = P (Yn+1 = β | Yn = α). however. f is a one-to-one function then (Yn ) is a Markov chain. Proof. ..4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . If. . . . Fix α0 . . a Markov chain. . Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. . Yn−1 = αn−1 }. . α1 . Indeed.

if β = 1. 64 . β). depends on i only through |i|. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. if i = 0. if we let 1/2. and so (Yn ) is. On the other hand. α = 0. conditional on {Xn = i}. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. P (Xn+1 = i ± 1|Xn = i) = 1/2. β). and if i = 0. β) P (Yn = α|Yn−1 ) i∈S = Q(α. for all i ∈ Z and all β ∈ Z+ . regardless of whether i > 0 or i < 0. i. β) P (Yn = α|Xn = i. if β = α ± 1. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). then |Xn+1 | = 1 for sure. β). But P (Yn+1 = β | Yn = α. β)P (Xn = i | Yn = α. We have that P (Yn+1 = β|Xn = i) = Q(|i|. a Markov chain with transition probabilities Q(α. i ∈ Z. i ∈ Z. Yn−1 ) Q(f (i). Define Yn := |Xn |.e. Indeed. observe that the function | · | is not one-to-one. But we will show that P (Yn+1 = β|Xn = i). because the probability of going up is the same as the probability of going down. α > 0 Q(α. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. Yn−1 ) Q(f (i).for all choices of the variables involved. Example: Consider the symmetric drunkard’s random walk. Thus (Yn ) is Markov with transition probabilities Q(α. β) := 1. β) i∈S P (Yn = α|Xn = i. β). In other words. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) P (Yn = α. Then (Yn ) is a Markov chain. β ∈ Z+ . Yn = α. To see this. Yn−1 )P (Xn = i | Yn = α. indeed.

independently from one another. (and independent of the initial population size X0 –a further assumption).. one question of interest is whether the process will become extinct or not. ξ (1) . If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. ξ2 . But here is an interesting special case. then we see that the above equation is of the form Xn+1 = f (Xn . The state 0 is irrelevant because it cannot be reached from anywhere. Back to the general case. We find the population (n) at time n + 1 by adding the number of offspring of each individual. . in general. . . are i.24 24. therefore transient. since ξ (0) . The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. a hard problem. 0 ≤ j ≤ i. It is a Markov chain with time-homogeneous transitions. . . each individual gives birth to at one child with probability p or has children with probability 1 − p. p0 = 1 − p. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. with probability 1. 1.d.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. if 0 is never reached then Xn → ∞. Let ξk be the number of offspring of the k-th individual in the n-th generation. if the current population is i there is probability pi that everybody gives birth to zero offspring. the population cannot grow: it will eventually reach 0. The parent dies immediately upon giving birth. and following the same distribution. ξ (n) ). . n ≥ 0) has the Markov property. so we will find some different method to proceed.i. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. Hence all states i ≥ 1 are transient. In other words. if the 65 (n) (n) . We will not attempt to compute the pij neither draw the transition diagram because they are both futile. and. Hence all states i ≥ 1 are inessential. . So assume that p1 = 1. Example: Suppose that p1 = p. Therefore. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. 2. and 0 is always an absorbing state. we have that (Xn . which has a binomial distribution: i j pij = p (1 − p)i−j . k = 0. . . j In this case. If we let ξ (n) = ξ1 . Thus. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . We let pk be the probability that an individual will bear k offspring. 0 But 0 is an absorbing state. Then P (ξ = 0) + P (ξ ≥ 2) > 0. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow.

when we start with X0 = i. i. since Xn = 0 implies Xm = 0 for all m ≥ n. We then have that D = D1 ∩ · · · ∩ Di . Indeed. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. If µ ≤ 1 then the ε = 1. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. Obviously. Proof. For i ≥ 1. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. so P (D|X0 = i) = εi . Therefore the problem reduces to the study of the branching process starting with X0 = 1. We wish to compute the extinction probability ε(i) := Pi (D). This function is of interest because it gives information about the extinction probability ε. 66 . Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. ϕn (0) = P (Xn = 0). we can argue that ε(i) = εi . . where ε = P1 (D). if we start with X0 = i in ital ancestors. . The result is this: Proposition 5. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. . Indeed. each one behaves independently of one another. ε(0) = 1. Let ε be the extinction probability of the process. . and P (Dk ) = ε. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. For each k = 1. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . and.population does not become extinct then it must grow to infinity.

Therefore ε = ϕ(ε). (See right figure below. if µ > 1. if the slope µ at z = 1 is ≤ 1. ϕ(ϕn (0)) → ϕ(ε). On the other hand. ϕn+1 (0) = ϕ(ϕn (0)). Therefore ε = ε∗ . We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). ε = ε∗ . in fact. This means that ε = 1 (see left figure below). = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. ϕn+1 (z) = ϕ(ϕn (z)). ϕn (0) ≤ ε∗ . z=1 Also. and so on.Note also that ϕ0 (z) = 1. Therefore. Let us show that. recall that ϕ is a convex function with ϕ(1) = 1. Notice now that µ= d ϕ(z) dz . all we need to show is that ε < 1. since ϕ is a continuous function. and since the latter has only two solutions (z = 0 and z = ε∗ ). But ϕn+1 (0) → ε as n → ∞. In particular. So ε ≤ ε∗ < 1. But ϕn (0) → ε. and. Since both ε and ε∗ satisfy z = ϕ(z).) 67 . Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . By induction. Also ϕn (0) → ε. the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). But 0 ≤ ε∗ .

. for a packet to go through. Abandoning this idea. then max(Xn + An − 1. there is no time to make a prior query whether a host has something to transmit or not. Xn+1 = Xn + An . But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. . there is collision and the transmission is unsuccessful. then the transmission is not successful and the transmissions must be attempted anew. k ≥ 0. on the next times slot. 6. otherwise. . is that if more than one hosts attempt to transmit a packet. there is a collision and. An the number of new packets on the same time slot. Then ε < 1. 1. only one of the hosts should be transmitting at a given time. 68 . 2. 4. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. The problem of communication over a common channel.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). random variables with some common distribution: P (An = k) = λk . . say. if Sn = 1. If collision happens. 3 hosts. 2.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1.i. 0). in a simplified model. We assume that the An are i. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). 1. then time slots 0. If s ≥ 2. we let hosts transmit whenever they like. So. and Sn the number of selected packets. there are x + a − 1 packets. using the same frequency.d. . One obvious solution is to allocate time slots to hosts in a specific order. 1. 3. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. Ethernet has replaced Aloha. Case: µ > 1. periodically. Let s be the number of selected packets. So if there are. are allocated to hosts 1. 3. During this time slot assume there are a new packets that need to be transmitted. . Since things happen fast in a computer network. 5. If s = 1 there is a successful transmission and. 24. . 3. Nowadays. 2. there are x + a packets. Then ε = 1. on the next times slot. Thus.

n ≥ 0) is a Markov chain. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x).x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . we have that the drift is positive for all large x and this proves transience.x are computed by subtracting the sum of the above from 1. An = a.x−1 + k≥1 kpx. defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). For x > 0. 69 . we must have no new packets arriving. if Xn + An = 0. The Aloha protocol is unstable. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An .e. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x.x−1 = P (An = 0. Xn−1 . To compute px. the Markov chain (Xn ) is transient. and p. We here only restrict ourselves to the computation of this expected change. so q x G(x) tends to 0 as x → ∞. and exactly one selection: px. So we can make q x G(x) as small as we like by choosing x sufficiently large. The random variable Sn has. for example we can make it ≤ µ/2. To compute px. we have ∆(x) = (−1)px. we have q < 1. k ≥ 1. So: px. The main result is: Proposition 6. Indeed. where q = 1 − p. where G(x) is a function of the form α + βx . We also let µ := k≥1 kλk be the expectation of An . The reason we take the maximum with 0 in the above equation is because.x−1 when x > 0. and so q x tends to 0 much faster than G(x) tends to ∞.e. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. . Proving this is beyond the scope of the lectures. Therefore ∆(x) ≥ µ/2 for all large x.) = x + a k x−k p q . The process (Xn . selections are made amongst the Xn + An packets.where λk ≥ 0 and k≥0 kλk = 1.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . . Idea of proof. Let us find its transition probabilities. Sn = 1|Xn = x) = λ0 xpq x−1 . . i. The probabilities px. An−1 . then the chain is transient. Since p > 0. Unless µ = 0 (which is a trivial case: no arrivals!). we should not subtract 1. So P (Sn = k|Xn = x. we think as follows: to have a reduction of x by 1. k 0 ≤ k ≤ x + a.x+k for k ≥ 1.

The way it does that is by computing the stationary distribution of a Markov chain that we now describe. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. Comments: 1.24. there are companies that sell high-rank pages. There is a technique. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. It uses advanced techniques from linear algebra to speed up computations.2) and let α pij := (1 − α)pij + . the next location Xn+1 is picked uniformly at random amongst all pages that i links to. if j ∈ L(i)  pij = 1/|V |. It is not clear that the chain is irreducible. 3. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. 2. because of the size of the problem. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. All you need to do to get high rank is pay (a lot of money). unless he gets bored–and this happens with probability α–in which case he clicks on a random page. that allows someone to manipulate the PageRank of a web page and make it much higher. unless i is a dangling page. We pick α between 0 and 1 (typically. Then it says: page i has higher rank than j if π(i) > π(j). A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. So this is how Google assigns ranks to pages: It uses an algorithm. α ≈ 0. PageRank. in the latter case. Define transition probabilities as follows:  1/|L(i)|. otherwise. |V | These are new transition probabilities. if L(i) = ∅   0. For each page i let L(i) be the set of pages it links to. Academic departments link to each other by hiring their 70 . So if Xn = i is the current location of the chain. the rank of a page may be very important for a company’s profits. The idea is simple. Its implementation though is one of the largest matrix computations in the world. If L(i) = ∅. Therefore. then i is a ‘dangling page’. known as ‘302 Google jacking’. So we make the following modification. neither that it is aperiodic. the chain moves to a random page.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. In an Internet-dominated society.

Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. maintain this allele for the next generation. This means that the external appearance of a characteristic is a function of the pair of alleles. And so on. then. it makes the evaluation procedure easy. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. two AA individuals will produce and AA child. But an AA mating with a Aa may produce a child of type AA or one of type Aa. The alleles in an individual come in pairs and express the individual’s genotype for that gene. generation after generation. One of the square alleles is selected 4 times. the genetic diversity depends only on the total number of alleles of each type. Imagine that we have a population of N individuals who mate at random producing a number of offspring. When two individuals mate. Do the same thing 2N times. This number could be independent of the research quality but. meaning how many characteristics we have. We have a transition from x to y if y alleles of type A were produced. from the point of view of an administrator. The process repeats. in each generation.g. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. If so. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level.4 The Wright-Fisher model: an example from biology In an organism. We will assume that a child chooses an allele from each parent at random. We will also assume that the parents are replaced by the children. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). 4. this allele is of type A with probability x/2N . 71 . colour of eyes) appears in two forms. Since we don’t care to keep track of the individuals’ identities. 5. a gene controlling a specific characteristic (e. 24. What is important is that the evaluation can be done without ever looking at the content of one’s research. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. We would like to study the genetic diversity. To simplify. In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). Three of the square alleles are never selected. a dominant (say A) and a recessive (say a).faculty from each other (and from themselves). their offspring inherits an allele from each parent. all we need is the information outlined in this figure. we will assume that. One of the circle alleles is selected twice. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. Thus.

These are: M ϕ(0) = 0. . the chain gets absorbed by either 0 or 2N . Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . i. The fact that M is even is irrelevant for the mathematical model.e. Since pxy > 0 for all 1 ≤ x. the random variables ξ1 . . . . . . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. ξM are i. Clearly. 72 . E(Xn+1 |Xn . there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. p00 = p2N. i. . .e. y ≤ 2N − 1. . we see that 2N x 2N −y x y pxy = . but the argument needs further justification because “the end of time” is not a deterministic time. X0 ) = E(Xn+1 |Xn ) = M This implies that.) One can argue that. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . Clearly. . ϕ(x) = x/M. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). . We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. (Here. ϕ(x) = y=0 pxy ϕ(y). ϕ(M ) = 1. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. conditionally on (X0 . M n P (ξk = 0|Xn .d.Each allele of type A is selected with probability x/2N . X0 ) = 1 − Xn . . 2N − 1} form a communicating class of inessential states. as usual. the states {1. on the average. . Xn ). . M and thus. and the population becomes one consisting of alleles of type a only. The result is correct. with n P (ξk = 1|Xn . the expectation doesn’t change and. and the population becomes one consisting of alleles of type A only. for all n ≥ 0.i. on the average. . . . 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. since. M Therefore. Tx := inf{n ≥ 1 : Xn = x}. Let M = 2N . Since selections are independent. EXn+1 = EXn Xn = Xn .2N = 1. Similarly. 0 ≤ y ≤ 2N. . n n where. X0 ) = Xn . . so both 0 and 2N are absorbing states. . the number of alleles of type A won’t change. since. . . .

while a number of them are sold to the market. on the average. Otherwise. Let ξn := An − Dn . say. We will apply the so-called Loynes’ method. and the stock reached level zero at time n + 1. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. A number An of new components are ordered and added to the stock. there are Xn + An − Dn components in the stock. so that Xn+1 = (Xn + ξn ) ∨ 0. the demand is met instantaneously. The correctness of the latter can formally be proved by showing that it does satisfy the recursion.d. . we will try to solve the recursion. for all n ≥ 0.The first two are obvious. Working for n = 1. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. We will show that this Markov chain is positive recurrent. n ≥ 0) are i. Thus. 73 . We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. . and E(An − Dn ) = −µ < 0. . Dn ). b).5 A storage or queueing application A storage facility stores. If the demand is for Dn components. Xn electronic components at time n. n ≥ 0) is a Markov chain. the demand is only partially satisfied. at time n + 1. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . Here are our assumptions: The pairs of random variables ((An . M −1 y−1 x M y−1 1− x . 2. Since the Markov chain is represented by this very simple recursion. provided that enough components are available. the demand is.i. and so. larger than the supply per unit of time. We have that (Xn .

The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). (n) . . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. Therefore. we may replace the latter random variables by ξ1 . ξn−1 ). . ξ0 ). . . . . . ξn−1 ) = (ξn−1 . that d (ξ0 . . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. in computing the distribution of Mn which depends on ξ0 . for y ≥ 0. . . = x≥0 Since the n-dependent terms are bounded. .Thus. so we can shuffle them in any way we like without altering their joint distribution. (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . we reverse it. Notice. . the random variables (ξn ) are i. for all n. we may take the limit of both sides as n → ∞. . .i. 74 . and. Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). using π(y) = limn→∞ P (Mn = y). . ξn−1 . Indeed. . where the latter equality follows from the fact that. . Hence. P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . instead of considering them in their original order.d. the event whose probability we want to compute is a function of (ξ0 . . ξn . But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). . however. . In particular. we find π(y) = x≥0 π(x)pxy .

Here.} < ∞) = 1. n→∞ This implies that P ( lim Zn = −∞) = 1. . n→∞ Hence g(y) = P (0 ∨ max{Z1 . It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. The case Eξn = 0 is delicate. we will use the assumption that Eξn = −µ < 0. . as required.So π satisfies the balance equations. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. This will be the case if g(y) is not identically equal to 1. . 75 . Depending on the actual distribution of ξn . Z2 . . the Markov chain may be transient or null recurrent. . We must make sure that it also satisfies y π(y) = 1. .} > y) is not identically equal to 1. Z2 .

y+z for any translation z. otherwise where p + q = 1. . otherwise. . we can let S = Zd . here. so far. To be able to say this. if x = ej .and space-homogeneity. if x ∈ {±e1 . but. where C is such that x p(x) = 1. This means that the transition probability pxy should depend on x and y only through their relative positions in space. . A random walk possesses an additional property. we need to give more structure to the state space S. the skip-free property. . besides time. the set of d-dimensional vectors with integer coordinates.g. If d = 1. . ±ed } otherwise 76 . A random walk of this type is called simple random walk or nearest neighbour random walk. . j = 1. . . namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. given a function p(x). then p(1) = p.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. a random walk is a Markov chain with time. . Example 3: Define p(x) = 1 2d . d  p(x) = qj . Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. have been processes with timehomogeneous transition probabilities. a structure that enables us to say that px. Our Markov chains. More general spaces (e. d   0. this walk enjoys. j = 1. x ∈ Zd . namely. . This is the drunkard’s walk we’ve seen before. that of spatial homogeneity. Example 1: Let p(x) = C2−|x1 |−···−|xd | . . p(−1) = q. So. 0. where p1 + · · · + pd + q1 + · · · + qd = 1.y = px+z.and space-homogeneous transition probabilities given by px. groups) are allowed. Define  pj .y = p(y − x). if x = −ej . For example. p(x) = 0. In dimension d = 1. we won’t go beyond Zd . .

We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. . (Recall that δz is a probability distribution that assigns probability 1 to the point z. x + ξ2 = y) = y p(x)p(y − x). The special structure of a random walk enables us to give more detailed formulae for its behaviour. x ± ed . which.d. We will let p∗n be the convolution of p with itself n times. If d = 1. . So. We refer to ξn as the increments of the random walk. y p(x)p′ (y − x). But this would be nonsense. we can reduce several questions to questions about this random walk. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . p′ (x). p ∗ δz = p. p(x) = 0.This is a simple random walk. Obviously. because all we have done is that we dressed the original problem in more fancy clothes. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . The convolution of two probabilities p(x). p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). We call it simple symmetric random walk. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . then p(1) = p(−1) = 1/2. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. Let us denote by Sn the state of the walk at time n. the initial state S0 is independent of the increments. (When we write a sum without indicating explicitly the range of the summation variable. random variables in Zd with common distribution P (ξn = x) = p(x). we will mean a sum over all possible values. in addition.) In this notation. otherwise. . where ξ1 . we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). Indeed. are i. One could stop here and say that we have a formula for the n-step transition probability. . We shall resist the temptation to compute and look at other 77 . Furthermore. . . . x ∈ Zd . Furthermore. x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. ξ2 . and this is the symmetric drunkard’s walk.) Now the operation in the last term is known as convolution. in general.i. computing the n-th order convolution for all n is a notoriously hard problem.

.2) (4. 2) → (3. 1): (2. for fixed n. +1.d. The principle is trivial. Here are a couple of meaningful questions: A. So. +1.2) where the ξn are i. 2) → (5. and the sequence of signs is {+1. but its practice may be hard. +1}n . to each initial position and to each sequence of n signs.i. . Look at the distribution of (ξ1 . 1) → (2. −1} then we have a path in the plane that joins the points (0. Therefore. In the sequel. 2 where #A is the number of elements of A. 1) → (4. .i−1 = 1/2 for all i ∈ Z. a random vector with values in {−1. If yes. Will the random walk ever return to where it started from? B. a path of length n. Events A that depend only on these the first n random signs are subsets of this sample space. If not. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn .0) We can thus think of the elements of the sample space as paths.1) + − (5. and so #A P (A) = n . . A ⊂ {−1. consider the following: 78 . random variables (random signs) with P (ξn = ±1) = 1/2. .1) − (3. all possible values of the vector are equally likely. 1}n . +1}n . we shall learn some counting methods. After all.methods. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. + (1.i+1 = pi. 0) → (1. So if the initial position is S0 = 0. First. if we can count the number of elements of A we can compute its probability. we have P (ξ1 = ε1 . ξn ). how long will that take? C. . Clearly. . +1}n . who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. we are dealing with a uniform distribution on the sample space {−1. . As a warm-up. Since there are 2n elements in {0. −1. we should assign.1) + (0. ξn = εn ) = 2−n .

0) The lightly shaded region contains all paths of length n that start at (0.i) (0. 0) and end up at (n. The number of paths that start from (0. x) and end up at (n.) The darker region shows all paths that start at (0.Example: Assume that S0 = 0. 0)).1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. S6 < 0. the number of paths that start from (m. (n. (There are 2n of them. The paths comprising this event are shown below. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. and compute the probability of the event A = {S2 ≥ 0. Consider a path that starts from (0. 0) (n. y) is given by: #{(m. i)} = n n+i 2 More generally. y) is given by: #{(0. 0) and ends at (n. x) and end up at (n. i). i). In if n + i is not an even number then there is no path that starts from (0. y)} =: {all paths that start at (m. We use the notation {(m. n 79 . 0) and end up at the specific point (n. S7 < 0}. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. in this case. i). y)}. S4 = −1. Hence (n+i)/2 = 0. x) (n. We can easily count that there are 6 paths in A. x) (n. Proof. 0) and n ends at (n. (n − m + y − x)/2 (23) Remark. y)} = n−m .047. Let us first show that Lemma 14.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

. P0 (Sn = y) 2 (n + y) + 1 . . for now. given that S0 = 0 and Sn = y > 0. 27. Sn ≥ 0 | Sn = y) = 1 − In particular. Sn > 0 | Sn = y) = y . (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. P0 (S1 ≥ 0 . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 .2 The ballot theorem Theorem 25. 27. 0) (n. Such a simple answer begs for a physical/intuitive explanation. n/2 (29) #{(0. Sn ≥ 0 | Sn = 0) = 84 1 . If n. n 2 #{paths (m. y) that do not touch zero} × 2−n #{paths (0. S2 > 0. 1) (n.3 Some conditional probabilities Lemma 17. 0) (n. n−m 2−n−m . . . The probability that. y)} . . we shall just compute it. y)} . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . The left side of (30) equals #{paths (1. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . n 2−n . .P (Sn = y|Sm = x) = In particular. . But. . if n is even. the random walk never becomes zero up to time n is given by P0 (S1 > 0. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). n This is called the ballot theorem and the reason will be given in Section 29. x) 2n−m (n. . n (30) Proof. if n is even. .

Sn ≥ 0. . . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . . . .4 Some remarkable identities Theorem 26. . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . Sn ≥ 0) = P0 (S1 = 0. Sn ≥ 0) = #{(0. . . . . . ξ3 . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. and Sn−1 > 0 implies that Sn ≥ 0. Sn = 0) = P0 (Sn = 0). . . (e) follows from the fact that ξ1 is independent of ξ2 . . For even n we have P0 (S1 ≥ 0. . . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. . . . . Sn < 0) (b) (a) = 2P0 (S1 > 0. Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . . . . . . . .) by (ξ1 .Proof. . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . (f ) follows from the replacement of (ξ2 . . . . ξ3 . . . (c) (d) (e) (f ) (g) where (a) is obvious. . ξ2 + ξ3 ≥ 0. Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . n/2 where we used (28) and (29). •). If n is even. . . 0) (n. . Proof. We next have P0 (S1 = 0. . We use (27): P0 (Sn ≥ 0 . +1 27. . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. Sn ≥ 0). . (b) follows from symmetry. P0 (S1 ≥ 0. Sn = 0) = P0 (S1 > 0. . . . . 85 . . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. ξ2 .). . . . . . . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . . and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . Sn > 0) + P0 (S1 < 0. . . 1 + ξ2 + ξ3 > 0.

Let n be even. P0 (Sn−1 > 0. Then f (n) := P0 (S1 = 0. . So u(n) f (n) = . the probability u(n) that the walk (started at 0) attains value 0 in n steps.5 First return to zero In (29). It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. So the last term is 1/2. Sn−2 > 0|Sn−1 = 1). . . . This is not so interesting. we can omit the last event. . Put a two-sided mirror at a. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). By symmetry. So 1 f (n) = 2u(n) P0 (S1 > 0. . given that Sn = 0. Theorem 27. the two terms must be equal.27. . Sn = 0) = u(n) . we either have Sn−1 = 1 or −1. . we see that Sn−1 > 0. Also. n−1 Proof. Sn−2 > 0|Sn−1 > 0. 86 . . n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. with equal probability. Sn = 0) P0 (S1 > 0. we computed. Sn = 0) = 2P0 (Sn−1 > 0. Sn = 0) + P0 (S1 < 0. To take care of the last term in the last expression for f (n). Sn−1 > 0. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . and u(n) = P0 (Sn = 0). . . Sn = 0 means Sn−1 = 1. Sn = 0. . . . . If a particle is to the right of a then you see its image on the left. Fix a state a. Sn−1 > 0. We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. and if the particle is to the left of a then its mirror image is on the right. . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). Sn = 0) Now. . Since the random walk cannot change sign before becoming zero. . . Sn = 0). . Sn−1 < 0. for even n. f (n) = P0 (S1 > 0. So f (n) = 2P0 (S1 > 0. By the Markov property at time n − 1. Sn−1 = 0. . So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . . .

n ≥ Ta . 28. as a corollary to the above result. so we may replace Sn by 2a − Sn (Theorem 28). n ∈ Z+ ) has the same law as (Sn . But a symmetric random walk has the same law as its negative. Sn ≤ a) + P0 (Sn > a). Then Ta = inf{n ≥ 0 : Sn = a} Sn . independent of the past before Ta . But P0 (Ta ≤ n) = P0 (Ta ≤ n. n < Ta . define Sn = where is the first hitting time of a.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. Finally. 2a − Sn ≥ a) = P0 (Ta ≤ n. . for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . Proof. . a + (Sn − a). So: P0 (Sn ≥ a) = P0 (Ta ≤ n. n < Ta . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). Then. n ∈ Z+ ). 2 Proof. Sn = Sn . we have n ≥ Ta . S1 . Sn ≥ a). Sn ≤ a). m ≥ 0) is a simple symmetric random walk. The last equality follows from the fact that Sn > a implies Ta ≤ n. Theorem 28. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). Trivially. n ≥ Ta By the strong Markov property. the process (STa +m − a. 2a − Sn . In the last event. Let (Sn . In other words. P0 (Sn ≥ a) = P0 (Ta ≤ n. . So Sn − a can be replaced by a − Sn in the last display and the resulting process. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). called (Sn . Sn > a) = P0 (Ta ≤ n. Sn ).What is more interesting is this: Run the particle till it hits a (it will. Thus. n ≥ 0) be a simple symmetric random walk and a > 0. . we obtain 87 . If Sn ≥ a then Ta ≤ n. 2 Define now the running maximum Mn = max(S0 . Sn ≤ a) + P0 (Ta ≤ n.

Sci. 9 Bertrand. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . the number of selected azure items is always ahead of the black ones equals (a − b)/n. usually a vase furnished with a foot or pedestal. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). It is the latter use we are interested in in Probability and Statistics. then. e 10 Andr´. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. The answer is very simple: Theorem 31. Bertrand. Compt. e e e Paris. This. (31) x > y. Sn = y) = P0 (Tx > n. Sci. (1887). by applying the reflection principle (Theorem 28). Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. . . We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. we have P0 (Mn < x. combined with (31). ornamental objects. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. of which a are coloured azure and b black.Corollary 19. Sn = y). If an urn contains a azure items and b = n−a black items. J. 369. 2 Proof. . then the probability that. employed for different purposes. Sn = y) = P0 (Tx ≤ n. n = a + b. gives the result. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Paris. Solution directe du probl`me r´solu par M. Sn = y) = P0 (Sn = 2x − y). Proof. as for holding liquids. (1887). Compt. 8 88 . during sampling without replacement. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. A sample from the urn is a sequence η = (η1 . . 29 Urns and the ballot theorem Suppose that an urn8 contains n items. where ηi indicates the colour of the i-th item picked. the ashes of the dead after cremation. P0 (Mn < x. An urn is a vessel. D. If Tx ≤ n. Rend. Acad. 436-437. The set of values of η contains n elements and η is uniformly distributed a in this set. Solution d’un probl`me. which results into P0 (Tx ≤ n. as long as a ≥ b. 105. and anciently for holding ballots to be drawn. 105. The original ballot problem. Rend. ηn ). Sn = 2x − y). Acad. Since Mn < x is equivalent to Tx > n.

i−1 = q = 1 − p.i+1 = p. . . n . we can translate the original question into a question about the random walk. . the ‘black’ items correspond to − signs). . So the number of possible values of (ξ1 . It remains to show that all values are equally a likely. . conditional on the event {Sn = y}. . We have P0 (Sn = k) = n n+k 2 pi. . where y = a − (n − a) = 2a − n. Sn > 0} that the random walk remains positive. we have: P0 (ξ1 = ε1 . . so ‘azure’ items are +1 and ‘black’ items are −1. . . . . . using the definition of conditional probability and Lemma 16. . 89 . always assuming (without any loss of generality) that a ≥ b = n − a. the result follows. . a we have that a out of the n increments will be +1 and b will be −1. the sequence (ξ1 . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . . . Since an urn with parameters n. ξn ) is uniformly distributed in a set with n elements. indeed. b be defined by a + b = n. . a − b = y. . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. −n ≤ k ≤ n. conditional on {Sn = y}. . . conditional on {Sn = y}. . . . . . ηn ) an urn process with parameters n. Then. n Since y = a − (n − a) = a − b. letting a. . (ξ1 . . p(n+k)/2 q (n−k)/2 . . Consider a simple symmetric random walk (Sn . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. Let ε1 . But if Sn = y then. An urn can be realised quite easily be a random walk: Theorem 32. We need to show that. εn be elements of {−1. i ∈ Z. . Since the ‘azure’ items correspond to + signs (respectively. The values of each ξi are ±1. ξn ) is. Proof. ξn ) has a uniform distribution.) So pi. . . the sequence (ξ1 . Proof of Theorem 31.We shall call (η1 . . . . . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. +1} such that ε1 + · · · + εn = y. . Then. (The word “asymmetric” will always mean “not necessarily symmetric”. . but which is not necessarily symmetric. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. But in (30) we computed that y P0 (S1 > 0. ξn = εn ) P0 (Sn = y) 2−n 1 = = . Sn > 0 | Sn = y) = . . n n −n 2 (n + y)/2 a Thus. a. . n ≥ 0) with values in Z and let y be a positive integer.

i. and all t ≥ 0. Theorem 33. Let ϕ(s) := E0 sT1 . x ∈ Z. Lemma 18.30. In view of Lemma 19 we need to know the distribution of T1 . In view of this.} ∪ {∞}. 2. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . This random variable takes values in {0. The rest is a consequence of the strong Markov property. only the relative position of y with respect to x is needed. .d. Consider a simple random walk starting from 0. y ∈ Z. We will approach this using probability generating functions. 2. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. . namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. Lemma 19. See figure below. Proof. . For all x. we make essential use of the skip-free property. Px (Ty = t) = P0 (Ty−x = t). Consider a simple random walk starting from 0. random variables. x − 1 in order to reach x. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. 1. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. 90 . If S1 = 1 then T1 = 1. for x > 0. plus a time T1 required to go from −1 to 0 for the first time. . x ≥ 0) is also a random walk. To prove this.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. . (We tacitly assume that S0 = 0. plus a ′′ further time T1 required to go from 0 to +1 for the first time. E0 sTx = ϕ(s)x . then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. . it is no loss of generality to consider the random walk starting from 0. Corollary 20. Proof. Consider a simple random walk starting from 0. . 2qs Proof. If x > 0. The process (Tx .) Then.

T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . Corollary 21. The first time n such that Sn = −1 is the first time n such that −Sn = 1. from the strong Markov property.  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . Corollary 22. Hence the previous formula applies with the roles of p and q interchanged. in order to hit 1. if x < 0 2ps Proof. 91 . Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. Consider a simple random walk starting from 0. 2ps Proof. Consider a simple random walk starting from 0. The formula is obtained by solving the quadratic. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ).Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p.

Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. if S1 = 1.30. Consider a simple random walk starting from 0.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. 2ps ′ =1− 30. Similarly. We again do first-step analysis. k by the binomial theorem. this was found in §30. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . For a symmetric ′ random walk. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 .3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . The latter is. then we can omit the k upper limit in the sum. the same as the time required for the walk to hit 1 starting ′ from 0. 1 − 4pqs2 . and simply write (1 + x)n = k≥0 n k x . we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. in distribution. If α = n is a positive integer then n (1 + x)n = k=0 n k x . where α is a real number.2 First return to the origin Now let us ask: If the particle starts at 0. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. k 92 . For the general case. If we define n to be equal to 0 for k > n. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}.

√ 1 − x. Indeed. but that we won’t worry about which ones. k!(n − k)! k! α k x . For example. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . k! k! (−1)2k xk = k≥0 xk . by definition. (−1)k = 2k − 1 k k and this is shown by direct computation. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0.Recall that n k = It turns out that the formula is formally true for general α. 2k − 1 k 93 . k as long as we define α k = α(α − 1) · · · (α − k + 1) . We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . (1 − x)−1 = But. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. −1 k so (1 − x)−1 = as needed. As another example.

x if x > 0 . if x < 0 This formula can be written more neatly as follows: Theorem 35. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. if x < 0 |x| In particular. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one.But. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . by the definition of E0 sT0 . By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. P0 (Tx < ∞) = p q q p ∧1 ∧1 . 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. Hence it is again transient. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. |x| if x > 0 (33) . Hence it is transient. x 1−|p−q| 2q 1−|p−q| 2p . so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). We apply this to (32). The quantity inside the square root becomes 1 − 4pq. s↑1 which is true for any probability generating function. 94 . without appeal to the Markov chain theory. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . That it is actually null-recurrent follows from Markov chain theory. But remember that q = 1 − p.

again. the sequence Mn is nondecreasing. The last part of Theorem 35 tells us that it is recurrent. S2 . Then limn→∞ Sn = −∞ with probability 1. Sn ). What is this probability? Theorem 37.Proof. except that the limit may be ∞.) If we let Mn = max(S0 . Otherwise. The simple symmetric random walk with values in Z is null-recurrent. Corollary 23. 31. S1 . n→∞ n→∞ x ≥ 0. every nondecreasing sequence has a limit. We know that it cannot be positive recurrent. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. This means that there is a largest value that will be ever attained by the random walk. but here it is < ∞. with some positive probability. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . it will never ever return to the point where it started from. Proof.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. . then we have M∞ = limn→∞ Mn . Theorem 36. with probability 1 because p < 1/2. If p < q then |p − q| = q − p and. Proof. if x > 0 . Unless p = 1/2. . Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. and so it is null-recurrent. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . Indeed.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. q p. Consider a simple random walk with values in Z. . . This value is random and is denoted by M∞ = sup(S0 . this probability is always strictly less than 1. Then ′ P0 (T0 < ∞) = 2(p ∧ q). . 31. if x < 0 which is what we want. we obtain what we are looking for. Assume p < 1/2. . . 95 . Consider a random walk with values in Z. x ≥ 0.

We now compute its exact distribution. 2(p ∧ q) E0 J0 = . starting from zero. . 1]. if a Markov chain starts from a state i then Ji is geometric. . . 1. 2. 31. (34) a random variable first considered in Section 10 where it was also shown to be geometric. But J0 ≥ 1 if and only if the first return time to zero is finite. k = 0. where the last probability was computed in Theorem 37. The probability generating function of T0 was obtained in Theorem 34. In the latter case. 2(p ∧ q) . even for p = 1/2. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. . 1. 2. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i).3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. . If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). 2. If p = q this gives 1. 1 − 2(p ∧ q) this being a geometric series. This proves the first formula. meaning that Pi (Ji = ∞) = 1. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. 1 − 2(p ∧ q) Proof.′ Proof. Remark 1: By spatial homogeneity. . as already known. In Lemma 4 we showed that. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . . . Pi (Ji ≥ k) = 1 for all k. 1. k = 0. We now look at the the number of visits to some state but if we start from a different state. 96 . . the total number of visits to a state will be finite. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . k = 0. Theorem 38. Then. But if it is transient. Consider a simple random walk with values in Z. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. .

e. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . starting from 0. The first term was computed in Theorem 35. if the random walk “drifts down” (i. In other words. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). The reason that we replaced k by k − 1 is because. ∞ As for the expectation.Theorem 39. But for x < 0.e. Consider the case p < 1/2. we have E0 Jx = 1/(1−2p) regardless of x. Apply now the strong Markov property at Tx . If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . Suppose x > 0. and the second in Theorem 38. Jx ≥ k). If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . i. Consider a random walk with values in Z. Remark 1: The expectation E0 Jx is finite if p = 1/2. given that Tx < ∞. Then P0 (Jx ≥ k) = P0 (Tx < ∞. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . (q/p)(−x)∨0 1 − 2q k≥1 . For x > 0. k ≥ 1. (2(p ∧ q))k−1 . simply because if Jx ≥ k ≥ 1 then Tx < ∞. Then P0 (Tx < ∞. E0 Jx = Proof. then E0 Jx drops down to zero geometrically fast as x tends to infinity. p < 1/2) then. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. 97 . We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . it will visit any point below 0 exactly the same number of times on the average.

e.i. . the probabilities P0 (Tx = n) = x P0 (Sn = x). Use the obvious fact that (ξ1 . In Lemma 19 we observed that. . every time we have a probability for S we can translate it into a probability for S. because the two have the same distribution. a sum of i. i. (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. = 1− √ 1 − s2 s x . ξn ) has the same distribution as (ξn . Sn = x) P0 (Tx = n) = . 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . in Theorem 41. Let (Sk ) be a simple symmetric random walk and x a positive integer. (Sk . . . (36) 2 Lastly. Theorem 40. . ξ1 ). 0 ≤ k ≤ n) of (Sk .32 Duality Consider random walk Sk = k ξi . 0 ≤ k ≤ n − 1. each distributed as T1 . n Proof. .i. 0 ≤ k ≤ n) has the same distribution as (Sk . Then P0 (Tx = n) = x P0 (Sn = x) . . Here is an application of this: Theorem 41. Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. Fix an integer n. . Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. Proof. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. ξn−1 . random variables. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . . ξ2 . . for a simple random walk and for x > 0. n x > 0. Tx is the sum of x i. we derived using duality.d. for a simple symmetric random walk. P0 (Sn = x) P0 (Sn = x) x . which is basically another probability for S. Thus. random variables.d. n (37) 98 . i=1 Then the dual (Sk . . . 0 ≤ k ≤ n).

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

ξ2 . so are ξ1 . They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. too. .From this. with equal probability (1/4). west. Since ξ1 . we let x1 be its horizontal coordinate and x2 its vertical. y ∈ Z. We have: 102 . Consider a change of coordinates: rotate the axes by 45o (and scale them). 0)) = P (ξn = (−1. . i=1 ξi i=1 ξi 2 1 Lemma 20. going at random east. random vectors such that P (ξn = (0.i. west.d. 1 Hence (Sn . i. (Sn . . the latter being functions of the former. Sn = i=1 n ξi . x2 ) into (x1 + x2 . . . he moves east. 1)) + P (ξn = (0. 1)) = P (ξn = (−1. being totally drunk. we immediately get that the expectation of T1 is infinity: ET1 = ∞. . random variables and hence a random walk. Many other quantities can be approximated in a similar manner. 0)) = 1/4. Sn ) = n n 2 . 0) and. So let 1 2 1 2 1 2 Rn = (Rn . −1)) = 1/2. x1 − x2 ). 0)) = P (ξn = (1. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . 1 1 Proof. 2 1 If x ∈ Z2 . for each n.. We have 1. The corners are described by coordinates (x. When he reaches a corner he moves again in the same manner. the two stochastic processes are not independent. Sn − Sn ). He starts at (0. 1 P (ξn = 1) = P (ξn = (1. 0). 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1.d. . It is not a simple random walk because there is a possibility that the increment takes the value 0. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. north or south. . Rn ) := (Sn + Sn . However.e. 2 Same argument applies to (Sn . So Sn = (Sn . north or south. y) where x. 0)) = 1/4. n ∈ Z+ ) is a random walk. We then let S0 = (0. n ∈ Z+ ) showing that it. map (x1 . are independent. n ≥ 1. ξn are not independent simply because if one takes value 0 the other is nonzero. i. the random 1 2 variables ξn . The two random walks are not independent. n ∈ Z+ ) is also a random walk. . ξ2 . ξ2 . n ∈ Z+ ) is a sum of i.i. is a random walk. (Sn . Indeed.

Proof. 2n n 2 2−4n . 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. ηn = −1) = P (ξn = (−1. ηn = −1) = P (ξn = (0. η 2 are ±1. n Corollary 24. that for a simple symmetric random walk (started from S0 = 0). 2 . Hence (Rn . 2 1 2 2 1 1 Proof. and by Stirling’s approximation. Lemma 22. ηn = ξn − ξn are independent for each n. n ∈ Z+ ) is a random walk in two dimensions. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . . ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε.i. n ∈ Z+ ). n ∈ Z ) is a random walk in one dimension. ηn = 1) = P (ξn = (0. We have Rn = n (ξi + ξi ). And so 2 1 2 1 P (ηn = ε. Similar argument shows that each of (Rn . b−a P (STa ∧Tb = a) = b . To show that the last two are independent. P (S2n = 0) = Proof. So P (R2n = 0) = 2n 2−2n . . 1 where the last equality follows from the independence of the two random walks. But by the previous n=0 c lemma. we need to show that ∞ P (Sn = 0) = ∞. as n → ∞. (Rn . . But (Rn ) 1 2 is a simple symmetric random walk. Recall. P (S2n = 0) ∼ n . b−a 103 . Let a < 0 < b. . ε′ ∈ {−1. R2n = 0) = P (R2n = 0)(R2n = 0). ξ 1 − ξ 2 ). 0)) = 1/4 1 2 P (ηn = −1. The sequence of vectors η .d. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. (Rn .. The same is true for (Rn ). So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . ξ2 . The last two are walk in one dimension. is i.1 Lemma 21. n ∈ Z+ ) 1 is a random walk in two dimensions. The two-dimensional simple symmetric random walk is recurrent. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . We have of each of the ηn n 1 2 P (ηn = 1. +1}. . b−a −a . where c is a positive constant. in the sense that the ratio of the two sides goes to 1. Hence P (ηn = 1) = P (ηn = −1) = 1/2. = 1)) = 1/4. 0)) = 1/4 1 2 P (ηn = 1. n ∈ Z ) is a random walk in one dimension. η . By Lemma 5. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . The possible values 1 . . from Theorem 7. (Rn + independent. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. 1)) = 1/4 1 2 P (ηn = −1. n ∈ Z+ ) is a random 2 . b−a P (Ta < Tb ) = b . Rn = n (ξi − ξi ). ηn = 1) = P (ξn = (1.

b = N . such that p is rational. given a 2-point probability distribution (p. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. a = M − N . e. suppose first k = 0. B = j) = c−1 (j − i)pi pj . b]. Define a pair of random variables (A. 0)} ∪ (N × N) with the following distribution: P (A = i. we get that c is precisely ipi . Theorem 43. Then Z = 0 if and only if A = B = 0 and this has 104 . The claim is that P (Z = k) = pk . B) taking values in {(0. choose. we have this common value: i<0 (−i)pi = and. B) are independent of (Sn ). and the probability it exits from the right point is 1 − p. B = 0) = p0 . we consider the random variable Z = STA ∧TB . N ∈ N. How about more complicated distributions? Let suppose we are given a probability distribution (pi . i ∈ Z. a < 0 < b such that pa + (1 − p)b = 0. P (A = 0. i ∈ Z. B = j) = 1. k ∈ Z. M < N . if p = M/N . Note.This last display tells us the distribution of STa ∧Tb . where c−1 is chosen by normalisation: P (A = 0. in particular. Conversely. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. 1 − p). we can always find integers a. using this.g. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. b. that ESTa ∧Tb = 0. Indeed. Let (Sn ) be a simple symmetric random walk and (pi . Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. B = 0) + P (A = i. Then there exists a stopping time T such that P (ST = i) = pi . i ∈ Z). and assuming that (A. M. The probability it exits it from the left point is p. Proof. i < 0 < j. To see this.

We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. by definition. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . Suppose next k > 0. B = j) P (STi ∧Tk = k)P (A = i. 105 . i<0 = c−1 Similarly. P (Z = k) = pk for k < 0.probability p0 .

2. if a. The set N of natural numbers is the set of positive integers: N = {1. the process of long division that one learns in elementary school). 3. Then c | a + (b − a). 32) = gcd(6. Indeed. they merely serve as reminders. Z = {· · · . 106 . i. with at least one of them nonzero. 38) = gcd(6. . b). Keep doing this until you obtain b − qa < a. 3. Given integers a. 0. leading to d1 = d2 . 6. −3. b) as the unique positive integer such that (i) d | a. 38) = gcd(82. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. But 3 is the maximum number of times that 38 “fits” into 120. . 20) = gcd(6. and d = gcd(a. Indeed 4 × 38 > 120. If b − a > a. with b = 0. So we work faster. b − a). gcd(120. But in the first line we subtracted 38 exactly 3 times. Arithmetic Arithmetic deals with integers. So (i) holds. b). 14) = gcd(6. we can define their greatest common divisor d = gcd(a. b are arbitrary integers. c | b. −1. if we also invoke Euclid’s algorithm (i. But d | a and d | b. 2) = 2.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. we have c | d. we say that b divides a if there is an integer k such that a = kb. 4. as taught in high schools or universities. They are not supposed to replace textbooks or basic courses. 2. Because c | a and c | b. 2) = gcd(4. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. (ii) if c | a and c | b. Therefore d | b − a. In other words. b) = gcd(a. and we’re done. their negatives and zero. 2) = gcd(2. The set Z of integers consists of N. 1. If c | b and b | a then c | a. 5.) Notice that: gcd(a. (That it is unique is obvious from the fact that if d1 . So (ii) holds. replace b − a by b − 2a. b − a). Similarly. For instance. Replace b by b − a. We will show that d = gcd(a. then d | c.e. with the second line. Then interchange the roles of the numbers and continue. −2. We write this by b | a.e. let d = gcd(a. 38) = gcd(44.}. 38) = gcd(6. Now suppose that c | a and c | b − a. · · · }. b. 8) = gcd(6. d2 are two such numbers then d1 would divide d2 and vice-versa. If a | b and b | a then a = b. Now. . 26) = gcd(6. d | b. Zero is denoted by 0.

. . with a > b. for some x. y = 6. r = 18. in particular. Thus. we can find an integer q and an integer r ∈ {0. y such that d = xa + yb. i. Second.e. Suppose that c | a and c | b. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. y ∈ Z. we have gcd(a. To see why this is true. The long division algorithm is a process for finding the quotient q and the remainder r. b) = min{xa + yb : x. b. it divides d. y ∈ Z. This shows (ii) of the definition of a greatest common divisor. such that a = qb + r. a − 1}. 38) = gcd(6. In the process. 120) = 38x + 120y. 1. To see this. we can show that. Then d = xa + yb. for integers a. not both zero. xa + yb > 0}. For example. with x = 19. . let d be the right hand side. we have the fact that if d = gcd(a. We need to show (i). Its proof is obvious. 107 .The theorem of division says the following: Given two positive integers b. a. let us follow the previous example again: gcd(120. The same logic works in general. 38) = gcd(6. b) then there are integers x. . to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. Here are some other facts about the greatest common divisor: First. Working backwards. gcd(38. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. b ∈ Z. Then c divides any linear combination of a. 2) = 2.

g. So d is not minimum as claimed.e. y) without contain its symmetric (y. yielding r = (−qx)a + (1 − qy)b. A relation which possesses all the three properties is called equivalence relation. x2 ∈ X have different values f (x1 ). it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. A relation is called symmetric if it cannot contain a pair (x. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. a relation can be thought of as a set of pairs of elements of the set A. and this is a contradiction. The relation < is not reflexive. m elements. The greatest common divisor of 3 integers can now be defined in a similar manner. If we take A to be the set of students in a classroom. x). The equality relation = is reflexive. but is not necessarily transitive. y) and (y. for some 1 ≤ r ≤ d − 1. 108 . say.. Both < and = are transitive. Finally. so d divides both a and b.that d | a and d | b. A relation is transitive if when it contains (x. i. Suppose this is not true. we can divide b by d to find b = qd + r. e. z). f (x2 ). so r = 0. Counting Let A.e. the greatest common divisor of any subset of the integers also makes sense. The set of functions from X to Y is denoted by Y X . that d does not divide both numbers. y) such that x is smaller than y. the set {f (x) : x ∈ X}. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). respectively. The equality relation is symmetric. that d does not divide b. We encountered the equivalence relation i theory. i. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . The relation < is not symmetric. But then b = q(xa + yb) + r. The range of f is the set of its values. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. For example. B be finite sets with n. A function is called one-to-one (or injective) if two different x1 . x). In general. z) it also contains (x. A relation is called reflexive if it contains pairs (x. A function is called onto (or surjective) if its range is Y . Since d > 0. then the relation “x is a friend of y” is symmetric.

) So the first animal can be assigned any of the m available names.The number of functions from A to B (i. Clearly. To be able to do this. a one-to-one function from the set A into A is called a permutation of A. Thus.e. If A has n elements. n! is the total number of permutations of a set of n elements. k! k where the symbol equation. and so on. so does the second. 109 . The first element can be chosen in n ways. Each assignment of this type is a one-to-one function from A to B. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . (The same name may be used for different animals. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . that of the second in m − 1 ways. Since the name of the first animal can be chosen in m ways. Any set A contains several subsets. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. Indeed. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. In the special case m = n. we may think of the elements of A as animals and those of B as names. and so on. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). we need more names than animals: n ≤ m. and the third. the cardinality of the set B A ) is mn . and so on. 1}n of sequences of length n with elements 0 or 1. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. Incidentally. the k-th in n − k + 1 ways. we denote (n)n also by n!. we have = 1. We thus observe that there are as many subsets in A as number of elements in the set {0. A function is then an assignment of a name to each animal. in other words. and this follows from the binomial theorem. the second in n − 1.

Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). n n . If the pig is included. The notation extends to more than one numbers. b). we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. 0). 110 .We shall let n = 0 if n < k. c. we use the following notations: a+ = max(a. and a− is called the negative part of a. b is denoted by a ∧ b = min(a. if x is a real number. we define x m = x(x − 1) · · · (x − m + 1) . b. by convention. consider a specific animal. a∨b∨c refers to the maximum of the three numbers a. a− = − min(a. by convention. m! n y If y is not an integer then. a+ is called the positive part of a. and we don’t care to put parantheses because the order of finding a maximum is irrelevant. k Also. Note that it is always true that a = a+ − a− . we let The Pascal triangle relation states that n+1 k = = 0. 0). The absolute value of a is defined by |a| = a+ + a− . the pig. In particular. in choosing k animals from a set of n + 1 ones. + k−1 k Indeed. say. Both quantities are nonnegative. b). Maximum and minimum The minimum of two numbers a. For example. The maximum is denoted by a ∨ b = max(a. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. If the pig is not included in the sample. we choose k animals from the remaining n ones and this can be done in n ways. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. Thus a ∧ b = a if a ≤ b or = b if b ≤ a.

Several quantities of interest are expressed in terms of indicators. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . P (X ≥ n). It is worth remembering that. . So X= n≥1 1(X ≥ n). A2 .Indicator function stands for the indicator function of a set A. Another example: Let X be a nonnegative random variable with values in N = {1. So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. are sets or clauses then the number n≥1 1An is the number true clauses. . 2. Thus. meaning a function that takes value 1 on A and 0 on the complement Ac of A. Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). It takes value 1 with probability P (A) and 0 with probability P (Ac ). .}. 111 . For instance. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. It is easy to see that 1A 1A∩B = 1A 1B . . . If A is an event then 1A is a random variable. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. = 1A + 1B − 1A∩B . if A1 . Then X X= n=1 1 (the sum of X ones is X). . This is very useful notation because conjunction of clauses is transformed into multiplication. 1 − 1A = 1 A c .

j=1 i = 1. . Combining. . we have ϕ(x) = m n vi i=1 j=1 aij xj . ϕ are linear functions: It is a m × n matrix. x1 .  where ∈ R. Let y i := n aij xj . because ϕ is linear. . . un be a basis for Rn . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . .  is a column representation of x with respect to . .  . But the ϕ(uj ) are vectors in Rm . Suppose that ψ.    . ym v1 .  = ··· ··· ···  . . .for all x.   . y ∈ Rn . . xn the given basis. Then. . . and all α. . x  . β ∈ R. So if we choose a basis v1 . . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . vm . Let u1 . The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . .  is a column representation of y := ϕ(x) with respect to the given basis .  The column  . m. . . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ). . . . . The column  . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  .

with respect to these bases. for any ε > 0. For instance I = {1. 0 0 1 In other words. So lim sup xn = limn→∞ inf{xn . If I has no upper bound then sup I = +∞. where m (AB)ij = k=1 Aik Bkj . respectively. . we define the supremum sup I to be the minimum of all upper bounds. e2 =  . we define max I.} ≤ inf{x2 . . and AB is a ℓ × n matrix. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. If I has no lower bound then inf I = −∞. If I is empty then inf ∅ = +∞. Similarly.  . inf I is characterised by: inf I ≤ x for all x ∈ I and. Similarly. x4 . .e. . 113 . xn+1 . . . If we fix a basis in Rn . j So A is a ℓ × m matrix. ψ. i. x2 . 1 ≤ j ≤ n. . . the choice of the basis is the so-called standard basis.} ≤ inf{x3 . it is its matrix representation with respect to the fixed basis. This minimum (or maximum) may not exist. . there is x ∈ I with x ≤ ε + inf I. . Rm and Rℓ and let A. . x2 . we define lim sup xn . . sup ∅ = −∞. a sequence a1 . . . · · · }. of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. . a square matrix represents a linear transformation from Rn to Rn .} has no minimum.} ≤ · · · and. Limsup and liminf If x1 .  . We note that lim inf(−xn ) = − lim sup xn .  . B is a m × n matrix. A matrix A is called square if it is a n × n matrix. Supremum and infimum When we have a set of numbers I (for instance. . standing for the “infimum” of I. and it is this that we called infimum. . . ed =  . 1/3. is a sequence of numbers then inf{x1 . xn+2 . defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . Then the matrix representation of ϕ ◦ ψ is AB. for any ε > 0. . sup I is characterised by: sup I ≥ x for all x ∈ I and. Usually. a2 . . . there is x ∈ I with x ≥ ε + inf I. . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. x3 . . .Fix three sets of basis on Rn . . Likewise. In these cases we use the notation inf I. 1/2. . ei = 1(i = j). B be the matrix representations of ϕ. 1 ≤ i ≤ ℓ. . .

n If the events A1 . and 1 Ac = 1 − 1 A . If An are events with P (An ) = 0 then P (∪n An ) = 0.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). A2 . and apply them. to compute P (A). B in the definition. if An are events with P (An ) = 1 then P (∩n An ) = 1. to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). P (∪n An ) ≤ Similarly. Linear Algebra that has many axioms is more ‘restrictive’. I do sometimes “cheat” from a strict mathematical point of view. A2 . In contrast. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. If the events A. n≥1 Usually. and everywhere else. Therefore.e. we often need to argue that X ≤ Y . we work a lot with conditional probabilities. Probability Theory is exceptionally simple in terms of its axioms.. to show that EX ≤ EY .e. . Of course. . Linear Algebra that has many axioms is more ‘restrictive’. namely:   Everything else. i. This is the inclusion-exclusion formula for two events. Note that 1A 1B = 1A∩B . . Quite often. it is useful to find disjoint events B1 . (That is why Probability Theory is very rich. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. . by not requiring this. A is a subset of B). follows from this axiom. . Indeed. . Often. B are disjoint events then P (A ∪ B) = P (A) + P (B). we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. . . are increasing with A = ∪n An then P (A) = limn→∞ P (An ). For example. E 1A = P (A). are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). That is why Probability Theory is very rich. which holds since X is positive. but not too much. This follows from the logical statement X 1(X > x) ≤ X. derive formulae. such that ∪n Bn = Ω.) If A.. B2 . A2 . . are mutually disjoint events. then P  n≥1 An  = P (An ). This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. n P (An ). This can be generalised to any finite number of events. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). if he events A1 . In Markov chain theory. Similarly. . . . for (besides P (Ω) = 1) there is essentially one axiom . the function that takes value 1 on A and 0 on Ac . In contrast. in these notes. 11 114 . Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. Similarly. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. If A1 . in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). If A is an event then 1A or 1(A) is the indicator random variable.

e. So if n an < ∞ then G(s) exists for all s ∈ [−1. a1 . is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . i. In case the random variable is ∞ absoultely continuous with density f . 1. 2. . 0). |s| < 1/|ρ|. . a1 . . . is a probability distribution on Z+ = {0. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). which is motivated from the algebraic identity X = X + − X − . . . . the generating function of the geometric sequence an = ρn . In case the random variable is discrete. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. making it a useful object when the a0 . 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. . Generating functions A generating function of a sequence a0 . viz. .. EX = x xP (X = x). The formula holds for any positive random variable. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). This definition makes sense only if EX + < ∞ or if EX − < ∞. . 1]. . We do not a priori know that this exists for any s other than. As an example. for s = 0 for which G(0) = 0. 0). n ≥ 0 is ∞ ρn s n = n=0 1 . X − = − min(X. is called probability generating function of X: ∞ as long as |ρs| < 1. The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. The generating function of the sequence pn = P (X = n). n = 0. ϕ(s) = EsX = n=0 sn P (X = n) 115 . For a random variable which takes positive and negative values.} because then n an = 1. x→∞ In Markov chain theory. we have EX = −∞ xf (x)dx. we define X + = max(X. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. 1. 1].An application of this is: P (X < ∞) = lim P (X ≤ x). taught in calculus courses). obviously. we encounter situations where P (X = ∞) may be positive. . which are both nonnegative and then define EX = EX + − EX − . the formula reduces to the known one. so it is useful to keep this in mind.

but. Note that ϕ(0) = P (X = 0). consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn .This is defined for all −1 < s < 1. Also. n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). we can recover the expectation of X from EX = lim ϕ′ (s). n=0 We do allow X to. take the value +∞. possibly. As an example. n ≥ 0) then the generating function of (xn+1 . Generating functions are useful for solving linear recursions. 116 . If we let X(s) = n≥0 n ≥ 0. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). n! as follows from Taylor’s theorem. xn sn be the generating of (xn . n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . with x0 = x1 = 1. Note that ∞ n=0 pn ≤ 1. ds ds and by letting s = 1. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). and this is why ϕ is called probability generating function. ∞ and p∞ := P (X = ∞) = 1 − pn . since |s| < 1. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . we have s∞ = 0.

this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. then the so-called distribution function.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 .e. Hence s2 + s − 1 = (s − a)(s − b). namely. b−a Noting that ab = −1. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. if X is a random variable with values in R. and so X(s) = 1 1 1 1 1 1 = . x ∈ R. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . we can solve for X(s) and find X(s) = s2 −1 . n ≥ 0) equals the sum of the generating functions of (xn+1 . b = −( 5 + 1)/2. Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . we can also write this as xn = (−a)n+1 − (−b)n+1 . 117 . for example. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. So. s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). i. n ≥ 0). Since x0 = x1 = 1. b−a is the solution to the recursion. the function P (X ≤ x). n ≥ 0) and (xn . But I could very well use the function P (X > x). the n-th term of the Fibonacci sequence. i. Despite the square roots in the formula.e. is such an example. Thus (1/ρ)−s is the 1 generating function of ρn+1 .

Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. since the number is so big. instead. Now. Hence we have We call this the rough Stirling approximation. dx All these objects can be referred to as “distribution” or “law” of X. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. n ∈ Z+ . Specifying the distribution of such an object may be hard. It turns out to be a good one. (b) The approximation is not good when we consider ratios of products. I could use its density f (x) = d P (X ≤ x). But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. Such an excursion is a random object which should be thought of as a random function. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. we encountered the concept of an excursion of a Markov chain from a state i. 1 Since d dx (x log x − x) = log x. n! ≈ exp(n log n − n) = nn e−n . For example (try it!) the rough Stirling approximation give that the probability that. to compute a large sum we can. If X is absolutely continuous. For instance. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. it’s better to use logarithms: n n! = e log n = exp k=1 log k. We frequently encounter random objects which belong to larger spaces.or the 2-parameter function P (a < X < b). for example they could take values which are themselves functions. n Hence n k=1 log k ≈ log x dx. in 2n fair coin tosses 118 . compute an integral and obtain an approximation. Even the generating function EsX of a Z+ -valued random variable X is an example. for it does allow us to specify the probabilities P (X = n).

This completely trivial thought gives the very useful Markov’s inequality. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . Let X be a random variable. you have seen it before. expressing non-negativity. n ∈ Z+ . Surely. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. Indeed. and is based on the concept of monotonicity. We baptise this geometric(λ) distribution. To remedy this. g(X) ≥ g(x)1(X ≥ x). Eg(X) ≥ g(x)P (X ≥ x) . because we know that the n probability goes to zero as n → ∞. for all x. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). 0 ≤ n ≤ m. y ∈ R g(y) ≥ g(x)1(y ≥ x). In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . so this little section is a reminder. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). where λ = P (X ≥ 1). 119 .we have exactly n heads equals 2n 2−2n ≈ 1. expressing monotonicity. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. We can case both properties of g neatly by writing For all x. and this is terrible. Consider an increasing and nonnegative function g : R → R+ . If we think of X as the time of occurrence of a random phenomenon. We have already seen a geometric random variable. a geometric function of k. Then. by taking expectations of both sides (expectation is an increasing operator). Markov’s inequality Arguably this is the most useful inequality in Probability. as n → ∞. It was shown in Theorem 38 that J0 satisfies the memoryless property. That was the random variable J0 = ∞ n=1 1(Sn = 0). and so. defined in (34). then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X.

Soc. Norris. Wiley.dartmouth. W (1968) An Introduction to Probability Theory and its Applications. W (1981) Topics in Ergodic Theory. Chung. 5. Amer. Springer. J R (1998) Markov Chains. 4. Cambridge University Press. Also available at: http://www.pdf 6. 120 . Bremaud P (1997) An Introduction to Probabilistic Modeling. J L (1997) Introduction to Probability. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). Grinstead. Feller. C M and Snell.mac. ´ 2.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Springer. 3.Bibliography ´ 1. Springer. 7. Cambridge University Press. Math. Parry. Bremaud P (1999) Markov Chains.

1 equivalence classes. 45 Chapman-Kolmogorov equation. 81 reflection principle. 82 Cesaro averages. 20 increments. 119 Google. 2 nearest neighbour random walk. 17 communicates with. 35 Helly-Bray Lemma. 10 irreducible. 20 function of a Markov chain. 26 running maximum. 61 recurrent. 119 matrix. 87 121 . 16. 3 branching process. 7 closed. 116 first-step analysis. 65 Catalan numbers. 61 detailed balance equations. 34 probability flow. 115 geometric distribution. 51 Monte Carlo. 76 neighbours. 18 arcsine law. 108 relation. 68 aperiodic. 101 axiom of Probability Theory. 6 Markov’s inequality. 16 equivalence relation. 70 greatest common divisor. 106 ground state. 110 meeting time. 7 invariant distribution. 110 mixing. 48 memoryless property. 34 PageRank. 111 maximum. 117 leads to. 16 communicating classes. 111 inessential state. 51 essential state. 29 reflection. 6 Markov property. 48 cycle formula. 114 balance equations. 77 indicator. 113 initial distribution. 17 Ethernet. 68 Fatou’s lemma. 115 random walk on a graph. 61 null recurrent. 58 digraph (directed graph). 15 Loynes’ method. 44 degree. 13 probability generating function. 15. 65 generating function. 15 distribution. 86 reflexive. 10 ballot theorem. 119 minimum. 77 coupling.Index absorbing. 16 convolution. 108 ergodic. 73 Markov chain. 84 bank account. 18 dual. 70 period. 18 permutation. 17 Kolmogorov’s loop criterion. 98 dynamical system. 53 Fibonacci sequence. 53 hitting time. 117 division. 108 ruin probability. 17 Aloha. 59 law. 17 infimum. 109 positive recurrent. 63 Galton-Watson process. 47 coupling inequality.

6 statespace. 76 simple symmetric random walk. 60 urn. 7 time-reversible. 73 strategy of the gambler. 27 storage chain. 119 strong Markov property. 104 states. 6 stationary. 16. 62 supremum.semigroup property. 108 tree. 77 skip-free property. 16. 8 transitive. 7. 8 transition probability matrix. 13 stopping time. 76 Skorokhod embedding. 3. 62 subsequence of a Markov chain. 24 Strirling’s approximation. 10 steady-state. 9 stationary distribution. 2 transition probability. 71 122 . 62 Wright-Fisher model. 113 symmetric. 7 simple random walk. 58 transient. 88 watching a chain in a set. 27 subordination. 29 transition diagram:. 108 time-homogeneous.

Sign up to vote on this title
UsefulNot useful