Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . .3 Some conditional probabilities . . . . . . . 83 27. . . . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . 71 24. . . . . . . . . .1 First hitting time analysis . . . . . . . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . . . . . . . . . . . . . . . . . . . . . . . .24. . . . . . . . . . . 84 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Distribution of hitting time and maximum . . . . . . . . . . .2 The ALOHA protocol: a computer communications application . . . . . . . . 65 24.3 The distribution of the first return to the origin . . . . . . . . 84 27. . . .1 Paths that start and end at specific points . . . . . . . . . . . . . . . . . . .2 Paths that start and end at specific points and do not touch zero at all . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Total number of visits to a state . . . . . . . . . . . 79 26. 70 24. . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . .4 Some remarkable identities . . . . . . . . .1 Asymptotic distribution of the maximum . . . . . . . .5 First return to zero . 68 24. . . . . . . . . . . . . . . .5 A storage or queueing application . . . . . . . . . . . . . . . . . . . . . . 85 27. . . . . . . 95 31. . . . . . . . . . . . . . . . . .2 First return to the origin . . . . . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . . . . . . . . . . . 92 30. . . . . . . . . . . . 95 31. . . . . . . . . .2 The ballot theorem . . .2 Return probabilities . . .1 Distribution after n steps . . . . .1 Branching processes: a population growth application . . . . . . . . . . . . . . . . . . . . . . . . 90 30. . . . . . .4 The Wright-Fisher model: an example from biology . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

S. of course. I tried to put all terms in blue small capitals whenever they are first encountered. A few starred sections should be considered “advanced” and can be omitted at first reading. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. “important” formulae are placed inside a coloured box. There are. I have a little mathematical appendix.. The second one is specifically for simple random walks. Also. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. There notes are still incomplete. I introduce stationarity before even talking about state classification.K. At the end.. They form the core material for an undergraduate course on Markov chains in discrete time. Greece and the U. I have tried both and prefer the current ordering. The first part is about Markov chains and some applications. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. dozens of good books on the topic. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible..Preface These notes were written based on a number of courses I taught over the years in the U. v . Of course. Also.

. The “time” can be discrete (i. . x Here. where r = rem(a. Its velocity is x = dx . For instance. Formally. An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. for any t. for any n.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. In the module following this one you will study continuous-time chains. we let our “state” be a pair of integers (xn . initialised by x0 = max(a. there is a function F that takes the pair Xn = (xn . continuous (i. has the property that. s ≥ t) ˙ 1 . An example from Physics is provided by Newton’s laws of motion. b) between two positive integers a and b. that gcd(a. x(t)). b). a totally ordered set. yn ) into Xn+1 = (xn+1 . depend only on the pair Xn . .e. a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system.e. the future pairs Xn+1 . Recall (from elementary school maths). Suppose that a particle of mass m is moving under the action of a force F in a straight line. where xn ≥ yn . b). yn+1 ). . Newton’s second law says that the acceleration is proportional to the force: m¨ = F. and the other the greatest common divisor. b) is the remainder of the division of a by b. i. In Mathematics. . . The sequence of pairs thus produced. x = x(t) denotes the position of the particle at time t. . . n = 0. X1 . the real numbers). yn ). the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). . and its ˙ dt 2x acceleration is x = d 2 . b) = gcd(r. Xn+2 . 1. the future trajectory (X(s). or.e. X2 . 2. the integers). X0 . F = f (x. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). By repeating the procedure. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. more generally. one of which is 0. yn ). It is not hard to see that. x). we end up with two numbers. In other words. b). Assume that the force F depends only on the particle’s position ¨ dt and velocity. and evolving as xn+1 = yn yn+1 = rem(xn . y0 = min(a.

We will also see why they are useful and discuss how they are applied. i. past states are not needed. by the software in your computer. via the use of the tools of Probability. containing fresh and stinky cheese. We will be dealing with random variables. .05. it does so. Similarly. A scientist’s job is to record the position of the mouse every minute. Y0 ) of random numbers. Choose a pair (X0 .1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. We will develop a theory that tells us how to describe. .can be completely specified by the current state X(t). a ≤ X0 ≤ b. analyse. 1 and 2. Indeed.000 columns or computing the integral of a complicated function of a large number of variables. Markov chains generalise this concept of dependence of the future only on the present. it moves from 2 to 1 with probability β = 0. 0 ≤ Y0 ≤ B. When the mouse is in cell 1 at time n (minutes) then. and use those mathematical models which are called Markov chains. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0.. say. Try this on the computer! 2 . Count the number of 1’s and divide by N . Repeat with (Xn . for steps n = 1.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. but some are stochastic. A mouse lives in the cage. we will see what kind of questions we can ask and what kind of answers we can hope for. instead of deterministic objects. Each time count 1 if Yn < f (Xn ) and 0 otherwise.99. uniformly. an effective way for dealing with large problems is via the use of randomness. The generalisation takes us into the realm of Randomness.e. In other words. . The resulting ratio is an approximation b of a f (x)dx. 2. We can summarise this information by the transition diagram: 1 Stochastic means random. regardless of where it was at earlier times. Perform this a large number N of times. Yn ). In addition.000 rows and 10. 2 2. Other examples of dynamical systems are the algorithms run. Some of these algorithms are deterministic. respectively. at time n + 1 it is either still in 1 or has moved to 2.

We easily see that if y > 0 then ∞ x. Assume that Dn . Dn is the money he deposits.95 0. How long does it take for the mouse. How often is the mouse in room 1 ? Question 1 has an easy. FS . . S1 . Dn − Sn = y − x) P (Sn = z. D1 . = z=0 ∞ = z=0 3 . Xn is the money (in pounds) at the beginning of the month.05 0.99 0.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. Sn .05 ≈ 20 minutes. are independent random variables. and Sn is the money he wants to spend. from month to month. Here. to move from cell 1 to cell 2? 2. 0}. . as follows: Xn+1 = max{Xn + Dn − Sn . 1. D2 .2 Bank account The amount of money in Mr Baxter’s bank account evolves. respectively. S2 . Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x).) 2. on the average. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. he can’t. S0 . . . px.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. So if he wishes to spend more than what is available.01 Questions of interest: 1. the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. (This is the mean of the binomial distribution with parameter α. D0 . answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. intuitive.y := P (Xn+1 = y|Xn = x). y = 0. . Assume also that X0 . 2. are random variables with distributions FD . .

if y = 0. The transition probability matrix is y=1 p0. being drunk.1 p1. in principle.0 p1.something that.y .2 P= p2.2 p1. Are there any places which he will never visit? 3. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. and if so. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). can be computed from FS and FD . on the average. Of course. Are there any places which he will visit infinitely many times? 4.1 p0. 2. how fast? 2.0 p2. how long will it take him.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. Questions of interest: 1. If he starts from position 0.0 = 1 − ∞ px. Will Mr Baxter’s account grow.2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. he has no sense of direction. 2. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. So he may move forwards with equal probability that he moves backwards. If the pub is located in position 0 and his home in position 100.0 p0. we have px.1 p2. how often will he be visiting 0? 2.

the answer to this is crucial for the policy determination.S = 1 − pS. the company must have an idea on how long the clients will live.D . given that he is currently healthy. 5 .D is omitted. observe. Clearly. there is clearly a higher chance that our man will be lost. but they will become more precise in the sequel. therefore. Also. for now. But.3 that the person will become sick. and pS.S − pH. north or south. These things are. etc.S = 0. mathematically imprecise.H − pS. that since there are more degrees of freedom now. and let’s say he’s completely drunk. We may again want to ask similar questions. the reason being that the company does not believe that its clients are subject to resurrection.D = 1.H = 1 − pH.1 D Thus. It proposes the following model summarising the state of health of an individual on a monthly basis: 0.S because pH. there is probability pH. so we assign probability 1/4 for each move. west.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. pD. and pS.H .8 0.3 H 0. So he can move east. Note that the diagram omits pH. 2. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly.D .5 Actuarial chains A life insurance company wants to find out how much money to charge its clients.01 S 0. pD.

. . . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. that the Xn take values in some countable set S. unless needed. for any n ∈ T. 1. . which precisely means that (Xn+1 . . (1) 6 . . n ∈ Z+ ) is Markov if and only if. . . . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). This is true for any n and m. n + m] Xk = ik | Xn = in ).3 3. . . . . or negative). . for any n. 2.} or the set Z of all integers (positive. m and any choice of states i0 . . m > n. m < n. in+m . 3. 3.1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . . and so future is independent of past given the present. . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . Since S is countable. (Xn . . 3. conditionally on Xn . in+1 ∈ S. . On the other hand. Lemma 1. X0 = i0 ) = P (∀ k ∈ [n + 1. . . . . Xn+m ) and (X0 . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). . . We will also assume. . . Proof. m ∈ T) is independent of the past process (Xm . the Markov property. it is customary to call (Xn ) a Markov chain. n + m] Xk = ik | Xn = in . Usually. Xn−1 ) given Xn . almost exclusively (unless otherwise said explicitly).} or N = {1.2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . . 2. Xn−1 ) are independent conditionally on Xn . n ∈ T}. if the claim holds. . zero. We focus almost exclusively to the first choice. . where T is a subset of the integers. . n ∈ T} is Markov instead of saying that it has the Markov property. viz. m ∈ T). . . X2 = i2 . T = Z+ = {0. called the state space. . Let us now write the Markov property in an equivalent way. So we will omit referring to T explicitly. . the future process (Xm . The elements of S are frequently called states. We say that this sequence has the Markov property if. . . . for all n ∈ Z+ and all i0 . X1 = i1 . P (Xn+1 = in+1 | Xn = in . Sometimes we say that {Xn .

2 3 And. n) = i1 ∈S ··· in−1 ∈S pi. or as P(m. 1) pi1 . . . Therefore. X1 = i1 .i1 (m. pi. If |S| = d then we have d × d matrices. i∈S i.j (n. something that can be called semigroup property or Chapman-Kolmogorov equation. by (1). Of course. n). n + 1) := P (Xn+1 = j | Xn = i) . The function µ(i) is called initial distribution. n). 7 . n) = P(m. m + 1) · · · pin−1 . n)]i. .in (n − 1. (2) which is reminiscent of matrix multiplication. of course.j∈S . for each m ≤ n the matrix P(m.i (n − 1.which means that the two functions µ(i) := P (X0 = i) .j (n.. j ∈ S. we can write (2) as P(m. The function pi. m + 1)P(m + 1. all probabilities of events associated with the Markov chain This causes some analytical difficulties. n + 1) is called transition probability from state i at time n to state j at time n + 1. which.j (m.3 In this new notation. More generally. n). by definition. n) = 1(i = j) and.i2 (1. ℓ) P(ℓ. n) if m ≤ ℓ ≤ n. m + 2) · · · P(n − 1. will specify all joint distributions2 : P (X0 = i0 .j (m. . remembering a bit of Algebra. Xn = in ) = µ(i0 ) pi0 . n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. n + 1) do not depend on n . But we will circumvent them by using Probability only.j (n. pi.j (m.i1 (0. 2) · · · pin−1 . n).j (m. viz. we define. then the matrices are infinitely large. n ≥ 0. X2 = i2 .3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. pi. pi. means that the transition probabilities pi.j (n. but if S is an infinite set. n) := [pi. 3. by a result in Probability. requires some ordering on S) and whose entries are pi. n) = P(m. a square matrix whose rows and columns are indexed by S (and this.

n) = P × P × · · · · · · × P = Pn−m . b ≥ 0.j∈S is called the (one-step) transition probability matrix or. if µ assigns probability p to state i1 and 1 − p to state i2 . of course. δi (j) := 1. and call pi. arranged in a row vector. n−m times for all m ≤ n. In other words. Similarly. Then we have µ n = µ 0 Pn . if Y is a real random variable. Thus Pi is the law of the chain when the initial distribution is δi . For example.j ]i. It corresponds. (3) We call δi the unit mass at i. if j = i otherwise. simply. transition matrix. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). From now on. and the semigroup property is now even more obvious because Pa+b = Pa Pb .j .and therefore we can simply write pi.j (n. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. Notational conventions Unit mass. For example. while P = [pi. Starting from a specific state i. then µ = pδi1 + (1 − p)δi2 . y y 8 . to picking state i with probability 1. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). for all integers a. 0. Then P(m. as follows from (1) and the definition of P. n + 1) =: pi. unless otherwise specified. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi .j the (one-step) transition probability from state i to state j. Notice that any probability distribution on S can be written as a linear combination of unit masses. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . The matrix notation is useful.

If at time n the bug is on a vertex. n ≥ 0) will not be stationary: indeed. the distribution of the bug will still be the same.) is stationary if it has the same law as (Xn .). We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. i. let us look at the following example: Example: Motion on a polygon. 1. for a while we will be able to tell how long ago the process started. the law of a stationary process does not depend on the origin of time.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. Consider a canonical heptagon and let a bug move on its vertices.e. This is a model of gas distributed between two identical communicating chambers. On the average. n = 0.d. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. as it takes some time before the molecules diffuse from room to room. . we indeed expect that we have N/2 molecules in each room. We now investigate when a Markov chain is stationary. Then pi. Example: the Ehrenfest chain. In other words.i. It should be clear that if initially place the bug at random on one of the vertices. because it is impossible for the number of molecules to remain constant at all times. at any point of time n. for any m ∈ Z+ . It is clear. . But this will not work. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. . n = m. Hence the uniform probability distribution is stationary. Another guess then is: Place 9 . Let Xn be the number of molecules in room 1 at time n. . random variables provides an example of a stationary process. Imagine there are N molecules of gas (where N is a large number. . selecting a vertex with probability 1/7. m + 1. dividing it into two identical rooms. and vice versa. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn .4 Stationarity We say that the process (Xn . then.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . then. say N = 1025 ) in a metallic chamber with a separator in the middle. at time n + 1 it moves to one of the adjacent vertices with equal probability. To provide motivation and intuition. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. . Clearly. The gas is placed partly in room 1 and partly in room 2. a sequence of i.

P (X1 = i) = j P (X0 = j)pj. Xn = in ) = P (Xm = i0 . A Markov chain (Xn . (a binomial distribution). which you should go through by yourselves. For the if part. . If we can do so. . Indeed. . Since. The equations (4) are called balance equations.i + P (X0 = i + 1)pi+1. X1 = i1 . stationary or invariant distribution. If we do so. P (Xn = i) = π(i) for all n. Xm+2 = i2 . i 0 ≤ i ≤ N. But we need to prove this. Xm+n = in ). . . In other words. use (1) to see that P (X0 = i0 . we are led to consider: The stationarity problem: Given [pij ].each molecule at random either in room 1 or in room 2. Xm+1 = i1 . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. (4) Such a π is called .i = P (X0 = i − 1)pi−1. . in ∈ S. X2 = i2 . . then we have created a stationary Markov chain. 2−N i−1 i+1 N N (The dots signify some missing algebra.i + π(i + 1)pi+1. The only if part is obvious. Lemma 2. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . . hence. Proof. then the number of molecules in room 1 will be i with probability π(i) = N −N 2 .i = π(i − 1)pi−1. . Changing notation. by (3). . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. find those distributions π such that πP = π. a Markov chain is stationary if and only if the distribution of Xn does not change with n.) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. for all m ≥ 0 and i0 . we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. .i = N N −i+1 N i+1 2−N + = · · · = π(i). . 10 .

where 1 is a column whose entries are all 1. d}. then we run into trouble: indeed.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. To find the stationary distribution. which. But here. 2 . which in turn means that the normalisation condition cannot be satisfied.j = 1. we use the normalisation condition π(1) = i≥1 j≥i = 1. i ≥ 1. This means that the solution to the balance equations is identically zero. 2. the next one will arrive in i minutes with probability p(i).}. p1. 11 . . . random variables. . . i∈S The matrix P has the eigenvalue 1 always because j∈S pi. Keep doing it continuously..d. . It is easy to see that the stationary distribution is uniform: π(x) = 1 . d! x ∈ Sd . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. i ≥ 0. i = 1. write the equations (4): π(i) = π(j)pj. . Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. After bus arrives at a bus stop. each x ∈ Sd is of the form x = (x1 . . We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. π(1) will be zero. and so each π(i) will be zero. n ≥ 0) is a Markov chain with transition probabilities pi+1. The times between successive buses are i. if i > 0 i ∈ N. we must be careful. . So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. i≥1 π(i) −1 To compute π(1). Example: card shuffling. In addition to that.i = 1. .i = π(0)p(i) + π(i + 1). π must be a probability: π(i) = 1. Who tells us that the sum inside the parentheses is finite? If the sum is infinite. j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). which gives p(j) .i = p(i). . where all the xi are distinct. . can be written as P1 = 1.i. . Example: waiting for a bus. Then (Xn . The state space is S = N = {1. xd ). in matrix notation. . Thus.

. for any n = 0. . . Here ξ(n) = (ξ1 (n). . So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite.) Consider an infinite binary vector ξ = (ξ1 . But the function f (x) = min(2x. . with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. ξi ∈ {0. . ξ(n + 1) = ϕ(ξ(n)).e. i. If ξ1 = 1 then x ≥ 1/2. The Markov chain is obtained by applying the mapping ϕ again and again. P (ξ1 (n + 1) = ε1 . any r = 1. what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. 2(1 − x)). . from (5). . and any ε1 . . P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). .). and the rule maps x into 2(1 − x). . Remarks: This is not a Markov chain with countably many states. As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. 1} for all i. εr ∈ {0. ξr+1 (n) = εr ) + P (ξ1 (n) = 1. . . . . (5) 12 . .. . We can show that if we start with the ξr (0). r = 1.). ξ2 (n) = ε1 . . .. ξ3 . 2. the same is true at n + 1. . You can check that .Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j).d. We have thus defined a mapping ϕ from binary vectors into binary vectors. But to even think about it will give you some strength. 1}. ξ2 (n). . . ξ2 .).i. . . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. applied to the interval [0. 2. i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. 1] (think of [0. . Example: bread kneading. If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. . 1. So it goes beyond our theory. ξ2 (n).i. ξ2 (n) = 1 − ε1 . are i. . . 1 − ξ3 .* (This is slightly beyond the scope of these lectures and can be omitted. . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . . So let us do so.) is the state at time n. . Hence choosing the initial state at random in this manner results in a stationary Markov chain.. . . if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle.d. . . So. . i. . . . . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. 2. ξr+1 (n) = 1 − εr ).

Proof. Conversely. for all A ⊂ S. The convenience is often suggested by the topological structure of the chain.j (which equals 1): π(i) j∈S pi. because the first d equations are linearly dependent: the sum of both sides equals 1. So F (A. π(j) = 1. If the state space has d states.i . When a Markov chain is stationary we refer to it as being in 4.i 13 . then the above are d + 1 equations. Ac ) is the total current from A to Ac . a set of d linearly independent equations which can be more easily solved.i = j∈A π(j)pj. We have: Proposition 1. we introduce more.i + j∈Ac π(j)pj. then choose A = {i} and see that you obtain the balance equations.1 Finding a stationary distribution As explained above. π is a stationary distribution if and only if π1 = 1 and F (A. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A.i + j∈Ac π(j)pj. Instead of eliminating equations.j . suppose that the balance equations hold. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj.i Multiply π(i) by j∈S pi. A). Ac ) = F (Ac . π(i) = j∈S (balance equations) (normalisation condition). i∈A j∈Ac Think of π(i)pi.j = j∈A π(j)pj. π(j)pj. If the latter condition holds. This is OK.j as a “current” flowing from i to j. as we will see in examples. j∈S i ∈ S. amongst them. Ac ) := π(i)pi. a stationary distribution π satisfies π = πP π1 = 1 In other words. expecting to be able to choose. therefore one of them is always redundant. But we only have d unknowns.Note on terminology: steady-state. Writing the equations in this form is not always the best thing to do.

p11 = 1 − α. gives us flexibility.j = j∈A π(j)pj. If we take α = 0. The extra degree of freedom provided by the last result. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). p22 = 1 − β.05.j + j∈A j∈Ac π(i)pi. That is.. α+β π(2) = α . which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. . A c F(A . 2. (Consequently.99 (as in the mouse example). but we shall not do this here.Now split the left sum also: π(i)pi. while the remaining two terms give the equality we seek. in steady state. p21 = β. . β < 1. . .04 room 1. 1. 3. π(1) = β . here is an example: Example: Consider the 2-state Markov chain with p12 = α. Thus.04 π(2) = 0. α+β π(2) = cα.) Assume 0 < α. and 4. One can ask how to choose a minimal complete set of linearly independent equations. Summing over i ∈ A.i + j∈Ac π(j)pj. we see that the first term on the left cancels with the first on the right. the mouse spends only 95. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 .99 ≈ 0.i This is true for all i ∈ S. The constant c is found from π(1) + π(2) = 1. we find π(1) = 0. We have π(1)α = π(2)β. A) Schematic representation of the flow balance relations.048.8% of the time in room 2.952.05 ≈ 0.2% of the time in 1. Example: a walk with a barrier. p q Consider the following chain. Instead. β = 0. A ) A c c F( A . 14 i = 1..

i2 · · · pin−1 . . (iii) Pi (Xn = j) > 0 for some n ≥ 0.) 5 5. (If p ≥ 1/2 there is no stationary distribution. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. In other words. Does this provide a stationary distribution? Yes. in−1 .1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. a pair (S.j > 0. .j > 0 for some n ∈ N and some states i1 . This is i=0 possible if and only if p < q. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. These are much simpler than the previous ones. 5. j) ∈ E ⇐⇒ pi. i − 1}. For i ≥ 1. and E a set of directed edges. . Proof. So assume i = j. the chain will visit j at some finite time. In fact.But we can make them simpler by choosing A = {0. This is often cast in terms of a digraph (directed graph). . provided that ∞ π(i) = 1. • Suppose that i j. p < 1/2. E) where S is the state space and E ⊂ S × S defined by (i. S is a set of vertices. π(i) = (p/q)i π(0). i. Ac ) = π(i − 1)p = π(i)q = F (Ac . Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). . 1. . we can solve them immediately. i. If i ≥ 1 we have F (A. . starting from i. n≥0 . If i = j there is nothing to prove.e. A). In graph-theoretic terminology. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0.i1 pi1 . (ii) pi. .e.2 The relation “leads to” We say that a state i leads to state j if.

Two communicating classes are either 1 2 3 4 5 6 In the example of the figure.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. the sum contains a nonzero term. The first two classes differ from the last two in character. 5}. the set of all states that communicate with i: [i] := {j ∈ S : j So.in−1 pi. The communicating class corresponding to the state i is. • Suppose now that (iii) holds. Proof. 3}. We just observed it is symmetric. reflexive and transitive. we see that there are four communicating classes: {1}. j) ⇐⇒ i j and j i. Corollary 1.i2 · · · pin−1 . In other words.i1 pi1 . this is symmetric (i Corollary 2. The relation is transitive: If i j and j k then i k. The class {4. i}.. it partitions S into equivalence classes known as communicating classes. By assumption. meaning that (ii) holds.the sum on the right must have a nonzero term. [i] = [j] if and only if i identical or completely disjoint.. (6) j. In addition. by definition. by definition..j . However. The relation j ⇐⇒ j i). We have Pi (Xn = j) = i1 . Just as any equivalence relation. 5} is closed: if the chain goes in there then it never moves out of it. • Suppose that (ii) holds. is an equivalence relation. (iii) holds. and so (iii) holds. 16 . where the sum extends over all possible choices of states in S. Look at the last display again. the class {2. 3} is not closed. {6}. Equivalence relation means that it is symmetric. It is reflexive and transitive because is so. {2. Hence Pi (Xn = j) > 0. (i) holds. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j).. we have that the sum is also positive and so one of its terms is positive. Since Pi (Xn = j) > 0. 5. {4.

i2 · · · pin−1 . pi1 . The communicating classes C1 . and C is closed. we can reach any other state. Lemma 3. Similarly. We then can find states i1 . The converse is left as an exercise. Suppose first that C is a closed communicating class. C5 in the figure above are closed communicating classes. the state space S is a closed communicating class. the state is inessential. Note: An inessential state i is such that there exists a state j such that i j but j i. and so on. then the chain (or. Thus. . Closed communicating classes are particularly important because they decompose the chain into smaller. in−1 such that pi. .More generally. .i1 pi1 . The classes C4 . C3 are not closed. starting from any state. C2 . In other words still. In graph terms. parts. Why? A state i is essential if it belongs to a closed communicating class. more manageable. Suppose that i j.j > 0. we say that a set of states C ⊂ S is closed if pi. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. more precisely. the single-element set {i} is a closed communicating class. If a state i is such that pi. Since i ∈ C. A general Markov chain can have an infinite number of communicating classes. and so i2 ∈ C. j∈C for all i ∈ C.i = 1 then we say that i is an absorbing state. pi.i2 > 0. Otherwise. its transition matrix P) is called irreducible. 17 . To put it otherwise. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. . we conclude that j ∈ C. Note that there can be no arrows between closed communicating classes. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. we have i1 ∈ C. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. There can be arrows between two nonclosed communicating classes but they are always in the same direction. Proof. If all states communicate with all others.j = 1.i1 > 0.

it will move to the set C2 = {2. 4} at the next step and vice versa. We say that the chain has period 2. Let Di = {n ∈ N : pii > 0}. pii (ℓn) ≥ pii (n) ℓ . Let us denote by pij the entries of Pn .4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. 3} at some point of time then.e. Since Pm+n = Pm Pn . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. If i j there is α ∈ N with pij > 0 and if 18 . A state i is called aperiodic if d(i) = 1. X4 ∈ C1 . i. j ∈ S. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. and. (n) Proof. . X2 ∈ C1 . Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. Dj = {n ∈ (n) (α) N : pjj > 0}. X3 ∈ C2 . So. ℓ ∈ N. (m) (n) for all m. d(i) = gcd Di . The period of an inessential state is not defined. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . i. If i j then d(i) = d(j). whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. Notice however. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . more generally. all states form a single closed communicating class. Theorem 2 (the period is a class property). Consider two distinct essential states i. n ≥ 0. d(j) = gcd Dj . j. . The chain alternates between the two sets.5. .

. d − 1. (Here Cd := C0 . . . Because n is sufficiently large. Pick n2 . . Formally. So d(i) = d(j). d(j) | d(i)). the chain is called aperiodic. Let n be sufficiently large. If a number divides two other numbers then it also divides their difference. Theorem 4. In particular. Of particular interest. r = 0. Corollary 3. . i. . So d(j) is a divisor of all elements of Di . Now let n ∈ Di . Arguing symmetrically. Divide n by n1 to obtain n = qn1 + r. . Then we can uniquely partition the state space S into d disjoint subsets C0 . . are irreducible chains. An irreducible chain has period d if one (and hence all) of the states have period d. Since n1 . (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. we have n ∈ Di as well. this is the content of the following theorem. Consider an irreducible chain with period d. q − r > 0. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. j ∈ S. where the remainder r ≤ n1 − 1. we obtain d(i) ≤ d(j) as well. Cd−1 such that the chain moves cyclically between them. 1. . An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. Therefore d(j) | α + β + n for all n ∈ Di . Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . n2 ∈ Di . From the last two displays we get d(j) | n for all n ∈ Di . showing that α + n + β ∈ Dj . . C1 . If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . showing that α + β ∈ Dj . all states communicate with one another. . as defined earlier. chains where. j∈Cr+1 i ∈ Cr .e. n1 ∈ Di such that n2 − n1 = 1. if d = 1.j i there is β ∈ N with pji > 0. Theorem 3.) 19 . More generally. Proof. Therefore d(j) | α + β. So pjj ≥ pij pji > 0. C1 . . Then pii > 0 and so pjj ≥ pij pii pji > 0. if an irreducible chain has period d then we can decompose the state space into d sets C0 . Cd−1 such that pij = 1.

20 .. If X0 ∈ A the two times coincide. Continue in this manner and define states i0 . then. Such a path is possible because of the choice of the i0 . necessarily. We avoid excessive notation by using the same letter for both. We show that these classes satisfy what we need.. id ∈ C0 . The existence of such a path implies that it is possible to go from i0 to i. that is why we call it “first-step analysis”.Proof. Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0.e the set of states j such that i j). Notice that this is an equivalence relation.. Assume d > 1. The reader is warned to be alert as to which of the two variables we are considering at each time. Suppose pij > 0 but j ∈ C1 but. . say. Take. j ∈ C2 . id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. It is easy to see that if we continue and pick a further state id with pid−1 . Cd−1 . with corresponding classes C0 .. Then pick i1 such that pi0 i1 > 0. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time.. ..e. . after it takes one step from its current position. . Let C1 be the equivalence class of i1 . We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 .. and by the assumptions that d d j. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. Hence it partitions the state space S into d disjoint equivalence classes. As an example. necessarily. i ∈ C0 . Pick a state i0 and let C0 be its equivalence class (i. which contradicts the definition of d.id > 0. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). i1 . for example. . j belongs to the next class. C1 . i1 . We use a method that is based upon considering what the Markov chain does at time 1. We now show that if i belongs to one of these classes and if pij > 0 then. i.

X0 = i) = P (TA < ∞|X1 = j). But what exactly is this probability equal to? To answer this in its generality. because the chain may end up in state 2. . ′ Proof. For the second part of the proof. Furthermore. ϕ(i) = j∈S i∈A pij ϕ(j). if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. If i ∈ A. let ϕ′ (i) be another solution. . conditionally on X1 . . and define ϕ(i) := Pi (TA < ∞). X2 .e. we have Pi (TA < ∞) = ϕ(j)pij . then TA ≥ 1. ′ is the remaining time until A is hit). we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . Hence: ′ ′ P (TA < ∞|X1 = j. By self-feeding this equation. and so ϕ(i) = 1.It is clear that P1 (T0 < ∞) < 1. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). But the Markov chain is homogeneous. fix a set of states A. The function ϕ(i) satisfies ϕ(i) = 1. We then have: Theorem 5. If i ∈ A then TA = 0. a ′ function of X1 . which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). i ∈ A. We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. Therefore.). j∈S as needed. the event TA is independent of X0 . Combining the above. So TA = 1 + TA .

(If the chain starts from 0 it takes no time to hit 0. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. Therefore. let A = {0. ψ(0) = ψ(2) = 0. If i ∈ A then. the second equals Pi (TA = 2). obviously. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). We immediately have ϕ(0) = 1. TA = 0 and so ψ(i) := Ei TA = 0. as above. 2}. 1. Furthermore. 1 ψ(1) = 1 + ψ(1). Letting n → ∞. The function ψ(i) satisfies ψ(i) = 0. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). 3 22 . if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. the set that contains only state 0. Start the chain from i. and let ψ(i) = Ei TA . i = 0. and ϕ(2) = 0. Example: In the example of the previous figure. ψ(i) = 1 + i∈A pij ψ(j). if the chain starts from 2 it will never hit 0. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). Example: Continuing the previous example.) As for ϕ(1). If ′ i ∈ A. then TA = 1 + TA . we choose A = {0}. which is the second of the equations. 2. We have: Theorem 6. because it takes no time to hit A if the chain starts from A. On the other hand. Clearly. ψ(i) := Ei TA . The second part of the proof is omitted.We recognise that the first term equals Pi (TA = 1). 3 2 so ϕ(1) = 3/4. j∈A i ∈ A. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). so the first two terms together equal Pi (TA ≤ 2). By continuing self-feeding the above equation n times. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). We let ϕ(i) = Pi (T0 < ∞). Proof.

This happens up to the first hitting time TA first hitting time of the set A. Hence. Now the last sum is a function of the future after time 1. by the Markov property. ϕAB is the minimal solution to these equations. It should be clear that the equations satisfied are: ϕAB (i) = 1. consider the following situation: Every time the state is x a reward f (x) is earned. ϕAB (i) = 0. . i∈A Furthermore. (This should have been obvious from the start. X2 .e. ′ where TA is the remaining time. B. We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ). . Then ′ 1+TA x ∈ A. of the random variables X1 .. h(x) = f (x). until set A is reached. h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). as argued earlier. TB for two disjoint sets of states A. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. after one step. otherwise. ϕAB (i) = j∈A i∈B pij ϕAB (j).) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . if x ∈ A. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). in the last sum. i. Next. Clearly. therefore the mean number of flips until the first success is 3/2. ′ TA = 1 + TA .which gives ψ(1) = 3/2. where. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). 23 . . We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). because the underlying experiment is the flipping of a coin with “success” probability 2/3. we just changed index from n to n + 1. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y).

2 2 We solve and find h(2) = 8. if he starts from room 2. x∈A pxy h(y). So. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. he will die. The successive stakes (Yk . in which case he dies immediately. k ∈ N) form the strategy of the gambler. (Sooner or later. no gambler is assumed to have any information about the outcome of the upcoming toss. so 24 .) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). unless he is in room 4. until he dies in room 4. Example: A thief enters sneaks in the following house and moves around the rooms. If heads come up then you win £1 per pound of stake. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . h(3) = 7.Thus. 1 2 4 3 The question is to compute the average total reward. for i = 1. 2. 3. collecting “rewards”: when in room i. (Also. are h(x) = 0. why?) If we let h(i) be the average total reward if he starts from room i. the set of equations satisfied by h. Clearly. h(1) = 5. Let p the probability of heads and q = 1 − p that of tails. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). he collects i pounds. y∈S h(x) = f (x) + x ∈ A.

a < i < b. Yk cannot depend on ξk . . b are absorbing. in simplest case Yk = 1 for all k. . ξk−1 and–as is often the case in practice–on other considerations of e. To solve the problem. b − 1. expectation) under the initial condition X0 = x. if p = q = 1/2 b−a Proof. We are clearly dealing with a Markov chain in the state space {a. it will make enormous profits). and where the states a. P (tails) = 1 − p. In other words. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. Theorem 7. in the first case. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. We use the notation Px (respectively. b = ℓ > 0 and let ϕ(x. We obviously have ϕ(0) = 0. a + 2. ℓ) := Px (Tℓ < T0 ). a + 1. . 0 ≤ x ≤ ℓ. the company has control over the probability p. It can only depend on ξ1 . Let £x be the initial fortune of the gambler. pi. if p = q Px (Tb < Ta ) = . b − a). Ex ) for the probability (respectively.i+1 = p. ℓ). (Think why!) To simplify things. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). . is our next goal. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. astrological nature. Consider a gambler playing a simple coin tossing game. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . She will stop playing if her money fall below a (a < x). We first solve the problem when a = 0. We wish to consider the problem of finding the probability that the gambler will never lose her money. a ≤ x ≤ b. in addition. . Let us consider a concrete problem. 25 ϕ(ℓ) = 1. betting £1 at each integer time against a coin which has P (heads) = p. . b} with transition probabilities pi.if the gambler wants to have a strategy at all. as described above. . write ϕ(x) instead of ϕ(x. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a.i−1 = q = 1 − p. . she must base it on whatever outcomes have appeared thus far. . x − a    .g.

Since ϕ(ℓ) = 1. We finally find λx − 1 . λx − 1 δ(0). we have obtained ϕ(x. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). 1. Corollary 4. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). ℓ) = λx −1 . Since p + q = 1. 0 ≤ x ≤ ℓ.and. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). λ = 1. ℓ Summarising. ℓ→∞ ℓ→∞ (q/p)x . Px (T0 < ∞) = 1 − lim 26 . λℓ −1 x ℓ. if p = q. and so λx − 1 λx − 1 =1− = λx . if p > 1/2 if p ≤ 1/2. Let δ(x) := ϕ(x + 1) − ϕ(x). We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. λ = 1. δ(x − 2) = · · · = q p x δ(0). The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. λℓ − 1 0 ≤ x ≤ ℓ. if λ = 1 if λ = 1. and so Px (T0 < ∞) = 1 − lim If p < 1/2. then λ = q/p < 1. The general case is obtained by replacing ℓ by b − a and x by x − a. Assuming that λ := q/p = 1. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). by first-step analysis. If p > 1/2. we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. we find δ(0) = 1/ℓ and so x ϕ(x) = . ℓ→∞ λℓ − 1 x = 1 − 0 = 1. then λ = q/p > 1.

. . . An example of a random time which is not a stopping time is the last visit of a specific set. . (b) is by the definition of conditional probability. A. any such A must have the property that. Recall that the Markov property states that. Let A be any event determined by the past (X0 . . Since (X0 . n ≥ 0) is a time-homogeneous Markov chain. . (d) where (a) is just logic. . | Xn = i)P (Xn = i. . Xn = i. including. We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . Xτ = i. Xn )1(τ = n). 27 . . . Xτ ) if τ < ∞. . τ < ∞) = pi. n ≥ 0) is independent of the past process (X0 . τ = n). τ = n) (c) = P (Xn+1 = j1 . A. . . .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. Let τ be a stopping time. . . conditional on τ < ∞ and Xτ = i. .j1 pj1 . τ = n)P (Xn = i. the first hitting time of the set A. . . A. for any n. A. . τ = n) = pi. | Xτ = i. . . Let τ be a stopping time. . . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . A. . Xn ) and the ordinary splitting property at n. Xn+1 .j2 · · · P (Xn = i. (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . Xτ +2 = j2 . | Xτ . .j1 pj1 . Xn+2 = j2 . possibly. . An example of a stopping time is the time TA . . . . considered earlier. . Xn ) if we know the present Xn . Then. This stronger property is referred to as the strong Markov property. . A ∩ {τ = n} is determined by (X0 . . . . . A | Xτ . . . . A. τ < ∞)P (A | Xτ .) is independent of the past process (X0 . Xτ )1(τ < ∞) = n∈Z+ (X0 .Xτ +2 = j2 . τ = n) = P (Xn+1 = j1 . We are going to show that for any such A. Xn+2 = j2 . Xn+2 = j2 . . τ < ∞). τ = n) (a) (b) = P (Xτ +1 = j1 . the future process (Xτ +n . . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . In particular. . . .j2 · · · We have: P (Xτ +1 = j1 . Suppose (Xn . P (Xτ +1 = j1 . Theorem 8 (strong Markov property). Xn ). . Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . the value +∞. . τ < ∞) and that P (Xτ +1 = j1 . . . Xn ) for all n ∈ Z+ . | Xn = i. . for all n ∈ Z+ . the future process (Xn . τ < ∞) = P (Xn+1 = j1 . Xτ +2 = j2 . . . . . . Proof. Xτ +2 = j2 . Xτ ) and has the same law as the original process started from i. . .

r = 2.d. in a generalised sense. random variables is a (very) particular example of a Markov chain. . we will show that we can split the sequence into i. Xi (2) . . Note the subtle difference between this Ti and the Ti of Section 6.. random trajectories. and so on.. by using the strong Markov property. := Xn . “chunks”. The Ti of Section 6 was allowed to take the value 0.9 Regenerative structure of a Markov chain A sequence of i. Our Ti here is always positive. . They are. we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. Xi (0) . Schematically. 2. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). Xi (1) . The random variables Xn comprising a general Markov chain are not independent. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}.) We let Ti be the time of the third visit.i. we let Ti be the time of the second (1) (3) visit. . . (1) r = 1. Formally. . as “random variables”. All these random times are stopping times. However. . 3. If X0 = i then the two definitions coincide. Ti Xi (0) (r) ≤ n < Ti (r+1) . in fact. (And we set Ti := Ti .d. Regardless. The idea is that if we can find a state i that is visited again and again. State i may or may not be visited again.i. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . 0 ≤ n < Ti . We (r) refer to them as excursions of the process. So Xi is the r-th excursion: the chain visits 28 .

We abstract this trivial observation and give a couple of definitions: First.. Apply Strong Markov Property (SMP) at Ti : SMP says that. The excursions Xi (0) .). and this proves the first claim. . . by definition. ad infinitum. but important corollary of the observation of this section is that: Corollary 5. (Numerical) functions g(Xi independent random variables. Xi (1) . we “watch” it up until the next time it returns to state i. . except. An immediate consequence of the strong Markov property is: Theorem 9. Therefore. Moreover. Λ2 . the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. all.state i for the r-th time. we say that a state i is recurrent if. 10 Recurrence and transience When a Markov chain has finitely many states. they are also identical in law (i. The claim that Xi . . at least one state must be visited infinitely many times. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. Second.d. have the same Proof. 29 . (r) .i. it will behave in the same way as if it starts from the same state at another point of time. we say that a state i is transient if. .. But (r) the state at Ti is. In particular. possibly. starting from i. g(Xi (2) ). for time-homogeneous Markov chains. Example: Define Ti (r+1) (0) ). The last corollary ensures us that Λ1 . . Perhaps some states (inessential ones) will be visited finitely many times. Therefore. but there are always states which will be visited again and again. are independent random variables. the future is independent of the (1) (2) past. given the state at (r) (r) (r) the stopping time Ti . the future after Ti is independent of the past before Ti . equal to i. Xi (2) . if it starts from state i at some point of time. and “goes for an excursion”. . of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. Xi distribution. A trivial. (0) are independent random variables. g(Xi (1) ). starting from i.

Pi (Ji = ∞) = 1. and that is why we get the product. Indeed. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . in principle. This means that the state i is transient. 4. therefore concluding that every state is either recurrent or transient. Starting from X0 = i. Pi (Ji = ∞) = 1. there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. we have Proof. we conclude what we asserted earlier. . fii = Pi (Ti < ∞) = 1. Therefore. Ei Ji = ∞. Suppose that the Markov chain starts from state i. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. k ≥ 0. We will show that such a possibility does not exist. namely. This means that the state i is recurrent. Indeed. If fii < 1. We claim that: Lemma 4.Important note: Notice that the two statements above are not negations of one another. The following are equivalent: 1. i. every state is either recurrent or transient.e. Because either fii = 1 or fii < 1. But the past (k−1) before Ti is independent of the future. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. State i is recurrent. 30 . 3. Lemma 5. 2. then Pi (Ji < ∞) = 1. Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . the random variable Ji is geometric: k Pi (Ji ≥ k) = fii .

by definition of Ji is equivalent to Pi (Ji = ∞) = 1. which. If i is a transient state then pii → 0. the probability to hit state j for the first time at time n ≥ 1. fij ≤ pij . The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. Pi (Ji < ∞) = 1. Proof. assuming that the chain (n) starts from state i. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. if we assume Ei Ji < ∞. From Elementary Probability. On the other hand. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. this implies that Ei Ji < ∞. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . then. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . State i is transient if. 4. 3. The following are equivalent: 1. Ji cannot take value +∞ with positive probability. n ≥ 1. If Ei Ji = ∞. This. The equivalence of the first two statements was proved before the lemma. (n) i. is equivalent to Pi (Ji < ∞) = 1.e. Notice the difference with pij . by definition. 2. Proof. as n → ∞. Hence Ei Ji = ∞ also. owing to the fact that Ji is a geometric random variable. as n → ∞. If i is transient then Ei Ji < ∞. Proof. Corollary 6. State i is transient. pii → 0. then. by the definition of Ji .1 Define First hitting time decomposition fij := Pi (Tj = n). it is visited finitely many times with probability 1. 10. fii = Pi (Ti < ∞) < 1. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . it is visited infinitely many times. Lemma 6. State i is recurrent if. by definition.e.Proof. The equivalence of the first two statements was proved above. and. i. Ei Ji < ∞. but the two sequences are related in an exact way: Lemma 7. Clearly. clearly. therefore Pi (Ji < ∞) = 1. (n) But if a series is finite then the summands go to 0. since Ji is a geometric random variable.

32 . by definition. . showing that j will appear in infinitely many of the excursions for sure. Corollary 8. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . and since they take the value 1 with positive probability. . If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. either all states are recurrent or all states are transient. starting from i. If j is transient. and so the probability that j will appear in a specific excursion is positive. we have n=1 pjj = Ej Jj < ∞. ∞ Proof. 1.The first term equals fij . Xi . and gij := Pi (Tj < Ti ) > 0. which is finite. Corollary 7. by summing over n the relation of Lemma 7.r := 1 j appears in excursion Xi (r) . 10. excursions Xi . are i.. we have i j ⇐⇒ fij > 0. For the second term. In particular. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞).i. . j i.i.) Theorem 10. . the probability to go to j in some finite time is positive. Now. . ( Remember: another property that was a class property was the property that the state have a certain period. but also fij = 1. . for all states i. and hence pij as n → ∞.2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. We now show that recurrence is a class property. r = 0. j i. not only fij > 0. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. if we let fij := Pi (Tj < ∞) . The last statement is simply an expression in symbols of what we said above. So the random variables δj. Proof.d. If j is a transient state then pij → 0. as n → ∞. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). Hence. denoted by i j: it means that. Start with X0 = i. Hence. infinitely many of them will be 1 (with probability 1).d. In other words. In a communicating class. Indeed.

. Theorem 12. X remains in C. U2 .e. uniformly distributed in the unit interval [0. If i is inessential then it is transient.Proof. and so i is a transient state. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. U1 .i. By finiteness. n+1 33 = 0 . . random variables. appear in a certain game. This state is recurrent. Every finite closed communicating class is recurrent. Xn = i for infinitely many n) = 0. Example: Let U0 . If i is recurrent then every other j in the same communicating class satisfies i j. i. i. Un ≤ x | U0 = x)dx xn dx = 1 . . but Pi (A ∩ B) = Pi (Xm = j. Theorem 11. . Let T := inf{n ≥ 1 : Un > U0 }. . Hence all states are recurrent. for example. necessarily. By closedness. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). Let C be the class.e. and hence. Pi (B) = 0. if started in C. If a communicating class contains a state which is transient then. . We now take a closer look at recurrence. Thus T represents the number of variables we need to draw to get a value larger than the initial one. Suppose that i is inessential. . all other states are also transient. . So if a communicating class is recurrent then it contains no inessential states and so it is closed. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. Indeed.d. observe that T is finite. be i. by the above theorem. P (T > n) = P (U1 ≤ U0 . 1]. Proof. Proof. P (B) = Pi (Xn = i for infinitely many n) < 1. j but j i. every other j is also recurrent. But then. Recall that a finite random variable may not necessarily have finite expectation. Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). It should be clear that this can. Hence. . some state i must be visited infinitely many times. . What is the expectation of T ? First. .

But this is misleading. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . random variables. Let Z1 . . To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. 0) > −∞. . n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. .d. When we have a Markov chain. .i. the sequence X0 . It is not a Law. X1 . the Fundamental Theorem of Probability Theory. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. Theorem 13. on the average. Before doing that. it is a Theorem that can be derived from the axioms of Probability Theory. but E min(Z1 . n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. Z2 . (ii) Suppose E max(Z1 . To apply the law of large numbers we must create some kind of independence.Therefore. 0) = ∞. We state it here without proof. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. of successive states is not an independent sequence. X2 . 34 . . be i. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. arguably. We say that i is null recurrent if Ei Ti = ∞. It is traditional to refer to it as “Law”. then it takes. .

which. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . (1) (2) (1) (n) (8) 13. (7) Whichever the choice of G. k=0 which represents the ‘time-average reward’ received. (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. . we can think of as a reward received when the chain is in state x. the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. as in Example 2. Xa(2) . The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. .i. . Specifically.i. Thus.. we have that G(Xa(1) ). Since functions of independent random variables are also independent. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). are i.2 Average reward Consider now a function f : S → R. Example 2: Let f (x) be as in example 1. Xa . G(Xa(2) ). . are i. .13. the (0) initial excursion Xa also has the same law as the rest of them. forms an independent sequence of (generalised) random variables. Then. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). Example 1: To each state x ∈ S. in this example. In particular. . because the excursions Xa . Xa(3) . 35 . define G(X ) to be the maximum reward received during this excursion. G(Xa(3) ). .d. and because if we start with X0 = a. random variables. . for an excursion X . for reasonable choices of the functions G taking values in R.1 Functions of excursions Namely. assign a reward f (x). G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ).d. with probability 1. n as n → ∞. but define G(X ) to be the total reward received over the excursion X . .

and so 1 first G → 0. say. Then. Hence |Gfirst | ≤ BTa . t t as t → ∞. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x).e. say |f (x)| ≤ B. in the figure below. Since P (Ta < ∞) = 1. We are looking for the total reward received between 0 and t. t as t → ∞. Gr is given by (7). Nt . (r) Nt := max{r ≥ 1 : Ta ≤ t}. Nt = 5. A similar argument shows that 1 last G → 0. The idea is this: Suppose that t is large enough. Let f : S → R+ be a bounded ‘reward’ function. We assumed that f is t bounded. t t In other words. Proof. with probability one. Thus we need to keep track of the number. and so on. plus the total reward received over the next excursion. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. t t 36 .Theorem 14. with probability one. For example. we have BTa /t → 0. We break the sum into the total reward received over the initial excursion. of complete excursions from the ground state a that occur between 0 and t. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. and Glast is the reward over the last incomplete cycle.

(11) By definition. are i. .d. we have Ta r→∞ r lim (r) . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . as t → ∞. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . E a Ta f (Xn ) = n=0 EG1 =: f .. as r → ∞. Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . . random variables with common expectation Ea Ta . = E a Ta . 2.i.also. lim t∞ 1 Nt Nt −1 Gr = EG1 . r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . = E a Ta . Since Ta → ∞. 1 Nt = . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. and since Ta − Ta k = 1. Since Ta /r → 0. E a Ta 37 . . Hence. So we are left with the sum of the rewards over complete cycles. it follows that the number Nt of complete cycles before time t will also be tending to ∞. we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . r=1 with probability one.

then the long-run proportion of visits to the state a should equal 1/Ea Ta . The physical interpretation of the latter should be clear: If. t Ea 1(Xk = b) E a Ta . it takes Ea Ta units of time between successive visits to the state a. on the average. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . The denominator is the duration of the excursion. which communicate and are positive recurrent. b. for all b ∈ S.e. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . we let f to be 1 at the state b and 0 everywhere else. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). Comment 1: It is important to understand what the formula says. Note that we include only one endpoint of the excursion. The theorem above then says that. Comment 4: Suppose that two states. E a Ta (14) with probability 1. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). from the Strong Markov Property. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). Ta −1 k=0 1 lim t→∞ t with probability 1. a. The equality of these two limits gives: average no. E b Tb 38 .To see that f is as claimed. the ground state. We can use b as a ground state instead of a. then we see that the numerator equals 1. i. not both. The constant f is a sort of average over an excursion.

we will soon realise that this dependence vanishes when the chain is irreducible. Moreover. so there is nothing lost in doing so. x ∈ S. for any state b such that a → x (a leads to x). Clearly. Both of them equal 1. where a is the fixed recurrent ground state whose existence was assumed.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . Consider the function ν [a] defined in (15). 5 We stress that we do not need the stronger assumption of positive recurrence here. Recurrence tells us that Ta < ∞ with probability one. we have 0 < ν [a] (x) < ∞. X2 . Proof. notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. this should have something to do with the concept of stationary distribution discussed a few sections ago. Hence the superscript [a] will soon be dropped. . X1 . The real reason for doing so is because we wanted to replace a function of X0 . We start the Markov chain with X0 = a. Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. ν [a] (x) = y∈S ν [a] (y)pyx . Fix a ground state a ∈ S and assume that it is recurrent5 . . Since a = XT a . by a function of X1 . . Indeed. Proposition 2. x ∈ S. . . X2 . X3 . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. In view of this. (15) should play an important role. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. .e. 39 . Note: Although ν [a] depends on the choice of the ‘ground’ state a. i.

x ∈ S. by definition. such that pax > 0. from (16). Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. so there exists n ≥ 1. let x be such that a x. x a. Suppose that the Markov chain possesses a positive recurrent state a. is finite when a is positive recurrent. Note: The last result was shown under the assumption that a is a recurrent state. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). indeed. 40 . (17) This π is a stationary distribution. Therefore. Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . n ≤ Ta ) pyx Pa (Xn−1 = y. in this last step. xa so. ν [a] = ν [a] Pn . that a is positive recurrent. namely. We now have: Corollary 9. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . n ≤ Ta ) = n=1 ∞ Pa (Xn = x. First note that. which. we used (16). So we have shown ν [a] (x) = y∈S pyx ν [a] (y). Xn−1 = y. We are now going to assume more. ax ax From Theorem 10. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). We just showed that ν [a] = ν [a] P. which are precisely the balance equations. n ≤ Ta ) Pa (Xn = x. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x.and thus be in a position to apply first-step analysis. for all x ∈ S. (n) Next. for all n ∈ N. such that pxa > 0. so there exists m ≥ 1. ν [a] (x) < ∞. where.

Therefore. in general. 41 . Let R1 (respectively. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. π(y) = 1 y∈S x∈S have at least one solution. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. to decide whether j will occur in a specific excursion. . and so π [a] is not trivially equal to zero. We already know that j is recurrent. Start with X0 = i. If heads show up at the r-th excursion. unique. Proof. We have already shown. But π [a] is simply a multiple of ν [a] . 2. Then j is positive Proof. no matter which one. that ν [a] P = ν [a] . π [a] P = π [a] . we toss a coin with probability of heads gij > 0. Since a is positive recurrent. that it sums up to one. Also. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. 1. It only remains to show that π [a] is a probability. i. Otherwise. since the excursions are independent. The coins are i.i. and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. Let i be a positive recurrent state. positive recurrent state. R2 ) be the index of the first (respectively.d. (r) j. then j occurs in the r-th excursion. Suppose that i recurrent. then the set of linear equations π(x) = y∈S π(y)pyx .In other words: If there is a positive recurrent state. if tails show up.e. Hence j will occur in infinitely many excursions. . x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. we have Ea Ta < ∞. This is what we will try to do in the sequel. then the r-th excursion does not contain state j. . Theorem 15. r = 0. in Proposition 2. In words. Consider the excursions Xi . second) excursion that contains j.

.i. An irreducible Markov chain is called transient if one (and hence all) states are transient. where the Zr are i. R2 has finite expectation. . or all states are null recurrent or all states are transient. . by Wald’s lemma. .d. IRREDUCIBLE CHAIN r = 1. . Let C be a communicating class. But. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. Then either all states in C are positive recurrent. 2. . Corollary 10. with the same law as Ti . we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. In other words. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . .R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 .

is a stationary distribution. Let π be a stationary distribution. Consider positive recurrent irreducible Markov chain. we have. n=1 43 . following Theorem 14. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). we can take expectations of both sides: E as t → ∞. indeed. Consider the Markov chain (Xn . Then. we start the Markov chain from the stationary distribution π. there is a stationary distribution. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. π(2) = 1 − α 2. for all n. Then there is a unique stationary distribution. as t → ∞. with probability one. we stress that a stationary distribution may not be unique. for all x ∈ S.16 Uniqueness of the stationary distribution At the risk of being too pedantic. the π defined by π(0) = α. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). Fix a ground state a. But. for all states i and j. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). x ∈ S. x ∈ S. recognising that the random variables on the left are nonnegative and bounded below 1. But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. state 1 (and state 2) is positive recurrent. n ≥ 0) and suppose that P (X0 = x) = π(x). after all. what prohibits uniqueness. Proof. by (18). an absorbing state). Then P (Ta < ∞) = x∈S In other words. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. x∈S By (13). we have P (Xn = x) = π(x). π(1) = 0. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. (18) π(x)Px (Ta < ∞) = π(x) = 1. By a theorem of Analysis (Bounded Convergence Theorem). Hence. for totally trivial reasons (it is. Theorem 16.

Consider an arbitrary Markov chain. Consider a positive recurrent irreducible Markov chain and two states a. Corollary 12. y ∈ C. b ∈ S. Namely. Both π [a] and π[b] are stationary distributions. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Assume that a stationary distribution π exists. The following is a sort of converse to that. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. from the definition of π [a] . Corollary 11. delete all transition probabilities pxy with x ∈ C. Then π [a] = π [b] . There is a unique stationary distribution. Hence there is only one stationary distribution. The cycle formula merely re-expresses this equality. Proof. Consider a Markov chain. Let C be the communicating class of i. In particular. π(i|C) = Ei Ti . for all x ∈ S. Decompose it into communicating classes. Consider the restriction of the chain to C. for all a ∈ S. Thus.Therefore π(x) = π [a] (x). There is only one stationary distribution. But π(i|C) > 0. Let i be such that π(i) > 0. E a Ta Proof. i. By the 1 above corollary. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . i is positive recurrent. an arbitrary stationary distribution π must be equal to the specific one π [a] . the only stationary distribution for the restricted chain because C is irreducible. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. x ∈ S. Then.e. Hence π = π [a] . We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). In particular. so they are equal. Then any state i such that π(i) > 0 is positive recurrent. Proof. π(C) x ∈ C. This is in fact. Consider a positive recurrent irreducible Markov chain with stationary distribution π. 1 π(a) = . π(a) = π [a] (a) = 1/Ea Ta . Therefore Ei Ti < ∞. where π(C) = y∈C π(y).

eventually. Proof. π(x|C) ≡ π C (x). it will settle to the vertical position. By our uniqueness result. take an = (−1)n . Let π be an arbitrary stationary distribution. that the pendulum will swing for a while and. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . An arbitrary stationary distribution π must be a linear combination of these π C . Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. that it remains in a small region). To each positive recurrent communicating class C there corresponds a stationary distribution π C . more generally. such that C αC = 1. x ∈ C) is a stationary distribution. This is the stable position. Proposition 3. without friction. We all know. x∈C Then (π(x|C). from physical experience. A Markov chain is a simple model of a stochastic dynamical system. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). If π(x) > 0 then x belongs to some positive recurrent class C. where αC ≥ 0. There is dissipation of energy due to friction and that causes the settling down. the pendulum will perform an 45 . consider the motion of a pendulum under the action of gravity. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. for example. 18 18. under irreducibility and positive recurrence conditions. So far we have seen. where π(C) := π(C) π(x). as n → ∞.which assigns positive probability to each state in C.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. So we have that the above decomposition holds with αC = π(C). This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . For example. that. In an ideal pendulum. For each such C define the conditional distribution π(x|C) := π(x) . A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge.

The thing to observe here is that there is an equilibrium point i. . . a > 0. ξ2 . 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . ξ0 . random variables. or whatever. The simplest differential equation As another example. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . we may it is stable because it does not (it cannot!) swing too far away. and define a stochastic sequence X1 . . . it cannot mean convergence to a fixed equilibrium point. A linear stochastic system We now pass on to stochastic systems. we here assume that X0 . However notice what happens when we solve the recursion. the one found by setting the right-hand side equal to zero: x∗ = b/a. Still. It follows that |ρ| < 1 is the condition for stability.e. If. ······ 46 . Specifically. But what does stability mean? In general. then x(t) = x∗ for all t > 0. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. mutually independent. then regardless of the starting position. moreover. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. x ∈ R. as t → ∞. or an example in physics. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . We say that x∗ is a stable equilibrium. .oscillatory motion ad infinitum. simple because the ξn ’s depend on the time parameter n. And we are in a happy situation here. are given. ξ1 . . by means of the recursion above. consider the following differential equation x = −ax + b. X2 . We shall assume that the ξn have the same law. we have x(t) → x∗ . ˙ There are myriads of applications giving physical interpretation to this.

In other words. Y = y) is. i. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). for any initial distribution. we have that the first term goes to 0 as n → ∞. suppose you are given the laws of two random variables X. 18. while the limit of Xn does not exists. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . Y but not their joint law. What this means is that. to try to make the coincidence probability P (X = Y ) as large as possible. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). the n-step transition matrix Pn . we know f (x) = P (X = x).3 Coupling To understand why the fundamental stability theorem works. . we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. . for instance. ξn−1 does not converge. . In particular.g. The second term.2 The fundamental stability theorem for Markov chains Theorem 17. under nice conditions. ∞ ˆ The nice thing is that. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . we need to understand the notion of coupling. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. How should we then define what P (X = x. Coupling refers to the various methods for constructing this joint law. Y = y) is? 47 . but we are not given what P (X = x. (e. converges. uniformly in x. j ∈ S. . Y = y) = P (X = x)P (Y = y). g(y) = P (Y = y). which is a linear combination of ξ0 . as n → ∞. Starting at an elementary level. 18. for time-homogeneous Markov chains. But suppose that we have some further requirement.Since |ρ| < 1. If a Markov chain is irreducible. define stability in a way other than convergence of the distributions of the random variables. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk .

y). Suppose that we have two stochastic processes as above and suppose that they have been coupled. n ≥ 0. n ≥ T ) ≤ P (n < T ) + P (Yn = x). We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. A coupling of them refers to a joint construction on the same probability space. y). (19) as n → ∞. The following result is the so-called coupling inequality: Proposition 4. x∈S 48 . and all x ∈ S. respectively. i. T is finite with probability one. and which maximises P (X = Y ).e. on the event that the two processes never meet. Repeating (19). we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Note that it makes sense to speak of a meeting time. n < T ) + P (Xn = x. uniformly in x ∈ S. Let T be a meeting of two coupled stochastic processes X. x ∈ S. n ≥ T ) = P (Xn = x. in particular. Y = y) which respects the marginals. We have P (Xn = x) = P (Xn = x. i. Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Proof.Exercise: Let X. Since this holds for all x ∈ S and since the right hand side does not depend on x. but with the roles of X and Y interchanged. n ≥ 0). and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). The coupling we are interested in is a coupling of processes. P (T < ∞) = 1.e. n < T ) + P (Yn = x. for all n ≥ 0. g(y) = x h(x. This T is a random variables that takes values in the set of times or it may take values +∞. but we are not told what the joint probabilities are. then |P (Xn = x) − P (Yn = x)| → 0. n ≥ 0) and the law of a stochastic process Y = (Yn . Then. precisely because of the assumption that the two processes have been constructed together. g(y) = P (Y = y). Find the joint law h(x. We are given the law of a stochastic process X = (Xn . |P (Xn = x) − P (Yn = x)| ≤ P (T > n). y) = P (X = x. If. Y be random variables with values in Z and laws f (x) = P (X = x). f (x) = y h(x. n ≥ 0. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Y .

n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. The first process.(y. as usual. we have that px.x′ ).x′ ). Therefore. and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }.4 Proof of the fundamental stability theorem First. Notice that σ(x. y ′ ) | Wn = (x. π(x) > 0 for all x ∈ S. they differ only in how the initial states are chosen. the transition probabilities. implying that q(x. y) ∈ S × S. We know that there is a unique stationary distribution. n ≥ 0).x′ > 0 and py. Let pij denote. It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law.y′ . Having coupled them.(y. . The second process. (x. in other words. n ≥ 0). review the assumptions of the theorem: the chain is irreducible. denoted by Y = (Yn . had we started with both X0 and Y0 independent and distributed according to π. We shall consider the laws of two processes. We now couple them by doing the straightforward thing: we assume that X = (Xn . denoted by X = (Xn . Its (1-step) transition probabilities are q(x. n ≥ 0) is a Markov chain with state space S × S.y′ ) = px. sup |P (Xn = x) − P (Yn = x)| → 0. Consider the process Wn := (Xn . y) = µ(x)π(y). is a stationary distribution for (Wn . Xm = y).y′ ) for all large n. we can define joint probabilities. n ≥ 0) is an irreducible chain. Y0 ) has distribution P (W0 = (x. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. Its initial state W0 = (X0 .x′ py. Yn ). Xn and Yn would be independent and distributed according to π. Then (Wn .Assume now that P (T < ∞) = 1. (Indeed.y′ . y) Its n-step transition probabilities are q(x. positive recurrent and aperiodic.x′ ). n ≥ 0) is independent of Y = (Yn . n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. From Theorem 3 and the aperiodicity assumption.y′ ) := P Wn+1 = (x′ . 18.x′ py. then. and so (Wn . we have not defined joint probabilities such as P (Xn = x. y) := π(x)π(y). 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px.y′ for all large n.(y.) By positive recurrence. x∈S as n → ∞. denoted by π. for all n ≥ 1. Then n→∞ lim P (T > n) = P (T = ∞) = 0.

say. p01 = 0. By the coupling inequality. P (T < ∞) = 1. n ≥ 0) is positive recurrent.(0. (1. p00 = 0. suppose that S = {0. and since P (Yn = x) = π(x). we have |P (Xn = x) − π(x)| ≤ P (T > n). after the meeting time. Moreover.Therefore σ(x. uniformly in x. Z0 = X0 which has an arbitrary distribution µ. then the process W = (X.(1.0) = q(1.1) = 1. Clearly. p10 = 1.(1. For example. Hence (Wn . Hence (Zn . Zn equals Xn before the meeting time of X and Y . as n → ∞. Y ) may fail to be irreducible.1). y) ∈ S × S. which converges to zero. (Zn . In particular. 1).(0. X arbitrary distribution Z stationary distribution Y T Thus. T is a meeting time between Z and Y . Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. n ≥ 0) is identical in law to (Xn . precisely because P (T < ∞) = 1. n ≥ 0). P (Xn = x) = P (Zn = x).1) = q(1. by assumption.1). Obviously.0). 0).99. 0). 1} and the pij are p01 = p10 = 1. This is not irreducible. for all n and x. Zn = Yn . However.0) = q(0. ≥ 0) is a Markov chain with transition probabilities pij . Since P (Zn = x) = P (Xn = x). y) > 0 for all (x.0). (1. Then the product chain has state space S ×S = {(0.01. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). 50 . In particular. Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). if we modify the original chain and make it aperiodic by. (0. 1)} and the transition probabilities are q(0.

0. Suppose that the period of the chain is d. we can decompose the state space into d classes C0 .f. Therefore p00 = 1 for even n (n) and = 0 for odd n. from C1 to C2 . It is. It is not difficult to see that 6 We put “wrong” in quotes.503 0. . from Cd back to C0 . Suppose X0 = 0. 51 . and this is expressed by our terminology. however.497. Thus. In matrix notation. most people use the term “ergodic” for an irreducible. π(1) ≈ 0. . . π(1) satisfy 0.99.4 and. The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature.e.503 0. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. such that the chain moves cyclically between them: from C0 to C1 . n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. customary to have a consistent terminology.01.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. But this is “wrong”.then the product chain becomes irreducible. Parry. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. 18. 6 The reason is that the term “ergodic” means something specific in Mathematics (c. Terminology cannot be.497 . we know we have a unique stationary distribution π and that π(x) > 0 for all states x. However. per se wrong. and so on.503. By the results of Section 14. 1981). p01 = 0.497 18. limn→∞ p00 does not exist. C1 .6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. . i. in particular. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). if we modify the chain by p00 = 0. positive recurrent and aperiodic Markov chain. Theorem 4. Cd−1 . Then Xn = 0 for even n and = 1 for odd n. π(0) + π(1) = 1. π(0) ≈ 0. By the results of §5. n→∞ (n) where π(0). p10 = 1.99π(0) = π(1). Indeed. then (n) lim p n→∞ 00 (n) = lim p10 = π(0).

i ∈ S. Since the αr add up to 1. where π is the stationary distribution. . j ∈ Cr . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. Suppose that X0 ∈ Cr . starting from i. The chain (Xnd . Suppose i ∈ Cr . thereafter. it remains in Cℓ . the (unique) stationary distribution of (Xnd . namely C0 . j ∈ Cℓ . Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . → dπ(j). n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. Proof. and let r = ℓ − k (modulo d). . Since (Xnd . d − 1. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. i ∈ Cr . i∈Cr for all r = 0. We shall show that Lemma 9.Lemma 8. n ≥ 0) is dπ(i). i ∈ Cr . we have pij for all i. as n → ∞. We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . . Hence if we restrict the chain (Xnd . Cd−1 . as n → ∞. . Let i ∈ Ck . and the previous results apply. This means that: Theorem 18. 52 . . each one must be equal to 1/d. All states have period 1 for this chain. n ≥ 0) is aperiodic. . Then pji = 0 unless j ∈ Cr−1 . if sampled every d steps. . n ≥ 0) consists of d closed communicating classes. after reduction by d) and. So π(i) = j∈Cr−1 π(j)pji . Then. . Then. Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). 1. j ∈ Cℓ . Let αr := i∈Cr π(i). π(i) = 1/d.

say.1 Limiting probabilities for null recurrent states Theorem 19. k = 1. . . as k → ∞. . The proof will be based on two results from Analysis. If j is a null recurrent state then pij → 0. . It is clear how we (ni ) k continue. . exists for all i ∈ N. Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. k = 1. i. 2. p3 . . . . For each i we have found subsequence ni . a probability distribution on N. p(n) = (n) (n) (n) (n) (n) ∞ p1 . .. . are contained between 0 and 1. We also saw in Corollary (n) 7 that. for each i. We have. . for each The second result is Fatou’s lemma. then pij → 0 as n → ∞. pi . for each n = 1. 2.) is a k k subsequence of each (ni . the first of which is known as HellyBray Lemma: Lemma 10. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. . schematically. Hence pi k i. . for all i ∈ S. p2 . k = 1. . say. p2 . whose proof we omit7 : 7 See Br´maud (1999). Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. p1 . . 2. 2. . (nk ) k → pi . . the sequence n3 of the third row is a subsequence k k of n2 of the second row. if j is transient. Consider. where = 1 and pi ≥ 0 for all i. If the period is d then the limit exists only along a selected subsequence (Theorem 18). 19. such that limk→∞ pi k exists and equals. . By construction. for all i ∈ S. 2. So the diagonal subsequence (nk . e 53 . k = 1. say. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . then pij → π(j) (Theorems 17). as n → ∞. there must be a subsequence exists and equals.).e. and so on. the sequence n2 of the second row k is a subsequence of n1 of the first row. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1..19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . . Now consider the contained between 0 and 1 there must be a exists and equals. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · .

By Corollary 13. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . (xi . (n ) (n ) Consider the probability vector By Lemma 10. by assumption. state j is positive recurrent: Contradiction. . y∈S Since x∈S αx ≤ 1. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . equalities. Suppose that for some i. . n = 1. y∈S x ∈ S. k→∞ (n′ +1) = y∈S piy k pyx . we have 0< x∈S The inequality follows from Lemma 11. Suppose that one of them is not. actually. the sequence (n) pij does not converge to 0. Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. If (yi .e. i. But Pn+1 = Pn P. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). i ∈ N). αy pyx . x ∈ S . and the last equality is obvious since we have a probability vector. i=1 (n) Proof of Theorem 19.Lemma 11. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . for some x0 ∈ S. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. π is a stationary distribution. . 54 . Let j be a null recurrent state. pix k Therefore. i ∈ N). pix k . (n′ ) Hence αx ≥ αy pyx . Thus. Now π(j) > 0 because αj > 0. x ∈ S.e. i. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . We wish to show that these inequalities are. pix k = 1. it is a strict inequality: αx0 > y∈S αy pyx0 . x∈S (n′ ) Since aj > 0.e. by Lemma 11. i. 2.

If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. as n → ∞. starting with the first hitting time decomposition relation of Lemma 7. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Pµ (Xn−m = j) → π C (j). so we will only look that the aperiodic case. as n → ∞. Let j be a positive recurrent state with period 1.19. Proof. and fij = Pi (Tj < ∞). Let i be an arbitrary state. Let C be the communicating class of j and let π C be the stationary distribution associated with C. and fij = ϕC (j). we have pij = Pi (Xn = j) = Pi (Xn = j. Then (n) lim p n→∞ ij = ϕC (i)π C (j). π C (j) = 1/Ej Tj . But Theorem 17 tells us that the limit exists regardless of the initial distribution. as in §10. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. The result when the period j is larger than 1 is a bit awkward to formulate. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). Example: Consider the following chain (0 < α. Indeed. In this form. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . β. that is.2. Let HC := inf{n ≥ 0 : Xn ∈ C}. Theorem 20. From the Strong Markov Property. Let ϕC (i) := Pi (HC < ∞). E j Tj where Tj = inf{n ≥ 1 : Xn = j}. the result can be obtained by the so-called Abel’s Lemma of Analysis. In general.2 The general case (n) It remains to see what the limit of pij is. which is what we need. This is a more old-fashioned way of getting the same result.

(n) (n) p32 → ϕ(3)π C1 (2). Recall that. We now generalise this. p44 → 0. For example. The phase (or state) space of a particle is not just a physical space but it includes. while. (n) (n) p33 → 0. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. Theorem 14 is a type of ergodic theorem. The subject of Ergodic Theory has its origins in Physics. (n) (n) p34 → 0. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. C2 = {5} (absorbing state). for example. ϕ(4) = . (n) (n) p43 → 0. Theorem 21. C2 are aperiodic classes. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. Similarly. from first-step analysis. 1 − αβ 1 − αβ Both C1 . We have ϕ(1) = ϕ(2) = 1. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. (n) (n) p35 → (1 − ϕ(3))π C2 (5). Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. p42 → ϕ(4)π C1 (2). whence 1−α β(1 − α) . Pioneers in the are were. p41 → ϕ(4)π C1 (1). we assumed the existence of a positive recurrent state. p45 → (1 − ϕ(4))π C2 (5). 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. ϕ(3) = p31 → ϕ(3)π C1 (1). The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . information about its velocity. as n → ∞. among others. Hence. And so is the Law of Large Numbers itself (Theorem 13). 2} (closed class).The communicating classes are I = {3. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). ϕ(5) = 0. C1 = {1. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. in Theorem 14. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . π C1 (2) = . 56 (20) x∈S . π C2 (5) = 1. 4} (inessential states).

So there are always recurrent states. then. we have t→∞ lim 1 last 1 G = lim Gfirst = 0. If so. t t→∞ t t t with probability one. which is the quantity denoted by f in Theorem 14. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . (21) f (x)ν [a] (x).Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . r=0 with probability 1 as long as E|G1 | < ∞. As in the proof of Theorem 14. From proposition 2 we know that the balance equations ν = νP have at least one solution. We only need to check that EG1 = f .5. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. and this is left as an exercise. But this is precisely the condition (20). Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. respectively. Then at least one of the states must be visited infinitely often. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. and this justifies the terminology of §18. Glast are the total rewards over the first and t t a a last incomplete cycles. Now. So if we divide by t the numerator and denominator of the fraction in (21). we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . as we saw in the proof of Theorem 14. 57 . Replacing n by the subsequence Nt . while Gfirst . in the former. we have C := i∈S ν(i) < ∞. we have Nt /t → 1/Ea Ta . we assumed positive recurrence of the ground state a. Remark 1: The difference between Theorem 14 and 21 is that. As in Theorem 14. so the numerator converges to f /Ea Ta . we have that the denominator equals Nt /t and converges to 1/Ea Ta . since the number of states is finite. Proof. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). we obtain the result. The reader is asked to revisit the proof of Theorem 14.

Consider an irreducible Markov chain with finitely many states. X0 ). We say that it is time-reversible. X0 = j). But if we film a man who walks and run the film backwards. without being able to tell that it is. Xn has the same distribution as X0 for all n. the chain is irreducible. . . for all n. then we instantly know that the film is played in reverse. C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. By Corollary 13. . Xn ) has the same distribution as (Xn . simply. in a sense. . running backwards. the first components of the two vectors must have the same distribution. . In particular. Proof. Lemma 12. then all states are positive recurrent. that is. the stationary distribution is unique. if we film a standing man who raises his arms up and down successively. Therefore π is a stationary distribution. in addition. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. . because positive recurrence is a class property (Theorem 15). . We summarise the above discussion in: Theorem 22. X1 = j) = P (X1 = i. It is. If. 22 Time reversibility Consider a Markov chain (Xn . X0 ) for all n. this state must be positive recurrent. If. In addition. Xn ) has the same distribution as (Xn . reversible if it has the same distribution when the arrow of time is reversed. X1 . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. A reversible Markov chain is stationary. X0 ). Proof. it will look the same. in addition. At least some state i must have π(i) > 0.e. i. like playing a film backwards. Suppose first that the chain is reversible. . we have Pn → Π. P (X0 = i. the chain is aperiodic. X1 ) has the same distribution as (X1 . . we see that (X0 . π(i) := Hence. and run the film backwards.Hence 1 ν(i). i. . Xn−1 . . i ∈ S. by Theorem 16. this means that it is stationary. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. (X0 . j ∈ S. For example. Taking n = 1 in the definition of reversibility. as n → ∞. X1 . . Then the chain is positive recurrent with a unique stationary distribution π. . Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . 58 . a finite Markov chain always has positive recurrent states. Because the process is Markov. Reversibility means that (X0 . actually. Theorem 23. . Xn−1 . . or. . More precisely. n ≥ 0).

and the chain is reversible. The DBE imply immediately that (X0 . k. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . . and pi1 i2 (m−3) → π(i2 ). . Suppose first that the chain is reversible. j. for all i. Theorem 24. Xn ) has the same distribution as (Xn . k. . X0 ). . Xn−1 . . . X2 ) has the same distribution as (X2 . X1 . then. and write. . suppose that the KLC (22) holds. . Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. letting m → ∞. X1 . Hence the DBE hold. X0 ). i2 . . we have π(i) > 0 for all i. suppose the DBE hold. j. X1 = j) = π(i)pij . (22) Proof. X0 ). Conversely. . . These are the DBE. X1 = i) = π(j)pji . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . Now pick a state k and write. . it is easy to show that. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. we have separation of variables: pij = f (i)g(j). . So the detailed balance equations (DBE) hold. Summing up over all choices of states i3 . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. (X0 . im ∈ S and all m ≥ 3. So the chain is reversible. By induction. pji 59 (m−3) → π(i1 ). Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. . π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. . and this means that P (X0 = i. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . using the DBE. In other words. . using DBE.for all i and j. X1 . then the chain cannot be reversible. Conversely. . X1 = j. i4 . X1 ) has the same distribution as (X1 . But P (X0 = i. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. X2 = i). and hence (X0 . We will show the Kolmogorov’s loop criterion (KLC). P (X0 = j. Pick 3 states i. X1 = j. for all n. X2 = k) = P (X0 = k. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . The same logic works for any m and can be worked out by induction.

that the chain is positive recurrent. there isn’t any.) Necessarily. and by replacing. the two oriented edges from i to j and from j to i by a single edge without arrow. If the graph thus obtained is a tree. then we need.(Convention: we form this ratio only when pij > 0. j for which pij > 0. namely the arrow from i to j. There is obviously only one arrow from AtoAc .1) which gives π(i)pij = π(j)pji . The reason is as follows. Example 1: Chain with a tree-structure. and the arrow from j to j is the only one from Ac to A. to guarantee. the detailed balance equations.) Let A.e. then the chain is reversible. Pick two distinct states i. so we have pij g(j) = . If we remove the link between them. i. the graph becomes disconnected. So g satisfies the balance equations. by assumption. The balance equations are written in the form F (A. f (i)g(i) must be 1 (why?). Hence the chain is reversible. and. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. for each pair of distinct states i. there would be a loop. Form the graph by deleting all self-loops. And hence π can be found very easily. 60 . For example. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. A) (see §4. in addition. • If the chain has infinitely many states. j for which pij > 0. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . (If it didn’t. meaning that it has no loops. Ac be the sets in the two connected components. using another method. Ac ) = F (Ac . consider the chain: Replace it by: This is a tree. Hence the original chain is reversible.

There are two cases: If i. 0. For each state i define its degree δ(i) = number of neighbours of i. p02 = 1/4. i∈S Since δ(i) = 2|E|. 2. j are not neighbours then both sides are 0. To see this let us verify that the detailed balance equations hold. where C is a constant. π(i)pij = π(j)pji . Suppose the graph is connected. If i. and a set E of undirected edges. Now define the the following transition probabilities: pij = 1/δ(i). i∈S where |E| is the total number of edges. For example.e. so both members are equal to C. Consider an undirected graph G with vertices S. The Markov chain with these transition probabilities is called a random walk on a graph (RWG).Example 2: Random walk on a graph. We claim that π(i) = Cδ(i). Each edge connects a pair of distinct vertices (states) without orientation. States j are i are called neighbours if they are connected by an edge. if j is a neighbour of i otherwise. etc. Hence we conclude that: 1. The stationary distribution assigns to a state probability which is proportional to its degree. The constant C is found by normalisation. i. The stationary distribution of a RWG is very easy to find. a finite set. we have that C = 1/2|E|. 61 . the DBE hold. In all cases. π(i) = 1. A RWG is a reversible chain. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C.

i.) has the Markov property but it is not time-homogeneous. Let (St . . . (m) (m) 23.3 Subordination Let (Xn . 1. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). . . 1. . 2. The sequence (Yk .e. k = 0.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. n ≥ 0) be Markov with transition probabilities pij .1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. in general. 2. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . . random variables with values in N and distribution λ(n) = P (ξ0 = n).i. . 1. A r = 1. where pij are the m-step transition probabilities of the original chain. t ≥ 0. 2. Define Yr := XT (r) . . . The state space of (Yr ) is A. if nk = n0 + km. i. irreducible and positive recurrent). . .e. this process is also a Markov chain and it does have time-homogeneous transition probabilities. 62 n ∈ N. k = 0. .) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij .23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). k = 0. how can we create new ones? 23. t ≥ 0) be independent from (Xn ) with S0 = 0. . 1.e. . then (Yk . π is a stationary distribution for the original chain. If the subsequence forms an arithmetic progression. . . In this case. 2. then it is a stationary distribution for the new chain. Due to the Strong Markov Property. if. 23.d. 2. in addition. . St+1 = St + ξt . k = 0. Let A be (r) a set of states and let TA be the r-th visit to A. where ξt are i.

. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). in general. (Notation: We will denote the elements of S by i. ∞ λ(n)pij . .) More generally. (n) 23. .Then Yt := XSt . . we have: Lemma 13. . Fix α0 . α1 . Proof.e. . is NOT. β). conditional on Yn the past is independent of the future. . Then (Yn ) has the Markov property. .. Indeed. . however. i. . n≥0 63 .4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . . and the elements of S ′ by α. . Yn−1 = αn−1 }. . If. We need to show that P (Yn+1 = β | Yn = α. then the process Yn = f (Xn ). a Markov chain. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. β. f is a one-to-one function then (Yn ) is a Markov chain. Yn−1 ) = P (Yn+1 = β | Yn = α). that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . knowledge of f (Xn ) implies knowledge of Xn and so.. j. .

β) i∈S P (Yn = α|Xn = i. But we will show that P (Yn+1 = β|Xn = i). for all i ∈ Z and all β ∈ Z+ . Example: Consider the symmetric drunkard’s random walk. Define Yn := |Xn |. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Yn = α. β). the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. On the other hand. i ∈ Z. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) P (Yn = α|Xn = i. In other words. then |Xn+1 | = 1 for sure. P (Xn+1 = i ± 1|Xn = i) = 1/2. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. a Markov chain with transition probabilities Q(α. β). depends on i only through |i|. 64 . indeed. α = 0. Thus (Yn ) is Markov with transition probabilities Q(α. i ∈ Z. β). β) P (Yn = α. if i = 0. But P (Yn+1 = β | Yn = α. β) := 1. To see this. β)P (Xn = i | Yn = α. conditional on {Xn = i}. regardless of whether i > 0 or i < 0. Yn−1 )P (Xn = i | Yn = α. Yn−1 ) Q(f (i). We have that P (Yn+1 = β|Xn = i) = Q(|i|. observe that the function | · | is not one-to-one. Indeed. and so (Yn ) is. because the probability of going up is the same as the probability of going down. Yn−1 ) Q(f (i). β) P (Yn = α|Yn−1 ) i∈S = Q(α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). if we let 1/2. β ∈ Z+ . i. β). α > 0 Q(α. Then (Yn ) is a Markov chain. and if i = 0. if β = 1.for all choices of the variables involved. if β = α ± 1.e.

If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. Back to the general case. therefore transient. and 0 is always an absorbing state. . Then P (ξ = 0) + P (ξ ≥ 2) > 0. which has a binomial distribution: i j pij = p (1 − p)i−j . .. since ξ (0) . ξ2 . ξ (1) . Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. and. independently from one another. In other words. k = 0.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. Therefore. in general. . Thus. But here is an interesting special case. if the current population is i there is probability pi that everybody gives birth to zero offspring. 0 ≤ j ≤ i. . Hence all states i ≥ 1 are inessential. 2. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. if 0 is never reached then Xn → ∞. then we see that the above equation is of the form Xn+1 = f (Xn . It is a Markov chain with time-homogeneous transitions. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. We let pk be the probability that an individual will bear k offspring. a hard problem. are i. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. n ≥ 0) has the Markov property. Let ξk be the number of offspring of the k-th individual in the n-th generation. we have that (Xn . Example: Suppose that p1 = p. the population cannot grow: it will eventually reach 0. . So assume that p1 = 1. so we will find some different method to proceed. if the 65 (n) (n) . 1. We find the population (n) at time n + 1 by adding the number of offspring of each individual. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. with probability 1. each individual gives birth to at one child with probability p or has children with probability 1 − p.24 24. j In this case. . (and independent of the initial population size X0 –a further assumption).d. ξ (n) ). 0 But 0 is an absorbing state. . If we let ξ (n) = ξ1 . . Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . Hence all states i ≥ 1 are transient.i. . and following the same distribution. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. one question of interest is whether the process will become extinct or not. p0 = 1 − p. The state 0 is irrelevant because it cannot be reached from anywhere. The parent dies immediately upon giving birth.

population does not become extinct then it must grow to infinity. ϕn (0) = P (Xn = 0). The result is this: Proposition 5. . . . This function is of interest because it gives information about the extinction probability ε. We wish to compute the extinction probability ε(i) := Pi (D). If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. each one behaves independently of one another. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. We then have that D = D1 ∩ · · · ∩ Di . since Xn = 0 implies Xm = 0 for all m ≥ n. For each k = 1. if we start with X0 = i in ital ancestors. ε(0) = 1. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. and P (Dk ) = ε. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. we can argue that ε(i) = εi . and. Indeed. Let ε be the extinction probability of the process. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. when we start with X0 = i. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . . Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. Indeed. Therefore the problem reduces to the study of the branching process starting with X0 = 1. where ε = P1 (D). i. If µ ≤ 1 then the ε = 1. For i ≥ 1. so P (D|X0 = i) = εi . 66 . Consider a Galton-Watson process starting with X0 = 1 initial ancestor. Obviously. Proof.

recall that ϕ is a convex function with ϕ(1) = 1. all we need to show is that ε < 1. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. and since the latter has only two solutions (z = 0 and z = ε∗ ). ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). Notice now that µ= d ϕ(z) dz .) 67 . ϕn (0) ≤ ε∗ . if µ > 1. By induction. Also ϕn (0) → ε. Let us show that. So ε ≤ ε∗ < 1. ϕn+1 (0) = ϕ(ϕn (0)). ϕ(ϕn (0)) → ϕ(ε). ϕn+1 (z) = ϕ(ϕn (z)).Note also that ϕ0 (z) = 1. since ϕ is a continuous function. z=1 Also. if the slope µ at z = 1 is ≤ 1. But ϕn+1 (0) → ε as n → ∞. and. Since both ε and ε∗ satisfy z = ϕ(z). We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). In particular. But ϕn (0) → ε. in fact. This means that ε = 1 (see left figure below). Therefore. the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . But 0 ≤ ε∗ . Therefore ε = ε∗ . ε = ε∗ . (See right figure below. and so on. Therefore ε = ϕ(ε). On the other hand.

Nowadays. One obvious solution is to allocate time slots to hosts in a specific order. If s ≥ 2.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). We assume that the An are i. . and Sn the number of selected packets. So. 4. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). So if there are. Then ε < 1. Then ε = 1. only one of the hosts should be transmitting at a given time. 0).1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. periodically. 2. 5. random variables with some common distribution: P (An = k) = λk . otherwise. on the next times slot. there is collision and the transmission is unsuccessful. . is that if more than one hosts attempt to transmit a packet. . then max(Xn + An − 1. Case: µ > 1. 1. 1. 6. for a packet to go through. 2. Ethernet has replaced Aloha. Thus. 3 hosts. there are x + a − 1 packets. there are x + a packets. k ≥ 0. An the number of new packets on the same time slot.i. . then time slots 0. Since things happen fast in a computer network. if Sn = 1. 68 . say. 1. are allocated to hosts 1. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. 3. using the same frequency.d. If collision happens. Abandoning this idea. we let hosts transmit whenever they like. in a simplified model. Let s be the number of selected packets. The problem of communication over a common channel. on the next times slot. 3. 2. Xn+1 = Xn + An . Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p.. there is no time to make a prior query whether a host has something to transmit or not. 3. If s = 1 there is a successful transmission and. 24. During this time slot assume there are a new packets that need to be transmitted. then the transmission is not successful and the transmissions must be attempted anew. there is a collision and. . .

Unless µ = 0 (which is a trivial case: no arrivals!). We here only restrict ourselves to the computation of this expected change.) = x + a k x−k p q . The main result is: Proposition 6. if Xn + An = 0. Proving this is beyond the scope of the lectures.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . The reason we take the maximum with 0 in the above equation is because. we should not subtract 1. we must have no new packets arriving. To compute px. selections are made amongst the Xn + An packets. n ≥ 0) is a Markov chain. 69 . so q x G(x) tends to 0 as x → ∞. The random variable Sn has. Let us find its transition probabilities. The probabilities px.e. For x > 0. Since p > 0. we have that the drift is positive for all large x and this proves transience. we think as follows: to have a reduction of x by 1. Idea of proof. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. and so q x tends to 0 much faster than G(x) tends to ∞.x−1 = P (An = 0. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). i. we have q < 1. and p. k 0 ≤ k ≤ x + a. So P (Sn = k|Xn = x. Sn = 1|Xn = x) = λ0 xpq x−1 . where G(x) is a function of the form α + βx . Xn−1 . Therefore ∆(x) ≥ µ/2 for all large x. . An = a. where q = 1 − p.where λk ≥ 0 and k≥0 kλk = 1. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . for example we can make it ≤ µ/2. To compute px. . the Markov chain (Xn ) is transient.e. we have ∆(x) = (−1)px.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . k ≥ 1. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1).x−1 when x > 0. Indeed. So: px.x−1 + k≥1 kpx. So we can make q x G(x) as small as we like by choosing x sufficiently large. defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. The Aloha protocol is unstable. .x are computed by subtracting the sum of the above from 1. and exactly one selection: px. then the chain is transient. The process (Xn . An−1 . We also let µ := k≥1 kλk be the expectation of An .x+k for k ≥ 1.

In an Internet-dominated society. the rank of a page may be very important for a company’s profits. Academic departments link to each other by hiring their 70 . So this is how Google assigns ranks to pages: It uses an algorithm. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. the chain moves to a random page. All you need to do to get high rank is pay (a lot of money). then i is a ‘dangling page’. α ≈ 0. Then it says: page i has higher rank than j if π(i) > π(j). It uses advanced techniques from linear algebra to speed up computations. There is a technique. known as ‘302 Google jacking’. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. PageRank. The idea is simple.2) and let α pij := (1 − α)pij + . Comments: 1. there are companies that sell high-rank pages. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. Its implementation though is one of the largest matrix computations in the world. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. For each page i let L(i) be the set of pages it links to. 2. because of the size of the problem. Define transition probabilities as follows:  1/|L(i)|. |V | These are new transition probabilities. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. It is not clear that the chain is irreducible.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. in the latter case. We pick α between 0 and 1 (typically. Therefore. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. So we make the following modification. neither that it is aperiodic. If L(i) = ∅. otherwise. unless i is a dangling page.24. that allows someone to manipulate the PageRank of a web page and make it much higher. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. So if Xn = i is the current location of the chain. if L(i) = ∅   0. 3. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. if j ∈ L(i)  pij = 1/|V |.

a dominant (say A) and a recessive (say a). To simplify.4 The Wright-Fisher model: an example from biology In an organism. this allele is of type A with probability x/2N . maintain this allele for the next generation. a gene controlling a specific characteristic (e. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. in each generation. But an AA mating with a Aa may produce a child of type AA or one of type Aa. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. then. 5. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). One of the circle alleles is selected twice. We will assume that a child chooses an allele from each parent at random. 4. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele.g. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. We have a transition from x to y if y alleles of type A were produced. The alleles in an individual come in pairs and express the individual’s genotype for that gene. 24. their offspring inherits an allele from each parent. The process repeats. What is important is that the evaluation can be done without ever looking at the content of one’s research. all we need is the information outlined in this figure. meaning how many characteristics we have. This number could be independent of the research quality but. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. 71 . from the point of view of an administrator. we will assume that.faculty from each other (and from themselves). In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). Imagine that we have a population of N individuals who mate at random producing a number of offspring. And so on. Since we don’t care to keep track of the individuals’ identities. generation after generation. When two individuals mate. Thus. the genetic diversity depends only on the total number of alleles of each type. We would like to study the genetic diversity. two AA individuals will produce and AA child. If so. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. it makes the evaluation procedure easy. This means that the external appearance of a characteristic is a function of the pair of alleles. Do the same thing 2N times. One of the square alleles is selected 4 times. colour of eyes) appears in two forms. Three of the square alleles are never selected. We will also assume that the parents are replaced by the children.

. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . ϕ(x) = y=0 pxy ϕ(y). with n P (ξk = 1|Xn . . n n where. Clearly. . . . conditionally on (X0 . at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. . . so both 0 and 2N are absorbing states. . Clearly. 2N − 1} form a communicating class of inessential states. the chain gets absorbed by either 0 or 2N . .Each allele of type A is selected with probability x/2N . on the average. . X0 ) = Xn . X0 ) = E(Xn+1 |Xn ) = M This implies that. . . . . as usual. and the population becomes one consisting of alleles of type a only. . .) One can argue that. the random variables ξ1 . on the average. M n P (ξk = 0|Xn . . . the expectation doesn’t change and. . since. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. the states {1. y ≤ 2N − 1. we see that 2N x 2N −y x y pxy = . and the population becomes one consisting of alleles of type A only. 72 . . but the argument needs further justification because “the end of time” is not a deterministic time. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. The fact that M is even is irrelevant for the mathematical model. (Here. Similarly. for all n ≥ 0.e. i. The result is correct. ϕ(x) = x/M. . since. . . Xn ). These are: M ϕ(0) = 0. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . X0 ) = 1 − Xn . Since selections are independent. M Therefore. . We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. EXn+1 = EXn Xn = Xn . Since pxy > 0 for all 1 ≤ x. 0 ≤ y ≤ 2N.e. p00 = p2N.i. M and thus. i. ϕ(M ) = 1. the number of alleles of type A won’t change. E(Xn+1 |Xn . Tx := inf{n ≥ 1 : Xn = x}.2N = 1. ξM are i. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. Let M = 2N .d.

say. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. A number An of new components are ordered and added to the stock. we will try to solve the recursion. . M x M (M −1)−(y−1) x x +1− M M M −1 = 24.d. there are Xn + An − Dn components in the stock. Working for n = 1. . M −1 y−1 x M y−1 1− x . The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. . while a number of them are sold to the market. for all n ≥ 0. b). and E(An − Dn ) = −µ < 0. Since the Markov chain is represented by this very simple recursion. on the average. and the stock reached level zero at time n + 1.The first two are obvious. n ≥ 0) is a Markov chain. n ≥ 0) are i. We have that (Xn .5 A storage or queueing application A storage facility stores. at time n + 1. Thus. 73 . Let ξn := An − Dn . Otherwise. Dn ). If the demand is for Dn components. provided that enough components are available. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. the demand is met instantaneously. larger than the supply per unit of time. the demand is only partially satisfied. Xn electronic components at time n. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. and so.i. the demand is. We will show that this Markov chain is positive recurrent. We will apply the so-called Loynes’ method. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . Here are our assumptions: The pairs of random variables ((An . so that Xn+1 = (Xn + ξn ) ∨ 0. 2.

the event whose probability we want to compute is a function of (ξ0 . we may replace the latter random variables by ξ1 . Hence. that d (ξ0 . . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations.Thus. (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . where the latter equality follows from the fact that. = x≥0 Since the n-dependent terms are bounded. Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). . . and. we reverse it. ξn−1 ) = (ξn−1 . ξn−1 . The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). we find π(y) = x≥0 π(x)pxy . . . . Therefore. . ξn . . . . ξn−1 ). we may take the limit of both sides as n → ∞. . (n) . for all n. Notice.i. for y ≥ 0. In particular. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). ξ0 ). in computing the distribution of Mn which depends on ξ0 .d. Indeed. . the random variables (ξn ) are i. 74 . . . . . . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. using π(y) = limn→∞ P (Mn = y). instead of considering them in their original order. . so we can shuffle them in any way we like without altering their joint distribution. however. .

Z2 .So π satisfies the balance equations. . n→∞ This implies that P ( lim Zn = −∞) = 1. as required. Depending on the actual distribution of ξn . This will be the case if g(y) is not identically equal to 1. . . It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. we will use the assumption that Eξn = −µ < 0. n→∞ Hence g(y) = P (0 ∨ max{Z1 . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . Z2 . . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. . Here. The case Eξn = 0 is delicate.} < ∞) = 1.} > y) is not identically equal to 1. 75 . We must make sure that it also satisfies y π(y) = 1. the Markov chain may be transient or null recurrent. .

. namely. so far. . p(x) = 0. groups) are allowed.and space-homogeneity. besides time. . then p(1) = p. Define  pj . x ∈ Zd . . if x = −ej . we need to give more structure to the state space S. we won’t go beyond Zd . A random walk possesses an additional property. This is the drunkard’s walk we’ve seen before. . Our Markov chains. Example 1: Let p(x) = C2−|x1 |−···−|xd | . if x ∈ {±e1 . . j = 1. . . . 0. given a function p(x). . here.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. If d = 1. Example 3: Define p(x) = 1 2d . So. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. if x = ej . Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. In dimension d = 1. .y+z for any translation z. d   0.y = p(y − x). This means that the transition probability pxy should depend on x and y only through their relative positions in space. j = 1. where p1 + · · · + pd + q1 + · · · + qd = 1. the skip-free property. p(−1) = q. otherwise. To be able to say this. where C is such that x p(x) = 1. this walk enjoys. that of spatial homogeneity. . we can let S = Zd . the set of d-dimensional vectors with integer coordinates. ±ed } otherwise 76 .and space-homogeneous transition probabilities given by px. have been processes with timehomogeneous transition probabilities. More general spaces (e. For example. d  p(x) = qj . a random walk is a Markov chain with time. A random walk of this type is called simple random walk or nearest neighbour random walk. but.y = px+z. a structure that enables us to say that px.g. otherwise where p + q = 1.

xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. The convolution of two probabilities p(x). . . . The special structure of a random walk enables us to give more detailed formulae for its behaviour. the initial state S0 is independent of the increments. we can reduce several questions to questions about this random walk. Indeed. . p ∗ δz = p. p′ (x). . . are i. p(x) = 0. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . (When we write a sum without indicating explicitly the range of the summation variable. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x).) Now the operation in the last term is known as convolution. So. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. x ± ed . x ∈ Zd . p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). otherwise. Let us denote by Sn the state of the walk at time n. We shall resist the temptation to compute and look at other 77 .i. (Recall that δz is a probability distribution that assigns probability 1 to the point z. . has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 .) In this notation. and this is the symmetric drunkard’s walk. we will mean a sum over all possible values.d. y p(x)p′ (y − x). computing the n-th order convolution for all n is a notoriously hard problem. ξ2 . in addition. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . We will let p∗n be the convolution of p with itself n times. Obviously. random variables in Zd with common distribution P (ξn = x) = p(x). which. If d = 1.This is a simple random walk. in general. x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. then p(1) = p(−1) = 1/2. But this would be nonsense. One could stop here and say that we have a formula for the n-step transition probability. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). Furthermore. Furthermore. We refer to ξn as the increments of the random walk. because all we have done is that we dressed the original problem in more fancy clothes. where ξ1 . We call it simple symmetric random walk. x + ξ2 = y) = y p(x)p(y − x).

if we can count the number of elements of A we can compute its probability. +1}n . Will the random walk ever return to where it started from? B. a path of length n. random variables (random signs) with P (ξn = ±1) = 1/2. we should assign. . . Here are a couple of meaningful questions: A. 2) → (3. and so #A P (A) = n . Therefore. 2 where #A is the number of elements of A. a random vector with values in {−1. 1) → (4. A ⊂ {−1.d. After all.0) We can thus think of the elements of the sample space as paths. . we have P (ξ1 = ε1 . consider the following: 78 . . for fixed n. Clearly. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. 1}n . how long will that take? C. If not. 1): (2.2) (4. −1. we shall learn some counting methods.1) + − (5. we are dealing with a uniform distribution on the sample space {−1. If yes. + (1. 1) → (2. +1}n . In the sequel.1) − (3. +1. +1.i−1 = 1/2 for all i ∈ Z. . . 2) → (5. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . Events A that depend only on these the first n random signs are subsets of this sample space. So. Look at the distribution of (ξ1 . Since there are 2n elements in {0. 0) → (1. . −1} then we have a path in the plane that joins the points (0. all possible values of the vector are equally likely. +1}n . The principle is trivial. As a warm-up.methods. and the sequence of signs is {+1. . but its practice may be hard.1) + (0.2) where the ξn are i. ξn = εn ) = 2−n .i. ξn ). So if the initial position is S0 = 0. First. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones.i+1 = pi. to each initial position and to each sequence of n signs.

Proof. y) is given by: #{(m. We can easily count that there are 6 paths in A. y)}.) The darker region shows all paths that start at (0. 0)). x) and end up at (n. Hence (n+i)/2 = 0.1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. y)} = n−m . (n − m + y − x)/2 (23) Remark. 0) and n ends at (n. 0) (n. y) is given by: #{(0. 0) and end up at (n. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. n 79 . i). (There are 2n of them. S6 < 0.i) (0. x) (n. We use the notation {(m. In if n + i is not an even number then there is no path that starts from (0. i). The paths comprising this event are shown below. Consider a path that starts from (0. and compute the probability of the event A = {S2 ≥ 0. x) (n. 0) and ends at (n. i). x) and end up at (n. The number of paths that start from (0. Let us first show that Lemma 14. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0.0) The lightly shaded region contains all paths of length n that start at (0. S7 < 0}. i)} = n n+i 2 More generally.047. 0) and end up at the specific point (n. in this case. y)} =: {all paths that start at (m. the number of paths that start from (m. (n. S4 = −1.Example: Assume that S0 = 0.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

. The probability that.P (Sn = y|Sm = x) = In particular. we shall just compute it. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y .3 Some conditional probabilities Lemma 17. for now. . 27. If n. if n is even. n/2 (29) #{(0. n 2 #{paths (m. given that S0 = 0 and Sn = y > 0. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). n 2−n . Such a simple answer begs for a physical/intuitive explanation. . 0) (n. P0 (Sn = y) 2 (n + y) + 1 . The left side of (30) equals #{paths (1. n This is called the ballot theorem and the reason will be given in Section 29. . n (30) Proof. Sn ≥ 0 | Sn = y) = 1 − In particular. But. Sn > 0 | Sn = y) = y . . y) that do not touch zero} × 2−n #{paths (0. n−m 2−n−m . y)} . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . . x) 2n−m (n. if n is even. . S2 > 0. 1) (n. . the random walk never becomes zero up to time n is given by P0 (S1 > 0. 27. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. P0 (S1 ≥ 0 . Sn ≥ 0 | Sn = 0) = 84 1 . y)} . . .2 The ballot theorem Theorem 25. 0) (n.

. (f ) follows from the replacement of (ξ2 . . . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . Sn ≥ 0. ξ2 . . . We use (27): P0 (Sn ≥ 0 . Sn ≥ 0) = #{(0. . . . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . . . . (b) follows from symmetry. . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . . Sn < 0) (b) (a) = 2P0 (S1 > 0. . .4 Some remarkable identities Theorem 26. •). . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . . . . . . Sn = 0) = P0 (S1 > 0. . ξ2 + ξ3 ≥ 0. . . Proof. . . . . . Sn ≥ 0) = P0 (S1 = 0. Sn > 0) + P0 (S1 < 0. For even n we have P0 (S1 ≥ 0. Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . (e) follows from the fact that ξ1 is independent of ξ2 . . . 0) (n. . . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. and Sn−1 > 0 implies that Sn ≥ 0. .) by (ξ1 . . . . Sn = 0) = P0 (Sn = 0). Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. . . . P0 (S1 ≥ 0. .Proof. . (c) (d) (e) (f ) (g) where (a) is obvious. +1 27. .). . 1 + ξ2 + ξ3 > 0. . . Sn ≥ 0). ξ3 . . . If n is even. . . n/2 where we used (28) and (29). 85 . . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . We next have P0 (S1 = 0. ξ3 . . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . . . and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0.

. P0 (Sn−1 > 0. . We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps.27. we can omit the last event. Sn = 0) = 2P0 (Sn−1 > 0. . . . By the Markov property at time n − 1. . Since the random walk cannot change sign before becoming zero. . Theorem 27. To take care of the last term in the last expression for f (n). n−1 Proof. with equal probability. By symmetry. Sn−1 = 0. Sn−2 > 0|Sn−1 > 0. . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). This is not so interesting. . given that Sn = 0. . Sn = 0. Sn−2 > 0|Sn−1 = 1). . the probability u(n) that the walk (started at 0) attains value 0 in n steps. . So f (n) = 2P0 (S1 > 0. Also. . . and u(n) = P0 (Sn = 0). Sn−1 > 0. Sn = 0). . for even n. n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. . Sn = 0 means Sn−1 = 1. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. . So 1 f (n) = 2u(n) P0 (S1 > 0. and if the particle is to the left of a then its mirror image is on the right. So the last term is 1/2.5 First return to zero In (29). the two terms must be equal. So u(n) f (n) = . . 86 . So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. Sn = 0) = u(n) . . Sn−1 > 0. . Fix a state a. Sn = 0) P0 (S1 > 0. f (n) = P0 (S1 > 0. Put a two-sided mirror at a. Sn = 0) Now. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). . Then f (n) := P0 (S1 = 0. . we see that Sn−1 > 0. Sn−1 < 0. If a particle is to the right of a then you see its image on the left. we computed. Let n be even. . we either have Sn−1 = 1 or −1. . n We know that P0 (Sn = 0) = u(n) = n/2 2−n . Sn = 0) + P0 (S1 < 0.

So: P0 (Sn ≥ a) = P0 (Ta ≤ n. . n ∈ Z+ ). But a symmetric random walk has the same law as its negative. 28. Thus. If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). so we may replace Sn by 2a − Sn (Theorem 28). observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). Sn ). we have n ≥ Ta . Then. 2 Proof. we obtain 87 . n ≥ Ta By the strong Markov property. S1 . called (Sn . Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). Sn ≤ a) + P0 (Sn > a). m ≥ 0) is a simple symmetric random walk. Sn > a) = P0 (Ta ≤ n. . P0 (Sn ≥ a) = P0 (Ta ≤ n. Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). n ≥ Ta . The last equality follows from the fact that Sn > a implies Ta ≤ n. n < Ta . If Sn ≥ a then Ta ≤ n. Proof. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . . Let (Sn . the process (STa +m − a. Theorem 28. Sn ≤ a) + P0 (Ta ≤ n. n ≥ 0) be a simple symmetric random walk and a > 0. In other words. n ∈ Z+ ) has the same law as (Sn . define Sn = where is the first hitting time of a. a + (Sn − a). Sn ≥ a).What is more interesting is this: Run the particle till it hits a (it will.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. as a corollary to the above result. 2a − Sn . So Sn − a can be replaced by a − Sn in the last display and the resulting process. In the last event. Then Ta = inf{n ≥ 0 : Sn = a} Sn . n < Ta . Sn ≤ a). Sn = Sn . 2a − Sn ≥ a) = P0 (Ta ≤ n. But P0 (Ta ≤ n) = P0 (Ta ≤ n. independent of the past before Ta . . Finally. Trivially. 2 Define now the running maximum Mn = max(S0 .

A sample from the urn is a sequence η = (η1 . Acad. 9 Bertrand. where ηi indicates the colour of the i-th item picked. . 105. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. Rend. as for holding liquids. Sci. An urn is a vessel. Compt. Proof. Compt. which results into P0 (Tx ≤ n. the ashes of the dead after cremation. 2 Proof. e 10 Andr´. Sci. If an urn contains a azure items and b = n−a black items. . posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. (1887). The answer is very simple: Theorem 31. then the probability that. The set of values of η contains n elements and η is uniformly distributed a in this set. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. (31) x > y. Rend. (1887). during sampling without replacement. It is the latter use we are interested in in Probability and Statistics. P0 (Mn < x. 105.Corollary 19. ηn ). Sn = y) = P0 (Tx ≤ n. Since Mn < x is equivalent to Tx > n. 436-437. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. usually a vase furnished with a foot or pedestal. the number of selected azure items is always ahead of the black ones equals (a − b)/n. ornamental objects. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . . D. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). then. 369. Solution directe du probl`me r´solu par M. as long as a ≥ b. e e e Paris. 8 88 . employed for different purposes. Sn = y) = P0 (Sn = 2x − y). of which a are coloured azure and b black. The original ballot problem. Solution d’un probl`me. J. and anciently for holding ballots to be drawn. If Tx ≤ n. . Acad. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. This. combined with (31). Sn = 2x − y). gives the result. Paris. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. Bertrand. Sn = y). we have P0 (Mn < x. n = a + b. Sn = y) = P0 (Tx > n. by applying the reflection principle (Theorem 28). we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n.

. The values of each ξi are ±1. Sn > 0 | Sn = y) = .i+1 = p. . An urn can be realised quite easily be a random walk: Theorem 32. n Since y = a − (n − a) = a − b. . (The word “asymmetric” will always mean “not necessarily symmetric”. so ‘azure’ items are +1 and ‘black’ items are −1. but which is not necessarily symmetric. . . Then. the sequence (ξ1 . we can translate the original question into a question about the random walk. . ηn ) an urn process with parameters n. Since the ‘azure’ items correspond to + signs (respectively. . . . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. . a. . But in (30) we computed that y P0 (S1 > 0. εn be elements of {−1. . Proof. . . So the number of possible values of (ξ1 . But if Sn = y then. i ∈ Z. . . Proof of Theorem 31. We have P0 (Sn = k) = n n+k 2 pi. It remains to show that all values are equally a likely. . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. n ≥ 0) with values in Z and let y be a positive integer. 89 . b be defined by a + b = n. . . where y = a − (n − a) = 2a − n. a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. p(n+k)/2 q (n−k)/2 . Consider a simple symmetric random walk (Sn . . +1} such that ε1 + · · · + εn = y. ξn = εn ) P0 (Sn = y) 2−n 1 = = . Then. n n −n 2 (n + y)/2 a Thus. indeed. ξn ) has a uniform distribution. conditional on the event {Sn = y}. . using the definition of conditional probability and Lemma 16. Since an urn with parameters n. . . . conditional on {Sn = y}. . . . . . conditional on {Sn = y}. . the result follows. We need to show that. . always assuming (without any loss of generality) that a ≥ b = n − a. ξn ) is. 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. −n ≤ k ≤ n. . ξn ) is uniformly distributed in a set with n elements. . . the sequence (ξ1 .We shall call (η1 . Sn > 0} that the random walk remains positive. the ‘black’ items correspond to − signs). . .) So pi. (ξ1 . letting a. . .i−1 = q = 1 − p. a − b = y. a we have that a out of the n increments will be +1 and b will be −1. ξn = εn | Sn = y) = P0 (ξ1 = ε1 . we have: P0 (ξ1 = ε1 . . Let ε1 . n . .

Consider a simple random walk starting from 0. This random variable takes values in {0. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. . y ∈ Z. x ∈ Z. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . 2. If S1 = 1 then T1 = 1. For all x. Proof.d. 2qs Proof. Consider a simple random walk starting from 0. plus a time T1 required to go from −1 to 0 for the first time. we make essential use of the skip-free property. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. E0 sTx = ϕ(s)x . Theorem 33. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. To prove this. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. Let ϕ(s) := E0 sT1 . x ≥ 0) is also a random walk. .1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. The process (Tx . We will approach this using probability generating functions.) Then.i. 2. In view of Lemma 19 we need to know the distribution of T1 . 90 . Lemma 18. 1. it is no loss of generality to consider the random walk starting from 0. for x > 0. In view of this. Consider a simple random walk starting from 0. . . See figure below. Corollary 20. The rest is a consequence of the strong Markov property. only the relative position of y with respect to x is needed. Px (Ty = t) = P0 (Ty−x = t). x − 1 in order to reach x.} ∪ {∞}. random variables. . If x > 0. Proof. Lemma 19. plus a ′′ further time T1 required to go from 0 to +1 for the first time. . and all t ≥ 0. .30. (We tacitly assume that S0 = 0.

91 . Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. from the strong Markov property. The formula is obtained by solving the quadratic. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. if x < 0 2ps Proof. in order to hit 1. Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . Hence the previous formula applies with the roles of p and q interchanged. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). The first time n such that Sn = −1 is the first time n such that −Sn = 1. 2ps Proof. Corollary 22.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Corollary 21. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 .  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . Consider a simple random walk starting from 0. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. Consider a simple random walk starting from 0.

30. then we can omit the k upper limit in the sum. If we define n to be equal to 0 for k > n. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. if S1 = 1.3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. If α = n is a positive integer then n (1 + x)n = k=0 n k x . (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. this was found in §30. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . and simply write (1 + x)n = k≥0 n k x . k 92 . 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . 1 − 4pqs2 . where α is a real number. The latter is. Consider a simple random walk starting from 0. the same as the time required for the walk to hit 1 starting ′ from 0. Similarly. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. We again do first-step analysis. k by the binomial theorem. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. For a symmetric ′ random walk.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even.2 First return to the origin Now let us ask: If the particle starts at 0. For the general case. in distribution. Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. 2ps ′ =1− 30.

√ 1 − x. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. k as long as we define α k = α(α − 1) · · · (α − k + 1) . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . k!(n − k)! k! α k x . −1 k so (1 − x)−1 = as needed. Indeed.Recall that n k = It turns out that the formula is formally true for general α. k! k! (−1)2k xk = k≥0 xk . 2k − 1 k 93 . by definition. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . (1 − x)−1 = But. (−1)k = 2k − 1 k k and this is shown by direct computation. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . For example. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. but that we won’t worry about which ones. As another example.

That it is actually null-recurrent follows from Markov chain theory. Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. x 1−|p−q| 2q 1−|p−q| 2p . x if x > 0 . s↑1 which is true for any probability generating function. if x < 0 This formula can be written more neatly as follows: Theorem 35. The quantity inside the square root becomes 1 − 4pq. Hence it is transient. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one.But. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . without appeal to the Markov chain theory. by the definition of E0 sT0 . • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. But remember that q = 1 − p. Hence it is again transient. if x < 0 |x| In particular. if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. 94 . The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . P0 (Tx < ∞) = p q q p ∧1 ∧1 . 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. |x| if x > 0 (33) . By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. We apply this to (32).

. Assume p < 1/2. If p < q then |p − q| = q − p and. then we have M∞ = limn→∞ Mn . This means that there is a largest value that will be ever attained by the random walk. S2 . Indeed. 31. the sequence Mn is nondecreasing. n→∞ n→∞ x ≥ 0. The simple symmetric random walk with values in Z is null-recurrent. except that the limit may be ∞. this probability is always strictly less than 1. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . but here it is < ∞. 95 . it will never ever return to the point where it started from. Corollary 23. Sn ). Consider a random walk with values in Z. Proof. every nondecreasing sequence has a limit. Then limn→∞ Sn = −∞ with probability 1. . . What is this probability? Theorem 37. with probability 1 because p < 1/2. q p. with some positive probability. Theorem 36. 31. if x > 0 . if x < 0 which is what we want.) If we let Mn = max(S0 . Proof. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . . x ≥ 0. The last part of Theorem 35 tells us that it is recurrent. Consider a simple random walk with values in Z. This value is random and is denoted by M∞ = sup(S0 . . Otherwise. Then ′ P0 (T0 < ∞) = 2(p ∧ q). . Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0.1 Asymptotic distribution of the maximum Suppose now that p < 1/2.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. S1 .Proof. we obtain what we are looking for. Unless p = 1/2. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. . and so it is null-recurrent. again. We know that it cannot be positive recurrent.

2. If p = q this gives 1. (34) a random variable first considered in Section 10 where it was also shown to be geometric. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . But J0 ≥ 1 if and only if the first return time to zero is finite. In the latter case. even for p = 1/2. . 2. We now look at the the number of visits to some state but if we start from a different state. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. This proves the first formula. Consider a simple random walk with values in Z. . 1. . . We now compute its exact distribution. 2(p ∧ q) E0 J0 = . So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). k = 0. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i).3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. . 1 − 2(p ∧ q) this being a geometric series. 1. In Lemma 4 we showed that. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . k = 0. 1]. meaning that Pi (Ji = ∞) = 1. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . But if it is transient. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. the total number of visits to a state will be finite. . . . The probability generating function of T0 was obtained in Theorem 34. as already known.′ Proof. if a Markov chain starts from a state i then Ji is geometric. Pi (Ji ≥ k) = 1 for all k. Then. Remark 1: By spatial homogeneity. 1 − 2(p ∧ q) Proof. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. 2(p ∧ q) . starting from zero. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . k = 0. Theorem 38. 31. 96 . where the last probability was computed in Theorem 37. . 2. 1. .

If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . p < 1/2) then. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). we have E0 Jx = 1/(1−2p) regardless of x. In other words. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . Consider the case p < 1/2. Consider a random walk with values in Z.e. Then P0 (Tx < ∞. then E0 Jx drops down to zero geometrically fast as x tends to infinity. ∞ As for the expectation. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . The reason that we replaced k by k − 1 is because. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . if the random walk “drifts down” (i. Remark 1: The expectation E0 Jx is finite if p = 1/2. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 .e. The first term was computed in Theorem 35. and the second in Theorem 38. But for x < 0. 97 . Jx ≥ k).Theorem 39. Apply now the strong Markov property at Tx . simply because if Jx ≥ k ≥ 1 then Tx < ∞. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. (q/p)(−x)∨0 1 − 2q k≥1 . starting from 0. E0 Jx = Proof. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. (2(p ∧ q))k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . For x > 0. Suppose x > 0. k ≥ 1. Then P0 (Jx ≥ k) = P0 (Tx < ∞. given that Tx < ∞. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. it will visit any point below 0 exactly the same number of times on the average. i.

for a simple symmetric random walk. Then P0 (Tx = n) = x P0 (Sn = x) . ξ1 ). i. 0 ≤ k ≤ n) of (Sk . (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). in Theorem 41.d. the probabilities P0 (Tx = n) = x P0 (Sn = x). Use the obvious fact that (ξ1 . 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . every time we have a probability for S we can translate it into a probability for S. . . random variables. . for a simple random walk and for x > 0. . . Sn = x) P0 (Tx = n) = . ξ2 . P0 (Sn = x) P0 (Sn = x) x . .32 Duality Consider random walk Sk = k ξi . . 0 ≤ k ≤ n) has the same distribution as (Sk . n x > 0. Tx is the sum of x i. 0 ≤ k ≤ n − 1. because the two have the same distribution. . which is basically another probability for S. Proof. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. i=1 Then the dual (Sk . we derived using duality. Let (Sk ) be a simple symmetric random walk and x a positive integer. n Proof. 0 ≤ k ≤ n). random variables. Thus. In Lemma 19 we observed that. ξn ) has the same distribution as (ξn . = 1− √ 1 − s2 s x . . Here is an application of this: Theorem 41. Fix an integer n.i. Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. n (37) 98 .e. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. (36) 2 Lastly.i. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. ξn−1 . . Theorem 40. a sum of i. (Sk . each distributed as T1 .d. . .

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

too. the random 1 2 variables ξn . west. Consider a change of coordinates: rotate the axes by 45o (and scale them). We have 1. y ∈ Z. ξn are not independent simply because if one takes value 0 the other is nonzero. n ∈ Z+ ) is a random walk. 0)) = P (ξn = (−1. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . 1)) + P (ξn = (0. It is not a simple random walk because there is a possibility that the increment takes the value 0. Sn = i=1 n ξi . . . 2 Same argument applies to (Sn . When he reaches a corner he moves again in the same manner. ξ2 . Indeed. 0)) = 1/4.. However. Since ξ1 . 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. . 0) and. north or south.e. . x2 ) into (x1 + x2 . Many other quantities can be approximated in a similar manner. the latter being functions of the former. . so are ξ1 . north or south. So Sn = (Sn . y) where x. (Sn . are independent. . 1 Hence (Sn . west. x1 − x2 ). They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. random variables and hence a random walk. (Sn . 2 1 If x ∈ Z2 . . ξ2 . . The two random walks are not independent. He starts at (0.i. he moves east. 0). n ∈ Z+ ) is a sum of i. i. i=1 ξi i=1 ξi 2 1 Lemma 20.d. n ≥ 1. The corners are described by coordinates (x. 1 1 Proof. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. Sn − Sn ). i. is a random walk. . we let x1 be its horizontal coordinate and x2 its vertical. ξ2 . the two stochastic processes are not independent.From this. 0)) = 1/4. Sn ) = n n 2 . We have: 102 .d. map (x1 . 1)) = P (ξn = (−1. random vectors such that P (ξn = (0. n ∈ Z+ ) showing that it. being totally drunk. we immediately get that the expectation of T1 is infinity: ET1 = ∞.i. Rn ) := (Sn + Sn . 0)) = P (ξn = (1. n ∈ Z+ ) is also a random walk. We then let S0 = (0. So let 1 2 1 2 1 2 Rn = (Rn . for each n. with equal probability (1/4). going at random east. 1 P (ξn = 1) = P (ξn = (1. −1)) = 1/2.

So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . as n → ∞. in the sense that the ratio of the two sides goes to 1. But (Rn ) 1 2 is a simple symmetric random walk. So P (R2n = 0) = 2n 2−2n . b−a −a . that for a simple symmetric random walk (started from S0 = 0). n ∈ Z+ ). = 1)) = 1/4. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . 0)) = 1/4 1 2 P (ηn = −1. +1}. Lemma 22. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . R2n = 0) = P (R2n = 0)(R2n = 0). 1)) = 1/4 1 2 P (ηn = −1. is i. 2n n 2 2−4n . . from Theorem 7. ξ 1 − ξ 2 ).d. Rn = n (ξi − ξi ). n ∈ Z ) is a random walk in one dimension. ηn = 1) = P (ξn = (1. The same is true for (Rn ). and by Stirling’s approximation. We have of each of the ηn n 1 2 P (ηn = 1. (Rn + independent. n ∈ Z+ ) is a random 2 . 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. We have Rn = n (ξi + ξi ). . The last two are walk in one dimension. ξ2 . 0)) = 1/4 1 2 P (ηn = 1. Hence P (ηn = 1) = P (ηn = −1) = 1/2. The two-dimensional simple symmetric random walk is recurrent. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . (Rn . η . η 2 are ±1. . b−a 103 . n Corollary 24. . 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. (Rn . The possible values 1 .. ηn = 1) = P (ξn = (0. ηn = −1) = P (ξn = (−1. b−a P (Ta < Tb ) = b . b−a P (STa ∧Tb = a) = b . Proof. ηn = −1) = P (ξn = (0. But by the previous n=0 c lemma.1 Lemma 21. 2 . 2 1 2 2 1 1 Proof.i. n ∈ Z+ ) is a random walk in two dimensions. n ∈ Z+ ) 1 is a random walk in two dimensions. ηn = ξn − ξn are independent for each n. n ∈ Z ) is a random walk in one dimension. Hence (Rn . where c is a positive constant. Let a < 0 < b. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. Similar argument shows that each of (Rn . ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. . By Lemma 5. Recall. P (S2n = 0) = Proof. And so 2 1 2 1 P (ηn = ε. we need to show that ∞ P (Sn = 0) = ∞. ε′ ∈ {−1. 1 where the last equality follows from the independence of the two random walks. The sequence of vectors η . P (S2n = 0) ∼ n . . To show that the last two are independent.

B = j) = 1.g. Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. Note. 1 − p). such that p is rational. Then Z = 0 if and only if A = B = 0 and this has 104 . b = N . B = 0) + P (A = i. i < 0 < j. a < 0 < b such that pa + (1 − p)b = 0. Indeed. choose. a = M − N . Then there exists a stopping time T such that P (ST = i) = pi . if p = M/N . we have this common value: i<0 (−i)pi = and. b]. i ∈ Z. How about more complicated distributions? Let suppose we are given a probability distribution (pi .This last display tells us the distribution of STa ∧Tb . e. b. Let (Sn ) be a simple symmetric random walk and (pi . we can always find integers a. M < N . Define a pair of random variables (A. and the probability it exits from the right point is 1 − p. in particular. we consider the random variable Z = STA ∧TB . Proof. we get that c is precisely ipi . B) are independent of (Sn ). i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. using this. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. N ∈ N. To see this. 0)} ∪ (N × N) with the following distribution: P (A = i. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. suppose first k = 0. B) taking values in {(0. M. P (A = 0. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. Theorem 43. and assuming that (A. Conversely. i ∈ Z). The probability it exits it from the left point is p. B = 0) = p0 . B = j) = c−1 (j − i)pi pj . k ∈ Z. The claim is that P (Z = k) = pk . i ∈ Z. where c−1 is chosen by normalisation: P (A = 0. given a 2-point probability distribution (p. that ESTa ∧Tb = 0.

by definition. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. 105 . Suppose next k > 0.probability p0 . B = j) P (STi ∧Tk = k)P (A = i. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . P (Z = k) = pk for k < 0. i<0 = c−1 Similarly.

−3. 3. So we work faster. 38) = gcd(6. · · · }. if we also invoke Euclid’s algorithm (i. −2. Now suppose that c | a and c | b − a. So (i) holds. c | b. 2. let d = gcd(a. Indeed 4 × 38 > 120. 38) = gcd(6. But in the first line we subtracted 38 exactly 3 times. b − a). b. We will show that d = gcd(a. 2. replace b − a by b − 2a. Indeed. Similarly. Zero is denoted by 0. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. 4. 106 . For instance. leading to d1 = d2 . 14) = gcd(6. Then c | a + (b − a). 32) = gcd(6. i.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. we have c | d. But 3 is the maximum number of times that 38 “fits” into 120. . with the second line. we can define their greatest common divisor d = gcd(a. (That it is unique is obvious from the fact that if d1 . the process of long division that one learns in elementary school). 0. 2) = gcd(4. Now. with b = 0. 38) = gcd(44. If b − a > a. d2 are two such numbers then d1 would divide d2 and vice-versa. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. 2) = 2. they merely serve as reminders. with at least one of them nonzero. their negatives and zero. −1. Arithmetic Arithmetic deals with integers. Given integers a. d | b.) Notice that: gcd(a. b − a). . Therefore d | b − a. In other words. Keep doing this until you obtain b − qa < a. So (ii) holds. If c | b and b | a then c | a. gcd(120. as taught in high schools or universities.e.}. The set Z of integers consists of N. and we’re done. and d = gcd(a. b). Replace b by b − a. . then d | c. if a. They are not supposed to replace textbooks or basic courses. b) as the unique positive integer such that (i) d | a. 3. 2) = gcd(2. 5. But d | a and d | b. 8) = gcd(6. 26) = gcd(6. b). (ii) if c | a and c | b. If a | b and b | a then a = b. 1. 20) = gcd(6. We write this by b | a.e. b) = gcd(a. Then interchange the roles of the numbers and continue. Because c | a and c | b. b are arbitrary integers. 6. The set N of natural numbers is the set of positive integers: N = {1. we say that b divides a if there is an integer k such that a = kb. 38) = gcd(82. Z = {· · · .

in particular. y ∈ Z. This shows (ii) of the definition of a greatest common divisor. 1. Thus. gcd(38. it divides d. with x = 19. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. r = 18. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. a. b) then there are integers x. for integers a. i. 107 . To see this. b. Its proof is obvious. let d be the right hand side. Then c divides any linear combination of a. Working backwards. To see why this is true. Here are some other facts about the greatest common divisor: First. For example. y ∈ Z. b) = min{xa + yb : x.The theorem of division says the following: Given two positive integers b. Then d = xa + yb. .e. b ∈ Z. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. not both zero. . xa + yb > 0}. In the process. The long division algorithm is a process for finding the quotient q and the remainder r. we can find an integer q and an integer r ∈ {0. y such that d = xa + yb. 2) = 2. . The same logic works in general. with a > b. a − 1}. Second. we have the fact that if d = gcd(a. . 38) = gcd(6. y = 6. 120) = 38x + 120y. let us follow the previous example again: gcd(120. for some x. We need to show (i). 38) = gcd(6. we can show that. such that a = qb + r. we have gcd(a. Suppose that c | a and c | b.

A relation is called reflexive if it contains pairs (x. a relation can be thought of as a set of pairs of elements of the set A. A relation which possesses all the three properties is called equivalence relation. the greatest common divisor of any subset of the integers also makes sense.e. A relation is called symmetric if it cannot contain a pair (x. B be finite sets with n. The equality relation = is reflexive.e. y) such that x is smaller than y. we can divide b by d to find b = qd + r. and this is a contradiction. y) and (y. Since d > 0. respectively. A function is called one-to-one (or injective) if two different x1 . x). i. then the relation “x is a friend of y” is symmetric. that d does not divide b.g. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). But then b = q(xa + yb) + r. for some 1 ≤ r ≤ d − 1. m elements. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . A function is called onto (or surjective) if its range is Y . f (x2 ). so r = 0. For example. The relation < is not reflexive. x). The range of f is the set of its values. Counting Let A. The greatest common divisor of 3 integers can now be defined in a similar manner. A relation is transitive if when it contains (x. In general. x2 ∈ X have different values f (x1 ). z) it also contains (x. say. i. yielding r = (−qx)a + (1 − qy)b. the set {f (x) : x ∈ X}. z). that d does not divide both numbers. but is not necessarily transitive. The relation < is not symmetric. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. e. Suppose this is not true. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. The set of functions from X to Y is denoted by Y X . Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. We encountered the equivalence relation i theory. so d divides both a and b. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B.. The equality relation is symmetric. y) without contain its symmetric (y. Finally. Both < and = are transitive. So d is not minimum as claimed. 108 .that d | a and d | b. If we take A to be the set of students in a classroom.

The first element can be chosen in n ways. the cardinality of the set B A ) is mn . and this follows from the binomial theorem. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen.) So the first animal can be assigned any of the m available names. Clearly. 109 . k! k where the symbol equation. we need more names than animals: n ≤ m. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . Each assignment of this type is a one-to-one function from A to B. If A has n elements. n! is the total number of permutations of a set of n elements.The number of functions from A to B (i. Since the name of the first animal can be chosen in m ways. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . the k-th in n − k + 1 ways. in other words. and so on. In the special case m = n. A function is then an assignment of a name to each animal. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. and the third. the second in n − 1. 1}n of sequences of length n with elements 0 or 1. Thus. we have = 1. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. To be able to do this. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . and so on. we may think of the elements of A as animals and those of B as names. we denote (n)n also by n!. and so on.e. Any set A contains several subsets. that of the second in m − 1 ways. We thus observe that there are as many subsets in A as number of elements in the set {0. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. Indeed. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). so does the second. a one-to-one function from the set A into A is called a permutation of A. (The same name may be used for different animals. Incidentally.

The notation extends to more than one numbers. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). 110 . + k−1 k Indeed. b). in choosing k animals from a set of n + 1 ones. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. b is denoted by a ∧ b = min(a. and we don’t care to put parantheses because the order of finding a maximum is irrelevant. we let The Pascal triangle relation states that n+1 k = = 0. 0). The maximum is denoted by a ∨ b = max(a. n n . and a− is called the negative part of a. In particular. 0). k Also. b). consider a specific animal. For example. Both quantities are nonnegative. a∨b∨c refers to the maximum of the three numbers a. we choose k animals from the remaining n ones and this can be done in n ways. if x is a real number. m! n y If y is not an integer then. Maximum and minimum The minimum of two numbers a. by convention. by convention. a− = − min(a. b. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. we define x m = x(x − 1) · · · (x − m + 1) . The absolute value of a is defined by |a| = a+ + a− . Note that it is always true that a = a+ − a− . say. If the pig is not included in the sample.We shall let n = 0 if n < k. we use the following notations: a+ = max(a. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. If the pig is included. the pig. a+ is called the positive part of a. we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. c.

. This is very useful notation because conjunction of clauses is transformed into multiplication. It takes value 1 with probability P (A) and 0 with probability P (Ac ). A2 . if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. . Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). . 2. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. . if A1 . Another example: Let X be a nonnegative random variable with values in N = {1.Indicator function stands for the indicator function of a set A. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . meaning a function that takes value 1 on A and 0 on the complement Ac of A. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y).}. It is easy to see that 1A 1A∩B = 1A 1B . are sets or clauses then the number n≥1 1An is the number true clauses. Then X X= n=1 1 (the sum of X ones is X). It is worth remembering that. = 1A + 1B − 1A∩B . If A is an event then 1A is a random variable. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. For instance. Thus. P (X ≥ n). . So X= n≥1 1(X ≥ n). 1 − 1A = 1 A c . Several quantities of interest are expressed in terms of indicators. . 111 . which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components.

 is a column representation of y := ϕ(x) with respect to the given basis . . . we have ϕ(x) = m n vi i=1 j=1 aij xj . . x1 . Combining. un be a basis for Rn . The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . ym v1 . . x  . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ).for all x. .  = ··· ··· ···  . . . . Suppose that ψ. . But the ϕ(uj ) are vectors in Rm . m. . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . Then. . vm . . . . . Let u1 . because ϕ is linear. So if we choose a basis v1 . xn the given basis. . . y ∈ Rn .   . . β ∈ R. and all α. .  The column  .  is a column representation of x with respect to . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . . ϕ are linear functions: It is a m × n matrix. j=1 i = 1. . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  .  where ∈ R. am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  . The column  .    . Let y i := n aij xj .  .

. . . x2 . . is a sequence of numbers then inf{x1 . xn+2 . Limsup and liminf If x1 . . e2 =  . 1/2. ei = 1(i = j). for any ε > 0. 1 ≤ j ≤ n. .  . and AB is a ℓ × n matrix. inf I is characterised by: inf I ≤ x for all x ∈ I and. . . we define the supremum sup I to be the minimum of all upper bounds. i. A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. there is x ∈ I with x ≥ ε + inf I.Fix three sets of basis on Rn . with respect to these bases. a square matrix represents a linear transformation from Rn to Rn . So lim sup xn = limn→∞ inf{xn . where m (AB)ij = k=1 Aik Bkj . If I has no lower bound then inf I = −∞. . . If I has no upper bound then sup I = +∞. 1 ≤ i ≤ ℓ. we define lim sup xn . . ed =  .} ≤ · · · and. 113 . and it is this that we called infimum. defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. sup ∅ = −∞.  . the choice of the basis is the so-called standard basis. The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . . Supremum and infimum When we have a set of numbers I (for instance. . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. ψ. Usually. sup I is characterised by: sup I ≥ x for all x ∈ I and. j So A is a ℓ × m matrix. . a2 .} has no minimum. . We note that lim inf(−xn ) = − lim sup xn . it is its matrix representation with respect to the fixed basis. B be the matrix representations of ϕ. . In these cases we use the notation inf I. . respectively.  . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. x4 . we define max I. B is a m × n matrix. . A matrix A is called square if it is a n × n matrix. 1/3. 0 0 1 In other words. there is x ∈ I with x ≤ ε + inf I. . If I is empty then inf ∅ = +∞. . Similarly. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. x3 . xn+1 . If we fix a basis in Rn .} ≤ inf{x3 . Likewise. . .} ≤ inf{x2 . Similarly. x2 . Then the matrix representation of ϕ ◦ ψ is AB.e. . for any ε > 0. . . · · · }. . Rm and Rℓ and let A. This minimum (or maximum) may not exist. standing for the “infimum” of I. . a sequence a1 . For instance I = {1. .

B in the definition. Indeed. . (That is why Probability Theory is very rich. if he events A1 . In contrast. B2 . follows from this axiom.) If A. we work a lot with conditional probabilities. This follows from the logical statement X 1(X > x) ≤ X. then P  n≥1 An  = P (An ). Of course. Therefore. if An are events with P (An ) = 1 then P (∩n An ) = 1. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). namely:   Everything else. and everywhere else. but not too much. If A1 . Quite often.e. If the events A. . Linear Algebra that has many axioms is more ‘restrictive’. are mutually disjoint events. derive formulae. 11 114 . n≥1 Usually. . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. by not requiring this. which holds since X is positive. i. to compute P (A). This can be generalised to any finite number of events. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. . . P (∪n An ) ≤ Similarly. . For example. to show that EX ≤ EY . for (besides P (Ω) = 1) there is essentially one axiom . . This is the inclusion-exclusion formula for two events. in these notes. Probability Theory is exceptionally simple in terms of its axioms.e. to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). I do sometimes “cheat” from a strict mathematical point of view. the function that takes value 1 on A and 0 on Ac . . Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. If An are events with P (An ) = 0 then P (∪n An ) = 0. n If the events A1 . and apply them. This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. n P (An ). E 1A = P (A). B are disjoint events then P (A ∪ B) = P (A) + P (B). and 1 Ac = 1 − 1 A . . But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. .. we often need to argue that X ≤ Y . Similarly. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). such that ∪n Bn = Ω. A is a subset of B). Similarly. A2 . If A is an event then 1A or 1(A) is the indicator random variable.. . Often. That is why Probability Theory is very rich. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ).Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). A2 . B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). . In Markov chain theory. A2 . Linear Algebra that has many axioms is more ‘restrictive’. Note that 1A 1B = 1A∩B . In contrast. it is useful to find disjoint events B1 .

If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. . EX = x xP (X = x). 0). Suppose now that X is a random variable with values in Z+ ∪ {+∞}. . 1. .e.An application of this is: P (X < ∞) = lim P (X ≤ x). . is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). n ≥ 0 is ∞ ρn s n = n=0 1 . For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). n = 0. . Generating functions A generating function of a sequence a0 . 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. the formula reduces to the known one. making it a useful object when the a0 . obviously. This definition makes sense only if EX + < ∞ or if EX − < ∞. . We do not a priori know that this exists for any s other than. . we have EX = −∞ xf (x)dx. is a probability distribution on Z+ = {0.} because then n an = 1. 2. is called probability generating function of X: ∞ as long as |ρs| < 1. |s| < 1/|ρ|. In case the random variable is discrete. taught in calculus courses). a1 . In case the random variable is ∞ absoultely continuous with density f . a1 . which is motivated from the algebraic identity X = X + − X − . X − = − min(X. ϕ(s) = EsX = n=0 sn P (X = n) 115 . As an example. 0). So if n an < ∞ then G(s) exists for all s ∈ [−1. x→∞ In Markov chain theory. 1].. 1. i. so it is useful to keep this in mind. which are both nonnegative and then define EX = EX + − EX − . . for s = 0 for which G(0) = 0. . viz. . The formula holds for any positive random variable. The generating function of the sequence pn = P (X = n). The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. we define X + = max(X. the generating function of the geometric sequence an = ρn . 1]. we encounter situations where P (X = ∞) may be positive. For a random variable which takes positive and negative values. . .

since |s| < 1. Generating functions are useful for solving linear recursions. 116 . Also. possibly. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . we have s∞ = 0. n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). and this is why ϕ is called probability generating function. while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). xn sn be the generating of (xn . Note that ϕ(0) = P (X = 0). n ≥ 0) then the generating function of (xn+1 . We can recover the probabilities pn from the formula pn = ϕ(n) (0) . As an example. with x0 = x1 = 1. but. take the value +∞. Note that ∞ n=0 pn ≤ 1. ds ds and by letting s = 1. n=0 We do allow X to. we can recover the expectation of X from EX = lim ϕ′ (s).This is defined for all −1 < s < 1. ∞ and p∞ := P (X = ∞) = 1 − pn . consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . If we let X(s) = n≥0 n ≥ 0. n! as follows from Taylor’s theorem.

then the so-called distribution function. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. 117 . x ∈ R. Hence s2 + s − 1 = (s − a)(s − b). n ≥ 0) equals the sum of the generating functions of (xn+1 . namely. we can also write this as xn = (−a)n+1 − (−b)n+1 . we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. n ≥ 0). n ≥ 0) and (xn . the n-th term of the Fibonacci sequence.e. if X is a random variable with values in R. i. Thus (1/ρ)−s is the 1 generating function of ρn+1 . the function P (X ≤ x). b−a Noting that ab = −1. is such an example. and so X(s) = 1 1 1 1 1 1 = . we can solve for X(s) and find X(s) = s2 −1 . This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . i. So. Since x0 = x1 = 1.e. for example. Despite the square roots in the formula. b = −( 5 + 1)/2. b−a is the solution to the recursion. But I could very well use the function P (X > x).

in 2n fair coin tosses 118 . since the number is so big. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. we encountered the concept of an excursion of a Markov chain from a state i. it’s better to use logarithms: n n! = e log n = exp k=1 log k. n ∈ Z+ . Now.or the 2-parameter function P (a < X < b). I could use its density f (x) = d P (X ≤ x). n Hence n k=1 log k ≈ log x dx. If X is absolutely continuous. (b) The approximation is not good when we consider ratios of products. to compute a large sum we can. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. instead. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. for example they could take values which are themselves functions. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. We frequently encounter random objects which belong to larger spaces. Even the generating function EsX of a Z+ -valued random variable X is an example. n! ≈ exp(n log n − n) = nn e−n . For example (try it!) the rough Stirling approximation give that the probability that. for it does allow us to specify the probabilities P (X = n). dx All these objects can be referred to as “distribution” or “law” of X. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. It turns out to be a good one. compute an integral and obtain an approximation. Specifying the distribution of such an object may be hard. Such an excursion is a random object which should be thought of as a random function. For instance. Hence we have We call this the rough Stirling approximation. 1 Since d dx (x log x − x) = log x.

if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). for all x. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. Surely. To remedy this. and this is terrible. Consider an increasing and nonnegative function g : R → R+ . y ∈ R g(y) ≥ g(x)1(y ≥ x). and so. you have seen it before. We can case both properties of g neatly by writing For all x. defined in (34). where λ = P (X ≥ 1). expressing non-negativity. Let X be a random variable. expressing monotonicity. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. by taking expectations of both sides (expectation is an increasing operator). as n → ∞. 119 . If we think of X as the time of occurrence of a random phenomenon. n ∈ Z+ . The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). This completely trivial thought gives the very useful Markov’s inequality. Indeed. 0 ≤ n ≤ m. We have already seen a geometric random variable. Then. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. and is based on the concept of monotonicity. Eg(X) ≥ g(x)P (X ≥ x) . Markov’s inequality Arguably this is the most useful inequality in Probability. a geometric function of k.we have exactly n heads equals 2n 2−2n ≈ 1. That was the random variable J0 = ∞ n=1 1(Sn = 0). because we know that the n probability goes to zero as n → ∞. g(X) ≥ g(x)1(X ≥ x). The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . We baptise this geometric(λ) distribution. so this little section is a reminder. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . It was shown in Theorem 38 that J0 satisfies the memoryless property.

edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook.mac. Cambridge University Press. 5. 3. Soc. Math.pdf 6.Bibliography ´ 1. Bremaud P (1997) An Introduction to Probabilistic Modeling. W (1981) Topics in Ergodic Theory. W (1968) An Introduction to Probability Theory and its Applications. Parry. Springer. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). J R (1998) Markov Chains. Chung. Norris. Also available at: http://www. Cambridge University Press. 4. Springer. Bremaud P (1999) Markov Chains. 120 . Amer. C M and Snell. J L (1997) Introduction to Probability. Wiley. ´ 2. Feller.dartmouth. Springer. Grinstead. 7.

119 matrix. 108 ergodic. 119 Google. 110 meeting time. 10 irreducible. 84 bank account. 16. 15. 15 Loynes’ method. 119 minimum. 51 essential state. 17 Aloha.Index absorbing. 7 closed. 115 geometric distribution. 17 communicates with. 106 ground state. 73 Markov chain. 48 memoryless property. 108 relation. 20 function of a Markov chain. 47 coupling inequality. 6 Markov property. 53 Fibonacci sequence. 68 aperiodic. 108 ruin probability. 70 greatest common divisor. 29 reflection. 110 mixing. 116 first-step analysis. 18 permutation. 18 dual. 117 division. 70 period. 115 random walk on a graph. 77 indicator. 61 detailed balance equations. 86 reflexive. 61 recurrent. 65 Catalan numbers. 26 running maximum. 1 equivalence classes. 34 probability flow. 16 convolution. 3 branching process. 65 generating function. 20 increments. 17 Ethernet. 109 positive recurrent. 6 Markov’s inequality. 35 Helly-Bray Lemma. 18 arcsine law. 77 coupling. 117 leads to. 44 degree. 82 Cesaro averages. 98 dynamical system. 7 invariant distribution. 58 digraph (directed graph). 16 communicating classes. 111 inessential state. 61 null recurrent. 53 hitting time. 34 PageRank. 45 Chapman-Kolmogorov equation. 15 distribution. 51 Monte Carlo. 48 cycle formula. 87 121 . 13 probability generating function. 81 reflection principle. 114 balance equations. 17 infimum. 68 Fatou’s lemma. 17 Kolmogorov’s loop criterion. 63 Galton-Watson process. 16 equivalence relation. 59 law. 113 initial distribution. 101 axiom of Probability Theory. 76 neighbours. 111 maximum. 2 nearest neighbour random walk. 10 ballot theorem.

62 Wright-Fisher model. 60 urn. 6 statespace. 77 skip-free property. 7 time-reversible. 8 transition probability matrix. 16. 108 time-homogeneous. 76 Skorokhod embedding. 119 strong Markov property. 62 subsequence of a Markov chain. 2 transition probability. 113 symmetric. 88 watching a chain in a set. 29 transition diagram:. 3. 27 storage chain. 58 transient.semigroup property. 24 Strirling’s approximation. 62 supremum. 13 stopping time. 108 tree. 7 simple random walk. 9 stationary distribution. 6 stationary. 27 subordination. 7. 73 strategy of the gambler. 71 122 . 10 steady-state. 76 simple symmetric random walk. 104 states. 16. 8 transitive.

Sign up to vote on this title
UsefulNot useful