Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . . . . . . . . . . . . .1 Asymptotic distribution of the maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 24. . . . . . . . . . . . . . . .1 Branching processes: a population growth application . .2 The ALOHA protocol: a computer communications application . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . . . . . . . . . 65 24. . . . .4 Some remarkable identities . . . . . . . . . . . . . . . .2 Paths that start and end at specific points and do not touch zero at all . . . . . . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . . . . . . 90 30. . . .1 First hitting time analysis . . . . . . . . . . . . . . . 79 26.2 First return to the origin . 68 24. . . . . . . . . . . .2 The ballot theorem . . . . . . . . . . . . . . . . . . . . . 84 27. . . 95 31. . . . . . . . . . . . . . . . . . . .24. . . . . . . . . . . . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . .1 Distribution after n steps . . . . . 85 27. . . . . . . . . . . . . . . . . .3 Total number of visits to a state . .1 Distribution of hitting time and maximum . . . . . . . . . . .3 Some conditional probabilities . . .5 A storage or queueing application . 95 31. . . . . . . . . . . . . . .5 First return to zero . . .1 Paths that start and end at specific points . . . . . . . . . . . 84 27. . . . . . . . . . . . . . . . . . . . . . . .2 Return probabilities . . . 70 24. . . . . . . . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . . . .4 The Wright-Fisher model: an example from biology . . . . . . . . . 83 27. . . . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . .3 The distribution of the first return to the origin . . . . . . . . . . . . . . . . 92 30. . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. I have tried both and prefer the current ordering. The first part is about Markov chains and some applications. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. They form the core material for an undergraduate course on Markov chains in discrete time. At the end.. of course. A few starred sections should be considered “advanced” and can be omitted at first reading.Preface These notes were written based on a number of courses I taught over the years in the U. There are. I tried to put all terms in blue small capitals whenever they are first encountered. Of course. The second one is specifically for simple random walks.S.. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. I introduce stationarity before even talking about state classification. dozens of good books on the topic.K. Also. v . Also. There notes are still incomplete.. Greece and the U. “important” formulae are placed inside a coloured box. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. I have a little mathematical appendix.

e. yn ). b) between two positive integers a and b.e. In other words. . The state of the particle at time t is described by the pair ˙ X(t) = (x(t). An example from Physics is provided by Newton’s laws of motion. For instance. X0 . b). a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. The “time” can be discrete (i. Suppose that a particle of mass m is moving under the action of a force F in a straight line. Assume that the force F depends only on the particle’s position ¨ dt and velocity. . . b) = gcd(r. the future pairs Xn+1 . x = x(t) denotes the position of the particle at time t. initialised by x0 = max(a. has the property that. depend only on the pair Xn . X1 . Formally. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). . and its ˙ dt 2x acceleration is x = d 2 . It is not hard to see that. Its velocity is x = dx .PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. where xn ≥ yn . b). or. more generally. x). We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. the future trajectory (X(s). x(t)). b) is the remainder of the division of a by b. b). yn ). y0 = min(a. a totally ordered set. n = 0. s ≥ t) ˙ 1 . The sequence of pairs thus produced. yn ) into Xn+1 = (xn+1 . In the module following this one you will study continuous-time chains. In Mathematics. where r = rem(a. we end up with two numbers. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. . X2 . that gcd(a. An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. . there is a function F that takes the pair Xn = (xn . yn+1 ). we let our “state” be a pair of integers (xn . F = f (x. the real numbers). 2. and the other the greatest common divisor. and evolving as xn+1 = yn yn+1 = rem(xn . i. for any n. the integers). By repeating the procedure. Recall (from elementary school maths). . 1. Xn+2 . one of which is 0. . for any t.e. continuous (i. . x Here.

Try this on the computer! 2 . Repeat with (Xn .99. In other words. When the mouse is in cell 1 at time n (minutes) then. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. Each time count 1 if Yn < f (Xn ) and 0 otherwise. Other examples of dynamical systems are the algorithms run. .can be completely specified by the current state X(t). Count the number of 1’s and divide by N . at time n + 1 it is either still in 1 or has moved to 2. The generalisation takes us into the realm of Randomness. containing fresh and stinky cheese. Markov chains generalise this concept of dependence of the future only on the present. by the software in your computer. 2 2. 1 and 2. an effective way for dealing with large problems is via the use of randomness.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. We will be dealing with random variables. uniformly. In addition. We will also see why they are useful and discuss how they are applied. instead of deterministic objects. 2. past states are not needed. i. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. Y0 ) of random numbers. Yn ). but some are stochastic.000 columns or computing the integral of a complicated function of a large number of variables. for steps n = 1.05. via the use of the tools of Probability. say. A scientist’s job is to record the position of the mouse every minute. it moves from 2 to 1 with probability β = 0. Similarly. Perform this a large number N of times. and use those mathematical models which are called Markov chains. . Indeed. 0 ≤ Y0 ≤ B.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. Choose a pair (X0 . regardless of where it was at earlier times. analyse.. The resulting ratio is an approximation b of a f (x)dx. . Some of these algorithms are deterministic. we will see what kind of questions we can ask and what kind of answers we can hope for. We can summarise this information by the transition diagram: 1 Stochastic means random.000 rows and 10.e. a ≤ X0 ≤ b. respectively. it does so. We will develop a theory that tells us how to describe. A mouse lives in the cage.

y = 0. to move from cell 1 to cell 2? 2. the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. .01 Questions of interest: 1. . D1 .2 Bank account The amount of money in Mr Baxter’s bank account evolves. from month to month. . FS . (This is the mean of the binomial distribution with parameter α. px. S0 . Dn is the money he deposits. intuitive. 1.y := P (Xn+1 = y|Xn = x). S2 . and Sn is the money he wants to spend. . are random variables with distributions FD . How often is the mouse in room 1 ? Question 1 has an easy. 0}. D0 .y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. How long does it take for the mouse. Xn is the money (in pounds) at the beginning of the month. are independent random variables. Sn .05 ≈ 20 minutes.95 0. as follows: Xn+1 = max{Xn + Dn − Sn . Assume that Dn . D2 . on the average.05 0. = z=0 ∞ = z=0 3 . he can’t. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. . Assume also that X0 . We easily see that if y > 0 then ∞ x. 2.) 2. So if he wishes to spend more than what is available. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. S1 .99 0. Here. . Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). respectively. Dn − Sn = y − x) P (Sn = z.

0 p2. and if so.0 = 1 − ∞ px. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. being drunk.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street.2 P= p2.something that. So he may move forwards with equal probability that he moves backwards. If he starts from position 0. we have px.1 p1. can be computed from FS and FD .2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. Of course.0 p0. The transition probability matrix is y=1 p0.1 p0. 2. on the average.1 p2. Are there any places which he will visit infinitely many times? 4.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . 2. if y = 0.0 p1. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. in principle.2 p1. Will Mr Baxter’s account grow. If the pub is located in position 0 and his home in position 100. how long will it take him. Questions of interest: 1. how fast? 2.y . Are there any places which he will never visit? 3. he has no sense of direction. how often will he be visiting 0? 2. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1.

H .S = 1 − pS. mathematically imprecise. given that he is currently healthy.S = 0. there is probability pH. the answer to this is crucial for the policy determination. therefore.8 0.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. the reason being that the company does not believe that its clients are subject to resurrection. But. It proposes the following model summarising the state of health of an individual on a monthly basis: 0.H = 1 − pH. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly.3 that the person will become sick.D . north or south. etc.D . there is clearly a higher chance that our man will be lost. so we assign probability 1/4 for each move. and pS.H − pS.D = 1.S because pH. pD. We may again want to ask similar questions.3 H 0. Note that the diagram omits pH. that since there are more degrees of freedom now. These things are.D is omitted.1 D Thus.S − pH. observe. and pS. but they will become more precise in the sequel. Also. for now. So he can move east. 5 . the company must have an idea on how long the clients will live.01 S 0. west.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients. Clearly. and let’s say he’s completely drunk. pD. 2.

. the future process (Xm . for any n ∈ T. (Xn . 3. If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . n ∈ T}.} or the set Z of all integers (positive. . X1 = i1 . Proof. n ∈ Z+ ) is Markov if and only if. in+m . . the Markov property. Lemma 1. 2. . m > n. we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. Let us now write the Markov property in an equivalent way. . m ∈ T). called the state space. . . . . X0 = i0 ) = P (∀ k ∈ [n + 1. which precisely means that (Xn+1 . We will also assume. . We say that this sequence has the Markov property if. . where T is a subset of the integers. . n ∈ T} is Markov instead of saying that it has the Markov property. Xn−1 ) are independent conditionally on Xn . n + m] Xk = ik | Xn = in ). Sometimes we say that {Xn . n + m] Xk = ik | Xn = in . So we will omit referring to T explicitly. 3. .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . m ∈ T) is independent of the past process (Xm . . m and any choice of states i0 .1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). P (Xn+1 = in+1 | Xn = in . . . it is customary to call (Xn ) a Markov chain. unless needed. Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). . for any n. This is true for any n and m. The elements of S are frequently called states. . . conditionally on Xn . 2. .} or N = {1. . . if the claim holds. On the other hand. . and so future is independent of past given the present. (1) 6 . . almost exclusively (unless otherwise said explicitly). . . . for all n ∈ Z+ and all i0 . .3 3. . or negative). X2 = i2 . . m < n. that the Xn take values in some countable set S. . . Usually. 1. in+1 ∈ S. We focus almost exclusively to the first choice. Xn−1 ) given Xn . viz. Since S is countable. Xn+m ) and (X0 . . 3. . . . zero. . . T = Z+ = {0.

n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m.j (n. pi.j (n. But we will circumvent them by using Probability only. 1) pi1 . . by a result in Probability.i (n − 1.in (n − 1. . n ≥ 0. n) = i1 ∈S ··· in−1 ∈S pi.. pi. X2 = i2 . Xn = in ) = µ(i0 ) pi0 . or as P(m. m + 1)P(m + 1. which. n).i1 (0. by definition. i∈S i. pi.j (m. The function pi. n). n)]i.i2 (1. 2) · · · pin−1 . then the matrices are infinitely large. If |S| = d then we have d × d matrices. The function µ(i) is called initial distribution. but if S is an infinite set. pi. n) = P(m. will specify all joint distributions2 : P (X0 = i0 .3 In this new notation. n) = 1(i = j) and.which means that the two functions µ(i) := P (X0 = i) . n + 1) := P (Xn+1 = j | Xn = i) . j ∈ S. . by (1). More generally.j∈S . 3. a square matrix whose rows and columns are indexed by S (and this.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. m + 1) · · · pin−1 . n).j (m. n) := [pi. 2 3 And. n) if m ≤ ℓ ≤ n. Of course. all probabilities of events associated with the Markov chain This causes some analytical difficulties.j (n. viz. n) = P(m. n). for each m ≤ n the matrix P(m. n + 1) is called transition probability from state i at time n to state j at time n + 1. (2) which is reminiscent of matrix multiplication. we define. .j (m. n + 1) do not depend on n . Therefore. m + 2) · · · P(n − 1. X1 = i1 .i1 (m. means that the transition probabilities pi. requires some ordering on S) and whose entries are pi. of course. ℓ) P(ℓ. we can write (2) as P(m.j (m. something that can be called semigroup property or Chapman-Kolmogorov equation. 7 . remembering a bit of Algebra.j (n.

y y 8 . The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi .j∈S is called the (one-step) transition probability matrix or. (3) We call δi the unit mass at i. if µ assigns probability p to state i1 and 1 − p to state i2 . For example. Then P(m. Then we have µ n = µ 0 Pn . The matrix notation is useful. n−m times for all m ≤ n. δi (j) := 1. n + 1) =: pi. for all integers a. It corresponds. and the semigroup property is now even more obvious because Pa+b = Pa Pb . transition matrix. For example. Starting from a specific state i. of course. unless otherwise specified. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y).j ]i. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . b ≥ 0. then µ = pδi1 + (1 − p)δi2 . if j = i otherwise. 0. Similarly.j the (one-step) transition probability from state i to state j. Notice that any probability distribution on S can be written as a linear combination of unit masses. In other words. Thus Pi is the law of the chain when the initial distribution is δi . From now on. when we say “Markov chain” we will mean “timehomogeneous Markov chain”. arranged in a row vector. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i).and therefore we can simply write pi.j . simply. n) = P × P × · · · · · · × P = Pn−m . to picking state i with probability 1. Notational conventions Unit mass. if Y is a real random variable. as follows from (1) and the definition of P. and call pi. while P = [pi.j (n.

as it takes some time before the molecules diffuse from room to room. . the law of a stationary process does not depend on the origin of time.) is stationary if it has the same law as (Xn . Clearly. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. . We now investigate when a Markov chain is stationary. m + 1.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. This is a model of gas distributed between two identical communicating chambers. a sequence of i. then. and vice versa. Imagine there are N molecules of gas (where N is a large number. at any point of time n. then. Example: the Ehrenfest chain. i.d. the distribution of the bug will still be the same. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. for any m ∈ Z+ . The gas is placed partly in room 1 and partly in room 2. It is clear. But this will not work. Another guess then is: Place 9 . say N = 1025 ) in a metallic chamber with a separator in the middle. In other words. n = 0. selecting a vertex with probability 1/7. . let us look at the following example: Example: Motion on a polygon. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. Hence the uniform probability distribution is stationary.i. n ≥ 0) will not be stationary: indeed. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . Then pi. random variables provides an example of a stationary process. If at time n the bug is on a vertex. 1. . at time n + 1 it moves to one of the adjacent vertices with equal probability.e. n = m. Consider a canonical heptagon and let a bug move on its vertices. It should be clear that if initially place the bug at random on one of the vertices. On the average. dividing it into two identical rooms. because it is impossible for the number of molecules to remain constant at all times. To provide motivation and intuition. .i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. Let Xn be the number of molecules in room 1 at time n.4 Stationarity We say that the process (Xn .). . for a while we will be able to tell how long ago the process started. we indeed expect that we have N/2 molecules in each room.

For the if part. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn .) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. The only if part is obvious. . Lemma 2. . then we have created a stationary Markov chain.each molecule at random either in room 1 or in room 2. Since. . in ∈ S. Xm+1 = i1 . . which you should go through by yourselves. . A Markov chain (Xn . . Changing notation. (4) Such a π is called .i = N N −i+1 N i+1 2−N + = · · · = π(i). i 0 ≤ i ≤ N. We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. stationary or invariant distribution.i = π(i − 1)pi−1. . use (1) to see that P (X0 = i0 . X1 = i1 . Xm+2 = i2 . 2−N i−1 i+1 N N (The dots signify some missing algebra. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P.i + π(i + 1)pi+1. Proof. . a Markov chain is stationary if and only if the distribution of Xn does not change with n. . hence. X2 = i2 . Indeed. 10 . by (3). Xm+n = in ). But we need to prove this. In other words. Xn = in ) = P (Xm = i0 . P (X1 = i) = j P (X0 = j)pj. for all m ≥ 0 and i0 .i + P (X0 = i + 1)pi+1. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. If we can do so. we are led to consider: The stationarity problem: Given [pij ]. The equations (4) are called balance equations. (a binomial distribution). . then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . P (Xn = i) = π(i) for all n. . . find those distributions π such that πP = π.i = P (X0 = i − 1)pi−1. If we do so.

. Who tells us that the sum inside the parentheses is finite? If the sum is infinite. each x ∈ Sd is of the form x = (x1 . . which gives p(j) . we must be careful. This means that the solution to the balance equations is identically zero. random variables. . which. π must be a probability: π(i) = 1. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. the next one will arrive in i minutes with probability p(i). . It is easy to see that the stationary distribution is uniform: π(x) = 1 . which in turn means that the normalisation condition cannot be satisfied. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. where all the xi are distinct.j = 1. In addition to that. p1. . Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. π(1) will be zero.i. where 1 is a column whose entries are all 1. i≥1 π(i) −1 To compute π(1). i ≥ 1.i = π(0)p(i) + π(i + 1). But here. After bus arrives at a bus stop. xd ).}. The times between successive buses are i. d! x ∈ Sd . Example: waiting for a bus.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. The state space is S = N = {1. we use the normalisation condition π(1) = i≥1 j≥i = 1. Keep doing it continuously.d. can be written as P1 = 1..i = 1. . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus.i = p(i). 2 . We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. if i > 0 i ∈ N. 2. in matrix notation. n ≥ 0) is a Markov chain with transition probabilities pi+1. . . . Example: card shuffling. 11 . To find the stationary distribution. . Then (Xn . d}. Thus. and so each π(i) will be zero. i ≥ 0. . then we run into trouble: indeed. i = 1. write the equations (4): π(i) = π(j)pj. . j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). .

.i.d. . . . and the rule maps x into 2(1 − x). εr ∈ {0. . But the function f (x) = min(2x. r = 1. .).). and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . .e. .* (This is slightly beyond the scope of these lectures and can be omitted. . . (5) 12 . . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. ξi ∈ {0. 1 − ξ3 . applied to the interval [0. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). . . 2. . ξ2 . what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. ξ2 (n) = ε1 . . Hence choosing the initial state at random in this manner results in a stationary Markov chain. .. . If ξ1 = 1 then x ≥ 1/2. . But to even think about it will give you some strength.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j).d. the same is true at n + 1. 1}. i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. from (5). are i. . . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. ξr+1 (n) = εr ) + P (ξ1 (n) = 1. So it goes beyond our theory. ξ2 (n). ξ3 . . We can show that if we start with the ξr (0). P (ξ1 (n + 1) = ε1 . . . You can check that . . . 2. 1] (think of [0. Here ξ(n) = (ξ1 (n). . . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. ξ(n + 1) = ϕ(ξ(n)). Remarks: This is not a Markov chain with countably many states. i. if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . So. 2(1 − x)). As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. . ξ2 (n). and any ε1 . . any r = 1. . . i. for any n = 0. Example: bread kneading. . So let us do so.i. 1} for all i. We have thus defined a mapping ϕ from binary vectors into binary vectors. ξr+1 (n) = 1 − εr ). . . .). The Markov chain is obtained by applying the mapping ϕ again and again. .) Consider an infinite binary vector ξ = (ξ1 .. . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. 1. .) is the state at time n. . with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. ξ2 (n) = 1 − ε1 . 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. 2.. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. .

j (which equals 1): π(i) j∈S pi. If the state space has d states. A). π(j) = 1.j = j∈A π(j)pj. π is a stationary distribution if and only if π1 = 1 and F (A. then the above are d + 1 equations.1 Finding a stationary distribution As explained above. Instead of eliminating equations.Note on terminology: steady-state. Ac ) is the total current from A to Ac .i + j∈Ac π(j)pj. We have: Proposition 1. This is OK.i 13 . If the latter condition holds. i∈A j∈Ac Think of π(i)pi. When a Markov chain is stationary we refer to it as being in 4. a stationary distribution π satisfies π = πP π1 = 1 In other words.i Multiply π(i) by j∈S pi. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. then choose A = {i} and see that you obtain the balance equations. j∈S i ∈ S. we introduce more. So F (A. π(j)pj. Ac ) = F (Ac . Writing the equations in this form is not always the best thing to do. because the first d equations are linearly dependent: the sum of both sides equals 1.j as a “current” flowing from i to j. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A.i = j∈A π(j)pj. The convenience is often suggested by the topological structure of the chain. suppose that the balance equations hold. expecting to be able to choose.j . as we will see in examples.i + j∈Ac π(j)pj. therefore one of them is always redundant. π(i) = j∈S (balance equations) (normalisation condition).i . Ac ) := π(i)pi. But we only have d unknowns. Conversely. amongst them. for all A ⊂ S. Proof. a set of d linearly independent equations which can be more easily solved.

The constant c is found from π(1) + π(2) = 1. The extra degree of freedom provided by the last result. β < 1.j = j∈A π(j)pj. Instead. gives us flexibility. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . . we see that the first term on the left cancels with the first on the right.04 π(2) = 0.i This is true for all i ∈ S.8% of the time in room 2.99 (as in the mouse example). the mouse spends only 95. If we take α = 0. p21 = β. π(1) = β . while the remaining two terms give the equality we seek. One can ask how to choose a minimal complete set of linearly independent equations. 14 i = 1. p22 = 1 − β.) Assume 0 < α. 1.i + j∈Ac π(j)pj.99 ≈ 0.952. (Consequently. in steady state. and 4. Summing over i ∈ A. A) Schematic representation of the flow balance relations.2% of the time in 1. α+β π(2) = α . Example: a walk with a barrier. p q Consider the following chain.048. 2. α+β π(2) = cα. we find π(1) = 0.. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). here is an example: Example: Consider the 2-state Markov chain with p12 = α.. Thus.05. .05 ≈ 0.04 room 1. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. . We have π(1)α = π(2)β.Now split the left sum also: π(i)pi. but we shall not do this here. .j + j∈A j∈Ac π(i)pi. 3. A c F(A . That is. p11 = 1 − α. β = 0. A ) A c c F( A .

(ii) pi. .e. These are much simpler than the previous ones. If i = j there is nothing to prove. Proof. . The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. i. we can solve them immediately.j > 0.e. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. i − 1}. In other words. the chain will visit j at some finite time. 1. . S is a set of vertices. In graph-theoretic terminology. E) where S is the state space and E ⊂ S × S defined by (i. (iii) Pi (Xn = j) > 0 for some n ≥ 0. j) ∈ E ⇐⇒ pi. A). n≥0 . in−1 . If i ≥ 1 we have F (A. • Suppose that i j. So assume i = j. a pair (S.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. Does this provide a stationary distribution? Yes. In fact. .i2 · · · pin−1 . p < 1/2. (If p ≥ 1/2 there is no stationary distribution.2 The relation “leads to” We say that a state i leads to state j if. and E a set of directed edges. i.) 5 5.i1 pi1 . π(i) = (p/q)i π(0). starting from i. . Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). Ac ) = π(i − 1)p = π(i)q = F (Ac . . For i ≥ 1. provided that ∞ π(i) = 1.But we can make them simpler by choosing A = {0. . . This is often cast in terms of a digraph (directed graph). 5. This is i=0 possible if and only if p < q. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows.j > 0 for some n ∈ N and some states i1 .

by definition. the sum contains a nonzero term. 3} is not closed. by definition. the set of all states that communicate with i: [i] := {j ∈ S : j So.in−1 pi. because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). this is symmetric (i Corollary 2. Since Pi (Xn = j) > 0. The relation j ⇐⇒ j i). and so (iii) holds.i2 · · · pin−1 . By assumption. The relation is transitive: If i j and j k then i k. We have Pi (Xn = j) = i1 . 3}. the class {2. Proof. 16 . 5. {2.the sum on the right must have a nonzero term. Look at the last display again. j) ⇐⇒ i j and j i. In other words. meaning that (ii) holds. 5}. is an equivalence relation. {4.. Equivalence relation means that it is symmetric. • Suppose that (ii) holds.. [i] = [j] if and only if i identical or completely disjoint. However. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. We just observed it is symmetric. It is reflexive and transitive because is so.. it partitions S into equivalence classes known as communicating classes. (i) holds..j . we see that there are four communicating classes: {1}.i1 pi1 . In addition. i}. {6}. (6) j.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. (iii) holds. The class {4. reflexive and transitive. Hence Pi (Xn = j) > 0. The communicating class corresponding to the state i is. where the sum extends over all possible choices of states in S. The first two classes differ from the last two in character. 5} is closed: if the chain goes in there then it never moves out of it. • Suppose now that (iii) holds. Corollary 1. we have that the sum is also positive and so one of its terms is positive. Just as any equivalence relation.

A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. we can reach any other state. 17 .i2 · · · pin−1 . A general Markov chain can have an infinite number of communicating classes. Why? A state i is essential if it belongs to a closed communicating class.More generally. parts. Thus. The converse is left as an exercise. We then can find states i1 .j > 0. Proof. The communicating classes C1 . starting from any state. In other words still. Similarly.i = 1 then we say that i is an absorbing state. Since i ∈ C. the single-element set {i} is a closed communicating class. To put it otherwise. Otherwise. . we have i1 ∈ C. more precisely.i1 > 0. then the chain (or. There can be arrows between two nonclosed communicating classes but they are always in the same direction. we say that a set of states C ⊂ S is closed if pi. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. in−1 such that pi. more manageable.j = 1. j∈C for all i ∈ C. we conclude that j ∈ C. C5 in the figure above are closed communicating classes. C3 are not closed. and C is closed. and so i2 ∈ C. Lemma 3. Closed communicating classes are particularly important because they decompose the chain into smaller. pi1 . If a state i is such that pi.i1 pi1 . . . Note that there can be no arrows between closed communicating classes. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. C2 . its transition matrix P) is called irreducible. If all states communicate with all others. Suppose that i j. pi. In graph terms. the state space S is a closed communicating class. The classes C4 . the state is inessential. Note: An inessential state i is such that there exists a state j such that i j but j i. and so on. Suppose first that C is a closed communicating class. .i2 > 0. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them.

. all states form a single closed communicating class. i. A state i is called aperiodic if d(i) = 1. i.e. The chain alternates between the two sets. pii (ℓn) ≥ pii (n) ℓ . more generally. Notice however. X2 ∈ C1 . n ≥ 0. it will move to the set C2 = {2. Theorem 2 (the period is a class property). that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. j. So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . and. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N.5. If i j then d(i) = d(j). (n) Proof. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . j ∈ S. We say that the chain has period 2. (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. If i j there is α ∈ N with pij > 0 and if 18 . ℓ ∈ N. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . (m) (n) for all m. . So. Consider two distinct essential states i. . X3 ∈ C2 .4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. d(j) = gcd Dj . 3} at some point of time then. X4 ∈ C1 . d(i) = gcd Di . The period of an inessential state is not defined. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. Let us denote by pij the entries of Pn . . Dj = {n ∈ (n) (α) N : pjj > 0}. Let Di = {n ∈ N : pii > 0}. 4} at the next step and vice versa. Since Pm+n = Pm Pn .

(Here Cd := C0 . . Therefore d(j) | α + β. n2 ∈ Di . If a number divides two other numbers then it also divides their difference. . Since n1 . Divide n by n1 to obtain n = qn1 + r. n1 ∈ Di such that n2 − n1 = 1. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. . Proof. . C1 . Arguing symmetrically. So d(i) = d(j). . An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. So d(j) is a divisor of all elements of Di . we obtain d(i) ≤ d(j) as well. d − 1. . where the remainder r ≤ n1 − 1.) 19 . Theorem 4. j ∈ S. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. . Corollary 3. Therefore d(j) | α + β + n for all n ∈ Di . 1. d(j) | d(i)). r = 0. all states communicate with one another. An irreducible chain has period d if one (and hence all) of the states have period d. Formally. From the last two displays we get d(j) | n for all n ∈ Di . chains where. . C1 . Of particular interest. Cd−1 such that the chain moves cyclically between them. showing that α + n + β ∈ Dj . Then pii > 0 and so pjj ≥ pij pii pji > 0.j i there is β ∈ N with pji > 0. as defined earlier. Let n be sufficiently large. . Now let n ∈ Di . i. q − r > 0. . . this is the content of the following theorem. Cd−1 such that pij = 1.e. In particular. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . Pick n2 . j∈Cr+1 i ∈ Cr . Then we can uniquely partition the state space S into d disjoint subsets C0 . . So pjj ≥ pij pji > 0. we have n ∈ Di as well. Consider an irreducible chain with period d. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . are irreducible chains. the chain is called aperiodic. if d = 1. Because n is sufficiently large. if an irreducible chain has period d then we can decompose the state space into d sets C0 . Theorem 3. More generally. showing that α + β ∈ Dj .

then. . Cd−1 . The reader is warned to be alert as to which of the two variables we are considering at each time. . id ∈ C0 . Hence it partitions the state space S into d disjoint equivalence classes. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). Let C1 be the equivalence class of i1 . i1 . id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. necessarily. Suppose pij > 0 but j ∈ C1 but. It is easy to see that if we continue and pick a further state id with pid−1 . . We avoid excessive notation by using the same letter for both. Take.. The existence of such a path implies that it is possible to go from i0 to i. Pick a state i0 and let C0 be its equivalence class (i. and by the assumptions that d d j.. j belongs to the next class.id > 0. with corresponding classes C0 . i ∈ C0 . C1 . Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. j ∈ C2 . necessarily. say. after it takes one step from its current position.e. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. As an example. We use a method that is based upon considering what the Markov chain does at time 1. Continue in this manner and define states i0 .. Notice that this is an equivalence relation.. for example. If X0 ∈ A the two times coincide. We show that these classes satisfy what we need. . i1 . Assume d > 1.. .e the set of states j such that i j).. We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. Then pick i1 such that pi0 i1 > 0. which contradicts the definition of d. that is why we call it “first-step analysis”. We now show that if i belongs to one of these classes and if pij > 0 then. 20 . Such a path is possible because of the choice of the i0 .Proof.. i.

′ Proof. We then have: Theorem 5. If i ∈ A then TA = 0. and define ϕ(i) := Pi (TA < ∞). a ′ function of X1 . then TA ≥ 1. fix a set of states A. Furthermore. Therefore. Combining the above. But the Markov chain is homogeneous. .).e. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . The function ϕ(i) satisfies ϕ(i) = 1. . because the chain may end up in state 2. the event TA is independent of X0 . For the second part of the proof. conditionally on X1 . ϕ(i) = j∈S i∈A pij ϕ(j). So TA = 1 + TA . if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). But what exactly is this probability equal to? To answer this in its generality. . If i ∈ A. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). we have Pi (TA < ∞) = ϕ(j)pij . X0 = i) = P (TA < ∞|X1 = j). ′ is the remaining time until A is hit). j∈S as needed. let ϕ′ (i) be another solution.It is clear that P1 (T0 < ∞) < 1. i ∈ A. Hence: ′ ′ P (TA < ∞|X1 = j. and so ϕ(i) = 1. By self-feeding this equation. X2 .

TA = 0 and so ψ(i) := Ei TA = 0. 3 2 so ϕ(1) = 3/4. the second equals Pi (TA = 2). j∈A i ∈ A. The second part of the proof is omitted. which is the second of the equations. ψ(i) := Ei TA . Therefore. We immediately have ϕ(0) = 1. obviously. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). and ϕ(2) = 0. On the other hand. Example: Continuing the previous example. we choose A = {0}. (If the chain starts from 0 it takes no time to hit 0. If i ∈ A then. By continuing self-feeding the above equation n times. Clearly. If ′ i ∈ A. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). Letting n → ∞. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S.We recognise that the first term equals Pi (TA = 1). because it takes no time to hit A if the chain starts from A. ψ(0) = ψ(2) = 0. 3 22 . Example: In the example of the previous figure. and let ψ(i) = Ei TA . then TA = 1 + TA . We let ϕ(i) = Pi (T0 < ∞). as above. ψ(i) = 1 + i∈A pij ψ(j). 2. let A = {0. We have: Theorem 6. Furthermore. 2}. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). i = 0.) As for ϕ(1). the set that contains only state 0. if the chain starts from 2 it will never hit 0. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). 1. The function ψ(i) satisfies ψ(i) = 0. Start the chain from i. so the first two terms together equal Pi (TA ≤ 2). 1 ψ(1) = 1 + ψ(1). Proof. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. we obtain ϕ′ (i) ≥ Pi (TA ≤ n).

because the underlying experiment is the flipping of a coin with “success” probability 2/3. i∈A Furthermore. i. 23 . We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). where. ϕAB is the minimal solution to these equations. of the random variables X1 . Clearly. Then ′ 1+TA x ∈ A. ϕAB (i) = 0. by the Markov property. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ).e. B. ′ where TA is the remaining time. after one step.which gives ψ(1) = 3/2. . ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. This happens up to the first hitting time TA first hitting time of the set A. . if x ∈ A. until set A is reached. X2 . Hence.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ). Next. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). as argued earlier. ϕAB (i) = j∈A i∈B pij ϕAB (j). (This should have been obvious from the start. ′ TA = 1 + TA . otherwise. consider the following situation: Every time the state is x a reward f (x) is earned. Now the last sum is a function of the future after time 1. therefore the mean number of flips until the first success is 3/2. It should be clear that the equations satisfied are: ϕAB (i) = 1. . h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). we just changed index from n to n + 1. TB for two disjoint sets of states A. h(x) = f (x). The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ).. in the last sum.

1 2 4 3 The question is to compute the average total reward. the set of equations satisfied by h. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). k ∈ N) form the strategy of the gambler. (Sooner or later. for i = 1. why?) If we let h(i) be the average total reward if he starts from room i. collecting “rewards”: when in room i. h(1) = 5. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. Let p the probability of heads and q = 1 − p that of tails. no gambler is assumed to have any information about the outcome of the upcoming toss. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . 3. unless he is in room 4. h(3) = 7. if he starts from room 2. Example: A thief enters sneaks in the following house and moves around the rooms. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. y∈S h(x) = f (x) + x ∈ A.Thus. until he dies in room 4. (Also. The successive stakes (Yk . in which case he dies immediately. are h(x) = 0. so 24 .) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. If heads come up then you win £1 per pound of stake. Clearly. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). So. 2. he will die. he collects i pounds. 2 2 We solve and find h(2) = 8. x∈A pxy h(y).

We are clearly dealing with a Markov chain in the state space {a. in the first case. . . .if the gambler wants to have a strategy at all. We obviously have ϕ(0) = 0. is our next goal. it will make enormous profits). Yk cannot depend on ξk . astrological nature. Theorem 7. b − a). if p = q = 1/2 b−a Proof. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. the company has control over the probability p. . betting £1 at each integer time against a coin which has P (heads) = p. pi. a < i < b. b} with transition probabilities pi. She will stop playing if her money fall below a (a < x). The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. We use the notation Px (respectively. ℓ) := Px (Tℓ < T0 ). a + 2. To solve the problem. 25 ϕ(ℓ) = 1. 0 ≤ x ≤ ℓ. P (tails) = 1 − p. and where the states a. as described above. ℓ). in addition. in simplest case Yk = 1 for all k.g. Let £x be the initial fortune of the gambler. It can only depend on ξ1 . b are absorbing. We wish to consider the problem of finding the probability that the gambler will never lose her money. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . write ϕ(x) instead of ϕ(x.i+1 = p. a + 1. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. . Ex ) for the probability (respectively. ξk−1 and–as is often the case in practice–on other considerations of e. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. Let us consider a concrete problem. . We first solve the problem when a = 0. . . expectation) under the initial condition X0 = x.i−1 = q = 1 − p. b = ℓ > 0 and let ϕ(x. Consider a gambler playing a simple coin tossing game. a ≤ x ≤ b. b − 1. x − a    . (Think why!) To simplify things. she must base it on whatever outcomes have appeared thus far. if p = q Px (Tb < Ta ) = . In other words. .

ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. δ(x − 2) = · · · = q p x δ(0). The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). Let δ(x) := ϕ(x + 1) − ϕ(x). we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). and so Px (T0 < ∞) = 1 − lim If p < 1/2. λℓ − 1 0 ≤ x ≤ ℓ. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. if λ = 1 if λ = 1. by first-step analysis. ℓ Summarising. This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). Corollary 4. 1. then λ = q/p < 1. λ = 1. Assuming that λ := q/p = 1. Px (T0 < ∞) = 1 − lim 26 . If p > 1/2. λℓ −1 x ℓ. Since ϕ(ℓ) = 1. we have obtained ϕ(x. ϕ(x) = pϕ(x + 1) + qϕ(x − 1). we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. if p > 1/2 if p ≤ 1/2. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. λ = 1. 0 ≤ x ≤ ℓ.and. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). ℓ→∞ ℓ→∞ (q/p)x . Since p + q = 1. We finally find λx − 1 . ℓ) = λx −1 . we find δ(0) = 1/ℓ and so x ϕ(x) = . if p = q. then λ = q/p > 1. The general case is obtained by replacing ℓ by b − a and x by x − a. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. λx − 1 δ(0). and so λx − 1 λx − 1 =1− = λx .

. . Then. Xn+2 = j2 . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. . . Xτ = i. . . Let τ be a stopping time.) is independent of the past process (X0 . . . τ = n)P (Xn = i. A. . . . (d) where (a) is just logic. possibly. . Xn+2 = j2 . . (b) is by the definition of conditional probability. P (Xτ +1 = j1 . Xn+2 = j2 . . Xτ ) and has the same law as the original process started from i. . n ≥ 0) is a time-homogeneous Markov chain. Xτ +2 = j2 . . . A. An example of a stopping time is the time TA . τ < ∞)P (A | Xτ . Xn ). any such A must have the property that. . for any n. the future process (Xτ +n . A. Suppose (Xn . . . Xn ) for all n ∈ Z+ . including. τ < ∞) = pi. for all n ∈ Z+ . τ = n) = pi. Since (X0 . τ = n) (c) = P (Xn+1 = j1 .Xτ +2 = j2 . the future process (Xn . A. We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . In particular. τ = n). . . | Xτ = i. . considered earlier. . . τ < ∞) = P (Xn+1 = j1 . A. . Recall that the Markov property states that. | Xτ . τ < ∞). . A | Xτ . Xn+1 . . . Xn = i. . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . Xn ) if we know the present Xn . . . | Xn = i)P (Xn = i. Xτ +2 = j2 . . Xτ )1(τ < ∞) = n∈Z+ (X0 . the value +∞.j2 · · · P (Xn = i. . 27 . . . τ = n) (a) (b) = P (Xτ +1 = j1 . Xτ ) if τ < ∞. Xτ +2 = j2 . Let τ be a stopping time. Xn ) and the ordinary splitting property at n. This stronger property is referred to as the strong Markov property. . A ∩ {τ = n} is determined by (X0 . conditional on τ < ∞ and Xτ = i. Theorem 8 (strong Markov property).j1 pj1 . . . . . . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . . We are going to show that for any such A. . . . the first hitting time of the set A. Let A be any event determined by the past (X0 . . . .j1 pj1 .j2 · · · We have: P (Xτ +1 = j1 . Proof. . . n ≥ 0) is independent of the past process (X0 . . An example of a random time which is not a stopping time is the last visit of a specific set. Xn )1(τ = n). . | Xn = i. . . τ = n) = P (Xn+1 = j1 . A. . .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. τ < ∞) and that P (Xτ +1 = j1 .

random trajectories. Ti Xi (0) (r) ≤ n < Ti (r+1) . r = 2. Note the subtle difference between this Ti and the Ti of Section 6. we let Ti be the time of the second (1) (3) visit. (1) r = 1. . then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . .d.d. Xi (2) .i. by using the strong Markov property.. If X0 = i then the two definitions coincide. we will show that we can split the sequence into i.9 Regenerative structure of a Markov chain A sequence of i. . Xi (1) . 3. . Schematically. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. 0 ≤ n < Ti . The random variables Xn comprising a general Markov chain are not independent. as “random variables”. Regardless. State i may or may not be visited again. random variables is a (very) particular example of a Markov chain. := Xn . in a generalised sense. (And we set Ti := Ti .) We let Ti be the time of the third visit. We (r) refer to them as excursions of the process. . All these random times are stopping times. . The Ti of Section 6 was allowed to take the value 0. “chunks”. and so on. . we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. Xi (0) . in fact. However. So Xi is the r-th excursion: the chain visits 28 . They are. 2. The idea is that if we can find a state i that is visited again and again. Formally. Our Ti here is always positive..i.

but important corollary of the observation of this section is that: Corollary 5.. starting from i. (Numerical) functions g(Xi independent random variables. starting from i. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. are independent random variables.i. . The last corollary ensures us that Λ1 . possibly. (0) are independent random variables. if it starts from state i at some point of time. Apply Strong Markov Property (SMP) at Ti : SMP says that..d. but there are always states which will be visited again and again. g(Xi (1) ). we say that a state i is recurrent if. Moreover. . Xi (1) . The claim that Xi . have the same Proof. Therefore. by definition. . the future is independent of the (1) (2) past. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. Therefore. Xi distribution. . for time-homogeneous Markov chains. the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. all. except. equal to i. Second.). g(Xi (2) ). In particular. and “goes for an excursion”. Λ2 .state i for the r-th time. The excursions Xi (0) . we say that a state i is transient if. at least one state must be visited infinitely many times. the future after Ti is independent of the past before Ti . 29 . we “watch” it up until the next time it returns to state i. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. Example: Define Ti (r+1) (0) ). they are also identical in law (i. and this proves the first claim. it will behave in the same way as if it starts from the same state at another point of time. . Perhaps some states (inessential ones) will be visited finitely many times. Xi (2) . . But (r) the state at Ti is. ad infinitum. 10 Recurrence and transience When a Markov chain has finitely many states. . (r) . A trivial. given the state at (r) (r) (r) the stopping time Ti . We abstract this trivial observation and give a couple of definitions: First. An immediate consequence of the strong Markov property is: Theorem 9.

2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . we conclude what we asserted earlier. Suppose that the Markov chain starts from state i. then Pi (Ji < ∞) = 1. therefore concluding that every state is either recurrent or transient. there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. Indeed. But the past (k−1) before Ti is independent of the future. The following are equivalent: 1. This means that the state i is recurrent. i. This means that the state i is transient. we have Proof. Pi (Ji = ∞) = 1. and that is why we get the product. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). Pi (Ji = ∞) = 1. 30 . If fii < 1. State i is recurrent. namely. every state is either recurrent or transient. Lemma 5. 2. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . in principle. Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. We will show that such a possibility does not exist.Important note: Notice that the two statements above are not negations of one another. Because either fii = 1 or fii < 1. fii = Pi (Ti < ∞) = 1. Starting from X0 = i. 4. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . 3. . Ei Ji = ∞.e. k ≥ 0. Therefore. We claim that: Lemma 4. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. Indeed.

Pi (Ji < ∞) = 1. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. Lemma 6. by the definition of Ji . and. Proof. Clearly. Ei Ji < ∞.Proof. This. the probability to hit state j for the first time at time n ≥ 1. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . From Elementary Probability.e. 2. The following are equivalent: 1. but the two sequences are related in an exact way: Lemma 7. since Ji is a geometric random variable. The equivalence of the first two statements was proved above. it is visited infinitely many times. The equivalence of the first two statements was proved before the lemma. clearly. if we assume Ei Ji < ∞. On the other hand. 4. 10. If i is a transient state then pii → 0. State i is transient. then. owing to the fact that Ji is a geometric random variable. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. fii = Pi (Ti < ∞) < 1. pii → 0. Notice the difference with pij . as n → ∞. If Ei Ji = ∞. Proof. If i is transient then Ei Ji < ∞. then. it is visited finitely many times with probability 1. by definition. assuming that the chain (n) starts from state i. i. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . by definition. n ≥ 1. fij ≤ pij . as n → ∞. Proof. is equivalent to Pi (Ji < ∞) = 1. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. Hence Ei Ji = ∞ also. this implies that Ei Ji < ∞. Corollary 6. (n) But if a series is finite then the summands go to 0. State i is recurrent if. which. 3. (n) i. therefore Pi (Ji < ∞) = 1.1 Define First hitting time decomposition fij := Pi (Tj = n). State i is transient if. Ji cannot take value +∞ with positive probability. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj .e.

) Theorem 10. Now. . Xi .. but also fij = 1. excursions Xi . if we let fij := Pi (Tj < ∞) . denoted by i j: it means that. In particular. In other words. we have i j ⇐⇒ fij > 0.r := 1 j appears in excursion Xi (r) . ∞ Proof. Corollary 7. We now show that recurrence is a class property. For the second term. showing that j will appear in infinitely many of the excursions for sure. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . Start with X0 = i. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). starting from i. and since they take the value 1 with positive probability. r = 0. If j is a transient state then pij → 0. . Corollary 8. Hence. 10. So the random variables δj. j i.i.i. as n → ∞. Proof. which is finite. In a communicating class.d. not only fij > 0. and so the probability that j will appear in a specific excursion is positive. and gij := Pi (Tj < Ti ) > 0. . If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. . by summing over n the relation of Lemma 7. 1.The first term equals fij . Indeed. Hence. the probability to go to j in some finite time is positive. j i. infinitely many of them will be 1 (with probability 1).d. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). we have n=1 pjj = Ej Jj < ∞. 32 . . ( Remember: another property that was a class property was the property that the state have a certain period.2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. . by definition. If j is transient. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. are i. and hence pij as n → ∞. The last statement is simply an expression in symbols of what we said above. either all states are recurrent or all states are transient. for all states i.

all other states are also transient. Suppose that i is inessential. U2 . . Hence. i. 1]. Let C be the class. Xn = i for infinitely many n) = 0. . Hence all states are recurrent. Recall that a finite random variable may not necessarily have finite expectation. But then. and so i is a transient state. If i is inessential then it is transient. X remains in C. by the above theorem.Proof. every other j is also recurrent. . 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). .i. This state is recurrent. P (B) = Pi (Xn = i for infinitely many n) < 1. Thus T represents the number of variables we need to draw to get a value larger than the initial one. What is the expectation of T ? First.d. appear in a certain game. P (T > n) = P (U1 ≤ U0 . for example. Let T := inf{n ≥ 1 : Un > U0 }. i. If i is recurrent then every other j in the same communicating class satisfies i j. n+1 33 = 0 . j but j i. . U1 . . Every finite closed communicating class is recurrent. If a communicating class contains a state which is transient then. So if a communicating class is recurrent then it contains no inessential states and so it is closed. . Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. observe that T is finite. Indeed. if started in C. . and hence. but Pi (A ∩ B) = Pi (Xm = j. Proof. By finiteness. necessarily. be i. Un ≤ x | U0 = x)dx xn dx = 1 . Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). Theorem 11. uniformly distributed in the unit interval [0.e. Proof. . Pi (B) = 0. random variables. some state i must be visited infinitely many times. . Example: Let U0 . We now take a closer look at recurrence. It should be clear that this can. By closedness. Theorem 12.e. . there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0.

It is traditional to refer to it as “Law”. Let Z1 .i. X2 . the Fundamental Theorem of Probability Theory. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. Before doing that. it is a Theorem that can be derived from the axioms of Probability Theory. arguably. . . When we have a Markov chain. then it takes. (ii) Suppose E max(Z1 . Z2 . 34 . random variables. Theorem 13. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. . (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . Then P Z1 + · · · + Zn =µ n→∞ n lim = 1.Therefore. the sequence X0 . on the average. To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. . 0) > −∞. But this is misleading. but E min(Z1 . . To apply the law of large numbers we must create some kind of independence. 0) = ∞. It is not a Law. be i. of successive states is not an independent sequence. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. . P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1.d. X1 . Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. We say that i is null recurrent if Ei Ti = ∞. We state it here without proof. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution.

. in this example. Example 1: To each state x ∈ S. n as n → ∞. as in Example 2.d. Xa(3) . random variables. (1) (2) (1) (n) (8) 13.2 Average reward Consider now a function f : S → R. . the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). 35 . G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). which. G(Xa(2) )..13. we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. .i. we can think of as a reward received when the chain is in state x. . Then. . We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). with probability 1. Xa(2) . Thus. are i. for reasonable choices of the functions G taking values in R. the (0) initial excursion Xa also has the same law as the rest of them. The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. k=0 which represents the ‘time-average reward’ received. because the excursions Xa . Specifically. are i. G(Xa(3) ). Xa . Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ).i. for an excursion X .d. define G(X ) to be the maximum reward received during this excursion. (7) Whichever the choice of G. .1 Functions of excursions Namely. (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. and because if we start with X0 = a. . given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . forms an independent sequence of (generalised) random variables. Example 2: Let f (x) be as in example 1. . but define G(X ) to be the total reward received over the excursion X . we have that G(Xa(1) ). assign a reward f (x). . In particular. Since functions of independent random variables are also independent.

Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). Thus we need to keep track of the number. Then. We are looking for the total reward received between 0 and t. t t as t → ∞. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. Let f : S → R+ be a bounded ‘reward’ function. For example. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . in the figure below. we have BTa /t → 0. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. and so on. with probability one. t t In other words.Theorem 14. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . Since P (Ta < ∞) = 1. Nt . The idea is this: Suppose that t is large enough. with probability one. Proof. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. t t 36 . (r) Nt := max{r ≥ 1 : Ta ≤ t}. A similar argument shows that 1 last G → 0. plus the total reward received over the next excursion. Gr is given by (7). say |f (x)| ≤ B. of complete excursions from the ground state a that occur between 0 and t. Nt = 5. t as t → ∞. and Glast is the reward over the last incomplete cycle. We assumed that f is t bounded. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). We break the sum into the total reward received over the initial excursion. and so 1 first G → 0. Hence |Gfirst | ≤ BTa .e. say.

. . we have Ta r→∞ r lim (r) . E a Ta f (Xn ) = n=0 EG1 =: f . Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . (11) By definition. = E a Ta .d. as t → ∞.also. Since Ta → ∞. are i. . = E a Ta . . 1 Nt = .i. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . 2. So we are left with the sum of the rewards over complete cycles. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . random variables with common expectation Ea Ta . and since Ta − Ta k = 1. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. Since Ta /r → 0. E a Ta 37 . r=1 with probability one. it follows that the number Nt of complete cycles before time t will also be tending to ∞. as r → ∞. lim t∞ 1 Nt Nt −1 Gr = EG1 . Hence.

We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). E a Ta (14) with probability 1. Ta −1 k=0 1 lim t→∞ t with probability 1. from the Strong Markov Property. The constant f is a sort of average over an excursion. Comment 4: Suppose that two states. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). Comment 1: It is important to understand what the formula says. t Ea 1(Xk = b) E a Ta . The theorem above then says that. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . then the long-run proportion of visits to the state a should equal 1/Ea Ta . b.e. i. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). Note that we include only one endpoint of the excursion.To see that f is as claimed. the ground state. (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. we let f to be 1 at the state b and 0 everywhere else. which communicate and are positive recurrent. E b Tb 38 . then we see that the numerator equals 1. Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . it takes Ea Ta units of time between successive visits to the state a. not both. a. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). for all b ∈ S. We can use b as a ground state instead of a. The equality of these two limits gives: average no. on the average. The physical interpretation of the latter should be clear: If. The denominator is the duration of the excursion.

we have 0 < ν [a] (x) < ∞. The real reason for doing so is because we wanted to replace a function of X0 . 5 We stress that we do not need the stronger assumption of positive recurrence here. X2 . notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. Consider the function ν [a] defined in (15). Recurrence tells us that Ta < ∞ with probability one. Hence the superscript [a] will soon be dropped. we will soon realise that this dependence vanishes when the chain is irreducible. . x ∈ S. 39 . We start the Markov chain with X0 = a. for any state b such that a → x (a leads to x). Clearly. Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. by a function of X1 . this should have something to do with the concept of stationary distribution discussed a few sections ago. Both of them equal 1. Since a = XT a . X2 . we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). . Proof. Proposition 2.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . i. Note: Although ν [a] depends on the choice of the ‘ground’ state a. . . Moreover. . In view of this. where a is the fixed recurrent ground state whose existence was assumed. X3 . . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. Fix a ground state a ∈ S and assume that it is recurrent5 . Indeed.e. X1 . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. so there is nothing lost in doing so. (15) should play an important role. ν [a] (x) = y∈S ν [a] (y)pyx . x ∈ S.

Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . xa so. which. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. we used (16). for all x ∈ S. Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . Suppose that the Markov chain possesses a positive recurrent state a. x ∈ S. in this last step.and thus be in a position to apply first-step analysis. where. let x be such that a x. 40 . by definition. that a is positive recurrent. so there exists m ≥ 1. First note that. We just showed that ν [a] = ν [a] P. ν [a] (x) < ∞. We now have: Corollary 9. so there exists n ≥ 1. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. (17) This π is a stationary distribution. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). ν [a] = ν [a] Pn . So we have shown ν [a] (x) = y∈S pyx ν [a] (y). n ≤ Ta ) Pa (Xn = x. We are now going to assume more. for all n ∈ N. x a. Therefore. Xn−1 = y. from (16). namely. such that pax > 0. Note: The last result was shown under the assumption that a is a recurrent state. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. (n) Next. is finite when a is positive recurrent. such that pxa > 0. ax ax From Theorem 10. indeed. which are precisely the balance equations. n ≤ Ta ) pyx Pa (Xn−1 = y.

Let i be a positive recurrent state. that ν [a] P = ν [a] . Let R1 (respectively. . r = 0.In other words: If there is a positive recurrent state. then the set of linear equations π(x) = y∈S π(y)pyx . Hence j will occur in infinitely many excursions. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. Start with X0 = i. π [a] P = π [a] . It only remains to show that π [a] is a probability. Also. to decide whether j will occur in a specific excursion. π(y) = 1 y∈S x∈S have at least one solution. Theorem 15. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. We have already shown. 41 . we have Ea Ta < ∞. we toss a coin with probability of heads gij > 0. in Proposition 2. if tails show up. 2.i. (r) j. Consider the excursions Xi . This is what we will try to do in the sequel. and so π [a] is not trivially equal to zero. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. Since a is positive recurrent. i. R2 ) be the index of the first (respectively. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. unique.d. 1. Therefore. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. since the excursions are independent. We already know that j is recurrent. The coins are i. that it sums up to one. then the r-th excursion does not contain state j. Proof. second) excursion that contains j. Otherwise. . If heads show up at the r-th excursion.e. positive recurrent state. Suppose that i recurrent. But π [a] is simply a multiple of ν [a] . . in general. no matter which one. Then j is positive Proof. In words. and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. then j occurs in the r-th excursion.

. IRREDUCIBLE CHAIN r = 1. where the Zr are i. . An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. with the same law as Ti . . .R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . or all states are null recurrent or all states are transient. In other words. But. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. Then either all states in C are positive recurrent.d. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. 2. Corollary 10. . RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . Let C be a communicating class. R2 has finite expectation. we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation.i. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. An irreducible Markov chain is called transient if one (and hence all) states are transient. . . by Wald’s lemma.

we can take expectations of both sides: E as t → ∞. recognising that the random variables on the left are nonnegative and bounded below 1.16 Uniqueness of the stationary distribution At the risk of being too pedantic. for all n. an absorbing state). But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. we have P (Xn = x) = π(x). n ≥ 0) and suppose that P (X0 = x) = π(x). But. we start the Markov chain from the stationary distribution π. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). Consider positive recurrent irreducible Markov chain. what prohibits uniqueness. for all states i and j. Then there is a unique stationary distribution. x∈S By (13). with probability one. following Theorem 14. we have. by (18). Theorem 16. we stress that a stationary distribution may not be unique. n=1 43 . x ∈ S. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. By a theorem of Analysis (Bounded Convergence Theorem). Then P (Ta < ∞) = x∈S In other words. Consider the Markov chain (Xn . there is a stationary distribution. is a stationary distribution. after all. Then. for all x ∈ S. π(1) = 0. (18) π(x)Px (Ta < ∞) = π(x) = 1. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). indeed. as t → ∞. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). x ∈ S. π(2) = 1 − α 2. the π defined by π(0) = α. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. Let π be a stationary distribution. Fix a ground state a. Hence. for totally trivial reasons (it is. state 1 (and state 2) is positive recurrent. Proof.

Decompose it into communicating classes. b ∈ S. Corollary 11. Corollary 12. Hence π = π [a] . Thus. The cycle formula merely re-expresses this equality. where π(C) = y∈C π(y). Consider a positive recurrent irreducible Markov chain with stationary distribution π. E a Ta Proof. By the 1 above corollary. There is a unique stationary distribution. The following is a sort of converse to that. x ∈ S. so they are equal. an arbitrary stationary distribution π must be equal to the specific one π [a] . Proof. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. π(a) = π [a] (a) = 1/Ea Ta . for all x ∈ S. the only stationary distribution for the restricted chain because C is irreducible. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . There is only one stationary distribution. Then any state i such that π(i) > 0 is positive recurrent. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . Assume that a stationary distribution π exists. Consider the restriction of the chain to C. This is in fact.e. Then. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). y ∈ C. But π(i|C) > 0. i. Consider a Markov chain. In particular. i is positive recurrent. Then π [a] = π [b] . Consider an arbitrary Markov chain. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. 1 π(a) = . Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Let C be the communicating class of i. In particular. Therefore Ei Ti < ∞.Therefore π(x) = π [a] (x). from the definition of π [a] . Hence there is only one stationary distribution. Let i be such that π(i) > 0. Both π [a] and π[b] are stationary distributions. π(C) x ∈ C. delete all transition probabilities pxy with x ∈ C. Proof. π(i|C) = Ei Ti . Consider a positive recurrent irreducible Markov chain and two states a. Namely. for all a ∈ S.

So far we have seen. An arbitrary stationary distribution π must be a linear combination of these π C . from physical experience. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. such that C αC = 1. that it remains in a small region). without friction. eventually. A Markov chain is a simple model of a stochastic dynamical system. where π(C) := π(C) π(x). under irreducibility and positive recurrence conditions. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . Proof. that the pendulum will swing for a while and.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. as n → ∞. So we have that the above decomposition holds with αC = π(C). There is dissipation of energy due to friction and that causes the settling down. for example. To each positive recurrent communicating class C there corresponds a stationary distribution π C . This is the stable position. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. x∈C Then (π(x|C). 18 18. Let π be an arbitrary stationary distribution. that. π(x|C) ≡ π C (x). the pendulum will perform an 45 . This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). it will settle to the vertical position. x ∈ C) is a stationary distribution. take an = (−1)n . Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. For example. In an ideal pendulum. Proposition 3. By our uniqueness result. If π(x) > 0 then x belongs to some positive recurrent class C. where αC ≥ 0. more generally. consider the motion of a pendulum under the action of gravity. For each such C define the conditional distribution π(x|C) := π(x) .which assigns positive probability to each state in C. We all know.

Specifically. . . we here assume that X0 . ˙ There are myriads of applications giving physical interpretation to this. Still. random variables. x ∈ R. It follows that |ρ| < 1 is the condition for stability. as t → ∞. We say that x∗ is a stable equilibrium. . .e. . it cannot mean convergence to a fixed equilibrium point. or an example in physics. A linear stochastic system We now pass on to stochastic systems. The simplest differential equation As another example. .oscillatory motion ad infinitum. and define a stochastic sequence X1 . by means of the recursion above. ······ 46 . for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . moreover. ξ1 . Since we are prohibited by the syllabus to speak about continuous time stochastic systems. or whatever. ξ0 . let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . The thing to observe here is that there is an equilibrium point i. mutually independent. However notice what happens when we solve the recursion. ξ2 . we have x(t) → x∗ . we may it is stable because it does not (it cannot!) swing too far away. But what does stability mean? In general. the one found by setting the right-hand side equal to zero: x∗ = b/a. then regardless of the starting position. then x(t) = x∗ for all t > 0. If. We shall assume that the ξn have the same law. X2 . consider the following differential equation x = −ax + b. a > 0. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . simple because the ξn ’s depend on the time parameter n. are given. And we are in a happy situation here. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example.

as n → ∞. A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x.2 The fundamental stability theorem for Markov chains Theorem 17. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). Y = y) = P (X = x)P (Y = y). which is a linear combination of ξ0 . g(y) = P (Y = y). Y = y) is. we need to understand the notion of coupling.g. What this means is that. j ∈ S. ξn−1 does not converge. How should we then define what P (X = x. . Starting at an elementary level. In other words. ∞ ˆ The nice thing is that. for instance. Y but not their joint law. for any initial distribution. 18. converges. (e.Since |ρ| < 1. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . But suppose that we have some further requirement. uniformly in x. we have that the first term goes to 0 as n → ∞. . In particular. while the limit of Xn does not exists. i. we know f (x) = P (X = x). but we are not given what P (X = x.3 Coupling To understand why the fundamental stability theorem works. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . define stability in a way other than convergence of the distributions of the random variables. . to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). to try to make the coincidence probability P (X = Y ) as large as possible. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). . the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . The second term. suppose you are given the laws of two random variables X. the n-step transition matrix Pn . If a Markov chain is irreducible. under nice conditions. Coupling refers to the various methods for constructing this joint law. for time-homogeneous Markov chains. Y = y) is? 47 . 18.

n ≥ 0. x∈S 48 . Since this holds for all x ∈ S and since the right hand side does not depend on x. for all n ≥ 0. (19) as n → ∞.e. n ≥ T ) = P (Xn = x. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Y = y) which respects the marginals. n ≥ T ) ≤ P (n < T ) + P (Yn = x). we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). in particular. This T is a random variables that takes values in the set of times or it may take values +∞. on the event that the two processes never meet. The following result is the so-called coupling inequality: Proposition 4. Repeating (19). Let T be a meeting of two coupled stochastic processes X. and which maximises P (X = Y ). n ≥ 0) and the law of a stochastic process Y = (Yn . T is finite with probability one. Y be random variables with values in Z and laws f (x) = P (X = x). |P (Xn = x) − P (Yn = x)| ≤ P (T > n). f (x) = y h(x. P (T < ∞) = 1. Note that it makes sense to speak of a meeting time. precisely because of the assumption that the two processes have been constructed together. Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). uniformly in x ∈ S. respectively. Suppose that we have two stochastic processes as above and suppose that they have been coupled. n < T ) + P (Yn = x. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. g(y) = x h(x. n ≥ 0. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n).e. The coupling we are interested in is a coupling of processes. Proof. but we are not told what the joint probabilities are. n < T ) + P (Xn = x. then |P (Xn = x) − P (Yn = x)| → 0. Then. If.Exercise: Let X. x ∈ S. Y . but with the roles of X and Y interchanged. Find the joint law h(x. A coupling of them refers to a joint construction on the same probability space. We have P (Xn = x) = P (Xn = x. y). We are given the law of a stochastic process X = (Xn . y) = P (X = x. i. g(y) = P (Y = y). and all x ∈ S. i. y). n ≥ 0).

n ≥ 0).Assume now that P (T < ∞) = 1.x′ > 0 and py. Then (Wn . the transition probabilities. n ≥ 0). (Indeed. The second process. n ≥ 0) is an irreducible chain. denoted by π. positive recurrent and aperiodic. n ≥ 0) is a Markov chain with state space S × S. Y0 ) has distribution P (W0 = (x. had we started with both X0 and Y0 independent and distributed according to π.) By positive recurrence. n ≥ 0) is independent of Y = (Yn .(y. is a stationary distribution for (Wn . We now couple them by doing the straightforward thing: we assume that X = (Xn . sup |P (Xn = x) − P (Yn = x)| → 0. and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }. denoted by Y = (Yn . Xn and Yn would be independent and distributed according to π. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. y) Its n-step transition probabilities are q(x. . 18. Its initial state W0 = (X0 . 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px.y′ . Then n→∞ lim P (T > n) = P (T = ∞) = 0. Therefore. Its (1-step) transition probabilities are q(x. we have not defined joint probabilities such as P (Xn = x. Having coupled them. Notice that σ(x. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. for all n ≥ 1. we can define joint probabilities. Consider the process Wn := (Xn .y′ ) for all large n. y) := π(x)π(y). and so (Wn .4 Proof of the fundamental stability theorem First. Xm = y). denoted by X = (Xn .x′ py. Let pij denote.y′ ) := P Wn+1 = (x′ . The first process. y) ∈ S × S. then. (x. review the assumptions of the theorem: the chain is irreducible. We know that there is a unique stationary distribution.y′ ) = px.x′ ). From Theorem 3 and the aperiodicity assumption. they differ only in how the initial states are chosen.x′ ).x′ py. as usual. y ′ ) | Wn = (x. we have that px. Yn ). in other words. x∈S as n → ∞. We shall consider the laws of two processes.x′ ).y′ . implying that q(x.(y. y) = µ(x)π(y). So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. π(x) > 0 for all x ∈ S. It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law.(y.y′ for all large n.

Obviously. y) > 0 for all (x. and since P (Yn = x) = π(x).0). for all n and x. n ≥ 0) is identical in law to (Xn . However. we have |P (Xn = x) − π(x)| ≤ P (T > n). By the coupling inequality. Z0 = X0 which has an arbitrary distribution µ. after the meeting time. (0. say. Since P (Zn = x) = P (Xn = x). as n → ∞.01. (1. ≥ 0) is a Markov chain with transition probabilities pij . which converges to zero.(0. 1} and the pij are p01 = p10 = 1.1).1). |P (Zn = x) − P (Yn = x)| ≤ P (T > n). p01 = 0.Therefore σ(x. (Zn . P (Xn = x) = P (Zn = x). Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. p00 = 0. uniformly in x. Moreover. by assumption.0).1) = q(1.(1. In particular. 50 . 0). Hence (Wn . Clearly. n ≥ 0) is positive recurrent. (1. Y ) may fail to be irreducible. Zn equals Xn before the meeting time of X and Y . Zn = Yn . y) ∈ S × S. Hence (Zn . X arbitrary distribution Z stationary distribution Y T Thus. precisely because P (T < ∞) = 1.(1. P (T < ∞) = 1. 1). Then the product chain has state space S ×S = {(0. 1)} and the transition probabilities are q(0. This is not irreducible. T is a meeting time between Z and Y . Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). suppose that S = {0. n ≥ 0). 0).99. p10 = 1. For example. In particular.0) = q(0.0) = q(1.(0. if we modify the original chain and make it aperiodic by. then the process W = (X.1) = 1.

18. i.99. most people use the term “ergodic” for an irreducible. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. Cd−1 .e. Theorem 4. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). Thus. . Suppose that the period of the chain is d. such that the chain moves cyclically between them: from C0 to C1 . Then Xn = 0 for even n and = 1 for odd n. if we modify the chain by p00 = 0. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. n→∞ (n) where π(0).497. 6 The reason is that the term “ergodic” means something specific in Mathematics (c. . By the results of §5. π(0) + π(1) = 1. But this is “wrong”. Parry. It is not difficult to see that 6 We put “wrong” in quotes. Terminology cannot be. By the results of Section 14.503 0. π(0) ≈ 0. p10 = 1. p01 = 0. 1981). π(1) ≈ 0. then (n) lim p n→∞ 00 (n) = lim p10 = π(0).4 and. however. π(1) satisfy 0. . from C1 to C2 .503.497 .6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. Indeed. and this is expressed by our terminology.01. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. we can decompose the state space into d classes C0 . C1 . from Cd back to C0 .f. limn→∞ p00 does not exist. in particular. It is.497 18. 0. Therefore p00 = 1 for even n (n) and = 0 for odd n.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. customary to have a consistent terminology. 51 . positive recurrent and aperiodic Markov chain. and so on. However.503 0. The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. In matrix notation. per se wrong. . Suppose X0 = 0.then the product chain becomes irreducible.99π(0) = π(1). Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1.

. i ∈ Cr . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Since the αr add up to 1. as n → ∞. .Lemma 8. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. we have pij for all i. 52 . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. the (unique) stationary distribution of (Xnd . 1. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. Then pji = 0 unless j ∈ Cr−1 . n ≥ 0) consists of d closed communicating classes. where π is the stationary distribution. We shall show that Lemma 9. after reduction by d) and. n ≥ 0) is aperiodic. d − 1. each one must be equal to 1/d. n ≥ 0) is dπ(i). All states have period 1 for this chain. it remains in Cℓ . j ∈ Cℓ . Suppose that X0 ∈ Cr . Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). thereafter. if sampled every d steps. i ∈ S. as n → ∞. . π(i) = 1/d. . and the previous results apply. j ∈ Cℓ . So π(i) = j∈Cr−1 π(j)pji . starting from i. Then. Suppose i ∈ Cr . namely C0 . and let r = ℓ − k (modulo d). i∈Cr for all r = 0. → dπ(j). We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . Let αr := i∈Cr π(i). . j ∈ Cr . Let i ∈ Ck . i ∈ Cr . . Since (Xnd . Then. The chain (Xnd . Hence if we restrict the chain (Xnd . Proof. . Cd−1 . . Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . This means that: Theorem 18.

such that limk→∞ p1 k 3 (n1 ) numbers p2 k . are contained between 0 and 1. Now consider the contained between 0 and 1 there must be a exists and equals. then pij → π(j) (Theorems 17). If j is a null recurrent state then pij → 0. e 53 . say. .. . k = 1. a probability distribution on N. . Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · .1 Limiting probabilities for null recurrent states Theorem 19. the first of which is known as HellyBray Lemma: Lemma 10. . p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. as k → ∞.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . p1 . . the sequence n2 of the second row k is a subsequence of n1 of the first row. . p(n) = (n) (n) (n) (n) (n) ∞ p1 . It is clear how we (ni ) k continue. . Hence pi k i. . For each i we have found subsequence ni . 2. for all i ∈ S. . If the period is d then the limit exists only along a selected subsequence (Theorem 18). . 2.) is a k k subsequence of each (ni . p2 . . Consider. . By construction. . 2. We have. say. where = 1 and pi ≥ 0 for all i. p2 . . . . We also saw in Corollary (n) 7 that.e. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. if j is transient. 2. as n → ∞. pi . for each n = 1. such that limk→∞ pi k exists and equals. then pij → 0 as n → ∞. 2. Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. . whose proof we omit7 : 7 See Br´maud (1999). schematically.. k = 1. say.). i. for each The second result is Fatou’s lemma. for all i ∈ S. . p3 . k = 1. (nk ) k → pi . for each i. k = 1. exists for all i ∈ N. and so on. . 19. The proof will be based on two results from Analysis. there must be a subsequence exists and equals. So the diagonal subsequence (nk . the sequence n3 of the third row is a subsequence k k of n2 of the second row. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞.

y∈S Since x∈S αx ≤ 1. i. Suppose that for some i. 54 . pix k . i ∈ N). . Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. for some x0 ∈ S. (n ) (n ) Consider the probability vector By Lemma 10. n = 1. and the last equality is obvious since we have a probability vector. pix k = 1. . π is a stationary distribution. i ∈ N). are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . (xi . . αy pyx . we have 0< x∈S The inequality follows from Lemma 11. By Corollary 13. x ∈ S.e.e. we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. actually. y∈S x ∈ S. 2. We wish to show that these inequalities are.Lemma 11. Now π(j) > 0 because αj > 0.e. equalities. Thus. Let j be a null recurrent state. by assumption. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . (n′ ) Hence αx ≥ αy pyx . k→∞ (n′ +1) = y∈S piy k pyx . x ∈ S . state j is positive recurrent: Contradiction. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . Suppose that one of them is not. If (yi . αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. it is a strict inequality: αx0 > y∈S αy pyx0 . x∈S (n′ ) Since aj > 0. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . and this is a contradiction (the number αx = αx cannot be strictly larger than itself). the sequence (n) pij does not converge to 0. pix k Therefore. i. But Pn+1 = Pn P. by Lemma 11. i. i=1 (n) Proof of Theorem 19.

Let i be an arbitrary state. the result can be obtained by the so-called Abel’s Lemma of Analysis. Let j be a positive recurrent state with period 1. as n → ∞.19. which is what we need. The result when the period j is larger than 1 is a bit awkward to formulate. β. Proof. Theorem 20. Let HC := inf{n ≥ 0 : Xn ∈ C}. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. we have pij = Pi (Xn = j) = Pi (Xn = j. In general.2. that is. and fij = ϕC (j). Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. Then (n) lim p n→∞ ij = ϕC (i)π C (j). γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . This is a more old-fashioned way of getting the same result. In this form. Example: Consider the following chain (0 < α. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. Let ϕC (i) := Pi (HC < ∞). we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). From the Strong Markov Property. as n → ∞. as in §10. Pµ (Xn−m = j) → π C (j). so we will only look that the aperiodic case. π C (j) = 1/Ej Tj . Let C be the communicating class of j and let π C be the stationary distribution associated with C. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Indeed. starting with the first hitting time decomposition relation of Lemma 7. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. But Theorem 17 tells us that the limit exists regardless of the initial distribution.2 The general case (n) It remains to see what the limit of pij is. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . and fij = Pi (Tj < ∞).

And so is the Law of Large Numbers itself (Theorem 13). We have ϕ(1) = ϕ(2) = 1. Hence. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. Theorem 21. p45 → (1 − ϕ(4))π C2 (5). Theorem 14 is a type of ergodic theorem. information about its velocity. ϕ(3) = p31 → ϕ(3)π C1 (1). 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . (n) (n) p43 → 0. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. p42 → ϕ(4)π C1 (2). for example. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. ϕ(4) = . we assumed the existence of a positive recurrent state. 2} (closed class). (n) (n) p34 → 0. Pioneers in the are were. while. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. 56 (20) x∈S . We now generalise this. as n → ∞. ϕ(4) = (1 − β)ϕ(5) + βϕ(3).The communicating classes are I = {3. π C2 (5) = 1. (n) (n) p33 → 0. (n) (n) p32 → ϕ(3)π C1 (2). (n) (n) p35 → (1 − ϕ(3))π C2 (5). ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. among others. The subject of Ergodic Theory has its origins in Physics. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . p44 → 0. whence 1−α β(1 − α) . Recall that. p41 → ϕ(4)π C1 (1). The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. Similarly. ϕ(5) = 0. from first-step analysis. For example. C2 = {5} (absorbing state). The phase (or state) space of a particle is not just a physical space but it includes. 1 − αβ 1 − αβ Both C1 . in Theorem 14. C2 are aperiodic classes. π C1 (2) = . 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. 4} (inessential states). C1 = {1. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville.

since the number of states is finite. we assumed positive recurrence of the ground state a. and this justifies the terminology of §18. So there are always recurrent states. while Gfirst . Glast are the total rewards over the first and t t a a last incomplete cycles. we have t→∞ lim 1 last 1 G = lim Gfirst = 0. (21) f (x)ν [a] (x). We only need to check that EG1 = f . we have Nt /t → 1/Ea Ta . Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. As in Theorem 14. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 .5. t t→∞ t t t with probability one. so the numerator converges to f /Ea Ta . So if we divide by t the numerator and denominator of the fraction in (21). Remark 1: The difference between Theorem 14 and 21 is that. r=0 with probability 1 as long as E|G1 | < ∞. If so. respectively. 57 . we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . Now. As in the proof of Theorem 14. Then at least one of the states must be visited infinitely often. From proposition 2 we know that the balance equations ν = νP have at least one solution. Replacing n by the subsequence Nt . as we saw in the proof of Theorem 14. and this is left as an exercise. we have C := i∈S ν(i) < ∞. in the former. we obtain the result. which is the quantity denoted by f in Theorem 14. But this is precisely the condition (20). 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. The reader is asked to revisit the proof of Theorem 14. Proof. we have that the denominator equals Nt /t and converges to 1/Ea Ta . then. t t where Gr = T (r) ≤t<T (r+1) f (Xt ).

Proof. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. 58 . the stationary distribution is unique. the first components of the two vectors must have the same distribution. . . Xn has the same distribution as X0 for all n. X0 ). X0 = j). It is. Consider an irreducible Markov chain with finitely many states. Theorem 23. like playing a film backwards. X1 = j) = P (X1 = i. this state must be positive recurrent. X1 . . More precisely. We summarise the above discussion in: Theorem 22. and run the film backwards. Xn−1 . In addition. as n → ∞.e. . i. C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. Because the process is Markov. Xn ) has the same distribution as (Xn . by Theorem 16. if we film a standing man who raises his arms up and down successively. that is. we see that (X0 . Xn−1 . we have Pn → Π. A reversible Markov chain is stationary. reversible if it has the same distribution when the arrow of time is reversed. a finite Markov chain always has positive recurrent states. for all n. it will look the same. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . But if we film a man who walks and run the film backwards. . . π(i) := Hence. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. We say that it is time-reversible. If. in addition. In particular. the chain is aperiodic. X1 . . running backwards. in a sense. . . in addition. Taking n = 1 in the definition of reversibility. At least some state i must have π(i) > 0. Suppose first that the chain is reversible. 22 Time reversibility Consider a Markov chain (Xn . If. then all states are positive recurrent. Lemma 12. (X0 . P (X0 = i. Proof. Then the chain is positive recurrent with a unique stationary distribution π. . . Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. without being able to tell that it is. Xn ) has the same distribution as (Xn . i ∈ S. By Corollary 13. Reversibility means that (X0 . X1 ) has the same distribution as (X1 . . because positive recurrence is a class property (Theorem 15). the chain is irreducible. X0 ). this means that it is stationary. n ≥ 0). . . actually. For example. or. .Hence 1 ν(i). . then we instantly know that the film is played in reverse. j ∈ S. Therefore π is a stationary distribution. simply. X0 ) for all n. i.

But P (X0 = i. We will show the Kolmogorov’s loop criterion (KLC). In other words. . and write. .for all i and j. k. By induction. i4 . So the chain is reversible. X0 ). we have π(i) > 0 for all i. These are the DBE. X1 . Xn−1 . . . Conversely. X2 ) has the same distribution as (X2 . it is easy to show that. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . . . X1 ) has the same distribution as (X1 . X1 = j. Now pick a state k and write. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. (22) Proof. j. . . Suppose first that the chain is reversible. X2 = i). and this means that P (X0 = i. Hence the DBE hold. and pi1 i2 (m−3) → π(i2 ). letting m → ∞. X1 . P (X0 = j. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. . and the chain is reversible. using DBE. pji 59 (m−3) → π(i1 ). and hence (X0 . π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. Conversely. for all n. . X2 = k) = P (X0 = k. X1 = j) = π(i)pij . then. suppose the DBE hold. (X0 . Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. X0 ). X1 . suppose that the KLC (22) holds. . . X0 ). Xn ) has the same distribution as (Xn . . then the chain cannot be reversible. π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . we have separation of variables: pij = f (i)g(j). X1 = i) = π(j)pji . Pick 3 states i. The same logic works for any m and can be worked out by induction. k. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . Summing up over all choices of states i3 . X1 = j. . So the detailed balance equations (DBE) hold. . The DBE imply immediately that (X0 . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. . we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . using the DBE. . for all i. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. i2 . im ∈ S and all m ≥ 3. Theorem 24. j.

that the chain is positive recurrent. j for which pij > 0. in addition.(Convention: we form this ratio only when pij > 0. then the chain is reversible. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. Pick two distinct states i.e. so we have pij g(j) = . to guarantee. consider the chain: Replace it by: This is a tree.) Let A. Ac be the sets in the two connected components. The reason is as follows.1) which gives π(i)pij = π(j)pji . the graph becomes disconnected. (If it didn’t. And hence π can be found very easily. then we need. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . by assumption. Ac ) = F (Ac . Hence the chain is reversible. A) (see §4. So g satisfies the balance equations. using another method.) Necessarily. j for which pij > 0. Example 1: Chain with a tree-structure. For example. namely the arrow from i to j. If the graph thus obtained is a tree. the two oriented edges from i to j and from j to i by a single edge without arrow. • If the chain has infinitely many states. i. and. 60 . the detailed balance equations. and by replacing. meaning that it has no loops. and the arrow from j to j is the only one from Ac to A. If we remove the link between them. There is obviously only one arrow from AtoAc . Hence the original chain is reversible. for each pair of distinct states i. there would be a loop. f (i)g(i) must be 1 (why?). Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. there isn’t any. The balance equations are written in the form F (A. Form the graph by deleting all self-loops.

if j is a neighbour of i otherwise. In all cases. i∈S where |E| is the total number of edges. 61 . j are not neighbours then both sides are 0. where C is a constant. so both members are equal to C. the DBE hold. 0.Example 2: Random walk on a graph. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. To see this let us verify that the detailed balance equations hold. a finite set. Suppose the graph is connected. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). For example.e. Each edge connects a pair of distinct vertices (states) without orientation. The stationary distribution assigns to a state probability which is proportional to its degree. For each state i define its degree δ(i) = number of neighbours of i. The stationary distribution of a RWG is very easy to find. etc. A RWG is a reversible chain. Hence we conclude that: 1. Now define the the following transition probabilities: pij = 1/δ(i). π(i) = 1. i∈S Since δ(i) = 2|E|. States j are i are called neighbours if they are connected by an edge. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. π(i)pij = π(j)pji . we have that C = 1/2|E|. There are two cases: If i. We claim that π(i) = Cδ(i). i. If i. and a set E of undirected edges. p02 = 1/4. 2. Consider an undirected graph G with vertices S. The constant C is found by normalisation.

if.e. . (m) (m) 23. . Define Yr := XT (r) . n ≥ 0) be Markov with transition probabilities pij . . . . if nk = n0 + km. how can we create new ones? 23. 62 n ∈ N. 2. . . . Let (St . 2. 1.d. . If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). i.i. If the subsequence forms an arithmetic progression.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. . π is a stationary distribution for the original chain. i. in general. A r = 1. . random variables with values in N and distribution λ(n) = P (ξ0 = n). 2. . where pij are the m-step transition probabilities of the original chain. St+1 = St + ξt .3 Subordination Let (Xn . irreducible and positive recurrent). 2. k = 0. 1. k = 0. in addition. In this case. k = 0. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . then (Yk . . then it is a stationary distribution for the new chain. The state space of (Yr ) is A. . t ≥ 0. where ξt are i. t ≥ 0) be independent from (Xn ) with S0 = 0. . k = 0. this process is also a Markov chain and it does have time-homogeneous transition probabilities. . Due to the Strong Markov Property.e. . 2. 1.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . 23.e.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). Let A be (r) a set of states and let TA be the r-th visit to A. 1. The sequence (Yk .) has the Markov property but it is not time-homogeneous.

we have: Lemma 13. If. . . We need to show that P (Yn+1 = β | Yn = α. .. conditional on Yn the past is independent of the future. and the elements of S ′ by α. Yn−1 = αn−1 }. .e. β. Then (Yn ) has the Markov property. . j. Fix α0 . . ∞ λ(n)pij . . then the process Yn = f (Xn ). Indeed. in general. (Notation: We will denote the elements of S by i. Yn−1 ) = P (Yn+1 = β | Yn = α). α1 . . . is NOT. knowledge of f (Xn ) implies knowledge of Xn and so. . β). (n) 23.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . .. f is a one-to-one function then (Yn ) is a Markov chain. a Markov chain. i. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). however. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. Proof.Then Yt := XSt .) More generally. αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . . n≥0 63 . . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). .

Define Yn := |Xn |. for all i ∈ Z and all β ∈ Z+ . Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). We have that P (Yn+1 = β|Xn = i) = Q(|i|. β) P (Yn = α|Xn = i. But we will show that P (Yn+1 = β|Xn = i). β ∈ Z+ . conditional on {Xn = i}. β) P (Yn = α|Yn−1 ) i∈S = Q(α. if we let 1/2. β). i. But P (Yn+1 = β | Yn = α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. regardless of whether i > 0 or i < 0. Example: Consider the symmetric drunkard’s random walk. because the probability of going up is the same as the probability of going down. if i = 0. indeed. β). P (Xn+1 = i ± 1|Xn = i) = 1/2. depends on i only through |i|. i ∈ Z. In other words. if β = α ± 1. if β = 1. Then (Yn ) is a Markov chain. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. α > 0 Q(α. and so (Yn ) is. β) P (Yn = α. observe that the function | · | is not one-to-one. Yn−1 )P (Xn = i | Yn = α. Thus (Yn ) is Markov with transition probabilities Q(α. Yn = α.e. and if i = 0. i ∈ Z. then |Xn+1 | = 1 for sure. On the other hand. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) := 1. To see this. Yn−1 ) Q(f (i). Indeed. a Markov chain with transition probabilities Q(α.for all choices of the variables involved. β). β)P (Xn = i | Yn = α. α = 0. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. β). Yn−1 ) Q(f (i). Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. 64 . β) i∈S P (Yn = α|Xn = i.

then we see that the above equation is of the form Xn+1 = f (Xn . Then P (ξ = 0) + P (ξ ≥ 2) > 0. n ≥ 0) has the Markov property. with probability 1. . Back to the general case. in general. . But here is an interesting special case. ξ (1) . . Thus. Hence all states i ≥ 1 are inessential. . The state 0 is irrelevant because it cannot be reached from anywhere. are i. . If we let ξ (n) = ξ1 . each individual gives birth to at one child with probability p or has children with probability 1 − p. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. ξ2 . if the current population is i there is probability pi that everybody gives birth to zero offspring. if the 65 (n) (n) . 2. we have that (Xn . (and independent of the initial population size X0 –a further assumption). . therefore transient. since ξ (0) . 1. Therefore. and. ξ (n) ). We let pk be the probability that an individual will bear k offspring. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. independently from one another. . . We find the population (n) at time n + 1 by adding the number of offspring of each individual.d. It is a Markov chain with time-homogeneous transitions. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. and 0 is always an absorbing state. j In this case. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. In other words.i. 0 But 0 is an absorbing state. Let ξk be the number of offspring of the k-th individual in the n-th generation. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. So assume that p1 = 1. so we will find some different method to proceed. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. if 0 is never reached then Xn → ∞. a hard problem. The parent dies immediately upon giving birth. the population cannot grow: it will eventually reach 0. which has a binomial distribution: i j pij = p (1 − p)i−j .. . and following the same distribution. Hence all states i ≥ 1 are transient. one question of interest is whether the process will become extinct or not. k = 0. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . We will not attempt to compute the pij neither draw the transition diagram because they are both futile. 0 ≤ j ≤ i.24 24. p0 = 1 − p. Example: Suppose that p1 = p.

we can argue that ε(i) = εi . we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. If µ ≤ 1 then the ε = 1. . and P (Dk ) = ε. Therefore the problem reduces to the study of the branching process starting with X0 = 1. ε(0) = 1. ϕn (0) = P (Xn = 0). let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. For each k = 1. This function is of interest because it gives information about the extinction probability ε. 66 . when we start with X0 = i. if we start with X0 = i in ital ancestors. Obviously. . We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn .population does not become extinct then it must grow to infinity. Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. i. The result is this: Proposition 5. Indeed. so P (D|X0 = i) = εi . We then have that D = D1 ∩ · · · ∩ Di . For i ≥ 1. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. where ε = P1 (D). and. Indeed. We wish to compute the extinction probability ε(i) := Pi (D). If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. Proof. . since Xn = 0 implies Xm = 0 for all m ≥ n. . Let ε be the extinction probability of the process. each one behaves independently of one another.

Notice now that µ= d ϕ(z) dz .Note also that ϕ0 (z) = 1. Therefore ε = ϕ(ε). But ϕn+1 (0) → ε as n → ∞. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . ε = ε∗ . So ε ≤ ε∗ < 1. On the other hand. Therefore ε = ε∗ .) 67 . ϕn+1 (z) = ϕ(ϕn (z)). But 0 ≤ ε∗ . if µ > 1. By induction. z=1 Also. But ϕn (0) → ε. ϕ(ϕn (0)) → ϕ(ε). This means that ε = 1 (see left figure below). and so on. ϕn+1 (0) = ϕ(ϕn (0)). Also ϕn (0) → ε. Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). and since the latter has only two solutions (z = 0 and z = ε∗ ). in fact. all we need to show is that ε < 1. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). and. In particular. then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. Since both ε and ε∗ satisfy z = ϕ(z). (See right figure below. the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. Let us show that. recall that ϕ is a convex function with ϕ(1) = 1. since ϕ is a continuous function. Therefore. We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). if the slope µ at z = 1 is ≤ 1. ϕn (0) ≤ ε∗ . = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)).

. 2. there is collision and the transmission is unsuccessful. if Sn = 1. . then time slots 0. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. 2. say. Thus. If s ≥ 2. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. Then ε = 1. 1. there are x + a packets. on the next times slot. for a packet to go through. Case: µ > 1. otherwise. random variables with some common distribution: P (An = k) = λk . and Sn the number of selected packets. Then ε < 1. Let s be the number of selected packets. 4. there is a collision and. 1. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. periodically. . 3. there are x + a − 1 packets. The problem of communication over a common channel. 2.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). Ethernet has replaced Aloha. Xn+1 = Xn + An . on the next times slot. 3. 0). 3 hosts. . k ≥ 0. 1. is that if more than one hosts attempt to transmit a packet. are allocated to hosts 1. . So.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. using the same frequency. then max(Xn + An − 1. We assume that the An are i. Nowadays. So if there are. During this time slot assume there are a new packets that need to be transmitted. we let hosts transmit whenever they like. One obvious solution is to allocate time slots to hosts in a specific order. 6. An the number of new packets on the same time slot. 3. If collision happens. . 5. there is no time to make a prior query whether a host has something to transmit or not. Abandoning this idea. then the transmission is not successful and the transmissions must be attempted anew. only one of the hosts should be transmitting at a given time. If s = 1 there is a successful transmission and. 24.i. 68 . Since things happen fast in a computer network.. in a simplified model.d.

Sn = 1|Xn = x) = λ0 xpq x−1 . Therefore ∆(x) ≥ µ/2 for all large x. k 0 ≤ k ≤ x + a. we have q < 1. and exactly one selection: px. We also let µ := k≥1 kλk be the expectation of An .e. Xn−1 . .x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). . defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. The reason we take the maximum with 0 in the above equation is because. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). So: px. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. So P (Sn = k|Xn = x.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . i. we think as follows: to have a reduction of x by 1. k ≥ 1. For x > 0. Unless µ = 0 (which is a trivial case: no arrivals!).e.x−1 when x > 0. Let us find its transition probabilities. and so q x tends to 0 much faster than G(x) tends to ∞. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . selections are made amongst the Xn + An packets. n ≥ 0) is a Markov chain. Proving this is beyond the scope of the lectures. and p. . we should not subtract 1. we have ∆(x) = (−1)px. The random variable Sn has. then the chain is transient. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. The process (Xn .x−1 + k≥1 kpx. 69 . An = a. Indeed. Idea of proof. To compute px.x−1 = P (An = 0. so q x G(x) tends to 0 as x → ∞. An−1 . where q = 1 − p.) = x + a k x−k p q . if Xn + An = 0. So we can make q x G(x) as small as we like by choosing x sufficiently large. We here only restrict ourselves to the computation of this expected change. for example we can make it ≤ µ/2. we must have no new packets arriving. where G(x) is a function of the form α + βx . the Markov chain (Xn ) is transient. The probabilities px. we have that the drift is positive for all large x and this proves transience. The main result is: Proposition 6. Since p > 0.where λk ≥ 0 and k≥0 kλk = 1.x are computed by subtracting the sum of the above from 1. To compute px.x+k for k ≥ 1. The Aloha protocol is unstable.

Comments: 1. the chain moves to a random page. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. It uses advanced techniques from linear algebra to speed up computations. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. α ≈ 0. if L(i) = ∅   0. We pick α between 0 and 1 (typically. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. In an Internet-dominated society. then i is a ‘dangling page’. If L(i) = ∅. All you need to do to get high rank is pay (a lot of money).3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. 3. PageRank. Its implementation though is one of the largest matrix computations in the world. unless i is a dangling page.2) and let α pij := (1 − α)pij + . So we make the following modification. Define transition probabilities as follows:  1/|L(i)|. So this is how Google assigns ranks to pages: It uses an algorithm. For each page i let L(i) be the set of pages it links to. the rank of a page may be very important for a company’s profits. because of the size of the problem. otherwise. neither that it is aperiodic. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. So if Xn = i is the current location of the chain. Academic departments link to each other by hiring their 70 . there are companies that sell high-rank pages. The idea is simple. that allows someone to manipulate the PageRank of a web page and make it much higher. There is a technique. if j ∈ L(i)  pij = 1/|V |. Therefore.24. 2. Then it says: page i has higher rank than j if π(i) > π(j). It is not clear that the chain is irreducible. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. known as ‘302 Google jacking’. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. in the latter case. |V | These are new transition probabilities. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. the next location Xn+1 is picked uniformly at random amongst all pages that i links to.

Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. we will assume that. This number could be independent of the research quality but. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. To simplify. The alleles in an individual come in pairs and express the individual’s genotype for that gene. the genetic diversity depends only on the total number of alleles of each type. a dominant (say A) and a recessive (say a). If so. One of the square alleles is selected 4 times. 4. 24. all we need is the information outlined in this figure.faculty from each other (and from themselves).g. Three of the square alleles are never selected. One of the circle alleles is selected twice. generation after generation. We will assume that a child chooses an allele from each parent at random. a gene controlling a specific characteristic (e. We have a transition from x to y if y alleles of type A were produced. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. We would like to study the genetic diversity. meaning how many characteristics we have. this allele is of type A with probability x/2N . So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). When two individuals mate. in each generation. Do the same thing 2N times. What is important is that the evaluation can be done without ever looking at the content of one’s research. colour of eyes) appears in two forms. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. Since we don’t care to keep track of the individuals’ identities. In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). 71 . Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. 5. it makes the evaluation procedure easy. from the point of view of an administrator. then. The process repeats. Thus. We will also assume that the parents are replaced by the children. This means that the external appearance of a characteristic is a function of the pair of alleles. Imagine that we have a population of N individuals who mate at random producing a number of offspring. But an AA mating with a Aa may produce a child of type AA or one of type Aa. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. their offspring inherits an allele from each parent.4 The Wright-Fisher model: an example from biology In an organism. And so on. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. maintain this allele for the next generation. two AA individuals will produce and AA child.

The fact that M is even is irrelevant for the mathematical model. ϕ(M ) = 1. y ≤ 2N − 1. Similarly. on the average. Clearly. X0 ) = 1 − Xn . since. ϕ(x) = y=0 pxy ϕ(y). . . M Therefore. . These are: M ϕ(0) = 0.) One can argue that. . . and the population becomes one consisting of alleles of type a only. . we see that 2N x 2N −y x y pxy = . . Tx := inf{n ≥ 1 : Xn = x}. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected.e. . . the random variables ξ1 . . .2N = 1. i. 2N − 1} form a communicating class of inessential states. as usual. . the number of alleles of type A won’t change. E(Xn+1 |Xn . X0 ) = E(Xn+1 |Xn ) = M This implies that. . EXn+1 = EXn Xn = Xn . . . at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). (Here. . . We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. but the argument needs further justification because “the end of time” is not a deterministic time. n n where. 0 ≤ y ≤ 2N. Since pxy > 0 for all 1 ≤ x. . so both 0 and 2N are absorbing states. ϕ(x) = x/M. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. the chain gets absorbed by either 0 or 2N . Since selections are independent. . the states {1. with n P (ξk = 1|Xn .i. for all n ≥ 0. p00 = p2N. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . Clearly.e. the expectation doesn’t change and. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. since. conditionally on (X0 . . we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . on the average. i. . . Let M = 2N . and the population becomes one consisting of alleles of type A only. M and thus. ξM are i. Xn ).Each allele of type A is selected with probability x/2N .d. M n P (ξk = 0|Xn . 72 . The result is correct. . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. . X0 ) = Xn .

the demand is only partially satisfied. for all n ≥ 0. b). Since the Markov chain is represented by this very simple recursion. the demand is met instantaneously.The first two are obvious. on the average. provided that enough components are available. 73 . . at time n + 1. . M −1 y−1 x M y−1 1− x . We will apply the so-called Loynes’ method. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. If the demand is for Dn components. there are Xn + An − Dn components in the stock. Xn electronic components at time n. 2. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. Let ξn := An − Dn .d. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. n ≥ 0) are i. We will show that this Markov chain is positive recurrent. . Thus. Working for n = 1. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. We have that (Xn . Dn ). Here are our assumptions: The pairs of random variables ((An . A number An of new components are ordered and added to the stock. n ≥ 0) is a Markov chain. while a number of them are sold to the market. say. Otherwise. and the stock reached level zero at time n + 1. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. so that Xn+1 = (Xn + ξn ) ∨ 0.i. the demand is. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . larger than the supply per unit of time. we will try to solve the recursion. and so. and E(An − Dn ) = −µ < 0.5 A storage or queueing application A storage facility stores.

ξ0 ). we reverse it. Notice. In particular. (n) . and. Indeed. (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . . that d (ξ0 . ξn . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy .Thus. The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). . . . . for y ≥ 0. n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. for all n. .d. . so we can shuffle them in any way we like without altering their joint distribution. 74 . instead of considering them in their original order. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). however. . ξn−1 ) = (ξn−1 . . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. . Therefore. = x≥0 Since the n-dependent terms are bounded. . . . where the latter equality follows from the fact that. . in computing the distribution of Mn which depends on ξ0 . .i. we find π(y) = x≥0 π(x)pxy . . we may take the limit of both sides as n → ∞. the event whose probability we want to compute is a function of (ξ0 . . ξn−1 ). we may replace the latter random variables by ξ1 . Hence. the random variables (ξn ) are i. using π(y) = limn→∞ P (Mn = y). . ξn−1 . . .

75 . . Here. Z2 . . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. We must make sure that it also satisfies y π(y) = 1. Z2 . . as required. . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . the Markov chain may be transient or null recurrent. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. The case Eξn = 0 is delicate. we will use the assumption that Eξn = −µ < 0.} < ∞) = 1. . This will be the case if g(y) is not identically equal to 1. Depending on the actual distribution of ξn . . n→∞ Hence g(y) = P (0 ∨ max{Z1 . n→∞ This implies that P ( lim Zn = −∞) = 1.} > y) is not identically equal to 1.So π satisfies the balance equations.

we won’t go beyond Zd . if x ∈ {±e1 . here. A random walk possesses an additional property. otherwise where p + q = 1. . then p(1) = p.g. so far. we can let S = Zd . Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. d   0. d  p(x) = qj . but. given a function p(x). if x = ej . 0. the skip-free property. To be able to say this. the set of d-dimensional vectors with integer coordinates. . . where p1 + · · · + pd + q1 + · · · + qd = 1. we need to give more structure to the state space S. groups) are allowed. For example. have been processes with timehomogeneous transition probabilities. besides time. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. . This is the drunkard’s walk we’ve seen before. . namely.y = p(y − x). p(x) = 0. Example 1: Let p(x) = C2−|x1 |−···−|xd | .and space-homogeneity.y = px+z. this walk enjoys. This means that the transition probability pxy should depend on x and y only through their relative positions in space. p(−1) = q. a random walk is a Markov chain with time. otherwise. a structure that enables us to say that px.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. x ∈ Zd . j = 1. . j = 1. . Example 3: Define p(x) = 1 2d . A random walk of this type is called simple random walk or nearest neighbour random walk. If d = 1. if x = −ej . . Our Markov chains. . More general spaces (e.y+z for any translation z. that of spatial homogeneity. . Define  pj . . . where C is such that x p(x) = 1. So. In dimension d = 1. ±ed } otherwise 76 .and space-homogeneous transition probabilities given by px.

We refer to ξn as the increments of the random walk. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). If d = 1. . then p(1) = p(−1) = 1/2. p′ (x). x ∈ Zd . the initial state S0 is independent of the increments. The convolution of two probabilities p(x). ξ2 . xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. computing the n-th order convolution for all n is a notoriously hard problem. we can reduce several questions to questions about this random walk. But this would be nonsense. p(x) = 0. (When we write a sum without indicating explicitly the range of the summation variable. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . So. are i. We call it simple symmetric random walk. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). . . . We will let p∗n be the convolution of p with itself n times. Obviously.i. where ξ1 . in general. . y p(x)p′ (y − x). we will mean a sum over all possible values. Furthermore. One could stop here and say that we have a formula for the n-step transition probability. (Recall that δz is a probability distribution that assigns probability 1 to the point z.) Now the operation in the last term is known as convolution.) In this notation. otherwise. x + ξ2 = y) = y p(x)p(y − x). Furthermore. p ∗ δz = p.d. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . because all we have done is that we dressed the original problem in more fancy clothes. x ± ed . The special structure of a random walk enables us to give more detailed formulae for its behaviour. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). We shall resist the temptation to compute and look at other 77 . . It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p.This is a simple random walk. and this is the symmetric drunkard’s walk. . Let us denote by Sn the state of the walk at time n. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. which. Indeed. random variables in Zd with common distribution P (ξn = x) = p(x). in addition.

So. First. Will the random walk ever return to where it started from? B.i−1 = 1/2 for all i ∈ Z. all possible values of the vector are equally likely. After all. and so #A P (A) = n . In the sequel. . ξn ). we are dealing with a uniform distribution on the sample space {−1. Events A that depend only on these the first n random signs are subsets of this sample space. .2) (4. we shall learn some counting methods.0) We can thus think of the elements of the sample space as paths. 2 where #A is the number of elements of A. + (1. ξn = εn ) = 2−n . Look at the distribution of (ξ1 . +1}n .i+1 = pi. Here are a couple of meaningful questions: A. . 0) → (1. 1) → (2. Clearly. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . a path of length n. how long will that take? C. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. 1) → (4.methods. −1} then we have a path in the plane that joins the points (0. consider the following: 78 .1) + (0. we have P (ξ1 = ε1 . 1}n . If yes. +1}n . Since there are 2n elements in {0. 1): (2. . The principle is trivial. random variables (random signs) with P (ξn = ±1) = 1/2. . 2) → (3. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. . +1. .1) + − (5. −1. if we can count the number of elements of A we can compute its probability. and the sequence of signs is {+1.d. 2) → (5. to each initial position and to each sequence of n signs. As a warm-up. but its practice may be hard. Therefore. +1}n . If not. . for fixed n.i.1) − (3. A ⊂ {−1.2) where the ξn are i. +1. a random vector with values in {−1. we should assign. So if the initial position is S0 = 0.

047. y) is given by: #{(0. 0) and end up at (n. We can easily count that there are 6 paths in A. y)}. 0)). 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. S6 < 0. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0.1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. We use the notation {(m. i). S7 < 0}.0) The lightly shaded region contains all paths of length n that start at (0. y) is given by: #{(m. x) (n. y)} = n−m . The paths comprising this event are shown below. the number of paths that start from (m. 0) and ends at (n. Proof. S4 = −1. (n. and compute the probability of the event A = {S2 ≥ 0.Example: Assume that S0 = 0. i).) The darker region shows all paths that start at (0. Hence (n+i)/2 = 0. 0) (n.i) (0. In if n + i is not an even number then there is no path that starts from (0. x) (n. 0) and end up at the specific point (n. Let us first show that Lemma 14. x) and end up at (n. i). in this case. (n − m + y − x)/2 (23) Remark. Consider a path that starts from (0. (There are 2n of them. The number of paths that start from (0. x) and end up at (n. 0) and n ends at (n. n 79 . y)} =: {all paths that start at (m. i)} = n n+i 2 More generally.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

1) (n. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . S2 > 0. The probability that. . n 2−n . P0 (Sn = y) 2 (n + y) + 1 . for now. 0) (n. y)} . If n. we shall just compute it. . P0 (S1 ≥ 0 . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . if n is even. y) that do not touch zero} × 2−n #{paths (0.3 Some conditional probabilities Lemma 17. . . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 .2 The ballot theorem Theorem 25. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). n This is called the ballot theorem and the reason will be given in Section 29. if n is even. n 2 #{paths (m. . given that S0 = 0 and Sn = y > 0. . x) 2n−m (n. 27. Sn ≥ 0 | Sn = 0) = 84 1 . Sn ≥ 0 | Sn = y) = 1 − In particular. But. 0) (n. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. 27. n−m 2−n−m . . The left side of (30) equals #{paths (1. n/2 (29) #{(0. . . the random walk never becomes zero up to time n is given by P0 (S1 > 0.P (Sn = y|Sm = x) = In particular. n (30) Proof. Such a simple answer begs for a physical/intuitive explanation. . y)} . Sn > 0 | Sn = y) = y .

. . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . . ξ2 + ξ3 ≥ 0. . . . 1 + ξ2 + ξ3 > 0. . . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. P0 (S1 ≥ 0. . 85 . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . . . . . . •). . . . . . . . (b) follows from symmetry. P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0.Proof. (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . . . . ξ3 . . . . . . . Sn ≥ 0) = #{(0. Sn ≥ 0) = P0 (S1 = 0. . . . . . Sn = 0) = P0 (S1 > 0. Sn = 0) = P0 (Sn = 0). and Sn−1 > 0 implies that Sn ≥ 0. Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. We next have P0 (S1 = 0. . 0) (n. . Sn < 0) (b) (a) = 2P0 (S1 > 0. . . . . . +1 27. . . Sn ≥ 0). ξ2 . . ξ3 . . . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. We use (27): P0 (Sn ≥ 0 . (e) follows from the fact that ξ1 is independent of ξ2 . If n is even.4 Some remarkable identities Theorem 26. . . . (f ) follows from the replacement of (ξ2 . Sn > 0) + P0 (S1 < 0. Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . . . . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). (c) (d) (e) (f ) (g) where (a) is obvious. . n/2 where we used (28) and (29). . . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. Proof. For even n we have P0 (S1 ≥ 0. . and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. Sn ≥ 0.).) by (ξ1 .

. Sn = 0) = 2P0 (Sn−1 > 0. Sn−1 > 0. Put a two-sided mirror at a. . Sn−1 < 0. Sn = 0) Now. we can omit the last event. By the Markov property at time n − 1. Let n be even. . Fix a state a. 86 . we computed. . This is not so interesting. We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. . So u(n) f (n) = . To take care of the last term in the last expression for f (n). It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. the probability u(n) that the walk (started at 0) attains value 0 in n steps. n−1 Proof. . . So f (n) = 2P0 (S1 > 0.27. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . for even n. Sn = 0) = u(n) . . . . Sn−2 > 0|Sn−1 = 1). we see that Sn−1 > 0. . P0 (Sn−1 > 0. . So 1 f (n) = 2u(n) P0 (S1 > 0. Sn = 0. Sn = 0). and u(n) = P0 (Sn = 0). given that Sn = 0. . n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. Sn−1 = 0. Also. and if the particle is to the left of a then its mirror image is on the right. Sn = 0 means Sn−1 = 1. . Theorem 27. . Sn−1 > 0. . Sn−2 > 0|Sn−1 > 0.5 First return to zero In (29). . Sn = 0) P0 (S1 > 0. with equal probability. . we either have Sn−1 = 1 or −1. . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). . f (n) = P0 (S1 > 0. Since the random walk cannot change sign before becoming zero. . . If a particle is to the right of a then you see its image on the left. . Then f (n) := P0 (S1 = 0. So the last term is 1/2. By symmetry. Sn = 0) + P0 (S1 < 0. . the two terms must be equal. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0).

m ≥ 0) is a simple symmetric random walk. Sn ≥ a). the process (STa +m − a. P0 (Sn ≥ a) = P0 (Ta ≤ n. called (Sn . Let (Sn . Trivially.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. 2a − Sn ≥ a) = P0 (Ta ≤ n. In the last event. for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). Sn ≤ a) + P0 (Sn > a). S1 . observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). n ≥ 0) be a simple symmetric random walk and a > 0. In other words. Sn > a) = P0 (Ta ≤ n. . Proof. as a corollary to the above result. Sn ). Then Ta = inf{n ≥ 0 : Sn = a} Sn .What is more interesting is this: Run the particle till it hits a (it will. Thus. So Sn − a can be replaced by a − Sn in the last display and the resulting process. n ∈ Z+ ) has the same law as (Sn . . If Sn ≥ a then Ta ≤ n. Sn = Sn . n < Ta . so we may replace Sn by 2a − Sn (Theorem 28). n < Ta . But P0 (Ta ≤ n) = P0 (Ta ≤ n. 2 Define now the running maximum Mn = max(S0 . independent of the past before Ta . The last equality follows from the fact that Sn > a implies Ta ≤ n. we have n ≥ Ta . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. n ∈ Z+ ). a + (Sn − a). But a symmetric random walk has the same law as its negative. n ≥ Ta . . 2 Proof. Then. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). 28. Theorem 28. we obtain 87 . Finally. . define Sn = where is the first hitting time of a. Sn ≤ a) + P0 (Ta ≤ n. 2a − Sn . n ≥ Ta By the strong Markov property. Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). Sn ≤ a).

as long as a ≥ b. combined with (31). Paris. and anciently for holding ballots to be drawn. e 10 Andr´. The set of values of η contains n elements and η is uniformly distributed a in this set. then. gives the result. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. J. An urn is a vessel. If an urn contains a azure items and b = n−a black items. . during sampling without replacement. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Sn = y) = P0 (Tx ≤ n. 2 Proof. (1887). n = a + b. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. Sn = y). . Compt. 369. Solution directe du probl`me r´solu par M. as for holding liquids. the number of selected azure items is always ahead of the black ones equals (a − b)/n. If Tx ≤ n. Solution d’un probl`me. 105. we have P0 (Mn < x. then the probability that. This. employed for different purposes. which results into P0 (Tx ≤ n. (31) x > y. Since Mn < x is equivalent to Tx > n. The answer is very simple: Theorem 31. Acad. ηn ). Sn = 2x − y). 9 Bertrand. It is the latter use we are interested in in Probability and Statistics. of which a are coloured azure and b black. Compt. Rend. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30.Corollary 19. Sci. . A sample from the urn is a sequence η = (η1 . Rend. where ηi indicates the colour of the i-th item picked. Sn = y) = P0 (Tx > n. e e e Paris. Sn = y) = P0 (Sn = 2x − y). ornamental objects. Bertrand. P0 (Mn < x. usually a vase furnished with a foot or pedestal. Proof. (1887). 8 88 . The original ballot problem. the ashes of the dead after cremation. D. Sci. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . by applying the reflection principle (Theorem 28). Acad. . But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. 436-437. 105.

ηn ) an urn process with parameters n. . . the sequence (ξ1 .We shall call (η1 . . ξn ) is uniformly distributed in a set with n elements. . . i ∈ Z. n . (ξ1 . We need to show that. But if Sn = y then. a we have that a out of the n increments will be +1 and b will be −1. . . . but which is not necessarily symmetric. .i−1 = q = 1 − p. n n −n 2 (n + y)/2 a Thus. the sequence (ξ1 . . +1} such that ε1 + · · · + εn = y. n Since y = a − (n − a) = a − b.i+1 = p. the result follows. . So the number of possible values of (ξ1 . Since the ‘azure’ items correspond to + signs (respectively. conditional on {Sn = y}. Consider a simple symmetric random walk (Sn . −n ≤ k ≤ n. . . conditional on {Sn = y}. . . It remains to show that all values are equally a likely. . ξn ) is. we have: P0 (ξ1 = ε1 . a. Sn > 0} that the random walk remains positive. . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. We have P0 (Sn = k) = n n+k 2 pi. Then. . . . the ‘black’ items correspond to − signs). the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. . conditional on the event {Sn = y}. Let ε1 . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. . . The values of each ξi are ±1. (The word “asymmetric” will always mean “not necessarily symmetric”. . . b be defined by a + b = n. . εn be elements of {−1. ξn ) has a uniform distribution. . . . Then. Sn > 0 | Sn = y) = . . a − b = y. . . Proof of Theorem 31. where y = a − (n − a) = 2a − n. An urn can be realised quite easily be a random walk: Theorem 32. a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. . . we can translate the original question into a question about the random walk. Proof. . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . always assuming (without any loss of generality) that a ≥ b = n − a. so ‘azure’ items are +1 and ‘black’ items are −1. . n ≥ 0) with values in Z and let y be a positive integer. indeed. . using the definition of conditional probability and Lemma 16. . But in (30) we computed that y P0 (S1 > 0. p(n+k)/2 q (n−k)/2 . . ξn = εn ) P0 (Sn = y) 2−n 1 = = .) So pi. 89 . . Since an urn with parameters n. letting a.

1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. for x > 0. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . If S1 = 1 then T1 = 1. The process (Tx . plus a ′′ further time T1 required to go from 0 to +1 for the first time. 1. and all t ≥ 0. Lemma 18. See figure below. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. We will approach this using probability generating functions. For all x. 2. Let ϕ(s) := E0 sT1 . This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. . . x − 1 in order to reach x. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. .30. . If x > 0. Corollary 20.i. This random variable takes values in {0. x ≥ 0) is also a random walk. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. random variables. . Consider a simple random walk starting from 0.) Then. To prove this. In view of this. . 2. plus a time T1 required to go from −1 to 0 for the first time. In view of Lemma 19 we need to know the distribution of T1 . Consider a simple random walk starting from 0. we make essential use of the skip-free property. Proof. y ∈ Z. E0 sTx = ϕ(s)x . x ∈ Z. Proof. The rest is a consequence of the strong Markov property. only the relative position of y with respect to x is needed.d. it is no loss of generality to consider the random walk starting from 0. Consider a simple random walk starting from 0. .} ∪ {∞}. (We tacitly assume that S0 = 0. 2qs Proof. Lemma 19. Px (Ty = t) = P0 (Ty−x = t). 90 . Theorem 33.

Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . from the strong Markov property. The first time n such that Sn = −1 is the first time n such that −Sn = 1. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . 91 .Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . if x < 0 2ps Proof. Consider a simple random walk starting from 0. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . Hence the previous formula applies with the roles of p and q interchanged. Corollary 21. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. 2ps Proof. Corollary 22. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Consider a simple random walk starting from 0. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. The formula is obtained by solving the quadratic. in order to hit 1.  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 .

30. Similarly. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . 1 − 4pqs2 . we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . k 92 . Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. if S1 = 1. this was found in §30. 2ps ′ =1− 30. the same as the time required for the walk to hit 1 starting ′ from 0. The latter is. If we define n to be equal to 0 for k > n. For the general case. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . For a symmetric ′ random walk.2 First return to the origin Now let us ask: If the particle starts at 0. If α = n is a positive integer then n (1 + x)n = k=0 n k x . Consider a simple random walk starting from 0. then we can omit the k upper limit in the sum. We again do first-step analysis. k by the binomial theorem. where α is a real number. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. and simply write (1 + x)n = k≥0 n k x . 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 .2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. in distribution.3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin.

and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = .Recall that n k = It turns out that the formula is formally true for general α. As another example. For example. k as long as we define α k = α(α − 1) · · · (α − k + 1) . (1 − x)−1 = But. 2k − 1 k 93 . √ 1 − x. k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. −1 k so (1 − x)−1 = as needed. but that we won’t worry about which ones. (−1)k = 2k − 1 k k and this is shown by direct computation. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . Indeed. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . k!(n − k)! k! α k x . We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . k! k! (−1)2k xk = k≥0 xk . we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. by definition.

x 1−|p−q| 2q 1−|p−q| 2p . 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. by the definition of E0 sT0 .But. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . The quantity inside the square root becomes 1 − 4pq. We apply this to (32). if x < 0 |x| In particular. s↑1 which is true for any probability generating function. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. P0 (Tx < ∞) = p q q p ∧1 ∧1 . But remember that q = 1 − p. 94 . • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. That it is actually null-recurrent follows from Markov chain theory. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. Hence it is transient. if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. |x| if x > 0 (33) . By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. if x < 0 This formula can be written more neatly as follows: Theorem 35. Hence it is again transient. without appeal to the Markov chain theory. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . x if x > 0 . • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0).

S2 . then we have M∞ = limn→∞ Mn . if x > 0 . the sequence Mn is nondecreasing. n→∞ n→∞ x ≥ 0. it will never ever return to the point where it started from. If p < q then |p − q| = q − p and. .) If we let Mn = max(S0 . but here it is < ∞. S1 . The last part of Theorem 35 tells us that it is recurrent. What is this probability? Theorem 37. again. if x < 0 which is what we want. Corollary 23. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x .2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. Proof. The simple symmetric random walk with values in Z is null-recurrent. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. 31. Proof. Theorem 36.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. Assume p < 1/2. 95 . Unless p = 1/2. . . except that the limit may be ∞. Otherwise.Proof. and so it is null-recurrent. x ≥ 0. . Then ′ P0 (T0 < ∞) = 2(p ∧ q). this probability is always strictly less than 1. with probability 1 because p < 1/2. . Then limn→∞ Sn = −∞ with probability 1. 31. This value is random and is denoted by M∞ = sup(S0 . This means that there is a largest value that will be ever attained by the random walk. with some positive probability. Consider a random walk with values in Z. . Consider a simple random walk with values in Z. Sn ). We know that it cannot be positive recurrent. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . we obtain what we are looking for. . q p. Indeed. every nondecreasing sequence has a limit.

the total number of visits to a state will be finite. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). k = 0. But if it is transient. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . In Lemma 4 we showed that. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . Theorem 38. 1. 1]. This proves the first formula. We now look at the the number of visits to some state but if we start from a different state. . 31. k = 0.3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). 2(p ∧ q) . starting from zero. 2(p ∧ q) E0 J0 = . 1 − 2(p ∧ q) Proof. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. Pi (Ji ≥ k) = 1 for all k. .′ Proof. . 96 . . 2. . And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . In the latter case. Then. But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. But J0 ≥ 1 if and only if the first return time to zero is finite. 1. if a Markov chain starts from a state i then Ji is geometric. as already known. . Remark 1: By spatial homogeneity. 1. meaning that Pi (Ji = ∞) = 1. where the last probability was computed in Theorem 37. The probability generating function of T0 was obtained in Theorem 34. Consider a simple random walk with values in Z. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. 2. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . k = 0. even for p = 1/2. . We now compute its exact distribution. . (34) a random variable first considered in Section 10 where it was also shown to be geometric. 2. If p = q this gives 1. . . 1 − 2(p ∧ q) this being a geometric series.

and the second in Theorem 38. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . Jx ≥ k).e. 97 . Consider the case p < 1/2. Then P0 (Tx < ∞. p < 1/2) then. starting from 0. Consider a random walk with values in Z. The first term was computed in Theorem 35. Suppose x > 0. In other words. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total.e. then E0 Jx drops down to zero geometrically fast as x tends to infinity. But for x < 0. given that Tx < ∞. Apply now the strong Markov property at Tx . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . For x > 0. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . i.Theorem 39. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. The reason that we replaced k by k − 1 is because. Then P0 (Jx ≥ k) = P0 (Tx < ∞. if the random walk “drifts down” (i. we have E0 Jx = 1/(1−2p) regardless of x. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . E0 Jx = Proof. ∞ As for the expectation. (2(p ∧ q))k−1 . Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . k ≥ 1. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. Remark 1: The expectation E0 Jx is finite if p = 1/2. it will visit any point below 0 exactly the same number of times on the average. simply because if Jx ≥ k ≥ 1 then Tx < ∞. (q/p)(−x)∨0 1 − 2q k≥1 .

and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . Theorem 40. Use the obvious fact that (ξ1 . Thus. n x > 0. = 1− √ 1 − s2 s x . . i=1 Then the dual (Sk . every time we have a probability for S we can translate it into a probability for S.i. . .e. a sum of i. (Sk . Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. 0 ≤ k ≤ n) of (Sk . . 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. each distributed as T1 . for a simple random walk and for x > 0. which is basically another probability for S. ξn ) has the same distribution as (ξn . random variables. n (37) 98 . 0 ≤ k ≤ n).d. Tx is the sum of x i. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. Proof. . ξ1 ). (36) 2 Lastly. Here is an application of this: Theorem 41. Fix an integer n.d. . we derived using duality. 0 ≤ k ≤ n − 1. Sn = x) P0 (Tx = n) = . . (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. In Lemma 19 we observed that. . . ξn−1 . random variables. n Proof. 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . the probabilities P0 (Tx = n) = x P0 (Sn = x). i. 0 ≤ k ≤ n) has the same distribution as (Sk . . P0 (Sn = x) P0 (Sn = x) x . Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. Let (Sk ) be a simple symmetric random walk and x a positive integer. Then P0 (Tx = n) = x P0 (Sn = x) . ξ2 . because the two have the same distribution. for a simple symmetric random walk. in Theorem 41. . .32 Duality Consider random walk Sk = k ξi .i.

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

From this. 1 P (ξn = 1) = P (ξn = (1. 0)) = P (ξn = (−1. 0)) = 1/4. 2 1 If x ∈ Z2 . It is not a simple random walk because there is a possibility that the increment takes the value 0. the random 1 2 variables ξn . he moves east. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. The two random walks are not independent. Indeed. However. for each n.d. x1 − x2 ). west. He starts at (0. being totally drunk. 1)) = P (ξn = (−1.i. . 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. we immediately get that the expectation of T1 is infinity: ET1 = ∞. random variables and hence a random walk. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. north or south. . n ∈ Z+ ) is a sum of i. map (x1 . are independent. Since ξ1 . north or south. y ∈ Z. We then let S0 = (0. i=1 ξi i=1 ξi 2 1 Lemma 20. with equal probability (1/4). i. west. Consider a change of coordinates: rotate the axes by 45o (and scale them). too. 0) and. is a random walk. 2 Same argument applies to (Sn . random vectors such that P (ξn = (0. y) where x. x2 ) into (x1 + x2 . So let 1 2 1 2 1 2 Rn = (Rn . .e. the two stochastic processes are not independent. . going at random east. ξ2 . the latter being functions of the former. Sn = i=1 n ξi . Sn ) = n n 2 . Many other quantities can be approximated in a similar manner. When he reaches a corner he moves again in the same manner. 1)) + P (ξn = (0. .. Sn − Sn ). We have 1. So Sn = (Sn . so are ξ1 . We have: 102 . n ∈ Z+ ) is a random walk. −1)) = 1/2. . 0)) = 1/4. (Sn . . ξ2 . . we let x1 be its horizontal coordinate and x2 its vertical. The corners are described by coordinates (x. 0)) = P (ξn = (1. . i. n ∈ Z+ ) showing that it. ξ2 . ξn are not independent simply because if one takes value 0 the other is nonzero. (Sn . n ∈ Z+ ) is also a random walk. 0). 1 1 Proof.i. Rn ) := (Sn + Sn . n ≥ 1.d. 1 Hence (Sn .

n ∈ Z ) is a random walk in one dimension. By Lemma 5. 0)) = 1/4 1 2 P (ηn = 1. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . from Theorem 7.. we need to show that ∞ P (Sn = 0) = ∞.1 Lemma 21. The sequence of vectors η . . Let a < 0 < b. b−a P (Ta < Tb ) = b . ε′ ∈ {−1. b−a P (STa ∧Tb = a) = b . Similar argument shows that each of (Rn . (Rn . And so 2 1 2 1 P (ηn = ε. and by Stirling’s approximation. +1}. 1)) = 1/4 1 2 P (ηn = −1. (Rn . Rn = n (ξi − ξi ). ηn = −1) = P (ξn = (−1. . that for a simple symmetric random walk (started from S0 = 0). being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . . R2n = 0) = P (R2n = 0)(R2n = 0). ηn = ξn − ξn are independent for each n. To show that the last two are independent. 0)) = 1/4 1 2 P (ηn = −1. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. b−a −a . ηn = −1) = P (ξn = (0. The two-dimensional simple symmetric random walk is recurrent. Hence P (ηn = 1) = P (ηn = −1) = 1/2. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0.i. P (S2n = 0) = Proof. Recall. n ∈ Z+ ) is a random 2 . (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . Lemma 22. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. n ∈ Z+ ) 1 is a random walk in two dimensions. 2 . n Corollary 24. 2 1 2 2 1 1 Proof. We have Rn = n (ξi + ξi ). is i. Proof. (Rn + independent. ηn = 1) = P (ξn = (1. n ∈ Z ) is a random walk in one dimension. The last two are walk in one dimension. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . 2n n 2 2−4n . So P (R2n = 0) = 2n 2−2n . 1 where the last equality follows from the independence of the two random walks. n ∈ Z+ ). ηn = 1) = P (ξn = (0. η 2 are ±1. n ∈ Z+ ) is a random walk in two dimensions. P (S2n = 0) ∼ n . The possible values 1 . The same is true for (Rn ). = 1)) = 1/4. η . But (Rn ) 1 2 is a simple symmetric random walk. . We have of each of the ηn n 1 2 P (ηn = 1. b−a 103 . ξ2 . Hence (Rn . But by the previous n=0 c lemma. where c is a positive constant. ξ 1 − ξ 2 ). in the sense that the ratio of the two sides goes to 1. .d. . as n → ∞. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε.

To see this. Then there exists a stopping time T such that P (ST = i) = pi . e. and the probability it exits from the right point is 1 − p.This last display tells us the distribution of STa ∧Tb . M < N . B = 0) + P (A = i. we consider the random variable Z = STA ∧TB . b. Indeed. suppose first k = 0. N ∈ N.g. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. Then Z = 0 if and only if A = B = 0 and this has 104 . i < 0 < j. The probability it exits it from the left point is p. a = M − N . This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. and assuming that (A. we get that c is precisely ipi . Conversely. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. Let (Sn ) be a simple symmetric random walk and (pi . Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. B = j) = c−1 (j − i)pi pj . b]. if p = M/N . that ESTa ∧Tb = 0. B) taking values in {(0. M. B = j) = 1. such that p is rational. b = N . The claim is that P (Z = k) = pk . P (A = 0. choose. i ∈ Z. Theorem 43. B = 0) = p0 . using this. How about more complicated distributions? Let suppose we are given a probability distribution (pi . Proof. where c−1 is chosen by normalisation: P (A = 0. k ∈ Z. i ∈ Z). given a 2-point probability distribution (p. Note. i ∈ Z. in particular. 0)} ∪ (N × N) with the following distribution: P (A = i. a < 0 < b such that pa + (1 − p)b = 0. we can always find integers a. we have this common value: i<0 (−i)pi = and. Define a pair of random variables (A. 1 − p). B) are independent of (Sn ).

P (Z = k) = pk for k < 0. 105 . B = j) P (STi ∧Tk = k)P (A = i. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i.probability p0 . i<0 = c−1 Similarly. Suppose next k > 0. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . by definition.

Because c | a and c | b. Indeed. 38) = gcd(82. 3. b are arbitrary integers. 2) = gcd(2. b − a). For instance. they merely serve as reminders.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. . 106 . Then c | a + (b − a). Indeed 4 × 38 > 120. 26) = gcd(6. Therefore d | b − a. 1. They are not supposed to replace textbooks or basic courses. (ii) if c | a and c | b. In other words.) Notice that: gcd(a. c | b. and d = gcd(a. let d = gcd(a. b. . Arithmetic Arithmetic deals with integers. But in the first line we subtracted 38 exactly 3 times. −1. 2. with b = 0. we say that b divides a if there is an integer k such that a = kb. with the second line. Zero is denoted by 0. 38) = gcd(6. So we work faster. 20) = gcd(6. as taught in high schools or universities. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. Replace b by b − a. 38) = gcd(6. (That it is unique is obvious from the fact that if d1 . we have c | d. 32) = gcd(6. We write this by b | a. 4. if we also invoke Euclid’s algorithm (i. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. 14) = gcd(6. their negatives and zero. 2) = gcd(4. Keep doing this until you obtain b − qa < a. The set N of natural numbers is the set of positive integers: N = {1. 3. 8) = gcd(6. and we’re done. if a. The set Z of integers consists of N. d2 are two such numbers then d1 would divide d2 and vice-versa. 2. Similarly. b) = gcd(a.e. the process of long division that one learns in elementary school). If c | b and b | a then c | a. b) as the unique positive integer such that (i) d | a. 38) = gcd(44. 0. 6. But d | a and d | b. −2. So (ii) holds. b). Then interchange the roles of the numbers and continue. with at least one of them nonzero. Z = {· · · . then d | c. 5. gcd(120. replace b − a by b − 2a. −3. We will show that d = gcd(a. 2) = 2. we can define their greatest common divisor d = gcd(a. Now.}. · · · }. . b). b − a). leading to d1 = d2 .e. But 3 is the maximum number of times that 38 “fits” into 120. If b − a > a. Now suppose that c | a and c | b − a. If a | b and b | a then a = b. i. Given integers a. So (i) holds. d | b.

y ∈ Z. We need to show (i). we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. we can find an integer q and an integer r ∈ {0. it divides d. Suppose that c | a and c | b. a − 1}. b ∈ Z. with a > b. i. In the process. we can show that. we have the fact that if d = gcd(a. Second. xa + yb > 0}. in particular. we have gcd(a. let d be the right hand side. for integers a. To see why this is true. b. gcd(38. not both zero. let us follow the previous example again: gcd(120. b) then there are integers x. . Working backwards. with x = 19. Thus. Then c divides any linear combination of a. 120) = 38x + 120y. 38) = gcd(6. y such that d = xa + yb. 2) = 2. The long division algorithm is a process for finding the quotient q and the remainder r. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. such that a = qb + r. .The theorem of division says the following: Given two positive integers b. Its proof is obvious. y = 6. For example. Then d = xa + yb. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. b) = min{xa + yb : x. y ∈ Z. To see this. for some x. Here are some other facts about the greatest common divisor: First. . 38) = gcd(6. 1. The same logic works in general.e. r = 18. a. . 107 . This shows (ii) of the definition of a greatest common divisor.

z) it also contains (x.that d | a and d | b. x2 ∈ X have different values f (x1 ). the set {f (x) : x ∈ X}. The relation < is not symmetric. that d does not divide both numbers.e. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. m elements. A relation is transitive if when it contains (x. x). But then b = q(xa + yb) + r. The equality relation is symmetric. but is not necessarily transitive. Both < and = are transitive. y) and (y. The range of f is the set of its values. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. If we take A to be the set of students in a classroom. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . we can divide b by d to find b = qd + r. for some 1 ≤ r ≤ d − 1. so r = 0. a relation can be thought of as a set of pairs of elements of the set A. A relation is called symmetric if it cannot contain a pair (x.. i. e. so d divides both a and b. The equality relation = is reflexive. In general. Since d > 0. i. A relation is called reflexive if it contains pairs (x. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. We encountered the equivalence relation i theory. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. For example. f (x2 ). Counting Let A. The relation < is not reflexive. y) such that x is smaller than y. A function is called onto (or surjective) if its range is Y . A relation which possesses all the three properties is called equivalence relation. A function is called one-to-one (or injective) if two different x1 . and this is a contradiction. The set of functions from X to Y is denoted by Y X . the greatest common divisor of any subset of the integers also makes sense. yielding r = (−qx)a + (1 − qy)b. then the relation “x is a friend of y” is symmetric.e. x). Suppose this is not true. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). y) without contain its symmetric (y. say. respectively. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). z).g. 108 . The greatest common divisor of 3 integers can now be defined in a similar manner. Finally. B be finite sets with n. that d does not divide b. So d is not minimum as claimed.

and so on. in other words. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . 109 . Since the name of the first animal can be chosen in m ways. Incidentally.The number of functions from A to B (i. and so on. Each assignment of this type is a one-to-one function from A to B. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . that of the second in m − 1 ways. Any set A contains several subsets. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). We thus observe that there are as many subsets in A as number of elements in the set {0. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). so does the second. Indeed. a one-to-one function from the set A into A is called a permutation of A. n! is the total number of permutations of a set of n elements. Thus. the second in n − 1. we denote (n)n also by n!. The first element can be chosen in n ways. we may think of the elements of A as animals and those of B as names.e. To be able to do this. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. A function is then an assignment of a name to each animal. we have = 1. k! k where the symbol equation. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . (The same name may be used for different animals. In the special case m = n. we need more names than animals: n ≤ m. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. the cardinality of the set B A ) is mn . If A has n elements. 1}n of sequences of length n with elements 0 or 1. and the third. the k-th in n − k + 1 ways. and this follows from the binomial theorem. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n.) So the first animal can be assigned any of the m available names. and so on. Clearly.

The maximum is denoted by a ∨ b = max(a. c. a+ is called the positive part of a. a− = − min(a. m! n y If y is not an integer then. 110 . The notation extends to more than one numbers. a∨b∨c refers to the maximum of the three numbers a. in choosing k animals from a set of n + 1 ones. 0). consider a specific animal. If the pig is included. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). we let The Pascal triangle relation states that n+1 k = = 0. by convention. For example. the pig. b). Maximum and minimum The minimum of two numbers a. 0). Note that it is always true that a = a+ − a− . n n . Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. + k−1 k Indeed. k Also. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. we use the following notations: a+ = max(a. b).We shall let n = 0 if n < k. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. by convention. b. If the pig is not included in the sample. we choose k animals from the remaining n ones and this can be done in n ways. we define x m = x(x − 1) · · · (x − m + 1) . and a− is called the negative part of a. In particular. if x is a real number. say. The absolute value of a is defined by |a| = a+ + a− . Both quantities are nonnegative. b is denoted by a ∧ b = min(a. we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. and we don’t care to put parantheses because the order of finding a maximum is irrelevant.

. 2. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. It is easy to see that 1A 1A∩B = 1A 1B . Thus. . So X= n≥1 1(X ≥ n). This is very useful notation because conjunction of clauses is transformed into multiplication. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). . A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). . Another example: Let X be a nonnegative random variable with values in N = {1. = 1A + 1B − 1A∩B . Several quantities of interest are expressed in terms of indicators. A2 . P (X ≥ n). An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . . If A is an event then 1A is a random variable. are sets or clauses then the number n≥1 1An is the number true clauses. It takes value 1 with probability P (A) and 0 with probability P (Ac ). 1 − 1A = 1 A c .Indicator function stands for the indicator function of a set A. meaning a function that takes value 1 on A and 0 on the complement Ac of A. Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. 111 . if A1 . For instance. It is worth remembering that. . Then X X= n=1 1 (the sum of X ones is X).}. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed.

x1 . .  is a column representation of y := ϕ(x) with respect to the given basis . xn the given basis.  where ∈ R. . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  . . y ∈ Rn . Let u1 . So if we choose a basis v1 .  .  The column  . . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  . The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . . .for all x. . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ). m. vm . ym v1 . j=1 i = 1. ϕ are linear functions: It is a m × n matrix. But the ϕ(uj ) are vectors in Rm . . un be a basis for Rn . .  is a column representation of x with respect to . β ∈ R. because ϕ is linear. The column  . .   .  = ··· ··· ···  . . x  . . . Let y i := n aij xj . Suppose that ψ. we have ϕ(x) = m n vi i=1 j=1 aij xj . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . . . . Combining. Then.    . . and all α. . .

 . .  . . inf I is characterised by: inf I ≤ x for all x ∈ I and. i. x2 . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. . . a2 . ed =  . Likewise. where m (AB)ij = k=1 Aik Bkj . xn+1 . 1 ≤ i ≤ ℓ. defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. 113 . standing for the “infimum” of I. we define lim sup xn . Then the matrix representation of ϕ ◦ ψ is AB. A matrix A is called square if it is a n × n matrix. sup ∅ = −∞.Fix three sets of basis on Rn . . sup I is characterised by: sup I ≥ x for all x ∈ I and. This minimum (or maximum) may not exist. If I is empty then inf ∅ = +∞. .  . If I has no upper bound then sup I = +∞. we define the supremum sup I to be the minimum of all upper bounds. . . Supremum and infimum When we have a set of numbers I (for instance. We note that lim inf(−xn ) = − lim sup xn . . . Rm and Rℓ and let A. ψ. . it is its matrix representation with respect to the fixed basis.} has no minimum. xn+2 .} ≤ inf{x2 . . j So A is a ℓ × m matrix. . . . In these cases we use the notation inf I. . a square matrix represents a linear transformation from Rn to Rn . 1 ≤ j ≤ n. .e. the choice of the basis is the so-called standard basis. x4 . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. with respect to these bases. Limsup and liminf If x1 . for any ε > 0. a sequence a1 . For instance I = {1. . is a sequence of numbers then inf{x1 . x2 . . x3 . . B is a m × n matrix. . . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. So lim sup xn = limn→∞ inf{xn . Usually. respectively.} ≤ · · · and. If we fix a basis in Rn . · · · }. 1/3. and it is this that we called infimum. there is x ∈ I with x ≥ ε + inf I. . e2 =  . B be the matrix representations of ϕ. we define max I.  . . The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. Similarly. 0 0 1 In other words.} ≤ inf{x3 . . . If I has no lower bound then inf I = −∞. and AB is a ℓ × n matrix. The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . ei = 1(i = j). . there is x ∈ I with x ≤ ε + inf I. for any ε > 0. Similarly. 1/2.

Similarly. are mutually disjoint events. Indeed. . to show that EX ≤ EY .. In contrast. If A is an event then 1A or 1(A) is the indicator random variable. If the events A. it is useful to find disjoint events B1 . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. but not too much. If A1 . A is a subset of B). B in the definition. A2 . . This follows from the logical statement X 1(X > x) ≤ X. B2 . we often need to argue that X ≤ Y . the function that takes value 1 on A and 0 on Ac .e. (That is why Probability Theory is very rich. for (besides P (Ω) = 1) there is essentially one axiom . showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. . Similarly. by not requiring this. . i. . Note that 1A 1B = 1A∩B . Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. In contrast. P (∪n An ) ≤ Similarly. namely:   Everything else. I do sometimes “cheat” from a strict mathematical point of view.. and apply them. in these notes. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). Of course. and everywhere else. A2 . to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). Quite often. A2 .e. if he events A1 . and 1 Ac = 1 − 1 A . But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). E 1A = P (A). Linear Algebra that has many axioms is more ‘restrictive’. . n≥1 Usually. derive formulae. 11 114 . Therefore. . Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. In Markov chain theory. Linear Algebra that has many axioms is more ‘restrictive’. This can be generalised to any finite number of events. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). such that ∪n Bn = Ω. which holds since X is positive.) If A. n If the events A1 . n P (An ). . Probability Theory is exceptionally simple in terms of its axioms. B are disjoint events then P (A ∪ B) = P (A) + P (B). Often. then P  n≥1 An  = P (An ). follows from this axiom. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). if An are events with P (An ) = 1 then P (∩n An ) = 1. That is why Probability Theory is very rich. If An are events with P (An ) = 0 then P (∪n An ) = 0. This is the inclusion-exclusion formula for two events. . . . . to compute P (A). we work a lot with conditional probabilities. For example.

making it a useful object when the a0 . 1]. |s| < 1/|ρ|. for s = 0 for which G(0) = 0. obviously.e. . n = 0. 1]. . we define X + = max(X. . In case the random variable is discrete. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. . . n ≥ 0 is ∞ ρn s n = n=0 1 . This definition makes sense only if EX + < ∞ or if EX − < ∞. is a probability distribution on Z+ = {0. For a random variable which takes positive and negative values. As an example. i. so it is useful to keep this in mind. . X − = − min(X. which are both nonnegative and then define EX = EX + − EX − . The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. In case the random variable is ∞ absoultely continuous with density f . . EX = x xP (X = x). 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. The formula holds for any positive random variable. 1. Generating functions A generating function of a sequence a0 . . .. is called probability generating function of X: ∞ as long as |ρs| < 1. the formula reduces to the known one. the generating function of the geometric sequence an = ρn . Suppose now that X is a random variable with values in Z+ ∪ {+∞}. a1 . is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . we have EX = −∞ xf (x)dx. which is motivated from the algebraic identity X = X + − X − . So if n an < ∞ then G(s) exists for all s ∈ [−1. a1 . The generating function of the sequence pn = P (X = n). ϕ(s) = EsX = n=0 sn P (X = n) 115 . . . 2.} because then n an = 1. 0). . 0). For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). taught in calculus courses). viz. We do not a priori know that this exists for any s other than. 1.An application of this is: P (X < ∞) = lim P (X ≤ x). x→∞ In Markov chain theory. we encounter situations where P (X = ∞) may be positive.

As an example. take the value +∞. but. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . we have s∞ = 0. ∞ and p∞ := P (X = ∞) = 1 − pn . n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . ds ds and by letting s = 1. Note that ∞ n=0 pn ≤ 1. consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . 116 . n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). and this is why ϕ is called probability generating function. If we let X(s) = n≥0 n ≥ 0. we can recover the expectation of X from EX = lim ϕ′ (s). since |s| < 1. n! as follows from Taylor’s theorem. Also. possibly. Generating functions are useful for solving linear recursions.This is defined for all −1 < s < 1. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). xn sn be the generating of (xn . n ≥ 0) then the generating function of (xn+1 . s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . with x0 = x1 = 1. n=0 We do allow X to. Note that ϕ(0) = P (X = 0).

i. n ≥ 0) and (xn . for example. n ≥ 0). we can also write this as xn = (−a)n+1 − (−b)n+1 . b = −( 5 + 1)/2. is such an example. Thus (1/ρ)−s is the 1 generating function of ρn+1 . b−a Noting that ab = −1. But I could very well use the function P (X > x). i. Despite the square roots in the formula.e. Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). So. and so X(s) = 1 1 1 1 1 1 = .e. the function P (X ≤ x). namely. we can solve for X(s) and find X(s) = s2 −1 .From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. x ∈ R. if X is a random variable with values in R. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . 117 . the n-th term of the Fibonacci sequence. Since x0 = x1 = 1. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. n ≥ 0) equals the sum of the generating functions of (xn+1 . Hence s2 + s − 1 = (s − a)(s − b). +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. then the so-called distribution function. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . b−a is the solution to the recursion.

n Hence n k=1 log k ≈ log x dx. dx All these objects can be referred to as “distribution” or “law” of X. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. for it does allow us to specify the probabilities P (X = n). for example they could take values which are themselves functions. since the number is so big. Such an excursion is a random object which should be thought of as a random function. n! ≈ exp(n log n − n) = nn e−n . in 2n fair coin tosses 118 . compute an integral and obtain an approximation. it’s better to use logarithms: n n! = e log n = exp k=1 log k. we encountered the concept of an excursion of a Markov chain from a state i. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. I could use its density f (x) = d P (X ≤ x).or the 2-parameter function P (a < X < b). It turns out to be a good one. Hence we have We call this the rough Stirling approximation. (b) The approximation is not good when we consider ratios of products. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. to compute a large sum we can. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. Now. For example (try it!) the rough Stirling approximation give that the probability that. Even the generating function EsX of a Z+ -valued random variable X is an example. n ∈ Z+ . For instance. instead. We frequently encounter random objects which belong to larger spaces. Specifying the distribution of such an object may be hard. If X is absolutely continuous. 1 Since d dx (x log x − x) = log x. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm.

That was the random variable J0 = ∞ n=1 1(Sn = 0). n ∈ Z+ . for all x. where λ = P (X ≥ 1). y ∈ R g(y) ≥ g(x)1(y ≥ x). This completely trivial thought gives the very useful Markov’s inequality. Indeed. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . you have seen it before. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. expressing monotonicity. and this is terrible. 119 . We can case both properties of g neatly by writing For all x. Markov’s inequality Arguably this is the most useful inequality in Probability.we have exactly n heads equals 2n 2−2n ≈ 1. g(X) ≥ g(x)1(X ≥ x). Eg(X) ≥ g(x)P (X ≥ x) . Consider an increasing and nonnegative function g : R → R+ . defined in (34). because we know that the n probability goes to zero as n → ∞. Then. Surely. We baptise this geometric(λ) distribution. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). We have already seen a geometric random variable. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . Let X be a random variable. 0 ≤ n ≤ m. a geometric function of k. If we think of X as the time of occurrence of a random phenomenon. and so. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. as n → ∞. and is based on the concept of monotonicity. It was shown in Theorem 38 that J0 satisfies the memoryless property. expressing non-negativity. by taking expectations of both sides (expectation is an increasing operator). To remedy this. so this little section is a reminder.

Soc. J L (1997) Introduction to Probability.pdf 6. Amer. W (1968) An Introduction to Probability Theory and its Applications. C M and Snell. Math. 4.Bibliography ´ 1.mac. J R (1998) Markov Chains. Also available at: http://www. Parry. 7. Chung. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). Bremaud P (1999) Markov Chains. Grinstead. Cambridge University Press. 3. Springer. Cambridge University Press.dartmouth. Feller. W (1981) Topics in Ergodic Theory. ´ 2. 120 . Norris. Bremaud P (1997) An Introduction to Probabilistic Modeling. Wiley. Springer. Springer.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. 5.

44 degree. 18 arcsine law. 68 aperiodic. 10 ballot theorem. 35 Helly-Bray Lemma. 10 irreducible. 70 greatest common divisor. 7 closed. 76 neighbours. 17 Aloha. 111 inessential state. 17 communicates with. 111 maximum. 109 positive recurrent. 13 probability generating function. 47 coupling inequality. 119 matrix. 18 dual. 108 ruin probability. 110 mixing. 16. 77 indicator. 106 ground state. 17 Ethernet. 51 Monte Carlo. 26 running maximum. 48 memoryless property. 81 reflection principle. 6 Markov property. 115 geometric distribution. 7 invariant distribution. 65 generating function. 73 Markov chain. 70 period. 20 increments. 16 convolution. 115 random walk on a graph. 53 Fibonacci sequence. 108 relation. 18 permutation. 116 first-step analysis. 2 nearest neighbour random walk. 87 121 . 119 Google. 3 branching process. 1 equivalence classes. 77 coupling. 117 division. 84 bank account. 45 Chapman-Kolmogorov equation. 16 equivalence relation. 65 Catalan numbers. 16 communicating classes. 108 ergodic. 61 null recurrent. 101 axiom of Probability Theory. 6 Markov’s inequality. 15 distribution. 15 Loynes’ method. 51 essential state. 17 infimum.Index absorbing. 110 meeting time. 114 balance equations. 48 cycle formula. 17 Kolmogorov’s loop criterion. 82 Cesaro averages. 61 recurrent. 20 function of a Markov chain. 98 dynamical system. 86 reflexive. 59 law. 29 reflection. 61 detailed balance equations. 15. 34 PageRank. 68 Fatou’s lemma. 119 minimum. 34 probability flow. 53 hitting time. 117 leads to. 113 initial distribution. 63 Galton-Watson process. 58 digraph (directed graph).

7. 27 storage chain. 27 subordination. 108 tree. 73 strategy of the gambler. 29 transition diagram:. 3. 6 stationary. 58 transient. 24 Strirling’s approximation. 8 transitive. 62 supremum. 62 subsequence of a Markov chain. 2 transition probability. 119 strong Markov property. 7 time-reversible. 6 statespace. 8 transition probability matrix. 108 time-homogeneous. 76 simple symmetric random walk. 13 stopping time. 16. 77 skip-free property. 71 122 . 7 simple random walk. 76 Skorokhod embedding. 9 stationary distribution. 10 steady-state.semigroup property. 113 symmetric. 88 watching a chain in a set. 104 states. 62 Wright-Fisher model. 16. 60 urn.

Sign up to vote on this title
UsefulNot useful