Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . . . .1 Branching processes: a population growth application . . . . . . . . . . . . 65 24. . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . . . . . . . .1 Distribution of hitting time and maximum . . . . . . . . . 68 24.5 A storage or queueing application . . . . . . . . . . . . . . . . . . . .1 Distribution after n steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 The distribution of the first return to the origin . . . . . . . . . . . . . . . . . . . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 First return to zero . . . . . . . . . . . . . . .2 First return to the origin . . .1 First hitting time analysis . . . . . . . . 84 27. . . . 71 24. . . . . . . . . . . . . . . . . . . . . . . . . . 95 31. . . . . . . . . . . 85 27. . . . .1 Paths that start and end at specific points . . . . . . . . . . . . . . . . . . . . . . .4 The Wright-Fisher model: an example from biology . . . . . . . . . . . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . . . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . . . . . . .24. . 95 31. . . . . . . . . . .3 Some conditional probabilities . . 79 26. . . . . . . . . . . . . . . 83 27. . . . . . . 90 30. . . . . . . . . . . .2 The ALOHA protocol: a computer communications application . . . . . . 92 30. . .2 The ballot theorem . . . . . . .3 Total number of visits to a state . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . .4 Some remarkable identities . . . . . 84 27. . . . . . . . . 70 24. .1 Asymptotic distribution of the maximum .2 Paths that start and end at specific points and do not touch zero at all . . . . . . . . . . . . .2 Return probabilities . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. I introduce stationarity before even talking about state classification. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible.Preface These notes were written based on a number of courses I taught over the years in the U. At the end. The only new thing here is that I give emphasis to probabilistic methods as soon as possible.. Greece and the U.. They form the core material for an undergraduate course on Markov chains in discrete time.K. I tried to put all terms in blue small capitals whenever they are first encountered. of course. The first part is about Markov chains and some applications.. Of course. There are. dozens of good books on the topic. v . A few starred sections should be considered “advanced” and can be omitted at first reading. “important” formulae are placed inside a coloured box. I have tried both and prefer the current ordering. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. Also. The second one is specifically for simple random walks. I have a little mathematical appendix. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory.S. There notes are still incomplete. Also.

In other words. has the property that. the future trajectory (X(s). Suppose that a particle of mass m is moving under the action of a force F in a straight line. for any t. . that gcd(a. In the module following this one you will study continuous-time chains. For instance. and evolving as xn+1 = yn yn+1 = rem(xn . . . Its velocity is x = dx . An example from Physics is provided by Newton’s laws of motion. for any n. The “time” can be discrete (i. depend only on the pair Xn . s ≥ t) ˙ 1 . where xn ≥ yn . where r = rem(a. there is a function F that takes the pair Xn = (xn . . i. . Formally. X0 . x = x(t) denotes the position of the particle at time t. and the other the greatest common divisor. b) = gcd(r. b). Recall (from elementary school maths). F = f (x. . the future pairs Xn+1 . X2 . x Here. yn ). and its ˙ dt 2x acceleration is x = d 2 . one of which is 0. b).e.e. Xn+2 . yn ) into Xn+1 = (xn+1 . 2. initialised by x0 = max(a. Newton’s second law says that the acceleration is proportional to the force: m¨ = F.PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. yn ). a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. b) is the remainder of the division of a by b.e. y0 = min(a. the real numbers). or. we let our “state” be a pair of integers (xn . 1. x). We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. yn+1 ). The sequence of pairs thus produced. X1 . . n = 0. In Mathematics. . An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. x(t)). more generally. . we end up with two numbers. b). The state of the particle at time t is described by the pair ˙ X(t) = (x(t). a totally ordered set. Assume that the force F depends only on the particle’s position ¨ dt and velocity. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). By repeating the procedure. the integers). continuous (i. It is not hard to see that. b) between two positive integers a and b.

Choose a pair (X0 . Indeed. say. In other words. We will also see why they are useful and discuss how they are applied. a ≤ X0 ≤ b. We will be dealing with random variables. instead of deterministic objects. uniformly.000 columns or computing the integral of a complicated function of a large number of variables. When the mouse is in cell 1 at time n (minutes) then. We will develop a theory that tells us how to describe.e. Markov chains generalise this concept of dependence of the future only on the present. Count the number of 1’s and divide by N . Some of these algorithms are deterministic. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. containing fresh and stinky cheese. i. 0 ≤ Y0 ≤ B. Other examples of dynamical systems are the algorithms run. it does so. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. it moves from 2 to 1 with probability β = 0. Repeat with (Xn . 2 2. an effective way for dealing with large problems is via the use of randomness. and use those mathematical models which are called Markov chains. The resulting ratio is an approximation b of a f (x)dx. 1 and 2. at time n + 1 it is either still in 1 or has moved to 2. In addition. Perform this a large number N of times. analyse. for steps n = 1. The generalisation takes us into the realm of Randomness..05. Y0 ) of random numbers. via the use of the tools of Probability.99. Yn ). We can summarise this information by the transition diagram: 1 Stochastic means random. past states are not needed. Similarly.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. we will see what kind of questions we can ask and what kind of answers we can hope for. respectively. . A scientist’s job is to record the position of the mouse every minute. . . by the software in your computer. A mouse lives in the cage.can be completely specified by the current state X(t). but some are stochastic. 2. Each time count 1 if Yn < f (Xn ) and 0 otherwise.000 rows and 10.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. regardless of where it was at earlier times. Try this on the computer! 2 .

01 Questions of interest: 1. D0 . Here. Dn is the money he deposits. Assume also that X0 . are independent random variables. So if he wishes to spend more than what is available. = z=0 ∞ = z=0 3 . Dn − Sn = y − x) P (Sn = z. and Sn is the money he wants to spend. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. 2. 0}. Assume that Dn . How long does it take for the mouse.) 2. . as follows: Xn+1 = max{Xn + Dn − Sn . px. FS . How often is the mouse in room 1 ? Question 1 has an easy. We easily see that if y > 0 then ∞ x.2 Bank account The amount of money in Mr Baxter’s bank account evolves. y = 0. . Xn is the money (in pounds) at the beginning of the month. the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0.99 0. to move from cell 1 to cell 2? 2. .05 0.y := P (Xn+1 = y|Xn = x). Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). from month to month. intuitive. he can’t. are random variables with distributions FD . D2 . 1. on the average. respectively. S1 . D1 .95 0.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. S2 . answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell.05 ≈ 20 minutes. . . Sn . . (This is the mean of the binomial distribution with parameter α. S0 .y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z.

being drunk.y . 2.0 p2.0 = 1 − ∞ px.0 p0.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . on the average. in principle. how fast? 2. and if so. 2. if y = 0. Questions of interest: 1. If the pub is located in position 0 and his home in position 100. Are there any places which he will never visit? 3. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. can be computed from FS and FD . we have px.1 p1.2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. The transition probability matrix is y=1 p0.1 p2. he has no sense of direction. how often will he be visiting 0? 2. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. Of course.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street.1 p0. Are there any places which he will visit infinitely many times? 4.0 p1. If he starts from position 0.2 P= p2.something that.2 p1. how long will it take him. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). So he may move forwards with equal probability that he moves backwards. Will Mr Baxter’s account grow.

the reason being that the company does not believe that its clients are subject to resurrection. It proposes the following model summarising the state of health of an individual on a monthly basis: 0.S − pH. the company must have an idea on how long the clients will live. mathematically imprecise. 5 . We may again want to ask similar questions.01 S 0. etc.3 that the person will become sick. Also. But. west.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. given that he is currently healthy.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients. and pS.S because pH. the answer to this is crucial for the policy determination.3 H 0. and let’s say he’s completely drunk. observe. therefore. Note that the diagram omits pH.1 D Thus. Clearly. there is clearly a higher chance that our man will be lost. pD.S = 0.D . there is probability pH.D . but they will become more precise in the sequel. These things are. north or south. and pS.H .S = 1 − pS. for now. so we assign probability 1/4 for each move. pD.H − pS.8 0. So he can move east.D is omitted.H = 1 − pH. that since there are more degrees of freedom now. 2.D = 1.

. . . . Xn−1 ) are independent conditionally on Xn .3 3. X1 = i1 . . or negative). . . if the claim holds. Lemma 1. . the Markov property. . almost exclusively (unless otherwise said explicitly). This is true for any n and m. . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . n ∈ T} is Markov instead of saying that it has the Markov property. . The elements of S are frequently called states. zero. we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. . 3. . 1. Usually. n ∈ Z+ ) is Markov if and only if. . 3.2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . Xn−1 ) given Xn . conditionally on Xn . . We focus almost exclusively to the first choice. Since S is countable. Sometimes we say that {Xn . So we will omit referring to T explicitly. On the other hand. Let us now write the Markov property in an equivalent way. . . P (Xn+1 = in+1 | Xn = in . . . . in+1 ∈ S. 2. which precisely means that (Xn+1 . called the state space. X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). m and any choice of states i0 . . 3. X2 = i2 . . X0 = i0 ) = P (∀ k ∈ [n + 1. for any n. the future process (Xm . Xn+m ) and (X0 . . Proof. . . and so future is independent of past given the present. . (Xn . m < n. . n ∈ T}. . T = Z+ = {0. n + m] Xk = ik | Xn = in ). . for any n ∈ T. . that the Xn take values in some countable set S. n + m] Xk = ik | Xn = in . . unless needed. . We say that this sequence has the Markov property if. . where T is a subset of the integers. . .} or N = {1. Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). . m > n. in+m . viz. . m ∈ T).1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . for all n ∈ Z+ and all i0 . m ∈ T) is independent of the past process (Xm . . (1) 6 . We will also assume. 2.} or the set Z of all integers (positive. it is customary to call (Xn ) a Markov chain.

for each m ≤ n the matrix P(m. If |S| = d then we have d × d matrices. Xn = in ) = µ(i0 ) pi0 . . Of course. X2 = i2 .which means that the two functions µ(i) := P (X0 = i) . or as P(m. . (2) which is reminiscent of matrix multiplication.i2 (1. n)]i.i (n − 1.j (n. pi. But we will circumvent them by using Probability only.j (n. n + 1) is called transition probability from state i at time n to state j at time n + 1.j (m. . n) = P(m. Therefore. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m.j (n. n ≥ 0. remembering a bit of Algebra. by (1). n) = P(m. 3. n). pi. a square matrix whose rows and columns are indexed by S (and this. viz. . More generally.j∈S . m + 1) · · · pin−1 . will specify all joint distributions2 : P (X0 = i0 . 2) · · · pin−1 . something that can be called semigroup property or Chapman-Kolmogorov equation. j ∈ S. n) = i1 ∈S ··· in−1 ∈S pi. by definition.j (m. i∈S i. X1 = i1 . pi.j (m.in (n − 1. n + 1) do not depend on n . m + 2) · · · P(n − 1.3 In this new notation. The function pi. but if S is an infinite set. requires some ordering on S) and whose entries are pi. n) if m ≤ ℓ ≤ n. then the matrices are infinitely large. which. pi. we can write (2) as P(m. m + 1)P(m + 1. n + 1) := P (Xn+1 = j | Xn = i) .j (m. n).i1 (m. n). we define. 1) pi1 . all probabilities of events associated with the Markov chain This causes some analytical difficulties. n) := [pi. 7 .. n) = 1(i = j) and. 2 3 And. by a result in Probability.j (n. means that the transition probabilities pi.i1 (0. n). The function µ(i) is called initial distribution. of course. ℓ) P(ℓ.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains.

n + 1) =: pi. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . unless otherwise specified. Then P(m. for all integers a. n) = P × P × · · · · · · × P = Pn−m . transition matrix. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . From now on.j∈S is called the (one-step) transition probability matrix or. when we say “Markov chain” we will mean “timehomogeneous Markov chain”.j the (one-step) transition probability from state i to state j. Similarly. and call pi. y y 8 . simply. For example. while P = [pi. (3) We call δi the unit mass at i. b ≥ 0.j ]i. It corresponds. and the semigroup property is now even more obvious because Pa+b = Pa Pb . n−m times for all m ≤ n. Notational conventions Unit mass. Then we have µ n = µ 0 Pn . to picking state i with probability 1. Thus Pi is the law of the chain when the initial distribution is δi .j (n. The matrix notation is useful. as follows from (1) and the definition of P. 0. For example.j . we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). In other words. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). of course. arranged in a row vector. if Y is a real random variable. if µ assigns probability p to state i1 and 1 − p to state i2 . Notice that any probability distribution on S can be written as a linear combination of unit masses.and therefore we can simply write pi. Starting from a specific state i. δi (j) := 1. if j = i otherwise. then µ = pδi1 + (1 − p)δi2 .

i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . we indeed expect that we have N/2 molecules in each room. the distribution of the bug will still be the same. n ≥ 0) will not be stationary: indeed. .) is stationary if it has the same law as (Xn . as it takes some time before the molecules diffuse from room to room. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . i. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. because it is impossible for the number of molecules to remain constant at all times. Then pi. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. We now investigate when a Markov chain is stationary. .i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. Hence the uniform probability distribution is stationary. Let Xn be the number of molecules in room 1 at time n.d. n = 0. Imagine there are N molecules of gas (where N is a large number. . The gas is placed partly in room 1 and partly in room 2. It is clear. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. dividing it into two identical rooms. But this will not work. the law of a stationary process does not depend on the origin of time. at time n + 1 it moves to one of the adjacent vertices with equal probability. . n = m. . then. at any point of time n.i. random variables provides an example of a stationary process.). then. In other words. To provide motivation and intuition. for any m ∈ Z+ . This is a model of gas distributed between two identical communicating chambers. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. let us look at the following example: Example: Motion on a polygon. 1. selecting a vertex with probability 1/7. It should be clear that if initially place the bug at random on one of the vertices. Consider a canonical heptagon and let a bug move on its vertices. say N = 1025 ) in a metallic chamber with a separator in the middle. m + 1. Another guess then is: Place 9 .4 Stationarity We say that the process (Xn . a sequence of i. If at time n the bug is on a vertex. Example: the Ehrenfest chain. Clearly.e. for a while we will be able to tell how long ago the process started. and vice versa. . On the average.

in ∈ S. Xm+1 = i1 . . X1 = i1 . Proof. In other words. Lemma 2. . . We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and.i = P (X0 = i − 1)pi−1. . stationary or invariant distribution.each molecule at random either in room 1 or in room 2. If we can do so. (a binomial distribution). we are led to consider: The stationarity problem: Given [pij ]. . . . Indeed. .) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. But we need to prove this. Xm+n = in ). 10 . Xn = in ) = P (Xm = i0 . a Markov chain is stationary if and only if the distribution of Xn does not change with n. A Markov chain (Xn . we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. (4) Such a π is called . by (3). . for all m ≥ 0 and i0 . . . hence. Since. If we do so. use (1) to see that P (X0 = i0 . 2−N i−1 i+1 N N (The dots signify some missing algebra. The equations (4) are called balance equations. then we have created a stationary Markov chain. find those distributions π such that πP = π. which you should go through by yourselves. then the number of molecules in room 1 will be i with probability π(i) = N −N 2 .i + π(i + 1)pi+1. i 0 ≤ i ≤ N. P (X1 = i) = j P (X0 = j)pj. The only if part is obvious. For the if part. X2 = i2 . Xm+2 = i2 .i = N N −i+1 N i+1 2−N + = · · · = π(i).i + P (X0 = i + 1)pi+1. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn .i = π(i − 1)pi−1. Changing notation. . P (Xn = i) = π(i) for all n.

So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. . . . where all the xi are distinct. i ≥ 1. p1. 2 .}. then we run into trouble: indeed. The state space is S = N = {1. The times between successive buses are i. the next one will arrive in i minutes with probability p(i). i≥1 π(i) −1 To compute π(1).j = 1. we must be careful. d}. We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. Thus. π(1) will be zero. we use the normalisation condition π(1) = i≥1 j≥i = 1. which.. xd ). π must be a probability: π(i) = 1. Then (Xn . j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). It is easy to see that the stationary distribution is uniform: π(x) = 1 . . which in turn means that the normalisation condition cannot be satisfied. d! x ∈ Sd .Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. . . i ≥ 0. if i > 0 i ∈ N. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. . To find the stationary distribution. Example: card shuffling. n ≥ 0) is a Markov chain with transition probabilities pi+1. write the equations (4): π(i) = π(j)pj. each x ∈ Sd is of the form x = (x1 . But here. can be written as P1 = 1. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. in matrix notation.i.i = π(0)p(i) + π(i + 1).d. random variables. After bus arrives at a bus stop. where 1 is a column whose entries are all 1. This means that the solution to the balance equations is identically zero. .i = p(i). 11 . . 2. Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. which gives p(j) . . . Example: waiting for a bus. i = 1. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. Who tells us that the sum inside the parentheses is finite? If the sum is infinite. . and so each π(i) will be zero. In addition to that.i = 1. . Keep doing it continuously.

. 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. . ξ(n + 1) = ϕ(ξ(n)). As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then.). . 2. . Example: bread kneading.) Consider an infinite binary vector ξ = (ξ1 . So. Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. So let us do so. and any ε1 . ξi ∈ {0. ξ2 . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. 1 − ξ3 . . and the rule maps x into 2(1 − x).. 2. 1} for all i.i. with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. . 1] (think of [0. εr ∈ {0. ξ2 (n) = 1 − ε1 . . .d. . .i. i. . any r = 1. If ξ1 = 1 then x ≥ 1/2. if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . . . .e. ξr+1 (n) = εr ) + P (ξ1 (n) = 1. . We can show that if we start with the ξr (0). 2. what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. . 1. i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. . We have thus defined a mapping ϕ from binary vectors into binary vectors. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). You can check that . . . . i.) is the state at time n. from (5). . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. .). (5) 12 . Remarks: This is not a Markov chain with countably many states. . . .Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). the same is true at n + 1. . . 2(1 − x)). . So it goes beyond our theory. . r = 1. . ξ2 (n) = ε1 . So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. Here ξ(n) = (ξ1 (n). . applied to the interval [0. and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 .d. . ξ2 (n). ξ3 . ξ2 (n). 1}. . .. for any n = 0. .. P (ξ1 (n + 1) = ε1 . But the function f (x) = min(2x. The Markov chain is obtained by applying the mapping ϕ again and again. ξr+1 (n) = 1 − εr ).).* (This is slightly beyond the scope of these lectures and can be omitted. But to even think about it will give you some strength. are i. Hence choosing the initial state at random in this manner results in a stationary Markov chain. . . . . .

Writing the equations in this form is not always the best thing to do. a set of d linearly independent equations which can be more easily solved. If the latter condition holds. Ac ) is the total current from A to Ac . So F (A.j (which equals 1): π(i) j∈S pi.1 Finding a stationary distribution As explained above. Conversely. This is OK.Note on terminology: steady-state. π is a stationary distribution if and only if π1 = 1 and F (A. When a Markov chain is stationary we refer to it as being in 4. therefore one of them is always redundant. as we will see in examples. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. suppose that the balance equations hold.j as a “current” flowing from i to j. Ac ) = F (Ac . a stationary distribution π satisfies π = πP π1 = 1 In other words.i = j∈A π(j)pj. A). because the first d equations are linearly dependent: the sum of both sides equals 1. π(j)pj. Proof. then the above are d + 1 equations. Instead of eliminating equations. i∈A j∈Ac Think of π(i)pi. then choose A = {i} and see that you obtain the balance equations. We have: Proposition 1.i + j∈Ac π(j)pj.i .j = j∈A π(j)pj. The convenience is often suggested by the topological structure of the chain. π(j) = 1.j . for all A ⊂ S.i 13 . Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. If the state space has d states. Ac ) := π(i)pi. π(i) = j∈S (balance equations) (normalisation condition). expecting to be able to choose.i + j∈Ac π(j)pj. we introduce more.i Multiply π(i) by j∈S pi. amongst them. j∈S i ∈ S. But we only have d unknowns.

we find π(1) = 0. but we shall not do this here.04 room 1. π(1) = β .) Assume 0 < α. Example: a walk with a barrier. One can ask how to choose a minimal complete set of linearly independent equations.99 (as in the mouse example). . . That is. 3.. A c F(A .05. here is an example: Example: Consider the 2-state Markov chain with p12 = α.952. p11 = 1 − α. p22 = 1 − β. . β = 0. A ) A c c F( A . p21 = β. α+β π(2) = cα. The constant c is found from π(1) + π(2) = 1. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . while the remaining two terms give the equality we seek.99 ≈ 0.04 π(2) = 0.i + j∈Ac π(j)pj. Instead. 1. . We have π(1)α = π(2)β. in steady state. Summing over i ∈ A.j = j∈A π(j)pj. we see that the first term on the left cancels with the first on the right.8% of the time in room 2. gives us flexibility. α+β π(2) = α . A) Schematic representation of the flow balance relations. 2. (Consequently. and 4.. β < 1.048. The extra degree of freedom provided by the last result. 14 i = 1. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ.j + j∈A j∈Ac π(i)pi. p q Consider the following chain.05 ≈ 0.Now split the left sum also: π(i)pi. the mouse spends only 95.2% of the time in 1.i This is true for all i ∈ S. If we take α = 0. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). Thus.

n≥0 . i − 1}. starting from i.e.) 5 5.But we can make them simpler by choosing A = {0. In fact. .1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. In other words. These are much simpler than the previous ones.2 The relation “leads to” We say that a state i leads to state j if. If i ≥ 1 we have F (A. in−1 . i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. we can solve them immediately. . the chain will visit j at some finite time. Ac ) = π(i − 1)p = π(i)q = F (Ac . π(i) = (p/q)i π(0). .j > 0. For i ≥ 1. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). i. and E a set of directed edges. (iii) Pi (Xn = j) > 0 for some n ≥ 0. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. . E) where S is the state space and E ⊂ S × S defined by (i.i1 pi1 . p < 1/2.e. . provided that ∞ π(i) = 1.i2 · · · pin−1 . 1. Proof. a pair (S. (If p ≥ 1/2 there is no stationary distribution. This is often cast in terms of a digraph (directed graph). 5. . . This is i=0 possible if and only if p < q. S is a set of vertices. j) ∈ E ⇐⇒ pi. (ii) pi. • Suppose that i j. So assume i = j. Does this provide a stationary distribution? Yes. .j > 0 for some n ∈ N and some states i1 . If i = j there is nothing to prove. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. A). In graph-theoretic terminology. i.

j) ⇐⇒ i j and j i. [i] = [j] if and only if i identical or completely disjoint. (6) j. i}. Since Pi (Xn = j) > 0. we see that there are four communicating classes: {1}. By assumption. The class {4. Hence Pi (Xn = j) > 0.. reflexive and transitive. (i) holds. it partitions S into equivalence classes known as communicating classes. The first two classes differ from the last two in character. 5} is closed: if the chain goes in there then it never moves out of it. the set of all states that communicate with i: [i] := {j ∈ S : j So. It is reflexive and transitive because is so. {6}. In addition. • Suppose that (ii) holds.j . this is symmetric (i Corollary 2.. The relation j ⇐⇒ j i).i2 · · · pin−1 . Equivalence relation means that it is symmetric. We have Pi (Xn = j) = i1 . because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j).. Proof. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. the sum contains a nonzero term. and so (iii) holds. is an equivalence relation. Look at the last display again. we have that the sum is also positive and so one of its terms is positive. • Suppose now that (iii) holds. 16 . {2.. Just as any equivalence relation.the sum on the right must have a nonzero term. meaning that (ii) holds. 3} is not closed. We just observed it is symmetric.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. 3}. The relation is transitive: If i j and j k then i k. However. (iii) holds. where the sum extends over all possible choices of states in S. 5.in−1 pi. the class {2. 5}. by definition.i1 pi1 . by definition. In other words. Corollary 1. {4. The communicating class corresponding to the state i is.

The communicating classes C1 . then the chain (or. Similarly.j = 1. In graph terms. the state space S is a closed communicating class. We then can find states i1 . C5 in the figure above are closed communicating classes. we have i1 ∈ C.i = 1 then we say that i is an absorbing state. starting from any state. Lemma 3. C2 . we can reach any other state. . pi. If a state i is such that pi. Otherwise. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. There can be arrows between two nonclosed communicating classes but they are always in the same direction. pi1 . and so i2 ∈ C. Suppose that i j. The converse is left as an exercise. The classes C4 . Note that there can be no arrows between closed communicating classes.i2 · · · pin−1 . and so on. parts. Closed communicating classes are particularly important because they decompose the chain into smaller. in−1 such that pi. Since i ∈ C. Suppose first that C is a closed communicating class. j∈C for all i ∈ C.i2 > 0. its transition matrix P) is called irreducible.More generally. C3 are not closed. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. If all states communicate with all others.i1 > 0. and C is closed. A general Markov chain can have an infinite number of communicating classes.i1 pi1 . the state is inessential. Thus. .j > 0. Why? A state i is essential if it belongs to a closed communicating class. . Proof. we say that a set of states C ⊂ S is closed if pi. we conclude that j ∈ C. more manageable. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. In other words still. . 17 . Note: An inessential state i is such that there exists a state j such that i j but j i. To put it otherwise. more precisely. the single-element set {i} is a closed communicating class.

Let Di = {n ∈ N : pii > 0}. X3 ∈ C2 . 4} at the next step and vice versa. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. more generally. Dj = {n ∈ (n) (α) N : pjj > 0}. The period of an inessential state is not defined. i. So. pii (ℓn) ≥ pii (n) ℓ . (n) Proof. . If i j there is α ∈ N with pij > 0 and if 18 . X4 ∈ C1 . Since Pm+n = Pm Pn . (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. j ∈ S. Theorem 2 (the period is a class property). it will move to the set C2 = {2. 3} at some point of time then. . we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . d(j) = gcd Dj .e. n ≥ 0.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. i. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. A state i is called aperiodic if d(i) = 1. (m) (n) for all m. all states form a single closed communicating class. ℓ ∈ N. X2 ∈ C1 . Let us denote by pij the entries of Pn . Notice however. that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . The chain alternates between the two sets. We say that the chain has period 2. . and. So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . If i j then d(i) = d(j).5. . d(i) = gcd Di . j. Consider two distinct essential states i.

So d(i) = d(j). the chain is called aperiodic. . this is the content of the following theorem. (Here Cd := C0 . So pjj ≥ pij pji > 0. If a number divides two other numbers then it also divides their difference. So d(j) is a divisor of all elements of Di . Because n is sufficiently large. Pick n2 . . Cd−1 such that the chain moves cyclically between them. all states communicate with one another. j ∈ S.e. Corollary 3.j i there is β ∈ N with pji > 0. Arguing symmetrically. Of particular interest. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . if d = 1. Let n be sufficiently large. More generally. Therefore d(j) | α + β. Then we can uniquely partition the state space S into d disjoint subsets C0 . n1 ∈ Di such that n2 − n1 = 1.) 19 . showing that α + β ∈ Dj . . . if an irreducible chain has period d then we can decompose the state space into d sets C0 . (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. . Cd−1 such that pij = 1. q − r > 0. Consider an irreducible chain with period d. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . . . An irreducible chain has period d if one (and hence all) of the states have period d. Theorem 4. 1. From the last two displays we get d(j) | n for all n ∈ Di . d(j) | d(i)). Now let n ∈ Di . chains where. C1 . . Proof. . Theorem 3. Formally. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. n2 ∈ Di . An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. Divide n by n1 to obtain n = qn1 + r. Since n1 . where the remainder r ≤ n1 − 1. we have n ∈ Di as well. d − 1. Therefore d(j) | α + β + n for all n ∈ Di . . r = 0. C1 . we obtain d(i) ≤ d(j) as well. are irreducible chains. as defined earlier. In particular. i. showing that α + n + β ∈ Dj . Then pii > 0 and so pjj ≥ pij pii pji > 0. . j∈Cr+1 i ∈ Cr . .

i1 . Notice that this is an equivalence relation. It is easy to see that if we continue and pick a further state id with pid−1 . Let C1 be the equivalence class of i1 . necessarily. . .. and by the assumptions that d d j. which contradicts the definition of d. As an example. that is why we call it “first-step analysis”. Such a path is possible because of the choice of the i0 . The existence of such a path implies that it is possible to go from i0 to i.. necessarily. Suppose pij > 0 but j ∈ C1 but. for example. i1 . We show that these classes satisfy what we need. Cd−1 . Continue in this manner and define states i0 . consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. . 20 . We use a method that is based upon considering what the Markov chain does at time 1. after it takes one step from its current position. We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}.. . j belongs to the next class. Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. then. i. id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P.. We now show that if i belongs to one of these classes and if pij > 0 then. j ∈ C2 . Hence it partitions the state space S into d disjoint equivalence classes. C1 . id ∈ C0 . Pick a state i0 and let C0 be its equivalence class (i.. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time.id > 0. i ∈ C0 . . The reader is warned to be alert as to which of the two variables we are considering at each time. Take.e.Proof. Assume d > 1.. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . If X0 ∈ A the two times coincide. with corresponding classes C0 . We avoid excessive notation by using the same letter for both. i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). Then pick i1 such that pi0 i1 > 0. say.e the set of states j such that i j)..

′ is the remaining time until A is hit). But the Markov chain is homogeneous. We then have: Theorem 5. . For the second part of the proof. . and define ϕ(i) := Pi (TA < ∞). let ϕ′ (i) be another solution. So TA = 1 + TA . a ′ function of X1 . the event TA is independent of X0 . ϕ(i) = j∈S i∈A pij ϕ(j).). Furthermore. Therefore. By self-feeding this equation. fix a set of states A.It is clear that P1 (T0 < ∞) < 1. The function ϕ(i) satisfies ϕ(i) = 1. We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. conditionally on X1 . But what exactly is this probability equal to? To answer this in its generality. X2 . X0 = i) = P (TA < ∞|X1 = j). we have Pi (TA < ∞) = ϕ(j)pij . Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). i ∈ A.e. ′ Proof. If i ∈ A then TA = 0. If i ∈ A. . we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. Hence: ′ ′ P (TA < ∞|X1 = j. j∈S as needed. and so ϕ(i) = 1. Combining the above. then TA ≥ 1. because the chain may end up in state 2.

because it takes no time to hit A if the chain starts from A. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). Clearly. 1. the set that contains only state 0. (If the chain starts from 0 it takes no time to hit 0. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). we choose A = {0}. let A = {0. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). 1 ψ(1) = 1 + ψ(1). Example: Continuing the previous example. The function ψ(i) satisfies ψ(i) = 0. obviously.We recognise that the first term equals Pi (TA = 1). so the first two terms together equal Pi (TA ≤ 2). if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. then TA = 1 + TA . If ′ i ∈ A. On the other hand. i = 0. 3 2 so ϕ(1) = 3/4. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). 2.) As for ϕ(1). and let ψ(i) = Ei TA . We let ϕ(i) = Pi (T0 < ∞). 2}. Example: In the example of the previous figure. Proof. We have: Theorem 6. if the chain starts from 2 it will never hit 0. The second part of the proof is omitted. j∈A i ∈ A. as above. Therefore. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. the second equals Pi (TA = 2). which is the second of the equations. 3 22 . ψ(i) = 1 + i∈A pij ψ(j). ψ(i) := Ei TA . Letting n → ∞. If i ∈ A then. ψ(0) = ψ(2) = 0. TA = 0 and so ψ(i) := Ei TA = 0. By continuing self-feeding the above equation n times. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). We immediately have ϕ(0) = 1. Furthermore. and ϕ(2) = 0. Start the chain from i.

i. This happens up to the first hitting time TA first hitting time of the set A. Then ′ 1+TA x ∈ A. We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ).) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA .. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ).e. otherwise. X2 .which gives ψ(1) = 3/2. TB for two disjoint sets of states A. Clearly. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). Now the last sum is a function of the future after time 1. ′ TA = 1 + TA . Next. of the random variables X1 . by the Markov property. . ϕAB (i) = 0. in the last sum. . h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). (This should have been obvious from the start. if x ∈ A. ′ where TA is the remaining time. because the underlying experiment is the flipping of a coin with “success” probability 2/3. after one step. . It should be clear that the equations satisfied are: ϕAB (i) = 1. B. Hence. until set A is reached. i∈A Furthermore. 23 . h(x) = f (x). ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). as argued earlier. ϕAB (i) = j∈A i∈B pij ϕAB (j). where. consider the following situation: Every time the state is x a reward f (x) is earned. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). we just changed index from n to n + 1. therefore the mean number of flips until the first success is 3/2. ϕAB is the minimal solution to these equations.

3. he will die. if he starts from room 2. (Also. h(3) = 7. y∈S h(x) = f (x) + x ∈ A. in which case he dies immediately. until he dies in room 4. so 24 . if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. So. are h(x) = 0. why?) If we let h(i) be the average total reward if he starts from room i.Thus. the set of equations satisfied by h. Let p the probability of heads and q = 1 − p that of tails. If heads come up then you win £1 per pound of stake. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . The successive stakes (Yk . 1 2 4 3 The question is to compute the average total reward. no gambler is assumed to have any information about the outcome of the upcoming toss. Clearly. (Sooner or later. he collects i pounds. 2.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. x∈A pxy h(y). h(1) = 5. for i = 1. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). 2 2 We solve and find h(2) = 8. k ∈ N) form the strategy of the gambler. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). unless he is in room 4. Example: A thief enters sneaks in the following house and moves around the rooms. You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. collecting “rewards”: when in room i.

Theorem 7. 0 ≤ x ≤ ℓ. Let £x be the initial fortune of the gambler.g. We use the notation Px (respectively. 25 ϕ(ℓ) = 1. We are clearly dealing with a Markov chain in the state space {a. .i−1 = q = 1 − p. Ex ) for the probability (respectively. and where the states a. . P (tails) = 1 − p. x − a    . . . in addition.if the gambler wants to have a strategy at all. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). is our next goal. . . astrological nature. if p = q Px (Tb < Ta ) = . it will make enormous profits). a < i < b. write ϕ(x) instead of ϕ(x. a + 1. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. if p = q = 1/2 b−a Proof. .i+1 = p. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. To solve the problem. We wish to consider the problem of finding the probability that the gambler will never lose her money. b = ℓ > 0 and let ϕ(x. She will stop playing if her money fall below a (a < x). ℓ) := Px (Tℓ < T0 ). Yk cannot depend on ξk . The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . in simplest case Yk = 1 for all k. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. We obviously have ϕ(0) = 0. expectation) under the initial condition X0 = x. pi. a ≤ x ≤ b. in the first case. b are absorbing. (Think why!) To simplify things. b − 1. b} with transition probabilities pi. We first solve the problem when a = 0. . as described above. In other words. a + 2. b − a). she must base it on whatever outcomes have appeared thus far. . betting £1 at each integer time against a coin which has P (heads) = p. ξk−1 and–as is often the case in practice–on other considerations of e. It can only depend on ξ1 . Consider a gambler playing a simple coin tossing game. the company has control over the probability p. ℓ). Let us consider a concrete problem.

Assuming that λ := q/p = 1. 1. if λ = 1 if λ = 1. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. then λ = q/p < 1. ℓ Summarising. δ(x − 2) = · · · = q p x δ(0). ℓ) = λx −1 . then λ = q/p > 1. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. Since p + q = 1. λx − 1 δ(0). we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. λ = 1.and. Px (T0 < ∞) = 1 − lim 26 . we find δ(0) = 1/ℓ and so x ϕ(x) = . and so λx − 1 λx − 1 =1− = λx . if p > 1/2 if p ≤ 1/2. The general case is obtained by replacing ℓ by b − a and x by x − a. Let δ(x) := ϕ(x + 1) − ϕ(x). ϕ(x) = pϕ(x + 1) + qϕ(x − 1). ℓ→∞ ℓ→∞ (q/p)x . Corollary 4. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. we have obtained ϕ(x. λ = 1. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. We finally find λx − 1 . If p > 1/2. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. Since ϕ(ℓ) = 1. and so Px (T0 < ∞) = 1 − lim If p < 1/2. λℓ −1 x ℓ. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). if p = q. λℓ − 1 0 ≤ x ≤ ℓ. by first-step analysis. 0 ≤ x ≤ ℓ. ℓ→∞ λℓ − 1 x = 1 − 0 = 1.

. . τ < ∞). A. . . . Xτ ) if τ < ∞. An example of a random time which is not a stopping time is the last visit of a specific set. . for any n. . Xn+1 . Xn ) for all n ∈ Z+ . | Xn = i.j1 pj1 . . Xn ) if we know the present Xn .j2 · · · P (Xn = i. . . . 27 . Xτ ) and has the same law as the original process started from i. A. . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . the future process (Xτ +n . τ < ∞) = P (Xn+1 = j1 . Xn ). . τ = n)P (Xn = i. τ < ∞)P (A | Xτ . . . A. τ = n) (c) = P (Xn+1 = j1 . τ = n) = P (Xn+1 = j1 . P (Xτ +1 = j1 . n ≥ 0) is a time-homogeneous Markov chain. . Recall that the Markov property states that. . Xn ) and the ordinary splitting property at n. . Xτ +2 = j2 . . . (d) where (a) is just logic. . . . .j1 pj1 . | Xτ = i. (b) is by the definition of conditional probability. . . A. . Xτ = i. Theorem 8 (strong Markov property). A. τ < ∞) = pi. Xn+2 = j2 . the value +∞. A. Let τ be a stopping time. . τ = n) = pi. .j2 · · · We have: P (Xτ +1 = j1 . Let A be any event determined by the past (X0 . Since (X0 . Xτ )1(τ < ∞) = n∈Z+ (X0 . Xτ +2 = j2 . . the future process (Xn . considered earlier. . An example of a stopping time is the time TA . A ∩ {τ = n} is determined by (X0 . In particular.Xτ +2 = j2 . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . the first hitting time of the set A. . Let τ be a stopping time. We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . . . . . Xτ +2 = j2 . for all n ∈ Z+ . possibly. . This stronger property is referred to as the strong Markov property. . . . Proof. conditional on τ < ∞ and Xτ = i. τ = n) (a) (b) = P (Xτ +1 = j1 . Suppose (Xn . . n ≥ 0) is independent of the past process (X0 . Then.8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set.) is independent of the past process (X0 . . . Xn+2 = j2 . τ < ∞) and that P (Xτ +1 = j1 . . τ = n). . Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. Xn = i. . | Xn = i)P (Xn = i. We are going to show that for any such A. . . . including. any such A must have the property that. A | Xτ . . Xn )1(τ = n). . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . | Xτ . . Xn+2 = j2 . . . . . . .

Xi (2) . . (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn .9 Regenerative structure of a Markov chain A sequence of i. Ti Xi (0) (r) ≤ n < Ti (r+1) . If X0 = i then the two definitions coincide. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). . Regardless. So Xi is the r-th excursion: the chain visits 28 . Xi (0) . All these random times are stopping times.d.) We let Ti be the time of the third visit.i. and so on. Formally. . we let Ti be the time of the second (1) (3) visit. . The Ti of Section 6 was allowed to take the value 0. They are. 2. random trajectories. in a generalised sense. we will show that we can split the sequence into i. as “random variables”. Xi (1) . 0 ≤ n < Ti . (And we set Ti := Ti .i. . := Xn . The idea is that if we can find a state i that is visited again and again. 3.. State i may or may not be visited again. random variables is a (very) particular example of a Markov chain. r = 2. Schematically. We (r) refer to them as excursions of the process.. The random variables Xn comprising a general Markov chain are not independent. However.d. . “chunks”. by using the strong Markov property. in fact. (1) r = 1. Note the subtle difference between this Ti and the Ti of Section 6. . These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. Our Ti here is always positive.

Apply Strong Markov Property (SMP) at Ti : SMP says that. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. possibly. Perhaps some states (inessential ones) will be visited finitely many times. are independent random variables. but there are always states which will be visited again and again. . starting from i. Example: Define Ti (r+1) (0) ).). Xi distribution. 10 Recurrence and transience When a Markov chain has finitely many states. at least one state must be visited infinitely many times.state i for the r-th time. all. equal to i.i. if it starts from state i at some point of time. . but important corollary of the observation of this section is that: Corollary 5. ad infinitum. 29 . (Numerical) functions g(Xi independent random variables. Therefore. we “watch” it up until the next time it returns to state i. . by definition. we say that a state i is recurrent if. have the same Proof. and this proves the first claim. we say that a state i is transient if. Xi (2) .. We abstract this trivial observation and give a couple of definitions: First. . Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. . it will behave in the same way as if it starts from the same state at another point of time.. for time-homogeneous Markov chains. (r) . Λ2 . Second. The excursions Xi (0) . Therefore. the future after Ti is independent of the past before Ti . But (r) the state at Ti is. .d. Xi (1) . A trivial. (0) are independent random variables. The last corollary ensures us that Λ1 . given the state at (r) (r) (r) the stopping time Ti . and “goes for an excursion”. g(Xi (1) ). In particular. except. Moreover. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. they are also identical in law (i. starting from i. the future is independent of the (1) (2) past. An immediate consequence of the strong Markov property is: Theorem 9. g(Xi (2) ). . The claim that Xi .

Indeed. Pi (Ji = ∞) = 1. The following are equivalent: 1. 30 . . therefore concluding that every state is either recurrent or transient. namely. in principle.e. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . Ei Ji = ∞. Lemma 5. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. This means that the state i is transient. k ≥ 0. and that is why we get the product. But the past (k−1) before Ti is independent of the future. there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . 4. Pi (Ji = ∞) = 1. Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. Suppose that the Markov chain starts from state i.Important note: Notice that the two statements above are not negations of one another. 3. We claim that: Lemma 4. then Pi (Ji < ∞) = 1. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. If fii < 1. We will show that such a possibility does not exist. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . Starting from X0 = i. Therefore. 2. Indeed. we have Proof. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. This means that the state i is recurrent. every state is either recurrent or transient. Because either fii = 1 or fii < 1. we conclude what we asserted earlier. i. State i is recurrent. fii = Pi (Ti < ∞) = 1.

If i is transient then Ei Ji < ∞. as n → ∞. State i is transient if. which. Lemma 6. If Ei Ji = ∞. as n → ∞. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . State i is transient. 10. Pi (Ji < ∞) = 1. if we assume Ei Ji < ∞.1 Define First hitting time decomposition fij := Pi (Tj = n). From Elementary Probability. Proof. by definition. The equivalence of the first two statements was proved before the lemma. but the two sequences are related in an exact way: Lemma 7. 3. Proof. 2. it is visited infinitely many times. by definition.Proof. n ≥ 1. Corollary 6. Ji cannot take value +∞ with positive probability. 4. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . since Ji is a geometric random variable. clearly. is equivalent to Pi (Ji < ∞) = 1. (n) i.e. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. Hence Ei Ji = ∞ also. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. This. The equivalence of the first two statements was proved above. it is visited finitely many times with probability 1.e. the probability to hit state j for the first time at time n ≥ 1. then. by the definition of Ji . fii = Pi (Ti < ∞) < 1. i. this implies that Ei Ji < ∞. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. assuming that the chain (n) starts from state i. and. The following are equivalent: 1. owing to the fact that Ji is a geometric random variable. therefore Pi (Ji < ∞) = 1. Notice the difference with pij . Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . fij ≤ pij . On the other hand. Proof. State i is recurrent if. Clearly. (n) But if a series is finite then the summands go to 0. Ei Ji < ∞. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. If i is a transient state then pii → 0. pii → 0. then.

In other words.d. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). j i. for all states i. ( Remember: another property that was a class property was the property that the state have a certain period. we have n=1 pjj = Ej Jj < ∞.The first term equals fij . j i.2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”.d. . If j is a transient state then pij → 0. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. and so the probability that j will appear in a specific excursion is positive.i. by definition. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. but also fij = 1. If j is transient. Corollary 8. Xi . We now show that recurrence is a class property. showing that j will appear in infinitely many of the excursions for sure. if we let fij := Pi (Tj < ∞) . and since they take the value 1 with positive probability. In particular.) Theorem 10. . r = 0. either all states are recurrent or all states are transient. Indeed. denoted by i j: it means that.. and gij := Pi (Tj < Ti ) > 0. by summing over n the relation of Lemma 7. as n → ∞. we have i j ⇐⇒ fij > 0. 1. 32 . . Proof. For the second term. are i. the probability to go to j in some finite time is positive. . Hence. Corollary 7. infinitely many of them will be 1 (with probability 1). So the random variables δj. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). which is finite. Now.i. excursions Xi . . The last statement is simply an expression in symbols of what we said above. In a communicating class. not only fij > 0. Start with X0 = i. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . and hence pij as n → ∞.r := 1 j appears in excursion Xi (r) . ∞ Proof. starting from i. 10. . Hence.

. . Let T := inf{n ≥ 1 : Un > U0 }. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). n+1 33 = 0 . some state i must be visited infinitely many times. It should be clear that this can.d. Recall that a finite random variable may not necessarily have finite expectation.Proof. for example. Proof. . . U1 . necessarily. uniformly distributed in the unit interval [0.i. Hence. 1]. By closedness. P (T > n) = P (U1 ≤ U0 . . By finiteness. What is the expectation of T ? First. . We now take a closer look at recurrence. and so i is a transient state. Theorem 11. .e. P (B) = Pi (Xn = i for infinitely many n) < 1. every other j is also recurrent. So if a communicating class is recurrent then it contains no inessential states and so it is closed. if started in C. Un ≤ x | U0 = x)dx xn dx = 1 . observe that T is finite. by the above theorem. and hence. This state is recurrent. random variables. Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). Proof. Let C be the class. . Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. appear in a certain game. . be i. Indeed.e. Theorem 12. U2 . j but j i. Example: Let U0 . Every finite closed communicating class is recurrent. but Pi (A ∩ B) = Pi (Xm = j. i. i. . all other states are also transient. If i is recurrent then every other j in the same communicating class satisfies i j. Xn = i for infinitely many n) = 0. If a communicating class contains a state which is transient then. Hence all states are recurrent. X remains in C. Thus T represents the number of variables we need to draw to get a value larger than the initial one. . But then. Suppose that i is inessential. If i is inessential then it is transient. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. Pi (B) = 0.

34 . Before doing that. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. We say that i is null recurrent if Ei Ti = ∞. it is a Theorem that can be derived from the axioms of Probability Theory. . Z2 . (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . X1 . we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. (ii) Suppose E max(Z1 . random variables. but E min(Z1 . To our rescue comes the regenerative structure of a Markov chain identified at an earlier section.Therefore. But this is misleading. . It is traditional to refer to it as “Law”. 0) = ∞.i. Let Z1 . of successive states is not an independent sequence. To apply the law of large numbers we must create some kind of independence. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. When we have a Markov chain. It is not a Law. X2 . Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. the sequence X0 . P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. . We state it here without proof. then it takes. the Fundamental Theorem of Probability Theory. . 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. Theorem 13. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. 0) > −∞. . . on the average.d. arguably. be i.

(7) Whichever the choice of G.i. Xa(3) . with probability 1. n as n → ∞. and because if we start with X0 = a.d. . which. Since functions of independent random variables are also independent.2 Average reward Consider now a function f : S → R. for reasonable choices of the functions G taking values in R.13. G(Xa(2) ). In particular. Specifically. the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. are i. G(Xa(3) ). . random variables. . Xa(2) . (1) (2) (1) (n) (8) 13. assign a reward f (x). given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . . but define G(X ) to be the total reward received over the excursion X . Example 2: Let f (x) be as in example 1.1 Functions of excursions Namely. Example 1: To each state x ∈ S. the (0) initial excursion Xa also has the same law as the rest of them. are i. . . . forms an independent sequence of (generalised) random variables. Xa . because the excursions Xa . Then. in this example.d. . we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). k=0 which represents the ‘time-average reward’ received.. G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ).i. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). 35 . for an excursion X . define G(X ) to be the maximum reward received during this excursion. as in Example 2. we can think of as a reward received when the chain is in state x. The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. . Thus. we have that G(Xa(1) ).

Hence |Gfirst | ≤ BTa . with probability one. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. We assumed that f is t bounded. t t In other words. Then. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). (r) Nt := max{r ≥ 1 : Ta ≤ t}. and Glast is the reward over the last incomplete cycle. say |f (x)| ≤ B. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . The idea is this: Suppose that t is large enough. say. We break the sum into the total reward received over the initial excursion. t t as t → ∞. Nt = 5. Nt . Since P (Ta < ∞) = 1.e. A similar argument shows that 1 last G → 0. with probability one.Theorem 14. of complete excursions from the ground state a that occur between 0 and t. For example. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . Proof. t as t → ∞. Gr is given by (7). and so 1 first G → 0. Let f : S → R+ be a bounded ‘reward’ function. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. t t 36 . plus the total reward received over the next excursion. We are looking for the total reward received between 0 and t. and so on. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). in the figure below. Thus we need to keep track of the number. we have BTa /t → 0.

t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . r=1 with probability one.i. Hence. .d. it follows that the number Nt of complete cycles before time t will also be tending to ∞. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 .. (11) By definition. random variables with common expectation Ea Ta . we have Ta r→∞ r lim (r) . 1 Nt = . Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . and since Ta − Ta k = 1. 2. Since Ta /r → 0. E a Ta 37 . lim t∞ 1 Nt Nt −1 Gr = EG1 . = E a Ta . = E a Ta . as t → ∞. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . .also. So we are left with the sum of the rewards over complete cycles. as r → ∞. are i. . E a Ta f (Xn ) = n=0 EG1 =: f . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. Since Ta → ∞.

Comment 4: Suppose that two states. Note that we include only one endpoint of the excursion. it takes Ea Ta units of time between successive visits to the state a. E b Tb 38 . f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). then the long-run proportion of visits to the state a should equal 1/Ea Ta . The equality of these two limits gives: average no. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). b. not both. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 .e. t Ea 1(Xk = b) E a Ta . on the average. Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . The physical interpretation of the latter should be clear: If. The theorem above then says that. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). for all b ∈ S. E a Ta (14) with probability 1. Comment 1: It is important to understand what the formula says. we let f to be 1 at the state b and 0 everywhere else. the ground state. Ta −1 k=0 1 lim t→∞ t with probability 1. The constant f is a sort of average over an excursion. from the Strong Markov Property. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. i. then we see that the numerator equals 1. a. which communicate and are positive recurrent.To see that f is as claimed. We can use b as a ground state instead of a. The denominator is the duration of the excursion.

Fix a ground state a ∈ S and assume that it is recurrent5 . . . 5 We stress that we do not need the stronger assumption of positive recurrence here. (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . 39 . X1 .e. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. x ∈ S. this should have something to do with the concept of stationary distribution discussed a few sections ago. x ∈ S. where a is the fixed recurrent ground state whose existence was assumed. Since a = XT a . . (15) should play an important role. . so there is nothing lost in doing so. X3 . notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. we have 0 < ν [a] (x) < ∞. X2 . Both of them equal 1. We start the Markov chain with X0 = a. Consider the function ν [a] defined in (15). . The real reason for doing so is because we wanted to replace a function of X0 . . Proof. Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). Indeed. for any state b such that a → x (a leads to x). Note: Although ν [a] depends on the choice of the ‘ground’ state a. by a function of X1 . we will soon realise that this dependence vanishes when the chain is irreducible. Proposition 2. X2 . ν [a] (x) = y∈S ν [a] (y)pyx . In view of this. Hence the superscript [a] will soon be dropped. Clearly. i. Moreover. Recurrence tells us that Ta < ∞ with probability one.

and thus be in a position to apply first-step analysis. by definition. n ≤ Ta ) pyx Pa (Xn−1 = y. from (16). Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. We just showed that ν [a] = ν [a] P. Note: The last result was shown under the assumption that a is a recurrent state. 40 . x ∈ S. let x be such that a x. which are precisely the balance equations. so there exists m ≥ 1. (17) This π is a stationary distribution. Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . ax ax From Theorem 10. ν [a] (x) < ∞. which. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. x a. We now have: Corollary 9. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). in this last step. so there exists n ≥ 1. Xn−1 = y. such that pax > 0. We are now going to assume more. Therefore. where. we used (16). xa so. is finite when a is positive recurrent. Suppose that the Markov chain possesses a positive recurrent state a. namely. such that pxa > 0. indeed. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). that a is positive recurrent. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). n ≤ Ta ) Pa (Xn = x. (n) Next. for all x ∈ S. for all n ∈ N. First note that. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . ν [a] = ν [a] Pn .

We already know that j is recurrent. In words. The coins are i. Suppose that i recurrent. unique. Also. in general. we have Ea Ta < ∞. then the set of linear equations π(x) = y∈S π(y)pyx . But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. (r) j. second) excursion that contains j. Let i be a positive recurrent state. We have already shown. r = 0. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. 2. Consider the excursions Xi . that ν [a] P = ν [a] . Let R1 (respectively. .e. . since the excursions are independent. Hence j will occur in infinitely many excursions. no matter which one.d. then the r-th excursion does not contain state j. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. If heads show up at the r-th excursion. if tails show up. in Proposition 2. . 41 . x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. 1. Since a is positive recurrent. and so π [a] is not trivially equal to zero. positive recurrent state. Start with X0 = i. to decide whether j will occur in a specific excursion. Theorem 15. we toss a coin with probability of heads gij > 0. that it sums up to one. Otherwise. Then j is positive Proof. Therefore. π [a] P = π [a] . It only remains to show that π [a] is a probability. Proof. R2 ) be the index of the first (respectively. This is what we will try to do in the sequel. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. But π [a] is simply a multiple of ν [a] .i. π(y) = 1 y∈S x∈S have at least one solution. and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. then j occurs in the r-th excursion.In other words: If there is a positive recurrent state. i.

Let C be a communicating class. . or all states are null recurrent or all states are transient. In other words. . RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. But. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. . R2 has finite expectation. we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation.i. . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . . IRREDUCIBLE CHAIN r = 1. Then either all states in C are positive recurrent. where the Zr are i. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. by Wald’s lemma. An irreducible Markov chain is called transient if one (and hence all) states are transient. with the same law as Ti . 2. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. .R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . Corollary 10.d. .

we can take expectations of both sides: E as t → ∞. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. Proof. an absorbing state). Then P (Ta < ∞) = x∈S In other words. indeed. By a theorem of Analysis (Bounded Convergence Theorem). we stress that a stationary distribution may not be unique. But. Fix a ground state a. Consider the Markov chain (Xn . for totally trivial reasons (it is. after all.16 Uniqueness of the stationary distribution At the risk of being too pedantic. what prohibits uniqueness. x∈S By (13). we start the Markov chain from the stationary distribution π. as t → ∞. π(2) = 1 − α 2. (18) π(x)Px (Ta < ∞) = π(x) = 1. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). is a stationary distribution. recognising that the random variables on the left are nonnegative and bounded below 1. x ∈ S. we have. by (18). Then. there is a stationary distribution. state 1 (and state 2) is positive recurrent. Consider positive recurrent irreducible Markov chain. Theorem 16. x ∈ S. for all x ∈ S. the π defined by π(0) = α. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). Let π be a stationary distribution. But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. π(1) = 0. Then there is a unique stationary distribution. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. n=1 43 . Hence. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). for all n. following Theorem 14. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. for all states i and j. n ≥ 0) and suppose that P (X0 = x) = π(x). with probability one. we have P (Xn = x) = π(x).

the only stationary distribution for the restricted chain because C is irreducible. x ∈ S. E a Ta Proof. Consider the restriction of the chain to C. y ∈ C. Consider a positive recurrent irreducible Markov chain with stationary distribution π. from the definition of π [a] .e. Then π [a] = π [b] . for all a ∈ S. Let i be such that π(i) > 0. Let C be the communicating class of i. 1 π(a) = . Hence π = π [a] . The following is a sort of converse to that. Then any state i such that π(i) > 0 is positive recurrent. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Proof. delete all transition probabilities pxy with x ∈ C. By the 1 above corollary. Therefore Ei Ti < ∞. The cycle formula merely re-expresses this equality. Consider a positive recurrent irreducible Markov chain and two states a. Assume that a stationary distribution π exists. an arbitrary stationary distribution π must be equal to the specific one π [a] . π(i|C) = Ei Ti . We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Thus. In particular. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . Hence there is only one stationary distribution. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . But π(i|C) > 0. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . where π(C) = y∈C π(y). There is a unique stationary distribution. Corollary 12. This is in fact. so they are equal.Therefore π(x) = π [a] (x). i. Proof. π(C) x ∈ C. π(a) = π [a] (a) = 1/Ea Ta . There is only one stationary distribution. Consider a Markov chain. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Then. Both π [a] and π[b] are stationary distributions. In particular. Corollary 11. b ∈ S. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. Decompose it into communicating classes. i is positive recurrent. Consider an arbitrary Markov chain. Namely. for all x ∈ S.

take an = (−1)n . consider the motion of a pendulum under the action of gravity. A Markov chain is a simple model of a stochastic dynamical system. There is dissipation of energy due to friction and that causes the settling down. from physical experience.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. where π(C) := π(C) π(x). A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. for example. This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . Let π be an arbitrary stationary distribution. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . π(x|C) ≡ π C (x). more generally. x∈C Then (π(x|C). the pendulum will perform an 45 . We all know. x ∈ C) is a stationary distribution. Proof. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). under irreducibility and positive recurrence conditions. So far we have seen. that it remains in a small region). For example. This is the stable position. By our uniqueness result. that the pendulum will swing for a while and. where αC ≥ 0. An arbitrary stationary distribution π must be a linear combination of these π C . In an ideal pendulum. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. Proposition 3. For each such C define the conditional distribution π(x|C) := π(x) . eventually. So we have that the above decomposition holds with αC = π(C). 18 18. If π(x) > 0 then x belongs to some positive recurrent class C. it will settle to the vertical position. such that C αC = 1. without friction. as n → ∞. that.which assigns positive probability to each state in C. To each positive recurrent communicating class C there corresponds a stationary distribution π C .

ξ1 . The thing to observe here is that there is an equilibrium point i. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . The simplest differential equation As another example. We shall assume that the ξn have the same law. x ∈ R. We say that x∗ is a stable equilibrium. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. simple because the ξn ’s depend on the time parameter n. But what does stability mean? In general. it cannot mean convergence to a fixed equilibrium point. and define a stochastic sequence X1 . Since we are prohibited by the syllabus to speak about continuous time stochastic systems. . ξ2 . A linear stochastic system We now pass on to stochastic systems. we here assume that X0 . However notice what happens when we solve the recursion. Specifically. .oscillatory motion ad infinitum. . ξ0 . we may it is stable because it does not (it cannot!) swing too far away. ······ 46 . If. we have x(t) → x∗ . . moreover. a > 0. or an example in physics. . mutually independent. then regardless of the starting position. then x(t) = x∗ for all t > 0. are given.e. X2 . Still. random variables. the one found by setting the right-hand side equal to zero: x∗ = b/a. . or whatever. as t → ∞. by means of the recursion above. ˙ There are myriads of applications giving physical interpretation to this. consider the following differential equation x = −ax + b. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . It follows that |ρ| < 1 is the condition for stability. And we are in a happy situation here.

In particular. 18. under nice conditions. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . converges. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x). uniformly in x. while the limit of Xn does not exists. as n → ∞. In other words. g(y) = P (Y = y). What this means is that. Coupling refers to the various methods for constructing this joint law. suppose you are given the laws of two random variables X. for any initial distribution. i. j ∈ S. we know f (x) = P (X = x). But suppose that we have some further requirement. for instance. How should we then define what P (X = x. Y = y) = P (X = x)P (Y = y).2 The fundamental stability theorem for Markov chains Theorem 17. which is a linear combination of ξ0 . Y = y) is? 47 . The second term. . 18. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 .Since |ρ| < 1. to try to make the coincidence probability P (X = Y ) as large as possible. Y = y) is. but we are not given what P (X = x. If a Markov chain is irreducible. define stability in a way other than convergence of the distributions of the random variables. Y but not their joint law. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j).3 Coupling To understand why the fundamental stability theorem works.g. Starting at an elementary level. A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. we have that the first term goes to 0 as n → ∞. we need to understand the notion of coupling. (e. . for time-homogeneous Markov chains. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . ∞ ˆ The nice thing is that. the n-step transition matrix Pn . . ξn−1 does not converge. . the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot.

x ∈ S. Note that it makes sense to speak of a meeting time. for all n ≥ 0. Suppose that we have two stochastic processes as above and suppose that they have been coupled.Exercise: Let X. Find the joint law h(x. i. n ≥ 0. n ≥ 0) and the law of a stochastic process Y = (Yn . we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). f (x) = y h(x. n < T ) + P (Xn = x. i. We have P (Xn = x) = P (Xn = x. Y be random variables with values in Z and laws f (x) = P (X = x). |P (Xn = x) − P (Yn = x)| ≤ P (T > n).e. y). y) = P (X = x. respectively. and all x ∈ S. but with the roles of X and Y interchanged. Repeating (19). A coupling of them refers to a joint construction on the same probability space.e. and which maximises P (X = Y ). This T is a random variables that takes values in the set of times or it may take values +∞. precisely because of the assumption that the two processes have been constructed together. n ≥ 0. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Let T be a meeting of two coupled stochastic processes X. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). Since this holds for all x ∈ S and since the right hand side does not depend on x. Then. Y = y) which respects the marginals. uniformly in x ∈ S. The coupling we are interested in is a coupling of processes. n ≥ 0). If. g(y) = P (Y = y). (19) as n → ∞. The following result is the so-called coupling inequality: Proposition 4. T is finite with probability one. g(y) = x h(x. in particular. Proof. We are given the law of a stochastic process X = (Xn . on the event that the two processes never meet. Y . P (T < ∞) = 1. y). but we are not told what the joint probabilities are. n ≥ T ) = P (Xn = x. x∈S 48 . n < T ) + P (Yn = x. n ≥ T ) ≤ P (n < T ) + P (Yn = x). We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. then |P (Xn = x) − P (Yn = x)| → 0.

as usual. y) ∈ S × S. (x. y) Its n-step transition probabilities are q(x. We know that there is a unique stationary distribution.4 Proof of the fundamental stability theorem First. Xn and Yn would be independent and distributed according to π. we can define joint probabilities. n ≥ 0) is a Markov chain with state space S × S. denoted by Y = (Yn .y′ ) = px. x∈S as n → ∞. . π(x) > 0 for all x ∈ S. denoted by X = (Xn .x′ ).y′ . 18. in other words. It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. we have that px. Having coupled them.x′ ).(y. review the assumptions of the theorem: the chain is irreducible. n ≥ 0) is an irreducible chain. Then n→∞ lim P (T > n) = P (T = ∞) = 0. sup |P (Xn = x) − P (Yn = x)| → 0. and so (Wn . had we started with both X0 and Y0 independent and distributed according to π. Notice that σ(x. Y0 ) has distribution P (W0 = (x. y) = µ(x)π(y).y′ ) := P Wn+1 = (x′ . Xm = y). is a stationary distribution for (Wn . positive recurrent and aperiodic. n ≥ 0) is independent of Y = (Yn . Let pij denote.y′ . y ′ ) | Wn = (x. Its (1-step) transition probabilities are q(x. n ≥ 0). Its initial state W0 = (X0 . From Theorem 3 and the aperiodicity assumption. Then (Wn .x′ py. and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }.Assume now that P (T < ∞) = 1. (Indeed.x′ > 0 and py.y′ for all large n. We shall consider the laws of two processes. they differ only in how the initial states are chosen. The first process. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. We now couple them by doing the straightforward thing: we assume that X = (Xn .x′ ). then.(y. denoted by π. n ≥ 0).) By positive recurrence. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. Consider the process Wn := (Xn . we have not defined joint probabilities such as P (Xn = x.(y. implying that q(x. Therefore. y) := π(x)π(y). The second process. for all n ≥ 1. Yn ).y′ ) for all large n.x′ py. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. the transition probabilities.

we have |P (Xn = x) − π(x)| ≤ P (T > n). 0). Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. However. Since P (Zn = x) = P (Xn = x).0).(0. T is a meeting time between Z and Y . precisely because P (T < ∞) = 1. Z0 = X0 which has an arbitrary distribution µ. then the process W = (X. (1.0) = q(1. Hence (Zn . Clearly. if we modify the original chain and make it aperiodic by. ≥ 0) is a Markov chain with transition probabilities pij . (Zn .1) = q(1. for all n and x. n ≥ 0). 1). X arbitrary distribution Z stationary distribution Y T Thus.01. For example. 1} and the pij are p01 = p10 = 1. say. after the meeting time. Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ).(0. n ≥ 0) is identical in law to (Xn .1). by assumption. suppose that S = {0. 50 .1). P (Xn = x) = P (Zn = x).0).1) = 1. p10 = 1. Y ) may fail to be irreducible. (0. 0).(1. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). y) > 0 for all (x. Zn equals Xn before the meeting time of X and Y . Obviously. 1)} and the transition probabilities are q(0. In particular.Therefore σ(x. y) ∈ S × S. uniformly in x. Then the product chain has state space S ×S = {(0. n ≥ 0) is positive recurrent.(1. (1. p01 = 0. Hence (Wn . Moreover. which converges to zero. P (T < ∞) = 1. By the coupling inequality. This is not irreducible. Zn = Yn .99. as n → ∞.0) = q(0. In particular. p00 = 0. and since P (Yn = x) = π(x).

from Cd back to C0 . p01 = 0. and this is expressed by our terminology.497 18. Indeed.01. C1 .99. n→∞ (n) where π(0). Therefore p00 = 1 for even n (n) and = 0 for odd n. we can decompose the state space into d classes C0 . Then Xn = 0 for even n and = 1 for odd n. By the results of §5.503 0. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). i. Thus. per se wrong. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. in particular. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. In matrix notation. It is not difficult to see that 6 We put “wrong” in quotes. 1981).e. By the results of Section 14. Cd−1 . from C1 to C2 . 6 The reason is that the term “ergodic” means something specific in Mathematics (c. However. π(1) satisfy 0. .503. It is. p10 = 1. Theorem 4. then (n) lim p n→∞ 00 (n) = lim p10 = π(0). Terminology cannot be. 51 . however. and so on. . such that the chain moves cyclically between them: from C0 to C1 .497 . π(0) ≈ 0.then the product chain becomes irreducible.503 0. . 18.99π(0) = π(1). π(0) + π(1) = 1.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. .f. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. if we modify the chain by p00 = 0. Suppose that the period of the chain is d.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. positive recurrent and aperiodic Markov chain. limn→∞ p00 does not exist.4 and.497. Suppose X0 = 0. The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. Parry. customary to have a consistent terminology. 0. most people use the term “ergodic” for an irreducible. π(1) ≈ 0. But this is “wrong”.

each one must be equal to 1/d. We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . . after reduction by d) and. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. π(i) = 1/d. Suppose that X0 ∈ Cr . and let r = ℓ − k (modulo d). 52 . All states have period 1 for this chain. as n → ∞. n ≥ 0) is dπ(i). The chain (Xnd . → dπ(j). namely C0 . Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . j ∈ Cℓ . Proof. i∈Cr for all r = 0. j ∈ Cℓ . Let i ∈ Ck . . . . We shall show that Lemma 9. d − 1. . i ∈ Cr . . Cd−1 . This means that: Theorem 18. Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. the (unique) stationary distribution of (Xnd . . . where π is the stationary distribution. i ∈ Cr . thereafter. i ∈ S. Let αr := i∈Cr π(i). Since (Xnd . as n → ∞. Then. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. Since the αr add up to 1. Suppose i ∈ Cr . starting from i. n ≥ 0) consists of d closed communicating classes. we have pij for all i. Hence if we restrict the chain (Xnd . Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j).Lemma 8. if sampled every d steps. So π(i) = j∈Cr−1 π(j)pji . n ≥ 0) is aperiodic. Then pji = 0 unless j ∈ Cr−1 . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . it remains in Cℓ . Then. j ∈ Cr . 1. and the previous results apply.

By construction. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . .). if j is transient. 2. Consider. The proof will be based on two results from Analysis. p2 . . for each The second result is Fatou’s lemma. p3 . say. . If j is a null recurrent state then pij → 0. . for all i ∈ S. 2. k = 1. Hence pi k i. such that limk→∞ pi k exists and equals. We have. k = 1. p1 . p(n) = (n) (n) (n) (n) (n) ∞ p1 . . . k = 1. . such that limk→∞ p1 k 3 (n1 ) numbers p2 k . . and so on. a probability distribution on N.e. whose proof we omit7 : 7 See Br´maud (1999). (nk ) k → pi . i. . the sequence n3 of the third row is a subsequence k k of n2 of the second row. 2. 2. e 53 . Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. for each i. . say..19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . as k → ∞. where = 1 and pi ≥ 0 for all i. . p2 . So the diagonal subsequence (nk . . Now consider the contained between 0 and 1 there must be a exists and equals. . For each i we have found subsequence ni . .. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. schematically.) is a k k subsequence of each (ni . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. If the period is d then the limit exists only along a selected subsequence (Theorem 18). pi . as n → ∞. 2. there must be a subsequence exists and equals. say. the sequence n2 of the second row k is a subsequence of n1 of the first row. then pij → 0 as n → ∞. . . are contained between 0 and 1. the first of which is known as HellyBray Lemma: Lemma 10. Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. It is clear how we (ni ) k continue. for each n = 1. 19. We also saw in Corollary (n) 7 that. then pij → π(j) (Theorems 17).1 Limiting probabilities for null recurrent states Theorem 19. . k = 1. . . exists for all i ∈ N. for all i ∈ S.

(n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . By Corollary 13. i ∈ N). state j is positive recurrent: Contradiction. n = 1. k→∞ (n′ +1) = y∈S piy k pyx .Lemma 11. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. and this is a contradiction (the number αx = αx cannot be strictly larger than itself).e. Let j be a null recurrent state. 54 . But Pn+1 = Pn P. y∈S Since x∈S αx ≤ 1. pix k Therefore. x ∈ S . x ∈ S. pix k . Suppose that for some i. . the sequence (n) pij does not converge to 0. If (yi . . This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy .e. Thus. by assumption. Suppose that one of them is not. for some x0 ∈ S. y∈S x ∈ S. x∈S (n′ ) Since aj > 0.e. (xi . (n ) (n ) Consider the probability vector By Lemma 10. pix k = 1. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . i. i. and the last equality is obvious since we have a probability vector. 2. it is a strict inequality: αx0 > y∈S αy pyx0 . αy pyx . (n′ ) Hence αx ≥ αy pyx . i=1 (n) Proof of Theorem 19. Now π(j) > 0 because αj > 0. π is a stationary distribution. We wish to show that these inequalities are. we have 0< x∈S The inequality follows from Lemma 11. equalities. i ∈ N). i. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. by Lemma 11. actually. .

and fij = Pi (Tj < ∞). Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). In general. we have pij = Pi (Xn = j) = Pi (Xn = j. Let j be a positive recurrent state with period 1. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. the result can be obtained by the so-called Abel’s Lemma of Analysis. Example: Consider the following chain (0 < α. so we will only look that the aperiodic case. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). From the Strong Markov Property.2. Indeed. as n → ∞.19. Let HC := inf{n ≥ 0 : Xn ∈ C}. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. Let i be an arbitrary state. This is a more old-fashioned way of getting the same result. Proof. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . In this form. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17.2 The general case (n) It remains to see what the limit of pij is. But Theorem 17 tells us that the limit exists regardless of the initial distribution. as in §10. that is. starting with the first hitting time decomposition relation of Lemma 7. Let C be the communicating class of j and let π C be the stationary distribution associated with C. which is what we need. and fij = ϕC (j). HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). as n → ∞. Theorem 20. Then (n) lim p n→∞ ij = ϕC (i)π C (j). The result when the period j is larger than 1 is a bit awkward to formulate. Pµ (Xn−m = j) → π C (j). E j Tj where Tj = inf{n ≥ 1 : Xn = j}. Let ϕC (i) := Pi (HC < ∞). β. π C (j) = 1/Ej Tj .

The communicating classes are I = {3. 56 (20) x∈S . We now generalise this. we assumed the existence of a positive recurrent state. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. Pioneers in the are were. ϕ(4) = . information about its velocity. (n) (n) p43 → 0. The phase (or state) space of a particle is not just a physical space but it includes. 2} (closed class). Similarly. π C2 (5) = 1. C1 = {1. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. (n) (n) p35 → (1 − ϕ(3))π C2 (5). And so is the Law of Large Numbers itself (Theorem 13). p42 → ϕ(4)π C1 (2). Theorem 21. for example. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. whence 1−α β(1 − α) . in Theorem 14. C2 are aperiodic classes. (n) (n) p34 → 0. C2 = {5} (absorbing state). The subject of Ergodic Theory has its origins in Physics. ϕ(3) = p31 → ϕ(3)π C1 (1). Hence. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. p44 → 0. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. as n → ∞. p41 → ϕ(4)π C1 (1). For example. ϕ(5) = 0. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. from first-step analysis. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. We have ϕ(1) = ϕ(2) = 1. while. p45 → (1 − ϕ(4))π C2 (5). 1 − αβ 1 − αβ Both C1 . π C1 (2) = . among others. Theorem 14 is a type of ergodic theorem. 4} (inessential states). Recall that. (n) (n) p33 → 0. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. (n) (n) p32 → ϕ(3)π C1 (2). ϕ(4) = (1 − β)ϕ(5) + βϕ(3).

(21) f (x)ν [a] (x). From proposition 2 we know that the balance equations ν = νP have at least one solution. But this is precisely the condition (20). respectively. then. so the numerator converges to f /Ea Ta . in the former. we have that the denominator equals Nt /t and converges to 1/Ea Ta . So there are always recurrent states. We only need to check that EG1 = f . So if we divide by t the numerator and denominator of the fraction in (21). we assumed positive recurrence of the ground state a. t t→∞ t t t with probability one. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. Remark 1: The difference between Theorem 14 and 21 is that. we obtain the result. Now. 57 . Glast are the total rewards over the first and t t a a last incomplete cycles. as we saw in the proof of Theorem 14. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. and this justifies the terminology of §18. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). Proof. which is the quantity denoted by f in Theorem 14. we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast .5. we have C := i∈S ν(i) < ∞. we have t→∞ lim 1 last 1 G = lim Gfirst = 0. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . The reader is asked to revisit the proof of Theorem 14. As in the proof of Theorem 14. If so. Replacing n by the subsequence Nt . Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. and this is left as an exercise. Then at least one of the states must be visited infinitely often. since the number of states is finite. As in Theorem 14. while Gfirst . r=0 with probability 1 as long as E|G1 | < ∞.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . we have Nt /t → 1/Ea Ta .

Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. if we film a standing man who raises his arms up and down successively. . in addition. that is. . Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. in a sense. n ≥ 0). . X1 . . At least some state i must have π(i) > 0. running backwards. X0 ). for all n. because positive recurrence is a class property (Theorem 15). i. But if we film a man who walks and run the film backwards. by Theorem 16. the chain is irreducible. 22 Time reversibility Consider a Markov chain (Xn . like playing a film backwards. X1 ) has the same distribution as (X1 . a finite Markov chain always has positive recurrent states. In addition. . the stationary distribution is unique. Then the chain is positive recurrent with a unique stationary distribution π. It is. X0 ) for all n. the first components of the two vectors must have the same distribution. Xn ) has the same distribution as (Xn . X0 ). we see that (X0 . We say that it is time-reversible. it will look the same. . Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. . We summarise the above discussion in: Theorem 22.e. X1 . . reversible if it has the same distribution when the arrow of time is reversed. Xn has the same distribution as X0 for all n. C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. Therefore π is a stationary distribution. Proof. Because the process is Markov. For example. X0 = j). . or. Reversibility means that (X0 . Theorem 23. Suppose first that the chain is reversible. and run the film backwards. If. (X0 . . we have Pn → Π. P (X0 = i. then we instantly know that the film is played in reverse. Consider an irreducible Markov chain with finitely many states. Xn−1 . . By Corollary 13. Xn ) has the same distribution as (Xn . More precisely. π(i) := Hence. then all states are positive recurrent. in addition. Proof. the chain is aperiodic. A reversible Markov chain is stationary.Hence 1 ν(i). . without being able to tell that it is. as n → ∞. j ∈ S. Xn−1 . . . Taking n = 1 in the definition of reversibility. actually. . i. . X1 = j) = P (X1 = i. 58 . simply. In particular. If. Lemma 12. i ∈ S. this means that it is stationary. this state must be positive recurrent.

. then the chain cannot be reversible. . X2 = k) = P (X0 = k. for all i. X1 = j. X0 ). π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . k. Conversely. The same logic works for any m and can be worked out by induction. So the detailed balance equations (DBE) hold. . These are the DBE. X1 = j. π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. and hence (X0 . using DBE. In other words. j. Hence the DBE hold. . . then. X0 ). P (X0 = j. We will show the Kolmogorov’s loop criterion (KLC). X1 . The DBE imply immediately that (X0 . X1 = i) = π(j)pji . . X2 = i). and pi1 i2 (m−3) → π(i2 ). Pick 3 states i. and write. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . we have π(i) > 0 for all i. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. Summing up over all choices of states i3 . X1 ) has the same distribution as (X1 . So the chain is reversible. . Theorem 24. Suppose first that the chain is reversible.for all i and j. . . Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. . X0 ). using the DBE. But P (X0 = i. i4 . (X0 . . and the chain is reversible. Xn−1 . . Conversely. Now pick a state k and write. . j. By induction. suppose that the KLC (22) holds. X1 . i2 . suppose the DBE hold. k. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . pji 59 (m−3) → π(i1 ). . Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. X2 ) has the same distribution as (X2 . . and this means that P (X0 = i. . we have separation of variables: pij = f (i)g(j). X1 . Xn ) has the same distribution as (Xn . (22) Proof. letting m → ∞. . for all n. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. X1 = j) = π(i)pij . im ∈ S and all m ≥ 3. it is easy to show that.

The reason is as follows. and by replacing. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . There is obviously only one arrow from AtoAc . Example 1: Chain with a tree-structure. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. by assumption. there isn’t any. f (i)g(i) must be 1 (why?). and the arrow from j to j is the only one from Ac to A. so we have pij g(j) = . Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. So g satisfies the balance equations.1) which gives π(i)pij = π(j)pji . that the chain is positive recurrent. Ac be the sets in the two connected components. Ac ) = F (Ac .) Necessarily. then the chain is reversible. the two oriented edges from i to j and from j to i by a single edge without arrow. 60 . in addition. for each pair of distinct states i. consider the chain: Replace it by: This is a tree. Pick two distinct states i. For example. j for which pij > 0. And hence π can be found very easily. there would be a loop. the detailed balance equations. and. meaning that it has no loops. If we remove the link between them. to guarantee. A) (see §4. j for which pij > 0. i. The balance equations are written in the form F (A. the graph becomes disconnected. (If it didn’t.) Let A. Hence the chain is reversible. then we need. using another method.e. namely the arrow from i to j. Hence the original chain is reversible.(Convention: we form this ratio only when pij > 0. If the graph thus obtained is a tree. Form the graph by deleting all self-loops. • If the chain has infinitely many states.

0. where C is a constant. i. i∈S Since δ(i) = 2|E|. if j is a neighbour of i otherwise. The stationary distribution assigns to a state probability which is proportional to its degree. the DBE hold. p02 = 1/4. etc. Hence we conclude that: 1. so both members are equal to C. In all cases. Now define the the following transition probabilities: pij = 1/δ(i). Suppose the graph is connected. Consider an undirected graph G with vertices S. we have that C = 1/2|E|.Example 2: Random walk on a graph. 2. and a set E of undirected edges. a finite set. j are not neighbours then both sides are 0. We claim that π(i) = Cδ(i). π(i)pij = π(j)pji . Each edge connects a pair of distinct vertices (states) without orientation. To see this let us verify that the detailed balance equations hold. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. There are two cases: If i. The stationary distribution of a RWG is very easy to find.e. For example. π(i) = 1. For each state i define its degree δ(i) = number of neighbours of i. States j are i are called neighbours if they are connected by an edge. The constant C is found by normalisation. A RWG is a reversible chain. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. 61 . If i. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). i∈S where |E| is the total number of edges.

t ≥ 0) be independent from (Xn ) with S0 = 0.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. if nk = n0 + km. if. in general. k = 0.e. .) has the Markov property but it is not time-homogeneous. 23. 62 n ∈ N. 2. then it is a stationary distribution for the new chain. . 1. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . . The state space of (Yr ) is A. . . . . 2. random variables with values in N and distribution λ(n) = P (ξ0 = n). i.e. where ξt are i. Let A be (r) a set of states and let TA be the r-th visit to A.e. . . . 1. The sequence (Yk . this process is also a Markov chain and it does have time-homogeneous transition probabilities. . k = 0. t ≥ 0. k = 0. Due to the Strong Markov Property. . 2. A r = 1. . St+1 = St + ξt . . k = 0. n ≥ 0) be Markov with transition probabilities pij . . in addition.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ).23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). If the subsequence forms an arithmetic progression. 2. i. Let (St . Define Yr := XT (r) . then (Yk . In this case. where pij are the m-step transition probabilities of the original chain.3 Subordination Let (Xn . . how can we create new ones? 23.i. . irreducible and positive recurrent). π is a stationary distribution for the original chain. 2. 1.d. (m) (m) 23. 1.

that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). . and the elements of S ′ by α. (Notation: We will denote the elements of S by i. i. knowledge of f (Xn ) implies knowledge of Xn and so. β. j.. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . . . Yn−1 = αn−1 }. Yn−1 ) = P (Yn+1 = β | Yn = α). a Markov chain. . We need to show that P (Yn+1 = β | Yn = α. αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . .e. in general.Then Yt := XSt . n≥0 63 . α1 . If.. . Proof. we have: Lemma 13. is NOT. conditional on Yn the past is independent of the future. Fix α0 . then the process Yn = f (Xn ). . Then (Yn ) has the Markov property.) More generally. . . ∞ λ(n)pij . f is a one-to-one function then (Yn ) is a Markov chain. . . . β). however. . . (n) 23. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). Indeed.

β) i∈S P (Yn = α|Xn = i. Thus (Yn ) is Markov with transition probabilities Q(α. i ∈ Z. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. indeed. α = 0. because the probability of going up is the same as the probability of going down. if β = 1. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Then (Yn ) is a Markov chain. β) P (Yn = α|Xn = i. β) P (Yn = α. P (Xn+1 = i ± 1|Xn = i) = 1/2. β). Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. if i = 0. Example: Consider the symmetric drunkard’s random walk.e.for all choices of the variables involved. β)P (Xn = i | Yn = α. Yn−1 ) Q(f (i). for all i ∈ Z and all β ∈ Z+ . Yn−1 ) Q(f (i). But we will show that P (Yn+1 = β|Xn = i). and if i = 0. β) P (Yn = α|Yn−1 ) i∈S = Q(α. if β = α ± 1. To see this. 64 . regardless of whether i > 0 or i < 0. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). a Markov chain with transition probabilities Q(α. Yn = α. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. conditional on {Xn = i}. β). α > 0 Q(α. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. depends on i only through |i|. i ∈ Z. then |Xn+1 | = 1 for sure. Indeed. observe that the function | · | is not one-to-one. and so (Yn ) is. β ∈ Z+ . β). i. In other words. Yn−1 )P (Xn = i | Yn = α. Define Yn := |Xn |. On the other hand. β) := 1. We have that P (Yn+1 = β|Xn = i) = Q(|i|. But P (Yn+1 = β | Yn = α. β). if we let 1/2.

each individual gives birth to at one child with probability p or has children with probability 1 − p. if 0 is never reached then Xn → ∞. Thus. one question of interest is whether the process will become extinct or not.i. . since ξ (0) . which has a binomial distribution: i j pij = p (1 − p)i−j . In other words. Let ξk be the number of offspring of the k-th individual in the n-th generation. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. k = 0. 1. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. ξ (n) ). if the current population is i there is probability pi that everybody gives birth to zero offspring. . Hence all states i ≥ 1 are inessential. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. if the 65 (n) (n) . with probability 1. . It is a Markov chain with time-homogeneous transitions. 0 ≤ j ≤ i.24 24. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. p0 = 1 − p. Back to the general case. the population cannot grow: it will eventually reach 0. So assume that p1 = 1. . . ξ2 . and. . so we will find some different method to proceed. Hence all states i ≥ 1 are transient.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. 0 But 0 is an absorbing state. and following the same distribution. Example: Suppose that p1 = p. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. 2. But here is an interesting special case. We find the population (n) at time n + 1 by adding the number of offspring of each individual. independently from one another. We let pk be the probability that an individual will bear k offspring. then we see that the above equation is of the form Xn+1 = f (Xn . Then P (ξ = 0) + P (ξ ≥ 2) > 0. and 0 is always an absorbing state. Therefore. The state 0 is irrelevant because it cannot be reached from anywhere. If we let ξ (n) = ξ1 . a hard problem. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is.d. (and independent of the initial population size X0 –a further assumption).. therefore transient. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . are i. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. j In this case. . we have that (Xn . . ξ (1) . The parent dies immediately upon giving birth. in general. . n ≥ 0) has the Markov property.

when we start with X0 = i. 66 . so P (D|X0 = i) = εi . Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. We then have that D = D1 ∩ · · · ∩ Di . Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. Obviously. . We wish to compute the extinction probability ε(i) := Pi (D). Proof. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. and P (Dk ) = ε. For each k = 1. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. i. Therefore the problem reduces to the study of the branching process starting with X0 = 1. Indeed. each one behaves independently of one another. we can argue that ε(i) = εi . and. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. . The result is this: Proposition 5. Let ε be the extinction probability of the process. Indeed. if we start with X0 = i in ital ancestors. This function is of interest because it gives information about the extinction probability ε.population does not become extinct then it must grow to infinity. since Xn = 0 implies Xm = 0 for all m ≥ n. . Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. . where ε = P1 (D). If µ ≤ 1 then the ε = 1. ε(0) = 1. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . ϕn (0) = P (Xn = 0). For i ≥ 1. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies.

But ϕn+1 (0) → ε as n → ∞. Since both ε and ε∗ satisfy z = ϕ(z). Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). This means that ε = 1 (see left figure below). then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. and so on. Also ϕn (0) → ε. since ϕ is a continuous function. In particular. z=1 Also. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). in fact. Notice now that µ= d ϕ(z) dz . the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. if the slope µ at z = 1 is ≤ 1. ϕn+1 (z) = ϕ(ϕn (z)). and since the latter has only two solutions (z = 0 and z = ε∗ ). recall that ϕ is a convex function with ϕ(1) = 1. Therefore ε = ϕ(ε). (See right figure below. Let us show that. On the other hand. By induction. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . all we need to show is that ε < 1. and. ϕ(ϕn (0)) → ϕ(ε). ϕn (0) ≤ ε∗ .) 67 . Therefore ε = ε∗ . We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). But 0 ≤ ε∗ . ϕn+1 (0) = ϕ(ϕn (0)).Note also that ϕ0 (z) = 1. if µ > 1. Therefore. ε = ε∗ . So ε ≤ ε∗ < 1. But ϕn (0) → ε.

Abandoning this idea. there is no time to make a prior query whether a host has something to transmit or not. there are x + a − 1 packets. 6. Since things happen fast in a computer network. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. otherwise. 0). then time slots 0. 1. During this time slot assume there are a new packets that need to be transmitted.. 1. Let s be the number of selected packets. 2. there are x + a packets.d. we let hosts transmit whenever they like. is that if more than one hosts attempt to transmit a packet. 68 . . Ethernet has replaced Aloha.i. using the same frequency. then max(Xn + An − 1. Then ε < 1. are allocated to hosts 1. One obvious solution is to allocate time slots to hosts in a specific order. We assume that the An are i. random variables with some common distribution: P (An = k) = λk . . 3. say. The problem of communication over a common channel. and Sn the number of selected packets. 2. An the number of new packets on the same time slot. Then ε = 1. on the next times slot. then the transmission is not successful and the transmissions must be attempted anew. Nowadays. 24.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). If s = 1 there is a successful transmission and. periodically. . Case: µ > 1. there is a collision and. Xn+1 = Xn + An . But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. 4. So. 2. 3 hosts. k ≥ 0. So if there are. in a simplified model. Thus.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). on the next times slot. if Sn = 1. If s ≥ 2. . If collision happens. for a packet to go through. there is collision and the transmission is unsuccessful. 5. 3. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. only one of the hosts should be transmitting at a given time. . . 1. 3.

we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. We also let µ := k≥1 kλk be the expectation of An . where q = 1 − p. . so q x G(x) tends to 0 as x → ∞. The process (Xn . we have ∆(x) = (−1)px.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . Sn = 1|Xn = x) = λ0 xpq x−1 . The random variable Sn has. Let us find its transition probabilities. k 0 ≤ k ≤ x + a. So: px. To compute px. Indeed. Therefore ∆(x) ≥ µ/2 for all large x.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k .x are computed by subtracting the sum of the above from 1. So we can make q x G(x) as small as we like by choosing x sufficiently large. So P (Sn = k|Xn = x.where λk ≥ 0 and k≥0 kλk = 1. The reason we take the maximum with 0 in the above equation is because.x−1 = P (An = 0. and so q x tends to 0 much faster than G(x) tends to ∞. Idea of proof.x−1 when x > 0. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. k ≥ 1.e. An = a. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). The probabilities px. where G(x) is a function of the form α + βx . Since p > 0. we have that the drift is positive for all large x and this proves transience. defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. we must have no new packets arriving. For x > 0.x+k for k ≥ 1. Proving this is beyond the scope of the lectures. To compute px. we think as follows: to have a reduction of x by 1.x−1 + k≥1 kpx. The main result is: Proposition 6. .) = x + a k x−k p q . 69 . and exactly one selection: px. i. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An .e. Xn−1 . . The Aloha protocol is unstable. we have q < 1. n ≥ 0) is a Markov chain. We here only restrict ourselves to the computation of this expected change. for example we can make it ≤ µ/2. then the chain is transient. selections are made amongst the Xn + An packets. Unless µ = 0 (which is a trivial case: no arrivals!). if Xn + An = 0. and p. we should not subtract 1. An−1 . the Markov chain (Xn ) is transient.

2) and let α pij := (1 − α)pij + . The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. For each page i let L(i) be the set of pages it links to. Its implementation though is one of the largest matrix computations in the world. Academic departments link to each other by hiring their 70 . Then it says: page i has higher rank than j if π(i) > π(j). Comments: 1. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph.24. We pick α between 0 and 1 (typically. There is a technique. It is not clear that the chain is irreducible. then i is a ‘dangling page’. unless i is a dangling page. that allows someone to manipulate the PageRank of a web page and make it much higher. PageRank.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. It uses advanced techniques from linear algebra to speed up computations. otherwise. All you need to do to get high rank is pay (a lot of money). 2. if j ∈ L(i)  pij = 1/|V |. Therefore. α ≈ 0. known as ‘302 Google jacking’. there are companies that sell high-rank pages. Define transition probabilities as follows:  1/|L(i)|. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. So we make the following modification. The idea is simple. the chain moves to a random page. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. If L(i) = ∅. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. because of the size of the problem. if L(i) = ∅   0. 3. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. in the latter case. the rank of a page may be very important for a company’s profits. neither that it is aperiodic. So if Xn = i is the current location of the chain. |V | These are new transition probabilities. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. So this is how Google assigns ranks to pages: It uses an algorithm. In an Internet-dominated society.

Thus. their offspring inherits an allele from each parent. We would like to study the genetic diversity. Imagine that we have a population of N individuals who mate at random producing a number of offspring. a gene controlling a specific characteristic (e. Since we don’t care to keep track of the individuals’ identities. the genetic diversity depends only on the total number of alleles of each type. this allele is of type A with probability x/2N . Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. This means that the external appearance of a characteristic is a function of the pair of alleles. all we need is the information outlined in this figure.faculty from each other (and from themselves). meaning how many characteristics we have. We will assume that a child chooses an allele from each parent at random. One of the square alleles is selected 4 times. it makes the evaluation procedure easy. This number could be independent of the research quality but. We have a transition from x to y if y alleles of type A were produced. And so on. The process repeats. To simplify. two AA individuals will produce and AA child. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). 24. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. 5. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. 4.g. a dominant (say A) and a recessive (say a). What is important is that the evaluation can be done without ever looking at the content of one’s research. One of the circle alleles is selected twice. 71 .4 The Wright-Fisher model: an example from biology In an organism. The alleles in an individual come in pairs and express the individual’s genotype for that gene. we will assume that. colour of eyes) appears in two forms. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. When two individuals mate. in each generation. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. from the point of view of an administrator. generation after generation. If so. Do the same thing 2N times. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. maintain this allele for the next generation. Three of the square alleles are never selected. But an AA mating with a Aa may produce a child of type AA or one of type Aa. then. We will also assume that the parents are replaced by the children. In this figure there are 4 alleles of type A (circles) and 6 of type a (squares).

. M Therefore. Clearly.e. the random variables ξ1 . 72 . . as usual. the number of alleles of type A won’t change.e. on the average. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. since. Clearly. for all n ≥ 0. ϕ(x) = x/M. X0 ) = 1 − Xn . We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. on the average. . the states {1. (Here. Therefore P (Xn = 0 or Xn = 2N for all large n) = 1.Each allele of type A is selected with probability x/2N . . Tx := inf{n ≥ 1 : Xn = x}. with n P (ξk = 1|Xn . n n where. Since selections are independent. . The result is correct. and the population becomes one consisting of alleles of type a only. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). so both 0 and 2N are absorbing states. y ≤ 2N − 1. . M n P (ξk = 0|Xn . Xn ). ϕ(M ) = 1. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . X0 ) = Xn . Similarly. These are: M ϕ(0) = 0. M and thus. . i.2N = 1. we see that 2N x 2N −y x y pxy = . . since. . . . the chain gets absorbed by either 0 or 2N . . The fact that M is even is irrelevant for the mathematical model. X0 ) = E(Xn+1 |Xn ) = M This implies that. Let M = 2N .) One can argue that. the expectation doesn’t change and. p00 = p2N. . EXn+1 = EXn Xn = Xn .i. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . E(Xn+1 |Xn . 2N − 1} form a communicating class of inessential states. . . . Since pxy > 0 for all 1 ≤ x. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. but the argument needs further justification because “the end of time” is not a deterministic time. . and the population becomes one consisting of alleles of type A only. . i. . . ξM are i. ϕ(x) = y=0 pxy ϕ(y). . 0 ≤ y ≤ 2N. .d. . conditionally on (X0 .

We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. provided that enough components are available. at time n + 1.The first two are obvious. We will show that this Markov chain is positive recurrent. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. there are Xn + An − Dn components in the stock. A number An of new components are ordered and added to the stock. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. and the stock reached level zero at time n + 1. 73 . and so. Working for n = 1. If the demand is for Dn components. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . Xn electronic components at time n. larger than the supply per unit of time. . Here are our assumptions: The pairs of random variables ((An . Since the Markov chain is represented by this very simple recursion. . M x M (M −1)−(y−1) x x +1− M M M −1 = 24. We have that (Xn . .i. Dn ). Let ξn := An − Dn . b). M −1 y−1 x M y−1 1− x . while a number of them are sold to the market. and E(An − Dn ) = −µ < 0. we will try to solve the recursion. the demand is met instantaneously. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. the demand is. the demand is only partially satisfied. We will apply the so-called Loynes’ method. so that Xn+1 = (Xn + ξn ) ∨ 0. 2. n ≥ 0) are i. Otherwise. on the average.d. for all n ≥ 0. Thus. say. n ≥ 0) is a Markov chain.5 A storage or queueing application A storage facility stores.

Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). . ξn . and. Indeed. In particular. in computing the distribution of Mn which depends on ξ0 . . . . we reverse it. . (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. . so we can shuffle them in any way we like without altering their joint distribution. 74 . the event whose probability we want to compute is a function of (ξ0 . . where the latter equality follows from the fact that. Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. . that d (ξ0 . . . we may replace the latter random variables by ξ1 . however. Notice. . . ξn−1 ) = (ξn−1 .Thus. for y ≥ 0.d. Hence. . (n) . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . we find π(y) = x≥0 π(x)pxy . the random variables (ξn ) are i. . we may take the limit of both sides as n → ∞. . using π(y) = limn→∞ P (Mn = y). for all n. .i. The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). = x≥0 Since the n-dependent terms are bounded. instead of considering them in their original order. . ξn−1 ). . . But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). Therefore. ξ0 ). ξn−1 . .

. The case Eξn = 0 is delicate. .So π satisfies the balance equations. Depending on the actual distribution of ξn . n→∞ This implies that P ( lim Zn = −∞) = 1. . . Z2 . This will be the case if g(y) is not identically equal to 1. we will use the assumption that Eξn = −µ < 0. 75 . .} > y) is not identically equal to 1. It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient.} < ∞) = 1. . Z2 . We must make sure that it also satisfies y π(y) = 1. From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. Here. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . as required. n→∞ Hence g(y) = P (0 ∨ max{Z1 . the Markov chain may be transient or null recurrent.

y+z for any translation z. For example. . here. where p1 + · · · + pd + q1 + · · · + qd = 1. we can let S = Zd . if x = ej . given a function p(x). A random walk of this type is called simple random walk or nearest neighbour random walk. otherwise where p + q = 1. So. Our Markov chains. the set of d-dimensional vectors with integer coordinates. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. namely. p(−1) = q. j = 1. . . d   0. Define  pj . this walk enjoys. but. . .y = p(y − x). . we won’t go beyond Zd . besides time. have been processes with timehomogeneous transition probabilities.g. Example 3: Define p(x) = 1 2d . .and space-homogeneous transition probabilities given by px. p(x) = 0. that of spatial homogeneity. otherwise. so far.y = px+z. we need to give more structure to the state space S.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. if x = −ej . ±ed } otherwise 76 . j = 1. In dimension d = 1. This means that the transition probability pxy should depend on x and y only through their relative positions in space. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. then p(1) = p. This is the drunkard’s walk we’ve seen before. if x ∈ {±e1 . Example 1: Let p(x) = C2−|x1 |−···−|xd | . a structure that enables us to say that px. a random walk is a Markov chain with time. 0. .and space-homogeneity. groups) are allowed. If d = 1. . . where C is such that x p(x) = 1. . the skip-free property. More general spaces (e. x ∈ Zd . d  p(x) = qj . . A random walk possesses an additional property. To be able to say this.

But this would be nonsense. x + ξ2 = y) = y p(x)p(y − x). random variables in Zd with common distribution P (ξn = x) = p(x). p(x) = 0. x ∈ Zd . (When we write a sum without indicating explicitly the range of the summation variable. So. where ξ1 . We call it simple symmetric random walk. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. (Recall that δz is a probability distribution that assigns probability 1 to the point z. in general. y p(x)p′ (y − x). otherwise.) Now the operation in the last term is known as convolution. x ± ed .d. We will let p∗n be the convolution of p with itself n times. and this is the symmetric drunkard’s walk. we will mean a sum over all possible values. ξ2 . One could stop here and say that we have a formula for the n-step transition probability. . p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). we can reduce several questions to questions about this random walk. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . Furthermore. . p ∗ δz = p. which. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. Furthermore. The convolution of two probabilities p(x). because all we have done is that we dressed the original problem in more fancy clothes. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. the initial state S0 is independent of the increments. . Obviously. . We refer to ξn as the increments of the random walk.) In this notation. The special structure of a random walk enables us to give more detailed formulae for its behaviour.i. If d = 1. Indeed. Let us denote by Sn the state of the walk at time n. computing the n-th order convolution for all n is a notoriously hard problem. . It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). then p(1) = p(−1) = 1/2.This is a simple random walk. We shall resist the temptation to compute and look at other 77 . in addition. . . p′ (x). are i.

who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. The principle is trivial. . .2) where the ξn are i. −1. 2 where #A is the number of elements of A. Will the random walk ever return to where it started from? B. consider the following: 78 . Therefore. how long will that take? C. Look at the distribution of (ξ1 . . If yes. +1. for fixed n. As a warm-up. In the sequel. a random vector with values in {−1. 2) → (5. Here are a couple of meaningful questions: A. all possible values of the vector are equally likely. +1}n . When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn .1) + (0. and the sequence of signs is {+1. we have P (ξ1 = ε1 . If not. So if the initial position is S0 = 0. random variables (random signs) with P (ξn = ±1) = 1/2. ξn = εn ) = 2−n . −1} then we have a path in the plane that joins the points (0. . A ⊂ {−1. 0) → (1.i.2) (4. we should assign.1) + − (5. 2) → (3. a path of length n. Events A that depend only on these the first n random signs are subsets of this sample space. 1) → (4.1) − (3. we are dealing with a uniform distribution on the sample space {−1. and so #A P (A) = n .i+1 = pi. ξn ).methods. First.0) We can thus think of the elements of the sample space as paths. +1}n . Since there are 2n elements in {0. . but its practice may be hard. 1}n . +1}n . Clearly. if we can count the number of elements of A we can compute its probability. 1): (2. . where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi.i−1 = 1/2 for all i ∈ Z. +1.d. we shall learn some counting methods. . to each initial position and to each sequence of n signs. 1) → (2. After all. . So. + (1.

i). The number of paths that start from (0. Hence (n+i)/2 = 0. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. i). So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. in this case. 0) and end up at (n. i)} = n n+i 2 More generally.i) (0. y)}. 0) and n ends at (n.1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. The paths comprising this event are shown below. S6 < 0. 0) and ends at (n. x) and end up at (n. We use the notation {(m. Consider a path that starts from (0.) The darker region shows all paths that start at (0. We can easily count that there are 6 paths in A. and compute the probability of the event A = {S2 ≥ 0. n 79 . x) (n. the number of paths that start from (m.Example: Assume that S0 = 0. S4 = −1. (There are 2n of them. (n − m + y − x)/2 (23) Remark. 0) (n. i). y)} =: {all paths that start at (m. 0)). y)} = n−m .0) The lightly shaded region contains all paths of length n that start at (0. y) is given by: #{(m. Proof. In if n + i is not an even number then there is no path that starts from (0. y) is given by: #{(0. 0) and end up at the specific point (n. (n. x) (n.047. x) and end up at (n. Let us first show that Lemma 14. S7 < 0}.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

given that S0 = 0 and Sn = y > 0. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof.3 Some conditional probabilities Lemma 17. for now. (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . The left side of (30) equals #{paths (1. y)} . P0 (S1 ≥ 0 . n−m 2−n−m . . . . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . 1) (n. if n is even. S2 > 0. . n/2 (29) #{(0. y) that do not touch zero} × 2−n #{paths (0. . 27. the random walk never becomes zero up to time n is given by P0 (S1 > 0. n (30) Proof. Sn ≥ 0 | Sn = 0) = 84 1 . we shall just compute it. if n is even. If n. y)} . . We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). n 2−n . 0) (n. . P0 (Sn = y) 2 (n + y) + 1 . 0) (n. 27. x) 2n−m (n. . . n 2 #{paths (m. n This is called the ballot theorem and the reason will be given in Section 29. . Such a simple answer begs for a physical/intuitive explanation.P (Sn = y|Sm = x) = In particular. Sn ≥ 0 | Sn = y) = 1 − In particular. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . Sn > 0 | Sn = y) = y . The probability that. But.2 The ballot theorem Theorem 25.

. . . . . . . . . (b) follows from symmetry. . . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. •).Proof. . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . . .). . . and Sn−1 > 0 implies that Sn ≥ 0.4 Some remarkable identities Theorem 26. . . Sn = 0) = P0 (S1 > 0. . (c) (d) (e) (f ) (g) where (a) is obvious. . .) by (ξ1 . 1 + ξ2 + ξ3 > 0. Sn ≥ 0) = P0 (S1 = 0. Sn ≥ 0) = #{(0. . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . . . We next have P0 (S1 = 0. . . . . . . . Sn = 0) = P0 (Sn = 0). . (e) follows from the fact that ξ1 is independent of ξ2 . . . ξ2 + ξ3 ≥ 0. . . Sn ≥ 0. (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . n/2 where we used (28) and (29). For even n we have P0 (S1 ≥ 0. Proof. ξ2 . ξ3 . . . . . . . . (f ) follows from the replacement of (ξ2 . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . We use (27): P0 (Sn ≥ 0 . . . . . . . . If n is even. P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . Sn > 0) + P0 (S1 < 0. . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. P0 (S1 ≥ 0. . . . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. 85 . . . 0) (n. Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. Sn < 0) (b) (a) = 2P0 (S1 > 0. ξ3 . . +1 27. . Sn ≥ 0). and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0.

Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. . Put a two-sided mirror at a. Sn−1 > 0. . Sn = 0). Then f (n) := P0 (S1 = 0. Sn−2 > 0|Sn−1 = 1). This is not so interesting. we see that Sn−1 > 0. and u(n) = P0 (Sn = 0). We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. Fix a state a. with equal probability. 86 . given that Sn = 0. . . . Sn = 0) + P0 (S1 < 0. Sn = 0) Now. . So the last term is 1/2. By symmetry. . n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. . f (n) = P0 (S1 > 0.27. and if the particle is to the left of a then its mirror image is on the right. the two terms must be equal. Sn = 0) = 2P0 (Sn−1 > 0. Theorem 27. we computed. . Sn = 0 means Sn−1 = 1. So u(n) f (n) = . So 1 f (n) = 2u(n) P0 (S1 > 0. . . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). Sn−1 = 0. . . Let n be even. Sn = 0) = u(n) . Sn = 0. . . So f (n) = 2P0 (S1 > 0. for even n. . By the Markov property at time n − 1.5 First return to zero In (29). Sn−2 > 0|Sn−1 > 0. Also. . Sn = 0) P0 (S1 > 0. . . P0 (Sn−1 > 0. we can omit the last event. Sn−1 > 0. . . To take care of the last term in the last expression for f (n). . we either have Sn−1 = 1 or −1. n−1 Proof. the probability u(n) that the walk (started at 0) attains value 0 in n steps. . So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . Since the random walk cannot change sign before becoming zero. Sn−1 < 0. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . If a particle is to the right of a then you see its image on the left.

n ∈ Z+ ). n ≥ Ta By the strong Markov property. so we may replace Sn by 2a − Sn (Theorem 28). the process (STa +m − a. Then. called (Sn . observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). 2 Define now the running maximum Mn = max(S0 . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). as a corollary to the above result. independent of the past before Ta . a + (Sn − a). But a symmetric random walk has the same law as its negative. 2 Proof. n < Ta . . If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). But P0 (Ta ≤ n) = P0 (Ta ≤ n. 2a − Sn . n ≥ 0) be a simple symmetric random walk and a > 0. S1 . define Sn = where is the first hitting time of a. n ≥ Ta . Sn ≤ a) + P0 (Ta ≤ n. Sn > a) = P0 (Ta ≤ n. . for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . . Sn = Sn . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. Then Ta = inf{n ≥ 0 : Sn = a} Sn . Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). P0 (Sn ≥ a) = P0 (Ta ≤ n. we have n ≥ Ta . If Sn ≥ a then Ta ≤ n. m ≥ 0) is a simple symmetric random walk. Proof. we obtain 87 . Thus. 28. Finally.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. Sn ≥ a). The last equality follows from the fact that Sn > a implies Ta ≤ n. Trivially. Sn ). Sn ≤ a) + P0 (Sn > a). In other words. In the last event.What is more interesting is this: Run the particle till it hits a (it will. Theorem 28. So Sn − a can be replaced by a − Sn in the last display and the resulting process. 2a − Sn ≥ a) = P0 (Ta ≤ n. n < Ta . . Let (Sn . n ∈ Z+ ) has the same law as (Sn . Sn ≤ a).

Rend. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . 9 Bertrand. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). It is the latter use we are interested in in Probability and Statistics. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. 105. Sn = 2x − y). P0 (Mn < x. the number of selected azure items is always ahead of the black ones equals (a − b)/n. This. . 2 Proof. Compt. then. the ashes of the dead after cremation. Since Mn < x is equivalent to Tx > n. Bertrand. If an urn contains a azure items and b = n−a black items. Sn = y) = P0 (Tx > n. . 436-437. Compt. Proof. of which a are coloured azure and b black. An urn is a vessel. Solution d’un probl`me. Sci. by applying the reflection principle (Theorem 28). Sn = y). J. D. 8 88 . during sampling without replacement. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. Rend. where ηi indicates the colour of the i-th item picked. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Solution directe du probl`me r´solu par M. usually a vase furnished with a foot or pedestal. The answer is very simple: Theorem 31. Acad. which results into P0 (Tx ≤ n. The original ballot problem. 105. we have P0 (Mn < x. n = a + b. A sample from the urn is a sequence η = (η1 . gives the result. combined with (31). (31) x > y. e e e Paris. .Corollary 19. Sn = y) = P0 (Sn = 2x − y). and anciently for holding ballots to be drawn. Paris. The set of values of η contains n elements and η is uniformly distributed a in this set. 369. as for holding liquids. Sn = y) = P0 (Tx ≤ n. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. e 10 Andr´. employed for different purposes. Sci. ornamental objects. . Acad. ηn ). If Tx ≤ n. (1887). then the probability that. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. as long as a ≥ b. (1887).

the result follows. Let ε1 . . conditional on {Sn = y}. . So the number of possible values of (ξ1 . . . . where y = a − (n − a) = 2a − n. Sn > 0} that the random walk remains positive. the ‘black’ items correspond to − signs). . i ∈ Z. . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. (The word “asymmetric” will always mean “not necessarily symmetric”. a − b = y. but which is not necessarily symmetric. . It remains to show that all values are equally a likely. the sequence (ξ1 . . . . letting a. a. −n ≤ k ≤ n. ξn ) has a uniform distribution. ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. n n −n 2 (n + y)/2 a Thus. . . The values of each ξi are ±1. . n ≥ 0) with values in Z and let y be a positive integer. . +1} such that ε1 + · · · + εn = y. indeed. . . . Since an urn with parameters n. (ξ1 . the sequence (ξ1 . . Proof of Theorem 31. so ‘azure’ items are +1 and ‘black’ items are −1. always assuming (without any loss of generality) that a ≥ b = n − a. .i−1 = q = 1 − p. ξn ) is. . . Since the ‘azure’ items correspond to + signs (respectively. . Consider a simple symmetric random walk (Sn . Then. . 89 . ξn ) is uniformly distributed in a set with n elements. conditional on the event {Sn = y}. . But in (30) we computed that y P0 (S1 > 0. . ηn ) an urn process with parameters n. We have P0 (Sn = k) = n n+k 2 pi. b be defined by a + b = n.i+1 = p. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. n Since y = a − (n − a) = a − b. p(n+k)/2 q (n−k)/2 . Then. . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . . εn be elements of {−1. . n . . Sn > 0 | Sn = y) = . . using the definition of conditional probability and Lemma 16. we have: P0 (ξ1 = ε1 . But if Sn = y then. . . .We shall call (η1 . . . .) So pi. we can translate the original question into a question about the random walk. An urn can be realised quite easily be a random walk: Theorem 32. . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. ξn = εn ) P0 (Sn = y) 2−n 1 = = . We need to show that. Proof. a we have that a out of the n increments will be +1 and b will be −1. conditional on {Sn = y}. . .

. 2.} ∪ {∞}. E0 sTx = ϕ(s)x . If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. Consider a simple random walk starting from 0. Px (Ty = t) = P0 (Ty−x = t). x ≥ 0) is also a random walk. In view of Lemma 19 we need to know the distribution of T1 . then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. In view of this. To prove this. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. The rest is a consequence of the strong Markov property. (We tacitly assume that S0 = 0. 1. Proof. Consider a simple random walk starting from 0. plus a ′′ further time T1 required to go from 0 to +1 for the first time. x ∈ Z.d. Theorem 33. 90 . If S1 = 1 then T1 = 1. For all x. for x > 0.) Then. . . Corollary 20. only the relative position of y with respect to x is needed. namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. and all t ≥ 0. Consider a simple random walk starting from 0. We will approach this using probability generating functions. plus a time T1 required to go from −1 to 0 for the first time. we make essential use of the skip-free property. y ∈ Z. . See figure below. This random variable takes values in {0. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. . 2. random variables. .i. Proof.30. x − 1 in order to reach x. . Lemma 18. The process (Tx .1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . it is no loss of generality to consider the random walk starting from 0. Lemma 19. 2qs Proof. Let ϕ(s) := E0 sT1 . If x > 0.

Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). The formula is obtained by solving the quadratic. from the strong Markov property. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. if x < 0 2ps Proof. We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . Corollary 22. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. 91 . Corollary 21. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . Consider a simple random walk starting from 0. Consider a simple random walk starting from 0.  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . in order to hit 1. Hence the previous formula applies with the roles of p and q interchanged. 2ps Proof. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . The first time n such that Sn = −1 is the first time n such that −Sn = 1.

E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . and simply write (1 + x)n = k≥0 n k x . then we can omit the k upper limit in the sum. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. 2ps ′ =1− 30.30. if S1 = 1. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. If α = n is a positive integer then n (1 + x)n = k=0 n k x . We again do first-step analysis.3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. where α is a real number. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . For the general case. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . 1 − 4pqs2 . k 92 . the same as the time required for the walk to hit 1 starting ′ from 0. Consider a simple random walk starting from 0. this was found in §30. If we define n to be equal to 0 for k > n. Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. k by the binomial theorem. Similarly. in distribution. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 .2 First return to the origin Now let us ask: If the particle starts at 0. The latter is. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. For a symmetric ′ random walk.

−1 k so (1 − x)−1 = as needed. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk .Recall that n k = It turns out that the formula is formally true for general α. √ 1 − x. by definition. k!(n − k)! k! α k x . k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . but that we won’t worry about which ones. 2k − 1 k 93 . let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. k as long as we define α k = α(α − 1) · · · (α − k + 1) . and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . k! k! (−1)2k xk = k≥0 xk . As another example. Indeed. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. (−1)k = 2k − 1 k k and this is shown by direct computation. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . (1 − x)−1 = But. For example.

• If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. without appeal to the Markov chain theory. s↑1 which is true for any probability generating function. We apply this to (32). By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. if x < 0 |x| In particular. P0 (Tx < ∞) = p q q p ∧1 ∧1 . We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. |x| if x > 0 (33) . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z.But. Hence it is again transient. 94 . x if x > 0 . Hence it is transient. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. But remember that q = 1 − p. by the definition of E0 sT0 . x 1−|p−q| 2q 1−|p−q| 2p . That it is actually null-recurrent follows from Markov chain theory. The quantity inside the square root becomes 1 − 4pq. if x < 0 This formula can be written more neatly as follows: Theorem 35. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p.

. Consider a simple random walk with values in Z. Proof. the sequence Mn is nondecreasing. every nondecreasing sequence has a limit. 31. . Corollary 23.1 Asymptotic distribution of the maximum Suppose now that p < 1/2. Indeed. Unless p = 1/2. S2 . if x > 0 . except that the limit may be ∞. What is this probability? Theorem 37. and so it is null-recurrent. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. Consider a random walk with values in Z.) If we let Mn = max(S0 . n→∞ n→∞ x ≥ 0. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. x ≥ 0. If p < q then |p − q| = q − p and. if x < 0 which is what we want.Proof.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. . . 95 . Assume p < 1/2. . this probability is always strictly less than 1. Then ′ P0 (T0 < ∞) = 2(p ∧ q). Then limn→∞ Sn = −∞ with probability 1. This means that there is a largest value that will be ever attained by the random walk. We know that it cannot be positive recurrent. it will never ever return to the point where it started from. Sn ). q p. . 31. with probability 1 because p < 1/2. P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . The simple symmetric random walk with values in Z is null-recurrent. Otherwise. This value is random and is denoted by M∞ = sup(S0 . Proof. with some positive probability. . again. S1 . then we have M∞ = limn→∞ Mn . The last part of Theorem 35 tells us that it is recurrent. but here it is < ∞. we obtain what we are looking for. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . Theorem 36.

1 − 2(p ∧ q) this being a geometric series. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. 2. 2. If p = q this gives 1. meaning that Pi (Ji = ∞) = 1. 1. starting from zero. We now look at the the number of visits to some state but if we start from a different state. We now compute its exact distribution. . And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . where the last probability was computed in Theorem 37. 2(p ∧ q) . the total number of visits to a state will be finite. But if it is transient. 96 . as already known. This proves the first formula. 31. 1 − 2(p ∧ q) Proof. . k = 0. Remark 1: By spatial homogeneity. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). .3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. . For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . . 1. k = 0. But J0 ≥ 1 if and only if the first return time to zero is finite. Pi (Ji ≥ k) = 1 for all k. Theorem 38. 1. In Lemma 4 we showed that. 1]. even for p = 1/2.′ Proof. if a Markov chain starts from a state i then Ji is geometric. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . . . Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. . The probability generating function of T0 was obtained in Theorem 34. Consider a simple random walk with values in Z. k = 0. In the latter case. 2. . we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . (34) a random variable first considered in Section 10 where it was also shown to be geometric. 2(p ∧ q) E0 J0 = . But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. Then. .

starting from 0. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . Consider a random walk with values in Z. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series.e. The reason that we replaced k by k − 1 is because. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . Apply now the strong Markov property at Tx . there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. given that Tx < ∞. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). Then P0 (Tx < ∞. But for x < 0. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . it will visit any point below 0 exactly the same number of times on the average. Remark 1: The expectation E0 Jx is finite if p = 1/2. we have E0 Jx = 1/(1−2p) regardless of x. p < 1/2) then. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . ∞ As for the expectation. (q/p)(−x)∨0 1 − 2q k≥1 . and the second in Theorem 38. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. E0 Jx = Proof. k ≥ 1.Theorem 39. Suppose x > 0. Jx ≥ k). then E0 Jx drops down to zero geometrically fast as x tends to infinity. Consider the case p < 1/2. 97 .e. The first term was computed in Theorem 35. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . For x > 0. simply because if Jx ≥ k ≥ 1 then Tx < ∞. if the random walk “drifts down” (i. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . i. (2(p ∧ q))k−1 . Then P0 (Jx ≥ k) = P0 (Tx < ∞. In other words.

Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. . Theorem 40. . 0 ≤ k ≤ n). .e. In Lemma 19 we observed that. which is basically another probability for S. Fix an integer n. Let (Sk ) be a simple symmetric random walk and x a positive integer. every time we have a probability for S we can translate it into a probability for S. . = 1− √ 1 − s2 s x . n (37) 98 . ξ2 . Sn = x) P0 (Tx = n) = . 0 ≤ k ≤ n) has the same distribution as (Sk . in Theorem 41.i. 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . i. . a sum of i. ξ1 ). . . Thus. (Sk . and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . n Proof. ξn ) has the same distribution as (ξn . 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. Then P0 (Tx = n) = x P0 (Sn = x) . for a simple random walk and for x > 0. random variables. Proof. each distributed as T1 . ξn−1 . we derived using duality. i=1 Then the dual (Sk . . because the two have the same distribution. Use the obvious fact that (ξ1 . . for a simple symmetric random walk.d. the probabilities P0 (Tx = n) = x P0 (Sn = x). n x > 0. 0 ≤ k ≤ n) of (Sk . . random variables.32 Duality Consider random walk Sk = k ξi . 0 ≤ k ≤ n − 1.i. Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). (36) 2 Lastly. Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited.d. Tx is the sum of x i. . P0 (Sn = x) P0 (Sn = x) x . Here is an application of this: Theorem 41. .

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

He starts at (0. However. The two random walks are not independent. (Sn . (Sn . Sn = i=1 n ξi . So Sn = (Sn . −1)) = 1/2. It is not a simple random walk because there is a possibility that the increment takes the value 0. y) where x.d. n ∈ Z+ ) is a sum of i.e. We have 1. x2 ) into (x1 + x2 . are independent. Sn − Sn ). n ∈ Z+ ) is a random walk. being totally drunk. n ∈ Z+ ) showing that it. 0) and. north or south. going at random east. so are ξ1 . he moves east. ξn are not independent simply because if one takes value 0 the other is nonzero. x1 − x2 ). .. 0)) = 1/4. the random 1 2 variables ξn .From this. west.i. i=1 ξi i=1 ξi 2 1 Lemma 20. we immediately get that the expectation of T1 is infinity: ET1 = ∞. for each n. 0)) = P (ξn = (−1. we let x1 be its horizontal coordinate and x2 its vertical. Indeed. . 2 1 If x ∈ Z2 . i. y ∈ Z. with equal probability (1/4). We have: 102 . is a random walk. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. 1 P (ξn = 1) = P (ξn = (1. the latter being functions of the former. random variables and hence a random walk. ξ2 .d. Many other quantities can be approximated in a similar manner. 1 Hence (Sn . . Sn ) = n n 2 . n ∈ Z+ ) is also a random walk. 1 1 Proof. We then let S0 = (0. . The corners are described by coordinates (x. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. west. Since ξ1 . We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . Consider a change of coordinates: rotate the axes by 45o (and scale them). 1)) + P (ξn = (0. map (x1 . . 1)) = P (ξn = (−1. So let 1 2 1 2 1 2 Rn = (Rn . 0). . n ≥ 1. the two stochastic processes are not independent. . north or south. 0)) = 1/4. i. too. Rn ) := (Sn + Sn . When he reaches a corner he moves again in the same manner. . 2 Same argument applies to (Sn .i. 0)) = P (ξn = (1. ξ2 . ξ2 . 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. . random vectors such that P (ξn = (0.

that for a simple symmetric random walk (started from S0 = 0). +1}. ξ 1 − ξ 2 ). b−a P (Ta < Tb ) = b . ξ2 . . 0)) = 1/4 1 2 P (ηn = 1. Lemma 22. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. we need to show that ∞ P (Sn = 0) = ∞. (Rn + independent. from Theorem 7. b−a 103 . The possible values 1 . (Rn . Similar argument shows that each of (Rn . 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. b−a −a . To show that the last two are independent. 1)) = 1/4 1 2 P (ηn = −1. 2 1 2 2 1 1 Proof. Proof. We have of each of the ηn n 1 2 P (ηn = 1. But by the previous n=0 c lemma. But (Rn ) 1 2 is a simple symmetric random walk. R2n = 0) = P (R2n = 0)(R2n = 0). We have Rn = n (ξi + ξi ). . Let a < 0 < b. n ∈ Z+ ). . By Lemma 5. in the sense that the ratio of the two sides goes to 1. b−a P (STa ∧Tb = a) = b . . 1 where the last equality follows from the independence of the two random walks.i. ηn = −1) = P (ξn = (−1. P (S2n = 0) = Proof.. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. The same is true for (Rn ). n ∈ Z ) is a random walk in one dimension.1 Lemma 21. is i. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. The sequence of vectors η . (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . n ∈ Z+ ) is a random 2 . 2 . So P (R2n = 0) = 2n 2−2n . So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . = 1)) = 1/4. n Corollary 24. ε′ ∈ {−1. where c is a positive constant. The two-dimensional simple symmetric random walk is recurrent. (Rn . ηn = ξn − ξn are independent for each n. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . 2n n 2 2−4n . and by Stirling’s approximation. ηn = 1) = P (ξn = (0. η . . Rn = n (ξi − ξi ). 0)) = 1/4 1 2 P (ηn = −1. . n ∈ Z+ ) is a random walk in two dimensions. Hence (Rn . And so 2 1 2 1 P (ηn = ε. η 2 are ±1. as n → ∞. n ∈ Z+ ) 1 is a random walk in two dimensions. ηn = −1) = P (ξn = (0. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . n ∈ Z ) is a random walk in one dimension. ηn = 1) = P (ξn = (1. Recall. Hence P (ηn = 1) = P (ηn = −1) = 1/2.d. The last two are walk in one dimension. P (S2n = 0) ∼ n .

The claim is that P (Z = k) = pk . i < 0 < j. i ∈ Z.This last display tells us the distribution of STa ∧Tb . M. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. B = 0) = p0 . Conversely. b. The probability it exits it from the left point is p. we consider the random variable Z = STA ∧TB . i ∈ Z). B) taking values in {(0. e. To see this. M < N . 0)} ∪ (N × N) with the following distribution: P (A = i. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. B = j) = c−1 (j − i)pi pj . where c−1 is chosen by normalisation: P (A = 0. Then Z = 0 if and only if A = B = 0 and this has 104 . B = 0) + P (A = i.g. Define a pair of random variables (A. using this. that ESTa ∧Tb = 0. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. B) are independent of (Sn ). Let (Sn ) be a simple symmetric random walk and (pi . suppose first k = 0. such that p is rational. we have this common value: i<0 (−i)pi = and. a = M − N . i ∈ Z. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. 1 − p). if p = M/N . P (A = 0. b = N . B = j) = 1. Then there exists a stopping time T such that P (ST = i) = pi . Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. given a 2-point probability distribution (p. Indeed. we can always find integers a. How about more complicated distributions? Let suppose we are given a probability distribution (pi . we get that c is precisely ipi . Theorem 43. a < 0 < b such that pa + (1 − p)b = 0. and assuming that (A. in particular. and the probability it exits from the right point is 1 − p. Proof. b]. Note. N ∈ N. choose. k ∈ Z.

105 . B = j) P (STi ∧Tk = k)P (A = i. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk .probability p0 . P (Z = k) = pk for k < 0. i<0 = c−1 Similarly. by definition. Suppose next k > 0. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i.

e. The set N of natural numbers is the set of positive integers: N = {1. We will show that d = gcd(a. . and we’re done. we have c | d. Therefore d | b − a.e. If b − a > a. But 3 is the maximum number of times that 38 “fits” into 120. 2) = 2. So (ii) holds. if we also invoke Euclid’s algorithm (i. Now. So we work faster. 4. and d = gcd(a. In other words. let d = gcd(a. 0.) Notice that: gcd(a. (That it is unique is obvious from the fact that if d1 .}. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. −3. if a. Now suppose that c | a and c | b − a. 5. Keep doing this until you obtain b − qa < a. Arithmetic Arithmetic deals with integers. 106 . 2. . Given integers a. −2. Indeed. 8) = gcd(6.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. 20) = gcd(6. But in the first line we subtracted 38 exactly 3 times. (ii) if c | a and c | b. Similarly. 14) = gcd(6. b). with b = 0. 6. But d | a and d | b. as taught in high schools or universities. 38) = gcd(6. Indeed 4 × 38 > 120. Then interchange the roles of the numbers and continue. then d | c. b). c | b. 3. d | b. 26) = gcd(6. The set Z of integers consists of N. Because c | a and c | b. 32) = gcd(6. their negatives and zero. b) as the unique positive integer such that (i) d | a. d2 are two such numbers then d1 would divide d2 and vice-versa. 3. we say that b divides a if there is an integer k such that a = kb. Then c | a + (b − a). b. For instance. b are arbitrary integers. b) = gcd(a. 2) = gcd(4. 38) = gcd(6. with the second line. We write this by b | a. 38) = gcd(82. 1. b − a). they merely serve as reminders. So (i) holds. −1. the process of long division that one learns in elementary school). . gcd(120. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. If a | b and b | a then a = b. 38) = gcd(44. · · · }. i. b − a). If c | b and b | a then c | a. Replace b by b − a. leading to d1 = d2 . Zero is denoted by 0. with at least one of them nonzero. replace b − a by b − 2a. They are not supposed to replace textbooks or basic courses. Z = {· · · . 2) = gcd(2. 2. we can define their greatest common divisor d = gcd(a.

b) = min{xa + yb : x. in particular. Suppose that c | a and c | b. Then d = xa + yb. we can show that. we have the fact that if d = gcd(a. Here are some other facts about the greatest common divisor: First.The theorem of division says the following: Given two positive integers b. i. Thus. let d be the right hand side. b) then there are integers x. For example. a. y such that d = xa + yb. let us follow the previous example again: gcd(120. . The long division algorithm is a process for finding the quotient q and the remainder r. with a > b. we can find an integer q and an integer r ∈ {0. 1. . Second. . Then c divides any linear combination of a. The same logic works in general. with x = 19. it divides d. 120) = 38x + 120y. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. To see this. We need to show (i). Working backwards. such that a = qb + r. 107 . This shows (ii) of the definition of a greatest common divisor. gcd(38. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. xa + yb > 0}. for integers a. 38) = gcd(6. 2) = 2. b ∈ Z.e. y ∈ Z. To see why this is true. we have gcd(a. b. . 38) = gcd(6. a − 1}. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. In the process. not both zero. y = 6. y ∈ Z. for some x. r = 18. Its proof is obvious.

For example. so r = 0. A relation is called reflexive if it contains pairs (x. e. Both < and = are transitive. y) without contain its symmetric (y. then the relation “x is a friend of y” is symmetric. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. say. yielding r = (−qx)a + (1 − qy)b. In general. i. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. m elements. y) and (y.g. Since d > 0. y) such that x is smaller than y. so d divides both a and b. f (x2 ). Suppose this is not true. but is not necessarily transitive. Finally. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . respectively. The range of f is the set of its values. z) it also contains (x.that d | a and d | b. x). the greatest common divisor of any subset of the integers also makes sense. The relation < is not reflexive. The equality relation is symmetric. x). for some 1 ≤ r ≤ d − 1. 108 . But then b = q(xa + yb) + r. A relation which possesses all the three properties is called equivalence relation. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. The relation < is not symmetric. The set of functions from X to Y is denoted by Y X . So d is not minimum as claimed. The greatest common divisor of 3 integers can now be defined in a similar manner. If we take A to be the set of students in a classroom.e. Counting Let A. B be finite sets with n. A function is called onto (or surjective) if its range is Y . A relation is transitive if when it contains (x. A relation is called symmetric if it cannot contain a pair (x. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). z). a relation can be thought of as a set of pairs of elements of the set A. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). i.. that d does not divide both numbers. and this is a contradiction. We encountered the equivalence relation i theory. The equality relation = is reflexive. we can divide b by d to find b = qd + r. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. x2 ∈ X have different values f (x1 ). that d does not divide b. A function is called one-to-one (or injective) if two different x1 . the set {f (x) : x ∈ X}.e.

the cardinality of the set B A ) is mn . the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). the k-th in n − k + 1 ways. k! k where the symbol equation. To be able to do this. and so on. 1}n of sequences of length n with elements 0 or 1. and so on. n! is the total number of permutations of a set of n elements. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. Any set A contains several subsets. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . and the third. Indeed. in other words. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . In the special case m = n. Each assignment of this type is a one-to-one function from A to B. Incidentally. Since the name of the first animal can be chosen in m ways. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. If A has n elements. Clearly. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. we need more names than animals: n ≤ m. The first element can be chosen in n ways.The number of functions from A to B (i.e. Thus. that of the second in m − 1 ways. We thus observe that there are as many subsets in A as number of elements in the set {0. and so on. we may think of the elements of A as animals and those of B as names. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). and this follows from the binomial theorem. we denote (n)n also by n!. A function is then an assignment of a name to each animal. 109 . If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . the second in n − 1. we have = 1. (The same name may be used for different animals.) So the first animal can be assigned any of the m available names. a one-to-one function from the set A into A is called a permutation of A. so does the second.

and we don’t care to put parantheses because the order of finding a maximum is irrelevant. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). n n .We shall let n = 0 if n < k. Both quantities are nonnegative. If the pig is not included in the sample. Maximum and minimum The minimum of two numbers a. the pig. The notation extends to more than one numbers. if x is a real number. we use the following notations: a+ = max(a. we define x m = x(x − 1) · · · (x − m + 1) . k Also. we let The Pascal triangle relation states that n+1 k = = 0. c. In particular. a+ is called the positive part of a. a∨b∨c refers to the maximum of the three numbers a. The maximum is denoted by a ∨ b = max(a. 0). 0). then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. m! n y If y is not an integer then. For example. we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. If the pig is included. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. we choose k animals from the remaining n ones and this can be done in n ways. 110 . b). a− = − min(a. b is denoted by a ∧ b = min(a. The absolute value of a is defined by |a| = a+ + a− . Note that it is always true that a = a+ − a− . b. in choosing k animals from a set of n + 1 ones. + k−1 k Indeed. say. consider a specific animal. b). and a− is called the negative part of a. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. by convention. by convention.

Thus. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . It takes value 1 with probability P (A) and 0 with probability P (Ac ). Several quantities of interest are expressed in terms of indicators. . For instance. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. 1 − 1A = 1 A c . It is easy to see that 1A 1A∩B = 1A 1B . Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). So X= n≥1 1(X ≥ n). 111 . This is very useful notation because conjunction of clauses is transformed into multiplication.}. So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. . which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). if A1 .Indicator function stands for the indicator function of a set A. If A is an event then 1A is a random variable. P (X ≥ n). A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). = 1A + 1B − 1A∩B . . . . we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. are sets or clauses then the number n≥1 1An is the number true clauses. A2 . 2. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. Then X X= n=1 1 (the sum of X ones is X). It is worth remembering that. meaning a function that takes value 1 on A and 0 on the complement Ac of A. . Another example: Let X be a nonnegative random variable with values in N = {1.

. . . . β ∈ R.  The column  . . Let y i := n aij xj . . Combining. . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ). The column  . j=1 i = 1. . Then. and all α. The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . m.   . .  is a column representation of x with respect to .  . Let u1 . . xn the given basis. So if we choose a basis v1 .  where ∈ R.  is a column representation of y := ϕ(x) with respect to the given basis . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi .  = ··· ··· ···  . x1 . .for all x. . . vm . because ϕ is linear. . . . y ∈ Rn . But the ϕ(uj ) are vectors in Rm . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . ϕ are linear functions: It is a m × n matrix. . un be a basis for Rn . ym v1 . . . we have ϕ(x) = m n vi i=1 j=1 aij xj . .    . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  . x  . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  . Suppose that ψ.

for any ε > 0. . x3 . . This minimum (or maximum) may not exist. A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. 113 . .} ≤ inf{x3 . with respect to these bases.  .} ≤ · · · and. there is x ∈ I with x ≥ ε + inf I.} ≤ inf{x2 . and it is this that we called infimum. xn+2 . ψ. . j So A is a ℓ × m matrix. a square matrix represents a linear transformation from Rn to Rn . · · · }. x2 . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. . . sup I is characterised by: sup I ≥ x for all x ∈ I and. inf I is characterised by: inf I ≤ x for all x ∈ I and. . B is a m × n matrix. Usually. . If I is empty then inf ∅ = +∞. . where m (AB)ij = k=1 Aik Bkj . We note that lim inf(−xn ) = − lim sup xn . . . . xn+1 . Likewise. Similarly. i. . . In these cases we use the notation inf I. .} has no minimum. standing for the “infimum” of I. 0 0 1 In other words. since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. . 1 ≤ j ≤ n. . . . B be the matrix representations of ϕ. If we fix a basis in Rn . . A matrix A is called square if it is a n × n matrix. For instance I = {1. . If I has no lower bound then inf I = −∞. Supremum and infimum When we have a set of numbers I (for instance. we define the supremum sup I to be the minimum of all upper bounds. 1/2. we define max I. it is its matrix representation with respect to the fixed basis.  . ed =  . 1 ≤ i ≤ ℓ. x2 . So lim sup xn = limn→∞ inf{xn . . . Rm and Rℓ and let A. there is x ∈ I with x ≤ ε + inf I. Similarly. 1/3.e. If I has no upper bound then sup I = +∞. is a sequence of numbers then inf{x1 .  .Fix three sets of basis on Rn . e2 =  . The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . . for any ε > 0. a sequence a1 . ei = 1(i = j). The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. the choice of the basis is the so-called standard basis. . . and AB is a ℓ × n matrix. Then the matrix representation of ϕ ◦ ψ is AB. . Limsup and liminf If x1 . sup ∅ = −∞. respectively. we define lim sup xn . x4 . a2 . .

In contrast. Therefore. . follows from this axiom. if he events A1 . A2 . namely:   Everything else. In contrast. and apply them. the function that takes value 1 on A and 0 on Ac . Similarly.e. we work a lot with conditional probabilities. Similarly. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems.) If A. A2 . I do sometimes “cheat” from a strict mathematical point of view. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. If An are events with P (An ) = 0 then P (∪n An ) = 0.e. If A1 . to compute P (A). . This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. Probability Theory is exceptionally simple in terms of its axioms. n≥1 Usually. In Markov chain theory. which holds since X is positive. such that ∪n Bn = Ω. This can be generalised to any finite number of events. A is a subset of B). Quite often. . if An are events with P (An ) = 1 then P (∩n An ) = 1. That is why Probability Theory is very rich. (That is why Probability Theory is very rich. . Linear Algebra that has many axioms is more ‘restrictive’. in these notes. Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. P (∪n An ) ≤ Similarly. we often need to argue that X ≤ Y . in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ).Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). E 1A = P (A). . Of course. This is the inclusion-exclusion formula for two events. If A is an event then 1A or 1(A) is the indicator random variable. showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. but not too much. n P (An ).. then P  n≥1 An  = P (An ). to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). i. . If the events A. derive formulae. by not requiring this. and everywhere else. B2 . are mutually disjoint events. it is useful to find disjoint events B1 . For example. for (besides P (Ω) = 1) there is essentially one axiom . B are disjoint events then P (A ∪ B) = P (A) + P (B). . Note that 1A 1B = 1A∩B . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. B in the definition. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). and 1 Ac = 1 − 1 A . 11 114 . . Linear Algebra that has many axioms is more ‘restrictive’. Indeed. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). . Often. . B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). n If the events A1 . A2 . .. This follows from the logical statement X 1(X > x) ≤ X. . to show that EX ≤ EY .

e. making it a useful object when the a0 . . . n ≥ 0 is ∞ ρn s n = n=0 1 . So if n an < ∞ then G(s) exists for all s ∈ [−1. for s = 0 for which G(0) = 0. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. . We do not a priori know that this exists for any s other than. Generating functions A generating function of a sequence a0 . . 1. we have EX = −∞ xf (x)dx. In case the random variable is ∞ absoultely continuous with density f . The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. The formula holds for any positive random variable. ϕ(s) = EsX = n=0 sn P (X = n) 115 . i. which are both nonnegative and then define EX = EX + − EX − . The generating function of the sequence pn = P (X = n). EX = x xP (X = x).. taught in calculus courses). 0). . 2. which is motivated from the algebraic identity X = X + − X − . 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. 1]. . a1 . viz. . . 1]. the formula reduces to the known one. This definition makes sense only if EX + < ∞ or if EX − < ∞. |s| < 1/|ρ|. obviously. In case the random variable is discrete. a1 . n = 0. is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . 1. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). 0). Suppose now that X is a random variable with values in Z+ ∪ {+∞}. As an example. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). is a probability distribution on Z+ = {0.} because then n an = 1. For a random variable which takes positive and negative values. . we define X + = max(X. so it is useful to keep this in mind. . X − = − min(X. . is called probability generating function of X: ∞ as long as |ρs| < 1. the generating function of the geometric sequence an = ρn .An application of this is: P (X < ∞) = lim P (X ≤ x). . we encounter situations where P (X = ∞) may be positive. x→∞ In Markov chain theory.

consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . n=0 We do allow X to. take the value +∞. since |s| < 1. Also. ds ds and by letting s = 1. n! as follows from Taylor’s theorem. with x0 = x1 = 1. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 .This is defined for all −1 < s < 1. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). we can recover the expectation of X from EX = lim ϕ′ (s). while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). As an example. If we let X(s) = n≥0 n ≥ 0. xn sn be the generating of (xn . Note that ∞ n=0 pn ≤ 1. possibly. Note that ϕ(0) = P (X = 0). n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . and this is why ϕ is called probability generating function. ∞ and p∞ := P (X = ∞) = 1 − pn . n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). n ≥ 0) then the generating function of (xn+1 . we have s∞ = 0. but. Generating functions are useful for solving linear recursions. 116 . We can recover the probabilities pn from the formula pn = ϕ(n) (0) .

117 . i. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . b = −( 5 + 1)/2. we can solve for X(s) and find X(s) = s2 −1 . the function P (X ≤ x). So. Thus (1/ρ)−s is the 1 generating function of ρn+1 . n ≥ 0). i. x ∈ R. for example. is such an example. But I could very well use the function P (X > x). this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. Since x0 = x1 = 1. n ≥ 0) equals the sum of the generating functions of (xn+1 . − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . Despite the square roots in the formula. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X.e. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. and so X(s) = 1 1 1 1 1 1 = . Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). b−a is the solution to the recursion. then the so-called distribution function. s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). Hence s2 + s − 1 = (s − a)(s − b). if X is a random variable with values in R. namely. the n-th term of the Fibonacci sequence.e. n ≥ 0) and (xn . b−a Noting that ab = −1.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . we can also write this as xn = (−a)n+1 − (−b)n+1 .

For example (try it!) the rough Stirling approximation give that the probability that. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. for it does allow us to specify the probabilities P (X = n). Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. It turns out to be a good one. I could use its density f (x) = d P (X ≤ x). But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. dx All these objects can be referred to as “distribution” or “law” of X. n ∈ Z+ . Now. to compute a large sum we can. (b) The approximation is not good when we consider ratios of products. For instance. since the number is so big. If X is absolutely continuous. instead. we encountered the concept of an excursion of a Markov chain from a state i. it’s better to use logarithms: n n! = e log n = exp k=1 log k. for example they could take values which are themselves functions. compute an integral and obtain an approximation. in 2n fair coin tosses 118 . Such an excursion is a random object which should be thought of as a random function. n! ≈ exp(n log n − n) = nn e−n . We frequently encounter random objects which belong to larger spaces. n Hence n k=1 log k ≈ log x dx.or the 2-parameter function P (a < X < b). Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. Specifying the distribution of such an object may be hard. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. Hence we have We call this the rough Stirling approximation. Even the generating function EsX of a Z+ -valued random variable X is an example. 1 Since d dx (x log x − x) = log x.

so this little section is a reminder. We have already seen a geometric random variable. That was the random variable J0 = ∞ n=1 1(Sn = 0). 0 ≤ n ≤ m. and so. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. because we know that the n probability goes to zero as n → ∞. a geometric function of k. g(X) ≥ g(x)1(X ≥ x). and this is terrible. Consider an increasing and nonnegative function g : R → R+ . To remedy this. defined in (34). Surely. where λ = P (X ≥ 1). We can case both properties of g neatly by writing For all x. Indeed. 119 . Then. y ∈ R g(y) ≥ g(x)1(y ≥ x). by taking expectations of both sides (expectation is an increasing operator). n ∈ Z+ . expressing non-negativity. for all x. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. This completely trivial thought gives the very useful Markov’s inequality. expressing monotonicity.we have exactly n heads equals 2n 2−2n ≈ 1. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). Eg(X) ≥ g(x)P (X ≥ x) . Markov’s inequality Arguably this is the most useful inequality in Probability. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. and is based on the concept of monotonicity. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . If we think of X as the time of occurrence of a random phenomenon. It was shown in Theorem 38 that J0 satisfies the memoryless property. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . as n → ∞. We baptise this geometric(λ) distribution. you have seen it before. Let X be a random variable. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x).

W (1968) An Introduction to Probability Theory and its Applications. Grinstead. Bremaud P (1999) Markov Chains. C M and Snell. 120 . Norris. Feller. J L (1997) Introduction to Probability.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook.Bibliography ´ 1. Amer. J R (1998) Markov Chains. Springer.dartmouth. Wiley. Springer. Also available at: http://www. Bremaud P (1997) An Introduction to Probabilistic Modeling. Parry.pdf 6. Soc. ´ 2. 5. 7. Cambridge University Press. 3.mac. 4. Math. Cambridge University Press. Chung. W (1981) Topics in Ergodic Theory. Springer. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance).

58 digraph (directed graph). 117 leads to. 114 balance equations. 47 coupling inequality. 35 Helly-Bray Lemma. 51 Monte Carlo. 53 hitting time. 73 Markov chain.Index absorbing. 110 mixing. 18 permutation. 61 recurrent. 34 PageRank. 26 running maximum. 20 function of a Markov chain. 61 detailed balance equations. 48 cycle formula. 86 reflexive. 116 first-step analysis. 15 Loynes’ method. 16 communicating classes. 16. 17 infimum. 18 arcsine law. 77 indicator. 119 minimum. 108 ergodic. 70 period. 3 branching process. 119 Google. 70 greatest common divisor. 10 irreducible. 76 neighbours. 16 equivalence relation. 15. 111 inessential state. 13 probability generating function. 101 axiom of Probability Theory. 82 Cesaro averages. 77 coupling. 110 meeting time. 115 geometric distribution. 6 Markov property. 59 law. 68 aperiodic. 109 positive recurrent. 119 matrix. 34 probability flow. 17 communicates with. 115 random walk on a graph. 51 essential state. 106 ground state. 87 121 . 18 dual. 113 initial distribution. 108 ruin probability. 1 equivalence classes. 65 Catalan numbers. 53 Fibonacci sequence. 6 Markov’s inequality. 10 ballot theorem. 17 Aloha. 65 generating function. 108 relation. 2 nearest neighbour random walk. 7 closed. 81 reflection principle. 16 convolution. 111 maximum. 84 bank account. 45 Chapman-Kolmogorov equation. 17 Kolmogorov’s loop criterion. 20 increments. 63 Galton-Watson process. 48 memoryless property. 15 distribution. 61 null recurrent. 17 Ethernet. 7 invariant distribution. 29 reflection. 117 division. 98 dynamical system. 68 Fatou’s lemma. 44 degree.

semigroup property. 13 stopping time. 6 stationary. 27 subordination. 77 skip-free property. 119 strong Markov property. 29 transition diagram:. 88 watching a chain in a set. 6 statespace. 113 symmetric. 3. 27 storage chain. 10 steady-state. 2 transition probability. 108 tree. 73 strategy of the gambler. 7. 9 stationary distribution. 62 subsequence of a Markov chain. 8 transitive. 71 122 . 76 simple symmetric random walk. 76 Skorokhod embedding. 60 urn. 24 Strirling’s approximation. 108 time-homogeneous. 7 simple random walk. 7 time-reversible. 16. 58 transient. 104 states. 16. 62 supremum. 62 Wright-Fisher model. 8 transition probability matrix.