Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . . . . . . . . . . . . . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . . . . . . . . . . . .2 The ALOHA protocol: a computer communications application . . . . . . . . . . . . .2 The ballot theorem . . . . . . . . . . . . . . . .4 The Wright-Fisher model: an example from biology . . . . . . . . . .3 Some conditional probabilities . . . . . . . . . . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. .5 First return to zero . . . . .24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 24. . . . . . . . . . . 84 27. . 92 30. . . . . . . . . . . . . . 79 26. . . . . . . . . . 90 30. . . . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . .1 Distribution of hitting time and maximum . . . . 95 31. . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . . . . . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . .3 The distribution of the first return to the origin . . . . . .1 First hitting time analysis . . . . . . . . 95 31. . . . . . . . . . . . . . . . 83 27. . . . . . . . . . . . 85 27. . 84 27. . . . . . . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . .1 Branching processes: a population growth application . . . . . . .2 Return probabilities . . . . . . . . . . . . . . . 70 24. . . 71 24. . . .4 Some remarkable identities . . .1 Paths that start and end at specific points . . . . .2 Paths that start and end at specific points and do not touch zero at all . . . . . . . . . . . . . . . . . . . . . . . . .5 A storage or queueing application . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . .1 Distribution after n steps . . . . . . . . . . . . . . .2 First return to the origin . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 24. . . . .1 Asymptotic distribution of the maximum . . . . . . . . . .3 Total number of visits to a state . . . . . . . . . . . . . . . . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

They form the core material for an undergraduate course on Markov chains in discrete time. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible..S. The first part is about Markov chains and some applications. The second one is specifically for simple random walks. of course. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. Of course. Greece and the U. I introduce stationarity before even talking about state classification. “important” formulae are placed inside a coloured box. A few starred sections should be considered “advanced” and can be omitted at first reading. There are. I tried to put all terms in blue small capitals whenever they are first encountered. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory.. At the end. There notes are still incomplete.K.Preface These notes were written based on a number of courses I taught over the years in the U. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. Also. v . dozens of good books on the topic. I have a little mathematical appendix. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. Also.. I have tried both and prefer the current ordering.

1. b). . x(t)). and its ˙ dt 2x acceleration is x = d 2 . where xn ≥ yn . 2. Recall (from elementary school maths). . Assume that the force F depends only on the particle’s position ¨ dt and velocity. more generally. The sequence of pairs thus produced. X1 . initialised by x0 = max(a. where r = rem(a. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. yn ). one of which is 0. that gcd(a. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). the real numbers). and the other the greatest common divisor. and evolving as xn+1 = yn yn+1 = rem(xn .PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. Suppose that a particle of mass m is moving under the action of a force F in a straight line. For instance. i. for any t. x Here. . Xn+2 . we let our “state” be a pair of integers (xn . x = x(t) denotes the position of the particle at time t. X2 . a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. b) is the remainder of the division of a by b. n = 0. for any n. b).e.e. b). The “time” can be discrete (i. An example from Physics is provided by Newton’s laws of motion. or. In other words. s ≥ t) ˙ 1 . Its velocity is x = dx . F = f (x. y0 = min(a. . yn ). x). It is not hard to see that. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. The state of the particle at time t is described by the pair ˙ X(t) = (x(t). By repeating the procedure. has the property that. the future trajectory (X(s). Formally. we end up with two numbers. depend only on the pair Xn . . b) = gcd(r. X0 . b) between two positive integers a and b. . An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. . . a totally ordered set. . the integers). In the module following this one you will study continuous-time chains. continuous (i. yn ) into Xn+1 = (xn+1 .e. there is a function F that takes the pair Xn = (xn . In Mathematics. yn+1 ). the future pairs Xn+1 .

1 and 2.05. respectively. 0 ≤ Y0 ≤ B. containing fresh and stinky cheese. A mouse lives in the cage. When the mouse is in cell 1 at time n (minutes) then. 2. for steps n = 1. Repeat with (Xn . We will develop a theory that tells us how to describe. We will also see why they are useful and discuss how they are applied.000 columns or computing the integral of a complicated function of a large number of variables. Perform this a large number N of times. instead of deterministic objects. In other words.000 rows and 10. and use those mathematical models which are called Markov chains. say. an effective way for dealing with large problems is via the use of randomness.can be completely specified by the current state X(t). at time n + 1 it is either still in 1 or has moved to 2.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. Each time count 1 if Yn < f (Xn ) and 0 otherwise. Yn ). past states are not needed. it moves from 2 to 1 with probability β = 0. . Choose a pair (X0 . by the software in your computer. 2 2. . Other examples of dynamical systems are the algorithms run. Indeed. Try this on the computer! 2 . Similarly.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. A scientist’s job is to record the position of the mouse every minute. i. uniformly. Markov chains generalise this concept of dependence of the future only on the present. it does so. We can summarise this information by the transition diagram: 1 Stochastic means random. Y0 ) of random numbers. . An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. we will see what kind of questions we can ask and what kind of answers we can hope for.. Some of these algorithms are deterministic. In addition. but some are stochastic. Count the number of 1’s and divide by N . via the use of the tools of Probability. regardless of where it was at earlier times. The generalisation takes us into the realm of Randomness.99. The resulting ratio is an approximation b of a f (x)dx.e. We will be dealing with random variables. analyse. a ≤ X0 ≤ b.

) 2. on the average. Here. How long does it take for the mouse. the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0.05 0. px. . to move from cell 1 to cell 2? 2. So if he wishes to spend more than what is available.y := P (Xn+1 = y|Xn = x). Xn is the money (in pounds) at the beginning of the month. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). are independent random variables. S2 .95 0. (This is the mean of the binomial distribution with parameter α.99 0. . y = 0. 2. he can’t. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. D0 . . How often is the mouse in room 1 ? Question 1 has an easy. and Sn is the money he wants to spend. D1 . 1.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. D2 . Assume that Dn . . Dn − Sn = y − x) P (Sn = z. We easily see that if y > 0 then ∞ x. . S1 . . The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px.2 Bank account The amount of money in Mr Baxter’s bank account evolves.01 Questions of interest: 1. 0}. as follows: Xn+1 = max{Xn + Dn − Sn . Sn . FS . intuitive.05 ≈ 20 minutes. are random variables with distributions FD .y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. from month to month. S0 . respectively. Dn is the money he deposits. Assume also that X0 . = z=0 ∞ = z=0 3 .

y . to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. Will Mr Baxter’s account grow. on the average. 2. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left).0 p0. how often will he be visiting 0? 2.2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. So he may move forwards with equal probability that he moves backwards.0 = 1 − ∞ px.0 p1. and if so.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . Of course.0 p2. in principle. how long will it take him. being drunk.1 p1. The transition probability matrix is y=1 p0. Questions of interest: 1. Are there any places which he will never visit? 3. 2.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. Are there any places which he will visit infinitely many times? 4.2 P= p2. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. If he starts from position 0.1 p0. can be computed from FS and FD . we have px. if y = 0. he has no sense of direction. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions.something that.2 p1. If the pub is located in position 0 and his home in position 100.1 p2. how fast? 2.

D . 5 .Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. therefore. north or south. 2. These things are. the reason being that the company does not believe that its clients are subject to resurrection. that since there are more degrees of freedom now. there is clearly a higher chance that our man will be lost. given that he is currently healthy.D is omitted. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. pD. Clearly. the company must have an idea on how long the clients will live. observe.1 D Thus. Note that the diagram omits pH. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. there is probability pH. pD.01 S 0.S = 1 − pS. Also. mathematically imprecise. etc.D = 1.S = 0. but they will become more precise in the sequel.S − pH.S because pH. and pS. But.3 H 0. So he can move east.D . and let’s say he’s completely drunk.8 0. the answer to this is crucial for the policy determination. west. and pS.H − pS. so we assign probability 1/4 for each move. for now.3 that the person will become sick.H = 1 − pH.H . We may again want to ask similar questions.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients.

. . where T is a subset of the integers. m and any choice of states i0 . for any n ∈ T. n + m] Xk = ik | Xn = in . . . . Xn−1 ) given Xn . m ∈ T). . in+1 ∈ S. . the future process (Xm . We will also assume. . . m ∈ T) is independent of the past process (Xm . and so future is independent of past given the present. in+m . . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). . . . it is customary to call (Xn ) a Markov chain. X2 = i2 . for all n ∈ Z+ and all i0 . . Let us now write the Markov property in an equivalent way. (1) 6 . We focus almost exclusively to the first choice. for any n. almost exclusively (unless otherwise said explicitly). . m < n. . Lemma 1. 2. . . 3. If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . which precisely means that (Xn+1 . . unless needed. n ∈ T}.} or N = {1. . n ∈ Z+ ) is Markov if and only if. . So we will omit referring to T explicitly. . . X0 = i0 ) = P (∀ k ∈ [n + 1. Since S is countable.1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . that the Xn take values in some countable set S. . zero. . On the other hand. called the state space. . Xn−1 ) are independent conditionally on Xn . . or negative). We say that this sequence has the Markov property if. . . Usually. The elements of S are frequently called states.} or the set Z of all integers (positive. n ∈ T} is Markov instead of saying that it has the Markov property. . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1.3 3. T = Z+ = {0. Xn+m ) and (X0 . 1. . . 2. . . viz. 3. Sometimes we say that {Xn . This is true for any n and m. P (Xn+1 = in+1 | Xn = in . Proof. Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). . . conditionally on Xn . X1 = i1 . 3. m > n.2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . n + m] Xk = ik | Xn = in ). if the claim holds. . . the Markov property. (Xn .

The function µ(i) is called initial distribution. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. n)]i.j (m. 2) · · · pin−1 .3 In this new notation. More generally.j (m.i1 (m. pi. for each m ≤ n the matrix P(m. 1) pi1 . n ≥ 0. means that the transition probabilities pi. which. pi. we define. 3. will specify all joint distributions2 : P (X0 = i0 . X2 = i2 . a square matrix whose rows and columns are indexed by S (and this. or as P(m. n).which means that the two functions µ(i) := P (X0 = i) .i1 (0. n) := [pi.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. n)..j∈S . j ∈ S.i2 (1.j (n. n + 1) do not depend on n . we can write (2) as P(m.i (n − 1. m + 2) · · · P(n − 1. n + 1) := P (Xn+1 = j | Xn = i) . If |S| = d then we have d × d matrices.j (n. Xn = in ) = µ(i0 ) pi0 . . n). something that can be called semigroup property or Chapman-Kolmogorov equation. m + 1)P(m + 1. i∈S i. requires some ordering on S) and whose entries are pi. of course. (2) which is reminiscent of matrix multiplication. all probabilities of events associated with the Markov chain This causes some analytical difficulties.j (n. n) = i1 ∈S ··· in−1 ∈S pi. 7 . X1 = i1 . 2 3 And. m + 1) · · · pin−1 . n). n) = 1(i = j) and. by definition. by (1). . viz.j (n. Therefore. remembering a bit of Algebra. n + 1) is called transition probability from state i at time n to state j at time n + 1. n) = P(m. n) = P(m.in (n − 1. pi.j (m. but if S is an infinite set. . then the matrices are infinitely large. Of course. The function pi. n) if m ≤ ℓ ≤ n. by a result in Probability. pi.j (m. . But we will circumvent them by using Probability only. ℓ) P(ℓ.

Notational conventions Unit mass. for all integers a.j ]i. unless otherwise specified. simply. y y 8 . transition matrix. For example. n−m times for all m ≤ n. Starting from a specific state i. b ≥ 0. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . when we say “Markov chain” we will mean “timehomogeneous Markov chain”. From now on. 0. if Y is a real random variable. and call pi. n) = P × P × · · · · · · × P = Pn−m .j∈S is called the (one-step) transition probability matrix or. The matrix notation is useful. δi (j) := 1. to picking state i with probability 1. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i).and therefore we can simply write pi. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y).j . then µ = pδi1 + (1 − p)δi2 . Notice that any probability distribution on S can be written as a linear combination of unit masses. and the semigroup property is now even more obvious because Pa+b = Pa Pb . as follows from (1) and the definition of P.j the (one-step) transition probability from state i to state j. Similarly. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . Thus Pi is the law of the chain when the initial distribution is δi . arranged in a row vector. It corresponds. (3) We call δi the unit mass at i. In other words. For example. if j = i otherwise. while P = [pi. of course. Then we have µ n = µ 0 Pn . n + 1) =: pi. if µ assigns probability p to state i1 and 1 − p to state i2 .j (n. Then P(m.

i. as it takes some time before the molecules diffuse from room to room. Example: the Ehrenfest chain. n = 0. Consider a canonical heptagon and let a bug move on its vertices. m + 1. . It is clear. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. . we indeed expect that we have N/2 molecules in each room. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. Let Xn be the number of molecules in room 1 at time n. n ≥ 0) will not be stationary: indeed.4 Stationarity We say that the process (Xn . let us look at the following example: Example: Motion on a polygon.) is stationary if it has the same law as (Xn . .i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . the law of a stationary process does not depend on the origin of time. This is a model of gas distributed between two identical communicating chambers. Imagine there are N molecules of gas (where N is a large number. say N = 1025 ) in a metallic chamber with a separator in the middle. 1. Then pi. If at time n the bug is on a vertex.e. then. at time n + 1 it moves to one of the adjacent vertices with equal probability. the distribution of the bug will still be the same. selecting a vertex with probability 1/7. In other words. at any point of time n. for any m ∈ Z+ . It should be clear that if initially place the bug at random on one of the vertices.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. dividing it into two identical rooms. The gas is placed partly in room 1 and partly in room 2. Hence the uniform probability distribution is stationary. We now investigate when a Markov chain is stationary. random variables provides an example of a stationary process. Another guess then is: Place 9 . But this will not work. for a while we will be able to tell how long ago the process started.i. . a sequence of i. To provide motivation and intuition. . Clearly. then.).d. On the average. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. n = m. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . because it is impossible for the number of molecules to remain constant at all times. and vice versa. .

(4) Such a π is called . If we can do so. Proof. We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. in ∈ S. P (X1 = i) = j P (X0 = j)pj. The equations (4) are called balance equations. stationary or invariant distribution.i + π(i + 1)pi+1. Xn = in ) = P (Xm = i0 . . Xm+2 = i2 . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . . . X1 = i1 . Since. Xm+1 = i1 . . . Xm+n = in ). 2−N i−1 i+1 N N (The dots signify some missing algebra. Changing notation. i 0 ≤ i ≤ N. If we do so. . . by (3). for all m ≥ 0 and i0 . use (1) to see that P (X0 = i0 . find those distributions π such that πP = π. . X2 = i2 . In other words. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. which you should go through by yourselves. Lemma 2. then we have created a stationary Markov chain.each molecule at random either in room 1 or in room 2. .i + P (X0 = i + 1)pi+1.i = π(i − 1)pi−1. . Indeed. we are led to consider: The stationarity problem: Given [pij ]. For the if part. But we need to prove this. The only if part is obvious. (a binomial distribution). then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . P (Xn = i) = π(i) for all n. . a Markov chain is stationary if and only if the distribution of Xn does not change with n.i = P (X0 = i − 1)pi−1. .) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π.i = N N −i+1 N i+1 2−N + = · · · = π(i). A Markov chain (Xn . hence. 10 .

d! x ∈ Sd .i = p(i). Example: waiting for a bus. i∈S The matrix P has the eigenvalue 1 always because j∈S pi. . i≥1 π(i) −1 To compute π(1). we use the normalisation condition π(1) = i≥1 j≥i = 1. . π must be a probability: π(i) = 1. But here. Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. π(1) will be zero. p1. Keep doing it continuously. . can be written as P1 = 1. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. write the equations (4): π(i) = π(j)pj. i = 1. i ≥ 0. . We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. 11 . j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). . The state space is S = N = {1. each x ∈ Sd is of the form x = (x1 . in matrix notation.d. To find the stationary distribution.i. So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. where 1 is a column whose entries are all 1. random variables. In addition to that. 2 .i = π(0)p(i) + π(i + 1). then we run into trouble: indeed. This means that the solution to the balance equations is identically zero.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. and so each π(i) will be zero.j = 1.}. the next one will arrive in i minutes with probability p(i). Example: card shuffling. we must be careful. . . which. It is easy to see that the stationary distribution is uniform: π(x) = 1 . Then (Xn . Who tells us that the sum inside the parentheses is finite? If the sum is infinite. d}. if i > 0 i ∈ N. . which in turn means that the normalisation condition cannot be satisfied. n ≥ 0) is a Markov chain with transition probabilities pi+1.. where all the xi are distinct. which gives p(j) .i = 1. The times between successive buses are i. . . . i ≥ 1. . Thus. . xd ). After bus arrives at a bus stop. 2.

* (This is slightly beyond the scope of these lectures and can be omitted. . Example: bread kneading. . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. .. As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. So it goes beyond our theory. But to even think about it will give you some strength. ξr+1 (n) = εr ) + P (ξ1 (n) = 1.d. . . any r = 1. r = 1. and the rule maps x into 2(1 − x). So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. ξ(n + 1) = ϕ(ξ(n)). 1}. (5) 12 . ξi ∈ {0. The Markov chain is obtained by applying the mapping ϕ again and again. . .Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). 2(1 − x)). ξ2 . . ξ2 (n). Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. ξ2 (n) = 1 − ε1 . You can check that . But the function f (x) = min(2x. . with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. . . Hence choosing the initial state at random in this manner results in a stationary Markov chain. So. Remarks: This is not a Markov chain with countably many states. We can show that if we start with the ξr (0). P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). ξ2 (n). applied to the interval [0. 1. . . . 2. .) is the state at time n. Here ξ(n) = (ξ1 (n). . 1] (think of [0. ξr+1 (n) = 1 − εr ). and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . ξ2 (n) = ε1 .i. i. P (ξ1 (n + 1) = ε1 . .) Consider an infinite binary vector ξ = (ξ1 . . 1} for all i.d. .). . if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 .. what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. . . .). for any n = 0. 1 − ξ3 . are i. . 2. . the same is true at n + 1. We have thus defined a mapping ϕ from binary vectors into binary vectors. and any ε1 . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. . . So let us do so.). . εr ∈ {0. from (5). If ξ1 = 1 then x ≥ 1/2. . . .. . 2. . . . . ξ3 . i. . .e.i. . . i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then.

i . for all A ⊂ S. Instead of eliminating equations. Ac ) := π(i)pi. because the first d equations are linearly dependent: the sum of both sides equals 1.j = j∈A π(j)pj.1 Finding a stationary distribution As explained above.Note on terminology: steady-state.j . When a Markov chain is stationary we refer to it as being in 4. But we only have d unknowns. π(i) = j∈S (balance equations) (normalisation condition). Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. Conversely. therefore one of them is always redundant. Proof.i = j∈A π(j)pj. suppose that the balance equations hold.j as a “current” flowing from i to j. a stationary distribution π satisfies π = πP π1 = 1 In other words. The convenience is often suggested by the topological structure of the chain. If the latter condition holds. i∈A j∈Ac Think of π(i)pi. π(j) = 1.i + j∈Ac π(j)pj. We have: Proposition 1. Ac ) is the total current from A to Ac . amongst them. a set of d linearly independent equations which can be more easily solved. π(j)pj.i Multiply π(i) by j∈S pi. then the above are d + 1 equations. Writing the equations in this form is not always the best thing to do. So F (A. π is a stationary distribution if and only if π1 = 1 and F (A. Ac ) = F (Ac . Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. If the state space has d states.i 13 . j∈S i ∈ S.i + j∈Ac π(j)pj. expecting to be able to choose. then choose A = {i} and see that you obtain the balance equations. A).j (which equals 1): π(i) j∈S pi. This is OK. we introduce more. as we will see in examples.

14 i = 1.99 (as in the mouse example). while the remaining two terms give the equality we seek. 1. The constant c is found from π(1) + π(2) = 1. the mouse spends only 95. A c F(A .i This is true for all i ∈ S.j + j∈A j∈Ac π(i)pi. p22 = 1 − β. p11 = 1 − α...8% of the time in room 2. We have π(1)α = π(2)β. (Consequently. here is an example: Example: Consider the 2-state Markov chain with p12 = α.j = j∈A π(j)pj. we see that the first term on the left cancels with the first on the right.04 π(2) = 0. . β = 0. we find π(1) = 0. . 3. Summing over i ∈ A. If we take α = 0. . Example: a walk with a barrier.05. gives us flexibility. and 4.Now split the left sum also: π(i)pi.04 room 1. α+β π(2) = cα.) Assume 0 < α. A) Schematic representation of the flow balance relations. A ) A c c F( A .048.05 ≈ 0. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). in steady state. Thus. That is. β < 1. p21 = β.952.99 ≈ 0. .2% of the time in 1. but we shall not do this here. α+β π(2) = α . The extra degree of freedom provided by the last result. π(1) = β . which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. One can ask how to choose a minimal complete set of linearly independent equations. p q Consider the following chain. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . Instead.i + j∈Ac π(j)pj. 2.

(ii) pi. If i ≥ 1 we have F (A.j > 0 for some n ∈ N and some states i1 . i. This is often cast in terms of a digraph (directed graph).2 The relation “leads to” We say that a state i leads to state j if. . . in−1 .j > 0. . (If p ≥ 1/2 there is no stationary distribution. p < 1/2. . 5.e. These are much simpler than the previous ones. This is i=0 possible if and only if p < q. . (iii) Pi (Xn = j) > 0 for some n ≥ 0. j) ∈ E ⇐⇒ pi. 1.) 5 5. the chain will visit j at some finite time. . starting from i. A). we can solve them immediately. In graph-theoretic terminology. Proof. For i ≥ 1. So assume i = j. Does this provide a stationary distribution? Yes.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. In fact. Ac ) = π(i − 1)p = π(i)q = F (Ac . i − 1}.i2 · · · pin−1 . • Suppose that i j. i.i1 pi1 . . The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. provided that ∞ π(i) = 1.e. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. In other words. a pair (S. n≥0 . E) where S is the state space and E ⊂ S × S defined by (i. S is a set of vertices. .But we can make them simpler by choosing A = {0. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). and E a set of directed edges. If i = j there is nothing to prove. i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. π(i) = (p/q)i π(0).

because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). (i) holds. we have that the sum is also positive and so one of its terms is positive.the sum on the right must have a nonzero term. {6}. it partitions S into equivalence classes known as communicating classes. We have Pi (Xn = j) = i1 . is an equivalence relation.in−1 pi. 5}. {2. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. Just as any equivalence relation. 5. The relation is transitive: If i j and j k then i k. The first two classes differ from the last two in character. 16 . The class {4. and so (iii) holds. where the sum extends over all possible choices of states in S. 5} is closed: if the chain goes in there then it never moves out of it. j) ⇐⇒ i j and j i. Proof. However. 3} is not closed.. In other words. • Suppose now that (iii) holds. {4. we see that there are four communicating classes: {1}. The relation j ⇐⇒ j i). (iii) holds.. this is symmetric (i Corollary 2.j . Look at the last display again. • Suppose that (ii) holds.. Since Pi (Xn = j) > 0. by definition. Corollary 1.. By assumption. the sum contains a nonzero term. (6) j. The communicating class corresponding to the state i is. i}. Hence Pi (Xn = j) > 0. the set of all states that communicate with i: [i] := {j ∈ S : j So. Equivalence relation means that it is symmetric. the class {2. by definition. 3}.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously.i2 · · · pin−1 . In addition. reflexive and transitive.i1 pi1 . We just observed it is symmetric. [i] = [j] if and only if i identical or completely disjoint. It is reflexive and transitive because is so. meaning that (ii) holds.

C5 in the figure above are closed communicating classes. and so on. C3 are not closed. Note that there can be no arrows between closed communicating classes. we have i1 ∈ C. Thus. pi. Suppose first that C is a closed communicating class. The communicating classes C1 .i2 · · · pin−1 . Note: An inessential state i is such that there exists a state j such that i j but j i. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. . Since i ∈ C. we say that a set of states C ⊂ S is closed if pi. starting from any state. C2 . its transition matrix P) is called irreducible. If a state i is such that pi. A general Markov chain can have an infinite number of communicating classes. and C is closed.More generally. In graph terms. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. Suppose that i j. Lemma 3. Proof. we can reach any other state. To put it otherwise. In other words still. . A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. more precisely.i1 > 0. 17 .i1 pi1 . If all states communicate with all others. parts. Otherwise. more manageable.i2 > 0. the state space S is a closed communicating class. the single-element set {i} is a closed communicating class. we conclude that j ∈ C. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. The classes C4 . then the chain (or. Closed communicating classes are particularly important because they decompose the chain into smaller. in−1 such that pi. pi1 . Similarly. The converse is left as an exercise. . There can be arrows between two nonclosed communicating classes but they are always in the same direction. . the state is inessential. and so i2 ∈ C.j = 1.j > 0.i = 1 then we say that i is an absorbing state. Why? A state i is essential if it belongs to a closed communicating class. j∈C for all i ∈ C. We then can find states i1 .

Theorem 2 (the period is a class property). Since Pm+n = Pm Pn . . Let Di = {n ∈ N : pii > 0}.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. i. whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. 3} at some point of time then. i. If i j then d(i) = d(j). The chain alternates between the two sets. all states form a single closed communicating class. X4 ∈ C1 . Dj = {n ∈ (n) (α) N : pjj > 0}. d(i) = gcd Di . j. n ≥ 0. 4} at the next step and vice versa. (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. . and. If i j there is α ∈ N with pij > 0 and if 18 . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 .5. . pii (ℓn) ≥ pii (n) ℓ . we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. So. (n) Proof. X2 ∈ C1 . A state i is called aperiodic if d(i) = 1. Notice however. it will move to the set C2 = {2. more generally. We say that the chain has period 2. ℓ ∈ N. X3 ∈ C2 . (m) (n) for all m.e. Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. Consider two distinct essential states i. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. . The period of an inessential state is not defined. Let us denote by pij the entries of Pn . d(j) = gcd Dj . j ∈ S.

e. . Arguing symmetrically. C1 . n1 ∈ Di such that n2 − n1 = 1. An irreducible chain has period d if one (and hence all) of the states have period d. (Here Cd := C0 . we obtain d(i) ≤ d(j) as well.j i there is β ∈ N with pji > 0. all states communicate with one another. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. Theorem 3. Of particular interest. 1. Cd−1 such that pij = 1. Then pii > 0 and so pjj ≥ pij pii pji > 0. j∈Cr+1 i ∈ Cr . So d(i) = d(j). j ∈ S. . . Let n be sufficiently large. this is the content of the following theorem. we have n ∈ Di as well. . r = 0. the chain is called aperiodic. Cd−1 such that the chain moves cyclically between them. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . From the last two displays we get d(j) | n for all n ∈ Di . Corollary 3. . Therefore d(j) | α + β. where the remainder r ≤ n1 − 1. d − 1. Since n1 . Because n is sufficiently large. An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. Theorem 4. q − r > 0. chains where. . (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . In particular. So pjj ≥ pij pji > 0. are irreducible chains. showing that α + β ∈ Dj . . . Proof. Then we can uniquely partition the state space S into d disjoint subsets C0 . as defined earlier. Now let n ∈ Di . if an irreducible chain has period d then we can decompose the state space into d sets C0 . Formally. If a number divides two other numbers then it also divides their difference. More generally. . if d = 1. Divide n by n1 to obtain n = qn1 + r. So d(j) is a divisor of all elements of Di . d(j) | d(i)). n2 ∈ Di . . Pick n2 . . C1 .) 19 . Consider an irreducible chain with period d. showing that α + n + β ∈ Dj . Therefore d(j) | α + β + n for all n ∈ Di . . i.

consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. then.. Hence it partitions the state space S into d disjoint equivalence classes. .e the set of states j such that i j). Pick a state i0 and let C0 be its equivalence class (i.Proof. . i. It is easy to see that if we continue and pick a further state id with pid−1 . As an example. C1 . Cd−1 . We show that these classes satisfy what we need. with corresponding classes C0 . . j belongs to the next class. say. Assume d > 1. necessarily. .. j ∈ C2 . We avoid excessive notation by using the same letter for both. We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. Such a path is possible because of the choice of the i0 . i1 .. We use a method that is based upon considering what the Markov chain does at time 1. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. for example. i1 . after it takes one step from its current position. We now show that if i belongs to one of these classes and if pij > 0 then. Notice that this is an equivalence relation. id ∈ C0 .. id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. 20 ... that is why we call it “first-step analysis”. which contradicts the definition of d. i ∈ C0 . . The existence of such a path implies that it is possible to go from i0 to i. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?).id > 0.e. If X0 ∈ A the two times coincide. Then pick i1 such that pi0 i1 > 0. The reader is warned to be alert as to which of the two variables we are considering at each time. Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. Take. necessarily. and by the assumptions that d d j. Continue in this manner and define states i0 . Suppose pij > 0 but j ∈ C1 but. Let C1 be the equivalence class of i1 ..

Therefore. Combining the above. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). But what exactly is this probability equal to? To answer this in its generality. . But the Markov chain is homogeneous. X0 = i) = P (TA < ∞|X1 = j). . then TA ≥ 1. i ∈ A. we have Pi (TA < ∞) = ϕ(j)pij .). let ϕ′ (i) be another solution. and so ϕ(i) = 1. because the chain may end up in state 2. ϕ(i) = j∈S i∈A pij ϕ(j). By self-feeding this equation. If i ∈ A then TA = 0. So TA = 1 + TA . and define ϕ(i) := Pi (TA < ∞). which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). fix a set of states A. We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. X2 . a ′ function of X1 . Hence: ′ ′ P (TA < ∞|X1 = j. ′ Proof. j∈S as needed. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . ′ is the remaining time until A is hit). the event TA is independent of X0 . .e. If i ∈ A. The function ϕ(i) satisfies ϕ(i) = 1. For the second part of the proof.It is clear that P1 (T0 < ∞) < 1. if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. conditionally on X1 . We then have: Theorem 5. Furthermore.

(If the chain starts from 0 it takes no time to hit 0. j∈A i ∈ A. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). then TA = 1 + TA . The function ψ(i) satisfies ψ(i) = 0. 3 22 . we obtain ϕ′ (i) ≥ Pi (TA ≤ n). On the other hand. The second part of the proof is omitted. because it takes no time to hit A if the chain starts from A. let A = {0. Clearly. By continuing self-feeding the above equation n times. We immediately have ϕ(0) = 1. ψ(0) = ψ(2) = 0. ψ(i) = 1 + i∈A pij ψ(j). which is the second of the equations. If i ∈ A then. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). 1. as above.) As for ϕ(1). and ϕ(2) = 0. If ′ i ∈ A. We have: Theorem 6. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. Example: In the example of the previous figure. TA = 0 and so ψ(i) := Ei TA = 0.We recognise that the first term equals Pi (TA = 1). 2}. and let ψ(i) = Ei TA . i = 0. Furthermore. ψ(i) := Ei TA . Therefore. so the first two terms together equal Pi (TA ≤ 2). Letting n → ∞. the second equals Pi (TA = 2). we choose A = {0}. We let ϕ(i) = Pi (T0 < ∞). if the chain starts from 2 it will never hit 0. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). Start the chain from i. obviously. Proof. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). the set that contains only state 0. 3 2 so ϕ(1) = 3/4. 1 ψ(1) = 1 + ψ(1). 2. Example: Continuing the previous example.

. We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ). Then ′ 1+TA x ∈ A. ′ where TA is the remaining time. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). otherwise. i∈A Furthermore. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). (This should have been obvious from the start. where. Hence. .e. if x ∈ A. ϕAB (i) = 0. ′ TA = 1 + TA . ϕAB (i) = j∈A i∈B pij ϕAB (j). Clearly. consider the following situation: Every time the state is x a reward f (x) is earned. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. we just changed index from n to n + 1.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . . h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ).which gives ψ(1) = 3/2. until set A is reached. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). Now the last sum is a function of the future after time 1. It should be clear that the equations satisfied are: ϕAB (i) = 1. B. Next. 23 . i. as argued earlier. TB for two disjoint sets of states A. of the random variables X1 . because the underlying experiment is the flipping of a coin with “success” probability 2/3. by the Markov property. therefore the mean number of flips until the first success is 3/2. h(x) = f (x). in the last sum. ϕAB is the minimal solution to these equations. after one step. X2 . . This happens up to the first hitting time TA first hitting time of the set A.

Let p the probability of heads and q = 1 − p that of tails. he collects i pounds.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. h(1) = 5. If heads come up then you win £1 per pound of stake.Thus. why?) If we let h(i) be the average total reward if he starts from room i. if he starts from room 2. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. (Sooner or later. Clearly. the set of equations satisfied by h. unless he is in room 4. k ∈ N) form the strategy of the gambler. (Also. y∈S h(x) = f (x) + x ∈ A. in which case he dies immediately. collecting “rewards”: when in room i. until he dies in room 4. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . he will die. 2. so 24 . If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). are h(x) = 0. h(3) = 7. 2 2 We solve and find h(2) = 8. 3. So. x∈A pxy h(y). You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. for i = 1. The successive stakes (Yk . 1 2 4 3 The question is to compute the average total reward. Example: A thief enters sneaks in the following house and moves around the rooms. no gambler is assumed to have any information about the outcome of the upcoming toss.

We wish to consider the problem of finding the probability that the gambler will never lose her money. ξk−1 and–as is often the case in practice–on other considerations of e. Let £x be the initial fortune of the gambler. . b = ℓ > 0 and let ϕ(x. Let us consider a concrete problem. and where the states a. b − 1. in simplest case Yk = 1 for all k. the company has control over the probability p. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). it will make enormous profits). This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. in the first case. astrological nature. We are clearly dealing with a Markov chain in the state space {a. To solve the problem. ℓ).i−1 = q = 1 − p. (Think why!) To simplify things. . a < i < b. . a + 1. We obviously have ϕ(0) = 0. . Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. a + 2. . b are absorbing.if the gambler wants to have a strategy at all. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. ℓ) := Px (Tℓ < T0 ).g. x − a    . Ex ) for the probability (respectively. We first solve the problem when a = 0. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . expectation) under the initial condition X0 = x. It can only depend on ξ1 . In other words. she must base it on whatever outcomes have appeared thus far. b − a). is our next goal. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. as described above. a ≤ x ≤ b. if p = q = 1/2 b−a Proof. if p = q Px (Tb < Ta ) = .i+1 = p. She will stop playing if her money fall below a (a < x). We use the notation Px (respectively. 0 ≤ x ≤ ℓ. 25 ϕ(ℓ) = 1. write ϕ(x) instead of ϕ(x. b} with transition probabilities pi. Yk cannot depend on ξk . betting £1 at each integer time against a coin which has P (heads) = p. in addition. . . . P (tails) = 1 − p. Theorem 7. pi. . Consider a gambler playing a simple coin tossing game.

then λ = q/p > 1. we find δ(0) = 1/ℓ and so x ϕ(x) = . λ = 1. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). then λ = q/p < 1. We finally find λx − 1 . Assuming that λ := q/p = 1.and. ℓ Summarising. ℓ→∞ ℓ→∞ (q/p)x . Since p + q = 1. we have obtained ϕ(x. We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. if λ = 1 if λ = 1. ℓ) = λx −1 . and so Px (T0 < ∞) = 1 − lim If p < 1/2. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). 1. Since ϕ(ℓ) = 1. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. λℓ − 1 0 ≤ x ≤ ℓ. by first-step analysis. If p > 1/2. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). ϕ(x) = pϕ(x + 1) + qϕ(x − 1). 0 ≤ x ≤ ℓ. Corollary 4. λx − 1 δ(0). The general case is obtained by replacing ℓ by b − a and x by x − a. and so λx − 1 λx − 1 =1− = λx . Px (T0 < ∞) = 1 − lim 26 . if p > 1/2 if p ≤ 1/2. if p = q. Let δ(x) := ϕ(x + 1) − ϕ(x). λ = 1. we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. δ(x − 2) = · · · = q p x δ(0). λℓ −1 x ℓ.

. . . . including. Xn ). Xn+2 = j2 . . . A.j1 pj1 . τ = n) = pi. τ = n) (a) (b) = P (Xτ +1 = j1 . . . considered earlier. . the future process (Xn . . . . | Xn = i)P (Xn = i. and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . Theorem 8 (strong Markov property).j1 pj1 .j2 · · · P (Xn = i. . Then. τ = n) = P (Xn+1 = j1 . τ < ∞) and that P (Xτ +1 = j1 . for any n. . P (Xτ +1 = j1 . τ = n) (c) = P (Xn+1 = j1 . . | Xn = i.) is independent of the past process (X0 . . Xτ ) if τ < ∞. . Recall that the Markov property states that. . Let A be any event determined by the past (X0 . .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. .Xτ +2 = j2 . Xn+2 = j2 . . 27 . Xn ) and the ordinary splitting property at n. A. Since (X0 . the value +∞. . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . . A | Xτ . .j2 · · · We have: P (Xτ +1 = j1 . . . This stronger property is referred to as the strong Markov property. . . Xn+2 = j2 . . . . A ∩ {τ = n} is determined by (X0 . Xτ = i. | Xτ = i. Xτ +2 = j2 . conditional on τ < ∞ and Xτ = i. Proof. Xn = i. a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . (d) where (a) is just logic. τ = n)P (Xn = i. n ≥ 0) is a time-homogeneous Markov chain. possibly. τ < ∞) = pi. . (b) is by the definition of conditional probability. . . Let τ be a stopping time. An example of a random time which is not a stopping time is the last visit of a specific set. Suppose (Xn . . We are going to show that for any such A. . τ = n). . . τ < ∞) = P (Xn+1 = j1 . . Xτ ) and has the same law as the original process started from i. τ < ∞)P (A | Xτ . . . A. Xτ +2 = j2 . Xτ +2 = j2 . An example of a stopping time is the time TA . the future process (Xτ +n . . In particular. . A. Xn ) if we know the present Xn . A. | Xτ . . n ≥ 0) is independent of the past process (X0 . A. any such A must have the property that. τ < ∞). Xn ) for all n ∈ Z+ . the first hitting time of the set A. Xτ )1(τ < ∞) = n∈Z+ (X0 . Xn )1(τ = n). . . . . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . for all n ∈ Z+ . . . . . . . . Let τ be a stopping time. Xn+1 . . .

Our Ti here is always positive.9 Regenerative structure of a Markov chain A sequence of i. by using the strong Markov property. random variables is a (very) particular example of a Markov chain. in a generalised sense. . in fact. . and so on. . Formally.i.i. The Ti of Section 6 was allowed to take the value 0. (And we set Ti := Ti . . The random variables Xn comprising a general Markov chain are not independent.d. Xi (1) .d.. The idea is that if we can find a state i that is visited again and again. State i may or may not be visited again. we let Ti be the time of the second (1) (3) visit. “chunks”. Note the subtle difference between this Ti and the Ti of Section 6. 3. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). However. These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. All these random times are stopping times. r = 2. So Xi is the r-th excursion: the chain visits 28 . Ti Xi (0) (r) ≤ n < Ti (r+1) . 0 ≤ n < Ti . If X0 = i then the two definitions coincide.) We let Ti be the time of the third visit. We (r) refer to them as excursions of the process. They are. we will show that we can split the sequence into i. we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}.. . Schematically. := Xn . (1) r = 1. Regardless. 2. Xi (2) . . Xi (0) . as “random variables”. random trajectories. . (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn .

we say that a state i is transient if. all. if it starts from state i at some point of time. . they are also identical in law (i. we “watch” it up until the next time it returns to state i. except. . and “goes for an excursion”. for time-homogeneous Markov chains. Λ2 . g(Xi (1) ). possibly. Therefore.state i for the r-th time. Moreover. The excursions Xi (0) . at least one state must be visited infinitely many times. Xi (2) . . In particular. g(Xi (2) ). the future is independent of the (1) (2) past. Example: Define Ti (r+1) (0) ). Second. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. Therefore. by definition. Apply Strong Markov Property (SMP) at Ti : SMP says that. The last corollary ensures us that Λ1 . . The claim that Xi . A trivial. Perhaps some states (inessential ones) will be visited finitely many times. have the same Proof. . Xi (1) . starting from i. are independent random variables. .d. 10 Recurrence and transience When a Markov chain has finitely many states. equal to i. and this proves the first claim. . ad infinitum. (Numerical) functions g(Xi independent random variables. (r) . But (r) the state at Ti is. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i.. Xi distribution. An immediate consequence of the strong Markov property is: Theorem 9. but there are always states which will be visited again and again. starting from i. but important corollary of the observation of this section is that: Corollary 5.. 29 . Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. (0) are independent random variables. We abstract this trivial observation and give a couple of definitions: First. we say that a state i is recurrent if. the future after Ti is independent of the past before Ti .). it will behave in the same way as if it starts from the same state at another point of time. the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1.i. given the state at (r) (r) (r) the stopping time Ti .

Starting from X0 = i. But the past (k−1) before Ti is independent of the future. Therefore. 30 . there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. k ≥ 0. 4. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . Because either fii = 1 or fii < 1. The following are equivalent: 1. This means that the state i is recurrent. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). every state is either recurrent or transient. we have Proof. i.e. Pi (Ji = ∞) = 1. We will show that such a possibility does not exist. Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. namely. Suppose that the Markov chain starts from state i. Ei Ji = ∞. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. If fii < 1. 3. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). . therefore concluding that every state is either recurrent or transient. fii = Pi (Ti < ∞) = 1. we conclude what we asserted earlier. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . and that is why we get the product. We claim that: Lemma 4. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. 2. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. Indeed.Important note: Notice that the two statements above are not negations of one another. then Pi (Ji < ∞) = 1. Lemma 5. in principle. Pi (Ji = ∞) = 1. State i is recurrent. Indeed. This means that the state i is transient.

clearly. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . (n) i. Ji cannot take value +∞ with positive probability. since Ji is a geometric random variable. then. by definition. Lemma 6. owing to the fact that Ji is a geometric random variable. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. as n → ∞. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. Clearly. The equivalence of the first two statements was proved above.Proof. 4. If i is transient then Ei Ji < ∞. i. If i is a transient state then pii → 0. Corollary 6. 10. Notice the difference with pij . pii → 0. but the two sequences are related in an exact way: Lemma 7. fii = Pi (Ti < ∞) < 1. then. it is visited infinitely many times. On the other hand.e. Proof. This.1 Define First hitting time decomposition fij := Pi (Tj = n). Proof. Pi (Ji < ∞) = 1. fij ≤ pij . is equivalent to Pi (Ji < ∞) = 1. if we assume Ei Ji < ∞. State i is transient if. Ei Ji < ∞. this implies that Ei Ji < ∞. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . If Ei Ji = ∞. it is visited finitely many times with probability 1. n ≥ 1. which. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. the probability to hit state j for the first time at time n ≥ 1. 2. Proof. (n) But if a series is finite then the summands go to 0. From Elementary Probability. assuming that the chain (n) starts from state i. as n → ∞. State i is recurrent if. and. by the definition of Ji . Hence Ei Ji = ∞ also. The following are equivalent: 1. by definition. therefore Pi (Ji < ∞) = 1. The equivalence of the first two statements was proved before the lemma.e. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. 3. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . State i is transient.

. 1. In a communicating class. . If j is a transient state then pij → 0.. denoted by i j: it means that. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . Corollary 8. by summing over n the relation of Lemma 7.The first term equals fij . Hence.i. but also fij = 1. Corollary 7. and hence pij as n → ∞. r = 0. starting from i. excursions Xi . j i. Proof. We now show that recurrence is a class property. and gij := Pi (Tj < Ti ) > 0. by definition. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. are i. not only fij > 0. the probability to go to j in some finite time is positive. In particular. So the random variables δj. j i. 32 . The last statement is simply an expression in symbols of what we said above. If j is transient.r := 1 j appears in excursion Xi (r) .d. Now. . Hence. Xi . . Indeed. either all states are recurrent or all states are transient. which is finite. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. Start with X0 = i. ∞ Proof. and so the probability that j will appear in a specific excursion is positive. 10. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ).2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. and since they take the value 1 with positive probability. ( Remember: another property that was a class property was the property that the state have a certain period.i. For the second term. we have i j ⇐⇒ fij > 0. infinitely many of them will be 1 (with probability 1). . . In other words. if we let fij := Pi (Tj < ∞) . we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). for all states i. we have n=1 pjj = Ej Jj < ∞. showing that j will appear in infinitely many of the excursions for sure.) Theorem 10.d. as n → ∞.

We now take a closer look at recurrence. uniformly distributed in the unit interval [0. i. P (B) = Pi (Xn = i for infinitely many n) < 1. Indeed. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0.e. Un ≤ x | U0 = x)dx xn dx = 1 . Xn = i for infinitely many n) = 0. all other states are also transient. Recall that a finite random variable may not necessarily have finite expectation. Pi (B) = 0. If a communicating class contains a state which is transient then. and hence. 1]. Theorem 12. but Pi (A ∩ B) = Pi (Xm = j. U1 . Proof. Theorem 11. if started in C. . Let T := inf{n ≥ 1 : Un > U0 }. and so i is a transient state.e. . Suppose that i is inessential. So if a communicating class is recurrent then it contains no inessential states and so it is closed. observe that T is finite. What is the expectation of T ? First. . for example. Proof. . . It should be clear that this can. every other j is also recurrent.Proof. X remains in C. Every finite closed communicating class is recurrent. But then. . some state i must be visited infinitely many times. . n+1 33 = 0 . i. be i. P (T > n) = P (U1 ≤ U0 . Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. random variables. j but j i.i. Let C be the class. . . If i is inessential then it is transient. Example: Let U0 . If i is recurrent then every other j in the same communicating class satisfies i j. Hence. Thus T represents the number of variables we need to draw to get a value larger than the initial one. This state is recurrent. by the above theorem. U2 .d. Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). . . By closedness. necessarily. appear in a certain game. Hence all states are recurrent. By finiteness.

Therefore. be i. it is a Theorem that can be derived from the axioms of Probability Theory. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is.d. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. X1 . Let Z1 . We say that i is null recurrent if Ei Ti = ∞. arguably. When we have a Markov chain. We state it here without proof. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. . . random variables. 0) = ∞. 0) > −∞. of successive states is not an independent sequence. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. It is not a Law. 34 . Z2 . Theorem 13. (ii) Suppose E max(Z1 . but E min(Z1 . the Fundamental Theorem of Probability Theory. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. . on the average. . . the sequence X0 . It is traditional to refer to it as “Law”. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. Before doing that. then it takes. To apply the law of large numbers we must create some kind of independence. But this is misleading.i. . X2 .

i.i. Then. (7) Whichever the choice of G. are i. the (0) initial excursion Xa also has the same law as the rest of them. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). . we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. . for reasonable choices of the functions G taking values in R.13. with probability 1. Xa(2) . the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). k=0 which represents the ‘time-average reward’ received. The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. (1) (2) (1) (n) (8) 13. . G(Xa(2) ). 35 . as in Example 2. define G(X ) to be the maximum reward received during this excursion.. for an excursion X . are i. In particular. Example 1: To each state x ∈ S. Since functions of independent random variables are also independent.d. . and because if we start with X0 = a. . . Xa . because the excursions Xa . but define G(X ) to be the total reward received over the excursion X . Xa(3) . Specifically. (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. forms an independent sequence of (generalised) random variables. . Thus. n as n → ∞. we can think of as a reward received when the chain is in state x. random variables. . Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). Example 2: Let f (x) be as in example 1. G(Xa(3) ).1 Functions of excursions Namely. .d. we have that G(Xa(1) ). in this example.2 Average reward Consider now a function f : S → R. which. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . assign a reward f (x).

Gr is given by (7). and so on. Proof. of complete excursions from the ground state a that occur between 0 and t. We break the sum into the total reward received over the initial excursion.Theorem 14. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . Hence |Gfirst | ≤ BTa . We shall present the proof in a way that the constant f will reveal itself–it will be discovered. (r) Nt := max{r ≥ 1 : Ta ≤ t}. t t In other words. Thus we need to keep track of the number. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . t t as t → ∞. and Glast is the reward over the last incomplete cycle. in the figure below. and so 1 first G → 0. Since P (Ta < ∞) = 1. Nt = 5. t t 36 . say |f (x)| ≤ B. We assumed that f is t bounded. we have BTa /t → 0. Let f : S → R+ be a bounded ‘reward’ function. plus the total reward received over the next excursion. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. The idea is this: Suppose that t is large enough. with probability one. A similar argument shows that 1 last G → 0. Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Then.e. t as t → ∞. We are looking for the total reward received between 0 and t. For example. say. with probability one. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). Nt .

E a Ta f (Xn ) = n=0 EG1 =: f . lim t∞ 1 Nt Nt −1 Gr = EG1 . = E a Ta . . as t → ∞. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 .. Hence. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . as r → ∞. r=1 with probability one. r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. are i. we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . random variables with common expectation Ea Ta .also. 1 Nt = . Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . . and since Ta − Ta k = 1. . 2. So we are left with the sum of the rewards over complete cycles. = E a Ta . it follows that the number Nt of complete cycles before time t will also be tending to ∞. we have Ta r→∞ r lim (r) . Since Ta /r → 0. E a Ta 37 .i. (11) By definition.d. Since Ta → ∞.

Ta −1 k=0 1 lim t→∞ t with probability 1. the ground state. from the Strong Markov Property. The equality of these two limits gives: average no. The physical interpretation of the latter should be clear: If. We can use b as a ground state instead of a. i. Comment 4: Suppose that two states. Note that we include only one endpoint of the excursion. Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . on the average. The denominator is the duration of the excursion. it takes Ea Ta units of time between successive visits to the state a. a. The constant f is a sort of average over an excursion. Comment 1: It is important to understand what the formula says. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . t Ea 1(Xk = b) E a Ta . E b Tb 38 . then the long-run proportion of visits to the state a should equal 1/Ea Ta . f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). then we see that the numerator equals 1. for all b ∈ S. The theorem above then says that.e. not both. b. which communicate and are positive recurrent.To see that f is as claimed. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). E a Ta (14) with probability 1. we let f to be 1 at the state b and 0 everywhere else.

Proof. We start the Markov chain with X0 = a. Indeed. 39 . . Note: Although ν [a] depends on the choice of the ‘ground’ state a. Both of them equal 1. In view of this. Fix a ground state a ∈ S and assume that it is recurrent5 . Then ν [a] satisfies the balance equations: ν [a] = ν [a] P.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. x ∈ S. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). Consider the function ν [a] defined in (15). we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. . Proposition 2. X2 . X1 . .e. we will soon realise that this dependence vanishes when the chain is irreducible. . so there is nothing lost in doing so. we have 0 < ν [a] (x) < ∞. x ∈ S. ν [a] (x) = y∈S ν [a] (y)pyx . for any state b such that a → x (a leads to x). Moreover. (15) should play an important role. this should have something to do with the concept of stationary distribution discussed a few sections ago. The real reason for doing so is because we wanted to replace a function of X0 . Since a = XT a . notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. Recurrence tells us that Ta < ∞ with probability one. X3 . by a function of X1 . Hence the superscript [a] will soon be dropped. 5 We stress that we do not need the stronger assumption of positive recurrence here. where a is the fixed recurrent ground state whose existence was assumed. . i. Clearly. . X2 .

that a is positive recurrent. is finite when a is positive recurrent. ν [a] = ν [a] Pn . x ∈ S. We are now going to assume more. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . Xn−1 = y. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. we used (16). so there exists n ≥ 1. ax ax From Theorem 10. by definition. n ≤ Ta ) Pa (Xn = x. let x be such that a x. which. Suppose that the Markov chain possesses a positive recurrent state a. Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). We just showed that ν [a] = ν [a] P. which are precisely the balance equations. in this last step. for all x ∈ S. (n) Next. Note: The last result was shown under the assumption that a is a recurrent state. so there exists m ≥ 1. namely. such that pxa > 0. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . x a. xa so. n ≤ Ta ) pyx Pa (Xn−1 = y. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). indeed. such that pax > 0. where. We now have: Corollary 9. ν [a] (x) < ∞. (17) This π is a stationary distribution.and thus be in a position to apply first-step analysis. 40 . for all n ∈ N. from (16). First note that. Therefore.

Let R1 (respectively. positive recurrent state. Theorem 15.i. We have already shown. 41 . We already know that j is recurrent. Start with X0 = i. then the r-th excursion does not contain state j. . no matter which one. we toss a coin with probability of heads gij > 0. Since a is positive recurrent. The coins are i. Suppose that i recurrent. 2. . if tails show up. Let i be a positive recurrent state. then j occurs in the r-th excursion. π [a] P = π [a] . then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. But π [a] is simply a multiple of ν [a] . Proof. that ν [a] P = ν [a] . But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. i. . the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. This is what we will try to do in the sequel. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. It only remains to show that π [a] is a probability. second) excursion that contains j. Otherwise. R2 ) be the index of the first (respectively. we have Ea Ta < ∞. then the set of linear equations π(x) = y∈S π(y)pyx . 1.In other words: If there is a positive recurrent state. r = 0. Then j is positive Proof. (r) j. Consider the excursions Xi . that it sums up to one. since the excursions are independent. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. unique. π(y) = 1 y∈S x∈S have at least one solution. and so π [a] is not trivially equal to zero. In words. Hence j will occur in infinitely many excursions. and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. Also.e. in Proposition 2. to decide whether j will occur in a specific excursion. in general. If heads show up at the r-th excursion. Therefore.d.

. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. where the Zr are i. Then either all states in C are positive recurrent.i. Corollary 10. or all states are null recurrent or all states are transient. by Wald’s lemma. RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 .R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . . But. An irreducible Markov chain is called transient if one (and hence all) states are transient. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . In other words. 2. we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. . . R2 has finite expectation. . An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. . IRREDUCIBLE CHAIN r = 1. with the same law as Ti . Let C be a communicating class. .d.

after all. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. Then P (Ta < ∞) = x∈S In other words. by (18). n=1 43 . there is a stationary distribution. we have P (Xn = x) = π(x). following Theorem 14. Consider positive recurrent irreducible Markov chain. Let π be a stationary distribution. we start the Markov chain from the stationary distribution π. By a theorem of Analysis (Bounded Convergence Theorem). Fix a ground state a. But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. for all states i and j. what prohibits uniqueness. Proof. π(1) = 0. Then. an absorbing state). x ∈ S. Consider the Markov chain (Xn . n ≥ 0) and suppose that P (X0 = x) = π(x). 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). with probability one. (18) π(x)Px (Ta < ∞) = π(x) = 1. for totally trivial reasons (it is. for all x ∈ S. Hence. indeed. we stress that a stationary distribution may not be unique. we have. for all n. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. Then there is a unique stationary distribution. as t → ∞. is a stationary distribution. state 1 (and state 2) is positive recurrent. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. But.16 Uniqueness of the stationary distribution At the risk of being too pedantic. x∈S By (13). we can take expectations of both sides: E as t → ∞. Theorem 16. x ∈ S. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). π(2) = 1 − α 2. recognising that the random variables on the left are nonnegative and bounded below 1. the π defined by π(0) = α. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x).

e. 1 π(a) = . There is a unique stationary distribution. b ∈ S. Let i be such that π(i) > 0. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. By the 1 above corollary. The cycle formula merely re-expresses this equality. Both π [a] and π[b] are stationary distributions. Let C be the communicating class of i. Thus. Hence π = π [a] . Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . Consider a positive recurrent irreducible Markov chain and two states a. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . π(a) = π [a] (a) = 1/Ea Ta . The following is a sort of converse to that. so they are equal. Therefore Ei Ti < ∞. delete all transition probabilities pxy with x ∈ C. Corollary 11. Decompose it into communicating classes. for all x ∈ S. Then. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like.Therefore π(x) = π [a] (x). In particular. Then any state i such that π(i) > 0 is positive recurrent. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Proof. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . Hence there is only one stationary distribution. Consider the restriction of the chain to C. But π(i|C) > 0. for all a ∈ S. E a Ta Proof. x ∈ S. Assume that a stationary distribution π exists. Consider a positive recurrent irreducible Markov chain with stationary distribution π. where π(C) = y∈C π(y). i is positive recurrent. π(i|C) = Ei Ti . an arbitrary stationary distribution π must be equal to the specific one π [a] . from the definition of π [a] . Consider an arbitrary Markov chain. Corollary 12. This is in fact. In particular. Namely. Proof. the only stationary distribution for the restricted chain because C is irreducible. Then π [a] = π [b] . i. There is only one stationary distribution. π(C) x ∈ C. y ∈ C. Consider a Markov chain.

For each such C define the conditional distribution π(x|C) := π(x) . This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . from physical experience. x∈C Then (π(x|C). In an ideal pendulum. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. Let π be an arbitrary stationary distribution. take an = (−1)n . the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). If π(x) > 0 then x belongs to some positive recurrent class C. without friction. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . For example. for example. Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. that. 18 18. eventually. π(x|C) ≡ π C (x). the pendulum will perform an 45 . This is the stable position. So we have that the above decomposition holds with αC = π(C). So far we have seen. more generally. where π(C) := π(C) π(x). A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. An arbitrary stationary distribution π must be a linear combination of these π C . under irreducibility and positive recurrence conditions. that the pendulum will swing for a while and.which assigns positive probability to each state in C. it will settle to the vertical position. x ∈ C) is a stationary distribution. as n → ∞. By our uniqueness result. consider the motion of a pendulum under the action of gravity. We all know. To each positive recurrent communicating class C there corresponds a stationary distribution π C . There is dissipation of energy due to friction and that causes the settling down. where αC ≥ 0.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. that it remains in a small region). Proof. such that C αC = 1. A Markov chain is a simple model of a stochastic dynamical system. Proposition 3.

X2 . we here assume that X0 . moreover. ξ2 . .e. as t → ∞. then x(t) = x∗ for all t > 0. The simplest differential equation As another example. we may it is stable because it does not (it cannot!) swing too far away. . x ∈ R. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. A linear stochastic system We now pass on to stochastic systems. . And we are in a happy situation here. simple because the ξn ’s depend on the time parameter n. It follows that |ρ| < 1 is the condition for stability. for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . The thing to observe here is that there is an equilibrium point i. and define a stochastic sequence X1 . let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . we have x(t) → x∗ . However notice what happens when we solve the recursion.oscillatory motion ad infinitum. We shall assume that the ξn have the same law. . or an example in physics. ······ 46 . If. . But what does stability mean? In general. . Still. consider the following differential equation x = −ax + b. Specifically. ˙ There are myriads of applications giving physical interpretation to this. random variables. ξ0 . 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . or whatever. it cannot mean convergence to a fixed equilibrium point. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. a > 0. then regardless of the starting position. are given. We say that x∗ is a stable equilibrium. mutually independent. by means of the recursion above. the one found by setting the right-hand side equal to zero: x∗ = b/a. ξ1 .

while the limit of Xn does not exists. Y = y) is? 47 . positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x).2 The fundamental stability theorem for Markov chains Theorem 17. define stability in a way other than convergence of the distributions of the random variables. we have that the first term goes to 0 as n → ∞. for instance. .3 Coupling To understand why the fundamental stability theorem works. for time-homogeneous Markov chains. 18. Coupling refers to the various methods for constructing this joint law. But suppose that we have some further requirement. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). If a Markov chain is irreducible. as n → ∞. j ∈ S. Starting at an elementary level. we know f (x) = P (X = x). ∞ ˆ The nice thing is that. . for any initial distribution. Y = y) = P (X = x)P (Y = y). In particular. the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). suppose you are given the laws of two random variables X. converges. How should we then define what P (X = x. ξn−1 does not converge. we need to understand the notion of coupling. the n-step transition matrix Pn . g(y) = P (Y = y). Y but not their joint law. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . under nice conditions. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . uniformly in x. to try to make the coincidence probability P (X = Y ) as large as possible. .g.Since |ρ| < 1. . i. which is a linear combination of ξ0 . Y = y) is. In other words. but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. (e. but we are not given what P (X = x. What this means is that. The second term. 18.

n < T ) + P (Xn = x. Y be random variables with values in Z and laws f (x) = P (X = x). If. uniformly in x ∈ S. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n). Suppose that we have two stochastic processes as above and suppose that they have been coupled. Then. y) = P (X = x. This T is a random variables that takes values in the set of times or it may take values +∞. y). n < T ) + P (Yn = x. n ≥ T ) = P (Xn = x. Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Repeating (19). for all n ≥ 0. Since this holds for all x ∈ S and since the right hand side does not depend on x. g(y) = x h(x. Let T be a meeting of two coupled stochastic processes X. Y = y) which respects the marginals. We have P (Xn = x) = P (Xn = x. Note that it makes sense to speak of a meeting time. T is finite with probability one. respectively. in particular. then |P (Xn = x) − P (Yn = x)| → 0. f (x) = y h(x. The coupling we are interested in is a coupling of processes. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). A coupling of them refers to a joint construction on the same probability space. Find the joint law h(x.e. n ≥ 0. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Y . but we are not told what the joint probabilities are. and all x ∈ S. n ≥ 0). n ≥ T ) ≤ P (n < T ) + P (Yn = x). g(y) = P (Y = y).Exercise: Let X. The following result is the so-called coupling inequality: Proposition 4. x ∈ S. n ≥ 0) and the law of a stochastic process Y = (Yn . x∈S 48 . y). precisely because of the assumption that the two processes have been constructed together. but with the roles of X and Y interchanged. i. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). (19) as n → ∞. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. P (T < ∞) = 1. i. and which maximises P (X = Y ).e. Proof. on the event that the two processes never meet. n ≥ 0. We are given the law of a stochastic process X = (Xn .

4 Proof of the fundamental stability theorem First.Assume now that P (T < ∞) = 1. (x. Yn ). y ′ ) | Wn = (x. Xn and Yn would be independent and distributed according to π. Having coupled them.x′ ). for all n ≥ 1.(y. Its (1-step) transition probabilities are q(x. n ≥ 0) is independent of Y = (Yn . Then n→∞ lim P (T > n) = P (T = ∞) = 0. . and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }. the transition probabilities. x∈S as n → ∞. We now couple them by doing the straightforward thing: we assume that X = (Xn . denoted by π.y′ . and so (Wn .x′ py. review the assumptions of the theorem: the chain is irreducible. positive recurrent and aperiodic. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. we can define joint probabilities. From Theorem 3 and the aperiodicity assumption. Let pij denote. We know that there is a unique stationary distribution.x′ ).x′ py. The second process. n ≥ 0). n ≥ 0) is a Markov chain with state space S × S. Therefore. in other words.y′ ) := P Wn+1 = (x′ . Then (Wn . then. π(x) > 0 for all x ∈ S. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ.y′ for all large n.x′ ). they differ only in how the initial states are chosen.(y. is a stationary distribution for (Wn .y′ ) for all large n. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. Y0 ) has distribution P (W0 = (x. y) = µ(x)π(y). denoted by Y = (Yn . denoted by X = (Xn .) By positive recurrence. y) ∈ S × S. Notice that σ(x. implying that q(x. n ≥ 0). as usual.(y. y) Its n-step transition probabilities are q(x. had we started with both X0 and Y0 independent and distributed according to π. We shall consider the laws of two processes. The first process. (Indeed.x′ > 0 and py. we have that px. Its initial state W0 = (X0 . It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. 18. sup |P (Xn = x) − P (Yn = x)| → 0. Consider the process Wn := (Xn . we have not defined joint probabilities such as P (Xn = x. n ≥ 0) is an irreducible chain. y) := π(x)π(y).y′ . Xm = y).y′ ) = px.

(1. Clearly. say. 50 .1). then the process W = (X.0). Moreover. In particular.0) = q(0. P (Xn = x) = P (Zn = x). This is not irreducible. y) ∈ S × S.(0. if we modify the original chain and make it aperiodic by. Then the product chain has state space S ×S = {(0. (1.1). (0. we have |P (Xn = x) − π(x)| ≤ P (T > n).01. 1)} and the transition probabilities are q(0. Zn equals Xn before the meeting time of X and Y . Obviously. Since P (Zn = x) = P (Xn = x). ≥ 0) is a Markov chain with transition probabilities pij .(0.99. as n → ∞. p00 = 0. n ≥ 0) is identical in law to (Xn . Zn = Yn . 0). suppose that S = {0. 1). which converges to zero. For example.0) = q(1.Therefore σ(x. n ≥ 0) is positive recurrent. (Zn . 0). for all n and x.0). P (T < ∞) = 1.(1. n ≥ 0). Hence (Zn . Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). X arbitrary distribution Z stationary distribution Y T Thus. However. and since P (Yn = x) = π(x). p01 = 0. By the coupling inequality. Z0 = X0 which has an arbitrary distribution µ. by assumption. uniformly in x. Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ). precisely because P (T < ∞) = 1.(1. y) > 0 for all (x. Y ) may fail to be irreducible. Hence (Wn .1) = 1. T is a meeting time between Z and Y . after the meeting time.1) = q(1. In particular. p10 = 1. 1} and the pij are p01 = p10 = 1.

such that the chain moves cyclically between them: from C0 to C1 . p01 = 0. It is not difficult to see that 6 We put “wrong” in quotes. from Cd back to C0 . A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. π(0) + π(1) = 1. 1981).99.497 18.4 and.503 0. C1 . however. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). from C1 to C2 . Suppose that the period of the chain is d. and this is expressed by our terminology. In matrix notation. . Therefore p00 = 1 for even n (n) and = 0 for odd n. Parry.01. Theorem 4. It is.6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. By the results of §5. then (n) lim p n→∞ 00 (n) = lim p10 = π(0). customary to have a consistent terminology. most people use the term “ergodic” for an irreducible. 18.503. in particular. positive recurrent and aperiodic Markov chain. However. we can decompose the state space into d classes C0 . Then Xn = 0 for even n and = 1 for odd n. 6 The reason is that the term “ergodic” means something specific in Mathematics (c. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1.497 . Suppose X0 = 0. π(0) ≈ 0. 51 . .then the product chain becomes irreducible. n→∞ (n) where π(0). Cd−1 . .5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. i. limn→∞ p00 does not exist. By the results of Section 14. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. per se wrong. π(1) satisfy 0. and so on. if we modify the chain by p00 = 0. The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. But this is “wrong”.497.f. p10 = 1. . Terminology cannot be.503 0.99π(0) = π(1).e. Thus. π(1) ≈ 0. 0. Indeed.

the (unique) stationary distribution of (Xnd . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. j ∈ Cℓ . . Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). The chain (Xnd . i ∈ Cr . . i ∈ Cr . All states have period 1 for this chain. we have pij for all i. 52 . This means that: Theorem 18. Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . starting from i. n ≥ 0) consists of d closed communicating classes. and the previous results apply. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. → dπ(j). as n → ∞. . Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Proof. each one must be equal to 1/d. if sampled every d steps. Since the αr add up to 1. 1. n ≥ 0) is dπ(i). We shall show that Lemma 9. Suppose i ∈ Cr . Let αr := i∈Cr π(i). Since (Xnd . i ∈ S. . Suppose that X0 ∈ Cr . Then. d − 1. n ≥ 0) is aperiodic. Then. i∈Cr for all r = 0. Cd−1 . . and let r = ℓ − k (modulo d). namely C0 . π(i) = 1/d. Then pji = 0 unless j ∈ Cr−1 . . j ∈ Cℓ . as n → ∞. j ∈ Cr . . it remains in Cℓ . after reduction by d) and. thereafter. where π is the stationary distribution. Let i ∈ Ck . . So π(i) = j∈Cr−1 π(j)pji . n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. Hence if we restrict the chain (Xnd . We have that π satisfies the balance equations: π(i) = j∈S π(j)pji .Lemma 8.

. pi . . p3 . e 53 . . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1..). . if j is transient. for each i. We have. for all i ∈ S. p2 . . are contained between 0 and 1. the sequence n3 of the third row is a subsequence k k of n2 of the second row. . as n → ∞. then pij → 0 as n → ∞. k = 1. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. Now consider the contained between 0 and 1 there must be a exists and equals. k = 1. for each n = 1. . where = 1 and pi ≥ 0 for all i. 2. It is clear how we (ni ) k continue. If the period is d then the limit exists only along a selected subsequence (Theorem 18). 19. By construction. whose proof we omit7 : 7 See Br´maud (1999). such that limk→∞ pi k exists and equals. .) is a k k subsequence of each (ni . . For each i we have found subsequence ni . . 2. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . Consider. k = 1. the first of which is known as HellyBray Lemma: Lemma 10. say. p(n) = (n) (n) (n) (n) (n) ∞ p1 . for all i ∈ S. and so on. k = 1. schematically.1 Limiting probabilities for null recurrent states Theorem 19. say. . . (nk ) k → pi . for each The second result is Fatou’s lemma. . The proof will be based on two results from Analysis. as k → ∞. Hence pi k i. the sequence n2 of the second row k is a subsequence of n1 of the first row.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . p1 . a probability distribution on N. say. p2 . . i. 2. . exists for all i ∈ N. . . . If j is a null recurrent state then pij → 0. Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof.e. 2. We also saw in Corollary (n) 7 that. So the diagonal subsequence (nk . 2. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. . then pij → π(j) (Theorems 17). . there must be a subsequence exists and equals.

pix k Therefore. y∈S x ∈ S. π is a stationary distribution. equalities. We wish to show that these inequalities are. (xi . i=1 (n) Proof of Theorem 19. . αy pyx . If (yi . 2. Suppose that one of them is not. it is a strict inequality: αx0 > y∈S αy pyx0 . state j is positive recurrent: Contradiction. (n ) (n ) Consider the probability vector By Lemma 10. by Lemma 11. Let j be a null recurrent state. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). i ∈ N). 54 . we have 0< x∈S The inequality follows from Lemma 11. Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. actually. we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. x ∈ S. Suppose that for some i. Now π(j) > 0 because αj > 0. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . n = 1. By Corollary 13. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. i.Lemma 11. But Pn+1 = Pn P. for some x0 ∈ S. the sequence (n) pij does not converge to 0. pix k . x∈S (n′ ) Since aj > 0. . k→∞ (n′ +1) = y∈S piy k pyx .e. i. and the last equality is obvious since we have a probability vector. Thus. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . i.e. (n′ ) Hence αx ≥ αy pyx . (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . by assumption. pix k = 1. x ∈ S . i ∈ N).e. . we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . y∈S Since x∈S αx ≤ 1.

π C (j) = 1/Ej Tj . Example: Consider the following chain (0 < α. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . as n → ∞. we have pij = Pi (Xn = j) = Pi (Xn = j. Let ϕC (i) := Pi (HC < ∞). Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. Indeed. E j Tj where Tj = inf{n ≥ 1 : Xn = j}.19. This is a more old-fashioned way of getting the same result. β. Let i be an arbitrary state. But Theorem 17 tells us that the limit exists regardless of the initial distribution. as n → ∞. Then (n) lim p n→∞ ij = ϕC (i)π C (j). Theorem 20. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. Let HC := inf{n ≥ 0 : Xn ∈ C}. The result when the period j is larger than 1 is a bit awkward to formulate. Pµ (Xn−m = j) → π C (j). In general. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Let j be a positive recurrent state with period 1. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j).2. and fij = ϕC (j). In this form. that is. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). Let C be the communicating class of j and let π C be the stationary distribution associated with C. Proof.2 The general case (n) It remains to see what the limit of pij is. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. as in §10. the result can be obtained by the so-called Abel’s Lemma of Analysis. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . and fij = Pi (Tj < ∞). starting with the first hitting time decomposition relation of Lemma 7. so we will only look that the aperiodic case. From the Strong Markov Property. which is what we need.

We now generalise this. whence 1−α β(1 − α) . (n) (n) p43 → 0. p42 → ϕ(4)π C1 (2). ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . C2 are aperiodic classes. Theorem 21. Similarly. Recall that. 2} (closed class). as n → ∞. π C1 (2) = .The communicating classes are I = {3. ϕ(3) = p31 → ϕ(3)π C1 (1). Hence. We have ϕ(1) = ϕ(2) = 1. And so is the Law of Large Numbers itself (Theorem 13). Pioneers in the are were. p45 → (1 − ϕ(4))π C2 (5). in Theorem 14. C2 = {5} (absorbing state). The phase (or state) space of a particle is not just a physical space but it includes. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. (n) (n) p33 → 0. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. ϕ(5) = 0. Theorem 14 is a type of ergodic theorem. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. The subject of Ergodic Theory has its origins in Physics. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. (n) (n) p35 → (1 − ϕ(3))π C2 (5). we assumed the existence of a positive recurrent state. p44 → 0. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. π C2 (5) = 1. while. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. (n) (n) p34 → 0. ϕ(4) = . (n) (n) p32 → ϕ(3)π C1 (2). ϕ(4) = (1 − β)ϕ(5) + βϕ(3). C1 = {1. p41 → ϕ(4)π C1 (1). information about its velocity. For example. 4} (inessential states). The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. 1 − αβ 1 − αβ Both C1 . 56 (20) x∈S . among others. for example. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. from first-step analysis.

Remark 1: The difference between Theorem 14 and 21 is that. while Gfirst . We only need to check that EG1 = f . r=0 with probability 1 as long as E|G1 | < ∞. then. As in the proof of Theorem 14. If so. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. Now. So if we divide by t the numerator and denominator of the fraction in (21). The reader is asked to revisit the proof of Theorem 14. t t→∞ t t t with probability one. So there are always recurrent states. 57 . we have Nt /t → 1/Ea Ta . Replacing n by the subsequence Nt . Glast are the total rewards over the first and t t a a last incomplete cycles. and this justifies the terminology of §18. we have that the denominator equals Nt /t and converges to 1/Ea Ta . we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . Proof. in the former. we have C := i∈S ν(i) < ∞.5. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). (21) f (x)ν [a] (x). respectively. we assumed positive recurrence of the ground state a. so the numerator converges to f /Ea Ta . Then at least one of the states must be visited infinitely often. since the number of states is finite. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. and this is left as an exercise. As in Theorem 14.Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . we have t→∞ lim 1 last 1 G = lim Gfirst = 0. we obtain the result. But this is precisely the condition (20). as we saw in the proof of Theorem 14. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . From proposition 2 we know that the balance equations ν = νP have at least one solution. which is the quantity denoted by f in Theorem 14.

by Theorem 16. P (X0 = i. X1 . like playing a film backwards. . More precisely. X1 . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. without being able to tell that it is. . that is. . X1 ) has the same distribution as (X1 . this state must be positive recurrent. We say that it is time-reversible. Xn ) has the same distribution as (Xn . . Taking n = 1 in the definition of reversibility. 22 Time reversibility Consider a Markov chain (Xn . C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. because positive recurrence is a class property (Theorem 15). At least some state i must have π(i) > 0. the chain is irreducible. reversible if it has the same distribution when the arrow of time is reversed. for all n. . Therefore π is a stationary distribution. then we instantly know that the film is played in reverse. it will look the same. . Proof. X1 = j) = P (X1 = i. X0 ) for all n. X0 ). Lemma 12. in a sense. then all states are positive recurrent. Xn ) has the same distribution as (Xn . i. Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . if we film a standing man who raises his arms up and down successively. . . . i. the chain is aperiodic. By Corollary 13. (X0 . . we have Pn → Π.e. . Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. . as n → ∞. If. simply. Xn−1 . i ∈ S. Consider an irreducible Markov chain with finitely many states. or. and run the film backwards. j ∈ S. . π(i) := Hence. Then the chain is positive recurrent with a unique stationary distribution π. in addition. Reversibility means that (X0 . Because the process is Markov.Hence 1 ν(i). . Proof. For example. X0 ). X0 = j). a finite Markov chain always has positive recurrent states. But if we film a man who walks and run the film backwards. 58 . we see that (X0 . n ≥ 0). Xn−1 . Theorem 23. We summarise the above discussion in: Theorem 22. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. A reversible Markov chain is stationary. It is. running backwards. this means that it is stationary. in addition. Suppose first that the chain is reversible. the stationary distribution is unique. In addition. In particular. . the first components of the two vectors must have the same distribution. . Xn has the same distribution as X0 for all n. If. actually.

. (X0 . The DBE imply immediately that (X0 . We will show the Kolmogorov’s loop criterion (KLC). π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. Conversely. j. Now pick a state k and write. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. and hence (X0 . X1 = j) = π(i)pij . pji 59 (m−3) → π(i1 ). then. . π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . P (X0 = j. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . X0 ). . Conversely. Xn−1 . for all n. and write. . j. X1 = i) = π(j)pji . . 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. X2 = i). it is easy to show that. . . suppose that the KLC (22) holds. . Pick 3 states i. k. k. Xn ) has the same distribution as (Xn . then the chain cannot be reversible. i2 . . . Suppose first that the chain is reversible. im ∈ S and all m ≥ 3. . .for all i and j. X1 ) has the same distribution as (X1 . These are the DBE. By induction. . pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . using the DBE. X1 . Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. In other words. The same logic works for any m and can be worked out by induction. . we have π(i) > 0 for all i. . So the chain is reversible. suppose the DBE hold. Hence the DBE hold. X2 = k) = P (X0 = k. and this means that P (X0 = i. . Summing up over all choices of states i3 . (22) Proof. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. and the chain is reversible. X1 = j. X1 . we have separation of variables: pij = f (i)g(j). X1 = j. i4 . X2 ) has the same distribution as (X2 . X0 ). using DBE. But P (X0 = i. for all i. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. . So the detailed balance equations (DBE) hold. and pi1 i2 (m−3) → π(i2 ). X1 . letting m → ∞. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . X0 ). Theorem 24.

the detailed balance equations. meaning that it has no loops. and by replacing. Example 1: Chain with a tree-structure. the two oriented edges from i to j and from j to i by a single edge without arrow. And hence π can be found very easily. So g satisfies the balance equations.(Convention: we form this ratio only when pij > 0. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. and. by assumption.) Let A. Form the graph by deleting all self-loops.) Necessarily. for each pair of distinct states i. using another method. the graph becomes disconnected. Pick two distinct states i. The reason is as follows. to guarantee. 60 . f (i)g(i) must be 1 (why?).1) which gives π(i)pij = π(j)pji . j for which pij > 0. (If it didn’t. For example. There is obviously only one arrow from AtoAc . in addition. then the chain is reversible.e. If we remove the link between them. Ac be the sets in the two connected components. A) (see §4. consider the chain: Replace it by: This is a tree. The balance equations are written in the form F (A. and the arrow from j to j is the only one from Ac to A. Hence the chain is reversible. there isn’t any. Ac ) = F (Ac . If the graph thus obtained is a tree. so we have pij g(j) = . • If the chain has infinitely many states. that the chain is positive recurrent. then we need. i. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. j for which pij > 0. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . Hence the original chain is reversible. namely the arrow from i to j. there would be a loop.

2. i∈S where |E| is the total number of edges. p02 = 1/4. The constant C is found by normalisation. For example. i. where C is a constant.Example 2: Random walk on a graph. j are not neighbours then both sides are 0. 61 . We claim that π(i) = Cδ(i). 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. a finite set. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). The stationary distribution assigns to a state probability which is proportional to its degree. so both members are equal to C. 0. π(i)pij = π(j)pji . If i. π(i) = 1. States j are i are called neighbours if they are connected by an edge. To see this let us verify that the detailed balance equations hold. There are two cases: If i. Suppose the graph is connected.e. if j is a neighbour of i otherwise. For each state i define its degree δ(i) = number of neighbours of i. Each edge connects a pair of distinct vertices (states) without orientation. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. and a set E of undirected edges. we have that C = 1/2|E|. Hence we conclude that: 1. The stationary distribution of a RWG is very easy to find. In all cases. Consider an undirected graph G with vertices S. i∈S Since δ(i) = 2|E|. etc. Now define the the following transition probabilities: pij = 1/δ(i). A RWG is a reversible chain. the DBE hold.

Define Yr := XT (r) . 1. 2. . 2. k = 0. 2. k = 0.d. n ≥ 0) be Markov with transition probabilities pij . where ξt are i. 62 n ∈ N. St+1 = St + ξt . 23. A r = 1. 1. . . . 2. i.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . 1.e. t ≥ 0. If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). . π is a stationary distribution for the original chain. . The state space of (Yr ) is A. Due to the Strong Markov Property. if. i. . . where pij are the m-step transition probabilities of the original chain. . this process is also a Markov chain and it does have time-homogeneous transition probabilities. 1. how can we create new ones? 23. . k = 0. then it is a stationary distribution for the new chain. random variables with values in N and distribution λ(n) = P (ξ0 = n). t ≥ 0) be independent from (Xn ) with S0 = 0. Let (St . . in general. Let A be (r) a set of states and let TA be the r-th visit to A.e. . . .23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). .i. . irreducible and positive recurrent).e.3 Subordination Let (Xn . in addition.2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i.) has the Markov property but it is not time-homogeneous. k = 0. 2. . if nk = n0 + km. If the subsequence forms an arithmetic progression. In this case.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. then (Yk . The sequence (Yk . a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . (m) (m) 23.

f is a one-to-one function then (Yn ) is a Markov chain. and the elements of S ′ by α. Yn−1 = αn−1 }. in general. .4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . . however. . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). . β). . Indeed. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). Yn−1 ) = P (Yn+1 = β | Yn = α). . αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 .. . . .e. Proof. If. (Notation: We will denote the elements of S by i. i. is NOT. . We need to show that P (Yn+1 = β | Yn = α. .) More generally. j. α1 . we have: Lemma 13. conditional on Yn the past is independent of the future. n≥0 63 . then the process Yn = f (Xn ). . Fix α0 . . Then (Yn ) has the Markov property. is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0.. β.Then Yt := XSt . knowledge of f (Xn ) implies knowledge of Xn and so. a Markov chain. ∞ λ(n)pij . (n) 23. .

β) P (Yn = α. and if i = 0. To see this. But we will show that P (Yn+1 = β|Xn = i). β ∈ Z+ . if i = 0. β)P (Xn = i | Yn = α. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. Thus (Yn ) is Markov with transition probabilities Q(α. α > 0 Q(α. Yn−1 ) Q(f (i). β). On the other hand. β). β) P (Yn = α|Xn = i. conditional on {Xn = i}. Indeed. a Markov chain with transition probabilities Q(α. if β = 1. then |Xn+1 | = 1 for sure. Yn−1 ) Q(f (i).e. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) i∈S P (Yn = α|Xn = i. i. We have that P (Yn+1 = β|Xn = i) = Q(|i|. indeed. Then (Yn ) is a Markov chain. i ∈ Z. 64 . β) := 1. In other words. But P (Yn+1 = β | Yn = α. β). and so (Yn ) is.for all choices of the variables involved. Define Yn := |Xn |. Yn−1 )P (Xn = i | Yn = α. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. P (Xn+1 = i ± 1|Xn = i) = 1/2. β). Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). depends on i only through |i|. α = 0. i ∈ Z. if β = α ± 1. observe that the function | · | is not one-to-one. for all i ∈ Z and all β ∈ Z+ . Yn = α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. β) P (Yn = α|Yn−1 ) i∈S = Q(α. Example: Consider the symmetric drunkard’s random walk. the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. regardless of whether i > 0 or i < 0. because the probability of going up is the same as the probability of going down. if we let 1/2.

which has a binomial distribution: i j pij = p (1 − p)i−j . . ξ (1) . if 0 is never reached then Xn → ∞. 0 But 0 is an absorbing state. ξ (n) ). we have that (Xn . n ≥ 0) has the Markov property. If we let ξ (n) = ξ1 . p0 = 1 − p. But here is an interesting special case. .i. since ξ (0) . Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. if the current population is i there is probability pi that everybody gives birth to zero offspring. Let ξk be the number of offspring of the k-th individual in the n-th generation. In other words. j In this case. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . then we see that the above equation is of the form Xn+1 = f (Xn .. . a hard problem. . If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. so we will find some different method to proceed. . It is a Markov chain with time-homogeneous transitions.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. 0 ≤ j ≤ i. are i. with probability 1. one question of interest is whether the process will become extinct or not.d.24 24. Example: Suppose that p1 = p. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. So assume that p1 = 1. If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. and 0 is always an absorbing state. (and independent of the initial population size X0 –a further assumption). Hence all states i ≥ 1 are inessential. the population cannot grow: it will eventually reach 0. . Back to the general case. We find the population (n) at time n + 1 by adding the number of offspring of each individual. independently from one another. The parent dies immediately upon giving birth. We will not attempt to compute the pij neither draw the transition diagram because they are both futile. ξ2 . Hence all states i ≥ 1 are transient. . Therefore. if the 65 (n) (n) . k = 0. in general. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. . The state 0 is irrelevant because it cannot be reached from anywhere. each individual gives birth to at one child with probability p or has children with probability 1 − p. and. . and following the same distribution. 2. Thus. We let pk be the probability that an individual will bear k offspring. 1. Then P (ξ = 0) + P (ξ ≥ 2) > 0. therefore transient.

This function is of interest because it gives information about the extinction probability ε. when we start with X0 = i. each one behaves independently of one another. . Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. Obviously. Therefore the problem reduces to the study of the branching process starting with X0 = 1. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . ε(0) = 1. The result is this: Proposition 5. Indeed. where ε = P1 (D). If µ ≤ 1 then the ε = 1. We then have that D = D1 ∩ · · · ∩ Di . . If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. and P (Dk ) = ε. Indeed. 66 . For each k = 1. if we start with X0 = i in ital ancestors. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. . we can argue that ε(i) = εi . For i ≥ 1. and. so P (D|X0 = i) = εi . Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}.population does not become extinct then it must grow to infinity. Proof. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. Consider a Galton-Watson process starting with X0 = 1 initial ancestor. since Xn = 0 implies Xm = 0 for all m ≥ n. i. ϕn (0) = P (Xn = 0). Let ε be the extinction probability of the process. We wish to compute the extinction probability ε(i) := Pi (D). .

We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). if the slope µ at z = 1 is ≤ 1.) 67 . But 0 ≤ ε∗ . Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. in fact. But ϕn+1 (0) → ε as n → ∞. ϕn+1 (z) = ϕ(ϕn (z)). On the other hand. z=1 Also. Notice now that µ= d ϕ(z) dz . recall that ϕ is a convex function with ϕ(1) = 1. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). and so on. if µ > 1. Therefore.Note also that ϕ0 (z) = 1. So ε ≤ ε∗ < 1. By induction. Also ϕn (0) → ε. ϕn+1 (0) = ϕ(ϕn (0)). ϕn (0) ≤ ε∗ . the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. Since both ε and ε∗ satisfy z = ϕ(z). Therefore ε = ϕ(ε). In particular. ϕ(ϕn (0)) → ϕ(ε). This means that ε = 1 (see left figure below). ε = ε∗ . and. (See right figure below. Let us show that. all we need to show is that ε < 1. since ϕ is a continuous function. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). But ϕn (0) → ε. Therefore ε = ε∗ . and since the latter has only two solutions (z = 0 and z = ε∗ ).

Abandoning this idea. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). So. . 1. During this time slot assume there are a new packets that need to be transmitted. Xn+1 = Xn + An . Let s be the number of selected packets. 3 hosts. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. We assume that the An are i. there is a collision and. . 5. for a packet to go through.. there is no time to make a prior query whether a host has something to transmit or not. . Thus. If s ≥ 2. Ethernet has replaced Aloha. on the next times slot. then time slots 0. k ≥ 0. 4. only one of the hosts should be transmitting at a given time. and Sn the number of selected packets. 3.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). 2. say. 3. 1. is that if more than one hosts attempt to transmit a packet.d. there are x + a packets. in a simplified model. 1. then max(Xn + An − 1. we let hosts transmit whenever they like. The problem of communication over a common channel. So if there are. Then ε < 1. if Sn = 1. If s = 1 there is a successful transmission and. using the same frequency.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. Case: µ > 1. 0). Then ε = 1.i. are allocated to hosts 1. If collision happens. 24. on the next times slot. 68 . 2. there are x + a − 1 packets. otherwise. there is collision and the transmission is unsuccessful. 3. then the transmission is not successful and the transmissions must be attempted anew. One obvious solution is to allocate time slots to hosts in a specific order. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. 6. An the number of new packets on the same time slot. periodically. . random variables with some common distribution: P (An = k) = λk . Nowadays. . So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. 2. Since things happen fast in a computer network. .

An = a. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. and p. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. where q = 1 − p. Xn−1 . So P (Sn = k|Xn = x. we should not subtract 1. So we can make q x G(x) as small as we like by choosing x sufficiently large.x are computed by subtracting the sum of the above from 1.x+k for k ≥ 1.x−1 when x > 0. The probabilities px. for example we can make it ≤ µ/2. To compute px.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k .x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k .x−1 + k≥1 kpx. so q x G(x) tends to 0 as x → ∞. we think as follows: to have a reduction of x by 1. and so q x tends to 0 much faster than G(x) tends to ∞. Idea of proof.e. n ≥ 0) is a Markov chain. . the Markov chain (Xn ) is transient. Since p > 0. where G(x) is a function of the form α + βx . x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. and exactly one selection: px. Indeed. Let us find its transition probabilities. Therefore ∆(x) ≥ µ/2 for all large x. 69 . conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An .) = x + a k x−k p q .where λk ≥ 0 and k≥0 kλk = 1. Sn = 1|Xn = x) = λ0 xpq x−1 . So: px. To compute px. We here only restrict ourselves to the computation of this expected change. we have that the drift is positive for all large x and this proves transience. The reason we take the maximum with 0 in the above equation is because. . The Aloha protocol is unstable. then the chain is transient. k 0 ≤ k ≤ x + a. we have ∆(x) = (−1)px. Proving this is beyond the scope of the lectures.e. The main result is: Proposition 6. selections are made amongst the Xn + An packets. k ≥ 1. We also let µ := k≥1 kλk be the expectation of An . Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). For x > 0. The process (Xn .x−1 = P (An = 0. An−1 . . Unless µ = 0 (which is a trivial case: no arrivals!). we have q < 1. i. if Xn + An = 0. The random variable Sn has. we must have no new packets arriving.

α ≈ 0.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. 2. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. Academic departments link to each other by hiring their 70 .2) and let α pij := (1 − α)pij + . to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. So if Xn = i is the current location of the chain. In an Internet-dominated society. then i is a ‘dangling page’. |V | These are new transition probabilities. if j ∈ L(i)  pij = 1/|V |. there are companies that sell high-rank pages. the rank of a page may be very important for a company’s profits. unless i is a dangling page. There is a technique. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. that allows someone to manipulate the PageRank of a web page and make it much higher. otherwise.24. It uses advanced techniques from linear algebra to speed up computations. in the latter case. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. All you need to do to get high rank is pay (a lot of money). Therefore. Then it says: page i has higher rank than j if π(i) > π(j). So this is how Google assigns ranks to pages: It uses an algorithm. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. It is not clear that the chain is irreducible. neither that it is aperiodic. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. Its implementation though is one of the largest matrix computations in the world. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. because of the size of the problem. Define transition probabilities as follows:  1/|L(i)|. For each page i let L(i) be the set of pages it links to. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. We pick α between 0 and 1 (typically. If L(i) = ∅. Comments: 1. PageRank. 3. So we make the following modification. the chain moves to a random page. if L(i) = ∅   0. The idea is simple. known as ‘302 Google jacking’.

from the point of view of an administrator. we will assume that. a dominant (say A) and a recessive (say a). This number could be independent of the research quality but. We will also assume that the parents are replaced by the children. To simplify. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. generation after generation. The process repeats. 71 .g.faculty from each other (and from themselves). at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. One of the circle alleles is selected twice. their offspring inherits an allele from each parent. colour of eyes) appears in two forms. Imagine that we have a population of N individuals who mate at random producing a number of offspring. it makes the evaluation procedure easy. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output.4 The Wright-Fisher model: an example from biology In an organism. We would like to study the genetic diversity. Three of the square alleles are never selected. Thus. a gene controlling a specific characteristic (e. two AA individuals will produce and AA child. all we need is the information outlined in this figure. This means that the external appearance of a characteristic is a function of the pair of alleles. Do the same thing 2N times. maintain this allele for the next generation. When two individuals mate. One of the square alleles is selected 4 times. And so on. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. The alleles in an individual come in pairs and express the individual’s genotype for that gene. 4. this allele is of type A with probability x/2N . But an AA mating with a Aa may produce a child of type AA or one of type Aa. 24. In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). If so. meaning how many characteristics we have. We have a transition from x to y if y alleles of type A were produced. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. We will assume that a child chooses an allele from each parent at random. the genetic diversity depends only on the total number of alleles of each type. then. What is important is that the evaluation can be done without ever looking at the content of one’s research. Since we don’t care to keep track of the individuals’ identities. 5. in each generation.

Let M = 2N . . . the chain gets absorbed by either 0 or 2N . M n P (ξk = 0|Xn . (Here. since.e. . the number of alleles of type A won’t change. X0 ) = 1 − Xn . . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. for all n ≥ 0. i. and the population becomes one consisting of alleles of type a only. the expectation doesn’t change and.e. Tx := inf{n ≥ 1 : Xn = x}.) One can argue that. the random variables ξ1 . Since pxy > 0 for all 1 ≤ x. . we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. Clearly. with n P (ξk = 1|Xn . there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. i. the states {1. ϕ(x) = y=0 pxy ϕ(y). Xn ). ξM are i. . 0 ≤ y ≤ 2N. on the average. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). .d. . and the population becomes one consisting of alleles of type A only. 72 . Clearly. . but the argument needs further justification because “the end of time” is not a deterministic time. p00 = p2N. . . we see that 2N x 2N −y x y pxy = . we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M .Each allele of type A is selected with probability x/2N . . 2N − 1} form a communicating class of inessential states. M and thus. . .2N = 1. . y ≤ 2N − 1.i. . . ϕ(M ) = 1. We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. Similarly. The result is correct. on the average. E(Xn+1 |Xn . as usual. Since selections are independent. so both 0 and 2N are absorbing states. . ϕ(x) = x/M. These are: M ϕ(0) = 0. . since. . The fact that M is even is irrelevant for the mathematical model. conditionally on (X0 . X0 ) = Xn . . . . EXn+1 = EXn Xn = Xn . . X0 ) = E(Xn+1 |Xn ) = M This implies that. Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . n n where. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. M Therefore.

i. 2. and so. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. say. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. If the demand is for Dn components. A number An of new components are ordered and added to the stock. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a.5 A storage or queueing application A storage facility stores. M −1 y−1 x M y−1 1− x . so that Xn+1 = (Xn + ξn ) ∨ 0. Otherwise.d. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. Thus. Here are our assumptions: The pairs of random variables ((An . Xn electronic components at time n. larger than the supply per unit of time. We will show that this Markov chain is positive recurrent. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. Dn ). We have that (Xn . on the average. . and E(An − Dn ) = −µ < 0. the demand is met instantaneously. the demand is only partially satisfied. b). at time n + 1. provided that enough components are available.The first two are obvious. Let ξn := An − Dn . . and the stock reached level zero at time n + 1. there are Xn + An − Dn components in the stock. We will apply the so-called Loynes’ method. the demand is. 73 . Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . n ≥ 0) are i. for all n ≥ 0. . we will try to solve the recursion. while a number of them are sold to the market. Since the Markov chain is represented by this very simple recursion. Working for n = 1. n ≥ 0) is a Markov chain.

. . Notice. n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. and.d. ξn−1 . . ξn−1 ). Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). In particular. so we can shuffle them in any way we like without altering their joint distribution. ξn−1 ) = (ξn−1 . however. But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). that d (ξ0 . Therefore. The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). . Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. in computing the distribution of Mn which depends on ξ0 .i. we find π(y) = x≥0 π(x)pxy . ξn . for y ≥ 0. we may take the limit of both sides as n → ∞. . . instead of considering them in their original order. = x≥0 Since the n-dependent terms are bounded. . . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . (n) . . . the random variables (ξn ) are i. . . we reverse it. (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . . we may replace the latter random variables by ξ1 . where the latter equality follows from the fact that. Hence. . Indeed. 74 . using π(y) = limn→∞ P (Mn = y). ξ0 ). the event whose probability we want to compute is a function of (ξ0 . . for all n. . . . . .Thus.

It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. . n→∞ This implies that P ( lim Zn = −∞) = 1. Depending on the actual distribution of ξn . This will be the case if g(y) is not identically equal to 1. as required. The case Eξn = 0 is delicate. . Z2 . 75 . n→∞ Hence g(y) = P (0 ∨ max{Z1 . . . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 .So π satisfies the balance equations. we will use the assumption that Eξn = −µ < 0. Here. . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. the Markov chain may be transient or null recurrent. We must make sure that it also satisfies y π(y) = 1. .} > y) is not identically equal to 1.} < ∞) = 1. Z2 .

. but. In dimension d = 1. this walk enjoys. j = 1. besides time. then p(1) = p. where C is such that x p(x) = 1. given a function p(x). A random walk possesses an additional property. so far. . p(x) = 0. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. we need to give more structure to the state space S. j = 1. To be able to say this. if x = ej .y+z for any translation z. Define  pj . where p1 + · · · + pd + q1 + · · · + qd = 1. 0. otherwise where p + q = 1.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. the set of d-dimensional vectors with integer coordinates. we won’t go beyond Zd . A random walk of this type is called simple random walk or nearest neighbour random walk. a structure that enables us to say that px.y = px+z. otherwise. have been processes with timehomogeneous transition probabilities. So.and space-homogeneous transition probabilities given by px. a random walk is a Markov chain with time. If d = 1. . . More general spaces (e. x ∈ Zd .and space-homogeneity. . Example 1: Let p(x) = C2−|x1 |−···−|xd | . This is the drunkard’s walk we’ve seen before. Example 3: Define p(x) = 1 2d . p(−1) = q. This means that the transition probability pxy should depend on x and y only through their relative positions in space. . . Our Markov chains. if x ∈ {±e1 .g. . here. . . we can let S = Zd . d  p(x) = qj . the skip-free property. that of spatial homogeneity. namely. For example. ±ed } otherwise 76 . d   0.y = p(y − x). groups) are allowed. . Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. . if x = −ej .

d. We call it simple symmetric random walk. p′ (x). computing the n-th order convolution for all n is a notoriously hard problem.This is a simple random walk. . notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. in addition. . The special structure of a random walk enables us to give more detailed formulae for its behaviour. random variables in Zd with common distribution P (ξn = x) = p(x). and this is the symmetric drunkard’s walk. If d = 1. Furthermore. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). . Obviously. p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . We will let p∗n be the convolution of p with itself n times. Let us denote by Sn the state of the walk at time n. because all we have done is that we dressed the original problem in more fancy clothes. then p(1) = p(−1) = 1/2. where ξ1 . otherwise. . . p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). which. But this would be nonsense. x ± ed . (When we write a sum without indicating explicitly the range of the summation variable.) In this notation. p ∗ δz = p. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . ξ2 . One could stop here and say that we have a formula for the n-step transition probability. So. .i. Indeed. the initial state S0 is independent of the increments. The convolution of two probabilities p(x). Furthermore. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. are i. . x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. we can reduce several questions to questions about this random walk. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). x + ξ2 = y) = y p(x)p(y − x). We shall resist the temptation to compute and look at other 77 .) Now the operation in the last term is known as convolution. (Recall that δz is a probability distribution that assigns probability 1 to the point z. we will mean a sum over all possible values. p(x) = 0. y p(x)p′ (y − x). We refer to ξn as the increments of the random walk. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . x ∈ Zd . in general.

1}n . and the sequence of signs is {+1. . −1} then we have a path in the plane that joins the points (0. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. we shall learn some counting methods. 2) → (3. +1}n .methods. . a path of length n. . If not. + (1. +1. −1.i. Here are a couple of meaningful questions: A.d. we should assign. Clearly. 1) → (4. +1}n . to each initial position and to each sequence of n signs. .2) where the ξn are i. If yes. ξn ). As a warm-up. 2 where #A is the number of elements of A.1) + (0. ξn = εn ) = 2−n . we are dealing with a uniform distribution on the sample space {−1. Events A that depend only on these the first n random signs are subsets of this sample space. random variables (random signs) with P (ξn = ±1) = 1/2. Therefore. . In the sequel. 1): (2.i−1 = 1/2 for all i ∈ Z.i+1 = pi. Since there are 2n elements in {0. A ⊂ {−1. . So if the initial position is S0 = 0.0) We can thus think of the elements of the sample space as paths. how long will that take? C. . . So. for fixed n. +1. After all. a random vector with values in {−1. +1}n . When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn .1) + − (5. 2) → (5. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. 1) → (2. we have P (ξ1 = ε1 . but its practice may be hard. if we can count the number of elements of A we can compute its probability. First. The principle is trivial. consider the following: 78 . Will the random walk ever return to where it started from? B. and so #A P (A) = n . all possible values of the vector are equally likely. Look at the distribution of (ξ1 .1) − (3. 0) → (1.2) (4.

2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26.) The darker region shows all paths that start at (0. the number of paths that start from (m. The paths comprising this event are shown below. x) (n. y)} = n−m . in this case. 0) and ends at (n. (There are 2n of them. 0)). The number of paths that start from (0. y)} =: {all paths that start at (m. Hence (n+i)/2 = 0. (n − m + y − x)/2 (23) Remark. (n. x) and end up at (n. i). 0) and end up at (n.0) The lightly shaded region contains all paths of length n that start at (0. We use the notation {(m. y)}. S6 < 0. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0.Example: Assume that S0 = 0. i). Consider a path that starts from (0. and compute the probability of the event A = {S2 ≥ 0. 0) and n ends at (n. y) is given by: #{(m. i)} = n n+i 2 More generally. We can easily count that there are 6 paths in A. i). Let us first show that Lemma 14. S4 = −1. x) (n. n 79 . S7 < 0}. y) is given by: #{(0.1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. 0) and end up at the specific point (n. x) and end up at (n.047. 0) (n.i) (0. Proof. In if n + i is not an even number then there is no path that starts from (0.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

If n. S2 > 0. Sn ≥ 0 | Sn = y) = 1 − In particular. .P (Sn = y|Sm = x) = In particular. 0) (n. if n is even. n/2 (29) #{(0. Such a simple answer begs for a physical/intuitive explanation. x) 2n−m (n. n 2 #{paths (m. . n 2−n . y)} . We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). y)} . y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . 27. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. P0 (S1 ≥ 0 . But. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . . we shall just compute it. n (30) Proof. . The left side of (30) equals #{paths (1. for now. 27. Sn > 0 | Sn = y) = y . if n is even. . given that S0 = 0 and Sn = y > 0. P0 (Sn = y) 2 (n + y) + 1 . . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 .3 Some conditional probabilities Lemma 17. . 1) (n. y) that do not touch zero} × 2−n #{paths (0. n This is called the ballot theorem and the reason will be given in Section 29. 0) (n. The probability that. Sn ≥ 0 | Sn = 0) = 84 1 . . the random walk never becomes zero up to time n is given by P0 (S1 > 0. . n−m 2−n−m .2 The ballot theorem Theorem 25. .

. . . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . . . . . . . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . P0 (S1 ≥ 0. (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. and Sn−1 > 0 implies that Sn ≥ 0. . We next have P0 (S1 = 0. . . . . . (f ) follows from the replacement of (ξ2 . Sn ≥ 0). n/2 where we used (28) and (29). . P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . . . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. .). ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. ξ3 . . . . . . ξ2 + ξ3 ≥ 0. Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . . . . .4 Some remarkable identities Theorem 26. Sn = 0) = P0 (Sn = 0). Sn > 0) + P0 (S1 < 0. •). . . . (b) follows from symmetry. Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. 85 . . If n is even. 1 + ξ2 + ξ3 > 0. . . . . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). ξ3 . . . .Proof.) by (ξ1 . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. Sn ≥ 0) = #{(0. . . . Proof. We use (27): P0 (Sn ≥ 0 . Sn ≥ 0. 0) (n. +1 27. . . . For even n we have P0 (S1 ≥ 0. . . . . . . (e) follows from the fact that ξ1 is independent of ξ2 . . Sn ≥ 0) = P0 (S1 = 0. . Sn < 0) (b) (a) = 2P0 (S1 > 0. (c) (d) (e) (f ) (g) where (a) is obvious. . . . . . ξ2 . Sn = 0) = P0 (S1 > 0. .

So 1 f (n) = 2u(n) P0 (S1 > 0. Sn = 0) = 2P0 (Sn−1 > 0. we computed.5 First return to zero In (29). and u(n) = P0 (Sn = 0). By the Markov property at time n − 1. . Sn = 0) Now. Sn = 0) P0 (S1 > 0. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). . Sn = 0 means Sn−1 = 1. . Fix a state a. Then f (n) := P0 (S1 = 0. So u(n) f (n) = . Let n be even. the probability u(n) that the walk (started at 0) attains value 0 in n steps. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . Also. . Theorem 27. . We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. we either have Sn−1 = 1 or −1. P0 (Sn−1 > 0. Sn = 0). If a particle is to the right of a then you see its image on the left. Sn = 0) + P0 (S1 < 0. Sn−1 < 0. . So the last term is 1/2. . for even n. the two terms must be equal. n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. Sn−2 > 0|Sn−1 > 0. Sn = 0) = u(n) . This is not so interesting. . . and if the particle is to the left of a then its mirror image is on the right. So f (n) = 2P0 (S1 > 0. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. . Put a two-sided mirror at a. . f (n) = P0 (S1 > 0. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . Since the random walk cannot change sign before becoming zero. . Sn−1 > 0. . with equal probability. . Sn−1 = 0. 86 . By symmetry. we see that Sn−1 > 0. Sn = 0. . To take care of the last term in the last expression for f (n). . n−1 Proof. . . Sn−1 > 0.27. given that Sn = 0. . we can omit the last event. . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). Sn−2 > 0|Sn−1 = 1). . . .

S1 . If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). n ∈ Z+ ). independent of the past before Ta . 2a − Sn . called (Sn . Sn ≥ a). n ∈ Z+ ) has the same law as (Sn . define Sn = where is the first hitting time of a. . In the last event. as a corollary to the above result. P0 (Sn ≥ a) = P0 (Ta ≤ n. But P0 (Ta ≤ n) = P0 (Ta ≤ n. so we may replace Sn by 2a − Sn (Theorem 28). 2 Proof. Sn ≤ a).What is more interesting is this: Run the particle till it hits a (it will. Then Ta = inf{n ≥ 0 : Sn = a} Sn . a + (Sn − a). Proof. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). m ≥ 0) is a simple symmetric random walk. the process (STa +m − a. Then. So Sn − a can be replaced by a − Sn in the last display and the resulting process. Sn ). we obtain 87 . for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . Let (Sn . Trivially. If Sn ≥ a then Ta ≤ n. 28. n < Ta . Theorem 28. Sn > a) = P0 (Ta ≤ n. 2a − Sn ≥ a) = P0 (Ta ≤ n. . Sn ≤ a) + P0 (Sn > a). Sn ≤ a) + P0 (Ta ≤ n. In other words. Finally. n ≥ Ta By the strong Markov property. we have n ≥ Ta . Sn = Sn . But a symmetric random walk has the same law as its negative. Thus. n ≥ 0) be a simple symmetric random walk and a > 0. The last equality follows from the fact that Sn > a implies Ta ≤ n. . n ≥ Ta . 2 Define now the running maximum Mn = max(S0 . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a).1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. . n < Ta . observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). So: P0 (Sn ≥ a) = P0 (Ta ≤ n.

Sci. Proof. gives the result. usually a vase furnished with a foot or pedestal. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. as for holding liquids. . 8 88 . . Rend. 436-437. P0 (Mn < x. e 10 Andr´. Bertrand. Compt. Sn = y) = P0 (Tx ≤ n. n = a + b. (1887). 2 Proof. then. Acad. which results into P0 (Tx ≤ n. Sn = y). J. Sn = y) = P0 (Tx > n. 105. Paris. An urn is a vessel. ηn ). Since Mn < x is equivalent to Tx > n. Solution d’un probl`me. during sampling without replacement. ornamental objects. If Tx ≤ n. e e e Paris. Rend. Sn = 2x − y). the number of selected azure items is always ahead of the black ones equals (a − b)/n. Solution directe du probl`me r´solu par M. 105. and anciently for holding ballots to be drawn. as long as a ≥ b. . Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. D. The set of values of η contains n elements and η is uniformly distributed a in this set. 9 Bertrand. we have P0 (Mn < x.Corollary 19. (31) x > y. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). A sample from the urn is a sequence η = (η1 . combined with (31). The answer is very simple: Theorem 31. 29 Urns and the ballot theorem Suppose that an urn8 contains n items. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Compt. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. Sn = y) = P0 (Sn = 2x − y). . We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. employed for different purposes. If an urn contains a azure items and b = n−a black items. The original ballot problem. (1887). we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. 369. by applying the reflection principle (Theorem 28). of which a are coloured azure and b black. It is the latter use we are interested in in Probability and Statistics. then the probability that. where ηi indicates the colour of the i-th item picked. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . This. Acad. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. Sci. the ashes of the dead after cremation.

. we can translate the original question into a question about the random walk. . but which is not necessarily symmetric. a. letting a. . conditional on the event {Sn = y}. . the sequence (ξ1 . always assuming (without any loss of generality) that a ≥ b = n − a. . . using the definition of conditional probability and Lemma 16. . . εn be elements of {−1. . Proof of Theorem 31.i−1 = q = 1 − p. . It remains to show that all values are equally a likely. . . . n Since y = a − (n − a) = a − b. . Sn > 0 | Sn = y) = . We need to show that. . a we have that a out of the n increments will be +1 and b will be −1. . Since an urn with parameters n. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. b be defined by a + b = n. An urn can be realised quite easily be a random walk: Theorem 32. indeed. ηn ) an urn process with parameters n. Sn > 0} that the random walk remains positive. we have: P0 (ξ1 = ε1 . . so ‘azure’ items are +1 and ‘black’ items are −1. −n ≤ k ≤ n. But if Sn = y then. +1} such that ε1 + · · · + εn = y.i+1 = p. (The word “asymmetric” will always mean “not necessarily symmetric”. . . n ≥ 0) with values in Z and let y be a positive integer. We have P0 (Sn = k) = n n+k 2 pi. . . So the number of possible values of (ξ1 . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. . . .We shall call (η1 . . . conditional on {Sn = y}. . . n . 89 . . n n −n 2 (n + y)/2 a Thus. where y = a − (n − a) = 2a − n. the sequence (ξ1 . Consider a simple symmetric random walk (Sn . ξn ) is uniformly distributed in a set with n elements. Then. . Then. Let ε1 . . . . conditional on {Sn = y}. i ∈ Z. a − b = y. But in (30) we computed that y P0 (S1 > 0. . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. Proof. . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . Since the ‘azure’ items correspond to + signs (respectively. . The values of each ξi are ±1. . . the ‘black’ items correspond to − signs). . p(n+k)/2 q (n−k)/2 . . ξn ) is.) So pi. ξn = εn ) P0 (Sn = y) 2−n 1 = = . (ξ1 . the result follows. ξn ) has a uniform distribution.

Consider a simple random walk starting from 0. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. E0 sTx = ϕ(s)x . x ∈ Z. x ≥ 0) is also a random walk. y ∈ Z. Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. 90 . The rest is a consequence of the strong Markov property. The process (Tx . In view of Lemma 19 we need to know the distribution of T1 .1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}.d. . . Proof. Proof. 2qs Proof. random variables. we make essential use of the skip-free property. 2. We will approach this using probability generating functions. See figure below. . Consider a simple random walk starting from 0. it is no loss of generality to consider the random walk starting from 0. and all t ≥ 0. Consider a simple random walk starting from 0.30.) Then.i. Lemma 19. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. For all x. plus a time T1 required to go from −1 to 0 for the first time. for x > 0. . only the relative position of y with respect to x is needed. Corollary 20. x − 1 in order to reach x. . Theorem 33. Lemma 18. 2. To prove this. Let ϕ(s) := E0 sT1 . If S1 = 1 then T1 = 1. (We tacitly assume that S0 = 0. plus a ′′ further time T1 required to go from 0 to +1 for the first time. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. 1. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . Px (Ty = t) = P0 (Ty−x = t). . namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. This random variable takes values in {0. In view of this. If x > 0.} ∪ {∞}. .

We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . The first time n such that Sn = −1 is the first time n such that −Sn = 1. The formula is obtained by solving the quadratic. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). 91 . Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . Corollary 21.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. in order to hit 1. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. 2ps Proof.  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . Consider a simple random walk starting from 0. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. Hence the previous formula applies with the roles of p and q interchanged. if x < 0 2ps Proof. from the strong Markov property. Consider a simple random walk starting from 0. Corollary 22. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p.

2ps ′ =1− 30. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . For a symmetric ′ random walk. k 92 . in distribution. this was found in §30.30. Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . Similarly. 1 − 4pqs2 .2 First return to the origin Now let us ask: If the particle starts at 0. the same as the time required for the walk to hit 1 starting ′ from 0. The latter is. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. k by the binomial theorem. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. If we define n to be equal to 0 for k > n. We again do first-step analysis. then we can omit the k upper limit in the sum. Consider a simple random walk starting from 0. For the general case. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 .3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. if S1 = 1. where α is a real number. and simply write (1 + x)n = k≥0 n k x .2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. If α = n is a positive integer then n (1 + x)n = k=0 n k x .

−1 k so (1 − x)−1 = as needed. (−1)k = 2k − 1 k k and this is shown by direct computation. 2k − 1 k 93 . k!(n − k)! k! α k x . by definition. k! k! (−1)2k xk = k≥0 xk . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . (1 − x)−1 = But. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk .Recall that n k = It turns out that the formula is formally true for general α. k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. As another example. k as long as we define α k = α(α − 1) · · · (α − k + 1) . Indeed. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . but that we won’t worry about which ones. √ 1 − x. For example. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series.

The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . x 1−|p−q| 2q 1−|p−q| 2p . By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. if x < 0 |x| In particular. But remember that q = 1 − p. Hence it is again transient. E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . x if x > 0 . by the definition of E0 sT0 . We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. Hence it is transient. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. |x| if x > 0 (33) . without appeal to the Markov chain theory.But. The quantity inside the square root becomes 1 − 4pq. if x < 0 This formula can be written more neatly as follows: Theorem 35. That it is actually null-recurrent follows from Markov chain theory. We apply this to (32). • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. 94 . P0 (Tx < ∞) = p q q p ∧1 ∧1 . if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). s↑1 which is true for any probability generating function.

except that the limit may be ∞. The simple symmetric random walk with values in Z is null-recurrent. . if x > 0 . the sequence Mn is nondecreasing. Sn ). This means that there is a largest value that will be ever attained by the random walk. then we have M∞ = limn→∞ Mn . Proof. If p < q then |p − q| = q − p and. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. We know that it cannot be positive recurrent. Indeed.Proof. . and so it is null-recurrent. with probability 1 because p < 1/2. 31. S1 . Theorem 36. Unless p = 1/2. The last part of Theorem 35 tells us that it is recurrent. What is this probability? Theorem 37. again. x ≥ 0. with some positive probability. Then ′ P0 (T0 < ∞) = 2(p ∧ q).2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. Then limn→∞ Sn = −∞ with probability 1. Consider a random walk with values in Z. . S2 . . q p. if x < 0 which is what we want. Otherwise.) If we let Mn = max(S0 . n→∞ n→∞ x ≥ 0. . Corollary 23. 31. this probability is always strictly less than 1. . Assume p < 1/2. every nondecreasing sequence has a limit. This value is random and is denoted by M∞ = sup(S0 . Consider a simple random walk with values in Z. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . we obtain what we are looking for. 95 . P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x .1 Asymptotic distribution of the maximum Suppose now that p < 1/2. . Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. but here it is < ∞. it will never ever return to the point where it started from. Proof.

1. Pi (Ji ≥ k) = 1 for all k. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). . This proves the first formula. k = 0. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . . . if a Markov chain starts from a state i then Ji is geometric. meaning that Pi (Ji = ∞) = 1. 1 − 2(p ∧ q) Proof. 2. even for p = 1/2. . Then. starting from zero. k = 0. . 1. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . 2(p ∧ q) . Consider a simple random walk with values in Z. 2(p ∧ q) E0 J0 = . Theorem 38. In Lemma 4 we showed that. In the latter case. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. The probability generating function of T0 was obtained in Theorem 34. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). 1 − 2(p ∧ q) this being a geometric series. We now compute its exact distribution. . . But J0 ≥ 1 if and only if the first return time to zero is finite. . 1]. 1. k = 0. 31. . But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. 96 . Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. We now look at the the number of visits to some state but if we start from a different state. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k .′ Proof. the total number of visits to a state will be finite. (34) a random variable first considered in Section 10 where it was also shown to be geometric. where the last probability was computed in Theorem 37. 2. as already known. .3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. If p = q this gives 1. 2. But if it is transient. Remark 1: By spatial homogeneity.

Remark 1: The expectation E0 Jx is finite if p = 1/2. p < 1/2) then. simply because if Jx ≥ k ≥ 1 then Tx < ∞. 97 . Then P0 (Tx < ∞. (q/p)(−x)∨0 1 − 2q k≥1 . Apply now the strong Markov property at Tx . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . it will visit any point below 0 exactly the same number of times on the average. given that Tx < ∞. starting from 0. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. The first term was computed in Theorem 35.e.e. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . But for x < 0. Then P0 (Jx ≥ k) = P0 (Tx < ∞. and the second in Theorem 38. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). ∞ As for the expectation. i. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . Consider a random walk with values in Z. Consider the case p < 1/2. Jx ≥ k). Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. In other words. if the random walk “drifts down” (i. The reason that we replaced k by k − 1 is because. For x > 0. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . E0 Jx = Proof. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . Suppose x > 0. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 .Theorem 39. (2(p ∧ q))k−1 . then E0 Jx drops down to zero geometrically fast as x tends to infinity. we have E0 Jx = 1/(1−2p) regardless of x. k ≥ 1.

we derived using duality. Fix an integer n. n x > 0. P0 (Sn = x) P0 (Sn = x) x .d. n Proof. in Theorem 41. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. ξn ) has the same distribution as (ξn . ξn−1 . random variables.d. the probabilities P0 (Tx = n) = x P0 (Sn = x). Then P0 (Tx = n) = x P0 (Sn = x) .i. n (37) 98 . 0 ≤ k ≤ n − 1. . Use the obvious fact that (ξ1 . and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. ξ1 ). ξ2 . 0 ≤ k ≤ n) has the same distribution as (Sk . each distributed as T1 . a sum of i. Thus. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. Sn = x) P0 (Tx = n) = . Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. 0 ≤ k ≤ n) of (Sk . . = 1− √ 1 − s2 s x . . . . (36) 2 Lastly. . Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. 0 ≤ k ≤ n). every time we have a probability for S we can translate it into a probability for S. i=1 Then the dual (Sk . . . (Sk . random variables. . for a simple random walk and for x > 0. 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . Tx is the sum of x i. because the two have the same distribution. In Lemma 19 we observed that. . Let (Sk ) be a simple symmetric random walk and x a positive integer. . Theorem 40.e. for a simple symmetric random walk. .i. (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). which is basically another probability for S. i. Here is an application of this: Theorem 41.32 Duality Consider random walk Sk = k ξi . Proof.

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

1)) = P (ξn = (−1.i. with equal probability (1/4). Rn ) := (Sn + Sn . The two random walks are not independent. i. So let 1 2 1 2 1 2 Rn = (Rn . . west. 0)) = 1/4. He starts at (0. We have 1. Since ξ1 . We then let S0 = (0. 2 Same argument applies to (Sn . n ∈ Z+ ) is a sum of i.From this. y ∈ Z.i. ξ2 . . n ∈ Z+ ) showing that it. the random 1 2 variables ξn .d. n ∈ Z+ ) is a random walk. Indeed. When he reaches a corner he moves again in the same manner. We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . . being totally drunk. . . is a random walk. n ∈ Z+ ) is also a random walk. 2 1 If x ∈ Z2 . ξn are not independent simply because if one takes value 0 the other is nonzero. going at random east. So Sn = (Sn . Consider a change of coordinates: rotate the axes by 45o (and scale them). 0)) = P (ξn = (−1.. map (x1 . However. west. we let x1 be its horizontal coordinate and x2 its vertical. n ≥ 1. . for each n. ξ2 . (Sn . y) where x. he moves east. too. x1 − x2 ). 0)) = 1/4. random vectors such that P (ξn = (0. . are independent. ξ2 . 0). 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. random variables and hence a random walk. 1 1 Proof.e. Sn = i=1 n ξi . Many other quantities can be approximated in a similar manner. Sn − Sn ).d. We have: 102 . 1)) + P (ξn = (0. . Sn ) = n n 2 . They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. The corners are described by coordinates (x. north or south. (Sn . north or south. 1 P (ξn = 1) = P (ξn = (1. the latter being functions of the former. x2 ) into (x1 + x2 . we immediately get that the expectation of T1 is infinity: ET1 = ∞. i. 0) and. i=1 ξi i=1 ξi 2 1 Lemma 20. −1)) = 1/2. so are ξ1 . . 0)) = P (ξn = (1. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. It is not a simple random walk because there is a possibility that the increment takes the value 0. 1 Hence (Sn . the two stochastic processes are not independent.

being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . 2 1 2 2 1 1 Proof. is i. from Theorem 7. The sequence of vectors η . But (Rn ) 1 2 is a simple symmetric random walk. 1)) = 1/4 1 2 P (ηn = −1. Hence P (ηn = 1) = P (ηn = −1) = 1/2. . Lemma 22. ηn = −1) = P (ξn = (0. The possible values 1 . And so 2 1 2 1 P (ηn = ε. where c is a positive constant. n ∈ Z+ ) 1 is a random walk in two dimensions. . +1}. P (S2n = 0) ∼ n . = 1)) = 1/4. as n → ∞. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . and by Stirling’s approximation. . The last two are walk in one dimension. in the sense that the ratio of the two sides goes to 1. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. (Rn . n ∈ Z+ ) is a random walk in two dimensions. The two-dimensional simple symmetric random walk is recurrent. 2n n 2 2−4n .. By Lemma 5. b−a −a . To show that the last two are independent.i. Proof.d. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. ηn = −1) = P (ξn = (−1. b−a P (Ta < Tb ) = b . We have Rn = n (ξi + ξi ). ηn = 1) = P (ξn = (0.1 Lemma 21. Similar argument shows that each of (Rn . ηn = 1) = P (ξn = (1. that for a simple symmetric random walk (started from S0 = 0). 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. n ∈ Z+ ). n ∈ Z ) is a random walk in one dimension. ηn = ξn − ξn are independent for each n. 0)) = 1/4 1 2 P (ηn = 1. We have of each of the ηn n 1 2 P (ηn = 1. (Rn . n ∈ Z+ ) is a random 2 . η . Hence (Rn . b−a P (STa ∧Tb = a) = b . ε′ ∈ {−1. n Corollary 24. we need to show that ∞ P (Sn = 0) = ∞. P (S2n = 0) = Proof. The same is true for (Rn ). . P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . R2n = 0) = P (R2n = 0)(R2n = 0). b−a 103 . Let a < 0 < b. n ∈ Z ) is a random walk in one dimension. But by the previous n=0 c lemma. . . ξ 1 − ξ 2 ). Rn = n (ξi − ξi ). 2 . (Rn + independent. 0)) = 1/4 1 2 P (ηn = −1. 1 where the last equality follows from the independence of the two random walks. Recall. η 2 are ±1. ξ2 . So P (R2n = 0) = 2n 2−2n .

Define a pair of random variables (A. b. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. using this. 0)} ∪ (N × N) with the following distribution: P (A = i. To see this. we consider the random variable Z = STA ∧TB . i ∈ Z. b]. such that p is rational. M. i ∈ Z. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. in particular. B) are independent of (Sn ). b = N . Note. k ∈ Z. How about more complicated distributions? Let suppose we are given a probability distribution (pi . a = M − N . c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. Then there exists a stopping time T such that P (ST = i) = pi . Conversely. M < N . The probability it exits it from the left point is p. B = 0) = p0 . choose. i < 0 < j. 1 − p). we have this common value: i<0 (−i)pi = and. given a 2-point probability distribution (p. B = 0) + P (A = i. Let (Sn ) be a simple symmetric random walk and (pi . Theorem 43. that ESTa ∧Tb = 0. B = j) = 1. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. if p = M/N . a < 0 < b such that pa + (1 − p)b = 0. B) taking values in {(0. The claim is that P (Z = k) = pk .This last display tells us the distribution of STa ∧Tb . P (A = 0.g. suppose first k = 0. where c−1 is chosen by normalisation: P (A = 0. and the probability it exits from the right point is 1 − p. i ∈ Z). Indeed. and assuming that (A. e. Then Z = 0 if and only if A = B = 0 and this has 104 . N ∈ N. B = j) = c−1 (j − i)pi pj . Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. we get that c is precisely ipi . Proof. we can always find integers a.

B = j) P (STi ∧Tk = k)P (A = i. Suppose next k > 0. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. 105 . i<0 = c−1 Similarly.probability p0 . P (Z = k) = pk for k < 0. by definition.

For instance. −2. b) as the unique positive integer such that (i) d | a. · · · }. with at least one of them nonzero. 1. replace b − a by b − 2a. So (ii) holds. 106 . we can define their greatest common divisor d = gcd(a. In other words. leading to d1 = d2 . . gcd(120. Now. let d = gcd(a. We write this by b | a. . 38) = gcd(82. 2) = 2. 2. Therefore d | b − a. then d | c. Then interchange the roles of the numbers and continue. b). So (i) holds. their negatives and zero. we have c | d.}. The set N of natural numbers is the set of positive integers: N = {1. They are not supposed to replace textbooks or basic courses.e. b − a). b − a). Given integers a. Z = {· · · .PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. If a | b and b | a then a = b. 2) = gcd(4. c | b. If c | b and b | a then c | a. Indeed. Similarly. with b = 0. 2. Because c | a and c | b. If b − a > a. Replace b by b − a. 6. b). the process of long division that one learns in elementary school). 32) = gcd(6. b) = gcd(a. d | b. 4. and we’re done. if a. 38) = gcd(44. Arithmetic Arithmetic deals with integers. 8) = gcd(6. So we work faster. b are arbitrary integers. d2 are two such numbers then d1 would divide d2 and vice-versa. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. 2) = gcd(2. (ii) if c | a and c | b. 38) = gcd(6. 5. i. if we also invoke Euclid’s algorithm (i. Zero is denoted by 0. they merely serve as reminders. But in the first line we subtracted 38 exactly 3 times. But 3 is the maximum number of times that 38 “fits” into 120. 14) = gcd(6. 3. 3. b. (That it is unique is obvious from the fact that if d1 . But d | a and d | b. with the second line. 38) = gcd(6. −3. as taught in high schools or universities. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. Then c | a + (b − a).e. Now suppose that c | a and c | b − a. We will show that d = gcd(a. 26) = gcd(6. −1. 20) = gcd(6. and d = gcd(a.) Notice that: gcd(a. Keep doing this until you obtain b − qa < a. we say that b divides a if there is an integer k such that a = kb. The set Z of integers consists of N. 0. . Indeed 4 × 38 > 120.

e. . Suppose that c | a and c | b. a − 1}. for integers a. b) = min{xa + yb : x. with a > b. The long division algorithm is a process for finding the quotient q and the remainder r. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. The same logic works in general. . we have gcd(a. 2) = 2. 107 . it divides d. This shows (ii) of the definition of a greatest common divisor. xa + yb > 0}. b ∈ Z. Its proof is obvious. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. gcd(38. with x = 19. For example. Working backwards. i. Then c divides any linear combination of a. 120) = 38x + 120y. y = 6. . r = 18. y ∈ Z. Then d = xa + yb. Thus. we have the fact that if d = gcd(a. 38) = gcd(6. 38) = gcd(6. for some x. let d be the right hand side. Here are some other facts about the greatest common divisor: First. a. we can show that.The theorem of division says the following: Given two positive integers b. . not both zero. In the process. y ∈ Z. b) then there are integers x. b. We need to show (i). we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. Second. such that a = qb + r. To see this. y such that d = xa + yb. in particular. let us follow the previous example again: gcd(120. we can find an integer q and an integer r ∈ {0. 1. To see why this is true.

then the relation “x is a friend of y” is symmetric. m elements. x2 ∈ X have different values f (x1 ). y) such that x is smaller than y.that d | a and d | b. B be finite sets with n. i. The equality relation = is reflexive. A relation is called reflexive if it contains pairs (x. The equality relation is symmetric. The relation < is not symmetric. A relation which possesses all the three properties is called equivalence relation. i. A relation is called symmetric if it cannot contain a pair (x. y) without contain its symmetric (y. Suppose this is not true. say.. so r = 0. If we take A to be the set of students in a classroom. f (x2 ). a relation can be thought of as a set of pairs of elements of the set A. A function is called onto (or surjective) if its range is Y . Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. Since d > 0. But then b = q(xa + yb) + r. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder).e. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). x). z) it also contains (x. We encountered the equivalence relation i theory. Finally.g. that d does not divide both numbers. e. The set of functions from X to Y is denoted by Y X . The relation < is not reflexive. y) and (y. Counting Let A. respectively. A relation is transitive if when it contains (x. for some 1 ≤ r ≤ d − 1. x). yielding r = (−qx)a + (1 − qy)b. but is not necessarily transitive. Both < and = are transitive. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. so d divides both a and b. 108 . The range of f is the set of its values. that d does not divide b. In general. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. and this is a contradiction. the greatest common divisor of any subset of the integers also makes sense. The greatest common divisor of 3 integers can now be defined in a similar manner. For example. A function is called one-to-one (or injective) if two different x1 . we can divide b by d to find b = qd + r. the set {f (x) : x ∈ X}. So d is not minimum as claimed. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . z).e.

Each assignment of this type is a one-to-one function from A to B. Clearly. The first element can be chosen in n ways. Incidentally. and so on. (The same name may be used for different animals. a one-to-one function from the set A into A is called a permutation of A.The number of functions from A to B (i. and this follows from the binomial theorem. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. If A has n elements. the cardinality of the set B A ) is mn . we need more names than animals: n ≤ m. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. 1}n of sequences of length n with elements 0 or 1. and so on. we may think of the elements of A as animals and those of B as names.e. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). the second in n − 1. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. A function is then an assignment of a name to each animal.) So the first animal can be assigned any of the m available names. Thus. in other words. that of the second in m − 1 ways. and the third. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. we denote (n)n also by n!. we have = 1. the k-th in n − k + 1 ways. so does the second. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . Any set A contains several subsets. k! k where the symbol equation. Since the name of the first animal can be chosen in m ways. and so on. 109 . In the special case m = n. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . We thus observe that there are as many subsets in A as number of elements in the set {0. Indeed. n! is the total number of permutations of a set of n elements. To be able to do this.

Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. b. b). we let The Pascal triangle relation states that n+1 k = = 0. The maximum is denoted by a ∨ b = max(a. the pig. 110 . Both quantities are nonnegative. by convention. b). b is denoted by a ∧ b = min(a. we use the following notations: a+ = max(a. by convention. say. and we don’t care to put parantheses because the order of finding a maximum is irrelevant. + k−1 k Indeed. we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. a∨b∨c refers to the maximum of the three numbers a. if x is a real number. 0). Note that it is always true that a = a+ − a− . c. The absolute value of a is defined by |a| = a+ + a− . n n . Thus a ∧ b = a if a ≤ b or = b if b ≤ a. Maximum and minimum The minimum of two numbers a.We shall let n = 0 if n < k. Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). For example. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. we define x m = x(x − 1) · · · (x − m + 1) . k Also. a− = − min(a. consider a specific animal. The notation extends to more than one numbers. If the pig is included. and a− is called the negative part of a. If the pig is not included in the sample. m! n y If y is not an integer then. we choose k animals from the remaining n ones and this can be done in n ways. In particular. a+ is called the positive part of a. 0). in choosing k animals from a set of n + 1 ones.

. . meaning a function that takes value 1 on A and 0 on the complement Ac of A. 111 .}. . .Indicator function stands for the indicator function of a set A. So X= n≥1 1(X ≥ n). we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. . if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. 1 − 1A = 1 A c . An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . P (X ≥ n). Thus. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. . So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. It takes value 1 with probability P (A) and 0 with probability P (Ac ). For instance. = 1A + 1B − 1A∩B . It is worth remembering that. A2 . if A1 . It is easy to see that 1A 1A∩B = 1A 1B . are sets or clauses then the number n≥1 1An is the number true clauses. Several quantities of interest are expressed in terms of indicators. Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). 2. This is very useful notation because conjunction of clauses is transformed into multiplication. Another example: Let X be a nonnegative random variable with values in N = {1. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). Then X X= n=1 1 (the sum of X ones is X). If A is an event then 1A is a random variable.

So if we choose a basis v1 .  . . . . x  . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ). Suppose that ψ. . vm . . . . we have ϕ(x) = m n vi i=1 j=1 aij xj . β ∈ R. . Then.  = ··· ··· ···  . x1 . The column  . .  where ∈ R. Let u1 . . . . m. am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  .  is a column representation of y := ϕ(x) with respect to the given basis . Combining. y ∈ Rn . xn the given basis. ϕ are linear functions: It is a m × n matrix.    . . because ϕ is linear. and all α. . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  . . j=1 i = 1. vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi .   . . Let y i := n aij xj .  The column  . But the ϕ(uj ) are vectors in Rm .for all x. . . ym v1 . The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . un be a basis for Rn . . .  is a column representation of x with respect to .

. e2 =  . . . ψ. For instance I = {1. Supremum and infimum When we have a set of numbers I (for instance. In these cases we use the notation inf I. . i. The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. A matrix A is called square if it is a n × n matrix. it is its matrix representation with respect to the fixed basis. We note that lim inf(−xn ) = − lim sup xn . where m (AB)ij = k=1 Aik Bkj . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. . Similarly. standing for the “infimum” of I. 0 0 1 In other words. defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. Then the matrix representation of ϕ ◦ ψ is AB. . . . . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. . 1 ≤ i ≤ ℓ. . there is x ∈ I with x ≥ ε + inf I. xn+2 . . x2 . sup ∅ = −∞. inf I is characterised by: inf I ≤ x for all x ∈ I and.} ≤ · · · and. xn+1 . . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. 1/2.} ≤ inf{x3 . .} has no minimum. respectively. .  . 1 ≤ j ≤ n. x2 . . with respect to these bases. 1/3. B is a m × n matrix. j So A is a ℓ × m matrix. and AB is a ℓ × n matrix. for any ε > 0. Usually. Similarly. Rm and Rℓ and let A. .e.} ≤ inf{x2 . a2 . B be the matrix representations of ϕ. If I is empty then inf ∅ = +∞. ed =  . This minimum (or maximum) may not exist. . · · · }. we define the supremum sup I to be the minimum of all upper bounds. we define max I. ei = 1(i = j). . If we fix a basis in Rn . . x3 . there is x ∈ I with x ≤ ε + inf I. x4 . a sequence a1 . Likewise. . .Fix three sets of basis on Rn . and it is this that we called infimum. the choice of the basis is the so-called standard basis. we define lim sup xn . 113 . is a sequence of numbers then inf{x1 . If I has no upper bound then sup I = +∞. The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  . . . . sup I is characterised by: sup I ≥ x for all x ∈ I and.  . Limsup and liminf If x1 . So lim sup xn = limn→∞ inf{xn .  . . If I has no lower bound then inf I = −∞. . for any ε > 0. a square matrix represents a linear transformation from Rn to Rn . .

(That is why Probability Theory is very rich. A is a subset of B). Linear Algebra that has many axioms is more ‘restrictive’. For example. . P (∪n An ) ≤ Similarly. we work a lot with conditional probabilities. and apply them. B are disjoint events then P (A ∪ B) = P (A) + P (B)..Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory).. If A1 . showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. by not requiring this. In contrast. . Note that 1A 1B = 1A∩B . for (besides P (Ω) = 1) there is essentially one axiom . such that ∪n Bn = Ω. Similarly. . and 1 Ac = 1 − 1 A . which holds since X is positive. we often need to argue that X ≤ Y . . are mutually disjoint events. Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. . are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). That is why Probability Theory is very rich. . Quite often. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). A2 . If the events A. . If An are events with P (An ) = 0 then P (∪n An ) = 0. In contrast. B in the definition. to show that EX ≤ EY . derive formulae. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. Indeed. Probability Theory is exceptionally simple in terms of its axioms. 11 114 . This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. then P  n≥1 An  = P (An ). Similarly. . in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ).e.e. .) If A. n P (An ). in these notes. Therefore. to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). B2 . the function that takes value 1 on A and 0 on Ac . This can be generalised to any finite number of events. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). but not too much. n If the events A1 . . This is the inclusion-exclusion formula for two events. n≥1 Usually. namely:   Everything else. Linear Algebra that has many axioms is more ‘restrictive’. A2 . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. . and everywhere else. to compute P (A). I do sometimes “cheat” from a strict mathematical point of view. it is useful to find disjoint events B1 . follows from this axiom. Of course. A2 . if he events A1 . . i. E 1A = P (A). If A is an event then 1A or 1(A) is the indicator random variable. In Markov chain theory. Often. This follows from the logical statement X 1(X > x) ≤ X. if An are events with P (An ) = 1 then P (∩n An ) = 1.

e. so it is useful to keep this in mind. is called probability generating function of X: ∞ as long as |ρs| < 1. . Suppose now that X is a random variable with values in Z+ ∪ {+∞}. which are both nonnegative and then define EX = EX + − EX − . is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . The formula holds for any positive random variable. . . For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). 0). EX = x xP (X = x). 1.} because then n an = 1. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. we have EX = −∞ xf (x)dx. We do not a priori know that this exists for any s other than. . is a probability distribution on Z+ = {0. As an example. . This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). . This definition makes sense only if EX + < ∞ or if EX − < ∞. X − = − min(X. a1 . . . which is motivated from the algebraic identity X = X + − X − . . For a random variable which takes positive and negative values. i. The generating function of the sequence pn = P (X = n). ϕ(s) = EsX = n=0 sn P (X = n) 115 . we define X + = max(X. 1]. we encounter situations where P (X = ∞) may be positive. . 0). In case the random variable is discrete. 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. 1.An application of this is: P (X < ∞) = lim P (X ≤ x). |s| < 1/|ρ|. taught in calculus courses). n ≥ 0 is ∞ ρn s n = n=0 1 . The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. In case the random variable is ∞ absoultely continuous with density f . 1]. Generating functions A generating function of a sequence a0 . . 2.. for s = 0 for which G(0) = 0. a1 . the generating function of the geometric sequence an = ρn . viz. n = 0. obviously. So if n an < ∞ then G(s) exists for all s ∈ [−1. x→∞ In Markov chain theory. making it a useful object when the a0 . the formula reduces to the known one. .

possibly. n! as follows from Taylor’s theorem. we can recover the expectation of X from EX = lim ϕ′ (s). Note that ∞ n=0 pn ≤ 1. consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . take the value +∞. As an example. n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). 116 . ds ds and by letting s = 1. n ≥ 0) then the generating function of (xn+1 . Generating functions are useful for solving linear recursions. Note that ϕ(0) = P (X = 0). n=0 We do allow X to.This is defined for all −1 < s < 1. ∞ and p∞ := P (X = ∞) = 1 − pn . xn sn be the generating of (xn . Also. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). with x0 = x1 = 1. we have s∞ = 0. since |s| < 1. If we let X(s) = n≥0 n ≥ 0. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . but. and this is why ϕ is called probability generating function. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 .

Thus (1/ρ)−s is the 1 generating function of ρn+1 . s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). i. b = −( 5 + 1)/2. if X is a random variable with values in R. the function P (X ≤ x).From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . Hence s2 + s − 1 = (s − a)(s − b). n ≥ 0) and (xn . x ∈ R. we can solve for X(s) and find X(s) = s2 −1 . +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. n ≥ 0) equals the sum of the generating functions of (xn+1 . Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). i.e. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . b−a Noting that ab = −1. namely. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . Despite the square roots in the formula. 117 . But I could very well use the function P (X > x). is such an example. for example. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. Since x0 = x1 = 1.e. b−a is the solution to the recursion. then the so-called distribution function. we can also write this as xn = (−a)n+1 − (−b)n+1 . and so X(s) = 1 1 1 1 1 1 = . n ≥ 0). So. the n-th term of the Fibonacci sequence.

dx All these objects can be referred to as “distribution” or “law” of X. instead. We frequently encounter random objects which belong to larger spaces. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. Even the generating function EsX of a Z+ -valued random variable X is an example. It turns out to be a good one. for it does allow us to specify the probabilities P (X = n).or the 2-parameter function P (a < X < b). n ∈ Z+ . Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. it’s better to use logarithms: n n! = e log n = exp k=1 log k. 1 Since d dx (x log x − x) = log x. (b) The approximation is not good when we consider ratios of products. Hence we have We call this the rough Stirling approximation. since the number is so big. Such an excursion is a random object which should be thought of as a random function. to compute a large sum we can. If X is absolutely continuous. I could use its density f (x) = d P (X ≤ x). Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. For instance. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. in 2n fair coin tosses 118 . we encountered the concept of an excursion of a Markov chain from a state i. n! ≈ exp(n log n − n) = nn e−n . it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. Now. For example (try it!) the rough Stirling approximation give that the probability that. Specifying the distribution of such an object may be hard. compute an integral and obtain an approximation. for example they could take values which are themselves functions. n Hence n k=1 log k ≈ log x dx. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier.

The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). expressing non-negativity. y ∈ R g(y) ≥ g(x)1(y ≥ x). defined in (34). We can case both properties of g neatly by writing For all x. That was the random variable J0 = ∞ n=1 1(Sn = 0). To remedy this. as n → ∞. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . We baptise this geometric(λ) distribution. 0 ≤ n ≤ m. Eg(X) ≥ g(x)P (X ≥ x) . The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . Let X be a random variable. Indeed. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. Then. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). This completely trivial thought gives the very useful Markov’s inequality. and is based on the concept of monotonicity. and so. 119 . and this is terrible. by taking expectations of both sides (expectation is an increasing operator). Consider an increasing and nonnegative function g : R → R+ . Markov’s inequality Arguably this is the most useful inequality in Probability. for all x. a geometric function of k.we have exactly n heads equals 2n 2−2n ≈ 1. g(X) ≥ g(x)1(X ≥ x). we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. We have already seen a geometric random variable. It was shown in Theorem 38 that J0 satisfies the memoryless property. expressing monotonicity. you have seen it before. Surely. so this little section is a reminder. If we think of X as the time of occurrence of a random phenomenon. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. because we know that the n probability goes to zero as n → ∞. n ∈ Z+ . where λ = P (X ≥ 1).

5. Chung. Springer. Cambridge University Press. Bremaud P (1997) An Introduction to Probabilistic Modeling.Bibliography ´ 1. 7. Springer.dartmouth. 3. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). Also available at: http://www. C M and Snell. Amer. W (1981) Topics in Ergodic Theory. ´ 2. Math. W (1968) An Introduction to Probability Theory and its Applications.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. Feller. J R (1998) Markov Chains. Cambridge University Press.mac. Wiley.pdf 6. 4. Grinstead. J L (1997) Introduction to Probability. Soc. Norris. 120 . Parry. Bremaud P (1999) Markov Chains. Springer.

15 Loynes’ method. 48 cycle formula. 18 arcsine law. 10 ballot theorem. 26 running maximum. 51 Monte Carlo. 110 mixing. 116 first-step analysis. 108 ruin probability. 115 random walk on a graph. 63 Galton-Watson process. 17 Kolmogorov’s loop criterion. 35 Helly-Bray Lemma. 53 hitting time. 87 121 . 70 period. 48 memoryless property. 15 distribution. 47 coupling inequality. 114 balance equations. 17 infimum. 84 bank account. 51 essential state. 17 Aloha. 1 equivalence classes. 108 ergodic. 76 neighbours. 119 minimum. 17 Ethernet. 106 ground state. 111 maximum. 77 indicator. 117 leads to. 119 Google. 10 irreducible. 16 convolution. 18 dual. 81 reflection principle. 6 Markov property. 109 positive recurrent. 2 nearest neighbour random walk. 45 Chapman-Kolmogorov equation. 58 digraph (directed graph). 101 axiom of Probability Theory. 119 matrix. 18 permutation. 29 reflection. 6 Markov’s inequality. 73 Markov chain. 15. 98 dynamical system. 7 invariant distribution. 65 generating function. 7 closed. 115 geometric distribution. 113 initial distribution. 65 Catalan numbers. 77 coupling. 16 communicating classes. 20 function of a Markov chain. 53 Fibonacci sequence. 34 probability flow. 70 greatest common divisor. 111 inessential state. 17 communicates with. 86 reflexive. 13 probability generating function.Index absorbing. 44 degree. 20 increments. 108 relation. 3 branching process. 61 null recurrent. 117 division. 16 equivalence relation. 68 aperiodic. 34 PageRank. 110 meeting time. 61 detailed balance equations. 59 law. 68 Fatou’s lemma. 16. 82 Cesaro averages. 61 recurrent.

104 states. 13 stopping time. 113 symmetric. 73 strategy of the gambler. 3. 2 transition probability. 7. 27 storage chain. 6 statespace. 119 strong Markov property. 7 time-reversible. 76 Skorokhod embedding. 77 skip-free property. 6 stationary. 24 Strirling’s approximation. 62 Wright-Fisher model. 8 transitive. 108 time-homogeneous. 58 transient. 76 simple symmetric random walk.semigroup property. 62 supremum. 8 transition probability matrix. 29 transition diagram:. 9 stationary distribution. 62 subsequence of a Markov chain. 71 122 . 10 steady-state. 7 simple random walk. 27 subordination. 16. 88 watching a chain in a set. 108 tree. 60 urn. 16.

Sign up to vote on this title
UsefulNot useful