Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . 92 31 Recurrence properties of simple random walks 94 31. . . . . . .5 First return to zero . . . . . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28. . . . .4 Some remarkable identities . .2 First return to the origin . . . . . . . . . . . . . . . . . . . . . 83 27. . . . . . . . . . . . . . . . . . . . . . . . 79 26. . . . . . . 70 24. . . .1 Distribution after n steps . . . . . . . . . . . . .1 First hitting time analysis . . . . . . . . . . . . . 95 31. .24. .3 The distribution of the first return to the origin . . . . . . . .1 Asymptotic distribution of the maximum . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . 68 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 The Wright-Fisher model: an example from biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 27. . . . . . . . . . . . . . . . . . . . . . . .1 Branching processes: a population growth application . . 90 30. .2 The ALOHA protocol: a computer communications application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 27. . . . . . . . . . .2 The ballot theorem . . . . . . . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . .2 Paths that start and end at specific points and do not touch zero at all . . . . . . . . . . . . . . . . . .1 Paths that start and end at specific points .3 Some conditional probabilities . . . . .1 Distribution of hitting time and maximum . . . . . . . . . . . . 92 30. . . . . . . . . . .2 Return probabilities . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . 95 31. . . . . . . .5 A storage or queueing application . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 24. . . . . . . . . . 71 24. . . . . . . . . . . . . 84 27.3 Total number of visits to a state . 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . . . . . . . . . . . . . . . . . . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

I have tried both and prefer the current ordering. There are. I have a little mathematical appendix. They form the core material for an undergraduate course on Markov chains in discrete time. The second one is specifically for simple random walks.K. I tried to put all terms in blue small capitals whenever they are first encountered. The first part is about Markov chains and some applications. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. At the end. The only new thing here is that I give emphasis to probabilistic methods as soon as possible.. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. dozens of good books on the topic. Also.. of course.Preface These notes were written based on a number of courses I taught over the years in the U. Greece and the U.S. Also. There notes are still incomplete. v . A few starred sections should be considered “advanced” and can be omitted at first reading. Of course.. “important” formulae are placed inside a coloured box. I introduce stationarity before even talking about state classification.

The “time” can be discrete (i. there is a function F that takes the pair Xn = (xn . and the other the greatest common divisor. X2 . the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). the integers). x Here. . the real numbers). It is not hard to see that. b) between two positive integers a and b. x = x(t) denotes the position of the particle at time t. . By repeating the procedure. In the module following this one you will study continuous-time chains. . we let our “state” be a pair of integers (xn . The sequence of pairs thus produced. s ≥ t) ˙ 1 . . Newton’s second law says that the acceleration is proportional to the force: m¨ = F. the future pairs Xn+1 . a totally ordered set. We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. y0 = min(a.e. yn ) into Xn+1 = (xn+1 .e. b). Recall (from elementary school maths). yn+1 ).PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. F = f (x. An example from Physics is provided by Newton’s laws of motion. In Mathematics. 2. Suppose that a particle of mass m is moving under the action of a force F in a straight line. . . more generally. one of which is 0. and its ˙ dt 2x acceleration is x = d 2 . the future trajectory (X(s). i. . . Assume that the force F depends only on the particle’s position ¨ dt and velocity. initialised by x0 = max(a. . n = 0.e. for any n. b). Formally. and evolving as xn+1 = yn yn+1 = rem(xn . x). Xn+2 . yn ). x(t)). yn ). b) is the remainder of the division of a by b. For instance. has the property that. 1. depend only on the pair Xn . The state of the particle at time t is described by the pair ˙ X(t) = (x(t). X1 . for any t. we end up with two numbers. Its velocity is x = dx . or. An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. b) = gcd(r. In other words. b). X0 . that gcd(a. where r = rem(a. where xn ≥ yn . continuous (i.

. past states are not needed. say. i. instead of deterministic objects.can be completely specified by the current state X(t). an effective way for dealing with large problems is via the use of randomness.99. for steps n = 1. 2.e. by the software in your computer. We will also see why they are useful and discuss how they are applied. 0 ≤ Y0 ≤ B. Perform this a large number N of times. When the mouse is in cell 1 at time n (minutes) then. Other examples of dynamical systems are the algorithms run. Each time count 1 if Yn < f (Xn ) and 0 otherwise. Count the number of 1’s and divide by N . . Some of these algorithms are deterministic. via the use of the tools of Probability. respectively. containing fresh and stinky cheese. but some are stochastic. The resulting ratio is an approximation b of a f (x)dx.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10.000 rows and 10.1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells.. In addition. 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. We will be dealing with random variables.05. we will see what kind of questions we can ask and what kind of answers we can hope for. Indeed. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. 2 2. Similarly. uniformly. Try this on the computer! 2 . . at time n + 1 it is either still in 1 or has moved to 2. We will develop a theory that tells us how to describe. Markov chains generalise this concept of dependence of the future only on the present. and use those mathematical models which are called Markov chains. Choose a pair (X0 . it does so. it moves from 2 to 1 with probability β = 0. We can summarise this information by the transition diagram: 1 Stochastic means random. Yn ). A scientist’s job is to record the position of the mouse every minute. A mouse lives in the cage. The generalisation takes us into the realm of Randomness. Repeat with (Xn . In other words. analyse.000 columns or computing the integral of a complicated function of a large number of variables. 1 and 2. regardless of where it was at earlier times. a ≤ X0 ≤ b. Y0 ) of random numbers.

are independent random variables. 0}. How often is the mouse in room 1 ? Question 1 has an easy. S1 . . = z=0 ∞ = z=0 3 . are random variables with distributions FD . to move from cell 1 to cell 2? 2. as follows: Xn+1 = max{Xn + Dn − Sn . Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x).05 0. . S0 . Dn is the money he deposits. intuitive.y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. . 2. S2 . Here. he can’t.y := P (Xn+1 = y|Xn = x).) 2. (This is the mean of the binomial distribution with parameter α. y = 0.95 0. . Sn .99 0. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. . Xn is the money (in pounds) at the beginning of the month.α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. 1. .2 Bank account The amount of money in Mr Baxter’s bank account evolves. So if he wishes to spend more than what is available. D0 . Assume that Dn . We easily see that if y > 0 then ∞ x. Dn − Sn = y − x) P (Sn = z. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell. Assume also that X0 .05 ≈ 20 minutes. and Sn is the money he wants to spend. FS .01 Questions of interest: 1. respectively. px. D2 . from month to month. D1 . on the average. the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. How long does it take for the mouse.

Are there any places which he will visit infinitely many times? 4. how fast? 2. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions. how long will it take him.0 = 1 − ∞ px.0 p2. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left). in principle.y . on the average. he has no sense of direction. Of course.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. we have px. how often will he be visiting 0? 2.1 p0.something that.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 . 2.1 p1. can be computed from FS and FD .2 p1. If he starts from position 0. being drunk. Will Mr Baxter’s account grow. The transition probability matrix is y=1 p0.1 p2. Questions of interest: 1. if y = 0. Are there any places which he will never visit? 3. If the pub is located in position 0 and his home in position 100. 2.2 P= p2.2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1.0 p1. and if so. So he may move forwards with equal probability that he moves backwards.0 p0.

S = 1 − pS. the reason being that the company does not believe that its clients are subject to resurrection. 5 .D . So he can move east.S − pH. But. that since there are more degrees of freedom now. These things are. but they will become more precise in the sequel. there is clearly a higher chance that our man will be lost.3 that the person will become sick.H . pD. and pS. north or south. there is probability pH.D = 1.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. and pS. Note that the diagram omits pH.H = 1 − pH. the company must have an idea on how long the clients will live.3 H 0. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. therefore.01 S 0. for now. the answer to this is crucial for the policy determination. so we assign probability 1/4 for each move. given that he is currently healthy.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients.S = 0.D . observe.S because pH. 2. and let’s say he’s completely drunk. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly.H − pS. etc. Clearly. pD.1 D Thus. Also.D is omitted. We may again want to ask similar questions.8 0. west. mathematically imprecise.

Usually. . which precisely means that (Xn+1 . . unless needed. Lemma 1. . where T is a subset of the integers. . 3. X1 = i1 . or negative). . . This is true for any n and m. in+1 ∈ S. Let us now write the Markov property in an equivalent way.} or N = {1. that the Xn take values in some countable set S. . Sometimes we say that {Xn . . n ∈ T} is Markov instead of saying that it has the Markov property. . P (Xn+1 = in+1 | Xn = in . Xn−1 ) are independent conditionally on Xn . almost exclusively (unless otherwise said explicitly). . n + m] Xk = ik | Xn = in ). m and any choice of states i0 . Since S is countable. viz. X2 = i2 . X0 = i0 ) = P (∀ k ∈ [n + 1. for any n ∈ T. m ∈ T). . . m < n. . We say that this sequence has the Markov property if. . . conditionally on Xn . . Proof. . (Xn . If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . (1) 6 . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). m ∈ T) is independent of the past process (Xm . We focus almost exclusively to the first choice. . the future process (Xm . and so future is independent of past given the present. X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ).1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . n + m] Xk = ik | Xn = in . 3. m > n. . . . n ∈ Z+ ) is Markov if and only if. We will also assume. . . it is customary to call (Xn ) a Markov chain. . zero. . . . in+m .} or the set Z of all integers (positive. . if the claim holds. .3 3. The elements of S are frequently called states. we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. 1.2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . . 3. 2. for all n ∈ Z+ and all i0 . called the state space. . the Markov property. 2. . T = Z+ = {0. . n ∈ T}. . On the other hand. for any n. . Xn−1 ) given Xn . . . . So we will omit referring to T explicitly. Xn+m ) and (X0 .

i2 (1. More generally. n) := [pi. .3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. n + 1) is called transition probability from state i at time n to state j at time n + 1.i1 (m. we define. i∈S i. all probabilities of events associated with the Markov chain This causes some analytical difficulties.j (n. 2 3 And. but if S is an infinite set. then the matrices are infinitely large.j (m.3 In this new notation. The function µ(i) is called initial distribution. . n)]i. pi. by (1). X1 = i1 . m + 2) · · · P(n − 1. m + 1) · · · pin−1 . n) = P(m. a square matrix whose rows and columns are indexed by S (and this. which.i1 (0. n ≥ 0. If |S| = d then we have d × d matrices. n) if m ≤ ℓ ≤ n. Of course. 1) pi1 . ℓ) P(ℓ.j (n. n + 1) do not depend on n .j (n. for each m ≤ n the matrix P(m.i (n − 1. m + 1)P(m + 1. n) = 1(i = j) and. n + 1) := P (Xn+1 = j | Xn = i) . The function pi. 3. of course. 7 . j ∈ S. pi. will specify all joint distributions2 : P (X0 = i0 . pi. n). we can write (2) as P(m. remembering a bit of Algebra. n) = i1 ∈S ··· in−1 ∈S pi. viz. something that can be called semigroup property or Chapman-Kolmogorov equation.. by a result in Probability.j∈S . . pi. . requires some ordering on S) and whose entries are pi.j (m.j (m. n). Xn = in ) = µ(i0 ) pi0 . means that the transition probabilities pi. X2 = i2 . by definition. n). n) = P(m. Therefore. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. But we will circumvent them by using Probability only.j (n.in (n − 1. 2) · · · pin−1 .which means that the two functions µ(i) := P (X0 = i) .j (m. or as P(m. (2) which is reminiscent of matrix multiplication. n).

if j = i otherwise. Then we have µ n = µ 0 Pn . δi (j) := 1. and the semigroup property is now even more obvious because Pa+b = Pa Pb . Starting from a specific state i. let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . y y 8 . (3) We call δi the unit mass at i. if µ assigns probability p to state i1 and 1 − p to state i2 .j . while P = [pi. Thus Pi is the law of the chain when the initial distribution is δi . n + 1) =: pi.and therefore we can simply write pi. simply. then µ = pδi1 + (1 − p)δi2 . A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). when we say “Markov chain” we will mean “timehomogeneous Markov chain”. 0. The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . The matrix notation is useful. Notice that any probability distribution on S can be written as a linear combination of unit masses. if Y is a real random variable.j∈S is called the (one-step) transition probability matrix or. Notational conventions Unit mass.j (n. n−m times for all m ≤ n. unless otherwise specified. and call pi. Similarly. From now on. arranged in a row vector.j the (one-step) transition probability from state i to state j. we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). b ≥ 0. as follows from (1) and the definition of P. for all integers a. Then P(m. For example. to picking state i with probability 1. n) = P × P × · · · · · · × P = Pn−m . transition matrix.j ]i. It corresponds. In other words. For example. of course.

the distribution of the bug will still be the same. m + 1.i. Imagine there are N molecules of gas (where N is a large number. In other words. random variables provides an example of a stationary process. as it takes some time before the molecules diffuse from room to room. Then pi. dividing it into two identical rooms. at time n + 1 it moves to one of the adjacent vertices with equal probability. and vice versa. . n = 0. Hence the uniform probability distribution is stationary. because it is impossible for the number of molecules to remain constant at all times. . . But this will not work. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1. say N = 1025 ) in a metallic chamber with a separator in the middle. for a while we will be able to tell how long ago the process started. for any m ∈ Z+ . If at time n the bug is on a vertex. We now investigate when a Markov chain is stationary.4 Stationarity We say that the process (Xn . Example: the Ehrenfest chain. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. .). then. . Consider a canonical heptagon and let a bug move on its vertices.e. On the average.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = . then. let us look at the following example: Example: Motion on a polygon. Clearly. n = m. To provide motivation and intuition. This is a model of gas distributed between two identical communicating chambers. N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn . Another guess then is: Place 9 .d. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. at any point of time n. The gas is placed partly in room 1 and partly in room 2. It is clear. 1.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. . the law of a stationary process does not depend on the origin of time. i. a sequence of i. Let Xn be the number of molecules in room 1 at time n.) is stationary if it has the same law as (Xn . n ≥ 0) will not be stationary: indeed. we indeed expect that we have N/2 molecules in each room. selecting a vertex with probability 1/7. It should be clear that if initially place the bug at random on one of the vertices.

. we are led to consider: The stationarity problem: Given [pij ]. . use (1) to see that P (X0 = i0 .each molecule at random either in room 1 or in room 2. . then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . . in ∈ S. 10 . n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. The equations (4) are called balance equations. Changing notation. If we do so. (a binomial distribution). We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. . find those distributions π such that πP = π. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. The only if part is obvious.i + π(i + 1)pi+1.i + P (X0 = i + 1)pi+1. In other words. But we need to prove this. i 0 ≤ i ≤ N. then we have created a stationary Markov chain. by (3). A Markov chain (Xn . X1 = i1 . for all m ≥ 0 and i0 . P (X1 = i) = j P (X0 = j)pj. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . . Xm+1 = i1 . which you should go through by yourselves. . If we can do so. 2−N i−1 i+1 N N (The dots signify some missing algebra.i = P (X0 = i − 1)pi−1. . Xm+n = in ). Lemma 2. Since. .i = N N −i+1 N i+1 2−N + = · · · = π(i). For the if part. Proof. . Indeed. a Markov chain is stationary if and only if the distribution of Xn does not change with n. . (4) Such a π is called .) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. Xm+2 = i2 . Xn = in ) = P (Xm = i0 . stationary or invariant distribution.i = π(i − 1)pi−1. P (Xn = i) = π(i) for all n. hence. . X2 = i2 .

We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. Who tells us that the sum inside the parentheses is finite? If the sum is infinite. To find the stationary distribution. .i = p(i). d}. This means that the solution to the balance equations is identically zero.i = π(0)p(i) + π(i + 1). Example: card shuffling. After bus arrives at a bus stop. n ≥ 0) is a Markov chain with transition probabilities pi+1.i = 1. which in turn means that the normalisation condition cannot be satisfied. The state space is S = N = {1. . So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞. Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions.. In addition to that. p1. xd ). 11 . 2.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. . Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. in matrix notation. then we run into trouble: indeed. Keep doing it continuously. Example: waiting for a bus. i = 1. .i. we must be careful. . π must be a probability: π(i) = 1. . . we use the normalisation condition π(1) = i≥1 j≥i = 1. where all the xi are distinct. random variables. which. . .d. and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. But here. It is easy to see that the stationary distribution is uniform: π(x) = 1 . . which gives p(j) . the next one will arrive in i minutes with probability p(i). i∈S The matrix P has the eigenvalue 1 always because j∈S pi.}. π(1) will be zero. .j = 1. and so each π(i) will be zero. Then (Xn . j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). can be written as P1 = 1. Thus. i ≥ 1. if i > 0 i ∈ N. . where 1 is a column whose entries are all 1. i≥1 π(i) −1 To compute π(1). d! x ∈ Sd . i ≥ 0. each x ∈ Sd is of the form x = (x1 . 2 . write the equations (4): π(i) = π(j)pj. The times between successive buses are i. .

So let us do so. . . .) Consider an infinite binary vector ξ = (ξ1 . r = 1. . ξr+1 (n) = εr ) + P (ξ1 (n) = 1. . 1} for all i. the same is true at n + 1. ξ(n + 1) = ϕ(ξ(n)).e. 2. .Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). ξ2 (n) = ε1 . Remarks: This is not a Markov chain with countably many states. So. . ξ2 (n) = 1 − ε1 . 1 − ξ3 . So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. . i. . ξ2 (n). .i. . . ξ2 . . ξr+1 (n) = 1 − εr ). any r = 1. . We have thus defined a mapping ϕ from binary vectors into binary vectors. i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals. 2. . ξ3 . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. . . with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then. and any ε1 . Example: bread kneading.. . . As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x.* (This is slightly beyond the scope of these lectures and can be omitted. . εr ∈ {0. 2.). what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. . . . The Markov chain is obtained by applying the mapping ϕ again and again.i. . . P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). So it goes beyond our theory.). . . ξ2 (n). . 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. ξi ∈ {0. .d. . .. applied to the interval [0. . . Hence choosing the initial state at random in this manner results in a stationary Markov chain. (5) 12 . are i. 2(1 − x)). P (ξ1 (n + 1) = ε1 . . . .. But to even think about it will give you some strength. 1. . But the function f (x) = min(2x. You can check that .) is the state at time n. .). .d. if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . If ξ1 = 1 then x ≥ 1/2. and the rule maps x into 2(1 − x). with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. 1}. We can show that if we start with the ξr (0). . i. Here ξ(n) = (ξ1 (n). from (5). If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. 1] (think of [0. for any n = 0.

1 Finding a stationary distribution As explained above.j (which equals 1): π(i) j∈S pi. But we only have d unknowns. then choose A = {i} and see that you obtain the balance equations. suppose that the balance equations hold. expecting to be able to choose. If the latter condition holds. If the state space has d states.j = j∈A π(j)pj. i∈A j∈Ac Think of π(i)pi. A). Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A. because the first d equations are linearly dependent: the sum of both sides equals 1.j . We have: Proposition 1. Writing the equations in this form is not always the best thing to do. Instead of eliminating equations. Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. π(j)pj. Proof. Ac ) := π(i)pi.i + j∈Ac π(j)pj. a stationary distribution π satisfies π = πP π1 = 1 In other words. Ac ) is the total current from A to Ac . π(j) = 1. a set of d linearly independent equations which can be more easily solved. This is OK. j∈S i ∈ S. π is a stationary distribution if and only if π1 = 1 and F (A. π(i) = j∈S (balance equations) (normalisation condition). Ac ) = F (Ac . When a Markov chain is stationary we refer to it as being in 4. amongst them. therefore one of them is always redundant.i . Conversely.i + j∈Ac π(j)pj.i = j∈A π(j)pj. for all A ⊂ S. we introduce more. So F (A. as we will see in examples. then the above are d + 1 equations.Note on terminology: steady-state. The convenience is often suggested by the topological structure of the chain.i 13 .j as a “current” flowing from i to j.i Multiply π(i) by j∈S pi.

β = 0.99 ≈ 0. . β < 1.j + j∈A j∈Ac π(i)pi. Summing over i ∈ A. but we shall not do this here. 14 i = 1. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1).i This is true for all i ∈ S. If we take α = 0. A ) A c c F( A . and 4. we see that the first term on the left cancels with the first on the right.. A) Schematic representation of the flow balance relations. α+β π(2) = α .952.) Assume 0 < α. A c F(A . we find π(1) = 0. Instead.05.04 π(2) = 0. p21 = β. π(1) = β . The extra degree of freedom provided by the last result.Now split the left sum also: π(i)pi.i + j∈Ac π(j)pj.. Thus. The constant c is found from π(1) + π(2) = 1. One can ask how to choose a minimal complete set of linearly independent equations. 1. p11 = 1 − α.048. 3. That is. . p22 = 1 − β. 2. . (Consequently. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. We have π(1)α = π(2)β. while the remaining two terms give the equality we seek. α+β π(2) = cα.99 (as in the mouse example). p q Consider the following chain.8% of the time in room 2. gives us flexibility. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . here is an example: Example: Consider the 2-state Markov chain with p12 = α.j = j∈A π(j)pj.2% of the time in 1. in steady state. the mouse spends only 95. .04 room 1.05 ≈ 0. Example: a walk with a barrier.

i. π(i) = (p/q)i π(0). n≥0 . • Suppose that i j. These are much simpler than the previous ones. in−1 . i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. Proof. So assume i = j. Ac ) = π(i − 1)p = π(i)q = F (Ac . E) where S is the state space and E ⊂ S × S defined by (i. p < 1/2. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j). .But we can make them simpler by choosing A = {0.i1 pi1 . . 1.e. . (ii) pi. In fact. A). For i ≥ 1.i2 · · · pin−1 .1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. and E a set of directed edges.j > 0 for some n ∈ N and some states i1 . a pair (S. j) ∈ E ⇐⇒ pi. i. i − 1}. We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. This is i=0 possible if and only if p < q.2 The relation “leads to” We say that a state i leads to state j if. In graph-theoretic terminology. . starting from i. S is a set of vertices. If i = j there is nothing to prove. provided that ∞ π(i) = 1. . we can solve them immediately.) 5 5. (If p ≥ 1/2 there is no stationary distribution.e. In other words. If i ≥ 1 we have F (A.j > 0. Does this provide a stationary distribution? Yes. . 5. (iii) Pi (Xn = j) > 0 for some n ≥ 0. . the chain will visit j at some finite time. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. . This is often cast in terms of a digraph (directed graph).

{4. In other words. meaning that (ii) holds. by definition.in−1 pi. it partitions S into equivalence classes known as communicating classes. 3} is not closed. • Suppose that (ii) holds. the sum contains a nonzero term. Proof.i2 · · · pin−1 . 16 . It is reflexive and transitive because is so. where the sum extends over all possible choices of states in S.. {6}. the class {2. (6) j.the sum on the right must have a nonzero term. The class {4. Hence Pi (Xn = j) > 0. [i] = [j] if and only if i identical or completely disjoint. 3}. we see that there are four communicating classes: {1}. Since Pi (Xn = j) > 0. we have that the sum is also positive and so one of its terms is positive. Corollary 1. 5}. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. {2.j . Equivalence relation means that it is symmetric.. In addition. However.3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. (i) holds. The first two classes differ from the last two in character.i1 pi1 . because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). The communicating class corresponding to the state i is. The relation is transitive: If i j and j k then i k. is an equivalence relation. We have Pi (Xn = j) = i1 .. this is symmetric (i Corollary 2. By assumption. (iii) holds. j) ⇐⇒ i j and j i. by definition. Look at the last display again. i}. 5} is closed: if the chain goes in there then it never moves out of it. 5. and so (iii) holds. • Suppose now that (iii) holds. We just observed it is symmetric. reflexive and transitive.. Just as any equivalence relation. The relation j ⇐⇒ j i). the set of all states that communicate with i: [i] := {j ∈ S : j So.

and so i2 ∈ C. 17 . Why? A state i is essential if it belongs to a closed communicating class.j = 1. the state is inessential. If all states communicate with all others. parts.i1 pi1 . We then can find states i1 .i1 > 0. The converse is left as an exercise.i2 > 0. starting from any state. C3 are not closed. In other words still. If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them.i2 · · · pin−1 . Similarly. j∈C for all i ∈ C. Thus. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. . . C2 . we have i1 ∈ C. then the chain (or. Closed communicating classes are particularly important because they decompose the chain into smaller. we say that a set of states C ⊂ S is closed if pi. pi. more precisely. its transition matrix P) is called irreducible. in−1 such that pi. In graph terms. Proof. A general Markov chain can have an infinite number of communicating classes. Note that there can be no arrows between closed communicating classes. Otherwise. Lemma 3.More generally. . If a state i is such that pi. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. and C is closed. we can reach any other state. Suppose that i j. . The communicating classes C1 . Suppose first that C is a closed communicating class. The classes C4 . and so on. To put it otherwise. C5 in the figure above are closed communicating classes. we conclude that j ∈ C. There can be arrows between two nonclosed communicating classes but they are always in the same direction. Since i ∈ C. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C.i = 1 then we say that i is an absorbing state. the state space S is a closed communicating class. the single-element set {i} is a closed communicating class.j > 0. Note: An inessential state i is such that there exists a state j such that i j but j i. more manageable. pi1 .

d(i) = gcd Di . Let us denote by pij the entries of Pn . Dj = {n ∈ (n) (α) N : pjj > 0}. . We say that the chain has period 2. Let Di = {n ∈ N : pii > 0}.5. (m) (n) for all m.e. X2 ∈ C1 . If i j then d(i) = d(j). . If i j there is α ∈ N with pij > 0 and if 18 . So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . . j. Notice however. 3} at some point of time then. The period of an inessential state is not defined. So. (n) Proof. Consider two distinct essential states i. 4} at the next step and vice versa. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0. i. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. it will move to the set C2 = {2. . ℓ ∈ N. pii (ℓn) ≥ pii (n) ℓ . Since Pm+n = Pm Pn . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. n ≥ 0.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. Theorem 2 (the period is a class property). d(j) = gcd Dj . all states form a single closed communicating class. A state i is called aperiodic if d(i) = 1. more generally. i. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. j ∈ S. The chain alternates between the two sets. X3 ∈ C2 . X4 ∈ C1 . and.

. . Cd−1 such that pij = 1. C1 . showing that α + β ∈ Dj . C1 . More generally. Proof. as defined earlier. From the last two displays we get d(j) | n for all n ∈ Di . the chain is called aperiodic. In particular. we have n ∈ Di as well. Of particular interest. Theorem 4. Let n be sufficiently large. If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . So pjj ≥ pij pji > 0. j∈Cr+1 i ∈ Cr . all states communicate with one another. Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . So d(j) is a divisor of all elements of Di . An irreducible chain has period d if one (and hence all) of the states have period d. this is the content of the following theorem. Then pii > 0 and so pjj ≥ pij pii pji > 0. An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i.) 19 . . Therefore d(j) | α + β + n for all n ∈ Di . i. showing that α + n + β ∈ Dj . Divide n by n1 to obtain n = qn1 + r. Because n is sufficiently large. Pick n2 . n2 ∈ Di . are irreducible chains. If a number divides two other numbers then it also divides their difference. Arguing symmetrically.j i there is β ∈ N with pji > 0. Consider an irreducible chain with period d. . Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. q − r > 0. (Here Cd := C0 . Therefore d(j) | α + β. Theorem 3. Cd−1 such that the chain moves cyclically between them. Then we can uniquely partition the state space S into d disjoint subsets C0 . . Now let n ∈ Di . Corollary 3. So d(i) = d(j). d − 1. . d(j) | d(i)). where the remainder r ≤ n1 − 1. .e. r = 0. chains where. . j ∈ S. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. 1. . Formally. . n1 ∈ Di such that n2 − n1 = 1. if d = 1. if an irreducible chain has period d then we can decompose the state space into d sets C0 . . . we obtain d(i) ≤ d(j) as well. Since n1 .

Proof. As an example. Take.. id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. i1 . i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?).e. ... Pick a state i0 and let C0 be its equivalence class (i. necessarily. Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . If X0 ∈ A the two times coincide. and by the assumptions that d d j. . C1 . i1 . . Cd−1 .e the set of states j such that i j). say. We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. j ∈ C2 . i ∈ C0 . Then pick i1 such that pi0 i1 > 0. necessarily. that is why we call it “first-step analysis”. Notice that this is an equivalence relation.. We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time.id > 0. . We show that these classes satisfy what we need. which contradicts the definition of d. for example. j belongs to the next class. 20 . i. Assume d > 1. Such a path is possible because of the choice of the i0 .. We now show that if i belongs to one of these classes and if pij > 0 then. consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. The existence of such a path implies that it is possible to go from i0 to i. It is easy to see that if we continue and pick a further state id with pid−1 . id ∈ C0 . We avoid excessive notation by using the same letter for both. then... Suppose pij > 0 but j ∈ C1 but. Hence it partitions the state space S into d disjoint equivalence classes. The reader is warned to be alert as to which of the two variables we are considering at each time. Continue in this manner and define states i0 . We use a method that is based upon considering what the Markov chain does at time 1. Let C1 be the equivalence class of i1 . . with corresponding classes C0 . after it takes one step from its current position.

conditionally on X1 . . Furthermore. . and define ϕ(i) := Pi (TA < ∞). For the second part of the proof. Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). X2 . because the chain may end up in state 2. So TA = 1 + TA .e. Hence: ′ ′ P (TA < ∞|X1 = j. fix a set of states A. if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S.It is clear that P1 (T0 < ∞) < 1. the event TA is independent of X0 . ′ Proof. and so ϕ(i) = 1. Combining the above. If i ∈ A then TA = 0. But the Markov chain is homogeneous. which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). Therefore. we have Pi (TA < ∞) = ϕ(j)pij . By self-feeding this equation. ′ is the remaining time until A is hit). . The function ϕ(i) satisfies ϕ(i) = 1. But what exactly is this probability equal to? To answer this in its generality. i ∈ A. let ϕ′ (i) be another solution. we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i. If i ∈ A. j∈S as needed.). ϕ(i) = j∈S i∈A pij ϕ(j). then TA ≥ 1. X0 = i) = P (TA < ∞|X1 = j). We then have: Theorem 5. a ′ function of X1 .

the second equals Pi (TA = 2). Therefore. On the other hand. we choose A = {0}. j∈A i ∈ A. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). the set that contains only state 0. We have: Theorem 6. ψ(0) = ψ(2) = 0. Example: In the example of the previous figure. 3 2 so ϕ(1) = 3/4. 1 ψ(1) = 1 + ψ(1). 2}. let A = {0. The second part of the proof is omitted. Proof. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). which is the second of the equations. The function ψ(i) satisfies ψ(i) = 0.We recognise that the first term equals Pi (TA = 1). Clearly. Letting n → ∞. We immediately have ϕ(0) = 1. and ϕ(2) = 0. Furthermore. ψ(i) := Ei TA . 1. 3 22 . TA = 0 and so ψ(i) := Ei TA = 0. Start the chain from i. if the chain starts from 2 it will never hit 0. By continuing self-feeding the above equation n times. 2. We let ϕ(i) = Pi (T0 < ∞). ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). (If the chain starts from 0 it takes no time to hit 0. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. Example: Continuing the previous example. so the first two terms together equal Pi (TA ≤ 2). ψ(i) = 1 + i∈A pij ψ(j). as above.) As for ϕ(1). obviously. If i ∈ A then. If ′ i ∈ A. and let ψ(i) = Ei TA . we obtain ϕ′ (i) ≥ Pi (TA ≤ n). if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. then TA = 1 + TA . By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). because it takes no time to hit A if the chain starts from A. i = 0.

Then ′ 1+TA x ∈ A. . h(x) = f (x). The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). (This should have been obvious from the start. by the Markov property. i∈A Furthermore. Now the last sum is a function of the future after time 1. ϕAB (i) = 0. in the last sum. This happens up to the first hitting time TA first hitting time of the set A. until set A is reached. after one step. as argued earlier. we just changed index from n to n + 1. It should be clear that the equations satisfied are: ϕAB (i) = 1. therefore the mean number of flips until the first success is 3/2. We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ). consider the following situation: Every time the state is x a reward f (x) is earned. 23 . Clearly. Next. ′ where TA is the remaining time.. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y).) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ).e. because the underlying experiment is the flipping of a coin with “success” probability 2/3.which gives ψ(1) = 3/2. . B. . ϕAB is the minimal solution to these equations. TB for two disjoint sets of states A. X2 . ′ TA = 1 + TA . ϕAB (i) = j∈A i∈B pij ϕAB (j). i. where. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. Hence. otherwise. because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ). if x ∈ A. of the random variables X1 .

h(1) = 5. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. k ∈ N) form the strategy of the gambler. if he starts from room 2. unless he is in room 4. The successive stakes (Yk . You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. 2. x∈A pxy h(y). we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). are h(x) = 0. in which case he dies immediately. (Sooner or later. y∈S h(x) = f (x) + x ∈ A. he will die. Clearly. If heads come up then you win £1 per pound of stake. why?) If we let h(i) be the average total reward if he starts from room i. Example: A thief enters sneaks in the following house and moves around the rooms. no gambler is assumed to have any information about the outcome of the upcoming toss. he collects i pounds. 2 2 We solve and find h(2) = 8. the set of equations satisfied by h.Thus. 1 2 4 3 The question is to compute the average total reward. 3. collecting “rewards”: when in room i. until he dies in room 4. for i = 1. (Also. If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails). your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . so 24 . h(3) = 7. Let p the probability of heads and q = 1 − p that of tails.) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. So.

it will make enormous profits). is our next goal.if the gambler wants to have a strategy at all. b − 1. betting £1 at each integer time against a coin which has P (heads) = p. 0 ≤ x ≤ ℓ. b are absorbing. We are clearly dealing with a Markov chain in the state space {a. . . in simplest case Yk = 1 for all k. 25 ϕ(ℓ) = 1. a + 2. astrological nature. x − a    . if p = q Px (Tb < Ta ) = . pi. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. ξk−1 and–as is often the case in practice–on other considerations of e.g. b = ℓ > 0 and let ϕ(x. in addition. . This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. We first solve the problem when a = 0. a < i < b. expectation) under the initial condition X0 = x. . Consider a gambler playing a simple coin tossing game. We wish to consider the problem of finding the probability that the gambler will never lose her money. as described above. Yk cannot depend on ξk . b} with transition probabilities pi. Ex ) for the probability (respectively. a ≤ x ≤ b. in the first case. (Think why!) To simplify things. if p = q = 1/2 b−a Proof. . she must base it on whatever outcomes have appeared thus far. ℓ) := Px (Tℓ < T0 ). In other words. write ϕ(x) instead of ϕ(x. The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). a + 1. P (tails) = 1 − p. and where the states a. We obviously have ϕ(0) = 0. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 . . To solve the problem. She will stop playing if her money fall below a (a < x).i+1 = p. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. the company has control over the probability p. b − a). We use the notation Px (respectively.i−1 = q = 1 − p. It can only depend on ξ1 . The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. Let us consider a concrete problem. Theorem 7. . ℓ). Let £x be the initial fortune of the gambler. . .

If p > 1/2. we have obtained ϕ(x. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. ℓ→∞ ℓ→∞ (q/p)x . λx − 1 δ(0). λ = 1. The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. if λ = 1 if λ = 1. λℓ −1 x ℓ. This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. λℓ − 1 0 ≤ x ≤ ℓ. we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. if p = q. and so λx − 1 λx − 1 =1− = λx . we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). Since ϕ(ℓ) = 1. then λ = q/p < 1. if p > 1/2 if p ≤ 1/2. ℓ) = λx −1 . We finally find λx − 1 . Assuming that λ := q/p = 1. δ(x − 2) = · · · = q p x δ(0). 0 ≤ x ≤ ℓ. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. then λ = q/p > 1. Px (T0 < ∞) = 1 − lim 26 . 1. and so Px (T0 < ∞) = 1 − lim If p < 1/2. ℓ Summarising. we find δ(0) = 1/ℓ and so x ϕ(x) = . by first-step analysis. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). Since p + q = 1.and. λ = 1. The general case is obtained by replacing ℓ by b − a and x by x − a. Corollary 4. Let δ(x) := ϕ(x + 1) − ϕ(x). ϕ(x) = pϕ(x + 1) + qϕ(x − 1). λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1).

the future process (Xn . . τ = n) = pi. and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . . A. considered earlier. . for any n. n ≥ 0) is independent of the past process (X0 . A.j1 pj1 . τ < ∞) and that P (Xτ +1 = j1 . . . A. This stronger property is referred to as the strong Markov property. . Xn ) and the ordinary splitting property at n. Xτ )1(τ < ∞) = n∈Z+ (X0 . Theorem 8 (strong Markov property). . A. Xτ ) if τ < ∞. Suppose (Xn . . Xn+1 . Xτ = i. | Xn = i. . . including. Xn ) for all n ∈ Z+ . | Xn = i)P (Xn = i. . . . .8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set.) is independent of the past process (X0 . (d) where (a) is just logic. . Since (X0 . Xn ) if we know the present Xn . 27 . An example of a stopping time is the time TA . A ∩ {τ = n} is determined by (X0 . τ = n) = P (Xn+1 = j1 . . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . Xn )1(τ = n). In particular. (b) is by the definition of conditional probability. . . We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . possibly. We are going to show that for any such A.j2 · · · P (Xn = i. . . . Let τ be a stopping time. A. for all n ∈ Z+ . the first hitting time of the set A. . . . τ = n) (a) (b) = P (Xτ +1 = j1 . . . . n ≥ 0) is a time-homogeneous Markov chain. τ = n). A | Xτ . Xn = i. . . . any such A must have the property that. . . . . . . P (Xτ +1 = j1 . τ < ∞) = pi.j1 pj1 . Let A be any event determined by the past (X0 . . . . Xτ ) and has the same law as the original process started from i. . the future process (Xτ +n . Xτ +2 = j2 . Let τ be a stopping time. Xn+2 = j2 . . . . . . . . τ = n) (c) = P (Xn+1 = j1 . Xn+2 = j2 . Xτ +2 = j2 . Xτ +2 = j2 . A. Recall that the Markov property states that. . . . An example of a random time which is not a stopping time is the last visit of a specific set. Proof. . . conditional on τ < ∞ and Xτ = i. the value +∞. | Xτ . Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. Then. . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . | Xτ = i. Xn+2 = j2 .j2 · · · We have: P (Xτ +1 = j1 . . τ = n)P (Xn = i. τ < ∞) = P (Xn+1 = j1 . τ < ∞)P (A | Xτ . . . . . Xn ).Xτ +2 = j2 . . τ < ∞).

2. random trajectories. .) We let Ti be the time of the third visit.i. (And we set Ti := Ti . Xi (0) . (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . All these random times are stopping times.i. Regardless. However. The random variables Xn comprising a general Markov chain are not independent.d. Our Ti here is always positive. Formally. . The idea is that if we can find a state i that is visited again and again. we will show that we can split the sequence into i. random variables is a (very) particular example of a Markov chain.. . These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. . Xi (1) . we let Ti be the time of the second (1) (3) visit. in a generalised sense. State i may or may not be visited again.d. “chunks”. The Ti of Section 6 was allowed to take the value 0. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). So Xi is the r-th excursion: the chain visits 28 . If X0 = i then the two definitions coincide. Ti Xi (0) (r) ≤ n < Ti (r+1) . . r = 2. They are.9 Regenerative structure of a Markov chain A sequence of i. by using the strong Markov property. 0 ≤ n < Ti . in fact. (1) r = 1. . as “random variables”. Schematically. We (r) refer to them as excursions of the process. Note the subtle difference between this Ti and the Ti of Section 6. and so on. := Xn .. we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. Xi (2) . 3. .

. . starting from i.d. An immediate consequence of the strong Markov property is: Theorem 9. except. Xi (1) . at least one state must be visited infinitely many times.. 29 . The last corollary ensures us that Λ1 . but important corollary of the observation of this section is that: Corollary 5. We abstract this trivial observation and give a couple of definitions: First. But (r) the state at Ti is. the future is independent of the (1) (2) past. 10 Recurrence and transience When a Markov chain has finitely many states. by definition. if it starts from state i at some point of time. (r) . starting from i. A trivial. for time-homogeneous Markov chains. The excursions Xi (0) . . g(Xi (2) ). equal to i. are independent random variables. given the state at (r) (r) (r) the stopping time Ti . Moreover. we say that a state i is transient if. and this proves the first claim. g(Xi (1) ).). possibly. Therefore. (Numerical) functions g(Xi independent random variables. Therefore.state i for the r-th time. . Example: Define Ti (r+1) (0) ). The claim that Xi . the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1. and “goes for an excursion”. Second. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. we say that a state i is recurrent if. Apply Strong Markov Property (SMP) at Ti : SMP says that. all. it will behave in the same way as if it starts from the same state at another point of time. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. (0) are independent random variables. we “watch” it up until the next time it returns to state i. but there are always states which will be visited again and again. . In particular.i. Xi (2) . .. . they are also identical in law (i. Perhaps some states (inessential ones) will be visited finitely many times. Xi distribution. Λ2 . the future after Ti is independent of the past before Ti . of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. ad infinitum. have the same Proof.

2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . Because either fii = 1 or fii < 1. Indeed. there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. k ≥ 0. in principle. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. Lemma 5. therefore concluding that every state is either recurrent or transient. 30 . i.Important note: Notice that the two statements above are not negations of one another. But the past (k−1) before Ti is independent of the future.e. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1). 2. fii = Pi (Ti < ∞) = 1. . This means that the state i is recurrent. Pi (Ji = ∞) = 1. we conclude what we asserted earlier. We will show that such a possibility does not exist. and that is why we get the product. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . We claim that: Lemma 4. This means that the state i is transient. Ei Ji = ∞. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. every state is either recurrent or transient. If fii < 1. State i is recurrent. 3. Indeed. then Pi (Ji < ∞) = 1. we have Proof. Therefore. namely. 4. Pi (Ji = ∞) = 1. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. Starting from X0 = i. The following are equivalent: 1. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . Suppose that the Markov chain starts from state i.

by the definition of Ji . Proof. State i is transient. Hence Ei Ji = ∞ also. which. therefore Pi (Ji < ∞) = 1. as n → ∞. Pi (Ji < ∞) = 1. This. and. this implies that Ei Ji < ∞. State i is transient if. Proof. On the other hand. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. State i is recurrent if. it is visited infinitely many times. assuming that the chain (n) starts from state i. (n) i. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. 4. From Elementary Probability. Ei Ji < ∞. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. The following are equivalent: 1. pii → 0. If i is transient then Ei Ji < ∞. If i is a transient state then pii → 0. fii = Pi (Ti < ∞) < 1. then. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. Corollary 6. since Ji is a geometric random variable. 10. (n) But if a series is finite then the summands go to 0. Ji cannot take value +∞ with positive probability. i. if we assume Ei Ji < ∞. Notice the difference with pij . owing to the fact that Ji is a geometric random variable. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . n ≥ 1. fij ≤ pij .e. by definition. then. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . If Ei Ji = ∞. clearly. 3. but the two sequences are related in an exact way: Lemma 7. 2. Lemma 6. The equivalence of the first two statements was proved before the lemma. The equivalence of the first two statements was proved above.1 Define First hitting time decomposition fij := Pi (Tj = n). Proof.Proof. Clearly. is equivalent to Pi (Ji < ∞) = 1. by definition.e. the probability to hit state j for the first time at time n ≥ 1. it is visited finitely many times with probability 1. as n → ∞.

In a communicating class. excursions Xi . infinitely many of them will be 1 (with probability 1). are i. starting from i.2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. and since they take the value 1 with positive probability. 1. r = 0. ( Remember: another property that was a class property was the property that the state have a certain period. So the random variables δj. showing that j will appear in infinitely many of the excursions for sure.r := 1 j appears in excursion Xi (r) . denoted by i j: it means that. . not only fij > 0. j i. Now. ∞ Proof. for all states i. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. 32 . the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). If j is transient. which is finite. In other words. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. if we let fij := Pi (Tj < ∞) . Xi . Corollary 7. . and hence pij as n → ∞. we have n=1 pjj = Ej Jj < ∞. Start with X0 = i. . Corollary 8. Indeed. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . and gij := Pi (Tj < Ti ) > 0.i. . as n → ∞. the probability to go to j in some finite time is positive. we have i j ⇐⇒ fij > 0. either all states are recurrent or all states are transient. j i. In particular. The last statement is simply an expression in symbols of what we said above. by definition. For the second term. If j is a transient state then pij → 0. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). We now show that recurrence is a class property. by summing over n the relation of Lemma 7.i. Hence.) Theorem 10. . but also fij = 1. Proof.d. Hence. 10. . and so the probability that j will appear in a specific excursion is positive..The first term equals fij .d.

observe that T is finite.e. i. random variables. U2 . and so i is a transient state. . appear in a certain game. Proof. Un ≤ x | U0 = x)dx xn dx = 1 . Recall that a finite random variable may not necessarily have finite expectation. Theorem 12. Let T := inf{n ≥ 1 : Un > U0 }. . i. Pi (B) = 0. What is the expectation of T ? First.i. . Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state). for example. Example: Let U0 . Proof. It should be clear that this can. all other states are also transient. If i is recurrent then every other j in the same communicating class satisfies i j. but Pi (A ∩ B) = Pi (Xm = j. Hence all states are recurrent. Indeed. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. . by the above theorem. . P (T > n) = P (U1 ≤ U0 . . But then. . This state is recurrent.Proof. U1 .d. . Theorem 11. necessarily. uniformly distributed in the unit interval [0. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. Let C be the class. . P (B) = Pi (Xn = i for infinitely many n) < 1. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). and hence. By closedness. Suppose that i is inessential. Every finite closed communicating class is recurrent. . Hence. If a communicating class contains a state which is transient then. . X remains in C. Xn = i for infinitely many n) = 0. So if a communicating class is recurrent then it contains no inessential states and so it is closed. if started in C. n+1 33 = 0 . j but j i. If i is inessential then it is transient.e. be i. 1]. every other j is also recurrent. Thus T represents the number of variables we need to draw to get a value larger than the initial one. We now take a closer look at recurrence. some state i must be visited infinitely many times. By finiteness.

0) > −∞. . Before doing that. .d. . an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. of successive states is not an independent sequence. 0) = ∞. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. Theorem 13. it is a Theorem that can be derived from the axioms of Probability Theory. We say that i is null recurrent if Ei Ti = ∞.i. . n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. the Fundamental Theorem of Probability Theory. arguably. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. . It is traditional to refer to it as “Law”. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. the sequence X0 . P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. random variables. Let Z1 . To our rescue comes the regenerative structure of a Markov chain identified at an earlier section. X1 . When we have a Markov chain. To apply the law of large numbers we must create some kind of independence. (ii) Suppose E max(Z1 . Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. Z2 . but E min(Z1 . We state it here without proof.Therefore. on the average. be i. n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. then it takes. It is not a Law. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . But this is misleading. 34 . . X2 .

Then. we can think of as a reward received when the chain is in state x.1 Functions of excursions Namely. Xa .i. The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. random variables. . . Xa(2) .13. We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). . k=0 which represents the ‘time-average reward’ received. Specifically. G(Xa(3) ). . with probability 1. Since functions of independent random variables are also independent.d. . . Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ). the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). assign a reward f (x). In particular. are i. Example 1: To each state x ∈ S. in this example. G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ). which. (7) Whichever the choice of G. . but define G(X ) to be the total reward received over the excursion X .2 Average reward Consider now a function f : S → R. Xa(3) . G(Xa(2) ). . given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . the (0) initial excursion Xa also has the same law as the rest of them. and because if we start with X0 = a. for reasonable choices of the functions G taking values in R. (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. forms an independent sequence of (generalised) random variables. we have that G(Xa(1) ).d. Thus.. for an excursion X . n as n → ∞. 35 . we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. define G(X ) to be the maximum reward received during this excursion. (1) (2) (1) (n) (8) 13.i. because the excursions Xa . are i. . Example 2: Let f (x) be as in example 1. as in Example 2.

t as t → ∞. Proof.Theorem 14. Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . We assumed that f is t bounded. t t 36 . (r) Nt := max{r ≥ 1 : Ta ≤ t}. Then. say |f (x)| ≤ B. The idea is this: Suppose that t is large enough. plus the total reward received over the next excursion. and so on. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x).e. of complete excursions from the ground state a that occur between 0 and t. Thus we need to keep track of the number. t t In other words. Hence |Gfirst | ≤ BTa . For example. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . say. with probability one. with probability one. t t as t → ∞. Let f : S → R+ be a bounded ‘reward’ function. Nt = 5. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. and so 1 first G → 0. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). and Glast is the reward over the last incomplete cycle. Since P (Ta < ∞) = 1. Gr is given by (7). Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. in the figure below. We are looking for the total reward received between 0 and t. we have BTa /t → 0. Nt . A similar argument shows that 1 last G → 0. We break the sum into the total reward received over the initial excursion.

Since Ta /r → 0. it follows that the number Nt of complete cycles before time t will also be tending to ∞. t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 . r=1 with probability one. as t → ∞. we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. Hence. So we are left with the sum of the rewards over complete cycles.d. . = E a Ta . Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . lim t∞ 1 Nt Nt −1 Gr = EG1 .also. . as r → ∞. and since Ta − Ta k = 1. we have Ta r→∞ r lim (r) . 1 Nt = . are i. E a Ta f (Xn ) = n=0 EG1 =: f . . random variables with common expectation Ea Ta . Since Ta → ∞. (11) By definition. E a Ta 37 . r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 . Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) .. = E a Ta . 2.i.

not both. it takes Ea Ta units of time between successive visits to the state a. i. We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). t Ea 1(Xk = b) E a Ta . for all b ∈ S. the ground state. Comment 1: It is important to understand what the formula says. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . The theorem above then says that. on the average. E a Ta (14) with probability 1. a. from the Strong Markov Property. The constant f is a sort of average over an excursion. then we see that the numerator equals 1.To see that f is as claimed. Comment 4: Suppose that two states. b. The physical interpretation of the latter should be clear: If. The equality of these two limits gives: average no. E b Tb 38 . We can use b as a ground state instead of a. which communicate and are positive recurrent. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). Ta −1 k=0 1 lim t→∞ t with probability 1. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . we let f to be 1 at the state b and 0 everywhere else. then the long-run proportion of visits to the state a should equal 1/Ea Ta . (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). Note that we include only one endpoint of the excursion.e. The denominator is the duration of the excursion. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb .

39 . x ∈ S. Proof. x ∈ S. X3 . we have 0 < ν [a] (x) < ∞. i. Indeed. . Moreover. Proposition 2. so there is nothing lost in doing so. 5 We stress that we do not need the stronger assumption of positive recurrence here. X1 . Both of them equal 1. . X2 . we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. this should have something to do with the concept of stationary distribution discussed a few sections ago. ν [a] (x) = y∈S ν [a] (y)pyx . Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. Fix a ground state a ∈ S and assume that it is recurrent5 . The real reason for doing so is because we wanted to replace a function of X0 . notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. (15) should play an important role. . Note: Although ν [a] depends on the choice of the ‘ground’ state a. X2 . We start the Markov chain with X0 = a. by a function of X1 . Clearly. Recurrence tells us that Ta < ∞ with probability one. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x).e. Consider the function ν [a] defined in (15). . In view of this. . Hence the superscript [a] will soon be dropped. we will soon realise that this dependence vanishes when the chain is irreducible. for any state b such that a → x (a leads to x). where a is the fixed recurrent ground state whose existence was assumed.14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. Since a = XT a . .

by definition. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . First note that. We now have: Corollary 9. such that pax > 0. ν [a] (x) < ∞. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. (n) Next. such that pxa > 0. so there exists m ≥ 1. let x be such that a x. We are now going to assume more. from (16). n ≤ Ta ) Pa (Xn = x. n ≤ Ta ) pyx Pa (Xn−1 = y. x a. We just showed that ν [a] = ν [a] P. namely. for all n ∈ N. Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . Suppose that the Markov chain possesses a positive recurrent state a. we used (16). for all x ∈ S. ν [a] = ν [a] Pn . Xn−1 = y. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. ax ax From Theorem 10. which. in this last step. (17) This π is a stationary distribution. Note: The last result was shown under the assumption that a is a recurrent state. indeed. that a is positive recurrent.and thus be in a position to apply first-step analysis. x ∈ S. is finite when a is positive recurrent. 40 . Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). where. xa so. Therefore. so there exists n ≥ 1. which are precisely the balance equations. n ≤ Ta ) = n=1 ∞ Pa (Xn = x.

π(y) = 1 y∈S x∈S have at least one solution. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. The coins are i. In words. Let R1 (respectively. to decide whether j will occur in a specific excursion. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. if tails show up. and so π [a] is not trivially equal to zero. then j occurs in the r-th excursion. then the r-th excursion does not contain state j. r = 0. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. Consider the excursions Xi . 2. π [a] P = π [a] . Suppose that i recurrent. 41 . we have Ea Ta < ∞. in general. Also. Let i be a positive recurrent state. We have already shown. then the set of linear equations π(x) = y∈S π(y)pyx . x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. Start with X0 = i. Hence j will occur in infinitely many excursions. that ν [a] P = ν [a] . Theorem 15. This is what we will try to do in the sequel.In other words: If there is a positive recurrent state. Proof. second) excursion that contains j. Therefore. and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. Otherwise. If heads show up at the r-th excursion. We already know that j is recurrent. .d. . positive recurrent state.i. no matter which one. But π [a] is simply a multiple of ν [a] . R2 ) be the index of the first (respectively. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. . (r) j. in Proposition 2. 1. i. Since a is positive recurrent. we toss a coin with probability of heads gij > 0.e. since the excursions are independent. It only remains to show that π [a] is a probability. unique. that it sums up to one. Then j is positive Proof.

An irreducible Markov chain is called transient if one (and hence all) states are transient. Corollary 10.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . 2. Then either all states in C are positive recurrent. .d. R2 has finite expectation. . An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. by Wald’s lemma. IRREDUCIBLE CHAIN r = 1. where the Zr are i. we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞. or all states are null recurrent or all states are transient. .i. . . . . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . But. with the same law as Ti . RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. Let C be a communicating class. In other words.

Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. But. Fix a ground state a. with probability one. following Theorem 14. we start the Markov chain from the stationary distribution π. is a stationary distribution. Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. (18) π(x)Px (Ta < ∞) = π(x) = 1. by (18). what prohibits uniqueness.16 Uniqueness of the stationary distribution At the risk of being too pedantic. Theorem 16. x∈S By (13). for all states i and j. we have P (Xn = x) = π(x). Let π be a stationary distribution. By a theorem of Analysis (Bounded Convergence Theorem). as t → ∞. after all. we can take expectations of both sides: E as t → ∞. we stress that a stationary distribution may not be unique. there is a stationary distribution. Consider the Markov chain (Xn . Then P (Ta < ∞) = x∈S In other words. n=1 43 . the π defined by π(0) = α. state 1 (and state 2) is positive recurrent. n ≥ 0) and suppose that P (X0 = x) = π(x). Consider positive recurrent irreducible Markov chain. for totally trivial reasons (it is. But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. π(2) = 1 − α 2. Then there is a unique stationary distribution. an absorbing state). recognising that the random variables on the left are nonnegative and bounded below 1. Proof. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. for all x ∈ S. Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). Hence. x ∈ S. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). indeed. x ∈ S. Then. E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). π(1) = 0. for all n. we have.

for all a ∈ S.Therefore π(x) = π [a] (x). the only stationary distribution for the restricted chain because C is irreducible. π(i|C) = Ei Ti . Hence π = π [a] . There is a unique stationary distribution. In particular. But π(i|C) > 0. Consider an arbitrary Markov chain. Let i be such that π(i) > 0. Thus. Corollary 11. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. E a Ta Proof. Corollary 12. Consider the restriction of the chain to C. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Consider a positive recurrent irreducible Markov chain and two states a. Namely. π(a) = π [a] (a) = 1/Ea Ta . π(C) x ∈ C. y ∈ C. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . Then π [a] = π [b] . The cycle formula merely re-expresses this equality. This is in fact. Therefore Ei Ti < ∞. i. b ∈ S. x ∈ S. Hence there is only one stationary distribution. 1 π(a) = . i is positive recurrent. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Proof. delete all transition probabilities pxy with x ∈ C. In particular. from the definition of π [a] . where π(C) = y∈C π(y). Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . Decompose it into communicating classes. Both π [a] and π[b] are stationary distributions. There is only one stationary distribution.e. The following is a sort of converse to that. so they are equal. Let C be the communicating class of i. Assume that a stationary distribution π exists. By the 1 above corollary. In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. Proof. Then any state i such that π(i) > 0 is positive recurrent. Then. an arbitrary stationary distribution π must be equal to the specific one π [a] . Consider a Markov chain. Consider a positive recurrent irreducible Markov chain with stationary distribution π. for all x ∈ S. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) .

We all know. it will settle to the vertical position. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . For example. An arbitrary stationary distribution π must be a linear combination of these π C . 18 18. In an ideal pendulum. This is the stable position. the pendulum will perform an 45 . So far we have seen. If π(x) > 0 then x belongs to some positive recurrent class C. from physical experience. consider the motion of a pendulum under the action of gravity. Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. eventually. Proof. To each positive recurrent communicating class C there corresponds a stationary distribution π C . π(x|C) ≡ π C (x).which assigns positive probability to each state in C.1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. that the pendulum will swing for a while and. Why stability? There are physical reasons why we are interested in the stability of a Markov chain. as n → ∞. take an = (−1)n . for example. For each such C define the conditional distribution π(x|C) := π(x) . such that C αC = 1. where π(C) := π(C) π(x). more generally. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. that it remains in a small region). Let π be an arbitrary stationary distribution. By our uniqueness result. x ∈ C) is a stationary distribution. There is dissipation of energy due to friction and that causes the settling down. x∈C Then (π(x|C). where αC ≥ 0. This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . that. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). So we have that the above decomposition holds with αC = π(C). without friction. A Markov chain is a simple model of a stochastic dynamical system. under irreducibility and positive recurrence conditions. Proposition 3.

ξ0 . If. consider the following differential equation x = −ax + b. then regardless of the starting position. And we are in a happy situation here. or whatever. The simplest differential equation As another example. ξ2 . moreover. ······ 46 . A linear stochastic system We now pass on to stochastic systems. a > 0. . ˙ There are myriads of applications giving physical interpretation to this. But what does stability mean? In general. we here assume that X0 . then x(t) = x∗ for all t > 0. the one found by setting the right-hand side equal to zero: x∗ = b/a. simple because the ξn ’s depend on the time parameter n. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . Specifically. are given. The thing to observe here is that there is an equilibrium point i. let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . X2 . and define a stochastic sequence X1 . x ∈ R. . We shall assume that the ξn have the same law. . for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . However notice what happens when we solve the recursion. we may it is stable because it does not (it cannot!) swing too far away. as t → ∞. mutually independent. . it cannot mean convergence to a fixed equilibrium point. We say that x∗ is a stable equilibrium.e. ξ1 . or an example in physics. by means of the recursion above.oscillatory motion ad infinitum. Since we are prohibited by the syllabus to speak about continuous time stochastic systems. so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. Still. . . random variables. It follows that |ρ| < 1 is the condition for stability. we have x(t) → x∗ .

but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . as n → ∞. we need to understand the notion of coupling. Y but not their joint law. for time-homogeneous Markov chains. 18. j ∈ S. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. Y = y) is? 47 . A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x. we know f (x) = P (X = x). If a Markov chain is irreducible. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x).Since |ρ| < 1. the n-step transition matrix Pn .3 Coupling To understand why the fundamental stability theorem works.2 The fundamental stability theorem for Markov chains Theorem 17. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . which is a linear combination of ξ0 . Starting at an elementary level.g. under nice conditions. for instance. 18. ∞ ˆ The nice thing is that. Y = y) is. g(y) = P (Y = y). assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . Y = y) = P (X = x)P (Y = y). but we are not given what P (X = x. i. suppose you are given the laws of two random variables X. for any initial distribution. to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). The second term. while the limit of Xn does not exists. to try to make the coincidence probability P (X = Y ) as large as possible. In other words. In particular. . . define stability in a way other than convergence of the distributions of the random variables. (e. . converges. we have that the first term goes to 0 as n → ∞. But suppose that we have some further requirement. What this means is that. uniformly in x. . How should we then define what P (X = x. ξn−1 does not converge. Coupling refers to the various methods for constructing this joint law.

precisely because of the assumption that the two processes have been constructed together. g(y) = P (Y = y). and hence P (Xn = x) − P (Yn = x) ≤ P (T > n).e. y). n ≥ 0.e. in particular. Note that it makes sense to speak of a meeting time. Then. Let T be a meeting of two coupled stochastic processes X. n ≥ T ) = P (Xn = x. n ≥ 0. n ≥ T ) ≤ P (n < T ) + P (Yn = x). f (x) = y h(x. y) = P (X = x. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). and all x ∈ S. If. n ≥ 0). we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). Y = y) which respects the marginals. x ∈ S. g(y) = x h(x. then |P (Xn = x) − P (Yn = x)| → 0. Repeating (19). uniformly in x ∈ S. Suppose that we have two stochastic processes as above and suppose that they have been coupled. y). for all n ≥ 0. i. The following result is the so-called coupling inequality: Proposition 4. Find the joint law h(x. Since this holds for all x ∈ S and since the right hand side does not depend on x. on the event that the two processes never meet. (19) as n → ∞. n < T ) + P (Yn = x. Proof. x∈S 48 . We have P (Xn = x) = P (Xn = x. The coupling we are interested in is a coupling of processes. and which maximises P (X = Y ). but with the roles of X and Y interchanged. respectively. n ≥ 0) and the law of a stochastic process Y = (Yn . Y . T is finite with probability one. A coupling of them refers to a joint construction on the same probability space. i. This T is a random variables that takes values in the set of times or it may take values +∞. P (T < ∞) = 1. We are given the law of a stochastic process X = (Xn . Y be random variables with values in Z and laws f (x) = P (X = x). Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n).Exercise: Let X. we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). but we are not told what the joint probabilities are. n < T ) + P (Xn = x. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T.

n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. Notice that σ(x. y) = µ(x)π(y). n ≥ 0) is independent of Y = (Yn . in other words.x′ ).x′ ). positive recurrent and aperiodic. Then (Wn .y′ .y′ ) for all large n.x′ > 0 and py. we have not defined joint probabilities such as P (Xn = x. (Indeed. (x.x′ py.y′ .x′ py.y′ ) = px. The first process. Xm = y).) By positive recurrence. x∈S as n → ∞. then.y′ ) := P Wn+1 = (x′ . We now couple them by doing the straightforward thing: we assume that X = (Xn . We know that there is a unique stationary distribution. for all n ≥ 1. Y0 ) has distribution P (W0 = (x.(y. . The second process. and so (Wn . Xn and Yn would be independent and distributed according to π. Consider the process Wn := (Xn . had we started with both X0 and Y0 independent and distributed according to π. they differ only in how the initial states are chosen. and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }. We shall consider the laws of two processes. From Theorem 3 and the aperiodicity assumption. is a stationary distribution for (Wn .x′ ).Assume now that P (T < ∞) = 1. review the assumptions of the theorem: the chain is irreducible.(y. y) Its n-step transition probabilities are q(x.y′ for all large n.4 Proof of the fundamental stability theorem First. as usual. Let pij denote. n ≥ 0) is an irreducible chain. we have that px. denoted by X = (Xn . y) := π(x)π(y). π(x) > 0 for all x ∈ S. Having coupled them. implying that q(x. denoted by Y = (Yn . It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. Its (1-step) transition probabilities are q(x. the transition probabilities. y) ∈ S × S.(y. Therefore. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. Yn ). Its initial state W0 = (X0 . denoted by π. n ≥ 0). 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. Then n→∞ lim P (T > n) = P (T = ∞) = 0. n ≥ 0). y ′ ) | Wn = (x. 18. sup |P (Xn = x) − P (Yn = x)| → 0. we can define joint probabilities. n ≥ 0) is a Markov chain with state space S × S.

precisely because P (T < ∞) = 1. However.(1.1) = q(1. Hence (Zn . y) > 0 for all (x. n ≥ 0) is positive recurrent. 0). we have |P (Xn = x) − π(x)| ≤ P (T > n). Moreover. (1. suppose that S = {0. say. Hence (Wn . and since P (Yn = x) = π(x). p10 = 1. n ≥ 0). y) ∈ S × S. 0). Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ).0).0) = q(1. In particular.99. uniformly in x. T is a meeting time between Z and Y . This is not irreducible. Zn = Yn . Then the product chain has state space S ×S = {(0. as n → ∞. if we modify the original chain and make it aperiodic by. 50 .Therefore σ(x. (Zn . Zn equals Xn before the meeting time of X and Y . p01 = 0. 1). Obviously. after the meeting time. for all n and x.(0.0).(0. For example. Y ) may fail to be irreducible. (0.1). P (T < ∞) = 1. Z0 = X0 which has an arbitrary distribution µ. which converges to zero. ≥ 0) is a Markov chain with transition probabilities pij . 1)} and the transition probabilities are q(0.0) = q(0. Since P (Zn = x) = P (Xn = x).(1.01.1). |P (Zn = x) − P (Yn = x)| ≤ P (T > n).1) = 1. X arbitrary distribution Z stationary distribution Y T Thus. then the process W = (X. (1. By the coupling inequality. n ≥ 0) is identical in law to (Xn . In particular. Clearly. 1} and the pij are p01 = p10 = 1. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. P (Xn = x) = P (Zn = x). p00 = 0. by assumption.

Then Xn = 0 for even n and = 1 for odd n. . per se wrong. The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. It is. 6 The reason is that the term “ergodic” means something specific in Mathematics (c. then (n) lim p n→∞ 00 (n) = lim p10 = π(0).497. Indeed. customary to have a consistent terminology.503 0. however. . most people use the term “ergodic” for an irreducible.503 0.4 and. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. π(1) satisfy 0. Suppose X0 = 0.503. . Theorem 4. It is not difficult to see that 6 We put “wrong” in quotes. we can decompose the state space into d classes C0 . 0. However.497 18.f. from Cd back to C0 . p01 = 0. π(0) ≈ 0. limn→∞ p00 does not exist. 51 . we know we have a unique stationary distribution π and that π(x) > 0 for all states x.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic.497 . i.01.then the product chain becomes irreducible.99. C1 .e. Cd−1 . Thus. in particular. from C1 to C2 . But this is “wrong”. . 1981). π(0) + π(1) = 1. and this is expressed by our terminology. if we modify the chain by p00 = 0. n→∞ (n) where π(0). Parry. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0. Therefore p00 = 1 for even n (n) and = 0 for odd n. 18. By the results of Section 14. positive recurrent and aperiodic Markov chain. By the results of §5. Suppose that the period of the chain is d. In matrix notation. p10 = 1. Terminology cannot be. such that the chain moves cyclically between them: from C0 to C1 . and so on.99π(0) = π(1).6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. π(1) ≈ 0.

and the previous results apply. n ≥ 0) consists of d closed communicating classes. starting from i. Proof. Hence if we restrict the chain (Xnd . j ∈ Cr . . Since (Xnd . n ≥ 0) is dπ(i). and let r = ℓ − k (modulo d). . i ∈ Cr . if sampled every d steps. So π(i) = j∈Cr−1 π(j)pji . i ∈ S. n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. j ∈ Cℓ . Then. . . as n → ∞. each one must be equal to 1/d. → dπ(j). Suppose i ∈ Cr . 52 . We shall show that Lemma 9. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. The chain (Xnd . Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). thereafter. i∈Cr for all r = 0. Let i ∈ Ck . Since the αr add up to 1. Then. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . Let αr := i∈Cr π(i). where π is the stationary distribution. n ≥ 0) is aperiodic. j ∈ Cℓ . π(i) = 1/d. This means that: Theorem 18. the (unique) stationary distribution of (Xnd . i ∈ Cr . . namely C0 . Suppose that X0 ∈ Cr . Then pji = 0 unless j ∈ Cr−1 . Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . . as n → ∞. Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d.Lemma 8. we have pij for all i. 1. Cd−1 . . d − 1. We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . after reduction by d) and. it remains in Cℓ . . All states have period 1 for this chain.

2. such that limk→∞ pi k exists and equals.1 Limiting probabilities for null recurrent states Theorem 19.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. for each n = 1. p(n) = (n) (n) (n) (n) (n) ∞ p1 . . . Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. We have. Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. k = 1. p2 . say. p1 . then pij → π(j) (Theorems 17). . 19. and so on. By construction.). say. schematically. the sequence n3 of the third row is a subsequence k k of n2 of the second row. . Hence pi k i. It is clear how we (ni ) k continue. where = 1 and pi ≥ 0 for all i. We also saw in Corollary (n) 7 that. If the period is d then the limit exists only along a selected subsequence (Theorem 18). . as k → ∞.. for all i ∈ S. . p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. . For each i we have found subsequence ni . . . such that limk→∞ p1 k 3 (n1 ) numbers p2 k .) is a k k subsequence of each (ni . e 53 . Consider. 2. for all i ∈ S. 2. So the diagonal subsequence (nk . say. exists for all i ∈ N. k = 1. Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . as n → ∞. k = 1. . Now consider the contained between 0 and 1 there must be a exists and equals.. a probability distribution on N. . for each The second result is Fatou’s lemma. . whose proof we omit7 : 7 See Br´maud (1999). then pij → 0 as n → ∞. . . . The proof will be based on two results from Analysis. If j is a null recurrent state then pij → 0. 2. 2. there must be a subsequence exists and equals. the sequence n2 of the second row k is a subsequence of n1 of the first row. k = 1. pi . p2 . if j is transient. . p3 . . for each i. . the first of which is known as HellyBray Lemma: Lemma 10.e. (nk ) k → pi . i. are contained between 0 and 1. .

state j is positive recurrent: Contradiction. n = 1. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). But Pn+1 = Pn P. Suppose that one of them is not.Lemma 11.e. Suppose that for some i. We wish to show that these inequalities are. i ∈ N). y∈S x ∈ S. Thus. . By Corollary 13. Let j be a null recurrent state. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. the sequence (n) pij does not converge to 0. pix k . (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . k→∞ (n′ +1) = y∈S piy k pyx . i. (n ) (n ) Consider the probability vector By Lemma 10. 54 .e. x ∈ S. we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. pix k Therefore. by Lemma 11.e. x∈S (n′ ) Since aj > 0. for some x0 ∈ S. x ∈ S . and the last equality is obvious since we have a probability vector. i=1 (n) Proof of Theorem 19. y∈S Since x∈S αx ≤ 1. by assumption. π is a stationary distribution. equalities. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . Now π(j) > 0 because αj > 0. pix k = 1. i. we have 0< x∈S The inequality follows from Lemma 11. i ∈ N). we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . . it is a strict inequality: αx0 > y∈S αy pyx0 . . actually. 2. Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. If (yi . (n′ ) Hence αx ≥ αy pyx . αy pyx . i. (xi .

as n → ∞. Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). Let i be an arbitrary state. Let j be a positive recurrent state with period 1. Theorem 20. Proof.2 The general case (n) It remains to see what the limit of pij is. γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . Indeed. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞).19. which is what we need. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. as in §10. The result when the period j is larger than 1 is a bit awkward to formulate. starting with the first hitting time decomposition relation of Lemma 7. the result can be obtained by the so-called Abel’s Lemma of Analysis. In this form. so we will only look that the aperiodic case. Pµ (Xn−m = j) → π C (j). Then (n) lim p n→∞ ij = ϕC (i)π C (j). Let HC := inf{n ≥ 0 : Xn ∈ C}. when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. we have pij = Pi (Xn = j) = Pi (Xn = j. This is a more old-fashioned way of getting the same result. and fij = Pi (Tj < ∞). HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Example: Consider the following chain (0 < α. E j Tj where Tj = inf{n ≥ 1 : Xn = j}. But Theorem 17 tells us that the limit exists regardless of the initial distribution.2. β. as n → ∞. that is. In general. From the Strong Markov Property. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. and fij = ϕC (j). π C (j) = 1/Ej Tj . Let ϕC (i) := Pi (HC < ∞). Let C be the communicating class of j and let π C be the stationary distribution associated with C.

Pioneers in the are were. Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1. π C1 (2) = . for example. as n → ∞. Hence. We now generalise this. p44 → 0. 4} (inessential states). 56 (20) x∈S . C2 are aperiodic classes. information about its velocity. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . Recall that. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. (n) (n) p35 → (1 − ϕ(3))π C2 (5). Similarly. ϕ(3) = p31 → ϕ(3)π C1 (1). Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. The subject of Ergodic Theory has its origins in Physics. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. from first-step analysis. 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. among others. For example. (n) (n) p43 → 0. Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. We have ϕ(1) = ϕ(2) = 1. Theorem 21. 1 − αβ 1 − αβ Both C1 . while. C1 = {1. (n) (n) p33 → 0. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). we assumed the existence of a positive recurrent state. ϕ(5) = 0. in Theorem 14. The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . Theorem 14 is a type of ergodic theorem.The communicating classes are I = {3. 2} (closed class). C2 = {5} (absorbing state). ϕ(4) = . Let f be a reward function such that |f (x)|ν [a] (x) < ∞. The phase (or state) space of a particle is not just a physical space but it includes. whence 1−α β(1 − α) . p45 → (1 − ϕ(4))π C2 (5). (n) (n) p34 → 0. p41 → ϕ(4)π C1 (1). And so is the Law of Large Numbers itself (Theorem 13). (n) (n) p32 → ϕ(3)π C1 (2). 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. π C2 (5) = 1. p42 → ϕ(4)π C1 (2).

Proof. since the number of states is finite. 57 . t t where Gr = T (r) ≤t<T (r+1) f (Xt ). Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. So there are always recurrent states. in the former.5. we assumed positive recurrence of the ground state a. Now. As in Theorem 14. which is the quantity denoted by f in Theorem 14. So if we divide by t the numerator and denominator of the fraction in (21). then. we have C := i∈S ν(i) < ∞. t t→∞ t t t with probability one. we obtain the result. respectively. we have t→∞ lim 1 last 1 G = lim Gfirst = 0. From proposition 2 we know that the balance equations ν = νP have at least one solution. while Gfirst . we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . r=0 with probability 1 as long as E|G1 | < ∞. But this is precisely the condition (20). Replacing n by the subsequence Nt . We only need to check that EG1 = f . Remark 1: The difference between Theorem 14 and 21 is that. As in the proof of Theorem 14. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. Then at least one of the states must be visited infinitely often. we have that the denominator equals Nt /t and converges to 1/Ea Ta . as we saw in the proof of Theorem 14. Glast are the total rewards over the first and t t a a last incomplete cycles. and this justifies the terminology of §18. The reader is asked to revisit the proof of Theorem 14. (21) f (x)ν [a] (x). and this is left as an exercise. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. If so. so the numerator converges to f /Ea Ta . The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . we have Nt /t → 1/Ea Ta .Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 .

X0 ) for all n. We summarise the above discussion in: Theorem 22.Hence 1 ν(i). in a sense. . a finite Markov chain always has positive recurrent states. In addition. Reversibility means that (X0 . Xn−1 . . i ∈ S. If. More precisely. Consider an irreducible Markov chain with finitely many states. then we instantly know that the film is played in reverse. the stationary distribution is unique. the first components of the two vectors must have the same distribution. Xn−1 . for all n. For example. the chain is irreducible. Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. Taking n = 1 in the definition of reversibility. Then the chain is positive recurrent with a unique stationary distribution π. . this state must be positive recurrent. . π(i) := Hence. it will look the same. simply. if we film a standing man who raises his arms up and down successively. . . . . the chain is aperiodic.e. Therefore π is a stationary distribution. Proof. X1 . running backwards. and run the film backwards. Because the process is Markov. At least some state i must have π(i) > 0. . By Corollary 13. Xn ) has the same distribution as (Xn . where P is the transition matrix and Π is a matrix all the rows of which are equal to π. we see that (X0 . Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . 58 . X1 ) has the same distribution as (X1 . It is. X1 . as n → ∞. P (X0 = i. in addition. i. But if we film a man who walks and run the film backwards. In particular. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. X0 = j). Xn ) has the same distribution as (Xn . We say that it is time-reversible. that is. X1 = j) = P (X1 = i. like playing a film backwards. we have Pn → Π. actually. . reversible if it has the same distribution when the arrow of time is reversed. . X0 ). A reversible Markov chain is stationary. . Theorem 23. Xn has the same distribution as X0 for all n. without being able to tell that it is. Suppose first that the chain is reversible. (X0 . 22 Time reversibility Consider a Markov chain (Xn . X0 ). . j ∈ S. . this means that it is stationary. because positive recurrence is a class property (Theorem 15). then all states are positive recurrent. . or. If. n ≥ 0). in addition. Lemma 12. C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. by Theorem 16. Proof. i. .

Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . and hence (X0 . pji 59 (m−3) → π(i1 ). The same logic works for any m and can be worked out by induction. (22) Proof. . The DBE imply immediately that (X0 . then. Pick 3 states i. π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. i4 . if there is an arrow from i to j in the graph of the chain but no arrow from j to i. X2 ) has the same distribution as (X2 . (X0 . Theorem 24. X1 . k. . Suppose first that the chain is reversible. k. X0 ). So the chain is reversible. using the DBE. But P (X0 = i. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. j. we have π(i) > 0 for all i. and write. and the chain is reversible. Summing up over all choices of states i3 . . X1 = j. X0 ). X2 = i). and this means that P (X0 = i. Conversely. X1 . X0 ). . . . suppose that the KLC (22) holds. By induction. . Hence the DBE hold. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . . X2 = k) = P (X0 = k. X1 = j) = π(i)pij . i2 . Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . So the detailed balance equations (DBE) hold.for all i and j. im ∈ S and all m ≥ 3. j. . then the chain cannot be reversible. . . X1 ) has the same distribution as (X1 . P (X0 = j. Now pick a state k and write. Xn ) has the same distribution as (Xn . we have separation of variables: pij = f (i)g(j). Xn−1 . X1 = j. . for all n. . . . using DBE. 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. These are the DBE. . suppose the DBE hold. Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. it is easy to show that. pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . X1 = i) = π(j)pji . and pi1 i2 (m−3) → π(i2 ). X1 . We will show the Kolmogorov’s loop criterion (KLC). . Conversely. letting m → ∞. for all i. In other words.

Pick two distinct states i. A) (see §4. in addition. then the chain is reversible. Hence the original chain is reversible. and the arrow from j to j is the only one from Ac to A.e. so we have pij g(j) = . The balance equations are written in the form F (A. Ac ) = F (Ac . that the chain is positive recurrent. namely the arrow from i to j. Form the graph by deleting all self-loops. There is obviously only one arrow from AtoAc . If we remove the link between them. Ac be the sets in the two connected components. meaning that it has no loops. So g satisfies the balance equations. f (i)g(i) must be 1 (why?). using another method.) Let A. 60 . If the graph thus obtained is a tree. i. and by replacing. by assumption. • If the chain has infinitely many states. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. the detailed balance equations. and. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. then we need. to guarantee. there isn’t any. The reason is as follows. (If it didn’t. j for which pij > 0. j for which pij > 0. And hence π can be found very easily.) Necessarily. consider the chain: Replace it by: This is a tree. Example 1: Chain with a tree-structure.1) which gives π(i)pij = π(j)pji . the two oriented edges from i to j and from j to i by a single edge without arrow. Hence the chain is reversible. the graph becomes disconnected. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji .(Convention: we form this ratio only when pij > 0. For example. for each pair of distinct states i. there would be a loop.

For example.e. if j is a neighbour of i otherwise. A RWG is a reversible chain.Example 2: Random walk on a graph. The stationary distribution of a RWG is very easy to find. a finite set. and a set E of undirected edges. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). i. etc. For each state i define its degree δ(i) = number of neighbours of i. j are not neighbours then both sides are 0. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. If i. To see this let us verify that the detailed balance equations hold. the DBE hold. Now define the the following transition probabilities: pij = 1/δ(i). π(i)pij = π(j)pji . We claim that π(i) = Cδ(i). Hence we conclude that: 1. There are two cases: If i. where C is a constant. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3. π(i) = 1. States j are i are called neighbours if they are connected by an edge. Each edge connects a pair of distinct vertices (states) without orientation. Suppose the graph is connected. Consider an undirected graph G with vertices S. 0. we have that C = 1/2|E|. In all cases. The constant C is found by normalisation. 61 . i∈S where |E| is the total number of edges. 2. The stationary distribution assigns to a state probability which is proportional to its degree. i∈S Since δ(i) = 2|E|. so both members are equal to C. p02 = 1/4.

The sequence (Yk . 1. . 1. 62 n ∈ N. . k = 0. .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i.) has the Markov property but it is not time-homogeneous. then it is a stationary distribution for the new chain. 2. . 2. i. how can we create new ones? 23.e. A r = 1. . 2. if. 1. . t ≥ 0) be independent from (Xn ) with S0 = 0. . If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). this process is also a Markov chain and it does have time-homogeneous transition probabilities. 1. . St+1 = St + ξt . 2. if nk = n0 + km. (m) (m) 23.d.e.i. If the subsequence forms an arithmetic progression. k = 0. n ≥ 0) be Markov with transition probabilities pij . k = 0.e.23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). . . . In this case. π is a stationary distribution for the original chain. where pij are the m-step transition probabilities of the original chain. The state space of (Yr ) is A. Define Yr := XT (r) . Let A be (r) a set of states and let TA be the r-th visit to A. 23.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence. . in general. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . i. then (Yk . in addition. k = 0. random variables with values in N and distribution λ(n) = P (ξ0 = n). Due to the Strong Markov Property. . 2. Let (St .) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . .3 Subordination Let (Xn . t ≥ 0. . where ξt are i. . irreducible and positive recurrent). .

Fix α0 . that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i). If. i. . j. . is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0.Then Yt := XSt . . in general. f is a one-to-one function then (Yn ) is a Markov chain.4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . however. α1 . αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . .. . . conditional on Yn the past is independent of the future. Yn−1 ) = P (Yn+1 = β | Yn = α). knowledge of f (Xn ) implies knowledge of Xn and so. we have: Lemma 13. Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). .. . β). and the elements of S ′ by α. . Indeed. ∞ λ(n)pij .e. . β. is NOT. . n≥0 63 . (n) 23. . Then (Yn ) has the Markov property. then the process Yn = f (Xn ). . a Markov chain. . Yn−1 = αn−1 }. (Notation: We will denote the elements of S by i.) More generally. We need to show that P (Yn+1 = β | Yn = α. Proof.

i. α > 0 Q(α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). observe that the function | · | is not one-to-one. i ∈ Z. if β = α ± 1. β). To see this. On the other hand. But P (Yn+1 = β | Yn = α. We have that P (Yn+1 = β|Xn = i) = Q(|i|. β) P (Yn = α|Yn−1 ) i∈S = Q(α. In other words. i ∈ Z. β) P (Yn = α. regardless of whether i > 0 or i < 0. β). the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. β). Define Yn := |Xn |. Yn−1 )P (Xn = i | Yn = α. Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. 64 . then |Xn+1 | = 1 for sure. Yn−1 ) Q(f (i). β) i∈S P (Yn = α|Xn = i. β)P (Xn = i | Yn = α. Yn = α. Then (Yn ) is a Markov chain. a Markov chain with transition probabilities Q(α. Indeed. Thus (Yn ) is Markov with transition probabilities Q(α.e. β) P (Yn = α|Xn = i. and so (Yn ) is. if we let 1/2. for all i ∈ Z and all β ∈ Z+ . α = 0. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. and if i = 0. β) := 1. because the probability of going up is the same as the probability of going down. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. depends on i only through |i|. if β = 1. indeed. P (Xn+1 = i ± 1|Xn = i) = 1/2. Yn−1 ) Q(f (i). if i = 0.for all choices of the variables involved. conditional on {Xn = i}. β). Example: Consider the symmetric drunkard’s random walk. β ∈ Z+ . But we will show that P (Yn+1 = β|Xn = i).

In other words. We find the population (n) at time n + 1 by adding the number of offspring of each individual. . Example: Suppose that p1 = p. which has a binomial distribution: i j pij = p (1 − p)i−j . we have that (Xn . and following the same distribution. the population cannot grow: it will eventually reach 0. We let pk be the probability that an individual will bear k offspring. We will not attempt to compute the pij neither draw the transition diagram because they are both futile..d. therefore transient. in general. It is a Markov chain with time-homogeneous transitions. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. with probability 1. and 0 is always an absorbing state. 0 But 0 is an absorbing state. The state 0 is irrelevant because it cannot be reached from anywhere. and. so we will find some different method to proceed. k = 0. each individual gives birth to at one child with probability p or has children with probability 1 − p.i. . a hard problem. then we see that the above equation is of the form Xn+1 = f (Xn . Thus. (and independent of the initial population size X0 –a further assumption). ξ2 . . Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. j In this case. . if the current population is i there is probability pi that everybody gives birth to zero offspring.24 24. 2. if the 65 (n) (n) . Hence all states i ≥ 1 are inessential. n ≥ 0) has the Markov property. Then P (ξ = 0) + P (ξ ≥ 2) > 0. ξ (1) . . So assume that p1 = 1. 0 ≤ j ≤ i. But here is an interesting special case. The parent dies immediately upon giving birth. . 1. Let ξk be the number of offspring of the k-th individual in the n-th generation. p0 = 1 − p. Hence all states i ≥ 1 are transient.1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. . since ξ (0) . Back to the general case. independently from one another. one question of interest is whether the process will become extinct or not. . . Therefore. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is. ξ (n) ). if 0 is never reached then Xn → ∞. are i. If we let ξ (n) = ξ1 . If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow.

The result is this: Proposition 5. i. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. Therefore the problem reduces to the study of the branching process starting with X0 = 1. since Xn = 0 implies Xm = 0 for all m ≥ n. This function is of interest because it gives information about the extinction probability ε. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. For i ≥ 1. Obviously. where ε = P1 (D). Let ε be the extinction probability of the process. ε(0) = 1. Indeed. 66 . when we start with X0 = i. We wish to compute the extinction probability ε(i) := Pi (D). . each one behaves independently of one another. ϕn (0) = P (Xn = 0).population does not become extinct then it must grow to infinity. so P (D|X0 = i) = εi . We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . Consider a Galton-Watson process starting with X0 = 1 initial ancestor. . Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. . Indeed. we can argue that ε(i) = εi . For each k = 1. Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. and. We then have that D = D1 ∩ · · · ∩ Di . if we start with X0 = i in ital ancestors. If µ ≤ 1 then the ε = 1. Proof. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. . and P (Dk ) = ε.

if µ > 1. recall that ϕ is a convex function with ϕ(1) = 1. (See right figure below. But 0 ≤ ε∗ . But ϕn+1 (0) → ε as n → ∞. But ϕn (0) → ε. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). This means that ε = 1 (see left figure below). Notice now that µ= d ϕ(z) dz . then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. ϕ(ϕn (0)) → ϕ(ε). in fact. all we need to show is that ε < 1. Therefore. ε = ε∗ . We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). if the slope µ at z = 1 is ≤ 1. ϕn+1 (z) = ϕ(ϕn (z)). = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). So ε ≤ ε∗ < 1. On the other hand. Therefore ε = ϕ(ε). Therefore ε = ε∗ . and. z=1 Also. Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). and so on. Also ϕn (0) → ε.) 67 . Since both ε and ε∗ satisfy z = ϕ(z). Let us show that. ϕn (0) ≤ ε∗ . ϕn+1 (0) = ϕ(ϕn (0)).Note also that ϕ0 (z) = 1. the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . By induction. In particular. since ϕ is a continuous function. and since the latter has only two solutions (z = 0 and z = ε∗ ).

3 hosts. we let hosts transmit whenever they like. 2. We assume that the An are i.d. Then ε = 1. using the same frequency. 68 . Then ε < 1. The problem of communication over a common channel. 1. if Sn = 1. Case: µ > 1. then max(Xn + An − 1. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. 3. 0).i. 5. An the number of new packets on the same time slot. on the next times slot. One obvious solution is to allocate time slots to hosts in a specific order. So. 6. periodically. . Nowadays. If s ≥ 2. only one of the hosts should be transmitting at a given time. So if there are. there is a collision and. . Thus. 3. 24. otherwise. . k ≥ 0. then time slots 0. . for a packet to go through. 4. 1. During this time slot assume there are a new packets that need to be transmitted. in a simplified model. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’).1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. Since things happen fast in a computer network. Let s be the number of selected packets. say. there are x + a packets. 2. 2. are allocated to hosts 1. . is that if more than one hosts attempt to transmit a packet.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). and Sn the number of selected packets. there are x + a − 1 packets. then the transmission is not successful and the transmissions must be attempted anew. Xn+1 = Xn + An . But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit.. If s = 1 there is a successful transmission and. . If collision happens. on the next times slot. there is collision and the transmission is unsuccessful. Ethernet has replaced Aloha. Abandoning this idea. 3. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. 1. there is no time to make a prior query whether a host has something to transmit or not. random variables with some common distribution: P (An = k) = λk .

x−1 = P (An = 0. The main result is: Proposition 6. Proving this is beyond the scope of the lectures. the Markov chain (Xn ) is transient. We here only restrict ourselves to the computation of this expected change. where G(x) is a function of the form α + βx . The random variable Sn has. An = a. Let us find its transition probabilities. To compute px. and exactly one selection: px. So: px. for example we can make it ≤ µ/2. we think as follows: to have a reduction of x by 1. . we have q < 1. Xn−1 .x−1 when x > 0. For x > 0. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). The probabilities px. The process (Xn .x are computed by subtracting the sum of the above from 1. and p. Unless µ = 0 (which is a trivial case: no arrivals!).x+k for k ≥ 1. So we can make q x G(x) as small as we like by choosing x sufficiently large.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k . so q x G(x) tends to 0 as x → ∞.e. but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. and so q x tends to 0 much faster than G(x) tends to ∞.where λk ≥ 0 and k≥0 kλk = 1.x−1 + k≥1 kpx. where q = 1 − p. Idea of proof. Sn = 1|Xn = x) = λ0 xpq x−1 . Since p > 0. The reason we take the maximum with 0 in the above equation is because. n ≥ 0) is a Markov chain. . conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . then the chain is transient. i. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). So P (Sn = k|Xn = x. The Aloha protocol is unstable. we should not subtract 1. Therefore ∆(x) ≥ µ/2 for all large x. if Xn + An = 0. 69 . . we must have no new packets arriving. we have that the drift is positive for all large x and this proves transience. we have ∆(x) = (−1)px. To compute px. We also let µ := k≥1 kλk be the expectation of An . defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. Indeed. An−1 .e. k ≥ 1.) = x + a k x−k p q . we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. k 0 ≤ k ≤ x + a. selections are made amongst the Xn + An packets.x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k .

The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. PageRank. 3. |V | These are new transition probabilities. So we make the following modification. The way it does that is by computing the stationary distribution of a Markov chain that we now describe. otherwise. Academic departments link to each other by hiring their 70 . The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. neither that it is aperiodic. if j ∈ L(i)  pij = 1/|V |.24. Its implementation though is one of the largest matrix computations in the world. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. unless he gets bored–and this happens with probability α–in which case he clicks on a random page. 2. If L(i) = ∅. the chain moves to a random page. So if Xn = i is the current location of the chain. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. So this is how Google assigns ranks to pages: It uses an algorithm. It is not clear that the chain is irreducible.2) and let α pij := (1 − α)pij + . There is a technique. α ≈ 0. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. then i is a ‘dangling page’. if L(i) = ∅   0. We pick α between 0 and 1 (typically.3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. It uses advanced techniques from linear algebra to speed up computations. because of the size of the problem. unless i is a dangling page. the rank of a page may be very important for a company’s profits. All you need to do to get high rank is pay (a lot of money). In an Internet-dominated society. known as ‘302 Google jacking’. The idea is simple. Therefore. in the latter case. that allows someone to manipulate the PageRank of a web page and make it much higher. there are companies that sell high-rank pages. For each page i let L(i) be the set of pages it links to. Define transition probabilities as follows:  1/|L(i)|. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. Then it says: page i has higher rank than j if π(i) > π(j). Comments: 1. the next location Xn+1 is picked uniformly at random amongst all pages that i links to.

And so on. then. it makes the evaluation procedure easy. In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. Since we don’t care to keep track of the individuals’ identities. maintain this allele for the next generation. this allele is of type A with probability x/2N . we will assume that.4 The Wright-Fisher model: an example from biology In an organism. Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. the genetic diversity depends only on the total number of alleles of each type. This means that the external appearance of a characteristic is a function of the pair of alleles. a gene controlling a specific characteristic (e. Do the same thing 2N times. We have a transition from x to y if y alleles of type A were produced. We would like to study the genetic diversity. So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). 5. their offspring inherits an allele from each parent.faculty from each other (and from themselves). We will assume that a child chooses an allele from each parent at random. What is important is that the evaluation can be done without ever looking at the content of one’s research. One of the circle alleles is selected twice. 24. in each generation. If so. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. a dominant (say A) and a recessive (say a). This number could be independent of the research quality but. generation after generation. two AA individuals will produce and AA child. The process repeats. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random.g. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. One of the square alleles is selected 4 times. Imagine that we have a population of N individuals who mate at random producing a number of offspring. from the point of view of an administrator. Three of the square alleles are never selected. Thus. all we need is the information outlined in this figure. The alleles in an individual come in pairs and express the individual’s genotype for that gene. We will also assume that the parents are replaced by the children. 71 . colour of eyes) appears in two forms. When two individuals mate. To simplify. But an AA mating with a Aa may produce a child of type AA or one of type Aa. 4. meaning how many characteristics we have. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology.

The result is correct. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M . . Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . . conditionally on (X0 . y ≤ 2N − 1.d. Clearly. M Therefore.) One can argue that.e. the number of alleles of type A won’t change. M and thus. . . ϕ(x) = x/M. EXn+1 = EXn Xn = Xn . . ϕ(M ) = 1. . the states {1. .i.2N = 1. . with n P (ξk = 1|Xn . and the population becomes one consisting of alleles of type A only. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. . ϕ(x) = y=0 pxy ϕ(y). so both 0 and 2N are absorbing states. . . since. we see that 2N x 2N −y x y pxy = . . 2N − 1} form a communicating class of inessential states. but the argument needs further justification because “the end of time” is not a deterministic time. X0 ) = Xn . . . . . as usual. (Here. . on the average. and the population becomes one consisting of alleles of type a only. since. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. the expectation doesn’t change and. 72 . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1.Each allele of type A is selected with probability x/2N . .e. the random variables ξ1 . Tx := inf{n ≥ 1 : Xn = x}. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x). Since pxy > 0 for all 1 ≤ x. X0 ) = 1 − Xn . These are: M ϕ(0) = 0. . i. . E(Xn+1 |Xn . . 0 ≤ y ≤ 2N. for all n ≥ 0. M n P (ξk = 0|Xn . Xn ). the chain gets absorbed by either 0 or 2N . n n where. . Since selections are independent. i. Similarly. Let M = 2N . . X0 ) = E(Xn+1 |Xn ) = M This implies that. . p00 = p2N. We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. Clearly. on the average. ξM are i. The fact that M is even is irrelevant for the mathematical model.

Here are our assumptions: The pairs of random variables ((An . Xn electronic components at time n. n ≥ 0) is a Markov chain. for all n ≥ 0. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. at time n + 1. We will apply the so-called Loynes’ method. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . . . there are Xn + An − Dn components in the stock. A number An of new components are ordered and added to the stock.i. Since the Markov chain is represented by this very simple recursion. we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. we will try to solve the recursion. . Dn ). and E(An − Dn ) = −µ < 0. b). We have that (Xn . provided that enough components are available. We will show that this Markov chain is positive recurrent. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. Otherwise. the demand is only partially satisfied. and the stock reached level zero at time n + 1.d. the demand is met instantaneously. Let ξn := An − Dn . so that Xn+1 = (Xn + ξn ) ∨ 0. If the demand is for Dn components.The first two are obvious. Working for n = 1. 2.5 A storage or queueing application A storage facility stores. Thus. while a number of them are sold to the market. say. M −1 y−1 x M y−1 1− x . M x M (M −1)−(y−1) x x +1− M M M −1 = 24. 73 . n ≥ 0) are i. larger than the supply per unit of time. and so. the demand is. on the average.

Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). ξn−1 ). for all n. . we may replace the latter random variables by ξ1 . . . ξn−1 ) = (ξn−1 . that d (ξ0 . ξn−1 . . .Thus.d. . But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). . Hence. . . . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. and. using π(y) = limn→∞ P (Mn = y). . where the latter equality follows from the fact that. . In particular. 74 . we may take the limit of both sides as n → ∞. ξn . . in computing the distribution of Mn which depends on ξ0 . . P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . . the event whose probability we want to compute is a function of (ξ0 . . ξ0 ). for y ≥ 0. (n) . .i. . we reverse it. so we can shuffle them in any way we like without altering their joint distribution. . . however. Indeed. Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. the random variables (ξn ) are i. instead of considering them in their original order. The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). = x≥0 Since the n-dependent terms are bounded. Therefore. we find π(y) = x≥0 π(x)pxy . Notice.

we will use the assumption that Eξn = −µ < 0. The case Eξn = 0 is delicate.} > y) is not identically equal to 1.} < ∞) = 1. as required. . It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. . Depending on the actual distribution of ξn . Here. . We must make sure that it also satisfies y π(y) = 1. Z2 . . Z2 . the Markov chain may be transient or null recurrent. This will be the case if g(y) is not identically equal to 1.So π satisfies the balance equations. . From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. 75 . n→∞ This implies that P ( lim Zn = −∞) = 1. . n→∞ Hence g(y) = P (0 ∨ max{Z1 . n→∞ But then P ( lim Mn = 0 ∨ max{Z1 .

0. .y = p(y − x). then p(1) = p. Example 3: Define p(x) = 1 2d . This means that the transition probability pxy should depend on x and y only through their relative positions in space. have been processes with timehomogeneous transition probabilities.y = px+z. but. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. otherwise where p + q = 1. we need to give more structure to the state space S. that of spatial homogeneity. . .and space-homogeneous transition probabilities given by px. we won’t go beyond Zd . p(−1) = q. d   0. Define  pj . So.g. . given a function p(x). this walk enjoys. besides time. so far. . namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. if x = −ej . For example. This is the drunkard’s walk we’ve seen before. A random walk possesses an additional property. here.y+z for any translation z. Our Markov chains. . . In dimension d = 1. Example 1: Let p(x) = C2−|x1 |−···−|xd | . p(x) = 0. we can let S = Zd . x ∈ Zd . . where p1 + · · · + pd + q1 + · · · + qd = 1. ±ed } otherwise 76 .and space-homogeneity. if x = ej . a random walk is a Markov chain with time. namely. . If d = 1. if x ∈ {±e1 . More general spaces (e. . j = 1. A random walk of this type is called simple random walk or nearest neighbour random walk. d  p(x) = qj . the skip-free property. a structure that enables us to say that px. j = 1. the set of d-dimensional vectors with integer coordinates. where C is such that x p(x) = 1. To be able to say this. . otherwise. . groups) are allowed.PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk.

We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). We refer to ξn as the increments of the random walk. ξ2 . We will let p∗n be the convolution of p with itself n times. . . are i. If d = 1. (When we write a sum without indicating explicitly the range of the summation variable. in addition. we can reduce several questions to questions about this random walk. p(x) = 0. p′ (x). . Let us denote by Sn the state of the walk at time n. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. p ∗ δz = p. because all we have done is that we dressed the original problem in more fancy clothes. Furthermore.) In this notation. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . otherwise. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). One could stop here and say that we have a formula for the n-step transition probability. We shall resist the temptation to compute and look at other 77 . then p(1) = p(−1) = 1/2.d. Furthermore. computing the n-th order convolution for all n is a notoriously hard problem. (Recall that δz is a probability distribution that assigns probability 1 to the point z. Obviously. So. .) Now the operation in the last term is known as convolution. the initial state S0 is independent of the increments. in general. we will mean a sum over all possible values. . y p(x)p′ (y − x). . The convolution of two probabilities p(x). We call it simple symmetric random walk. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. random variables in Zd with common distribution P (ξn = x) = p(x).i. x ± ed . has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. . and this is the symmetric drunkard’s walk. The special structure of a random walk enables us to give more detailed formulae for its behaviour. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). where ξ1 . which. x + ξ2 = y) = y p(x)p(y − x).This is a simple random walk. x ∈ Zd . But this would be nonsense. Indeed.

where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi. + (1. . Therefore. +1}n . When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn .0) We can thus think of the elements of the sample space as paths.2) where the ξn are i. In the sequel. −1. Clearly. 1): (2. and the sequence of signs is {+1. . First.methods. 0) → (1. +1}n . a random vector with values in {−1. ξn ). how long will that take? C. ξn = εn ) = 2−n . consider the following: 78 . to each initial position and to each sequence of n signs. but its practice may be hard. we have P (ξ1 = ε1 .i. 2) → (3. 1}n . So. +1. Here are a couple of meaningful questions: A.1) + − (5. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones. So if the initial position is S0 = 0. a path of length n.1) − (3. 1) → (4. As a warm-up. The principle is trivial.i+1 = pi. we shall learn some counting methods. we should assign.d. If yes. Events A that depend only on these the first n random signs are subsets of this sample space. If not. Look at the distribution of (ξ1 . 2 where #A is the number of elements of A. −1} then we have a path in the plane that joins the points (0. . if we can count the number of elements of A we can compute its probability. . . 2) → (5. all possible values of the vector are equally likely. Since there are 2n elements in {0. random variables (random signs) with P (ξn = ±1) = 1/2.i−1 = 1/2 for all i ∈ Z. After all. +1}n . Will the random walk ever return to where it started from? B. +1. and so #A P (A) = n . . we are dealing with a uniform distribution on the sample space {−1.2) (4.1) + (0. 1) → (2. for fixed n. . A ⊂ {−1. .

Proof. (n − m + y − x)/2 (23) Remark. n 79 . i). y)} = n−m . x) (n.i) (0.) The darker region shows all paths that start at (0. We use the notation {(m.047. In if n + i is not an even number then there is no path that starts from (0. i)} = n n+i 2 More generally. i). i). 0)). S4 = −1. S7 < 0}. y) is given by: #{(m. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. x) and end up at (n. x) (n. The number of paths that start from (0. the number of paths that start from (m. 0) and ends at (n. y)} =: {all paths that start at (m.Example: Assume that S0 = 0. Hence (n+i)/2 = 0. We can easily count that there are 6 paths in A. in this case. 0) and end up at (n. S6 < 0. 0) and n ends at (n. Consider a path that starts from (0. y)}. (n. 0) and end up at the specific point (n. So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. and compute the probability of the event A = {S2 ≥ 0. The paths comprising this event are shown below. x) and end up at (n. 0) (n.1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. y) is given by: #{(0. (There are 2n of them. Let us first show that Lemma 14.0) The lightly shaded region contains all paths of length n that start at (0.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

(n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . . y)} . if n is even. (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. n (30) Proof. P0 (S1 ≥ 0 .2 The ballot theorem Theorem 25. Sn > 0 | Sn = y) = y .3 Some conditional probabilities Lemma 17. n 2−n . . . The probability that. The left side of (30) equals #{paths (1. P0 (Sn = y) 2 (n + y) + 1 . n This is called the ballot theorem and the reason will be given in Section 29. . Sn ≥ 0 | Sn = 0) = 84 1 . n 2 #{paths (m. 0) (n. . the random walk never becomes zero up to time n is given by P0 (S1 > 0. 0) (n. S2 > 0. if n is even. Sn ≥ 0 | Sn = y) = 1 − In particular. . for now. 1) (n. . y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . But. If n. y) that do not touch zero} × 2−n #{paths (0.P (Sn = y|Sm = x) = In particular. given that S0 = 0 and Sn = y > 0. n/2 (29) #{(0. We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23). n−m 2−n−m . . y)} . we shall just compute it. 27. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . . x) 2n−m (n. . 27. Such a simple answer begs for a physical/intuitive explanation.

. (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . . . . . Proof. . . If n is even. and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . . . . Sn ≥ 0. . . We use (27): P0 (Sn ≥ 0 . Sn ≥ 0) = #{(0. . . Sn ≥ 0). Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. . . . remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). . (b) follows from symmetry. .) by (ξ1 . . n/2 where we used (28) and (29). 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. . . . ξ3 . . . . . Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. . ξ2 + ξ3 ≥ 0. . . •). . and Sn−1 > 0 implies that Sn ≥ 0.4 Some remarkable identities Theorem 26. . . . Sn ≥ 0) = P0 (S1 = 0. . Sn < 0) (b) (a) = 2P0 (S1 > 0. P0 (S1 ≥ 0. . (e) follows from the fact that ξ1 is independent of ξ2 . . and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0. . ξ3 . . Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . .Proof. . . . . Sn = 0) = P0 (S1 > 0. ξ2 .). . . Sn > 0) + P0 (S1 < 0. 85 . Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. . . (f ) follows from the replacement of (ξ2 . . . (c) (d) (e) (f ) (g) where (a) is obvious. 1 + ξ2 + ξ3 > 0. ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0. Sn = 0) = P0 (Sn = 0). . For even n we have P0 (S1 ≥ 0. We next have P0 (S1 = 0. . . . +1 27. . . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . 0) (n. P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . .

n We know that P0 (Sn = 0) = u(n) = n/2 2−n . the probability u(n) that the walk (started at 0) attains value 0 in n steps. n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. . Sn−2 > 0|Sn−1 > 0. Let n be even. . Sn = 0) P0 (S1 > 0. We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. Sn = 0) + P0 (S1 < 0. Sn−1 = 0. . and u(n) = P0 (Sn = 0). It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. P0 (Sn−1 > 0. Then f (n) := P0 (S1 = 0. . . . Sn = 0) = u(n) . Sn−2 > 0|Sn−1 = 1). n−1 Proof. . Sn = 0. Also. . Sn = 0 means Sn−1 = 1. and if the particle is to the left of a then its mirror image is on the right. Since the random walk cannot change sign before becoming zero. .27. f (n) = P0 (S1 > 0. the two terms must be equal. . we see that Sn−1 > 0. . . Put a two-sided mirror at a. we either have Sn−1 = 1 or −1. . given that Sn = 0. So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. . with equal probability. for even n. . Sn−1 > 0.5 First return to zero In (29). 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). By symmetry. So the last term is 1/2. Sn = 0). To take care of the last term in the last expression for f (n). Sn−1 > 0. . Sn = 0) Now. . Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). Fix a state a. we computed. This is not so interesting. . . we can omit the last event. . . Sn−1 < 0. Sn = 0) = 2P0 (Sn−1 > 0. If a particle is to the right of a then you see its image on the left. So 1 f (n) = 2u(n) P0 (S1 > 0. . . Theorem 27. So u(n) f (n) = . So f (n) = 2P0 (S1 > 0. . By the Markov property at time n − 1. 86 .

the process (STa +m − a. But P0 (Ta ≤ n) = P0 (Ta ≤ n. Sn ). Proof. Thus. 2a − Sn . define Sn = where is the first hitting time of a. Sn ≤ a) + P0 (Ta ≤ n. Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). Then. called (Sn . for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . n ≥ Ta . n < Ta . If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). we have n ≥ Ta . Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). independent of the past before Ta . we obtain 87 . Let (Sn .1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. Theorem 28. 2 Proof. In other words. m ≥ 0) is a simple symmetric random walk. n ∈ Z+ ). Finally. S1 . The last equality follows from the fact that Sn > a implies Ta ≤ n. Trivially. n ∈ Z+ ) has the same law as (Sn . 2a − Sn ≥ a) = P0 (Ta ≤ n. Then Ta = inf{n ≥ 0 : Sn = a} Sn . as a corollary to the above result. 28. But a symmetric random walk has the same law as its negative. P0 (Sn ≥ a) = P0 (Ta ≤ n. So: P0 (Sn ≥ a) = P0 (Ta ≤ n. If Sn ≥ a then Ta ≤ n. So Sn − a can be replaced by a − Sn in the last display and the resulting process. . 2 Define now the running maximum Mn = max(S0 . . a + (Sn − a). . In the last event. n < Ta . Sn = Sn . Sn ≤ a) + P0 (Sn > a). . n ≥ Ta By the strong Markov property. n ≥ 0) be a simple symmetric random walk and a > 0. so we may replace Sn by 2a − Sn (Theorem 28). Sn ≥ a). Sn > a) = P0 (Ta ≤ n.What is more interesting is this: Run the particle till it hits a (it will. Sn ≤ a). observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a).

as long as a ≥ b. The original ballot problem. P0 (Mn < x. then the probability that. Sn = y) = P0 (Tx ≤ n. Sn = y). 105. Sn = 2x − y). Compt. ornamental objects. A sample from the urn is a sequence η = (η1 . Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . 369. Rend. usually a vase furnished with a foot or pedestal. by applying the reflection principle (Theorem 28). Solution directe du probl`me r´solu par M. e e e Paris. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. This. . where ηi indicates the colour of the i-th item picked. Sci. Since Mn < x is equivalent to Tx > n. (31) x > y. Acad. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). The set of values of η contains n elements and η is uniformly distributed a in this set. n = a + b. during sampling without replacement. Compt. gives the result. 8 88 . 9 Bertrand. If Tx ≤ n. (1887). The answer is very simple: Theorem 31. combined with (31). Sci. It is the latter use we are interested in in Probability and Statistics. Sn = y) = P0 (Tx > n. 105. e 10 Andr´. what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. An urn is a vessel. Acad. . posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. employed for different purposes. Bertrand. 2 Proof. as for holding liquids. Rend. If an urn contains a azure items and b = n−a black items. Solution d’un probl`me. Sn = y) = P0 (Sn = 2x − y). Paris. D. which results into P0 (Tx ≤ n. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. then. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. J. 436-437. and anciently for holding ballots to be drawn. the ashes of the dead after cremation. we have P0 (Mn < x.Corollary 19. ηn ). of which a are coloured azure and b black. . (1887). 29 Urns and the ballot theorem Suppose that an urn8 contains n items. the number of selected azure items is always ahead of the black ones equals (a − b)/n. . we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. Proof.

. n Since y = a − (n − a) = a − b. where y = a − (n − a) = 2a − n. letting a. 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. . b be defined by a + b = n. indeed. so ‘azure’ items are +1 and ‘black’ items are −1. An urn can be realised quite easily be a random walk: Theorem 32. a. . . n . the result follows. . we can translate the original question into a question about the random walk. Sn > 0} that the random walk remains positive. i ∈ Z. . Since the ‘azure’ items correspond to + signs (respectively. . a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. . . .) So pi. p(n+k)/2 q (n−k)/2 . . Then. the sequence (ξ1 . The values of each ξi are ±1. We have P0 (Sn = k) = n n+k 2 pi. (ξ1 . . using the definition of conditional probability and Lemma 16. . . . . 89 . . . . n ≥ 0) with values in Z and let y be a positive integer. but which is not necessarily symmetric. Proof. the sequence (ξ1 . . εn be elements of {−1. n n −n 2 (n + y)/2 a Thus. a we have that a out of the n increments will be +1 and b will be −1. Sn > 0 | Sn = y) = . ξn ) has a uniform distribution. . −n ≤ k ≤ n. conditional on {Sn = y}. . a − b = y. We need to show that. ξn = εn ) P0 (Sn = y) 2−n 1 = = . . ξn ) is uniformly distributed in a set with n elements. Since an urn with parameters n. . Proof of Theorem 31. the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. always assuming (without any loss of generality) that a ≥ b = n − a. It remains to show that all values are equally a likely. But in (30) we computed that y P0 (S1 > 0. . Let ε1 . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . we have: P0 (ξ1 = ε1 . .i+1 = p. . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. So the number of possible values of (ξ1 . . . ηn ) an urn process with parameters n. But if Sn = y then.We shall call (η1 .i−1 = q = 1 − p. +1} such that ε1 + · · · + εn = y. conditional on the event {Sn = y}. . Consider a simple symmetric random walk (Sn . . . . the ‘black’ items correspond to − signs). (The word “asymmetric” will always mean “not necessarily symmetric”. . ξn ) is. . . . . conditional on {Sn = y}. . Then. .

Consider a simple random walk starting from 0. This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. it is no loss of generality to consider the random walk starting from 0. For all x. Lemma 18. 90 . To prove this. Theorem 33. x ≥ 0) is also a random walk. The rest is a consequence of the strong Markov property. E0 sTx = ϕ(s)x . Px (Ty = t) = P0 (Ty−x = t). Proof.) Then. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 .} ∪ {∞}. only the relative position of y with respect to x is needed.i. Proof.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. random variables. If x > 0. . we make essential use of the skip-free property. This random variable takes values in {0. x ∈ Z. Corollary 20. We will approach this using probability generating functions. y ∈ Z. . 2qs Proof. In view of this. The process (Tx . . then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. . and all t ≥ 0. 2. Lemma 19. . 1. for x > 0. Consider a simple random walk starting from 0. If S1 = 1 then T1 = 1. plus a time T1 required to go from −1 to 0 for the first time. See figure below. In view of Lemma 19 we need to know the distribution of T1 . namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. plus a ′′ further time T1 required to go from 0 to +1 for the first time. Let ϕ(s) := E0 sT1 .d. . Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. Consider a simple random walk starting from 0.30. . x − 1 in order to reach x. 2. (We tacitly assume that S0 = 0. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1.

The formula is obtained by solving the quadratic.  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. Consider a simple random walk starting from 0. if x < 0 2ps Proof. 91 . 2ps Proof. The first time n such that Sn = −1 is the first time n such that −Sn = 1. T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . in order to hit 1. Corollary 22. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Corollary 21. Consider a simple random walk starting from 0. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. Hence the previous formula applies with the roles of p and q interchanged. from the strong Markov property. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 .

2 First return to the origin Now let us ask: If the particle starts at 0.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof.3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. 2ps ′ =1− 30. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. Consider a simple random walk starting from 0. 1 − 4pqs2 . in distribution.30. The latter is. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . and simply write (1 + x)n = k≥0 n k x . We again do first-step analysis. If we define n to be equal to 0 for k > n. For the general case. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. Similarly. For a symmetric ′ random walk. k 92 . then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. the same as the time required for the walk to hit 1 starting ′ from 0. k by the binomial theorem. 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . where α is a real number. If α = n is a positive integer then n (1 + x)n = k=0 n k x . we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . if S1 = 1. this was found in §30. then we can omit the k upper limit in the sum.

(−1)k = 2k − 1 k k and this is shown by direct computation. let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . k as long as we define α k = α(α − 1) · · · (α − k + 1) . (1 − x)−1 = But. √ 1 − x. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . k!(n − k)! k! α k x . we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . For example. As another example. but that we won’t worry about which ones. So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . by definition. −1 k so (1 − x)−1 = as needed.Recall that n k = It turns out that the formula is formally true for general α. Indeed. 2k − 1 k 93 . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. k! k! (−1)2k xk = k≥0 xk . and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = .

if x < 0 This formula can be written more neatly as follows: Theorem 35. P0 (Tx < ∞) = p q q p ∧1 ∧1 . That it is actually null-recurrent follows from Markov chain theory. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. x if x > 0 . s↑1 which is true for any probability generating function. We apply this to (32). The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . Hence it is transient. x 1−|p−q| 2q 1−|p−q| 2p . 94 . • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent.But. |x| if x > 0 (33) . Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0). if x < 0 |x| In particular. if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. The quantity inside the square root becomes 1 − 4pq. But remember that q = 1 − p. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. Hence it is again transient. • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. without appeal to the Markov chain theory. by the definition of E0 sT0 .

The simple symmetric random walk with values in Z is null-recurrent. then we have M∞ = limn→∞ Mn . What is this probability? Theorem 37. S1 . If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. the sequence Mn is nondecreasing. we obtain what we are looking for. This value is random and is denoted by M∞ = sup(S0 . Corollary 23. Consider a simple random walk with values in Z. Sn ). it will never ever return to the point where it started from. Indeed. The last part of Theorem 35 tells us that it is recurrent. S2 . with probability 1 because p < 1/2. We know that it cannot be positive recurrent. x ≥ 0. every nondecreasing sequence has a limit. . 31. n→∞ n→∞ x ≥ 0. this probability is always strictly less than 1. and so it is null-recurrent. Proof. with some positive probability. but here it is < ∞. If p < q then |p − q| = q − p and. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. Consider a random walk with values in Z. again. except that the limit may be ∞. . P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . Then limn→∞ Sn = −∞ with probability 1. Theorem 36. . .Proof. Unless p = 1/2. . . q p. This means that there is a largest value that will be ever attained by the random walk. Then ′ P0 (T0 < ∞) = 2(p ∧ q). Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . 95 . if x < 0 which is what we want. Assume p < 1/2.2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. 31. Otherwise.1 Asymptotic distribution of the maximum Suppose now that p < 1/2.) If we let Mn = max(S0 . if x > 0 . Proof. .

2. 2(p ∧ q) . 1]. Theorem 38. So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q). even for p = 1/2. 1 − 2(p ∧ q) this being a geometric series. If p = q this gives 1. Consider a simple random walk with values in Z. . In Lemma 4 we showed that. . 1. . 2(p ∧ q) E0 J0 = . Remark 1: By spatial homogeneity. We now look at the the number of visits to some state but if we start from a different state. In the latter case. starting from zero. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. . 96 . as already known. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . k = 0. Then. meaning that Pi (Ji = ∞) = 1. 2. . This proves the first formula. Pi (Ji ≥ k) = 1 for all k.3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. We now compute its exact distribution. The probability generating function of T0 was obtained in Theorem 34. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . But if it is transient. . (34) a random variable first considered in Section 10 where it was also shown to be geometric. . . But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. the total number of visits to a state will be finite. k = 0. . if a Markov chain starts from a state i then Ji is geometric. 1. 2. 1. where the last probability was computed in Theorem 37. 31. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. . And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. 1 − 2(p ∧ q) Proof. But J0 ≥ 1 if and only if the first return time to zero is finite. k = 0. The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i).′ Proof.

given that Tx < ∞. The first term was computed in Theorem 35. we have E0 Jx = 1/(1−2p) regardless of x. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. ∞ As for the expectation. Then P0 (Tx < ∞.e. E0 Jx = Proof. For x > 0.e. i. But for x < 0. Consider the case p < 1/2. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . if the random walk “drifts down” (i. Then P0 (Jx ≥ k) = P0 (Tx < ∞. Jx ≥ k). (2(p ∧ q))k−1 . In other words. (q/p)(−x)∨0 1 − 2q k≥1 . Remark 1: The expectation E0 Jx is finite if p = 1/2. The reason that we replaced k by k − 1 is because. it will visit any point below 0 exactly the same number of times on the average. then E0 Jx drops down to zero geometrically fast as x tends to infinity. 97 . p < 1/2) then. and the second in Theorem 38. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . simply because if Jx ≥ k ≥ 1 then Tx < ∞. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total.Theorem 39. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . Apply now the strong Markov property at Tx . E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). k ≥ 1. If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . starting from 0. Consider a random walk with values in Z. Suppose x > 0.

which is basically another probability for S. Use the obvious fact that (ξ1 . 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . random variables. .d. each distributed as T1 . Then P0 (Tx = n) = x P0 (Sn = x) . . 0 ≤ k ≤ n) of (Sk . . Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. ξ2 . 0 ≤ k ≤ n) has the same distribution as (Sk . Theorem 40. = 1− √ 1 − s2 s x . for a simple symmetric random walk. 0 ≤ k ≤ n − 1.32 Duality Consider random walk Sk = k ξi . Proof. . (Sk . . Tx is the sum of x i. random variables. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. Sn = x) P0 (Tx = n) = . n (37) 98 . . n Proof. for a simple random walk and for x > 0.i. (36) 2 Lastly. . i. In Lemma 19 we observed that.e. ξn−1 . n x > 0. . (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). . the probabilities P0 (Tx = n) = x P0 (Sn = x). i=1 Then the dual (Sk . . ξn ) has the same distribution as (ξn . because the two have the same distribution. . in Theorem 41.i. ξ1 ).d. P0 (Sn = x) P0 (Sn = x) x . Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. 0 ≤ k ≤ n). Fix an integer n. and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x . Let (Sk ) be a simple symmetric random walk and x a positive integer. Here is an application of this: Theorem 41. Thus. a sum of i. we derived using duality. every time we have a probability for S we can translate it into a probability for S. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. .

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

going at random east.d. n ∈ Z+ ) is a sum of i.. The two random walks are not independent. 0)) = P (ξn = (−1. ξ2 .i. Many other quantities can be approximated in a similar manner. we let x1 be its horizontal coordinate and x2 its vertical. are independent. being totally drunk. Sn ) = n n 2 . So let 1 2 1 2 1 2 Rn = (Rn .i. However. with equal probability (1/4). too. Since ξ1 . the random 1 2 variables ξn . 2 Same argument applies to (Sn . ξn are not independent simply because if one takes value 0 the other is nonzero. for each n. x2 ) into (x1 + x2 . 0)) = 1/4. west. . random variables and hence a random walk. 1 1 Proof. 0)) = P (ξn = (1. map (x1 . 0)) = 1/4. n ∈ Z+ ) showing that it. We then let S0 = (0. . Indeed. 1 P (ξn = 1) = P (ξn = (1. . north or south. i. . west. −1)) = 1/2. We have: 102 . n ∈ Z+ ) is a random walk. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid.e.d. He starts at (0. . . 0). (Sn . 2 1 If x ∈ Z2 . When he reaches a corner he moves again in the same manner. 1 Hence (Sn . It is not a simple random walk because there is a possibility that the increment takes the value 0. . (Sn . . 1)) = P (ξn = (−1. so are ξ1 . n ≥ 1. Sn − Sn ). north or south. i=1 ξi i=1 ξi 2 1 Lemma 20. is a random walk. The corners are described by coordinates (x. So Sn = (Sn . y) where x. 0) and. Sn = i=1 n ξi .From this. We have 1. Rn ) := (Sn + Sn . he moves east. 1)) + P (ξn = (0. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. we immediately get that the expectation of T1 is infinity: ET1 = ∞. n ∈ Z+ ) is also a random walk. ξ2 . the two stochastic processes are not independent. the latter being functions of the former. . They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. ξ2 . We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . y ∈ Z. i. random vectors such that P (ξn = (0. x1 − x2 ). Consider a change of coordinates: rotate the axes by 45o (and scale them).

So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . n ∈ Z+ ) is a random 2 . (Rn . b−a 103 . 2n n 2 2−4n . that for a simple symmetric random walk (started from S0 = 0). . ξ2 . η . (Rn + independent. We have of each of the ηn n 1 2 P (ηn = 1. To show that the last two are independent. The sequence of vectors η . The possible values 1 . . n ∈ Z ) is a random walk in one dimension. (Rn . is i. The two-dimensional simple symmetric random walk is recurrent. ηn = −1) = P (ξn = (−1. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. 1)) = 1/4 1 2 P (ηn = −1. . ηn = ξn − ξn are independent for each n. ε′ ∈ {−1. we need to show that ∞ P (Sn = 0) = ∞. Recall.. b−a −a . from Theorem 7. +1}. We have Rn = n (ξi + ξi ). P (S2n = 0) ∼ n . b−a P (Ta < Tb ) = b . ηn = 1) = P (ξn = (0. But by the previous n=0 c lemma. Let a < 0 < b. Similar argument shows that each of (Rn . And so 2 1 2 1 P (ηn = ε. 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. The last two are walk in one dimension. where c is a positive constant. ηn = 1) = P (ξn = (1. = 1)) = 1/4. Lemma 22. 0)) = 1/4 1 2 P (ηn = 1. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . η 2 are ±1. The same is true for (Rn ). So P (R2n = 0) = 2n 2−2n . ηn = −1) = P (ξn = (0. as n → ∞. 2 . 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. n ∈ Z+ ). in the sense that the ratio of the two sides goes to 1. and by Stirling’s approximation. But (Rn ) 1 2 is a simple symmetric random walk. Rn = n (ξi − ξi ). .i. (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . 1 where the last equality follows from the independence of the two random walks.1 Lemma 21. n ∈ Z ) is a random walk in one dimension. Hence (Rn . . n Corollary 24. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a .d. By Lemma 5. Proof. n ∈ Z+ ) 1 is a random walk in two dimensions. b−a P (STa ∧Tb = a) = b . Hence P (ηn = 1) = P (ηn = −1) = 1/2. ξ 1 − ξ 2 ). P (S2n = 0) = Proof. R2n = 0) = P (R2n = 0)(R2n = 0). 2 1 2 2 1 1 Proof. n ∈ Z+ ) is a random walk in two dimensions. . 0)) = 1/4 1 2 P (ηn = −1.

0)} ∪ (N × N) with the following distribution: P (A = i. B = j) = 1. How about more complicated distributions? Let suppose we are given a probability distribution (pi . Theorem 43. Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. The claim is that P (Z = k) = pk . given a 2-point probability distribution (p. i ∈ Z). Indeed.g. choose. B = j) = c−1 (j − i)pi pj .This last display tells us the distribution of STa ∧Tb . and the probability it exits from the right point is 1 − p. Proof. a < 0 < b such that pa + (1 − p)b = 0. i < 0 < j. i ∈ Z. Then Z = 0 if and only if A = B = 0 and this has 104 . k ∈ Z. M < N . if p = M/N . b = N . such that p is rational. B) are independent of (Sn ). we can always find integers a. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a. i ∈ Z. a = M − N . b]. Let (Sn ) be a simple symmetric random walk and (pi . i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. 1 − p). that ESTa ∧Tb = 0. e. The probability it exits it from the left point is p. suppose first k = 0. we have this common value: i<0 (−i)pi = and. and assuming that (A. B = 0) = p0 . P (A = 0. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. Then there exists a stopping time T such that P (ST = i) = pi . where c−1 is chosen by normalisation: P (A = 0. using this. we consider the random variable Z = STA ∧TB . N ∈ N. Note. M. we get that c is precisely ipi . To see this. Conversely. B = 0) + P (A = i. Define a pair of random variables (A. b. in particular. B) taking values in {(0. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}.

i<0 = c−1 Similarly. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. P (Z = k) = pk for k < 0. 105 . by definition. B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk .probability p0 . B = j) P (STi ∧Tk = k)P (A = i. Suppose next k > 0.

if we also invoke Euclid’s algorithm (i.e. 38) = gcd(6. So (i) holds. if a. Zero is denoted by 0. −2. The set N of natural numbers is the set of positive integers: N = {1. as taught in high schools or universities. b. −3. 38) = gcd(82. the process of long division that one learns in elementary school). We will show that d = gcd(a. But in the first line we subtracted 38 exactly 3 times. If b − a > a. In other words. 0. b). with the second line. So we work faster. So (ii) holds. Now. and d = gcd(a. We write this by b | a. . let d = gcd(a. 3. 14) = gcd(6. with b = 0. But d | a and d | b. their negatives and zero.) Notice that: gcd(a. Arithmetic Arithmetic deals with integers. (ii) if c | a and c | b. 20) = gcd(6. i. d | b. . 8) = gcd(6. 2. (That it is unique is obvious from the fact that if d1 . 2. For instance. b are arbitrary integers. 6. Therefore d | b − a. 5. Given integers a. 38) = gcd(44. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. gcd(120. replace b − a by b − 2a. Keep doing this until you obtain b − qa < a. Replace b by b − a. They are not supposed to replace textbooks or basic courses. c | b. then d | c. b). with at least one of them nonzero. 106 . we have c | d. If c | b and b | a then c | a.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. 32) = gcd(6. · · · }. we say that b divides a if there is an integer k such that a = kb. they merely serve as reminders.e. Then interchange the roles of the numbers and continue. leading to d1 = d2 . b − a). Indeed. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. 3. and we’re done. 38) = gcd(6. 2) = gcd(4. Now suppose that c | a and c | b − a. b) = gcd(a. Then c | a + (b − a). 2) = 2. 4. we can define their greatest common divisor d = gcd(a. But 3 is the maximum number of times that 38 “fits” into 120. Z = {· · · . If a | b and b | a then a = b. The set Z of integers consists of N. d2 are two such numbers then d1 would divide d2 and vice-versa. 2) = gcd(2. 1. Because c | a and c | b. Similarly.}. b) as the unique positive integer such that (i) d | a. Indeed 4 × 38 > 120. 26) = gcd(6. . b − a). −1.

xa + yb > 0}. For example. for integers a. To see this. Then d = xa + yb.e. 38) = gcd(6. such that a = qb + r. This shows (ii) of the definition of a greatest common divisor. 120) = 38x + 120y. for some x. . not both zero. it divides d. gcd(38. we can find an integer q and an integer r ∈ {0. To see why this is true. we have gcd(a. The same logic works in general. . Then c divides any linear combination of a. with a > b. let us follow the previous example again: gcd(120. . a. let d be the right hand side. Thus. a − 1}. i.The theorem of division says the following: Given two positive integers b. . Its proof is obvious. 1. Second. we can show that. 38) = gcd(6. in particular. Here are some other facts about the greatest common divisor: First. We need to show (i). r = 18. Working backwards. to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. The long division algorithm is a process for finding the quotient q and the remainder r. y ∈ Z. b) then there are integers x. 107 . b. with x = 19. we have the fact that if d = gcd(a. y such that d = xa + yb. b) = min{xa + yb : x. Suppose that c | a and c | b. y ∈ Z. we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. b ∈ Z. y = 6. we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. In the process. 2) = 2.

z) it also contains (x. A relation is called reflexive if it contains pairs (x. A relation is transitive if when it contains (x. The relation < is not reflexive. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x. but is not necessarily transitive. The greatest common divisor of 3 integers can now be defined in a similar manner. that d does not divide b. y) without contain its symmetric (y. The relation < is not symmetric. x). For example. We encountered the equivalence relation i theory. the greatest common divisor of any subset of the integers also makes sense. respectively. The equality relation = is reflexive.that d | a and d | b.e. m elements. e. Counting Let A. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). 108 . i.e. A relation is called symmetric if it cannot contain a pair (x. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. so r = 0. The range of f is the set of its values. So d is not minimum as claimed. so d divides both a and b. and this is a contradiction.. x2 ∈ X have different values f (x1 ). B be finite sets with n. But then b = q(xa + yb) + r. Since d > 0. If we take A to be the set of students in a classroom. that d does not divide both numbers. y) such that x is smaller than y. Suppose this is not true. the set {f (x) : x ∈ X}. A function is called one-to-one (or injective) if two different x1 . The set of functions from X to Y is denoted by Y X . z). The equality relation is symmetric. we can divide b by d to find b = qd + r. A function is called onto (or surjective) if its range is Y . yielding r = (−qx)a + (1 − qy)b. a relation can be thought of as a set of pairs of elements of the set A. the relation < between real numbers can be thought of as consisting of pairs of numbers (x. x). for some 1 ≤ r ≤ d − 1. then the relation “x is a friend of y” is symmetric. i. say. Both < and = are transitive. f (x2 ). In general.g. y) and (y. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y . So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). Finally. A relation which possesses all the three properties is called equivalence relation. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B.

the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. we denote (n)n also by n!. Since the name of the first animal can be chosen in m ways. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . Indeed. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . Incidentally. We thus observe that there are as many subsets in A as number of elements in the set {0. The first element can be chosen in n ways. Clearly. 1}n of sequences of length n with elements 0 or 1. that of the second in m − 1 ways. and so on. In the special case m = n. A function is then an assignment of a name to each animal. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . Each assignment of this type is a one-to-one function from A to B. Thus. and this follows from the binomial theorem. we have = 1. so does the second. and so on. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. the k-th in n − k + 1 ways. k! k where the symbol equation. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. in other words.The number of functions from A to B (i. the cardinality of the set B A ) is mn . and the third. the second in n − 1. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅).) So the first animal can be assigned any of the m available names. To be able to do this. a one-to-one function from the set A into A is called a permutation of A. If A has n elements. we need more names than animals: n ≤ m. Any set A contains several subsets. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). and so on.e. 109 . n! is the total number of permutations of a set of n elements. we may think of the elements of A as animals and those of B as names. (The same name may be used for different animals.

we define x m = x(x − 1) · · · (x − m + 1) . Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). 0). 0). m! n y If y is not an integer then. we let The Pascal triangle relation states that n+1 k = = 0. c. a∨b∨c refers to the maximum of the three numbers a. In particular. If the pig is included. For example. we use the following notations: a+ = max(a. Maximum and minimum The minimum of two numbers a. in choosing k animals from a set of n + 1 ones. The notation extends to more than one numbers. if x is a real number. we choose k animals from the remaining n ones and this can be done in n ways. consider a specific animal. and we don’t care to put parantheses because the order of finding a maximum is irrelevant. say. a− = − min(a. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. The absolute value of a is defined by |a| = a+ + a− . we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. + k−1 k Indeed.We shall let n = 0 if n < k. by convention. b. 110 . then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. The maximum is denoted by a ∨ b = max(a. b is denoted by a ∧ b = min(a. and a− is called the negative part of a. Both quantities are nonnegative. n n . Note that it is always true that a = a+ − a− . by convention. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. b). a+ is called the positive part of a. If the pig is not included in the sample. the pig. b). k Also.

A2 . 2.Indicator function stands for the indicator function of a set A. P (X ≥ n). we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. Then X X= n=1 1 (the sum of X ones is X). . 111 . Another example: Let X be a nonnegative random variable with values in N = {1. are sets or clauses then the number n≥1 1An is the number true clauses. which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). . if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. Thus. if A1 . = 1A + 1B − 1A∩B . 1 − 1A = 1 A c . It is worth remembering that. Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). meaning a function that takes value 1 on A and 0 on the complement Ac of A. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . . We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. . So X= n≥1 1(X ≥ n). For instance. Several quantities of interest are expressed in terms of indicators. So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components.}. It takes value 1 with probability P (A) and 0 with probability P (Ac ). If A is an event then 1A is a random variable. A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). This is very useful notation because conjunction of clauses is transformed into multiplication. It is easy to see that 1A 1A∩B = 1A 1B . . .

The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . . . . ym v1 . . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  .  is a column representation of y := ϕ(x) with respect to the given basis . . and all α. Then. x  . But the ϕ(uj ) are vectors in Rm . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ).   .  is a column representation of x with respect to . .  where ∈ R. Suppose that ψ. . . Let y i := n aij xj . we have ϕ(x) = m n vi i=1 j=1 aij xj . Combining. . . . .    . . vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi .  . . xn the given basis. . un be a basis for Rn . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . j=1 i = 1. . because ϕ is linear. . The column  . y ∈ Rn . m. β ∈ R. . . .  = ··· ··· ···  . Let u1 . So if we choose a basis v1 . x1 .  The column  . . ϕ are linear functions: It is a m × n matrix. vm .for all x.

there is x ∈ I with x ≤ ε + inf I. . . .e. For instance I = {1. of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. . x2 . Supremum and infimum When we have a set of numbers I (for instance.Fix three sets of basis on Rn . we define lim sup xn . for any ε > 0. . . . Likewise. . . x2 . x4 . 1 ≤ i ≤ ℓ. 1/3. . . Limsup and liminf If x1 . B be the matrix representations of ϕ. for any ε > 0.} ≤ inf{x3 .  . we define max I. standing for the “infimum” of I. . where m (AB)ij = k=1 Aik Bkj . Then the matrix representation of ϕ ◦ ψ is AB. . Similarly. 113 . a sequence a1 . 0 0 1 In other words. A matrix A is called square if it is a n × n matrix. . In these cases we use the notation inf I. and AB is a ℓ × n matrix. . The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  .} ≤ · · · and. .} ≤ inf{x2 . . So lim sup xn = limn→∞ inf{xn . with respect to these bases. xn+1 . . sup I is characterised by: sup I ≥ x for all x ∈ I and. . since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. We note that lim inf(−xn ) = − lim sup xn . . inf I is characterised by: inf I ≤ x for all x ∈ I and.  .  . we define the supremum sup I to be the minimum of all upper bounds. This minimum (or maximum) may not exist. · · · }. a2 . . ed =  . there is x ∈ I with x ≥ ε + inf I. . 1 ≤ j ≤ n.} has no minimum. ψ. If I has no upper bound then sup I = +∞. and it is this that we called infimum. Similarly. 1/2. respectively. x3 . The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. i. e2 =  . it is its matrix representation with respect to the fixed basis. . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. If we fix a basis in Rn . a square matrix represents a linear transformation from Rn to Rn . j So A is a ℓ × m matrix. B is a m × n matrix. . If I is empty then inf ∅ = +∞. xn+2 . ei = 1(i = j). Rm and Rℓ and let A. . sup ∅ = −∞. defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. . the choice of the basis is the so-called standard basis. . is a sequence of numbers then inf{x1 . If I has no lower bound then inf I = −∞. Usually. .

11 114 . for (besides P (Ω) = 1) there is essentially one axiom . . in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). which holds since X is positive. are mutually disjoint events. . Of course. Quite often. Indeed.e. . Linear Algebra that has many axioms is more ‘restrictive’. I do sometimes “cheat” from a strict mathematical point of view. E 1A = P (A). in these notes. This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A.. . showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. then P  n≥1 An  = P (An ). by not requiring this. derive formulae. Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. n If the events A1 .Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). are increasing with A = ∪n An then P (A) = limn→∞ P (An ). Therefore. This follows from the logical statement X 1(X > x) ≤ X. Note that 1A 1B = 1A∩B . the function that takes value 1 on A and 0 on Ac . to compute P (A). n P (An ). . and everywhere else. . In Markov chain theory. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). . A2 . Probability Theory is exceptionally simple in terms of its axioms. and apply them.e. n≥1 Usually. But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. we often need to argue that X ≤ Y . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. Linear Algebra that has many axioms is more ‘restrictive’. A2 . Often. For example. namely:   Everything else. and 1 Ac = 1 − 1 A .) If A. Similarly. B are disjoint events then P (A ∪ B) = P (A) + P (B). If A1 . . If A is an event then 1A or 1(A) is the indicator random variable. it is useful to find disjoint events B1 . This is the inclusion-exclusion formula for two events. In contrast.. but not too much. B2 . . P (∪n An ) ≤ Similarly. to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A). Similarly. to show that EX ≤ EY . A is a subset of B). are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). . if An are events with P (An ) = 1 then P (∩n An ) = 1. we work a lot with conditional probabilities. This can be generalised to any finite number of events. follows from this axiom. (That is why Probability Theory is very rich. i. . If An are events with P (An ) = 0 then P (∪n An ) = 0. . B in the definition. such that ∪n Bn = Ω. That is why Probability Theory is very rich. A2 . if he events A1 . Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. If the events A. In contrast.

1. making it a useful object when the a0 . x→∞ In Markov chain theory. so it is useful to keep this in mind. we have EX = −∞ xf (x)dx. n ≥ 0 is ∞ ρn s n = n=0 1 . 1.e. In case the random variable is ∞ absoultely continuous with density f . We do not a priori know that this exists for any s other than. . . If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof.. the generating function of the geometric sequence an = ρn . obviously. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). i. is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . In case the random variable is discrete. for s = 0 for which G(0) = 0. a1 . 1]. we encounter situations where P (X = ∞) may be positive. For a random variable which takes positive and negative values. 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. which is motivated from the algebraic identity X = X + − X − . viz. a1 . . we define X + = max(X. taught in calculus courses). The formula holds for any positive random variable. As an example. ϕ(s) = EsX = n=0 sn P (X = n) 115 . .An application of this is: P (X < ∞) = lim P (X ≤ x). X − = − min(X. The generating function of the sequence pn = P (X = n). the formula reduces to the known one. 1]. which are both nonnegative and then define EX = EX + − EX − . EX = x xP (X = x). is a probability distribution on Z+ = {0. 0). . . . . So if n an < ∞ then G(s) exists for all s ∈ [−1. . . |s| < 1/|ρ|. The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. is called probability generating function of X: ∞ as long as |ρs| < 1. n = 0. This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). 2. .} because then n an = 1. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. 0). This definition makes sense only if EX + < ∞ or if EX − < ∞. . Generating functions A generating function of a sequence a0 .

since |s| < 1. ∞ and p∞ := P (X = ∞) = 1 − pn . we have s∞ = 0. We can recover the probabilities pn from the formula pn = ϕ(n) (0) . n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). and this is why ϕ is called probability generating function. Note that ϕ(0) = P (X = 0). If we let X(s) = n≥0 n ≥ 0. s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . n! as follows from Taylor’s theorem. we can recover the expectation of X from EX = lim ϕ′ (s). Generating functions are useful for solving linear recursions. 116 . while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). ds ds and by letting s = 1. n ≥ 0) then the generating function of (xn+1 . possibly. n=0 We do allow X to. Note that ∞ n=0 pn ≤ 1. Also. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). with x0 = x1 = 1. xn sn be the generating of (xn . As an example. n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . take the value +∞.This is defined for all −1 < s < 1. but.

b = −( 5 + 1)/2. is such an example. Hence s2 + s − 1 = (s − a)(s − b). the n-th term of the Fibonacci sequence. b−a Noting that ab = −1. we can also write this as xn = (−a)n+1 − (−b)n+1 . this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. we can solve for X(s) and find X(s) = s2 −1 . n ≥ 0) equals the sum of the generating functions of (xn+1 . if X is a random variable with values in R. x ∈ R. i. n ≥ 0). namely. 117 . s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). So.From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . for example. we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . n ≥ 0) and (xn . Thus (1/ρ)−s is the 1 generating function of ρn+1 . i. then the so-called distribution function. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. But I could very well use the function P (X > x).e. b−a is the solution to the recursion.e. This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). Despite the square roots in the formula. and so X(s) = 1 1 1 1 1 1 = . Since x0 = x1 = 1. the function P (X ≤ x).

Even the generating function EsX of a Z+ -valued random variable X is an example. to compute a large sum we can. Such an excursion is a random object which should be thought of as a random function. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. in 2n fair coin tosses 118 . dx All these objects can be referred to as “distribution” or “law” of X. (b) The approximation is not good when we consider ratios of products. it’s better to use logarithms: n n! = e log n = exp k=1 log k. We frequently encounter random objects which belong to larger spaces. For example (try it!) the rough Stirling approximation give that the probability that. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. n ∈ Z+ .or the 2-parameter function P (a < X < b). If X is absolutely continuous. 1 Since d dx (x log x − x) = log x. since the number is so big. compute an integral and obtain an approximation. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. It turns out to be a good one. n Hence n k=1 log k ≈ log x dx. n! ≈ exp(n log n − n) = nn e−n . Specifying the distribution of such an object may be hard. for example they could take values which are themselves functions. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. for it does allow us to specify the probabilities P (X = n). I could use its density f (x) = d P (X ≤ x). instead. Hence we have We call this the rough Stirling approximation. For instance. Now. we encountered the concept of an excursion of a Markov chain from a state i.

expressing monotonicity. Consider an increasing and nonnegative function g : R → R+ . We have already seen a geometric random variable. Markov’s inequality Arguably this is the most useful inequality in Probability. We baptise this geometric(λ) distribution. If we think of X as the time of occurrence of a random phenomenon. and this is terrible. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. so this little section is a reminder. a geometric function of k. To remedy this. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). expressing non-negativity. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . That was the random variable J0 = ∞ n=1 1(Sn = 0). and is based on the concept of monotonicity. defined in (34). if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). Indeed. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. Surely. We can case both properties of g neatly by writing For all x. Let X be a random variable.we have exactly n heads equals 2n 2−2n ≈ 1. for all x. It was shown in Theorem 38 that J0 satisfies the memoryless property. The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . and so. by taking expectations of both sides (expectation is an increasing operator). Eg(X) ≥ g(x)P (X ≥ x) . g(X) ≥ g(x)1(X ≥ x). as n → ∞. you have seen it before. n ∈ Z+ . And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. 0 ≤ n ≤ m. 119 . because we know that the n probability goes to zero as n → ∞. This completely trivial thought gives the very useful Markov’s inequality. y ∈ R g(y) ≥ g(x)1(y ≥ x). Then. where λ = P (X ≥ 1).

7. 3. Grinstead. Parry. Chung. Bremaud P (1999) Markov Chains. Cambridge University Press. Feller. 120 . Math. 4. J R (1998) Markov Chains. Soc. Also available at: http://www.dartmouth. W (1968) An Introduction to Probability Theory and its Applications. Springer. 5. Norris. C M and Snell.mac. Amer. Cambridge University Press. ´ 2. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). Wiley. Bremaud P (1997) An Introduction to Probabilistic Modeling. Springer. W (1981) Topics in Ergodic Theory. J L (1997) Introduction to Probability. Springer.Bibliography ´ 1.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook.pdf 6.

16 communicating classes. 20 increments. 15 Loynes’ method. 18 dual. 70 period. 15. 48 cycle formula. 63 Galton-Watson process. 3 branching process. 13 probability generating function. 16. 87 121 . 115 random walk on a graph. 68 Fatou’s lemma. 106 ground state. 61 detailed balance equations. 98 dynamical system. 114 balance equations. 117 leads to. 76 neighbours. 29 reflection. 84 bank account. 10 irreducible. 110 meeting time. 53 hitting time. 53 Fibonacci sequence. 1 equivalence classes. 20 function of a Markov chain. 7 closed. 82 Cesaro averages. 117 division. 7 invariant distribution. 101 axiom of Probability Theory. 111 inessential state. 10 ballot theorem. 65 Catalan numbers.Index absorbing. 86 reflexive. 119 Google. 109 positive recurrent. 119 minimum. 17 communicates with. 61 recurrent. 110 mixing. 34 probability flow. 108 relation. 70 greatest common divisor. 6 Markov’s inequality. 17 infimum. 48 memoryless property. 17 Ethernet. 16 equivalence relation. 15 distribution. 35 Helly-Bray Lemma. 18 permutation. 51 essential state. 18 arcsine law. 113 initial distribution. 16 convolution. 45 Chapman-Kolmogorov equation. 17 Kolmogorov’s loop criterion. 68 aperiodic. 65 generating function. 51 Monte Carlo. 17 Aloha. 116 first-step analysis. 111 maximum. 77 indicator. 58 digraph (directed graph). 115 geometric distribution. 59 law. 81 reflection principle. 44 degree. 2 nearest neighbour random walk. 77 coupling. 6 Markov property. 119 matrix. 34 PageRank. 108 ergodic. 108 ruin probability. 26 running maximum. 47 coupling inequality. 73 Markov chain. 61 null recurrent.

24 Strirling’s approximation. 60 urn. 62 Wright-Fisher model. 76 simple symmetric random walk. 27 storage chain. 3. 7. 7 time-reversible. 119 strong Markov property. 16. 71 122 . 8 transitive. 73 strategy of the gambler. 104 states. 108 tree. 88 watching a chain in a set. 27 subordination. 7 simple random walk. 113 symmetric.semigroup property. 108 time-homogeneous. 76 Skorokhod embedding. 10 steady-state. 6 statespace. 13 stopping time. 2 transition probability. 77 skip-free property. 29 transition diagram:. 62 supremum. 16. 8 transition probability matrix. 62 subsequence of a Markov chain. 9 stationary distribution. 58 transient. 6 stationary.

Sign up to vote on this title
UsefulNot useful