Introductory lecture notes on

MARKOV CHAINS AND RANDOM WALKS
Takis Konstantopoulos∗ Autumn 2009

c Takis Konstantopoulos 2006-2009

Contents
1 Introduction 2 Examples of Markov chains 2.1 2.2 2.3 2.4 2.5 A mouse in a cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bank account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) . . . . . . . . . . . . . . . . . . . . . . Simple random walk (drunkard’s walk) in a city . . . . . . . . . . . . . . . . . Actuarial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 6 6 6 7 9

3 The Markov property 3.1 3.2 3.3 Definition of the Markov property . . . . . . . . . . . . . . . . . . . . . . . . Transition probability and initial distribution . . . . . . . . . . . . . . . . . . Time-homogeneous chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Stationarity 4.1

Finding a stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 13 15

5 Topological structure 5.1 5.2 5.3 5.4

The graph of a chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “leads to” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The relation “communicates with” . . . . . . . . . . . . . . . . . . . . . . . . 16 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 24 27 28 29

6 Hitting times and first-step analysis 7 Gambler’s ruin 8 Stopping times and the strong Markov property 9 Regenerative structure of a Markov chain 10 Recurrence and transience

10.1 First hitting time decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Communicating classes and recurrence/transience . . . . . . . . . . . . . . . . 32 11 Positive recurrence 33

i

12 Law (=Theorem) of Large Numbers in Probability Theory 13 Law of Large Numbers for Markov chains 13.1 Functions of excursions

34 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.2 Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14 Construction of a stationary distribution 15 Positive recurrence is a class property 16 Uniqueness of the stationary distribution 17 Structure of a stationary distribution 18 Coupling and stability 39 41 43 44 45

18.1 Definition of stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 18.2 The fundamental stability theorem for Markov chains . . . . . . . . . . . . . 47 18.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 18.4 Proof of the fundamental stability theorem . . . . . . . . . . . . . . . . . . . 49 18.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 18.6 Periodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 19 Limiting theory for Markov chains* 53

19.1 Limiting probabilities for null recurrent states . . . . . . . . . . . . . . . . . . 53 19.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 20 Ergodic theorems 21 Finite chains 22 Time reversibility 23 New Markov chains from old ones* 56 57 58 62

23.1 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.2 Watching a Markov chain when it visits a set . . . . . . . . . . . . . . . . . . 62 23.3 Subordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 23.4 Deterministic function of a Markov chain . . . . . . . . . . . . . . . . . . . . 63 24 Applications 65

ii

. . . . . . . . . . . 92 30. . . . . .3 PageRank (trademark of Google): A World Wide Web application . . . . . . . . . . . . . 84 27. 79 26. . . . . . . . . . . .4 Some remarkable identities . . . . . . . . . . .5 First return to zero . . . . . . .1 Asymptotic distribution of the maximum . 90 30. . . . . . . . . . . .2 First return to the origin . . .1 Branching processes: a population growth application . . . . . . . .2 The ALOHA protocol: a computer communications application . . . . 70 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 31. . . . . . . . . . . . . . . . . . . . . 68 24. 80 27 The simple symmetric random walk in dimension 1: simple probability calculations 83 27. . 83 27. . . . . . . . . . . . .5 A storage or queueing application . . . . . . . . . . . . . . . . . . . . . . . .2 Return probabilities . .4 The Wright-Fisher model: an example from biology . . . . . . . . 87 29 Urns and the ballot theorem 30 The asymmetric simple random walk in dimension 1 88 89 30. . . . .3 The distribution of the first return to the origin . . . . . . . . 84 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 28 The reflection principle for a simple random walk in dimension 1 86 28.2 Paths that start and end at specific points and do not touch zero at all . .1 Paths that start and end at specific points . . . . . . . . . . . . . . . . . . . . . .1 First hitting time analysis . . . . . . .1 Distribution of hitting time and maximum . . . . . . . .3 Total number of visits to a state . . . . . . . . . . . . . 65 24. . 92 31 Recurrence properties of simple random walks 94 31. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24. . . . . . . .2 The ballot theorem . . 71 24. . . . . . . . 95 31. . . . . . . . . . . . . 96 32 Duality 33 Amount of time that a SSRW is positive* 98 99 iii . . . . . . . . . .1 Distribution after n steps . . . . . 85 27. . 73 25 Random walk 26 The simple symmetric random walk in dimension 1: path counting 76 78 26. . . . . . . . . . . . . . . . . .3 Some conditional probabilities . . . . . . . . .

34 Simple symmetric random walk in two dimensions 35 Skorokhod embedding* 102 103 iv .

. The first part is about Markov chains and some applications. I have a little mathematical appendix. one can argue that random walk calculations should be done before the student is exposed to the Markov chain theory. I tried to put all terms in blue small capitals whenever they are first encountered. of course. There notes are still incomplete. Also. They form the core material for an undergraduate course on Markov chains in discrete time. The second one is specifically for simple random walks. A few starred sections should be considered “advanced” and can be omitted at first reading. There are.Preface These notes were written based on a number of courses I taught over the years in the U. Also. I tried to make everything as rigorous as possible while maintaining each step as accessible as possible. Greece and the U. At the end. The notes should be readable by someone who has taken a course in introductory (non-measuretheoretic) probability.. v . I have tried both and prefer the current ordering.K.S. I plan to add a few more sections: – On algorithms and simulation – On criteria for positive recurrence – On doing stuff with matrices – On finance applications – On a few more delicate computations for simple random walks – Reshape the appendix – Add more examples These are things to come.. The only new thing here is that I give emphasis to probabilistic methods as soon as possible. “important” formulae are placed inside a coloured box. Of course. I introduce stationarity before even talking about state classification. dozens of good books on the topic.

yn+1 ). yn ) into Xn+1 = (xn+1 . . b). It is not hard to see that. the future trajectory (X(s). for any n. F = f (x. Xn+2 . . yn ). where r = rem(a. An example from Arithmetic of a dynamical system in discrete time is the one which finds the greatest common divisor gcd(a. Newton’s second law says that the acceleration is proportional to the force: m¨ = F. . b) is the remainder of the division of a by b.e. the real numbers). for any t. has the property that. the particle is suspended at the end of a spring and force is produced by extending the spring and letting the system oscillate (Hooke’s law). depend only on the pair Xn . n = 0. x Here. . An example from Physics is provided by Newton’s laws of motion. one of which is 0. Recall (from elementary school maths).PART I: MARKOV CHAINS 1 Introduction A Markov chain is a mathematical model of a random phenomenon evolving with time in a way that the past affects the future only through the present. Formally. and its ˙ dt 2x acceleration is x = d 2 . x). that gcd(a.e. a totally ordered set. a phenomenon which evolves with time in a way that only the present affects the future is called a dynamical system. Its velocity is x = dx .e. x(t)). the integers). 2. initialised by x0 = max(a. X2 . continuous (i. where xn ≥ yn . Assume that the force F depends only on the particle’s position ¨ dt and velocity. The sequence of pairs thus produced. and the other the greatest common divisor. more generally. we let our “state” be a pair of integers (xn . x = x(t) denotes the position of the particle at time t. X0 . i. 1. . The state of the particle at time t is described by the pair ˙ X(t) = (x(t). b) = gcd(r. . X1 . s ≥ t) ˙ 1 . In Mathematics. yn ). b) between two positive integers a and b. and evolving as xn+1 = yn yn+1 = rem(xn . b). y0 = min(a. we end up with two numbers. The “time” can be discrete (i. For instance. In the module following this one you will study continuous-time chains. or. In other words. b). . the future pairs Xn+1 . We are herein constrained by the Syllabus to discuss only discrete-time Markov chains. By repeating the procedure. . Suppose that a particle of mass m is moving under the action of a force F in a straight line. there is a function F that takes the pair Xn = (xn . .

1 Examples of Markov chains A mouse in a cage A mouse is in a cage with two cells. An example of a random algorithm is the Monte Carlo method for the b approximate computation of the integral a f (x)dx of a complicated function f : suppose that f is positive and bounded below B. Count the number of 1’s and divide by N . Perform this a large number N of times. Choose a pair (X0 .000 rows and 10. an effective way for dealing with large problems is via the use of randomness. Y0 ) of random numbers. Repeat with (Xn . 1 2 Statistical observations lead the scientist to believe that the mouse moves from cell 1 to cell 2 with probability α = 0. Other examples of dynamical systems are the algorithms run.. The resulting ratio is an approximation b of a f (x)dx. but some are stochastic. uniformly.can be completely specified by the current state X(t). at time n + 1 it is either still in 1 or has moved to 2. We will be dealing with random variables. past states are not needed.000 columns or computing the integral of a complicated function of a large number of variables. via the use of the tools of Probability. 0 ≤ Y0 ≤ B. In addition. a ≤ X0 ≤ b. say. Some of these algorithms are deterministic.05. 2. regardless of where it was at earlier times. Try this on the computer! 2 . . 2 2. and use those mathematical models which are called Markov chains.1 They are stochastic either because the problem they are solving is stochastic or because the problem is deterministic but “very large” such as finding the determinant of a matrix with 10. We will also see why they are useful and discuss how they are applied. . We can summarise this information by the transition diagram: 1 Stochastic means random.99. The generalisation takes us into the realm of Randomness. for steps n = 1. Markov chains generalise this concept of dependence of the future only on the present. analyse. A scientist’s job is to record the position of the mouse every minute. we will see what kind of questions we can ask and what kind of answers we can hope for. by the software in your computer. Similarly. We will develop a theory that tells us how to describe. Each time count 1 if Yn < f (Xn ) and 0 otherwise. When the mouse is in cell 1 at time n (minutes) then. containing fresh and stinky cheese. it moves from 2 to 1 with probability β = 0. In other words. respectively. it does so. 1 and 2. Yn ).e. A mouse lives in the cage. . instead of deterministic objects. i. Indeed.

FS .95 0.05 ≈ 20 minutes. to move from cell 1 to cell 2? 2. answer: Since the mouse really tosses a coin to decide whether to move or stay in a cell.01 Questions of interest: 1. = z=0 ∞ = z=0 3 .y = P (Dn − Sn = y − x) = = z=0 ∞ z=0 ∞ P (Sn = z. . 2. S2 . 1.) 2. Dn is the money he deposits. .α 1−α 1 β 2 1−β Another way to summarise the information is by the 2×2 transition probability matrix P= 1−α α β 1−β = 0. We easily see that if y > 0 then ∞ x. y = 0. . Sn . So if he wishes to spend more than what is available. are random variables with distributions FD . the first time that the mouse will move from 1 to 2 will have mean 1/α = 1/0. D1 . from month to month. . he can’t. and Sn is the money he wants to spend. S1 . How often is the mouse in room 1 ? Question 1 has an easy.05 0. Xn is the money (in pounds) at the beginning of the month. px. are independent random variables. as follows: Xn+1 = max{Xn + Dn − Sn . D0 . . respectively.99 0. S0 . intuitive. Assume also that X0 . 0}.y := P (Xn+1 = y|Xn = x). on the average. The information described above by the evolution of the account together with the distributions for Dn and Sn leads to the (one-step) transition probability which is defined by: px. . How long does it take for the mouse. D2 . Here. Dn = z + y − x) P (Sn = z)P (Dn = z + y − x) FS (z)FD (z + y − x). Assume that Dn .2 Bank account The amount of money in Mr Baxter’s bank account evolves. (This is the mean of the binomial distribution with parameter α. Dn − Sn = y − x) P (Sn = z.

if y = 0. The transition probability matrix is y=1 p0. on the average. can be computed from FS and FD . how often will he be visiting 0? 2. he has no sense of direction. So he may move forwards with equal probability that he moves backwards.0 = 1 − ∞ px. We may want to model such a case by assigning probability p for the drunkard to go one step to the right (and probability 1 − p for him to go one step to the left).2 ··· ··· ···   ··· · · ·  · · · ··· and is an infinite matrix. how long will it take him.0 p2.3 Simple random walk (drunkard’s walk) Consider a completely drunk person who walks along a street. 2. being drunk.0 p0.something that.2 P= p2. If he starts from position 0.4 Simple random walk (drunkard’s walk) in a city Suppose that the drunkard is allowed to move in a city whose streets are laid down in a square pattern: 4 .2 p1. Will Mr Baxter’s account grow. to go from the pub to home? You might object to the assumption that the person is totally drunk and has no sense of direction. how fast? 2. in principle. 1/2 1/2 1/2 1/2 1/2 −2 1/2 −1 1/2 0 1/2 1 1/2 2 1/2 3 Questions: 1. we have px. Questions of interest: 1.y . and if so. 2. Of course. How long will it take (if ever) for the bank account to empty? We shall develop methods for answering these and other questions.1 p2. Are there any places which he will visit infinitely many times? 4.1 p0. If the pub is located in position 0 and his home in position 100.0 p1. Are there any places which he will never visit? 3.1 p1.

pD. the reason being that the company does not believe that its clients are subject to resurrection. there is probability pH.3 that the person will become sick. that since there are more degrees of freedom now. We may again want to ask similar questions. given that he is currently healthy. north or south.1 D Thus. therefore. mathematically imprecise.D . and pS.H = 1 − pH.S because pH.H . Note that the diagram omits pH. So he can move east. But.Salvador Dali (1922) The Drunkard Suppose that the drunkard moves from corner to corner. west. observe. Also. Question: What is the distribution of the lifetime (in months) of a currently healthy individual? Clearly. there is clearly a higher chance that our man will be lost. 2. etc.S = 0. pD.01 S 0.3 H 0. It proposes the following model summarising the state of health of an individual on a monthly basis: 0. and pS. for now.8 0. the answer to this is crucial for the policy determination. the company must have an idea on how long the clients will live.S − pH.S = 1 − pS.D = 1. but they will become more precise in the sequel. Clearly.H − pS. and let’s say he’s completely drunk. These things are.D . 5 .D is omitted.5 Actuarial chains A life insurance company wants to find out how much money to charge its clients. so we assign probability 1/4 for each move.

2. . We say that this sequence has the Markov property if. for any n ∈ T. 1. . . n + m] Xk = ik | Xn = in . Proof. m and any choice of states i0 . it is customary to call (Xn ) a Markov chain. .2 Transition probability and initial distribution We therefore see that joint probabilities can be expressed as P (X0 = i0 . . Since S is countable. This is true for any n and m. We focus almost exclusively to the first choice. . Xn = in ) = P (X0 = i0 )P (X1 = i1 | X0 = i0 )P (X2 = i2 | X1 = i1 ) · · · P (Xn = in | Xn−1 = in−1 ). conditionally on Xn . . X0 = i0 ) = P (Xn+1 = in+1 | Xn = in ). . Usually. unless needed. if the claim holds. . . for all n ∈ Z+ and all i0 . Sometimes we say that {Xn .} or N = {1.} or the set Z of all integers (positive. The elements of S are frequently called states. . the future process (Xm . m < n. m ∈ T) is independent of the past process (Xm . in+m . n ∈ Z+ ) is Markov if and only if. or negative). . Xn−1 ) given Xn . (1) 6 . So we will omit referring to T explicitly. X2 = i2 . (Xn . 3. If (Xn ) is Markov then the claim holds by the conditional independence of Xn+1 and (X0 . . . . . . where T is a subset of the integers. n ∈ T}. Lemma 1. 2. . . . T = Z+ = {0. 3. X0 = i0 ) = P (∀ k ∈ [n + 1. the Markov property. Xn−1 ) are independent conditionally on Xn .1 The Markov property Definition of the Markov property Consider a (finite or infinite) sequence of random variables {Xn . . . 3. . Xn+m ) and (X0 . . which precisely means that (Xn+1 . . that the Xn take values in some countable set S.3 3. in+1 ∈ S. . . We will also assume. On the other hand. . n + m] Xk = ik | Xn = in ). viz. . . m > n. for any n. and so future is independent of past given the present. m ∈ T). . almost exclusively (unless otherwise said explicitly). . . . . n ∈ T} is Markov instead of saying that it has the Markov property. zero. P (Xn+1 = in+1 | Xn = in . X1 = i1 . we can see that by applying it several times we obtain P (∀ k ∈ [n + 1. called the state space. Let us now write the Markov property in an equivalent way. . . .

j (n. 3. pi. The function pi. If |S| = d then we have d × d matrices. n). Of course. X2 = i2 . remembering a bit of Algebra. by definition.j (m.3 Time-homogeneous chains We will specialise to the case of time-homogeneous chains. m + 2) · · · P(n − 1. 1) pi1 . will specify all joint distributions2 : P (X0 = i0 . n) = i1 ∈S ··· in−1 ∈S pi. or as P(m. 2 3 And.which means that the two functions µ(i) := P (X0 = i) . all probabilities of events associated with the Markov chain This causes some analytical difficulties. . . n + 1) is called transition probability from state i at time n to state j at time n + 1. something that can be called semigroup property or Chapman-Kolmogorov equation. by (1). But we will circumvent them by using Probability only. viz. n + 1) do not depend on n . pi. . means that the transition probabilities pi.j (n.j∈S . by a result in Probability.j (m. n) := [pi.j (n. n) := P (Xn = j | Xm = i) is the transition probability from state i at time m to state j time n ≥ m. 2) · · · pin−1 .i (n − 1. n)]i. n).j (m. we can write (2) as P(m.i2 (1. requires some ordering on S) and whose entries are pi. X1 = i1 . for each m ≤ n the matrix P(m. 7 .in (n − 1. pi. Xn = in ) = µ(i0 ) pi0 . we define. m + 1)P(m + 1. a square matrix whose rows and columns are indexed by S (and this. then the matrices are infinitely large. j ∈ S. of course. but if S is an infinite set. n). n) if m ≤ ℓ ≤ n. pi..i1 (0. i∈S i. n ≥ 0. n) = 1(i = j) and. ℓ) P(ℓ. n) = P(m. . The function µ(i) is called initial distribution.j (m.j (n. Therefore. (2) which is reminiscent of matrix multiplication.i1 (m. More generally. n). which. n) = P(m.3 In this new notation. n + 1) := P (Xn+1 = j | Xn = i) . m + 1) · · · pin−1 .

The distribution that assigns value 1 to a state i and 0 to all other states is denoted by δi . From now on. of course. and call pi. while P = [pi. Then P(m.j the (one-step) transition probability from state i to state j. The matrix notation is useful.j (n. if Y is a real random variable. y y 8 . n + 1) =: pi. n−m times for all m ≤ n. n) = P × P × · · · · · · × P = Pn−m . when we say “Markov chain” we will mean “timehomogeneous Markov chain”.j ]i. as follows from (1) and the definition of P. then µ = pδi1 + (1 − p)δi2 . let µn := [µn (i) = P (Xn = i)]i∈S be the distribution of Xn . Thus Pi is the law of the chain when the initial distribution is δi . and the semigroup property is now even more obvious because Pa+b = Pa Pb . Notice that any probability distribution on S can be written as a linear combination of unit masses. to picking state i with probability 1. simply. (3) We call δi the unit mass at i. For example. unless otherwise specified. if µ assigns probability p to state i1 and 1 − p to state i2 . Similarly.j . Starting from a specific state i.and therefore we can simply write pi. 0. In other words. Notational conventions Unit mass. A useful convention used for time-homogeneous chains is to write Pi (A) := P (A | X0 = i). we write Ei Y := E(Y | X0 = i) = yP (Y = y | X0 = i) = yPi (Y = y). δi (j) := 1. For example. b ≥ 0. transition matrix.j∈S is called the (one-step) transition probability matrix or. It corresponds. for all integers a. arranged in a row vector. if j = i otherwise. Then we have µ n = µ 0 Pn .

n = 0. .e. . Let Xn be the number of molecules in room 1 at time n. This is a model of gas distributed between two identical communicating chambers. Example: the Ehrenfest chain. n = m. We can model this by saying that the chance that a molecule move from room 1 to room 2 is proportional to the number of molecules in room 1.i−1 = P (Xn+1 N −i N i = i − 1 | Xn = i) = .). But this will not work. for any m ∈ Z+ . we indeed expect that we have N/2 molecules in each room. for a while we will be able to tell how long ago the process started. In other words. . On the average. If at time n the bug is on a vertex. We now investigate when a Markov chain is stationary. 1. m + 1. then. i. Clearly. because it is impossible for the number of molecules to remain constant at all times. the distribution of the bug will still be the same. The question then is: can we distribute the initial number of molecules in a way that the origin of time plays no role? One guess might be to take X0 = N/2. To provide motivation and intuition. It should be clear that if initially place the bug at random on one of the vertices. Imagine there are N molecules of gas (where N is a large number. Then a hole is opened in the separator and we watch what happens as the molecules diffuse between the two rooms. It is clear. . . N If we start the chain with X0 = N (all molecules in room 1) then the process (Xn .) is stationary if it has the same law as (Xn . selecting a vertex with probability 1/7. Hence the uniform probability distribution is stationary. say N = 1025 ) in a metallic chamber with a separator in the middle. dividing it into two identical rooms.4 Stationarity We say that the process (Xn . . random variables provides an example of a stationary process. then. the law of a stationary process does not depend on the origin of time. that if room 1 contains more gas than room 2 then there is a tendency to observe net motion from 1 to 2. a sequence of i. at time n + 1 it moves to one of the adjacent vertices with equal probability. Then pi. and vice versa. as it takes some time before the molecules diffuse from room to room. The gas is placed partly in room 1 and partly in room 2. Another guess then is: Place 9 . at any point of time n.i. n ≥ 0) will not be stationary: indeed. Consider a canonical heptagon and let a bug move on its vertices. let us look at the following example: Example: Motion on a polygon.i+1 = P (Xn+1 = i + 1 | Xn = i) = pi.d.

In other words.i = N N −i+1 N i+1 2−N + = · · · = π(i). . The equations (4) are called balance equations. Xm+1 = i1 . . X1 = i1 . (4) Such a π is called . then we have created a stationary Markov chain. Xm+2 = i2 . Changing notation. . P (X1 = i) = j P (X0 = j)pj. . Lemma 2.each molecule at random either in room 1 or in room 2. But we need to prove this. then the number of molecules in room 1 will be i with probability π(i) = N −N 2 . A Markov chain (Xn . Since.) The example above is typical: We found (by guessing!) some distribution π with the property that if X0 has distribution π then every other Xn also has distribution π. . For the if part. . . . . X2 = i2 . . 2−N i−1 i+1 N N (The dots signify some missing algebra. we are led to consider: The stationarity problem: Given [pij ]. we see that the Markov chain is stationary if and only if the distribution µ0 is chosen so that µ0 = µ0 P. P (Xn = i) = π(i) for all n.i = P (X0 = i − 1)pi−1. If we do so. . (a binomial distribution).i = π(i − 1)pi−1.i + π(i + 1)pi+1. Xm+n = in ). in ∈ S. i 0 ≤ i ≤ N. use (1) to see that P (X0 = i0 . 10 . for all m ≥ 0 and i0 . stationary or invariant distribution. which you should go through by yourselves. The only if part is obvious. n ∈ Z+ ) is stationary if and only if it is time-homogeneous and Xn has the same distribution as Xℓ for all n and ℓ. Proof. the distribution µn at time n is related to the distribution at time 0 by µn = µ0 Pn . Indeed. We can now verify that if P (X0 = i) = π(i) then P (X1 = i) = π(i) and. Xn = in ) = P (Xm = i0 . . a Markov chain is stationary if and only if the distribution of Xn does not change with n.i + P (X0 = i + 1)pi+1. If we can do so. by (3). hence. find those distributions π such that πP = π.

and this means that 1 is a (right) eigenvector of P corresponding to the eigenvalue 1. But here. The state space is S = N = {1. i = 1. . d! x ∈ Sd . i∈S The matrix P has the eigenvalue 1 always because j∈S pi. i≥1 π(i) −1 To compute π(1). which in turn means that the normalisation condition cannot be satisfied. each x ∈ Sd is of the form x = (x1 . Consider a deck of d = 52 cards and shuffle them in the following (stupid) way: pick two cards at random and swap their positions. Example: waiting for a bus.. Who tells us that the sum inside the parentheses is finite? If the sum is infinite. write the equations (4): π(i) = π(j)pj. n ≥ 0) is a Markov chain with transition probabilities pi+1. and so each π(i) will be zero. 11 .i = p(i). After bus arrives at a bus stop. In addition to that. then we run into trouble: indeed. . We can describe this by a Markov chain which takes values in the set Sd of d! permutations of {1. where all the xi are distinct. we must be careful.j = 1. d}. . π must be a probability: π(i) = 1. Thus. Consider the following process: At time n (measured in minutes) let Xn be the time till the arrival of the next bus. random variables. . The times between successive buses are i. . So we must make an: ASSUMPTION: i≥1 j≥i p(j) < ∞.i.}.Eigenvector interpretation: The equation π = πP says that π is a left eigenvector of the matrix P with eigenvalue 1. which. π(1) will be zero. Keep doing it continuously. can be written as P1 = 1. . in matrix notation. we use the normalisation condition π(1) = i≥1 j≥i = 1. which gives p(j) . . j≥1 These can easily be solved: π(i) = π(1) j≥i p(j). if i > 0 i ∈ N. Example: card shuffling. This means that the solution to the balance equations is identically zero. .i = 1. xd ). . 2 . 2. To find the stationary distribution. where 1 is a column whose entries are all 1. . . It is easy to see that the stationary distribution is uniform: π(x) = 1 . . the next one will arrive in i minutes with probability p(i).i = π(0)p(i) + π(i + 1). i ≥ 1. . p1. Then (Xn . i ≥ 0.d.

with P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 then.i. (5) 12 . But to even think about it will give you some strength. 1}. ξ2 (n). So.i. . . any r = 1. We can show that if we start with the ξr (0). 2. Hence choosing the initial state at random in this manner results in a stationary Markov chain. what we have really shown is that this assumption is a necessary and sufficient condition for the existence and uniqueness of a stationary distribution. So our assumption really means: ASSUMPTION ⇐⇒ the expected time between successive bus arrivals is finite. .d. are i. ξr+1 (n) = 1 − εr ). i≥1 j≥i j≥1 i≥1 j≥1 and the latter sum is the expected time between two successive bus arrivals.). εr ∈ {0. .).. ξ3 .). . i. 1 − ξ3 . . So it goes beyond our theory. for any n = 0. ξ2 . ξi ∈ {0. r = 1.Let us work out the sum to see if this assumption can be expressed in more “physical” terms: p(j) = 1(j ≥ i)p(j) = jp(j). from (5). 2(1 − x)). . The Markov chain is obtained by applying the mapping ϕ again and again. . ξ2 (n). . . .e. .. . ξr (n + 1) = εr ) = P (ξ1 (n) = 0. . . if ξ1 = 1 then shift to the left and flip all bits to obtain (1 − ξ2 . Here ξ(n) = (ξ1 (n). and any ε1 . . 2. . If ξ1 = 0 then x < 1/2 and our shifting rule maps x into 2x. . . and transform it according to the following rule: if ξ1 = 0 then shift to the left to obtain (ξ2 . .. .* (This is slightly beyond the scope of these lectures and can be omitted. P (ξr (n) = 1) = P (ξr (n) = 0) = 1/2 The proof is by induction: if the ξ1 (n). . ξ(n + 1) = ϕ(ξ(n)). . 2. . If ξ1 = 1 then x ≥ 1/2. . . . applied to the interval [0. and the rule maps x into 2(1 − x). . Remarks: This is not a Markov chain with countably many states. . ξ2 (n) = ε1 . . Note that there is nothing random in the transition mechanism of this chain! The only way to instill randomness is by choosing the initial state at random. . But the function f (x) = min(2x. 1. i. . . . 1] as a flexible bread dough arranged in a long rod) kneads the bread for it stretches it till it doubles in length and then folds it back in the middle. So let us do so. the same is true at n + 1. with P (ξr (0) = 1) = P (ξr (0) = 0) = 1/2 then the same will be true for all n. . . Example: bread kneading. We have thus defined a mapping ϕ from binary vectors into binary vectors.) Consider an infinite binary vector ξ = (ξ1 . . P (ξ1 (n + 1) = ε1 . ξ2 (n) = 1 − ε1 . You can check that . . .) is the state at time n.d. . 1} for all i. As for the name of the chain (bread kneading) it can be justified as follows: to each ξ there corresponds a number x between 0 and 1 because ξ can be seen as the binary representation of the real number x. 1] (think of [0. ξr+1 (n) = εr ) + P (ξ1 (n) = 1.

π(i) = j∈S (balance equations) (normalisation condition). amongst them.j as a “current” flowing from i to j.i + j∈Ac π(j)pj. Writing the equations in this form is not always the best thing to do. a stationary distribution π satisfies π = πP π1 = 1 In other words. This is OK. as we will see in examples.i 13 . Instead of eliminating equations. π(j) = 1. Proof. π(j)pj. Conversely. We have: Proposition 1. If the latter condition holds. for all A ⊂ S. a set of d linearly independent equations which can be more easily solved.i . then choose A = {i} and see that you obtain the balance equations. Ac ) := π(i)pi. we introduce more. i∈A j∈Ac Think of π(i)pi. When a Markov chain is stationary we refer to it as being in 4.i = j∈A π(j)pj. Ac ) = F (Ac . Fix a set A ⊂ S and write π(i) = j∈S π(j)pj. π is a stationary distribution if and only if π1 = 1 and F (A.i Multiply π(i) by j∈S pi.j (which equals 1): π(i) j∈S pi.1 Finding a stationary distribution As explained above. j∈S i ∈ S.j = j∈A π(j)pj. because the first d equations are linearly dependent: the sum of both sides equals 1. But we only have d unknowns. suppose that the balance equations hold. So F (A.i + j∈Ac π(j)pj. Let us introduce the notion of “probability flow” from a set A of states to its complement under a distribution π: F (A.Note on terminology: steady-state. therefore one of them is always redundant. If the state space has d states. expecting to be able to choose. then the above are d + 1 equations. The convenience is often suggested by the topological structure of the chain. Ac ) is the total current from A to Ac . A).j .

. p22 = 1 − β. in steady state. That is. . α+β π(2) = α . β = 0. Example: a walk with a barrier.i This is true for all i ∈ S. Instead. A) Schematic representation of the flow balance relations.952.99 ≈ 0.. We have π(1)α = π(2)β. 1. 3. here is an example: Example: Consider the 2-state Markov chain with p12 = α. 14 i = 1.05 ≈ 0. 2. which immediately tells us that π(1) is proportional to β and π(2) proportional to α: π(1) = cβ. A ) A c c F( A . p q Consider the following chain.99 (as in the mouse example). p21 = β. . p11 = 1 − α. . the mouse spends only 95. we find π(1) = 0.2% of the time in 1. Thus.Now split the left sum also: π(i)pi.. we see that the first term on the left cancels with the first on the right. where p + q = 1: p p p 0 q 1 q 2 q 3 q 4 . A c F(A .i + j∈Ac π(j)pj. The balance equations are: π(i) = pπ(i − 1) + qπ(i + 1). but we shall not do this here.j = j∈A π(j)pj.j + j∈A j∈Ac π(i)pi. The extra degree of freedom provided by the last result. One can ask how to choose a minimal complete set of linearly independent equations. α+β π(2) = cα.8% of the time in room 2. π(1) = β .048. Summing over i ∈ A.04 room 1. (Consequently. and 4. gives us flexibility. The constant c is found from π(1) + π(2) = 1. while the remaining two terms give the equality we seek. β < 1.05. If we take α = 0.04 π(2) = 0.) Assume 0 < α.

. (If p ≥ 1/2 there is no stationary distribution.e.1 Topological structure The graph of a chain This refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. 5. the chain will visit j at some finite time. 1. So assume i = j. This is i=0 possible if and only if p < q. In fact.2 The relation “leads to” We say that a state i leads to state j if. E) where S is the state space and E ⊂ S × S defined by (i. i − 1}. Does this provide a stationary distribution? Yes. If i ≥ 1 we have F (A. In graph-theoretic terminology. . i leads to j (written as i Notice that this relation is reflexive): ∀i ∈ S Furthermore: Theorem 1. . we can solve them immediately. . starting from i. Ac ) = π(i − 1)p = π(i)q = F (Ac . • Suppose that i j. . in−1 . We can frequently (but not always) draw the graph (as we did in examples) and visualise the set E as the set of all edges with arrows. p < 1/2. The following are equivalent: (i) i j i i j) ⇐⇒ Pi (∃n ≥ 0 Xn = j) > 0. n≥0 . i. A). j) ∈ E ⇐⇒ pi. π(i) = (p/q)i π(0). i. S is a set of vertices. In other words. . This is often cast in terms of a digraph (directed graph). .) 5 5. Proof. and E a set of directed edges. (iii) Pi (Xn = j) > 0 for some n ≥ 0. These are much simpler than the previous ones.j > 0. Since 0 < Pi (∃n ≥ 0 Xn = j) ≤ 15 Pi (Xn = j).e.i2 · · · pin−1 . provided that ∞ π(i) = 1. a pair (S. For i ≥ 1. . (ii) pi.But we can make them simpler by choosing A = {0.i1 pi1 .j > 0 for some n ∈ N and some states i1 . If i = j there is nothing to prove.

(i) holds.. Just as any equivalence relation. Corollary 1. meaning that (ii) holds. By assumption. Equivalence relation means that it is symmetric. {6}..i2 · · · pin−1 . Proof. by definition. • Suppose now that (iii) holds. Two communicating classes are either 1 2 3 4 5 6 In the example of the figure. 5. The relation is transitive: If i j and j k then i k. {2. 5} is closed: if the chain goes in there then it never moves out of it. 3} is not closed. j) ⇐⇒ i j and j i. 5}. Since Pi (Xn = j) > 0. i}. (iii) holds. we see that there are four communicating classes: {1}. 16 . because Pi (Xn = j) ≤ Pi (∃m ≥ 0 Xm = j). the set of all states that communicate with i: [i] := {j ∈ S : j So. However. [i] = [j] if and only if i identical or completely disjoint.. it partitions S into equivalence classes known as communicating classes. • Suppose that (ii) holds. we have that the sum is also positive and so one of its terms is positive. In other words. In addition. The first two classes differ from the last two in character.j . (6) j..3 The relation “communicates with” Define next the relation i communicates with j (written as i Obviously. The communicating class corresponding to the state i is. this is symmetric (i Corollary 2. the class {2. We just observed it is symmetric. We have Pi (Xn = j) = i1 . 3}. The class {4. reflexive and transitive. is an equivalence relation.the sum on the right must have a nonzero term. where the sum extends over all possible choices of states in S. the sum contains a nonzero term. The relation j ⇐⇒ j i). and so (iii) holds. It is reflexive and transitive because is so.i1 pi1 . by definition.in−1 pi. {4. Hence Pi (Xn = j) > 0. Look at the last display again.

. Similarly. In graph terms. Thus. then the chain (or. The converse is left as an exercise. more manageable. and C is closed. To put it otherwise. pi. There can be arrows between two nonclosed communicating classes but they are always in the same direction. The communicating classes C1 . more precisely. j∈C for all i ∈ C. and so i2 ∈ C. in−1 such that pi. C2 . pi1 . starting from any state. parts. Note: If we consider a Markov chain with transition matrix P and fix n ∈ N then the Markov chain with transition matrix Pn has exactly the same closed communicating classes.i2 > 0. A general Markov chain can have an infinite number of communicating classes. . Note that there can be no arrows between closed communicating classes. and so on. 17 . If we remove the non-closed communicating classes then we obtain a collection of disjoint closed communicating classes with no arrows between them. In other words still. Suppose that i j. C5 in the figure above are closed communicating classes.j > 0. Otherwise.More generally.i2 · · · pin−1 . If all states communicate with all others. A communicating class C is closed if and only if the following holds: If i ∈ C and if i j then j ∈ C. the state is inessential. Why? A state i is essential if it belongs to a closed communicating class.i = 1 then we say that i is an absorbing state. Lemma 3. Since i ∈ C. its transition matrix P) is called irreducible.j = 1. . The classes C4 . the single-element set {i} is a closed communicating class. the state space S is a closed communicating class. we can reach any other state. we say that a set of states C ⊂ S is closed if pi. we have i1 ∈ C. Proof.i1 pi1 . we conclude that j ∈ C. Suppose first that C is a closed communicating class. Note: An inessential state i is such that there exists a state j such that i j but j i.i1 > 0. If a state i is such that pi. Closed communicating classes are particularly important because they decompose the chain into smaller. C3 are not closed. C1 C2 C3 C4 C5 The general structure of a Markov chain is indicated in this figure. . We then can find states i1 .

The period of an inessential state is not defined. So if the chain starts with X0 ∈ C1 then we know that X1 ∈ C2 . j. If i j then d(i) = d(j). more generally. Notice however. we have So pii (2n) (n) pij ≥ pii (n) 2 (m+n) ≥ pik pkj . The chain alternates between the two sets. The period of an essential state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps: d(i) := gcd{n ∈ N : Pi (Xn = i) > 0} . i. (n) Proof. A state i is called aperiodic if d(i) = 1. . Since Pm+n = Pm Pn . j ∈ S. Therefore if pii > 0 then pii (n) (ℓn) > 0 for all ℓ ∈ N. (Can you find an example of a chain with period 3?) We will now give the definition of the period of a state of an arbitrary chain. . So.4 Period Consider the following chain: 1 2 4 3 The chain is irreducible. i. 3} at some point of time then. . X3 ∈ C2 . and.5. Let us denote by pij the entries of Pn . that we can further simplify the behaviour of the chain by noticing that if the chain is in the set C1 = {1. We say that the chain has period 2. n ≥ 0. d(i) = gcd Di .e. Consider two distinct essential states i. X4 ∈ C1 . pii (ℓn) ≥ pii (n) ℓ . Theorem 2 (the period is a class property). If i j there is α ∈ N with pij > 0 and if 18 . it will move to the set C2 = {2. 4} at the next step and vice versa. (m) (n) for all m. all states form a single closed communicating class. Let Di = {n ∈ N : pii > 0}. X2 ∈ C1 . ℓ ∈ N. d(j) = gcd Dj . whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. Dj = {n ∈ (n) (α) N : pjj > 0}. . Another way to say this is: (n) (m) If the integer n divides m (denote this by: n | m) and if pii > 0 then pii > 0.

Arguing symmetrically. Now let n ∈ Di . Theorem 4. if an irreducible chain has period d then we can decompose the state space into d sets C0 . . . showing that α + n + β ∈ Dj . . Therefore n = qn1 + r(n2 − n1 ) = (q − r)n1 + rn2 . we obtain d(i) ≤ d(j) as well. .) 19 .e. r = 0. C1 . In particular. Cd−1 such that the chain moves cyclically between them. From the last two displays we get d(j) | n for all n ∈ Di . (Here Cd := C0 . If i is an aperiodic state then there exists n0 such that pii > 0 for all n ≥ n0 . the chain is called aperiodic. Therefore d(j) | α + β. Consider an irreducible chain with period d. Proof. So d(i) = d(j). . . Of particular interest. So d(j) is a divisor of all elements of Di . Because n is sufficiently large. q − r > 0. Divide n by n1 to obtain n = qn1 + r. Let n be sufficiently large. if d = 1. as defined earlier. So pjj ≥ pij pji > 0. An irreducible chain with finitely many states is aperiodic if and only if there (n) exists an n such that pij > 0 for all i. Then we can uniquely partition the state space S into d disjoint subsets C0 . Corollary 3. Cd−1 such that pij = 1. . An irreducible chain has period d if one (and hence all) of the states have period d. showing that α + β ∈ Dj . j ∈ S. Theorem 3. i. . Therefore d(j) | α + β + n for all n ∈ Di . .j i there is β ∈ N with pji > 0. Since d(i) is the greatest common divisor we have d(i) ≥ d(j) (in fact. Since n1 . . j∈Cr+1 i ∈ Cr . Then pii > 0 and so pjj ≥ pij pii pji > 0. all states communicate with one another. are irreducible chains. d(j) | d(i)). C1 . . n2 ∈ Di . 1. this is the content of the following theorem. we have n ∈ Di as well. (n) (n) (α+n+β) (α) (n) (β) (β) (α+β) (α) (β) This figure shows the internal structure of a closed communicating class with period d = 4. chains where. If a number divides two other numbers then it also divides their difference. . n1 ∈ Di such that n2 − n1 = 1. Pick n2 . More generally. d − 1. Formally. where the remainder r ≤ n1 − 1.

It is easy to see that if we continue and pick a further state id with pid−1 . Such a path is possible because of the choice of the i0 . and by the assumptions that d d j. necessarily. . after it takes one step from its current position. . We now show that if i belongs to one of these classes and if pij > 0 then. i ∈ C0 . We show that these classes satisfy what we need. .Proof. Assume d > 1.. . i2 i0 i0 in a number of steps which is an integer multiple of d − 1 (why?). j ∈ C2 . i1 . 20 ..e. The existence of such a path implies that it is possible to go from i0 to i. We define the hitting time4 of a set of states A by TA := inf{n ≥ 0 : Xn ∈ A}. which contradicts the definition of d. Define the relation i d j ⇐⇒ pij (nd) > 0 for some n ≥ 0. Pick a state i0 and let C0 be its equivalence class (i. As an example. that is why we call it “first-step analysis”. i1 . Notice that this is an equivalence relation. then. i.. say. We avoid excessive notation by using the same letter for both. If X0 ∈ A the two times coincide. Take. Suppose pij > 0 but j ∈ C1 but. id ∈ C0 . consider the chain 1 1/3 1 0 4 1/2 1 1/6 2 We shall later consider the time inf{n ≥ 1 : Xn ∈ A} which differs from TA simply by considering n ≥ 1 instead of n ≥ 0. Cd−1 . Consider the path i0 → i → j → i2 → i3 → · · · → id → i0 . for example. necessarily. Then pick i1 such that pi0 i1 > 0. j belongs to the next class. id−1 d 6 Hitting times and first-step analysis Consider a Markov chain with transition probability matrix P. with corresponding classes C0 . Hence it partitions the state space S into d disjoint equivalence classes.. The reader is warned to be alert as to which of the two variables we are considering at each time.. Let C1 be the equivalence class of i1 .e the set of states j such that i j).. Continue in this manner and define states i0 . We use a method that is based upon considering what the Markov chain does at time 1. . We are interested in deriving formulae for the probability that this time is finite as well as the expectation of this time. C1 ..id > 0.

It is clear that P1 (T0 < ∞) < 1. because the chain may end up in state 2. We first have (where TA Pi (TA < ∞) = j∈S ′ Pi (1 + TA < ∞|X1 = j)Pi (X1 = j) = j∈S ′ Pi (TA < ∞|X1 = j)pij ′ But observe that the random variable TA is a function of the future after time 1 (i.). which implies that ′ P (TA < ∞|X1 = j) = P (TA < ∞|X0 = j) = Pj (TA < ∞) = ϕ(j). let ϕ′ (i) be another solution. We then have: Theorem 5. The function ϕ(i) satisfies ϕ(i) = 1. By self-feeding this equation. ′ is the remaining time until A is hit). Furthermore. Therefore. . If i ∈ A then TA = 0. . if ϕ′ (i) is any other solution of these equations then ϕ′ (i) ≥ ϕ(i) for all i ∈ S. i ∈ A. Hence: ′ ′ P (TA < ∞|X1 = j. ′ Proof. X0 = i) = P (TA < ∞|X1 = j).e. a ′ function of X1 . So TA = 1 + TA . X2 . But the Markov chain is homogeneous. j∈S as needed. conditionally on X1 . Then ϕ′ (i) = j∈A pij + j∈A pij ϕ′ (j). we have ϕ′ (i) = j∈A pij + j∈A = j∈A pij + j∈A pij pij   pjk + k∈A k∈A pjk + j∈A pij pjk ϕ′ (k) k∈A  pjk ϕ′ (k) k∈A 21 . But what exactly is this probability equal to? To answer this in its generality. ϕ(i) = j∈S i∈A pij ϕ(j). If i ∈ A. then TA ≥ 1. and so ϕ(i) = 1. Combining the above. the event TA is independent of X0 . fix a set of states A. and define ϕ(i) := Pi (TA < ∞). we have Pi (TA < ∞) = ϕ(j)pij . For the second part of the proof. .

and ϕ(2) = 0. we obtain ϕ′ (i) ≥ Pi (TA < ∞) = ϕ(i). Start the chain from i. the set that contains only state 0. If i ∈ A then. Clearly. the second equals Pi (TA = 2). so the first two terms together equal Pi (TA ≤ 2). ψ(0) = ψ(2) = 0.) As for ϕ(1). if the chain starts from 2 it will never hit 0. we obtain ϕ′ (i) ≥ Pi (TA ≤ n). Letting n → ∞. j∈A i ∈ A. We immediately have ϕ(0) = 1. obviously. if ψ ′ (i) is any other solution of these equations then ψ ′ (i) ≥ ψ(i) for all i ∈ S. 1 ψ(1) = 1 + ψ(1). Therefore. 2}. let A = {0. 3 22 . 3 2 so ϕ(1) = 3/4. we choose A = {0}. 1. ψ(i) := Ei TA . ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Next consider the mean time (mean number of steps) until the set A is hit for the first time. which is the second of the equations. then TA = 1 + TA . (If the chain starts from 0 it takes no time to hit 0. On the other hand. By omitting the last term we obtain ϕ′ (i) ≥ Pi (TA ≤ 2). because it takes no time to hit A if the chain starts from A. and let ψ(i) = Ei TA . as above. Furthermore. ′ E i TA = 1 + E i TA = 1 + j∈S pij Ej TA = 1 + j∈A pij ψ(j). ψ(i) = 1 + i∈A pij ψ(j). i = 0. Example: Continuing the previous example. 2. Example: In the example of the previous figure. By continuing self-feeding the above equation n times. TA = 0 and so ψ(i) := Ei TA = 0. we have 1 1 ϕ(1) = ϕ(1) + ϕ(0). The second part of the proof is omitted. The function ψ(i) satisfies ψ(i) = 0. We have: Theorem 6. If ′ i ∈ A.We recognise that the first term equals Pi (TA = 1). We let ϕ(i) = Pi (T0 < ∞). Proof.

because when X0 ∈ A then T0 = 0 and so the total reward is f (X0 ).e. We may ask to find the probabilities ϕAB (i) = Pi (TA < TB ).. ′ TA Ex n=0 f (Xn+1 ) = y∈S pxy h(y). Hence. Next. 23 . ϕAB (i) = 0. after one step. . h(x) = Ex n=0 f (Xn ) ′ 1+TA = f (x) + Ex n=1 ′ TA f (Xn ) = f (x) + Ex n=0 f (Xn+1 ). where. by the Markov property. Clearly. ′ where TA is the remaining time. It should be clear that the equations satisfied are: ϕAB (i) = 1. of the random variables X1 . ϕAB (i) = j∈A i∈B pij ϕAB (j). therefore the mean number of flips until the first success is 3/2. because the underlying experiment is the flipping of a coin with “success” probability 2/3. i∈A Furthermore. i. This happens up to the first hitting time TA first hitting time of the set A. as argued earlier. . TB for two disjoint sets of states A. h(x) = f (x). in the last sum. ′ TA = 1 + TA . We are interested in the mean total reward TA h(x) := Ex n=0 f (Xn ). if x ∈ A. until set A is reached. B. ϕAB is the minimal solution to these equations.) ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ Now let us consider two hitting times TA . (This should have been obvious from the start. Then ′ 1+TA x ∈ A. The total reward is f (X0 ) + f (X1 ) + · · · + f (XTA ). X2 . . we just changed index from n to n + 1. ∼◦∼◦∼◦∼◦∼◦∼◦∼◦∼ As yet another application of the first-step analysis method. consider the following situation: Every time the state is x a reward f (x) is earned.which gives ψ(1) = 3/2. Now the last sum is a function of the future after time 1. otherwise.

You can remember the latter ones by thinking that the expectecd total reward equals the immediate reward f (x) plus the expected remaining reward. why?) If we let h(i) be the average total reward if he starts from room i. 3. h(3) = 7. for i = 1. Let p the probability of heads and q = 1 − p that of tails. Example: A thief enters sneaks in the following house and moves around the rooms. until he dies in room 4. in which case he dies immediately. (Also. if just before the k-th toss you decide to bet Yk pounds then after the realisation of the toss you will have earned Yk pounds if heads came up or lost Yk pounds if tails appeared. If heads come up then you win £1 per pound of stake. he will die. h(1) = 5. 1 2 4 3 The question is to compute the average total reward. The successive stakes (Yk . If ξk is the outcome of the k-th toss (where ξk = +1 means heads and ξk = −1 means tails).) 7 Gambler’s ruin Consider the simplest game of chance of tossing an (possibly unfair) coin repeatedly and independently. the set of equations satisfied by h. So. if he starts from room 2. are h(x) = 0. unless he is in room 4. 2 2 We solve and find h(2) = 8. your fortune after the n-th toss is n Xn = X0 + k=1 Y k ξk . (Sooner or later. collecting “rewards”: when in room i. he collects i pounds. so 24 . no gambler is assumed to have any information about the outcome of the upcoming toss. 2. k ∈ N) form the strategy of the gambler. x∈A pxy h(y). y∈S h(x) = f (x) + x ∈ A.Thus. we have h(4) = 0 1 h(1) = 1 + h(2) 2 1 h(3) = 3 + h(2) 2 1 1 h(2) = 2 + h(1) + h(3). Clearly.

. We first solve the problem when a = 0. a + 2. she must base it on whatever outcomes have appeared thus far. in addition. astrological nature. We use the notation Px (respectively. ξk−1 and–as is often the case in practice–on other considerations of e. and where the states a. . betting £1 at each integer time against a coin which has P (heads) = p. b} with transition probabilities pi. The general solution will then obtained from Px (Tb < Ta ) = ϕ(x − a. b − a). Yk cannot depend on ξk . The gambler starts with X0 = x pounds and has a target: To make b pounds (b > x). write ϕ(x) instead of ϕ(x. P (tails) = 1 − p. if p = q Px (Tb < Ta ) = . it will make enormous profits). x − a    . We obviously have ϕ(0) = 0. expectation) under the initial condition X0 = x. pi. The difference between the gambler’s ruin problem for a company and that for a mere mortal is that. a + 1. the company has control over the probability p. This has profound actuarial applications: An insurance company must make sure that the strategies it employs are such that–at least–the company will not go bankrupt (and. She will stop playing if her money fall below a (a < x). . To solve the problem. b are absorbing. We are clearly dealing with a Markov chain in the state space {a. 0 ≤ x ≤ ℓ. a < i < b.g. . . . in simplest case Yk = 1 for all k. Consider the hitting time of state y: Ty := inf{n ≥ 0 : Xn = y}. It can only depend on ξ1 . (Think why!) To simplify things. In other words. . in the first case. is our next goal. b = ℓ > 0 and let ϕ(x. . Ex ) for the probability (respectively. Theorem 7. 25 ϕ(ℓ) = 1. ℓ) := Px (Tℓ < T0 ).i+1 = p. . Let us consider a concrete problem. Then the probability that the gambler’s fortune will reach level b > x before reaching level a < x equals:  x−a − 1  (q/p)    (q/p)b−a − 1 .if the gambler wants to have a strategy at all. a ≤ x ≤ b. if p = q = 1/2 b−a Proof.i−1 = q = 1 − p. We wish to consider the problem of finding the probability that the gambler will never lose her money. as described above. ℓ). b − 1. Let £x be the initial fortune of the gambler. Consider a gambler playing a simple coin tossing game.

We have Px (T0 < ∞) = lim Px (T0 < Tℓ ) = 1 − lim Px (Tℓ < T0 ). ℓ Summarising. then λ = q/p < 1. Assuming that λ := q/p = 1. Since ϕ(ℓ) = 1. ℓ→∞ ℓ→∞ (q/p)x . This gives δ(x) = δ(0) for all x and ϕ(x) = xδ(0). by first-step analysis. we have obtained ϕ(x.and. if p = q. 1. if p > 1/2 if p ≤ 1/2. We finally find λx − 1 . ϕ(x) = pϕ(x + 1) + qϕ(x − 1). λ = 1. ℓ→∞ λℓ − 1 x = 1 − 0 = 1. ℓ→∞ λℓ − 1 0−1 λx − 1 = 1 − 0 = 1. Corollary 4. If p > 1/2. λℓ −1 x ℓ. λ = 1. we have λℓ −1 λ−1 δ(0) = 1 which gives δ(0) = (λ − 1)/(λℓ − 1). The ruin probability of a gambler starting with £x is Px (T0 < ∞) = Proof. 0 ≤ x ≤ ℓ. then λ = q/p > 1. if λ = 1 if λ = 1. and so λx − 1 λx − 1 =1− = λx . δ(x − 2) = · · · = q p x δ(0). we find δ(0) = 1/ℓ and so x ϕ(x) = . ℓ) = λx −1 . The general case is obtained by replacing ℓ by b − a and x by x − a. λℓ − 1 0 ≤ x ≤ ℓ. ℓ→∞ ℓ Px (T0 < ∞) = 1 − lim Finally. Px (T0 < ∞) = 1 − lim 26 . λ−1 ϕ(x) = If λ = 1 then q = p and so ϕ(x + 1) − ϕ(x) = ϕ(x) − ϕ(x − 1). we find ϕ(x) = δ(x − 1) + δ(x − 2) + · · · + δ(0) = [λx−1 + · · · + λ + 1]δ(0) = Since ϕ(ℓ) = 1. λx − 1 δ(0). We have δ(x) = q δ(x − 1) = p q p 2 0 < x < ℓ. and so Px (T0 < ∞) = 1 − lim If p < 1/2. we can write this as p[ϕ(x + 1) − ϕ(x)] = q[ϕ(x) − ϕ(x − 1)]. Since p + q = 1. Let δ(x) := ϕ(x + 1) − ϕ(x).

τ < ∞). . the future process (Xn . . τ = n). τ < ∞) = P (Xn+1 = j1 . . . . . . Suppose (Xn . Let τ be a stopping time. We will now prove that this remains true if we replace the deterministic n by a random stopping time τ . Xn+2 = j2 . . Recall that the Markov property states that. A | Xτ . . n ≥ 0) is independent of the past process (X0 . . conditional on τ < ∞ and Xτ = i. | Xτ = i.8 Stopping times and the strong Markov property A random time is a random variable taking values in the time-set. Xn+1 . including.) is independent of the past process (X0 . . A. (b) is by the definition of conditional probability. . Xn ). . . .j2 · · · P (Xn = i. .j1 pj1 . Xn+2 = j2 . for all n ∈ Z+ . Xτ +2 = j2 . Xτ ) if τ < ∞. considered earlier. τ < ∞) and that P (Xτ +1 = j1 . | Xn = i)P (Xn = i. This stronger property is referred to as the strong Markov property. 27 . Xτ ) and has the same law as the original process started from i. Then. . . Xn ) for all n ∈ Z+ . | Xn = i. A. An example of a random time which is not a stopping time is the last visit of a specific set. A. the first hitting time of the set A. . A. . . and (d) is due to the fact that (Xn ) is time-homogeneous Markov chain with transition probabilities pij . P (Xτ +1 = j1 . . . a stopping time τ is a random variable with values in Z+ ∪{+∞} such that 1(τ = n) is a deterministic function of (X0 . . . Xτ )1(τ < ∞) = n∈Z+ (X0 . . Our assertions are now easily obtained from this by first summing over all n (whence the event {τ < ∞} appears) and then by dividing both sides by P (Xτ = i. Let τ be a stopping time. . . n ≥ 0) is a time-homogeneous Markov chain. τ = n) (c) = P (Xn+1 = j1 . . . Xn+2 = j2 . any such A must have the property that. An example of a stopping time is the time TA . . . We are going to show that for any such A. Xn = i. . τ = n) = P (Xn+1 = j1 . Theorem 8 (strong Markov property). the value +∞. . . Xτ = i. τ = n) (a) (b) = P (Xτ +1 = j1 . . . Xn )1(τ = n). In particular. A. (d) where (a) is just logic. . Xτ +2 = j2 . . . . . . Since (X0 . . . τ < ∞)P (A | Xτ . Xn ) if we know the present Xn . τ = n) = pi. Proof. .Xτ +2 = j2 . . . τ < ∞) = pi. . . the future process (Xτ +n . .j2 · · · We have: P (Xτ +1 = j1 . . | Xτ . .j1 pj1 . A ∩ {τ = n} is determined by (X0 . Xn ) and the ordinary splitting property at n. . . (c) is due to the fact that A ∩ {τ = n} is determined by (X0 . . Let A be any event determined by the past (X0 . for any n. . Xτ +2 = j2 . . possibly. . τ = n)P (Xn = i. A.

They are. 3. (1) r = 1. and so on. := Xn .i. (And we set Ti := Ti . . “chunks”. r = 2.) We let Ti be the time of the third visit. Xi (1) . The Ti of Section 6 was allowed to take the value 0. Ti Xi (0) (r) ≤ n < Ti (r+1) . These excursions are independent! 11111111111 00000000000 1111111111 0000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111111111 00000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 11111 00000 11111111111111 00000000000000 1111111111 0000000000 1111111111111111 0000000000000000 11111 11111111111111 00000000000000 1111111111 0000000000 i 00000 11111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000 1111111111111111 0000000000000000 11111 00000 11111 00000 Ti = Ti (1) Ti (2) Ti (3) Ti (4) Fix a state i ∈ S and let Ti be the first time that the chain will visit this state: Ti := inf{n ≥ 1 : Xn = i}. then the behaviour of the chain between successive visits cannot be influenced by the past or the future (and that is due to the strong Markov property). Formally. we will show that we can split the sequence into i. All these random times are stopping times.d. . State i may or may not be visited again. Our Ti here is always positive. . So Xi is the r-th excursion: the chain visits 28 .. Regardless. .. as “random variables”. . Xi (2) . we recursively define Ti (r) (2) := inf{n > Ti (r−1) : Xn = i}. .i. in fact. 2. (Why?) Consider the “trajectory” of the chain between two successive visits to the state i: Xi Let also We think of (r) := Xn . Xi (0) . we let Ti be the time of the second (1) (3) visit. by using the strong Markov property. We (r) refer to them as excursions of the process. in a generalised sense. random variables is a (very) particular example of a Markov chain. Schematically. . The idea is that if we can find a state i that is visited again and again.d. The random variables Xn comprising a general Markov chain are not independent. However. 0 ≤ n < Ti . random trajectories.9 Regenerative structure of a Markov chain A sequence of i. If X0 = i then the two definitions coincide. Note the subtle difference between this Ti and the Ti of Section 6.

by definition. Apply Strong Markov Property (SMP) at Ti : SMP says that. the future is independent of the (1) (2) past.i. g(Xi (1) ). . . The excursions Xi (0) . it will behave in the same way as if it starts from the same state at another point of time. of the excursions are Λr := (r) n=Ti 1(Xn = j) The meaning of Λr is this: it expresses the number of visits to a state j between two successive visits to the state i. Example: Define Ti (r+1) (0) ). equal to i. 29 . they are also identical in law (i. . are independent random variables. Λ2 . possibly. ad infinitum. Xi distribution. (r) . Second. at least one state must be visited infinitely many times. except. and this proves the first claim. but there are always states which will be visited again and again. We abstract this trivial observation and give a couple of definitions: First. starting from i. if it starts from state i at some point of time. The claim that Xi . all. . The last corollary ensures us that Λ1 . An immediate consequence of the strong Markov property is: Theorem 9. the chain returns to i only finitely many times: Pi (Xn = i for infinitely many n) = 0. Xi (2) . A trivial.. starting from i. we say that a state i is recurrent if. Therefore.d. In particular. g(Xi (2) ). (Numerical) functions g(Xi independent random variables.). for time-homogeneous Markov chains. Therefore. .. Moreover. Perhaps some states (inessential ones) will be visited finitely many times. have the same Proof. we say that a state i is transient if. Xi have the same distribution follows from the fact that the Markov chain is time-homogeneous. . Xi (1) . (0) are independent random variables. given the state at (r) (r) (r) the stopping time Ti . the future after Ti is independent of the past before Ti . and “goes for an excursion”.state i for the r-th time. we “watch” it up until the next time it returns to state i. 10 Recurrence and transience When a Markov chain has finitely many states. . But (r) the state at Ti is. but important corollary of the observation of this section is that: Corollary 5. the chain returns to i infinitely many times: Pi (Xn = i for infinitely many n) = 1.

and that is why we get the product. Indeed. This means that the state i is transient. (k−1) Notice that fii = Pi (Ti < ∞) can either be equal to 1 or smaller than 1. 30 . Notice that: state i is visited infinitely many times ⇐⇒ Ji = ∞. Starting from X0 = i. Ei Ji = ∞. Therefore. then Pi (Ji < ∞) = 1. Indeed. Notice also that fii := Pi (Ji ≥ 1) = Pi (Ti < ∞). there could be yet another possibility: 0 < Pi (Xn = i for infinitely many n) < 1. every state is either recurrent or transient. If fii = 1 we have that Pi (Ji ≥ k) = 1 for all k. therefore concluding that every state is either recurrent or transient. Pi (Ji = ∞) = 1. We summarise this together with another useful characterisation of recurrence & transience in the following two lemmas. Because either fii = 1 or fii < 1.e. k ≥ 0. 2. Pi (Ji = ∞) = 1. . Lemma 5. the event {Ji ≥ k} implies that Ti < ∞ and that Ji ≥ k − 1. We will show that such a possibility does not exist. fii = Pi (Ti < ∞) = 1. 3. By the strong Markov property at time Ti (k−1) Pi (Ji ≥ k) = Pi (Ji ≥ k − 1)Pi (Ji ≥ 1).Important note: Notice that the two statements above are not negations of one another. But the past (k−1) before Ti is independent of the future. namely. The following are equivalent: 1. i. Suppose that the Markov chain starts from state i. the random variable Ji is geometric: k Pi (Ji ≥ k) = fii . in principle. 4. This means that the state i is recurrent. 2 k Pi (Ji ≥ k) = fii Pi (Ji ≥ k − 1) = fii Pi (Ji ≥ k − 2) = · · · = fii . we have Proof. We claim that: Lemma 4. State i is recurrent. If fii < 1. Consider the total number of visits to state i (excluding the possibility that X0 = i): ∞ Ji = n=1 1(Xn = i) = sup{r ≥ 0 : Ti(r) < ∞} . we conclude what we asserted earlier.

The equivalence of the first two statements was proved before the lemma. (n) i. (n) But if a series is finite then the summands go to 0. if we assume Ei Ji < ∞. State i is recurrent if. If Ei Ji = ∞. 3. as n → ∞. (n) pij n (m) (n−m) (n) (n) = m=1 fij pjj . Pi (Ji < ∞) = 1. The following are equivalent: 1. (n) pij n = Pi (Xn = j) = m=1 n Pi (Tj = m. n ≥ 1. If i is a transient state then pii → 0. fij ≤ pij . Lemma 6.Proof. Hence Ei Ji = ∞ also. fii = Pi (Ti < ∞) < 1. owing to the fact that Ji is a geometric random variable. by the definition of Ji . Proof. The latter is the probability to be in state j at time n (not necessarily for the first time) starting from i. i. therefore Pi (Ji < ∞) = 1. but the two sequences are related in an exact way: Lemma 7. Corollary 6. 4. the probability to hit state j for the first time at time n ≥ 1. clearly. Proof.e. The equivalence of the first two statements was proved above. Ei Ji < ∞. then. since Ji is a geometric random variable. by definition. it is visited infinitely many times. pii → 0. But ∞ ∞ ∞ (n) Ei Ji = Ei n=1 1(Xn = i) = n=1 Pi (Xn = i) = n=1 (n) pii . This. Ji cannot take value +∞ with positive probability. this implies that Ei Ji < ∞. Clearly. it is visited finitely many times with probability 1. 10.e. as n → ∞. State i is transient if. by definition of Ji is equivalent to Pi (Ji = ∞) = 1. and. If i is transient then Ei Ji < ∞. Notice the difference with pij .1 Define First hitting time decomposition fij := Pi (Tj = n). which. State i is transient. From Elementary Probability. is equivalent to Pi (Ji < ∞) = 1. 2. then. Proof. On the other hand. Xn = j) Pi (Tj = m)Pi (Xn = j | Tj = m) 31 = m=1 . by definition. we see that its expectation cannot be infinity without Ji itself being infinity with probability one. assuming that the chain (n) starts from state i.

by definition.d. 32 . we have i j ⇐⇒ fij > 0. by summing over n the relation of Lemma 7. 10. If j is transient.) Theorem 10. Corollary 7. we have n=1 pjj = Ej Jj < ∞. In a communicating class.i. .r := 1 j appears in excursion Xi (r) . showing that j will appear in infinitely many of the excursions for sure. not only fij > 0. We now show that recurrence is a class property. For the second term. either all states are recurrent or all states are transient. r = 0.d. and so the probability that j will appear in a specific excursion is positive. . . and gij := Pi (Tj < Ti ) > 0. ( Remember: another property that was a class property was the property that the state have a certain period. If i is recurrent and if i j then j is recurrent and fij = Pi (Tj < ∞) = 1. the probability to go to j in some finite time is positive. Now. and since they take the value 1 with positive probability. . if we let fij := Pi (Tj < ∞) . infinitely many of them will be 1 (with probability 1). Corollary 8. If i is recurrent and i j then there is a positive probability (0) (1) (= fij ) that j will appear in one of the i. the probability that j will appear in a specific excursion equals gij = Pi (Tj < Ti ). Start with X0 = i. as n → ∞. Hence. The last statement is simply an expression in symbols of what we said above. 1. which is finite. .. we obtain (after interchanging the order of the two sums in the right hand side) ∞ n=1 (n) pij ∞ (m) fij ∞ n=m (n) (n−m) pjj ∞ (m) (n) (m) = m=1 = (1 + Ej Jj ) m=1 fij = (1 + Ej Jj )Pi (Tj < ∞). excursions Xi . j i. j i. In particular. Indeed. for all states i. . and hence pij as n → ∞. we use the Strong Markov Property: (n−m) Pi (Xn = j | Tj = m) = Pi (Xn = j | Xm = j) = pjj . Proof. starting from i. Xi .2 Communicating classes and recurrence/transience Recall the definition of the relation “state i leads to state j”. If j is a transient state then pij → 0.i. In other words. Hence. ∞ Proof. denoted by i j: it means that. but also fij = 1. are i.The first term equals fij . So the random variables δj.

We now take a closer look at recurrence. Thus T represents the number of variables we need to draw to get a value larger than the initial one. U2 . observe that T is finite. . all other states are also transient. j but j i. This state is recurrent. appear in a certain game. Xn = i for infinitely many n) = 0. Let C be the class. but Pi (A ∩ B) = Pi (Xm = j. Proof. . . Recall that a finite random variable may not necessarily have finite expectation. It should be clear that this can. random variables. Every finite closed communicating class is recurrent. . and hence. By finiteness. and so i is a transient state. for example. But then. By closedness. Suppose that i is inessential. If i is inessential then it is transient. . necessarily. If a communicating class contains a state which is transient then. Let T := inf{n ≥ 1 : Un > U0 }. P (B) = Pi (Xn = i for infinitely many n) < 1. Un ≤ x | U0 = x)dx xn dx = 1 . U1 .d. uniformly distributed in the unit interval [0. n+1 33 = 0 . Pi (B) = 0. . X remains in C. .e. Proof. So if a communicating class is recurrent then it contains no inessential states and so it is closed. Theorem 12. Hence 11 Positive recurrence Recall that if we fix a state i and let Ti be the first return time to i then either Pi (Ti < ∞) = 1 (in which case we say that i is recurrent) or Pi (Ti < ∞) < 1 (transient state).Proof. i. Un ≤ U0 ) 1 = 0 1 P (U1 ≤ x. Hence all states are recurrent. be i. . i. . What is the expectation of T ? First. . Theorem 11. there is a state j such that i there is m such that Pi (A) = Pi (Xm = j) > 0. by the above theorem. . if started in C. 1]. Hence. every other j is also recurrent.e. Indeed.i. P (T > n) = P (U1 ≤ U0 . some state i must be visited infinitely many times. 0 < Pi (A) = Pi (A ∩ B) + Pi (A ∩ B c ) = Pi (A ∩ B c ) ≤ Pi (B c ). Example: Let U0 . If i is recurrent then every other j in the same communicating class satisfies i j.

Then P Z1 + · · · + Zn =µ n→∞ n lim = 1. P (T < ∞) = lim P (T ≤ n) = lim (1 − n→∞ ∞ n→∞ 1 ) = 1. random variables. . n+1 You are invited to think about the meaning of this: If the initial value U0 is unknown. Let Z1 . . of successive states is not an independent sequence. 34 . the Fundamental Theorem of Probability Theory. (i) Suppose E|Z1 | < ∞ and let µ = EZ1 . To our rescue comes the regenerative structure of a Markov chain identified at an earlier section.Therefore. . be i. it is a Theorem that can be derived from the axioms of Probability Theory. Our goal now is to show that a chain possessing a positive recurrent state also possesses a stationary distribution. When we have a Markov chain. we shall discover what the form of this stationary distribution is by using the Law of Large Numbers 12 Law (=Theorem) of Large Numbers in Probability Theory The following result is. the sequence X0 . 0) > −∞. arguably. but E min(Z1 . But this is misleading. Z2 . Theorem 13.i.d. To apply the law of large numbers we must create some kind of independence. It is not a Law. X1 . X2 . 13 Law of Large Numbers for Markov chains The law of large numbers stated above requires independence. . an infinite number of steps to exceed it! We now classify a recurrent state i as follows: We say that i is positive recurrent if Ei Ti < ∞. . It is traditional to refer to it as “Law”. We say that i is null recurrent if Ei Ti = ∞. . then it takes. 0) = ∞. (ii) Suppose E max(Z1 . n+1 But ET = ∞ P (T > n) = n=1 n=1 1 = ∞. Then P Z1 + · · · + Zn =∞ n→∞ n lim = 1. Before doing that. We state it here without proof. on the average.

. Since functions of independent random variables are also independent. (7) Whichever the choice of G. Example 1: To each state x ∈ S. given a recurrent state a ∈ S (which will be referred to as the ground state for no good reason other than that it has been fixed once and for all) the sequence of excursions Xa(1) . 35 . Example 2: Let f (x) be as in example 1. forms an independent sequence of (generalised) random variables. . we can think of as a reward received when the chain is in state x. are i. n as n → ∞. define G(X ) to be the maximum reward received during this excursion. which. Note that EG(Xa(1) ) = EG(Xa(2) ) = · · · = E G(Xa(0) ) | X0 = a = Ea G(Xa(0) ).13. are i. . we shall assume that f is bounded and shall examine those conditions which yield a nontrivial limit. .i. the (0) initial excursion Xa also has the same law as the rest of them. for an excursion X . random variables. Xa . G(Xa(2) ).2 Average reward Consider now a function f : S → R.i. . Then.1 Functions of excursions Namely. Xa(2) . . . in this example. Specifically. . k=0 which represents the ‘time-average reward’ received. assign a reward f (x). Xa(3) . We are interested in the existence of the limit time-average reward = lim 1 t→∞ t t f (Xk ). (1) (2) (1) (n) (8) 13. The final equality in the last display is merely our notation for the expectation of a random variable conditional on X0 = a. G(Xa(3) ). we have that G(Xa(1) ).. with probability 1. (r) (r+1) G(Xa(r) ) = max{f (Xt ) : Ta ≤ t < Ta }. Thus. as in Example 2. for reasonable choices of the functions G taking values in R. the law of large numbers tells us that G(Xa ) + · · · + G(Xa ) → EG(Xa(1) ). and because if we start with X0 = a.d. but define G(X ) to be the total reward received over the excursion X .d. . In particular. because the excursions Xa . G(Xa(r) ) = (r) (r+1) Ta ≤t<Ta f (Xt ).

and Glast is the reward over the last incomplete cycle. and so 1 first G → 0. Since P (Ta < ∞) = 1. Then.Theorem 14. Gfirst is the reward up to time Ta = Ta or t (whichever t is smaller). A similar argument shows that 1 last G → 0. We are looking for the total reward received between 0 and t. 0 Ta (1) Ta (2) T (3) a Ta(4) Ta (5) t We write the total reward as a sum of rewards over complete excursions plus a first and a last term: t Nt −1 r=1 (1) f (Xn ) = n=0 Gr + Gfirst + Glast . Ea Ta < ∞) and assume that the distribution of X0 is such that P (Ta < ∞) = 1. We assumed that f is t bounded. Let f : S → R+ be a bounded ‘reward’ function.e. the ‘time-average reward’ f exists and is a deterministic quantity: P where Ta −1 1 t→∞ t lim t f (Xn ) = f n=0 =1 . For example. say |f (x)| ≤ B. We break the sum into the total reward received over the initial excursion. and so on. We shall present the proof in a way that the constant f will reveal itself–it will be discovered. with probability one. The idea is this: Suppose that t is large enough. we have BTa /t → 0. Proof. Hence |Gfirst | ≤ BTa . with probability one. t t 36 . (r) Nt := max{r ≥ 1 : Ta ≤ t}. Nt . t t as t → ∞. in the figure below. Thus we need to keep track of the number. of complete excursions from the ground state a that occur between 0 and t. t as t → ∞. t t In other words. Ea f= n=0 f (Xn ) E a Ta = x∈S f (x)π(x). Gr is given by (7). Suppose that a Markov chain possesses a positive recurrent state a ∈ S (i. Nt = 5. say. plus the total reward received over the next excursion.

it follows that the number Nt of complete cycles before time t will also be tending to ∞. r=1 with probability one. random variables with common expectation Ea Ta . Since Ta → ∞..i. we have Ta r→∞ r lim (r) . E a Ta 37 . lim t∞ 1 Nt Nt −1 Gr = EG1 . So we are left with the sum of the rewards over complete cycles. 2. Since Ta /r → 0. we have that the visit with index r = Nt to the ground state a is the last before t (see also figure above): (N (N Ta t ) ≤ t < Ta t +1) . . Hence Ta t t Ta t ≤ < Nt Nt Nt But (11) tells us that Ta t Ta t = lim t→∞ Nt t→∞ Nt lim Hence lim (N ) (N +1) (N ) (N +1) . and since Ta − Ta k = 1. r=1 (10) Now look at the sequence r (r) (1) Ta = Ta + k=1 (k) (k−1) ) (Ta − Ta (1) (k) (k−1) and see what the Law of Large Numbers again. (11) By definition. .also. 1 Nt = . are i. = E a Ta . as t → ∞. r=1 (9) Look again what the Law of Large Numbers (8) tells us: that 1 n→∞ n lim (r) n Gr = EG1 .d. . = E a Ta . E a Ta f (Xn ) = n=0 EG1 =: f . as r → ∞. Hence. Write now 1 t Nt −1 Gr = r=1 Nt 1 t Nt Nt −1 Gr . t→∞ t E a Ta (12) Using (10) and (12) in (9) we have 1 lim t→∞ t Thus 1 t→∞ t lim t Nt −1 Gr = r=1 EG1 .

for all b ∈ S. Comment 1: It is important to understand what the formula says. a. the ground state. The numerator of f is computed by summing up the rewards over the excursion (making sure that not both endpoints are included). The theorem above then says that. Applying formula (14) with b in place of a we find that the limit in (13) also equals 1/Eb Tb . Comment 4: Suppose that two states. which communicate and are positive recurrent. just observe that EG1 = E (1) (1+1) Ta ≤t<Ta f (Xt ) = Ea 0≤t<Ta f (Xt ). then the long-run proportion of visits to the state a should equal 1/Ea Ta . (13) 1(Xn = b) = n=1 Comment 3: If in the above we let b = a. Ta −1 k=0 1 lim t→∞ t with probability 1. Hence t→∞ lim 1 t t 1(Xn = a) = n=1 1 . E a Ta (14) with probability 1. then we see that the numerator equals 1. i. t Ea 1(Xk = b) E a Ta . from the Strong Markov Property. f(X2 ) f(X1 ) f(X )= f(a) 0 f(T −1) a 0 T a Comment 2: A very important particular case occurs when we choose f (x) := 1(x = b). on the average. We can use b as a ground state instead of a. The equality of these two limits gives: average no.e. The constant f is a sort of average over an excursion. it takes Ea Ta units of time between successive visits to the state a. The physical interpretation of the latter should be clear: If. we let f to be 1 at the state b and 0 everywhere else.To see that f is as claimed. The denominator is the duration of the excursion. not both. Note that we include only one endpoint of the excursion. of times state b is visited = Ea between two successive visits to state a Ta −1 1(Xk = b) = k=0 E a Ta . We may write EG1 = Ea 0≤t<Ta f (Xt ) or EG1 = Ea 0<t≤Ta f (Xt ). b. E b Tb 38 .

We start the Markov chain with X0 = a. Since a = XT a . notice that the choice of notation ν [a] is justifiable through the notation (6) for the communicating class of the state a. X3 . The real reason for doing so is because we wanted to replace a function of X0 . Fix a ground state a ∈ S and assume that it is recurrent5 . we have 0 < ν [a] (x) < ∞.e. X2 . . . X2 . Moreover. .14 Construction of a stationary distribution Our discussions on the law of large numbers for Markov chains revealed that the quantity Ta −1 ν (x) := Ea n=0 [a] 1(Xn = x) . Note: Although ν [a] depends on the choice of the ‘ground’ state a. Proposition 2. Recurrence tells us that Ta < ∞ with probability one. ν [a] (x) = y∈S ν [a] (y)pyx . Consider the function ν [a] defined in (15). i. . for any state b such that a → x (a leads to x). so there is nothing lost in doing so. Both of them equal 1. x ∈ S. we can write ν (x) = Ea k=1 [a] Ta 1(Xn = x). (15) should play an important role. Clearly. . we will soon realise that this dependence vanishes when the chain is irreducible. we saw (Comment 2 above) that if a is a positive recurrent state then ν [a] (x) E a Ta is the long-run fraction of times that state x is visited. Hence the superscript [a] will soon be dropped. X1 . x ∈ S. 39 . Proof. this should have something to do with the concept of stationary distribution discussed a few sections ago. 5 We stress that we do not need the stronger assumption of positive recurrence here. In view of this. Then ν [a] satisfies the balance equations: ν [a] = ν [a] P. Indeed. . (16) This was a trivial but important step: We replaced the n = 0 term with the n = Ta term. where a is the fixed recurrent ground state whose existence was assumed. by a function of X1 .

which are precisely the balance equations. let x be such that a x. from (16). indeed. n ≤ Ta ) Ta = n=1 y∈S ∞ = n=1 y∈S = y∈S pyx Ea n=1 [a] 1(Xn−1 = y) = y∈S pyx ν (y). that a is positive recurrent. (17) This π is a stationary distribution. We now have: Corollary 9. First note that. ax ax From Theorem 10.and thus be in a position to apply first-step analysis. x a. n ≤ Ta ) = n=1 ∞ Pa (Xn = x. for all x ∈ S. So we have shown ν [a] (x) = y∈S pyx ν [a] (y). Hence 1 = ν [a] (a) ≥ ν [a] (x)p(m) > ν [a] (x). 40 . is finite when a is positive recurrent. Suppose that the Markov chain possesses a positive recurrent state a. which. n ≤ Ta ) Pa (Xn = x. Xn−1 = y. We are now going to assume more. in this last step. by definition. ν [a] = ν [a] Pn . Ta Ta (m) ν [a] (x) = Ea x∈S n=1 x∈S 1(Xn = x) = Ea n=1 1 = E a Ta . so there exists n ≥ 1. Hence ν [a] (x) ≥ ν [a] (a)p(n) = p(n) > 0. where. for all n ∈ N. so there exists m ≥ 1. Note: The last result was shown under the assumption that a is a recurrent state. such that pxa > 0. Therefore. which is precisely what we do next: ∞ ν (x) = Ea n=1 ∞ [a] 1(Xn = x. We just showed that ν [a] = ν [a] P. (n) Next. xa so. ν [a] (x) < ∞. namely. n ≤ Ta ) pyx Pa (Xn−1 = y. such that pax > 0. we used (16). Define Ta −1 ν [a] (x) = π (x) := E a Ta [a] Ea n=0 1(Xn = x) E a Ta . x ∈ S.

But π [a] is simply a multiple of ν [a] . since the excursions are independent. We have already shown. and so π [a] is not trivially equal to zero. then we immediately have a stationary (probability) distribution! We DID NOT claim that this distribution is. to decide whether j will occur in a specific excursion. in general. Suppose that i recurrent. If heads show up at the r-th excursion. second) excursion that contains j. that it sums up to one. x∈S Pause for thought: We have achieved something VERY IMPORTANT: if we have one. Let R1 (respectively.i. We already know that j is recurrent. 2. Consider the excursions Xi . . positive recurrent state. Then j is positive Proof. 41 . Otherwise. Theorem 15. in Proposition 2. that ν [a] P = ν [a] . In words. the event that j occurs in a specific excursion is independent from the event that j occurs in the next excursion. Also. 1. 15 Positive recurrence is a class property We now show that (just like recurrence) positive recurrence is also a class property. Therefore. R2 ) be the index of the first (respectively.d.e. Hence j will occur in infinitely many excursions. then j occurs in the r-th excursion.In other words: If there is a positive recurrent state. But π [a] (x) = x∈S 1 E a Ta ν [a] (x) = 1. Since a is positive recurrent. π(y) = 1 y∈S x∈S have at least one solution. The coins are i. no matter which one. (r) j. then the r-th excursion does not contain state j. then the set of linear equations π(x) = y∈S π(y)pyx . Let i be a positive recurrent state. Proof. i. This is what we will try to do in the sequel. . and the probability gij = Pi (Tj < ∞) that j occurs in a specific excursion is positive. we toss a coin with probability of heads gij > 0. unique. if tails show up. . It only remains to show that π [a] is a probability. π [a] P = π [a] . Start with X0 = i. we have Ea Ta < ∞. r = 0.

Then either all states in C are positive recurrent.R2 R1 j i R1+1 R1+2 This random variable has the same law as T under P i i ζ We just need to show that the sum of the durations of excursions with indices R1 . Since gij > 0 and Ei Ti < ∞ we have that Ei ζ < ∞.i. An irreducible Markov chain is called positive recurrent if one (and hence all) states are positive recurrent. or all states are null recurrent or all states are transient. 2. with the same law as Ti . we just need to show that R2 −R1 +1 ζ := r=1 Zr has finite expectation. R2 has finite expectation. An irreducible Markov chain is called null recurrent if one (and hence all) states are null recurrent. . IRREDUCIBLE CHAIN r = 1. . the expectation of this random variable equals −1 Ei ζ = (Ei Ti ) × Ei (R2 − R1 + 1) = (Ei Ti ) × (gij + 1) because R2 − R1 is a geometric random variable: Pi (R2 − R1 = r) = (1 − gij )r−1 gij . by Wald’s lemma. . . RECURRENT TRANSIENT POSITIVE RECURRENT NULL RECURRENT 42 . . In other words. An irreducible Markov chain is called recurrent if one (and hence all) states are recurrent. . But. Corollary 10.d. Let C be a communicating class. An irreducible Markov chain is called transient if one (and hence all) states are transient. . where the Zr are i.

Since the chain is irreducible and positive recurrent we have Pi (Tj < ∞) = 1. is a stationary distribution. there is a stationary distribution. we can take expectations of both sides: E as t → ∞. Theorem 16. following Theorem 14. for all x ∈ S. By a theorem of Analysis (Bounded Convergence Theorem). we start the Markov chain from the stationary distribution π. recognising that the random variables on the left are nonnegative and bounded below 1. Consider the Markov chain (Xn . with probability one. for all states i and j. Consider positive recurrent irreducible Markov chain. Then P (Ta < ∞) = x∈S In other words. But. x ∈ S. as t → ∞. π(1) = 0. we have P (Xn = x) = π(x). Ta −1 1 t t Ea 1(Xk = x) k=0 1(Xn = x) n=1 → E a Ta = π [a] (x). E 1 t t 1 t t 1(Xn = x) n=1 → π [a] (x). we have. for all n. Example: Consider the chain we’ve seen before: 1 1/3 1 0 1/2 1 1/6 2 Clearly. what prohibits uniqueness. (You should think why this is OBVIOUS!) Notice that 1 This lack of communication is. by (18). for totally trivial reasons (it is. n=1 43 . Then there is a unique stationary distribution. n ≥ 0) and suppose that P (X0 = x) = π(x). Then. x∈S By (13).16 Uniqueness of the stationary distribution At the risk of being too pedantic. indeed. π(2) = 1 − α 2. the π defined by π(0) = α. (18) π(x)Px (Ta < ∞) = π(x) = 1. Let π be a stationary distribution. an absorbing state). Proof. we stress that a stationary distribution may not be unique. Hence. after all. x ∈ S. But observe that there are many–infinitely many–stationary distributions: For each 0 ≤ α ≤ 1. 1(Xn = x) n=1 = 1 t t P (Xn = x) = π(x). Fix a ground state a. state 1 (and state 2) is positive recurrent.

Then. Corollary 11. This is in fact. π(i|C) = Ei Ti . In Section 14 we assumed the existence of a positive recurrent state a and proved the existence of a stationary distribution π. In particular.Therefore π(x) = π [a] (x). There is a unique stationary distribution. E a Ta Proof. The cycle formula merely re-expresses this equality. Thus. By the 1 above corollary. The following is a sort of converse to that. 1 π(a) = . Hence there is only one stationary distribution. Decompose it into communicating classes. y ∈ C. x ∈ S.e. Consider a positive recurrent irreducible Markov chain with stationary distribution π. an arbitrary stationary distribution π must be equal to the specific one π [a] . i. the only stationary distribution for the restricted chain because C is irreducible. Namely. for all a ∈ S. We can easily see that we thus obtain a Markov chain with state space C with stationary distribution π(·|C). Assume that a stationary distribution π exists. π(a) = π [a] (a) = 1/Ea Ta . Consider a Markov chain. π(C) x ∈ C. There is only one stationary distribution. But π(i|C) > 0. Consider an arbitrary Markov chain. Therefore Ei Ti < ∞. which is useful because it is often used as a criterion for positive recurrence: Corollary 13. Let i be such that π(i) > 0. so they are equal. Proof. Then for each positive recurrent communicating class C there corresponds a unique stationary distribution π C 44 . from the definition of π [a] . Both π [a] and π[b] are stationary distributions. Then π [a] = π [b] . b ∈ S. Then any state i such that π(i) > 0 is positive recurrent. Consider a positive recurrent irreducible Markov chain and two states a. Proof. 17 Structure of a stationary distribution It is not hard to show now what an arbitrary stationary distribution looks like. In particular. we obtain the cycle formula: Ta −1 Tb −1 Ea k=0 1(Xk = x) E a Ta = Eb k=0 1(Xk = x) E b Tb . Consider the restriction of the chain to C. Let π(·|C) be the distribution defined by restricting π to C: π(x|C) := π(x) . where π(C) = y∈C π(y). for all x ∈ S. i is positive recurrent. Corollary 12. delete all transition probabilities pxy with x ∈ C. Let C be the communicating class of i. Hence π = π [a] .

from physical experience. consider the motion of a pendulum under the action of gravity. x∈C Then (π(x|C). such that C αC = 1. For example. the so-called Cesaro 1 averages n n P (Xk = x) converge to π(x). Proposition 3. that the pendulum will swing for a while and. Stability of a deterministic dynamical system means that the system eventually converges to a fixed point (or. without friction. There is dissipation of energy due to friction and that causes the settling down. it will settle to the vertical position. Let π be an arbitrary stationary distribution. So far we have seen. where αC ≥ 0. By our uniqueness result. A sequence of real numbers an k=1 1 may not converge but its Cesaro averages n (a1 + · · · + an ) may converge. where π(C) := π(C) π(x). An arbitrary stationary distribution π must be a linear combination of these π C .which assigns positive probability to each state in C. So we have that the above decomposition holds with αC = π(C). the pendulum will perform an 45 .1 Coupling and stability Definition of stability Stability refers to the convergence of the probabilities P (Xn = x) as n → ∞. 18 18. take an = (−1)n . To each positive recurrent communicating class C there corresponds a stationary distribution π C . Why stability? There are physical reasons why we are interested in the stability of a Markov chain. We all know. For each such C define the conditional distribution π(x|C) := π(x) . for example. more generally. π(x|C) ≡ π C (x). In an ideal pendulum. This is defined by picking a state (any state!) a ∈ C and letting π C := π [a] . that. that it remains in a small region). This is the stable position. as n → ∞. A Markov chain is a simple model of a stochastic dynamical system. Proof. x ∈ C) is a stationary distribution. Any stationary distribution π is necessarily of the form π= C: C positive recurrent communicating class αC π C . eventually. under irreducibility and positive recurrence conditions. If π(x) > 0 then x belongs to some positive recurrent class C.

as t → ∞. the one found by setting the right-hand side equal to zero: x∗ = b/a. moreover. or whatever. Specifically. The simplest differential equation As another example. . And we are in a happy situation here. . for we can do it: X1 = ρX0 + ξ0 X2 = ρX1 + ξ1 = ρ(ρX0 + ξ0 ) + ξ1 = ρ2 X0 + ρξ0 + ξ1 X3 = ρX2 + ξ2 = ρ3 X0 + ρ2 ξ0 + ρξ1 + ξ2 Xn = ρn X0 + ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 .oscillatory motion ad infinitum. and define a stochastic sequence X1 . The thing to observe here is that there is an equilibrium point i. x ∈ R. ······ 46 . But what does stability mean? In general. X2 . let us consider a stochastic analogue of the above in discrete time: Xn+1 = ρXn + ξn . . a > 0. or an example in physics. We shall assume that the ξn have the same law. A linear stochastic system We now pass on to stochastic systems. . . so I’ll leave it upon the reader to decide whether he or she wants to consider this as an financial example. random variables. simple because the ξn ’s depend on the time parameter n. It follows that |ρ| < 1 is the condition for stability. by means of the recursion above. ˙ There are myriads of applications giving physical interpretation to this. we here assume that X0 . Since we are prohibited by the syllabus to speak about continuous time stochastic systems. . are given. mutually independent. ξ2 . it cannot mean convergence to a fixed equilibrium point. 4 3 2 x* 1 0 0 1 2 t 3 t4 If we start with x(t = 0) = x∗ . consider the following differential equation x = −ax + b. ξ1 . then regardless of the starting position.e. ξ0 . we may it is stable because it does not (it cannot!) swing too far away. Still. We say that x∗ is a stable equilibrium. However notice what happens when we solve the recursion. If. then x(t) = x∗ for all t > 0. we have x(t) → x∗ .

the n-step transition matrix Pn . . If a Markov chain is irreducible. converges. we need to understand the notion of coupling.Since |ρ| < 1. . i. positive recurrent and aperiodic with (unique) stationary distribution π then n→∞ lim P (Xn = x) = π(x).3 Coupling To understand why the fundamental stability theorem works. for time-homogeneous Markov chains. (e. suppose you are given the laws of two random variables X. Y = y) is? 47 . 18. for instance. we know f (x) = P (X = x).g. . j ∈ S. What this means is that. assume that the ξn are bounded) Xn does converge to ˆ X∞ = k=0 ρk ξk . the limit of its distribution does: ˜ ˆ ˆ lim P (Xn ≤ x) = lim P (Xn ≤ x) = lim P (Xn ≤ x) = P (X∞ ≤ x) n→∞ n→∞ n→∞ This is precisely the reason why we cannot. Y = y) = P (X = x)P (Y = y). but notice that if we let ˆ Xn := ρn−1 ξn−1 + ρn−2 ξn−2 + · · · + ρξ1 + ξ0 . to a matrix with rows all equal to π: (n) lim p n→∞ ij = π(j). for any initial distribution. we have ˜ ˆ P (Xn ≤ x) = P (Xn ≤ x). define stability in a way other than convergence of the distributions of the random variables. Y but not their joint law. to try to make the coincidence probability P (X = Y ) as large as possible. uniformly in x. ˜ Xn := ρn−1 ξ0 + ρn−2 ξ1 + · · · + ρξn−2 + ξn−1 . In other words. The second term. but we are not given what P (X = x. which is a linear combination of ξ0 . ξn−1 does not converge. Y = y) is. we have that the first term goes to 0 as n → ∞. as n → ∞. while the limit of Xn does not exists. Coupling refers to the various methods for constructing this joint law. In particular. But suppose that we have some further requirement. Starting at an elementary level. . 18. g(y) = P (Y = y). under nice conditions. How should we then define what P (X = x. A straightforward (often not so useful) construction is to assume that the random variables are independent and define P (X = x.2 The fundamental stability theorem for Markov chains Theorem 17. ∞ ˆ The nice thing is that.

P (T < ∞) = 1. We say that ransom time T is meeting time of the two processes if Xn = Yn for all n ≥ T. and hence P (Xn = x) − P (Yn = x) ≤ P (T > n).e. n < T ) + P (Yn = x. n ≥ 0. A coupling of them refers to a joint construction on the same probability space. we obtain P (Yn = x) − P (Xn = x) ≤ P (T > n). We are given the law of a stochastic process X = (Xn . on the event that the two processes never meet. y). uniformly in x ∈ S. Y be random variables with values in Z and laws f (x) = P (X = x). n ≥ 0. Repeating (19). The coupling we are interested in is a coupling of processes. x∈S 48 . i. We have P (Xn = x) = P (Xn = x.e. y). we can write it as sup |P (Xn = x) − P (Yn = x)| ≤ P (T > n). Proof. Note that it makes sense to speak of a meeting time. precisely because of the assumption that the two processes have been constructed together. T is finite with probability one. in particular. Y = y) which respects the marginals. then |P (Xn = x) − P (Yn = x)| → 0. This T is a random variables that takes values in the set of times or it may take values +∞. If. Y . Suppose that we have two stochastic processes as above and suppose that they have been coupled. for all n ≥ 0. n ≥ T ) = P (Xn = x. y) = P (X = x. (19) as n → ∞. x ∈ S. Combining the two inequalities we obtain the first assertion: |P (Xn = x) − P (Yn = x)| ≤ P (T > n). g(y) = P (Y = y). respectively. n ≥ T ) ≤ P (n < T ) + P (Yn = x). i. n ≥ 0). and which maximises P (X = Y ). g(y) = x h(x. and all x ∈ S. but with the roles of X and Y interchanged. but we are not told what the joint probabilities are.Exercise: Let X. n ≥ 0) and the law of a stochastic process Y = (Yn . Find the joint law h(x. The following result is the so-called coupling inequality: Proposition 4. f (x) = y h(x. Since this holds for all x ∈ S and since the right hand side does not depend on x. Let T be a meeting of two coupled stochastic processes X. |P (Xn = x) − P (Yn = x)| ≤ P (T > n). n < T ) + P (Xn = x. Then.

(x. denoted by Y = (Yn . It is very important to note that we have only defined the law of X and the law of Y but we have not defined a joint law. We shall consider the laws of two processes. x∈S as n → ∞. n ≥ 0) is independent of Y = (Yn . Y0 ) has distribution P (W0 = (x. So both processes are positive recurrent aperiodic Markov chains with the same transition probabilities. Its initial state W0 = (X0 . Let pij denote. n ≥ 0). 18. Having coupled them. n ≥ 0) has the law of a Markov chain with transition probabilities pij and X0 distributed according to some arbitrary distribution µ. y) := π(x)π(y). Notice that σ(x. We know that there is a unique stationary distribution. The second process. denoted by π. y) Its n-step transition probabilities are q(x.x′ ).x′ ). Then (Wn . the transition probabilities. n ≥ 0). . and it also makes sense to consider their first meeting time: T := inf{n ≥ 0 : Xn = Yn }. n ≥ 0) is an irreducible chain. y) = µ(x)π(y). y) ∈ S × S. 49 (n) (n) (n) (n) (n) (n) = P (Xn+1 = x′ | Xn = x) P (Yn+1 = y ′ | Yn = y) = px. Its (1-step) transition probabilities are q(x. had we started with both X0 and Y0 independent and distributed according to π.(y. Then n→∞ lim P (T > n) = P (T = ∞) = 0.x′ py.y′ ) for all large n. The first process. for all n ≥ 1.y′ . as usual. they differ only in how the initial states are chosen.) By positive recurrence. then.y′ for all large n.4 Proof of the fundamental stability theorem First. Yn ). positive recurrent and aperiodic. Consider the process Wn := (Xn . We now couple them by doing the straightforward thing: we assume that X = (Xn . From Theorem 3 and the aperiodicity assumption. (Indeed. y ′ ) | Wn = (x.(y. we can define joint probabilities. n ≥ 0) is a Markov chain with state space S × S.x′ ).Assume now that P (T < ∞) = 1. we have that px. denoted by X = (Xn . we have not defined joint probabilities such as P (Xn = x.(y. n ≥ 0) has the law of a Markov chain with transition probabilities pij and Y0 distributed according to the stationary distribution µ. Therefore. is a stationary distribution for (Wn . sup |P (Xn = x) − P (Yn = x)| → 0. π(x) > 0 for all x ∈ S.y′ . implying that q(x. review the assumptions of the theorem: the chain is irreducible.y′ ) = px. in other words. Xm = y).x′ py. Xn and Yn would be independent and distributed according to π.y′ ) := P Wn+1 = (x′ . and so (Wn .x′ > 0 and py.

(0. which converges to zero. p00 = 0. and since P (Yn = x) = π(x). 0). Y ) may fail to be irreducible.1). for all n and x. Z0 = X0 which has an arbitrary distribution µ. after the meeting time.1) = 1. y) ∈ S × S. Hence (Wn . Zn equals Xn before the meeting time of X and Y . n ≥ 0). 1} and the pij are p01 = p10 = 1. Hence (Zn . Moreover. precisely because P (T < ∞) = 1. Then the product chain has state space S ×S = {(0.99. (0.1) = q(1. 1)} and the transition probabilities are q(0. X arbitrary distribution Z stationary distribution Y T Thus. uniformly in x. then the process W = (X. y) > 0 for all (x. suppose that S = {0. For example. n ≥ 0) is positive recurrent. 50 . In particular.Therefore σ(x.0).0) = q(1. 1).1). Define now a third process by: Zn := Xn 1(n < T ) + Yn 1(n ≥ T ).0) = q(0. (1.(1. if we modify the original chain and make it aperiodic by. Obviously. p10 = 1. |P (Zn = x) − P (Yn = x)| ≤ P (T > n). ≥ 0) is a Markov chain with transition probabilities pij . (Zn . This is not irreducible. Example showing that the aperiodicity assumption is essential for the method used in the proof: If pij are the transition probabilities of an irreducible chain X and if Y is an independent copy of X. we have |P (Xn = x) − π(x)| ≤ P (T > n). Clearly. (1. In particular. P (T < ∞) = 1. by assumption. T is a meeting time between Z and Y .(1.01.0). P (Xn = x) = P (Zn = x). as n → ∞. say. Since P (Zn = x) = P (Xn = x). Zn = Yn .(0. By the coupling inequality. p01 = 0. n ≥ 0) is identical in law to (Xn . 0). However.

per se wrong. Suppose that the period of the chain is d. and this is expressed by our terminology. Therefore p00 = 1 for even n (n) and = 0 for odd n. p10 = 1. limn→∞ p00 does not exist. i. 51 .e.01. 18. customary to have a consistent terminology. we know we have a unique stationary distribution π and that π(x) > 0 for all states x. C1 .99.497.503. Thus. It is not difficult to see that 6 We put “wrong” in quotes. However. we can decompose the state space into d classes C0 . in particular. Theorem 4. from Cd back to C0 .497 18. from C1 to C2 .then the product chain becomes irreducible. 1981).4 and. if we modify the chain by p00 = 0. Example showing that the aperiodicity assumption is essential for the limiting result: Consider the same chain as above: p01 = p10 = 1. p01 = 0.99π(0) = π(1). π(1) satisfy 0.f. A Markov chain which is irreducible and positive recurrent and aperiodic will be called mixing. n→∞ (n) where π(0). . positive recurrent and aperiodic Markov chain. and so on. n→∞ (n) (n) lim p n→∞ 11 = lim p01 = π(1). then (n) lim p n→∞ 00 (n) = lim p10 = π(0). . 6 The reason is that the term “ergodic” means something specific in Mathematics (c. But this is “wrong”.497 . The reader is asked to pay attention to this terminology which is different from the one used in the Markov chain literature. . Suppose X0 = 0. such that the chain moves cyclically between them: from C0 to C1 .6 Periodic chains Suppose now that we have a positive recurrent irreducible chain. 0. most people use the term “ergodic” for an irreducible. n→∞ lim p00 p01 (n) (n) p10 p11 (n) (n) = 0.5 Terminology A Markov chain which is irreducible and positive recurrent will be called ergodic. In matrix notation. Parry. Then Xn = 0 for even n and = 1 for odd n. π(1) ≈ 0.503 0. Indeed.503 0. Terminology cannot be. π(0) ≈ 0. It is. Cd−1 . By the results of §5. . however. By the results of Section 14. π(0) + π(1) = 1.

i ∈ Cr . n ≥ 0) is aperiodic. j ∈ Cℓ . if sampled every d steps. and let r = ℓ − k (modulo d). We shall show that Lemma 9. . . the (unique) stationary distribution of (Xnd . i ∈ S. we have pij for all i. Let i ∈ Ck . . i∈Cr for all r = 0. So π(i) = j∈Cr−1 π(j)pji . Then pij (r+nd) (nd) = Pi (Xnd = j) → dπ(j). Then. 1. Let αr := i∈Cr π(i). . after reduction by d) and. starting from i. Cd−1 . This means that: Theorem 18. n ≥ 0) is dπ(i). . The chain (Xnd . Proof. Suppose i ∈ Cr . Can we also find the limit when i and j do not belong to the same class? Suppose that i ∈ Ck . Since (Xnd . . as n → ∞. the original chain enters Cℓ in r = ℓ − k steps (where r is taken to be a number between 0 and d − 1. → dπ(j). j ∈ Cr . Then pji = 0 unless j ∈ Cr−1 . thereafter. We have that π satisfies the balance equations: π(i) = j∈S π(j)pji . . n ≥ 0) to any of the sets Cr we obtain a positive recurrent irreducible aperiodic chain. Then. namely C0 .Lemma 8. Since the αr add up to 1. . it remains in Cℓ . as n → ∞. and the previous results apply. Suppose that X0 ∈ Cr . Hence if we restrict the chain (Xnd . i ∈ Cr . Suppose that pij are the transition probabilities of a positive recurrent irreducible chain with period d. Summing up over i ∈ Cr we obtain αr = i∈Cr π(i) = j∈Cr−1 π(j) i∈Cr pji = j∈Cr−1 π(j) = αr−1 . π(i) = 1/d. each one must be equal to 1/d. d − 1. 52 . All states have period 1 for this chain. where π is the stationary distribution. j ∈ Cℓ . n ≥ 0) consists of d closed communicating classes.

. whose proof we omit7 : 7 See Br´maud (1999). as n → ∞. and so on. . such that limk→∞ pi k exists and equals. . (nk ) k → pi . say. It is clear how we (ni ) k continue. 2.19 Limiting theory for Markov chains* We wish to complete our discussion about the possible limits of the n-step transition prob(n) abilities pij . . . . exists for all i ∈ N. pi . the first of which is known as HellyBray Lemma: Lemma 10. .1 Limiting probabilities for null recurrent states Theorem 19. . . . Since they are (n2 ) subsequence n2 of n1 such that limk→∞ p1 k k k (nk ) (n) as n → ∞. for each The second result is Fatou’s lemma. Recall that we know what happens when j is positive recurrent and i is in the same commu(n) nicating class as j If the period is 1. then pij → π(j) (Theorems 17). By construction. as k → ∞. We also saw in Corollary (n) 7 that. 2. . Since the numbers p1 n1 1 n1 2 (n) (n1 ) < < n1 < · · · . say. then pij → 0 as n → ∞. k = 1. . Then we can find a i=1 pi sequence n1 < n2 < n3 < · · · of integers such that limk→∞ pi Proof. Hence pi k i. for each i. the sequence n3 of the third row is a subsequence k k of n2 of the second row. Now consider the contained between 0 and 1 there must be a exists and equals.e. 2. If j is a null recurrent state then pij → 0. Consider. We have. . for all i ∈ S. i. p2 . k = 1. for all i ∈ S. are contained between 0 and 1.). p(n) = (n) (n) (n) (n) (n) ∞ p1 . . 2. where = 1 and pi ≥ 0 for all i. If the period is d then the limit exists only along a selected subsequence (Theorem 18). . p2 . a probability distribution on N.) is a k k subsequence of each (ni . schematically. such that limk→∞ p1 k 3 (n1 ) numbers p2 k . . k = 1. The proof will be based on two results from Analysis. for each n = 1. say. if j is transient. the sequence n2 of the second row k is a subsequence of n1 of the first row. e 53 . 2. there must be a subsequence exists and equals. p3 . . For each i we have found subsequence ni . So the diagonal subsequence (nk . . p1 .. 19. p1 p1 (n1 ) 1 p1 (n1 ) 2 (n2 ) 2 p1 (n1 ) 3 (n2 ) 3 (n3 ) 3 (n2 ) 1 (n3 ) 1 ··· ··· ··· ··· → → → ··· p1 p2 p3 ··· p1 p1 ··· p1 p1 ··· (n3 ) 2 p1 ··· Now pick the numbers in the diagonal. .. k = 1.

by Lemma 11. i ∈ N). Hence we can find a subsequence nk such that k→∞ lim pij k = αj > 0. i ∈ N). x ∈ S . we have that π(x) := αx y∈S αy satisfies the balance equations and has x∈S π(x) = 1. π is a stationary distribution. the sequence (n) pij does not converge to 0. Suppose that for some i. x∈S (n′ ) Since aj > 0. pix k Therefore. This would imply that αx > x∈S x∈S y∈S x∈S αy pyx = y∈S αy x∈S pyx = y∈S αy . for some x0 ∈ S. and this is a contradiction (the number αx = αx cannot be strictly larger than itself). i. 2. we can pick subsequence (n′ ) of (nk ) such that k k→∞ lim pix k = αx . pix k = 1.e.e. Let j be a null recurrent state. Suppose that one of them is not.e. i. (n′ ) lim pix k (n′ +1) ≥ y∈S k→∞ lim piy k pyx . it is a strict inequality: αx0 > y∈S αy pyx0 .Lemma 11. If (yi . αy pyx . We wish to show that these inequalities are. i=1 (n) Proof of Theorem 19. αx ≤ lim inf k→∞ (n′ ) for all x ∈ S. Thus. are sequences of nonnegative numbers then ∞ (n) yi lim inf xi i=1 n→∞ (n) ∞ ≤ lim inf n→∞ yi x i . x ∈ S. (n′ ) Hence αx ≥ αy pyx . actually. By Corollary 13. by assumption. . k→∞ (n′ +1) = y∈S piy k pyx . we have 0< x∈S The inequality follows from Lemma 11. and the last equality is obvious since we have a probability vector. (n ) (n ) Consider the probability vector By Lemma 10. n = 1. 54 . equalities. But Pn+1 = Pn P. . Now π(j) > 0 because αj > 0. (xi . y∈S Since x∈S αx ≤ 1. state j is positive recurrent: Contradiction. . pix k . i. y∈S x ∈ S.

The result when the period j is larger than 1 is a bit awkward to formulate. Theorem 20. Let µ be the distribution of XHC given that X0 = i and that {HC < ∞}. Let HC := inf{n ≥ 0 : Xn ∈ C}. Let ϕC (i) := Pi (HC < ∞). γ < 1): γ 1−γ 1 1 2 1−α α 3 β 4 1−β 5 1 55 . From the Strong Markov Property. But Theorem 17 tells us that the limit exists regardless of the initial distribution. Example: Consider the following chain (0 < α. and fij = ϕC (j). as n → ∞.2. Indeed. we have Pi (Xn = j | HC = m) = Pµ (Xn−m = j). Remark: We can write the result of Theorem 20 as (n) lim p n→∞ ij = fij . when j is a positive recurrent state but i and j do not necessarily belong to the same communicating class. Hence (n) lim p n→∞ ij ∞ = π (j) m=1 C Pi (HC = m) = π C (j)Pi (HC < ∞). E j Tj where Tj = inf{n ≥ 1 : Xn = j}. we have pij = Pi (Xn = j) = Pi (Xn = j. Pµ (Xn−m = j) → π C (j). In general. β.19. Then (n) lim p n→∞ ij = ϕC (i)π C (j). This is a more old-fashioned way of getting the same result. If i also belongs to C then ϕC (i) = 1 and so the result follows from Theorem 17. so we will only look that the aperiodic case. as n → ∞. the result can be obtained by the so-called Abel’s Lemma of Analysis. Let j be a positive recurrent state with period 1. HC ≤ n) n (n) = m=1 Pi (HC = m)Pi (Xn = j | HC = m). Proof. as in §10. Let C be the communicating class of j and let π C be the stationary distribution associated with C. and fij = Pi (Tj < ∞). Let i be an arbitrary state. In this form. which is what we need.2 The general case (n) It remains to see what the limit of pij is. that is. starting with the first hitting time decomposition relation of Lemma 7. π C (j) = 1/Ej Tj .

We have ϕ(1) = ϕ(2) = 1. We will avoid to give a precise definition of the term ‘ergodic’ but only mention what constitutes a result in the area. in Theorem 14. from first-step analysis. ϕ(4) = . Ergodic Theory is nowadays an area that combines many fields of Mathematics and Science. Pioneers in the are were. Let f be a reward function such that |f (x)|ν [a] (x) < ∞. 2} (closed class). 1+γ 1+γ Let ϕ(i) be the probability that class C1 will be reached starting from state i. (n) (n) p34 → 0. ϕ(5) = 0. while.The communicating classes are I = {3. The ergodic hypothesis in physics says that the proportion of time that a particle spends in a particular region of its “phase space” is proportional to the volume of this region. π C1 (2) = . Consider the quantity ν [a] (x) defined as the average number of visits to state x between two successive visits to state a–see 16. p41 → ϕ(4)π C1 (1). Hence. the 19-th century French mathematicians Henri Poincar´ and e Joseph Liouville. 1 − αβ 1 − αβ Both C1 . Theorem 21. 20 Ergodic theorems An ergodic theorem is a result about time-averages of random processes. we assumed the existence of a positive recurrent state. as n → ∞. (n) (n) p35 → (1 − ϕ(3))π C2 (5). C2 are aperiodic classes. p44 → 0. For example. Recall that. (n) (n) p33 → 0. information about its velocity. 56 (20) x∈S . And so is the Law of Large Numbers itself (Theorem 13). whence 1−α β(1 − α) . p45 → (1 − ϕ(4))π C2 (5). among others. The phase (or state) space of a particle is not just a physical space but it includes. ϕ(4) = (1 − β)ϕ(5) + βϕ(3). for example. C2 = {5} (absorbing state). We now generalise this. C1 = {1. Theorem 14 is a type of ergodic theorem. (n) (n) p43 → 0. (n) (n) p32 → ϕ(3)π C1 (2). 4} (inessential states). π C2 (5) = 1. ϕ(3) = (1 − α)ϕ(2) + αϕ(4) . ranging from Mathematical Physics and Dynamical Systems to Probability and Statistics. p42 → ϕ(4)π C1 (2). The stationary distributions associated with the two closed classes are: 1 γ π C1 (1) = . The subject of Ergodic Theory has its origins in Physics. Similarly. ϕ(3) = p31 → ϕ(3)π C1 (1). Suppose that a Markov chain possesses a recurrent state a ∈ S and starts from an initial state X0 such that P (Ta < ∞) = 1.

So there are always recurrent states. From proposition 2 we know that the balance equations ν = νP have at least one solution. So if we divide by t the numerator and denominator of the fraction in (21). As in the proof of Theorem 14. and this justifies the terminology of §18. If so. while Gfirst . Now. 21 Finite chains We take a brief look at some things that are specific to chains with finitely many states. We only need to check that EG1 = f . Then at least one of the states must be visited infinitely often. The Law of Large Numbers tells us that 1 lim n→∞ n n−1 Gr = EG1 . which is the quantity denoted by f in Theorem 14. we break the total reward as follows: t Nt −1 f (Xn ) = n=0 r=1 Gr + Gfirst + Glast . (21) f (x)ν [a] (x). The reader is asked to revisit the proof of Theorem 14. Note that the denominator t n=0 1(Xn = a) equals the number Nt of visits to state a up to time t. t t where Gr = T (r) ≤t<T (r+1) f (Xt ). we have that the denominator equals Nt /t and converges to 1/Ea Ta . As in Theorem 14. we obtain the result. 57 .Then P where f := x∈S t→∞ lim t n=0 t n=0 f (Xn ) 1(Xn = a) =f =1 . Glast are the total rewards over the first and t t a a last incomplete cycles.5. t t→∞ t t t with probability one. since the number of states is finite. as we saw in the proof of Theorem 14. r=0 with probability 1 as long as E|G1 | < ∞. respectively. Proof. and this is left as an exercise. Remark 1: The difference between Theorem 14 and 21 is that. But this is precisely the condition (20). then. Replacing n by the subsequence Nt . we have C := i∈S ν(i) < ∞. we have Nt /t → 1/Ea Ta . we have t→∞ lim 1 last 1 G = lim Gfirst = 0. Remark 2: If a chain is irreducible and positive recurrent then the Ergodic Theorem Theorem 14 holds. so the numerator converges to f /Ea Ta . we assumed positive recurrence of the ground state a. in the former.

Xn ) has the same distribution as (Xn . because positive recurrence is a class property (Theorem 15). We summarise the above discussion in: Theorem 22. reversible if it has the same distribution when the arrow of time is reversed. for all n. that is. this means that it is stationary. in a sense. X0 ). in addition. It is. By Corollary 13.Hence 1 ν(i). . Then the chain is reversible if and only if the detailed balance equations hold: π(i)pij = π(j)pji . then we instantly know that the film is played in reverse. n ≥ 0). X1 ) has the same distribution as (X1 . in addition. Xn ) has the same distribution as (Xn . j ∈ S. actually. X1 . . Another way to put this is by saying that observing any part of the process will not reveal any information about whether the process runs forwards or backwards in time. 22 Time reversibility Consider a Markov chain (Xn . i. A reversible Markov chain is stationary. We say that it is time-reversible. Consider an irreducible Markov chain with finitely many states. Consider an irreducible Markov chain with transition probabilities pij and stationary distribution π. running backwards. like playing a film backwards. Suppose first that the chain is reversible. i. and run the film backwards. At least some state i must have π(i) > 0. . X0 ) for all n. without being able to tell that it is. X1 = j) = P (X1 = i. the chain is irreducible. this state must be positive recurrent. If. . For example. X1 . . we see that (X0 . the chain is aperiodic. More precisely. P (X0 = i. . it will look the same. . 58 . X0 = j). Xn−1 . we have Pn → Π. Lemma 12. if we film a standing man who raises his arms up and down successively. If. . the first components of the two vectors must have the same distribution. a finite Markov chain always has positive recurrent states. .e. In addition. (X0 . . . Xn−1 . simply. . i ∈ S. Then the chain is positive recurrent with a unique stationary distribution π. Therefore π is a stationary distribution. C satisfies the balance equations and the normalisation condition i∈S π(i) = 1. π(i) := Hence. Proof. But if we film a man who walks and run the film backwards. or. Proof. Theorem 23. by Theorem 16. Taking n = 1 in the definition of reversibility. then all states are positive recurrent. the stationary distribution is unique. Reversibility means that (X0 . . . Xn has the same distribution as X0 for all n. . Because the process is Markov. . X0 ). as n → ∞. where P is the transition matrix and Π is a matrix all the rows of which are equal to π. In particular.

X2 = i).for all i and j. . (22) Proof. if there is an arrow from i to j in the graph of the chain but no arrow from j to i. X2 = k) = P (X0 = k. Xn−1 . By induction. Conversely. . X1 . Summing up over all choices of states i3 . So the chain is reversible. i2 . Hence the DBE hold. and this means that P (X0 = i. X1 = j. 2) The DBE are equivalent to saying that the ratio pij /pji can be written as a product of a function of i and a function of j. . . i4 . . then. . X0 ). suppose the DBE hold. Hence cancelling the π(i) from the above display we obtain the KLC for m = 3. suppose that the KLC (22) holds. Xn ) has the same distribution as (Xn . Pick 3 states i. for all n. we have π(i) > 0 for all i. X1 = j) = π(i)pij . im ∈ S and all m ≥ 3. using DBE. X1 = i) = π(j)pji . So the detailed balance equations (DBE) hold. The same logic works for any m and can be worked out by induction. π(i)pij pjk pki = [π(j)pji ]pjk pki = [π(j)pjk ]pji pki = [π(k)pkj ]pji pki = [π(k)pki ]pji pkj = [π(i)pik ]pji pkj = π(i)pik pkj pji Since the chain is irreducible. X1 = j. k. it is easy to show that. Consider an aperiodic irreducible Markov chain with transition probabilities pij and stationary distribution π. and write. X2 ) has the same distribution as (X2 . . In other words. X1 . These are the DBE. and the chain is reversible. We will show the Kolmogorov’s loop criterion (KLC). . pji 59 (m−3) → π(i1 ). and pi1 i2 (m−3) → π(i2 ). Suppose first that the chain is reversible. Conversely. Then the chain is reversible if and only if the Kolmogorov’s loop criterion holds: for all i1 . . . . letting m → ∞. The DBE imply immediately that (X0 . j. X1 ) has the same distribution as (X1 . But P (X0 = i. for all i. (X0 . j. we have separation of variables: pij = f (i)g(j). . P (X0 = j. we have pi2 i1 implying that π(i1 )pi1 i2 = π(i2 )pi2 i1 . pi1 i2 pi2 i3 · · · pim−1 im = pi1 im pim im−1 · · · pi2 i1 . and hence (X0 . . X0 ). X1 . . using the DBE. Theorem 24. Now pick a state k and write. im−1 we obtain (m−3) (m−3) pi1 i2 pi2 i1 = pi1 i2 pi2 i1 If the chain is aperiodic. . k. π(i)pij pjk = [π(j)pji ]pjk = [π(j)pjk ]pji = π(k)pkj pji . Notes: 1) A necessary condition for reversibility is that pij > 0 if and only if pji > 0. . . then the chain cannot be reversible. X0 ).

For example. • If the chain has infinitely many states. consider the chain: Replace it by: This is a tree.) Necessarily. so we have pij g(j) = . The reason is as follows. there would be a loop. Consider a positive recurrent irreducible Markov chain such that if pij > 0 then pji > 0. 60 . Form the graph by deleting all self-loops. Pick two distinct states i. Hence the original chain is reversible. the graph becomes disconnected. j for which pij > 0. and. Hence the chain is reversible.1) which gives π(i)pij = π(j)pji . and by replacing. This gives the following method for showing that a finite irreducible chain is reversible: • Form the ratio pij /pji and show that the variables separate. by assumption. namely the arrow from i to j. to guarantee. So g satisfies the balance equations. If we remove the link between them. The balance equations are written in the form F (A. using another method. Example 1: Chain with a tree-structure. And hence π can be found very easily. that the chain is positive recurrent.(Convention: we form this ratio only when pij > 0. then the chain is reversible. meaning that it has no loops. the detailed balance equations. i.e. then we need. there isn’t any. the two oriented edges from i to j and from j to i by a single edge without arrow. Ac be the sets in the two connected components.) Let A. and the arrow from j to j is the only one from Ac to A. in addition. for each pair of distinct states i. (If it didn’t. Ac ) = F (Ac . If the graph thus obtained is a tree. j for which pij > 0. pji g(i) Multiplying by g(i) and summing over j we find g(i) = j∈S g(j)pji . f (i)g(i) must be 1 (why?). There is obviously only one arrow from AtoAc . A) (see §4.

we have that C = 1/2|E|. π(i) = 1. and a set E of undirected edges. the DBE hold. 2. j are neighbours 1 1 then π(i)pij = Cδ(i) δ(i) = C and π(j)pji = Cδ(j) δ(j) = C. 1 4 0 2 3 Here we have p12 = p10 = p14 = 1/3.Example 2: Random walk on a graph. if j is a neighbour of i otherwise. In all cases. A RWG is a reversible chain. 61 . so both members are equal to C. π(i)pij = π(j)pji . The stationary distribution assigns to a state probability which is proportional to its degree. If i. Each edge connects a pair of distinct vertices (states) without orientation. Consider an undirected graph G with vertices S. i∈S Since δ(i) = 2|E|. Suppose the graph is connected. j are not neighbours then both sides are 0. i. There are two cases: If i. The constant C is found by normalisation. We claim that π(i) = Cδ(i). a finite set. p02 = 1/4. 0. etc.e. Hence we conclude that: 1. For each state i define its degree δ(i) = number of neighbours of i. States j are i are called neighbours if they are connected by an edge. i∈S where |E| is the total number of edges. The stationary distribution of a RWG is very easy to find. The Markov chain with these transition probabilities is called a random walk on a graph (RWG). To see this let us verify that the detailed balance equations hold. Now define the the following transition probabilities: pij = 1/δ(i). where C is a constant. For example.

If π is the stationary distribution of the original chain then π/π(A) is the stationary distribution of (Yr ). 1. 23. Due to the Strong Markov Property.) has time-homogeneous transition probabilities: P (Yk+1 = j|Yk = i) = pij . . 2. . where pij are the m-step transition probabilities of the original chain. 2. 1. k = 0. (m) (m) 23. if nk = n0 + km.1 Subsequences The most obvious way to create a new Markov chain is by picking a subsequence.e. Define Yr := XT (r) .2 Watching a Markov chain when it visits a set Suppose that the original chain is ergodic (i. 62 n ∈ N. 2.3 Subordination Let (Xn .i.e. then it is a stationary distribution for the new chain. random variables with values in N and distribution λ(n) = P (ξ0 = n). 1. . . 1. . . The state space of (Yr ) is A.e. then (Yk . k = 0. . St+1 = St + ξt . .23 New Markov chains from old ones* We now deal with the question: if we have a Markov chain (Xn ). Let (St . . i. . t ≥ 0. k = 0. in addition.d.) has the Markov property but it is not time-homogeneous. 2. t ≥ 0) be independent from (Xn ) with S0 = 0. a sequence of integers (nk )k≥0 such that 0 ≤ n0 < n1 < n2 < · · · and by letting Yk := Xnk . if. . . 2. In this case. . π is a stationary distribution for the original chain. . . irreducible and positive recurrent). in general. how can we create new ones? 23. where ξt are i. . The sequence (Yk . n ≥ 0) be Markov with transition probabilities pij . this process is also a Markov chain and it does have time-homogeneous transition probabilities. k = 0. . Let A be (r) a set of states and let TA be the r-th visit to A. If the subsequence forms an arithmetic progression. A r = 1. i.

4 Deterministic function of a Markov chain If (Xn ) is a Markov chain with state space S and if f : S → S ′ is some function from S to some countable set S ′ . Yn−1 = αn−1 }. that there is a function Q : S ′ × S ′ → R such that P (Yn+1 = β | Xn = i) = Q(f (i).) More generally. we have: Lemma 13. αn−1 ∈ S ′ and let Yn−1 be the event Yn−1 := {Y0 = α0 . is NOT. knowledge of f (Xn ) implies knowledge of Xn and so.. then the process Yn = f (Xn ). in general. and the elements of S ′ by α. β. (n) 23. f is a one-to-one function then (Yn ) is a Markov chain. . ∞ λ(n)pij . .Then Yt := XSt . . . Then (Yn ) has the Markov property. j. a Markov chain. . If. Yn−1 ) = P (Yn+1 = β | Yn = α). . Fix α0 .. n≥0 63 . . is a Markov chain with P (Yt+1 = j|Yt = i) = n=1 t ≥ 0. Proof. . (Notation: We will denote the elements of S by i. We need to show that P (Yn+1 = β | Yn = α. Indeed. . β). Suppose that P (Yn+1 = β | Xn = i) depends on i only through f (i). . conditional on Yn the past is independent of the future. . α1 . however. . . .e. i.

β). if we let 1/2. regardless of whether i > 0 or i < 0. i ∈ Z. In other words. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. a Markov chain with transition probabilities Q(α. Then (Yn ) is a Markov chain. observe that the function | · | is not one-to-one. β ∈ Z+ . β)P (Xn = i | Yn = α. if β = 1. P (Xn+1 = i ± 1|Xn = i) = 1/2. Example: Consider the symmetric drunkard’s random walk. β). and so (Yn ) is. because the probability of going up is the same as the probability of going down. Yn−1 ) Q(f (i). Xn = i|Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. To see this. conditional on {Xn = i}. depends on i only through |i|. α = 0. Thus (Yn ) is Markov with transition probabilities Q(α. Yn−1 )P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = i∈S = i∈S = i∈S Q(f (i). But we will show that P (Yn+1 = β|Xn = i). Define Yn := |Xn |. But P (Yn+1 = β | Yn = α. Yn = α.for all choices of the variables involved. β) i∈S P (Yn = α|Xn = i. β) 1(α = f (i))P (Xn = i|Yn−1 ) P (Yn = α|Yn−1 ) = Q(α. then |Xn+1 | = 1 for sure. Yn−1 ) = i∈S P (Yn+1 = β | Xn = i. On the other hand. β) P (Yn = α|Xn = i. if i = 0. β) P (Yn = α|Yn−1 ) i∈S = Q(α.e. i ∈ Z. β). 64 . the absolute value of next state will either increase by 1 or decrease by 1 with probability 1/2. indeed. for all i ∈ Z and all β ∈ Z+ . β). if β = α ± 1. i. α > 0 Q(α. Indeed. Yn−1 )P (Xn = i | Yn = α. β) := 1. Yn−1 ) Q(f (i). We have that P (Yn+1 = β|Xn = i) = Q(|i|. and if i = 0. β) P (Yn = α.

Thus. . Therefore.d. Case II: If p0 = P (ξ = 0) > 0 then state 0 can be reached from any other state: indeed. a hard problem. one question of interest is whether the process will become extinct or not. with probability 1. It is a Markov chain with time-homogeneous transitions. the population cannot grow: it will eventually reach 0. independently from one another. If we let ξ (n) = ξ1 . each individual gives birth to at one child with probability p or has children with probability 1 − p. then we see that the above equation is of the form Xn+1 = f (Xn . . . In other words. are i.. Then the size of the (n + 1)-st generation is: (n) (n) (n) Xn+1 = ξ1 + ξ2 + · · · + ξXn . If p1 = P (ξ = 1) = 1 then Xn = X0 + n and this is a trivial case. 0 ≤ j ≤ i. k = 0. We let pk be the probability that an individual will bear k offspring. ξ (n) ).1 Applications Branching processes: a population growth application We describe a basic model for population growth known as the Galton-Watson process. and 0 is always an absorbing state. . Back to the general case. Hence all states i ≥ 1 are transient. ξ (1) . If we know that Xn = i then Xn+1 is the number of successful births amongst i individuals. if the current population is i there is probability pi that everybody gives birth to zero offspring. (and independent of the initial population size X0 –a further assumption). 0 But 0 is an absorbing state. . which has a binomial distribution: i j pij = p (1 − p)i−j . So assume that p1 = 1. if the 65 (n) (n) . 1.24 24. n ≥ 0) has the Markov property. j In this case. The idea is this: At each point of time n we have a number Xn of individuals each of which gives birth to a random number of offspring. . The state 0 is irrelevant because it cannot be reached from anywhere. and. if 0 is never reached then Xn → ∞. 2. and following the same distribution. Then P (ξ = 0) + P (ξ ≥ 2) > 0. We distinguish two cases: Case I: If p0 = P (ξ = 0) = 0 then the population can only grow. in general. Hence all states i ≥ 1 are inessential. But here is an interesting special case. Example: Suppose that p1 = p. p0 = 1 − p. We find the population (n) at time n + 1 by adding the number of offspring of each individual. therefore transient. . We will not attempt to compute the pij neither draw the transition diagram because they are both futile. Computing the transition probabilities pij = P (Xn+1 = j|Xn = i) is.i. since ξ (0) . . so we will find some different method to proceed. ξ2 . we have that (Xn . . The parent dies immediately upon giving birth. Let ξk be the number of offspring of the k-th individual in the n-th generation.

i. Let ∞ ϕ(z) := Ez ξ = k=0 z k pk be the probability generating function of ξ. since Xn = 0 implies Xm = 0 for all m ≥ n. If µ ≤ 1 then the ε = 1. We let ϕn (z) := Ez Xn ∞ = k=0 z k P (Xn = k) be the probability generating function of Xn . Consider a Galton-Watson process starting with X0 = 1 initial ancestor. and P (Dk ) = ε. where ε = P1 (D). We wish to compute the extinction probability ε(i) := Pi (D).population does not become extinct then it must grow to infinity. . . Proof. The result is this: Proposition 5. Indeed. each one behaves independently of one another. if we start with X0 = i in ital ancestors. For i ≥ 1. Let D be the event of ‘death’ or extinction: D := {Xn = 0 for some n ≥ 0}. we have n→∞ lim ϕn (0) = lim P (Xm = 0 for all m ≥ n) n→∞ = P (there exists n such that Xm = 0 for all m ≥ n) = P (D) = ε. This function is of interest because it gives information about the extinction probability ε. let Dk be the event that the branching process corresponding to the k-th initial ancestor dies. Let ε be the extinction probability of the process. ε(0) = 1. . 66 . we can argue that ε(i) = εi . Therefore the problem reduces to the study of the branching process starting with X0 = 1. For each k = 1. If µ > 1 then ε < 1 is the unique solution of ϕ(z) = z. Indeed. Let ∞ µ = Eξ = k=1 kpk be the mean number of offspring of an individual. Obviously. and. . We then have that D = D1 ∩ · · · ∩ Di . ϕn (0) = P (Xn = 0). when we start with X0 = i. so P (D|X0 = i) = εi .

and so on. ϕn (0) ≤ ε∗ . and. This means that ε = 1 (see left figure below). recall that ϕ is a convex function with ϕ(1) = 1. Therefore ε = ϕ(ε). (See right figure below. Let us show that.) 67 . Notice now that µ= d ϕ(z) dz . Hence ϕ2 (z) = ϕ1 (ϕ(z)) = ϕ(ϕ(z)). if the slope µ at z = 1 is ≤ 1. and since the latter has only two solutions (z = 0 and z = ε∗ ). We then have ϕn+1 (z) = Ez Xn+1 = E z ξ1 z ξ2 · · · z ξXn ∞ (n) (n) (n) ϕ1 (z) = ϕ(z). ϕ(ϕn (0)) → ϕ(ε). But ϕn+1 (0) → ε as n → ∞. in fact. But 0 ≤ ε∗ . ε = ε∗ . Also ϕn (0) → ε. if µ > 1. Since both ε and ε∗ satisfy z = ϕ(z). Therefore. On the other hand. the graph of ϕ(z) lies above the diagonal (the function z) and so the equation z = ϕ(z) has no solutions other than z = 1. By induction.Note also that ϕ0 (z) = 1. Therefore ε = ε∗ . But ϕn (0) → ε. In particular. z=1 Also. = i=0 ∞ E z ξ 1 z ξ 2 · · · z ξ Xn E z ξ1 z ξ2 · · · z ξi (n) (n) (n) (n) (n) (n) (n) (n) 1(Xn = i) 1(Xn = i) (n) = i=0 ∞ = i=0 ∞ E[z ξ1 ] E[z ξ2 ] · · · E[z ξi ] E[1(Xn = i)] ϕ(z)i P (Xn = i) = i=0 = E ϕ(z)Xn = ϕn (ϕ(z)). all we need to show is that ε < 1. Hence ϕ(0) ≤ ϕ(ε∗ ) = ε∗ . then the graph of ϕ(z) does intersect the diagonal at some point ε∗ < 1. So ε ≤ ε∗ < 1. ϕ3 (z) = ϕ2 (ϕ(z)) = ϕ(ϕ(ϕ(z))) = ϕ(ϕ2 (z)). since ϕ is a continuous function. ϕn+1 (z) = ϕ(ϕn (z)). ϕn+1 (0) = ϕ(ϕn (0)).

3. in a simplified model. If collision happens. there are x + a − 1 packets. then max(Xn + An − 1. During this time slot assume there are a new packets that need to be transmitted. then the transmission is not successful and the transmissions must be attempted anew. Thus. Each of the backlogged or newly arrived packets is selected for transmission with (a small) probability p. 3. 3. then time slots 0.2 The ALOHA protocol: a computer communications application This is a protocol used in the early days of computer networking for having computer hosts communicate over a shared medium (radio channel). Then ε < 1. Then ε = 1. 1. 1. So if there are. we let hosts transmit whenever they like. But this is not a good method it is a waste of time to allocate a slot to a host if that host has nothing to transmit. Xn+1 = Xn + An .d. k ≥ 0. random variables with some common distribution: P (An = k) = λk . Nowadays. Abandoning this idea. for a packet to go through. Ethernet has replaced Aloha. 68 . there is collision and the transmission is unsuccessful. 5. there are x + a packets. and Sn the number of selected packets. 2. Case: µ > 1. there is no time to make a prior query whether a host has something to transmit or not. So if we let Xn be the number of backlogged packets at the beg ginning of the n-th time slot. . If s = 1 there is a successful transmission and.1 1 (z) (z) ε 0 1 0 ε 1 z z Case: µ ≤ 1. 2. only one of the hosts should be transmitting at a given time. 24. on the next times slot. 1. . on the next times slot. So. Let s be the number of selected packets. periodically. there is a collision and. 0). . is that if more than one hosts attempt to transmit a packet. . say. are allocated to hosts 1.i. 2. 4. assume that on a time slot there are x packets amongst all hosts that are waiting for transmission (they are ‘backlogged’). One obvious solution is to allocate time slots to hosts in a specific order. . The problem of communication over a common channel. We assume that the An are i. 3 hosts.. Since things happen fast in a computer network. otherwise. using the same frequency. An the number of new packets on the same time slot. . If s ≥ 2. 6. if Sn = 1.

we have q < 1. for example we can make it ≤ µ/2. So we can make q x G(x) as small as we like by choosing x sufficiently large. then the chain is transient. k ≥ 1. So: px.where λk ≥ 0 and k≥0 kλk = 1. we think as follows: to have a reduction of x by 1. The Aloha protocol is unstable. Sn should be 0 or ≥ 2) or we need k + 1 new packets and one selection (Sn = 1). k 0 ≤ k ≤ x + a. An−1 . selections are made amongst the Xn + An packets. we think like this: to have an increase by k we either need exactly k new packets and no successful transmissions (i. 69 . We here only restrict ourselves to the computation of this expected change. we should not subtract 1.x−1 + k≥1 kpx. The reason we take the maximum with 0 in the above equation is because. and so q x tends to 0 much faster than G(x) tends to ∞. i. Proving this is beyond the scope of the lectures. we have that the drift is positive for all large x and this proves transience. An = a. The process (Xn . So P (Sn = k|Xn = x. The main result is: Proposition 6. where G(x) is a function of the form α + βx . we must have no new packets arriving. if Xn + An = 0. To compute px. conditional on Xn and An (and all past states and arrivals) a distribution which is Binomial with parameters Xn + An . n ≥ 0) is a Markov chain.e. Sn = 1|Xn = x) = λ0 xpq x−1 . We also let µ := k≥1 kλk be the expectation of An .x are computed by subtracting the sum of the above from 1. . For x > 0.) = x + a k x−k p q . The random variable Sn has. so q x G(x) tends to 0 as x → ∞. the Markov chain (Xn ) is transient. and p. and exactly one selection: px. Idea of proof.x−1 when x > 0. The probabilities px. Let us find its transition probabilities. Unless µ = 0 (which is a trivial case: no arrivals!). defined by‘ ∆(x) := E[Xn+1 − Xn |Xn = x]. Since p > 0. we have ∆(x) = (−1)px. .e. .x+k kλk − kλk (x + k)pq x+k−1 + k≥1 x k≥1 = −λ0 xpq x−1 + = −λ0 xpq x x−1 kλk+1 (x + k + 1)pq x+k . but we mention that it is based on the so-called drift criterion which says that if the expected change Xn+1 − Xn conditional on Xn = x is bounded below from a positive constant for all large x. To compute px.x+k for k ≥ 1.x−1 = P (An = 0. Indeed. x+k−1 k≥1 + µ − λ1 (x + 1)pq − λk (x + k)pq k≥2 = µ − q G(x). where q = 1 − p. Xn−1 . Therefore ∆(x) ≥ µ/2 for all large x.x+k = λk (1 − (x + k)pq x+k−1 ) + λk+1 (x + k + 1)pq x+k .

unless he gets bored–and this happens with probability α–in which case he clicks on a random page. So this is how Google assigns ranks to pages: It uses an algorithm. in the latter case. 3. The interpretation is that of a ‘bored surfer’: a surfer moves according the the original Markov chain. to compute the stationary distribution π corresponding to the Markov chain with transition matrix P = [pij ]: π = πP. 2. if L(i) = ∅   0. The algorithm used (PageRank) starts with an initial guess π(0) for π and updates it by π (k) = π (k−1) P. unless i is a dangling page. known as ‘302 Google jacking’. It uses advanced techniques from linear algebra to speed up computations. We pick α between 0 and 1 (typically. A new use of PageRank is to rank academic doctoral programs based on their records of finding jobs for their graduates. otherwise. α ≈ 0. if page i has a link to page j then we put an arrow from i to j and consider it as a directed edge of the graph. then i is a ‘dangling page’. So if Xn = i is the current location of the chain. if j ∈ L(i)  pij = 1/|V |. Define transition probabilities as follows:  1/|L(i)|. All you need to do to get high rank is pay (a lot of money).3 PageRank (trademark of Google): A World Wide Web application PageRank is an algorithm used by the popular search engine Google in order to assign importance to a web page. because of the size of the problem. So we make the following modification. For each page i let L(i) be the set of pages it links to. PageRank. |V | These are new transition probabilities. Its implementation though is one of the largest matrix computations in the world. Comments: 1. the chain moves to a random page.24. If L(i) = ∅. It is not clear that the chain is irreducible.2) and let α pij := (1 − α)pij + . The way it does that is by computing the stationary distribution of a Markov chain that we now describe. neither that it is aperiodic. the next location Xn+1 is picked uniformly at random amongst all pages that i links to. Therefore. In an Internet-dominated society. there are companies that sell high-rank pages. There is a technique. that allows someone to manipulate the PageRank of a web page and make it much higher. The web graph is a a graph set of vertices V consisting of web pages and edges representing hyperlinks: thus. the rank of a page may be very important for a company’s profits. Then it says: page i has higher rank than j if π(i) > π(j). Academic departments link to each other by hiring their 70 . The idea is simple.

maintain this allele for the next generation. An individual can be of type AA (meaning that all its alleles are A) or aa or Aa. it makes the evaluation procedure easy. We will also assume that the parents are replaced by the children. Needless to say that these uses of PageRank and PageRank-like systems represent corruption at the deepest level. two AA individuals will produce and AA child. Thus. We are now ready to define the Markov chain Xn representing the number of alleles of type A in generation n. at the next generation the number of alleles of type A can be found as follows: Pick an allele at random. Three of the square alleles are never selected.faculty from each other (and from themselves). we will assume that. Lazy university administrators use PageRank-like systems for assigning a single number to a researcher’s output. The alleles in an individual come in pairs and express the individual’s genotype for that gene. But an AA mating with a Aa may produce a child of type AA or one of type Aa. One of the square alleles is selected 4 times. in each generation. We would like to study the genetic diversity. this allele is of type A with probability x/2N . So let x be the number of alleles of type A (so and 2N − x is the number of those of type a). Alleles that are never picked up may belong to individuals who never mate or may belong to individuals who mate but their offspring did not inherit this allele. the genetic diversity depends only on the total number of alleles of each type. This means that the external appearance of a characteristic is a function of the pair of alleles. If so. a gene controlling a specific characteristic (e. colour of eyes) appears in two forms. Since we don’t care to keep track of the individuals’ identities. To simplify. meaning how many characteristics we have. When two individuals mate. then.4 The Wright-Fisher model: an example from biology In an organism. We will assume that a child chooses an allele from each parent at random. 71 . In this figure there are 4 alleles of type A (circles) and 6 of type a (squares). Imagine that we have a population of N individuals who mate at random producing a number of offspring. This number could be independent of the research quality but. from the point of view of an administrator. generation after generation. 5. The process repeats. And so on. What is important is that the evaluation can be done without ever looking at the content of one’s research.g. But it is a kind of corruption that is hard to trace since it is unanimously equated with modern use of Information Technology. all we need is the information outlined in this figure. We have a transition from x to y if y alleles of type A were produced. a dominant (say A) and a recessive (say a). their offspring inherits an allele from each parent. 4. 24. One of the circle alleles is selected twice. Do the same thing 2N times.

and the population becomes one consisting of alleles of type a only. we should have M × ϕ(x) + 0 × (1 − ϕ(x)) = x. ϕ(x) = x/M. the chain gets absorbed by either 0 or 2N .d. . on the average. since. . . . Tx := inf{n ≥ 1 : Xn = x}. and the population becomes one consisting of alleles of type A only. M and thus. . since. so both 0 and 2N are absorbing states. Since selections are independent. y ≤ 2N − 1. ξM are i. Clearly. .e. the number of alleles of type A won’t change. p00 = p2N. 72 . . conditionally on (X0 . . Xn ).e. we would like to know the probability ϕ(x) = Px (T0 > TM )P (T0 > TM |X0 = x) that the chain will eventually be absorbed by state M .) One can argue that. . X0 ) = 1 − Xn . . . . X0 ) = Xn . . . . . ϕ(x) = y=0 pxy ϕ(y). .Each allele of type A is selected with probability x/2N . Clearly. on the average. (Here. Let M = 2N . the random variables ξ1 . we see that 2N x 2N −y x y pxy = . X0 ) = E(Xn+1 |Xn ) = M This implies that. the states {1.i. there is a x chance (= ( 2N )2N ) that only alleles of type A are selected. . These are: M ϕ(0) = 0. M n P (ξk = 0|Xn . 2N − 1} form a communicating class of inessential states. M Therefore. . . Therefore P (Xn = 0 or Xn = 2N for all large n) = 1. at “the end of time” the chain (started from x) will be either M with probability ϕ(x) or 0 with probability 1 − ϕ(x).2N = 1. with n P (ξk = 1|Xn . 0 ≤ y ≤ 2N. . i. . The result is correct. i. the expectation doesn’t change and. 1− y 2N 2N x Notice that there is a chance (= (1 − 2N )2N ) that no allele of type A is selected at all. but the argument needs further justification because “the end of time” is not a deterministic time. . We show that our guess is correct by proceeding via the mundane way of verifying that it satisfies the first-step analysis equations. EXn+1 = EXn Xn = Xn . as usual. Similarly. E(Xn+1 |Xn . Another way to represent the chain is as follows: n n Xn+1 = ξ1 + · · · + ξM . . for all n ≥ 0. Since pxy > 0 for all 1 ≤ x. ϕ(M ) = 1. The fact that M is even is irrelevant for the mathematical model. n n where.

5 A storage or queueing application A storage facility stores. and so. Dn ). 73 . . we will try to solve the recursion. the demand is. We will show that this Markov chain is positive recurrent. If the demand is for Dn components. The right-hand side of the last equals M y=0 M y x M y x 1− M M −y y = M = M y=1 M y=1 M! x y!(M − y)! M y 1− x M M −y y M x M M −y−1+1 x (M − 1)! (y − 1)!(M − y)! M M y=1 y−1+1 1− x = M x = M as needed. The correctness of the latter can formally be proved by showing that it does satisfy the recursion. b). at time n + 1. A number An of new components are ordered and added to the stock. Here are our assumptions: The pairs of random variables ((An . and the stock reached level zero at time n + 1. Working for n = 1. the demand is met instantaneously. . there are Xn + An − Dn components in the stock.The first two are obvious. We can express these requirements by the recursion Xn+1 = (Xn + An − Dn ) ∨ 0 where a ∨ b := max(a. We have that (Xn . Let ξn := An − Dn . the demand is only partially satisfied. Let us consider (n) gxy := Px (Xn > y) = P (x + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0 > y . we find X1 = (X0 + ξ0 ) ∨ 0 X2 = (X1 + ξ1 ) ∨ 0 = (X0 + ξ0 + ξ1 ) ∨ ξ1 ∨ 0 ··· X3 = (X2 + ξ2 ) ∨ 0 = (X0 + ξ0 + ξ1 + ξ2 ) ∨ (ξ1 + ξ2 ) ∨ ξ2 ∨ 0 Xn = (X0 + ξ0 + · · · + ξn−1 ) ∨ (ξ1 + · · · + ξn−1 ) ∨ · · · ∨ ξn−1 ∨ 0. n ≥ 0) is a Markov chain. Thus. n ≥ 0) are i. 2. say. Xn electronic components at time n.d. provided that enough components are available.i. for all n ≥ 0. We will apply the so-called Loynes’ method. and E(An − Dn ) = −µ < 0. on the average. Otherwise. M x M (M −1)−(y−1) x x +1− M M M −1 = 24. while a number of them are sold to the market. Since the Markov chain is represented by this very simple recursion. so that Xn+1 = (Xn + ξn ) ∨ 0. larger than the supply per unit of time. . M −1 y−1 x M y−1 1− x .

we find π(y) = x≥0 π(x)pxy . in computing the distribution of Mn which depends on ξ0 . . where the latter equality follows from the fact that. . . we may replace the latter random variables by ξ1 . the random variables (ξn ) are i. . . . we reverse it. the event whose probability we want to compute is a function of (ξ0 . But Mn+1 = 0 ∨ max Zj 1≤j≤n+1 =0∨ d 1≤j≤n+1 max (Zj − Z1 ) + ξ0 = 0 ∨ (Mn + ξ0 ). however. (n) . n→∞ n→∞ We claim that π(y) := g(y) − g(y − 1) satisfies the balance equations. for all n. . for y ≥ 0. that d (ξ0 . ξn−1 ) = (ξn−1 . . . . ξn . . Hence. P (Mn+1 = y) = P (0 ∨ (Mn + ξ0 ) = y) = x≥0 P (Mn = x)P (0 ∨ (x + ξ0 ) = y) P (Mn = x) pxy . Notice. so we can shuffle them in any way we like without altering their joint distribution.i. ξn−1 ). = x≥0 Since the n-dependent terms are bounded. . Then Since Mn+1 ≥ Mn we have g0y ≤ g0y (n) (n) (n+1) g0y = P (Mn > y). 74 . Therefore. Indeed. In particular. Let Zn := ξ0 + · · · + ξn−1 Mn := Zn ∨ Zn−1 ∨ · · · ∨ Z1 ∨ 0. . .d.Thus. . and. ξ0 ). . instead of considering them in their original order. using π(y) = limn→∞ P (Mn = y). ξn−1 . . we may take the limit of both sides as n → ∞. The limit of a bounded and increasing sequence certainly exists: g(y) := lim g0y = lim P0 (Xn > y). (n) gxy = P (x + ξn−1 + · · · + ξ0 ) ∨ (ξn−2 + · · · + ξ0 ) ∨ · · · ∨ ξ0 ∨ 0 > y . . . .

we will use the assumption that Eξn = −µ < 0. Here. . . as required. 75 . .} > y) is not identically equal to 1. n→∞ This implies that P ( lim Zn = −∞) = 1. From the Law of Large Numbers we have P ( lim Zn /n = −µ) = 1. The case Eξn = 0 is delicate. .} < ∞) = 1. the Markov chain may be transient or null recurrent. . It can also be shown (left as an exercise) that if Eξn > 0 then the Markov chain is transient. This will be the case if g(y) is not identically equal to 1. Z2 . Depending on the actual distribution of ξn . . n→∞ Hence g(y) = P (0 ∨ max{Z1 .So π satisfies the balance equations. n→∞ But then P ( lim Mn = 0 ∨ max{Z1 . Z2 . We must make sure that it also satisfies y π(y) = 1.

d  p(x) = qj . p(−1) = q. so far.and space-homogeneity. . otherwise where p + q = 1. x ∈ Zd . given a function p(x). where C is such that x p(x) = 1. here. besides time. Our Markov chains. a random walk is a Markov chain with time. the skip-free property. if x = −ej . that of spatial homogeneity. To be able to say this. . j = 1. This is the drunkard’s walk we’ve seen before. . if x ∈ {±e1 . A random walk of this type is called simple random walk or nearest neighbour random walk. namely that to go from state x to state y it must pass through all intermediate states because its value can change by at most 1 at each step. ±ed } otherwise 76 . Define  pj . . groups) are allowed. otherwise. 0. For example. Example 2: Let where ej be the vector that has 1 in the j-th position and 0 everywhere else. namely. .PART II: RANDOM WALKS 25 Random walk The purpose of this part of the notes is to take a closer look at a special kind of Markov chain known as random walk. . p(x) = 0.y+z for any translation z. Example 3: Define p(x) = 1 2d . this walk enjoys.g.y = p(y − x). . .y = px+z.and space-homogeneous transition probabilities given by px. where p1 + · · · + pd + q1 + · · · + qd = 1. . if x = ej . we can let S = Zd . we need to give more structure to the state space S. In dimension d = 1. the set of d-dimensional vectors with integer coordinates. This means that the transition probability pxy should depend on x and y only through their relative positions in space. So. . A random walk possesses an additional property. we won’t go beyond Zd . . Example 1: Let p(x) = C2−|x1 |−···−|xd | . If d = 1. have been processes with timehomogeneous transition probabilities. . but. d   0. a structure that enables us to say that px. j = 1. then p(1) = p. More general spaces (e.

One could stop here and say that we have a formula for the n-step transition probability. and this is the symmetric drunkard’s walk. in addition.i. xy The random walk ξ1 + · · · + ξn is a random walk starting from 0. otherwise. It is clear that Sn can be represented as Sn = S0 + ξ1 + · · · + ξn . But this would be nonsense. p(n) = Px (Sn = y) = P (x + ξ1 + · · · + ξn = y) = P (ξ1 + · · · + ξn = y − x). . . x + ξ2 = y) = y p(x)p(y − x). x ∈ Zd . we will mean a sum over all possible values. has equal probabilities to move from one state x to one of its “neighbouring states” x ± e1 . random variables in Zd with common distribution P (ξn = x) = p(x). The convolution of two probabilities p(x). We shall resist the temptation to compute and look at other 77 . We refer to ξn as the increments of the random walk. So. Indeed.) In this notation. where ξ1 . are i. which. . p′ (x). in general.d. Furthermore. Furthermore. the initial state S0 is independent of the increments. p(x) = 0. We easily see that P (ξ1 + · · · + ξn = x) = p∗n (x). Let us denote by Sn the state of the walk at time n.This is a simple random walk. ξ2 . x ± ed . . (Recall that δz is a probability distribution that assigns probability 1 to the point z. . x ∈ Zd is defined as a new probability p ∗ p′ given by (p ∗ p′ )(x) = The reader can easily check that p ∗ p′ = p′ ∗ p. Obviously. we have P (ξ1 + ξ2 = x) = (p ∗ p)(x). (When we write a sum without indicating explicitly the range of the summation variable. notice that P (ξ1 + ξ2 = x) = y P (ξ1 = x. . we can reduce several questions to questions about this random walk. because all we have done is that we dressed the original problem in more fancy clothes. then p(1) = p(−1) = 1/2. If d = 1. computing the n-th order convolution for all n is a notoriously hard problem. . p ∗ (p′ ∗ p′′ ) = (p ∗ p′ ) ∗ p′′ . We will let p∗n be the convolution of p with itself n times. p ∗ δz = p. The special structure of a random walk enables us to give more detailed formulae for its behaviour. We call it simple symmetric random walk.) Now the operation in the last term is known as convolution. y p(x)p′ (y − x).

we should assign. a path of length n. but its practice may be hard. 2) → (5. If not.i−1 = 1/2 for all i ∈ Z. . we shall learn some counting methods. So.i. to each initial position and to each sequence of n signs. −1} then we have a path in the plane that joins the points (0. . In the sequel. ξn ). +1}n . and the sequence of signs is {+1. . 1): (2.methods. So if the initial position is S0 = 0. Here are a couple of meaningful questions: A. A ⊂ {−1. +1. First. +1. a random vector with values in {−1. After all. . +1}n . . 1}n . 1) → (4. Events A that depend only on these the first n random signs are subsets of this sample space.0) We can thus think of the elements of the sample space as paths. all possible values of the vector are equally likely. . we have P (ξ1 = ε1 .1) − (3. for fixed n. how long will that take? C.i+1 = pi.1) + (0.1) + − (5. if we can count the number of elements of A we can compute its probability. + (1. 2) → (3. 0) → (1. Clearly. we are dealing with a uniform distribution on the sample space {−1. where will it go? 26 The simple symmetric random walk in dimension 1: path counting This is a random walk with values in Z and transition probabilities pi.2) where the ξn are i. As a warm-up. 2 where #A is the number of elements of A. Look at the distribution of (ξ1 . −1.d. +1}n .2) (4. 1) → (2. The principle is trivial. . Will the random walk ever return to where it started from? B. ξn = εn ) = 2−n . If yes. . Since there are 2n elements in {0. consider the following: 78 . Therefore. When the walk starts from 0 we can write Sn = ξ1 + · · · + ξn . and so #A P (A) = n . random variables (random signs) with P (ξn = ±1) = 1/2. who cares about the exact value of p∗n (x) for all n and x? Perhaps it is the case (a) that different questions we need to ask or (b) that approximate methods are more informative than exact ones.

in this case. 0)).1 Paths that start and end at specific points Now consider a path that starts from a specific point and ends up at a specific point. S6 < 0. 2 1 3 0 1 2 4 5 6 7 −1 −2 −3 − −4 −7 26. We can easily count that there are 6 paths in A.0) The lightly shaded region contains all paths of length n that start at (0. Hence (n+i)/2 = 0. The number of paths that start from (0. i). x) (n. S7 < 0}. i)} = n n+i 2 More generally. 0) and end up at the specific point (n. y) is given by: #{(0. In if n + i is not an even number then there is no path that starts from (0. the number of paths that start from (m. x) and end up at (n. We use the notation {(m. n 79 . So P (A) = 6/26 = 6/128 = 3/64 ≈ 0. x) and end up at (n. 0) and ends at (n. y)}. Let us first show that Lemma 14. (n. and compute the probability of the event A = {S2 ≥ 0. y)} =: {all paths that start at (m. (n − m + y − x)/2 (23) Remark. 0) (n.047. y) is given by: #{(m. i). x) (n. 0) and end up at (n. Consider a path that starts from (0. The paths comprising this event are shown below. S4 = −1.Example: Assume that S0 = 0. Proof.i) (0.) The darker region shows all paths that start at (0. y)} = n−m . i). 0) and n ends at (n. (There are 2n of them.

Let u be the number of +’s in this path (the number of upward segments) and d the number of −’s (the number of downward segments). Then u + d = n, Hence u = (n + i)/2, d = (n − i)/2. So if we know where the path starts from and where it ends at we know how many times it has gone up and how many down. The question then becomes: given a sequence of n signs out of which u are + and the rest are −, in how many ways can we arrange them? Well, in exactly n ways, and this proves the formula. u Notice that u = (n + i)/2 must be an integer. This is true if n + i is even or, equivalently if n and i are simultaneously even or simultaneously odd. If this is not the case, then there is no path that can go from (0, 0) to (n, i). For example (look at the figure), there is no path that goes from (0, 0) to (3, 2). We shall thus define n to be zero if u is not an integer. u More generally, the number of paths that start from (m, x) and end up at (n, y) equals the number of paths that start from (0, 0) and end up at (n − m, y − x). So, applying the previous formula, we obtain (23). u − d = i.

26.2

Paths that start and end at specific points and do not touch zero at all
(n, y) that never become n−m . − m − y − x)

Lemma 15. Suppose that x, y > 0. The number of paths (m, x) zero (i.e. they do not touch the horizontal axis) is given by: #{(m, x) (n, y); remain > 0} =
1 2 (n

n−m − − m + y − x)

1 2 (n

(24)

Proof.

(m,x) (n,y)

The shaded region contains all possible paths that start at (m, x) and end up at the specific point (n, y) which, in addition, never touch the horizontal axis.

n−m

Consider, instead, a path that starts at (m, x) and ends up at point (n, y) which DOES touch the horizontal axis, like the path indicated by the solid line below. There is a trick, 80

called reflection, which is useful here. Note that to each such path there corresponds another path obtained by following it up to the first time it hits zero and then reflecting it around the horizontal axis. In the figure below, follow the solid path starting at (m, x) up to the first time it hits zero and then follow the dotted path: this is the reflected path.
(n, y)

(m, x)

(n, −y)

A path and its reflection around a horizontal line. Notice that the reflection starts only at the point where the path touches the horizontal line.

Note that if we know the reflected path we can recover the original path. Hence we may as well count the number of reflected paths. But each reflected path starts from (m, x) and ends at (n, −y) and so it necessarily crosses the horizontal axis. We thus apply (23) with −y in place of y: #{(m, x) (n, y); touch zero} = #{(m, x) = So, for x > 0, y > 0, #{(m, x) (n, y); remain > 0} = #{(m, x) =
1 2 (n

n−m . (n − m − y − x)/2

(n, −y)}

n−m − − m + y − x)

(n, y)} − #{(m, x)
1 2 (n

(n, y); touch zero}

n−m . − m − y − x)

Corollary 14. For any y > 0, #{(1, 1) Let m = 1, x = 1 in (24): #{(1, 1) (n, y); remain > 0} = n−1 n−1 − 1 + y) − 1 2 (n − y) − 1 (n − 1)! (n − 1)! − n−y = n+y n−y n−y ! ! 2 −1 ! 2 2 −1 ! 2 1n+y 1n−y n! = − n+y n 2 n 2 ! n−y ! 2 2
1 2 (n

(n, y); remain > 0} =

y n

#{(0, 0)

(n, y)}.

(25)

=

y n

n
n+y 2

,

and the latter term equals the total number of paths from (0, 0) to (n, y).

81

Corollary 15. The number of paths from (m, x) to (n, y) that do not touch level z < min(x, y) is given by: #{(m, x) (n, y); remain > z} = n−m − − m + y − x) n−m . − m − y − x) − z (26)

1 2 (n

1 2 (n

Proof. Because it is only the relative position of x and y with respect to z that plays any role, we have #{(m, x) (n, y); remain > z} = #{(m, x + z) (n, y + z); remain > 0}

and so the formula is obtained by an application of (24). Corollary 16. The number of paths from (0, 0) to (n, y), where y ≥ 0, that remain ≥ 0 is given by: n n #{(0, 0) (n, y); remain ≥ 0} = n+y − n+y (27) 2 2 +1 Proof. #{(0, 0) (n, y); remain ≥ 0} = #{(0, 0) = = where we used (26) with z = −1. Corollary 17 (Catalan numbers). The number of paths from (0, 0) to (n, 0) that remain ≥ 0 is given by: #{(0, 0) Proof. Apply (27) with y = 0: #{(0, 0) (n, 0); remain ≥ 0} = n
n 2

= #{(0, 1) n
n+y 2

(n, y); remain > −1} n −1 n , n+y 2 +1
n−y 2

(n, y + 1); remain > 0}

− −

n
n+y 2

(n, 0); remain ≥ 0} =

1 n+1 . n + 1 n/2

n 2

n −1

=

1 n+1 , n + 1 n/2

where the latter calculation is as in the proof of (25). Corollary 18. The number of paths that start from (0, 0), have length n, and remain nonnegative is given by: #{(0, 0) (n, •); remain ≥ 0} = n . ⌈n/2⌉ (28)

where ⌈x⌉ denotes the smallest integer that is ≥ x. 82

Proof. To see, this, just use (27) and sum over all y ≥ 0. If n = 2m then we need to set y = 2z in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m, •); remain ≥ 0} = 2m m+z − 2m m+z+1

z≥0

2m 2m 2m 2m 2m 2m 2m ± ··· + − + ··· + − + − 2m 2m 2m − 1 m+2 m+1 m+1 m n 2m , = = n/2 m

because all the terms that are denoted by · · · are equal to 0, while the rest cancel with one another, and only the first one remains. If n = 2m + 1 then we need to set y = 2z + 1 in (27) (otherwise (27) is just zero) and so we have #{(0, 0) = (2m + 1, •); remain ≥ 0} = 2m + 1 2m + 1 − m+z+1 m+z+2

z≥0

2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 2m + 1 ±· · · + − +· · ·+ − + − 2m + 1 2m + 1 2m m+3 m+2 m+2 m+1 n 2m + 1 . = = ⌈n/2⌉ m+1

We have shown that the shaded regions contain exactly the same number of paths. Was this, a posteriori, obvious?

2m

2m

27

The simple symmetric random walk in dimension 1: simple probability calculations

Since all paths of certain length have the same probability, we can translate the path counting formulae into corresponding probability formulae.

27.1

Distribution after n steps
n 2−n . (n + y)/2 83

Lemma 16. P0 (Sn = y) =

n This is called the ballot theorem and the reason will be given in Section 29. The left side of (30) equals #{paths (1. n 2 #{paths (m. for now. 27. n−m 2−n−m . we shall just compute it. Sn ≥ 0 | Sn = 0) = 84 1 . if n is even.2 The ballot theorem Theorem 25.3 Some conditional probabilities Lemma 17. . If n. y)} . But. P0 (Sn = y) 2 (n + y) + 1 . We have P0 (Sn = y) = and P (Sn = y|Sm = x) = and we apply (23).P (Sn = y|Sm = x) = In particular. 27. the random walk never becomes zero up to time n is given by P0 (S1 > 0. 0) (n. y)} . S2 > 0. . 1) (n. . (n − m + y − x)/2 u(n) := P0 (Sn = 0) = Proof. The probability that. . Such a simple answer begs for a physical/intuitive explanation. Sn > 0 | Sn = y) = y . n (30) Proof. x) 2n−m (n. n/2 (29) #{(0. y ≥ 0 are both even or both odd then P0 (S1 ≥ 0 . . 0) (n. y) that do not touch zero} × 2−n #{paths (0. y)} × 2−n n−1 n−1 − 1 1 (n − 1 + y − 1) 2 (n − 1 − y − 1) = 2 n 1 (n + y) 2 = y . Sn ≥ 0 | Sn = y) = 1 − In particular. P0 (S1 ≥ 0 . . . . n 2−n . (n/2) + 1 y+1 P0 (Sn = y + 2) = 1 . given that S0 = 0 and Sn = y > 0. if n is even. . .

. . . . . . . . . . . For even n we have P0 (S1 ≥ 0. . .). .) by (ξ1 . and finally (g) is trivial because Sn−1 ≥ 0 means Sn−1 > 0 because Sn−1 cannot take the value 0.Proof. 1 + ξ2 + ξ3 > 0. ξ2 + ξ3 ≥ 0. . Sn ≥ 0) = P0 (S1 = 0. . . Sn < 0) (b) (a) = 2P0 (S1 > 0. Sn = 0) = P0 (Sn = 0). . . . (b) follows from symmetry. Sn ≥ 0) = #{(0. (c) (d) (e) (f ) (g) where (a) is obvious. . . 85 . . . . . . . . 1 + ξ2 + · · · + ξn > 0 | ξ1 = 1) = P0 (S1 ≥ 0. remain ≥ 0} = 2n n 2−n = P0 (Sn = 0). Sn > 0) + P0 (S1 < 0. ξ2 . Sn ≥ 0). . We use (27): P0 (Sn ≥ 0 . . . . . . . ξ3 . If n is even. . n/2 where we used (28) and (29). . . and from the fact that if m is integer then 1 + m > 0 is equivalent to m ≥ 0. . . +1 27. •). 0) (n. Sn ≥ 0. ξ3 . (c) is just conditioning on {ξ1 = 1} and using the fact that S1 = ξ1 . Sn = 0) = P0 (S1 > 0. Sn−1 ≥ 0) = P0 (ξ2 ≥ 0. . . Sn > 0 | ξ1 = 1) = P0 (1 + ξ2 > 0. We next have P0 (S1 = 0.4 Some remarkable identities Theorem 26. P0 (Sn = y) − n n+y 2 n +1 n+y 2 This is further equal to 1− ( n+y 2 ( n! n−y + 1)! ( 2 − 1)! n+y 2 )! ( n−y )! 2 =1− n! n−y 2 n+y 2 + 1 = n+y 2 y+1 . (d) follows from P0 (ξ1 = 1) = 1/2 and from the substitution of the value of ξ1 on the left of the conditioning. Sn ≥ 0 | Sn = y) = n = n+y 2 P0 (Sn ≥ 0 . . Sn = y) P0 (Sn = y) =1− P0 (Sn = y + 2) . . . . . . . . Proof. . (f ) follows from the replacement of (ξ2 . (e) follows from the fact that ξ1 is independent of ξ2 . . P0 (S1 ≥ 0. Sn > 0) = 2P0 (ξ1 = 1)P0 (S2 > 0. . . . and Sn−1 > 0 implies that Sn ≥ 0. . . . . . . . ξ2 + · · · + ξn ≥ 0) = P0 (S1 ≥ 0.

. . Then f (n) := P0 (S1 = 0. given that Sn = 0. Sn−1 < 0. n We know that P0 (Sn = 0) = u(n) = n/2 2−n . . . So the last term is 1/2. If a particle is to the right of a then you see its image on the left. . . n−1 28 The reflection principle for a simple random walk in dimension 1 The reflection principle says the following. Put a two-sided mirror at a. Theorem 27. for even n. . Sn−1 = 0. . we computed.27. . . Sn = 0) P0 (S1 > 0. f (n) = P0 (S1 > 0. . and if the particle is to the left of a then its mirror image is on the right. we see that Sn−1 > 0. So u(n) f (n) = . n−1 Proof. with equal probability. the probability u(n) that the walk (started at 0) attains value 0 in n steps. Sn = 0) + P0 (S1 < 0. we can omit the last event. Sn = 0) Now. Fix a state a. So f (n) = 2P0 (S1 > 0. and u(n) = P0 (Sn = 0). . By symmetry. . . Sn = 0) = 2P0 (Sn−1 > 0. This is not so interesting. Let n be even. 86 . 2 and the last probability can be computed from the ballot theorem–see (30): it is equal to 1/(n − 1). Sn−2 > 0|Sn−1 > 0.5 First return to zero In (29). . Sn = 0) = u(n) . Sn−1 > 0. . Sn−1 > 0. . Since the random walk cannot change sign before becoming zero. . To take care of the last term in the last expression for f (n). So if the particle is in position s then its mirror image is in position s = 2a − s because s − a = a − s. Sn = 0) = P0 (Sn = 0) P (Sn−1 > 0|Sn = 0). we either have Sn−1 = 1 or −1. It is clear that if a particle starts at 0 and performs a simple symmetric random walk then its mirror image starts at a and performs a simple symmetric random walk. We now want to find the probability f (n) that the walk will return to 0 for the first time in n steps. . Sn = 0 means Sn−1 = 1. So 1 f (n) = 2u(n) P0 (S1 > 0. . P0 (Sn−1 > 0. . By the Markov property at time n − 1. . . Also. the two terms must be equal. Sn = 0). Sn = 0. Sn−2 > 0|Sn−1 = 1). .

Thus. . If (Sn ) is a simple symmetric random walk started at 0 then so is (Sn ). Trivially. Proof. we have n ≥ Ta . n ∈ Z+ ) has the same law as (Sn . Sn ≤ a). so we may replace Sn by 2a − Sn (Theorem 28). In the last event. Then. a + (Sn − a). Sn ). If Sn ≥ a then Ta ≤ n. . define Sn = where is the first hitting time of a.1 Distribution of hitting time and maximum We can take advantage of this principle to establish the distribution function of Ta : Theorem 29. Sn ≤ a) + P0 (Sn > a). as a corollary to the above result. observing that Sn has the same distribution as −Sn we get that this is also equal to P0 (|Sn | ≥ a) − 1 P0 (|Sn | = a). called (Sn .What is more interesting is this: Run the particle till it hits a (it will. S1 . 2 Proof. m ≥ 0) is a simple symmetric random walk. . n ≥ Ta . 2 Define now the running maximum Mn = max(S0 . Then 1 P0 (Ta ≤ n) = 2P0 (Sn ≥ a) − P0 (Sn = a) = P0 (|Sn | ≥ a) − P0 (|Sn | = a). But a symmetric random walk has the same law as its negative. the process (STa +m − a. Combining the last two displays gives P0 (Ta ≤ n) = P0 (Sn ≥ a) + P0 (Sn > a) = P0 (Sn ≥ a) + P0 (Sn ≥ a) − P0 (Sn = a). n < Ta . In other words. Sn ≤ a) + P0 (Ta ≤ n. n ≥ 0) be a simple symmetric random walk and a > 0. n < Ta . n ∈ Z+ ). n ≥ Ta By the strong Markov property. Let (Sn . So: P0 (Sn ≥ a) = P0 (Ta ≤ n. . for sure–Theorem 35) and after that consider its mirror image Sn = 2a − Sn . 2a − Sn . we obtain 87 . Sn ≥ a). The last equality follows from the fact that Sn > a implies Ta ≤ n. independent of the past before Ta . P0 (Sn ≥ a) = P0 (Ta ≤ n. Theorem 28. Sn = Sn . So Sn − a can be replaced by a − Sn in the last display and the resulting process. But P0 (Ta ≤ n) = P0 (Ta ≤ n. 2a − Sn ≥ a) = P0 (Ta ≤ n. Sn > a) = P0 (Ta ≤ n. Finally. 28. Then Ta = inf{n ≥ 0 : Sn = a} Sn .

The answer is very simple: Theorem 31. Acad. Sci. e 10 Andr´. 369. Sn = 2x − y). what is the probability that the number of azure items is constantly ahead of the number of black ones? The problem was solved10 in the same year. Observe that Mn ≥ x if and only if Sk ≥ x for some k ≤ n which is equivalent to Tx ≤ n and use Theorem 29. Solution d’un probl`me. Sn = y) = P0 (Tx > n. 1 P0 (Mn ≥ x) = P0 (Tx ≤ n) = 2P0 (Sn ≥ x) − P0 (Sn = x) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). A sample from the urn is a sequence η = (η1 . J. we can replace Sn by 2x − Sn in the last part of the last display and get P0 (Tx ≤ n. where ηi indicates the colour of the i-th item picked. 105. (31) x > y. But if x > y then 2x − y > x and so {Sn = 2x − y} ⊂ {Tx ≤ n}. Paris. Since Mn < x is equivalent to Tx > n. ornamental objects. by applying the reflection principle (Theorem 28). D. usually a vase furnished with a foot or pedestal. then. The set of values of η contains n elements and η is uniformly distributed a in this set. The original ballot problem. . 29 Urns and the ballot theorem Suppose that an urn8 contains n items. we have P0 (Mn < x. of which a are coloured azure and b black. Compt. and anciently for holding ballots to be drawn. An urn is a vessel. Compt. n = a + b. 436-437. We can do more: we can derive the joint distribution of Mn and Sn : Theorem 30. . 105. 8 88 . then the probability that. . 9 Bertrand. which results into P0 (Tx ≤ n. the number of selected azure items is always ahead of the black ones equals (a − b)/n. the ashes of the dead after cremation. Sci. e e e Paris. Sn = y). as for holding liquids. posed in 18879 asks the following: if we start picking the items from the urn one by one without replacement. Sn = y) = P0 (Sn = y) − P0 (Tx ≤ n. employed for different purposes. Acad. ηn ). gives the result. This.Corollary 19. Rend. (1887). . P0 (Mn < x. Bertrand. Proof. It is the latter use we are interested in in Probability and Statistics. as long as a ≥ b. Sn = y) = P0 (Tx ≤ n. If Tx ≤ n. combined with (31). during sampling without replacement. Solution directe du probl`me r´solu par M. If an urn contains a azure items and b = n−a black items. Sn = y) = P0 (Sn = y) − P0 (Sn = 2x − y) . (1887). Sn = y) = P0 (Sn = 2x − y). Rend. 2 Proof.

Since an urn with parameters n. where y = a − (n − a) = 2a − n. . . ξn ) has a uniform distribution.i−1 = q = 1 − p. the ‘black’ items correspond to − signs). . . Since the ‘azure’ items correspond to + signs (respectively. We need to show that. So the number of possible values of (ξ1 . a we have that a out of the n increments will be +1 and b will be −1. n Since y = a − (n − a) = a − b. . . . . the sequence (ξ1 . . . But in (30) we computed that y P0 (S1 > 0. a can be realised by a simple symmetric random walk starting from 0 and ending at Sn = y. Consider a simple symmetric random walk (Sn . . we can translate the original question into a question about the random walk. . εn be elements of {−1. We have P0 (Sn = k) = n n+k 2 pi. we have: P0 (ξ1 = ε1 . (The word “asymmetric” will always mean “not necessarily symmetric”. the sequence (ξ1 . Sn > 0} that the random walk remains positive. Proof of Theorem 31. . a. . 89 . . the event that the azure are always ahead of the black ones during sampling is precisely the event {S1 > 0. conditional on {Sn = y}. −n ≤ k ≤ n. An urn can be realised quite easily be a random walk: Theorem 32. . It remains to show that all values are equally a likely. p(n+k)/2 q (n−k)/2 . . indeed. Then. . . b be defined by a + b = n. . ηn ) an urn process with parameters n.) So pi. . . always assuming (without any loss of generality) that a ≥ b = n − a. ξn = εn ) P0 (Sn = y) 2−n 1 = = . . . n n −n 2 (n + y)/2 a Thus. Proof. . ξn ) is uniformly distributed in a set with n elements. . . ξn ) is. so ‘azure’ items are +1 and ‘black’ items are −1. . . 30 The asymmetric simple random walk in dimension 1 Consider now a simple random walk with values in Z. conditional on {Sn = y}. . . ξn ) of the increments is an urn process with parameters n and a = (n + y)/2. . . (ξ1 . n ≥ 0) with values in Z and let y be a positive integer. . ξn = εn | Sn = y) = P0 (ξ1 = ε1 . conditional on the event {Sn = y}. . . letting a. a − b = y.We shall call (η1 . i ∈ Z. n . using the definition of conditional probability and Lemma 16. Sn > 0 | Sn = y) = . Then. . The values of each ξi are ±1.i+1 = p. but which is not necessarily symmetric. . the result follows. . . Let ε1 . But if Sn = y then. +1} such that ε1 + · · · + εn = y.

Consider a simple random walk starting from 0.} ∪ {∞}. Let ϕ(s) := E0 sT1 . This follows from the spatial homogeneity of the transition probabilities: to reach y starting from x. If S1 = −1 then T1 is the sum of three times: one unit of time spent ′ to go from 0 to −1. We will approach this using probability generating functions. The process (Tx . 90 . See figure below. and all t ≥ 0. plus a time T1 required to go from −1 to 0 for the first time. (We tacitly assume that S0 = 0.d. 1. In view of Lemma 19 we need to know the distribution of T1 . Start with S0 = 0 and watch the process until the first n = T1 such that Sn = 1. Px (Ty = t) = P0 (Ty−x = t). 2. we make essential use of the skip-free property. E0 sTx = ϕ(s)x . The rest is a consequence of the strong Markov property. Lemma 18. . This random variable takes values in {0. . namely the obvious fact that a simple random walk starting from 0 must pass through levels 1. Lemma 19.30. . . For all x. In view of this. then the summands in Tx = T1 + (T2 − T1 ) + · · · + (Tx − Tx−1 ) are i. Proof. only the relative position of y with respect to x is needed.i. 2.1 First hitting time analysis We will use probabilistic + analytical methods to compute the distribution of the hitting times Tx = inf{n ≥ 0 : Sn = x}. Consider a simple random walk starting from 0. x ∈ Z. for x > 0. Consider a simple random walk starting from 0. y ∈ Z. If x > 0. If S1 = 1 then T1 = 1. . it is no loss of generality to consider the random walk starting from 0. Theorem 33.) Then. Then the probability generating function of the first time that level 1 is reached is given by ϕ(s) = E0 sT1 = 1− 1 − 4pqs2 . Proof. To prove this. Corollary 20. random variables. . x − 1 in order to reach x. 2qs Proof. . x ≥ 0) is also a random walk. plus a ′′ further time T1 required to go from 0 to +1 for the first time.

Hence the previous formula applies with the roles of p and q interchanged. Consider a simple random walk starting from 0. Use Lemma 19 and the previous formulae of Theorem 33 and Corollary 21. 91 . Corollary 22. Hence T1 = This can be used as follows: ϕ(s) = E0 [s1(S1 = 1) + s1+T1 +T1 1(S1 = −1)] = sP (S1 = 1) + sE0 [sT1 sT1 |S1 = −1]P (S1 = −1) = sp + sE0 [sT1 sT1 |S1 = −1]q ′′ ′ But. if x < 0 2ps Proof.  0 if x > 0  2qs Tx (32) E0 s = √ |x|   E sT−1 |x| = 1− 1−4pqs2  0 . The formula is obtained by solving the quadratic. Then the probability generating function of the first time that level −1 is reached is given by E0 sT−1 = 1− 1 − 4pqs2 . 2ps Proof.Sn 1 0 −1 n T1’ T1’’ If S1 = −1 then. from the strong Markov property. But note that −Sn is a simple random walk with upward transition probability q and downward transition probability p. Then the probability generating function of the first time that level x ∈ Z is reached is given by  √ x  E sT1 x = 1− 1−4pqs2 . We thus have ′ ′′ ′ ′′ ′ ′′ 1 ′ ′′ 1 + T1 + T 1 if S1 = 1 if S1 = −1 ϕ(s) = ps + qsϕ(s)2 . T1 is independent of T1 conditional on S1 = −1 and each with distribution that of T1 . The first time n such that Sn = −1 is the first time n such that −Sn = 1. Consider a simple random walk starting from 0. in order to hit 1. the RW must first hit 0 (it takes time ′ ′′ T1 to do this) and then 1 (it takes a further time of T1 ). Corollary 21.

if S1 = 1. ′ But if S1 = −1 then T0 equals 1 plus the time required for the walk to hit 0 starting from −1. then we can omit the k upper limit in the sum.2: P0 (T0 = n) = f (n) = u(n)/(n − 1) is n is even. the same as the time required for the walk to hit 1 starting ′ from 0. k 92 . 1(S1 = −1) ′ + E0 s T 0 ′ ′ 1(S1 = 1) = qE0 sT0 | S1 = −1 + pE0 sT0 | S1 = 1 . Then the probability generating function of the first return time to 0 is given by E0 s T 0 = 1 − Proof.2 First return to the origin Now let us ask: If the particle starts at 0. k by the binomial theorem. Similarly. how long will it take till it first returns to 0? We are talking about the stopping time ′ T0 := inf{n ≥ 1 : Sn = 0}. and simply write (1 + x)n = k≥0 n k x . E0 s T 0 = E0 s T 0 ′ ′ ′ 1 − 4pqs2 . If α = n is a positive integer then n (1 + x)n = k=0 n k x . this was found in §30. in distribution. then T0 equals 1 plus a time whose distribution that of the time required for the walk to hit −1 starting from 0. 1 − 4pqs2 . If we define n to be equal to 0 for k > n. where α is a real number. Consider a simple random walk starting from 0. E0 sT0 = qE0 (s1+T1 ) + pE0 (s1+T−1 ) = qs 1− 1 − 4pqs2 1− + ps 2qs 1 − 4pqs2 . 2ps ′ =1− 30. we have computed the probability generating function E0 sT0 –see Theorem 34 The tool we need here is Taylor’s theorem for the function f (x) = (1 + x)α . The latter is. For the general case. We again do first-step analysis. (Pay attention: it is important to write n ≥ 1 inside the set!) Theorem 34.30.3 The distribution of the first return to the origin We wish to find the exact distribution of the first return time to the origin. For a symmetric ′ random walk.

Recall that n k = It turns out that the formula is formally true for general α. k!(n − k)! k! α k x . 2k − 1 k 93 . by definition. and it looks exactly the same: (1 + x)α = k≥0 n(n − 1) · · · (n − k + 1) n! = . let us compute √ (−1)k (−x)k = k≥0 k≥0 k≥0 −1 (−x)k k = (−1)k k! (−1)(−2) · · · (−1 − k + 1) = = (−1)k . k! The adverb ‘formally’ is supposed to mean that this formula holds for some x around 0. we will show that (1 − x)−1 = 1 + x + x2 + x3 + · · · which is of course the all too-familiar geometric series. As another example. k as long as we define α k = α(α − 1) · · · (α − k + 1) . but that we won’t worry about which ones. k! k! (−1)2k xk = k≥0 xk . So √ 1−x= ′ k≥0 1 2k k x =1+ 2k − 1 k 1 − 4pqs2 : ′ k≥1 2k k 1 x 2k − 1 k We apply this to E0 sT0 = 1 − E0 s T 0 = k≥1 1 2k (pq)k s2k . For example. (1 − x)−1 = But. Indeed. −1 k so (1 − x)−1 = as needed. We have k≥0 1 − x = (1 − x)1/2 = 1/2 (−1)k xk . (−1)k = 2k − 1 k k and this is shown by direct computation. k We can write this in more familiar symbols if we observe that 2k −k 1 1/2 4 . √ 1 − x.

E0 s T 0 = n≥0 ′ ′ ′ P0 (T0 = n)sn . The key fact needed here is: P0 (Tx < ∞) = lim E0 sTx . so 1 − 4pq = So P0 (Tx < ∞) = 1 − 4p + 4p2 =        (1 − 2p)2 = |1 − 2p| = |p − q|. if p = q then P0 (Tx < ∞) = 1 for all x ∈ Z. Hence it is transient. 2k − 1 k 2k − 1 31 Recurrence properties of simple random walks Consider again a simple random walk Sn with values in Z with upward probability p and downward probability q = 1 − p. s↑1 which is true for any probability generating function. if x < 0 This formula can be written more neatly as follows: Theorem 35. • If p = 1/2 then p − q = 0 and the law of large numbers only tells us that the walk is not positive recurrent. x 1−|p−q| 2q 1−|p−q| 2p . That it is actually null-recurrent follows from Markov chain theory. 94 . • If p < 1/2 then p − q < 0 and so Sn tends to −∞ with probability one. The quantity inside the square root becomes 1 − 4pq. By applying the law of large numbers we have that P0 Sn =p−q n→∞ n lim = 1. Comparing the two ‘infinite polynomials’ we see that we must have n = 2k (we knew that– the random walk can return to 0 only in an even number of steps) and then ′ P0 (T0 = 2k) = 1 1 2k k k p q = P0 (S2k = 0).But. without appeal to the Markov chain theory. • If p > 1/2 then p − q > 0 and so Sn tends to +∞ with probability one. We will see how the results about the generating functions of the hitting and return times help us answer these questions directly. |x| if x > 0 (33) . if x < 0 |x| In particular. But remember that q = 1 − p. x if x > 0 . Hence it is again transient. We apply this to (32). P0 (Tx < ∞) = p q q p ∧1 ∧1 . by the definition of E0 sT0 .

1 Asymptotic distribution of the maximum Suppose now that p < 1/2. . If p < q then |p − q| = q − p and. Corollary 23. we obtain what we are looking for. Then limn→∞ Sn = −∞ with probability 1. If p > q then |p − q| = p − q and so (33) gives P0 (Tx < ∞) = 1. with some positive probability. x ≥ 0. with probability 1 because p < 1/2. What is this probability? Theorem 37. . Proof. Indeed. Proof. . Consider a random walk with values in Z. Otherwise. . 31. 95 . but here it is < ∞. This means that there is a largest value that will be ever attained by the random walk. The last part of Theorem 35 tells us that it is recurrent. Then the overall maximum M∞ has a geometric distribution: P0 (M∞ ≥ x) = (p/q)x . except that the limit may be ∞. q p. S1 . it will never ever return to the point where it started from. . P0 (M∞ ≥ x) = lim P0 (Mn ≥ x) = lim P0 (Tx ≤ n) = P0 (Tx < ∞) = (p/q)x . Consider a simple random walk with values in Z. and so it is null-recurrent. Sn ).2 Return probabilities Only in the symmetric case p = q = 1/2 will the random walk return to the point where it started from. every nondecreasing sequence has a limit.) If we let Mn = max(S0 . then we have M∞ = limn→∞ Mn . The simple symmetric random walk with values in Z is null-recurrent. Assume p < 1/2. This value is random and is denoted by M∞ = sup(S0 . 31. Theorem 36. Then ′ P0 (T0 < ∞) = 2(p ∧ q). if x < 0 which is what we want. . again. n→∞ n→∞ x ≥ 0. the sequence Mn is nondecreasing. S2 .Proof. Let ′ T0 = inf{n ≥ 1 : Sn = 0} be the first return time to 0. . this probability is always strictly less than 1. Unless p = 1/2. if x > 0 . We know that it cannot be positive recurrent.

1. k = 0. . k = 0. In the latter case. Pi (Ji ≥ k) = 1 for all k. Ei Ji = 1 − 2(p ∧ q) Remark 2: The formulae work for all values of p ∈ [0. . 1 − 2(p ∧ q) this being a geometric series. 1. But if it is transient. If p = q this gives 1. But J0 ≥ 1 if and only if the first return time to zero is finite. 2(p ∧ q) . 2. if a Markov chain starts from a state i then Ji is geometric. . even for p = 1/2. . . But ′ P0 (T0 < ∞) = lim EsT0 = 1 − s↑1 ′ 1 − 4pq = 1 − |p − q|. . We now compute its exact distribution. For the second formula we have ∞ ∞ E0 J0 = k=1 P0 (J0 ≥ k) = k=1 (2(p ∧ q))k = 2(p ∧ q) . Remark 1: By spatial homogeneity. In Lemma 4 we showed that. 96 . . The total number of visits to state i is given by ∞ Ji = n=1 1(Sn = i). The probability generating function of T0 was obtained in Theorem 34. We now look at the the number of visits to some state but if we start from a different state. If p > q this gives 1 − (p − q) = 1 − (1 − q − q) = 2q. 1 − 2(p ∧ q) Proof.3 Total number of visits to a state If the random walk is recurrent then it visits any state infinitely many times. Then. we obviously have that the same formulae apply for any i: Pi (Ji ≥ k) = (2(p ∧ q))k . 2(p ∧ q) E0 J0 = . k = 0. .′ Proof. the total number of visits to a state will be finite. This proves the first formula. And so P0 (J0 ≥ k) = P0 (J0 ≥ 1)k . Consider a simple random walk with values in Z. And if p < q this gives 1 − (q − p) = 1 − (1 − p − p) = 2p. 31. 1]. starting from zero. 2. as already known. meaning that Pi (Ji = ∞) = 1. Theorem 38. 1. the total number J0 of visits to state 0 is geometrically distributed with P0 (J0 ≥ k) = (2(p ∧ q))k . where the last probability was computed in Theorem 37. . (34) a random variable first considered in Section 10 where it was also shown to be geometric. 2. . So ′ P0 (J0 ≥ 1) = P0 (T0 < ∞) = 2(p ∧ q).

If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (q/p)|x| (2q)k−1 . it will visit any point below 0 exactly the same number of times on the average. E0 Jx = Proof. (2(p ∧ q))k−1 . ∞ As for the expectation. Remark 2: If the initial state is not 0 then use spatial homogeneity to obtain the analogous formulae. If p ≤ 1/2 then P0 (Jx ≥ k) = (p/q)x∨0 (2p)k−1 . if the random walk “drifts down” (i. simply because if Jx ≥ k ≥ 1 then Tx < ∞. Jx ≥ k). Consider a random walk with values in Z. p < 1/2) then. 97 . k ≥ 1. In other words. (q/p)(−x)∨0 1 − 2q k≥1 .e. E0 Jx = If p ≥ 1/2 then P0 (Jx ≥ k) = (q/p)(−x)∨0 (2q)k−1 . But for x < 0. Jx ≥ k) = P0 (Tx < ∞)Px (Jx ≥ k − 1). The reason that we replaced k by k − 1 is because. We repeat the argument for x < 0 and find P0 (Jx ≥ k) = q ∧1 p |x| (2(p ∧ q))k−1 . If p ≥ 1/2 then this gives P0 (Jx ≥ k) = (2q)k−1 . we have E0 Jx = 1/(1−2p) regardless of x. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (2p)k−1 . Remark 1: The expectation E0 Jx is finite if p = 1/2. The first term was computed in Theorem 35.e. Then P0 (Jx ≥ k) = P0 (Tx < ∞. Py (Jx ≥ k) = P0 (Jx−y ≥ k) and Ey Jx = E0 Jx−y . i. Then P0 (Tx < ∞. So we obtain P0 (Jx ≥ k) = p ∧1 q x (p/q)x∨0 1 − 2p k≥1 . For x > 0. there has already been one visit to state x and so we need at least k − 1 remaining visits in order to have at least k visits in total. then E0 Jx drops down to zero geometrically fast as x tends to infinity.Theorem 39. given that Tx < ∞. If p ≤ 1/2 then this gives P0 (Jx ≥ k) = (p/q)x (2p)k−1 . Consider the case p < 1/2. Suppose x > 0. Apply now the strong Markov property at Tx . starting from 0. and the second in Theorem 38. we apply E0 Jx = k=1 P0 (Jx ≥ k) and perform the sum of a geometric series.

0 ≤ k ≤ n). Then P0 (Tx = n) = x P0 (Sn = x) . a sum of i. Theorem 40. random variables. 1 ≤ k ≤ n | Sn = x) = P0 (Sk < x. because the two have the same distribution. ξ2 .i. Sn > 0 | Sn = x) = But the left-hand side is P0 (Sn −Sn−k > 0. Rewrite the ballot theorem terms of the dual: P0 (S1 > 0. . . Pause for reflection: We have been dealing with the random times Tx for simple random walks (symmetric or asymmetric) from the beginning of the lectures. 0 ≤ k ≤ n) is defined by Sk := Sn − Sn−k . (35) In Theorem 29 we used the reflection principle and derived the distribution function of Tx for a simple symmetric random walk: 1 P0 (Tx ≤ n) = P0 (|Sn | ≥ x) − P0 (|Sn | = x). for a simple symmetric random walk. i=1 Then the dual (Sk . . ξn ) has the same distribution as (ξn . .32 Duality Consider random walk Sk = k ξi . Let Tx = inf{n ≥ 1 : Sn = x} be the first time that x will be visited. In Lemma 19 we observed that. = 1− √ 1 − s2 s x . Tx is the sum of x i. for a simple random walk and for x > 0. Let (Sk ) be a simple symmetric random walk and x a positive integer.d. the probabilities P0 (Tx = n) = x P0 (Sn = x). ξn−1 . Thus.d.i. 0 ≤ k ≤ n − 1. (Sk . (36) 2 Lastly. Use the obvious fact that (ξ1 . .e. i. Proof. which is basically another probability for S. Fix an integer n. n x > 0. . Here is an application of this: Theorem 41. . . every time we have a probability for S we can translate it into a probability for S. each distributed as T1 . in Theorem 41. . 0 ≤ k ≤ n) has the same distribution as (Sk . . P0 (Sn = x) P0 (Sn = x) x . n Proof. ξ1 ). random variables. 0 ≤ k ≤ n) of (Sk . n (37) 98 . Sn = x) P0 (Tx = n) = . . we derived using duality. . and thus derived the generating function of Tx : E0 sTx = Specialising to the symmetric case we have E0 s Tx 1− 1 − 4pqs2 2qs x .

We remark here something methodological: It was completely different techniques that led to results about, essentially, the same thing: the distribution of Tx . This is an important lesson to learn. Of course, the three formulae should be compatible: the number (37) should be the coefficient of sn in (35); summing up (37) should give (36), or taking differences in (36) should give (37). But on trying to verify these things algebraically, the reader will realise that it is not easy.

33

Amount of time that a SSRW is positive*

Consider a SSRW (Sn ) starting from 0. Define its trajectory as the polygonal line obtained by joining the points (n, Sn ) in the plane, as described earlier. Watch this trajectory up to time 2n. Let T+ (2n) be the number of sides of the polygonal line that are above the horizontal axis and T− (2n) = 2n − T+ (2n) the number of sides that are below. Clearly, T± (2n) are both even numbers because, if the RW is at 0, it takes an even number of steps to return to 0. We can think of T+ (2n) as the number of leads in 2n coin tosses, where we

0

T 0’

′ Figure 1: Part of the time the SSRW is positive and part negative. After it returns to 0 at time T0 ,

it behaves again like a SSRW and is independent of the past.

have a lead at some time if the number of heads till that time exceeds the number of tails. We are dealing with a fair coin which is tossed 2n times, so the event that the number of heads equals k has maximum probability when k = n (heads=tails). Most people take this statement and translate it into a statement about T+ (2n) arguing that P (T+ (2n) = m) is maximised when m = n, i.e. when T+ (2n) = T− (2n). This is wrong. In fact, as the theorem below shows, the most likely value of P (T+ (2n) = m) is when m = 0 or m = 2n. Theorem 42 ( distribution of the total time the random walk is positive ). P (T+ (2n) = 2k) = P (S2k = 0)P (S2n−2k = 0) , 0 ≤ k ≤ n.

′ Proof. The idea is to condition at the first time T0 that the RW returns to 0: n ′ P (T+ (2n) = 2k, T0 = 2r). r=1

P (T+ (2n) = 2k) =

99

′ Between 0 and T0 the random walk is either above the horizontal axis or below; call the first event A+ and the second A− and write n ′ ′ P (A+ , T0 = 2r)P (T+ (2n) = 2k | A+ , T0 = 2r) n ′ ′ P (A− , T0 = 2r)P (T+ (2n) = 2k | A− , T0 = 2r)

P (T+ (2n) = 2k) =
r=1

+
r=1

In the first case,
′ ′ P (T+ (2n) = 2k | A+ , T0 = 2r) = P (T+ (2n) = 2k−2r | A+ , T0 = 2r) = P (T+ (2n−2r) = 2k−2r), ′ because, if T0 = 2r and A+ occurs then the RW has already spent 2r units of time above the axis and so T+ (2n) is reduced by 2r. Furthermore, we may remove the conditioning ′ because the future evolution of the RW after T0 is independent of the past. In the second case, a similar argument shows ′ ′ P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n) = 2k | A− , T0 = 2r) = P (T− (2n − 2r) = 2k).

We also observe that, by symmetry, 1 ′ ′ ′ P (A+ , T0 = 2r) = P (A− , T0 = 2r) = P (T0 = 2r) = f2r . 2 Letting p2k,2n = P (T+ (2n) = 2k), we have shown that p2k,2n = 1 2
k

f2r p2k−2r,2n−2r +
r=1

1 2

n−k

f2r p2k,2n−2r
r=1

We want to show that p2k,2n = u2k u2n−2k , where u2k = P (S2k = 0). With this hypothesis, we can start playing with the recursion of the display above and using induction on n we will realise that p2k,2n 1 = u2n−2k 2
k r=1

1 f2r u2k−2r + u2k 2

n−k

f2r u2n−2k−2r .
r=1

Explicitly, we have P (T+ (2n) = 2k) = 2k k 2n − 2k −2n 2 , n−k 0 ≤ k ≤ n. (38)

With 2n = 100, we plot the function P (T100 = 2k) for 0 ≤ k ≤ 50 below. Notice that it is maximised at the ends of the interval.

100

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0

10

20 k

30

40

50

The arcsine law*
Use of Stirling’s approximation in formula (38) gives P (T+ (2n) = 2k) ∼ 1/π k(n − k) , as k → ∞ and n − k → ∞.

Now consider the calculation of the following probability: P T+ (2n) ∈ [a, b] 2n =
na≤k≤nb

P

T+ (2n) k = 2n n 1/π k(n − k) =
a≤k/n≤b

a≤k/n≤b

1 . k k (n − n ) n n

1/π

This is easily seen to have a limit as n → ∞, the limit being
b

1/π t(1 − t)

dt,

a

because the last sum is simply a Riemann approximation to the last integral. This shows the celebrated arcsine law: x √ T+ (2n) 2 1/π lim P dt = arcsin x. ≤x = n→∞ 2n π t(1 − t) 0 We thus have that the limiting distribution of T+ (2n) (the fraction of time that the RW is √ 2 positive), as n → ∞, is the distribution function π arcsin x, 0 < x < 1. This is known as the arcsine law. This distribution function has density √ 1/π , 0 < x < 1, the plot of
x(1−x)

which is amazingly close to the plot of the figure above. Now consider again a SSRW and consider the hitting times (Tx , x ≥ 0). This is another random walk because the summands in Tx = x (Ty − Ty−1 ) are i.i.d. all distributed as y=1 T1 : 2n −2n 1 P (T1 ≥ 2n) = P (S2n = 0) = 2 ∼√ . n πn 101

west. (Sn . i. 1 1 Proof. 0). . The corners are described by coordinates (x. i=1 ξi i=1 ξi 2 1 Lemma 20. we immediately get that the expectation of T1 is infinity: ET1 = ∞. However. Many other quantities can be approximated in a similar manner. (Sn . y ∈ Z. for each n. Sn ) = n n 2 . We describe this motion as a random walk in Z2 = Z × Z: Let ξ1 . the random 1 2 variables ξn . 0)) = 1/4. going at random east.i. ξ2 . is a random walk. They are also identically distributed with 1 P (ξn = 0) = P (ξn = (0. So let 1 2 1 2 1 2 Rn = (Rn . he moves east. the two stochastic processes are not independent. . . 2 Same argument applies to (Sn . random vectors such that P (ξn = (0. Sn − Sn ). north or south. 1 Hence (Sn . map (x1 . n ≥ 1.i. y) where x. . We then let S0 = (0. It is not a simple random walk because there is a possibility that the increment takes the value 0. Indeed. He starts at (0. being totally drunk. n ∈ Z+ ) is also a random walk. The two random walks are not independent. Since ξ1 .From this. ξ2 .d.e. 0)) = 1/4. . ξn are not independent simply because if one takes value 0 the other is nonzero. . 1 P (ξn = 1) = P (ξn = (1. 1)) + P (ξn = (0. n ∈ Z+ ) showing that it. . i. so are ξ1 . x1 − x2 ). We have 1. 0)) = 1/4 1 P (ξn = −1) = P (ξn = (−1. north or south. 0) and. too. . the latter being functions of the former. n ∈ Z+ ) is a random walk. are independent. . 0)) = P (ξn = (1. Sn = i=1 n ξi . We have: 102 . When he reaches a corner he moves again in the same manner. we let x1 be its horizontal coordinate and x2 its vertical. west. 2 1 If x ∈ Z2 .d. 1)) = P (ξn = (−1. random variables and hence a random walk. with equal probability (1/4). Rn ) := (Sn + Sn . Consider a change of coordinates: rotate the axes by 45o (and scale them).. −1)) = 1/2. So Sn = (Sn . ξ2 . n ∈ Z+ ) is a sum of i. 34 Simple symmetric random walk in two dimensions Consider a drunkard in a city whose roads are arranged in a rectangular grid. x2 ) into (x1 + x2 . 0)) = P (ξn = (−1.

η . that for a simple symmetric random walk (started from S0 = 0). . So P (R2n = 0) = 2n 2−2n . ηn = ε′ ) = P (ηn = ε)P (ηn = ε′ ) for all choices of ε. Rn = n (ξi − ξi ). The sequence of vectors η . The last two are walk in one dimension. η 2 are ±1. R2n = 0) = P (R2n = 0)(R2n = 0). n ∈ Z+ ). . and by Stirling’s approximation. we need to show that ∞ P (Sn = 0) = ∞. +1}. n ∈ Z ) is a random walk in one dimension. . The two-dimensional simple symmetric random walk is recurrent. ξ 1 − ξ 2 ). The same is true for (Rn ). ηn = ξn − ξn are independent for each n. 2 . n ∈ Z+ ) is a random 2 . 1 where the last equality follows from the independence of the two random walks. P (Tb < Ta ) = We can read this as follows: P (STa ∧Tb = b) = −a . 0)) = 1/4 1 2 P (ηn = −1. . ηn = −1) = P (ξn = (0. We have Rn = n (ξi + ξi ). as n → ∞. 35 Skorokhod embedding* We take a closer look at the gambler’s ruin problem for a simple symmetric random walk in one dimension. P (S2n = 0) ∼ n . 2n n 2 2−4n . Let a < 0 < b. ηn = 1) = P (ξn = (1. . 1 1 2 2 Hence P (ηn = 1) = P (ηn = −1) = 1/2. (Rn + independent.. ξ2 . 0)) = 1/4 1 2 P (ηn = 1. n ∈ Z+ ) 1 is a random walk in two dimensions. n ∈ Z ) is a random walk in one dimension. from Theorem 7. being obtained the vector ηn = (ξn 1 2 n n n by taking the same function on the elements of the sequence ξ1 . b−a P (STa ∧Tb = a) = b . The possible values 1 . n Corollary 24.1 Lemma 21. So Rn = n ηn where ηn is i=1 i=1 i=1 1 + ξ 2 . P (S2n = 0) = Proof. Hence (Rn . (Rn + 2 2 1 2 1 1 we check that ηn = ξn + ξn . Hence P (ηn = 1) = P (ηn = −1) = 1/2. By Lemma 5.d. And so 2 1 2 1 P (ηn = ε. in the sense that the ratio of the two sides goes to 1.i. ηn = 1) = P (ξn = (0. But by the previous n=0 c lemma. Lemma 22. where c is a positive constant. ε′ ∈ {−1. 1 2 1 2 P (S2n = 0) = P (R2n = 0) = P (R2n = 0. = 1)) = 1/4. . b−a 103 . (Rn . To show that the last two are independent. Similar argument shows that each of (Rn . b−a P (Ta < Tb ) = b . We have of each of the ηn n 1 2 P (ηn = 1. n ∈ Z+ ) is a random walk in two dimensions. 1)) = 1/4 1 2 P (ηn = −1. Proof. 2 1 2 2 1 1 Proof. ηn = −1) = P (ξn = (−1. But (Rn ) 1 2 is a simple symmetric random walk. is i. (Rn . Recall. b−a −a .

How about more complicated distributions? Let suppose we are given a probability distribution (pi . b].This last display tells us the distribution of STa ∧Tb . B = j) = c−1 (j − i)pi pj . Theorem 43. a < 0 < b such that pa + (1 − p)b = 0. we can always find integers a. To see this. using this. such that p is rational. i ∈ Z) a given probability distribution on Z with zero mean: i∈Z ipi = 0. in particular. suppose first k = 0. B = j) = 1. P (A = 0. i ∈ Z). a = M − N . i < 0 < j. we have this common value: i<0 (−i)pi = and. Indeed. The probability it exits it from the left point is p. i ∈ Z. M < N . i ∈ Z. Can we find a stopping time T such that ST has the given distribution? The Skorokhod embedding theorem tells us how to do this. c= i<0 (−i)pi = i>0 Letting Ti = inf{n ≥ 0 : Sn = i}. and the probability it exits from the right point is 1 − p.g. k ∈ Z. Then there exists a stopping time T such that P (ST = i) = pi . that ESTa ∧Tb = 0. The claim is that P (Z = k) = pk . e. Then Z = 0 if and only if A = B = 0 and this has 104 . where c−1 is chosen by normalisation: P (A = 0. 0)} ∪ (N × N) with the following distribution: P (A = i. B = 0) + P (A = i. i<0 j>0 This leads to: c−1 j>0 jpj i<0 pi + c−1 i<0 (−i)pi j>0 i>0 ipi pj = 1 − p0 Since i∈Z ipi = 0. choose. B) taking values in {(0. Define a pair of random variables (A. Proof. N ∈ N. we consider the random variable Z = STA ∧TB . M. Let (Sn ) be a simple symmetric random walk and (pi . b = N . b. if p = M/N . we get that c is precisely ipi . B = 0) = p0 . Conversely. B) are independent of (Sn ). Note. given a 2-point probability distribution (p. 1 − p). and assuming that (A. This means that we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a.

B = k) i<0 = = i<0 −i −1 c (k − i)pi pk k−i (−i)pi pk = pk . 105 . i<0 = c−1 Similarly. B = j) P (STi ∧Tk = k)P (A = i. We have P (Z = k) = P (STA ∧TB = k) = i<0 j>0 P (STi ∧Tj = k)P (A = i. Suppose next k > 0. P (Z = k) = pk for k < 0.probability p0 . by definition.

b. We will show that d = gcd(a. b). 3. 38) = gcd(6. Indeed.e. as taught in high schools or universities.}. In other words. 2. Therefore d | b − a. They are not supposed to replace textbooks or basic courses. with at least one of them nonzero. So (i) holds.e. b). · · · }. 0. 6. if we also invoke Euclid’s algorithm (i. This gives us an algorithm for finding the gcd between two numbers: Suppose 0 < a < b are integers. let d = gcd(a. b − a). 106 . The set Z of integers consists of N. . b) = gcd(a. and d = gcd(a. But d | a and d | b. b − a). Now suppose that c | a and c | b − a. −3. Indeed 4 × 38 > 120. b) as the unique positive integer such that (i) d | a. The set N of natural numbers is the set of positive integers: N = {1. we have c | d. replace b − a by b − 2a. and we’re done. Keep doing this until you obtain b − qa < a. 2. 14) = gcd(6. the process of long division that one learns in elementary school). then d | c. But in the first line we subtracted 38 exactly 3 times. (That it is unique is obvious from the fact that if d1 . −2. 32) = gcd(6.) Notice that: gcd(a. c | b. If c | b and b | a then c | a. we may summarise the first line by saying that we replace the largest of the two numbers by the remainder of the division of the largest with the smallest. 2) = 2. If b − a > a. 1. they merely serve as reminders. So (ii) holds. 38) = gcd(6. Then interchange the roles of the numbers and continue. For instance. 26) = gcd(6. we can define their greatest common divisor d = gcd(a. So we work faster. . b are arbitrary integers. Then c | a + (b − a). We write this by b | a. If a | b and b | a then a = b. Zero is denoted by 0. if a. Similarly. 38) = gcd(82. Arithmetic Arithmetic deals with integers. Because c | a and c | b. −1. 3. 5. Now. d | b. 2) = gcd(4. But 3 is the maximum number of times that 38 “fits” into 120. we say that b divides a if there is an integer k such that a = kb. Replace b by b − a. 2) = gcd(2. i. their negatives and zero. gcd(120. 4. Z = {· · · . 20) = gcd(6. (ii) if c | a and c | b. with b = 0.PART III: APPENDICES This section contains some sine qua non elements from standard Mathematics. with the second line. d2 are two such numbers then d1 would divide d2 and vice-versa. 38) = gcd(44. leading to d1 = d2 . 8) = gcd(6. . Given integers a.

for some x. gcd(38. not both zero. y ∈ Z. b ∈ Z. Suppose that c | a and c | b. . In the process. 120) = 38x + 120y. let us follow the previous example again: gcd(120. Then d = xa + yb. . Second.e. This shows (ii) of the definition of a greatest common divisor. To see this. 2) = 2. we can find an integer q and an integer r ∈ {0. Working backwards. we can show that. 1. 107 . we used the divisions 120 = 3 × 38 + 6 38 = 6 × 6 + 2. y ∈ Z. The same logic works in general. b. a − 1}. we have the fact that if d = gcd(a. Its proof is obvious. . 38) = gcd(6. for integers a. such that a = qb + r. 38) = gcd(6. To see why this is true. r = 18. with x = 19. Here are some other facts about the greatest common divisor: First.The theorem of division says the following: Given two positive integers b. with a > b. For example. a. b) then there are integers x. Thus. we have gcd(a. b) = min{xa + yb : x. y = 6. Then c divides any linear combination of a. it divides d. in particular. We need to show (i). to divide 3450 with 26 we do 132 —— 3450 26 — 85 78 — 70 52 — 18 26 | and obtain q = 26. let d be the right hand side. y such that d = xa + yb. i. . we have 2 = 38 − 6 × 6 = 38 − 6 × (120 − 3 × 38) = 19 × 38 − 6 × 120. xa + yb > 0}. The long division algorithm is a process for finding the quotient q and the remainder r.

z) it also contains (x. The relation < is not reflexive. z). Suppose this is not true. The set of functions from X to Y is denoted by Y X . so r = 0. x2 ∈ X have different values f (x1 ). So r is expressible as a linear combination of a and b and is strictly smaller than d (being a remainder). Both < and = are transitive. y) without contain its symmetric (y. The greatest common divisor of 3 integers can now be defined in a similar manner. m elements. Any equivalence relation splits a set into equivalence classes: the equivalence class of x contains all y that are related (are equivalent) to x.g. f (x2 ). A relation is called symmetric if it cannot contain a pair (x. If we take A to be the set of students in a classroom. But then b = q(xa + yb) + r. The range of f is the set of its values. A function is called onto (or surjective) if its range is Y .e. yielding r = (−qx)a + (1 − qy)b. that d does not divide b. y) such that x is smaller than y. Counting Let A.e. B be finite sets with n. the greatest common divisor of any subset of the integers also makes sense. y) and (y. that d does not divide both numbers. say. then the relation “x is a friend of y” is symmetric. it can be thought as reflexive (unless we do not want to accept the fact that one may not be afriend of oneself). A relation is called reflexive if it contains pairs (x. e. i.that d | a and d | b. For example. 108 . x). A relation which possesses all the three properties is called equivalence relation. In general. We encountered the equivalence relation i theory. A function is called one-to-one (or injective) if two different x1 . A relation is transitive if when it contains (x. so d divides both a and b. the set {f (x) : x ∈ X}. the relation < between real numbers can be thought of as consisting of pairs of numbers (x.. The relation < is not symmetric. Relations A relation on a set A is a mathematical notion that establishes a way of pairing elements of A. x). So d is not minimum as claimed. respectively. The equality relation is symmetric. Since d > 0. If B ⊂ Y then f −1 (B) consists of all x ∈ X with f (x) ∈ B. The equality relation = is reflexive. we can divide b by d to find b = qd + r. Finally. and this is a contradiction. for some 1 ≤ r ≤ d − 1. a relation can be thought of as a set of pairs of elements of the set A. but is not necessarily transitive. i. j (i communicates with j) in Markov chain Functions We a function f from a set X into a set Y is denoted by f : X → Y .

So we have (n)k = n(n − 1)(n − 2) · · · (n − k + 1) ways to pick k elements if we care about their order. that of the second in m − 1 ways. Incidentally. A function is then an assignment of a name to each animal. Hence the total number of assignments is m × m × ··· × m n times Suppose now we are not allowed to give the same name to different animals. Clearly. Each assignment of this type is a one-to-one function from A to B. the total number of one-to-one functions from A to B is (m)n := m(m − 1)(m − 2) · · · (m − n + 1). n! is the total number of permutations of a set of n elements. If A has n elements. 109 . We thus observe that there are as many subsets in A as number of elements in the set {0. the cardinality of the set B A ) is mn . the second in n − 1. n! stands for the product of the first n natural numbers: n! = 1 × 2 × 3 × · · · × n. we may think of the elements of A as animals and those of B as names. we denote (n)n also by n!. The first element can be chosen in n ways.e. Any set A contains several subsets. and the third. and so on. so does the second. (The same name may be used for different animals. Since the name of the first animal can be chosen in m ways. k! k where the symbol equation.The number of functions from A to B (i. To be able to do this. a one-to-one function from the set A into A is called a permutation of A. If we don’t then we divide by the number k! of permutations of the k elements concluding that the number of subsets with k elements of a set with n elements equals (n)k n =: . Thus. and this follows from the binomial theorem. n k is merely a name for the number computed on the left side of this n k = n n−k n 0 = n! . the k-th in n − k + 1 ways.) So the first animal can be assigned any of the m available names. we have = 1. k!(n − k)! Since there is only one subset with no elements (it is the empty set ∅). and so on. the number of subsets with k elements each can be found by designating which of the n elements of A will be chosen. and so on. in other words. 1}n of sequences of length n with elements 0 or 1. Indeed. In the special case m = n. The total number of subsets of A (regardless of their number of elements) is thus equal to n k=0 n k = (1 + 1)n = 2n . we need more names than animals: n ≤ m.

a− = − min(a. by convention. k Also. c. m! n y If y is not an integer then. then there are k − 1 remaining k n animals to be chosen and this can be done in k−1 ways. b). The notation extends to more than one numbers. The maximum is denoted by a ∨ b = max(a. by convention. we choose k animals from the remaining n ones and this can be done in n ways. a∨b∨c refers to the maximum of the three numbers a. we use the following notations: a+ = max(a. In particular. Both quantities are nonnegative. b. If the pig is included. we add the two numbers together in order to find the total number of ways to pick k animals from n + 1 animals. Since the sentences “the pig is included” and “the pig is not included” cannot be simultaneously true. b). + k−1 k Indeed. we define x m = x(x − 1) · · · (x − m + 1) .We shall let n = 0 if n < k. b is denoted by a ∧ b = min(a. 0). the pig. Thus a ∧ b = a if a ≤ b or = b if b ≤ a. 110 . and a− is called the negative part of a. 0). Note that it is always true that a = a+ − a− . The absolute value of a is defined by |a| = a+ + a− . Note that −(a ∨ b) = (−a) ∧ (−b) and (a ∧ b) + c = (a + c) ∧ (b + c). say. a+ is called the positive part of a. consider a specific animal. For example. and we don’t care to put parantheses because the order of finding a maximum is irrelevant. in choosing k animals from a set of n + 1 ones. if x is a real number. Maximum and minimum The minimum of two numbers a. we let The Pascal triangle relation states that n+1 k = = 0. n n . If the pig is not included in the sample.

P (X ≥ n). . A2 . are sets or clauses then the number n≥1 1An is the number true clauses. An example of its application: Since 1A∪B = 1 − 1(A∪B)c = 1 − 1Ac ∩Bc = 1 − 1Ac 1Bc = 1 − (1 − 1A )(1 − 1B ) = 1A + 1B − 1A 1B . Therefore its expectation is E 1A = 1 × P (A) + 0 × P (Ac ) = P (A). So E[X] = n≥1 E 1(X ≥ n) = n≥1 Matrices Let Rd be the set of vectors with d components. Several quantities of interest are expressed in terms of indicators. = 1A + 1B − 1A∩B . . Thus. we have E[1A∪B ] = E[1A ] + E[1B ] − E[1A∩B ]. if I go to a bank N times and Ai is the clause “I am robbed the i-th time” then N n=1 1Ai is the total number of times I am robbed. So X= n≥1 1(X ≥ n). It is worth remembering that. If A is an event then 1A is a random variable.}. 1 − 1A = 1 A c . This is very useful notation because conjunction of clauses is transformed into multiplication. We also use the notation 1(clause) for a number that is 1 whenever the clause is true or 0 otherwise. Then X X= n=1 1 (the sum of X ones is X). A linear function ϕ from Rn to Rm is such that ϕ(αx + βy) = αϕ(x) + βϕ(y). meaning a function that takes value 1 on A and 0 on the complement Ac of A. It takes value 1 with probability P (A) and 0 with probability P (Ac ). It is easy to see that 1A 1A∩B = 1A 1B . .Indicator function stands for the indicator function of a set A. . . if A1 . 111 . For instance. . 2. which means P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Another example: Let X be a nonnegative random variable with values in N = {1.

.  = ··· ··· ···  .  where ∈ R.   . and all α.  is a column representation of x with respect to . m. j=1 i = 1. un be a basis for Rn .  The column  . . we have ϕ(x) = m n vi i=1 j=1 aij xj . So if we choose a basis v1 . because ϕ is linear. Let y i := n aij xj . . . β ∈ R.  is a column representation of y := ϕ(x) with respect to the given basis . . . . Then. Let u1 . . vm . The matrix A of ϕ with respect to the choice of the two basis is defined by   a11 · · · a1n A :=  · · · · · · · · ·  am1 · · · amn  y1  Then ϕ ◦ ψ is a linear function: Rn − → Rℓ − 112 ϕ◦ψ . But the ϕ(uj ) are vectors in Rm .for all x. . . . . . The relation between the column representation of y and the column representation of x is written as  1    1 x y a11 · · · a1n  .    . Then any x ∈ Rn can be uniquely written as n x= j=1 x j uj . x  . am1 · · · amn xn ym Rn − Rm − Rℓ → → ψ ϕ  .  . The column  . Combining. . . . ym v1 . Suppose that ψ. . . x1 . . xn n   1 y := ϕ(x) = j=1 xj ϕ(uj ). vm in Rn we can express each ϕ(uj ) as a linear combination of the basis vectors: m ϕ(uj ) = i=1 aij vi . y ∈ Rn . ϕ are linear functions: It is a m × n matrix. . xn the given basis. .

Supremum and infimum When we have a set of numbers I (for instance. xn+2 . A matrix A is called square if it is a n × n matrix. . If I has no upper bound then sup I = +∞. If I is empty then inf ∅ = +∞. sup ∅ = −∞. . Rm and Rℓ and let A. we define the supremum sup I to be the minimum of all upper bounds.  . is a sequence of numbers then inf{x1 . 0 0 1 In other words. x2 . So lim sup xn = limn→∞ inf{xn . we define max I. . . . of numbers) then the minimum of I is denoted by min I and refers to a number c ∈ I such that c ≤ x for all the numbers x ∈ I. 1/3. . . and it is this that we called infimum. there is x ∈ I with x ≥ ε + inf I. .} ≤ · · · and. Usually. For instance I = {1. Limsup and liminf If x1 . xn+1 . we define lim sup xn . Similarly. . This minimum (or maximum) may not exist. B is a m × n matrix. ψ. ed =  . a2 . A property of real numbers says that if the set I is bounded from below then this maximum of all lower bounds exists. 1/2. ei = 1(i = j). a square matrix represents a linear transformation from Rn to Rn . . 1 ≤ i ≤ ℓ. standing for the “infimum” of I. Likewise. for any ε > 0.  .e. since any increasing sequence has a limit (possibly +∞) we take this limit and call it lim sup xn (limit superior) of the sequence. · · · }. . The sequence xn has a limit ℓ if and only if lim sup xn = lim inf xn = ℓ. . 1 ≤ j ≤ n. x4 . where m (AB)ij = k=1 Aik Bkj . and AB is a ℓ × n matrix. defined as the maximum of all the numbers y with the property y ≤ x for all x ∈ I. The standard basis in Rd consists of the vectors       1 0 0 0 1 0       e 1 =  .  . sup I is characterised by: sup I ≥ x for all x ∈ I and. j So A is a ℓ × m matrix. . respectively. . . with respect to these bases. . x3 . . 113 . e2 =  . . B be the matrix representations of ϕ. .Fix three sets of basis on Rn . .} ≤ inf{x3 . inf I is characterised by: inf I ≤ x for all x ∈ I and. x2 .} ≤ inf{x2 . i. . it is its matrix representation with respect to the fixed basis. . . for any ε > 0. . We note that lim inf(−xn ) = − lim sup xn . . If we fix a basis in Rn . Then the matrix representation of ϕ ◦ ψ is AB. . . .} has no minimum. a sequence a1 . In these cases we use the notation inf I. the choice of the basis is the so-called standard basis. there is x ∈ I with x ≤ ε + inf I. If I has no lower bound then inf I = −∞. Similarly.

then P  n≥1 An  = P (An ). A2 .. (That is why Probability Theory is very rich. P (∪n An ) ≤ Similarly. Indeed. and everywhere else. B are disjoint events then P (A ∪ B) = P (A) + P (B). to compute P (A). showing that P (A) ≤ P (B) is more of a logical problem than a numerical one: we need to show that A implies B (i. Of course. namely:   Everything else. Similarly. 11 114 . For example. . . . the function that takes value 1 on A and 0 on Ac . by not requiring this. B in the definition. . for (besides P (Ω) = 1) there is essentially one axiom . In contrast. such that ∪n Bn = Ω. Often. Linear Algebra that has many axioms is more ‘restrictive’. which holds since X is positive. Linear Algebra that has many axioms is more ‘restrictive’. . n P (An ). This can be generalised to any finite number of events.Probability Theory I do not require11 knowledge of axiomatic Probability Theory (with Measure Theory). . I do sometimes “cheat” from a strict mathematical point of view. Probability Theory is exceptionally simple in terms of its axioms. and 1 Ac = 1 − 1 A . . This is generalisable for a number of events: P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) If we interchange the roles of A. A2 . Markov’s inequality for a positive random variable X states that P (X > x) ≤ EX/x. in which case we write P (A) = n P (A ∩ Bn ) = P (A|Bn )P (Bn ). That is why Probability Theory is very rich. are mutually disjoint events. In Markov chain theory. if he events A1 . if An are events with P (An ) = 1 then P (∩n An ) = 1. B are not disjoint then P (A ∪ B) = P (A) + P (B) − P (A ∩ B).e. If A is an event then 1A or 1(A) is the indicator random variable. Note that 1A 1B = 1A∩B . n≥1 Usually. to compute P (A ∩ B) we read the definition as P (A ∩ B) = P (A)P (B|A).. to show that EX ≤ EY . we get P (A ∩ B) = P (A)P (B|A) = P (B)P (A|B) and this is called Bayes’ theorem. . But I would like the reader to keep in mind one thing: Every mathematical system has a set of axioms from which we can prove (useful) theorems. . Recall that P (B|A) is defined by P (A ∩ B)/P (A) and is a very motivated definition. are decreasing with A = ∩n An then P (A) = limn→∞ P (An ). A is a subset of B). and apply them. B2 . n If the events A1 . E 1A = P (A). If the events A. In contrast. Therefore. it is useful to find disjoint events B1 . Quite often.) If A. we work a lot with conditional probabilities. This is the inclusion-exclusion formula for two events. . . .e. A2 . If An are events with P (An ) = 0 then P (∪n An ) = 0. i. Similarly. are increasing with A = ∪n An then P (A) = limn→∞ P (An ). in these notes. derive formulae. This follows from the logical statement X 1(X > x) ≤ X. follows from this axiom. we often need to argue that X ≤ Y . If A1 . but not too much.

1]. is a probability distribution on Z+ = {0. If it exists for some s1 then it exists uniformly for all s with |s| < |s1 | (and this needs a proof. Suppose now that X is a random variable with values in Z+ ∪ {+∞}. EX = x xP (X = x). we have EX = −∞ xf (x)dx. is an infinite polynomial having the ai as coefficients: G(s) = a0 + a1 s + a2 s2 + · · · . X − = − min(X. the generating function of the geometric sequence an = ρn . we define X + = max(X. For a random variable which takes positive and negative values. As an example. .e. . . ϕ(s) = EsX = n=0 sn P (X = n) 115 . . This is due to the fact that n=1 P (X ≥ n) = E 1(X ≥ n). Generating functions A generating function of a sequence a0 . taught in calculus courses). i. The generating function of the sequence pn = P (X = n). So if n an < ∞ then G(s) exists for all s ∈ [−1. 0). |s| < 1/|ρ|. making it a useful object when the a0 . which is motivated from the algebraic identity X = X + − X − . .. the formula reduces to the known one. viz. . . a1 . 1 − ρs (39) Note that |G(s)| ≤ n |an ||s|n and this is ≤ n |an | if s ∈ [−1. We do not a priori know that this exists for any s other than. The expectation of a nonnegative random variable X may be defined by ∞ EX = 0 P (X > x)dx. This definition makes sense only if EX + < ∞ or if EX − < ∞. for s = 0 for which G(0) = 0. 1. . a1 . 0). n = 0.An application of this is: P (X < ∞) = lim P (X ≤ x). 1. obviously. n ≥ 0 is ∞ ρn s n = n=0 1 . . so it is useful to keep this in mind. . . In case the random variable is discrete. is called probability generating function of X: ∞ as long as |ρs| < 1. 1]. which are both nonnegative and then define EX = EX + − EX − . The formula holds for any positive random variable. x→∞ In Markov chain theory. we encounter situations where P (X = ∞) may be positive. For a random variable that takes values in N we can write EX = ∞ P (X ≥ n). . 2. In case the random variable is ∞ absoultely continuous with density f .} because then n an = 1.

∞ and p∞ := P (X = ∞) = 1 − pn . and this is why ϕ is called probability generating function. but. we can recover the expectation of X from EX = lim ϕ′ (s). xn sn be the generating of (xn . take the value +∞. n=0 We do allow X to. n ≥ 0) is n≥0 xn+1 sn = s−1 (X(s) − x0 ) and the generating function of (xn+2 . we have s∞ = 0. Since the term s∞ p∞ could formally appear in the sum defining ϕ(s). If we let X(s) = n≥0 n ≥ 0. n! as follows from Taylor’s theorem. Note that ϕ(0) = P (X = 0). since |s| < 1. Also. ds ds and by letting s = 1. n ≥ 0) is n≥0 xn+2 sn = s−2 (X(s) − x0 − sx1 ). consider the Fibonacci sequence defined by the recursion xn+2 = xn+1 + xn . As an example. while ∞ ϕ(1) = n=0 P (X = n) = P (X < +∞). s→1 something that can be formally seen from d d EsX = E sX = EXsX−1 . We can recover the probabilities pn from the formula pn = ϕ(n) (0) . Generating functions are useful for solving linear recursions. possibly. n ≥ 0) then the generating function of (xn+1 .This is defined for all −1 < s < 1. 116 . Note that ∞ n=0 pn ≤ 1. with x0 = x1 = 1.

the function P (X ≤ x). 117 .From the recursion xn+2 = xn+1 + xn we have (and this is were linearity is used) that the generating function of (xn+2 . n ≥ 0) and (xn . i. is such an example. i. we can also write this as xn = (−a)n+1 − (−b)n+1 . − − a−b s−b s−a b−a b−s a−s 1/ρ 1 1 But (39) tells us that ρn has generating function 1−ρs = (1/ρ)−s . and so X(s) = 1 1 1 1 1 1 = . Hence X(s) is the generating function of 1 xn = ((1/b)n+1 − (1/a)n+1 ). we speak of some function which enables us to specify the probabilities of the form P (X ∈ B) where B is a set of values of X. +s−1 But the polynomial s2 + s − 1 has two roots: √ √ a = ( 5 − 1)/2. the n-th term of the Fibonacci sequence. b−a Noting that ab = −1.e.e. x ∈ R. b−a is the solution to the recursion. But I could very well use the function P (X > x). Thus (1/ρ)−s is the 1 generating function of ρn+1 . n ≥ 0) equals the sum of the generating functions of (xn+1 . This tells us that s−b is the generating function of (1/b)n+1 1 and that s−a is the generating function of (1/a)n+1 . then the so-called distribution function. if X is a random variable with values in R. we can solve for X(s) and find X(s) = s2 −1 . Despite the square roots in the formula. b = −( 5 + 1)/2. this number is always an integer (it has to be!) Distribution or law of a random object When we speak of the “distribution” or “law” of a random variable X with values in Z or R or Rn or even in a much larger space. So. namely. s−2 (X(s) − x0 − sx1 ) = s−1 (X(s) − x0 ) + X(s). n ≥ 0). Since x0 = x1 = 1. for example. Hence s2 + s − 1 = (s − a)(s − b).

n Hence n k=1 log k ≈ log x dx. Specifying the distribution of such an object may be hard. since the number is so big. Since log x does not vary too much between two successive positive integers k and k + 1 it follows that k+1 k log x dx ≈ log k. For example (try it!) the rough Stirling approximation give that the probability that. compute an integral and obtain an approximation. (b) The approximation is not good when we consider ratios of products. 1 Since d dx (x log x − x) = log x. Large products: Stirling’s formula If we want to compute n! = 1 × 2 × · · · × n for large n then. it follows that n 1 log x dx = n log n − n + 1 ≈ n log n − n. n! ≈ exp(n log n − n) = nn e−n . I could use its density f (x) = d P (X ≤ x). we encountered the concept of an excursion of a Markov chain from a state i. Warning: the verb “specify” should not be interpreted as “compute” in the sense of existence of an analytical formula or algorithm. Even the generating function EsX of a Z+ -valued random variable X is an example. it’s better to use logarithms: n n! = e log n = exp k=1 log k. for example they could take values which are themselves functions. but using or merely talking about such an object presents no difficulty other than overcoming a psychological barrier. But there are two problems: (a) We do not know in what sense the symbol ≈ should be interpreted. in 2n fair coin tosses 118 .or the 2-parameter function P (a < X < b). For instance. Hence we have We call this the rough Stirling approximation. instead. to compute a large sum we can. If X is absolutely continuous. We frequently encounter random objects which belong to larger spaces. It turns out to be a good one. n ∈ Z+ . Such an excursion is a random object which should be thought of as a random function. dx All these objects can be referred to as “distribution” or “law” of X. for it does allow us to specify the probabilities P (X = n). Now.

To remedy this. expressing monotonicity. Then. We can case both properties of g neatly by writing For all x. It was shown in Theorem 38 that J0 satisfies the memoryless property. n ∈ Z+ . by taking expectations of both sides (expectation is an increasing operator). g(X) ≥ g(x)1(X ≥ x). you have seen it before. We have already seen a geometric random variable. Markov’s inequality Arguably this is the most useful inequality in Probability. and is based on the concept of monotonicity. If we think of X as the time of occurrence of a random phenomenon. expressing non-negativity. That was the random variable J0 = ∞ n=1 1(Sn = 0). because we know that the n probability goes to zero as n → ∞. Surely. then X is such that knowledge that X has exceeded n does not make the distribution of the remaining time X − n any different from X. as n → ∞. and this is terrible. The geometric distribution We say that the random variable X with values in Z+ is geometric if it has the memoryless property: P (X − n ≥ m | X ≥ n) = P (X ≥ m). where λ = P (X ≥ 1). and so. y ∈ R g(y) ≥ g(x)1(y ≥ x). 0 ≤ n ≤ m. And if y < x then 1(y < x) = 0 and the display reads g(y) ≥ 0. This completely trivial thought gives the very useful Markov’s inequality. for all x. we shall prove Strirling’s approximation: √ n! ∼ nn e−n 2πn where the symbol an ∼ bn means that an /bn → 0. Eg(X) ≥ g(x)P (X ≥ x) . We baptise this geometric(λ) distribution. Indeed. In fact it was shown there that P (J0 ≥ k) = P (J0 ≥ 1)k . The same process shows that a geometric random variable has a distribution of the form P (X ≥ n) = λn . Consider an increasing and nonnegative function g : R → R+ .we have exactly n heads equals 2n 2−2n ≈ 1. a geometric function of k. Let X be a random variable. so this little section is a reminder. if y ≥ x then 1(y ≥ x) = 1 and the display tells us g(y) ≥ g(x). defined in (34). 119 .

Amer. Grinstead. J L (1997) Introduction to Probability. 5. Math. Wiley. Springer. Cambridge University Press.mac. ´ 2. 120 . Feller. Cambridge University Press.pdf 6. Soc. Bremaud P (1999) Markov Chains. 7. Norris.edu/~chance/teaching_aids/books_articles/ /probability_book/amsbook. 4. Springer. Chung. 3. Bremaud P (1997) An Introduction to Probabilistic Modeling. Parry. W (1981) Topics in Ergodic Theory. W (1968) An Introduction to Probability Theory and its Applications.dartmouth. K L and Aitsahlia F (2003) Elementary Probability Theory (With Stochastic Processes and an Introduction to Mathematical Finance). C M and Snell. J R (1998) Markov Chains. Springer.Bibliography ´ 1. Also available at: http://www.

20 function of a Markov chain. 98 dynamical system. 65 Catalan numbers.Index absorbing. 86 reflexive. 58 digraph (directed graph). 82 Cesaro averages. 115 geometric distribution. 76 neighbours. 44 degree. 61 detailed balance equations. 18 dual. 47 coupling inequality. 10 ballot theorem. 53 hitting time. 15. 116 first-step analysis. 73 Markov chain. 113 initial distribution. 70 greatest common divisor. 16 equivalence relation. 70 period. 108 ergodic. 119 minimum. 68 aperiodic. 1 equivalence classes. 35 Helly-Bray Lemma. 61 recurrent. 51 essential state. 111 inessential state. 34 probability flow. 114 balance equations. 16 communicating classes. 29 reflection. 18 permutation. 65 generating function. 77 indicator. 106 ground state. 84 bank account. 53 Fibonacci sequence. 119 matrix. 81 reflection principle. 48 memoryless property. 16. 110 meeting time. 2 nearest neighbour random walk. 110 mixing. 20 increments. 17 infimum. 17 Aloha. 3 branching process. 117 division. 7 invariant distribution. 101 axiom of Probability Theory. 6 Markov’s inequality. 34 PageRank. 26 running maximum. 15 Loynes’ method. 17 Kolmogorov’s loop criterion. 16 convolution. 68 Fatou’s lemma. 13 probability generating function. 111 maximum. 17 communicates with. 108 ruin probability. 117 leads to. 51 Monte Carlo. 63 Galton-Watson process. 87 121 . 109 positive recurrent. 10 irreducible. 77 coupling. 18 arcsine law. 59 law. 61 null recurrent. 48 cycle formula. 108 relation. 7 closed. 6 Markov property. 15 distribution. 115 random walk on a graph. 17 Ethernet. 119 Google. 45 Chapman-Kolmogorov equation.

13 stopping time. 77 skip-free property. 6 statespace. 113 symmetric. 88 watching a chain in a set.semigroup property. 27 subordination. 2 transition probability. 24 Strirling’s approximation. 119 strong Markov property. 60 urn. 7 time-reversible. 16. 62 supremum. 7. 29 transition diagram:. 104 states. 16. 71 122 . 3. 8 transition probability matrix. 73 strategy of the gambler. 10 steady-state. 58 transient. 62 subsequence of a Markov chain. 108 tree. 6 stationary. 8 transitive. 27 storage chain. 76 simple symmetric random walk. 108 time-homogeneous. 9 stationary distribution. 76 Skorokhod embedding. 7 simple random walk. 62 Wright-Fisher model.

Sign up to vote on this title
UsefulNot useful