Professional Documents
Culture Documents
Lecture Notes
on courses taught by
D R . R AKESH N IGAM
Lecture Notes
Towards brevity, clarity and precision.
courses taught by
D R . R AKESH N IGAM
Edition 4
This book has been compiled from a series of lecture notes on a comprehensive
set of topics that include - Linear Algebra, Discrete Mathematics, Probability foun-
dations, Stochastic Processes, Computational Finance and Data Science. The lec-
ture notes on all of the above topics have been collectively written by the PGDM
(2019 − 21) batch at the Madras School of Economics, based on courses taught by
Dr. Rakesh Nigam.
Our attempt has been to make this book self contained, in the sense that it covers
an extensive portion of various topics that form the building blocks of modern day
Analytics and Research. The concepts related to Linear Algebra and Stochastics,
have been built from the ground up - this would serve as a convenient reference
for students and professionals alike, who might be looking to pursue a deeper un-
derstanding of topics in Analytics and Finance.
Editors
February 19, 2021
2
Features new in the 4th edition
• Linear Algebra - Linear Algebra concepts are essential to grasp for a bet-
ter understanding of analysis methods - Multivariate Data representations
rely heavily on Linear Algebra - Code representations and implementations
are based on Matrices and Vectors - Various unsupervised Machine Learn-
ing methods are focussed entirely on Linear Algerba computation methods.
We start with the basics of Vector algebra, moving on to more complex al-
gorthims like the Singular Value Decomposition and Principal Component
Analysis. Applications in Text Analytics are also covered here.
• Advanced Probability - Most Machine Learning and Econometric models
are probabilistic in nature - thereby making it essential for us to get a good
grasp on the fundamentals of probability theory. This part of the book first
builds the foundations of probability theory with concepts related to - Ran-
dom variables, Combinatorics, Expectation and Variance - and subsequently
moving on to more advanced topics that include - Time Series analysis, joint
distributions and Conditional probability theorems.
• Stochastic Processes - We introduce the Markov Chain framework in this
section. This part of the book relies heavily on topics related to - The memo-
ryless property, Exponential and Poisson distributions, Discrete time Markov
Chains. There are sections on Queueing theory as well - Queueing models
are constructed entirely using the concepts of Markov Chains and stochastic
processes.
• Stochastic Calculus and Computational Finance - In this section we con-
tinue from where we left off in the Stochastic Processes section - into the
domain of Continuous time Markov Chains and Martingales. We then move
on to advanced topics related to the Black Scholes Merton Model for pricing
options.
• Topics in Data Science - This section is devoted to the mathematical under-
pinnings of some advanced computational frameworks that include - Natu-
ral Language Processing, Computer Vision and an introduction to Bayesian
models of statistical computation.
• Calculus and Discrete Mathematics - In our attempt to make this book as
self contained as possible, we have included a rather comprehensive section
that covers a host of topics - Fourier Transforms, Fundamentals of Calculus,
Propositional Logic, Proof Techniques, Graph theory and Recurrence rela-
tions.
3
Contents
I Linear Algebra 15
1 Introduction to Linear Algebra 16
1.1 Introduction to Vector Addition and Subtraction . . . . . . . . . . . 16
1.1.1 Vector Addition and Scalar multiplication . . . . . . . . . . . . 16
1.2 Vector Dot product . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.1 Norm of a vector . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.2 Unit Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.3 Angles between vectors and dot products . . . . . . . . . . . . . 19
1.3 Solving system of Linear equations using Linear combinations . . . 19
1.4 Motivation from Chemistry . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Application of Linear algebra in Mechanical Engineering and Eco-
nomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.1 Engineering Terminology . . . . . . . . . . . . . . . . . . . . . 21
1.5.2 Static analysis of a 3-bar truss structure . . . . . . . . . . . . . 22
1.5.3 Economic analysis of a manufacturing firm . . . . . . . . . . . . 25
1.5.4 Solving this problem for the firm . . . . . . . . . . . . . . . . . 26
2 On Solving Ax = b 28
2.1 Premise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.1 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Case 1: Full Rank (r = m = n) . . . . . . . . . . . . . . . . . . 29
2.3 Case 2: Full Column Rank (r = n < m) . . . . . . . . . . . . . . 30
2.3.1 Unique Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2 No solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Case 3: Full Row Rank (r = m < n) . . . . . . . . . . . . . . . . 32
2.5 Case 4: Not Full Rank (r < n, r < m) . . . . . . . . . . . . . . . 32
2.6 Applications in Economics: The Leontief Input-Output Model . . . 34
2.7 Applications in Physics: Kirchhoff’s Voltage and Current Laws . . . 35
4
Contents 5
3.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.1 Concavity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.2 Envelope theorem . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.3 Comparitive statics . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Multivariate optimization . . . . . . . . . . . . . . . . . . . . . . . 47
3.6.1 Multivariate comparitive statics . . . . . . . . . . . . . . . . . . 48
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . 78
7.3 Visualizing SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.4 Information Retrieval Strategies . . . . . . . . . . . . . . . . . . . . 80
7.5 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.6 Weighted-Zone Scoring . . . . . . . . . . . . . . . . . . . . . . . . 80
7.7 Term frequency ranking . . . . . . . . . . . . . . . . . . . . . . . . 82
7.8 Inverse document frequency ranking . . . . . . . . . . . . . . . . . 85
II Advanced Probability 87
8 Fundamentals of Probability 88
8.1 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.1.1 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.1.2 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.1.3 Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2 Basic Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . 91
8.3.1 Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.3.2 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.4.1 Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.4.2 Geometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.4.3 Negative Binomial . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.5 Cumulative Frequency distributions . . . . . . . . . . . . . . . . . 94
8.6 General points about discrete variables . . . . . . . . . . . . . . . . 94
8.7 Conditions For Independence . . . . . . . . . . . . . . . . . . . . . 95
8.7.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.7.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.8 Axioms Of Probability . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.9 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . 97
8.10 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.10.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.11 Properties Of Moment Generating Function . . . . . . . . . . . . . 101
8.12 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Linear Algebra
15
Chapter 1
Topics
ñ ô ñ ô
x1 y
x= y= 1 (1.1)
x2 y2
ñ ô ñ ô
x1 + y 1 z
z =x+y = = 1 (1.2)
x2 + y 2 z2
A vector can be multiplied by any scalar, say c to produce another vector in the
same plane and this can be represented as :
16
Chapter 1. Introduction to Linear Algebra 17
y x
ñ ô ñ ô
x cx1
cx = c 1 = (1.3)
x2 cx2
Consider an example where we take our x vector as the point (4, 2) and y vector
as the point (-1,2), then upon adding the two vectors we will get the vector z as
a result. This is shown geometrically below in figure 1 and can be represented in
vector algebraic form as follows :
ñ ô ñ ô ñ ô
4 −1 3
+ = (1.4)
2 2 4
A short note on linear combinations : If one vector can be expressed as the sum-
mation of scalar multiples of other vectors, it is known as a linear combination of
those vectors.
length or magnitude of the vector and can be calculated using the formula :
Chapter 1. Introduction to Linear Algebra 18
p
u
= u1 2 + u2 2 + u3 2 + · · · + un 2 (1.5)
The length of a vector with two elements is the square root of the sum of each
element squared.The magnitude of a vector is sometimes called the length of a
vector, or norm of a vector. Basically, norm of a vector is a measure of distance,
symbolized by double vertical bar. Usually a vector Norm represents the norm in
the Euclidean space and is also known as the Euclidean norm.
A short note on Euclidean space : Euclidean space is the fundamental space re-
lated to classical geometry or Euclidean geometry. It is a concept that was introduced
by the ancient Greek Mathematician - Euclid of Alexandria. The set of n-tuples of real
number that are equipped with the dot product functionality is essentially a Euclidean
space of dimension n.
r = xî + y ĵ (1.6)
ñ ô ñ ô
1 0
î = , ĵ = (1.7)
0 1
ñ ô ñ ô ñ ô
1 0 2
2 +2 = (1.8)
0 1 2
a1 x1 + a2 x2 + a3 x3 + · · · + an xn = b (1.9)
where b and the coefficients a1 · · · an are real or complex numbers, usually known
in advance. The subscript n may be any positive integer. In textbook examples
and exercises, n is normally between 2 and 5 so as to make computations easier.
In real-life problems, n might be 50 or 5000, or anyhing for that matter. We will
attempt to solve a system of linear equations using linear combinations. Consider
a set of linear equations as follows :
2x − y = 0 (1.10)
− x + 2y − z = −1 (1.11)
− 3y + 4z = 4 (1.12)
Representing this system of equations in the form Ax = b, where A is the coeffi-
cient matrix, x is the vector of unknowns and b is the result vector.
Chapter 1. Introduction to Linear Algebra 20
C1 C2 C3
r1T "−2 −1 0 # x 0
T
r2 −1 2 −1 y = −1 (1.13)
r3T
0 −3 4 z 4
2 −1 0 0
x −1 + y 2 + z −1 = −1
(1.14)
0 −3 4 4
2x + 0y = 2z → x + 0y − z = 0 (1.17)
0x + 2y = z → 0x + 2y − z = 0 (1.18)
We will now write this system of equations in the form of Ax = b and attempt to
solve it using the concept of linear combinations of columns of coefficient matrix
A, as follows :
ñ ô x ñ ô
1 0 −1 0
y = (1.19)
0 2 −1 0
z
ñ ô ñ ô ñ ô ñ ô
1 0 −1 0
x +y +z = (1.20)
0 2 −1 0
Chapter 1. Introduction to Linear Algebra 21
Now putting values of y = 1, the solution to the system of equations turns out to
be x = 2 and z = 2, whereas if we put y = 2, the solution turns out to be x = 4
and x = 4. So the final equation of water can now be written as :
The resulting system of equations has infinite number of solutions. For each arbi-
trary value of y, we will get some combination of values for x and z as well.
As we can see, a very basic system of chemical equations can be easily solved using
matrix manipulation, vector dot prodcuts, linear combinations and vector addition
(as explained in the above sections). An illustration of chemican equations was
taken simply to drive home the point that these concepts can be applied to any
field and can consequently simplify computations.
• Pin joints : A pin jointed truss is a structure made up from separate compo-
nents by connecting them together at pinned joints or nodes, usually to form
a series of triangles.
• Aim :
Since the static equilibrium equations of a truss structure tell us that the
sum of the forces along x and y directions at all the joints should be equal
to 0, we will use this concept to form the equations in Matrix notation and
solve for finding the unknown variables - Bar forces and Reaction forces
along each beam.
• Procedure :
1 and a static analysis of this structure can determine the unknown internal
bar forces N12 , N13 , N23 and reaction forces R2x , R2y , R3y . Internal bar forces
are positive.
Equations are now formed and represented using matrix notation as shown
below :
P
JointP1: Fx = 0 : −cosαN12 + cosβN13 = 0
Joint 1: FP
y = 0 : −sinαN12 − sinβN13 − 1000 = 0
Joint 2: x = 0 : cosαN12 + N12 + R2x = 0
FP
Joint 2:P Fy = 0 : sinαN12 + R2y = 0
Joint 3: PFx = 0 : −N23 − cosβN13 = 0
Joint 3: Fy = 0 : sinβN13 + R3y = 0
We now attempt to solve the system of equations and solve for the unknown
forces. The measurement of the angles and their corresponding cosine values
are as follows :
Chapter 1. Introduction to Linear Algebra 24
β = 36.86◦
α = 53.13◦
cos α = 0.6
cos β = 0.8
sin α = 0.8
sin β = 0.6
• Conclusion :
5a = 5b + 5c = 250 (1.23)
By considering the distribution of the 150 available units of milk :
3a + 4b + 2c = 150 (1.24)
Similarly, since 100 units of coffee are used, it must be the case that :
2a + b + 3c = 100 (1.25)
In other words, we have the system of equations as follows :
5a = 5b + 5c = 250 (1.26)
3a + 4b + 2c = 150 (1.27)
2a + b + 3c = 100 (1.28)
We solve this using elementary row operations, starting with the augmented ma-
trix, as follows :
5 5 5 250 1 1 1 50
3 4 2 150 → 3 4 2 150
2 1 3 100 2 1 3 100
1 1 1 50 1 1 1 50
3 4 2 150 → 0 1 −1 0
2 1 3 100 0 0 0 0
Therefore, the system is equivalent to :
a + b + c = 50
b−c=0
from which we obtain b = c and a = 50 − b − c = 50 − 2c. In other words, the
weekly production levels of ChocoB and ChocoC bars are equal, and the produc-
tion of ChocoA is (in thousands) 50 − 2c where c is the production of ChocoC (and
ChocoB) bars. Clearly, none of a, b, c can be negative, so the production level, c,
of ChocoC bars must be such that a = 50 − 2c ≥ 0 that is, c ≥ 25. Therefore, the
maximum number of ChocoC bars which it is possible to manufacture in a week is
25000 (in which case the same number of Break bars are produced and no ChocoA
bars will be manufactured).
Chapter 1. Introduction to Linear Algebra 27
Summary
• Introduction to Vector addition and scalar multiplication and how they
combine to form linear combinations.
• References for definitions : David Lay, Norman Biggs and Aalborg Uni-
versity Journal
Chapter 2
On Solving Ax = b
2.1 Premise
Consider a matrix A of order m × n, where m denotes the number of rows and
n denotes the number of columns. Also consider the vector b having m rows and
vector x comprising of n variables.1 Then a given system of m linear equations in
n variables can be written as:
Ax = b
Objective: We attempt to solve the above for any given set of linear equations.
This is achieved by the algorithmic way of elimination which uses elementary row
operations. The key idea behind elimination is to obtain an upper triangular ma-
trix U . This matrix is further solved using back-substitution.
1
All vectors considered here are column vectors, unless otherwise stated.
28
Chapter 2. On Solving Ax = b 29
2.1.1 Rank
Definition 2.1.1 The rank of A is the number of pivots. This number is denoted
by r. Rank serves the purpose of determining the structure of R. There are four
cases which are described in the subsequent sections.
• r = m =⇒ Every row is a pivot row and R has no zero rows. This signals
that each row is an independent row.
• We know that,
r = m =⇒ Ax = b has a solution for every right side b ∈ Rm and,
r = n =⇒ Ax = b has only one solution (if it has any),
thus, Ax = b has a unique solution for r = m = n.
• The null space, N (A) contains only the zero vector x = 0 (because, full
column rank) and, the column space C(A) ∈ Rm (because, full row rank).
x + 2y + z = 2
3x + 8y + z = 12
4y + z = 2 (2.2)
Chapter 2. On Solving Ax = b 30
From the above, we can see that R = I and the unique solution is given by x =
2, y = 1, z = −2.
• r < m =⇒ Every row is not a pivot row and there are linear dependencies
among rows.
• If b ∈ C(A) then the null space contains only the zero vector, N (A) = {0}
(because, full column rank) then Ax = b will have a unique solution.
• If b 6∈ C(A) and the null space is empty, N (A) = { } then Ax = b does not
have a solution.
ñ ô
I
• The rref for A has the structure R =
0
x + 5y = 1
2x + 4y = 2
3x + 9y = 3
8x + 10y = 8 (2.5)
Chapter 2. On Solving Ax = b 31
2.3.2 No solution
Consider the following system of linear equations:
x + 5y = 2
2x + 4y = 1
3x + 9y = 1
8x + 10y = 4 (2.8)
These equations can be respresented in the form of a matrix A of order 4 × 2. We
solve for Ax = b using elimination.
1 5 ñ ô 2 1 5 2
2 4 x 1 2 4 1
= =⇒ [A|b] = (2.9)
3 9 y 1 3 9 1
8 10 4 8 10 4
On performing row operations, we obtain U and R as follows:
1 5 2 1 0 − 1/2
0 −6 −3 0 1 1/2
[U |c] = −→ [R|d] = (2.10)
0 0 −2 0 0 −2
0 0 3 0 0 3
ñ ô
I
From the above, we can see that R = .
0.
Clearly, 0x + 0y 6= −2 or 3. Hence, Ax = b is not solvable as b 6∈ C(A).
Chapter 2. On Solving Ax = b 32
A(x + αv) = b
Ax + αAv = b
Ax = b (∵ Av = 0) (2.11)
x + 3y + 4z + 3w = 2
2x + 5y + 8z + 11w = 3 (2.12)
From the above, we can see that R = [ I | F ]. b is obtained using linear combina-
tions of the columns of A. Thus infinite solutions are obtained when b ∈ C(A).
• r < n =⇒ There are r pivot columns and (n − r) free columns. This signals
that there are infinite solutions if b ∈ C(A).
ñ ô
Ir F
• The rref for A has the structure R =
0 0
1x + 2y + 2z + 2w = 1
2x + 4y + 6z + 8w = 5
3x + 6y + 8z + 10w = 6 (2.15)
The null space matrix N comprises of the null space solutions for the set of equa-
tions. In order to obtain the null space solution, set Rx = 0 and solve for x. We
further observe that,
−2 2
ñ ô
0 −2 −F
N = = (2.19)
1 0 I
0 1
The null space solution for R also turns out to be the null space solution for A.
Hence, AN = O, that is,
−2 2
1 2 2 2 0 0
0 −2
AN = 2 6 4 8 = 0 0 =O (2.20)
1 0
3 8 5 10 0 0
0 1
Chapter 2. On Solving Ax = b 34
Let x be the production vector that lists the output for each sector and let d be
the final demand of the non-productive consumption sector. A will represent the
intermediate demand matrix resulting from the interdependencies among the in-
dustries. This can be represented as follows:
amount intermediate f inal
produced = demand + demand
(x)
(Ax)
(d)
Here, aij represents the output produced by industry i that will be used in the
production of industry j while di represents the final external demand for industry
i. Then x will be computed as:
x = (I − A)−1 d (2.21)
• To produce one unit of services, the industry uses 0.25 units of electricity,
0.4 units of its own production and 0.15 units of automobile.
• To produce one unit of automobile, the industry consumes 0.2 units of elec-
tricity, 0.5 units of services and 0.1 units of its own production.
• The final demand for electricity, services and automobile was 50,000 units,
75,000 units and 1,25,000 units respectively.
How much should each industry produce in order to fulfil the total demand?
1.464 0.803 0.771
(I − A)−1 = 1.007 2.488 1.606
0.330 0.503 1.464
Clearly, the rank of the above matrix is equal to the order of the matrix i.e r =
m = n. Therefore, this system will have a unique solution in x which is given by
1.464 0.803 0.771 50000 229921.59
x = 1.007 2.488 1.606 · 75000 = 437795.27
0.330 0.503 1.464 125000 237401.57
Thus, the electricity generation industry will produce 229921.59 units of elec-
tricity, service industry will produce 437795.27 and the automobile industry will
produce 237401.57 units to cater to the the combined, internal and external de-
mand.
the key components demanded for production in any industry. The supply of elec-
tricity to industries far and wide happens through a series of lines and networks.
The subsequent application in physics would throws light into a very fundamental
law concerning networks, loops and junctions. These are the Kirchhoff’s Voltage
and Current Laws.
The Kirchhoff’s laws are conservation laws that form the foundation to all of cir-
cuit theory. Kirchhoff’s voltage law states that the directed sum of potential drops
around any closed loop is zero. The Kirchhoff’s current law states that the alge-
braic sum of currents in a closed loop equals zero. Using the idea of networks
and graphs one can construct basic circuits, which can then be represented in the
matrix form. Further, we show how this is linked to the Ax = b form; It is also
shown that the Kirchhoff’s laws directly embody this form and the idea of rank
helps in extracting meaningful properties which are of relevance to physics.
b1
x1 x2
b4 b5
x4
b3 b6 b2
x3
Consider the above circuit with four nodes labelled 1, 2, 3 and 4. Let xi be the
electric potential at the ith node. The potential difference on the edge b1 is given
by x2 − x1 . This the vector b can be constructed as follows.
b1 x2 − x1
b x − x
2 3 2
vector comprising of
b3 x1 − x3
b= = = potential differences (2.22)
b4 x4 − x1
between the nodes.
b5 x4 − x2
b6 x4 − x3
The above equation can now be rewritten in order to bring about the Ax = b form
as shown below. The matrix A is of the order 6 × 4, with 5 edges and 4 nodes.
Chapter 2. On Solving Ax = b 37
−1 0 0 1 1 0 0 −1
0 −1 0 1
0 1 0
−1
ñ ô
0 0 −1 1 0 0 1 −1 I3 F
−→ −→
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
| {z }
rref matrix - R
0 0 0 0 0 0 0 0
| {z } | {z }
Row 3 and Row 4 are swapped. rref matrix - R
We know that the null space matrix contains the vectors in the N (A). The infor-
mation about the null space matrix N is found in the matrix F which contains
pivot variables under elimination. Thus, the null space solution of Ax is:
x1 1
x 1
N = 2 = α
x3 1
x4 1
The above result is significant in physics. It means that if the potential differ-
ences between the nodes are zero, b = 0 then the nodes are at the same electric
potential. In other words, these nodes are equipotential. It can therefore be seen
Chapter 2. On Solving Ax = b 38
Using Ohm’s law and the information about conductance of each edge, one could
create a conductance matrix, G. This could then be used to obtain the currents on
each edge yi . The vector denoted by y has entries comprising of currents in all the
edges. Thus Gb = y, where G is the 6 × 6 conductance matrix; b is the vector of
potential difference and y is the vector of currents.
y1
1 2
y4 y5
y3 y6 y2
The currents along with their flow directions are shown in the above figure. Now
the Kirchhoff’s current law can be written as AT y = 0. Thus, vectors in the null
space of AT correspond to collections of current in the loop that satisfy Kirchhoff’s
laws. This is summarized in the following schematic.
x = {x1 , x2 , x3 , x4 } AT y = 0
potential at nodes Kirchhoff’s Current Law
Ax = b AT y
The AT y = 0 equation is written out and using the method of elimination the
reduced row echelon form (rref) is obtained and the rank r is imminent in the
Chapter 2. On Solving Ax = b 39
number of pivots.
y1
−1 0 1 −1 0 0 y
2 0
1 −1 0 0 −1 0 y3 0
=
0 1 −1 0 0 −1 y4 0
0 0 0 1 1 1 y5 0
y6
−1 0 1 −1 0 0 0
1 −1 0 0 −1 0 0
T
=⇒ [A | 0 ] =
0 1 −1 0 0 −1 0
0 0 0 1 1 1 0
−1 0 1 −1 0 0 −1 0 1 −1 0 0
0 −1 1 −1 −1
1 −1 0 0 −1 0 0
−→
0 1 −1 0 0 −1 0 0 0 −1 −1 −1
0 0 0 1 1 1 0 0 0 0 0 0
| {z } | {z }
AT matrix - current through nodes upper triangular matrix, U
−1 0 1 0 1 1 −1 0 0 1 1 1
0 −1 1 0 0 1 0 −1 0 1 0 1
−→ −→
0 0 0 −1 −1 −1 0 0 −1 0 −1 −1
0 0 0 0 0 0 0 0 0 0 0 0
| {z } | {z }
rref matrix - R Col. 3 and Col. 4 are swapped.
1 0 0 −1 −1 −1 ñ ô
0 1 0 −1 0 −1 I3 F
−→ −→
0 0 1 0 1 1 0 0
0 0 0 0 0 0 | {z }
rref matrix - R
| {z }
Col. 3 and Col. 4 are swapped.
The null space matrix contains the vectors in the N (AT ). The information about
the null space matrix N is found in the matrix F which contains pivot variables
under elimination. Thus, the null space solution of AT y is:
y1 1 1 1 y1 1 1 1
y 1 0 1 y 1 0 1
2 2
y 0 −1 −1 y 1 0 0
N = 4 =
3
=⇒ y = =
y3 1 0 0 y4 0 −1 −1
y5 0 1 0 y5 0 1 0
y6 0 0 1 y6 0 0 1
flow around a particular loop. We also notice that a linear combination of these
solutions give the other loops. The rank r = 3 therefore gives an insight into the
number of independent loops in the network.
Summary
• The four different cases for the linear equations (reflected in rref, R) de-
pending on the rank r are:
Note that there are certain situation wherein we do not require a specific sign on
the quadratic form, but a set of restricted values. In this situation, we say that A
is positive definite subject to constraint bx = 0 if xT Ax > 0 for all x 6= 0 and
bx = 0.
41
Chapter 3. Matrix Definiteness and Cramers Rule 42
necessary and sufficient conditions for definiteness criteria. First we note the def-
inition of minor matrices - these are submatrices of A formed by eliminating k
rows and the same numbered k columns. We say that the naturally ordered or
nested principal minor matrices of A are the minor matrices given by:
ñ ô a11 a12 a13
a11 a12
a11 , , a21 a22 a23 (3.2)
a21 a22
a31 a32 a33
Minor determinants are simply the determinants of the minors. Note now that if
we are given a square matrix A and a vector b then we can border A by b in the
following manner. Note that what this gives us is essentially a bordered matrix.
0 b1 b2 · · · bn
1 11 a12 · · · a1n
b a
.
. .. .. .. .. (3.3)
. . . . .
bn an1 an2 · · · ann
Now it turns out that the border preserving principal minor matrices are the
principal minors of this matrix, that include the border elements as well and are
typically used to test definiteness under the constraint conditions. We that the
square matrix A is:
• Positive definite if and only if the principal minor determinants are all pos-
itive.
• Negative definite if and only if the principal minor determinants of order k
have a sign as (−1)k for k = 1, 2, · · · , n.
• Positive definite subject to constraint bx = 0 if and only if the border pre-
serving principal minors are all negative. The condition is actually (−1)m
where m is the number of constraints.
• Negative definite subject to constraint bx = 0 if and only if the border
preserving principal minors have sign (−1)k for k = 1, 2, · · · , n.
3.2 Analysis
Given a vector x in the space Rn and a positive real number e, we can define an
open ball of radius e at x as follows:
We note that a set of points A is an open set if for every x in A there exists some
Be (x) which is contained in A. Further we note that if x is in an arbitrary set and
there exists some e > 0 such that Be (x) is in A, then x is said to be in the interior
of A.
3.3 Calculus
Calculus gives us the tools to approximate certain function with linear functions.
Given a function f : R → R we can define its derivative at point x∗ as follows:
Note that an important interpretation of the gradient is that it points in the di-
rection that the function f increases most rapidly. To denote this, we first let h
Chapter 3. Matrix Definiteness and Cramers Rule 45
Now note that since f (x∗ ) = a then we have the following formula for the tangent
hyperplane:
H(a) = {x : Df (x∗ )(x − x∗ ) = 0} (3.18)
With this we can state that if x is a vector in the tangent hyperplane, then x − x∗
is orthogonal to the gradient of f at x. We note that a tangent hyperplane is
defined by a set of points x such that x − x∗ is orthogonal to the gradient of a
function f on which point x∗ lies, at x∗ .
3.5 Optimization
Let f : R → R be a function and we say that this function achieves its maximum
at x∗ if f (x∗ ) > f (x) for all x. On similar lines, if we say that x∗ is a minimum
point if f (x∗ ) < f (x) for all x. Suppose now that a function achieves a maximum
at x∗ . Now we know from calculus conditions that at this point the first derivative
would be equal to 0 and the second derivative would be less than or equal to 0 at
x∗ . This can be depicted as:
f 0 (x∗ ) = 0 (3.19)
f 00 (x∗ ) < 0 (3.20)
Similarly for minimization the second order condition would become:
3.5.1 Concavity
A function is said to be concave if it satisfies the following condition:
A concave function always has its second derivative less than or equal to 0 or it has
a maximum. Now a convex function is one that satisfies the following property:
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) (3.23)
A convex function is known to have a minima or its second derivative is greater
than or equal to 0.
Now the two separate first order conditions are given by:
∂f (x1 , x2 )
=0 (3.32)
∂x1
∂f (x1 , x2 )
=0 (3.33)
∂x2
Generally for n choice variables we define the gradient vector as:
Å ã
∂f ∂f
Df (x) = ,··· , (3.34)
∂x1 ∂xn
Df (x∗ ) = 0 (3.35)
Now the second order conditions of this problem can be expressed in the form of
a Hessian matrix and is given by:
ñ ô
f11 f12
H= (3.36)
f21 f22
∂ 2f
(3.37)
∂xi ∂xj
Now from calculus we can say that for a maximization problem, at optimal choice
x∗ the Hessian matrix must be negative semidefinite. Therefore for some vector
(h1 , h2 ) it must satisfy:
ñ ôñ ô
ó f
11 f12 h1
î
h1 h2 ≤0 (3.38)
f21 f22 h2
This can also be written as hHhT ≤ 0 and note that that the corresponding condi-
tion for minimization would require the same matrix to be of a positive semidefi-
nite form.
Chapter 3. Matrix Definiteness and Cramers Rule 48
Now this we can simply use the cramer’s rule to find the respective rate of changes
of optimal choices:
−f13 f12
−f23 f22
∂x1
= (3.44)
∂a f11 f12
f21 f22
Chapter 4
49
Chapter 4. Multivariate Data Analysis - Basic concepts 50
• Find the Eigenvalues of E = AT A. Take their positive square root to get the
singular values :- p p
σA1 = λE1 , σA2 = λE2 (4.6)
• Find matrix U by using the following modified relationship of the SVD equa-
tion :-
AV = U Σ (4.10)
Av1 = u1 σ1 (4.11)
1
u1 = Av1 (4.12)
σ1
• Present the reduced dimensional form of the matrix by assembling A as the
sum of r rank 1 matrices formed by the summation of the inner product of
each pair of orthonormal eigenvectors :-
r
X
A= σi ui viT (4.13)
i=1
• Finally the trucated form of the SVD which is essentially the reduced rank
approximation of A is given as :-
A = Ur Σr VrT (4.14)
A−1 T −1 T
L = (A A) A (4.17)
A−1 −1 T
L = Vn Σ Um (4.18)
In a similar way we can compute the Right inverse of a matrix in the special case
of n > m = r. Here the Left null space does not exists and we can compute the
right inverse as follows :-
Rm = AAT (4.19)
−1
Rm Rm = Im (4.20)
A−1 T T −1
R = A (AA ) (4.21)
A−1
R = Vn Σ−1 Um
T
(4.22)
In case m < r and n < r both the null spaces in effect do not exist. In this case we
compute the Pseudoinverse which is given by :-
A+ = Vr Σ+ UrT (4.23)
1
A+ ui = vi (4.24)
σi
in = [1, 1, 1, · · · , 1]. The sum of all the x components of the j th column is thus
represented as :-
x1j
î ó x2j
iTn xj = 1 1 · · · 1
.
.
(4.28)
.
xnj
And finally the averge can be expressed as :-
1 T
x̄j = i xj (4.29)
n n
Now we note that the last vector with all mean values has values that are all equal.
So we can write that exression as :-
x¯j
.
. = x̄j in = in x̄j (4.31)
.
x¯j
J is called the centering matrix and has the properties of a symmetric matrix
: J = J T and J 2 = J and iTn J = 0. The last condition means that the sum
of deviations is 0. We must additionally remember some key properties about
premultiplying the matrices, say : sJA (s is some scalar) with centering matrices
:-
iT (sJA) = siT JA = 0 (4.36)
J(sJA) = sJJA = sJA (4.37)
Remember an important point that the average of the centered value in the new
column would be 0. [1]This can be check with computation as well. Suppose
there is a centered column of the j variable in the data matrix called yj = Jxj .
When we take the average we essentially compute :-
iT yj = iT (Jxj ) = 0 (4.38)
This process of transforming raw values into standard values is called standard-
ization and in this case the mean of the standard vector is 0 and variance is 1.
Z = JXD−1 (4.48)
4.3 Correlations
We shall now consider a n × p data matrix wherein n referes to the number of
obervations or individuals and p refers to the number of attributes or factors of
Chapter 4. Multivariate Data Analysis - Basic concepts 55
This term is computed using the centered values of the variable vectors and is
positive for positive correlation, negative for negative correlation and 0 for no
correlation. We can represent the same formula in vector form as :-
ï x
ò 1j − x̄ j
1 . .
vjk = . . (4.50)
n x1j − x̄j . xnj − x̄j .
xnj − x̄j
Note that we have directly written formula here without the 1/n terms since they
simply cancel out. This equation can be further simplified to give :-
Note that the formula for deriving the cosine of the angle between two vectors is
also given by the same formula as the correlation coefficient. So the angle between
the two vectors is essentially a measure of their correlation as well. Additionally
Chapter 4. Multivariate Data Analysis - Basic concepts 56
note that the correlation coefficient between pairs of standardized variables and
raw variables is the same. Now, the p × p covariance matrix of the data matrix X
is given by :-
xT1 J T Jx1 · · · xT1 J T Jxk · · · xT1 J T Jxp
.. .. ..
. ··· . ··· .
1 T T
V = xj J Jx1 · · · xTj J T Jxk · · · xTj J T Jxp (4.54)
n . . .
.. ··· .. ··· ..
xTp J T Jx1 · · · xTp J T Jxk · · · xTp J T Jxp
xT1 J T
1 . î ó
V = .
.
Jx1 · · · Jxp (4.55)
n T T
xp J
xT1
1 .. î
J T J x1 · · · xp
ó
V = (4.56)
n .T
xp
1 T T 1 1
V = X J JX = X T JX = Y T Y (4.57)
n n n
Similarly we can derive the correlation matrix. Since covariances among pairs
of the standardized variables are nothing but the correlations between the raw
variables, we can denote the correlation matrix in terms of z as follows :-
z1T J T Jz1 · · · z1T J T Jzk · · · z1T J T Jzp
.. .. ..
. ··· . ··· .
1 T T
T T T T
R = zj J Jz1 · · · zj J Jzk · · · zj J Jzp (4.58)
n .. .. ..
. ··· . ··· .
T T T T T T
zp J Jz1 · · · zp J Jzk · · · zp J Jzp
z1T z1 · · · z1T zk · · · z1T zp
. .. ..
.. · · · . · · · .
1
R = zjT z1 · · · zjT zk · · · zjT zp (4.59)
n . . .
. .. ..
. ··· ···
T T T
zp z1 · · · zp zk · · · zp zp
The above result is obtained with the property Jz = z. Finally we can write the
correlation matrix as :-
1
R = ZT Z (4.60)
n
Now recall that previously we had shown that Z = JXD−1 . We can substitute this
into equation 12 to obtain :-
1 −1 T T
R= D X J JXD−1 = D−1 V D−1 (4.61)
n
Chapter 4. Multivariate Data Analysis - Basic concepts 57
T (v) = Av (4.63)
u = c1 v1 + c2 v2 + · · · + cn vn (4.69)
u = 6 − 4x + 3x2 (4.71)
When we operate this matrix on the basis vectors v1 , v2 , v3 we would get the result
of the derivative transformation of the basis vectors respectively as 0, v1 , v2 . We
can write the input function also as a + bx + cx2 and the corresponding derivative
output as b + 2cx. Hence we can write this operation as :-
ñ ô a ñ ô
0 1 0 b
b = (4.76)
0 0 2 2c
c
Note here that the input scenario, our basis vectors were 1, x, x2 and their coor-
dinates a, b, c gave us the u vector on whom transformation was performed. But
now in the output our basis vectors have changed to 1, x and their coordinates
b, 2c now describe the transformed u vector in a lower dimensional space.
NOTE :- Any vector v is a linear combination of input basis vectors with coordi-
nates as the scalar weights as c1 v1 + · · · + cn vn . In the same manner T (v) will
be the linear combination of output basis vectors with coordinates as the scalar
weights as c1 T (v1 ) + · · · cn T (vn ).
examples we will consider different bases to illustrate how the matrix is formed.
For our case T (v) = v we will change the basis from v to w where each vector v
is a linear combination of w1 and w2 .[2] The collection of input and output bases
vector in a matrix form are given below :-
ñ ô
3 6
[v1 , v2 ] = (4.80)
3 8
ñ ô
3 0
[w1 , w2 ] = (4.81)
1 2
Notice the change of bases by the following relation :-
WB = V (4.84)
B = W −1 V (4.85)
[w1 , w2 ][B] = [v1 , v2 ] (4.86)
ñ ôñ ô ñ ô
3 0 1 2 3 6
= (4.87)
1 2 1 3 3 8
So we note a rather important point here that when the input basis in in V and
output basis is in W then the identity transformation T = I is actually doing the
operation B = W −1 V . Another interpretation of this is as follows :- Consider a
vector u that can be written using both these bases :-
u = c1 v1 + · · · + cn vn (4.88)
u = d1 w1 + · · · + dn d n (4.89)
Then we have the relation :-
V c = Wd (4.90)
So we notice that the coordinates of the new basis or the coefficients can be ex-
pressed as :-
d = W −1 V c (4.91)
d = Bc (4.92)
B = W −1 V (4.93)
Chapter 4. Multivariate Data Analysis - Basic concepts 61
Therefore the elements a11 , · · · , am1 form the first column of A. As an example
when the transformation T is the derivative operator and the first basis vector is
1 then T (v1 ) = 0 that is the first column of A is the zero vector. Suppose the
input basis vectors vi are : 1, x, x2 , x3 . The output basis vectors wi are : 1, x, x2 .
Transformation T is the derivative operator and A is the derivative matrix. We
define the various processes as follows :-
v = c1 + c2 x + c3 x 2 + c4 x 3 (4.96)
dv
= 1c2 + 2c3 x + 3c4 x2 (4.97)
dx
c1
0 1 0 0 c2
c2
Ac = 0 0 2 0 = 2c3
(4.98)
c3
0 0 0 3 3c4
c4
Note additionally that every v is a combination of the input basis vectors and every
T (v) is a combination of the output basis vectors. Also note that the result of Ac is
the coefficients or the coordinates of the transformed vector as per the new basis.
Chapter 4. Multivariate Data Analysis - Basic concepts 62
• Now we will see that when we choose our bases as the eigenvectors of a
projection matrix we get the bases as :-
v2 = w2 = (1, 1) (4.104)
We would notice that equation 65 would project to T (v1 ) = v1 and therefore
λ1 = 1. Also equation 66 would project to T (v1 ) = 0 and therefore λ2 = 0.
Remember that when we have an identity transformation or when our input and
output bases are the same we essentially change our bases form standard to new
bases by using B −1 where B = W −1 V . Since in the standard basis our projection
or transformation matrix is A we can change the bases to the eigenvector bases
Chapter 4. Multivariate Data Analysis - Basic concepts 63
Here b refers to the new basis constisting of the eigenvectors and std refers to the
standard basis. In case our input and output bases are different then we change
our bases to singular basis vectors which are the result of an SVD on the T matrix.
We can transform our original transformation matrix A by :-
−1
Bout ABin = U −1 AV = Σ (4.107)
Here the matrix V contains the input basis vectors and U contains the output basis
vectors.
Chapter 5
5.2 Introduction
We decompose matrices in order to make our computation easier and to reveal
relationships among attributes. Decomposing a matrix with the method of Diag-
onalization is only possible if we have a square matrix, however an SVD can be
applied to any matrix of any size. An SVD happens to be one of the most popular
decompositions since it reveals the four fundamental subspaces that make up the
entire vector space, namely :- The row space, the column space, the null space
and the null space of the transposed matrix. Once we decompose any matrix to
reveal the elements of these fundamental subspaces, it is possible to easily identify
the relevant and most important properties of that matrix. We are instantly able
to see the relationships between rows and columns as well identify correlations
or dependencies among various attributes. In addition, we find that the Singular
values in the decomposed matrix reveal important information about the effective
rank of a matrix. As we go along exploring this method of decomposition, we
will see how the SVD results in a dimensionality reduction of a matrix. In essence
this means that any large matrix can be decomposed or compressed into a ma-
trix with a very small effective rank and as a result our computations can become
much faster and efficient. There are wide ranging applications of the SVD as will
see later on. It is used extensively in disciplines like :- Economics, Finance, En-
gineering, Physics, Social science, Data analysis, Statistical modelling and many
others. When it comes to applying this concept to big data, it is especially of great
64
Chapter 5. Singular Value Decomposition 65
help. We will now develop the mathematics behind the Left and Right Inverse of
a matrix using SVD and a small recap before that.
σ12 0 ··· 0 0 v~T
0 σ22 ··· 0 0 1
..
î ó .. .. .. .. .. .
. . . . . v~T
A = u~1 u~2 · · · u~r ur+1
~ u~m r (5.1)
0 ··· σr2 0 0 .
.
0 0 2
· · · σr+1 0 .
0 0 ··· σn v~nT
T
Am×n = Um×r Σr×r Vr×n (5.2)
i=r
σi u~i v~iT
X
A= (5.3)
i=1
Equation 2 shows the truncated form of the SVD ; the rank r approximation of A
and equation 3 shows how matrix A is assembled from a summation of r rank 1
matrices.
What are singular values ? (A recap) : The diagonal entries of the matrix Σ
are called the singular values of a matrix and are computed by using the positive
squareroot of the eigenvalues of AAT . They are typically arranged in an ascending
order in the matrix with :
Where r denotes the effective rank of the matrix and is also an indicator of the
number of non-zero singular values present in the diagonal matrix.
T
NOTE : Note that in equation 7, the expression (Um×m Um×m ) resolves to the iden-
tity matrix Im×m .
As we can see from this figure, the matrix A is mapping the vector of unknowns
x ∈ Rn which belongs to the Row space of A, to b ∈ Rm which belongs to the
Column space of A. Notice here that since all the rows and columns of matrix
A are independent, the N (A) and N (AT ) do not exist. This is a very important
result. Since the null space and the left null space do not exist, that essentially
implies that both the null spaces contain only the zero vector. It is due to this
reason that the general purpose inverse of A exists and there is a unique solution
for the system of equations.
Chapter 5. Singular Value Decomposition 68
cases as well. One involving the situation where the vector b belongs to the C(A)
and one case where it does not. They are explained below :
• case 1 : bm×1 ∈ C(A) : In this case the vector vectb has no choice but
to be in the C(A) and hence there will be only 1 solution to the system of
equations. Vector b always belongs to the C(A).
• case 2 : bm×1 ∈ / C(A) : In this case we will project the vector b onto the
column space of A. This in turn means that we are solving a Least squares
solution vector. The equations are given by :
00
AXLS = b ∈ C(A) (5.12)
Chapter 5. Singular Value Decomposition 69
AT AXLS = AT b (5.13)
XLS = (AT A)−1 AT b
00
b = Pb
P = A(AT A)−1 AT (5.14)
P = AL−1
n A
T
(5.15)
• This last step leads us to a contradiction since our initial assumption is vio-
lated. Hence we conclude the following.
• L−1 T
n zn = A Azn = On can never happen and hence the null space does not
exist.
Ln = AT A (5.16)
L−1
n Ln = In
A−1 −1 T
L = Ln A (5.18)
Chapter 5. Singular Value Decomposition 70
5.6.3 Notes
Finally we note that if there is a vector in N (AT ) then it will be mapped to the
zero vector and if there is a vector in the C(A) then it will stay there. This can be
further validated using the left inverse in the following equations :
The setting behind subsequent explanations rely on an example that considers a 3−D
space wherein we have attached a mass at the end of a fixed spring which is frinc-
tionless. We aim to measure its oscillations or frequency along the x axis. So we
essentially place three viewing angles A, B, C at three axes (not knowing which axis
is infact the x axis). After this we record measurements of the position of the mass
over a certain amount of time. Consider that it moves with a frequency of 120hz.
71
Chapter 6. Principal Component Analysis 72
Let X be our original dataset where each column represents a recording of the
position of the mass at a certain point of time and essentially it is a 6 × 72000
matrix.[4] Now let Y be another dataset that is related to the original dataset by
linear transformation P . Y is our new representation of our dataset and is given
by :-
PX = Y (6.3)
Now we will define a few variables for further interpretation :-
• pi are the rows of P
• xi are the columns of X
• yi are the columns of Y
• Aso remember that P is a matrix that transforms X into Y
• P can be treated as a matrix that rotates and then stretches X in the process
of transformation
• The rows or P : [p1 , · · · , pm ] are the new basis vectors that express the
coordinates of the X columns
We can see how this transformation plays out in matrices :-
p1 p1 · x 1 · · · p1 · x n
. î
x1 · · · xn = .. ..
ó
PX = .
. . .
(6.4)
pm pm · x 1 · · · pm · x n
6.2 Variance
We realise at this point that we need to keep our noise in measurements low so as
to get most information out of our signal. Also, all noise is measured in relative
terms to the signal. [4]With this we get the Signal to noise ratio which is a ratio
of variances :-
2
σsignal
SN R = 2 (6.6)
σnoise
A high SN R would indicate precise measurement and low SN R indicates more
noise in the data. Here is a figure to give an idea of how signal and noise in-
corporate a series of measurements :- Each camera’s motion recording should be
would reveal the direction of motion of our mass. An additional point to note is
that measuring multiple variables causes redundancy or confounding. Two mea-
sure in this space seem to be correlated.[4] As a result we can say that essentially
measuring one variable is enough to express the data in a concise manner. This is
the fundamental idea behind dimensionality reduction.
6.2.1 Covariance
In the case of 2 variables it is somewhat easy to expose confounding by fitting a
best line on the data and assessing the quality of the fit. We will now generalize
this notion to any dimension. Consider two sets of measurements or samples with
0 means :-
A = [a1 , · · · , an ], B = [b1 , · · · , bn ] (6.7)
The variance of A and B are given as follows :-
n n
1X 2 2 1X 2
σA2 = ai , σB = b (6.8)
n i=1 n i=1 i
If A and B in the above example are considered to be row vectors we can write
the covariance formula in matrix form as follows :-
2 1 T
σab = ab (6.10)
n
Further expanding this idea, if we consider the matrix X as containing many such
row vectors of the form a and b then we can simultaneously find the covariance
between each pair of these measurement vectors by computing the covariance
matrix in the following way :-
1
CX = XX T (6.11)
n
Remember that this is a square m × m matrix with variance of each sample mea-
surement vector along the diagonals of the matrix. [5]The off diagonal terms
are the covariance terms between various pairs of sample measurement vectors.
Remember two crucial points :-
• The high values in diagonal elements reflect the interesting structure that
explains the significant variance of the measurement that interests us.
• The off diagonal terms should be made 0 and in essence, the effect of the y
measurements that cause confounding is decoupled from our measurements.
• Now find another direction along which variance is maximised but this time
include only those direction to which the earlier found direction is orthongo-
nal. This is due to the orthonormality constraints. Keep saving these vectors
as pi .
Note that the resultant ordered set of vectors pi are precisely what are known as
principal components. The importance of rank ordering these dirction vectors is
that we can obtain easily the importance of each direction.
1
CY = P ( XX T )P T (6.15)
n
CY = P C X P T (6.16)
Additionally we note that any square symmetric matrix is infact diagonalized by
an orthogonal matrix of its eigenvectors. This relation of similarity transformation
or diagonalization is given by :-
A = EDE T (6.17)
CY = P C X P T (6.19)
CY = P (EDE T )P T (6.20)
CY = P (P T DP )P T (6.21)
CY = (P P T )D(P P T ) (6.22)
CY = (P P −1 )D(P P −1 ) (6.23)
CY = D (6.24)
And hence we find that with that particular choice of P our CY has been diago-
nalized. Our results can be summarized as follows :-
1
• The principal components of X are the eigenvectors of CX = XX T .
n
• The ith diagonal value of CY is the variance of X along the principal compo-
nent pi .
• The process involves first centering the values of X and then finding the
eigenvectors of CX .
• u· uj = 1 or 0 if i = j or i 6= j.
• Xvi = σi ui
Note that the sets of eigenvectors v and the vectors u are orthonormal vectors in r
dimensional space. Also Σ is a diagonal matrix as previously shown with diagonal
elements containing the singular values in a rank ordered manner. Finally when
we fill up all the v and u vectors into matrices we get the decomposition of X as
follows :-
X = U ΣV T (6.27)
This essentially means the the orthonormal matrix U first rotates the X matrix,
the Σ matrix stretches it and then V T matrix again rotates it. In the equation
XV = ΣU we can think of the columns of V as input vectors and columns of U as
ouput vectors wherein these vectors span the input and output spaces respectively.
Now we present this manipulation :-
X = U ΣV T (6.28)
U T X = ΣV T (6.29)
UT X = Z (6.30)
Where Z = ΣV T . We note that in this last equation U T can be essentially seen as
a change of basis matrix such that X is re-expressed as Z. Note that the columns
of V in this case are the principal components of X.
References
[1] Kohei adachi : Matrix based approach to multivariate data analysis
7.1 Introduction
Text Analytics is a discipline of computer science that is used to draw meaning-
ful inferences from unstructured text documents. Some popular applications of
text analytics include sentiment analysis, email spam filters, plagiarism detection,
stock market predictions based on sentiments of web users and also in dream
content analysis in psychology. It uses techniques of matrix decomposition such
as eigendecomposition, singular value decomposition to construct a low-rank ap-
proximation to the term-document matrix.
C = U ΣV T (7.1)
where, U and V are orthonormal matrices provided C contains only real entries.
The matrix U has dimensions m × m and contains the orthogonal eigenvectors
of CC T while the matrix V has dimensions n × n and contains the orthogonal
eigenvectors of C T C. The matrix Σ is diagonal1 and contains entries σi ’s called
singular values. The matrix Σ has dimensions m × n and the number of non-zero
singular values in it is min(m, n). Let p = min(m, n).
78
Chapter 7. Text Analytics using SVD 79
eigenvalues are also arranged in this order. A key assumption here is that the
eigenvalues are all non-negative λi ≥ 0 ∀ i ∈ 1, 2, · · · , p and thus the singular
values are all real, σi ∈ R+ . The matrix Σ has entries such that Σii = σi for
1 ≤ i ≤ r and zero otherwise. Therefore:
Cm × n Um Σm × n VnT
Cm × n Um Σm × n VnT
7.5 Query
A query is a mini-document that a user wishes to search for in a collection of
documents. This section expounds different treatments of a query. A Boolean
query is one which return matching documents in the collection that match the
given query. However, in real life, one often has to sift through a large number of
documents - either on the internet or a private intranet collection of documents.
Therefore, it is imminent on the search engine that it ranks the various documents
on the basis of relevance to the query and displays matching documents in the
rank-order. In order to accomplish this, we develop various scoring techniques.
For instance, a document can have title, abstract, author, body and references as
zones. These zones are chosen so because of their natural and logical separation
in actual documents such as research papers, however, it should be noted that
zones can be arbitrarily chosen. Suppose the query is to search for all works by
the author Strogatz, that is q = “Strogatz”, then a greater weight must be assigned
on the author zone than the others. However, if the query is to search for the
works that deal with chaos, q = “chaos”, then weights must be larger on the title,
abstract and body. The distinction made here implies that the search engine has to
weigh different zones (such as abstract, author etc.) differently according to the
query being made.
Chapter 7. Text Analytics using SVD 81
Now consider a query-document pair (q, d). Our objective is to weight score the
document for the given query q. Let the document be divided into l zones. It is
important that if this document is taken from a collection, then every document
in the collection has l zones each. Let gi be the weight on the ith zone. Further, let
g1 , g2 , · · · , gi , · · · , gl ∈ [0, 1] such that,
l
X
gi = 1 (7.3)
i=1
For 1 ≤ i ≤ l, si is the Boolean score denoting a match between the query q and
the ith zone. Suppose all the query q terms occur in the zone i, then the Boolean
score for the ith zone is 1. That is:
(
Boolean If query q matches zone i =⇒ si = 1
Score If query q does NOT match zone i =⇒ si = 0
The weighted zone score is defined as the summation of all Boolean scores weighted
by their individual zone weight in a document. That is,
l
X
WZS = gi si (7.4)
i=1
Example. Consider the following document d divided into 4 zones - title, author,
abstract and body as shown below. Each of these zones are assigned weights:
g1 = 0.2, g2 = 0.1, g3 = 0.2, and, g4 = 0.5. The search query is chaos. For this
particular query-document pair, determine the WZS.
For the given (q, d) pair, with q = chaos, we know the Title g1 0.2
adjacent zonal weights. We can also observe that the Author g2 0.1
sum of zonal weights for the document equals one. Abstract g3 0.2
Σgi = 1. Body g4 0.5
For this particular query, the word chaos pertains to the subject matter in the doc-
ument. Thus it is reflected in the given weights that the zone author takes the
least weight. However, this were a search engine that is indexing based on the
Chapter 7. Text Analytics using SVD 82
Document d
1st zone g1
g1
2nd zone
g3
3rd zone
l zones
gi
ith zone
lth zone gl
author, the opposite would be true. Since the term chaos appears in d in zones
title, abstract and body, the corresponding Boolean scores are si = 1. Hence the
WZS is:
X4
WZS = gi si = (0.2)1 + (0.1)0 + (0.2)1 + (0.5)1 = 0.9
i=1
Note. In case the query comprises of two terms, then there are two methods of
going forth with Boolean scores. The two-term query can be expressed as q = q1 ·q2 .
Here the · denotes Boolean AND; thus if the independent Boolean scores of both
return 1 then we obtain a match with the query q. Another way is to express the
two-term query as q = q1 + q2 , where the + denotes the Boolean OR. In this case
a match is obtained if any of the two terms are present in a particular zone.
Note. For multiple-term queries, we take the “bag of words” model here, which
means that we ignore syntax and treat any combinations of the words in the given
query as identical to the original. Thus if q1 = A red bag full of apples. and if q2 =
A bag full of red apples, then q1 ' q2 , although their meanings are different.
Document d
1st zone g1
Li-Yorke Chaos in Models Title
with Backward Dynamics
2nd zone g2
David R. Stockman, Department of Economics Author
3rd zone g3
4th zone
g4
1 Introduction
The equilibrium of a dynamic economic model can often be
characterized as a trajectory generated by a dynamical system.
Many nonlinear dynamical systems are single-valued moving Body
forward, but multi-valued going backward, i.e., the system is
not invertible.
Σgi = 1
these zones and thereby obtaining the WZS has been discussed in Section ??.
The in order to determine the relevance of a document, one has to take in into
account the adverse effect of common words - words that occur too often. These
effects have to be attenuated for meaningful and relevant search results. In due
course we will make use of SVD to carry this out.
Chapter 7. Text Analytics using SVD 84
We use raw term frequency to characterize queries, where all terms are equally
important. This, however, is not useful to us, for the following reason. Consider
the previous example in 7.6. If the collection of documents is the Journal on Non-
linear Dynamics and Chaos, then the query chaos has little to no significance in
rank-ordering documents for all documents are likely to contain this term.
Collection of
documents, c
d1
d2 Individual
documents, di
d3
d4
A better method for weighing terms uses the document frequency (df ) defined as
the total number of documents in the collection that contain the term t. It makes
sense to use df because our purpose here is to discriminate between documents
for the purpose of scoring and ranking them.
Example.
Consider the adjacent table. For the words equilib- Word cf df
rium and there, the collection frequency are almost equilibrium 235 102
identical. However, since there is a common word, its there 259 195
df is higher. While scoring one has to search for equi- chaos 240 161
librium in only 102 documents, making the process
efficient and retrieval speedy. document number = 200
Chapter 7. Text Analytics using SVD 85
It can be observed that the idf of a rare term is high while that of a common term
is low. For the example in 7.8, we can see that idfequilibrium > idfthere .
In order to incorporate the merits of using both tf and idf in rank ordering, we
combine them into tf -idf composite weighting. This weight for a term t in a
document d is defined as:
tf -idft,d = tft,d × idft (7.6)
This method assigns weight to a term t in a document d such that tf -idf is:
Example.
Consider the adjacent table with terms from a
collection of the Journal on Non-linear dynam- Term df idf
ics and chaos. Here. equilibrium is an important equilibrium 102 0.894
term not found in all documents and has a rel- there 695 0.061
atively high idf . A common word there is found chaos 490 0.212
in virtually all documents and has the least idf . dexterity 20 1.602
However a rare word like dexterity has the high-
est idf . document number = 800
References
[1] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman,
R. (1990). Indexing by latent semantic analysis. Journal of the American
society for information science, 41(6), 391-407.
[5] Strang, G., (1993). Introduction to linear algebra (Vol. 3). Wellesley, MA:
Wellesley-Cambridge Press.
[6] Golub, G. H., & Van Loan, C. F. (2012). Matrix computations (Vol. 3). JHU
press.
Part II
Advanced Probability
87
Chapter 8
Fundamentals of Probability
8.1 Combinatorics
Starting with the fundamental principle of counting, we can assume that experi-
ment 1 results in any of m possible outcomes and experiment 2 results in any of
n possible outcomes. Then, if these two experiments are performed in succession,
we would observe that there are a total of mn outcomes possible. Note that the
below matrix lists out all the possible pairs of outcomes from experiment 1 and
2. The item (i, j) corresponds to the pair in which i was obtained in experiment 1
and j was obtained in experiment 2.
(1, 1) (1, 2) · · · (1, n)
. .. .. ..
. (8.1)
. . . .
(m, 1) (m, 2) · · · (m, n)
8.1.1 Permutations
Ordered arrangements of elements are called permutations. For example if we
have letters a, b, c then the permutations of these letters (elements) is given as:
abc, acb, bac, bca, cab, cba (8.3)
Each such arrangement is called a permutation. Note that as a general rule, for n
objects there are n! permutations:
n(n − 1)(n − 2) · · · 3.2.1 (8.4)
Things become a bit more involved when we are permuting elements in which
there are some objects that are alike. For example if we want to find different
88
Chapter 8. Fundamentals of Probability 89
arrangements of the word PEPPER then obviously we will have a total of 6! per-
mutation possible since there are six letters in the word. But, what if we simply
interchange the alike elements in the word ? For example, if we simply inter-
change the two middle P’s, then it wouldn’t really change our permutation. For
this reason we calculate the total number of permutations of PEPPER by adjusting
for the permutations among the alike elements as well. So we final number of
permutations would become:
6!
(8.5)
3!2!
Note that 3! refers to the number of permutations among the P’s (which are three
in number) and 2! refers to the number of permutations among the E’s. As a
general rule we can say:
n!
(8.6)
n1 !n2 ! · · · nr !
Where there are n1 alike elements of type 1, n2 alike elements of type 2 and so on.
8.1.2 Combinations
If we want to form groups of r objects from a total of n objects where essentially
our permutations are now order irrelevant then we call them combinations. The
number of such combinations are given by:
Ç å
n n!
= (8.7)
r (n − r)!r!
As a simple example, consider finding the expansion of (x + y)3 . We can use the
binomial theorem to expand this expression:
Ç å Ç å Ç å Ç å
3 3 0 3 3 1 2 3 2 3 3
(x + y) = xy + xy + x y+ xy (8.9)
0 1 2 3
This set of all unions consists of all outcomes in atleast one of the Ei events. In a
similar manner, the event consisting of outcomes in all of the Ei events is given by
a continuous intersection of these sets:
inf
\
En (8.12)
n=1
Note the all important De-Morgans laws given by the following expressions. Also
note that the superscript c refers to complement of a set (which is nothing but a
set of elements not in the set).
n
!c n
[ \
c c c
(A ∪ B) = A ∩ B → Ei = Eic (8.13)
n=1 n=1
n
!c n
\ [
c c c
(A ∩ B) = A ∪ B → Ei = Eic (8.14)
n=1 n=1
Note an important point that events are nothing but sets of outcomes and hence
we can denote events as sets and perform set manipulation on them. For example,
we can denote the concept of mutually exclusive events using set notation as
follows:
E1 ∩ E2 = E1 E2 = φ (8.15)
The above equation basically means that if the intersection of two sets is the dis-
joint set then they are effectively, mutually exclusive sets or events. We can also
compute the probability of the union of many mutually exclusive events as follows:
Ñ é
inf
[ Xinf
P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) → P = P (Ei ) (8.16)
n=1 n=1
Moving on, we can state the basic expansion of a union of sets that are not
mutually exclusive as:
E ∪ F = E + F − EF (8.17)
Expanding the union of three sets:
E ∪ F ∪ G = E + F + G − EF − EG − F G + EF G (8.18)
Applying the probability operator and we simply get:
P (E∪F ∪G) = P (E)+P (F )+P (G)−P (EF )−P (EG)−P (F G)+P (EF G) (8.19)
Notice hard enough and you’ll see that a pattern emerges in terms of signs in the
above summation. The combined union is the sum of all (positive sign) sets taken
one at a time, all (negative sign) sets taken two at a time and all (positive sign)
sets taken three at a time. We can generalize this to the union of n sets as follows
in terms of probability:
n
X X
P (E1 ∪E2 · · · En ) = P (Ei )− P (E1 E2 )+· · ·+(−1)n+1 P (E1 E2 · · · En ) (8.20)
i=1 i1 <i2
Chapter 8. Fundamentals of Probability 91
8.3.1 Bayes
We will introduce the concept of Bayes theorem with the help of a common
exmaple. Suppose that D is the event that a person has a disease and E is the
event that upon testing for the disease, the test comes out positive (Note that
there can be a false positive test also - if a person does not have the disease then
the test comes positive). Now if we want to find the probability that - the person
has the disease given that the result if positive.
P (DE)
P (D|E) = (8.25)
P (E)
P (E|D)P (D)
= (8.26)
P (E|D)P (D) + P (E|Dc )P (Dc )
8.3.2 Odds
As a quick note, odds are defined as the ratio of probability of occurrence of an
event to the probability of the non-occurrence of the event. It is given as:
P (A) P (A)
= (8.27)
P (Ac ) 1 − P (A)
Chapter 8. Fundamentals of Probability 92
8.4 Distributions
Starting with the Bernoulli random variable, we define this random variable as
the outcome of a single trial when the outcomes are of only two types - success
and failure, encoded as 1 and 0 respectively.
p(0) = P (X = 0) = 1 − p (8.28)
p(1) = P (X = 1) = p (8.29)
Extending the same concept a little further, if supposing we have n independent
trials, each of which is associated with a probability of success of p and probability
of failure of (1 − p) and if we define random variable X as the Number of suc-
cesses in n trials then what we have is essentially a binomial random variable.
Ç å
n i
p(i) = p (1 − p)n−i (8.30)
i
E[X] = np (8.31)
V AR[X] = npq = np(1 − p) (8.32)
n Ç å
X n k
P (X ≤ i) = p (1 − p)n−k (8.33)
k=0
k
8.4.1 Poisson
A random variable X taking on values 1, 2, · · · , is called a Poisson random vari-
able with parameter λ if:
e−λ λi
p(i) = P (X = i) = (8.34)
i!
Note that this is the approximation of a binomial variable when n is very large and
p is small. Some general properties:
E[X] = λ (8.35)
V AR[X] = λ (8.36)
The derivation for optional reading can be presented below in a step wise man-
ner:
• Now we basically let λ = np or p = λ/n and with this we can rewrite the
previous formula in terms of λ as follows :
Å ãi Å
λ n−i
ã
n! λ
P (X = i) = 1− (8.38)
(n − i)!i! n n
n(n − 1) · · · (n − i + 1) λi (1 − λ/n)n
= (8.39)
ni i! (1 − λ/n)i
λ n
Å ã
1− ≈ e−λ (8.40)
n
n(n − 1) · · · (n − i + 1)
≈1 (8.41)
ni
λ i
Å ã
1− ≈1 (8.42)
n
• And finally we end up with :
e−λ λi
P (X = i) = (8.43)
i!
8.4.2 Geometric
Suppose now that there are many independent trials, each having a probability
of success as p, such that these trials are performed until a success occurs. Our
random variable X primarily defines the number of trials required until the first
success is encountered.
P (X = n) = (1 − p)n−1 p (8.44)
Some key points:
1
E[X] = (8.45)
p
1−p q
V AR[X] = 2
= 2 (8.46)
p p
The main logic is that for us to stop conducting the trials, the rt h success has
to happen at the nt h trial and therefore we count the combinations of the r − 1
successes that must have occurred in the last n − 1 trials. Some key points :
r
E[X] = (8.48)
p
r(1 − p) rq
V AR[X] = 2
= 2 (8.49)
p p
F (X) = P (X ≤ x) (8.50)
Note that for a distribution function F (X), F (b) denotes the probability that the
random variable takes on values less than or equal to b. some properties about
CDF functions are:
• F is non decreasing which essentially means that for a < b we have F (a) <
F (b).
8.7.1 Example 1
Let us consider a fair coin is tossed twice. Let sample space be denoted as S and
A1 , A2 , A3 are the three events in sample space S
S = {HH, HT, T H, T T }
A1 = {HH, HT }, A2 = {HH, T H}, A3 = {HH, T T }
Let us see whether independence exist between the events
P (A1 A2 ) = P {HH} = 1/4
P (A1 ) = 2/4
P (A2 ) = 2/4
P (A1 )P (A2 ) = 1/4
P (A1 A2 ) = P (A1 )P (A2 )
Let us check for other conditions
P (A1 A2 A3 ) = P (A1 )P (A2 )p(A3 ) (8.65)
P (A1 A2 A3 ) = P {HH} = 1/4
P (A1 ) = 1/2
P (A2 ) = 1/2
P (A3 ) = 1/2
P (A1 )P (A2 )P (A3 ) = 1/8
Equation 8 is not satisfied . Therefore independence does not exist between the
events
Chapter 8. Fundamentals of Probability 96
8.7.2 Example 2
Let us consider a fair die is tossed .
S = {1, 2, 3, 4, 5, 6}
A1 = {1, 2, 3, 4}
A2 = {4, 5, 6}
A3 = {4, 5, 6}
Let us check for the following condition
P (A1 A2 ) = 1/6
P (A1 ) = 4/6
P (A2 ) = 1/2
P (A1 )P (A2 ) = 1/3
P (A1 A2 ) is not equal to P (A1 )P (A2 ) Therefore independence does not exist be-
tween the events
P : p(s)− > R
where R is the real number If sample space is infinite, then there is some restriction
on A
Chapter 8. Fundamentals of Probability 97
putting x=0
f (0) = a0
f 1 (0) = a1
f 11 (0) = 2a2
f 111 (0) = 6a3
f n (0) = n!an
an = (1/n!)f n (0)
The power series expansion for f(x) is unique
8.10 Consequences
Let us consider x and y are discrete and have a probability mass function p(x,y)
1.E(x + y) = E(x) + E(y) X
E(x) = xp(x)
x
XX
E(x + y) = (x + y)p(x, y)
x y
XX XX
E(x + y) = xp(x, y) + yp(x, y)
x y x y
XX
E(x) = xp(x, y)
x y
XX
E(y) = yp(x, y)
x y
E(xy) = E(x)E(y)
So X
E(XY ) = xypx (x)py (y)
x,y
XX
E(XY ) = [ xypx (x)py (y)]
x y
X X
E(XY ) = [ xpx (x)][ ypy (y)]
x y
we know X
E(X) = xpx (x)
x
X
E(Y ) = ypy (y)
y
Therefore
E(xy) = E(x)E(y)
The same result is applicable for continuous case
3. var(X + Y ) = E[(X + Y )2 ] − [E(X + Y )]2
Mx (t) = E(etx )
x=∞
X
x
e = xn /n!
x=0
8.10.1 Example 1
X is a binomial with parameter n and p,Mx (t) =?
n
X
Mx (t) = etk P (X = k)
k=0
n Ç å
X
tk n k (n−k)
Mx (t) = e p q
k=0
k
n Ç å
X n
Mx (t) = etk (et p)k q (n−k)
k=0
k
Mx (t) = (et p + q)n
Mxk (0) = E(X k )
f (x) = a0 + a1 x + ... + an xn + ....
an = f n (0)/n!
Mx (t) = (et p + q)n
Mx1 (t) = n(et p + q)n−1 pet
Mx11 (t) = n(n − 1)(et p + q)n−2 (et p)pet + n(et p + q)n−1 pet
Mx1 (0) = n(pe0 + q)n−1 pe0 = np = E(X) = meanof binomial
Mx11 (0) = n(n − 1)(p2 ) + np = n2 p2 − np2 + np = E(X 2 )
Lets calculate Variance now
σ 2 = npq
Fact:
Consider X and Y are random variables. If Mx (t) = My (t) for all t then X and Y
have the same distribution
Note:
X and Y may have same probability mass function or cumulative distribution func-
tion but X and Y are two different function which have same distribution
Mx (t) = My (t)
E(X) = E(Y )
This implies means are same
E(X 2 ) = E(Y 2 )
Chapter 8. Fundamentals of Probability 101
= E(etx )E(ety )
As X and Y are independent random variables ,etx and ety are independent random
variables
= Mx (t)My (t)
etx and ety are independent random variables by independence of X and Y as etx
and ety are powers series of X and Y
µ = 0; σ 2 = 1
F (x) = P (σz + µ ≤ x)
F 1 (x) = P df of (σz + µ = x)
√ Z (x−µ/σ) (t2 /2)
F (X) = 1/ 2π e dt = F (x)
−∞
√ 2
F 1 (x) = 1/ 2πe−1/2((x−µ)/σ) 1/σ
We want Mx (t) where X has normal pdf with paramater µ, σ
X = σz + µ
8.12 Theorem
If X and Y are independent random variables
X ≈ N (µx , σx2 )
Y ≈ N (µy , σy2 )
Then X+Y is normal N (µx + µy , σx2 + σy2 ) Moment generating function completely
determines the distributions, X and Y are indpendent random variables
which is the moment generating function of a normal random variable with mean
(µx + µy ) and variance (σx2 + σy2 )
Chapter 9
9.1 Sequences
A sequence is nothing but a list of numbers written in a specific order. Sequences
can be infinite or finite. A general form of sequences can be shown:
a1 - first term
a2 - second term
an - nth term
{an } (9.2)
{an }∞
n=1 (9.3)
To illustrate with an example, here is how we would write the first few terms of a
sequence:
n+1 3 4 5
{ 2 }∞ n=1 = { 2 , , , } (9.4)
n n=1 4 9 16
n=2 n=3 n=4
An interesting way to think about sequences is as functions that map index values
to the value that the particular sequence might take. For example consider the
same sequence as above written as a function and its values written in a tuple of
the format (n, f (n)).
n+1
f (n) = (9.5)
n2
values → (1, 2), (2, 3/4), (3, 4/9), (4, 5/16) (9.6)
We do this because in this situation we can essentially plot out the values and
obtain a graphical representation of a sequence.
103
Chapter 9. Sequences and Series 104
2.5
f (n)
2
1.5
0 1 2 3 4 5
n
We can observe from this graph that as n increases the value of sequence terms is
going closer and closer to zero. Hence we can say that the limiting value of this
sequenec is zero:
n+1
lim an = lim =0 (9.7)
n→∞ n→∞ n2
limn→∞ an = L
• We can say that limn→∞ an = −∞ if for every number M < 0 there exists a
number N such that:
• The key insight for us that for a limit to exist and have a finite value, then
all the sequence terms must get closer and closer to that finite value as n
approaches infinity.
Chapter 9. Sequences and Series 105
• Given a sequence {an } if we have a function f (x) such that f (n) = an and
that limx→∞ f (x) = L then we can say that:
limn→∞ an = L
• If there exists a number m such that m ≤ an for every n then we say that
the sequence is bounded below and m is called the lower bound of the
sequence.
• If there exists a number M such that an ≤ M for every n then we say that
the sequence is bounded above and M is called the upper bound of the
sequence.
• Finally we can say that if {an } is bounded and monotonic then {an } is con-
vergent.
9.2 Series
To begin defining an infinite series we first start with a sequence {an }. Note that
a sequence is just a sequence of numbers whereas a series represents some kind of
operation on those sequence of numbers. We can define a basic series as:
s1 = a1
s 2 = a1 + s 2
s 3 = a1 P + a2 + a3
sn = ni=1 ai
We can further note that the successive values of the series itself forms a sequence
of numbers which can be represented as {sn }∞ n=1 . This is a sequence of partial
sums. Now we can compute the limiting value of this sequence of partial sums as:
n
X ∞
X
lim sn = lim ai = ai (9.9)
n→∞ n→∞
i=1 i=1
Note that as in the case of sequences before, if the sequence of series values has a
finite limit, then the series is said to be convergent and if the limit does not exist
then it is divergent. Now we will prove the following theorem :
P
if: an converges then: limn→∞ an = 0
• Step 1: We can write the following two partial sums for the given series:
sn−1 = Pn−1
P
i=1 ai = a1 + a2 + · · · + an−1
sn = ni=1 ai = a1 + a2 + · · · + an
an = sn − sn−1
Chapter 9. Sequences and Series 107
P
• We can say that if an is convergent then the sequence of partial sums is
also convergent for some finite value. Note that the same holds true for the
partial sums series of n and (n − 1).
{sn }∞
n=1 → limn→∞ sn = s → limn→∞ sn−1 = s
Now to present the root test, suppose we have the series defined by:
X
an (9.12)
Basics of Convolutions
Now our integrals limits are basically defined over X and Y values such that X +
Y ≤ a. We can break this into the following integral limits - we let y take on
any value between minus infinity and infinity and we ensure satisfaction of the
inequality expression by letting x ≤ a − y. Therefore we can say that Y takes on
any value whereas X is limited from minus infinity to the upper bound of a − y
Z ∞ Z a−y
FX+Y (a) = fX (x)fY (y)dxdy (10.6)
−∞ −∞
108
Chapter 10. Basics of Convolutions 109
Z ∞ Z a−y
= fX (x)dxfY (y)dy (10.7)
−∞ ∞
Z ∞
= FX (a − y)fY (y)dy (10.8)
−∞
The above final expression is derived from the following definition of cumulative
function of a continuous random variable:
Z a
FX (a) = fX (a)dx (10.9)
−∞
Hence our final cumulative distribution function of the sum of two independent
random variables FX+Y is known as the convolution of the distributions FX and
FY - the individual cumulative distributions.
Z ∞
FX+Y (a) = FX (a − y)fY (y)dy (10.10)
−∞
Now we note from the definition of fX that this function will take on a value of 1
in the interval:
0<a−y <1 (10.16)
Chapter 10. Basics of Convolutions 110
Since the above intervals for y are between a − 1 and a and not as it should be
- between 0 and 1, we achieve this interval by further splitting a into two sets of
intervals, which together combine to give us the correct limits of y. First, we split
a as 0 < a < 1 we would have the interval defined as:
0<y<a (10.19)
Note that we have 0 on the left side of this inequality because when a lies be-
tween 0 and 1 then the expression a − 1 would always be negative, hence we
collapse it to 0 since our function cannot take on negative values. Therefore we
would have: Z a
fX+Y (a) = dy = a (10.20)
0
Second, we would split a as 1 ≤ a ≤ 2 such that the main integral limit is:
We are essentially collapsing the right hand limit to 1 since any value of a that
is between 1 and 2 would not exists in our function. Therefore we can resolve the
distribution as: Z 1
fX+Y (a) = dy = 1 − (a − 1) = 2 − a (10.22)
a−1
Therefore when we draw the distribution function, taking equation 20 along with
limits of a between 0 and 1 we would get a linearly rising line till x = 1 and a
linearly falling line from x = 1 till x = 2.
0.5
0.5 1 1.5 2
Chapter 11
P (P robability) D(Distribution)
If [Xn −−−−−−−−→ X], then [Xn −−−−−−−−−→ X] (11.3)
n→∞ n→∞
The above statement says that if a convergence occur in distribution , then it does
not imply that convergence will occur in probability also.Let us try to verify the
statement using counter example .
111
Chapter 11. Limit theorems and Convergence - Part 1 112
Now the above equation can be written as union of the two cases which we ob-
tained after removing modulus from X .
ñ ô
Event, |X| > = (X > ) ∪ (X < − ) (11.9)
2 2 2
Now using the Law of total Probability , the above equation can be written into
summation of below two terms .
Now we know that joint probability is intersection of two events which is smaller.The
conditional probability is a probability which is less than one. Hence we can write
as follows :
Using the above , we can substitute the above value in equation 22 as follows ,
Therefore , combining equation (23) and equation (27) , we can write the equa-
tion (22) as follows :-
F = F1 ∪ F2 (11.34)
Now we can see that P (F1) ≤ P(F) , since P (F2) ≥ 0. Now comparing above
equation ,
Let us consider , (
1, if ω ∈ A
Xi (ω) = (11.45)
0, if ω ∈
/A
using the above condition , we can write as ,
Now we know we can obtain random sample mean S¯n as shown in the below
equation :
∞
X1 + X2 + · + Xn 1X
S¯n = = Xi = f ractionof timesω ∈ A (11.47)
n n n=1
11.6 WLLN
We define S¯n as Fraction of times the outcome ω ∈ Ω is in a given set A, converges
in probability to E(Xi )=µ =P(A) = Probability of set or event A. Also E(Xi )=µ
and var(Xi )=σ 2
Thus the above two equation means that distance between Xn and X tends to zero
i.e. d(Xn ,X) → 0.
if ,
lim d(Xn , X) = lim E[|Xn − X|r ] → 0 (11.51)
x→∞ x→∞
Now when r= 2 , the above equation gives us the mean square convergence (m.s)
as shown in the equation below :
mean square convergence
Xn −−−−−−−−−−−−−−→ X (11.52)
n→∞
Now let us explain the above definition with the help of an example . Consider
a random sample Xn which has uniform distribution with mean equal to 0 and
1
variance equal to .Now we need to prove that
n
L
r
Xn −→ X = 0∀r ≥ 1, (11.53)
Z∞ Z1/n Z1/n
E[|Xn |r ] = FXn (x)dx = xr ndx = n xr dx (11.56)
−∞ 0 0
1 n→∞
E[|Xn |r ] = −−−→ 0∀r ≥ 1 (11.59)
(r + 1)nr
Thus the we proved that a random sample converges to X .
r L
Xn −→ X = 0 (proved) (11.60)
Chapter 11. Limit theorems and Convergence - Part 1 118
r L P (P robability
Xn −−−→ X =⇒ Xn −−−−−−−−→ X (11.61)
n→∞ n→∞
Proof. Consider for any > 0 , the tail probability given as follows :
We can write the above equation because |Xn − X|r ≥ 0 and thus |Xn − X| ≥ 0
i.e. it is non-negative random variable . Also r > 0.
Now using the above statement , we can write as follows .
n→∞
P [|Xn − X| ≥ ] −−−→ 0 (11.64)
If
L n→∞
r
Xn −−−→ X =⇒ E[|Xn − X|r ] −−−→ 0 (11.66)
n→∞
Note. Converse of the above theorem is not true i.e. the sequence of random
variable Xn that converges in Probability but does not converge in mean.
P (P robabilityr L
M athematically, Xn −−−−−−−−→ X ; Xn −−−→X (11.67)
n→∞ n→∞
11.8 Recap
So far we have introduced the topics of Convergence and Moment Generating
Functions. We have also looked upto the limit theorems in detail. And in this
lecture we will discuss in detail the almost sure convergence and strong law of
large number. And in the coming lecture we will bulid the course and introduce
to the stochastic process.
closer to the X for some value of n as n increases. Suppose we want to observe the
value of the random variable X, but we cannot observe it directly. So, what we do
is that we come with some estimation technique to measure the X, and got it as
X1 . And again we estimate it and update the estimation to X2 and so on. And we
continue this process to get X1 , X2 , ..... And as we increase the n our estimation
gets better and better. So, we hope that the value Xn converges to X.
There are different ways in which a sequence can be converge. Some of these
convergence are stonger than the other and some are weaker. If a sequence is
stonger and the other sequence is weaker. Then the later sequence convergence
implies the former convergence. A sequence can converges to following types:
• Convergence in distribution
• Convergence in probability
• Convergence in mean
For example, using the figure, we conclude that if a sequence of random variables
converges in probability to a random variable X, then the sequence converges in
distribution to X as well.
• Dice Problem
Suppose in a dice making factory, the first batch of the dice produced came out
to be biased or defects. So, now the outcome from tossing the dice will follow an
different marked uniform distribution.
But as the production process improved and the dice became less and less defec-
tives. And the outcome from throwing the dice will now follow uniform distribu-
tion more closely.
• Note
It is shown by,
Lr
Xn −→ X (11.72)
Chapter 12
Almost sure means occures almost every where but there are some places that
it doesnot occure. This is pointwise convergence in the same space sS. We have a
sequence of random variables X1 , X2 , .........., Xn defines on an underlying sample
space and we assume S is a finite set |S| < ∞.
And we have a function (Xn ) and (X)that matches ’S’ to real numbers.Xn : S
−→ R and X : S −→ R. Limitng random variable is also on the sample space.
si is the ith outcome. Take a randomn variable Xn on sample space Xn (si ) = xni .
This is the ith real number outcome. For i = 1, 2, .......k and n = 1, 2, .......
After a random experiment is performed for example coin is tossed one of the si
will occure (that is ’H occurs). Since it is the outcome of the experiment and then
the values of the Xn are known, i.e. xni are known.
123
Chapter 12. Limit theorems and Convergence - Part 2 124
The following sequence x1i , x2i , ........, xni , of real number is observed and we can
discuss about the convergence of the rel number.
12.1.1 Example 1
Suppose we do a random experiment of tossing a coin. The outcomes being head
or tail.|S| = 2 < ∞
Consider a sequence of random variable X1 , X2 , .........., Xn .
(
n
, if s = H
Xn = n+1 n (12.1)
(−1) , otherwise
So, consider each of the outcomes H or T and determine if the sequence of real
number converges or not.
n
If s = H =⇒ Xn (H) = n+1
We see that the sequence doesnot converges since it oscillates between -1 and
+1 as n becomes larger and larger. Let’s define an event S,
This event converges when the uoutcome is Head. (s = H).So, the probability of
the event E∞ i.e. the probability of Heads is 12 . Since it is a single toss of fair coin.
and with probability(H) = 12 and the sequence did not converged when s = T . If
the probability that the sequence Xn (s) converges to X(s) is equal to 1 then Xn
converges X almost surely with probability 1.
a.s w.p.1
Xn −−−→ XorXn −−−→ X (12.4)
n→∞ n→∞
a.s
if Xn −−−→ Xif P (E∞ ) = 1. (12.5)
n→∞
12.1.2 Example 2
Suppose a sample space between 0, 1 and uniform probability measure on S.
Putting different values for the Xn (s), we get the values X1 (s) = 1, X2 (s) = 43 , ....
and so on. Here the intervals are shrinking.
1
• Case (ii) Now for s > 2
=⇒ X(s) = 0. Since s > 12 , then 2s > 1 or 2(s − 1)
> 0.
(
n+1
1, if 0 6 s < 2n
Xn (s) = (12.10)
0, otherwise
We know (2s − 1) > 0, X(s). limn→∞ Xn (s)=X(s)=0, ∀ s> 21 . Now,
1
limn→∞ Xn (s)=0, ∀ n> 2s−1 .
So, we choose ’n’ in such a way that when we increase ’n’ larger and larger it be-
1
comes, 2s−1 . So, the condition becomes, limn→∞ Xn (s)=X(s)=0. That implies, s
∈ E∞ .
=⇒ ( 21 , 1) ⊂ E∞ .
Now we can write the event as, E∞ = [0, 21 ] ∪ ( 21 , 1). And applying probability, we
get;
1 1
P [E∞ ] = P {[0, ]} + P {( , 1)} (12.11)
2 2
From the axioms of probability, disjoint unions of event and both are uniform
measures. So, the probability of occuring of both will be half ( 12 ).
= 12 + 12 . = 1.
so, P[E∞ ] = 1
So, from the case above we have shown that a sequence of random variable
X1 , X2 , .........., Xn converges almost sure to X(s) when the sample size increases.
i.e.
a.s
Xn −−−→ X
n→∞
This is stronger form of convegence because this implies weaker form of conver-
gence. Stronger law implies the weak law.So, the expected value of mean will be
E(S˜n ) = µ ( This goes to a degenerated randomn variable µ) and the var(S˜n ) goes
to zero as n −→ ∞.
Chapter 12. Limit theorems and Convergence - Part 2 127
12.2.1 Example 3
Consider a sample space S = [0, 1]. 0 6 s 6 1, with uniform probability distribu-
tion.
We define a sequence of random variable
Xn (s) = (s + S n ) and X(s) = s in S ∀ s ∈ [0,1]
that is, 0 6 s 6 1, such that,
n→∞
sn −−−→ 0
So, as ’n’ tends to infite, Xn becomes smaller and smaller. This implies, limn→∞
n→∞
Xn (s) → s, since sn −−−→ 0.
That mean the sample mean of the random variable converges in probability as n
tends to infinity to a degenerate random variable (X) µ, where µ is the a deter-
ministic constant.
The mean is also finite.
E(Xi ) < ∞. (12.20)
So,
E(S¯n ) = µ,
σ 2 n→∞ (12.21)
var(S¯n ) = −−−→ 0.
n
NOTE: If we remove the finite variance assumption, then lets check what will hap-
pen.
Let Xi are independent identical random variable with well defined moment gen-
erating function. So, these are defined as, MXi (0) = 1 and MX0 i (0) = µ = E(Xi ).
So, by defination,
MXi (t) = E[etXi ] (12.22)
So now the moment generating function of S¯n is defined in,
¯
MS¯n (t) = E[etSn ] (12.23)
t P
Xi
= E[e n ] (12.24)
t P
= E[e( n ) Xi
] (12.25)
n
1X
S¯n = (12.26)
n i=1
t
MS¯n (t) = E[e n [X1 + X2 + ....Xn ]] (12.27)
t t t
= E[e n X1 .e n X2 .........e n Xn ] (12.28)
Breaking down the equation. As Xi are independent, the above equation 32 will
factor out.
t t t
MS¯n (t) = E[e n X1 ].E[e n X2 ].........E[e n Xn ] (12.29)
So, now what se got is the product of moment generating functions.
t t t t
MS¯n (t) = MX1 ( ).MX2 ( ).MX3 ( )..............MXn ( ) (12.30)
n n n n
Since all the momnet generating functions are identical and since Xi is identical
to random variable X, we have
Chapter 12. Limit theorems and Convergence - Part 2 129
t
MS¯n (t) = [MX ( )]2 (12.31)
n
Let see it through the Taylor series expansion. So, the taylor series of ’f ’ about x
= 0 to finite order,
So, hence the distribution of the S¯n converges weakly to the distribution of de-
generated random variable Y = µ.
Recall:
We saw convergence in distribution to a constant µ implies convergence in proba-
bility hence we have weak law of large number. So,
n→∞ n→∞
[Xn −−−→ µ] =⇒ [Xn −−−→ µ] (12.38)
D P
where, µ = constant
NOTE:
Here, using the moment generating function we can prove the weak law of large
number without finite variance assumption.
Chapter 12. Limit theorems and Convergence - Part 2 130
• Moments
The nth moment of a random variable is the expected value of its nth power.
Definition: Let X be a random variable, and n N . If the expected value,
µX (n) = E[X n ] (12.39)
exists and is finite, then X is said to possess a finite nth moment and µX (n) is called
nth moment.
For example, the first moment gives the expected value E[X]. And the second cen-
tral moment is the variance of X. Similar to mean and variance, other moments
give useful information about random variables.
MX (s) = E[esX ]
We say that MGF of X exists, if there exists a positive constant a such that
MX (s) is finite for all s[−a, a].
One question that can be raised is, why Moment Generating Function are usefull?
So, there are two reasons behind this. First being, moment generating function
of any random variable X gives us all moments of X. That is why it is called the
moment generating function.
Second, MGF can uniquely determine the distribution. So, therefore if two random
variable have the same MGF then they must have the same distribution as well.
Thus, if you find the MGF of a random variable, you have indeed determined its
distribution.
• Finding Moments from MGF:
Remember the Taylor series for ex : for all xR, we have
∞
x2 x3 X xk
ex = 1 + x + + + ......... = (12.40)
2! 3! k=0
k!
Chapter 12. Limit theorems and Convergence - Part 2 131
• Property 1
Let two random variable have the same moment generating function, then the
random variables have the same distribution. Suppose, X and Y are two random
variable and have the same moment generating function MX (s), then X & Y are
distributed in the same way(same CDF etc.).
So, we can say moment generating function determines the distribution of a ran-
dom variable. And this can come handy while dealing with unknown random
variable.
• Property 2
While dealing with the sum of the random variable, moment generating functions
makes it easier to handel. If there are two independent random variables, X and
Y and we want to see the moment generating function of X + Y , then we have to
multiply the separate, individual moment generating functions of X and Y .
So, if X and Y are independent and X has moment generating function MX (s)
and Y has moment generating function MY (s) , then the moment generating func-
tion of X + Y is just MX (s) MY (s) , or the product of the two moment generating
functions.
Example 4
If X ≡ Uniform (0, 1) , find E[Y k ] using MY (s).
∞
1 X sk
= (12.45)
s k=1 k !
∞
X sk−1
= (12.46)
k=1
k!
∞
X 1 sk
= (12.47)
k=1
k+1k!
sk 1
Thus, the coefficient of k!
in the Taylor series for MY (s) is k+1
, so
1
E[Y k ] = (12.48)
k+1
• Definition:
Consider i.i.d. random variables X1 , X2 , .........., Xn . Let the mean of each random
variable be µ. We define the sample mean as,
X1 + X2 , .......... + Xn
X̄n = (12.49)
n
It is to note that the sample mean X̄n is random itself.
It makes sense that this sample mean will fluctuate, because the components that
make it up (the X terms) are themselves random.
Chapter 12. Limit theorems and Convergence - Part 2 133
Now based on the concept we have two different type of law of large number,
The strong law of Large Number and Almost sure convergence are being discussed
throughly in the previous section of this lecture.
a.s
X̄n −−−→ µ (12.50)
n→∞
• Definition
The central limit theorem states that, if we choose a sufficiently large random
sample from a population mean of mean ’µ’ and variance "σ 2 ", then the random
sample will be distributed normally with sample mean "µ".
This is an extremely powerful result, because it holds no matter what the distribu-
tion of the underlying random variables (i.e., the X 0 s) is.
D σ2
X̄n −
→ N (µ, ) (12.52)
n
D
Where − → means ‘converges in distribution’; it’s implied here that this convergence
takes place as n , or the number of underlying random variables, grows.
• NOTE:
LLN states that the mean of a large number of i.i.d. random variable converges to
expected value.
And, CLT states that, under similar condition, the sum of large sample of random
variable has an approximate normal distribution.
X̄ − µ
Zn = (12.53)
√σ
n
The central limit theorem states tha CDF of Zn converges to the standard normal
CDF.
that X is a random variable taking on non-negative values, then for any a > 0 we
would have:
E[X]
P (X ≥ a) ≤
a
Now we essentially represent an indicator random variable by defining it over
a > 0 as follows:
(
1 ifX ≥ a
I=
0 otherwise
We can say that for X ≥ 0 we can have the indicator variable taking on values as:
X
I≤
a
Taking expectations on both sides we get:
E[X]
E[I] ≤
a
While computing E[I] we can put it the form of a weighted average with proba-
bilities as the weights:
E[I] = 1 × P (X ≥ A) + 0 × P (X < A) = P (X ≥ A)
Now because we have essentially shown as E[I] = P (X ≥ A) we have justified
the Markov’s inequality:
E[X]
P (X ≥ A) ≤
a
σ2
P (|X − µ| ≥ k) ≤
k2
2
Now we know that (X − µ) is essentially a non-negative random variable. So
basically if we set a = k 2 then we can write the **Markov’s inequality** as follows:
E[(X − µ)2 ]
P ((X − µ)2 ≥ a2 ) ≤
a2
2 2
We also know that (X − µ) ≥ k and |X − µ| ≥ k are equivalent conditions.
Therefore we can rewrite the above eqaution as:
σ2
P (|X − µ| ≥ k) ≤
k2
Chapter 12. Limit theorems and Convergence - Part 2 136
P (X = E[X]) = 1
Now for a zero variance random variable we can state the **Chebyshev’s inequal-
ity**, for any n ≥ 1 as follows:
Å ã
1
P |X − µ| > ≤0
n
Since we know that probability cannot be negative this inequality can be resolved
to an equation:
Å ã
1
P |X − µ| > =0
n
Now we can let n → ∞ and use the continuity property of probability to get:
Å ã Ç Å ãå
1 1
0 = lim P |X − µ| > = P lim |X − µ| > = P (X 6= µ)
n→∞ n n→∞ n
In the right hand side, if we resolve the modulus operator within the limits, we
will essentially get events which can be combined in some form:
(X − µ) > 0 → X > µ
(−X + µ) > 0 → X < µ
X 6= µ
X 1 + X2 + · · · + Xn
Å ã
n→∞
P − µ| > → 0
n
Before moving on with the proof, we assume that the random variables have finite
variance σ 2 . We can now calculate the mean and variance associated with the
sample mean statistic.
X 1 + X2 + · · · + Xn
Å ã
E =µ
n
X 1 + X2 + · · · + Xn σ2
Å ã
V ar =
n n
Chapter 12. Limit theorems and Convergence - Part 2 137
X1 + X2 + · · · + Xn σ2
Å ã
P | − µ| ≥ ≤ = 0, as n → ∞
n n
And since all the sequence of n random variables are independent and identical
we can essentially raise all those multiplication terms to the power of n.
Now we let L(t) = log M (t) = log E[exp(tXi )]. We can note that:
√
−L00 (t/ n)n−3/2 t2
= lim , again using L0 hospitals
n→∞ −2n−3/2
Ç å
00 t t2
= lim L √ = t2 /2, sinceL00 (0) = 1
n→∞ n 2
Upon expanding the right hand side multiplications, we will get terms of the form:
since we are essentially selecting pairs of 2 out of 4 elements. Now, expanding the
product, resolving the zero terms and taking expectations we get:
Ç å Ç å
n n
E[Sn4 ] = E[Xi4 ] + 6 E[Xi2 Xj2 ]
1 2
Note that we have ignored the expansion term 3K/n3 since the above would be
a bigger set and hence gives us a upper bound. Finally we would take an infinite
sum on both sides to get:
∞
! ∞ Ç å
X Sn4 X Sn4
E = E <∞
n=1
n4 n=1
n4
Note that we can write this because of the fact that the right hand side sequence
would converge to a finite sum over an infinite series. We note now that if there
is a positive measure of probability for which an infinite sum is infinite, then its
expected value would also be infinite. But this is not the case here since our
expected value is finite. Therefore we can say with probability 1 that:
∞
X S4 n
<∞
n=1
n4
We also know that the convergence of an infinite series goes to 0 and hence we
can say with probability 1 that:
Sn4
lim =0
n→∞ n4
Now since the fourth power of this expression goes to 0 then we can also conclude
with probability 1 that:
Sn n→∞
→ 0
n
Finally, in this case we took our mean to be 0 but in a more general case we can
make the statement that with probability 1:
n
X Xi − µ
lim =0
n→∞
i=1
n
n
X Xi
lim =µ
n→∞
i=1
n
Xi ∼ unif (0, 1)
Further, the general formula for mean and variance is given by:
a+b (b − a)2
µ= , σ=
2 12
Chapter 12. Limit theorems and Convergence - Part 2 141
The mean for this distribution is 1/2 and variance is 1/12. The general form for
this probability distribution function is given by:
1 , x ∈ (a, b)
fXi (x) = b − a
0, x ∈ (a, b)c
Now what we will do is essentially this: Take varying number of random samples
from this distributions, compute the **standardized Z statistic** (which implies
the distribution of the sum of the random variables) for each sample size and
check the distribution of this sum. The general formulation for this is:
(X1 + X2 + · · · + Xn ) − nµ
Zn = √
nσ
If we sample once we get:
X1 − 1/2
Z1 = p
1/12
Now if we sample twice we get:
(X1 + X2 ) − 2/2
Z2 = p
2/12
Similarly if we sample say 50 times we would get:
Borel-Cantelli Lemma
Now note that any subset E of the sample space is called an event. This is typi-
cally a set that contains various outcomes of the experiment and we say that if a
particular outcome is contained within E, then event E has occurred. For example
if we define our event to be - E is the event that heads appears on the first coin toss
- then our associated set for this event would be:
Now if we consider an event such that it contains all the outcomes contained
in both E and F then that event would be the intersection of the two events
142
Chapter 13. Borel-Cantelli Lemma 143
and is shown as follows. Assume that E = {(H, H), (H, T ), (T, H)} and F =
{(H, T ), (T, H), (T, T )}.
E ∩ F = EF = {(H, T ), (T, H)} (13.4)
Now let us consider another example of two events obtained from rolling two dies
and the associated outcome tuples denote the sum of the two die rolls. Suppose
E = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} is the event that the sum of two die
rolls is 7 and let F = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} be the event that the sum of
two die rolls is 6. Look carefully and you might notice that the two events have
nothing in common. There are no outcomes that are contained in both sets and
hence we say that such an event simply could not occur. Such an event is known
as a null event and is denoted as EF = φ. In this case we say that events E and
F are mutually exclusive.
In a similar manner, the intersection event of many events can be defined as fol-
lows: ∞
\
En (13.6)
n=1
Now if we want to define an event such that it contains all those outcomes in
the sample space S that are not in event E, then such an event is known as the
complement of E denoted as E c . Note that the complement of the sample space
is the null set (S c = φ). Further, for any two events E and F if all the outcomes in
E are also present in F then we say that E is a subset of F and consequently, F is
the superset of E. This is denoted as:
E⊂F (13.7)
Note that the condition of equality of two sets is given is they are both subsets of
each other. That is:
E = F ⇐⇒ E ⊂ F and F ⊂ E (13.8)
Some of these concepts seem quite intuitive when viewed in the form of Venn
diagrams. Basic examples are presented below.
A B A B
Chapter 13. Borel-Cantelli Lemma 144
Ñ éc
n
\ n
[
Ei = Eic (13.10)
i=1 i=1
• P (S) = 1.
E1 ⊃ E2 ⊃ · · · ⊃ En ⊃ En+1 (13.13)
Now for an increasing sequence of events we can essentially define a limiting event
in the form of: ∞
[
lim En = Ei (13.14)
n→∞
i=1
Chapter 13. Borel-Cantelli Lemma 145
Similarly we can define the limiting event for a decreasing sequence of events as:
∞
\
lim En = Ei (13.15)
n→∞
i=1
Additionally we note an important proposition that lays down the probability for
an increasing or decreasing sequence of events as:
With this in mind, we tend to define infinite sums as a limiting case of finite sums,
denoted as: ∞ n
X X
xi = lim xi (13.18)
n→∞
i=1 i=1
Therefore we further state that in order to even have an infinite sum, it has to
be possible to arrange the terms in a sequence. Note now that if an infinite set
of terms can be arranged in a sequence it is called countable and otherwise it is
uncountable. Positive rational are said to be countable since we can list them in
an order as a ratio of integers.
1 2 3
, , ,··· (13.19)
2 3 4
However we must note that real numbers between 0 and 1 are not countable.
Suppose we try to arrange these real numbers into a sequence x1 , x2 , · · · . Also
note that choose to express these as:
∞
X
xj = dij 10−i (13.20)
i=1
Where dij ∈ (0, 1, 2, · · · , 9) is the ith digit after the decimal place, of the j th number
in the sequence. What is happeneing here is that we assume that any given x can
Chapter 13. Borel-Cantelli Lemma 146
P∞x1 → j−i= 1
x1 as i=1 di1 10 = 0.1 + 0.031 + · · · ≈ 0.2 · · · 74131
Therefore we can writeP
Similarly, x2 as ∞
i=1 di2 10
−i
= 0.2 + 0.572 + · · · ≈ 0.1 · · · 90572
We assume that with the above sequence we can essentially list out the entire set
between 0 and 1. Therefore if now consider an indicator random variable such
that I(A) = 1 if condition A is true and I(A) = 0 if condition A is false. Then we
can define a new number by:
∞
X
y= (1 + I{dii = 1})10−i (13.21)
i=1
Now look closely at this number, it is basically saying that if the diagonal element
of the array of the xj sequences, that is dii is equal to 1 then the ith digit of y would
take on the number 2. Finally when we get the decimal expansion of y we would
have the case that - The first decimal expansion element of y would be different
from the d11 element of x1 , the second decimal expansion element of y would be
different from the d22 element of x2 and so on. With this what we have essentially
done is that proven that all the xj sequence of numbers differ in at least one digit
from the newly defined number y. Therefore we note that while y does in fact
belong to the set (0, 1) it is not equal to any of the xj sequence of numbers. Hence
we can say that the elements between 0 and 1 cannot be arranged or explicitly
listed out.
13.3 Recap
An infimum of a subset set S of a partially ordered set T denoted by inf S is
the greatest element in T that is less than or equal to all elements in S. It is
the Greatest lower bound. The supremum of a set S of a partially ordered set
T denoted by sup S is the least element in T that is greater than or equal to all
elements in S. It is the lowest upper bound.
Then we translated this broad condition into events. Basically we are interested in
event Ek happens infinitely many times. Before moving further, note some basic
principles:
∞
\
∀n∈N → , intersections (13.23)
n=1
∞
[
∃k≥n → , unions (13.24)
k=n
With this we can write our condition of Ek happening infinitely many times as:
∞ [
\ ∞
Ek (13.25)
n=1 k=n
After this we saw that if the series of events is divergent then, under the assump-
tion that the Ek events are disjoint, we proved that as the limit of n tends to
infinity, the tail probability (starting with cutoff point n) would tend to infinity.
∞
X
lim P (Ek ) → ∞ (13.26)
n→∞
k=n
And conversely, if the series of events is convergent then the tail of the series goes
to 0 as n tends to infinity.
X∞
lim P (Ek ) → 0 (13.27)
n→∞
k=n
Finally with these funamental properties laid out, we formulated the Borel Canilli
Lemma which stated:
∞
X ∞ [
\ ∞
if P (En ) < ∞ → P Ek = 0 (13.28)
n=1 n=1 k=n
For an event A that occurs all but finitely often, we define the Lim Inf as follows:
∞ \
[ ∞
lim inf An = lim inf An = Ak (13.30)
n→∞
n=1 k=n
Now let us consider Ω to be our sample space and in that we consider a sample
point or an outcome ω ∈ Ω. Then the following condition can be defined:
ω ∈ [limn→∞ sup An ] ⇐⇒ ω lies in infinitely many of the individual sets An
We can define a similar condition for the Infimum as well:
ω ∈ [limn→∞ inf An ] ⇐⇒ ω lies in all but a finite number of sets
Chapter 13. Borel-Cantelli Lemma 148
{Xn } = {(X1 = {0}), (X2 = {1}), (X3 = {0}), (X4 = {1}) · · · } (13.31)
From this set of subsets, we can clearly notice that the even indexed elements
are all {0}, whereas all the odd indexed elements are {1}. With these we define
two new series of event classes that separately contain the odd and even indexed
elements:
{Yn } = {{0}, {0}, · · · } (13.32)
{Zn } = {{1}, {1}, · · · } (13.33)
Now consider the original series {Xn }. In order to evaluate its Lim Sup we need
to first compute the successive unions of all the subsets in the series of the form
{0} ∪ {1}. Note that in every union iteration, we get the same set of the form
{0, 1}. Now define the LimSup as follows:
∞
\ [ ∞ ∞
\
lim sup An = An → lim sup Xn = {0, 1} = {0, 1} (13.34)
n=1 k=n n=1
Now similarly, if we want to compute the Lim Inf of this series, then we need to
first take successive intersections of all the events, which in this case would be
iteratively found out as follows: {0} ∩ {1} = φ. φ ∩ {0} = φ. And so on. In the we
will get only the null set φ. With this the LimInf could be defined as follows:
[∞ ∞
\ [∞
lim inf An = Ak → lim inf Xn =
[φ] = φ (13.35)
n=1 k=n n=1
We can notice from equation 34 and 35 that the LimSup and LimInf functions of
particular sequence are not equal. When this is the case we say that the limit of
this sequence does not exist. Now recall through equation 32, the sequence Yn .
We will now compute the LimSup and LimInf for this series and check if it’s limit
exists, which essentially the same condition as the LimSup and LimInf being equal.
∞
\ ∞
[ \∞
lim sup Yn = Yk =
{0} = {0} (13.36)
n=1 k=n n=1
∞
[ ∞
\ ∞
[
lim inf Yn = Yk = {0} = {0} (13.37)
n=1 k=n n=1
Hence from the above two equations we see clearly that the limit of the series
would exist and be given as:
{Bn } = {(B1 = {50}), (B2 = {20}), (B3 = {35}), {0}, {1}, · · ·} (13.39)
| {z } | {z }
T ransients tail pattern
Now for each value of the cut-off, specifying the starting of a sequence, we call
union set of all events after the cut-off point as Dn . With this the LimSup can be
specified as:
\∞ ∞
[
lim sup Bn = Bk = D1 ∩ D2 ∩ D3 ∩ · · · = {0, 1} (13.40)
n=1 k=n
| {z }
Dn
For all values of cut-off points, we see that the event {0, 1} tends to happen in-
finitely often since the multiple unions for all n cut off points resolve to that set.
To explicitly break down this process, we see the contents of the those Dn sets and
how they evolve.
∞
[
D1 = Bk = {50, 20, 35, −15, 0, 1}
k=1
Now we will compute the LimInf of this series. Note that the successive intersec-
tions of each set in this series would simply be the null set. Let us see the workings
of this:
[∞ ∞
\
lim inf Bn = Bk = E1 ∪ E2 ∪ E3 · · · = φ ∪ φ · · · = φ (13.41)
n=1 k=n
| {z }
En
Chapter 13. Borel-Cantelli Lemma 150
We see that the sequence {In } is infact an increasing sequence with In ⊂ In+1 . Why
is this an increasing sequence? Because in each increasing iteration of forming
In we are taking fewer successive intersections among the Xn events and fewer
intersections naturally correspond to bigger sets. Hence the (n+1)th will be bigger
than the nth set. Now we say the the least upper bound on this sequence of
Infimums (In ) is the LimInf and is given by:
∞
" ∞ #
[ \
lim inf Xn = sup{inf{Xm |m ∈ (n, n + 1, · · · )}} = Xm (13.44)
n→∞
n=1 m=n
14.1 Recap
We have discussed about what are random/ stochastic processes. Stochastic pro-
cess is the process of some values changing randomly over time. At its simplest
form, it involves a variable changing at a random rate through time. There are var-
ious types of stochastic processes. Mainly classified into discrete time stochastic
processes and continuos time stochastic processes.
14.2 Introduction
For FT [ y ( t ) ] to exist, y(t) must be absolutely intergratable
Z ∞
=⇒ |y(t)|dt < ∞ (14.1)
−∞
Where Rxx (τ ) is the autocorrelation function, Sxx (ω) is the spectral density F T [Rxx (τ )]
is the fourier transform and F T −1 [Sxx (w)] is inverse fourier transform.
151
Chapter 14. Time Series Analysis: Basics 152
Z ∞
2 1
E(x (t)) = Rxx (0) = Sxx (ω)dω (14.5)
2π −∞
Where E(x2 (t)) say that Mean square value of the stochastic process X(t) is the
average power of X(t).
=⇒ −dτ = dτ =⇒ Even.
∗
=⇒ Sxx (ω) = Sxx (ω) (14.9)
∗
Where Sxx (ω) is complex conjugate of Sxx (ω).
Recall: a = x + jy and a∗ = x − jy.
=⇒ If a= a∗ =⇒ x + jy = x − jy =⇒ y = −y =⇒ 2y = 0 =⇒ y=0
=⇒ a=x which is a real valued function.
R +∞
6. −∞ Rxx dτ < ∞ then Sxx (ω) is a continuous function of omega.
Z +∞
Sxx (ω) = Rxx (τ )e−jωτ dτ = F T [Rxx (τ )] (14.10)
−∞
Note: Since Power spectral density Sxx (ω) is even, non-negative real func-
tion. Same Rxx (τ ) can not be the autocorrelation function of WSSP X(t).
Definition: For 2 Stochastic Processes X(t) and Y(t) that are jointly WSSP F T [Rxx (τ )]=
Cross power spectral density= Sxy (ω). Then
Z ∞
Sxy (ω) = F T [Rxx (τ )] = Rxy (τ )e−jωτ dτ. (14.11)
−∞
Where Sxy (ω) is a complex function even when X(t) and Y(t) are real stochastic
processes.
(t - τ ) t (t + τ )
Figure 12.1
Chapter 14. Time Series Analysis: Basics 154
∗
=⇒ Syx (ω) = Sxy (−ω) = Sxy (ω) (14.13)
By Definition: Rxy (τ ) = E[X(T )Y (T + τ )] =⇒ Ryx (τ ) = E[Y (T )X(T + τ )]
Then, Z ∞
Sxy (ω) = Rxy (τ )e−jωτ dτ = F T [Rxy (τ )]
−∞
Z ∞ Z ∞
∗ jωτ
Sxy (ω) = Rxy (τ )e dτ = Ryx (−τ )ejωτ dτ
−∞ −∞
Z ∞ Z ∞
∗
Sxy (ω) = Ryx (−τ )e jωτ
dτ ⇐⇒ τ =−τ
dτ =−dτ − Ryx (τ1 )e−jωτ dτ1
−∞ −∞
Z ∞
∗
Sxy (ω) = Ryx (τ1 )e−jωτ dτ1 = F T [Ryx (τ ) = Syx (ω).
−∞
∗
Sxy (ω) = Syx (ω) (14.14)
Example: Find autocorrelation
® function Rxx (τ ) of stochastic process with power
S0 ; |ω| < ω0
spectral density Sxx (ω) =
0 otherwise
ω < ω0 if ω > 0
Sol: |ω|<ω0 =⇒ −ω < ω0 if ω < 0
=⇒ ω > −ω0 if ω < 0
SXX (ω)
S0
ω
-ω0 0 + ω0
Figure 12.2
ejx + e−jx
⇒ cos x = (14.17)
2
Subtracting equation (15) and (16)
ejx + e−jx
⇒ sin x = (14.18)
2j
Chapter 14. Time Series Analysis: Basics 155
R∞
Rxx (ω) = F TR−1 [Sxx (ω)] = 2π
1
−∞
Sxx ejωτ dc
1 ∞ So
R ∞ (14.19)
Rxx (τ ) = 2π −∞ o
S ejωτ dω = 2π −∞
ejωτ dω
So ejω0 τ −e−jω0 τ
⇒ Rxx (τ ) = So
[ejωτ ]ω−ω So
= 2πjτ
0
[ejω0 τ − e−jω0 τ ] = [ ]
2πjτ 0
So
πτ 2j (14.20)
⇒ Rxx (τ ) = πτ Sn (ω0 τ )
In statistics and econometrics one often assumes that an observed series of data
values is the sum of a series of values generated by a deterministic linear pro-
cess, depending on certain independent /explanatory variables, and on a series
of random noise values. If there is non-zero correlation between the noise values
underlying different observations then the estimated model parameters are still
unbiased, but estimates of their uncertainties such as confidence intervals will be
biased .
In Time series analysis there are often no explanatory variables other than the past
values of the variable being modeled i.e. the dependent variable. In this case the
noise process is often modeled as a moving average process, in which the current
value of the dependent variable depends on current and past values of a sequential
white noise process.
Here, N(t) = White Noise
Definition: White noise is a random function N(t) whose power spectral den-
sity Sn n(ω) is constant for all frequencies ω. N(t) is the white noise.
N0
⇒ Snn (ω) = is constant ∀ ω (14.21)
2
Where N0 is a real positive constant.
Autocorrelation of white noise =Rnn (τ ) = F T −1 [Sxx (ω)] = F T −1 ( N20 )
N0 N0 N0
Rnn (τ ) = F T −1 ( ) = ( )F T −1 [1] = ( )δ(τ )
2 2 2
R∞ −jωτ −jω(0)
F T [δ(τ )] = −∞ δ(τ )e dτ = e = e0 = 1
F T −1 [1] = δ(τ ) ⇒ F T [δ(τ )] = 1
®
∞ if x = 0
δ(τ ) = (14.22)
0 if x 6= 0
Chapter 14. Time Series Analysis: Basics 156
N0 N0
SN N (ω) = RN N (τ ) = δ (τ )
Z Z
F T −1
N0 N0
FT
Z Z
ω τ
0 0
Power Spectral Density Auto Correlation Function
SN N (ω) RN N (τ )
Figure 12.3
Example: Let Y(t) = [X(t) +N(t)] be weakly stationery process with X(t) as the
2
actual speed and N(t) is the zero mean noise process with variance σN and µN = 0.
Find the power spectral density of Y(T) i.e. Sxx (ω ).
Sol: Z ∞
Ryy (τ ) e−jωτ dτ
Syy (ω) = F T Ryy (τ ) =
−∞
2
µN = 0 and V ar(N (t)) = σN (14.23)
Ryy (τ ) = E y (t) y (t + τ ) = E X (t + τ ) + N (t + τ )
Ryy (τ ) = E X (t) X (t + τ ) + X (t) N (t + τ ) + N (t) X (t + τ ) + N (t) N (t + τ )
2
Ryy (τ ) = Rxx (τ ) + Rnn (τ ) = Rxx (τ ) + σN δτ
Chapter 14. Time Series Analysis: Basics 157
2
î ó î ó2 î ó
= V ar N (t) = E N 2 (t) − E N (t) = E N 2 (t) − µ2N
σN
î ó
2
σN = E N 2 (t) = Rnn (0)
2 ∞ if x = 0
⇒ Rnn (τ ) = σN δ τ ) where δ τ )= {
0 if x 6= 0
2
Ryy (τ ) = Rxx (τ ) + σN δ (τ )
Now solving for
the power
spectral
density of Y(t), 2
2
Syy (ω) = FT Ryy
(τ ) = F T Rxx (τ ) + σN δ (τ ) = F T Rxx (τ ) + F T σ N δ (τ )
Since F T δ (τ ) =0
2
Syy (ω) = Sxx (ω) + σN
Till now we have discussed about the continuous time stochastic processes (CTSP).
Let us look at the discrete time stochastic processes in further section.
When interpreted as time, if the index set of a stochastic process has a finite or
countable number of elements, such as a finite set of numbers, the set of integers,
or the natural numbers, then the stochastic process is said to be in discrete time.
Chapter 14. Time Series Analysis: Basics 158
n→0 1 2
t
0 1 TS 2 TS
Figure 12.4
1 if m = 0
δ (m) = {
0 if m 6= 0
P+∞
Definition: Power spectral density of X(n) is Sxx (Ω) = m=−∞ Rxx (m) e−jΩm
Sxx (Ω) = DF T Rxx (m) (14.24)
Where Rxx (m) is discrete autocorrelation function of X (m) and DF T Rxx (m)
is Discrete Fourier transformation.
e−j(Ω+2π)n = e−jΩn
Hence e−jΩn is periodic with 2π . ⇒ Sxx (Ω) is periodic with 2π .
Therefore, it is sufficient to define Sxx (Ω) in range Ω ∈ (- π, π )
Chapter 14. Time Series Analysis: Basics 159
Ω
−π 0 π
Figure 12.5
1
R∞
⇒ Autocorrelation function of X(n)= Rxx (m) = 2π −∞
Sxx ejΩτ dΩ
Properties of Power Spectral Density Sxx (ω):
+∞
X
Sxx (Ω) = Rxx (m) e−jΩm (14.27)
m=−∞
+∞
X
Sxx (Ω) = Rxx (m) Cos (jΩm) − jSin (jΩm)
m=−∞
+∞
X +∞
X
Sxx (Ω) = Rxx (m) Cos (jΩm) − j Rxx (m) Sin (jΩm)
m=−∞ m=−∞
(14.28)
Sxx (Ω) is an even function. Cos (jΩm) is even and Sin (jΩm) is odd.
⇒ Sxx (Ω) is real.
Example: Assume X(n) is Real SP ⇒ Rxx (−m) = Rxx (m) . Find the power
spectral density of X(n) i.e. Sxx (Ω) .
Sol:
Chapter 14. Time Series Analysis: Basics 160
P−1 . P+∞
−jΩm −jΩm
Sxx (Ω) = m=−∞ Rxx (m) e + m=0 Rxx (m) e
.
Sxx (Ω) = +∞ −jΩm
= −1 −jΩm
P P
R
m=−∞ P xx (m) e m=−∞ Rxx (m) e +
+∞ −jΩm
m=0 Rxx (m) e
Introducing aP
dummy k.
⇒ Sxx (Ω) = k=+1 jΩm
+ +∞ −jΩm
P
k=−∞ Rxx (−k) e m=0 Rxx (m) e
k=∞
X +∞
X
Sxx (Ω) = Rxx (−k) e jΩm
+ Rxx (0) + Rxx (k) e−jΩm
k=1 k=0
k=∞
X î ó
Sxx (Ω) = Rxx (k) ejΩm + e−jΩm + Rxx (0)
k=1
1
Rπ
⇒ Rxx (−m) = − 2π −π
Sxx (−α) ejαm dα
Discrete Time Stochastic Process, DTSP is a good by sampling CTSP X(t). If CTSP
X(t) is sampled at constant intervals Ts time units i.e. Ts is sampling period. Then
samples from a DTSP are defined by X(n).
− -2 -1 0 +1 2 − −
− -2 TS -1 TS 0 TS 1 TS 2 TS − −
TS TS TS TS
Figure 12.6
Note: If X(t) is a WSSP in continuous time then X(n) is also WSSP in discrete
time with µx (n) = µx = Constant and Rxx (m) = Rxx (mTs ) .
X(t) CTSP =⇒ XT
X (ω1 , t) One Realization of CTSP
is a sample path
0 t1 t2 t3 t4 t (Continious time)
=⇒
Sampling
X(n) DTSP
X (ω1 , t) Xn , n = 0, 1, 2.....
n (Discrete time)
0 1 2 3 4
Figure 12.7
Chapter 14. Time Series Analysis: Basics 162
CTSP { X(t), t∈ T}
For to ∈ T , X(to ) is Random Variable so here
CDF is FX(t0 ) (n) = P X (t0 ) ≤ x
Joint CDF of X(t1 ) and X(t2 ) is the same as the joint distribution of X(t1 +∆ ) and
X(t2 +∆ ) i.e. time shift of ∆ doesn’t change it’s stationarity properties.
Definition: CTSP { X(t), t∈ T} is SSSP if ∀ t1 , t2 , ......, tn ∈ R and all τ ∈ R, Joint
CDF of X(t1 ), X(t2 ), ......., X(tr )that is for all real numbers x1 , x2 , ....., xn is
1. The mean function does not change due to shifts in time and is independent
of time. E[X(t1 )]= E[X(t2 )] i.e. µx (t1 ) = µx (t2 ) = Constant.
2. The autocorrelation function does not change by shifts in time and is inde-
pendent of time. E[X(t1 ) X(t2 )]= E[X(t1 + τ ) X(t2 + τ )].
Chapter 14. Time Series Analysis: Basics 163
1. µx (t) = µx (x) ∀ t ∈ R
1. µx (n) = µx ∀ n ∈ Z
RXX (τ )
τ = 0 τ
Figure 12.8
A signal that is just a function of time and not a sample path of a stochastic pro-
cess can exhibit cyclostationary properties in the framework of the fraction-of-time
point of view. If the signal is further ergodic, all sample paths exhibits the same
time-average.
This process has a periodic structure. The statistical properties are repeated every
T0 units of time. That is the random variables X(t1 ) , X(t2 ), . . . . . . , X(tn ) have the
same joint CDF as the RVS X(t1 + Tp ), X(t2 + Tp ), . . . .., X(tn + Tp ) then the RVS
are cyclo-stationery.
For Example: X(t)= A Cos (ω t) ⇒ X(t + 2π ω
) = A Cos (ω (t + 2π ω
))
1. µx (n + M ) = µx ∀ n ∈ Z
Note: Mean Square continuity does not mean that every possible realization of
X(t) is a continuos function.
∞ if τ = 0
δ (τ ) = {
0 if τ 6= 0
E X 2 (t) = Rnn (0) ⇒ WNSP has infinite power.
Also, Rnn (τ ) = 0 f or any τ 6= 0
⇒ N(t1 ) and N(t2 ) are uncorrelated for any t1 6= t2
⇒ White Gaussian noise GN(t1 ) and GN(t2 ) are independent for any t1 6= t2
For Gaussian RVS independence ⇔ uncorrelated
Such quantities include the average value of the process over a range of times and
the error in estimating the average using sample values at a small set of times.
The Gaussian random variable is clearly the most commonly used and of most im-
portance. For continuous variables, possible values are distributed on a continuous
scale and the probability density function links every possible value with a given
probability intensity which we can think of as the probability to find the value of
the variable around every possible value. A theoretical frequency distribution for a
random variable, characterized by a bell-shaped curve symmetrical about its mean.
T
X → = [X1 , X2 , . . . , Xn ] is a random vector.
→
a = [a1 , a2 , . . . , an ]T ∈ R.
RVS X1 , X2 , . . . , Xn are jointly normal if all ai ∈ R.
Jointly Gaussian random variables can be characterized by the property that every
scalar linear combination of such variables is Gaussian. An important property of
jointly normal random variables is that their joint PDF is completely determined
by their mean and covariance matrices.
m→ = E [X → ] , X → = [X1 , X2 , . . . , Xn ]T
Covariance
→ matrix= C
→T
C= E X X with | C| = Det(C )
1D Gaussian PDF of X = fx (x)
1 −( X−µ)2 2 )
= e 2σ (14.34)
2πσ
N D Gaussian PDF of X → = fx (x)
x−m)T −1 (~
1 1c x−m)
~
= n p e−( 2
)
(14.35)
(2π)
2 |c|
Note: If two jointly norma; random process X(t) and Y (t) are uncorrelated
that is Cxy (t1 , t2 ) = 0 ∀ t1 , t2 then X(t) and Y (t) are two independent SPs
Chapter 14. Time Series Analysis: Basics 167
Note: For Gaussian SP, Weak statinarity and strong stationarity SSSP are
equivalent
Theorem: For Gaussian SP, { X(t), t∈ T} if X(t) is WSSP then X(t) is SSSP
Definition: 2 SP { X(t)} and { Y (t)} are jointly Gaussian for all
t1 , t2 , ....., tn ∈ Rx
t1 , t2 , ......, tn ∈ Ry the RVS X(t1 ) , X(t2 ), ...... , X(tn ) are jointly normal.
Proof:
Need to show ∀ t1 , t2 ,....., tn ∈ R, the variables X(t1 ) , X(t2 ), ....... , X(tk ) have the
same joint CDF as the RVS X(t1 + τ ), X(t2 + τ ), ....., X(tk + τ )
Since these RVS are jointly Gaussian, We show that the mean vector of co variance
matrices are same.
If X(t) is WSSP ⇒ µx (ti ) = µx tj = µx = Constant ∀ i,j
And Cxx (ti + τ, tj + τ ) = Cxx (ti , tj ) = Cxx (ti − tj )∀i, j
⇒ Mean vector of Covariance matrix of X(t1 ) , X(t2 ), ...... , X(tk ) is same as the
mean vector and covariance matrix of X(t1 + τ ), X(t2 + τ ), ......, X(tk + τ ).
14.10 Summary:
Properties of Power spectral density Sxx (ω) are it is a non-negative, even, real
and continuous function in ω. Since Power spectral density Sxx (ω) is even, non-
negative real function. Same Rxx (τ ) can not be the autocorrelation function of
WSSP X(t).White noise is a random function N(t) whose power spectral density
Sn n(ω) is constant for all frequencies ω. N(t) is the white noise. ⇒ Snn (ω) =
N0
2
is constant ∀ ω A Discrete Time Stochastic Process (DTSP) is called white
noise if the Random variables X( nk ) are uncorrelated.Properties of Sxx (ω) are it
is is periodic with 2π, is an even function in Ω, is real and even function. Discrete
Time Stochastic Process, DTSP is a good by sampling CTSP X(t). If CTSP X(t) is
sampled at constant intervals Ts time units i.e. Ts is sampling period. Then sam-
ples from a DTSP are defined by X(n).
If X(t) is a WSSP in continuous time then X(n) is also WSSP in discrete time with
µx (n) = µx = Constant and Rxx (m) = Rxx (mTs )
For Weak Stationarity, The mean function does not change due to shifts in time and
is independent of time. E[X(t1 )]= E[X(t2 )] i.e. µx (t1 ) = µx (t2 ) = Constant.The
autocorrelation function does not change by shifts in time and is independent of
time. E[X(t1 ) X(t2 )]= E[X(t1 + τ ) X(t2 + τ )].
Rxx (τ ) takes its maximum value at τ =0 that is X (t + τ ) and X (t) have highest
correlation at τ =0 The Cyclo Stationery process has a periodic structure. The sta-
tistical properties are repeated every T0 units of time. That is the random variables
X(t1 ) , X(t2 ), . . . . . . , X(tn ) have the same joint CDF as the RVS X(t1 + Tp ), X(t2
+ Tp ), . . . .., X(tn + Tp ) then the RVS are cyclo-stationery. For Gaussian SP, Weak
statinarity and strong stationarity SSSP are equivalent
Chapter 15
Note that the constant α is called the probability limit of zn and can be also
written in the following notations as well:
plimn→∞ zn = α (15.2)
P
zn → α (15.3)
• A sequence of random scalars {zn } converges almost surely to a constant α
if we have:
P rob( lim zn = α) = 1 (15.4)
n→∞
• Now in all of the above scenarios our sequence was converging to a constant
value, however convergence holds for a target random variable as well. We
can say that a sequence of random variables {zn } converges to a random
variable z if:
P
{zn − z} → 0 (15.6)
P
zn → z (15.7)
168
Chapter 15. Time Series Analysis: Intermediate 169
from this statement that the mean, variance and other higher moments remain the
same across all i. Now we note some important definitions within this framework.
• A sequence of independently and identically distributed random variables
is a stationary process that exhibits no serial dependence.
• There are many aggregate time series such as GDP that are not stationary
because they exhibit time trends. Further we note that many time trends can
be reduced to stationary processes. A process is called trend stationary if it
becomes stationary after subtracting from it a linear function of time. Also,
if a process is non stationary but its first difference zi −zi−1 is stationary, then
the sequence {zi } is called difference stationary.
• A stochastic process is said to be weakly covariance stationary if:
Γj = Γ−j (15.10)
We can say the the 0th order autocovariance is nothing but the variance given
by:
Γ0 = V ar(zi ) (15.11)
The corresponding notation for scalar quantities is :
γj = γ−j (15.12)
If we take a string of n successive values of the stochastic process (zi , zi+1 , · · · , zi+n−1 )
then by the rule of covariance stationarity we can say that the (n × n) covari-
ance matrix is the same as that of (z1 , z2 , · · · , zn ) and is given by:
γ0 γ1 γ2 · · · γn−1
1 γ0 γ1 · · · γn−2
γ
. ..
..
V ar(zi , zi+1 , · · · , zi+n−1 ) = ··· ··· ··· . (15.13)
γn−2 · · · γ1 γ0 γ1
γn−1 · · · γ2 γ1 γ0
yt = φyy−1 + wt (15.15)
y2 = φy1 + w2 (15.18)
(y1 , y2 , · · · , yT ) (15.22)
Now we note that a time series operator typically transforms one type of series into
another type of time series and one such popular operator is the lag operator. This
lag operator basically gives the previous values of a variable at a particular date.
It is represented by L and its operation is shown as:
Lk xt = xt − k (15.25)
yt = φyt−1 + wt (15.26)
yt = φLyt + wt (15.27)
(1 + φL + φ2 L2 + · · · + φt Lt ) (15.29)
Consider only the LHS compound operator and expand the operator in brackets to
get:
(1 + φL + φ2 L2 + · · · + φt Lt )(1 − φL) (15.31)
= (1 + φL + φ2 L2 + · · · + φt Lt ) − (1 + φL + φ2 L2 + · · · + φt Lt )φL (15.32)
2 2 t t 2 2 3 3 t+1 t+1
= (1 + φL + φ L + · · · + φ L ) − (φL + φ L + φ L + · · · + φ L ) (15.33)
= (1 − φt+1 Lt+1 ) (15.34)
Now we can substitute this compound operator back into our main equation given
by equation 16 and obtain:
Now note that if t becomes very large and if |φ| < 1 then the expression φt+1 y−1
would tend to 0. Hence we can think of the operator: (1 + φL + φ2 L2 + · · · + φt Lt )
to be an approximation for the inverse of (1 − φL), which would inturn satisfy the
following condition:
(1 − φL)−1 (1 − φL) = 1 (15.39)
With this kind of an operation over our difference equation we can essentially
write it in the form:
yt = (1 − φL)−1 wt (15.40)
yt = wt + φwt−1 + φ2 wt−2 + · · · (15.41)
(1 − φ1 L − φ2 L2 )yt = wt (15.43)
For a moment, let us only consider the lag operator polynomial in the LHS. We can
essentially factorize this polynomial by selecting numbers λ1 and λ2 such that:
Clearly in our search for these λ values we look to satisfy the following properties:
λ1 + λ2 = φ1 (15.45)
λ1 λ2 = −φ2 (15.46)
Now obviously we need to ensure that the left hand side of the above equation is
equal to the right hand side. With this, we can actually write out a corresponding
polynomial that fulfils the same criterion.
(1 − φ1 z − φ2 z 2 ) = (1 − λ1 z)(1 − λ2 z) (15.47)
E(t , τ ) = 0 (15.53)
∼ N (0, σ 2 ) (15.54)
Yt = µ + t + θt−1 (15.55)
We note that the constant term included in the process equation is actually the
mean. The variance of Y is given by:
θσ 2 θ
ρ1 = = (15.65)
(1 + θ2 )σ 2 1 + θ2
Note that all the higher autocorrelations will be zero in this case.
15.6.1 MA(q)
The q th order moving average process can be described as:
Now since the white noise terms are uncorrelated, we can write the variance as:
Coming to the computation of the j th lag covariance and dropping out the cross
product terms of the white noise terms (since they are uncorrelated and will re-
solve to zero), we get:
With θ0 = 1. Now we consider the process that results when q → ∞. this can be
shown as: ∞
X
Yt = µ + ψj t−j = µ + ψ0 t + ψ1 t−1 + · · · (15.78)
j=0
This is essentially an M A(∞) process. We further note that this infinite sequence
ensures a covariance stationary process provided that:
X∞
|ψj | < ∞ (15.79)
j=0
Note that a sequence of numbers satisfying the above condition is said to be ab-
solutely summable. We can now calculate the mean and autocovariances of an
M A(∞) process with absolutely summable coefficients.
E(Yt ) = lim E(µ + ψ0 t + ψ1 t−1 + · · · + ψT t−T ) = µ (15.80)
T →∞
Yt = c + φYt−1 + t (15.86)
Where again {t } is a white noise process. In earlier sections when we looked
at the analysis of difference equations, we learnt that if |φ| ≥ 1 then the effect
of the terms on Y tend to accumulate rather than die out over time, in which
case a covariance stationary process would not exist. If however |φ| < 1 then
we would have a covariance stationary process which could be characterized by
the following stable equation. Note that this the same equation we obtained after
recursively substituting a general difference equation (here w = (c + ).
This would end up being equal to [1/(1 − |φ|)] since it would essentially be a
geometric series of partial sums. Mean of the AR(1) process can be represented
as:
c
E(Yt ) = (15.90)
1−φ
The variance can be represented as:
15.8.1 AR(2)
The second order autoregressive function can be written as:
(1 − φ1 L − φ2 L2 )Yt = c + t (15.116)
From our earlier discussions regarding difference equations and their stability, we
can say that this equation is stable if the roots of the characteristic polynomial lie
outside the unit circle. It is only when this condition is satisfied that we can say
that the AR(2) process is covariance stationary.
(1 − φ1 z − φ2 z 2 ) = 0 (15.117)
Note that the inverse of this autoregressive operator can be written as:
Now multiplying both sides of our main equation with this function we get:
Also since c is a constant the operator premultiplied with c would simply become:
c
ψ(L)c = (15.120)
1 − φ1 − φ2
Additionally we also have the condition that :
∞
X
|ψj | < ∞ (15.121)
j=0
Now since we have effectively resolved our AR(2) process to an M A(∞) process
as is evident by equation 34 we can state the mean of the process as:
c
µ= (15.122)
1 − φ1 − φ2
Now we can find the second moment by rewriting the main equation as:
E[Yt −µ]2 = φ1 E[(Yt −µ)(Yt−1 −µ)]+φ2 E[(Yt −µ)(Yt−2 −µ)]+E[t (Yt −µ)] (15.130)
Note what the last term in this equation would resolve to:
γ0 = φ1 ρ1 γ0 + φ2 ρ2 γ0 + σ 2 (15.132)
(1 − φ2 )σ 2
γ0 = (15.133)
(1 + φ22 )[(1 − φ2 )2 − φ21 ]
15.8.2 AR(p)
An autoregressive processes of the pth order can be written as:
For ensuring stationarity the roots of our characteristic polynomial must lie outside
the unit cirlce.
1 − φ1 z − φ2 z 2 − · · · − φp z p = 0 (15.135)
After applying the inverse of the characteristic lag operator polynomial we can
obtain the covarince stationary transformation of this process as follows:
Yt = µ + ψ(L)t (15.136)
We can take expectations on the main equation to get the mean as follows:
µ = c + φ1 µ + φ2 µ + · · · + φp µ (15.139)
Chapter 15. Time Series Analysis: Intermediate 181
c
µ= (15.140)
1 − φ1 − φ2 − · · · − φp
Writing the main autoregressive equation in mean deviation form we obtain:
Now if we multiply both sides by Yt−j − µ we would essentially obtain the autoco-
variance functions as:
γ0 = φ1 γ1 + φ2 γ2 + · · · + φp γp + σ 2 (15.143)
15.9 ARMA(p,q)
First we note that ARM A(p, q) is a process that includes both autoregressive and
moving average terms.
Our precondition for stationarity in the ARM A process is essentially the same
condition as the AR process and its stationarity essentially depends on the AR
parameters.
(1 − φ1 z − φ2 z 2 − · · · − φp z p ) = 0 (15.146)
The above equation should ideally have roots that lie outside the unit circle for the
equation system to be stable and for stationarity to exist. We would now divide
both sides of the main equation by (1 − φ1 L − φ2 L2 − · · · − φp Lp ) to get :
Yt = µ + ψ(L)t (15.147)
Where we have:
1 + θ1 L + θ2 L2 + · · · + θq Lq
ψ(L) = (15.148)
q − φ1 L − φ2 L2 − · · · − φp Lp
We would ultimately obtain the process mean as:
c
µ= (15.149)
1 − φ1 − φ2 − · · · − φp
The mean is the same as for the AR(p) process. Now we can write the ARM A in
terms of mean deviations to get:
Premultiplying this equation with (Yt−j − µ) we would get the covariance function
as:
γj = φ1 γj−1 + φ2 γj−2 + · · · + φp γj−p (15.151)
A word of caution here is that the above set of covariances are true only for j >
q. It is basically after q lags that the autocovariance function of ARM A follows
the same autocovariance pattern as the AR(p) process. Further we note that the
above autocovariance function does not hold for j ≤ q because of the presence of
correlations between θj t−j and Yt−j .
15.10 Invertibility
We we will define invertibility of an M A(1) process. Consider the M A(1) as fol-
lows:
Yt − µ = (1 + θL)t (15.152)
We note that the white noise terms are uncorrelated and have constant variance
σ 2 . Provided that |θ| < 1 we can multiply both sides (1 − θL)−1 to obtain:
(1 − θL − θ2 L2 − · · · )(Yt − µ) = t (15.153)
We note that the above equation can essentially be viewed as an AR(∞) repre-
sentation. If a moving average process can be written as an infinite autoregressive
process by simply inverting the moving average operator (1+θL), then the moving
average process is said to be invertible. In a similar manner we can also define
invertibility for an M A(q) process as well.
Yt − µ = (1 + θ1 L + θ2 L2 + · · · + θq Lq )t (15.154)
Now provided that the roots of the characteristic polynomial lie outside the unit
circle, the invertibility condition would be valid.
1 − θ1 z + θ2 z 2 + · · · θq z q = 0 (15.155)
Therefore we can now invert the moving average operator :
(1 + η1 L + η2 L2 + · · · ) = (1 + θ1 L + θ2 L2 + · · · + θq Lq )−1 (15.156)
Multiplying on both sides we would get:
(1 + η1 L + η2 L2 + · · · )(Yt − µ) = t (15.157)
The above equation is essentially an AR(∞) process.
of stochastic process to make note of is the white noise process which is char-
acterized typically by having a zero mean, constant finite variance and serially
uncorralated terms.
Yt = Yt−1 + ut (15.161)
Thus in this model the value of Yt is equal to the value of Yt−1 plus the effects of a
random shock. Using continuous substitution we can write:
Y1 = Y0 + u1 (15.162)
Y2 = Y1 + u2 = Y0 + u1 + u2 (15.163)
If the process started at some initial period t = 0 then we can the time series as:
X
Yt = Y0 + ut (15.164)
V ar(Yt ) = tσ 2 (15.166)
We can clearly notice that the variance of this series depends on time and hence
violates the stationarity condition. Further we note that a RWM with no intercept
is essentially a model without drift and here Y0 = 0, therefore E(Yt ) = 0. An
interesting feature of the RWM is the persistence of random shocks as is even
evident from equation 7. We can see here that Yt is the sum of initial value Y0
plus the sum of various random shocks. We can say that the impact of a particular
shock does not die out. For example if we encounter u2 = 2 then every value of Yt
from Y2 onwards will be 2 units higher persistently hence the effect of this shock
does not Pdie out. For this reason a RWM has infinite memory. We note that the
quantity ut is known as a stochastic trend. Now if we write the RWM equation
as:
Yt − Yt−1 = ∆Yt = ut (15.167)
With ∆ being the first difference operator we notice that while the series of Yt is
nonstationary, the series of its first difference is infact stationary.
Chapter 15. Time Series Analysis: Intermediate 185
Yt = δ + Yt−1 + ut (15.168)
Here δ is called the drift parameter. We can rewrite the equation as:
This basically shows that yt shifts upwards or downwards depending on the value
of δ. Computing the mean and variance for this model:
E(Yt ) = Y0 + tδ (15.170)
V ar(Yt ) = tσ 2 (15.171)
Even in this we can see that the variance, as well the mean, depends on time and
hence violates the stationarity conditions. The RWN with and without drift are
nonstationary series.
Yt = β1 + β2 t + β3 Yt−1 + ut (15.173)
Yt = Yt−1 + ut (15.174)
This is stationary. Hence we can say that a RWM without drift is a difference
stationary process (DSP).
Chapter 15. Time Series Analysis: Intermediate 186
• If Xt ∼ I(d) and Yt ∼ I(d) then Zt = (aXt + bYt ) ∼ I(d∗). Note that d∗ can
be equal to d or even less than it.
Yt = Yt−1 + ut (15.183)
Xt = Xt−1 + vt (15.184)
Suppose we generated 500 obsevations from ut ∼ N (0, 1) and 500 observations
from vt ∼ N (0, 1). Assume that initial values of both Xt and Yt are zero. Further
assume that ut and vr are serially and mutually uncorrelated. We know that both
these series are nonstationary, are I(1) and exhibit stochastic trends. Now if we
regress Yt on Xt we should expect an R2 that tends to be 0 and no relation since
they are fundamentally uncorrelated processes. But if by chance we obtain a result
that gives a statistically significant coefficient, then that would be termed as a
spurious regression. A good rule of thumb to identify spurious regressions is if
R2 > d where d is the Durbin Watson statistic.
Here n is the sample size. Finally we can get the sample ACF at k lag as:
γ̂k
ρ̂k = (15.188)
γ̂0
A plot of sample ACF with k is called the sample correlogram. If the correlogram
of a time series hovers around zero and resembles that of a purely white noise cor-
relogram, it is probably a stationary series. On the other hand the correlogram of
a nonstationary random walk series will exhibit strong correlations upto large lag
lengths. In this case the autocorrelation coefficient starts at a very high value and
slowly declines towards 0 as lag length increases. Next we will begin to answer
some pertinent questions like - How do we choose the lag length over which to ob-
serve ACF patterns. And how to decide if the correlation coefficient at a particular
lag is statistically significant or not.
where n is the sample size the m is the lag length. This test is usually used to
check if a given series is a white noise series or not. In large samples this Q test
statistic is approximately distributed as a chi square variable with m degrees of
freedom. Therefore if the computed Q exceeds the critical Q from the chi square
distribution at the chosen level of significance, then we reject the null hypothesis
that all the ρk are simultaneously zero. A variation of this test is the Ljung Box
test statistic which is given as:
m
X ρ̂2k
LN = n(n + 2) ∼ χ2 (m) (15.193)
k=1
n−K
Chapter 15. Time Series Analysis: Intermediate 189
With this manipulation we are effectively testing the hypothesis that δ = 0 against
the alternative hypothesis that δ < 0. If δ = 0 therefore ρ = 1 we conclude that
our process is infact unit root nonstationary. Before proceeding another interesting
point to note is that if δ actually turns out to be 0 then our model would effectively
collapse to:
∆Yt = (Yt − Yt−1 ) = ut (15.197)
which implies our earlier point that the first differences of a RWM is a station-
ary random process. Now reiterating the problem - we will first regress the first
differences of Yt on Yt−1 and check if the estimated slope of the regression δ̂ is sta-
tistically equal to 0 or not. It if it zero we conclude the series to be nonstationary
and if it is negative we conclude it to be stationary. Note that we cannot use a t test
for the null hypothesis δ = 0 here because even in large samples the coefficient
does not asymptotically resolve to a normal distribution. Now Dickey and Fuller
showed that under the null hypothesis, t value of the coefficient of Yt−1 follows a
tau statistic. This Tau statistic test is known as the DF test. The various possible
scenarios under which this test might be applied are:
• In all cases the null hypothesis is given by: H0 : δ = 0 - there is a unit root;
time series is nonstationary; has a stochastic trend.
Now we note that if the null hypothesis is rejected it could mean that (1) Yt is
stationary with zero mean in the first case. (2) Yt is stationary with nonzero
mean. We must note that the critical values of the Tau statistic would be different
for different model specifications and hence we must try to ensure that we are not
committing a specification error in modeling. Now we estimate as follows:
• Divide the estimated coefficient of Yt−1 by its standard error to get the Tau
statistic and refer to the DF table.
• If the computed Tau value |τ | exceeds the critical DF value then we reject
the null hypothesis and conclude stationarity of the process. Alternatively if
the computed value is smaller than the DF value we reject null hypothesis.
As a demonstration the regression of the above three model specifications is
given below:
• After this we find from the DF table that the 5% signifiance level critical
values for the three models are : −1.95 (no intercept, no trend). −2.88
(intercept, no trend). −3.43 (intercept and trend).
• Now first of all we should immediately rule out model 1 because its coeffi-
cient is positive. δ > 0 would imply a greater than 1 correlation which would
make the model explode.
• In the other two models we can accordingly compute the Tau stat and com-
pare with the critical values.
Here t is a pure white noise error term and ∆Yt−1 = (Yt−1 − Yt−2 ), ∆Yt−2 =
(Yt−2 − Yt−3 ) and so on. The point is to specifically include the lagged terms so
Chapter 15. Time Series Analysis: Intermediate 191
as to ensure that the error terms are not correlated, thereby giving us unbiased
estimates of δ. Note that lag length is usually selected based on information
criteria.
Yt = β1 + β2 t + ut (15.201)
Note that the L denotes natural logarithm and β2 represents the elasticity of real
personal consumption expenditure to real disposable personal income. We can
denote it as consumption elasticity.
ut = LP CEt − β1 − β2 LDP I (15.204)
Now suppose we subject ut to a unit root test and find that it is stationary I(0).
The interesting point here is that even though LPCE and LDPI are individually I(1)
having stochastic trends, we find that their linear combination is I(0). Basically we
can say that the linear combination has the effect of cancelling out the two stochas-
tic trends in the series. Therefore in this scenario this regression of consumption
on income would make sense. We say that the two variables are cointegrated if
their linear combination gives a stationary series I(0). Note that typically variables
are cointegrated if they have a long term relationship between them. In short, to
test cointegration we see if the residuals of the regression function are I(0).
Further we note that here δ is the mean of Yt and ut is a white noise process with
mean 0 and variance σ 2 . We can say that Yt follows a first order autoregressive
process or is AR(1). The value of Y at time t depends on its previous first lag
value and a random shock. Note that the Y values are typically expressed as their
mean deviations and the model is basically stating that the current value of Y is
a proportion of its previous value and an error term. In a more general way, an
AR(p) process can be expressed as:
Yt = µ + β1 ut + β2 ut−1 (15.216)
• Estimation: After figuring out the appropriate p and q values to form the
model, we need to estimate the parameters included in the model which are
the AR and M A terms.
15.22 Identification
The main tools used in the identification process are the ACF, PACF and their cor-
responding correlograms which are nothing but plots of ACF/PACF with the corre-
sponding lag lengths. Now the concept of partial autocorrelation is analogous to
the concept of a partial regression coefficient in multivariate regression analysis.
For example the k th coefficient βk measures the rate change in the mean value of
the regressand for a unit increase in the value of the k th regressor, while holding
Chapter 15. Time Series Analysis: Intermediate 195
the effect of all other regressors constant. Similarly, the partial autocorrelation
ρkk measures the correlation between observations that are k time periods apart
after controlling for autocorrelations in the intermediate time lags (less than k).
For example the correlation exhibited between Yt and Yt−k might be due to the
correlation they exhibit with intermediate lags Yt−1 , Yt−2 , · · · . The partial correla-
tion ρkk basically removes their influence.
From the standard figures of ACF and PACF correlograms in the GDP series we
will notice that the ACF declines very slowly and the ACF values upto 22 lag val-
ues are statistically significantly different from zero, that is they all fall outside the
95% confidence interval bounds. On the other hand, the PACF falls dramatically
after the second lag and PACF values after lag 2 are statistically insignificant. Now
we must note that this GDP series is nonstationary and before applying the Box
Jenkins approach we must convert it to a stationary series. What we have seen
before as well is that if we take the first differences in GDP we do not observe any
time trends and also after conducting DF unit root tests we were able to ascertain
that the first differenced series is stationary.
We see that after this the ACF and PACF plots look very different. In this par-
ticular example we have the ACF at lags 1, 2, 5 that are statistically significant and
all others are insignificant. In the PACF, we have only the lags 1, 12 that are sta-
tistically significant. Now to see if our data can be modeled as an ARM A, M A
or AR process, we need to compare the ACF and PACF plots of standard models
of AR(1), M A(1), ARM A(1, 1), M A(2), ARIM A(2, 2) and so on. All these series
have typically characteristic patterns in the ACF and PACF and if our data ACF
and PACF patterns resemble any of those, we will select that as our modeling ap-
proach. We will then apply diagnostic checks to see if the chosen ARM A model is
accurate or not. to do this we keep some ground rules handy.
• AR(p) - ACF Decays exponentially or with a damped sine wave or both. PACF
has significant spikes through lags p
• M A(q) - ACF has significant spikes through lags q. PACF delines exponen-
tially.
• ARM A(p, q) - ACF has exponential decay. PACF has exponential decay.
15.23 Estimation
Let Y ∗ denote the first differenced logged GDP figures and let us assume that our
data exhibits an M A(2) pattern and so we model it that way.
Joint Distributions
= lim P (X ≤ a, Y ≤ b) (16.4)
b→∞
P (a1 < X < a2 , b1 < Y < b2 ) = F (a2 , b2 )+F (a1 , b1 )−F (a1 , b2 )−F (a2 , b1 ) (16.12)
197
Chapter 16. Joint Distributions 198
In case random variables X and Y happen to be discrete we can define the joint
probability mass function of X and Y as follows:
p(x, y) = P (X = x, Y = y) (16.13)
The associated Marginals would be given as follows:
X
pX (x) = P (X = x) = p(x, y) (16.14)
y
X
pY (y) = P (Y = y) = p(x, y) (16.15)
x
The function f (x, y) is known as the joint PDF of X and Y . Suppose that A and
B are sets of real numbers and we define C such that C = {(x, y) : x ∈ A, y ∈ B}
then we can write:
Z Z
P {X ∈ A, Y ∈ B} = f (x, y)dxdy (16.17)
B A
Also note that:
Z b Z a
F (a, b) = P {X ∈ (−∞, a), Y ∈ (∞, b)} = f (x, y)dxdy (16.18)
−∞ −∞
If we partially differentiate the above function by both a and b we would get back
the PDF as follows:
∂2
f (a, b) = F (a, b) (16.19)
∂a∂b
We define a rather interestig interpretation of the joint density function below:
Z a+da Z b+db
P {a < X < a + da, b < Y < b + db} = f (x, y)dxdy ≈ f (a, b)dadb
a b
(16.20)
Where da and db are very small quantities. Now the marginals for the individual
continuous random variables can be given by:
P {X ∈ A} = P {X ∈ A, Y ∈ (−∞, ∞)} (16.21)
Z Z ∞ Z
= f (x, y)dxdy = fX (x)dx (16.22)
A −∞ A
Note the following points additionally:
Z ∞
fX (x) = f (x, y)dy (16.23)
−∞
Z ∞
fY (y) = f (x, y)dx (16.24)
−∞
Chapter 16. Joint Distributions 199
16.1.2 Independence
The random variables X and Y are said to be independent if for two sets of real
numbers A and B we can write:
P {X ∈ A, Y ∈ B} = P {X ∈ A}P {Y ∈ B} (16.25)
Basically we say that the above two random variables are independent if for all
possible A and B the sets {X ∈ A} and {Y ∈ B} are disjoint or independent.
What follows from this rule are the following points:
F (a, b) = FX (a)FY (b) (16.26)
p(x, y) = pX (x)pY (y) (16.27)
16.2 Conditionals
For two events the conditional probability of E given F is as follows:
P (EF )
P (E|F ) = (16.34)
P (F )
For discrete variables we would have to define the conditional probability mass
function of X given Y = y. This is given as:
P {X = x, Y = y} p(x, y)
pX|Y (x|y) = P {X = x|Y = y} = = (16.35)
P {Y = y} pY (y)
Chapter 16. Joint Distributions 200
A crucial point to note that if X and Y are independent then the conditional dis-
tribution of X with respect to Y is just the same as the unconditional distribution
of X.
pX|Y (x|y) = P (X = x) (16.37)
16.3 Expectations
We start off by recalling the expected value of a random variable X as given by
the expression: X
E[X] = xp(x) (16.40)
x
Z ∞
E[X] = xf (x)dx (16.41)
−∞
We say that E[X] is a weighted average of all possible values of X with the asso-
ciated probabilities as the weights. Consider two random variables X and Y and
g as a function of these two variables. Then we have:
XX
E[g(X, Y )] = g(x, y)p(x, y) (16.42)
y x
Z ∞ Z ∞
E[g(X, Y )] = g(x, y)p(x, y)dxdy (16.43)
−∞ −∞
16.4 covariance
The covariance is an expression that captures the extent of linear relationship
between two random variables X and Y . It is defined as:
Note that if X and Y are independent then the covariance would be 0. Note some
properties about the covariance:
• Cov(aX, Y ) = aCov(X, Y )
ÄPn Pm ä Pn Pm
• Cov i=1 X i , j=1 j =
Y i=1 j=1 Cov(Xi , Yj )
We can actually obtain probabilities apart from just expectations of a random vari-
able by conditioning it on some other random variable. Let E be an arbitrary event
and let X denote an indicator random variable as follows:
(
1, if E occurs
X= (16.53)
0, if E does not occur
We can now apply the total expectations law in terms of probabilities as follow:
X
P (E) = P (E|Y = y)P (Y = y) (16.56)
y
Z
P (E) = P (E|Y = y)fY (y)dy (16.57)
y
Now we can essentially add equations 21 and 22 to get the variance of X. With
this we can present the conditional variance decomposition as follows:
Conditional Probability
Definition 17.1.1 Joint Probability Function. When X and Y are discrete random
variables, then we refer to f (x, y) as the joint probability mass function of X and
Y iff the following is satisfied.
f (x, y) = P (X = x, Y = y) (17.1)
Definition 17.1.2 Joint Probability Density Function. When X and Y are con-
tinuous random variables, then we refer to f (x, y) as the joint probability density
function iff the following is satisfied.
Z bZ d
P (a ≤ X ≤ b, c ≤ Y ≤ d) = f (x, y) dx dy (17.2)
a c
When we are given the joint probability distribution functions (pmf or pdf), we can
compute the marginal probability distribution functions of the random component
variables using the following definitions.
203
Chapter 17. Conditional Probability 204
Definition 17.1.5 Joint Cumulative Distribution Function. The joint cdf of two
random variables X and Y is given by:
FXY (x, y) = P (X ≤ x, Y ≤ y) (17.7)
The above definition of joint cdf is applicable to discrete,continuous and random
variables. We can also write this as:
x1 , y2 x2 , y2
y2
y1 x2 , y1
x1 , y1
x1 x2
Chapter 17. Conditional Probability 205
P (X = xi and A)
PX | A (xi ) = P [X = xi | A] =
P (A)
FX | A (x) = P [X ≤ x | A]
P [X = xi , Y = yj ] PXY (xi , yj )
PXY (xi | yj ) = P [X = xi | Y = yj ] = =
P [Y = Yj ] PY (yj )
When two random variables X and Y are independent, then the occurrence of X
does not depend on the occurrence of Y and vice-versa. Thus, we will modify the
pmf and the cdf accordingly.
This occurs because when X and Y are independent PX | Y (xi | yj ) = Px (xi ) since
Y does not provide any information about X.
A sample space Ω has partitions B1 , B2 , B3 ... and so on. Also an event A is seen to
occur within the same sample space Ω.
The total probability of A can then be a cumulative sum of all the probabilities of
intersections of A and the partitions.
Ω Ω
B1 B2 A
If you want to find P(A), you can look at a partition of Ω, and add the amount of
probability of A that falls in each partition
A ∩ B1 A ∩ B2
X
P (A) = P (A | Bi ).P (Bi ) (Probability of A) (17.15)
i
Chapter 17. Conditional Probability 207
We compute the probability of X using the partitions of the discrete random vari-
able Y
X
PX (x) = PX | Y (x | yj )PY (yj ) (Probability of X using partitions of
(17.16)
Y)
yj ∈RY
The expectation of X and the conditional expectation of X given Bi using the Law
of Total Probability are given by:
X
E[X] = x.P [X = x] (Expectation of X) (17.17)
x∈RX
X
E[X|Bi ] = x.P [X = x | Bi ] (Conditional Expectation of X in every partition)
(17.18)
x∈RX
Thus, substituting equation (18) and (19) to compute the total expectation of X,
we get
X X X
x.Px (x) = [ x.P [X = x | Y = yj ]]Py (yj ) (17.20)
x∈Rx yj ∈Ry x∈Rx
As we can clearly see, the LHS in Equation (20) is simply the expectation of X and
the part within brackets of the RHS is E[X | yj ].
The above results can be summarised to obtain the Law of Total Expectation which
is given by
X
E[X] = E(X | yj )Py (yj ) (17.21)
yj ∈RY
Basically the law of total expectation tells us how to compute the expectation of
E(X) using the knowledge of the conditional expectation of X given Y = yj . This is
directly derived from the law of total probability since the expectation is summed
up over the marginal probabilities of the partitions of Y
X
g(y) = E[X | Y = y] = xi .PX | Y (xi | y)
xi ∈Rx
g(Y ) = E[X | Y ]
Example 1: X = (aY + b)
E[X | Y = y] = E[(aY + b) | Y = y] = ay + b
E[X | Y ] = aY + b
Example 2: Let us assume two random variables X and Y with joint pmf and let
Z = E[X | Y ]
Y =0 Y =1
X=0 1/5 2/5 P (X = 0) = 1/5 + 2/5 = 3/5
X=1 2/5 0 P (X = 1) = 2/5 + 0 = 2/5
P (Y = 0) = 1/5 + 2/5 = 3/5 P (Y = 1) = 2/5 + 0 = 2/5
(X | Y = 0) ∼ Bern(2/3)
PX,Y (1, 1) 0
PX | Y (1 | 1) = = =0
PY (1) 2/5
∴ we can say with certainty that whenever Y = 1, X will take the value 0.
(X | Y = 1) ∼ Bern(0)
We know that if X and Y are independent,
PX,Y (x, y) = PX (x).PY (y) ∀ x, y
In the above example,
3 3 9
PX,Y (0, 0) = 1/5 6= PX (0).PY (0) = . =
5 5 25
Thus, X and Y are not independent random variables.
(
E[X | Y = 0] with probability P (Y = 0) = 3/5
Z = E[X | Y ] =
E[X | Y = 1] with probability P (Y = 1) = 2/5
(
2/3 with probability 3/5
Z = E[X | Y ] =
0 with probability 2/5
Therefore, PMF of Z can be written as
3/5 if z = 2/3
PMF of Z = PZ (z) = 2/5 if z = 0
0 otherwise
Note: For given random variables X and Y and known functions g(X) and
h(y),we can compute the conditional expectation as
E[g(X).h(Y ) | X] = g(X).E[h(Y ) | X]
Chapter 17. Conditional Probability 210
E[X] = E[E[X | Y ]]
| {z }
g(Y )
Independence ⇒ Cov(X, Y ) = 0
but
Cov(X, Y ) = 0 ; Independence
Also, if X and Y are independent random variables
E[g(X).h(Y )] = E[g(X)].E[h(Y )]
Chapter 17. Conditional Probability 211
Recall,
2
σX = E[X 2 ] − (µX )2
2
Note: For given random variables X and Y , σX | Y =y = V ar(X | Y = y) is a
deterministic function of Y and V ar(X | Y ) is a random variable.
17.7.1 Derivation
Assuming Z = E[X | Y ],
2
V = V ar(X | Y ) = E(X 2 | Y ) − E[X | Y ] = E[X 2 | Y ] − Z 2 (17.24)
Using the law of total expectation, the expectation of V will be given as
E[V ] = E[E[X 2 | Y ]] −E[Z 2 ] = E[X 2 ] − E[Z 2 ] (17.25)
| {z }
E[X 2 ]
17.7.3 Intuition
Let us demonstrate the intuition behind the law of total variance with some exam-
ples.
Here, the variance decomposition has been done in a way that the LHS gives us
the total variance within the population. The first term on the RHS gives us the
variability of height within the country i.e. the average of variances of heights
across countries. The second term is the inter-sample phenomenon as we men-
tioned earlier i.e. it will tell us the variability of height across countries by giving
us the variance of average height across countries.
• µN = E(N )
Chapter 17. Conditional Probability 214
2
• σN = V ar(N )
Assume that Xi are independent of each other which implies that the amount spent
by every customer is independent of the amount spent by any other customer.
Also, assume that Xi is independent of N i.e. the amount spent by each customer
is independent of the number of customers that visit the store on that day.
N
X
E(Y ) = E[ Xi ]
i=1
N
X
6= E(Xi )
i=1
N
X N
X
E(Y ) = E[E(Y | N )] = E[E[ Xi | N ]] = E[ E[Xi | N ]
i=1 i=1
| {z }
E(Xi )
XN
E(Y ) = E( E(Xi ))
i=1
XN
= E( E(X)
i=1
= E[N.E[X]]
∴ E(Y ) = E(N ).E(X)
Chapter 17. Conditional Probability 215
17.8 References
• Ramachandran, K. M., & Tsokos, C. P. (2020). Mathematical statistics with
applications in R. Academic Press.
216
Chapter 18. Multivariate Gaussian Distribution 217
In extension, the expected value for a sum of random vectors, is the sum of indi-
vidual expected values of each random vector.
î ó î ó î ó î ó
E X1T + X2T + · · · + XkT = E X1T + E X2T + · · · + E XkT (18.4)
X11 X21 Xk1
X X X
î
T T T
ó 12 22 k2
.. + E .. + · · · + E ..
E X1 + X2 + · · · + Xk = E
. . .
X1n X2n Xkn
E[X11 ] E[X21 ] E[Xk1 ]
E[X12 ] E[X22 ]
ó E[X ]
î k2
T T T
E X 1 + X2 + · · · + Xk = . + . + · · · +
.
.
. . . .
.
E[X1n ] E[X2n ] E[Xkn ]
In a similar manner, we define the covariance matrix (CX ) for a random vector X
as:
V ar(X1 ) Cov(X1 , X2 ) . . . Cov(X1 , Xn )
h i Cov(X2 , X1 )
V ar(X2 ) . . . Cov(X2 , Xn )
= E (Xi −µXi )(Xj −µXj )T =
CX . .. .. .
.. . . ..
Cov(Xn , X1 ) Cov(Xn , X2 ) ... V ar(Xn )
(18.6)
In a similar manner, we can also define for any two random vectors X and Y ,
cross-correlation (RXY ) and cross-covariance (CXY ) matrices.
h i
T
RXY = E XY
h i
CXY = E (X − µX )(Y − µY )T (18.7)
Note. Here, µX = E[X] is the expected value of random vector X and µY = E[Y ]
is the expected value of random vector Y .
X1 E[X1 ] µ1
X2 E[X2 ] µ2
î ó
µX = E X = E . = . = .
(18.8)
.. .. ..
Xn E[Xn ] µn
Note here that Y T is not a vector but a number. Thus Y T = Y . Substituting the
value of Y in the above expression,
" #
î ó î ó T
E[Y Y T ] ≥ 0 =⇒ E bT X − E[X] bT X − E[X] ≥0
" #
T
E bT
=⇒ X − E[X] X − E[X] b ≥ 0
h T i
bT E
=⇒ X − E[X] X − E[X] b≥0
| {z }
CX , covariance matrix of X
T
=⇒ b C b≥0 =⇒ CX is PSD. (18.10)
| X{z }
condition for positive
semidefiniteness (PSD)
2
Theorem 18.2.1 If X ∼ N(µX , σX ) and Y ∼ N(µY , σY2 ) are jointly normal random
variables, setting a = b = 1 =⇒ aX + bY = X + Y , then the resultant random
2
variable X + Y ∼ N(µX + µY , σX + σY2 + 2ρXY σX σY ). Essentially, a linear com-
bination fo two normal variables X and Y gives a joint normal variable X + Y .
We know from the above theorem that since Z1 and Z2 are independent normal
distributions, they are also jointly normal. We can therefore attempt to obtain the
joint PDF of Z1 and Z2 . Because, these are independent, we have,
1 − 1 (z12 +z22 )
fZ1 ,Z2 (z1 , z2 ) = e 2 (18.11)
2π
In order to prove that X and Y are bivariate normal, we need to show first that
aX + bY is normal ∀ a, b ∈ R.
h p i
aX + bY 2
= aZ1 + b ρZ1 + 1 − ρ Z2
Ä p ä
aX + bY = (a + bρ) Z1 + b 1 − ρ2 Z2 (18.12)
| {z } | {z }
constant coeff. constant coefficient
V ar(X) = V ar(Z1 ) = 1
»
V ar(Y ) = V ar(ρZ1 ) + V ar( (1 − ρ2 ))
| {z } | {z }
=ρ2 V ar(Z1 ) =(1−ρ2 )V ar(Z2 )
E[X] = E[Z1 ] = 0
p
E[Y ] = ρ E[Z1 ] + 1 − ρ2 E[Z2 ] = 0
| {z } | {z }
=0, given =0, given
p
ρ(X, Y ) = Cov(X, Y ) = Cov(Z1 , ρZ1 + 1 − ρ2 Z2 )
p
ρ(X, Y ) = Cov(Z1 , ρZ2 ) + Cov(Z1 , 1 − ρ2 Z2 )
p
ρ(X, Y ) = ρ Cov(Z1 , Z1 ) + 1 − ρ2 Cov(Z1 , Z2 )
| {z } | {z }
=V ar(Z1 )=1 = 0, since Z1 and Z2
are independent
ρ(X, Y ) = ρ
In order to prove a very important result about the PDF of a bivariate normal
2
distribution, we first construct two normal random variables X ∼ N(µX , σX ) and
2
Y ∼ N(µY , σY ), such that ρ(X, Y ) = ρ. Let Z1 ∼ N(0, 1) and Z2 ∼ N(0, 1) be
independent random variables.
X = σX Z1 + µX
î p ó
Y = σY ρZ1 + 1 − ρ2 Z2 + µY (18.14)
where, µx , µY ∈ R and σX > 0, σY > 0 and ρ ∈ (−1, 1) are all parameters of the
PDF of the joint normal distribution fX,Y (x, y).
Theorem 18.2.2 Let X and Y be two bivariate normal random variable with joint
PDF fX,Y (x, y) given by:
h i
x−µX 2 y−µY 2 2ρ(x−µX )(y−µY )
1
1 − 2 σX
+ σY
− σ X σY
fX,Y (x, y) = p e 2(1−ρ ) (18.17)
2πσX σY 1 − ρ 2
Chapter 18. Multivariate Gaussian Distribution 222
Then, there exist independent standard normal variables Z1 and Z2 such that X
and Y are given by:
X = σX Z1 + µX
î p ó
Y = σY ρZ1 + 1 − ρ2 Z2 + µY (18.18)
Note. We can generate X and Y from the standard normal Z1 and Z2 . Using this
method, we can generate samples from the bivariate normal distribution.
On Independence of RVs
If X and Y are jointly normal random variables and they are uncorrelated
Cov(X, Y ) = 0, then X and Y are independent random varibles.
Theorem 18.2.3 If X and Y are bivariate normal and are uncorrelated Cov(X, Y ) =
0, then they are independent random variables.
− −
1 2 σ 1
fX,Y (x, y) = √ e X √ e 2 σY
2πσX 2πσY
| {z }| {z }
marginal distribution of X marginal distribution of Y
fX (x) fY (y)
W = a1 X 1 + a2 X 2 + · · · + an X n (18.20)
| {z }
W is a linear combination
of all n random variables.
W = aT X (18.21)
The above equation (18.22) is the PDF of a Standard Gaussian vector. We may
now, extend this PDF for a general normal random vector X with mean vector m
and covariance matrix CX
References 224
Let Z be a standard normal random variable such that Z ∼ N(0, 1), and let X
be a general normal random variable with finite mean µ and variance σ. We can
obtain X from Z using the transformation:
X = σZ + µ
Likewise, one can extend this to the case of random vector X with mean vector
m, where m = E[X] and covariance matrix CX such that
î ó
CX = E (X − m)(X − m)T
Our objective now is to find such a transformation. It turns out that if CX can
be decomposed into CX = AAT , then
X = AZ + m
References
[1] Stochastic Processes Course notes of Prof. Rakesh Nigam.
[3] Karlin. Samuel, and Howard E. Taylor. A second course in stochastic pro-
cesses. Elsevier, 1981.
[4] Karlin, S., and H. M. Taylor. A first course in stochastic processes, Acad. Press,
New York. 1966.
Part III
Stochastic Processes
225
Chapter 19
19.1 Premise
In probability theory, a stochastic process is a set of random variables indexed by
time or space. It refers to a system that has observations at various points of time
and the outcome or the observed value for each observation is actually a random
variable.
226
Chapter 19. Introduction to Markov Chains 227
Suppose we model the mood of a person as a state. A person can be in two states
- a happy state H or a sad state S. A Markov chain on this model will imply that
the mood of the person tomorrow will only be affected by the mood of the person
today and not by her mood yesterday or any previous day. The possible states are
S = {H, S} and the occurrences are indexed by time intervals n, n + 1.
0.2
0.8 H S 0.3
0.7
ÇH S å
H 0.8 0.2
S 0.7 0.3
The transitions between states occur at discrete times. In this example we can say
that P (Hn+1 | Hn ) = 0.8 i.e. if the person is happy today, the probability that she
will be happy tomorrow is 0.8 and the probability that she is sad tomorrow given
that she was happy today is P (Sn+1 | Hn ) = 0.2.
If you start with an initial wealth,W0 , the question that we need to address is that
given the above setup, should you play this game or not.
According to the previous cases, the outcome of the bet can be modelled as a
random variable.
W (t) = W (t − 1) + Y (t)
where Yt is given as,
(
b if bet is won with probability p
Yt =
−b if bet is won with probability 1 − p = q
t
X
E[W (t)] = E[W0 ] + E[Y (i)]
i=1
W (t) = W (0) + E[Y (i)]t
W (t) = W (0) + (2p − 1)bt
2p − 1 = 1.1 − 1 = 0.1
W (t) = W (t − 1) + Y (t)
E[W (t) | W (t − 1)] = E[W (t − 1) | W (t − 1)] + E[Y (t) | W (t − 1)]
= W (t − 1) + E[Y (t) | W (t − 1)]
since Y(t) is independent of previous wealth
= W (t − 1) + E[Y (t)]
= W (t − 1) + (2p − 1)b
Chapter 19. Introduction to Markov Chains 230
Now conditioning on W (t − 2)
W (t) = W (t − 1) + Y (t)
W (t − 1) = W (t − 2) + Y (t − 1)
W (t) = W (t − 2) + Y (t − 1) + Y (t)
E[W (t) | W (t − 2)] = W (t − 2) + (2p − 1)b + (2p − 1)b
0 1 2 3 ...
p p01 p02 ... . . . 0
00
p 10 p11 p12 ... . . . 1
p p21 p22 ... . . . 2
20
.
.
.
pi0 pi1 pi2 ... . . . i
..
.
Chapter 19. Introduction to Markov Chains 231
• The system can enter a given state either from the previous state with prob-
ability pi−1,i or from the next state with probability pi+1,i or it can stay in the
same state with probability pii .
• The system can exit a given state either to the previous state with probability
pi,i−1 or to the next state with probability pi,i+1 .
Note: Here the power of 2 does not imply a square of the matrix but the number
of time steps increased to compute the new probability transition matrix.
p2ij = P [Xn+2 = j | Xn = i]
Using matrix multiplication, matrix P . matrix P . This will give us the dot product
of ith row and j th column. For fixed i and j it will be a matrix element.
pm+n
ij = P [Xm+n = j | X0 = i]
X∞
m+n
pij = P [Xm+n = j, Xm = k | X0 = i]
k=1
∞
X
pm+n
ij = P [Xm+n = j | Xm = k, X0 = i].P [Xm = k | X0 = i]
k=1
Removing past X0 = i,
∞
X
pm+n
ij = P [Xm+n = j | Xm = k].P [Xm = k | X0 = i]
k=1
X∞
pm+n
ij = pnkj .pm
ik
k=1
∞
X
pm+n
ij = pm n
ik .pkj
k=1
This gives us the final transition probability matrix and the Chapman Kolmogorov
equation,
P m+n = P m .P n
• pm n
ik is the probability of going from X0 = i to Xm = k and pkj is the probability
of going from Xm = k to Xn = j.
• Since any state k might have occurred at time m, we sum over all k.
P m+n = P m .P n
where,
P m+n is the (m + n) step transition,
P m is the m step transition and
Chapter 19. Introduction to Markov Chains 234
P (2) = P.P = P 2
P (n) = P n−1 .P = P n−2 .P.P = . . . = P n
Theorem 19.5.1 The matrix of n step transition probabilities P (n) is given by the
nth power of the transition probability matrix P =⇒ P (n) = P n .
Chapter 20
• One possible outcome is that you gain $b with probability p and you lose $b
with probability (1 − p) = q.
• There exists a bias factor which is the ratio of the probability of losing to
the probability of winning and is denoted as:
q
α= (20.1)
p
• Concerning the bias factor we have three broad situations - first is that if
α > 1 or q > p then you are more likely to lose than win the bet. Second
is when α < 1 or q < p you are more likely to win rather than lose the bet.
Third is when α = 1 or p = q = 0.5 which means that the gamble is a fair
game.
• There are two boundary conditions that state that - You keep playing until
you go broke or you reach a maximum possible wealth of $bN .
• Given all these rules of the game, we are interested in answering the question
- What is the probability Si of reaching $bN before going broke given that our
initial wealth is $ib ?
235
Chapter 20. The Gambler’s ruin Framework 236
• We first set up the state space - this would obviously consist of positive and
negative increments of bet amount $b. However we choose to normalize the
state space by dividing each value in the set by $b to ultimately get:
S = {0, 1, · · · , i − 1, i, i + 1, · · · , N } (20.2)
• After applying the Markov property, we get the main expression as:
• Recall that the bias factor is given by α = q/p. We can further rewrite the
above equation in terms of the bias factor as:
• Indeed we can see a pattern emerging here. We can then generalize the
above expression as:
• Now if we sum all the above expressions, the middle terms like - S2 , S3 , · · · , Si−1
will simply cancel out and we will be left with:
Si − S1 = S1 [α + α2 + · · · + αi−1 ] (20.16)
Si = S1 [1 + α + α2 + · · · + αi−1 ] (20.17)
Chapter 20. The Gambler’s ruin Framework 238
• Note in the above equation, the term in brackets is a geometric series. Now
assuming that α 6= 1 we can write the general form of the above expression
as: ñ ô
1 − αi
Si = S1 (20.18)
1−α
• We will now apply the first boundary condition as SN = 1 which states that
when i = N the probability of winning is 1. We can then subtitute SN for Si
and write:
(1 − αN )
1 = SN = S1 (20.19)
(1 − α)
(1 − α)
S1 = (20.20)
1 − αN
• Now that we have an analytic expression for S1 , we can substitute this into
the general formula of Si to obtain:
(1 − αi )
Si = (20.21)
(1 − αN )
• We define fi as the probability that starting at state i (X0 = i), the DTMC
ever reenters state i. This can be written as:
"∞ #
[ [∞
fi = P Xn = i|X0 = i = P Xn = i|Xm = i (20.27)
n=1 n=m+1
• We say that state i is recurrent if fi = 1. That is, the DTMC reenters state i
again and again almost surely, infinitely often.
• We say that state i is transient if fi < 1. This essentially means that there is
a positive probability (1 − fi ) of never coming back to transient state i.
• Now we will show that state R1 is also recurrent. At the first time step
(n = 0) we have:
P [X1 = R1 |X0 = R1 ] = 0.3 (20.29)
• Now at the second time step (n = 1) we can go from R1 to R2 and then from
R2 to R1 .
P [X2 = R1 , X1 6= R1 |X0 = R1 ] = (0.7)(0.6) (20.30)
• Similarly we can calculate the probability that the random process would
take on the value of state R1 after 3 time steps (n = 3). So we go from R1 to
R2 , loop within R2 for one time step and return to R1 .
P [X3 = R1 , X2 6= R1 , X1 6= R1 |X0 = R1 ] = (0.7)(0.4)(0.6) (20.31)
• Since we are checking for the process returning to state R1 in any number of
time steps, we sum across all possible paths with different values of n.
∞
X
fi = P [Xn+1 = R1 , Xn 6= R1 , · · · , X1 6= R1 |X0 = R1 ] (20.33)
n=0
∞
X
fi = P [X1 = R1 |X0 = R1 ] + (0.7)(0.4)n−1 (0.6) (20.34)
| {z } n=1
0.3
"∞ #
X
fi = 0.3 + 0.7 (0.4)n−1 (0.6) (20.35)
n=1
fi = 0.3 + 0.7 = 1 (20.36)
• Finally we note that since, as per our definition, fi = 1 - we can say tht R1 is
in fact, a recurrent state.
(l+r)
X (l) (r) (l) (r)
pki = pkj pji ≥ pkj pji (20.40)
| {z } j |{z} |{z}
k→i k→j j→i
From the above two equations we can gather that (i → k) and (k → i) hence
it then follows that i ⇐⇒ k.
• Lastly we note that the equivalence relation partitions the set of states
into disjoint classes.
The I in the above relation denotes an indicator random variable that can be
defined as follows:
(
1, if Xn = i|X0 = i
I{Xn = i|X0 = i} = (20.42)
0, if Xn 6= i|X0 = i
We now go a bit further within this new framework and define the probability of
revisiting state i exactly n times. This is given below:
Note that the (1 − fi ) expression is crucial here since it denotes the probability of
escaping from state i. This is because we only want our random variable to visit
the state i exactly n times. So, at the (n + 1)th step we escape from this state.
The above expression indeed looks familiar - it is the definition of a geometric
random variable. We can then say that:
Ni ∼ geometric(1 − fi ) (20.44)
Chapter 20. The Gambler’s ruin Framework 242
Since we have now effectively characterized a distribution for the random variable,
we can compute the expected number of visits to state i given X0 = i. This is
shown as: ∞
X 1
E[Ni ] = nfin (1 − fi ) = (20.45)
n=1
1 − f i
We can now state the first part of this theorem as - State i is transient iff, given
that (fi < 1), the following holds:
∞
1 X (n)
E[Ni ] = ⇐⇒ pii < ∞ (20.47)
1 − fi n=1
The second part of this theorem states that - state i is recurrent iff, for (fi = 1),
we have: ∞
X (n)
E[Ni ] = ∞ ⇐⇒ pii = ∞ (20.48)
n=1
Chapter 21
• We say that for a probability transition matrix - the probabilities are indepen-
dent of time n as n → ∞ since the matrix product converges. all columns
are equal - which means that the probabilities are independent of the initial
condition. ñ ô
0.6 0.4
P 30 = (21.2)
0.6 0.4
243
Chapter 21. Setting the DTMC Framework 244
• Positive and null recurrence are class properties and recurrent states in a
finite MC are positive recurrent.
• Ergodic states are those that are positive recurrent and aperiodic. An irre-
ducible MC with ergodic states is said to be an Ergodic MC.
21.2 Example 1
• We look at the various probabilities of return times to state 0 as given below:
1 1
P [return time = 2] = 1 · = (21.6)
2 2
1 1 1
P [return time = 3] = 1 · · = (21.7)
2 3 2·3
1
P [return time = n] = (21.8)
(n − 1)n
• Now using Induction hypotheses, we can show the following general result
to hold:
k+1
X 1 k
= (21.9)
m=2
(m − 1)m k+1
• Using the above result, we can show that state 0 is recurrent by proving that
the probability of returning to 0 is 1:
n n
X X 1
P [return time = m] = (21.10)
m=2 m=2
(m − 1)m
1 1 1
= + + + ··· (21.11)
1·2 2·3 3·4
n + 1 n→∞
= → 1 (21.12)
n
• Now we look to characterize the expected return time to state 0:
∞ ∞
X X 1
E[return time to 0] = nP [return time = n] = n (21.13)
n=2 n=2
(n − 1)n
∞
X 1 n→∞
= → ∞ (21.14)
n=2
n−1
The above result holds true due to the divergence property of a Harmonic
series. Because of this result we can say that state 0 is null recurrent.
Chapter 21. Setting the DTMC Framework 245
∞
X
πj = 1 (21.17)
j=0
• The limiting probabilities given above are independent of the initial state
X0 = i and the above algebraic equations can be solved to find pij .
∞
X
= P [Xn+1 = j|Xn = i]P [Xn = i] (21.19)
i=1
∞
X
= pij P [Xn = i] (21.20)
i=1
21.4 Example 2
• We now look at an example, characterizing stationary distribution properties
using a geometric random variable X ∼ geom(p).
• Now consider a MC with two states (with loops and equal probabilities),
using which we denote the following return probabilities:
1 1
P [T1 = 1] = = q = p =⇒ =2 (21.28)
2 p
Å ã2
1
P [T1 = 2] = (21.29)
2
Å ãk
1
P [T1 = k] = (21.30)
2
• We now compute the expected return time to state 1 as follows:
∞ ∞ Å ã2 X ∞
X X 1
E[T1 ] = kP [T1 = k] = k = kq k = 2 (21.31)
k=1 k=1
2 k=1
• The fact that the above equation is equal to 2 can be proved as follows using
the result from equation 27 as follows:
∞ ∞
1X k 1 X q 1
kq = 2 =⇒ kq k = 2 = 2 = 2 (21.32)
q k=1 p k=1
p p
• We can see that since the expected time to reenter state 1 is finite so we say
that π2 = π2 = 1/2 is the stationary distribution. Also we say that state 1 is
positive recurrent due to finite expected return time.
We can clearly see above that as before in the limiting case, all the rows have
same probabilities which in turn implies that the probabilities are indepen-
dent of the initial state X0 = i.
• So we can write the probability distribution vector after the nth step as fol-
lows:
lim P (n) = lim [P T ]n P (0) = [π1 , π2 , · · · , πJ ]T (21.34)
n→∞ n→∞
21.6 Example 3
• We consider the example of the following aperiodic, irreducible MC, charac-
terized by the following transition probability matrix:
0 0.3 0.7
P = 0.1 0.5 0.4 (21.39)
0.1 0.2 0.7
3
X
πi = 1 (21.42)
i=1
Chapter 21. Setting the DTMC Framework 248
• The above two equations can be put into a system of equations in matrix
format as follows:
π1 0 0.1 0.1
π1
π2 0.3 0.5 0.2
= π (21.43)
π3 0.7 0.4 0.7 2
π3
1 1 1 1
| {z }
PT
(1) (0)
=⇒ πj = pij πi (21.45)
• From this we say that if the probability distribution is unchanged for all n
then:
(n)
P (Xn = i) = πi = πi (21.46)
• We say that the DTMC is stationary in a probabilistic sense - the states might
change, but the probabilties don’t.
21.7 Example 4
(n)
• We let Ti denote the fraction of time spend in the ith state upto time n.
This can be expressed with the help of an indicator random variable as:
n
(n) 1X
Ti = I{Xm = i} (21.47)
n m=1
• Now taking the limiting case under the condition that the probability of being
in state i is the same regardless of n we get the following result:
n n
(n) 1X n→∞ 1
X
E[Ti ] = P [Xm = i] → πi (21.50)
n m=1 n m=1
1
= (πi + πi + · · · + πi ) = πi (21.51)
n
n→∞ (n) n→∞
P [Xm = i] → πi =⇒ E[Ti ] → πi (21.52)
• We note note that for an Ergodic DTMC, as n tends to infinity, time average
is equal to the ensemble average. This can be written as follows:
n
(n) 1X (n)
lim ti = lim I{Xm = i} = E[Ti ] (21.53)
n→∞ n→∞ n | {z }
m=1
| {z } πi
time average
• We note that for an Ergodic DTMC, the same condition is true without the
expectation as well:
n
(n) 1X
lim Ti = lim I{Xm = i} = πi (21.54)
n→∞ n→∞ n
m=1
Chapter 22
250
Chapter 22. Introducing Inhomogeneous DTMCs 251
• In the above equation we note that the general 2 dimensional tuple random
variable given by Xn0 = (Xn , n) is specified in terms of:
• Now the projection condition that fetches us the original time inhomoge-
neous random variable from the first dimension projection of the time ho-
mogeneous random variable is given as:
Xn = P r1 (Xn0 ) (22.5)
• The crucial point to grasp is that given the fact that P [X0 = i] holds for for
i ∈ E, we might express the following transition probabilities without the
time index dependence:
0
P(i,k),(j,l) = P 0 [Xk+1
0
= (j, l)|Xk0 = (i, k)] = δl,(k+1) Pk;(ij) (22.7)
• We note that because of the above relation - it suffices to discuss the time
inhomogeneous MC in terms of an associated time homogeneous MC.
π0 = (0.25
|{z}, |{z}
0.75) (22.8)
a b
π00 = (0.25
|{z}, |{z}
0.75, |{z}
0 , |{z}
0 , |{z}
0 , |{z}
0 ) (22.9)
(a,0) (b,0) (a,1) (b,1) (a,2) (b,2)
• At this point it is instructive to recall the general equation relating the limit-
ing probabilities to the transition matrix as follows:
(n+1)
π (n) |{z}
| {z } = |{z}
π P (22.10)
(1×m) (1×m) (m×m)
Chapter 22. Introducing Inhomogeneous DTMCs 252
• We must also note the various phases of transition across states for this pro-
cess - the limiting distributions and the associated transition probability ma-
trices.
P P P
X0 (π0 ) →0 X1 (π1 ) →1 X2 (π2 ) →2 X3 (π3 ) (22.11)
• Now with different transition probability matrices at different time steps, we
can then write the comprehensive partitioned transtition probability matrix
of the associated time homogeneous MC with 2 dimensional tuple states.
(a, 1), (b, 1) (a, 2), (b, 2) (a, 3), (b, 3)
" P0 0 0 # (a, 0), (b, 0)
0 P1 0 (a, 1), (b, 1)
0 0 P2 (a, 2), (b, 2)
• We can then write the following equations that relate the limiting distri-
butions across different time steps to the individual probability transition
matrices:
π1 = π0 P0 (22.13)
π2 = π2 P1 (22.14)
π3 = π2 P2 (22.15)
• Further, the time homogeneous limiting distribution vectors can be ex-
pressed as follows: î ó
π(0) = π0 0 0 (22.16)
î ó
π(1) = 0 π1 0 (22.17)
î ó
π(2) = 0 0 π2 (22.18)
• Again, the final vector output in the prevous equation is a result of the correc-
tion we applied after equation 19. Finally, we do the following computation
to get the third distribution as:
î ó P0 0 0 î ó
π(3) = 0 0 π2 0 P1 0 = 0 0 π3 (22.21)
0 0 P2
Problem. (Gambler’s Ruin) The player starts the game with an initial wealth of
amount n. A roulette wheel is spun and in each bet the player either wins $1
(event W) or loses and amount of $1 (event L). Let p be the probability of winning
and let q = 1 − p be the probability of losing the bet. The player keeps playing
until an amount of m is won or until they goes broke.
A random walk system is used to model the behavior of the player’s gains and
losses. Through each time step, the player’s outcomes are recorded as a string of
W’s and L’s. For instance, {WLLLLLWLL...} is a sample string. For every W, the
random walker moves up a step from the initial position of n and for every L, they
move a step downward, relative to the previous position.
Note that the probability of winning an ith bet is mutually independent of that
of winning any past (or future) bets. Thus the random walk is a memoryless
independent system. Such a random walk with mutually independent moves is
called a martingale. All random walks are not necessarily martingales. If p 6= 1/2
then the random walk is said to be biased and if p = 1/2 then the random walk is
unbiased. A random walk may or may not have boundaries. In our case though,
the random walk has boundaries at 0 and n + m.
Define W ∗ as the event that the RW hits the upper bound amount T = n + m,
before it hits 0, the bottom boundary. T here denotes the top boundary. Define D
as the event that the RW starts with amount n.
254
Chapter 23. Random Walks and Gambler’s Ruin 255
23.1 Problem
Our objective is to compute the probability that the player reaches upper bound
T before going broke, denoted by Xn = P [W ∗ |D = n]. The standard tree method
of computing the probability is complicates as the size of the sample space |S| is
infinite. It could theoretically be possible to go a infinite alternate W and L steps
such that the player never wins amount m in finite time. One such sample path is
given by {WLWLWL...} an so on.
If the player starts off broke, then he would never be able to enter the game. Thus
the probability that he would gain wealth T is 0. Likewise, for a player with initial
wealth of T by construction of the game is not interested in entering it. But for
a player entering the game with initial wealth in the range (0, T ), then it can be
proved that the probability of reaching a level of wealth T before going broke is
given by Xn = pXn+1 + (1 − p)Xn−1 .
Xn = P [W ∗ |D = n]
= P [W ∗ ∩ E | D = n] + P [W ∗ ∩ E | D = n] (23.1)
| {z }
Law of Total Probability
î ó î ó
Xn = P E | D = n · P W ∗ | E ∧ (D = n)
î ó î ó
+P E | D = n · P W ∗ | E ∧ (D = n)
î ó î ó
Xn = pP W ∗ | (D = n + 1) + (1 − p)P W ∗ |(D = n − 1)
| {z }
due to mutual independence, E ∧ (D = n) ≡ (D = n + 1)
Solution. The above equation is a linear homogeneous recurrence relation and can
be solved in the standard way of solving difference equation. We first assume
a trial solution and obtain a general solution; further by applying the boundary
conditions, the arbitrary constants take fixed values and we obtain the particular
solution. We can rearrange the terms in the recurrence relation,
pXn+1 − Xn + (1 − p)Xn−1 = 0
prn+1 − rn + (1 − p)rn−1 = 0
pr − 1 + (1 − p)r−1 = 0
pr2 − r + (1 − p) = 0 (23.4)
The above quation is called a characteristic equation, the solutions to which are
plugged in to obtain the general solution of the linear homogeneous difference
equation.
p p
1± 1 − 4p(1 − p) 1 ± 1 − 4p + 4p2
r = =
2p 2p
p
1 ± (1 − 2p)2 1 ± (1 − 2p)
r = =
2p 2p
2 − 2p 2p
r = or
2p 2p
1−p
r = or 1 (23.5)
p
Xn = Ar1n + Br2n
1−p n
Å ã
Xn = A + B(1)n
p
1−p n
Å ã
Xn = A +B (23.6)
p
Chapter 23. Random Walks and Gambler’s Ruin 257
1−p T
Å ã
XT = A −A=1
p
"Å ãT #−1
1−p
A = −1 and
p
"Å #−1
1−p T
ã
B = − −1 (23.7)
p
In the case of a roulette, p < 1/2. Then (1 − p)/p > 1. And T = n + m. We have,
Theorem. If p < 1/2 (more likely to lose a bet), the probability of winning and
0 = X0 = B =⇒ B = 0 (23.12)
1 = XT = AT + B =⇒ A = 1/T (23.13)
n n
Thus, Xn = = (23.14)
T n+m
Chapter 23. Random Walks and Gambler’s Ruin 258
Theorem. If p = 1/2, (fair game), the probability of winning and amount m before
After x bets (or steps of the RW), we are drifted√by (1 − 2p)x. This is the expected
loss in x consecutive bets and is of the order o( x).
h i √
E loss on x bets = (1 − 2p)x > o( x) (23.17)
√
Since in o( x), the constant is small, the swings cannot save and the random walk
goes downward and crashes almost surely. The probability of winning m before
going broke in an unfair game is computed in the previous section.
One way in which the random walker does not go broke is if they play forever. The
probability of playing forever is zero, which is 1 minus the probability of losing.
There are sample points W LW LW LW... going forever. By measure theory, when
we add up these sample points we get the probability of zero.
P [W LW . . . ] + P [LW . . . ] + · · · = 0 (23.19)
This motivates the question: How long does it take for one to go broke (or win)?
and is answered in the subsequent sections.
Define S to be the number of steps until the random walker hits the boundary. We
have En = E[S|D = n]. We claim that,
0,
if n = 0, already broke.
En = 0, if n = T, already wealthy. (23.20)
1 + pEn+1 + (1 − p)En−1 , if 0 < n < T.
Let us focus on the last case, and take up the recurrence relation,
Chapter 23. Random Walks and Gambler’s Ruin 259
En = 1
|{z} + pEn+1 +(1 − p)En−1 , (23.21)
| {z } | {z }
1st bet, makes it win 1st bet, lose 1st bet,
inhomogeneous start with n+1 start with n−1
recurrence over again over again
The particular solution can be derived by applying the boundary conditions and
solving for A and B. Note that, the general solution is the sum of homogeneous
solution and the particular solution.
En = a (constant) (23.24)
En = an + b (23.25)
1
=⇒ a = (23.26)
1 − 2p
Set b = 0 (23.27)
Memoryless Property
We also note that the expected value and variance of the exponential distribution
is,
1 2 1
E[X] = ; E[X 2 ] = ; V ar(X) = (24.4)
λ λ2 λ2
Consider the x to represent a time period. Starting from any time zero on the
timeline, there exists a positive number a on the timeline, and after a duration x
has passed the process reaches point x + a on the timeline. This is shown in Figure
24.1.
We can consider the event that the random variable X (modelling time in this
case) has a value X > x + a, given that the event X > a has occurred. The
memoryless property is stated as,
Memoryless
Property P [X > (x + a) | (X > a)] = P [X > a] (24.5)
260
Chapter 24. Memoryless Property 261
0 a a+x time
Given that the event X > a has already occurred, then the conditional probability
of X > (x + a) is same as P (x), that is, the process assumes as if event X > a has
not even occurred, like when it started at time zero.
Note that the right-hand side of the expression P [X > x] is same as the event
P [X > x|X > 0]. And the event that X > 0 refers to the entire given the sample
space; naturally for P (E) = P (E|Ω) implies E(X) = E(X|Ω).
s
time zero
0 t t+s time
Figure 24.2: Time till car engine fails modelled by random variable X
The required probability is given event E = (X > t), what is the conditional
probability of event X ≤ (t + s). This is given by, the probability of intersection of
the two events over the probability of E.
From figure 24.2 the probability of the event P [(X ≤ t + s) ∩ (X > t)] is written
as,
References
[1] Stochastic Processes Course notes of Prof. Rakesh Nigam.
[2] Ross, Sheldon. A first course in probability. Pearson, 2014.
[3] Karlin. Samuel, and Howard E. Taylor. A second course in stochastic pro-
cesses. Elsevier, 1981.
[4] Karlin, S., and H. M. Taylor. A first course in stochastic processes, Acad. Press,
New York. 1966.
References 263
Figure 24.3: Memoryless Property of the exponential distribution arises from the
shifting nature of its PDF.
Chapter 25
• The probability density at a point x is given by the following (note that Delta
denotes the length of the interval):
P [x < X < x + ∆]
fX (x) = lim (25.1)
∆→0 ∆
• The expression given in the numerator of the RHS of the above equation
could be written as a difference of CDF values as:
• We can say that for a small δ value, the following relation would hold true:
264
Chapter 25. The Poisson Process 265
• In each individual discrete time slot, the probability of success (arrival) oc-
curing is taken as p.
• Here, the interarrival times would follow a geometric PMF. Recall that the
interarrival time characterizes the time until first arrival.
• Now using the above result, we can alternatively define the arrival rate as
the expected number of arrivals per unit time:
length δ. Now the length of the big interval is nothing but the sum of the
lengths of the small intervals. If we have n such intervals, then we have:
τ
n= (25.17)
δ
• Further we note that each individual δ time interval could have either 1 or
0 arrivals. So in each trial, the probability of having 1 arrival is just like the
probability of success in a discrete time slot and is given as p = λδ.
• With this framework, we can actually think about this now in terms of a
bernoulli process where we have the paramters n = τ /δ and p = λδ. We now
write the probability of having k arrivals in n slots:
Ç å
n k
P [k arrivals in n slots] = p (1 − p)n−k (25.18)
k
• Now since we can effectively write δ as τ /n we can rewrite the above equa-
tion as follows:
Ç å Å ãk Å
λτ n−k
ã
n λτ
P [k arrivals in n slots] = 1− (25.19)
k n n
• Now if we follow the steps laid out in section 0.2 on the Poisson as a binomial
approximation, we could tend δ to 0 and n to ∞ to obtain the following
Poisson distribution:
(λτ )k e−λτ
P (k, τ ) = (25.20)
k!
• More generally, we could model the probabiliy of N arrivals in t as a random
variable Nt ∼ P ois(λt). Where the expected value and variance are equal
and are given by λt.
• We can relate the Bernoulli process expected value and variance with the
Poisson process as follows (as n → ∞, δ → 0 and p → 0):
• The way we think about these arrivals is this - There are k − 1 arrivals in the
interval of time [0, t] and then in the last small interval [t, t + δ] we have one
arrival, totalling to k arrivals. We also have the last interval of time such that
δ → 0.
Chapter 25. The Poisson Process 268
• This random variable can be denoted as Yk and its probability can be given
by using the result from section 0.1 in the start of this note.
P DF (yk ) = fYk (t)δ = P [t ≤ Yk ≤ t + δ] (25.23)
= P [(k − 1) arrivals in [0, t]]P [1 arrival in [t, t + δ]] (25.24)
• Note that we are multiplying the probabilities in the above equation since
the time intervals are disjoint and hence, independent. We can then rewrite
the above equation as:
fYk (t)δ = P [(k − 1) arrivals in [0, t]](λδ) (25.25)
• Now note that Yk is nothing but the random variable encoding the time until
k th arrival. Furthermore, the left term of the RHS in the above equation
could easily be modeled as a Poisson distribution with parameter (λt) and
we also know that (λδ) is the probability of 1 arrival in the last time interval.
With this we can write:
ñ ô
(λt)k−1 −λt
fYk (t)δ = e (λδ) (25.26)
(k − 1)!
• The above distribution can be further simplified and then resulting Erlang
distribution is then represented as follows (note that the PDF only depends
on k):
λk tk−1 e−λt
fYk (t)δ = (25.27)
(k − 1)!
• We additionally note that the time until first arrival or the interarrival time
in the Poisson process follows an exponential distribution - Ti ∼ exp(λ).
• If we consider the time intervals (s, t) and (s, s + t), then stationary incre-
ments implies the following:
P [N (s + t) − N (s)] = k = P [N (t) = k] (25.32)
• Property 2 - The number of events in time interval (0, t) has a Poisson dis-
tribution with mean (λt). We say that Nt ∼ pois(λt) such that:
e−λt (λt)n
P [N (t) = n] = (25.33)
n!
25.6.1 An Example
• We let {N (t), t ∈ (0, ∞)} be a Poisson process with rate λ and we let X1
denote its first arrival time. We will now attempt to show that given the fact
that N (t) = 1 - then X1 is uniformly distributed in the time interval (0, t) -
instead of being distributed exponentially, as could be mistakenly perceived.
• In short, we need to show that [X1 |N (t) = 1] ∼ unif (0, t)the following
property holds:
x
P [X1 ≤ x|N (t) = 1] = , 0 ≤ x ≤ t (25.36)
t
• We can then write the following expressions for the above conditional prob-
ability:
P [X1 ≤ x, N (t) = 1]
P [X1 ≤ x|N (t) = 1] = (25.37)
P (N (t) = 1)
• Now we recall the fact that Nt ∼ pois(λt) and its probability is given by:
e−λt (λt)k
P [N (t) = k] = (25.38)
k!
P [N (t) = 1] = λte−λt (25.39)
• Using the above facts and expressions, we might state that the probability
P [X1 ≤ x, N (t) = 1] - is actually associated with the event of one arrival
in the interval (0, x) and no arrival in the interval (x, t). This probability is
given as: ñ ô
e−λx (λx)1 î −λ(t−x) ó
P [X ≤ x, N (t) = 1] = e (25.40)
1!
= λxe−λx e−λt eλx = (λx)e−λt (25.41)
λxe−λt x
P [X1 ≤ x|N (t) = 1] = −λt
= (25.42)
λte t
• The above expression essentially defines the probability as defined by the
uniform distribution. Hence we have proved our initial argument.
25.7 PASTA
• PASTA stands for - Poisson arrivals see time averages. We build this con-
cept by considering a queueing system - which can be thought of as a restau-
rant in which customers arrive at a certain rate. The number of customers in
the restaurant at a particular time is said to be the state of the system.
• We say that customers arrive at a rate of λ into the restaurant and we say
that the state of the system is characterized by Ej . The system spends its
time in different states across various time intervals.
Chapter 25. The Poisson Process 271
πj = P (Ej ) (25.44)
πj∗ = Prob that system is in state Ej just before a randomly chosen arrival
(25.45)
25.7.1 An Example
• Let us look at ’My PC (computer)’ - which is a system of one customer and
one server.
• The above system can only be in two possible states. The PC can be free (E0 )
or the PC can be occupied (E1 ).
• Now note that since ’I am’ the only customer - the PC will always be free
just when ’I need it’. Hence the following customer perspective probability
would be:
π0∗ = Prob that PC is in state E0 (PC is free) just when I need it = 1 (25.46)
• From the observer’s point of view, we have the following probability defini-
tions:
π0 = P (E0 ) = Proportion of time PC is free < 1 (25.48)
π1 = P (E1 ) = Proportion of time PC is occupied > 0 (25.49)
Chapter 25. The Poisson Process 272
• NOTE - In this example we must note that the arrival process is not Pois-
son. This means that when an arrival has occurred (you have started to work
on your PC) - then for a while it is quite unlikely that another arrival process
would occur. That is, you have essentially stopped the previous session and
started a new one - hence the arrivals at different times are not independent.
• However we also note that for a Poisson arrival, the PASTA property is satis-
fied - that is that πj = πj∗ for state Ej .
• The above point is true because the sequence of arrivals have exponen-
tially distributed interarrival times - this comes from the memoryless
property of the exponential distribution.
• With this, we say that the remaining time to the next arrival has the same
exponential distribution irrespective of the time that has already elapsed
since the previous arrival.
• Since the stochastic characterization of the arrival process before the instant
of consideration is the same, irrespective of how the instant has been chosen
- the state distributions of the system (induced by past arrival processes) at
the instant of consideration must the be same in both cases.
• The steady state probability of the queue having k customers is given by the
following expression:
Pk = lim P [N (t) = k] (25.52)
t→∞
• Then, we define the probability that the proces monitor (outside observer)
sees k customers in the queue just before an arrival as:
• Now we note that since Poisson arrivals are essentially independent of the
queue size, we can write the following equality:
• The right term of the numerator will cancel out with the denominator and
we are then left with the following:
N (t) = number of events that occur in the time interval [0, t] (25.60)
• Sixth condition: Combining the above two condtions, we can get the prob-
ability of 0 events happening as:
N (t) = 0 (25.67)
N (t + h) = 0 (25.68)
{N (t + h) − N (t)} = 0 (25.69)
P0 (t + h) = P [N (t + h) = 0] (25.75)
P0 (t + h) = P0 (t)[1 − λh − o(h)] (25.76)
• Now we rearrange both sides of the above equation and divide throughout
by h to obain the following:
P0 (t + h) − P0 (t) o(h)
= −λP0 (t) − P0 (t) (25.77)
h h
ñ ô
P0 (t + h) − P0 (t) o(h)
lim = −λP0 (t) − lim P0 (t) (25.78)
h→0 h h→0 h
|{z}
0
dP0 (t)
=⇒ = −λP0 (t) (25.79)
dt
• The above equation is essentially a differential equation which has the fol-
lowing general form and initial conditions:
• The above result can then be extended to show the general case - probability
of getting k events during a period of length t follows a Poisson distribution
with parameter (λt) as follows:
e−λt (λt)k
P [N (t) = k] = (25.83)
k!
expected number of events in a period =⇒ E[N (t)] = λt (25.84)
• We had stated earlier that - events occur at a constant rate λ. This can be
further expanded as:
• Further we state that the expected time between arrivals is given as follows:
1
E[time between arrivals] = (25.86)
λ
• Now we want to find the probability that time T2 is actually greater than s
given that the first event occurs at time t. We can write this as follows:
• The above equation is nothing but the complimentary CDF of the exponential
distribution - hence now we know that T2 ∼ exp(λ)
Chapter 25. The Poisson Process 277
25.11 PASTA
• PASTA (Poisson Arrivals See Time Averages) property refers to the expected
state of a queueing system as seen by an arrival from a Poisson Process. An
arrival from a Poisson Process observes the system as if it was arriving at a
random moment in time. Therefore the expected value of any parameter
of the queue at the instant of a Poisson arrival is the long run average value
of that parameter.
• Consider that till time a1 from time 0 - we have 0 arrivals. Then in the
interval between a1 and (a1 + b) - that is in the interval of size b - there is one
arrival. Finally, there are also 0 arrivals in the interval between (a1 + b) and
(a1 + b + a2 ) = t - that is in the interval of size a2 .
N (a1 ) = 0 (25.91)
N (b) = 1 (25.92)
N (a2 ) = 0 (25.93)
N (t) = 1 (25.94)
• We will now show that the probability that the Poisson Process produces 1
arrival in the interval of length b is the same as the probability of a randomly
chosen point being in the interval b.
P [0 in a1 and 1 in b and 0 in a2 ]
P [one arrival in [0, t] occurs in b] =
P [1 arrival in t]
(25.95)
P [N (a1 ) = 0, N (b) = 1, N (a2 ) = 0]
P [N (b) = 1|N (t) = 1] = (25.96)
P [N (t) = 1]
P [N (a1 ) = 0]P [N (b) = 1]P [N (a2 ) = 0]
ind. increments =⇒ (25.97)
P [N (t) = 1]
(e−λa1 )[λbe−λb ][e−λa2 ] b
stationary increments =⇒ = (25.98)
[λte−λt ] t
• From the above result we can state that - arriving from a Poisson Process
is statistically identical to arriving at a random moment in time.
Chapter 26
Introduction to Queues
• We might also say that πj is the steady state probability of being in state j or
the long term probability of being in state j. The global balance equation
for the state j then can be given as follows:
m
X
πj = πk pkj (26.1)
k=1
• If we run a long trajectory of a Markov Chain, which visits many states with
different frequencies over time - then the long run frequency of being in
state j is given by πj .
• The diagram below tells us how the probabilities of transition from all pos-
sible states sum up to get the probability of being in state j. In the figure
below we note that π1 is the fraction of time we are in state 1. Similarly,
π1 p1j denotes the transition from state 1 to state j. This is shown in the figure
from all m states. Note that p1j represents the fraction of transitions from
state 1 to state j whenever we find ourselves in state 1.
278
Chapter 26. Introduction to Queues 279
• Here we have j = 1, 2 as the possible states and π1 and π2 are the steady
state probabilities of being in state 1 and 2 respectively. We can then write
down the steady state balance equation as follows for a general state j:
2
X
πj = πk pkj (26.4)
k=1
• From the above equation, we can obtain the steady state probabilities for
state 1 and 2 as follows:
π1 = π1 p11 + π2 p21 = 0.5π1 + 0.2π2 (26.5)
Chapter 26. Introduction to Queues 280
• Further we assume that the arrival rate is constant at each time slot. This
means that we have pi = p and qi = q for all possible values of i. Additionally,
we define the load factor of the system as follows:
p
ρ= (26.8)
q
• With this framework in place, we can write the detailed balance equation
for some state i - by factoring in the frequency of outward transitions from
state i and the frequency of downward transitions to state i. These two
rates should be in equilibrium. Note that πi corresponds to the fraction of
time we are in state i and pi corresponds to the frequency of transition
from state i to state (i + 1). The detailed balance is given by:
πi pi = πi+1 qi+1 (26.9)
• Note that with the above detailed balance equation - we essentially have a
recursion equation - since πi+1 can be computed in terms of πi .
• Further, we have a total of (m + 1) unknowns in the form of (π0 , · · · , πm ) and
we don’t know pi0 as well. We then have m equations and a normalization
condition of the form:
πi pi = πi+1 qi+1 (26.10)
Xm
πi = 1 (26.11)
i=1
Chapter 26. Introduction to Queues 281
• From the detailed balance equation, we can then write the limiting proba-
bilities of states in terms of the load factor as follows:
πi p = πi+1 q (26.15)
p
=⇒ πi+1 = πi = πi ρ (26.16)
q
=⇒ πi = π0 ρi , by continuous recursion (26.17)
m
X
=⇒ π0 ρ i = 1 (26.19)
i=1
1
=⇒ π0 = Pm (26.20)
i=0 ρi
• Now we note that if ρ = 1 then πi = π0 for all i. This means that all the
steady state probabilities are equal. We can say that every state i is equally
likely to occur in the long run. We can then write the following relations:
1 1 1
π0 = Pm = Pm = (26.21)
i=0 ρi i=0 1 m+1
1
π i = π0 = (26.22)
m+1
• Now we can consider the case when p < q which implies that ρ < 1. This
means that our system is stable and that there is a tendency of customers to
Chapter 26. Introduction to Queues 282
be served faster than they arrive. The drift is said to be leftward. Further,
If we take the limiting case of m tending to ∞ then we have the following:
1
π0 = m (26.23)
X
ρi
i=1
| {z }
geometric series
1
=⇒ π0 = = (1 − ρ) (26.24)
1/(1 − ρ)
• With this result we can obtain the limiting distribution of state i as a geo-
metric distribution as follows:
πi = π0 ρi = (1 − ρ)ρi (26.25)
• After characterizing the general distribution, we can get the expected num-
ber of customers in the queueing system as follows:
ρ
E[Xn ] = (26.26)
1−ρ
• We then assume that during each individual time slot - a packet can arrive
with probability λ. Then, the packet arrival rate can be given by:
• Another assumption that during each individual time slot - a packet departs
the system with probability µ. Then the departure rate can be given as fol-
lows:
expected number of departures µ
departure rate = = (26.28)
unit of time ∆t ∆t
Chapter 26. Introduction to Queues 283
• In this model we assume that there are no simultaneous arrivals and de-
partures in a given time slot. Note the following variable definitions:
qn = number of packets in queue in the nth time slot (queue length) (26.29)
• If on the other hand we have qn > 0 then the queue length at time (n + 1)
would have elements of both arrivals and departures and would be given by
(note that the positive sign above the brackets indicates non-negativity of
the queue length):
qn+1 = [qn + An − Dn ]+ (26.36)
P [An = 1] = λ (26.37)
P [Dn = 1] = µ (26.38)
• After making the Markov assumption that future queue lengths depend only
on the current queue length, we get the probability of queue length in-
creasing and probability of queue length decreasing (on the condition that
i > 0) as follows:
• The probability of queue length not changing given that currently the queue
length is 0 - essentially removes the possibility of a departure happening -
such a probability measure is then given by:
• Now for an infinite state space MC - we can specify the various transition
probabilities as given below:
p0,0 = 1 − λ (26.43)
pi,(i+1) = λ (26.44)
pi,(i−1) = µ (26.45)
pi,i = 1 − λ − µ, ∀i 6= 0 (26.46)
• Now we can use the global balance equations to find the limiting probability
distributions by essentially computing the eigenvector of transition matrix
P T associated with the largest eigenvalue of 1. These relations are specified
as follows:
[P T ]n p(0) = p(n) =⇒ P T π = π (26.50)
π = lim p(0) (26.51)
n→∞
• Further note that the reulting probability distribution would have an expo-
nential format. View the figure below for a pictorial specification of the
model we have described.
Chapter 26. Introduction to Queues 285
• We shall now write down the limit distributio equations for states 0 and i
respectively in the following manner:
π0 = (1 − λ)π0 + µπ1 (26.52)
πi = λπi−1 + (1 − λ − µ)πi + µπi+1 (26.53)
• Now we would formulate a general exponential expression for the limiting
distribution with parameters α and c - this can be written as follows:
πi = cαi =⇒ π0 = cα0 (26.54)
• Further note that the ratio (µ/λ) is known as the queue stability margin -
A larger value of this ratio implies fewer packets in queue.
Chapter 26. Introduction to Queues 286
• Now recall the limiting distribution equation for state i and then rearrange
the terms to obtain the following result:
• The above two results tell us one basic result - that the average rate at
which the queue leaves state 0 (λπ0 ) is equal to the average rate at which
the queue enters state 0 (µπ1 ).
• Even thought the evolution equations remain the same - the queue probabil-
ities will change now. Note the following probabilities:
• Given that in equilibrium - the rate at which the queue leaves a state equals
the rate at which the queue enters the state, we can write the following
queue balance equations:
• As in the previous case, we will now attempt to find the parameters of the
limiting distribution by specifying its general exponential form:
πi = cαi (26.78)
λ(1 − µ)
λ(1 − µ)cα0 = µ(1 − λ)cα1 =⇒ α = (26.79)
µ(1 − λ)
• We now substitute the general form distribution into equation 77 and obtain
the quadratic form that can be satisfied by the above expression for α.
• Now before moving ahead in simplifying the above expression, we can es-
sentially rearrange the α expression given by equation 79 to obtain the fol-
lowing:
λ(1 − µ) = αµ(1 − λ) (26.82)
• Now in equation 81 we will use the above relation to essentially subtitute the
RHS of the above equation in place of the LHS expression. With this we find
that the quadratic equation is satisfied.
• Now our next step is to find the constant c by applying the constraint equa-
tion of the limiting probabilities as follows:
∞
X ∞
X
πi = cαi = 1 (26.84)
i=0 i=0
∞
X c
c αi = 1 =⇒ =1 (26.85)
i=0
1−α
=⇒ c = (1 − α) (26.86)
• As a final step, with our specified parameters - we can write the limiting
distribution in the following form:
• We specify state J as being the state of full capacity queue length. Note that
there cannot be any arrivals in this state - if there is an arrival then it would
lead top buffer overflow.
πJ = cαJ (26.89)
λ(1 − µ)
=⇒ α = (26.91)
µ(1 − λ)
Chapter 26. Introduction to Queues 289
• Now we try to find the constant c by taking into account the constraint
equations as follows:
XJ XJ
πi = cαi = 1 (26.92)
i=0 i=0
J
X c[1 − α(J+1) ]
=⇒ cαi = =1 (26.93)
i=0
[1 − α]
(1 − α)
=⇒ c = (26.94)
[1 − α(J+1) ]
This chapter focusses on developing the fundamental concepts of Queueing theory involving
- Exponential distributions, Counting processes, Markov chains and transition rates. Conse-
quently a quick overview of the M/M/1 model is given before delving into the M/M/s model
since it essentially serves as the connecting link.
So therefore we can then state that in queueing models, interarrival times and service
times follow an exponential distribution and customer arrival rate follows a poisson
distribution.
290
Chapter 27. Queueing models - M/M/1 291
The CDF, denoted as P [T ≤ t] and its complement, denoted as P [T ≥ t] are given below:
Note that λ represents the rate at which events occur in time interval T and the expected
value E[T ] is the mean interarrival time between events. Note that the Variance of this
distribution is given by:
1
var[T ] = 2 (27.7)
λ
We also note the transition times denoted as Ti which means - the transition of a random
variable out of state i. This represents the time it takes to transit from one state to another
and is essentially the interarrival time. The standard notation is that when an event occurs,
the probability that we transit out of state i to state j is Pij . Now in most cases we are
often dealing with the limiting transition probabilities or the steady state transition
probabilities. This is represented as:
e−λt (λt)j
P [N (t) = j] = (27.10)
j!
Chapter 27. Queueing models - M/M/1 292
Further we note that the steady state probability p0 actually means this - it is the steady
state probability such that we transit from some state i to a state of 0 which essentially
means a state where no customers are there in the system. Similarly pn stands for the
steady state probability that we transit from state i to state n.
27.4.2 Derivation
Referring to the Birth and Death model, we write the infinitesimal transition rates as λn
and µn for n customers in the system. We note that the arrival rate is constant that is it
does not change depending on the number of customers already in the system. We now
determine if µn changes or not. Assuming n busy servers as of time t, then we say that the
probabiliy of a server completing service during the interval (t, t + ∆t) is given by:
Given that out of n servers, the probability that 1 server completes service is:
Ç å
n
[µ∆t + o(∆t)]1 [1 − µ∆t + o(∆t)]n−1 = nµ∆t + o(∆t) (27.27)
1
Chapter 27. Queueing models - M/M/1 294
Now given that n servers are busy, the probability that r servers complete service in the
given interval is given by:
Ç å
n
[µ∆t + o(∆t)]r [1 − µ∆t + o(∆t)]n−r = o(∆t) (27.28)
r
We see clearly that the only case wherein we have nonnegligible probability of service
completion is the case of completion of one service. Therefore the service rate would be
given by nµ and since we have s servers in all, it is given by sµ. Also since arrival rate
is constant we have λn = λ. Recall that we defined the concept of steady state transition
probabilities in the previous section. With this we can write the Chapman-Kolmogorov
equations in terms of steady state probabilities as follows:
pn = ρn−s ps , n ≥ s (27.35)
Where ρ again represents the traffic intensity and since the maximum service rate is sµ,
in this case it is given by:
λ
ρ= (27.36)
sµ
A key point tonote is that if the number of customers in the system is more than s then
the system effectively behaves like a M/M/1 system with service rate sµ. Some other
notations that we need for further proofs are:
λ
• Expected number of busy servers: α =
µ
α
• Alternative expression 1: =ρ
s
• Alternative expression 2: α = sρ
References 295
ã−1 −1
s−1 r s Å
X α α α
p0 = + 1− (27.37)
r! s! s
r=0
With this we get the mean number of customers in the system as:
ραs p0
L=α+ (27.38)
s!(1 − ρ)2
ρα2 p0
Lq = (27.39)
s!(1 − ρ)2
Getting to the waiting time, we get the mean waiting time in queue as follows:
α s p0
Wq = (27.40)
s!sµ(1 − ρ)2
References
[1] Narayan Bhat - An introduction to queueing theory
Example. Consider the following 2-state DTMC with transition probability matrix
T1 . Let us denote the states by S = {1, 2}.
ñ ô
0.75 0.25
T1 =
0.5 0.5
Let the states be denoted by 1 and 2. Each entry in T1 denotes the conditional
probability that the system moved to state j in time t given that it is in state i in
296
Chapter 28. Continuous Time Markov Chains 297
î ó
time (t − 1). That is P Xt = j|Xt−1 = i .
î ó X î ó î ó
P (Xt = j|Xt−1 = i) = P (Xt− 1 = k|Xt−1 = i) P (Xt = j|Xt− 1 =(28.1)
k)
2 2
k
X
T1 (i, j) = T 1 (i, k).T 1 (k, j) (28.2)
2 2
k
î ó2 î ó1
T1 = T 1 T 1 = T 1 =⇒ T 1 = T1 2 (28.3)
2 2 2 2
sampling, it is actually no so because this holds only in the case when T1 is posi-
tive definite (has positive eigenvalues). In the following example, we consider a
transition matrix with at least one negative eigenvalue and show that it does not
yield a real-valued stochastic matrix.
The eigenvalues of T1 are 1 and −0.8, and is thus not positive definite. Like the
previous example, decomposing T1 using matrix square root gives the transition
matrix T 1 for half sub-interval sampling of the Markov Chain.
2
ñ ô
0.5 + 0.447i 0.5 − 0.447i
T1 =
2 0.5 − 0.447i 0.5 + 0.447i
Thus we see that there is no real valued stochastic matrix describing the same
procss as T1 but at half the sampling periodicity. Put differently, there is no 2-
state CTMC, which when sampled at rate of 1 time unit produces a Markov Chain
with matrix T1 . The problem in generating T 1 arises because T1 has negative
2
eigenvalues.
Proposition. Only stochastic transition matrices with all positive eigenvalues cor-
respond to a CTMC process sampled at a given periodicity. This means that:
3. Many systems do not have a natural sampling rate. The rate is chosen for
computational or measurement convenience.
Markov Property The Markov property states that the conditional probability
of the process to be in a future state j depends only on the current state and is
independent of the past path taken by the process.
î ó î ó
P X(t) = j|X(t1 ) = i1 , . . . , X(tn ) = in = P X(t) = j|X(tn ) = in (28.4)
Chapter 28. Continuous Time Markov Chains 299
Time Homogeneity This refers to the property that the conditional probability
of the process being in a future state j given a current state i remains the same
so long as the time interval between transition is the same. That is, for example
2-period transitional probability remains the same no matter what the time point
of the initial state - transition from i to j in interval t = 1 to t = 3 is same as that
in the interval t = 5 to t = 7.
î ó î ó
P X(t) = j|X(s) = i = P X(t − s) = j|X(0) = i (28.5)
î ó
pij (s, t + s) = P X(t + s) = j|X(s) = i (28.6)
î ó
pij (0, t) = pij (t) = P X(t) = j|X(0) = i (28.7)
(
î ó 1, i = j
pij (0) = P X(0) = j|X(0) = i = (28.9)
6 j
0, i =
P (t) = pij (t) , t ≥ 0 =⇒ P (0) = I (28.10)
This implies that the probability for same instant transitions 1 if it remains in the
same state and 0 if it moves to another state. Thus single-instant transitions from
one state to a different state are not allowed. Consequently, the transition matrix
P (0) is an Identity matrix.
X
P (0, s + t) = Pik (0, s).Pkj (s, s + t), 0<s<t (28.13)
k
P (s + t) = P (s).P (t) (28.14)
dpij (t)
qij = (28.15)
dt
ñ t=0 ô
pij (t + h) − pij (t)
qij = lim (28.16)
h
h→0
t=0
ñ ô
pij (h) − pij (0)
qij = lim (28.17)
h→0 h
Proposition. The elements of the transition probability matrix pij are related to
those of the rate matrix by the following equation, where qii ≤ 0, qij ≥ 0 and h
takes small non-negative values.
(
1 + hqii + o(h), i = j, for small h
pij (h) = (28.18)
0 + hqij + o(h), i 6= j, h → 0
Proof. We first note that the transition matrix for pij (0) = δij , which is the Kro-
necker Delta function taking the value of 1 when i = j and 0 otherwise. Differen-
tiating pii (t) with respect to t,
ñ ô ñ ô
dpii (t) pii (h) − pii (0) 1 + hqii + o(h) − 1
= lim = lim (28.19)
dt t=0 h→0 h h→0 h
ñ ô
o(h)
= lim qii + = qii + 0 = qii (28.20)
h→0 h
ñ ô ñ ô
dpij (t) pij (h) − pij (0) hqij + o(h) − 0
= lim = lim (28.21)
dt t=0 h→0 h h→0 h
ñ ô
o(h)
= lim qij + = qij + 0 = qij (28.22)
h→0 h
d d(1)
(Σj pij (t)) = (28.23)
dt t=0 dt t=0
X dpij (t)
= 0 =⇒ Σj qij = 0 (28.24)
j
dt
1. How long does the CTMC stay at a particular state before jumping to the
next state?
2. With what probability does the CTMC jump to the given next state?
Let Ti denote the time spent by the CTMC in state i before moving to another state.
For i ≥ 1, the probability that time spent in state i greater than an arbitrary value
t, is the probability of intersection between events: that it stays in i for period s
for all values of s betwee 0 and t, conditioned on initial state i.
î ó î ó
P Ti > t = P X(s) = i, 0 ≤ s ≤ t | X(0) = i (28.25)
Applying the Markov chain rule, the probability of being in state i for sub-intervals
(fractions of total time) depends only on the current state information and not the
past.
Chapter 28. Continuous Time Markov Chains 302
î ó h i
P Ti > t = P X(s) = i, 0 ≤ s ≤ nt | X(0) = i .
h i
t 2t t
×P X(s) = i, n ≤ s ≤ n | X n = i .
h i
× · · · × P X(s) = i, (n−1)t
n
≤ s ≤ t | X (n−1)t
n
= i (28.26)
î ó î ó n
P Ti > t = P X(s) = i, 0 ≤ s ≤ nt | X(0) = i ∀n
î ón
= lim X(s) = i, 0 ≤ s ≤ nt | X(0) = i
n→∞
n n
= lim pii ( n ) = lim 1 + qii . n + o( n ) = etqii (28.27)
t t t
n→∞ n→∞
| {z }
exponential function
The probability that the time spent by the CTMC is state i greater than t follows
an exponential distribution, with the rate matrix diagonal element, −qii as the
parameter. Also note that qii ≤ 0. Thus,
î ó î ó
P Ti > t = etqii , P Ti ≤ t = 1 − etqii
î ó −1
=⇒ Ti ∼ Exp(−qii ) , E Ti = (28.28)
qii
Suppose the CTMC changes stateh at time t, from state i to state j. The i probability
of this jump is given by limh→0 P X(t + h) = j | X(t) = i, X(t + h) 6= i . Expanding
this term within the limit, we obtain,
h i
P X(t + h) = j | X(t) = i, X(t + h) 6= i (28.29)
h i
P X(t + h) = j , X(t) = i, X(t + h) 6= i
= h i
P X(t) = i, X(t + h) 6= i
h i
P X(t + h) = j , X(t) = i
= h i , j 6= i (28.30)
P X(t + h) 6= i , X(t) = i
h i
h i P X(t + h) = j , X(t) = i
P X(t + h) = j | X(t) = i = h i
P X(t) = i
h i
h i P X(t + h) 6= i , X(t) = i
P X(t + h) 6= i | X(t) = i = h i
P X(t) = i
h i h i
P X(t + h) = j | X(t) = i P X(t + h) = j , X(t) = i
=⇒ h i = h 6= i
i , j (28.31)
P X(t + h) 6= i | X(t) = i P X(t + h) 6= i , X(t) = i
h i
lim P X(t + h) = j | X(t) = i, X(t + h) 6= i
h→0
h i
P X(t + h) = j , X(t) = i
= lim h i , j 6= i
h→0
P X(t + h) 6= i , X(t) = i
pij (h) hqij
= lim P = lim P
k6=i pik (h) k6=i hqik
h→0 h→0
qij qij
= P = (28.32)
k6=i qik −qii
Since we know that the row sum of the Q matrix is zero, Σk qP ik = 0, and that
qii ≤ 0, it implies that for all k 6= i (off diagonal entries), qii = − k6=i qik .
We have now set up the premise to answer the two questions intially raised. î The
ó
CTMC remains in state i for a period Ti , such that Ti ∼ Exp(qii ) with mean E Ti =
−1 q
qii
. Then it jumps to another state j 6= i, with probability −qijii . Thus we have
proved that the CTMC process depends only on the rate matrix Q rather than on
the transition probability matrix P . If the CTMC process is only observed at jumps
then a Markov Chain is obtained with Transition Matrix P . This MC with P as the
transition matrix is called the Embedded Markov Chain.
qij
P = pij , =
−qii
" #
dP (t) P (t + h) − P (t)
= lim
dt h→0 h
" Ä ä#
dP (t) P (t) P (h) − P (0)
= lim
dt h→0 h
"Ä ä#
dP (t) P (h) − P (0)
= P (t) lim
dt h→0 h
" #
dP (t) dP (t)
= P (t) = P (t)Q (28.34)
dt dt
t=0
This results in the famous Forward Kolmogorov Equation for determining the tran-
sition probabilities of a CTMC given the rate matrix. In the matrix form, we have,
dP (t)
= P (t)Q (28.35)
dt
dpij (t) X
= pik (t)qkj ∀ i, j (28.36)
dt k
The FKE can also be expressed in its element-wise form. Express the Chapman-
Kolmogorov equation in element-wise form:
X
pij (s, t + h) = pik (s, t).pkj (t, t + h)
k∈S
dP (t)
= QP (t) (28.40)
dt
dpij (t) X
= qik pkj (t) ∀ i, j (28.41)
dt k
In Finance, we use the BKE in pricing financial products such as options, futures
and other derivatives. Suppose we have a payoff from a derivative at maturity
period t = T . We wish to compute the initial price of the derivative at t = 0, which
can be done using the model the dynamics of its underlying asset given by the
Backward Kolmogorov equation.
The forward and backward Kolmogorov equations give the dynamics of the system
P (t). We know that P (0) = I and P (t) = eQt . For a finite state CTMC the stationary
solution of both equations are the same.
dP (t)
FKE: = P (t)Q =⇒ P Q = 0 gives stationary solution. (28.42)
dt
dP (t)
BKE: = QP (t) =⇒ QP = 0 gives stationary solution. (28.43)
dt
Chapter 28. Continuous Time Markov Chains 306
Both the equations result in the same stationary solution for P (t).
=⇒ πP (t) = π ∀t (28.45)
Taking the derivative on both sides of the equation,
Ä ä Ä ä
d πP (t) d π dP (t)
= =⇒ π =0 ∀t (28.46)
dt dt dt
dP (t)
π =0 =⇒ πQ = 0 (28.47)
dt
t−0
• A basic intuition about this rate matrix is that - Probability mass flowing
out of one state will go to another state - essentially it is conserved.
• If we know that the process takes on state i by time t, then the probability of
the process jumping to state j in a small time ∆t is given by:
• If we consider the above result by considering all possible states i and j then
we can write the same in vector form as follows:
307
Chapter 29. CTMC and Embedded MC 308
• Since this is independent of time, its derivative with respect to time would
be zero. Then we have the following relation:
dα(t)
= α(t)Q =⇒ πQ = 0 (29.19)
dt
• The above stated balance equation essentially characterizes the balance of
probability flows across states. We take the specific case of the j th row as
follows: X
qj π j = πi qij (29.20)
P
|{z} i6=j
i6=j qji
X X
=⇒ πj qji = πi qij (29.21)
i6=j i6=j
Chapter 29. CTMC and Embedded MC 309
• The above equation states that the total outflow of probability from state j
and the inflow into state j is the same. We can essentially remove the sum-
mation and consequently obtain the detailed balance equation as follows:
• We note that these equations are linearly dependent - that is, any given
equation is automatically satisfied if the other equation is satisfied - due to
conservation of probability. The solution is unique upto a constant factor
and is uniquely determined by adding a normalization condition as follows:
X
π T e = 1 =⇒ πj = 1 (29.24)
j
• Further note that the stationary distribution is nothing but the left eigen-
vector corresponding to the eigenvalue of 0 - after solving the following
equation:
πT Q = 0 (29.25)
Figure 29.1:
• Given the above setup, we can write the balance equations over these two
sets as follows: X X
πi qij = πj qji (29.26)
i∈A,j∈AC j∈AC ,i∈A
=⇒ π T Q = 0 (29.27)
ñ ô
î ó Q QAAC î ó
AA
=⇒ πA πAC = 0 = 0A 0AC (29.28)
QAC A QAC AC
Chapter 29. CTMC and Embedded MC 310
î ó î ó
πA QAA + πAC QAC A πA QAAC + πAC QAC AC = 0A 0AC (29.29)
(1) =⇒ πA QAA + πAC QAC A = 0A (29.30)
(2) =⇒ πA QAAC + πAC QAC AC = 0AC (29.31)
NOTE: QAA = QAC AC = 0
|{z} (29.32)
zero matrices
=⇒ πA QAAC = πAC QAC A (29.33)
• The e vector and E matrix of all ones help us to frame the normalization in
a vector-matrix format - these are given by:
1 1 ··· 1 eT
1 1 · · · 1 eT
î ó
T
e = 1 1 · · · 1 , |{z} E = .. .. = ..
.. (29.36)
n×n . . · · · . .
1 1 ··· 1 eT
• We can now note the following equations and associated computations that
help us arrive at the solution for the stationary distribution vector.
(1) =⇒ π T Q = 0T (29.39)
(2) =⇒ π T E T = eT (29.40)
adding the above two =⇒ π T [Q + E] = 0T + eT = eT (29.41)
=⇒ [Q + E]T pi = e =⇒ [QT + E]π = e (29.42)
T −1
=⇒ π = [Q + E] e (29.43)
Chapter 29. CTMC and Embedded MC 311
• Let us consider the state transitions of the process Xt occurring at time in-
stances - t0 , t1 , · · · .
(e)
• Note the definition: Xn = value of Xt immediately after transition at time n
- that is at time instant t+
n or simply, the value of Xt in the interval (tn , tn+1 ).
(e)
• Since the process Xt is a Markov Process, we say that Xn is a DTMC and
the following condition holds:
• We note that when Xi ∼ exp(λi ) then the following condition would hold:
λi
P [min(X1 · · · , Xn ) = Xi ] = (29.49)
(λ1 + · · · λn )
313
Part V
314
Part VI
315
Chapter 30
Fourier Transforms
30.1 Introduction
To provide some context for our discussion about Fourier transforms and series, let
us imagine a scenario. Since Fourier analysis is concerned with signals and waves,
we imagine a musician playing a steady note on a trumpet. Further there is a mi-
crophone in front of the trumpet that is essentially capturing the sound produced.
The mic typically has a diaphram which undergoes pressure due to the sound
waves from the trumpet and this pressure then translates into voltage, which is
proportional to the instantaneous air pressure. Now if measure this with an os-
cillopscope we will get a graph of pressure against time F (t) which would turn
out to be periodic. Note that it is the reciprocal of the period which is termed as
the frequency of the note being on the trumpet. The typical relationship between
frequency and time period is:
1
ν= (30.1)
T
Let us say that the fundamental frequency of this one note sound is 256Hz. Now
in reality one sine wave of the said frequency is not produced, rather multiple
overtones are produced which are multiples of the fundamental frequency with
various amplitudes and phases. Phase basically determines where in the cycle
the signal would start repeating. Technically, we can analyse the wave by finding
a list of the amplitudes and phases of the various sine waves that comprise the
complex signal. We can plot a graph of amplitudes against frequency denoted by
A(ν). Now since we are effectively bringing the function from the time domain
to the frequency domain we say that A(ν) is the Fourier transform of F (t).
316
Chapter 30. Fourier Transforms 317
Here ν0 represents the fundamental frequency. The various sine and cosine func-
tions in the series denote the various phases of the signal that are not in step with
the fundamental signal. We can rewrite the previous formula as:
∞
X
F (t) = an cos(2πnν0 t) + bn sin(2πnν0 t) (30.3)
−∞
Note that this process of constructing a waveform by adding together the funda-
mental frequency and its overtones of various amplitudes is called Fourier syn-
thesis. Given that cos(−x) = cos(x) and sin(x) = − sin(−x) we can rewrite the
above expression as:
∞
X
F (t) = A0 /2 + An cos(2πnν0 t) + Bn sin(2πnν0 t) (30.4)
n=1
30.3 Amplitudes
Now note that the opposite process of extracting the frequencies and amplitudes
from the original signal is called Fourier analysis. We are interested in trying to
find the amplitudes Am and Bm for various instances of m. Now before moving
ahead, we note the utilisation of the orthogonality property of trigonometric
functions - the central idea is that if we take a sine and a cosine, or two sines or two
cosines (as multiples of the fundamental frequency), then take their product and
integrate this product over the period of fundamental frequency, then the result is
zero. Noting that 1 period is denoted as the inverse of frequency: P = 1/ν0 , we
have: Z P
cos(2πnν0 t) cos(2πmν0 t)dt = 0 (30.5)
t=0
Z P
sin(2πnν0 t) sin(2πmν0 t)dt = 0 (30.6)
t=0
Z P
sin(2πnν0 t) cos(2πmν0 t)dt = 0 (30.7)
t=0
Note that in case m = n then the first two integrals would resolve to 1/2ν0 . Now
we note some general expressions for the coefficient values:
2 P
Z
Bm = F (t) sin(2πmν0 t)dt (30.8)
P t=0
2 P
Z
Am = F (t) cos(2πmν0 t)dt (30.9)
P t=0
An alternate way of writing the Fourier series is shown below. Note that this
expression comes about as a result of taking Am = Rm cos φm and Bm = Rm sin φm .
∞
A0 X
F (t) = + Rm cos(2πmν0 t + φm ) (30.10)
2 m=1
Chapter 30. Fourier Transforms 318
Now the coefficients Cm are basically complex numbers. Now typically, without
going into the derivations, we use inversion formulae to get the coefficienty values
of the real and complex parts of the coefficients:
Z 1/ν0
Am = 2ν0 F (t) cos(2πmν0 t)dt (30.13)
0
Z 1/ν0
Bm = 2ν0 F (t) sin(2πmν0 t)dt (30.14)
0
Z 1/ν0
Cm = 2ν0 F (t)e−2πmν0 t dt (30.15)
0
The above expressions can be further rewritten in slightly different notation, if we
let ν0 = ω0 /2π. The expressions become:
Z 2π/ω0
Am = ω0 /π F (t) cos(mω0 t)dt (30.16)
0
Z 2π/ω0
Bm = ω0 /π F (t) sin(mω0 t)dt (30.17)
0
Z 2π/ω0
Cm = 2ω0 /π F (t)e−imω0 t dt (30.18)
0
An easy way to remember these formulae is by writing them as:
Z Å ã
2 2πmt
Am = F (t) cos dt (30.19)
period one period period
Z Å ã
2 2πmt
Bm = F (t) sin (30.20)
period one period period
Chapter 30. Fourier Transforms 319
Note that if F (t) is real then a(ν) and b(ν) are real as well, however if the function
F (t) is asymmetrical that is if F (t) 6= F (−t) then we have complex values Φ(ν).
In certain cases F (t) is symmetrical, which in turn implies that Φ(ν) is real and
F (t) consists only of cosines. Our Fourier series would then become:
Z ∞
F (t) = Φ(ν) cos(2πνt)dν (30.24)
−∞
Now comes the interesting bit. We can actually recover the function that contains
information about the frequencies Φ(ν) from F (t) by way of inversion.
Z ∞
Φ(ν) = F (t) cos(2πνt)dt (30.25)
−∞
Finally we say that Φ(ν) which is a function in the frequency domain, is the Fourier
transform of F (t) which is in the time domain. Another general formulation of
this is given by: Z ∞
Φ(ν) = F (t)e−2πiνt dt (30.26)
−∞
30.6 Spectrum
Note first that the square of the amplitude of oscillation of a wave, gives a measure
of power contained in each harmonic of the wave. In case the fourier transform
of F (t) that is Φ(ν) is complex, then if we take the product of this and its complex
conjugate Φ∗ (ν) then we would get the power spectrum or the spectral power
density of F (t).
S(ν) = Φ(ν)Φ∗ (ν) (30.27)
References 320
Note that Φ is nothing but the fourier transform of F (t) and the multiplication
of the fourier transform with its complex conjugate gives us the power spectral
density of F (t). Finally we note that autocorrelation function is the fourier
transform of this power spectum function we just obtained.
References
[1] A students guide to Fourier transforms- JF James
Chapter 31
A popular property of the Laplace transform is that of linearity and can be stated
as:
L{aF1 (t) + bF2 (t)} = aL{F1 (t)} + b{F2 (t)} (31.2)
Yet another important theorem associated with this transform is called the first
shift theorem and can be defined as follows:
321
Chapter 31. Laplace, Dirac Delta and Fourier Series 322
T
t −st T
Z ï ò Z T
−st 1
→ te dt = − e − − e−st dt (31.7)
0 s 0 0 s
ï òT
T 1
= − e−sT + − 2 e−st (31.8)
s s 0
T 1 1 T →∞ 1
= − e−sT − 2 e−st + 2 → 2 (31.9)
s s s s
Therefore the Laplace transform of the function F (t) = t is given by f (s) = 1/s2 .
Now we note some general formulae regarding various Laplace transforms. The
derivation of these expressions is omitted in this section.
n!
• L(tn ) =
sn+1
1
• L{tea t} =
(s − a)2
• Before the next formula we must recall the Euler’s formula that gives us an
expression for the polar coordinates of complex numbers:
Now we note that due to the linearity property, the Lapace transform of ei t
is given by:
L(eit ) = L(cos(t)) + iL(sin(t)) (31.11)
Where the Laplace transforms of the individual trigonometric functions are:
s
L(cos(t)) = (31.12)
s2 + 1
1
L(sin(t)) = (31.13)
s2 +1
d
• L{tF (t)} = − f (s)
ds
• A popular function whose Laplace transform is immensely useful is the Heav-
iside’s unit step function which is given by:
(
0, if t < 0.
H(t) = (31.14)
1, if t ≥ 0.
In a similar manner, we can generalize the above two points to write the
Laplace transform of an n times differentiable function as:
L{F (n) (t)} = sn f (s) − sn−1 F (0) − sn−2 F 0 (0) − · · · − F (n−1) (0) (31.18)
w
• L{sin(wt)} =
s2
+ w2
s
• L{cos(wt)} = 2
s + w2
Note that as per the way this function is defined, for t < t0 we would have H(t −
t0 ) = 0 and the transform would evaluate as follows, taking only those t such that
t > t0 and consequently the function evaluating to H(t − t0 ) = 1.
Z ∞ ñ ô∞
−st e−st e−st0
L{H(t − t0 )} = e dt = − = (31.20)
t0 s t s
0
This function assumes special relevance when it is multiplied with another func-
tion and that action of multiplying this Heaviside function is analogous to ’switch-
ing on’ the other function. With this intuition we can state the second shift theo-
rem defined as:
L{H(t − t0 )F (t − t0 )} = e−st0 f (s) (31.21)
Note that with this we can find the Laplace transform of a function that is switched
on at t = t0 .
Now we take a simple example wherein the inverse transform is determined using
partial fractions. Å ã
−1 a
→L (31.24)
s2 − a2
Solving the undetermined coefficients using partial fractions we get:
ï ò
a 1 1 1
= − (31.25)
s 2 − a2 2 s−a s+a
Now we can simply apply the linearity property of the inverse transform operator
to get: ï ò
−1 a 1
L 2 2
= (eat − e−at ) (31.26)
s −a 2
The above is defined for any function h(t) that is continuous in the interval (−∞, ∞).
The Dirac-δ function can be thought of as the limiting case of a top hat function
with unit area as it becomes infinitesimally thin and tall. First we define a function
as follows:
0,
if t ≤ −1/T .
Tp (t) = 0.5T, if −1/T < t < 1/T . (31.29)
0, if t ≥ 1/T .
The Dirac Delta function then models the limiting behaviour of this function and
can be written as:
δ(t) = lim Tp (t) (31.30)
T →∞
The value of the integral within the limits indicates the area under the cuve
h(t)Tp (t) and we say that this area would approach the value h(0) as T → ∞.
Further we say that for very large value of T the interval [−1/T, 1/T ] will be small
enough for the value of h(t) to not differ from its value at the origin. With this we
can express h in the form of: h(t) = h(0) + (t) where the term (t) tends to 0 as
T goes to infinity. Therefore we can say that h(t) tends to h(0) for extremely large
values of T . Note that δ(t) is not a true function since it is not defined for t = 0,
therefore δ(0) has no value. Writing out the left and right side limits we get:
Z ∞
h(t)δ(t)dt = h(0) (31.32)
0−
Z 0+
h(t)δ(t)dt = h(0) (31.33)
−∞
As a limiting case of the top hat function the Dirac Delta function then looks this:
We note an important property that as the interval gets smaller and smaller due
to T becoming large, the area under the top hat function would always be unity.
Chapter 31. Laplace, Dirac Delta and Fourier Series 326
Hence in the limiting case, the length of the arrow (which happens to represent
the Dirac-δ function) is 1. Therefore we have with h = 1:
Z ∞
δ(t)dt = 1 (31.34)
−∞
L{δ(t)} = 1 (31.35)
This essentially means that we are reducing the width of the top hat function
such that it lies between 0 and 1/T (because in the exponential order laplace
transformation we usually have limits starting from 0), and that we are increasing
the height from T /2 to T so as to preserve the unit area.
We can get the Laplace transform of this Dirac Delta function, provided that t > 0
as:
L{δ(t − t0 )} = e−st0 (31.38)
This has been called the filtering property since we can see clearly from the defi-
nition that the Dirac-δ function helps us pick out a particular value of a function.
Z ∞
h(t)δ(t − t0 )dt = h(t0 ) (31.39)
−∞
left and right derivatives at the end points, then we say that for each x ∈ [−π, π]
the Fourier series of f converges to:
f (x− ) + f (x+ )
(31.40)
2
And at both the end points (x = ±π) the series converges to:
f (π − ) + f ((−π)+ )
(31.41)
2
The fourier series gives us a result that at points of discontinuity the value of
Fourier series of f takes the value of the mean of one sided limits of f as the
value at the discontinuous point.
These terms, together form a trigonometric system and the resulting series so
obtained is called the trigonometric series:
Here the a and b terms are the coefficients of the series and we say that if the
coefficients are such that the series converges, then its sum would also have the
same period as the individual components, that is 2π. Now if we have a function
f (x) of period 2π and can be represented by a convergent series of the form in
equation 44 then we say that the Fourier series of f (x) is:
∞
X
f (x) = a0 + (an cos(nx) + bn sin(nx)) (31.45)
n=1
Consequently, the Fourier coefficients can be found using the following equa-
tions: Z π
1
a0 = f (x)dx (31.46)
2π −π
1 π
Z
an = f (x) cos(nx)dx (31.47)
π −π
1 π
Z
bn = f (x) sin(nx)dx (31.48)
π −π
Chapter 31. Laplace, Dirac Delta and Fourier Series 328
A crucial point to note is that the underlying concept behind this Fourier series is
the orthogonality of the trigonometric system - which means that every term in
the trigonometric series is orthogonal to each other, or that their inner product is
zero. In terms of integrals we can write this condition as:
Z π
cos(nx) sin(mx)dx = 0 (31.49)
−π
Chapter 32
A Primer in Calculus
• [a, b] −→ {x|a ≤ x ≤ b}
−→ 2x − 1 < x + 3 (32.1)
2x < x + 4 (32.2)
x<4 (32.3)
−→ (−∞, 4) (32.4)
329
Chapter 32. A Primer in Calculus 330
• |ab| = |a||b|
• |a + b| ≤ |a| + |b|
The problem statement of solving an inequality with absolute values can be stated
as: The inequality |a| < D states that the distance from a to 0 is less than D or we
can say that a lies between −D and D. This is traditionally denoted as:
To give a clear demonstration, we will compute the solution for the inequality
|2x − 3| ≤ 1 and |2x − 3| ≥ 1 given as follows:
close to a number L for all x sufficiently close to x0 we say that f approaches the
limit L as x approaches x0 . This is typically expressed as:
Now we note the formal definition of a limit: Let a function f (x) be defined on
an open interval about x0 except at x0 itself. We say that f (x) approaches the limit
L as x approaches x0 :
lim f (x) = L (32.19)
x→x0
If for every small number > 0 there exists a corresponding number δ > 0 such
that for all x:
0 < |x − x0 | < δ −→ |f (x) − L| < (32.20)
Now we shall note the definition of a continuous function. We say that a function
f (x) is continuous at an interior point x = c of its domain if the following holds:
f (x + h) − f (x)
f 0 (x) = lim (32.22)
h→0 h
We say that the domain of f 0 (x) is the set of all points in the domain of f for which
the limit exists. If f 0 (x) exists we say that f is differentiable at x. Finally, the
process of calculating the derivative of a function is called differentiation. Note
some common differentiation rules:
Chapter 32. A Primer in Calculus 332
Å ã v du − u dv
d u
= dx 2 dx (32.24)
dx v v
Next we discuss the chain rule. This is a rule that used to compute the derivatives
of composite functions. If f (u) is differentiable at a point u = g(x) and in turn
g(x) is differentiable at a point x, then we composite function (f ◦g)0 (x) = f 0 (g(x))·
g 0 (x). Letting y = f (u) and u = g(x) we can write:
dy dy du
= (32.26)
dx du dx
In this f (x) = x can be differentiated twice to get 0 and g(x) = ex can be integrated
multiple times without any problem. The formula for this is given by:
Z Z
udv = uv − vdu (32.39)
• Lastly, we note the derivative rule for inverse functions as given by:
1
(f −1 )0 = (32.48)
f0
References 335
References
[1] An introduction to Laplace transforms and Fourier series - PPG Dyke
33.1 Transforms
• The Laplace transform of a function f : [0, ∞] → C is defined as follows:
Z ∞
F (s) = L(f (t)) = e−st f (t)dt, ∀s ∈ C (33.1)
0
Note that the Laplace transform of a function f (t) exists only if it is of expo-
nential order. Now just like taking the Laplace tranform of a function f (t)
in the time space, we can take the inverse Laplace transform of the cor-
responding function F (s) in frequency space (or s space) to get back the
original function in the time space.
• When f (t) and g(t) are two piecewise continuous functions defined over
t > 0, then their convolution is represented as follows:
Z t
(f ∗ g)(t) = f (t − u)g(u)du, ∀0 ≤ t < ∞ (33.3)
0
We now note that the Laplace transform of the convolution of two functions
is the product of the Laplace transforms of the individual functions:
The area under the graph of this function is alwasy 1. If we consider the
limiting case of this impulse function, as k → ∞, we will have infinite height
336
Chapter 33. Transforms and the Memoryless Property 337
L(δ(t)) = 1 (33.6)
∞
X
= f (nT )e−nT s (33.12)
n=0
−sT
Now we take z = e and subtitute in the above equation to get the Z-
transform as: ∞
X
F (z) = f (nT )z −n (33.13)
n=0
The function F (z) is called the Z-transform of the discrete time signal func-
tion f (nT ).
∞
X
F (z) = Z(f (nT )) = f (nT )z −n (33.14)
n=0
We can interpret this rule as follows - Suppose we are seeing how customers arrive
at a shop. If we do not observe a customer arrival until time a has elapsed, the
distribution of waiting time from time a until the next customer arrives is the
same as when we again start our waiting time from 0. Consider the conditional
distribution that denotes - the probability that X is within the range of time t and
time t + s, given that time t has already passed - which is the same as the process
starting from zero and X lying below s.
Note now that in the above equation the RHS is nothing but the CDF. Further, if
we denote s = (x − t) we get:
This essentially denotes a shift in the PDF. The new expected value is given by:
1
E[X|X > t] = t + = t + E[X] (33.24)
λ
This indicator variable says that we get Y = 0 or tails with a probability of q and
we get Y = 1 or heads with a probability of p. Writing the LTE formulation as:
Now we also know that if by the first time step, if we have observed a head, then
the expected time to heads will be 1. So we have:
E(X|Y = 1) = 1 (33.29)
Recurrence Relations
an = r n (34.3)
340
Chapter 34. Recurrence Relations 341
• Next, take all the RHS terms to the left hand side (so as to get 0 on the RHS)
and also divide each term by rn−k to get:
• Step 3: We then find the roots of the above equation as r1 = 2 and r2 = −1.
Hence we say that the sequence {an } is a solution to the given recurrence
relation if and only if:
an = α1 2n + α2 (−1)n (34.10)
a0 = 2 = α 1 + α 2 (34.11)
a1 = 7 = α1 2 + α2 (−1) (34.12)
Chapter 34. Recurrence Relations 342
• Step 5: We then solve the above two equations to get the value of α1 = 3
and α2 = −1. Finally, substituting all the values we get the general solution
to the recurrence relation of the form:
an = 3 · 2n − (−1)n (34.13)
• Now we note the the following equation is termed as the associated homo-
geneous recurrence relation - basically the equation we get from simply
ignoring the F (n) term:
an = 3an−1 + 2n (34.16)
• Step 1: we first try to find a general solution for the associated homoge-
neous linear recurrence relation given by (just write the equation after
ignoring that 2n term):
an = 3an−1 (34.17)
r=3 (34.18)
• Step 2: From the above characteristic equation we get the root as 3. The
general solution could now be written of the form:
a(h)
n = α3
n
(34.19)
Note that the superscript (h) only denotes that this is an associated homo-
geneous solution.
References 343
cn + d = 3(c(n − 1) + d) + 2n (34.21)
a(p)
n = −n − 3/2 (34.24)
• Step 5: Finally, we sum the general and particular solution to get the com-
plete solution as:
an = a(h)
n + an
(p)
(34.25)
3
an = −n − + α3n (34.26)
2
• Step 6: As an absolute last step, we put the initial condition in the general
solution of the associated homogeneous recurrence relation by putting a1 =
1 and we would get α = 11/6. With this the proper solution would be:
3 11 n
an = −n − n + ·3 (34.27)
2 6
References
[1] Rosen - Discrete Mathematics and its Applications
Chapter 35
Propositional Logic
Logic
Propositional logic is the branch of logic that studies ways of combining or alter-
ing statements or propositions to form complicated statements/propositions. It
studies the way in which statements interact with each other. Predicate logic is an
extension of propositional logic. It adds the ideas of predicates and quantifiers
which in essence captures meaning in statements. Modal logic is an extension to
propositional and predicate logic which includes operators expressing modality.
Temporal logic is any system of rules and symbolism for representing and reason-
ing about propositions qualified in terms of time. We will discuss propositional
logic in the following sections.
Definition 35.2.1 Atom:An atom is the most basic form of proposition. It is de-
noted by capital letters.(eg. P, Q, R, etc)
344
Chapter 35. Propositional Logic 345
Symbol Definition
¬ Not
∧ And
∨ Or
=⇒ Implies
⇐⇒ Iff (If and only if)
Definition 35.2.4 φ[I]: The truth value of formula φ evaluated for an interpreta-
tion I.
(i) If φ is an atom ρ then φ[I] is the truth value assigned to ρ in the interpreta-
tion.
(ii) If φ = ¬ρ then φ[I]= ¬ρ[I].
(iii) If φ = ρ1 ∧ ρ2 then φ[I] = ρ1 [I] ∧ ρ2 [I].
(iv) If φ = ρ1 ∨ ρ2 then φ[I] = ρ1 [I] ∨ ρ2 [I].
Examples:
I P1 P2 P3 ... Pn
I1 T =1 T =1 T =1 ... T =1
I2 T =1 T =1 F =0 ... F =0
I3 T =1 F =0 F =0 ... F =0
.. .. .. .. .. ..
. . . . . .
Im F =0 F =0 F =0 ... F =0
I P1 P2
I1 T =1 T =1
I1 T =1 F =0
I1 F =0 T =1
I4 F =0 F =0
have only 2 values (T or F ). Therefore, the total number of possible results for 2
atoms will be,
n 2
2(2 ) = 2(2 ) = 24 = 16 ways.
The set of all interpretations of a formula is given by,
S = {I1 , I2 , . . . , I(2n ) } = {I1 , I2 , I3 , I4 }
Chapter 35. Propositional Logic 348
I4 I1 I2 I3 I4 Subsets
T3
F4 T1 T2 T3 F4 {I1 , I2 , I3 }
I2
T4 T1 T2 F3 T4 {I1 , I2 , I4 }
T2 F3
F4 T1 T2 F3 F4 {I1 , I2 }
T4 T1 F2 T3 T4 {I1 , I3 , I4 }
I1
F2 T3
F4 T1 F2 T3 F4 {I1 , I3 }
T1
T4 T1 F2 F3 T4 {I1 , I4 }
F3
F4 T1 F2 F3 F4 {I1 }
T4 F1 T2 T3 T4 {I2 , I3 , I4 }
T3
F4 F1 T2 T3 F4 {I2 , I3 }
F1
T4 F1 T2 F3 T4 {I2 , I4 }
T2 F3
F4 F1 T2 F3 F4 {I2 }
T4 F1 F2 T3 T4 {I3 , I4 }
T3
F2 F4 F1 F2 T3 F4 {I3 }
T4 F1 F2 F3 T4 {I4 }
F3
F4 F1 F2 F3 F4 {}
Figure 35.1: Tree diagram showing the number of truth tables formed from 2
atoms.
Chapter 35. Propositional Logic 349
De Morgan’s Laws:
The De Morgan’s Laws allow the expression of conjunctions and disjunctions purely
in terms of each other via negation. The laws can be used to simplify computation
of truth tables. Consider two propositions φ and ψ. The De Morgan’s Laws are
given as follows:
Propositional Satisfiability
Definition 35.3.1 Satisfiability: A propositional formula φ is satisfiable if there
exists a truth assignment I such that φ[I] = T for that assignment. This denoted
as: I φ when φ[I] = T for given I.
Propositional Validity
Definition 35.3.2 Validity: A propositional formula φ is valid (also called as a
tautology) if φ[I] = T , for all truth assignments I. Hence we have, I φ, ∀ I.
Summary
• Logic is categorized into four types - Propositional, Predicate, Modal and
Temporal.
35.4.1 Review
Definition 35.4.1 Satisfiability: A propositional formula φ is satisfiable if there
exists a truth assignment I such that φ[I] = T for that assignment. This denoted
as: I φ when φ[I] = T for given I.
I P ¬P φ = P ∨ ¬P I P ¬P φ = P ∧ ¬P
I1 T F T I1 T F F
I2 F T T I2 F T F
Since, ∀ I, φ[I] = T , it can be said that Since, ∀ I, φ[I] = F , it can be said that
φ = P ∨ ¬P is a valid formula. φ = P ∧ ¬P is a contradiction.
Note:
iff .
• A formula φ is valid ⇐=
=⇒ ¬φ is unsatisfiable.
Fact 1 F1 Gaurav carries an umbrella if it is cloudy and forecast calls for rain.
Fact 2 F2 It is cloudy.
Solution. The statements in the given facts can be broken down into the following
atoms. Further, the facts and premises can be condensed on the basis of the atoms.
Facts:
Atoms: F1 P ∧ Q → R
Atom P It is cloudy. F2 ¬P
Atom Q The forecast calls for rain. Premises:
Atom R Gaurav carries an umbrella. P1 P ∧ Q
P2 ¬P
?
Conclusion: Is ¬R true? (¬R = T )
Chapter 35. Propositional Logic 352
One can quickly assert that all problems that can be solved in polynomial time
can also be checked for, in polynomial time. Thus it can be said that all P prob-
lems form a subset of N P problems.
If P = N P , then that would mean that all problems which can be verified in poly-
nomial time, can also be solved in polynomial time. This has widespread ramifi-
cations to the way of life as we know it. Many problems in operations research
such as integer linear programming and the travelling salesman problem can be
efficiently solved, leading to widespread impact on logistics. Further, solving the
N P complete protein folding problem would lead to significant advances in life
sciences and medicine.
Negative consequences would also arise, for N P complete problems are funda-
mental to several disciplines. For instance cryptography relies on certain problems
being hard to solve. Finding successful and efficient ways to solve N P complete
problems would lead to breakage of many cryptosystems, thereby throwing indi-
viduals’ privacy and security at risk.
35.5.2 Entailment
Definition 35.5.1 A set of n formulae ψ1 , ψ2 , . . . , ψn entails a single formula ψ
if, for every truth assignment I that satisfies all of ψ1 , ψ2 , . . . , ψn , I satisfies ψ.
The entailment of ψ1 , ψ2 , . . . , ψn formulae in a single formula ψ is denoted by:
ψ1 , ψ2 , . . . , ψn ψ.
I P Q φ=P ∨Q ψ=P ¬φ ¬φ ∨ ψ ψ φ ¬ψ ¬ψ ∨ φ
I1 T T T T F T T T F T
I2 T F T T F T T T F T
I3 F T T F F F F T T T
I4 F F F F T T F F T T
The process of checking for entailment involves going through the entire truth
table and checking for satisfiability of a formula and hence is tedious. There is an
easier way to go about this. In order to check if φ ψ, create a truth table column
¬φ ∨ ψ. If this column T ∀ I, then φ entails ψ. In the above example, the column
¬ψ ∨ φ is all T , thus, ψ φ.
I P Q φ=P ∧Q ψ=P ¬φ ¬φ ∨ ψ ψ φ ¬ψ ¬ψ ∨ φ
I1 T T T T F T T T F T
I2 T F F T T T T F F F
I3 F T F F T T F F T T
I4 F F F F T T F F T T
From the truth table, ψ is T in all cases when φ is T (the sole case being truth
assignment I1 ). Further, ¬φ ∨ ψ = T, ∀ I, φ, which means that φ entails ψ, or
P ∧ Q P . However, in truth assignment I2 , one can observe that φ is F when ψ
is T . In addition, ¬ψ ∨ φ 6= T, ∀ I, because for I2 , ¬ψ ∨ φ 6= F . Hence, ψ does not
entail φ making P 6 P ∧ Q.
iff.
Theorem 1. ψ1 , ψ2 , . . . , ψn entails ψ ⇐= =⇒ [ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ is valid.
Proof. This theorem has to be proved for both directions - the only if direction and
the if only direction. Thus, the theorem is expressed in two statements as follows
and both statements are individually proved.
Only if case: ψ1 , ψ2 , . . . , ψn entails ψ =⇒ [ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ is valid.
If only case: [ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ is valid. =⇒ ψ1 , ψ2 , . . . , ψn entails ψ.
Only if case
=========⇒
Suppose ψ1 , ψ2 , . . . , ψn entails ψ, we wish to show that φ = {[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→
ψ} is valid. In order to show that φ is valid, we have to prove that for every truth
assignment I, φ[I] = T .
Consider any truth assignment I, the left side of the formula can either be T or F .
This makes two cases- one where (ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] = F and the other when
Chapter 35. Propositional Logic 355
(ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] = T .
Case 1: (ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] = F
φ = {[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ}
φ[I] = {[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ[I]}
| {z }
F in Case 1.
φ = ¬[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] ∨ ψ (∵ φ → ψ is defined as ¬φ ∨ ψ)
φ[I] = ¬F ∨ ψ
φ[I] = T ∨ ψ = T =⇒ φ[I] is T
φ[I] = T −→ T =⇒ φ[I] is T
If only case
⇐=========
Suppose φ = {[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ} is valid for all truth assignments I. This
means ∀ I, φ[I] = T . We wish to show that ψ1 , ψ2 , . . . , ψn entails ψ.
φ[I] = {(ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] −→ ψ[I]} is valid. Thus φ[I] = T ∀ I. This means
that (ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] = T for any interpretation I. Hence, φ[I] = {T −→ ψ[I]}
is valid. Thus, ψ[I] = T , proving that ψ1 , ψ2 , . . . , ψn entails ψ.
Summary
• A formula φ is satisfiable if there exists truth assignment I such that φ[I] =
T.
• A formula φ is valid if φ[I] = T ∀ I.
• A formula φ is a contradiction if φ[I] = F ∀ I.
iff .
• A formula φ is valid ⇐=
=⇒ ¬φ is unsatisfiable.
• Checking if a formula is satisfiable for all truth interpretations is called the
satisfiability problem. This falls under the general class of N P problems
which can be verified, but cannot be solved in polynomial time.
• Logical inference is the process of reducing statements into atoms and
checking for validity of conclusions from facts and premises.
• A set of n formulae ψ1 , ψ2 , . . . , ψn entails a single formula ψ if, for every truth
assignment I that satisfies all of ψ1 , ψ2 , . . . , ψn , I satisfies ψ. This is denoted
by: ψ1 , ψ2 , . . . , ψn ψ.
iff.
• Theorem: ψ1 , ψ2 , . . . , ψn entails ψ ⇐==⇒ [ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ is valid.
Chapter 36
Graph Theory
Example 1. Consider the following undirected graph. The degrees of the vertices
of the graph have been calculated.
v1
d1 = Deg{v1 } = 2
e1 e2 d2 = Deg{v2 } = 1
d3 = Deg{v3 } = 1
v2 v3
Example 2. Consider the following directed graph. There are two types of de-
grees for these vertices in directed graphs, namely, dout and din .
v1 v1 v2 v3
e1 e2 din 2 1 0
dout 0 1 2
v2 e3 v3
This shows that the degree of a graph is given by double counting the edges as
every edge will be included from both the vertices.
Definition 36.1.3 A tree T is a connected acyclic graph.
356
Chapter 36. Graph Theory 357
x1
b4
x4
b1
b3
b5
x2
b2 x3
Consider the above circuit with four nodes labelled 1, 2, 3 and 4. Let xi be the
electric potential at the ith node. The potential difference on the edge b1 is given
Chapter 36. Graph Theory 358
The above equation can now be rewritten in order to bring about the Ax = b form
as shown below. The matrix A is of the order 5 × 4, with 5 edges and 4 nodes.
We know that the null space matrix contains the vectors in the N (A). The infor-
mation about the null space matrix N is found in the matrix F which contains
pivot variables under elimination. Thus, the null space solution of Ax is:
x1 1
x 1
N = 2 = α
x3 1
x4 1
Chapter 36. Graph Theory 359
The above result is significant in physics. It means that if the potential differences
between the nodes are zero, b = 0 then the nodes are at the same electric poten-
tial. In other words, these nodes are equipotential. It can therefore be seen that
Ax = b encapsulates the Kirchhoff’s Voltage Law.
Using Ohm’s law and the information about conductance of each edge, one could
create a conductance matrix, G. This could then be used to obtain the currents on
each edge yi . The vector denoted by y has entries comprising of currents in all the
edges. Thus Gb = y, where G is the 5 × 5 conductance matrix; b is the vector of
potential difference and y is the vector of currents.
1 y4
4
y1
y3
y5
2
y2
3
The currents along with their flow directions are shown in the above figure. Now
the Kirchhoff’s current law can be written as AT y = 0. Thus, vectors in the null
space of AT correspond to collections of current in the loop that satisfy Kirchhoff’s
laws. This is summarized in the following schematic.
x = {x1 , x2 , x3 , x4 } AT y = 0
potential at nodes Kirchhoff’s Current Law
Ax = b AT y
The AT y = 0 equation is written out and using the method of elimination the
reduced row echelon form (rref) is obtained and the rank r is imminent in the
Chapter 36. Graph Theory 360
number of pivots.
y
1
−1 0 −1 −1 0 y 0
2
1 −1 0 0 0 0
y3 =
0 1 1 0 −1 0
y4
0 0 0 1 1 0
y5
−1 0 −1 −1 0 0
1 −1 0 0 0 0
=⇒ [AT | 0 ] =
0 1 1 0 −1 0
0 0 0 1 1 0
−1 0 −1 −1 0 −1 0 −1 −1 0
0 −1 −1 −1
1 −1 0 0 0 0
−→ −→
0 1 1 0 −1 0 0 0 −1 −1
0 0 0 1 1 0 0 0 0 0
| {z } | {z }
AT matrix - current through nodes upper triangular matrix, U
−1 0 −1 0 1 −1 0 0 −1 1
0 −1 −1 0 1 0 −1 0 −1 1
−→ −→
0 0 0 −1 −1 0 0 −1 0 −1
0 0 0 0 0 0 0 0 0 0
| {z } | {z }
rref matrix - R Col. 3 and Col. 4 are swapped.
1 0 0 1 −1 ñ ô
0 1 0 1 −1 I3 F
−→
0 0 1 0 1 0 0
0 0 0 0 0 | {z }
rref matrix - R
| {z }
Col. 3 and Col. 4 are swapped.
The null space matrix contains the vectors in the N (AT ). The information about
the null space matrix N is found in the matrix F which contains pivot variables
under elimination. Thus, the null space solution of AT y is:
y1 −1 1 y1 −1 1
y −1 1 y −1 1
2 2
y
N = 4 = 0 −1 y
=⇒ y = 3 = 1 0
y3 1 0 y4 0 −1
y5 0 1 y5 0 1
One can quickly notice that the null space solution of AT which is given by AT y =
0 encapsulates the Kirchhoff’s Current Law, where each solution to y gives the
current flow around a particular loop.
Chapter 36. Graph Theory 361
where m is the number of edges and r is the number of nodes minus one. The
Dim[N (AT )] gives the number of independent loops L in the graph. The number of
rows in the incidence matrix A is equal to the number of edges|E| in the loop. The
rank of the incidence matrix r (which is invariant under the transpose operation)
is one less than the number of nodes |V | − 1. Therefore,
Summary
• A graph G is a set of vertices V and edges E. =⇒ G = {V, E}.
Here, the set of vertices is given by V = {v1 , v2 , . . . , vn } where, |V | = n and
the set of edges is given by E = {e1 , e2 , . . . , en } where, |E| = m.
• The degree of a vertex v in a graph is the number of edges incident on v ∈ V .
• A tree T is a connected acyclic graph.
• Therefore, the sum of degrees for a tree can be generalised as:
n
X
d(vi ) = 2m = 2(n − 1)
i=1
Proof Techniques
Topics
• Introduction to Proof Techniques.
• Direct Proofs.
• Indirect Proofs - Proof by Contrapositive and Proof by Contradiction.
362
Chapter 37. Proof Techniques 363
The direct proof technique shows that if P is T then Q must also be T , and there-
fore, the case that P is T and Q is F never arises.
Proof
Direct Indirect
Contrapositive Contradiction
P : n2 is even.
Q: n is even.
P : n2 is even.
Q: n is even.
Suppose P is T and Q is F .
−→ n2 is even and n is not even, i.e. n is odd.
−→ n2 is even and n2 is odd (From Ex.1)
−→ But n2 cannot be both even and odd, which gives us a
contradiction. (as A ∧ ¬A is always F )
Contradiction [A ∧ ¬A]
Premise Conclusion
P C
Assume P is T . Logical steps But (P ∧ ¬P ) is T
¬P is T It can never be T.
Assume C is F .
| {z }
Contradiction!
√
Example: To prove that 2 is irrational.
p2 = 2q 2 and p = 2
(2l)2 = 2q 2
4l2 = 2q 2
q 2 = 2l2
q 2 is even. Therefore q is even. —- (C)
(B) and (C) give a contradiction. Also, if p and q are even then 2 should be a
common factor of p and q.(D)
However, from (A) we know that the only common factor of p and q is 1. Thus, A
and D also give us a contradiction.
Thus, we can say that ¬P is F i.e. P is true.
Chapter 37. Proof Techniques 366
is T} , then Q is T .
If |P {z
| {z }
antecedent consequence
The statement P is T is called the antecedent or the hypothesis while the statement
Q is T is called the consequence or the conclusion. Whenever P is T , Q is T as
well. Also, whenever Q is not T (¬Q), P is not T as well (¬P ).
P ⇒ Q ≡ ¬Q ⇒ ¬P
P ⊆Q Q⊆P
P Q P Q
¬P
¬Q
Whenever P is T , Q is T as well. Whenever ¬Q is T , ¬P is T as well.
Biconditionals(P ⇔ Q)
The biconditional statement P ⇔ Q is the proposition P if and only if Q. This is
the same as stating (P ⇒ Q) ∧ (Q ⇒ P ). It is only possible when P and Q have
the same truth values and is false otherwise. It also means that, P is necessary and
sufficient for Q. For example, any integer n is even, iff n2 is even is a biconditional
statement i.e. ∀ n ∈ Z, n is even ⇔ n2 is even.
Chapter 37. Proof Techniques 367
P =⇒ Q Q =⇒ P
P Q
Q P
(P =⇒ Q) ∧ (Q =⇒ P )
(¬P =⇒ F ) =⇒ P
(¬¬P ∨ F ) =⇒ P
P ∨F =⇒ P
P =⇒ P
x + y = 16 ⇒ [(x ≥ 8) ∨ (y ≥ 8)]
| {z }
∴ (x + y) < 16 which is contrary to our given premise and therefore our conclusion
that (x ≥ 8) ∨ (y ≥ 8) is true.
We will now prove the same using the technique of proof by contradiction.
P : x + y = 16 ⇒ [(x ≥ 8) ∨ (y ≥ 8)]
If x + y = 16 then, x ≥ 8 or y ≥ 8.
Let us assume that P is not true.
• Conclude that P is ¬F = T
(x + y) = 16 → (x ≥ 8) ∨ (y ≥ 8) (37.1)
A(k + 1) = 1 + 2 + 3 + · · · + k + (k + 1) (37.8)
k(k + 1)
A(k) = 1 + 2 + 3 + · · · + k = (37.9)
2
A(k + 1) = A(k) + A(k + 1) (37.10)
k(k + 1)
A(k + 1) = + (k + 1) (37.11)
2
k
A(k + 1) = (k + 1)[ + 1] (37.12)
2
(k + 1)(k + 2)
A(k + 1) = (37.13)
2
(k + 1)((k + 1) + 1)
→ A(k + 1) = (37.14)
2
Hence, with the last step of Proof by Induction, we have proved that A(n) =
n(n + 1)
1 + 2 + 3 + ··· + n = for all Integers.
2
Chapter 37. Proof Techniques 372
• Base case : A(1) = 1, which is equal to (1)2 = n2 = 1. Hence the base case is
True.
A(k + 1) = k 2 + 2k + 1 (37.16)
A(k + 1) = (k + 1)2 (37.17)